Homozygosity Haplotype Allows a Genomewide Search for the Autosomal Segments Shared among Patients
2007; Elsevier BV; Volume: 80; Issue: 6 Linguagem: Inglês
10.1086/518176
ISSN1537-6605
AutoresHitoshi Miyazawa, Masaaki Kato, Takuya Awata, Masakazu Kohda, Hiroyasu Iwasa, Nobuyuki Koyama, Tomoaki Tanaka, Huqun, Shunei Kyo, Yasushi Okazaki, Koichi Hagiwara,
Tópico(s)Genomics and Rare Diseases
ResumoA promising strategy for identifying disease susceptibility genes for both single- and multiple-gene diseases is to search patients’ autosomes for shared chromosomal segments derived from a common ancestor. Such segments are characterized by the distinct identity of their haplotype. The methods and algorithms currently available have only a limited capability for determining a high-resolution haplotype genomewide. We herein introduce the homozygosity haplotype (HH), a haplotype described by the homozygous SNPs that are easily obtained from high-density SNP genotyping data. The HH represents haplotypes of both copies of homologous autosomes, allowing for direct comparisons of the autosomes among multiple patients and enabling the identification of the shared segments. The HH successfully detected the shared segments from members of a large family with Marfan syndrome, which is an autosomal dominant, single-gene disease. It also detected the shared segments from patients with model multigene diseases originating with common ancestors who lived 10–25 generations ago. The HH is therefore considered to be useful for the identification of disease susceptibility genes in both single- and multiple-gene diseases. A promising strategy for identifying disease susceptibility genes for both single- and multiple-gene diseases is to search patients’ autosomes for shared chromosomal segments derived from a common ancestor. Such segments are characterized by the distinct identity of their haplotype. The methods and algorithms currently available have only a limited capability for determining a high-resolution haplotype genomewide. We herein introduce the homozygosity haplotype (HH), a haplotype described by the homozygous SNPs that are easily obtained from high-density SNP genotyping data. The HH represents haplotypes of both copies of homologous autosomes, allowing for direct comparisons of the autosomes among multiple patients and enabling the identification of the shared segments. The HH successfully detected the shared segments from members of a large family with Marfan syndrome, which is an autosomal dominant, single-gene disease. It also detected the shared segments from patients with model multigene diseases originating with common ancestors who lived 10–25 generations ago. The HH is therefore considered to be useful for the identification of disease susceptibility genes in both single- and multiple-gene diseases. Current genetic approaches focus on the identification of disease susceptibility genes (hereafter referred to as “disease genes”) by exploiting the cosegregation of the disease phenotype over generations with a disease gene as well as a set of polymorphic marker types in its neighborhood (i.e., the haplotype). In a large family including multiple patients with a specific disease, the disease gene is usually derived from a single ancestor. On the basis of this assumption, haplotype analysis1Long JC Williams RC Urbanek M An E-M algorithm and testing strategy for multiple-locus haplotypes.Am J Hum Genet. 1995; 56: 799-810PubMed Google Scholar or linkage analysis2Morton NE Sequential tests for the detection of linkage.Am J Hum Genet. 1955; 7: 277-318PubMed Google Scholar has been used to find the gene. In affected-sib-pair analysis, a sib pair affected with the same disease is considered to share the same disease gene, inherited from their parents. The gene is searched for by looking at the genetic markers shared by the pair.1Long JC Williams RC Urbanek M An E-M algorithm and testing strategy for multiple-locus haplotypes.Am J Hum Genet. 1995; 56: 799-810PubMed Google Scholar, 3Kruglyak L Lander ES Complete multipoint sib-pair analysis of qualitative and quantitative traits.Am J Hum Genet. 1995; 57: 439-454PubMed Google Scholar In whole-genome association studies, researchers try to capture segments containing disease-risk alleles derived from a limited number of very ancient ancestors where a haplotype block is the ultimate unit of search.4International HapMap Consortium The International HapMap Project.Nature. 2003; 426: 789-796Crossref PubMed Scopus (4688) Google Scholar, 5Di Rienzo A Hudson RR An evolutionary framework for common diseases: the ancestral-susceptibility model.Trends Genet. 2005; 21: 596-601Abstract Full Text Full Text PDF PubMed Scopus (154) Google Scholar, 6Carlson CS Eberle MA Kruglyak L Nickerson DA Mapping complex disease loci in whole-genome association studies.Nature. 2004; 429: 446-452Crossref PubMed Scopus (496) Google Scholar Because the haplotype contains the canonical information for every approach, determination of the haplotype is considered to greatly simplify the analyses.7Gillanders EM Pearson JV Sorant JM Trent JM O’Connell JR Bailey-Wilson JE The value of molecular haplotypes in a family-based linkage study.Am J Hum Genet. 2006; 79: 458-468Abstract Full Text Full Text PDF PubMed Scopus (8) Google Scholar However, the haplotype is not easy to identify in diploid organisms such as humans, because the genotypes of polymorphic markers are obtained as a mixture of those of the two alleles. Although many methods have been developed to reconstruct the haplotype,1Long JC Williams RC Urbanek M An E-M algorithm and testing strategy for multiple-locus haplotypes.Am J Hum Genet. 1995; 56: 799-810PubMed Google Scholar, 7Gillanders EM Pearson JV Sorant JM Trent JM O’Connell JR Bailey-Wilson JE The value of molecular haplotypes in a family-based linkage study.Am J Hum Genet. 2006; 79: 458-468Abstract Full Text Full Text PDF PubMed Scopus (8) Google Scholar, 8Amos CI Dawson DV Elston RC The probabilistic determination of identity-by-descent sharing for pairs of relatives from pedigrees.Am J Hum Genet. 1990; 47: 842-853PubMed Google Scholar, 9Zhang K Zhao H A comparison of several methods for haplotype frequency estimation and haplotype reconstruction for tightly linked markers from general pedigrees.Genet Epidemiol. 2006; 30: 423-437Crossref PubMed Scopus (19) Google Scholar their capabilities are limited. It is currently not possible, at least on a genomewide basis, to obtain haplotype information from an arbitrary subject or to compare two unrelated subjects in order to search for chromosomal segments sharing the same haplotype. In this study, we introduce the homozygosity haplotype (HH), which overcomes a part of this problem. The HH is a form of haplotype described by the homozygous SNPs, and, therefore, it is easily obtained genomewide. Using a family affected with Marfan syndrome (MIM 154700) and patients with model multigene diseases, we demonstrate how HH analysis allows the identification of the location of disease genes. An HH is a haplotype described by only homozygous SNPs and is obtained by the deletion of heterozygous SNPs (fig. 1Ai), leaving only the homozygous SNPs (fig. 1Aii). At this point, the haplotype of each chromosome is uniquely determined, because all SNPs are homozygous (fig. 1Aiii). Note that both copies of homologous autosomes have the same HH over their entire length. A compSNP is a SNP that is homozygous in two subjects (fig. 1B). We can compare the HHs between two subjects by use of the compSNPs (fig. 1C). An RCHH is a run of compSNPs matched for allelic type, the genetic length of which is longer than the cutoff value (fig. 1C). An RCHH is bounded by either a mismatched compSNP(s) or by the end(s) of an autosome. The RCHHs shared by multiple subjects are the overlap of the RCHHs for each subject pair (fig. 1D). An RCA is an autosomal region where subjects share a chromosomal segment derived from a common ancestor (i.e., a segment identical by descent) (fig. 2A). In an RCA, subjects share the same segment on one or both copies of their homologous autosomes and thus share the same HH. Conversely, when subjects have the HH in a region, it suggests the presence of an RCA. Note that the RCA is unknown, and its presence is merely predicted through the RCHHs (see the section entitled “The RCHHs, False Negatives, Type A False Positives, and Type B False Positives”). The average genetic length of the RCAs decreases over generations. Figure 2B is a model pedigree with common ancestors (A and B). Two descendants (M and N), who are m and n generations removed from their common ancestors, share the RCAs derived from A and B. Assuming that the spouses (shown in gray shapes in fig. 2B) are not the descendants of A or B, then RCA(m,n) is the ratio of the total genetic length of the A- or B-derived RCA to the entire length of the autosomes. It is expressed asRCA(m,n:m≥n)=2-m+1m≥1,n=034m=1,n=12-m-n+2{otherwise.(1) A detailed description of the deduction of equation (1) is given in appendix A. Note that RCA(1,0) is equal to 1, indicating that a parent and a child (i.e., m=1, n=0) share the RCAs over the entire lengths of their autosomes. We used the Haldane’s Poisson process model10Haldane JBS The combination of linkage values, and the calculation of distances between the loci of linked factors.J Genet. 1919; 8: 299-309Crossref Scopus (65) Google Scholar for the occurrence of crossovers and performed all calculations on the basis of this model. Information on SNPs used by the 500K GeneChips Mapping Array Set (Affymetrix) was summarized in the GeneChip annotation files (4/13/2006 version; see Affymetrix Web site), where, for each SNP, the genetic distance from the telomere of the short arm of the chromosome was obtained by interpolation from the sex-averaged data by deCODE Genetics.11Kong A Gudbjartsson DF Sainz J Jonsdottir GM Gudjonsson SA Richardsson B Sigurdardottir S Barnard J Hallbeck B Masson G et al.A high-resolution recombination map of the human genome.Nat Genet. 2002; 31: 241-247Crossref PubMed Scopus (1329) Google Scholar The genetic length of an RCHH is the genetic distance between its bounding compSNPs. We restricted our analysis to a total of 492,554 SNPs that had assigned dbSNP refIDs (see National Center for Biology Information Web site). The computer programs were written in the C programming language and were compiled by the GNU C compiler 4.0 (see the GNU Compiler Collection Web site). The program is available from our Web site and from the Saitama Medical University Web site. When subjects who have common ancestors suffer from the same disease, the RCAs are the candidate regions in which to look for the disease gene. Because many RCAs are contained in the RCHHs, we established an algorithm that detects the RCHHs, thereby allowing us to identify the disease gene. As described, an RCHH is defined when a run of type-matched compSNPs is longer than the cutoff value (fig. 3A). Many RCHHs contain the RCAs; however, some do not. We defined three types of errors (fig. 3B). The false negatives are the RCAs that are not contained in the RCHHs. The type A false positives are the RCHHs that do not contain the RCAs. The type B false positives are the spaces between a containing RCHH and a contained RCA. The equations to calculate each of these errors are given in appendix A. Before the analysis, we calculated the ratios of the false negatives to the total length of the RCAs and of the type A and type B false positives to the entire length of the autosomes for a range of cutoff values, and we selected a value that minimizes the influence of errors. This study was approved by the institutional review board of Saitama Medical University. All DNA samples were purified from peripheral blood drawn after written informed consent had been obtained. A family that included multiple patients with Marfan syndrome was genotyped, as were 46 unrelated subjects. In addition, the genotyping data from 45 unrelated Japanese subjects who had been enrolled in the International HapMap project (see International HapMap Web site) were obtained from the Affymetrix Web site (an average of 199,400 SNPs per subject had confidence values <.05). Genomewide SNP genotypings were performed using the 500K GeneChips Mapping Array Set (i.e., the GeneChip Human Mapping 250K Nsp Array plus the GeneChip Human Mapping 250K Sty Array) (Affymetrix) or either of the two arrays. (Hereafter, the 500K GeneChips Mapping Array Sets will be abbreviated as “500k GeneChips” and the GeneChip Human Mapping 250K Nsp Array as “250k GeneChips.”) 500k GeneChips was used for the analysis of the family with Marfan syndrome. 250k GeneChips was used for the multigene disease simulation. In the multigene disease simulations, genotyping data of patients who share an RCA at a specific position were constructed by replacing that part of their genotyping data with the genotyping data of a specific subject who acts as a common ancestor. The length of the replaced segment (x, in centimorgans) was taken at random from an exponential distribution with a probability density function off(x)=λe-λx ,λ=m100 ,(2) where m is the age, in generations, of the common ancestor. The numbers of subjects who share an RCHH at a given position on an autosome were compared between the patient pool and the control pool. The assumption was made thatu0=P∧1*-P∧2*P∧*(1-P∧*)(1n1+1n2) has a standard normal distribution, wherePˆ*1=x1+0.5n1+1,Pˆ*2=x2+0.5n2+1,Pˆ*=x1+x2+0.5n1+n2+1, x1, and x2 are the numbers of subjects sharing RCHHs in the patient pool and the control pool, respectively, and n1 and n2 are the total numbers of subjects in the patient pool and the control pool, respectively. The P value was calculated byP=u0∞12πe-x22dx . To investigate the utility of the HH in a family analysis, we studied a family that has multiple patients with Marfan syndrome. Marfan syndrome is an autosomal dominant disease characterized by an abnormality in the connective tissue. Mutations of either the fibrillin-1 gene (FBN-1) on 15q21.1 or of the TGF-β type II receptor gene (TGFBR2) on 3p24.2 are known to be the cause of this syndrome.12Hayward C Brock DJ Fibrillin-1 mutations in Marfan syndrome and other type-1 fibrillinopathies.Hum Mutat. 1997; 10: 415-423Crossref PubMed Scopus (74) Google Scholar The family had been studied, and six symptomatic members and three asymptomatic carriers had a heterozygous 1879C→T (R627C) mutation in FBN-1 (fig. 4A). Subject I-1 is considered to be the common ancestor for the disease gene. The questions we posed were as follows: (i) Could the HH identify the region containing FBN-1 by the data from six symptomatic members? and (ii) Could the HH further narrow the region with the inclusion into the analysis of three asymptomatic carriers? Before beginning our analysis, we checked the accuracy of the genotyping data. Equation (1) indicates that a parent and a child share the same HH along the entire length of their autosomes. Therefore, the ratio of the number of the matched compSNPs to the total number of compSNPs indicates the accuracy of genotyping. We studied three pairs: II-1 and III-4, II-3 and III-1, and II-5 and III-3. In a GeneChip analysis, the genotyping result for each SNP is accompanied by a confidence value; the smaller the confidence values, the more reliable the data. The concordance ratios of the compSNPs in three parent-child pairs for a range of confidence-value cutoffs are shown in figure 4B and 4C. We chose a confidence-value cutoff of .05. Secondly, we determined the cutoff value for defining the RCHHs (hereafter, an “RCHH cutoff”). Figure 4D shows the relationship of the RCHH cutoffs and three types of errors. We chose 3.0 cM because it gave small rates of type A and type B false positives with an acceptable value for the false negatives. We then analyzed six symptomatic patients. In figure 5, we present the result stepwise. Patients II-3 and III-1 are a parent-child pair who share RCHHs over the entire length of the autosomes (fig. 5A). II-1 and II-2 are siblings whose RCHHs occupy 81% of the autosomes (eq. [1] predicted 75%) (fig. 5B). II-2 and III-1 are an aunt and nephew whose RCHHs occupy 56% of the autosomes (eq. [1] predicted 50%) (fig. 5C). III-2 and III-3 are first cousins whose RCHHs occupy 39% of the autosomes (eq. [1] predicted 25%) (fig. 5D). The RCHHs conserved among all symptomatic members contained 96% of the total length of the RCA (calculated from table A1), and they did indeed contain FBN-1 (fig. 5E). The inclusion of asymptomatic carriers (II-4, II-5, and III-4) (see fig. 4A) did further narrow the RCHHs (fig. 5F). These results demonstrate that HH analysis is both efficient and intuitive for identifying the location of disease genes in a large family.Table A1Ratios of False Negatives to the Total Length of the RCAs for a Range of RCHH CutoffsRatio of False Negatives to Total Length of the RCAs, by RCHH Cutoff (in cM)Calculation Method and Variable(s).2.4.6.811.21.41.61.822.22.42.62.833.23.43.63.844.24.44.64.855.25.45.65.86Monte Carlom+nmn220.000.000.000.000.000.000.000.000.000.001.001.001.001.001.001.001.002.002.002.002.003.003.003.003.004.004.004.004.005.00511.000.000.000.000.000.000.001.001.001.001.001.001.002.002.002.003.003.003.004.004.004.005.005.005.006.006.007.007.008.008330.000.000.000.000.000.001.001.001.001.002.002.002.003.003.003.004.004.005.005.006.007.007.008.008.009.010.011.012.012.01321.000.000.000.000.001.001.001.001.002.002.003.003.004.004.005.005.006.007.008.008.009.010.011.012.013.014.015.016.017.018440.000.000.000.000.001.001.001.002.002.003.003.004.005.005.006.007.008.009.010.011.012.013.014.015.017.018.019.021.022.02431.000.000.000.001.001.001.002.002.003.004.004.005.006.007.008.009.010.011.013.014.015.017.018.019.021.022.024.026.028.02922.000.000.000.001.001.002.002.003.004.004.005.006.007.009.010.011.013.014.015.017.019.020.022.024.026.028.030.032.034.036550.000.000.000.001.001.002.002.003.004.004.005.006.007.008.010.011.012.014.015.017.019.020.022.024.026.028.030.032.034.03741.000.000.001.001.001.002.003.004.004.005.007.008.009.010.012.013.015.017.019.020.022.025.027.029.031.034.036.039.041.04432.000.000.001.001.002.002.003.004.005.006.008.009.011.012.014.016.018.020.022.024.026.029.031.034.036.039.042.044.047.050660.000.000.001.001.002.002.003.004.005.006.008.009.010.012.014.016.018.020.022.024.026.029.031.034.036.039.041.044.047.05051.000.000.001.001.002.003.004.005.006.007.009.011.013.014.016.018.020.023.025.028.030.033.036.039.042.045.049.052.056.05942.000.000.001.001.002.003.004.006.007.009.010.012.014.016.018.021.023.026.029.032.035.038.041.044.048.051.055.059.062.06633.000.000.001.001.002.003.004.006.007.009.010.012.014.016.019.021.024.026.029.032.035.038.042.045.048.052.055.059.063.066761.000.000.001.002.003.004.005.006.008.010.011.013.016.018.021.023.026.029.033.036.040.043.046.050.054.058.063.067.071.07552.000.000.001.002.003.004.006.007.009.011.013.016.018.021.024.027.030.034.037.041.044.048.052.056.061.065.070.075.080.08443.000.000.001.002.003.004.006.007.009.011.013.016.018.021.024.026.030.033.037.040.044.048.052.056.061.065.069.074.078.083862.000.001.001.003.004.005.007.009.012.014.017.020.023.026.029.033.037.041.046.050.055.060.064.069.075.081.086.092.097.10353.000.001.001.002.004.005.007.009.011.014.017.019.022.026.029.032.036.040.045.050.054.059.063.069.074.080.085.091.096.10144.000.001.001.002.004.005.007.009.011.014.017.019.022.026.029.032.036.040.045.049.054.059.063.068.073.079.084.089.094.099963.000.001.002.003.005.007.009.012.015.017.020.024.027.031.035.039.044.049.055.061.067.073.079.085.092.098.104.111.117.12454.000.001.002.003.004.006.009.011.013.016.019.022.026.030.034.038.043.047.052.057.063.069.074.081.088.094.100.105.111.1181064.000.001.002.004.006.008.011.014.017.021.024.028.032.036.041.046.051.056.062.068.074.081.090.097.104.111.119.126.134.14255.000.001.002.003.005.007.011.014.016.019.023.027.031.035.040.045.051.056.062.068.074.081.088.095.102.110.117.124.131.1381165.000.001.002.004.007.009.013.016.019.023.027.032.037.042.047.053.060.066.072.080.089.096.103.110.117.125.133.141.150.1581266.000.001.003.005.008.010.014.019.023.027.032.037.042.049.056.063.069.077.084.092.101.111.120.128.137.145.152.159.166.173Eq. (A1)m+nmn12.000.001.002.004.007.009.013.016.020.025.029.034.040.045.051.057.064.070.077.084.091.099.106.114.122.130.138.146.154.16313.000.001.003.005.008.011.015.019.023.028.034.040.046.052.059.066.073.081.088.096.104.113.121.130.139.148.157.166.175.18414.000.002.003.006.009.013.017.022.027.033.039.045.052.059.067.075.083.091.100.109.118.127.137.146.156.166.175.185.196.20615.000.002.004.007.010.014.019.025.031.037.044.051.059.067.075.084.093.103.112.122.132.142.152.163.173.184.195.206.217.22816.000.002.004.008.012.016.022.028.034.041.049.057.066.075.084.094.104.114.125.135.146.157.168.180.191.203.214.226.238.25017.001.002.005.008.013.018.024.031.038.046.055.064.073.083.093.104.115.126.137.149.161.173.185.197.209.222.234.247.259.27218.001.002.005.009.014.020.027.034.042.051.060.070.081.091.103.114.126.138.150.163.175.188.201.214.228.241.254.267.280.29419.001.003.006.010.016.022.030.038.047.056.066.077.088.100.112.125.137.150.163.177.190.204.218.232.246.260.274.288.302.31620.001.003.007.012.018.025.033.041.051.062.073.084.096.109.122.135.149.163.177.191.206.220.235.250.264.279.294.308.323.33725.001.005.010.018.027.037.049.062.075.090.106.122.139.156.173.191.209.228.246.264.283.301.319.337.355.373.391.408.425.44230.002.007.014.025.037.051.067.084.103.122.142.163.184.206.228.250.272.294.316.337.359.380.401.422.442.462.482.501.519.53735.002.009.019.033.049.067.087.109.132.156.180.206.231.257.283.308.334.359.384.408.432.455.478.501.522.543.563.583.602.62040.003.012.025.041.062.084.109.135.163.191.220.250.279.308.337.366.394.422.449.475.501.525.549.572.594.615.636.655.674.69250.005.018.037.062.090.122.156.191.228.264.301.337.373.408.442.475.507.537.566.594.620.645.669.692.713.733.751.769.785.80160.007.025.051.084.122.163.206.250.294.337.380.422.462.501.537.572.605.636.665.692.717.740.762.782.801.818.834.849.862.874 Open table in a new tab Each multigene disease has a specific genetic structure. Some are considered to be a collection of single-gene diseases of which the phenotypes are indistinguishable from each other. In others, several genes working together are required to produce symptoms.13Pritchard DJ Korf BR Medical genetics at a glance. Blackwell Publishing, Birmingham, United Kingdom2003Google Scholar In either case, a subgroup of patients may share a disease gene from a common ancestor. To investigate the utility of the HH in multigene diseases, we investigated a model multigene disease (fig. 6A). Here, SNP rs16823424 (the 100,000th SNP on 500k GeneChip) is the location of the disease gene. In this region, 15 patients share an RCA derived from a common ancestor who lived 10 generations ago. The genotyping data around rs16823424 in these 15 patients were replaced with the genotyping data at the corresponding position from a specific person who, in this analysis, acts as the common ancestor (fig. 6B). The lengths of the replacements were taken at random from an exponential distribution (eq. [2]: m=10). Therefore, comparison of two patients corresponds with the situation m=n=10 in fig. 2B. The patient pool included these 15 subjects together with 30 unrelated subjects (fig. 6C). The control pool consists of 45 unrelated Japanese samples obtained from the Affymetrix Web site. Our aim was to identify the rs16823424 region. The strategy was as follows. In step 1, we divided autosomal regions into minute regions. In step 2, using the patient pool, we identified the HH shared by the greatest number of subjects for each region (i.e., the most common HH). We then concatenated the most common HHs for each region into a virtual HH for the entire autosome. A virtual subject who has this virtual HH was named “the representative” (step 1 in fig. 6C). The subjects were counted who shared the RCHHs with “the representative” in both the patient pool and the control pool for each region (step 2 in fig. 6C). Finally, the differences between the pools were expressed by P values. The candidate region for the disease gene is the region that has the lowest P value (i.e., the greatest −log10(P) value) in the entire autosome. Before the analysis, we determined an appropriate RCHH cutoff (fig. 6D). Here, the false negatives were plotted for several ages of common ancestors (the ages are expressed by m and n). As the number of generations increases, the length of the RCA shortens, increasing the difficulty of its detection with increasingly high m and n values. Because an RCHH cutoff of 5 cM was considered suitable for m=n=10, this value is used hereafter. The false negatives to the total length of the RCAs decreases as we include more SNPs in the analysis. This will be discussed later. We then performed the analysis. Figure 6E is a densitogram of the −log10(P) value; the denser the areas are, the higher the significance. The rs16823424 region provided a −log10(P) of 4.48, and was the only region with a -log10(P)>3.0 (i.e., P<.001). The greatest −log10(P) outside of the rs16823424 region was 2.92, which provides the background of the analysis. Next, we investigated the detection limit. For each number from 7 to 15, we constructed 100 patient pools in which that number of patients, out of 45, shared an RCA at the rs16823424 region. When the number is 30 (fig. 6D). The number of type A false positives is reduced as the number of SNPs increases (fig. 8). (The rate of type B false positives is heavily dependent on the actual genotyping data and thus was not plotted.) A larger number of SNPs will allow us to use a smaller RCHH cutoff. Figure 8 suggests that the genotyping data of 1,000,000 SNPs may expand the range of analysis to m+n>60. In the model multigene diseases, we used the patient pool containing 45 subjects. However, smaller numbers of subjects worked fine as well. For example, a pool of 18 subjects containing 6 subjects sharing an RCA clearly provided sufficient signal, although with a higher background (data not shown). The four major methods for the identification of disease genes are the haplotype analysis, the linkage analysis, the sib-pair analysis and the whole-genome association studies.15Strachan T Read A Human molecular genetics. Garland Science/Taylor & Francis Group, Oxfordshire, United Kingdom2003Google Scholar The former two methods target single-gene diseases occurring in families (usually m+n<6), whereas the latter two methods target both single-gene and multigene diseases occurring in the general population. When m+n 12 (see table A1; compare the values for m+n=12). Given that NSNP is the total number of SNPs on a genotyping chip, and Pn and Qn are the frequencies of the major and minor alleles for the nth SNP, respectively, the average frequencies of the major alleles (Fmajorallele) and the minor alleles (Fminorallele) are F majorallele=∑n=1NSNPPnNSNPand F minorallele=∑n=1NSNPQnNSNP ,respectively. The number of mismatched compSNPs (NmismatchedcompSNP) is approximated byNmismatchedcompSNP≈2( F majorallele)2( F minorallele)2NPt1NPt2NSNP , where NPt1 and NPt2 are the numbers of SNPs successfully genotyped for Pt1 and Pt2, respectively. NmismatchedcompSNP is not a large number. For example, with use of the 500k GeneChips from Affymetrix, NmismatchedcompSNP is 22,000 at maximum, spaced at 0.16 cM on average. This spacing is larger in size than most of the haplotype blocks and thus is assumed to be randomly distributed over the entire autosome. The length between two mismatched compSNPs is considered to have an exponential distribution with a density probability function off(x)=λe-λx ,λ=NmismatchedcompSNPLautosome ,where Lautosome is the entire genetic length of the autosomes. Therefore, for the cutoff value c,RType A false positives=∫c∞xf(x)dx∫0∞xf(x)dx=(1+λc)e−λc. An RCHH containing an RCA is expected to have the type B false positives with a length of cut off value2 on each end. It is impossible to distinguish RCHHs that contain the RCAs from those that do not (i.e., the type A false positives). We calculated RType B false positives under the assumption that every RCHH contains an RCA. Therefore, the RType B false positives calculation results in an overestimation, which we consider to be more appropriate than an underestimation when the appropriate RCHH cutoff is being determined. The easiest way to compare the patient pool and the control pool is to directly compare the number of patients sharing the RCHHs at the given position (fig. A2A). This algorithm usually works fine, but actually this reduces the sensitivity. Assume that, at a specific position, the patient pool has 4 subjects sharing HH1 and has 0 subject sharing HH2. The control pool has 0 subject sharing HH1 and 4 subjects sharing HH2. Although two pools are different in their frequency of HH1, the algorithm shown in figure A2A does not detect it. One of the ways to solve this problem is to have a representative, as shown in fig. A2B, as we did in this study. For the actual algorithm, please see the program source code. This algorithm may have difficulty picking up the most common HHs in a region where there is no dominant HH but only many kinds of HHs with low frequencies, which we think does not cause any major problems. We have also provided the source code for an alternative algorithm. The source may be modified according to your uses. Crossover interference increases the average size of the RCA, and favors the RCHHs in detecting the RCA, which reduces the false negatives. This results in a better performance in HH analysis. Figure A3 shows an RCA in one generation. The size of RCA may be reduced in size in the next generation. The reduction occurs by two processes: (1) crossover occurs in one or both subjects and (2) multiple crossovers occur in one or both subjects. Although occurrence of process 2 may be suppressed by crossover interference, process 1 is independent of the interference and is not suppressed. Moreover, as the size of the shared segments from the common ancestor (shown in gray in fig. A3) shortens over generations, multiple crossovers in a single RCA become less frequent, even without crossover interference, and process 1 becomes the main determinant of the size of the RCAs. Therefore, crossover interference has a limited effect on HH analysis, and so we chose not to make any adjustment in the algorithm.
Referência(s)