Mapping Tumor-Suppressor Genes with Multipoint Statistics from Copy-Number–Variation Data
2006; Elsevier BV; Volume: 79; Issue: 1 Linguagem: Inglês
10.1086/504354
ISSN1537-6605
AutoresIuliana Ionita, Raoul-Sam Daruwala, Bud Mishra,
Tópico(s)Genomics and Phylogenetic Studies
ResumoArray-based comparative genomic hybridization (arrayCGH) is a microarray-based comparative genomic hybridization technique that has been used to compare tumor genomes with normal genomes, thus providing rapid genomic assays of tumor genomes in terms of copy-number variations of those chromosomal segments that have been gained or lost. When properly interpreted, these assays are likely to shed important light on genes and mechanisms involved in the initiation and progression of cancer. Specifically, chromosomal segments, deleted in one or both copies of the diploid genomes of a group of patients with cancer, point to locations of tumor-suppressor genes (TSGs) implicated in the cancer. In this study, we focused on automatic methods for reliable detection of such genes and their locations, and we devised an efficient statistical algorithm to map TSGs, using a novel multipoint statistical score function. The proposed algorithm estimates the location of TSGs by analyzing segmental deletions (hemi- or homozygous) in the genomes of patients with cancer and the spatial relation of the deleted segments to any specific genomic interval. The algorithm assigns, to an interval of consecutive probes, a multipoint score that parsimoniously captures the underlying biology. It also computes a P value for every putative TSG by using concepts from the theory of scan statistics. Furthermore, it can identify smaller sets of predictive probes that can be used as biomarkers for diagnosis and therapeutics. We validated our method using different simulated artificial data sets and one real data set, and we report encouraging results. We discuss how, with suitable modifications to the underlying statistical model, this algorithm can be applied generally to a wider class of problems (e.g., detection of oncogenes). Array-based comparative genomic hybridization (arrayCGH) is a microarray-based comparative genomic hybridization technique that has been used to compare tumor genomes with normal genomes, thus providing rapid genomic assays of tumor genomes in terms of copy-number variations of those chromosomal segments that have been gained or lost. When properly interpreted, these assays are likely to shed important light on genes and mechanisms involved in the initiation and progression of cancer. Specifically, chromosomal segments, deleted in one or both copies of the diploid genomes of a group of patients with cancer, point to locations of tumor-suppressor genes (TSGs) implicated in the cancer. In this study, we focused on automatic methods for reliable detection of such genes and their locations, and we devised an efficient statistical algorithm to map TSGs, using a novel multipoint statistical score function. The proposed algorithm estimates the location of TSGs by analyzing segmental deletions (hemi- or homozygous) in the genomes of patients with cancer and the spatial relation of the deleted segments to any specific genomic interval. The algorithm assigns, to an interval of consecutive probes, a multipoint score that parsimoniously captures the underlying biology. It also computes a P value for every putative TSG by using concepts from the theory of scan statistics. Furthermore, it can identify smaller sets of predictive probes that can be used as biomarkers for diagnosis and therapeutics. We validated our method using different simulated artificial data sets and one real data set, and we report encouraging results. We discuss how, with suitable modifications to the underlying statistical model, this algorithm can be applied generally to a wider class of problems (e.g., detection of oncogenes). The process of carcinogenes is imparts many genetic changes to a cancer genome at many different scales: point mutations, translocations, segmental duplications, and deletions. Whereas most of these changes have no direct impact on the cellular functions—and may not contribute to the carcinogenesis in any obvious manner—few of these chromosomal aberrations have a disproportionately significant impact on the cell’s ability to initiate and maintain processes involved in tumor growth; namely, through its ability to proliferate, escape senescence, achieve immortality, and signal to neighboring cells. Two classes of genes are critically involved in cancer development and are discernible in terms of their copy-number variations (CNVs): oncogenes that are activated or altered in function and tumor-suppressor genes (TSGs) that are deactivated in cancer cells. Thus, the effect of oncogenes is via gain-of-function mutations that lead to malignancy. For instance, a segmental amplification can increase the genomic copy number of a region containing an oncogene, thus leading to overexpression of the oncogene product. The mutation is dominant; that is, only a mutated allele is necessary for the cell to become malignant. TSGs affect the cells via mutations (often involving segmental deletions) that contribute to malignancy by loss of function of both alleles of the gene. The “two-hit” hypothesis of Knudson1Knudson AG Mutation and cancer: statistical study of retinoblastoma.Proc Natl Acad Sci USA. 1971; 68: 820-823Crossref PubMed Scopus (5560) Google Scholar for tumorigenesis has been widely recognized as an important model of such losses of function involved in many cancers. Whole-genome–scale data and their computational analysis can now lead to rapid discovery and characterization of important genetic changes at significantly higher resolution, thus providing a systems-level understanding of the roles of oncogenes and TSGs in cancer development and its molecular basis. As an example, whereas BRCA1 and BRCA2 TSGs provide better understanding of familial breast cancer and other TSGs, including PTEN and p53, do so for sporadic breast cancer, we still lack a reasonably complete picture, since many important components remain undiscovered. Whole-genome analysis, now possible through array-based comparative genomic hybridization (arrayCGH) experiments, can remedy the situation by shedding light on many more genes and their interrelationship. In the current whole-genome analysis setup, microarray techniques are being used successfully to measure fluctuations in copy number for a large number of genomic regions in one genome relative to a different but related genome sample. For example, arrayCGH can map copy-number changes at a large number of chromosomal locations in one genome with respect to a reference genome and, from them, extrapolate to infer segments of the genome that have undergone the same degree of amplifications or deletions. For some references to and discussions of algorithms that estimate these CNVs, see Daruwala et al.2Daruwala RS Rudra A Ostrer H Lucito R Wigler M Mishra B A versatile statistical analysis algorithm to detect genome copy number variation.Proc Natl Acad Sci USA. 2004; 101: 16292-16297Crossref PubMed Scopus (45) Google Scholar In the present article, we examine how these CNV data can be used for the purpose of identifying TSGs. The intuitive basis of our approach can be easily stated, as follows. Suppose we have whole-genome CNV data for several patients who suffer from the same specific class of cancer, putatively caused by loss of function in both alleles of the same TSG. In that case, the loss-of-function event may have many underlying causes; for instance, a nonsynonymous point mutation in the exon, a mutation in the regulatory region, a small insertion-deletion event in the coding region, or a relatively large segmental deletion event that affects one or many exons of the gene. In each case, the phenotypic result will be similar, but the whole-genome analysis will identify only segmental deletion events that exhibit themselves through reduced copy-number values for genomic intervals. For any such deleted segment to effect a loss of function in the TSG, it must overlap with the genomic interval corresponding to the TSG. Even though events representing small, undetectable mutations will go unnoticed, by accounting for the CNVs, a suitable algorithm can infer the location of the TSG implicated in the disease. Our approach exploits these topological relationships among the genomic intervals and works by enumerating all possible intervals in the genome and then evaluating them with a score function that measures the likelihood of an interval being exactly the TSG. The mathematical derivation and properties of this score function appear in the appendix (online only). The rest of the article is organized as follows. We first present a formal description of the score function, which we have only intuitively sketched so far (see the “Methods” section), and then show how this function is used in evaluation of whether a region represents a TSG. Next, we illustrate our method, using this score function and several sets of simulated data, computed under a wide variety of scenarios (see the “Results” section); we also assess the power of the method by examining how accurately it discovers the true location (which is known to the simulator) of the TSG. Finally, we analyze and report the results from an arrayCGH data set (using 100K Affy-chips), obtained from several patients with lung cancer. We conclude with a discussion of the strength and weakness of the proposed method (see the “Discussion” section). Our method for the identification of TSGs relies on a multipoint score function, computed over whole-genome–analysis data for a sufficiently large group of patients suffering from the same form of cancer. In the following section, we present a systematic derivation of this score function, starting with a few simple assumptions about the underlying biology and the data. For any interval I (represented as a set of consecutive probes), we wish to quantify the strength of the association between deletions in I and the disease by analyzing the genomic data for many diseased individuals. For this purpose, we select a metric, the relative risk (RR), as it compares and assigns a numerical value to the risks of disease in two populations with respect to each other: the first population comprises subjects whose genomes contain a segmental deletion in the interval I, and the second comprises subjects whose genomes have no such segmental deletion in I.RRIdeleted=lnP(disease|Ideleted)P(disease|INOT deleted)=ln[P(Ideleted|disease)P(INOT deleted|disease)×P(INOT deleted)P(Ideleted)]=ln[P(Ideleted|disease)P(INOT deleted|disease)]+{−ln[P(Ideleted)P(INOT deleted)]}(1) We caution the reader that, in an abuse of definition, we will frequently use the shortened phrase “I deleted” to mean that “at least a part of I is deleted.” The first term in equation (1) can be estimated from the tumor samples available: P(Ideleted|disease)P(INOT deleted|disease)=nIdeletednINOT deleted,(2) where nI deleted (or nI NOT deleted) is simply the number of tumor samples in which I is deleted (or not deleted). The second part of equation (1), −ln[P(Ideleted)P(INOT deleted)],incorporates prior information inherent in the statistical distribution of deletions. For instance, we may note that if I is a small interval, then P(I deleted)≪P(I NOT deleted) and, hence, −ln[P(Ideleted)P(INOT deleted)]is a large positive number. Similarly, if I is very large, then the situation is reversed and −ln[P(Ideleted)P(INOT deleted)]becomes a large negative number. Consequently, the prior information, included in the distribution of random unrelated deletions in the genome, is reflected through an advantage accrued to small intervals; in other words, under the assumption that the same strength of evidence exists in the data for different sizes, preference is given to the smaller intervals. To derive a computational procedure for this prior score, we rely on a probabilistic model of how the genomic data may have been generated. In this simplest parsimonious model, we assume that, at any genomic location, a breakpoint may occur as a Poisson process at a rate of μ⩾0. At the places where any of these breakpoints start, a segmental deletion may occur, the length of which is distributed as an exponential random variable with a parameter λ⩾0. Note the following lemma. Lemma 1.—Under the assumption of the generative process described above, the probability that an interval I=[a,b] (in the genome) is deleted can be expressed as P([a,b]deleted)=1−e−μ(b−a)e−μa1−e−λa2λae−μ(G−b)1−e−λ(G−b)2λ(G−b),(3) where [0,G] represents the region of interest (e.g., a chromosome) and [a,b] is a specific genomic interval in this region. See the appendix (online only) for proof of the lemma. Using equations (2) and (3), we can now compute the score function RRI deleted for an interval I. Parameters μ and λ, which appear in the score function, are assumed to have been estimated from data by a procedure described in the next section. In figure 1, we show how the additional prior score −ln[P(Ideleted)P(INOT deleted)],computed using equation (3) in the previous lemma, varies as a function of the length of the interval. All the parameters (μ, λ, and G) are the same as those in the simulation examples in the “Results” section. Figure 1 emphasizes the significantly higher prior advantage given to intervals of smaller length. Clearly, we expect the high-scoring intervals determined by this method to be treated as candidates for TSGs. We still need to define precisely how and how many of these intervals should be selected and then evaluated for their statistical significance. In the preceding section, we defined a score for an interval I (RRI deleted), which depends on two extraneous parameters that describe a background genome-reorganization process. These two parameters—namely, λ and μ—must be estimated from arrayCGH data. We recall that λ is the parameter of the exponential distribution for generating deletions—that is, 1λ is the average length of a deletion—and that μ is the parameter of the Poisson process used for generating the breakpoints—that is, μ is the mean number of breakpoints per unit length. Recently, several statistically powerful algorithms have been devised to analyze the arrayCGH data and to render the underlying genome in terms of segments of regions of similar copy numbers. These algorithms readily yield an output that can be interpreted as alternating segments of normal and abnormal segments, with the abnormal segments falling into two groups: segmental losses and segmental gains. If these segments satisfy the assumptions regarding the breakpoint and length distributions, the desired parameters μ and λ can be estimated empirically from the segmentation of the data. Certain Bayesian algorithms, such as the one proposed by Daruwala et al.2Daruwala RS Rudra A Ostrer H Lucito R Wigler M Mishra B A versatile statistical analysis algorithm to detect genome copy number variation.Proc Natl Acad Sci USA. 2004; 101: 16292-16297Crossref PubMed Scopus (45) Google Scholar and its variants (T. S. Anantharaman, M. Sobel, and B.M., unpublished data), include these assumptions in their prior and are thus able to estimate these parameters directly. The present algorithm builds on the latter class of segmentation algorithms but is not limited by this requirement. In addition to estimating λ and μ, we also use the segmentation of individual samples to obtain the positions of the breakpoints (points where deletions start) in each sample and use these positions to assess the statistical significance of our results. The estimation procedure proceeds in a sequence of steps. In the first step, the algorithm computes the scores (RRI deleted) for all the intervals I, with lengths taking values in a range determined by a lower and an upper bound, starting with small intervals containing a few markers and ending with very long intervals. We have evaluated two different statistical methods designed to estimate the location of the TSGs. The first and the simplest method operates by simply choosing the maximum-scoring interval as the candidate TSG; namely, it selects the interval I with maximum RRI deleted in a genomic region of interest (e.g., a chromosome or a chromosomal arm) as the most plausible location of a causative TSG. We refer to this method as the “Max method.” The other method functions by estimating the locations of the left and the right boundaries of the TSG, with use of two scoring functions, as described below. Two scores, SLx and SRx, are computed for every marker position x∈[0,G]. The first value, SLx, is to be interpreted as the confidence that the point x is the left boundary of a TSG; symmetrically, the latter, SRx, is the confidence that the point x is the right boundary of a TSG. These scores are defined more formally as SLx=∑I∈ILxRRIdeleted,where ℐℒx is the set of intervals that are bounded by the marker x from the left. Similarly, SRx=∑I∈IRxRRIdeleted,where ℐℛx is the set of intervals with the right boundary exactly at x. Using these two scores, we can obtain an estimation of the true position of the TSG as the interval [x*L,x*R], where, for the left (right) boundary, we choose the marker position x*L=arg maxxSLx (x*R=arg maxxSRx) that maximizes the SLx (SRx) score. We refer to this method as the “LR method.” Thus far, we have seen how to estimate the putative location of a TSG either by maximizing the RR scores over many intervals or by estimating other related scores that characterize the boundaries of the gene. Irrespective of which method is chosen, the result is always an interval that consists of some number of markers; in the following, the computed interval is referred to as “Imax.” The final step of our algorithm determines whether this finding is statistically significant; that is, it assigns a P value to Imax. Unfortunately, there is no obvious or readily available approach for analytically computing a P value for an interval Imax. Therefore, the algorithm must rely on a different empirical method to compute the statistical significance; namely, it computes the P value from the observed distribution of breakpoints along the chromosome (as given by the segmentation algorithm). It uses a null hypothesis that no TSG resides on the chromosome; consequently, the breakpoints can be expected to be uniformly distributed. Note that, if a detailed and complete understanding of a genomewide distribution of breakpoints were available, then it would pose little difficulty in changing the following discussions and derivations mutatis mutandis. However, to avoid any unnecessary biases in our estimators, we chose, for the time being, to focus on an uninformative prior only, as reflected in our assumptions. We may now note that if indeed Imax is a TSG, then its neighborhood could be expected to contain an unusually large number of breakpoints, thus signifying presence of a deviant region, which cannot be explained simply as random fluctuations in the null distribution of breakpoints. Therefore, after counting the number of breakpoints on the chromosome (N) and the number of breakpoints in the interval Imax (k) across all samples, we need to address the following question: how unusual is it to find k breakpoints in a region of length w=|Imax|, given the fact that there are N breakpoints uniformly distributed across the chromosome? We answer this question using results from the theory of scan statistics,3Glaz J Naus J Wallenstein S Scan statistics. Springer-Verlag, New York2001Crossref Google Scholar as follows. Let Sw be the largest number of breakpoints in any interval of fixed length w (the interval contains a fixed number of markers). This statistic is commonly referred to as the “scan statistic” and provides the necessary tool for our computation. Using this new notation, we answer the question we posed: namely, how likely it is that we have k (of N) breakpoints in any interval of length w=|Imax|? The probability of this event is exactly P(Sw⩾k). Wallenstein and Neff4Wallenstein S Neff N An approximation for the distribution of the scan statistic.Stat Med. 1987; 6: 197-207Crossref PubMed Scopus (57) Google Scholar derived an approximation for P(Sw⩾k), using the following notations. Let b(k;N,w)=(NK)wk(1−w)N−kand Gb(k;N,w)=∑i=kNb(i;N,w).Then P(Sw≥k)≈(kw−1−N−1)b(k;N,w)+2Gb(k;N,w),(4) which is accurate when P(Sw⩾k)<0.10 and remains so, even for larger values. Note that, for the above formula to be applicable, w must must be a number between 0 and 1. Therefore, in our derivation below, we use a normalized w, computed as the number of markers in the interval Imax divided by the total number of markers on the chromosome. To illustrate how this approximation of the P value performs, in figure 2, we plot the calculated P values against different numbers of breakpoints k, while examining the effect of different window sizes w. We used the following assumptions: the total number of breakpoints is N=50, k∈{1…20}, and w∈{1300,1200,1100,150,120,110}. (Thus, w is normalized as the number of markers in the interval divided by the total number of markers on the chromosome.) Since the computation of P values in equation (4) depends on the size of the interval w and since the size w=|Imax| of the interval Imax (found either by the Max or LR method) might not be the optimal length (e.g., because of underestimation of the length of the TSG), we also examine intervals overlapping Imax but of slightly different lengths and then compute a P value as before. From the resulting P values, we choose the smallest (most significant) value to measure the statistical significance. To account for the fact that multiple window sizes have been tested, we apply a conservative Bonferroni adjustment for the P values (we multiply the P values by the number of window sizes, and we use windows with lengths of up to 10 markers in the analysis of both simulated and real data). We applied our method to both simulated data and real data. Below, we describe the data sources, data qualities, and computed results, and we have relegated all the details to the appendix (online only). We simulated data according to the generative process that was described above. The simulation works on a growing population of cells, starting with an individual normal cell whose genome contains a single TSG at a known fixed position. As the simulation proceeds, it introduces breakpoints at different positions in the genome, each occurring as a Poisson process with rate parameter μ. At each of these breakpoints, the simulation also postulates a deletion with length distributed as an exponential random variable with parameter λ. Once, in some cell in the population, both copies of the TSG become nonfunctional (either by homozygous deletion or hemizygous deletion in the presence of other mutations), the resulting precancerous cell in the simulation starts to multiply indefinitely. Over time, the new progenitor cells also incur other independent “collateral damages” (i.e., deletions). Finally, the simulator randomly samples the population for tumor cells, mimicking the microdissection process used by a physician and, thus, assuming that the collected sample exhibits a composition of different tumor cells and some normal cells as well. In our simulations, we assumed that even the normal cells have some random deletions, whereas the different tumor cells all come from the same ancestral precancerous cell (fig. 3). In all our simulations, we fixed the parameters, as listed below. N=50=number of diseased individuals.G=100 Mb = length of the chromosome.P=10,000 or P=5,000=total number of probes (with the implication of average resolutions of 10 kb and 20 kb, respectively).C=100=total number of cells per tumor sample, with 70% tumor cells and 30% normal cells.μG=2=mean number of breakpoints per cell. (This value corresponds to the background deletions that occur after the TSG becomes nonfunctional.)1λ = 50 kb = mean length of a deletion.TSG =[10.0 Mb,10.1 Mb]. (TSG is represented by an interval starting at 10.0 Mb and has a length of 100 kb.)To the resulting copy numbers, we added an independent Gaussian noise, ∼N(0,0.12). The simulated data were segmented using the publicly available software described by Daruwala et al.2Daruwala RS Rudra A Ostrer H Lucito R Wigler M Mishra B A versatile statistical analysis algorithm to detect genome copy number variation.Proc Natl Acad Sci USA. 2004; 101: 16292-16297Crossref PubMed Scopus (45) Google Scholar (NYU Versatile MAP Segmenter). A segment was called “deleted” if log2 of the segmental:mean ratio (test:normal) for that segment was less than a threshold value of log2(12)−1.0. Table 1 shows the different simulated scenarios we used. They all share the same set of parameters as described above, with an additional complexity to reflect differences in the composition of the starting population: some samples are assumed to be diseased because of mutations in the TSG (phomozygous+phemizygous), and some samples are sporadic (psporadic). Among the samples with mutations in the TSG, some have only homozygous deletions (phomozygous), and some have only hemizygous deletion of the TSG (phemizygous). Furthermore, the sporadic samples are assumed not to have deletions in the TSG under investigation; that is, they have only background deletions.Table 1Six Simulated ModelsPercentage of SampleModelphomozygousphemizygouspsporadic1100002505003010004500505252550605050Note.—phomozygous represents the percentage of samples in the data set with homozygous deletions, phemizygous is the percentage of samples with hemizygous deletions, and psporadic is the proportion of samples with no deletion in the TSG under investigation (randomly diseased). Open table in a new tab Note.— phomozygous represents the percentage of samples in the data set with homozygous deletions, phemizygous is the percentage of samples with hemizygous deletions, and psporadic is the proportion of samples with no deletion in the TSG under investigation (randomly diseased). The performance of our method was evaluated by the Jaccard measure of overlap between the estimated position of the TSG and the real position used in the simulation. Note that, if E is the estimated interval and T is the true one, then the Jaccard measure is defined simply as J(E,T)=|E∩T||E∪T|,where |E∩T| is the length of the interval common to both—that is, the interval E∩T. We also tested the capacity of the inferred TSG as a possible biomarker for cancer detection or classification. More precisely, we measured, for a postulated TSG, its sensitivity, which is defined as the percentage of diseased samples that have the estimated TSG deleted. For models 4–6, which also contain sporadic samples, we considered, in our calculation of sensitivity, only the more meaningful situations, consisting only of samples that are diseased because of mutations in the TSG under investigation. Table 2 presents our results, with a summary of overlap and sensitivity measures for each of the six models outlined above and for the two marker resolutions simulated, 10 kb and 20 kb. The numbers that appear in the table are, after averaging over 50 data sets, simulated under the corresponding models. In all cases, the estimated P value is very small (<.001).Table 2Overlap between True Location and Estimated Location of the TSG and the Resulting Sensitivity for the Six Simulated ModelsJaccard MeasureSensitivityAverage Intermarker Distance (kb) and ModelLRMaxLRMax10: 1.82 ± .11.72 ± .23.80 ± .08.79 ± .10 2.84 ± .12.67 ± .24.69 ± .10.67 ± .13 3.84 ± .10.62 ± .30.56 ± .11.54 ± .13 4.74 ± .15.23 ± .19.80 ± .14.69 ± .12 5.73 ± .16.33 ± .25.69 ± .12.59 ± .16 6.74 ± .17.26 ± .25.54 ± .12.46 ± .1220: 1.70 ± .15.44 ± .27.59 ± .16.56 ± .16 2.70 ± .19.38 ± .30.46 ± .14.43 ± .15 3.68 ± .20.43 ± .30.38 ± .14.34 ± .16 4.60 ± .21.25 ± .21.60 ± .18.55 ± .15 5.65 ± .20.24 ± .22.46 ± .15.40 ± .14 6.58 ± .28.27 ± .28.37 ± .15.33 ± .14Note.—LR and Max refer to the two methods used to estimate the location of the TSG. Open table in a new tab Note.— LR and Max refer to the two methods used to estimate the location of the TSG. To present a better understanding of the entire distribution of scores, we also plotted box plots for the Jaccard measure and for the sensitivity measure for all the simulated scenarios (see figs. Figure 4, Figure 5, Figure 6, Figure 7).Figure 5Box plots of the sensitivity measure for each of the six models (table 1). Fifty data sets are simulated according to each model, and the distribution of the resulting 50 sensitivity measures is depicted in each box plot. Average intermarker distance is 10 kb. A, LR. B, Max.View Large Image Figure ViewerDownload Hi-res image Download (PPT)Figure 6Box plots of the Jaccard measure of overlap for each of the six models (table 1). Fifty data sets are simulated according to each model, and the distribution of the resulting 50 overlap measures is depicted in each box plot. Average intermarker distance is 20 kb. A, LR. B, Max.View Large Image Figure ViewerDownload Hi-res image Download (PPT)Figure 7Box plots of the sensitivity measure for each of the six models (table 1). Fifty data sets are simulated according to each model, and the distribution of the resulting 50 sensitivity measures is depicted in each box plot. Average intermarker distance is 20 kb. A, LR. B, Max.View Large Image Figure ViewerDownload Hi-res
Referência(s)