Artigo Acesso aberto Revisado por pares

SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data

2017; Elsevier BV; Volume: 101; Issue: 1 Linguagem: Inglês

10.1016/j.ajhg.2017.05.017

ISSN

1537-6605

Autores

Di Zhang, Linhai Zhao, Biao Li, Zongxiao He, Gao T. Wang, Dajiang J. Liu, Suzanne M. Leal,

Tópico(s)

Epigenetics and DNA Methylation

Resumo

Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 × 10−6) was observed with CCDC62 (SKAT-O [p = 6.89 × 10−7], combined multivariate collapsing [p = 1.48 × 10−6], and burden of rare variants [p = 1.48 × 10−6]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits. Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 × 10−6) was observed with CCDC62 (SKAT-O [p = 6.89 × 10−7], combined multivariate collapsing [p = 1.48 × 10−6], and burden of rare variants [p = 1.48 × 10−6]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits. Massively parallel sequencing technologies are generating an unprecedented amount of sequence data on various kinds of samples including human exomes and genomes. Many rare variant association methods have been developed to elucidate the underlying disease etiology using large-scale population-based sequence datasets.1Li B. Leal S.M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data.Am. J. Hum. Genet. 2008; 83: 311-321Abstract Full Text Full Text PDF PubMed Scopus (1100) Google Scholar, 2Morris A.P. Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies.Genet. Epidemiol. 2010; 34: 188-193Crossref PubMed Scopus (371) Google Scholar, 3Madsen B.E. Browning S.R. A groupwise association test for rare mutations using a weighted sum statistic.PLoS Genet. 2009; 5: e1000384Crossref PubMed Scopus (823) Google Scholar, 4Price A.L. Kryukov G.V. de Bakker P.I.W. Purcell S.M. Staples J. Wei L.-J. Sunyaev S.R. Pooled association tests for rare variants in exon-resequencing studies.Am. J. Hum. Genet. 2010; 86: 832-838Abstract Full Text Full Text PDF PubMed Scopus (597) Google Scholar, 5Wu M.C. Lee S. Cai T. Li Y. Boehnke M. Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test.Am. J. Hum. Genet. 2011; 89: 82-93Abstract Full Text Full Text PDF PubMed Scopus (1573) Google Scholar Although some findings are promising,6Schick U.M. Auer P.L. Bis J.C. Lin H. Wei P. Pankratz N. Lange L.A. Brody J. Stitziel N.O. Kim D.S. et al.Cohorts for Heart and Aging Research in Genomic EpidemiologyNational Heart, Lung, and Blood Institute GO Exome Sequencing ProjectAssociation of exome sequences with plasma C-reactive protein levels in >9000 participants.Hum. Mol. Genet. 2015; 24: 559-571Crossref PubMed Scopus (30) Google Scholar statistical power analyses performed with simulated data demonstrate that large sample sizes of tens or even hundreds of thousands of individuals are required for adequately powered studies.7Auer P.L. Reiner A.P. Wang G. Kang H.M. Abecasis G.R. Altshuler D. Bamshad M.J. Nickerson D.A. Tracy R.P. Rich S.S. Leal S.M. NHLBI GO Exome Sequencing ProjectGuidelines for large-scale sequence-based complex trait association studies: lessons learned from the NHLBI Exome Sequencing Project.Am. J. Hum. Genet. 2016; 99: 791-801Abstract Full Text Full Text PDF PubMed Scopus (67) Google Scholar, 8Zuk O. Schaffner S.F. Samocha K. Do R. Hechter E. Kathiresan S. Daly M.J. Neale B.M. Sunyaev S.R. Lander E.S. Searching for missing heritability: designing rare variant association studies.Proc. Natl. Acad. Sci. USA. 2014; 111: E455-E464Crossref PubMed Scopus (408) Google Scholar Large-scale genetic epidemiological studies are currently ongoing, including the Trans-Omics for Precision Medicine program (TopMed) (see Web Resources) and UK BioBank9Trehearne A. Genetics, lifestyle and environment. UK Biobank is an open access resource following the lives of 500,000 participants to improve the health of future generations.Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz. 2016; 59: 361-367Crossref PubMed Scopus (16) Google Scholar studies. Additional large-scale genetic epidemiological studies are emerging that will generate whole-genome sequence (WGS) data or impute WGS data into existing genotype array data to better understand the genetic etiology of complex traits. It is problematic to analyze large datasets of massively parallel sequence data given the limitations of current analytic tools for annotation, data quality control, and association testing.9Trehearne A. Genetics, lifestyle and environment. UK Biobank is an open access resource following the lives of 500,000 participants to improve the health of future generations.Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz. 2016; 59: 361-367Crossref PubMed Scopus (16) Google Scholar, 10Wang K. Li M. Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.Nucleic Acids Res. 2010; 38 (e164–e164)Crossref Scopus (7872) Google Scholar Analytic tools such as PLINK/SEQ and Variant Association Tools (VAT)11Wang G.T. Peng B. Leal S.M. Variant association tools for quality control and analysis of large-scale sequence and genotyping array data.Am. J. Hum. Genet. 2014; 94: 770-783Abstract Full Text Full Text PDF PubMed Scopus (51) Google Scholar are designed to run on a single computer/processor, with poor support for parallel computation. For example, PLINK/SEQ is a single threaded program, and VAT can be multi-threaded on a single server for some tasks. None of the existing tools can leverage the computational resources of a cluster of multiple servers. Backend database systems for current tools are also obsolete, usually relying on a single flat file or a file-based relational database like SQLite, which is not suitable for high input/output applications. To address these issues, we developed SEQSpark, a new analysis tool for large-scale sequence data quality control, annotation, and rare variant association analysis. SEQSpark is based on Spark, a fast engine for large-scale data processing. The Spark platform was selected because it has a simple parallel computation model and it has nearly unlimited scalability, allowing for expanded computational resources, e.g., number of servers within a cluster to enhance the performance without modification of the software. Spark makes use of the Hadoop file system (HDFS), a distributed file system that allows for data storage in a cluster environment (Figure 1A). An additional advantage of Spark is its ability to store data in memory, allowing for many magnitudes faster analysis speeds compared to software that accesses data from hard drives. Spark can efficiently make use of today’s servers that have tens of gigabytes of memory. SEQSpark splits large datasets into many small blocks that are stored across an entire cluster of servers. The blocks (Figure 1A) can then be accessed and processed simultaneously, so the speed is enhanced by a factor that is proportional to the number of blocks. Variants in different blocks can be analyzed together when necessary. The enhancement in speed is dependent on the hardware resources in a cluster, i.e., the total number of processors, size of memory, and number of disks, but is not limited by the computation power and input/output throughput of a single server. By carefully implementing algorithms in the map-reduce form compatible with the processing engine, we overcome the computational and input/output bottleneck that is experienced by other tools. SEQSpark uses the memory caching advantage of Spark and outperforms PLINK/SEQ and VAT even in a single-server environment. PLINK/SEQ and VAT both use the file-based relational database, SQLite, which can considerably reduce computational speed, because it lacks optimization for large datasets and high-frequency access to data. SEQSpark also takes advantage of the sparse data structure of genotype data.6Schick U.M. Auer P.L. Bis J.C. Lin H. Wei P. Pankratz N. Lange L.A. Brody J. Stitziel N.O. Kim D.S. et al.Cohorts for Heart and Aging Research in Genomic EpidemiologyNational Heart, Lung, and Blood Institute GO Exome Sequencing ProjectAssociation of exome sequences with plasma C-reactive protein levels in >9000 participants.Hum. Mol. Genet. 2015; 24: 559-571Crossref PubMed Scopus (30) Google Scholar SEQSpark (Figure 1B) performs data quality control based on genotype and variant level metrics, e.g., read depth, quality score. Using a variety of databases, annotation and bioinformatics evaluation is performed at both variant and gene/region levels. Allele frequencies and bioinformatics scores from external databases can be used as weights in subsequent association tests. SEQSpark implements both single variant association tests and rare variant aggregate association methods, e.g., combined multivariate collapsing (CMC),1Li B. Leal S.M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data.Am. J. Hum. Genet. 2008; 83: 311-321Abstract Full Text Full Text PDF PubMed Scopus (1100) Google Scholar burden of rare variants (BRV),2Morris A.P. Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies.Genet. Epidemiol. 2010; 34: 188-193Crossref PubMed Scopus (371) Google Scholar, 12Auer P.L. Wang G. Leal S.M. Testing for rare variant associations in the presence of missing data.Genet. Epidemiol. 2013; 37: 529-538Crossref PubMed Scopus (17) Google Scholar variable threshold (VT),4Price A.L. Kryukov G.V. de Bakker P.I.W. Purcell S.M. Staples J. Wei L.-J. Sunyaev S.R. Pooled association tests for rare variants in exon-resequencing studies.Am. J. Hum. Genet. 2010; 86: 832-838Abstract Full Text Full Text PDF PubMed Scopus (597) Google Scholar sequence kernel association test (SKAT),5Wu M.C. Lee S. Cai T. Li Y. Boehnke M. Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test.Am. J. Hum. Genet. 2011; 89: 82-93Abstract Full Text Full Text PDF PubMed Scopus (1573) Google Scholar and SKAT-optimal (SKAT-O).13Lee S. Wu M.C. Lin X. Optimal tests for rare variant effects in sequencing association studies.Biostatistics. 2012; 13: 762-775Crossref PubMed Scopus (455) Google Scholar All methods are implemented in a regression framework so that important covariates can be included in the analysis and gene × gene and gene × environment interactions can be investigated. Conditional regression can also be performed to tease apart, for example, associations with susceptibility variants from those due to linkage disequilibrium. The speed and versatility of SEQSpark make it ideal for the analysis of small- to large-scale genetic studies of complex traits. WGS data from the UK10K and 50,000 simulated exomes were analyzed on a small cluster to demonstrate the versatility and speed of SEQSpark. The cluster consists of eight servers each with two AMD Opteron 8-core CPUs, 64 GB memory, and three 4 TB SATA hard drives. For the analysis, the number of blocks was set to 5,120. When comparing the performance of SEQSpark to PLINK/SEQ and VAT, 2,000 simulated exomes were analyzed on a workstation which consists of two Intel Xeon 8-core CPUs with hyper-threading turned on (32 virtual cores in total); eight 8 GB DDR3 memory sticks (64 GB in total); and six 1 TB SATA hard drives. For SEQSpark the number of blocks for the analysis was set to 1,024. We recommended to set the number of blocks for an analysis to a value so that each block contains 32–128 megabytes of data. It is preferable to have more blocks when using a very large cluster with many CPUs so that each block has fewer megabytes of data. No modifications were made to the workstation’s hardware to run SEQSpark and it could even be run on a laptop. Cloud computing can also be used to analyze data using SEQSpark. The speed of SEQSpark can be increased not only by adding additional CPUs but also by increasing the number of hard drives per server, although increasing the number of CPUs will have a greater impact on the speed than the number of hard drives. Analysis of UK10K data and the 50,000 simulated exomes were not performed using either PLINK/SEQ or VAT due to the extended computational time necessary to perform data analysis. For PLINK/SEQ and VAT, considerable computational time was needed to load these two datasets; e.g., for the WGS UK10K data it took PLINK/SEQ 8.5 hr and VAT 9.3 hr to load chromosome 1, and for the 50,000 simulated exomes PLINK/SEQ took 6.6 hr and VAT 26.3 hr to load the chromosome 1 data. The loading time depends not only on the size of the file but also the number of variant sites and genotypes. Although the simulated exome data contains 3× the number of genotypes of the UK10K, PLINK/SEQ can load this dataset quicker because there are fewer variant sites, where the loading time for VAT is impacted by the number of genotypes. In contrast to the loading times for VAT and PLINK/SEQ, it took SEQSpark 2.3 and 5.0 min to load the chromosome 1 data for the UK10K and 50,000 simulated exomes, respectively. Therefore, to compare the three programs in a reasonable time frame, 2,000 exomes were generated and analyzed. Analysis of waist-to-hip ratio was performed using WGS data from the Avon Longitudinal Study of Parents and Children (ALSPAC) cohort which was included in the UK10K. This study includes 1,927 individuals of which 1,811 individuals had waist-to-hip ratio (WHR) data available for analysis. The generation of the WGS data as well as the quality control, which was performed before distribution, has been previously described.14Walter K. Min J.L. Huang J. Crooks L. Memari Y. McCarthy S. Perry J.R. Xu C. Futema M. Lawson D. et al.UK10K ConsortiumThe UK10K project identifies rare variants in health and disease.Nature. 2015; 526: 82-90Crossref PubMed Scopus (649) Google Scholar Functional annotation was performed to determine gene boundaries and to classify coding variants, i.e., splice sites, nonsense, missense, and frameshift indels. Ti/Tv ratio was calculated for variants with an MAF < 0.01 and ≥ 0.01. It took 12 s to calculate Ti/Tv ratios and 1.4 min to annotate the rare variants (MAF < 0.01). Table 1 contains the benchmark times to complete each step of the analysis including benchmark times for each association method.Table 1Benchmarks for SEQSpark Analysis of UK10K Hip-to-Waist Ratio DataVariants: MAF ≥ 0.01aA total of 9,332,772 variants with an MAF ≥ 0.01 analyzed.Rare Variants: MAF < 0.01bA total of 542,616 rare variants within coding regions were loaded and after annotation, a total of 163,578 missense, splice-site, frameshift, and nonsense variants in 18,011 genes were available for analysis.Load datacThe dataset size is 669.4 GB in LZ4 compression format.21.75 min16.25 minAnnotationN/A1.40 minTi/Tv ratio1.92 min0.20 minPCAdTen PCs were generated using all variants with an MAF ≥ 0.01.11.65 minN/ASingle variant16.03 minN/ACMCN/A0.22 minBRVN/A0.23 minVTN/A7.90 minSKATN/A0.18 minSKAT-ON/A0.22 minQuality control was performed using data from 1,927 individuals with WGS data. PCA was performed using 1,811 individuals who had data on WHRs and association analysis was performed using 1,798 individuals with WHRs data who were not outliers in the PC analysis.a A total of 9,332,772 variants with an MAF ≥ 0.01 analyzed.b A total of 542,616 rare variants within coding regions were loaded and after annotation, a total of 163,578 missense, splice-site, frameshift, and nonsense variants in 18,011 genes were available for analysis.c The dataset size is 669.4 GB in LZ4 compression format.d Ten PCs were generated using all variants with an MAF ≥ 0.01. Open table in a new tab Quality control was performed using data from 1,927 individuals with WGS data. PCA was performed using 1,811 individuals who had data on WHRs and association analysis was performed using 1,798 individuals with WHRs data who were not outliers in the PC analysis. To determine whether there were outliers and to control for population substructure in the analysis, the first ten principal components (PCs) were generated for the 1,811 samples with WHR phenotype data using variants with an MAF ≥ 0.01. Individuals with a first or second PC value that was more than 4 standard deviations (SDs) from the mean were removed before analysis. Although for the first PC, all values were within 4 SDs of the mean, 13 individuals had a second PC value that was >4 SDs from the mean and therefore were removed from further analysis (Figure 2A). Stepwise regression was used to determine which covariates should be adjusted for in the analysis. Covariates age (p = 0.0362), sex (p < 2.0 × 10−16), and body mass index (BMI) (p < 2.0 × 10−16) were significant. Residuals were generated for analysis adjusting for these covariates. To evaluate whether inclusion of PCs aided in controlling for population substructure, analysis was performed without including any PCs, including the first PC, including the first and second PCs, and lastly including the first, second, and third PCs. When analysis was performed without inclusion of PCs, the lambda values were as follows: single variant (λ = 0.9991), CMC (λ = 1.0267), BRV (λ = 1.0302), VT (λ = 1.0096), SKAT (λ = 1.0669), and SKAT-0 (λ = 0.9603). Although for some tests, λ is slightly inflated, inclusion of additional PCs did not reduce the λ values and therefore the analysis was performed without inclusion of any PCs. It can be observed from the quantile-quantile plots that type one error is well controlled (Figure 2B). Single variant analysis was performed using the score test assuming an additive model for all variants with an MAF > 0.01. It took 16.3 min to perform the analysis for 9,332,772 variant sites. None of the single variant analyses reached the genome-wide significance level of p < 5.0 × 10−8. This is not surprising given the modest sample size of 1,798 study subjects used for the analysis. For each gene region, missense, nonsense, splice-site, or frameshift variants with an MAF < 0.01 were selected to perform aggregate rare variant association analysis using CMC, BRV, VT, SKAT, and SKAT-O. For CMC and BRV, a score test was performed. For gene-based aggregate rare variant association analysis of 15,937 genes, each with two or more variant sites and at least three alternative alleles, it took 14 s to perform the fixed effect BRV test and 11 s to perform the random effects SKAT. For the gene-based rare variant aggregate analysis, an association was observed with CCDC62 (MIM: 613481) that met exome-wide significance (p < 2.5 × 10−6) with SKAT-O (p = 6.89 × 10−7), CMC (p = 1.48 × 10−6), and BRV(p = 1.48 × 10−6) and suggestive evidence of association with VT (p = 2.15 × 10−5) and SKAT (p = 5.50 × 10−6). To validate the results obtained from SEQSpark, the analysis of rare variants in CCDC62 was also performed using VAT and PLINK/SEQ. The results were the same, except for CMC and BRV, because VAT implements the Wald test instead of the score test. When analysis was performed using the Wald test in SEQSpark, identical results were obtained for VAT and SEQSpark (CMC [p = 1.39 × 10−6]; BRV [p = 1.39 × 10−6]). PLINK/SEQ produced results only for SKAT, since WHR is a quantitative trait. The results for SKAT were identical for the three programs. There have been no previous reports of common or rare variants in CCDC62 being associated with WHR or obesity. There are also no functional studies that demonstrate the involvement of CCDC62 in metabolism. It is important that this association is replicated in an independent sample to determine whether rare variants in CCDC62 are truly associated with WHR. To benchmark and evaluate the performance of PLINK/SEQ, VAT, and SEQSpark, we simulated exome data using non-Finnish European allele frequencies obtained from 33,370 individuals from ExAC. Two samples were generated, one with 50,000 exomes to demonstrate the capabilities of SEQSpark and the other with 2,000 exomes to compare the performance and benchmark SEQSpark, PLINK/SEQ, and VAT. Genotype data were generated assuming Hardy-Weinberg equilibrium (HWE) and missing genotypes were introduced using a probability of 1%. In all a total of 3,681,143 variants in 18,295 genes were generated for the 22 autosomes for the 50,000 exomes and 872,218 variants and 18,295 genes for the sample of 2,000 exomes. To generate realistic sequence data to analyze and perform data quality control, we generated sequence data with variable read depths as well as genotype quality scores. For each genotype, read depth information was generated by drawing from a negative binomial (r,p) where r = 7 is the number of failures until the experiment is stopped and p = 0.2 is the probability of failure, that led to ∼2% of the genotypes having a read depth of <8×. GQ scores were also generated using a two-step sampling procedure: first for 95% of the genotypes the GQ was set to 99; for the remaining genotypes the GQ scores is obtained by drawing from a uniform (a,b) distribution where a = 0 is the starting value and b = 100 is the ending value, leading to ∼1% of the GQ scores being less than 20. None of the phenotype data were generated to be in association with the genotypes, i.e., data are generated under the null hypothesis of no association. For each sample that exome data were generated, age, sex, a quantitative trait, and a dichotomous trait were assigned. Age was generated using a negative binomial (r,p) distribution where r = 20 and p = 0.3, while the quantitative trait was simulated using a normal (μ,σ2) distribution where the mean μ = 25 and the variance σ2 = 3. Of the samples, 50% were assigned to be male. For the qualitative trait, 50% of the samples were assigned to be case subjects and the other half control subjects. When the quantitative trait was analyzed, three covariates (sex, age, and the binary trait) were included in the analysis, while for the analysis of the qualitative trait, the three covariates that were included in the analysis are sex, age, and the quantitative trait. First the 50,000 simulated exomes were loaded (44.6 min) and annotated (12.0 min) using SEQSpark on a small cluster. The simulated case-control and quantitative trait data were analyzed in a regression framework including covariates using several rare variant aggregate association tests (CMC, BRV, SKAT, and SKAT-O), obtaining p values analytically. A total of 18,219 genes, with at least two missense, nonsense, or splice site variants sites with an MAF < 0.01 and at least three alternative alleles, were analyzed. The analysis times varied for the same test depending on whether a binary or quantitative trait was analyzed. SKAT-O took more than twice the time to analyze the case-control data than the quantitative trait data. On the other hand, analysis of the case-control data was quicker for fixed-effect tests CMC and BRV and the random effects test SKAT than the analysis of the quantitative trait data. The analysis time for CMC, BRV, SKAT, and SKAT-O combined for the quantitative trait was 48.4 min and for the qualitative trait 72.4 min. Table 2 displays the benchmarks for the analysis of the 50,000 simulated exomes.Table 2Benchmarks for SEQSpark Analysis of 50,000 Simulated ExomesAnalysis Times for Rare VariantsaA total of 3,681,143 variants were loaded and after annotation, a total of 1,420,128 missense, splice-site, and nonsense variants with a MAF of < 0.01 in 18,219 genes with ≥2 variant sites and ≥3 alternative alleles were available for analysis.Load databThe dataset size is 659.6 GB in LZ4 compression format.44.63 minAnnotation12.00 minQuantitativeCase-ControlCMC14.40 min12.72 minBRV3.82 min2.97 minSKAT6.25 min5.92 minSKAT-O24.12 min51.13 mina A total of 3,681,143 variants were loaded and after annotation, a total of 1,420,128 missense, splice-site, and nonsense variants with a MAF of < 0.01 in 18,219 genes with ≥2 variant sites and ≥3 alternative alleles were available for analysis.b The dataset size is 659.6 GB in LZ4 compression format. Open table in a new tab The first step in analysis was to load the data into each of the software packages. This process was most lengthy for VAT, taking slightly more than 1 hr to perform, was faster for PLINK/SEQ taking almost 40 min, and the quickest for SEQSpark run on a cluster taking slightly over 4 min. Performing quality control of the genotype data, i.e., removal of genotypes with a read depth <8× and/or genotype quality score (GQ) < 20, took 48 min using PLINK/SEQ, 37 min for VAT, 7 min for SEQSpark on a single server, and 3 min when SEQSpark was run on a cluster. For calculation of Ti/Tv ratios, it took PLINK/SEQ 57 min, VAT 11 min, and SEQSpark run on a cluster 8 s. The calculations of allele frequencies took more than 1 hr for PLINK/SEQ, 41 min for VAT, and 5 min for SEQSpark on the server and 1 min on the cluster. Table 3 displays benchmark times for loading, annotation, quality control, and data exploration.Table 3Benchmark for Performing Quality ControlPLINK/SEQaAll software, PLINK/SEQ, VAT, and SEQSpark were run on a single server.VATaAll software, PLINK/SEQ, VAT, and SEQSpark were run on a single server.SEQSpark Single ServeraAll software, PLINK/SEQ, VAT, and SEQSpark were run on a single server.SEQSpark CusterbSEQSpark was also run on a cluster and is the only software which has this capability.Load data38.75 min61.75 min5.67 min4.35 minAnnotationN/AcPLINK/SEQ does not have a separate annotation step.3.32 min1.42 min1.38 minGenotype and variant removaldThose genotypes with a read depth of <8× and/or GQ score <20 were removed.48.43 min36.67 min6.75 min2.57 minCalculation of Ti/Tv ratios56.54 min10.83 min1.48 min0.13 minCalculation of allele frequencies62.5 min40.43 min5.03 min1.30 minExome variant data were generated for a total of 2,000 samples using ExAC non-Finnish European allele frequencies. A total of 872,218 variants were generated with genotype-specific read depths and quality scores.a All software, PLINK/SEQ, VAT, and SEQSpark were run on a single server.b SEQSpark was also run on a cluster and is the only software which has this capability.c PLINK/SEQ does not have a separate annotation step.d Those genotypes with a read depth of <8× and/or GQ score <20 were removed. Open table in a new tab Exome variant data were generated for a total of 2,000 samples using ExAC non-Finnish European allele frequencies. A total of 872,218 variants were generated with genotype-specific read depths and quality scores. Association testing of the 2,000 exomes was performed using several commonly used aggregate rare variant association tests, i.e., CMC, VT, SKAT, SKAT-O, BRV, “Burden Test” for both quantitative and qualitative traits controlling for confounders. For the BRV and CMC, SEQSpark implemented the score test to perform gene-based rare variant association testing. Missense, nonsense, or

Referência(s)