Artigo Acesso aberto Revisado por pares

Validation of a Customized Bioinformatics Pipeline for a Clinical Next-Generation Sequencing Test Targeting Solid Tumor–Associated Variants

2018; Elsevier BV; Volume: 20; Issue: 3 Linguagem: Inglês

10.1016/j.jmoldx.2018.01.007

ISSN

1943-7811

Autores

Thomas M. Schneider, Geoffrey H. Smith, Michael R. Rossi, Charles E. Hill, Linsheng Zhang,

Tópico(s)

Genomics and Phylogenetic Studies

Resumo

Bioinformatic analysis is an integral and critical part of clinical next-generation sequencing. It is especially challenging for some pipelines to consistently identify insertions and deletions. We present the validation of an open source tumor amplicon pipeline (OTA-pipeline) for clinical next-generation sequencing targeting solid tumor–associated variants. Raw data generated from 557 TruSight Tumor 26 samples and in silico data were analyzed by the OTA-pipeline and legacy pipeline and compared. Discrepant results were confirmed by orthogonal methods. The OTA-pipeline reported 22 variants that were not detected by the previously validated pipeline, including seven synonymous or intronic single-nucleotide variants, five single-nucleotide variants at frequency <5%, one insertion, and nine deletions. Variant allele frequencies reported by the two pipelines were highly concordant, although a few significant discrepancies were present. Analysis of in silico FASTQ files demonstrated a higher sensitivity of detecting complex insertions and deletions with the OTA-pipeline. The higher sensitivity came at a cost, because false-positive calls were increased in difficult-to-sequence regions. However, these calls were all flagged by our strand bias filter, distinguishing them from true variants. Our validation process provides a model for laboratories that want to establish an in-house bioinformatics pipeline for clinical next-generation sequencing. Bioinformatic analysis is an integral and critical part of clinical next-generation sequencing. It is especially challenging for some pipelines to consistently identify insertions and deletions. We present the validation of an open source tumor amplicon pipeline (OTA-pipeline) for clinical next-generation sequencing targeting solid tumor–associated variants. Raw data generated from 557 TruSight Tumor 26 samples and in silico data were analyzed by the OTA-pipeline and legacy pipeline and compared. Discrepant results were confirmed by orthogonal methods. The OTA-pipeline reported 22 variants that were not detected by the previously validated pipeline, including seven synonymous or intronic single-nucleotide variants, five single-nucleotide variants at frequency <5%, one insertion, and nine deletions. Variant allele frequencies reported by the two pipelines were highly concordant, although a few significant discrepancies were present. Analysis of in silico FASTQ files demonstrated a higher sensitivity of detecting complex insertions and deletions with the OTA-pipeline. The higher sensitivity came at a cost, because false-positive calls were increased in difficult-to-sequence regions. However, these calls were all flagged by our strand bias filter, distinguishing them from true variants. Our validation process provides a model for laboratories that want to establish an in-house bioinformatics pipeline for clinical next-generation sequencing. The advancement in the knowledge of the underlying molecular mechanisms of cancer and successful development of drugs targeting specific driver variants or related dysfunctional molecular pathways have greatly promoted the adoption of clinical laboratory tests detecting cancer-associated variants. The variant profiling of a variety of solid tumors and hematolymphoid malignancies has gradually become standard of care. With the ability to detect a vast range of variants simultaneously, at great sensitivity and specificity, next-generation sequencing (NGS) has become widely adopted in the clinical laboratory to provide variant profile information for diagnostic cancer specimens. NGS tests in the clinical laboratory include the wet laboratory process to generate sequencing data and the data analysis to identify and annotate variants before interpreting and reporting results to the medical chart. Many vendors or third-party software developers provide ready-to-use closed source bioinformatics programs to ease the burden of bringing up these tests for laboratories with little experience in bioinformatics. Bioinformatics pipelines set certain thresholds to help ascertain what is and is not real. Not surprisingly, these parameters have a significant effect on the sensitivity and specificity of a pipeline. Clinical laboratories with targeted panels have different needs and requirements when it comes to their pipelines compared with the research laboratories focusing on discovery of novel molecular mechanisms. Proprietary pipelines attempt to appease both customers, but there may not always be an optimal solution for either. In addition, these proprietary analysis pipelines are usually tied to specific platform or library preparation kits. A laboratory-developed or custom bioinformatics solution with the ability to modify certain thresholds may be desired under certain circumstances. We have previously reported validation of Illumina's (San Diego, CA) TruSight Tumor 26 (TST26) targeted NGS assay for variant profiling of solid tumors.1Fisher K.E. Zhang L. Wang J. Smith G.H. Newman S. Schneider T.M. Pillai R.N. Kudchadkar R.R. Owonikoko T.K. Ramalingam S.S. Lawson D.H. Delman K.A. El-Rayes B.F. Wilson M.M. Sullivan H.C. Morrison A.S. Balci S. Adsay N.V. Gal A.A. Sica G.L. Saxe D.F. Mann K.P. Hill C.E. Khuri F.R. Rossi M.R. Clinical validation and implementation of a targeted next-generation sequencing assay to detect somatic variants in nonsmall cell lung, melanoma, and gastrointestinal malignancies.J Mol Diagn. 2016; 18: 299-315Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar The data analysis pipeline was validated as an integral part of the test (from here, referred to as legacy pipeline). A multi-institutional proficiency study by exchanging FASTQ files has shown that there are challenges to correctly detect insertions and deletions (indels) from the data generated by the TST26 library preparation.2Davies K.D. Farooqi M.S. Gruidl M. Hill C.E. Woolworth-Hirschhorn J. Jones H. Jones K.L. Magliocco A. Mitui M. O'Neill P.H. O'Rourke R. Patel N.M. Qin D. Ramos E. Rossi M.R. Schneider T.M. Smith G.H. Zhang L. Park J.Y. Aisner D.L. Multi-institutional FASTQ file exchange as a means of proficiency testing for next-generation sequencing bioinformatics and variant interpretation.J Mol Diagn. 2016; 18: 572-579Abstract Full Text Full Text PDF PubMed Scopus (18) Google Scholar In our laboratory, we encountered a rare event in which a clinically significant EGFR exon 19 deletion was missed by the legacy pipeline. At the same time, when our laboratory was transitioning from Illumina's MiSeq platform to the NextSeq platform, the legacy pipeline was found not to work seamlessly in the new platform. Therefore, a customized informatics pipeline was built [from here, referred to as open-source tumor amplicon pipeline (OTA-pipeline)] on the basis of popular open source modules. Herein, we present our experience in developing and validating a customized and improved informatics package for the analysis of NGS data generated from TST26. FASTQ files from 557 clinical NGS samples previously performed at Emory University Hospital's (Atlanta, GA) molecular diagnostic laboratory were used in this clinical validation. An additional FASTQ file with an insertion known to be challenging for various informatics pipelines in a previous FASTQ exchange study2Davies K.D. Farooqi M.S. Gruidl M. Hill C.E. Woolworth-Hirschhorn J. Jones H. Jones K.L. Magliocco A. Mitui M. O'Neill P.H. O'Rourke R. Patel N.M. Qin D. Ramos E. Rossi M.R. Schneider T.M. Smith G.H. Zhang L. Park J.Y. Aisner D.L. Multi-institutional FASTQ file exchange as a means of proficiency testing for next-generation sequencing bioinformatics and variant interpretation.J Mol Diagn. 2016; 18: 572-579Abstract Full Text Full Text PDF PubMed Scopus (18) Google Scholar was used as well. The nucleic acid extraction, quality control, and library preparation and sequencing were previously described in detail.1Fisher K.E. Zhang L. Wang J. Smith G.H. Newman S. Schneider T.M. Pillai R.N. Kudchadkar R.R. Owonikoko T.K. Ramalingam S.S. Lawson D.H. Delman K.A. El-Rayes B.F. Wilson M.M. Sullivan H.C. Morrison A.S. Balci S. Adsay N.V. Gal A.A. Sica G.L. Saxe D.F. Mann K.P. Hill C.E. Khuri F.R. Rossi M.R. Clinical validation and implementation of a targeted next-generation sequencing assay to detect somatic variants in nonsmall cell lung, melanoma, and gastrointestinal malignancies.J Mol Diagn. 2016; 18: 299-315Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar Briefly, DNA was extracted from formalin-fixed, paraffin-embedded sections in which tumor cell nuclei were confirmed to be ≥10%. The TST26 library was prepared per the manufacturer's protocol. TST26 is an amplicon-based method that targets and amplifies two libraries for every sample by capturing the sense and the antisense DNA strands of the same region. No more than 10 samples are processed in a single sequencing run to achieve maximum depth of coverage. Challenging complex indels, either documented or similar to previously encountered variants, were tested against the legacy and OTA-pipeline using in silico FASTQ files generated by the open source software ART version ChocolateCherryCake-03-19-2015.3Huang W. Li L. Myers J.R. Marth G.T. ART: a next-generation sequencing read simulator.Bioinformatics. 2012; 28: 593-594Crossref PubMed Scopus (636) Google Scholar ART generates FASTQ files given a set of provided reference sequences in FASTA format. The expected amplicon sequences in TST26, provided in Illumina's TST26 manifest files (Illumina product insert: TruSight Tumor 26 Product Files), along with amplicon sequences containing complex indels, were provided to ART in a FASTA format to generate synthetic NGS reads (Supplemental Table S1). A total of 29 artificial indels were generated, 20 of which correspond to real variants in the Catalogue of Somatic Mutations in Cancer database,4Forbes S.A. Beare D. Gunasekaran P. Leung K. Bindal N. Boutselakis H. Ding M. Bamford S. Cole C. Ward S. Kok C.Y. Jia M. De T. Teague J.W. Stratton M.R. McDermott U. Campbell P.J. COSMIC: exploring the world's knowledge of somatic mutations in human cancer.Nucleic Acids Res. 2015; 43: D805-D811Crossref PubMed Scopus (1596) Google Scholar 2 were modified from the FASTQ file of real cases with adjacent single-nucleotide variants (SNVs), and 7 were entirely novel creations. In the legacy pipeline, FASTQ files were processed using the on-board Amplicon DS software plug-in for the MiSeq Reporter version 2.2.29.1Fisher K.E. Zhang L. Wang J. Smith G.H. Newman S. Schneider T.M. Pillai R.N. Kudchadkar R.R. Owonikoko T.K. Ramalingam S.S. Lawson D.H. Delman K.A. El-Rayes B.F. Wilson M.M. Sullivan H.C. Morrison A.S. Balci S. Adsay N.V. Gal A.A. Sica G.L. Saxe D.F. Mann K.P. Hill C.E. Khuri F.R. Rossi M.R. Clinical validation and implementation of a targeted next-generation sequencing assay to detect somatic variants in nonsmall cell lung, melanoma, and gastrointestinal malignancies.J Mol Diagn. 2016; 18: 299-315Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar Amplicon DS is closed source solution; however, in general, alignment is performed using a banded Smith-Waterman alignment algorithm with a band width of 25 bp. This band width limits the detection of a single insertion or deletion to a maximum of 25 bp. Alignments that include more than three indels are filtered from the alignment results (Illumina, https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/miseqreporter/miseq-reporter-amplicon-ds-workflow-guide-15042903-02.pdf, last accessed May 5, 2017). SNVs and short indels are identified using the Illumina-developed Somatic Variant Caller version 3.1.6.4. Variant scores are computed using a Poisson model and are excluded when the Phred quality score is 60. For VarScan and FreeBayes variants, BCFtools was used to generate a strand bias filter, with strand bias simply defined as >90% of alternative reads coming from one read. Pool bias filters were added after merging the respective variant call files from each library into a single VCF file (Figure 1B). Pool bias in the OTA-pipeline is simply defined as a variant being present in one library pool and absent in another pool; differences in variant frequency are not taken into account in the OTA-pipeline. All of the calls made by the various variant callers are then unioned together. The legacy pipeline annotated genomic variants using Illumina VariantStudio version 2.1.46, whereas the OTA-pipeline used ANNOVAR (2014-11-12) (Figure 1B).10Wang K. Li M. Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.Nucleic Acids Res. 2010; 38: e164Crossref PubMed Scopus (6083) Google Scholar Annotation information includes gene name/symbol; location of variant (ie, exonic or intronic variant); chromosome; position; cDNA description of variant; protein description of variant; the aligner and variant detector used to generate the variant; quality filters associated with the variant; variant frequency and read depth of each library and the two combined; Exome Aggregation Consortium11Lek M. Karczewski K.J. Minikel E.V. Samocha K.E. Banks E. Fennell T. et al.Analysis of protein-coding genetic variation in 60,706 humans.Nature. 2016; 536: 285-291Crossref PubMed Scopus (5598) Google Scholar minor allele frequency (MAF) information, if known; Catalogue of Somatic Mutations in Cancer identification, if known; 1000 Genome MAF information; and dbSNP MAF.12Sherry S.T. Ward M.H. Kholodov M. Baker J. Phan L. Smigielski E.M. Sirotkin K. dbSNP: the NCBI database of genetic variation.Nucleic Acids Res. 2001; 29: 308-311Crossref PubMed Scopus (4203) Google Scholar Complete information of the columns is included in Supplemental Table S3. The resulting alignment files, genome VCF, and annotated text files are then fed to an internal quality control program entitled CoverageQC (Figure 1C and Supplemental Figures S1, S2, S3, and S4). CoverageQC transforms the annotated text file into a Microsoft Excel file (Microsoft Corp., Redmond, WA), in which certain variants are highlighted or shaded out, depending on presumed classification (Supplemental Table S4). Synonymous and intronic variants, along with germline variants with an MAF >1%, are filtered by shading them out in the Excel file. In addition, annotated variants are filtered through an internal database to eliminate known artifactual variants on the basis of their overrepresentation in clinical samples, discovered in the initial validation of the legacy pipeline.1Fisher K.E. Zhang L. Wang J. Smith G.H. Newman S. Schneider T.M. Pillai R.N. Kudchadkar R.R. Owonikoko T.K. Ramalingam S.S. Lawson D.H. Delman K.A. El-Rayes B.F. Wilson M.M. Sullivan H.C. Morrison A.S. Balci S. Adsay N.V. Gal A.A. Sica G.L. Saxe D.F. Mann K.P. Hill C.E. Khuri F.R. Rossi M.R. Clinical validation and implementation of a targeted next-generation sequencing assay to detect somatic variants in nonsmall cell lung, melanoma, and gastrointestinal malignancies.J Mol Diagn. 2016; 18: 299-315Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar There are 45 of these variants currently filtered. Approximately nine of these variants are seen in a single case, on average. These variants are listed in Supplemental Table S5. Like the synonymous and intronic variants, these are shaded out in the generated Excel file. The filtered variants are never removed and can be seen and reported if deemed appropriate. CoverageQC also generates an HTML document displaying the coverage of each nucleotide in all targeted regions for the current test and provides hyperlinks to Broad Institute's Integrative Genomics Viewer version 2.3,13Robinson J.T. Thorvaldsdottir H. Winckler W. Guttman M. Lander E.S. Getz G. Mesirov J.P. Integrative genomics viewer.Nat Biotechnol. 2011; 29: 24-26Crossref PubMed Scopus (5871) Google Scholar in which the case's alignment files and variant call files are further examined (Supplemental Figures S2 and S3). In the course of a routine clinical case sign out, the attending pathologist will examine a case starting from viewing the HTML file generated by CoverageQC in a web browser. Regions with low depth of coverage will be documented in the clinical reports. If adequate sequencing depth is not obtained in a significant amount of regions, the run is determined a failure and the case will need to be repeated or tested by other methods. Then, each variant call will be manually inspected and reviewed using all of the aforementioned tools. Allele percentage from one of the variant callers is used for the final report; however, because the allele fraction is not always accurate in amplicon-based NGS tests, the allele fraction is always indicated approximately in our reports. Manual investigation of reads in Integrative Genomics Viewer becomes critically important for indels because the variant callers may not recognize and define indels perfectly, especially if the indels are at the end of the reads. In this situation, many different calls with strand bias will be made by the pipeline. Adjustments to indel calls may be necessary so that overlapping variants are incorporated into a single all-encompassing variant description confining to Human Genome Variation Society nomenclature. At this time, these adjustments are performed manually by pathologists and recorded in the annotated Excel file, and they do not involve modification of the original VCF files. The steps of running the OTA-pipeline, with a comparison to the legacy pipeline, are illustrated in Figure 2. The OTA-pipeline reduces the hands-on time compared with the legacy pipeline because the annotation step requires no user input. The OTA-pipeline is implemented as a Linux Bourne shell script and is processed in our laboratory's own 32-core Linux server [Proliant Generation 8 dual Intel Xeon 8-core hyperthreaded core processing units, 256-gigabytes of random access memory (Hewlett Packard, Palo Alto, CA); Enterprise Linux (Red Hat, Raleigh, NC)]. A MiSeq run of 10 samples takes approximately 30 minutes to process; for a NextSeq run, it takes approximately 45 minutes for every eight samples. For validation purposes, synonymous and benign variant calls from both the legacy pipeline and the OTA-pipeline were not filtered out before comparison. To simplify variant comparison, VCF files from the legacy pipeline and the OTA-pipeline were both annotated by ANNOVAR.13Robinson J.T. Thorvaldsdottir H. Winckler W. Guttman M. Lander E.S. Getz G. Mesirov J.P. Integrative genomics viewer.Nat Biotechnol. 2011; 29: 24-26Crossref PubMed Scopus (5871) Google Scholar Annotated variants were loaded into an Oracle SQL database (Express Edition 11g R2) and compared with each other. Calls with the same genomic coordinates and Human Genome Variation Society cDNA annotation were considered matches. As mentioned in Variant Annotation and Review of Cases Clinically, manual inspection of variant calls is almost always performed with mandatory adjustments to make the most concise and correct variant calls in clinical reports. Therefore, the concept of a true discrepancy and processing discrepancy was introduced. For example, VarScan and UnifiedGenotyper will occasionally represent a dinucleotide substitution as two single point variants, whereas FreeBayes will represent this as 2-bp substitution. This is a processing discrepancy but not a true discrepancy. Indels were another source of processing discrepancies because of overlapping reads. Indels in these scenarios are usually broken down into multiple calls that do not always match among the four variant callers. Therefore, a variant detector is considered to be concordant with other variant detectors as long as there is a similar call in the same region. All potential processing discrepancies were manually reviewed and excluded from true discrepancies. Discrepant indels between the OTA-pipeline and the legacy pipeline were confirmed by amplifying the region using PCR primers, followed by fragment analysis (Table 1). PCR of a genomic DNA template (6 μL at 10 ng/μL) was performed using 12.5 μL of HotStar-Taq Master Mix (Qiagen Inc., Valencia, CA), 5 μL at 0.2 mmol/L final concentration of forward and reverse primers, and nuclease-free water in a total volume of 26 μL. Amplification was performed on an ABI9700 thermal cycler(Applied Biosystems, Foster City, CA), as follows: 94°C for 15 minutes; 40 cycles of 94°C for 30 seconds, 57°C for 30 seconds, and 72°C for 60 seconds; and 72°C for 30 minutes. Fragment analysis was performed by capillary electrophoresis using either a QIAxcel (Qiagen Inc.) or an ABI PRISM3100 (Applied Biosystems). For fragment analysis by the QIAxcel, the machine was set up according to manufacturer's instructions, using the DNA Screening Kit and AM420 method, and an undiluted amplicon was loaded on the QIAxcel with the analysis performed by the QIAxcel ScreenGel software version 1.2.1 (Qiagen Inc.). For fragment analysis by the ABI PRISM3100, previous PCR product was diluted 1:200 in nuclease-free water. Formamide/size standard mix was prepared by adding 7.5 μL of 400HD ROX size standard (GeneScan; Applied Biosystems) to 500 μL of HiDi formamide (Applied Biosystems). Then, 1 μL of diluted (1:200) PCR product was transferred to a well containing 10 μL of formamide/size standard mix. Samples were heated to 95°C for 5 minutes on the thermal cycler, then snap chilled on ice for 5 minutes and loaded for capillary electrophoresis on an ABI PRISM3100. Fragment length analysis was performed using the GeneMapper Software version 3.7 (Applied Biosystems).Table 1Indel Variants Discrepant between PipelinesCaseGeneVariantIndel size, bpReason for missingConfirmedForward primerReverse primerEmory1EGFRNM_005228.4:c.2237_2257delinsTGT21UnknownYes5′-GCACCATCTC-ACAATTGCCAGTTA-3′5′-AAAAGGTGG-GCCTGAGGTTCA-3′Emory2TP53NM_000546.5:c.390_421del32>25-bp deletionYes5′-GAATCAACC-CACAGCTGCAC-3′5′-AGGAGGTGC-TTACGCATGTT-3′Emory3TP53NM_000546.5:c.372_375+5del8Interval restrictionYes5′-CGGCCAGGCAT-TGAAGTCT-3′5′-CAGCTACGG-TTTCCGTCTGG-3′Emory4APCNM_000038.5:c.4534_4559del26>25-bp deletionYes5′-TTTCTTGTTCAT-CCAGCCTGAGT-3′5′-GCTCTGATTCT-GTTTCATTCCCA-3′Emory5APCNM_000038.5:c.4634_4663del30>25-bp deletionYes5′-TGGGAATGAAAC-AGAATCAGAGC-3′5′-TGTTGGCATGG-CAGAAATAATACAT-3′Emory6METNM_001127500.1:c.3071_3082+18del30>25-bp deletion and interval restrictionYes5′-GCCCAACTACA-GAAATGGTTTCA-3′5′-AACAATGTCA-CAACCCACTGA-3′Emory7TP53NM_000546.5:c.993_993+4del5Interval restrictionYes5′-ACGGCATTTTG-AGTGTTAGACTG-3′5′-TCCTAGCACTG-CCCAACAAC-3′Emory8STK11NM_000455.4:c.465-26_471del32>25-bp deletion and interval restrictionYes5′-TGTGCCTGGA-CTTCTGTGAC-3′5′-TCGGAGATTTT-GAGGGTGCC-3′Emory9TP53NM_000546.5:c.782+1del1Interval restrictionUnavailableNANAFQ1KITNM_000222:c.1728_1729insCCTTATGATCACAAATGG18UnknownUnavailableNANAART2EGFRNM_005228.3:c.[2232_2233delinsTG; 2237_2257delinsTGT]2; 21UnknownArtificialNANAART3EGFRNM_005228.3:c.[2232_2233delinsTG;2236_2257del]2; 21UnknownArtificialNANAART4EGFRNM_005228.3:c.2185-3_2286delinsGACCT5UnknownArtificialNANAART6EGFRNM_005228.3:c.[2185_2214del; 2221_2224delinsTTAA]30; 4UnknownArtificialNANAART7EGFRNM_005228.3:c.[2202_2228del;2238_2241insdelCCGA]27; 4UnknownArtificialNANAART8EGFRNM_005228.3:c.[2231_2252del22;2258_2261insdelTAGC]22; 4UnknownArtificialNANAART9EGFRNM_005228.3:c.[2245_2246insdelCC;c.2252_2253insTGAACCG]2; 7UnknownArtificialNANAART22TP53NM_000546.5:c.864_865insTTCCGCGGCGCACAGAGGAAGAAGAGAAT29>25-bp deletion and interval restrictionArtificialNANACases starting with the name Emory are actual cases encountered in our laboratory, whereas cases starting with ART are artificial FASTQ files. FQ1 is a real case from a separate laboratory. See Supplemental Table S1 for the breakdown of all artificial cases. The forward and reverse primers listed are the PCR primers used in the confirmatory tests (see Confirmation of the Discrepant Variants for detail). The reference sequence can be found with corresponding NM_numbers at: https://www.ncbi.nlm.nih.gov/nuccore.Indel, insertion/deletion; NA, not applicable. Open table in a new tab Cases starting with the name Emory are actual cases encountered in our laboratory, whereas cases starting with ART are artificial FASTQ files. FQ1 is a real case from a separate laboratory. See Supplemental Table S1 for the breakdown of all artificial cases. The forward and reverse primers listed are the PCR primers used in the confirmatory tests (see Confirmation of the Discrepant Variants for detail). The reference sequence can be found with corresponding NM_numbers at: https://www.ncbi.nlm.nih.gov/nuccore. Indel, insertion/deletion; NA, not applicable. Single-nucleotide variants were confir

Referência(s)