insiM

Artigo Acesso aberto Revisado por pares

insiM

2018; Elsevier BV; Volume: 21; Issue: 1 Linguagem: Inglês

10.1016/j.jmoldx.2018.08.001

ISSN

1943-7811

Autores

Sushant Patil, Ibro Mujacic, Lauren L. Ritterhouse, Jeremy P. Segal, Sabah Kadri,

Tópico(s)

Genomics and Phylogenetic Studies

Resumo

Lack of reliable reference samples containing different mutations of interest across large sets of disease-relevant loci limits the extensive validation clinical next-generation sequencing (NGS) assays and their associated bioinformatics pipelines. Herein, we have generated a publicly available, highly flexible tool, in silico Mutator (insiM), to introduce point mutations, insertions, deletions, and duplications of any size into real data sets of amplicon-based or hybrid-capture NGS assays. insiM accepts an alignment file along with target territory and produces paired-end FASTQ files containing specified mutations via modification of original sequencing reads. Mutant signal is, thus, generated within the context of existing real-world data to most closely mimic assay performance. Resulting files may then be passed through the assay's bioinformatics pipeline to assist with assay/bioinformatics validation and to identify performance gaps in detection. To establish the basic functionality of the software, a series of simulation experiments with varying mutation types, sizes, and allele frequencies were performed across the entire clinical territory of hybrid-capture and amplicon-based clinical assays developed at The University of Chicago. This work demonstrates the utility of insiM as a supplementary tool during the validation of an NGS assay's bioinformatics pipeline. Lack of reliable reference samples containing different mutations of interest across large sets of disease-relevant loci limits the extensive validation clinical next-generation sequencing (NGS) assays and their associated bioinformatics pipelines. Herein, we have generated a publicly available, highly flexible tool, in silico Mutator (insiM), to introduce point mutations, insertions, deletions, and duplications of any size into real data sets of amplicon-based or hybrid-capture NGS assays. insiM accepts an alignment file along with target territory and produces paired-end FASTQ files containing specified mutations via modification of original sequencing reads. Mutant signal is, thus, generated within the context of existing real-world data to most closely mimic assay performance. Resulting files may then be passed through the assay's bioinformatics pipeline to assist with assay/bioinformatics validation and to identify performance gaps in detection. To establish the basic functionality of the software, a series of simulation experiments with varying mutation types, sizes, and allele frequencies were performed across the entire clinical territory of hybrid-capture and amplicon-based clinical assays developed at The University of Chicago. This work demonstrates the utility of insiM as a supplementary tool during the validation of an NGS assay's bioinformatics pipeline. The past few years have seen a tremendous push to expand the scope of clinical next-generation sequencing (NGS)–based testing for patients to increasingly larger sets of genes and associated disease-relevant loci. This places substantial pressure on laboratories, because NGS assays have many moving parts and their development may involve potentially extensive troubleshooting of both technical wet-laboratory and bioinformatics components. As per the College of American Pathologists' accreditation standards, NGS assays are required to undergo an extensive validation process covering all aspects of the test, including a full validation of the underlying bioinformatics.1Aziz N. Zhao Q. Bry L. Driscoll D.K. Funke B. Gibson J.S. Grody W.W. Hegde M.R. Hoeltge G.A. Leonard D.G.B. Merker J.D. Nagarajan R. Palicki L.A. Robetorye R.S. Schrijver I. Weck K.E. Voelkerding K.V. College of American Pathologists' laboratory standards for next-generation sequencing clinical tests.Arch Pathol Lab Med. 2014; 139: 481-493Crossref PubMed Scopus (208) Google Scholar One of the most challenging aspects of large-scale NGS laboratory-developed protocol validation is properly assessing the sensitivity and specificity of an assay across the entire clinical territory. As NGS panels are expanding to increasingly accommodate more genes, an obvious problem is the lack of sufficient reliable test samples containing anomalies at many (or most) genetic loci, to assess the bioinformatics pipeline performance across the entire assay territory. Even if such specimens were available, sequencing all specimens would be impractical because of prohibitive cost. Previously, some groups have used synthetic DNA fragments2Baum P.D. Young J.J. Zhang Q. Kasakow Z. McCune J.M. Design, construction, and validation of a modular library of sequence diversity standards for polymerase chain reaction.Anal Biochem. 2011; 411: 106-115Crossref PubMed Scopus (6) Google Scholar, 3Zook J.M. Samarov D. McDaniel J. Sen S.K. Salit M. Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing.PLoS One. 2012; 7: e41356Crossref PubMed Scopus (36) Google Scholar and plasmid-based DNA constructs,4Sims D.J. Harrington R.D. Polley E.C. Forbes T.D. Mehaffey M.G. McGregor P.M. Camalier C.E. Harper K.N. Bouk C.H. Das B. Conley B.A. Doroshow J.H. Williams P.M. Lih C.-J. Plasmid-based materials as multiplex quality controls and calibrators for clinical next-generation sequencing assays.J Mol Diagn. 2016; 18: 336-349Abstract Full Text Full Text PDF PubMed Scopus (21) Google Scholar but these can be expensive for a large assay or may not adequately cover the entirety of the assay. Lack of a thorough performance evaluation across all loci can put the assay at risk for producing incorrect results at a higher than expected rate and negatively affecting patient care. This is an area where in silico–generated data sets harboring specifically designed anomalies can be critical to validate the assay's analytical pipeline5Duncavage E.J. Abel H.J. Pfeifer J.D. In silico proficiency testing for clinical next-generation sequencing.J Mol Diagn. 2017; 19: 35-42Abstract Full Text Full Text PDF PubMed Scopus (16) Google Scholar at a larger scale. Recently, the use of in silico data sets has been permitted by the Association for Molecular Pathology to supplement the validation of the bioinformatics pipeline of an assay.6Roy S. Coldren C. Karunamurthy A. Kip N.S. Klee E.W. Lincoln S.E. Leon A. Pullambhatla M. Temple-Smolkin R.L. Voelkerding K.V. Wang C. Carter A.B. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines: a joint recommendation of the Association for Molecular Pathology and the College of American Pathologists.J Mol Diagn. 2018; 20: 4-27Abstract Full Text Full Text PDF PubMed Scopus (156) Google Scholar It is true that in silico–mutated data sets can never entirely substitute assay performance assessment using previously certified sample sets. However, once the assay itself has been tested on an acceptable number of true-positive and true-negative samples, in silico–mutated data sets can serve to supplement assessment of alignment and variant calling performance more broadly, including at every assay genomic region (or position, if necessary). This can improve confidence about the broad effectiveness of the pipeline that could not be achieved using only cell lines or specimens previously tested in clinical laboratories. Currently available algorithms to produce in silico NGS FASTQ data sets mainly revolve around de novo production of reads fitted to the Browser Extensible Data (BED) coordinates of interest.5Duncavage E.J. Abel H.J. Pfeifer J.D. In silico proficiency testing for clinical next-generation sequencing.J Mol Diagn. 2017; 19: 35-42Abstract Full Text Full Text PDF PubMed Scopus (16) Google Scholar, 7Escalona M. Rocha S. Posada D. A comparison of tools for the simulation of genomic next-generation sequencing data.Nat Rev Genet. 2016; 17: 459-469Crossref PubMed Scopus (83) Google Scholar However, such data sets have artificial features that may not match the actual raw data produced by a particular prospective clinical assay. Roy et al6Roy S. Coldren C. Karunamurthy A. Kip N.S. Klee E.W. Lincoln S.E. Leon A. Pullambhatla M. Temple-Smolkin R.L. Voelkerding K.V. Wang C. Carter A.B. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines: a joint recommendation of the Association for Molecular Pathology and the College of American Pathologists.J Mol Diagn. 2018; 20: 4-27Abstract Full Text Full Text PDF PubMed Scopus (156) Google Scholar recommend that bioinformatics pipeline validation should emulate the real-world environment of the laboratory as closely as possible. Therefore, bioinformatics pipelines would be more appropriately challenged by manipulating the original data set by incorporation of mutations at particular sites. This would serve to retain the depth profiles and read characteristics of an original assay, thereby retaining any data abnormalities inherent to the assay while avoiding the introduction of spurious anomalies that may arise because of the de novo read generation. To our knowledge, there has been only one such program (ie, MutationMaker8Duncavage E.J. Abel H.J. Merker J.D. Bodner J.B. Zhao Q. Voelkerding K.V. Pfeifer J.D. A model study of in silico proficiency testing for clinical next-generation sequencing.Arch Pathol Lab Med. 2016; 140: 1085-1091Crossref PubMed Scopus (19) Google Scholar); however, this program is not publicly available and is only able to simulate a subset of genetic anomalies [single-nucleotide variants (SNVs), dinucleotide substitutions, and deletions]. In addition, its performance has only been demonstrated on amplicon-based sequencing assay data. Herein, we have generated a publicly available, highly flexible tool, in silico Mutator (insiM), to introduce most types of mutations, including point mutations, insertions, deletions, and duplications of any size and at any allele frequency, in real paired-end data sets of any amplicon-based or hybrid-capture DNA-based NGS assay. The software is hosted on GitHub (https://github.com/thesushantpatil/insiM). Detailed documentation of its features and performance is provided, and the utility of such software systems to serve as a powerful adjunct during clinical NGS bioinformatics validation is demonstrated. Data sets for in silico modification were generated by our previously reported amplicon-based and hybrid-capture clinical NGS assays.9Kadri S. Long B.C. Mujacic I. Zhen C.J. Wurst M.N. Sharma S. McDonald N. Niu N. Benhamed S. Tuteja J.H. Seiwert T.Y. White K.P. McNerney M.E. Fitzpatrick C. Wang Y.L. Furtado L.V. Segal J.P. Clinical validation of a next-generation sequencing genomic oncology panel via cross-platform benchmarking against established amplicon sequencing assays.J Mol Diagn. 2017; 19: 43-56Abstract Full Text Full Text PDF PubMed Scopus (58) Google Scholar The first assay (UCM-OncoPlus) is a 1213-gene targeted hybridization capture assay that uses Roche-Nimblegen SeqCap EZ (Roche Sequencing, Pleasanton, CA) custom-capture design, with 147 genes clinically reported in the medical record, as of April 2018. Sequence data (2 × 101-bp paired-end reads) were generated using a HiSeq 2500 instrument (Illumina, San Diego, CA) in rapid run mode. The second assay (OncoScreen ST2.0) is a 50-gene targeted amplicon panel using the Ion Ampliseq Cancer Hotspot Panel v2 primer set (Thermo Fisher Scientific, Waltham, MA) for amplification of 207 hotspot regions, sequenced via an Illumina MiSeq using 2 × 152-bp sequencing. All data are stored and processed on a Health Insurance Portability and Accountability Act of 1996–compliant cluster computing system maintained by The University of Chicago (Chicago, IL) Center for Research Informatics. The UCM-OncoPlus and OncoScreen ST2.0 pipelines use BWA-MEM v0.7.1210Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv.org, 2013, arXiv:1303.3997.Google Scholar and NovoAlign v3.03.00 (Novocraft, Selangor, Malaysia) for alignment, respectively. Both pipelines use a combination of Samtools v0.1.1911Li H. Handsaker B. Wysoker A. Fennell T. Ruan J. Homer N. Marth G. Abecasis G. Durbin R. The sequence alignment/map format and SAMtools.Bioinformatics. 2009; 25: 2078-2079Crossref PubMed Scopus (24499) Google Scholar mpileup and an in-house developed pileup file analyzer for variant calling.9Kadri S. Long B.C. Mujacic I. Zhen C.J. Wurst M.N. Sharma S. McDonald N. Niu N. Benhamed S. Tuteja J.H. Seiwert T.Y. White K.P. McNerney M.E. Fitzpatrick C. Wang Y.L. Furtado L.V. Segal J.P. Clinical validation of a next-generation sequencing genomic oncology panel via cross-platform benchmarking against established amplicon sequencing assays.J Mol Diagn. 2017; 19: 43-56Abstract Full Text Full Text PDF PubMed Scopus (58) Google Scholar The large insertion/deletion (indel) support for the OncoPlus pipeline is mainly provided by the local de novo indel realignment software ABRA v0.96,12Mose L.E. Wilkerson M.D. Hayes D.N. Perou C.M. Parker J.S. ABRA: improved coding indel detection via assembly-based realignment.Bioinformatics. 2014; 30: 2813-2815Crossref PubMed Scopus (84) Google Scholar whereas the OncoScreen ST2.0 pipeline uses Amplicon Indel Hunter for reference-free calling of large (>5-bp) insertions and deletions.13Kadri S. Zhen C.J. Wurst M.N. Long B.C. Jiang Z.-F. Wang Y.L. Furtado L.V. Segal J.P. Amplicon indel hunter is a novel bioinformatics tool to detect large somatic insertion/deletion mutations in amplicon-based next-generation sequencing data.J Mol Diagn. 2015; 17: 635-643Abstract Full Text Full Text PDF PubMed Scopus (20) Google Scholar insiM is an in silico software, implemented in Python version 2.7.6 (Python Software Foundation, Beaverton, OR; https://www.python.org), for manipulating aligned NGS data at specified target regions, by introducing various types of mutations at the specified frequencies. insiM requires the pysam module in the Python PATH. Table 1 lists all the mandatory and optional parameters required by the software in the form of a plain text configuration file. insiM can then be executed as follows: python insiM_v1.0.py [configuration file].Table 1insiM Input Parameters in the Configuration FileArgumentValue-assay (m)Assay type: amplicon or capture (hybrid capture)-bam (m)BAM file-target (m)BED file containing target loci. Mutations are introduced at the midpoints of the BED coordinates. For simulating a mutation at a specific position, start and end BED coordinates should denote that exact same position.-mutation (m)Mutation type: snv, ins, del, dup, or mix (multitype)-genome (m)Genome FASTA file-ampliconbed (m)Amplicon BED file. Required only when -assay == amplicon-read (m)Read length. Required only when -assay == amplicon.-vaf (m)Variant allele frequency. For multitype mutations, a separate value can be specified for each locus in the BED file.-len (o)Size of insert. If not specified, random sequence of length 10 is used for ins/del/dup.-seq (o)Sequence of insert. For insertion, this sequence is inserted; if not specified, a random sequence of length specified by -len is inserted.-out (o)Output FASTQ base nameBED, Browser Extensible Data; del, deletion; dup, duplication; ins, insertion; m, mandatory; mix, multitype; o, optional; snv, single-nucleotide variant. Open table in a new tab BED, Browser Extensible Data; del, deletion; dup, duplication; ins, insertion; m, mandatory; mix, multitype; o, optional; snv, single-nucleotide variant. The software offers the ability to introduce either the same type of mutation, at constant variant allele frequency (VAF), for all mutation loci; or multitype mutations, wherein a specific mutation type, VAF, variant allele, and indel length (if applicable), separated by a semicolon, can be specified in the fourth column of the input BED file for each locus (Supplemental Table S1). The software extracts uniquely mapped reads at each specified mutation locus randomly, on the basis of the specified variant frequency, and outputs all (mutated or otherwise) reads to new paired FASTQ files. insiM also outputs the exact fraction of mutated reads (VAFinsiM) and variant sequence (for insertion and SNVs) at each mutation locus to a Variant Call Format file for downstream performance evaluation of the bioinformatics pipeline. insiM works by evaluating reads in the context of their pairs and mutating one or both reads on the basis of the overlap at the mutation position (Figure 1). For example, if both read mates of a pair overlap the mutation locus, then both will be altered. The new FASTQ files should be nearly identical to the original FASTQ files other than the mutated reads and any hard clipping that the original alignment produced in the input BAM. The specific logic of mutation generation in NGS data varies depending on the type of anomaly. For SNVs, only the specified base is mutated within the read. For insertions and duplications, a new sequence is added at the specified locus, and the end of the existing reads is then truncated to the original read lengths. In case of deletion simulations in amplicon-based assays, the reads are extended to the specified read length by first appending the genomic sequence until the amplicon end; then, non-specific nucleotides may be added. In future updates, library-specific primers may also be used to more accurately extend the reads, past the amplicon end. For deletions in hybrid-capture assays, insiM first generates a library fragment from the read pair, introduces the deletion in the fragment, and then reconstructs the read pair. In cases in which the fragment is longer than the sum of the read lengths, the read mates do not overlap each other. For such nonoverlapping reads, the region of the fragment that is not covered by reads is derived from the reference genome, whereas for overlapping reads (shorter fragments), the fragment sequence in the overlapped region is obtained from the forward read. The deletion is introduced in the fragment at the specified position, and a downstream genomic sequence is used to extend the fragment by the deletion length, thus preserving the fragment length/insert size. Finally, read sequences of the original length are extracted from either end of the mutated fragment. To establish the credibility of in silico–mutated data sets as a supplemental tool in bioinformatics pipeline validation, 57 variants (across 24 genes) from 29 clinically validated reference samples9Kadri S. Long B.C. Mujacic I. Zhen C.J. Wurst M.N. Sharma S. McDonald N. Niu N. Benhamed S. Tuteja J.H. Seiwert T.Y. White K.P. McNerney M.E. Fitzpatrick C. Wang Y.L. Furtado L.V. Segal J.P. Clinical validation of a next-generation sequencing genomic oncology panel via cross-platform benchmarking against established amplicon sequencing assays.J Mol Diagn. 2017; 19: 43-56Abstract Full Text Full Text PDF PubMed Scopus (58) Google Scholar were compared against the respective simulated variants in a nonmalignant formalin-fixed, paraffin-embedded spleen sample. These 29 samples, previously run on amplicon assay OncoScreen ST2.0, were a subset of validation samples used in the clinical validation of the hybrid-capture UCM-OncoPlus assay.9Kadri S. Long B.C. Mujacic I. Zhen C.J. Wurst M.N. Sharma S. McDonald N. Niu N. Benhamed S. Tuteja J.H. Seiwert T.Y. White K.P. McNerney M.E. Fitzpatrick C. Wang Y.L. Furtado L.V. Segal J.P. Clinical validation of a next-generation sequencing genomic oncology panel via cross-platform benchmarking against established amplicon sequencing assays.J Mol Diagn. 2017; 19: 43-56Abstract Full Text Full Text PDF PubMed Scopus (58) Google Scholar Before running insiM, it was confirmed that the nonmalignant sample did not contain mutations at the 57 loci used in this comparison. The sample was sequenced with 62 million read pairs (2 × 101-bp paired end), with a median depth of 894× and a median insert size of 170 bp. Three independent iterations of insiM were performed to simulate the 57 variants using the multitype mode of insiM. The input BED file to insiM is shown in Supplemental Table S1. The output FASTQ files were then passed through the UCM-OncoPlus bioinformatics pipeline. VAFsimulated was calculated as the mean VAF obtained across the three iterations. For comparison, VAFsexperimental were obtained by processing the UCM-OncoPlus experimental data set of validation samples. To demonstrate the applicability of insiM to validate bioinformatics pipelines, mutations across a range of VAFs and sizes (for insertions, deletions, and duplications) were simulated, over three independent iterations in the same formalin-fixed, paraffin-embedded spleen sample. The sample was first aligned with BWA-MEM10Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv.org, 2013, arXiv:1303.3997.Google Scholar for UCM-OncoPlus and NovoAlign for OncoScreen ST2.0 and then used as input for insiM. The input BED file contained clinical exonic territory (2497 exons) for the UCM-OncoPlus and 207 amplicons for the OncoScreen ST2.0 assays. Before running insiM, it was confirmed that the sample did not contain mutations at the loci analyzed in the study. As listed below, for each set of input parameters (mutation type, VAF, or indel size as a variable), three independent iterations were performed: i) SNVs at 10%, 50%, and 100% VAFs; ii) insertions, deletions, and duplications of 10-bp length, at VAFs of 0%, 1%, 3%, 5%, and 10% to 100% in increments of 10%; and iii) insertions, deletions, and duplications of lengths from 10 to 100 bp, increased by increments of 10 bp, at 50% VAF. The length range was extended up to 1000 bp, with 100-bp increments, for deletions in hybrid-capture data. The output FASTQ files were then passed through the respective clinical bioinformatics pipelines. The observed VAF (VAFobserved) was calculated as the mean VAF across the three iterations. In silico data mutation software, insiM, was generated to perform manipulation of NGS data for validation of variant calling in bioinformatics pipelines. insiM accepts a BAM file and a target region BED file, along with mutation types and VAFs for variants, for any paired-end NGS assay. This publicly available software is highly flexible and can work on amplicon or hybrid-capture data, either mutating an entire panel territory with a single mutation type or introducing different types of mutations at specified positions and VAFs simultaneously. insiM will output paired-end FASTQ files containing the type of mutation (SNV, deletion, insertion, or duplication) at all specified target coordinates and VAFs, along with a Variant Call Format file of mutations that can be used for direct comparison with the output of the bioinformatics pipeline. These FASTQ files may then be passed through the assay's bioinformatics pipeline, and the resulting variant list can be examined at each of the mutation coordinates to assess for the presence of the selected in silico mutant anomaly at the appropriate VAF. To demonstrate the utility of insiM as a supplemental tool in bioinformatics pipelines validation, 29 clinical validation samples, previously tested on the clinical amplicon-based panel OncoScreen ST2.0 at The University of Chicago, were used. After filtering out common homozygous SNVs, the remaining unique 57 somatic mutations were used. These variants were simulated in a nonmalignant formalin-fixed, paraffin-embedded spleen sample with input VAFs derived from OncoScreen ST2.0 data. The simulated data sets were then processed through the UCM-OncoPlus bioinformatics pipeline to obtain VAFsimulated. The same set of samples was run on the UCM-OncoPlus assay, and the experimental data set was processed through the bioinformatics pipeline to obtain VAFexperimental. Figure 2, A and B , shows that the performance of simulated data sets is highly comparable to that of experimental data sets. Both simulated and experimental data sets exhibit a high degree of correlation (R2 = 0.99 and R2 = 0.96, respectively) when compared with the reference data set. These results exclude the possibility of data set–specific bioinformatics pipeline artifacts and thus, establish the validity of in silico–mutated data sets as a supplemental tool in bioinformatics pipelines validation. To establish the basic functionality of the software, SNVs, insertions, deletions, and duplications were generated at the midpoint of capture regions corresponding to 147 clinically reported genes for the UCM-OncoPlus assay and of each amplicon of OncoScreen ST2.0, at varying VAFs and indel sizes. Resultant FASTQ files were processed through the respective pipelines to detect simulated variants. For UCM-OncoPlus data, SNVs at three different input VAFs (10%, 50%, and 100%) were introduced as per these specified frequencies over the entire territory of the assay, with SDs of 0.6%, 1%, and 0.3% at 10%, 50%, and 100%, VAFinsiM, respectively (Figure 2C). These SDs are expected because of random sampling of the fragmented reads in hybrid-capture assays. Similar deviations of experimental VAFs were seen when comparing experimental VAFs of the 57 variants with the input VAFs (Figure 2B). In contrast, simulations in amplicon data do not show this effect because all reads from an amplicon have the same start and end (Supplemental Figure S1A). For insertions and deletions, input VAFs used in the validation process were 0%, 1%, 3%, 5%, and 10% to 100% in increments of 10% (with 10-bp length for insertions, duplications, and deletions) (Figure 3, A and B ). The exact VAFs introduced by insiM with these inputs are indicated as VAFsinsiM. Selected loci were also manually investigated in Integrative Genomics Browser to ascertain the expected effect of the simulations.14Robinson J.T. Thorvaldsdóttir H. Winckler W. Guttman M. Lander E.S. Getz G. Mesirov J.P. Integrative genomics viewer.Nat Biotechnol. 2011; 29: 24-26Crossref PubMed Scopus (5954) Google Scholar However, the VAFsobserved using the aligner (BWA-MEM) alone were found to be less than VAFsinsiM. Such a deviation was more pronounced in the case of duplications (Supplemental Figure S2A). It was observed that for a few loci, the presence of a germline SNV in the local downstream sequence causes the aligner to align the duplicated sequence, derived from the reference genome, to the specified mutation locus. As a result, an insertion containing the SNV is introduced downstream of that locus (Supplemental Figure S2B). This lack of sensitivity for indel detection is not uncommon, and most laboratories use indel realignment software to help improve accuracy15Frampton G.M. Fichtenholtz A. Otto G.A. Wang K. Downing S.R. He J. et al.Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing.Nat Biotechnol. 2013; 31: 1023-1031Crossref PubMed Scopus (1196) Google Scholar, 16Cheng D.T. Mitchell T.N. Zehir A. Shah R.H. Benayed R. Syed A. Chandramohan R. Liu Z.Y. Won H.H. Scott S.N. Brannon A.R. O'Reilly C. Sadowska J. Casanova J. Yannes A. Hechtman J.F. Yao J. Song W. Ross D.S. Oultache A. Dogan S. Borsu L. Hameed M. Nafa K. Arcila M.E. Ladanyi M. Berger M.F. Memorial Sloan Kettering-integrated mutation profiling of actionable cancer targets (MSK-IMPACT): a hybridization capture-based next-generation sequencing clinical assay for solid tumor molecular oncology.J Mol Diagn. 2015; 17: 251-264Abstract Full Text Full Text PDF PubMed Scopus (868) Google Scholar (discussed below). Similarly, it was seen that the discrepancy between the VAFsinsiM and aligner-based VAFsobserved is remedied by ABRA,12Mose L.E. Wilkerson M.D. Hayes D.N. Perou C.M. Parker J.S. ABRA: improved coding indel detection via assembly-based realignment.Bioinformatics. 2014; 30: 2813-2815Crossref PubMed Scopus (84) Google Scholar an indel realignment tool, which has superior performance for deletions over insertions and duplications (Figure 3, A and B, and Supplemental Figure S2A). The respective data for the OncoScreen ST2.0 assay are shown in Supplemental Figure S1, B and C, in which a high level of concordance was observed between VAFsinsiM and VAFsobserved. As noted above, indel detection efficiency has always been an area of concern for clinical NGS laboratories.17Davies K.D. Farooqi M.S. Gruidl M. Hill C.E. Woolworth-Hirschhorn J. Jones H. Jones K.L. Magliocco A. Mitui M. O'Neill P.H. O'Rourke R. Patel N.M. Qin D. Ramos E. Rossi M.R. Schneider T.M. Smith G.H. Zhang L. Park J.Y. Aisner D.L. Multi-institutional FASTQ file exchange as a means of proficiency testing for next-generation sequencing bioinformatics and variant interpretation.J Mol Diagn. 2016; 18: 572-579Abstract Full Text Full Text PDF PubMed Scopus (18) Google Scholar Large (>20-bp) indel detection efficiency has been shown to be influenced by read length,18Li R. Hsieh C.-L. Young A. Zhang Z. Ren X. Zhao Z. Illumina synthetic long read sequencing allows recovery of missing sequences even in the "Finished" C. elegans genome.Sci Rep. 2015; 5: 10814Crossref PubMed Scopus (41) Google Scholar which affects the VAFobserved in two ways. An aligner's performance worsens as the reads increasingly differ from the reference sequence. Unlike amplicon assays, in which the reads from each amplicon have the same start and end, hybrid-capture assays pull down DNA fragments that have different starts and ends. Because of this, sequencing reads may either contain the complete insertion or overlap it partially. In the latter case, the aligner usually aligns the reads as end insertions of various lengths or a cluster of small indels and mismatches. As a result, different variants, corresponding to varying insertion lengths, are observed at the mutation locus; and the observed VAF for the specified insertion length decreases. As the insertion length approaches the read length, reads become unmappable or are mapped non-specifically in the genome, resulting in low VAFs at the insertion locus. Similar performance issues as a function of indel length have also been reported in amplicon assays.13Kadri S. Zhen C.J. Wurst M.N. Long B.C. Jiang Z.-F. Wang Y.L. Furtado L.V. Segal J.P. Amplicon indel hunter is a novel bioinformatics tool to detect large somatic insertion/deletion mutations in amplicon-based next-generation sequencing data.J Mol Diagn. 2015; 17: 635-643Abstract Full Text Full Text PDF PubMed Scopus (20) Google Scholar For insertions and duplications, sizes in the range of 10 to 100 bp, increased by increments of 10 bp, were tested at 50% input VAF. For deletions, the above range was further extended up to 1000 bp in hybrid-capture data, with 100-bp increments. Correlation between VAFsinsiM and VAFsobserved for UCM-OncoPlus data, obtained by two different methods [aligner (BWA-MEM) and indel realigner (ABRA)], is shown in Figure 3, C and D, for varying sizes of insertions and deletions, respectively. No insertions were detected using only BWA-MEM at an insertion size of 50 bp and beyond, whereas VAFs obtained from ABRA realignment decrease linearly from 10 to 90 bp. In the case of deletions, no deletions were detected using only BWA-MEM at 70 bp and beyond. Using ABRA, the VAFsobserved decreased linearly from 10 to 200 bp and then stayed relatively consistent, albeit lower than expected for higher sizes. The results herein demonstrate the use of in silico–mutated data sets in identifying the gaps in detection accuracy of a pipeline and, thus, help establish the utility of incorporating a realignment step to improve indel calling performance in hybrid-capture pipelines. The respective data for the OncoScreen ST2.0 assay are shown in Supplemental Figure S3, in which the alignment-independent tool, amplicon indel hunter,13Kadri S. Zhen C.J. Wurst M.N. Long B.C. Jiang Z.-F. Wang Y.L. Furtado L.V. Segal J.P. Amplicon indel hunter is a novel bioinformatics tool to detect large somatic insertion/deletion mutations in amplicon-based next-generation sequencing data.J Mol Diagn. 2015; 17: 635-643Abstract Full Text Full Text PDF PubMed Scopus (20) Google Scholar outperforms the alignment-based approach (NovoAlign). Different types of mutations at different input VAFs can be generated at specified loci in a single iteration using insiM. NPM1, FLT3, CEBPA, and DNMT3A genes are recurrently mutated in acute myeloid leukemia.19Lin P. Falini B. Acute myeloid leukemia with recurrent genetic abnormalities other than translocations.Am J Clin Pathol. 2015; 144: 19-28Crossref PubMed Scopus (4) Google Scholar For the demonstration of multitype mutation functionality, commonly reported SNV (DNMT3A, NM_022552.4:c.2644C>T), insertion (NPM1, NM_002520.6:c.863_864insCCTG), deletion (CEBPA, NM_004364.4:c.750del), and duplication (FLT3, NM_004119.2:c.1784_1804dup) mutations were introduced at different input VAFs in the same nonmalignant formalin-fixed, paraffin-embedded spleen sample used previously (Supplemental Figure S4A). All four mutations were detected successfully by the UCM-OncoPlus pipeline (Supplemental Figure S4, B–E). Unlike other traditional molecular diagnostic systems, NGS assays typically involve complex custom analytical pipelines with many opportunities for implementation and design errors. The College of American Pathologists' accreditation standards require separate validation assessments of assay bioinformatics. Using clinical test samples through the entire laboratory assay process ensures that anomalies are faithfully captured, amplified, and recorded by every phase of the analytical system. However, assembling a suitable set of validation samples that harbor genomic anomalies of interest and have been previously tested in other clinical laboratories is one of the most challenging parts of the process. Spike-in synthetic DNA constructs to test particular mutations at specific loci can be expensive and can only be performed for a small subset of larger panel–sized assays. In this regard, in silico mutation testing fits a clear and otherwise unfillable gap. In silico data simulation offers an extremely low-cost method for evaluating a bioinformatics pipeline at many thousands of mutation sites in a single iteration at a variety of input VAFs. It may also be used to perform exhaustive assessments of candidate pipeline modules intended to perform various steps (eg, alignment and variant calling) during the pipeline design phase. Although there is no substitute for actual samples, performing in silico mutation of actual data generated using the same assay (rather than de novo read synthesis) is helpful to preserve the general features of the assay itself, with the additional overlay of manufactured mutational signal. Herein, we have presented insiM, which allows the simulation of multiple mutation types in real targeted panel data, including SNVs, insertions, deletions, and duplications, at any input VAF. This can facilitate the validation of a clinical NGS bioinformatics pipeline for its accuracy in all the regions that it is intended to cover, helping to identify any performance gaps in analytical systems. It can also help to assess the reliability of the pipeline for larger (>20-bp) indels across the entire assay, which may be the only means to assess this, even in a large patient sample set. In addition to this, the ability to tailor the input VAFs allows the laboratory to test that the informatics system functions successfully across a broad territory at various dilutions. The software is designed for genomic anomalies, including SNVs, indels, and duplications. In the future, insiM can be made more comprehensive by incorporating additional features, such as simulation of large structural variations, simulation of variants in cis, and identification of mutation loci with preexisting variants. We suggest using insiM as a supplementary tool during the validation of an assay's bioinformatics pipeline to examine the limitations of the various modules and methods. Download .pdf (2.1 MB) Help with pdf files Supplemental Figure S1A: Single-nucleotide variants at input variant allele frequencies (VAFs) of 10%, 50%, and 100% were introduced at the midpoints of 207 amplicons for the OncoScreen ST2.0 assay. The mean of VAFsinsiM for three iterations has been plotted for each exon. B and C: Correlation between VAFsinsiM and mean VAFsiobserved, obtained by two different methods, has been shown at different input VAFs for insertions (B) and deletions (C) of 10-bp sizes. Download .pdf (1.04 MB) Help with pdf files Supplemental Figure S2A: Duplications at 10-bp sizes were introduced at the midpoints of exons of 147 clinically reported genes for the OncoPlus assay. Correlation between VAFsinsiM and VAFsobserved, obtained by three different methods, has been shown at different input variant allele frequencies (VAFs). B: A schematic, showing how the presence of a germline single-nucleotide variant (SNV) in the local downstream sequence of a mutation locus affects the VAFobserved at that locus. In such case, for reads containing the SNV (A, highlighted in red), a duplicated sequence, derived from the reference genome (highlighted in green), is aligned to the specified mutation locus and an insertion containing the SNV is introduced downstream. Download .pdf (1.25 MB) Help with pdf files Supplemental Figure S3Insertions/deletions at an input variant allele frequency (VAF) of 50% were introduced at the midpoints of 207 amplicons for the OncoScreen ST2.0 assay. VAFsobserved, obtained by two different methods, have been shown along with VAFsinsiM, obtained from insiM counter, for varying sizes of insertions (A) and deletions (B). Download .pdf (.81 MB) Help with pdf files Supplemental Figure S4A: An example input BED file of target mutations. Mutations are introduced at the midpoint of each BED entry. The fourth column contains type of mutation, variant allele frequency (VAF), variant sequence, and variant length, each separated by a semicolon. Integrative Genomics Browser screenshots of each of the four mutations specified in the input BED file. B: The NM_022552.4:c.2644C>T mutation in the DNMT3A gene. C: The NM_002520.6:c.863_864insCCTG in the NPM1 gene. D: The NM_004119.2:c.1784_1804dup in the FLT3 gene. E: The NM_004364.4:c.750del in the CEBPA gene. chr, chromosome; del, deletion; dup, duplication; ins, insertion. Download .docx (.03 MB) Help with docx files Supplemental Table S1 Download .xml (.0 MB) Help with xml files Data Profile

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

insiM