Revisão Acesso aberto Revisado por pares

Recommendations for the Use of in Silico Approaches for Next-Generation Sequencing Bioinformatic Pipeline Validation

2022; Elsevier BV; Volume: 25; Issue: 1 Linguagem: Inglês

10.1016/j.jmoldx.2022.09.007

ISSN

1943-7811

Autores

Eric J. Duncavage, Joshua F. Coleman, Monica E. de Baca, Sabah Kadri, Annette Leon, Mark J. Routbort, Somak Roy, Carlos J. Suarez, Chad Vanderbilt, Justin M. Zook,

Tópico(s)

Genomics and Phylogenetic Studies

Resumo

In silico approaches for next-generation sequencing (NGS) data modeling have utility in the clinical laboratory as a tool for clinical assay validation. In silico NGS data can take a variety of forms, including pure simulated data or manipulated data files in which variants are inserted into existing data files. In silico data enable simulation of a range of variants that may be difficult to obtain from a single physical sample. Such data allow laboratories to more accurately test the performance of clinical bioinformatics pipelines without sequencing additional cases. For example, clinical laboratories may use in silico data to simulate low variant allele fraction variants to test the analytical sensitivity of variant calling software or simulate a range of insertion/deletion sizes to determine the performance of insertion/deletion calling software. In this article, the Working Group reviews the different types of in silico data with their strengths and limitations, methods to generate in silico data, and how data can be used in the clinical molecular diagnostic laboratory. Survey data indicate how in silico NGS data are currently being used. Finally, potential applications for which in silico data may become useful in the future are presented. In silico approaches for next-generation sequencing (NGS) data modeling have utility in the clinical laboratory as a tool for clinical assay validation. In silico NGS data can take a variety of forms, including pure simulated data or manipulated data files in which variants are inserted into existing data files. In silico data enable simulation of a range of variants that may be difficult to obtain from a single physical sample. Such data allow laboratories to more accurately test the performance of clinical bioinformatics pipelines without sequencing additional cases. For example, clinical laboratories may use in silico data to simulate low variant allele fraction variants to test the analytical sensitivity of variant calling software or simulate a range of insertion/deletion sizes to determine the performance of insertion/deletion calling software. In this article, the Working Group reviews the different types of in silico data with their strengths and limitations, methods to generate in silico data, and how data can be used in the clinical molecular diagnostic laboratory. Survey data indicate how in silico NGS data are currently being used. Finally, potential applications for which in silico data may become useful in the future are presented. Next-generation sequencing (NGS)–based molecular diagnostics have rapidly proliferated for both molecular somatic and germline applications, necessitating standards and guidelines for their commonplace use in the clinical laboratory.1Richards S. Aziz N. Bale S. Bick D. Das S. Gastier-Foster J. Grody W.W. Hegde M. Lyon E. Spector E. Voelkerding K. Rehm H.L. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.Genet Med. 2015; 17: 405-424Abstract Full Text Full Text PDF PubMed Scopus (16264) Google Scholar, 2Li M.M. Datto M. Duncavage E.J. Kulkarni S. Lindeman N.I. Roy S. Tsimberidou A.M. Vnencak-Jones C.L. Wolff D.J. Younes A. Nikiforova M.N. Standards and guidelines for the interpretation and reporting of sequence variants in cancer: a joint consensus recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists.J Mol Diagn. 2017; 19: 4-23Abstract Full Text Full Text PDF PubMed Scopus (954) Google Scholar, 3Roy S. Coldren C. Karunamurthy A. Kip N.S. Klee E.W. Lincoln S.E. Leon A. Pullambhatla M. Temple-Smolkin R.L. Voelkerding K.V. Wang C. Carter A.B. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines.J Mol Diagn. 2018; 20: 4-27Abstract Full Text Full Text PDF PubMed Scopus (246) Google Scholar A major advantage of broad NGS-based panels is that NGS can identify variants in a large number of genes; however, the analytical validation of NGS panels can be challenging, as it is difficult to obtain physical samples with the vast number of variants capable of being detected by the assay. Instead, most laboratories will sequence a representative number of samples or cell lines with known variants and rely on sequencing metrics (ie, the fraction of targets with sufficient coverage) to infer performance across most sequenced positions as part of analytical assay validation.4Jennings L.J. Arcila M.E. Corless C. Kamel-Reid S. Lubin I.M. Pfeifer J. Temple-Smolkin R.L. Voelkerding K.V. Nikiforova M.N. Guidelines for validation of next-generation sequencing–based oncology panels: a joint consensus recommendation of the Association for Molecular Pathology and College of American Pathologists.J Mol Diagn. 2017; 19: 341-365Abstract Full Text Full Text PDF PubMed Scopus (406) Google Scholar Several approaches have been developed to supplement real physical samples for analytical validation, including spiking in synthetic DNA and in silico data. Synthetic DNA with common or challenging variants of clinical interest can be added to reference samples (eg, from Genome in a Bottle5Zook J.M. McDaniel J. Olson N.D. Wagner J. Parikh H. Heaton H. Irvine S.A. Trigg L. Truty R. McLean C.Y. De La Vega F.M. Xiao C. Sherry S. Salit M. An open resource for accurately benchmarking small variant and reference calls.Nat Biotechnol. 2019; 37: 561-566Crossref PubMed Scopus (142) Google Scholar) to help validate the method and bioinformatics parts of the assay, but these are limited to short-read sequencing and certain classes of variants.5Zook J.M. McDaniel J. Olson N.D. Wagner J. Parikh H. Heaton H. Irvine S.A. Trigg L. Truty R. McLean C.Y. De La Vega F.M. Xiao C. Sherry S. Salit M. An open resource for accurately benchmarking small variant and reference calls.Nat Biotechnol. 2019; 37: 561-566Crossref PubMed Scopus (142) Google Scholar, 6He H.J. Stein E.V. Konigshofer Y. Forbes T. Tomson F.L. Garlick R. Yamada E. Godfrey T. Abe T. Tamura K. Borges M. Goggins M. Elmore S. Gulley M.L. Larson J.L. Ringel L. Haynes B.C. Karlovich C. Williams P.M. Garnett A. Ståhlberg A. Filges S. Sorbara L. Young M.R. Srivastava S. Cole K.D. Multilaboratory assessment of a new reference material for quality assurance of cell-free tumor DNA measurements.J Mol Diagn. 2019; 21: 658-676Abstract Full Text Full Text PDF PubMed Scopus (11) Google Scholar, 7Lincoln S.E. Hambuch T. Zook J.M. Bristow S.L. Hatchell K. Truty R. Kennemer M. Shirts B.H. Fellowes A. Chowdhury S. Klee E.W. Mahamdallie S. Cleveland M.H. Vallone P.M. Ding Y. Seal S. DeSilva W. Tomson F.L. Huang C. Garlick R.K. Rahman N. Salit M. Kingsmore S.F. Ferber M.J. Aradhya S. Nussbaum R.L. One in seven pathogenic variants can be challenging to detect by NGS: an analysis of 450,000 patients with implications for clinical sensitivity and genetic test implementation.Genet Med. 2021; 23: 1673-1680Abstract Full Text Full Text PDF PubMed Scopus (24) Google Scholar, 8Sims D.J. Harrington R.D. Polley E.C. Forbes T.D. Mehaffey M.G. McGregor P.M. Camalier C.E. Harper K.N. Bouk C.H. Das B. Conley B.A. Doroshow J.H. Williams P.M. Lih C.J. Plasmid-based materials as multiplex quality controls and calibrators for clinical next-generation sequencing assays.J Mol Diagn. 2016; 18: 336-349Abstract Full Text Full Text PDF PubMed Scopus (24) Google Scholar, 9Ewing A.D. Houlahan K.E. Hu Y. Ellrott K. Caloian C. Yamaguchi T.N. Bare J.C. P'Ng C. Waggott D. Sabelnykova V.Y. Kellen M.R. Norman T.C. Haussler D. Friend S.H. Stolovitzky G. Margolin A.A. Stuart J.M. Boutros P.C. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection.Nat Methods. 2015; 12: 623-630Crossref PubMed Scopus (192) Google Scholar In silico NGS validation testing can take many forms and has been adopted by many clinical laboratories, and commercial proficiency testing programs are now available (eg, College of American Pathologists).10Duncavage E.J. Abel H.J. Merker J.D. Bodner J.B. Zhao Q. Voelkerding K.V. Pfeifer J.D. A model study of in silico proficiency testing for clinical next-generation sequencing.Arch Pathol Lab Med. 2016; 140: 1085-1091Crossref PubMed Scopus (25) Google Scholar Although many previous guidelines do not discuss in silico data, two previous guidelines for validation of oncology NGS panels and bioinformatics pipelines recommended using in silico data during the optimization and familiarization process and envisioned increasing use of in silico data to augment real samples for some pathogenic mutations.3Roy S. Coldren C. Karunamurthy A. Kip N.S. Klee E.W. Lincoln S.E. Leon A. Pullambhatla M. Temple-Smolkin R.L. Voelkerding K.V. Wang C. Carter A.B. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines.J Mol Diagn. 2018; 20: 4-27Abstract Full Text Full Text PDF PubMed Scopus (246) Google Scholar,4Jennings L.J. Arcila M.E. Corless C. Kamel-Reid S. Lubin I.M. Pfeifer J. Temple-Smolkin R.L. Voelkerding K.V. Nikiforova M.N. Guidelines for validation of next-generation sequencing–based oncology panels: a joint consensus recommendation of the Association for Molecular Pathology and College of American Pathologists.J Mol Diagn. 2017; 19: 341-365Abstract Full Text Full Text PDF PubMed Scopus (406) Google Scholar In this article, the Working Group focuses on what constitutes in silico NGS testing, how in silico data files can be generated, how in silico testing can be used in clinical NGS assay validation, and the limitations of in silico testing compared with physical samples. In silico NGS data may be broadly defined as any data that have been artificially manipulated or generated. For example, in silico data may be generated de novo by simulating reads from reference sequence data (purely simulated data) (Figure 1A and Supplemental Table S1).11Huang W. Li L. Myers J.R. Marth G.T. ART: a next-generation sequencing read simulator.Bioinformatics. 2012; 28: 593-594Crossref PubMed Scopus (833) Google Scholar, 12Frampton M. Houlston R. Generation of artificial FASTQ files to evaluate the performance of next-generation sequencing pipelines.PLoS One. 2012; 7: e49110Crossref PubMed Scopus (37) Google Scholar, 13Xie Q. Liu Q. Mao F. Cai W. Wu H. You M. Wang Z. Chen B. Sun Z.S. Wu J. A Bayesian framework to identify methylcytosines from high-throughput bisulfite sequencing data.PLoS Comput Biol. 2014; 10: e1003853Crossref PubMed Scopus (4) Google Scholar, 14Cao M.D. Ganesamoorthy D. Zhou C. Coin L.J.M. Simulating the dynamics of targeted capture sequencing with CapSim.Bioinformatics. 2018; 34: 873-874Crossref PubMed Scopus (8) Google Scholar, 15Caboche S. Audebert C. Lemoine Y. Hot D. Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data.BMC Genomics. 2014; 15: 264Crossref PubMed Scopus (63) Google Scholar, 16Li Y. Han R. Bi C. Li M. Wang S. Gao X. DeepSimulator: a deep simulator for nanopore sequencing.Bioinformatics. 2018; 34: 2899-2908Crossref PubMed Scopus (52) Google Scholar, 17Li Y. Wang S. Wang S. Bi C. Qiu Z. Li M. Gao X. DeepSimulator1.5: a more powerful, quicker and lighter simulator for nanopore sequencing.Bioinformatics. 2020; 36: 2578-2580Crossref PubMed Scopus (21) Google Scholar, 18Shcherbina A. FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets.BMC Res Notes. 2014; 7: 533Crossref PubMed Scopus (26) Google Scholar, 19Balzer S. Malde K. Lanzén A. Sharma A. Jonassen I. Characteristics of 454 pyrosequencing data-enabling realistic simulation with flowsim.Bioinformatics. 2010; 26: i420-i425Crossref PubMed Scopus (110) Google Scholar, 20McElroy K.E. Luciani F. Thomas T. GemSIM: general, error-model based simulator of next-generation sequencing data.BMC Genomics. 2012; 13: 74Crossref PubMed Scopus (123) Google Scholar, 21Angly F.E. Willner D. Rohwer F. Hugenholtz P. Tyson G.W. Grinder: a versatile amplicon and shotgun sequence simulator.Nucleic Acids Res. 2012; 40: e94Crossref PubMed Scopus (138) Google Scholar, 22Yuan X. Zhang J. Yang L. IntSIM: an integrated simulator of next-generation sequencing data.IEEE Trans Biomed Eng. 2017; 64: 441-451Crossref PubMed Scopus (42) Google Scholar, 23Lau B. Mohiyuddin M. Mu J.C. Fang L.T. Asadi N.B. Dallett C. Lam H.Y.K. LongiSLND: in silico sequencing of lengthy and noisy datatypes.Bioinformatics. 2016; 32: 3829-3832Crossref PubMed Scopus (13) Google Scholar, 24Luo R. Sedlazeck F.J. Darby C.A. Kelly S.M. Schatz M.C. LRSim: a linked-reads simulator generating insights for better genome partitioning.Comput Struct Biotechnol J. 2017; 15: 478-484Abstract Full Text Full Text PDF PubMed Scopus (20) Google Scholar, 25Holtgrewe M. Mason—A Read Simulator for Second Generation Sequencing Data. FU Berlin, Berlin, Germany2010Google Scholar, 26Yang C. Chu J. Warren R.L. Birol I. NanoSim: nanopore sequence read simulator based on statistical characterization.Gigascience. 2017; 6: 1-6Crossref Scopus (34) Google Scholar, 27Stephens Z.D. Hudson M.E. Mainzer L.S. Taschuk M. Weber M.R. Iyer R.K. Simulating next-generation sequencing datasets from empirical mutation and sequencing models.PLoS One. 2016; 11: e0167047Crossref PubMed Scopus (36) Google Scholar, 28Wei Z.G. Zhang S.W. NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model.BMC Bioinformatics. 2018; 19: 177Crossref PubMed Scopus (22) Google Scholar, 29Ono Y. Asai K. Hamada M. PBSIM: PacBio reads simulator - toward accurate genome assembly.Bioinformatics. 2013; 29: 119-121Crossref PubMed Scopus (183) Google Scholar, 30Hu X. Yuan J. Shi Y. Lu J. Liu B. Li Z. Chen Y. Mu D. Zhang H. Li N. Yue Z. Bai F. Li H. Fan W. pIRS: profile-based Illumina pair-end reads simulator.Bioinformatics. 2012; 28: 1533-1535Crossref PubMed Scopus (124) Google Scholar, 31Xia Y. Liu Y. Deng M. Xi R. Pysim-sv: a package for simulating structural variation data with GC-biases.BMC Bioinformatics. 2017; 18: 53Crossref PubMed Scopus (15) Google Scholar, 32Bartenhagen C. Dugas M. RSVSim: an R/Bioconductor package for the simulation of structural variations.Bioinformatics. 2013; 29: 1679-1681Crossref PubMed Scopus (62) Google Scholar, 33Xing Y. Dabney A.R. Li X. Wang G. Gill C.A. Casola C. SECNVs: a simulator of copy number variants and whole-exome sequences from reference genomes.Front Genet. 2020; 11: 82Crossref PubMed Scopus (5) Google Scholar, 34Chen S. Han Y. Guo L. Hu J. Gu J. SeqMaker: a next generation sequencing simulator with variations, sequencing errors and amplification bias integrated.The Institute of Electrical and Electronics Engineers (IEEE) International Conference on Bioinformatics and Biomedicine (BIBM). 2016: 835-840Google Scholar, 35Baker E.A.G. Goodwin S. McCombie W.R. Mendivil Ramos O. SiLiCO: a simulator of long read sequencing in PacBio and Oxford Nanopore.bioRxiv. 2016; ([Preprint] doi:10.1101/076901)Google Scholar, 36Stöcker B.K. Köster J. Rahmann S. SimLoRD: simulation of long read data.Bioinformatics. 2016; 32: 2704-2706Crossref PubMed Scopus (52) Google Scholar, 37Yue J.X. Liti G. SimuG: a general-purpose genome simulator.Bioinformatics. 2019; 35: 4442-4444Crossref PubMed Scopus (23) Google Scholar, 38Pattnaik S. Gupta S. Rao A.A. Panda B. SInC: an accurate and fast error-model based simulator for SNPs, indels and CNVs coupled with a read generator for short-read sequence data.BMC Bioinformatics. 2014; 15: 40Crossref PubMed Scopus (44) Google Scholar, 39Bolognini D. Sanders A. Korbel J.O. Magi A. Benes V. Rausch T. VISOR: a versatile haplotype-aware structural variant simulator for short- and long-read sequencing.Bioinformatics. 2020; 36: 1267-1269Crossref PubMed Scopus (15) Google Scholar, 40Kim S. Jeong K. Bafna V. Wessim: a whole-exome sequencing simulator based on in silico exome capture.Bioinformatics. 2013; 29: 1076-1077Crossref PubMed Scopus (24) Google Scholar More commonly, in clinical laboratories, in silico data have been generated by manipulating existing NGS data files (Table 1).9Ewing A.D. Houlahan K.E. Hu Y. Ellrott K. Caloian C. Yamaguchi T.N. Bare J.C. P'Ng C. Waggott D. Sabelnykova V.Y. Kellen M.R. Norman T.C. Haussler D. Friend S.H. Stolovitzky G. Margolin A.A. Stuart J.M. Boutros P.C. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection.Nat Methods. 2015; 12: 623-630Crossref PubMed Scopus (192) Google Scholar,41Samadian S. Bruce J.P. Pugh T.J. Bamgineer: introduction of simulated allele-specific copy number variants into exome and targeted sequence data sets.PLoS Comput Biol. 2018; 14: e1006080Crossref PubMed Scopus (4) Google Scholar, 42Patil S.A. Mujacic I. Ritterhouse L.L. Segal J.P. Kadri S. insiM: in silico mutator software for bioinformatics pipeline validation of clinical next-generation sequencing assays.J Mol Diagn. 2019; 21: 19-26Abstract Full Text Full Text PDF PubMed Scopus (7) Google Scholar, 43Li Z. Fang S. Zhang R. Yu L. Zhang J. Bu D. Sun L. Zhao Y. Li J. VarBen: generating in silico reference data sets for clinical next-generation sequencing bioinformatics pipeline evaluation.J Mol Diagn. 2021; 23: 285-299Abstract Full Text Full Text PDF PubMed Scopus (5) Google Scholar For example, two data files from physical samples may be mixed at various ratios to simulate variants with different variant allele frequencies to evaluate the bioinformatic pipeline performance across a greater number of variant allele frequencies than may be obtained with physical samples (mixing sample data) (Figure 1B).44Spencer D.H. Tyagi M. Vallania F. Bredemeyer A.J. Pfeifer J.D. Mitra R.D. Duncavage E.J. Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data.J Mol Diagn. 2014; 16: 75-88Abstract Full Text Full Text PDF PubMed Scopus (90) Google Scholar Data from a single physical sample may also be downsampled to simulate the effects of lower coverage depths on variant calling (downsampling FASTQ files) (Figure 1C).44Spencer D.H. Tyagi M. Vallania F. Bredemeyer A.J. Pfeifer J.D. Mitra R.D. Duncavage E.J. Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data.J Mol Diagn. 2014; 16: 75-88Abstract Full Text Full Text PDF PubMed Scopus (90) Google Scholar,45Cottrell C.E. Al-Kateb H. Bredemeyer A.J. Duncavage E.J. Spencer D.H. Abel H.J. Lockwood C.M. Hagemann I.S. O'Guin S.M. Burcea L.C. Sawyer C.S. Oschwald D.M. Stratman J.L. Sher D.A. Johnson M.R. Brown J.T. Cliften P.F. George B. McIntosh L.D. Shrivastava S. Nguyen T.T. Payton J.E. Watson M.A. Crosby S.D. Head R.D. Mitra R.D. Nagarajan R. Kulkarni S. Seibert K. Virgin IV, H.W. Milbrandt J. Pfeifer J.D. Validation of a next-generation sequencing assay for clinical molecular oncology.J Mol Diagn. 2014; 16: 89-105Abstract Full Text Full Text PDF PubMed Scopus (158) Google Scholar In silico data may also be generated by manipulating read-level data within data files generated from physical samples (manipulated assay data) (Figure 1D and Table 1). For example, variants such as single-nucleotide variants (SNVs), insertions/deletions (indels), or even structural variants may be inserted into laboratory data files (BAM or FASTQ) to assess the ability of the bioinformatics pipeline to correctly identify and annotate the variant.9Ewing A.D. Houlahan K.E. Hu Y. Ellrott K. Caloian C. Yamaguchi T.N. Bare J.C. P'Ng C. Waggott D. Sabelnykova V.Y. Kellen M.R. Norman T.C. Haussler D. Friend S.H. Stolovitzky G. Margolin A.A. Stuart J.M. Boutros P.C. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection.Nat Methods. 2015; 12: 623-630Crossref PubMed Scopus (192) Google Scholar,43Li Z. Fang S. Zhang R. Yu L. Zhang J. Bu D. Sun L. Zhao Y. Li J. VarBen: generating in silico reference data sets for clinical next-generation sequencing bioinformatics pipeline evaluation.J Mol Diagn. 2021; 23: 285-299Abstract Full Text Full Text PDF PubMed Scopus (5) Google Scholar,46Balan J. Jenkinson G. Nair A. Saha N. Koganti T. Voss J. Zysk C. Barr Fritcher E.G. Ross C.A. Giannini C. Raghunathan A. Kipp B.R. Jenkins R. Ida C. Halling K.C. Blackburn P.R. Dasari S. Oliver G.R. Klee E.W. SeekFusion - a clinically validated fusion transcript detection pipeline for PCR-based next-generation sequencing of RNA.Front Genet. 2021; 12: 739054Crossref PubMed Scopus (5) Google Scholar This approach may be especially useful to determine the performance characteristics of a bioinformatics pipeline for the detection of variants for which there are few physical samples available, such as in the case of rare disorders or rare, clinically relevant somatic alterations. In silico approaches may also assess the bioinformatic pipeline's ability to detect challenging variants for which it is difficult to find physical samples, such as medium-sized indels between 15 and 1000 bp47Kadri S. Zhen C.J. Wurst M.N. Long B.C. Jiang Z.F. Wang Y.L. Furtado L.V. Segal J.P. Amplicon Indel Hunter is a novel bioinformatics tool to detect large somatic insertion/deletion mutations in amplicon-based next-generation sequencing data.J Mol Diagn. 2015; 17: 635-643Abstract Full Text Full Text PDF PubMed Scopus (23) Google Scholar or dinucleotide substitutions. In silico NGS data have primarily been used for SNVs and indels, but there are opportunities for expanding to microsatellite instability (MSI), tumor mutation burden (TMB),48Makrooni M.A. O'Sullivan B. Seoighe C. Bias and inconsistency in the estimation of tumour mutation burden.BMC Cancer. 2022; 22: 840Crossref PubMed Scopus (2) Google Scholar structural variants,9Ewing A.D. Houlahan K.E. Hu Y. Ellrott K. Caloian C. Yamaguchi T.N. Bare J.C. P'Ng C. Waggott D. Sabelnykova V.Y. Kellen M.R. Norman T.C. Haussler D. Friend S.H. Stolovitzky G. Margolin A.A. Stuart J.M. Boutros P.C. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection.Nat Methods. 2015; 12: 623-630Crossref PubMed Scopus (192) Google Scholar,30Hu X. Yuan J. Shi Y. Lu J. Liu B. Li Z. Chen Y. Mu D. Zhang H. Li N. Yue Z. Bai F. Li H. Fan W. pIRS: profile-based Illumina pair-end reads simulator.Bioinformatics. 2012; 28: 1533-1535Crossref PubMed Scopus (124) Google Scholar, 31Xia Y. Liu Y. Deng M. Xi R. Pysim-sv: a package for simulating structural variation data with GC-biases.BMC Bioinformatics. 2017; 18: 53Crossref PubMed Scopus (15) Google Scholar, 32Bartenhagen C. Dugas M. RSVSim: an R/Bioconductor package for the simulation of structural variations.Bioinformatics. 2013; 29: 1679-1681Crossref PubMed Scopus (62) Google Scholar,37Yue J.X. Liti G. SimuG: a general-purpose genome simulator.Bioinformatics. 2019; 35: 4442-4444Crossref PubMed Scopus (23) Google Scholar,43Li Z. Fang S. Zhang R. Yu L. Zhang J. Bu D. Sun L. Zhao Y. Li J. VarBen: generating in silico reference data sets for clinical next-generation sequencing bioinformatics pipeline evaluation.J Mol Diagn. 2021; 23: 285-299Abstract Full Text Full Text PDF PubMed Scopus (5) Google Scholar copy number variations,41Samadian S. Bruce J.P. Pugh T.J. Bamgineer: introduction of simulated allele-specific copy number variants into exome and targeted sequence data sets.PLoS Comput Biol. 2018; 14: e1006080Crossref PubMed Scopus (4) Google Scholar, 42Patil S.A. Mujacic I. Ritterhouse L.L. Segal J.P. Kadri S. insiM: in silico mutator software for bioinformatics pipeline validation of clinical next-generation sequencing assays.J Mol Diagn. 2019; 21: 19-26Abstract Full Text Full Text PDF PubMed Scopus (7) Google Scholar, 43Li Z. Fang S. Zhang R. Yu L. Zhang J. Bu D. Sun L. Zhao Y. Li J. VarBen: generating in silico reference data sets for clinical next-generation sequencing bioinformatics pipeline evaluation.J Mol Diagn. 2021; 23: 285-299Abstract Full Text Full Text PDF PubMed Scopus (5) Google Scholar,49Ellingford J.M. Campbell C. Barton S. Bhaskar S. Gupta S. Taylor R.L. Sergouniotis P.I. Horn B. Lamb J.A. Michaelides M. Webster A.R. Newman W.G. Panda B. Ramsden S.C. Black G.C.M. Validation of copy number variation analysis for next-generation sequencing diagnostics.Eur J Hum Genet. 2017; 25: 719-724Crossref PubMed Scopus (63) Google Scholar RNA sequencing50Bruno A.E. Miecznikowski J.C. Qin M. Wang J. Liu S. FUSIM: a software tool for simulating fusion transcripts.BMC Bioinformatics. 2013; 14: 13Crossref PubMed Scopus (11) Google Scholar [see RNA Sequencing below for references for FUSIM, Polyester (https://bioconductor.org/packages/release/bioc/html/polyester.html, last accessed September 1, 2022), rlsim (https://github.com/sbotond/rlsim, last accessed September 1, 2022), RNASeqReadSimulator (https://github.com/davidliwei/RNASeqReadSimulator, last accessed September 1, 2022), SimCT (https://github.com/jaudoux/simct, last accessed September 1, 2022), and bisulfite sequencing13Xie Q. Liu Q. Mao F. Cai W. Wu H. You M. Wang Z. Chen B. Sun Z.S. Wu J. A Bayesian framework to identify methylcytosines from high-throughput bisulfite sequencing data.PLoS Comput Biol. 2014; 10: e1003853Crossref PubMed Scopus (4) Google Scholar (https://github.com/BeyondTheSky/BSSim, last accessed September 1, 2022)], and metagenomics.18Shcherbina A. FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets.BMC Res Notes. 2014; 7: 533Crossref PubMed Scopus (26) Google Scholar Finally, in silico approaches may be used when changes are made only to the bioinformatics pipeline and not to the wet laboratory components of the assay; in this scenario, data previously generated by the laboratory are reanalyzed with the new version of the bioinformatics pipeline (data reanalysis) (Figure 1E).Table 1Bioinformatics Tools that Produce Manipulated Assay DataNameGitHub URL∗These are examples of software packages available at the time of this writing and not comprehensive lists. Inclusion does not represent an organizational endorsement by the Association for Molecular Pathology of any individual product or service. All websites last accessed April 12, 2022.SummaryBamgineer (no version given, last update July 30, 2020)https://github.com/pughlab/bamgineerSimulates haplotype-phased, allele-specific copy number variants in an existing BAM file. Requires legacy dependencies samtools version 1.2 and pysam version 0.8.4.41Samadian S. Bruce J.P. Pugh T.J. Bamgineer: introduction of simulated allele-specific copy number variants into exome and targeted sequence data sets.PLoS Comput Biol. 2018; 14: e1006080Crossref PubMed Scopus (4) Google ScholarBAMSurgeon version 1.3https://github.com/adamewing/bamsurgeon/Simulates SNVs, small indels, and structural variants in an existing BAM file.9Ewing A.D. Houlahan K.E. Hu Y. Ellrott K. Caloian C. Yamaguchi T.N. Bare J.C. P'Ng C. Waggott D. Sabelnykova V.Y. Kellen M.R. Norman T.C. Haussler D. Friend S.H. Stolovitzky G. Margolin A.A. Stuart J.M. Boutros P.C. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection.Nat Methods. 2015; 12: 623-630Crossref PubMed Scopus (192) Google ScholarinsiM version 1.0https://github.com/thesushantpatil/insiMSimulates SNVs, small indels, and duplication events in an existing BAM file. Outputs a paired-end FASTQ file.42Patil S.A. Mujacic I. Ritterhouse L.L. Segal J.P. Kadri S. insiM: in silico mutator software for bioinformatics pipeline validation of clinical next-generation sequencing assays.J Mol Diagn. 2019; 21: 19-26Abstract Full Text Full Text PDF PubMed Scopus (7) Google ScholarVarBen (no version given, last update May 15, 2021)https://github.com/nccl-jmli/VarBenSimulates SNVs, indels, and structural variants in an existing BAM file.43Li Z. Fang S. Zhang R. Yu L. Zhang J. Bu D. Sun L. Zhao Y. Li J. VarBen: generating in silico reference data sets for clinical next-generation sequencing bioinformatics pipeline evaluation.J Mol Diagn. 2021; 23: 285-299Abstract Full Text Full Text PDF PubMed Scopus (5) Google ScholarIndel, insertion/deletion; SNV, single-nucleotide variant.∗ These are examples of software packages available at the time of this writing and not comprehensive lists. Inclusion does not represent an organizational endorsement by the Association for Molecular Pathology of any individual product or service. All websites last accessed April 12, 2022. Open table in a new tab Indel, insertion/deletion; SNV, single-nucleotide variant. To better assess the technical aspects, limitations, and advantages of in silico variant simulation in the clinical laboratory, the Association for Molecular Pathology (AMP) convened a panel of subject matter experts to examine the topic. In this article, the Working Group explores the different types of in silico data files, how they can be used by clinical laboratories, and their advantages and disadvantages. Data from an Association for Molecular Pathology survey reflects how in silico files are being used by laboratories today. Finally, the Working Group provides recommendations and future directions for the use of in silico NGS data. In silico data can be a powerful tool, but it is important to understand the strengths and limitations of the variety of types of in silico data and which aspects of the pipeline they can help validate. Different types of data will be useful for identifying different sources of errors (eg, systematic sequencing errors versus alignment errors), mimicking different types of variants (eg, small variants versus structural variants), and testing different variant origins (eg, germline versus acquired/somatic). When a change is only made to the bioinformatics pip

Referência(s)