Using R and Bioconductor in Clinical Genomics and Transcriptomics
2019; Elsevier BV; Volume: 22; Issue: 1 Linguagem: Inglês
10.1016/j.jmoldx.2019.08.006
ISSN1943-7811
Autores Tópico(s)Molecular Biology Techniques and Applications
ResumoBioinformatics pipelines are essential in the analysis of genomic and transcriptomic data generated by next-generation sequencing (NGS). Recent guidelines emphasize the need for rigorous validation and assessment of robustness, reproducibility, and quality of NGS analytic pipelines intended for clinical use. Software tools written in the R statistical language and, in particular, the set of tools available in the Bioconductor repository are widely used in research bioinformatics; and these frameworks offer several advantages for use in clinical bioinformatics, including the breath of available tools, modular nature of software packages, ease of installation, enforcement of interoperability, version control, and short learning curve. This review provides an introduction to R and Bioconductor software, its advantages and limitations for clinical bioinformatics, and illustrative examples of tools that can be used in various steps of NGS analysis. Bioinformatics pipelines are essential in the analysis of genomic and transcriptomic data generated by next-generation sequencing (NGS). Recent guidelines emphasize the need for rigorous validation and assessment of robustness, reproducibility, and quality of NGS analytic pipelines intended for clinical use. Software tools written in the R statistical language and, in particular, the set of tools available in the Bioconductor repository are widely used in research bioinformatics; and these frameworks offer several advantages for use in clinical bioinformatics, including the breath of available tools, modular nature of software packages, ease of installation, enforcement of interoperability, version control, and short learning curve. This review provides an introduction to R and Bioconductor software, its advantages and limitations for clinical bioinformatics, and illustrative examples of tools that can be used in various steps of NGS analysis. Robust bioinformatics approaches have increasingly become critical for analysis of high-throughput, high-complexity molecular data produced by new assay technologies, such as microarrays and next-generation sequencing (NGS), especially when the results are used to potentially influence clinical decisions. In clinical bioinformatics, a software pipeline is a set of predefined programmatic procedures (file operations, programs or tools, and database queries) that convert one or more inputs (eg, raw sequencing data) into one or more outputs, often in sequential and sometimes parallel series of steps, which can yield intermediate results that can be useful on their own and also subsequently entered as inputs to the next tool in the pipeline.1Roy S. Coldren C. Karunamurthy A. Kip N.S. Klee E.W. Lincoln S.E. Leon A. Pullambhatla M. Temple-Smolkin R.L. Voelkerding K.V. Wang C. Carter A.B. Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines: a Joint Recommendation of the Association for Molecular Pathology and the College of American Pathologists.J Mol Diagn. 2018; 20: 4-27Abstract Full Text Full Text PDF PubMed Scopus (63) Google Scholar, 2Gargis A.S. Kalman L. Bick D.P. da Silva C. Dimmock D.P. Funke B.H. et al.Good laboratory practice for clinical next-generation sequencing informatics pipelines.Nat Biotechnol. 2015; 33: 689-693Crossref PubMed Scopus (71) Google Scholar, 3Oliver G.R. Hart S.N. Klee E.W. Bioinformatics for clinical next generation sequencing.Clin Chem. 2015; 61: 124-135Crossref PubMed Google Scholar Current pipelines for genomic and transcriptomic assays with clinical applications have not been fully standardized, and bioinformatics approaches are highly variable among institutions using these assays for patient care purposes. Recently, the Association of Molecular Pathology, with collaboration from the College of American Pathologists and the American Medical Informatics Association, published guidelines for the validation of clinical NGS bioinformatics pipelines, consisting of a set of 17 best practice consensus recommendations.1Roy S. Coldren C. Karunamurthy A. Kip N.S. Klee E.W. Lincoln S.E. Leon A. Pullambhatla M. Temple-Smolkin R.L. Voelkerding K.V. Wang C. Carter A.B. Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines: a Joint Recommendation of the Association for Molecular Pathology and the College of American Pathologists.J Mol Diagn. 2018; 20: 4-27Abstract Full Text Full Text PDF PubMed Scopus (63) Google Scholar These guidelines are focused mainly on the use of NGS assays for the detection of clinically relevant genomic alterations (variants). Several laboratories also use NGS for identification of changes in RNA structure (eg, to detect genomic fusions) and abundance (eg, to detect mRNA or miRNA expression patterns), which will be abbreviated as RNASeq. The guidelines emphasize the need to thoroughly validate the pipeline and to lock down the complete set of tools, code, operational environment, and network connections that compose the pipeline before using it for clinical purposes. Of importance, any changes to any components of the pipeline require revalidation to ensure that there is no impact in the performance characteristics of the pipeline. On the other hand, the field of genomics is constantly evolving, whether it is advances in sequencing technology, processing of sequencing data and bioinformatics software, knowledge of genomic structure, biological functions, and regulatory networks, or most important, clinical significance of genomic and epigenomic alterations. Therefore, the development, improvement, and proper validation of flexible bioinformatics pipelines that reflect the most recent advances in genomics are critical steps in providing optimal patient care. In many institutions, the tools of choice are acquired from commercial vendors and/or other well-established pipelines, such as the Genome Analysis Tool Kit set of tools from the Broad Institute (Cambridge, MA).4McKenna A. Hanna M. Banks E. Sivachenko A. Cibulskis K. Kernytsky A. Garimella K. Altshuler D. Gabriel S. Daly M. DePristo M.A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.Genome Res. 2010; 20: 1297-1303Crossref PubMed Scopus (8142) Google Scholar In addition, several institutions choose to use a customized assembly of various tools, often from open-source repositories, that offer the following advantages: i) transparent access to the algorithms used for better independent assessment of the software functionalities and limitations; ii) ability to fine-tune parameters and modify the code to best fit specific goals and operational frameworks and to rapidly adapt the pipeline to changes in technology, software, or knowledge bases or in clinical needs of the institution; iii) ability to choose the best-of-breed tool for a particular process in the pipeline; and iv) ability to explore, test, and prototype alternative approaches to the bioinformatics pipelines in a development setting. One of the most commonly used open-source repositories of bioinformatics tools used in genomics, transcriptomics, and other NGS-based assays is the Bioconductor repository.5Gentleman R.C. Carey V.J. Bates D.M. Bolstad B. Dettling M. Dudoit S. Ellis B. Gautier L. Ge Y. Gentry J. Hornik K. Hothorn T. Huber W. Iacus S. Irizarry R. Leisch F. Li C. Maechler M. Rossini A.J. Sawitzki G. Smith C. Smyth G. Tierney L. Yang J.Y. Zhang J. Bioconductor: open software development for computational biology and bioinformatics.Genome Biol. 2004; 5: R80Crossref PubMed Google Scholar,6Huber W. Carey V.J. Gentleman R. Anders S. Carlson M. Carvalho B.S. Bravo H.C. Davis S. Gatto L. Girke T. Gottardo R. Hahne F. Hansen K.D. Irizarry R.A. Lawrence M. Love M.I. MacDonald J. Obenchain V. Oleś A.K. Pagès H. Reyes A. Shannon P. Smyth G.K. Tenenbaum D. Waldron L. Morgan M. Orchestrating high-throughput genomic analysis with Bioconductor.Nat Methods. 2015; 12: 115-121Crossref PubMed Scopus (880) Google Scholar Bioconductor tools are written in the R statistical programming language (heretofore abbreviated as R) and are freely available to download, install, and modify through an open-source and open-development model supported by the use of the GitHub repository system. In this review, we will discuss tools written in R that can be useful in pipelines for the processing and analysis of NGS data, with a focus on Bioconductor and potential clinical applicable genomic and transcriptomic assays. The R statistical language is a free, open-source implementation of the older statistical and graphing language S, with additional features such as the ability to extend base R functionalities by using self-contained code extensions, called packages, that can be easily installed from repositories, such as CRAN and Bioconductor. The source, version, and/or reference for all packages mentioned in this review are listed in Supplemental Table S1.6Huber W. Carey V.J. Gentleman R. Anders S. Carlson M. Carvalho B.S. Bravo H.C. Davis S. Gatto L. Girke T. Gottardo R. Hahne F. Hansen K.D. Irizarry R.A. Lawrence M. Love M.I. MacDonald J. Obenchain V. Oleś A.K. Pagès H. Reyes A. Shannon P. Smyth G.K. Tenenbaum D. Waldron L. Morgan M. Orchestrating high-throughput genomic analysis with Bioconductor.Nat Methods. 2015; 12: 115-121Crossref PubMed Scopus (880) Google Scholar, 7Bao L. Pu M. Messer K. AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data.Bioinformatics. 2014; 30: 1056-1063Crossref PubMed Scopus (0) Google Scholar, 8Shen Y. Rahman M. Piccolo S.R. Gusenleitner D. El-Chaar N.N. Cheng L. Monti S. Bild A.H. Johnson W.E. ASSIGN: context-specific genomic profiling of multiple heterogeneous biological pathways.Bioinformatics. 2015; 31: 1745-1753Crossref PubMed Scopus (16) Google Scholar, 9Yu G. Zhang B. Bova G.S. Xu J. Shih I.M. Wang Y. BACOM: in silico detection of genomic deletion types and correction of normal cell contamination in copy number data.Bioinformatics. 2011; 27: 1473-1480Crossref PubMed Scopus (0) Google Scholar, 10Sengupta S. Wang J. Lee J. Müller P. Gulukota K. Banerjee A. Ji Y. Bayclone: Bayesian nonparametric inference of tumor subclones using NGS data.Pac Symp Biocomput. 2015; : 467-478PubMed Google Scholar, 11Kane M.J. Emerson J. Weston S. Scalable strategies for computing with massive data.J Stat Softw. 2013; 55: 1-19Crossref Scopus (0) Google Scholar, 12Durinck S. Spellman P.T. Birney E. Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt.Nat Protoc. 2009; 4: 1184-1191Crossref PubMed Scopus (569) Google Scholar, 13Zhu W. Kuziora M. Creasy T. Lai Z. Morehouse C. Guo X. Sebastian Y. Shen D. Huang J. Dry J.R. BubbleTree: an intuitive visualization to elucidate tumoral aneuploidy and clonality using next generation sequencing data.Nucleic Acids Res. 2015; 44: e38Crossref PubMed Scopus (6) Google Scholar, 14Purdom E. Ho C. Grasso C.S. Quist M.J. Cho R.J. Spellman P. Methods and challenges in timing chromosomal abnormalities within cancer samples.Bioinformatics. 2013; 29: 3113-3120Crossref PubMed Scopus (14) Google Scholar, 15Carrara M. Beccuti M. Cavallo F. Donatelli S. Lazzarato F. Cordero F. Calogero R.A. State of art fusion-finder algorithms are suitable to detect transcription-induced chimeras in normal tissues?.BMC Bioinformatics. 2013; 14 Suppl 7: S2Crossref PubMed Scopus (34) Google Scholar, 16Lågstad S. Zhao S. Hoff A.M. Johannessen B. Lingjærde O.C. Skotheim R.I. Chimeraviz: a tool for visualizing chimeric RNA.Bioinformatics. 2017; 33: 2954-2956Crossref PubMed Scopus (0) Google Scholar, 17Oróstica K.Y. Verdugo R.A. chromPlot: visualization of genomic data in chromosomal context.Bioinformatics. 2016; 32: 2366-2368Crossref PubMed Scopus (0) Google Scholar, 18Zare H. Wang J. Hu A. Weber K. Smith J. Nickerson D. Song C. Witten D. Blau C.A. Noble W.S. Inferring clonal composition from multiple sections of a breast cancer.PLoS Comput Biol. 2014; 10: e1003703Crossref PubMed Scopus (59) Google Scholar, 19Klambauer G. Schwarzbauer K. Mayr A. Clevert D.-A. Mitterecker A. Bodenhofer U. Hochreiter S. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate.Nucleic Acids Res. 2012; 40: e69Crossref PubMed Scopus (193) Google Scholar, 20Gusnanto A. Tcherveniakov P. Shuweihdi F. Samman M. Rabbitts P. Wood H.M. Stratifying tumour subtypes based on copy number alteration profiles using next-generation sequence data.Bioinformatics. 2015; 31: 2713-2720Crossref PubMed Scopus (0) Google Scholar, 21Gusnanto A. Wood H.M. Pawitan Y. Rabbitts P. Berri S. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data.Bioinformatics. 2012; 28: 40-47Crossref PubMed Scopus (92) Google Scholar, 22Jiang Y. Oldridge D.A. Diskin S.J. Zhang N.R. CODEX: a normalization and copy number variation detection method for whole exome sequencing.Nucleic Acids Res. 2015; 43: e39Crossref PubMed Scopus (49) Google Scholar, 23Kuilman T. Velds A. Kemper K. Ranzani M. Bombardelli L. Hoogstraat M. Nevedomskaya E. Xu G. de Ruiter J. Lolkema M.P. Ylstra B. Jonkers J. Rottenberg S. Wessels L.F. Adams D.J. Peeper D.S. Krijgsman O. CopywriteR: DNA copy number detection from off-target sequence data.Genome Biol. 2015; 16: 49Crossref PubMed Scopus (80) Google Scholar, 24Mock A. Murphy S. Morris J. Marass F. Rosenfeld N. Massie C. CVE: an R package for interactive variant prioritisation in precision oncology.BMC Med Genomics. 2017; 10: 37Crossref PubMed Scopus (0) Google Scholar, 25Fowler A. Mahamdallie S. Ruark E. Seal S. Ramsay E. Clarke M. Uddin I. Wylie H. Strydom A. Lunter G. Rahman N. Accurate clinical detection of exon copy number variants in a targeted NGS panel using DECoN.Wellcome Open Res. 2016; 1: 20Crossref PubMed Google Scholar, 26Ahn J. Yuan Y. Parmigiani G. Suraokar M.B. Diao L. Wistuba I.I. Wang W. DeMix: deconvolution for mixed cancer transcriptomes using raw measured data.Bioinformatics. 2013; 29: 1865-1871Crossref PubMed Scopus (40) Google Scholar, 27Love M.I. Huber W. Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.Genome Biol. 2014; 15: 550Crossref PubMed Scopus (8791) Google Scholar, 28Buschmann T. DNABarcodes: an R package for the systematic construction of DNA sample tags.Bioinformatics. 2017; 33: 920-922PubMed Google Scholar, 29Sayols S. Scherzinger D. Klein H. dupRadar: a Bioconductor package for the assessment of PCR artifacts in RNA-Seq data.BMC Bioinformatics. 2016; 17: 428Crossref PubMed Scopus (10) Google Scholar, 30Delhomme N. Padioleau I. Furlong E.E. Steinmetz L.M. easyRNASeq: a bioconductor package for processing RNA-Seq data.Bioinformatics. 2012; 28: 2532-2533Crossref PubMed Scopus (53) Google Scholar, 31Robinson M.D. McCarthy D.J. Smyth G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.Bioinformatics. 2010; 26: 139-140Crossref PubMed Scopus (9384) Google Scholar, 32Rainer J. Gatto L. Weichenberger C.X. Ensembldb: an R package to create and use Ensembl-based annotation resources.Bioinformatics. 2019; 35: 3151-3153Crossref PubMed Scopus (0) Google Scholar, 33Chelaru F. Corrada Bravo H. Epiviz: a view inside the design of an integrated visual analysis software for genomics.BMC Bioinformatics. 2015; 16 Suppl 11: S4Crossref PubMed Scopus (2) Google Scholar, 34Yoshihara K. Shahmoradgoli M. Martínez E. Vegesna R. Kim H. Torres-Garcia W. Treviño V. Shen H. Laird P.W. Levine D.A. Carter S.L. Getz G. Stemke-Hale K. Mills G.B. Verhaak R.G.W. Inferring tumour purity and stromal and immune cell admixture from expression data.Nat Commun. 2013; 4: 2612Crossref PubMed Scopus (438) Google Scholar, 35Sathirapongsasuti J.F. Lee H. Horst B.A.J. Brunner G. Cochran A.J. Binder S. Quackenbush J. Nelson S.F. Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV.Bioinformatics. 2011; 27: 2648-2654Crossref PubMed Scopus (247) Google Scholar, 36Plagnol V. Curtis J. Epstein M. Mok K.Y. Stebbings E. Grigoriadou S. Wood N.W. Hambleton S. Burns S.O. Thrasher A.J. Kumararatne D. Doffinger R. Nejentsev S. A robust model for read count data in exome sequencing experiments and implications for copy number variant calling.Bioinformatics. 2012; 28: 2747-2754Crossref PubMed Scopus (191) Google Scholar, 37Andor N. Graham T.A. Jansen M. Xia L.C. Aktipis C.A. Petritsch C. Ji H.P. Maley C.C. Pan-cancer analysis of the extent and consequences of intratumor heterogeneity.Nat Med. 2016; 22: 105-113Crossref PubMed Scopus (233) Google Scholar, 38Krijgsman O. Benner C. Meijer G.A. van de Wiel M.A. Ylstra B. FocalCall: an R package for the annotation of focal copy number aberrations.Cancer Inform. 2014; 13: 153-156Crossref PubMed Scopus (0) Google Scholar, 39Gendoo D.M. Ratanasirigulchai N. Schroder M.S. Pare L. Parker J.S. Prat A. Haibe-Kains B. Genefu: an R/Bioconductor package for computation of gene expression-based signatures in breast cancer.Bioinformatics. 2016; 32: 1097-1099Crossref PubMed Scopus (62) Google Scholar, 40Akalin A. Franke V. Vlahoviček K. Mason C.E. Schübeler D. Genomation: a toolkit to summarize, annotate and visualize genomic intervals.Bioinformatics. 2015; 31: 1127-1129Crossref PubMed Scopus (52) Google Scholar, 41Lawrence M. Huber W. Pagès H. Aboyoun P. Carlson M. Gentleman R. Morgan M.T. Carey V.J. Software for computing and annotating genomic ranges.PLoS Comput Biol. 2013; 9: e1003118Crossref PubMed Scopus (763) Google Scholar, 42Yin T. Cook D. Lawrence M. Ggbio: an R package for extending the grammar of graphics for genomic data.Genome Biol. 2012; 13: R77Crossref PubMed Scopus (115) Google Scholar, 43Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York, NY2009Crossref Google Scholar, 44Hänzelmann S. Castelo R. Guinney J. GSVA: gene set variation analysis for microarray and RNA-Seq data.BMC Bioinformatics. 2013; 14: 7Crossref PubMed Scopus (468) Google Scholar, 45Hahne F. Ivanek R. Mathé E. Davis S. Statistical Genomics: Methods and Protocols. Springer New York, New York, NY2016: 335-351Google Scholar, 46Lai Y.-P. Wang L.-B. Wang W.-A. Lai L.-C. Tsai M.-H. Lu T.-P. Chuang E.Y. iGC—an integrated analysis package of gene expression and copy number alteration.BMC Bioinformatics. 2017; 18: 35Crossref PubMed Scopus (1) Google Scholar, 47Law C.W. Chen Y. Shi W. Smyth G.K. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts.Genome Biol. 2014; 15: R29Crossref PubMed Scopus (1244) Google Scholar, 48Ramos M. Schiffer L. Re A. Azhar R. Basunia A. Cabrera C.R. Chan T. Chapman P. Davis S. Gomez-Cabrero D. Culhane A.C. Haibe-Kains B. Hansen K. Kodali H. Louis M.S. Mer A.S. Reister M. Morgan M. Carey V. Waldron L. Software for the integration of multi-omics experiments in Bioconductor.Cancer Res. 2017; 77: e39-e42Crossref PubMed Scopus (0) Google Scholar, 49Hernandez-Ferrer C. Ruiz-Arenas C. Beltran-Gomila A. González J.R. MultiDataSet: an R package for encapsulating multiple data sets with application to omic data integration.BMC Bioinformatics. 2017; 18: 36Crossref PubMed Scopus (5) Google Scholar, 50Povysil G. Tzika A. Vogt J. Haunschmid V. Messiaen L. Zschocke J. Klambauer G. Hochreiter S. Wimmer K. panelcn.MOPS: copy-number detection in targeted NGS panel data for clinical diagnostics.Hum Mutat. 2017; 38: 889-897Crossref PubMed Scopus (0) Google Scholar, 51Liu C. Lehtonen R. Hautaniemi S. PerPAS: topology-based single sample pathway analysis method.IEEE/ACM Trans Comput Biol Bioinform. 2018; 15: 1022-1027Crossref PubMed Scopus (1) Google Scholar, 52Foroushani A. Agrahari R. Docking R. Chang L. Duns G. Hudoba M. Karsan A. Zare H. Large-scale gene network analysis reveals the significance of extracellular matrix pathway and homeobox genes in acute myeloid leukemia: an introduction to the Pigengene package and its applications.BMC Med Genomics. 2017; 10: 16Crossref PubMed Scopus (13) Google Scholar, 53Riester M. Singh A.P. Brannon A.R. Yu K. Campbell C.D. Chiang D.Y. Morrissey M.P. PureCN: copy number calling and SNV classification using targeted short read sequencing.Source Code Biol Med. 2016; 11: 13Crossref PubMed Scopus (13) Google Scholar, 54Scheinin I. Sie D. Bengtsson H. van de Wiel M.A. Olshen A.B. van Thuijl H.F. van Essen H.F. Eijk P.P. Rustenburg F. Meijer G.A. Reijneveld J.C. Wesseling P. Pinkel D. Albertson D.G. Ylstra B. DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly.Genome Res. 2014; 24: 2022-2032Crossref PubMed Scopus (106) Google Scholar, 55Gaidatzis D. Lerch A. Hahne F. Stadler M.B. QuasR: quantification and annotation of short reads in R.Bioinformatics. 2015; 31: 1130-1132Crossref PubMed Scopus (84) Google Scholar, 56Reinecke F. Satya R.V. DiCarlo J. Quantitative analysis of differences in copy numbers using read depth obtained from PCR-enriched samples and controls.BMC Bioinformatics. 2015; 16: 17Crossref PubMed Scopus (0) Google Scholar, 57Collado-Torres L. Nellore A. Kammers K. Ellis S.E. Taub M.A. Hansen K.D. Jaffe A.E. Langmead B. Leek J.T. Reproducible RNA-seq analysis using recount2.Nat Biotechnol. 2017; 35: 319-321Crossref PubMed Scopus (54) Google Scholar, 58Collado-Torres L. Nellore A. Jaffe A.E. Recount workflow: accessing over 70,000 human RNA-seq samples with Bioconductor.F1000Res. 2017; 6: 1558Crossref PubMed Scopus (3) Google Scholar, 59Jabot-Hanin F. Varet H. Tores F. Alcais A. Jais J.-P. Rfpred: a random forest approach for prediction of missense variants in human exome.bioRxiv. 2016; (037127)Google Scholar, 60Wang S. Pandis I. Johnson D. Emam I. Guitton F. Oehmichen A. Guo Y. Optimising parallel R correlation matrix calculations on gene expression data using MapReduce.BMC Bioinformatics. 2014; 15: 351Crossref PubMed Scopus (0) Google Scholar, 61de Souza W. Carvalho B.S. Lopes-Cendes I. Rqc: a Bioconductor package for quality control of high-throughput sequencing data.J Stat Softw Code Snippets. 2018; 87: 1-14Google Scholar, 62Liao Y. Smyth G.K. Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.Nucleic Acids Res. 2013; 41: e108Crossref PubMed Scopus (718) Google Scholar, 63Lawrence M. Gentleman R. Carey V. Rtracklayer: an R package for interfacing with genome browsers.Bioinformatics. 2009; 25: 1841-1842Crossref PubMed Scopus (135) Google Scholar, 64Favero F. Joshi T. Marquard A.M. Birkbak N.J. Krzystanek M. Li Q. Szallasi Z. Eklund A.C. Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data.Ann Oncol. 2015; 26: 64-70Crossref PubMed Scopus (126) Google Scholar, 65Morgan M. Anders S. Lawrence M. Aboyoun P. Pagès H. Gentleman R. ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data.Bioinformatics. 2009; 25: 2607-2608Crossref PubMed Scopus (222) Google Scholar, 66Chen M. Gunel M. Zhao H. SomatiCA: identifying, characterizing and quantifying somatic copy number aberrations from cancer genome sequencing data.PLoS One. 2013; 8: e78143Crossref PubMed Scopus (0) Google Scholar, 67Gehring J.S. Fischer B. Lawrence M. Huber W. SomaticSignatures: inferring mutational signatures from single-nucleotide variants.Bioinformatics. 2015; 31: 3673-3675Crossref PubMed Scopus (210) Google Scholar, 68Zhu Y. Stephens R.M. Meltzer P.S. Davis S.R. SRAdb: query and use public next-generation sequencing data from within R.BMC Bioinformatics. 2013; 14: 19Crossref PubMed Scopus (40) Google Scholar, 69H Backman T.W. Girke T. systemPipeR: NGS workflow and report generation environment.BMC Bioinformatics. 2016; 17: 388Crossref PubMed Scopus (31) Google Scholar, 70Colaprico A. Silva T.C. Olsen C. Garofano L. Cava C. Garolini D. Sabedot T.S. Malta T.M. Pagnotta S.M. Castiglioni I. Ceccarelli M. Bontempi G. Noushmehr H. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data.Nucleic Acids Res. 2016; 44: e71Crossref PubMed Scopus (215) Google Scholar, 71Hummel M. Bonnin S. Lowy E. Roma G. TEQC: an R package for quality control in target capture experiments.Bioinformatics. 2011; 27: 1316-1317Crossref PubMed Scopus (14) Google Scholar, 72Ha G. Roth A. Khattra J. Ho J. Yap D. Prentice L.M. Melnyk N. McPherson A. Bashashati A. Laks E. Biele J. Ding J. Le A. Rosner J. Shumansky K. Marra M.A. Gilks C.B. Huntsman D.G. McAlpine J.N. Aparicio S. Shah S.P. TITAN: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data.Genome Res. 2014; 24: 1881-1893Crossref PubMed Scopus (112) Google Scholar, 73Soneson C. Love M.I. Robinson M.D. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.F1000Res. 2016; 4: 1521Crossref Google Scholar, 74Wang N. Gong T. Clarke R. Chen L. Shih I.-M. Zhang Z. Levine D.A. Xuan J. Wang Y. UNDO: a Bioconductor R package for unsupervised deconvolution of mixed gene expressions in tumor samples.Bioinformatics. 2015; 31: 137-139Crossref PubMed Scopus (0) Google Scholar, 75Obenchain V. Lawrence M. Carey V. Gogarten S. Shannon P. Morgan M. VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants.Bioinformatics. 2014; 30: 2076-2078Crossref PubMed Scopus (79) Google Scholar, 76Knaus B.J. Grünwald N.J. VCFR: a package to manipulate and visualize variant call format data in R.Mol Ecol Resour. 2017; 17: 44-53Crossref PubMed Scopus (54) Google Scholar, 77Alvarez M.J. Shen Y. Giorgi F.M. Lachmann A. Ding B.B. Ye B.H. Califano A. Functional characterization of somatic mutations in cancer using network-based inference of protein activity.Nat Genet. 2016; 48: 838-847Crossref PubMed Scopus (326) Google Scholar, 78Pugh T.J. Amr S.S. Bowser M.J. Gowrisankar S. Hynes E. Mahanta L.M. Rehm H.L. Funke B. Lebo M.S. VisCap: inference and visualization of germ-line copy-number variants from targeted clinical sequencing data.Genet Med. 2016; 18: 712-719Abstract Full Text Full Text PDF PubMed Scopus (46) Google Scholar Some features of the R programming language and environment of relevance to bioinformatics are described below. In contrast with several other statistical languages, including the S-plus implementation of S, R does not natively use a point and click graphic user interface; rather, commands are typed into a console or console-like window and immediately executed by pressing the return key. Alternatively, commands can be stored in text files or scripts, conventionally with the .R extension, which can be called with the source(“filepath/filename.R”) command or executed in total or in blocks from an R editor. The fact that R is an interpreted language, and lines of code can be rapidly written and executed and the results visualized, allows for rapid prototyping and testing of new functionalities by step-wise running of the modified parts of the code, often maintaining intermediate results from unmodified lines of code in memory, followed by rapidly observing the results of the modified code. When speed is an issue, portions of well-tested code can be reimplemented or compiled, usually as C++ code, for execution at run time. Although R can be run from a command-line console in a terminal-type application, many if not most users use RStudio (Boston, MA), a free, multiplatform, open-source integrated development environment providing a graphic user interface that display various windows and tabs useful for R programming, including the following: i) R console, where commands can be typed and executed, and the results can be displayed; ii) Tabbed Source window containing various R scripts that can be executed in their entirety (sourced), or by specific lines or blocks of code; iii) Workspace window, showing the various objects (including package functions) loaded in the various environments; iv) History window containing a searchable list of all previously executed code; v) Files tab that can be used to consult directories and perform file operations; vi) Packages tab, listing packages loaded or installed; and vii) Help window containing documentation manual pages for the various packages installed. Display of various graphs can be directed to integrated or floating Plots windows or to a web browser window, or saved to a file. Markdown documents integrating code blocks, code outputs, and rich-text descriptions can be generated, edited, and exported in pdf, HTML, or Microsoft Word formats using the integrated knitr package. Of importance for clinical bioinformatics pipeline development, RStudio supports version control of projects by integrating Git repository functionalities, and can be deployed in a server environment, therefore avoiding multiple installations of R and associated libraries in local workstations. The modular design of R and the use of packages that function as optional plugins to extend basic functionalities are major features of R and Bioconductor and allow users to tailor solutions to their custom needs. Packages can be downloaded and installed from the default R repository (usually CRAN; https://cran.r-project.org, last accessed November 7, 2019) with the simple command install.packages(“PackageName”). Standardized rules and guidelines for package generation, structure, dependency management, installation, and documentation are available in the Manual for Writing R Extensions (https://cran.r-project.org/doc/manuals/r-release/R-exts.html, last accessed April 28, 2019) and provide a framework for package robustness, ease of learning, and interoperability. On installation, integrity of transmission is verified by checking the MD5 checksum i
Referência(s)