Failure to Detect Mutations in U2AF1 due to Changes in the GRCh38 Reference Sequence
2022; Elsevier BV; Volume: 24; Issue: 3 Linguagem: Inglês
10.1016/j.jmoldx.2021.10.013
ISSN1943-7811
AutoresChristopher A. Miller, Jason Walker, Travis L. Jensen, William F. Hooper, Robert S. Fulton, Jeffrey S. Painter, Mikkael A. Sekeres, Timothy J. Ley, David H. Spencer, Johannes B. Goll, Matthew J. Walter,
Tópico(s)Cancer Genomics and Diagnostics
ResumoThe U2AF1 gene is a core part of mRNA splicing machinery and frequently contains somatic mutations that contribute to oncogenesis in myelodysplastic syndrome, acute myeloid leukemia, and other cancers. A change introduced in the GRCh38 version of the human reference build prevents detection of mutations in this gene, and others, by variant calling pipelines. This study describes the problem in detail and shows that a modified GRCh38 reference build with unchanged coordinates can be used to ameliorate the issue. The U2AF1 gene is a core part of mRNA splicing machinery and frequently contains somatic mutations that contribute to oncogenesis in myelodysplastic syndrome, acute myeloid leukemia, and other cancers. A change introduced in the GRCh38 version of the human reference build prevents detection of mutations in this gene, and others, by variant calling pipelines. This study describes the problem in detail and shows that a modified GRCh38 reference build with unchanged coordinates can be used to ameliorate the issue. The U2AF1 gene encodes a core component of the mRNA splicing machinery that is frequently mutated in myelodysplastic syndrome (MDS) and other cancers.1Graubert T.A. Shen D. Ding L. Okeyo-Owuor T. Lunn C.L. Shao J. Krysiak K. Harris C.C. Koboldt D.C. Larson D.E. McLellan M.D. Dooling D.J. Abbott R.M. Fulton R.S. Schmidt H. Kalicki-Veizer J. O'Laughlin M. Grillot M. Baty J. Heath S. Frater J.L. Nasim T. Link D.C. Tomasson M.H. Westervelt P. DiPersio J.F. Mardis E.R. Ley T.J. Wilson R.K. Walter M.J. Recurrent mutations in the U2AF1 splicing factor in myelodysplastic syndromes.Nat Genet. 2011; 44: 53-57Google Scholar, 2Yoshida K. Sanada M. Shiraishi Y. Nowak D. Nagata Y. Yamamoto R. Sato Y. Sato-Otsubo A. Kon A. Nagasaki M. Chalkidis G. Suzuki Y. Shiosaka M. Kawahata R. Yamaguchi T. Otsu M. Obara N. Sakata-Yanagimoto M. Ishiyama K. Mori H. Nolte F. Hofmann W.-K. Miyawaki S. Sugano S. Haferlach C. Koeffler H.P. Shih L.-Y. Haferlach T. Chiba S. Nakauchi H. Miyano S. Ogawa S. Frequent pathway mutations of splicing machinery in myelodysplasia.Nature. 2011; 478: 64-69Google Scholar, 3Sondka Z. Bamford S. Cole C.G. Ward S.A. Dunham I. Forbes S.A. The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers.Nat Rev Cancer. 2018; 18: 696-705Google Scholar Though predominantly associated with hematopoietic cancers (73%), mutations are also recurrent in lung tumors (6.5%) and have been reported in 24 other tumor types. Specifically, mutations at residues S34 and Q157 have been shown to promote exon skipping and are confirmed driver mutations contributing to cancer pathogenesis.4Okeyo-Owuor T. White B.S. Chatrikhi R. Mohan D.R. Kim S. Griffith M. Ding L. Ketkar-Kulkarni S. Hundal J. Laird K.M. Kielkopf C.L. Ley T.J. Walter M.J. Graubert T.A. U2AF1 mutations alter sequence specificity of pre-mRNA binding and splicing.Leukemia. 2015; 29: 909-917Google Scholar, 5Shirai C.L. Ley J.N. White B.S. Kim S. Tibbitts J. Shao J. Ndonwi M. Wadugu B. Duncavage E.J. Okeyo-Owuor T. Liu T. Griffith M. McGrath S. Magrini V. Fulton R.S. Fronick C. O'Laughlin M. Graubert T.A. Walter M.J. Mutant U2AF1 expression alters hematopoiesis and pre-mRNA splicing in vivo.Cancer Cell. 2015; 27: 631-643Google Scholar, 6Ilagan J.O. Ramakrishnan A. Hayes B. Murphy M.E. Zebari A.S. Bradley P. Bradley R.K. U2AF1 mutations alter splice site recognition in hematological malignancies.Genome Res. 2015; 25: 14-26Google Scholar, 7Fei D.L. Zhen T. Durham B. Ferrarone J. Zhang T. Garrett L. Yoshimi A. Abdel-Wahab O. Bradley R.K. Liu P. Varmus H. Impaired hematopoiesis and leukemia development in mice with a conditional knock-in allele of a mutant splicing factor gene U2af1.Proc Natl Acad Sci U S A. 2018; 115: E10437-E10446Google Scholar Genomic data were collected as part of the MDS National History Study or The Cancer Genome Atlas (TCGA) project and appropriate consent was received under those protocols.8Sekeres M.A. Gore S.D. Stablein D.M. DiFronzo N. Abel G.A. DeZern A.E. Troy J.D. Rollison D.E. Thomas J.W. Waclawiw M.A. Liu J.J. Al Baghdadi T. Walter M.J. Bejar R. Gorak E.J. Starczynowski D.T. Foran J.M. Cerhan J.R. Moscinski L.C. Komrokji R.S. Deeg H.J. Epling-Burnette P.K. The National MDS Natural History Study: design of an integrated data and sample biorepository to promote research studies in myelodysplastic syndromes.Leuk Lymphoma. 2019; 60: 3161-3171Google Scholar,9Ley T.J. Miller C. Ding L. Raphael B.J. Mungall A.J. Robertson A.G. et al.Cancer Genome Atlas Research NetworkGenomic and epigenomic landscapes of adult de novo acute myeloid leukemia.N Engl J Med. 2013; 368: 2059-2074Google Scholar Sequencing reads from the MDS cohort were aligned to both masked and unmasked GRCh38 reference genomes using the BWA-MEM software package version 0.7.15 (https://github.com/lh3/bwa/releases/tag/v0.7.15),10Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.arXiv [q-bioGN]. 2013; ([Preprint])arXiv:1303.3997Google Scholar followed by sorting and deduplication, as detailed in a CWL (Common Workflow Language) workflow archived at https://git.io/JYbGl (last accessed December 1, 2021). Variants in the MDS cohort were called in single-sample mode using VarScan software version 2.4.2 with params "--min-coverage 20 --min-reads2 5 --min-var-freq 0.05 --strand-filter 1".11Koboldt D.C. Zhang Q. Larson D.E. Shen D. McLellan M.D. Lin L. Miller C.A. Mardis E.R. Ding L. Wilson R.K. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.Genome Res. 2012; 22: 568-576Google Scholar Data from TCGA acute myeloid leukemia samples were aligned using the same process, and somatic variants were called using an ensemble approach, described in detail in the CWL workflow at https://git.io/JYbGM (GitHub, last accessed December 1, 2021). The modified genome FASTA file used for these analyses is available at https://zenodo.org/record/4684553 (Zenodo, last accessed December 1, 2021). Sequence data from the MDS cohort are available in the database of Genotypes and Phenotypes (dbGaP) under accession id phs002714.v1.p1. The study is currently available at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002714.v1.p1 (dbGaP, last accessed January 23, 2022). TCGA acute myeloid leukemia data are available via the Genomic Data Commons at https://portal.gdc.cancer.gov (last accessed December 1, 2021). Pipelines used for variant calling are available at https://github.com/genome/analysis-workflows (GitHub, last accessed December 1, 2021). All links in the Materials and Methods are to the specific versions of the workflows used. As part of the National Myelodysplastic Syndrome (MDS) Natural History Study (ClincialTrials.gov, https://www.clinicaltrials.gov, identifier: NCT02775383), a targeted gene panel was used to sequence bone marrow samples from 120 patients either diagnosed with MDS or suspected to have MDS.8Sekeres M.A. Gore S.D. Stablein D.M. DiFronzo N. Abel G.A. DeZern A.E. Troy J.D. Rollison D.E. Thomas J.W. Waclawiw M.A. Liu J.J. Al Baghdadi T. Walter M.J. Bejar R. Gorak E.J. Starczynowski D.T. Foran J.M. Cerhan J.R. Moscinski L.C. Komrokji R.S. Deeg H.J. Epling-Burnette P.K. The National MDS Natural History Study: design of an integrated data and sample biorepository to promote research studies in myelodysplastic syndromes.Leuk Lymphoma. 2019; 60: 3161-3171Google Scholar Of these patients, 38 were eventually confirmed to have MDS or a myeloproliferative neoplasm. Initial analyses looking at sequencing quality metrics revealed coverage levels and mutation frequencies that closely matched expectations, with one exception: mutations in the U2AF1 spliceosome gene are typically observed in nearly 10% of MDS or myeloproliferative neoplasm patients, but only 2 of the 38 MDS or myeloproliferative neoplasm patients (5.2%) in this group had such mutations, both at the Q157 hotspot.1Graubert T.A. Shen D. Ding L. Okeyo-Owuor T. Lunn C.L. Shao J. Krysiak K. Harris C.C. Koboldt D.C. Larson D.E. McLellan M.D. Dooling D.J. Abbott R.M. Fulton R.S. Schmidt H. Kalicki-Veizer J. O'Laughlin M. Grillot M. Baty J. Heath S. Frater J.L. Nasim T. Link D.C. Tomasson M.H. Westervelt P. DiPersio J.F. Mardis E.R. Ley T.J. Wilson R.K. Walter M.J. Recurrent mutations in the U2AF1 splicing factor in myelodysplastic syndromes.Nat Genet. 2011; 44: 53-57Google Scholar,12Walter M.J. Shen D. Shao J. Ding L. White B.S. Kandoth C. Miller C.A. Niu B. McLellan M.D. Dees N.D. Fulton R. Elliot K. Heath S. Grillot M. Westervelt P. Link D.C. DiPersio J.F. Mardis E. Ley T.J. Wilson R.K. Graubert T.A. Clonal diversity of recurrently mutated genes in myelodysplastic syndromes.Leukemia. 2013; 27: 1275-1282Google Scholar,13Papaemmanuil E. Gerstung M. Malcovati L. Tauro S. Gundem G. Van Loo P. et al.Clinical and biological implications of driver mutations in myelodysplastic syndromes.Blood. 2013; 122 (quiz 3699): 3616-3627Google Scholar Although this deviation was not significant (compared with the Walter et al12Walter M.J. Shen D. Shao J. Ding L. White B.S. Kandoth C. Miller C.A. Niu B. McLellan M.D. Dees N.D. Fulton R. Elliot K. Heath S. Grillot M. Westervelt P. Link D.C. DiPersio J.F. Mardis E. Ley T.J. Wilson R.K. Graubert T.A. Clonal diversity of recurrently mutated genes in myelodysplastic syndromes.Leukemia. 2013; 27: 1275-1282Google Scholar cohort; P = 0.53 via Fisher's exact test), both mutations had only about 15% of the expected sequence coverage (mean of 204× depth), whereas the full targeted panel had a median coverage of 1337.3× (Supplemental Table S1). Nearly the entirety of the U2AF1 gene was likewise affected, with a median depth of only 130× and four exons with no coverage at all (Figure 1). Manual inspection of the alignments with Integrated Genomics Viewer (IGV) revealed that this poor coverage extended to a 150-kb region of the genome where the majority of reads had a mapping quality of zero (Figure 1).14Robinson J.T. Thorvaldsdóttir H. Winckler W. Guttman M. Lander E.S. Getz G. Mesirov J.P. Integrative genomics viewer.Nat Biotechnol. 2011; 29: 24-26Google Scholar The sequence aligner BWA-MEM reserves this mapping quality zero score for reads that cannot be uniquely placed in the genome, and it generally indicates a highly repetitive sequence or problems with the underlying assemblies.10Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.arXiv [q-bioGN]. 2013; ([Preprint])arXiv:1303.3997Google Scholar The region with such problematic alignments spans the CBS, U2AF1, FRGCA, and CRYAA genes. Further investigation revealed that in the GRCh38 reference build, content added to the p-arm of chromosome 21 (chr21:6427259-6580,181) contained sequence that replicated the sequence of the U2AF1 locus (chr21:43035875-43187577) with 99.0% identity. The same issue does not exist in prior reference builds GRCh36 or GRCh37. After consultation with members of the Genome Reference Consortium (GRC), it was determined that a bacterial artificial chromosome (BAC) clone (Nucleotide, https://www.ncbi.nlm.nih.gov/nuccore; accession number FP236240.8) was incorrectly added to the reference genome, creating this duplicate sequence. This resulted in the alignment algorithm, BWA-MEM, splitting the reads among these two loci, thus lowering the overall coverage substantially (Figure 1, Supplemental Figure S1, and Supplemental Table S1). In addition, reads with mapping quality scores of zero are typically excluded or down-weighted during variant calling, due to the increased chance of artefactual calls, and these factors combined explained the paucity of mutations observed in U2AF1, especially at the S34F position. To address this issue, the authors' group created a modified version of GRCh38 that maintains the coordinate system, but masks the new duplicate sequence on chromosome 21p by replacing it with "N" characters. The authors realigned the data to this reference and observed a substantial increase in coverage and mapping qualities across the affected region. Over the exons of U2AF1, the coverage of reads with mapping quality >0 rose from a median of 0.3× to a median of 1195×. This enabled the discovery of an additional U2AF1 mutation (S34F) in this MDS cohort. The data were also aligned to GRCh37, and it was confirmed that this region was not problematic in the older reference, where the median coverage over the exons of U2AF1 was 1381× (Supplemental Table S1 and Supplemental Figure S2). To validate this finding in an orthogonal data set, data were retrieved from acute myeloid leukemia patients sequenced for the TCGA paper.9Ley T.J. Miller C. Ding L. Raphael B.J. Mungall A.J. Robertson A.G. et al.Cancer Genome Atlas Research NetworkGenomic and epigenomic landscapes of adult de novo acute myeloid leukemia.N Engl J Med. 2013; 368: 2059-2074Google Scholar The data in this study were originally aligned to genome build GRCh36, and six mutations in the U2AF1 gene were reported from exome sequencing. After this paper was published, the Genomic Data Commons took this sequence dataset, realigned it (with BWA-MEM) to the stock GRCh38.d1.vd1 reference, and produced variant calls using four different algorithms.10Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.arXiv [q-bioGN]. 2013; ([Preprint])arXiv:1303.3997Google Scholar,15Grossman R.L. Heath A.P. Ferretti V. Varmus H.E. Lowy D.R. Kibbe W.A. Staudt L.M. Toward a shared vision for cancer genomic data.N Engl J Med. 2016; 375: 1109-1112Google Scholar These variant files [in MAF (Mutation Annotation Format)] reported no U2AF1 mutations in the six expected samples. Their GRCh38 sequence alignments were then downloaded, and the same coverage and mapping quality issues on chromosome 21 were observed. One sample lacked any U2AF1 reads at all, due to unrelated sequencing problems (the TCGA consortium only identified the mutation using orthogonal assays). This sample was excluded from further analyses, leaving five evaluable samples. Running another somatic variant calling pipeline on these alignments also did not reveal any U2AF1 mutations. However, after realigning the sequence data to the masked version of GRCh38, the same somatic pipeline identified all five expected mutations in U2AF1 (Supplemental Table S2). The reference genome is essential to modern cancer genomics, but problems with these assemblies have the potential to cause both false positive and negative results. In this study, the authors describe the latter, where changes introduced in the GRCh38 human reference build cause mutations in a cancer driver gene to be missed with standard analysis approaches. GRCh38 was first released in late 2013, but widespread adoption lagged somewhat, so some studies involving myeloid malignancies (TCGA, the Beat AML trial) have been spared this issue because they used older versions of the reference.9Ley T.J. Miller C. Ding L. Raphael B.J. Mungall A.J. Robertson A.G. et al.Cancer Genome Atlas Research NetworkGenomic and epigenomic landscapes of adult de novo acute myeloid leukemia.N Engl J Med. 2013; 368: 2059-2074Google Scholar,16Tyner J.W. Tognon C.E. Bottomly D. Wilmot B. Kurtz S.E. Savage S.L. et al.Functional genomic landscape of acute myeloid leukaemia.Nature. 2018; 562: 526-531Google Scholar Nonetheless, large cancer databases, including the National Cancer Institute's Genomic Data Commons, are likely missing many U2AF1 mutations due to this artefact. This also has clinical implications because U2AF1 mutations have strong associations with prognosis, and clinical trials of splice-modulating drugs are being planned or are underway.5Shirai C.L. Ley J.N. White B.S. Kim S. Tibbitts J. Shao J. Ndonwi M. Wadugu B. Duncavage E.J. Okeyo-Owuor T. Liu T. Griffith M. McGrath S. Magrini V. Fulton R.S. Fronick C. O'Laughlin M. Graubert T.A. Walter M.J. Mutant U2AF1 expression alters hematopoiesis and pre-mRNA splicing in vivo.Cancer Cell. 2015; 27: 631-643Google Scholar,17Tefferi A. Finke C.M. Lasho T.L. Hanson C.A. Ketterling R.P. Gangat N. Pardanani A. U2AF1 mutation types in primary myelofibrosis: phenotypic and prognostic distinctions.Leukemia. 2018; 32: 2274-2278Google Scholar,18Seiler M. Yoshimi A. Darman R. Chan B. Keaney G. Thomas M. et al.H3B-8800, an orally available small-molecule splicing modulator, induces lethality in spliceosome-mutant cancers.Nat Med. 2018; 24: 497-504Google Scholar These findings have been reported to the GRC (https://www.ncbi.nlm.nih.gov/grc/human/issues/HG-2544, last accessed December 10, 2021), but as there are no patches or new releases scheduled for the human reference genome, the problem remains unresolved in the current release of GRCh38 (GRCh38.p13). However, in the time since these analyses were performed, the GRC has released a masking file that includes this chr21 region, along with two other contaminating sequences on alternate contigs.19Wagner J. Olson N.D. Harris L. McDaniel J. Cheng H. Fungtammasan A. et al.Towards a comprehensive variation benchmark for challenging medically-relevant autosomal genes.bioRxiv. 2021; ([Preprint]. doi:10.1101/2021.06.07.444885)Google Scholar Using this file to create a masked genome mirrors the approach that the authors' analyses used and likewise resolves the issues in U2AF1 reported in this paper. To apply this fix, the bed file can be downloaded from NCBI at https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_GRC_exclusions.bed (last accessed December 1, 2021), then applied to a GRCh38 FASTA file using the bedtools maskfasta tool, followed by reindexing with the aligner of choice.20Quinlan A.R. Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features.Bioinformatics. 2010; 26: 841-842Google Scholar Although masking the genome offers better quality alignments at this locus, the next leap forward will come with a new reference sequence, likely based on the draft genome recently produced by the Telomere-to-Telomere consortium.21Nurk S. Koren S. Rhie A. Rautiainen M. Bzikadze A.V. Mikheenko A. et al.The Complete Sequence of a Human Genome.bioRxiv. 2021; ([Preprint]. doi:10.1101/2021.05.26.445798)Google Scholar In the longer term, the genomics community can look forward to graph genome approaches capable of representing haplotypes from many different populations. These should further increase the accuracy of short-read genomic alignments, upon which many analyses, and increasingly clinical decisions, are based. As these genome reference improvements are released, the genomics community will need to validate them before they can be used on clinical cases. Previous data sets will need to be realigned to ensure that changes are understood and problematic portions of assemblies that might alter diagnostic results are identified. Genomic annotations and pipelines will also need to be updated, which can be resource intensive. Hence, GRCh38 will likely remain in use for years. The use of GRC-released bed file is recommended to create a masked reference that is coordinate-compatible to be used interchangeably with the standard GRCh38 reference in cancer genomics applications. Especially, applications where detection of U2AF1 mutations is critical, including sequencing of hematological cancers or studies of spliceosome dysfunction. We thank Nancy DiFronzo for leadership of the National MDS Natural History Study. The National MDS Natural History Study thanks the study participants, as well as the investigator teams at the participating clinical sites. C.A.M., J.R.W., T.L.J., W.F.H., R.S.F., D.H.S., J.B.G., and M.J.W. conceptualized the study; J.S.P. led sample acquisition; C.A.M., J.R.W., T.L.J., and W.F.H. performed data analysis; C.A.M., J.R.W., and T.L.J. visualized the data; M.A.S., T.J.L., J.B.G., and M.J.W. provided supervision and funding; C.A.M., J.B.G., T.L.J., and M.J.W. wrote and edited the manuscript. All authors read and approved the final manuscript. C.A.M. is the guarantor of this work and, as such, had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Download .pdf (3.18 MB) Help with pdf files Supplemental Figure S1Examples of poor coverage of other genes in the affected region. Examples shown are from exome sequencing of sample TCGA-AB-2912-11A. Integrated Genomics Viewer (IGV) views show sequence reads, with alignments to GRCh38 at top and alignments to the modified reference at bottom. Grey bars at top show overall coverage. Reads in white indicate multimapped reads, with mapping qualities of zero, whereas red and blue reads have higher quality alignments. A: Reads aligning to the CBS gene. B: Reads aligning to the CRYAA gene (FRGCA, the other gene in the affected region, was not targeted by this exome reagent). Download .pdf (1.55 MB) Help with pdf files Supplemental Figure S2Myelodysplastic syndrome (MDS) data aligned to GRCh37 from sample MDSEPID3178 (the same sample shown in Figure 1), showing no evidence of alignment issues. The U2AF1 transcript shown is ENST00000291552.9 (Ensembl, ensembl.org/index.html, last accessed January 10, 2022). Download .xlsx (.02 MB) Help with xlsx files Supplemental Table S1 Download .docx (.02 MB) Help with docx files Supplemental Table S2
Referência(s)