The genes of OMIM : A legacy of Victor McKusick
2021; Wiley; Volume: 185; Issue: 11 Linguagem: Inglês
10.1002/ajmg.a.62415
ISSN1552-4833
AutoresAlan F. Scott, Joanna Amberger,
Tópico(s)Nutrition, Genetics, and Disease
ResumoVictor McKusick is largely remembered as a clinician and cataloger of inherited disorders; however, those of us who worked with him know that he was perhaps most proud of this role in the effort to map the human genome. This is evident from his 2006 biographical article, A 60-year tale of spots, maps, and genes, in the Annual Review of Genomics and Human Genetics (McKusick, 2006). Originally dissuaded from leaving cardiology to enter the then nascent field of genetics, McKusick followed his passion and curiosity about genetics, perhaps in part because he was an identical twin who referred to his niece as his "half daughter." (Victor's brother Vincent would occasionally visit Baltimore and joyfully confuse OMIM staff some of whom did not know the two were twins.) His seminal work, Mendelian Inheritance in Man (MIM) and later the online version, OMIM, was begun at a time when the technology to physically isolate genes was impossible and was his way of trying to bring order to the growing, and often disorganized, body of literature about inherited disorders. MIM first published in 1966 as a catalog of 1487 Mendelian disorders, now, has over 25,700 entries describing both phenotypes (>9300) and genes (>16,400). The OMIM literature review process today focuses on identifying new gene-phenotype relationships or adding to knowledge about existing relationships. In addition, because approximately a third of surveyed users report that they use OMIM to learn about the biological function of genes, a substantial effort is made to identify and add genes with known biological function even in the absence of as yet Mendelian phenotypes associated with variants in those genes. The genetic code was fully deciphered in 1966 and by 1969 McKusick proposed at the International Conference on Birth Defects in The Hague that mapping all the genes would be a useful approach to understanding birth defects (Dronamraju, 2012). Because of his enthusiasm within the then relatively small community of geneticists interested in mapping, McKusick became a key figure in the formative years of what would become the Human Genome Project (HGP). Among several efforts that McKusick championed was the Human Gene Mapping (HGM) workshops that he organized with Frank Ruddle from Yale, a pioneer in the use of interspecies somatic cell hybrids for gene mapping. The first workshop took place in 1973 and by 1977, when McKusick and Ruddle (1977) published "The status of the human gene map," 210 genes had been mapped with at least one to each chromosome. In 1980, McKusick (1980, 1981) proposed the aspirational goal that all human genes be mapped by the year 2000. The 11 international HGM workshops were held from 1974 to 1991 and led to the creation of the Genome Database (GDB), a repository of gene mapping information. The workshops dove-tailed with the ramp up of sequencing efforts with the initiation of Genome Mapping and Sequencing meetings at Cold Spring Harbor Laboratory (CSHL) in April 1988. At this first CSHL meeting, the idea of establishing an international coordinating scientific body to bring together nascent genome sequencing efforts around the world was suggested by Sydney Brenner. In September 1988, the Human Genome Organization (HUGO) held its founding meeting in Montreux, Switzerland. Thirty-one scientists, including five Nobel laureates from around the world, attended and McKusick was elected its president. Membership grew quickly and scientists from nearly two dozen countries joined. The goals of the organization as detailed in the journal Genomics (founded and co-edited initially by McKusick and Frank Ruddle) were to coordinate research among scientists, exchange data, encourage new technologies, and foster discussion of ethical, legal and commercial implications of the HGP (McKusick, 1989). McKusick had a deep appreciation of history and frequently talked about the development of genetics as well as the role played by Johns Hopkins University (JHU), an institution that he cared about deeply. When McKusick began creating his catalog of human inherited disorders in the late 1950s the molecular bases of very few diseases were known. The cloning of genes and the sequencing of the genome was in the realm of science fiction. As human geneticists identified families segregating phenotypes and markers such as RFLPs (Restriction Fragment Length Polymorphisms) and STRs (Short Tandem Repeats) were developed, the idea of using linkage analysis as Morgan had done decades earlier in Drosophila became a reality. McKusick tracked the pivotal role that technological advances made in science and often described those that made the genome project possible in his presentations (Figure 1). The discovery of restriction enzymes in 1970 (by Tom Kelly and Hamilton Smith at JHU; Kelly Jr. & Smith, 1970) and their subsequent use in DNA cloning and restriction mapping (Nathans & Smith, 1975) followed by the development of the polymerase chain reaction (Mullis et al., 1986) were pivotal advances and worthy of the Nobel prizes that they were awarded. With the ability to produce large amounts of specific DNA molecules, techniques quickly arrived to sequence DNA, first using a method by Maxam and Gilbert (1977) and in 1977 by Fred Sanger (Sanger et al., 1977). Sanger's method was easier and quickly became the standard. Because sequencing reads were only a few hundred base pairs, it was clear that assembling a genome would be a monumental task. In fact, Sanger is said to have joked that sequencing the human genome might have been best suited for prison labor in lieu of making license plates. At the beginning of the HGP, computational tools for analyzing sequences were still in early development and the amount of data generated from a ~3 Gb (gigabases) human genome was too large to assemble in a reasonable time with the computational resources and software then available. In part, for that reason many of the early efforts were focused on model organisms with more manageable genomes that could serve as test beds for the development of new approaches to make sequencing easier and more accurate. Automated DNA sequencing would be a key to completing the HGP. Fluorescent sequencing was largely the idea of Lloyd Smith and Leroy Hood (Smith et al., 1986). Hood received his MD at JHU and chaired the Biology Division at Caltech where prototype instruments were developed. The first generation DNA sequencer was commercialized by Applied Biosystems Inc (ABI) in 1986 and started to appear in labs by 1988. AFS had been doing Maxim-Gilbert and Sanger DNA sequencing using 32P labeled nucleotides in the late 1980s at Hopkins and learned of a fluorescent DNA sequencer that had been installed in the small, over-crowded NIH lab of Craig Venter. In 1988, AFS visited Venter to see the new ABI sequencer and was impressed by both the instrument and the audacious plans Venter had about sequencing all of the mRNAs as well as the entirety of the X chromosome. With the ABI instrument in hand, it was clear that the genome could be sequenced given appropriate effort and time. As early as 1965, McKusick had realized that computers could be used in the field of human genetics (McKusick, 1965). Indeed, he was an early adopter of IT methods, and from its inception, MIM was stored first on magnetic tape and then was migrated to early word processing software in the 1980s. It was at this time that MIM was selected to be included in the Knowledge Base Research Program of the Lister Hill Center for Biomedical Communications at the National Library of Medicine. As output from the HGM Workshops grew, McKusick, who served on the Howard Hughes Medical Institute (HHMI) advisory board from 1967 to 1983, suggested that one way HHMI could support the genome sciences was to build a central database that could organize the growing flood of data in one place. In 1988, reports from a National Research Council (NRC) assessment of Mapping and Sequencing the Human Genome, of which McKusick was a member, and the Office of Technology Assessment (OTA) concluded that the genome project was important for the US government to pursue in a coordinated manner. In 1989, HHMI funded the Genome Database (GDB) at Johns Hopkins University. GDB filled an important, if transitional, role that served as a mechanism to tie disparate data sources together in a way that other resources did not. With improved technology and a mindset that doing "big" science had its advantages, a "gold rush" of gene discovery began. Human Genome Sciences (HGS), a company formed in 1992 by Craig Venter, sought to commercialize a growing catalog of anonymous mRNAs that were dubbed expressed sequence tags (ESTs). The public effort followed suit, and as ESTs were identified and mapped, GDB captured that information designating many as cORFs (chromosome assigned Open Reading Frames). This terminology has largely been replaced as anonymous sequences have been matched to genes and given meaningful names. However, some cORFs persist (e.g., c9orf72, the approved symbol for a gene whose repeat expansions cause an ALS phenotype; OMIM 614260). In 1998, Venter left HGS and started Celera with the goal of sequencing the Human Genome (years later it was revealed that Venter's genome was the individual they sequenced while the public genome was a composite of several people). McKusick joined Celera's Scientific Advisory Board that year and supported both the public and private genome efforts. He believed that completing the sequence of the human genome trumped the politics of how it was accomplished. McKusick was active in the planning of the public HGP and advocated a chromosome-based approach. This may have stemmed in part from improved methods of chromosome analysis (e.g., high resolution G-banding) and somatic cell genetics that allowed mouse-human hybrid cells with single human chromosomes; see Figure 1). This approach also made sense at the time as a way to distribute the work and funding among various research institutions. In the late 1980s, McKusick was the Principal Investigator on an early NCHGR (the predecessor of NHGRI) program project genome grant, "Mapping the Chromosomes of Man." Interestingly, Hamilton Smith conceived of BAC-end sequencing and joined Craig Venter at Celera where this approach was applied at the whole genome level. This method depended on the large computational investment that Venter was able to build at Celera and proved to be a more efficient strategy than a chromosome-based approach. With the recent advent of long DNA methods, a whole genome strategy is still used today for de novo assemblies. The public sequencing effort was first headed by James Watson and later by Francis Collins, and involved labs in the United States and other countries. The bulk of these efforts occurred mainly in the UK (Wellcome Trust Sanger Institute) and in the United States (Washington University, Baylor University, the Whitehead Institute, and the Joint Genome Institute of the Department of Energy with labs in California and New Mexico). With the establishment of large centers, sequencing and assembly proceeded within each group, again with a chromosome-oriented approach. There was sometimes tension between the public and private efforts, but ultimately the competition probably sped the completion of the genome by years. Both efforts were commended at the White House on June 26, 2000 in a ceremony that McKusick attended. This was within the 20-year time frame he had proposed two decades earlier. McKusick was friendly with Venter and Collins and both men attended his funeral in 2008. Because the human genome belongs to everyone, McKusick strongly advocated for international collaboration and data sharing. The Bermuda data sharing agreement in 1996 required that all sequence greater than 1 kb produced at each site be publicly shared within 24 h. NCBI in the US, EMBL in Europe and DDBJ in Japan became the primary data repositories and exchanged data while developing different tools for annotation. In 2000, the UCSC genome browser became available and its graphical interface made the genome accessible (Lee et al., 2020). With the amount of completed sequence rapidly growing, a separate mapping database was no longer necessary and much of GDB's data was incorporated elsewhere. However, OMIM's catalog of genes and genetic phenotypes grew in importance largely because of its narratives and interpretative approach to selecting and organizing relevant information from the literature. In a landmark paper McKusick (2001) equated the completion of the HGP to the gross anatomy of Vesalius in 1543 in its benefit to medicine and society, an idea that he had introduced much earlier (McKusick, 1981). The wide availability of sequence data from human and model organisms was a boon to individual labs and a burst of gene discovery occurred quickly by both basic scientists and geneticists. As more researchers became involved in gene discovery the publication of the same sequences under different names led to duplicate records. In addition, communities of scientists working on other species often adopted their own nomenclature for homologous sequences adding to the confusion. The problem was exacerbated by publishers who did not require authors to submit their sequences to public databases before print, and there was no simple way to compare sequences to identify duplicate gene reports. In 1990, NCBI introduced BLAST (Basic Local Alignment Search Tool; Altschul et al., 1990), a sequence comparison algorithm that allowed users to identify the same or similar sequences across species, and this greatly helped sort out redundant entries. The legacy of this era can still be seen in the large number of aliases listed in the NCBI, HGNC, OMIM, and other databases (e.g., Figure 2). The Human Gene Mapping Workshops, recognizing the need to establish norms with gene naming, established the Human Genome Nomenclature Committee (HGNC). Phyllis McAlpine was the first head of the HGNC; she was succeeded by Sue Povey at the Galton Laboratory and now by Elspeth Bruford. From the beginning, MIM and the HGNC worked closely together. In 1995, McKusick asked AFS, who ran a service center offering DNA sequencing and who was doing gene-based research, if he would help curate the flood of new genes that were being reported in the literature. As noted, because many journals accepted sequence (protein or DNA/RNA) without corresponding database IDs, we had to manually enter each for sequence analysis. To manage the volume of new papers, we created an Excel spreadsheet that was shared among the work group through a home-grown local area network. The spreadsheet tracked the citation, accession number, gene name(s), OMIM number, and so on. Initially, AFS and JSA were joined by Jeni Hart who worked on this project (and later at NCBI on RefSeq, the reference sequence database) followed by Erik Janus and the occasional student. As we found duplicates, we would email HGNC and the mouse database at the Jackson labs. In 1998, we met Greg Schuler from the NCBI at the Bar Harbor Short Course (another initiative of McKusick and the begun in 1960 with John Fuller) and discussed the idea of having NIH take on what we had created. The result was a relational database called NomeMIM (Nomenclature and OMIM) developed by Schuler at NCBI. This web-based database provided a simple way for us to notify and share our gene curation data with HGNC. As we unraveled duplicate entries in OMIM or found discordant naming of orthologs in other species, we matched these to Genbank accession numbers and were thus able to bring some order to the genetic equivalent of a tower of Babel. The NCBI curation of NomeMIM was coordinated by Donna Maglott and Kim Pruitt. After a year or so, NCBI RefSeq curations were added to the database which was renamed LocusXRef. This database then became integrated into NCBI's publicly available resource, EntrezGene, and now named simply Gene (www.ncbi.nlm.nih.gov/gene). Subsequently, the curation output of MeSH indexers and NCBI staff doing sequence analysis and sequence reconciliation were added to LocusXref resulting in the robust resource we have today. The LocusXRef/NCBI Gene collaboration continues, but it is increasingly rare to find a literature report of a gene that is not already described in the database. A screen shot (Figure 2) shows the "back end" of the editable database which allows SQL searching and includes a variety of links to other databases. Gene entries start with a MIM number preceded by an asterisk. The preferred title (generally taken from HGNC) is followed by alternative names and aliases used in the literature. The HGNC approved gene symbol is explicitly provided. The cytogenetic and genomic location (GRCh38) taken from data at NCBI are included next. The text of an OMIM gene entry is taken from articles from the biomedical literature and provides a variety of information structured under headings. Most gene entries have a short overview followed by information on how the gene was cloned, description of known transcript variants (when applicable), summary of expression data, and a synopsis of the structure of the gene. The mapping section may include information on methods by which the gene was mapped, the presence of pseudogenes, and mapping of homologous genes in other species. Functional studies of the gene or protein product(s) are summarized under the gene function heading, and there is a section for descriptions of animal models. Variants in a gene are organized under an Allelic Variants heading. Selected variants are given allelic variant numbers sequentially in the gene entry. The phenotype associated with the variant makes up the allelic variant title and below that is the mutation with links to variant databases (e.g., dbSNP, gnomAD, ClinVar). The text of the allelic variant includes in whom the variant was identified, the zygosity of the variant, the phenotypic consequence of the variant, and functional studies of that variant, when available. Although McKusick received encouragement as early as 1987 to make OMIM a directory of all human mutations, it was clear that that was beyond the scope of what was possible and many other resources have assumed that role. The external links in each gene entry take users to that gene's data in other specialized databases. These databases include genome browsers, DNA sequence and protein databases, protein databases, databases that curate various gene-related information (e.g., BioGPS, Gene Ontology, GeneCards), clinical resources such as Genetics Home Reference, the Genetic Testing Registry, or GARD (Genetic and Rare Disease Information Center), variation resources such as ClinVar or gnomAD, model organism databases, and cellular pathway databases. It is not possible nor desired to include all references on a particular gene in OMIM. Instead the faculty and staff endeavor to select articles that provide significant landmarks along the expanding knowledge of a particular gene. To provide directed access to the greater biomedical literature, OMIM includes a circled + sign at the end of each paragraph that leverages PubMed's "Similar Articles" tool for references cited in that paragraph. The references that support the text of the entry are included at the bottom of the entry and have links to PubMed and the full text of the article at the publishers' websites. As of December 2020, RefSeq recorded about 24,000 protein coding genes (excluding pseudogenes) of which about 16,000 were in OMIM. Most of the current ~8000 genes not yet included in OMIM are either members of large gene families (e.g., olfactory receptors) or have little, if any, functional information yet reported in the published literature. With time, many of these genes will be characterized with publications supporting their biology, and OMIM entries will be created. While the total number of genes is reaching a plateau, the number of genes with phenotypes continues to increase by roughly 250 per year during the last decade. The criteria that OMIM sets for inclusion of genes with phenotypes is rigorous (Amberger et al., 2019) so the number of published purported phenotype-gene relationships may be higher. With the publication of additional research and patient reports, the validated relationships will be added to OMIM. The extent to which OMIM includes information on common traits with complex oligogenic underpinnings is limited. For example, GWAS studies that have found phenotypic traits (e.g., height) associated with regions near genes with monogenic phenotypes (e.g., Marfan's; Tcheandjieu et al., 2020), while intriguing, are not within the scope of OMIM. As shown in Figure 3, the number of new gene entries in OMIM (mostly protein coding) is reaching a plateau while the number of genes with phenotypes continues to climb, especially in the last decade as improved technologies have made sequencing faster, less expensive and more accessible. Today whole exome sequencing (WES) or, especially, whole genome sequencing (WGS) make identification of many types of DNA variants fairly simple. The most difficult task is to decide which variants in specific genes are deleterious and make sense with the phenotype. Tools such as knock-out animal models, CRISPR editing in cell lines, RNA sequencing in bulk and of single cells, new proteomic methods, the ability to identify large structural variants or epigenetic modifications are among the many approaches that can build evidence for a gene to phenotype connection. With WES or WGS, large pedigrees are not needed as they were in linkage studies but multiple affected probands are still very helpful. The difficulty in confirming which variants may be functional has led to the creation of several variant sharing databases such as Gene Matcher (Sobreira et al., 2015) at JHU that are pooled together for querying in the Matchmaker Exchange consortium (Philippakis et al., 2015). The use of the new sequencing technologies coupled with anonymous sharing of phenotype and sequence data accelerates the molecular elucidation of rare single-gene Mendelian disorders (Figure 4). As of 2021, the protein-coding portion of the genome is well annotated. Increasingly, however, other elements in the genome are being discovered and identified as having functional consequences. For example, RefSeq lists nearly 3000 loci expressing noncoding transcripts. In recent years, some of these have been implicated in regulatory roles that affect phenotypes (e.g., MIR96 and deafness, OMIM 611606). Soon, a telomere to telomere human chromosomal assembly will be completed and pan-genome sequencing efforts producing similarly complete genomes from diverse human populations will help flush out the remaining genes. It is possible that new approaches such as large-scale peptide sequencing by mass spectrometry (Kim et al., 2014) may very well find novel short coding sequences or currently missed exons in known loci resulting from alternate splicing. With increasing knowledge about the genome and how genes are regulated and interact, the definition of a gene has evolved and become more nuanced. One definition might be that a gene is a block of sequence that has functional consequences at the biochemical and cellular level, at the level of clinical phenotypes and, ultimately, leads to evolutionary change. Presumably, sequences, both coding and noncoding, that have remained conserved over millennia are functionally important and each will, with time, have an entry in OMIM. For some, discovering their biology will be difficult as they may cause embryonic lethality (such is the case for approximately 25% of genes in mice; Dickinson et al., 2016), they may be redundant and alternative loci may act as functional substitutes in some cases, their loss may create phenotypes too mild to warrant attention, they may be genes whose variants may have undergone natural selection in the past but are currently neutral, or they may be genes with repeat expansions or structural anomalies that are difficult to identify with current methods. Advances in technology will continue to evolve opening new avenues for understanding the complexity of the genome. In the near future, repeat expansions and structural anomalies that alter gene function will be more readily identifiable with long DNA methods such as optical mapping or nanopore sequencing, both of which can also distinguish haplotypes. Nanopore and bisulfite sequencing will be used to identify epigenetic modifications. RNAseq methods for single cells and tissues will allow us to follow trajectories of gene expression during development and disease progression. The interpretation of variants both in genes and regulatory regions will improve with ever-growing datasets of sequence and genotype data. Furthermore, the use of artificial intelligence tools will more accurately predict the consequences of variants both in coding and regulatory sequences. New tools such as immunoglobulin repertoire sequencing will help us interpret the role of somatic recombination in antibody production and better understand the role of the immune system in health and disease. Inevitably, we may reach a point of genetic indeterminism where subtler phenotypes cannot be definitively associated with gene variants and lifestyle choices or random chance events may outweigh genetics. Until that point there is much to be done and OMIM will continue to serve as a source where genes and phenotypes cohabit as McKusick envisioned. Victor McKusick was 32 years old when the structure of DNA was reported, 40 years old when the genetic code was solved, 62 years old when PCR was developed, and 72 when the HGP was lauded in 2000. During the 100 years since his birth, genetics has matured beyond all expectations of what many thought possible and certainly from the time he was advised to work in a more traditional field of medicine. While he never spent time at the bench, McKusick understood and appreciated the technologies that appeared over his lifetime. His leadership, especially during the early phases of the HGP, was critical in how the project was organized and in its ultimate success. McKusick's legacy will live on. Over the next 100 years it is likely the burden of inherited disease will be greatly lessened, if not eliminated, thanks to all those who benefited from his pioneering efforts and vision. The authors declared no competing or financial interests. Data sharing is not applicable to this article as no new data were created or analyzed in this study.
Referência(s)