Nomenclature of genetic variants in hemostasis
2011; Elsevier BV; Volume: 9; Issue: 4 Linguagem: Inglês
10.1111/j.1538-7836.2011.04191.x
ISSN1538-7933
AutoresAnne Goodeve, P.H. Reitsma, John H. McVey,
Tópico(s)Platelet Disorders and Treatments
ResumoStandardized systems for naming genes and variations within them have been devised and have matured over the past 10 years. The Human Genome Organisation (HUGO) and International Federation of Human Genetics Societies have overseen these developments and through their lead, the Human Genome Variation Society (HGVS) has devised an extensive scheme for nomenclature of sequence variants [1-3]. Some fields, such as diagnostic molecular genetics, have adopted these nomenclature schemes so that they have become commonplace. In other areas they have been utilized to a limited extent, or not at all. Many coagulation genes were cloned and initially sequenced during the 1980s, prior to the introduction of standardized nomenclature systems. As a result, most genes and proteins have their own idiosyncrasies of naming and numbering, making it difficult for those unfamiliar with them to readily understand conventions used. This can lead to confusion in the laboratory, literature and the diagnostic setting. A number of investigators are independently using elements of the HGVS nomenclature and its widespread introduction is now necessary to prevent confusion arising from different schemes being employed. It is particularly important in molecular genetic diagnostic work that there is no confusion resulting from differences in mutation nomenclature between laboratories and several external quality assessment schemes therefore require the use of HGVS nomenclature for unambiguous genetic analysis reports. These include the European Molecular Genetics Quality Network (EMQN) [4] and UK National External Quality Assessment Service (UK NEQAS) for Molecular Genetics [5]. Many journals require HGVS recommendations to be followed, although policing of nomenclature use by reviewers occurs to variable extents. von Willebrand factor (VWF) nomenclature was altered in line with HGVS conventions following an International Society on Thrombosis and Haemostasis Scientific and Standardisation Committee on VWF recommendation in 2001 [6] and nomenclature for human platelet antigens has also followed the HGVS system since a recommendation in 2003 [7]. To reduce confusion in other areas of hemostasis and to enable investigators to readily understand numbering of genes/proteins with which they have not worked previously, full adoption of standard gene names and of DNA and protein sequence variant numbering is now recommended across hemostasis. The aim of this article is to provide a brief explanation of nomenclature recommendations and to highlight information sources that should be consulted for detailed guidance. A standard system for naming human genes [8] is maintained by the HUGO Gene Nomenclature Committee (HGNC) [9]. Each unique gene symbol comprises a short representation of the descriptive gene name, each containing only Latin letters and Arabic numbers with no punctuation or reference to species. Hierarchical series are used for gene families such as coagulation factors and platelet proteins. To demonstrate that a gene is being referred to, the symbol is italicized and this discriminates gene from protein. For example, coagulation factor genes include F2, F5 and F8 and platelet glycoprotein genes include GP1BA, GP1BB and GP6. Pseudogenes are generally given the name of the gene to which they are similar followed by the letter P, for example VWFP and PROSP denote VWF and protein S pseudogenes, respectively. More extensive assistance is given on the HGNC guideline page [9]. Currently used protein names are unchanged. The HGVS system for consistent identification of sequence variants is explained by use of examples on its website [3]. The nomenclature does not discriminate between variants classified as mutations or polymorphisms – both are referred to using the same nomenclature – and an earlier system of 'X/Y' to indicate two alleles at a polymorphic position has been superseded. Sequence variants differing from a named reference sequence (below) should be recorded. Sequences are given a prefix denoting their type; cDNA has the prefix c. and is used as the standard reference sequence, in preference to genomic DNA (g.DNA) as this provides landmarks within the gene such as intron/exon boundaries [3]. Protein is given the prefix p. (Tables S1 and S2). The A of the ATG initiator methionine at the start of each protein is utilized as the sequence start point (+1) and exonic nucleotides are numbered from there; there is no nucleotide 0, the next 5′ nucleotide is numbered −1. Nucleotide alterations are always indicated following the nucleotide number, to help prevent confusion with amino acid alterations, so c.567A>T is a nucleotide substitution and p.A567T (or p.Ala567Thr) is a protein substitution. The greater than symbol '>' is used to indicate a nucleotide alteration; arrows are not used. Intronic numbering utilizes the closest exonic nucleotide and numbers from that point, for example c.456+2T>C indicates the substitution of a T nucleotide by a C at the second nucleotide of an intron, numbered from the last nucleotide of the exon immediately 5′ to it. Insertions and deletions are indicated by 'ins' or 'del', for example c.234delC, c.678_9insA. Where more than one nucleotide is affected, 5′ and 3′ ends of the variant sequence are given; c.345_7del3 or c.345_7delTAG. Table S1 gives examples of common mutation types. The sequence is numbered from the first methionine of the protein as +1. This differs from many current numbering schemes in hemostasis, where the signal peptide and sometimes propeptide are numbered negatively and amino acid numbering starts from the beginning of the mature protein. Potential confusion can be avoided by always referring to the reference sequence and start point within it (below). Three-letter amino acid codes are used in preference to single-letter codes as they are less likely to be confused during transcription between documents and this is particularly recommended in any clinical work. However, for publication purposes, single-letter amino acid codes are acceptable. Amino acid designations are placed either side of the codon number; p.Gly143Ser or p.G143S. Superscript text indicating the codon number is not used, nor are arrows. Insertions and deletions are denoted by 'ins' or 'del' after a description of the flanking amino acids, for example p.Lys45_Leu46insGlnSer (or p.K45_L46insQS). Examples are given in Table S2 and on the HGVS website [3]. Some trivial mutation names such as factor V Leiden are unlikely to be replaced by this systematic nomenclature, so it is recommended that both HGVS nomenclature and the trivial name are used together (Table S3). Where more than one sequence variant is identified in an individual, their distribution on the same or different alleles can be described through the use of square brackets (Table S4). Variants known to be on the same allele are contained within the same square brackets, for example c.[2220G>A; 3614G>A] and p.[Met740Ile; Arg1205His]. The plus symbol is used to separate each of two alleles; thus c.[421G>T]+[7603C>T] and p.[Asp141Tyr]+[Arg2535X] represent a compound heterozygous genotype, whereas c.[2561G>A]+[2561G>A] and p.[Arg854Gln]+[Arg854Gln] represent a homozygous genotype. The [=] symbol represents a normal allele, whereas the format c.[76A>C(+)283G>C] indicates that the two sequence variants could be in an allelic or compound heterozygous arrangement. Introns and exons are numbered sequentially from the 5′ end of the gene using Arabic numbers for both exons and introns. For alternatively spliced transcripts, the reference sequence used (below) should represent the major and largest transcript. Alternatively spliced exons derived from sequences within the gene are numbered as for intronic sequences. Variants in transcripts initiating or terminating outside of this region can be described as for upstream/downstream sequences [3]. The American National Centre for Biotechnology Information (NCBI) maintains a sequence collection of genomes, transcripts and proteins designated the reference sequence collection (RefSeq) [10]. RefSeq records [11] integrate information from several sources and include the current description of the sequence and its features. Any description of sequence variation, including diagnostic patient reports and publications, should indicate the DNA and protein RefSeq number and version used, plus the numbering convention [3]. For example, a VWF mutation report should include the following; nucleotide and amino acid numbering is from the 'A' of the ATG methionine start codon: nucleotide RefSeq NM_000552.x and amino acid RefSeq NP_000543.x (where x indicates version number). RefSeq are not always numbered from the first methionine/A of ATG and users should therefore ensure that they do number from this point. A link to related sequences from the RefSeq page [11] can be used to access the transcript via the Ensembl genome browser [12], which can be annotated with exons, codons, amino acid translation, sequence numbering and sequence variants. Reference SNP (RefSNP) accession numbers, referred to as 'rs' numbers, are used by the single nucleotide polymorphism database (dbSNP) [13], a central repository for single base nucleotide substitutions plus short deletions and insertions, to uniquely identify sequence variants. Genomic, cDNA and protein locations are given using HGVS nomenclature and records contain allele frequency where available. rs numbers, available from dbSNP, should be used when first describing a particular sequence variant to facilitate its unambiguous identification. Novel single nucleotide polymorphism (SNP) should be submitted by those identifying them to dbSNP along with their frequency data in specified population(s) to increase knowledge of sequence variation [14]. Genetic variants should be described at both the DNA and protein level, with the gene named and exon/intron number identified. As alterations in the protein are generally theoretical predictions, this should be stated in clinical reports. When application of HGVS nomenclature results in different numbering from that used previously, both HGVS and legacy nomenclature should be shown to avoid confusion. Legacy nomenclature can be referred to in a footnote. There are many locus-specific databases (LSDB) for genes encoding proteins involved in hemostatic disorders. As most genes and proteins in the field do not currently follow the above recommendations, DNA and protein numbering on these LSDB may require amendment, and some database managers have agreed to add additional field(s) to enable both legacy and HGVS numbering to be displayed. The Human Gene Mutation Database (HGMD) [15], which records published information on the first report of a mutation in many human disease genes, also uses legacy numbering schemes, and care should be taken to identify the numbering scheme when utilizing these data. To facilitate use of standardized nomenclature, the CoagBase webpage [16] hosted by ISTH has been developed to provide links for several coagulation and platelet genes to HGNC gene names, RefSeq, LSDB, etc., along with an indication of the alterations required in protein numbering. For a transitional period while familiarity is gained with the nomenclature, journal submissions and diagnostic genetic analysis reports should give both legacy and HGVS nomenclature at the first report of any substitution, for example c.1246G>T in exon 7 of SERPINC1 (RefSeq NM_000488.3; NP_000479.1), predicted to result in p.Ala416Ser, previously reported as 13,268G>T, A384S in exon 6 when numbered from the start of the mature protein according to Olds et al. [17]. For publication purposes, the RefSeq and a brief description of the HGVS numbering scheme should be included in addition to a reference to the legacy numbering used. Introduction of the scheme described in this manuscript should begin immediately. The authors state that they have no conflict of interest. Figure S1. Stylised gene illustrating HGVS numbering. Table S1. Examples of nucleotide alteration nomenclature. Table S2. Examples of amino acid alteration nomenclature. Table S3. Examples of legacy and HGVS nomenclature for common mutations. Table S4. More than one mutation in an individual. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
Referência(s)