Artigo Acesso aberto Revisado por pares

An expanded sequence context model broadly explains variability in polymorphism levels across the human genome

2016; Nature Portfolio; Volume: 48; Issue: 4 Linguagem: Inglês

10.1038/ng.3511

ISSN

1546-1718

Autores

Varun Aggarwala, Benjamin F. Voight,

Tópico(s)

Genetic Associations and Epidemiology

Resumo

Varun Aggarwala and Benjamin Voight analyze human polymorphism data and develop an expanded sequence context model that explains >81% of variability in substitution probabilities, highlighting mutation-promoting motifs. Using their model, they present substitution intolerance scores for genes and a new intolerance score for amino acids, and demonstrate clinical use of the model in neuropsychiatric diseases. The rate of single-nucleotide polymorphism varies substantially across the human genome and fundamentally influences evolution and incidence of genetic disease. Previous studies have only considered the immediately flanking nucleotides around a polymorphic site—the site's trinucleotide sequence context—to study polymorphism levels across the genome. Moreover, the impact of larger sequence contexts has not been fully clarified, even though context substantially influences rates of polymorphism. Using a new statistical framework and data from the 1000 Genomes Project, we demonstrate that a heptanucleotide context explains >81% of variability in substitution probabilities, highlighting new mutation-promoting motifs at ApT dinucleotide, CAAT and TACG sequences. Our approach also identifies previously undocumented variability in C-to-T substitutions at CpG sites, which is not immediately explained by differential methylation intensity. Using our model, we present informative substitution intolerance scores for genes and a new intolerance score for amino acids, and we demonstrate clinical use of the model in neuropsychiatric diseases.

Referência(s)