2022-03-31 アメリカ国立標準技術研究所(NIST)
・ゲノムを完成させたTelomere-to-Telomere(T2T)コンソーシアムでは、フルゲノムが数千人のDNA配列決定を支援する能力をテストしています。従来のゲノム配列で生じた何万ものエラーが修正され、医学的に重要な200以上の遺伝子の解析に適していることが、『サイエンス』誌に発表された新しい論文で明らかにされました。
・T2Tのゲノムが遺伝性疾患の研究を大きく推進し、さらに将来、患者がより信頼性の高い診断の恩恵を受ける可能性がある。
<関連情報>
- https://www.nist.gov/news-events/news/2022/03/first-complete-human-genome-poised-strengthen-genetic-analysis-nist-study
- https://www.science.org/doi/10.1126/science.abl3533
完全な参照ゲノムがヒトの遺伝的変異の解析を向上させる A complete reference genome improves analysis of human genetic variation
SERGEY AGANEZOV,STEPHANIE M. YAN,XDANIELA C. SOTO,XMELANIE KIRSCHE,SAMANTHA ZARATE,PAVEL AVDEYEV,DYLAN J. TAYLOR,KISHWAR SHAFIN,ALAINA SHUMATE,CHUNLIN XIAO ,JUSTIN WAGNER,JENNIFER MCDANIEL,NATHAN D. OLSON,MICHAEL E. G. SAURIA,MITCHELL R. VOLLGER,ARANG RHIE,MELISSA MEREDITH,SKYLAR MARTIN,JOYCE LEE,SERGEY KOREN,JEFFREY A. ROSENFELD,BENEDICT PATEN,RYAN LAYER,CHEN-SHAN CHIN,FRITZ J. SEDLAZECK ,NANCY F. HANSEN,DANNY E. MILLER,ADAM M. PHILLIPPY,KAREN H. MIGA,RAJIV C. MCCOY,X MEGAN Y. DENNIS, JUSTIN M. ZOOK AND MICHAEL C. SCHATZ
Science Published:1 Apr 2022
DOI: 10.1126/science.abl3533
Abstract
Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for clinical and functional study. We show how this reference universally improves read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing reference for human genetics.
For the past 20 years, the human reference genome (GRCh38) has served as the bedrock of human genetics and genomics (1–3). One of the central applications of the human reference genome, and of reference genomes in general, has been to serve as a substrate for clinical, comparative, and population genomic analyses. More than 1 million human genomes have been sequenced to study genetic diversity and clinical relationships, and nearly all of them have been analyzed by aligning the sequencing reads from the donors to the reference genome [e.g., (4–6)]. Even when donor genomes are assembled de novo, independent of any reference, the assembled sequences are almost always compared to a reference genome to characterize variation by leveraging deep catalogs of available annotations (7, 8). Consequently, human genetics and genomics benefit from the availability of a high-quality reference genome, ideally without gaps or errors that may obscure important variation and regulatory relationships.
The current human reference genome, GRCh38, is used for countless applications, with rich resources available to visualize and annotate the sequence across cell types and disease states (3, 9–12). However, despite decades of effort to construct and refine its sequence, the human reference genome still suffers from several major limitations that hinder comprehensive analysis. Most immediately, GRCh38 contains more than 100 million nucleotides that either remain entirely unresolved (currently represented as “N” characters), such as the p-arms of the acrocentric chromosomes, or are substituted with artificial models, such as the centromeric satellite arrays (13). Furthermore, GRCh38 possesses 11.5 Mbp of unplaced and unlocalized sequences that are represented separately from the primary chromosomes (3, 14). These sequences are difficult to study, and many genomic analyses exclude them to avoid identifying false variants and false regulatory relationships (6). Relatedly, artifacts such as an apparent imbalance between insertions and deletions (indels) have been attributed to systematic misassemblies in GRCh38 (15–17). Overall, these errors and omissions in GRCh38 introduce biases in genomic analyses, particularly in centromeres, satellites, and other complex regions.
Another major concern regards the influence of the reference genome on analysis of variation across large cohorts for population and clinical genomics. Several studies, such as the 1000 Genomes Project (1KGP) (18) and gnomAD (6), have provided information about the extent of genetic diversity within and between human populations. Many analyses of Mendelian and complex diseases use these catalogs of single-nucleotide variants (SNVs), small indels, and structural variants (SVs) to rank and prioritize potential causal variants on the basis of allele frequencies (AFs) and other evidence (19–21). When evaluating these resources, the overall quality and representativeness of the human reference genome should be considered. Any gaps or errors in the sequence could obscure variation and its contribution to human phenotypes and disease.
In addition to omissions such as centromeric sequences or acrocentric chromosome arms, the current reference genome possesses other errors and biases, including within genes of known medical relevance (22, 23). Moreover, GRCh38 was assembled from multiple donors with clone-based sequencing, which creates an excess of artificial haplotype structures that can subtly bias analyses (1, 24). Over the years, there have been attempts to replace certain rare alleles with more common alleles, but hundreds of thousands of artificial haplotypes and rare alleles remain to this day (3, 25, 26). Increasing the continuity, quality, and representativeness of the reference genome is therefore crucial for improving genetic diagnosis, as well as for understanding the complex relationship between genetic and phenotypic variation.
The Telomere-to-Telomere (T2T) CHM13 genome addresses many of the limitations of the current reference (27). Specifically, the T2T-CHM13v1.0 assembly adds nearly 200 Mbp of sequence and resolves errors present in GRCh38. Here, we demonstrate the impact of the T2T-CHM13 reference on variant discovery and genotyping in a globally diverse cohort. This includes all 3202 samples from the recently expanded 1KGP sequenced with short reads (28) along with 17 samples from diverse populations sequenced with long reads (8, 27, 29). Our analysis reveals more than 2 million variants within previously unresolved regions of the genome, genome-wide improvements in SV discovery, and enhancement in variant calling accuracy across 622 medically relevant genes. In summary, our work demonstrates universal improvements in read mapping and variant calling, thereby broadening the horizon for future genomic studies.