完全な参照ゲノムがヒトの遺伝的変異の解析を向上させる A complete reference genome improves analysis of human genetic variation
SERGEY AGANEZOV,STEPHANIE M. YAN,XDANIELA C. SOTO,XMELANIE KIRSCHE,SAMANTHA ZARATE,PAVEL AVDEYEV,DYLAN J. TAYLOR,KISHWAR SHAFIN,ALAINA SHUMATE,CHUNLIN XIAO ,JUSTIN WAGNER,JENNIFER MCDANIEL,NATHAN D. OLSON,MICHAEL E. G. SAURIA,MITCHELL R. VOLLGER,ARANG RHIE,MELISSA MEREDITH,SKYLAR MARTIN,JOYCE LEE,SERGEY KOREN,JEFFREY A. ROSENFELD,BENEDICT PATEN,RYAN LAYER,CHEN-SHAN CHIN,FRITZ J. SEDLAZECK ,NANCY F. HANSEN,DANNY E. MILLER,ADAM M. PHILLIPPY,KAREN H. MIGA,RAJIV C. MCCOY,X MEGAN Y. DENNIS, JUSTIN M. ZOOK AND MICHAEL C. SCHATZ
Science Published:1 Apr 2022
Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for clinical and functional study. We show how this reference universally improves read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing reference for human genetics.
For the past 20 years, the human reference genome (GRCh38) has served as the bedrock of human genetics and genomics (1–3). One of the central applications of the human reference genome, and of reference genomes in general, has been to serve as a substrate for clinical, comparative, and population genomic analyses. More than 1 million human genomes have been sequenced to study genetic diversity and clinical relationships, and nearly all of them have been analyzed by aligning the sequencing reads from the donors to the reference genome [e.g., (4–6)]. Even when donor genomes are assembled de novo, independent of any reference, the assembled sequences are almost always compared to a reference genome to characterize variation by leveraging deep catalogs of available annotations (7, 8). Consequently, human genetics and genomics benefit from the availability of a high-quality reference genome, ideally without gaps or errors that may obscure important variation and regulatory relationships.
The current human reference genome, GRCh38, is used for countless applications, with rich resources available to visualize and annotate the sequence across cell types and disease states (3, 9–12). However, despite decades of effort to construct and refine its sequence, the human reference genome still suffers from several major limitations that hinder comprehensive analysis. Most immediately, GRCh38 contains more than 100 million nucleotides that either remain entirely unresolved (currently represented as “N” characters), such as the p-arms of the acrocentric chromosomes, or are substituted with artificial models, such as the centromeric satellite arrays (13). Furthermore, GRCh38 possesses 11.5 Mbp of unplaced and unlocalized sequences that are represented separately from the primary chromosomes (3, 14). These sequences are difficult to study, and many genomic analyses exclude them to avoid identifying false variants and false regulatory relationships (6). Relatedly, artifacts such as an apparent imbalance between insertions and deletions (indels) have been attributed to systematic misassemblies in GRCh38 (15–17). Overall, these errors and omissions in GRCh38 introduce biases in genomic analyses, particularly in centromeres, satellites, and other complex regions.