CMUの研究者が開発した機械学習手法により、進化の基本的な側面が解明される(Machine Learning Method Developed by CMU Researchers Illuminates Fundamental Aspects of Evolution)

2023-04-28

2023-04-27 カーネギーメロン大学

カーネギーメロン大学の研究者チームは、新しい方法を開発し、種の特定の特性が進化した理解に不可欠なゲノムの部分を特定することができるようにしました。彼らは、科学誌に掲載された研究で、AIと機械学習の技術を最新のものにする必要性に言及し、特に濃縮されたDNA領域に着目しています。
TACITというMLツールキットを使うことで、新しいゲノム配列で重要なエンハンサー領域を識別でき、絶滅危惧種の生物学的保全などの潜在的な応用があります。
研究チームは、TACITを使用して、240種の哺乳動物のゲノム配列を予測し、脳の進化に関係する部分を特定しました。また、社会行動に関連するエンハンサーも同定されました。

<関連情報>

機械学習を用いた哺乳類全体のエンハンサー遺伝子変異と複雑な表現型の関連付け Relating enhancer genetic variation across mammals to complex phenotypes using machine learning

Irene M. Kaplow,Alyssa J. Lawler,Daniel E. Schäffer,Chaitanya Srinivasan,Heather H. Sestili ,Morgan E. Wirthlin ,BaDoi N. Phan,Kavya Prasad,Ashley R. Brown ,Xiaomeng Zhang,Kathleen Foley,Diane P. Genereux,Zoonomia Consortium,Elinor K. Karlsson,Kerstin Lindblad-Toh,Wynn K. Meyer,Andreas R. Pfenning
Science Published:28 Apr 2023
DOI:https://doi.org/10.1126/science.abm7993

Structured Abstract

INTRODUCTION
Diverse phenotypes, including large brains relative to body size, group living, and vocal learning ability, have evolved multiple times throughout mammalian history. These shared phenotypes may have arisen repeatedly by means of common mechanisms discernible through genome comparisons.

RATIONALE
Protein-coding sequence differences have failed to fully explain the evolution of multiple mammalian phenotypes. This suggests that these phenotypes have evolved at least in part through changes in gene expression, meaning that their differences across species may be caused by differences in genome sequence at enhancer regions that control gene expression in specific tissues and cell types. Yet the enhancers involved in phenotype evolution are largely unknown. Sequence conservation–based approaches for identifying such enhancers are limited because enhancer activity can be conserved even when the individual nucleotides within the sequence are poorly conserved. This is due to an overwhelming number of cases where nucleotides turn over at a high rate, but a similar combination of transcription factor binding sites and other sequence features can be maintained across millions of years of evolution, allowing the function of the enhancer to be conserved in a particular cell type or tissue. Experimentally measuring the function of orthologous enhancers across dozens of species is currently infeasible, but new machine learning methods make it possible to make reliable sequence-based predictions of enhancer function across species in specific tissues and cell types.

RESULTS
To overcome the limits of studying individual nucleotides, we developed the Tissue-Aware Conservation Inference Toolkit (TACIT). Rather than measuring the extent to which individual nucleotides are conserved across a region, TACIT uses machine learning to test whether the function of a given part of the genome is likely to be conserved. More specifically, convolutional neural networks learn the tissue- or cell type–specific regulatory code connecting genome sequence to enhancer activity using candidate enhancers identified from only a few species. This approach allows us to accurately associate differences between species in tissue or cell type–specific enhancer activity with genome sequence differences at enhancer orthologs. We then connect these predictions of enhancer function to phenotypes across hundreds of mammals in a way that accounts for species’ phylogenetic relatedness. We applied TACIT to identify candidate enhancers from motor cortex and parvalbumin neuron open chromatin data that are associated with brain size relative to body size, solitary living, and vocal learning across 222 mammals. Our results include the identification of multiple candidate enhancers associated with brain size relative to body size, several of which are located in linear or three-dimensional proximity to genes whose protein-coding mutations have been implicated in microcephaly or macrocephaly in humans. We also identified candidate enhancers associated with the evolution of solitary living near a gene implicated in separation anxiety and other enhancers associated with the evolution of vocal learning ability. We obtained distinct results for bulk motor cortex and parvalbumin neurons, demonstrating the value in applying TACIT to both bulk tissue and specific minority cell type populations. To facilitate future analyses of our results and applications of TACIT, we released predicted enhancer activity of >400,000 candidate enhancers in each of 222 mammals and their associations with the phenotypes we investigated.

CONCLUSION
TACIT leverages predicted enhancer activity conservation rather than nucleotide-level conservation to connect genetic sequence differences between species to phenotypes across large numbers of mammals. TACIT can be applied to any phenotype with enhancer activity data available from at least a few species in a relevant tissue or cell type and a whole-genome alignment available across dozens of species with substantial phenotypic variation. Although we developed TACIT for transcriptional enhancers, it could also be applied to genomic regions involved in other components of gene regulation, such as promoters and splicing enhancers and silencers. As the number of sequenced genomes grows, machine learning approaches such as TACIT have the potential to help make sense of how conservation of, or changes in, subtle genome patterns can help explain phenotype evolution.

月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30