2025-09-26 中国科学院(CAS)
Web要約 の発言:
<関連情報>
- https://english.cas.cn/newsroom/research_news/life/202509/t20250928_1055768.shtml
- https://www.pnas.org/doi/10.1073/pnas.2418254122
言語モデルは、タンパク質機能の適応的収束進化の複雑な配列基盤を明らかにする Language models reveal a complex sequence basis for adaptive convergent evolution of protein functions
Zhenqiu Cao, Hongjiu Zhang, and Zhengting Zou
Proceedings of the National Academy of Sciences September 23, 2025
DOI:https://doi.org/10.1073/pnas.2418254122
Significance
In biology, repeated emergence of the same functional trait in evolution is important as it provides opportunity to decode the relations between genome or protein sequences to specific functions. Such functional convergence has been largely linked to sequence convergence at the level of single sites, because conventional methods cannot measure similarity of high-order features of sequences. This study reveals that the recent protein language models can extract embeddings from protein sequences reflecting high-order features, and develops statistical tests to evaluate the adaptive convergence of such features. The findings emphasize an underrated sequence basis for functional trait convergence in evolution, provide corresponding detection framework, and demonstrate potential power of deep learning in investigating the complex sequence–function mapping in evolutionary biology.
Abstract
Convergent evolution, or convergence, refers to repeated, independent emergences of the same trait in two or more lineages of species during evolution, often indicating functional adaptation to specific environmental factors. Many computational methods have been proposed to investigate the genetic basis for organismal functional convergence, as an important way to decode the complex sequence–function map of proteins. These methods mostly focus on the convergence of amino acid states at the level of individual sites in functionally related proteins. However, even without site-level sequence similarity, protein function similarity may also stem from convergence of high-order protein features, which cannot be captured by the conventional methods. To fill this gap, we first derived numerical embeddings from protein sequences by pretrained protein language models (PLM). In four previously reported cases, we found that functionally convergent proteins have similar embeddings despite no site-level convergence, indicating that PLM embeddings can reflect convergence of high-order protein features. We then designed a pipeline to detect Adaptive Convergence by Embedding of Protein (ACEP). ACEP tests were significant on known and additional candidate genes with putative adaptive convergence like echolocation and crassulacean acid metabolism. Genome-wide application showed that the ACEP framework can effectively enrich such candidates. Relations between convergences of PLM embeddings and specific protein physicochemical features were further examined. In conclusion, PLM embeddings can indicate adaptive convergence of high-order protein features beyond site identities, demonstrating the power of deep learning tools for investigating the complex mapping between molecular sequences and functions.


