タンパク質の機能予測を省力化～分子シミュレーションとタンパク質言語モデルを組み合わせて教師データを拡張～

2025-10-23

2025-10-22 産業技術総合研究所

産業技術総合研究所は、分子シミュレーションとタンパク質言語モデルを統合し、少量の実験データからでも高精度にタンパク質の機能値を予測できるAI手法を開発した。従来は大量の実験データが必要だったが、今回、計算によって得られた機能値を「疑似教師データ」として活用することでデータ拡張を実現。結合親和性、酵素活性、細胞毒性、蛍光強度など多様な機能予測に応用可能であることを示した。これにより、機能性タンパク質の開発効率が大幅に向上し、創薬・産業酵素・バイオ素材設計への応用が期待される。研究は産総研人工知能研究センターと東京大学の共同で行われ、『Briefings in Bioinformatics』(2025年10月掲載)に発表された。本手法はAIによるバイオ分子設計の実用化を後押しし、タンパク質工学の新たな基盤技術として注目されている。

データ拡張によりタンパク質の機能値予測の精度を向上

<関連情報>

分子シミュレーションとタンパク質言語モデルによる弱監視によるデータ効率の高いタンパク質変異効果予測 Data-efficient protein mutational effect prediction with weak supervision by molecular simulation and protein language models

Teppei Deguchi, Nur Syatila Ab Ghani, Yoichi Kurumida, Shinji Iida, Kaito Kobayashi, Yutaka Saito
Briefings in Bioinformatics Published:10 October 2025
DOI:https://doi.org/10.1093/bib/bbaf536

Abstract

Machine learning-based protein mutational effect prediction is widely used in protein engineering and pathogenicity prediction, but training data scarcity remains a major challenge due to high costs of experimental measurements. A previous study proposed data augmentation using computational estimates by molecular simulation. However, this approach has been limited to predicting mutational effects on thermostability. Here, we present a new data augmentation method that combines molecular simulation with zero-shot prediction computed by protein language models. These computational estimates serve as ‘weak’ training data to supplement experimental training data. Our method dynamically adjusts the weight and inclusion of weak training data based on available experimental training data. This reduces potential negative impacts of weak training data while extending applicability to diverse protein properties such as binding affinity and enzymatic activity. Benchmark tests demonstrate that our method improves prediction accuracy particularly when experimental training data are scarce. These results indicate the capability of our approach to advance protein engineering and pathogenicity prediction in small data regimes.

月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28