コンピュータ科学が生物学に貢献する方法(Q&A: How does computer science advance biology?)

ad

2025-06-24 ペンシルベニア州立大学(PennState)

ペンシルベニア州立大学のShao教授とKoslicki教授は、コンピュータサイエンスが生物学の発展にどのように貢献しているかを解説。彼らの研究は、DNA/RNAの解析、遺伝子発現、進化に伴うゲノム再構成の理解などに焦点を当てている。Shao教授のチームは、誤りに強い繰り返し配列検出ツール「EquiRep」や、SATソルバーを応用したゲノム変化計測法を開発。一方、Koslicki教授のチームは、k-merの適切なサイズ選択を効率化する「Prokrusteanグラフ」を発表。これらのツールは、個別化医療の進展、疾患診断、感染症追跡などに応用可能であり、現代医学に不可欠な計算基盤となる。特筆すべきは、修士・博士課程の学生たちがこれら研究に第一著者として貢献している点である。

<関連情報>

プロクルステン・グラフ K-merサイズ解析のための部分文字列インデックス Prokrustean Graph: A substring index for rapid k-mer size analysis

Adam Park, David Koslicki
bioRxiv  Posted: December 20, 2024
DOI:https://doi.org/10.1101/2023.11.21.568151

Abstract

Despite the widespread adoption of k-mer-based methods in bioinformatics, understanding the influence of k-mer sizes remains a persistent challenge. Selecting an optimal k-mer size or employing multiple k-mer sizes is often arbitrary, application-specific, and fraught with computational complexities. Typically, the influence of k-mer size is obscured by the outputs of complex bioinformatics tasks, such as genome analysis, comparison, assembly, alignment, and error correction. However, it is frequently overlooked that every method is built above a well-defined k-mer-based object like Jaccard Similarity, de Bruijn graphs, k-mer spectra, and Bray-Curtis Dissimilarity. Despite these objects offering a clearer perspective on the role of k-mer sizes, the dynamics of k-mer-based objects with respect to k-mer sizes remain surprisingly elusive.

This paper introduces a computational framework that generalizes the transition of k-mer-based objects across k-mer sizes, utilizing a novel substring index, the Prokrustean graph. The primary contribution of this framework is to compute quantities associated with k-mer-based objects for all k-mer sizes, where the computational complexity depends solely on the number of maximal repeats and is independent of the range of k-mer sizes. For example, counting vertices of compacted de Bruijn graphs for k = 1, …, 100 can be accomplished in mere seconds with our substring index constructed on a gigabase-sized read set.

Additionally, we derive a space-efficient algorithm to extract the Prokrustean graph from the Burrows-Wheeler Transform. It becomes evident that modern substring indices, mostly based on longest common prefixes of suffix arrays, inherently face difficulties at exploring varying k-mer sizes due to their limitations at grouping co-occurring substrings.

We have implemented four applications that utilize quantities critical in modern pangenomics and metagenomics. The code for these applications and the construction algorithm is available at https://github.com/KoslickiLab/prokrustean.

 

DCJ距離の厳密かつ高速なSAT定式化 An Exact and Fast SAT Formulation for the DCJ Distance

Aaryan M. Sarnaik,Chen, Austin Diaz, Mingfu Shao
bioRxiv  Posted: November 08, 2024.
DOI:https://doi.org/10.1101/2024.11.05.622153

Abstract

Reducing into a satisfiability (SAT) formulation has been proven effective in solving certain NP-hard problems. In this work, we extend this research by presenting a novel SAT formulation for computing the double-cut-and-join (DCJ) distance between two genomes with duplicate genes. The DCJ distance serves as a crucial metric in studying genome rearrangement. Previous work achieved an exact solution by transforming it into a maximum cycle decomposition (MCD) problem on the corresponding adjacency graph of two genomes, followed by reducing this problem into an integer linear programming (ILP) formulation. Using both simulated datasets and real genomic datasets, we firmly conclude that the SAT-based method scales much better and runs faster than using ILP, being able to solve a whole new range of instances which the ILP-based method cannot solve in a reasonable amount of time. This underscores the SAT-based approach as a versatile and more powerful alternative to ILP, with promising implications for broader applications in computational biology.

生物工学一般
ad
ad
Follow
ad
タイトルとURLをコピーしました