2025-06-24 ペンシルベニア州立大学(PennState)
<関連情報>
- https://www.psu.edu/news/engineering/story/qa-how-does-computer-science-advance-biology
- https://www.biorxiv.org/content/10.1101/2023.11.21.568151v7.article-info
- https://www.biorxiv.org/content/10.1101/2024.11.05.622153v1.article-info
プロクルステン・グラフ K-merサイズ解析のための部分文字列インデックス Prokrustean Graph: A substring index for rapid k-mer size analysis
Adam Park, David Koslicki
bioRxiv Posted: December 20, 2024
DOI:https://doi.org/10.1101/2023.11.21.568151
Abstract
Despite the widespread adoption of k-mer-based methods in bioinformatics, understanding the influence of k-mer sizes remains a persistent challenge. Selecting an optimal k-mer size or employing multiple k-mer sizes is often arbitrary, application-specific, and fraught with computational complexities. Typically, the influence of k-mer size is obscured by the outputs of complex bioinformatics tasks, such as genome analysis, comparison, assembly, alignment, and error correction. However, it is frequently overlooked that every method is built above a well-defined k-mer-based object like Jaccard Similarity, de Bruijn graphs, k-mer spectra, and Bray-Curtis Dissimilarity. Despite these objects offering a clearer perspective on the role of k-mer sizes, the dynamics of k-mer-based objects with respect to k-mer sizes remain surprisingly elusive.
This paper introduces a computational framework that generalizes the transition of k-mer-based objects across k-mer sizes, utilizing a novel substring index, the Prokrustean graph. The primary contribution of this framework is to compute quantities associated with k-mer-based objects for all k-mer sizes, where the computational complexity depends solely on the number of maximal repeats and is independent of the range of k-mer sizes. For example, counting vertices of compacted de Bruijn graphs for k = 1, …, 100 can be accomplished in mere seconds with our substring index constructed on a gigabase-sized read set.
Additionally, we derive a space-efficient algorithm to extract the Prokrustean graph from the Burrows-Wheeler Transform. It becomes evident that modern substring indices, mostly based on longest common prefixes of suffix arrays, inherently face difficulties at exploring varying k-mer sizes due to their limitations at grouping co-occurring substrings.
We have implemented four applications that utilize quantities critical in modern pangenomics and metagenomics. The code for these applications and the construction algorithm is available at https://github.com/KoslickiLab/prokrustean.
DCJ距離の厳密かつ高速なSAT定式化 An Exact and Fast SAT Formulation for the DCJ Distance
Aaryan M. Sarnaik,Chen, Austin Diaz, Mingfu Shao
bioRxiv Posted: November 08, 2024.
DOI:https://doi.org/10.1101/2024.11.05.622153
Abstract
Reducing into a satisfiability (SAT) formulation has been proven effective in solving certain NP-hard problems. In this work, we extend this research by presenting a novel SAT formulation for computing the double-cut-and-join (DCJ) distance between two genomes with duplicate genes. The DCJ distance serves as a crucial metric in studying genome rearrangement. Previous work achieved an exact solution by transforming it into a maximum cycle decomposition (MCD) problem on the corresponding adjacency graph of two genomes, followed by reducing this problem into an integer linear programming (ILP) formulation. Using both simulated datasets and real genomic datasets, we firmly conclude that the SAT-based method scales much better and runs faster than using ILP, being able to solve a whole new range of instances which the ILP-based method cannot solve in a reasonable amount of time. This underscores the SAT-based approach as a versatile and more powerful alternative to ILP, with promising implications for broader applications in computational biology.


