バイオインフォマティクスのデータ削減技術は、慎重に使用する必要がある(Bioinformatics data reduction techniques must be used with caution)

2022-07-02

2022-07-01 ペンシルベニア州立大学(PennState)

バイオインフォマティクスの分野では、DNAの解析は、データスケッチという、データセットのサイズを系統的に小さくして、科学者がより高速に解析・近似できるようにする手法で行うことができます。この方法の拡張性は魅力的ですが、データスケッチに使用される2つの一般的なツールでは、解析や結果に不正確さや矛盾が生じることが、ペンシルバニア州立大学の研究チームによって明らかにされました。
Oxford Bioinformatics誌(6月27日発行)において、研究者らは、最小化ジャカード推定量には偏りと矛盾があること、つまり、スケッチにどれだけ多くのデータポイントを入れても、2つのゲノム間のダイバージェンスの推定値は不正確なままであることを発見した。
ゲノムスケッチでは、データ科学者は、2つのゲノム配列間の分岐を推定するのに使用できるスケッチを形成するk-mersと呼ばれる、小さいが代表的なデータポイントのセットを抽出します。推定された分岐は、真の分岐とほぼ同じで、許容範囲内の信頼区間であることが望ましい。研究者らは、この分野の一般的な仮定に反して、バイオインフォマティクスで使用されるいくつかのスケッチ戦略がこれらの目標を満たさないことを発見しました。
研究者らは、大腸菌ゲノムのシミュレーションと解析を行い、大腸菌データの部分文字列の最小化ジャカード推定値と、手作業で計算した真の値とを比較して、より小さい部分文字列が配列中のどこに属するかを探った。その結果、この方法では大きなゲノムの中の正しい読み取りの位置を見つけられない可能性があることが示されました。
Journal of Computational Biology誌の論文では、研究者は、データスケッチによく使われる別の手法であるMinHash推定器が、ゲノム研究において有効かどうかを検証しています。この研究では、研究者は、進化によって影響を受けるスケッチデータポイントの統計的特性を計算しました。

<関連情報>

最小化ジャカード推定量に偏りがあり、矛盾している The minimizer Jaccard estimator is biased and inconsistent

Mahdi Belbasi, Antonio Blanca, Robert S Harris, David Koslicki, Paul Medvedev
Oxford Bioinformatics Published::27 June 2022
DOI:https://doi.org/10.1093/bioinformatics/btac244

Abstract

Motivation
Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences.

Results
We show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e. the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool.

Availability and implementatio
Scripts to reproduce our experiments are available at https://github.com/medvedevgroup/minimizer-jaccard-estimator/tree/main/reproduce.

Supplementary information
Supplementary data are available at Bioinformatics online.

スプリアスマッチを伴わない単純な突然変異を受けた塩基配列からのk-mersの統計量 The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Antonio Blanca,Robert S. Harris,David Koslicki and Paul Medvedev
Journal of Computational Biology Published:16 Feb 2022
DOI:https://doi.org/10.1089/cmb.2021.0431

Abstract

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31