De Novo Structure Prediction from Tandem Mass Spectra: Algorithms, Benchmarks, and Limitations

Schneider, M.Y.; Kholmanskikh, D.D.; Romanov, K.Y.; Perekina, E.A.; Nikolenko, S.A.; Lukin, R.Y.; Golov, I.V.

Авторы: Schneider Mark Yu, Kholmanskikh Daniil D., Romanov Kirill Ya, Perekina Elena A., Nikolenko Sergei A., Lukin Ruslan Yu, Golov Ivan V.
Журнал: Molecules
Том: 31
Номер: 5
Год издания: 2026
Издательство: MDPI
Местоположение издательства: Basel, Switzerland
Первая страница: 769
DOI: 10.3390/molecules31050769
Аннотация: The identification of unknown molecules from analytical data remains a fundamental challenge in chemistry, with critical implications for drug discovery, metabolomics, and natural product research. While tandem mass spectrometry provides rich structural fingerprints, most spectra are absent from reference libraries, spurring the development of de novo generative models. However, their true accuracy has been difficult to assess. Our critical analysis reveals that state-of-the-art models achieve only 4.1% top-10 accuracy on rigorously leakage-controlled benchmarks like MassSpecGym. This sobering figure stands in stark contrast to earlier, overly optimistic reports, a discrepancy we attribute to pervasive data leakage in naive data splits. This review traces the field’s rapid evolution through three architectural eras: from fingerprint-conditioned RNN pipelines to end-to-end sequence models and, most recently, to graph-native diffusion under molecular-formula constraints. We demonstrate that explicitly conditioning generative models on a molecular formula significantly improves exact-match accuracy compared to unconstrained baselines. Crucially, our analysis distinguishes between two experimentally relevant paradigms: formula-conditioned generation for true unknown discovery and scaffold-based generation for hypothesis-driven research. While the latter shows high potential with oracle scaffolds, its performance drastically drops with predicted ones, revealing a critical bottleneck. To build the next generation of reliable tools, we propose a clear roadmap centered on standardized, leakage-aware benchmarking and transparent reporting.
Добавил в систему: Шнайдер Марк Юрьевич

	ИСТИНА	Войти в систему Регистрация
	Интеллектуальная Система Тематического Исследования НАукометрических данных
	Главная Поиск Статистика О проекте Помощь

ИСТИНА

Интеллектуальная Система Тематического Исследования НАукометрических данных

De Novo Structure Prediction from Tandem Mass Spectra: Algorithms, Benchmarks, and Limitationsстатья