De Novo Structure Prediction from Tandem Mass Spectra: Algorithms, Benchmarks, and Limitationsстатья
Информация о цитировании статьи получена из
Scopus
Статья опубликована в журнале из списка Web of Science и/или Scopus
Дата последнего поиска статьи во внешних источниках: 1 апреля 2026 г.
Аннотация:The identification of unknown molecules from analytical data remains a fundamental challenge in chemistry, with critical implications for drug discovery, metabolomics, and natural product research. While tandem mass spectrometry provides rich structural fingerprints, most spectra are absent from reference libraries, spurring the development of de novo generative models. However, their true accuracy has been difficult to assess. Our critical analysis reveals that state-of-the-art models achieve only 4.1% top-10 accuracy on rigorously leakage-controlled benchmarks like MassSpecGym. This sobering figure stands in stark contrast to earlier, overly optimistic reports, a discrepancy we attribute to pervasive data leakage in naive data splits. This review traces the field’s rapid evolution through three architectural eras: from fingerprint-conditioned RNN pipelines to end-to-end sequence models and, most recently, to graph-native diffusion under molecular-formula constraints. We demonstrate that explicitly conditioning generative models on a molecular formula significantly improves exact-match accuracy compared to unconstrained baselines. Crucially, our analysis distinguishes between two experimentally relevant paradigms: formula-conditioned generation for true unknown discovery and scaffold-based generation for hypothesis-driven research. While the latter shows high potential with oracle scaffolds, its performance drastically drops with predicted ones, revealing a critical bottleneck. To build the next generation of reliable tools, we propose a clear roadmap centered on standardized, leakage-aware benchmarking and transparent reporting.