Αlpha¹ Review in Progress

RECORD ID: 60FF2AAB
Peer-Reviewed Manuscript

vir2vec: A Genome-Wide Viral Embedding

Authors

Rancati, S.; Arozarena Donelli, P.; Nicora, G.; Bergomi, L.; Buonocore, T.; Sy, M. A.; Pandey, S.; Prosperi, M.; Salemi, M.; Bellazzi, R.; Boucher, C.; Parimbelli, E.; Marini, S.

Abstract

Genomic language models (gLMs) have recently emerged as powerful numerical surrogates for DNA, but existing architectures are largely focused on human DNA or trained on limited viral references, and no dedicated benchmark currently exists for viral genome understanding. Here we introduce vir2vec, a 422-million-parameter, decoder-only gLM obtained by continual pretraining of Mistral-DNA on a rigorously curated pan-viral corpus of 565,747 complete genomes spanning 295 viral species. vir2vec operates on byte-pair-encoded DNA subwords and exposes fixed-length genome-level embeddings that are reused across tasks. Additionally, we present vGUE, a unified benchmark for viral representation learning. In vGUE, we precompute vir2vec embeddings and feed them to simple classifiers (logistic regression, support vector machines, random forests) trained under nested cross-validation, to quantify how well they capture biologically motivated axes of viral variation. Using this framework, vGUE assesses genomic viral prediction tasks across: (i) organism-level discrimination (virus vs non-virus genomes and reads), (ii) genome-wide evolutionary fingerprints (DNA vs RNA viruses, host-range prediction), (iii) intragenus species separation (HIV-1 vs HIV-2), (iv) fine-grained variant and subtype typing (SARS-CoV-2 lineages), and (v) phenotypic context signal detection (HIV-1 brain vs plasma Tropism). vir2vec attains the highest balanced accuracy across seven out of eight heterogenous classification tasks consistently outperforming both a human-trained genomic foundation model and a viral-specific one. By coupling a pan-viral gLM with a standardized evaluation suite, vir2vec and vGUE provide an open foundation for future viral genomic models, surveillance tools, and discovery pipelines. vir2vec is released as a controlled-access resource with the understanding that it is designed for discriminative/classification embedding tasks, and not generative; responsible deployment of viral genomic models requires consideration of dual-use implications and appropriate ethical, governance oversight.

Peer Reviews

Peer review in progress...

Community Assessment

Your Assessment

Robust Methods
Supported Claims
Significance