Αlpha¹ Review in Progress

RECORD ID: 3369A97B

Peer-Reviewed Manuscript

Benchmarking datasets for machine learning in protein function prediction

Authors

Jang, Y.; Qin, Q.-Q.; Wang, J.-L.; Kornmann, B.

DOI: 10.64898/2025.12.17.694800 GitHub Data

Abstract

Remarkable progress has been achieved by machine learning, particularly in accurate prediction of protein tertiary structures. Despite these advances, accurately annotating protein functions through machine learning approaches remains challenging, primarily due to the limited availability of large-scale benchmarking data. In this study, we addressed this gap by systematically screening proteins from the UniProt database for functional annotations, resulting in the creation of a benchmarking dataset that includes protein sequences and their corresponding annotations. The Protein Annotation Dataset (PAD) is a resource available to train a wide range of machine learning models for assignment of function annotations to previously unlabeled proteins. We curated a comprehensive dataset comprising four categories of functional annotations using enzyme commission (EC) numbers and gene ontology (GO) terms. The dataset was subsequently partitioned into training, validation, and test subsets. Furthermore, we incorporated an independent set from 12 diverse species, enabling the development and evaluation of innovative machine learning models.

Peer Reviews

Peer review in progress...

Community Assessment

Your Assessment

Robust Methods

Supported Claims

Significance