Αlpha¹ Review in Progress
Benchmarking datasets for machine learning in protein function prediction
Jang, Y.; Qin, Q.-Q.; Wang, J.-L.; Kornmann, B.
Remarkable progress has been achieved by machine learning, particularly in accurate prediction of protein tertiary structures. Despite these advances, accurately annotating protein functions through machine learning approaches remains challenging, primarily due to the limited availability of large-scale benchmarking data. In this study, we addressed this gap by systematically screening proteins from the UniProt database for functional annotations, resulting in the creation of a benchmarking dataset that includes protein sequences and their corresponding annotations. The Protein Annotation Dataset (PAD) is a resource available to train a wide range of machine learning models for assignment of function annotations to previously unlabeled proteins. We curated a comprehensive dataset comprising four categories of functional annotations using enzyme commission (EC) numbers and gene ontology (GO) terms. The dataset was subsequently partitioned into training, validation, and test subsets. Furthermore, we incorporated an independent set from 12 diverse species, enabling the development and evaluation of innovative machine learning models.
Peer Reviews
Peer review in progress...