Norwegian version of this page

How can we train AI to develop new vaccines and medicines?

Machine learning and artificial intelligence are set to play a significant role in the future of drug development. Recent research has unveiled new insights into how these models should be supervised, using both so-called positive and negative data.

The figure illustrates how antibodies are fed into machine learning models

"Digitized antibody". Illustration image: Rahmad Akbar.

By Elin Martine Doeland, Institute of Clinical Medicine
Published Sep. 15, 2025

Imagine you are developing antibodies—drugs precisely aimed at a target, for example a viral protein or onco-marker. You test a series of antibodies and find that some work, while others do not.?

You would like to continue modifying them and see if you can make them even better. However, you do not want to waste time testing those that certainly will not work. To only test antibodies that might work, you need to separate those antibodies that do not bind to your target before moving on to costly and time-consuming experiments.?

AI models can determine which antibodies may work

One way to do this is to train a computational model that can support you in the process. Today, machine learning models are already helping experimental scientists narrow down their search.

Portrait image
Aygul Minnegalieva. Image: ?sne Ramb?l Hillestad, UiO

“Moreover, machine learning models, once shown the data, can learn what makes an antibody bind—what features set binders apart from those that do not. Without such models, this is not obvious at all, as it lies beyond human perception and intuition,” says Aygul Minnegalieva, a PhD candidate at the University of Oslo.?

She investigates how to best train AI models at the Greiff Lab.?

“However, not all machine learning models will do that correctly. Only if models are trained with the right data, we can use them to gain an understanding of biological determinants. For example, what makes an antibody a binder,” she explains.?

Minnegalieva and colleagues have recently published a study on this in Nature Machine Intelligence.?

Teaching the models to recognise which antibodies bind to a pathogen

“One approach to accomplish this is to present the models with examples of both correct and incorrect responses regarding what we want them to recognise,” explains the PhD candidate.?

Such incorrect examples or errors are referred to as negative data, while the correct examples are classified as positive data.

The errors must pose a challenge for the models to recognise. In the latest study, Minnegalieva and her colleagues discovered that the negative data the models are exposed to must be sufficiently challenging.

“We need to show the models incorrect examples that closely resemble the correct ones. This way, the data models learn more effectively,” Minnegalieva points out.

The AI-models became better at reasoning?

Specifically, the researchers presented the models as negative data with antibodies that still bind to target proteins, for instance in a virus, but do so suboptimally.?

“In this manner, the models improved their ability to accurately tell apart antibodies that would be effective in combating a pathogen from those that would not,” she explains.

Most importantly, this method enabled the models to capture the underlying sequence determinants in antibodies that help them bind to a protein in a pathogen.?

“Those determinants made more biological sense,” Minnegalieva states, and continues:?

“Essentially, the models became better at reasoning.”?

Accelerating the development of antibodies and medicines with AI

Machine learning is increasingly being employed in the development of new medicines, allowing researchers to reduce the number of experimental tests required.

“We can reduce the number of errors when developing new candidates of antibodies or medicines for targeting pathogens or cancer,” says Minnegalieva, continuing:

“The models we use must be both accurate and reliable. They must truly understand what matters from a biological point of view. Only then can we make sound predictions and save time,” says Minnegalieva.

The new study outlines how the models can be trained to better meet these requirements.

Can apply the methodology across multiple fields?

Though the study specifically focused on antibodies, the results can be broadly generalised across various fields where machine learning is applied.

“Fields such as language modelling, protein design, and the prediction of molecular properties also depend on the sampling of negative data. All these areas face the risk of models taking shortcuts if the negative examples are too simplistic,” concludes Minnegalieva.

Portrait image
Professor Victor Greiff. Image: ?ystein Horgmo, UiO

Professor Victor Greiff, head of the Greiff Lab, also highlights the relevance and potential impact of the study.

"Our work shows that data curation is not a preprocessing step, it’s a scientific choice that encodes assumptions and determines what machine learning can discover. For immunology, drug discovery, and beyond, careful dataset design may be the key to building machine learning models that generalize and reveal true biological principles," Greiff says.

Contact

Reference

Ursu, E., Minnegalieva, A., Rawat, P., Chernigovskaya, M., Tacutu, R., Sandve, G. K., ... & Greiff, V. (2025). Training data composition determines machine learning generalization and biological rule discovery. Nature Machine Intelligence, 7(8), 1206-1219. The article can be accessed here.

Ta, W., & Stokes, J. M. (2025). The importance of negative training data for robust antibody binding prediction: Machine learning. Nature Machine Intelligence, 7(8), 1192-1194. The News & Views article can be accessed here (subscription needed).

Published Sep. 15, 2025 3:45 PM - Last modified Nov. 6, 2025 3:01 PM