Chemistry-aware AI can generate millions of plausible new molecules

by

Gaby Clark

scientific editor

Meet our editorial team
Behind our editorial process

Robert Egan

associate editor

Meet our editorial team
Behind our editorial process
Editors' notes

This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

peer-reviewed publication

trusted source

proofread

The GIST
Add as preferred source


Molecule created by CoCoGraph. Credit: URV.

Finding and developing new molecules is one of the great research endeavors of modern chemistry. From the development of new drugs to the creation of more sustainable materials, everything depends on finding new combinations of atoms with useful properties. Now, a research team from the Universitat Rovira i Virgili (URV) has developed an artificial intelligence tool capable of generating millions of new molecules which, although still unknown to science, comply with the laws of chemistry and could therefore be realistic possibilities. The research results have been published in the journal Nature Machine Intelligence.

The system, called CoCoGraph, works in a similar way to generative artificial intelligence tools for text or images, such as ChatGPT or Dall-E. "These models create new content that looks very much like the real thing. Our algorithm does the same, but with molecules," explains Roger Guimerà, an ICREA Research Professor in the Department of Chemical Engineering at the URV.

Unlike other AI tools, however, the model does not yet respond to specific instructions. For the moment it simply carries out the more basic task of generating plausible molecules, that is, structures that comply with the rules of chemistry.

Nevertheless, the task is enormous. Even when the system is given just one molecular formula (for example, that of paracetamol), it can construct a vast number of atomic combinations, although only a small fraction of these combinations turns out to be viable in reality.

"The number of possible molecules is immense; it is estimated that there could be up to 10⁶⁰ different ones, which is far more than the number of water molecules in the ocean," explains Guimerà. In contrast, the number of known molecules is only a tiny fraction of this figure. The sheer enormity of the number of possible new molecules means that finding ones that are actually useful is like looking for a needle in a giant haystack.

How the model works

To generate these new molecules, CoCoGraph uses a diffusion model, a technique common in image generation. The process involves progressively "disordering" a real molecule and training the system to learn how to reconstruct it.

"We start with a real molecule, break the bonds and create new ones at random. The model learns to reverse this process and reconstruct coherent structures," comments Marta Sales-Pardo, a researcher in the Department of Chemical Engineering who also took part in the research.

Unlike images, however, molecules are discrete structures, which makes the problem much more complex from a mathematical point of view.

Always-valid molecules

One of the main innovations of the model is that it directly incorporates the basic rules of chemistry. For example, each atom always maintains the correct number of bonds, and this guarantees that 100% of the molecules generated are chemically valid, unlike the impossible structures that can be produced by other models.

Furthermore, the system is more efficient: it uses fewer parameters, requires less computing power and can generate molecules more quickly.

The research team has compared CoCoGraph with other state-of-the-art models and analyzed 36 physicochemical properties of the generated molecules, such as solubility and structural complexity. The result is that, for approximately two-thirds of these properties, the molecules generated are chemically more realistic than those from other models.

Verification by the scientific community

To check how plausible these molecules were, the team conducted an experiment with 121 chemistry experts from the University itself. Each participant was shown twenty pairs of molecules—one real and one generated by the new AI—and had to identify which was the real one.

The results showed that the experts were wrong in approximately 4 out of 10 cases, meaning they often confused the generated molecules with the real ones. "This means that many of the molecules we generate are very convincing," explains Sales.

Although the model cannot yet design molecules with a specific function, promising tests have already been carried out. For example, researchers have identified molecules with properties similar to paracetamol from among the millions generated. They have also explored techniques to partially modify an existing molecule, a kind of chemical "tweak," to create new variants with similar characteristics.

These approaches could be useful in the future for optimizing drugs or developing new materials.

Discover the latest in science, tech, and space with over 100,000 subscribers who rely on Phys.org for daily insights. Sign up for our free newsletter and get updates on breakthroughs, innovations, and research that matter—daily or weekly.

Subscribe

The first step towards an AI that designs bespoke molecules

The research team is clear that this is only the beginning. The main medium- to long-term goal is to be able to ask the artificial intelligence for a molecule with specific properties; for example, for a molecule that is soluble, non-toxic and useful for a specific application.

"For the moment, we are only generating molecules. The next step will be to apply specific objectives to this process," says Manuel Ruiz-Botella, a doctoral student who also participated in the research.

If successful, the technology could transform fields such as chemistry, pharmacology and materials science and accelerate the discovery of new solutions in a chemical universe that is still practically unexplored.

Publication details

Manuel Ruiz-Botella et al, A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules, Nature Machine Intelligence (2026). DOI: 10.1038/s42256-026-01229-5

Journal information: Nature Machine Intelligence

Provided by University of Rovira i Virgili