The process of discovering molecules, which have the properties needed to create new drugs and materials, is burdensome and expensive, consumes huge calculation resources and months of human work to narrow the huge space of potential candidates.
Large language models (LLM), such as chatgpt, can improve this process, but enabling LLM understanding and a reason for atoms and bonds forming a molecule, in the same way as with the words that create sentences, presented a scientific obstacle.
Scientists from MIT and MIT-IBM Watson Ai Lab created a promising approach that increases LLM by other machine learning models known as models based on charts that have been specially designed to generate and predict molecular structures.
Their method uses the basic LLM to interpret natural language inquiries determining the desired molecular properties. It automatically switches between the basic AI modules based on the charts to design the molecule, explain the justification and generate the synthetic plan. Covering the production of a text step, chart and synthesis, combining words, charts and reactions to common vocabulary for LLM.
Compared to existing LLM approaches, this multimodal technique generated molecules that better fit the user's specifications and more often had the correct synthesis plan, improving the success rate from 5 to 35 percent.
LLM also went out, which are over 10 times its size and that design molecules and synthesis paths only with text representations, suggesting that multimodality is the key to the success of the new system.
“We hope that this may be a comprehensive solution in which we would automate the whole process of designing and creating molecule from the beginning to end. If LLM can simply answer in a few seconds, it would be a huge time for pharmaceutical companies,” says Michael Sun, MIT graduate and interacts myth and co-author A-Autor paper with this technique.
SUN co -authors are the main author of Gang Liu, a PhD student at the University of Notre Dame; Wojciech Matusik, professor of electrical engineering and computer science in MIT, who runs a group of design and computing production at the Computer Science Laboratory and Artificial Intelligence (CSAIL); Meng Jiang, associate professor at the University of Notre Dame; and the older author Jie Chen, senior scientist and manager at MIT-IBM Watson Ai Lab. The research will be presented at an international conference on the representation of learning.
The best of both worlds
Large language models are not built to understand the nuances of chemistry, which is one of the reasons why they fight the opposite molecular project, the process of identifying molecular structures that have certain functions or properties.
LLM converts the text to representations called tokens they use for sequential prediction of the next word in the sentence. But molecules are “chart structures”, consisting of atoms and bindings without a special order, which hinders their coding as a sequential text.
On the other hand, powerful AI models based on charts represent molecular atoms and bonds as connected nodes and edges on the chart. Although these models are popular in terms of reverse molecular design, they require complex input data, they cannot understand natural language and give results that can be difficult to interpret.
MIT researchers combined LLM with the AI ​​model based on charts into a unified frames that become the best of both worlds.
Llamole, which means a large language model for molecular discovery, uses the basic LLM as a guard to understand the user's inquiry in question-a molecule with a certain molecule with specific properties.
For example, maybe a user is looking for a molecule that can penetrate the blood-brain barrier and brake HIV, given that he has a molecular weight of 209 and some binding characteristics.
As LLM predicts the text in response to the query, it switches between the chart modules.
One module uses the diffusion model of the chart to generate a molecular structure conditioned by input requirements. The second module uses the neural network for coding the generated molecular structure back into LLM consumption tokens. The final chart module is a predictor of the chart reaction, which he adopts as an indirect molecular structure and predicts the reaction stage, looking for an exact set of steps to make a molecule from the basic components.
Scientists have created a new type of trigger token that says LLM when to activate each module. When LLM predicts the “design” token, it switches to a module that sketch the molecular structure, and when it predicts the “retro” token, switches to the retrosintetic planning module, which predicts the next stage of reaction.
“The beauty of this is that everything that LLM generates before the activation of a specific module is transmitted to the module itself. The module learns to act in a manner consistent with what was before,” says Sun.
In the same way, the exit of each module is encoded and returned to the LLM generation process, so he understands what every module has done and will continue to predict tokens based on this data.
Better, simpler molecular structures
Ultimately, Llamol displays the image of the molecular structure, the textual description of the molecule and the step -by -step synthesis plan, which contains detailed information on how to do it, up to individual chemical reactions.
In experiments covering the design of particles that match the user's specifications, Llamol exceeded 10 standard LLM, four refined LLM and the most modern method specific to the domain. At the same time, it increased the success rate in retrosynthetic planning from 5 to 35 percent, generating molecules that are higher quality, which means that they had simpler structures and cheaper components.
“Llms themselves try to come up with how to synthesize molecules, because it requires many multi -stage planning. Our method can generate better molecular structures, which are also easier to synthesize,” says Liu.
To train and evaluate llalles, scientists built two sets of data from scratch, because the existing sets of data of molecular structures did not contain enough details. They increased hundreds of thousands of patented molecules with natural language descriptions generated by AI and non -standard description templates.
The set of data that he built to adapt LLM contains templates related to 10 molecular properties, so one limit of llamol is that it is trained to design molecules, taking into account only these 10 numerical properties.
In future work, scientists want to generalize llalles so that they can include any molecular property. In addition, they plan to improve the chart modules to increase the success of the Llamole retrosynthesis.
And in the long run they hope to apply this approach to going beyond the particles, creating multimodal LLM, which can support other types of data -based data, such as combined sensors in the power network or transactions on the financial market.
“Llamole shows the feasibility of using large language models as an interface for complex data outside the text description, and we predict that they will be the basis that affects other AI algorithms to solve problems with the chart,” says Chen.
These studies are partly financed by MIT-IBM Watson Ai Lab, National Science Foundation and Office of Naval Research.