Large language models (LLMs) have demonstrated considerable capabilities across various fields. However, a significant limitation is their lack of ability to access and reason over specialized, domain-specific knowledge and tools. This is particularly challenging in complex scientific areas like chemistry and molecular discovery, which require intricate data and dynamic analysis.
Finding inspiration in ChemCrow, an LLM powered assistant for chemistry synthesis, this research introduces CACTUS (Chemistry Agent Connecting Tool-Usage to Science). An innovative approach designed to bridge the gap between domain expertise and LLM reasoning strengths, by providing an agent that integrates existing cheminformatics tools helping to enhance the reasoning and problem-solving abilities of LLMs in chemistry.
CACTUS utilizes the open-source LangChain platform, simplifying the integration of LLMs with external tools and data sources.
This research aims to change the field of drug discovery and molecular property prediction by providing a platform that combines the cognitive power of LLMs with the precision of domain-specific computational tools. The ability of the platform to run on consumer grade hardware potentially democratizes drug discovery allowing budget constrained researchers access to cutting edge tools.
Key Takeaways
- CACTUS significantly outperforms baseline LLMs on chemistry questions by integrating cheminformatics (Cheminformatics focuses on storing, retrieving, analyzing, and manipulating chemical data) tools and domain prompting.
- Gemma-7b and Mistral-7b consistently achieved the highest accuracy. Llama3-8b was also noted among the highest accuracy models in one evaluation. Gemini-Pro achieved the best results but was not open source and ran into quota restriction issues.
- Domain-specific prompting strategies play a crucial role in enhancing the model's performance.
- Built with LangChain for extensibility using a custom MRKL agent (tool, LLMChain, Agent)
- Uses a zero-shot agent with the ReAct framework to determine which tool to use
- Accurate results can be achieved even with smaller models (like Phi, mentioned in) and on consumer-grade hardware, potentially increasing accessibility for researchers with limited computational resources or covering large chemical spaces.
- Effectively leverages widely used open-source cheminformatics tools, such as those from the RDKit library.
- Provides a user-friendly platform to explore chemical space, identify promising compounds, and assist in tasks like molecular property prediction, similarity searching, and drug-likeness assessment.
- The framework is extensible, allowing for the integration of new tools and more advanced models. Future developments aim to incorporate capabilities like 3D molecular modeling, reinforcement learning, and enhanced explainability.
Overview
Large language models (LLMs) are foundational models trained on vast amounts of data to support diverse tasks. While powerful, these transformer-based models have limitations in deeply understanding chemical and biological texts.
They can struggle with tasks requiring access to dynamic, real-time, or confidential data not present in their training sets. Moreover, while LLM-generated answers may appear correct they can fail to exhibit true reasoning or subject knowledge (the so-called hallucination phenomenon), potentially leading to "robustness failures" derived from strictly statistical associations.
Current research explores augmenting LLMs with external tools to help solve problems and tasks more efficiently within the chemistry domain. Providing specific prompts tailored to a task can also improve the quality and relevance of text generated by these models. Combining these two approaches is the basis of frameworks like the Tool Augmented Language Model (TALM).
CACTUS builds upon this concept by developing an Intelligent Agent specifically for cheminformatics. Cheminformatics is a field focused on the storage, retrieval, analysis, and manipulation of chemical data, essentially connecting computational linguistics with chemistry to aid in drug discovery and development.
CACTUS is designed as an LLM-powered agent capable of intelligently determining the most suitable tools and their optimal sequence for a given task. It is built to be user-friendly, providing an intuitive chat-based interface and extensible by leveraging the LangChain framework.
The platform seamlessly integrates familiar open-source Python tools, particularly from the widely used RDKit library - a library providing a wide range of tools for working with chemical structures, allowing you to represent, analyze, and manipulate molecules.
The focus of CACTUS is not on improving the accuracy of these existing tools themselves, but rather on demonstrating how an LLM agent can intelligently utilize them within a more streamlined workflow for researchers.
Why it’s Important
LLMs, despite their broad training, inherently lack the specialized knowledge and practical abilities required for tasks such as predicting molecular properties or assessing drug-likeness.
By connecting LLMs with established cheminformatics tools, CACTUS empowers researchers by giving the models the ability to access and reason over domain-specific information and execute specific calculations. This synergistic approach leverages the LLM's cognitive and analytical strengths with the tools' precision and domain expertise.
CACTUS offers a promising pathway for de novo drug design and molecular discovery by assisting scientists in navigating the vast chemical space more effectively. It enables tasks like molecular property prediction, similarity searching, and drug-likeness assessment from natural language interaction.
The agent's workflow involves receiving a user question, identifying the necessary tools, executing them, and observing the output to formulate an informed answer.
The agent's ability to intelligently select and sequence tools can optimize workflows can potentially accelerate the often lengthy and resource-intensive drug development process.
Beyond drug discovery, the implications of intelligent agents like CACTUS are far-reaching. They pave the way for autonomous operation of complex tasks in scientific research, including data analysis, experimental planning, hypothesis generation, and testing.
Integrating agents into large, tool-based platforms could create natural language interfaces, further streamlining research workflows. The research suggest this could advance our understanding of what computational chemistry can achieve, potentially enabling autonomous experimentation with or without human involvement.
The trend of integrating AI agents with specialized domain tools could extend to many other scientific and engineering disciplines as well. For example, similar agents could be developed for materials science, catalysis design , or even more complex biological systems, where managing vast datasets and applying specific analytical techniques with high accuracy is crucial.
The potential to deploy such agents on consumer-grade hardware, a primary focus of the research, lowers the barrier to entry for smaller research groups, citizen researchers and academic institutions. However, the development still faces challenges, particularly in ensuring the robustness and explainability of the agent's reasoning and tool usage, which is necessary for researchers to trust and verify the results.
The synergistic relationship between human intelligence, AI, and specialized software tools facilitated by agents like CACTUS has the potential to transform the landscape of scientific discovery, making it faster, more efficient, accurate, and innovative.
Summary of Results
The research evaluated the performance of CACTUS by integrating it with a diverse set of open-source large language models and testing them on a benchmark of chemistry questions.
They created a benchmark of thousands of chemistry questions, divided into qualitative (Yes/No, True/False answers) and quantitative (numerical answers) sets, as well as a combined set.
The evaluated LLMs included Gemma-7b, Falcon-7b, MPT-7b, Llama3-8b (or Llama2-7b in), and Mistral-7b. The key finding was that CACTUS significantly outperforms baseline LLMs.
Gemma-7b, Llama3-8b and Mistral-7b consistently achieved the highest accuracy regardless of the prompting strategy used.
The study also explored the impact of domain-specific prompting and different hardware configurations on performance. These factors were highlighted as important aspects of prompt engineering and deployment.
The prompting strategy included a Minimal and a Domain prompt comparison. The Minimal prompt was a default LangChain prompt that included only tool description while the Domain prompt included language to align the model with Chemistry. The "domain prompting" strategy generally yielded better results than a "minimal prompt," especially for qualitative questions.
A significant result was the demonstration of the potential for deploying smaller models (like Phi, mentioned in) on consumer-grade hardware while maintaining high accuracy. This finding is important for wider adoption and accessibility, especially for researchers with limited computational resources.
Tests were conducted on different NVIDIA GPUs, including the data center-grade A100 80GB and the consumer-grade RTX 2080 Ti. Smaller models, such as Phi2 (2.7B parameters) and Phi3 (3.8B parameters), were found to perform surprisingly well on consumer-grade hardware, achieving accuracy comparable to larger models on more powerful hardware, particularly for quantitative tasks
CACTUS achieves its capabilities by leveraging a suite of cheminformatics tools. These tools, primarily from the RDKit library for the purpose of demonstration, allow CACTUS to perform specific calculations and assessments. The table below outlines some of the currently supported tools.
By utilizing these tools, CACTUS assists researchers in making informed decisions during molecular discovery and prioritizing compounds with desirable characteristics. The accuracy of these individual methods is documented in RDKit documentation and their original publications and was not a focus of the current research.
The development and benchmarking process highlighted challenges related to model deployment and prompt engineering. The solutions implemented, such as using vLLM for hosting models and developing tailored prompts for each LLM, serve as a foundation for future work.
Conclusion
CACTUS is presented as a significant advancement in cheminformatics, offering a powerful and adaptable tool for researchers. It represents an innovative open-source agent that successfully leverages the combined power of large language models and domain-specific cheminformatics tools.
By integrating a wide array of computational methods, CACTUS creates a comprehensive platform enabling researchers and chemists to navigate the vast chemical space for molecular discovery, potentially facilitating the identification of promising therapeutic compounds.
The evaluation demonstrated that CACTUS, particularly when coupled with models like Gemma-7b and Mistral-7b and top end hardware, provides a significant performance boost over baseline LLMs in answering chemistry questions.
The research also underscored the critical roles of prompt engineering and the feasibility of deploying CACTUS with smaller models on more accessible hardware, enhancing its potential for wider adoption.
Looking ahead, the future development of CACTUS aims to create an intelligent, comprehensive tool with enhanced explainability and symbolic reasoning capabilities, addressing common limitations of LLMs. The integration of advanced physics-based AI/ML models and additional tools for fragment identification, property calculation, and toxicity screening is also planned. As AI-driven scientific discovery progresses, agents like CACTUS are expected to play a pivotal role in transforming fields like drug discovery, catalysis, and materials science, accelerating innovation and contributing to advancements in human health.
CACTUS: Connecting Large Language Models and Cheminformatics for Molecular Discovery
CACTUS: Chemistry Agent Connecting Tool Usage to Science