Skip to Content

ChatNT: Redefining Biological Data Conversations with AI

Conversational Intelligence Meets Genomics

Get All The Latest Research & News!

Thanks for registering!

Chat Nucleotide Transformer (ChatNT), an innovative AI designed to transform how scientists and researchers approach genomics, proteomics, and related fields. It enables researchers to interact with intricate biological data as easily as chatting with a colleague.

As biological datasets grow rapidly, ChatNT provides an intuitive, high-performance interface that removes the need for deep coding expertise, opens new doors for scientific exploration and reduces the barrier to entry in scientific discovery.

Natural Language for Biological Complexity

ChatNT is a generative AI model that fuses the understanding of biological sequences with the adaptability of natural language processing. Drawing inspiration from advanced vision-language models, ChatNT reframes genomics tasks as text-to-text problems, allowing users to interact with complex data through seamless conversations. 

Unlike earlier models that required specialized systems for each task, this "generalist" model can handle dozens of genomics problems within a single interface, making research more efficient and accessible.

In benchmarking tests using InstaDeep’s Nucleotide Transformer framework, ChatNT achieved an impressive Matthews Correlation Coefficient (MCC) of 0.77, surpassing previous specialist models that demanded extensive reprogramming and separate architectures for individual tasks.

The Architecture Behind ChatNT

Scalability and flexibility are at the heart of ChatNT’s design. The model was trained on a vast, diverse dataset covering 27 biological functions and multiple species, incorporating over 600 million DNA tokens and 273 million English tokens. Its processing pipeline includes several key stages:

  • DNA Encoder (NTv2): Breaks DNA into nucleotide chunks and converts them into numerical vectors to identify biological patterns.

  • Projection & Translation: Refines these vectors using neural networks and a perceiver resampler, filtering for the most relevant features.

  • English Decoder (Vicuna-7B): Embeds refined biological information into prompts for a language model, producing clear, plain-English responses.

This comprehensive pipeline enables ChatNT to provide not only precise numeric outputs but also conversational explanations, making it suitable for a broad spectrum of biological questions.

Building Trust Through Transparency

Trust is essential in scientific research, and ChatNT addresses this by introducing two critical innovations. 

First, it employs a perplexity-based scoring system to estimate its own confidence, offering probability-based answers instead of simple binaries. This is further enhanced by Platt scaling calibration, ensuring users can depend on the model’s self-assessment, even in sensitive applications.

Second, ChatNT offers gradient-based attribution for model interpretability. Researchers can see which DNA segments most influenced the AI’s predictions, with the model reliably pinpointing biologically significant motifs like GT/AG dinucleotides for splice sites or TATA boxes for promoters. This transparency builds user confidence and yields meaningful biological insights.

Performance Across Genomics and Proteomics

ChatNT demonstrates excellence across a broad range of genomics tasks, often matching or exceeding the best specialist models. For example, it scored MCC 0.95 in promoter prediction and MCC 0.98 in splice site detection. The model also showed superior capabilities in human enhancer identification, though it left room for improvement in plant-specific enhancer prediction, highlighting areas for future growth.

Beyond DNA, ChatNT generalizes effectively to RNA and protein analysis. It performs well on regression and numeric tasks, outperforming established models in polyadenylation prediction (PCC 0.91) and protein melting point estimation (PCC 0.89), while maintaining high accuracy in other protein feature predictions.

Future Directions: Scalability and Multimodality

The architecture behind ChatNT is designed for ongoing scalability. Future updates aim to increase task diversity and improve zero-shot generalization, enabling the model to tackle new challenges without retraining. Planned multimodal enhancements, such as specialized RNA and protein encoders and expanded context windows, promise even greater performance and versatility.

As the need for intuitive, multi-purpose AI in biology grows, ChatNT sets a new standard. It marks a foundational step toward truly conversational, generalist AI that empowers researchers and democratizes advanced biological analysis.

ChatNT’s introduction is a milestone for digital biology, making sophisticated genomics and proteomics accessible through natural conversation. As this AI continues to evolve, it is poised to revolutionize how scientists interact with biological data, accelerating discoveries and shaping the future of research.

Source: InstaDeep Blog


ChatNT: Redefining Biological Data Conversations with AI
Joshua Berkowitz July 8, 2025
Share this post