Toucan Dataset: Transforming AI Agents Into Digital Doers

Unlocking a New Era for AI Agents

Get All The Latest to Your Inbox!

Advertise Here!

Gain premium exposure to our growing audience of professionals. Learn More

Inquire Now

Toucan, a groundbreaking open-source dataset from IBM and the University of Washington is crafted to propel tool-calling capabilities in large language models (LLMs) to new heights.

For AI to move beyond simple conversation and become genuinely useful assistants, it must master tool-calling. This skill allows agents to identify, select, and utilize the right digital applications for a given job. Reliable tool-calling, however, depends on access to large volumes of realistic, high-quality training data which is something the industry has sorely lacked until now.

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments Toucan dataset on Hugging Face

“LLMs trained on Toucan essentially learn how to choose the right tools for the task, create engaging dialogue to keep humans in the loop, and recognize when a task can’t be solved with the available toolset,” said Adriana Meza, an IBM Research engineer who co-led the dataset’s creation.

Why Toucan Is a Big Deal

Toucan stands out as the most extensive public dataset for tool-calling yet. It’s built from 1.5 million real-world task scenarios, each detailing how an AI agent interacts with over 2,000 web services.

These trajectories span activities like analyzing business data, drafting summaries, scheduling, and managing calendars. Every scenario captures the agent’s process from initial planning to final execution and summary, creating a rich training ground for LLMs.

Authenticity: Scenarios are based on real API executions, not synthetic simulations.

Unmatched scale: With five times more data than previous datasets, Toucan covers broader toolsets and more complex tasks including parallel tool use.

Immediate community impact: The dataset quickly became a trending topic on Hugging Face, signaling widespread interest.

How Toucan Was Built

The Toucan team gathered metadata from MCP (Model Context Protocol) servers, which link AI agents to APIs. After careful filtering, data from 500 servers formed the dataset’s backbone.

Five open-source LLMs generated plausible task scenarios, while three more models translated these into detailed agent trajectories. To ensure quality, two additional LLMs rated each scenario for complexity and execution, so only the best examples were included.

Compared to other resources, Toucan’s sheer size is a game-changer. About 20% of its scenarios require parallel tool-calling, pushing AI agents to handle multi-step and multi-tool workflows efficiently.

“Tool-calling is central to AI agents,” said Rameswar Panda, the IBM researcher who led the team behind Toucan. “How can you train better agents? Through diverse, high-quality examples sourced from the real world.”

Performance That Raises the Bar

Early results are already impressive. Smaller, open-source models trained on Toucan have outperformed much larger frontier models on major industry benchmarks like the Berkeley Function Calling Leaderboard (BFCLv3) and Salesforce’s MCP-Universe.

For instance, the Qwen-2.5-32B model, fine-tuned with Toucan, surpassed OpenAI’s GPT-4.5-Preview on BFCLv3, despite being a fraction of the size.

Efficiency: Parallel tool-calling cuts computational costs and accelerates workflows.

Smarter choices: Models learn to pick the best tool, engage users effectively, and recognize their own limitations.

Benchmark gains: Some models saw up to nine percentage points improvement over the previous state-of-the-art.

The Road Ahead for Toucan

Toucan’s creators are already expanding the dataset by incorporating new MCP servers and tools, mirroring the fast-paced growth of the API ecosystem. They’re also developing a reinforcement learning gym and new benchmarks to help LLMs master enterprise workflows, leveraging insights gained from Toucan’s construction.

A Turning Point for Digital Agents

Toucan delivers the scale, diversity, and real-world context that LLMs need to become genuinely useful digital agents. As the dataset continues to grow, expect smarter, more capable AI tools ready to tackle practical, real-world tasks—bringing us closer to fully autonomous digital assistants.

Source: IBM Research Blog by Kim Martineau, 17 Oct 2025

in News

# AI agents API integration benchmarking large language models machine learning open source tool-calling Toucan dataset

Source: https://research.ibm.com/blog/toucan-for-tool-calling

Joshua Berkowitz October 25, 2025

Views 4081

Share this post

blogs

Our latest content

Check out what's new !

See all

Ads

Prompt Maker Image Generator

Struggling with the perfect AI image prompt? My free app helps you generate brilliant ideas and instantly creates an image to match. Go from concept to creation in two clicks!

Try It

Most Popular Articles

Check out what the hot topics are!