Toucan, a groundbreaking open-source dataset from IBM and the University of Washington is crafted to propel tool-calling capabilities in large language models (LLMs) to new heights.
For AI to move beyond simple conversation and become genuinely useful assistants, it must master tool-calling. This skill allows agents to identify, select, and utilize the right digital applications for a given job. Reliable tool-calling, however, depends on access to large volumes of realistic, high-quality training data which is something the industry has sorely lacked until now.
TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments Toucan dataset on Hugging Face
“LLMs trained on Toucan essentially learn how to choose the right tools for the task, create engaging dialogue to keep humans in the loop, and recognize when a task can’t be solved with the available toolset,” said Adriana Meza, an IBM Research engineer who co-led the dataset’s creation.
Why Toucan Is a Big Deal
Toucan stands out as the most extensive public dataset for tool-calling yet. It’s built from 1.5 million real-world task scenarios, each detailing how an AI agent interacts with over 2,000 web services.
These trajectories span activities like analyzing business data, drafting summaries, scheduling, and managing calendars. Every scenario captures the agent’s process from initial planning to final execution and summary, creating a rich training ground for LLMs.
- Authenticity: Scenarios are based on real API executions, not synthetic simulations.
- Unmatched scale: With five times more data than previous datasets, Toucan covers broader toolsets and more complex tasks including parallel tool use.
- Immediate community impact: The dataset quickly became a trending topic on Hugging Face, signaling widespread interest.
How Toucan Was Built
The Toucan team gathered metadata from MCP (Model Context Protocol) servers, which link AI agents to APIs. After careful filtering, data from 500 servers formed the dataset’s backbone.
Five open-source LLMs generated plausible task scenarios, while three more models translated these into detailed agent trajectories. To ensure quality, two additional LLMs rated each scenario for complexity and execution, so only the best examples were included.
Compared to other resources, Toucan’s sheer size is a game-changer. About 20% of its scenarios require parallel tool-calling, pushing AI agents to handle multi-step and multi-tool workflows efficiently.
“Tool-calling is central to AI agents,” said Rameswar Panda, the IBM researcher who led the team behind Toucan. “How can you train better agents? Through diverse, high-quality examples sourced from the real world.”
Performance That Raises the Bar
Early results are already impressive. Smaller, open-source models trained on Toucan have outperformed much larger frontier models on major industry benchmarks like the Berkeley Function Calling Leaderboard (BFCLv3) and Salesforce’s MCP-Universe.
For instance, the Qwen-2.5-32B model, fine-tuned with Toucan, surpassed OpenAI’s GPT-4.5-Preview on BFCLv3, despite being a fraction of the size.
- Efficiency: Parallel tool-calling cuts computational costs and accelerates workflows.
- Smarter choices: Models learn to pick the best tool, engage users effectively, and recognize their own limitations.
- Benchmark gains: Some models saw up to nine percentage points improvement over the previous state-of-the-art.
The Road Ahead for Toucan
Toucan’s creators are already expanding the dataset by incorporating new MCP servers and tools, mirroring the fast-paced growth of the API ecosystem. They’re also developing a reinforcement learning gym and new benchmarks to help LLMs master enterprise workflows, leveraging insights gained from Toucan’s construction.
A Turning Point for Digital Agents
Toucan delivers the scale, diversity, and real-world context that LLMs need to become genuinely useful digital agents. As the dataset continues to grow, expect smarter, more capable AI tools ready to tackle practical, real-world tasks—bringing us closer to fully autonomous digital assistants.

GRAPHIC APPAREL SHOP
Toucan Dataset: Transforming AI Agents Into Digital Doers