Enterprises striving to leverage AI for complex tasks often face a trade-off: high accuracy usually comes at a high cost, especially with leading proprietary models. Recent Databricks research reveals that automated prompt optimization can break this trade-off, helping organizations achieve top-tier accuracy with open-source models while dramatically reducing operating expenses.
Benchmarking with Real-World Complexity
Extracting structured information from unstructured documents is a persistent challenge in enterprise AI. It requires handling diverse schemas, specialized terminology, and ensuring reliability. To address this, Databricks introduced IE Bench; a benchmark suite designed for tough, domain-specific extraction tasks in finance, legal, healthcare, and more. IE Bench tests models in scenarios that mirror real business needs, providing a true measure of operational readiness.
The Power of Automated Prompt Optimization
Manual prompt engineering is time-consuming and doesn't scale for enterprise workloads. Automated prompt optimization replaces guesswork with algorithmic rigor, improving prompts through iterative, feedback-driven processes. Notable techniques like GEPA, SIMBA, and MIPROv2 systematically search for the best instructions or examples to maximize model accuracy with no labeled data or supervised fine-tuning required.
- GEPA combines language reflection and evolutionary search, leading to the largest accuracy gains among optimizers tested.
- These optimizations are pipeline-agnostic, supporting the multi-stage workflows common in enterprise AI.
- Utilizing stronger optimizer models, such as Claude Sonnet 4, can further enhance the performance of open-source models like gpt-oss-120b.
Outperforming Proprietary Solutions at Scale
Applying GEPA optimization to gpt-oss-120b enabled it to surpass industry leaders like Claude Opus 4.1 and Claude Sonnet 4 on IE Bench, all while reducing serving costs by up to 90 times. Even proprietary models benefit, for example GEPA optimization pushed Claude Opus 4.1 to its best-ever performance, proving that automated prompt optimization enhances all model types.
- GEPA-optimized gpt-oss-120b: Outperforms Claude Opus 4.1 by approximately 2.2 points at a fraction of the cost.
- GEPA-optimized Claude Opus 4.1: Sets a new benchmark for IE Bench performance.
- The quality-cost ratio improves markedly for all models post-optimization.
Prompt Optimization versus Supervised Fine-Tuning
While supervised fine-tuning (SFT) has long been the standard for boosting model quality, prompt optimization offers a cost-effective alternative. GEPA optimization alone matches or slightly exceeds SFT’s performance, cutting serving costs by about 20%. When the two approaches are combined, enterprises see even greater accuracy gains, though with a modest uptick in cost.
- Prompt optimization delivers a superior quality-cost balance compared to SFT alone.
- Combining GEPA and SFT maximizes accuracy, yet open-source models with prompt optimization still offer the lowest overall cost.
Scaling Up: Lifetime Cost Matters
For organizations running millions of AI-driven transactions, ongoing serving costs quickly outweigh the initial investment in optimization. GEPA-optimized gpt-oss-120b stands out, maintaining significantly lower total costs over time versus proprietary alternatives. This cost advantage persists even at massive scale, making it practical for high-volume, production-grade deployments.
- The upfront optimization expense is rapidly recouped as usage grows.
- Open-source models with automated prompt optimization deliver lasting savings at scale.
Key Takeaway
Databricks’ research underscores a pivotal shift: automated prompt optimization empowers enterprises to deploy high-performing, cost-efficient AI agents tailored to their real-world needs. Open-source models can now rival or outperform closed-source giants at a fraction of the price, and even proprietary solutions benefit from optimization. With these innovations integrated into Databricks Agent Bricks, organizations are equipped to quickly build, test, and optimize agents, unlocking unprecedented quality and efficiency for enterprise AI.

How Automated Prompt Optimization: Efficient Performance at a Fraction of the Cost