Blog Posts | Joshua Berkowitz

8 Articles

AI benchmarking ×

OfficeQA: The Next Frontier in AI Enterprise Reasoning Evaluation

The evolution of AI agents has brought us closer to automating complex business tasks, yet measuring their true capabilities remains a challenge. Databricks' OfficeQA is anewly released, open-source b...

AI benchmarking AI evaluation Databricks data retrieval document intelligence enterprise AI grounded reasoning OfficeQA

Dec 11, 2025

0 3905

News

How Automated Prompt Optimization: Efficient Performance at a Fraction of the Cost

Enterprises striving to leverage AI for complex tasks often face a trade-off: high accuracy usually comes at a high cost, especially with leading proprietary models. Recent Databricks research reveals...

AI benchmarking automation cost reduction Databricks enterprise AI large language models open-source AI prompt optimization

Dec 6, 2025

0 4763

News

SEAL Showdown: How Real People Are Changing the AI Model Leaderboard

The explosion of large language models (LLMs) has unlocked new ways to interact with technology, but traditional benchmarks often fail to answer a critical question: Which AI model actually works best...

AI benchmarking data labeling demographics LLM comparison model evaluation Scale AI SEAL Showdown user preferences

Sep 30, 2025

0 22462

News

How Reliable Are LLM Judges? Lessons from DataRobot's Evaluation Framework

Relying on automated judges powered by Large Language Models (LLMs) to assess AI output may seem efficient, but it comes with hidden risks. LLM judges can be impressively confident even when they're w...

AI benchmarking AI trust LLM evaluation machine learning open-source tools prompt engineering RAG systems

Sep 20, 2025

0 17028

News

QuArch Puts AI Agents to the Test on Computer Architecture

Computer architecture is having an AI moment. Yet despite rapid progress in agentic tooling for coding and verification, hardware-centric knowledge remains stubbornly hard for language models to maste...

AI benchmarking Artificial Intelligence computer architecture Computer Science

Aug 28, 2025

0 5071

Papers

Introducing LiveMCPBench: Evaluating Models on Large Tool Set Usage

A new arXiv preprint, LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools , from the Chinese Academy of Sciences and UCAS, introduces a benchmark to test AI agents in realistic tool-rich environme...

AI benchmarking AI tools Artificial Intelligence MCP MCP Server

Aug 13, 2025

0 13222

Papers

TextArena Uses Competitive Gameplay to Advance AI

As language models quickly catch up with and surpass traditional benchmarks, the need for more effective measurement tools becomes urgent. TextArena steps in as an innovative, open-source platf...

agentic AI AI benchmarking LLM evaluation open source reinforcement learning soft skills text-based games TrueSkill

Jul 29, 2025

0 7018

Papers

AI is Disrupting Medical Diagnostics: Surpassing Human Expertise and Reducing Costs

Imagine solving the toughest medical mysteries faster and more accurately than ever before. This is becoming reality as advanced AI systems are now outperforming even experienced clinicians in diagnos...

AI benchmarking AI healthcare clinical reasoning cost efficiency future of medicine generative AI medical diagnostics

Jul 14, 2025

0 5104

News

Our latest content

Check out what's new !

See all

Ads

Prompt Maker Image Generator

Struggling with the perfect AI image prompt? My free app helps you generate brilliant ideas and instantly creates an image to match. Go from concept to creation in two clicks!

Most Popular Articles

Check out what the hot topics are!

See all

Every shirt tells a story—and every story

#ClothingForACause