Blog Posts | Joshua Berkowitz

3 Articles

2025 × AI benchmarking ×

How Reliable Are LLM Judges? Lessons from DataRobot's Evaluation Framework

Relying on automated judges powered by Large Language Models (LLMs) to assess AI output may seem efficient, but it comes with hidden risks. LLM judges can be impressively confident even when they're w...

AI benchmarking AI trust LLM evaluation machine learning open-source tools prompt engineering RAG systems

Sep 20, 2025

0 14597

News

QuArch Puts AI Agents to the Test on Computer Architecture

Computer architecture is having an AI moment. Yet despite rapid progress in agentic tooling for coding and verification, hardware-centric knowledge remains stubbornly hard for language models to maste...

AI benchmarking Artificial Intelligence computer architecture Computer Science

Aug 28, 2025

0 2607

Papers

Introducing LiveMCPBench: Evaluating Models on Large Tool Set Usage

A new arXiv preprint, LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools , from the Chinese Academy of Sciences and UCAS, introduces a benchmark to test AI agents in realistic tool-rich environme...

AI benchmarking AI tools Artificial Intelligence MCP MCP Server

Aug 13, 2025

0 9295

Papers

Our latest content

Check out what's new !

See all

Ads

Prompt Maker Image Generator

Struggling with the perfect AI image prompt? My free app helps you generate brilliant ideas and instantly creates an image to match. Go from concept to creation in two clicks!

Most Popular Articles

Check out what the hot topics are!

See all

Every shirt tells a story—and every story

#ClothingForACause