The SimplyfAI Guide to LLM Evals

The SimplyfAI Guide to LLM Evals

The AI labs ship a new model every few weeks. Each one is supposedly the smartest thing ever built. The benchmark numbers go up. The vibes are immaculate. But which numbers matter, which ones are cooked, and which ones tell you anything about whether the model will actually help you do your job? We pulled apart the eval landscape and kept 9 that are still worth paying attention to β€” split by whether you're building products, tracking the frontier, or just trying to pick the right AI tool for work.

Confidence key: 🟒 Verified β€” confirmed against primary sources 🟑 Illustrative β€” broadly supported, treat as directional πŸ”΄ Needs refresh β€” leaderboard has moved since snapshot


For Founders & Builders

Can this model do real work?

1. SWE-bench Verified

Can it fix real bugs in real repos and get the tests green?

SWE-bench takes actual bug reports from popular open-source projects on GitHub and asks an AI to fix them β€” then runs the project's test suite to check. It's not toy puzzles or interview questions. It's the closest thing to "hand this model a real engineering ticket and see if it ships working code." Frontier models now resolve about 80% of these issues, but harder variants like SWE-bench Pro drop that to ~23%, showing how much real-world complexity remains.

🟑 IllustrativeModelScore
1Claude Opus 4.580.9%
2Claude Opus 4.680.8%
3Gemini 3.1 Pro80.6%
Source: swebench.com / marc0.dev, Mar 2026

* The official SWE-bench site is the source of truth but does not always expose per-model scores in a stable snapshot. These figures are drawn from aggregator reporting. Treat as illustrative β€” check swebench.com for current data.

2. GAIA

Can it use tools, browse the web, reason through steps, and arrive at a correct answer?

GAIA gives an AI questions that are easy for any human β€” things like "find a specific fact on a website, download a file, and do some maths with the result" β€” but require the model to chain together browsing, reading, and reasoning across multiple steps. Humans score 92%. When it launched, GPT-4 scored 15%. It exposes whether a model can actually operate in the real world or just answer textbook questions.

🟑 IllustrativeModelScore
1Claude Opus 4.6~75%
2Gemini 3 Pro~72%
3GPT-5.2~68%
Source: HAL Princeton / HuggingFace, approx. early 2026

* GAIA scores are heavily dependent on the agent framework and scaffold used. Comparing raw scores across models without controlling for prompting setup is not fully meaningful. Human baseline of 92% and GPT-4 baseline of 15% are verified from the original paper.

3. LiveCodeBench

Broader code ability β€” contamination-resistant, tests self-repair and execution

Older coding benchmarks like HumanEval had a problem: the questions leaked into training data, so models could memorise answers instead of actually solving them. LiveCodeBench fixes this by continuously pulling fresh problems from competitive programming contests. It also tests whether a model can run its own code, spot errors, and fix them β€” not just produce something that looks right on first glance.

🟑 IllustrativeModelScore
1Gemini 3 Pro91.7%
2Gemini 3 Flash90.8%
3DeepSeek V3.289.6%
Source: Artificial Analysis / Price Per Token, Mar 2026

* Gemini 3 Pro at 91.7% is supported by public leaderboard aggregators. Full trio should be treated as illustrative β€” primary benchmark pages don't always surface identical rankings.

For Tracking Intelligence

How close are we to the frontier?

1. GPQA Diamond

PhD-level reasoning that's designed to be impossible to Google

GPQA consists of 198 graduate-level science questions written by PhD experts in physics, biology, and chemistry. The twist: even smart non-experts with full internet access only get 34% right. It was designed to test whether a model genuinely understands the science or is just pattern-matching from training data. Top models now score above 90%, which means GPQA is approaching saturation β€” it's running out of room to separate the best from the rest.

🟒 VerifiedModelScore
1Gemini 3.1 Pro94.1%
2GPT-5.492.0%
3GPT-5.3 Codex91.5%
Source: Artificial Analysis, Mar 2026

* GPT-5.4 at 92.0% matches Artificial Analysis. OpenAI's own launch post reports 92.8% β€” the difference is likely methodology. Gemini 3.1 Pro at 94.1% is confirmed across multiple sources.

2. ARC-AGI-2

A different kind of wall β€” tests genuine abstraction, not accumulated knowledge

Most benchmarks test what a model knows. ARC-AGI tests whether it can figure out something it has never seen before. Each task is a visual pattern puzzle β€” simple enough that random people off the street solve 60% of them in under two attempts, but pure language models often score 0%. It measures the kind of fluid reasoning that more data and bigger models haven't automatically solved. Think of it as: can the model climb a wall it hasn't been wallpapered on during training?

πŸ”΄ Needs refreshModelScore
1Gemini 3.1 Pro77.1%
2Claude Opus 4.668.8%
3GPT-5.254.2%
Source: ARC Prize / designforonline, Feb 2026

* This leaderboard moves fast. GPT-5.4 and GPT-5.4 Pro are now on the ARC Prize board, and refinement harnesses (e.g. Poetiq at 54% β†’ model refinements above 95%) are pushing scores rapidly. These figures are an older snapshot β€” check arcprize.org/leaderboard for current data.

3. HLE (Humanity's Last Exam)

2,500 expert-vetted questions where even frontier models still mostly fail

Created by nearly 1,000 subject-matter experts across 50 countries and published in Nature, HLE was designed to be the last closed-ended academic exam AI would ever need β€” because once models pass it, there's nothing harder left to ask in a standardised format. When it launched in early 2025, the best models scored single digits. A year later, the top score is still only ~45%. It's the clearest measure of how far we remain from expert-level AI across all of human knowledge.

🟒 VerifiedModelScore
1Gemini 3.1 Pro44.7%
2GPT-5.441.6%
3GPT-5.3 Codex39.9%
Source: Artificial Analysis, Mar 2026

* GPT-5.4 at 41.6% matches Artificial Analysis. OpenAI's launch post reports 39.8% without tools, 52.1% with tools β€” framing matters. Scores from Scale AI's official leaderboard may differ slightly due to evaluation methodology.

For Everyday Users

Which AI should I actually use at work?

1. Chatbot Arena The taste test

Real people compare two AIs blind and vote for the one they prefer

Arena shows you two anonymous AI responses side by side and lets you pick the better one. With over 5.4 million votes across 323 models, it's the largest crowd-sourced AI evaluation in the world. It captures something no automated test can: which model actually feels better to talk to β€” in tone, judgment, helpfulness, and knowing when to be concise versus thorough. If you only look at one metric before choosing an AI tool, this is the one.

🟑 IllustrativeModelElo
1Claude Opus 4.61504
2Gemini 3.1 Pro1500
3Grok 4.201493
Source: arena.ai, Mar 2026

* Supported by recent public reporting. Arena Elo shifts over time as new votes accumulate and models are added. Treat as a directional snapshot, not a fixed ranking.

2. MMLU-Pro The knowledge test

A massive exam across dozens of subjects β€” is this model well-rounded or narrow?

MMLU-Pro is a 12,000-question multiple-choice exam covering 14 subject areas from business and law to science and engineering, with 10 answer options instead of the usual 4. It's the evolved version of MMLU, the old "general knowledge" benchmark that frontier models had mostly maxed out. The "Pro" version requires genuine multi-step reasoning rather than just recognising familiar patterns. Think of it as asking: if I give this AI a question outside its comfort zone, will it fall apart?

🟑 IllustrativeModelScore
1Gemini 3 Pro89.8%
2Claude Opus 4.589.5%
3Gemini 3 Flash89.0%
Source: Artificial Analysis / Price Per Token, Mar 2026

* Gemini 3 Pro at 89.8% is supported by public aggregators. Full trio should be treated as illustrative. Note that some sources indicate MMLU-Pro is approaching saturation for frontier models.

3. GDPval The real work test

Can AI produce work that matches what a human professional delivers?

Built by OpenAI with experienced professionals averaging 14 years in their fields, GDPval gives an AI the same tasks a real lawyer, nurse, engineer, or financial analyst would do at work β€” then has human experts blindly compare the AI's output to a professional's. It covers 44 occupations across 9 industries contributing to US GDP. It's the most direct answer to "could this AI do my actual job?" that any benchmark has attempted. Scores are Elo-based, where parity with a human expert sits around 50% win rate.

🟒 VerifiedModelElo
1GPT-5.41667
2Claude Sonnet 4.61633
3Claude Opus 4.61606
Source: Artificial Analysis (GDPval-AA), Mar 2026

* All three figures are supported by Artificial Analysis snippets and Anthropic's Sonnet 4.6 reporting. GDPval was built by OpenAI β€” worth noting the benchmark creator's own models tend to be well-optimised for their own evals.

A word of caution: benchmarks can be gamed. The industry calls it "benchmaxing" β€” when labs optimise their models specifically to score well on popular evals, sometimes at the expense of general capability. A model that tops a leaderboard may have been fine-tuned on similar questions, trained with test-set contamination, or evaluated under conditions that flatter its strengths. High scores don't always mean better real-world performance, and a model that wins on paper can still fumble on your actual workload. Treat every number on this page as a signal, not a verdict.

This is the first in a series of eval posts from SimplyfAI. In the ones that follow, we'll be running our own tests β€” putting these models through real tasks, in real conditions, with real constraints β€” to give you something the leaderboards can't: an honest sense of what each model is actually like to use.

Read more