Second Brain

Back to Blog Research Best LLM Judge Models in 2026: 8 Models Ranked on Calibration, Cost, and Self-Preference Eight LLM judge models compared on human correlation, cost per score, latency, and self-preference bias. Pick by your rubric, not by SummEval. April 18, 2025 · Updated May 20, 2026 · 17 min read llm-as-judge llm-judge-models judge-calibration self-preference galileo-luna-2 turing-models prometheus-2 2026 Table of Contents TL;DR: which judge wins on which axis The three axes that actually matter The eight judge models worth running in 2026 1. Claude Sonnet 4.5 / Opus 4.x: the calibration ceiling 2. GPT-5 / GPT-5-mini: the structured-output workhorse 3. Gemini 2.5 Pro / Flash: long context, multimodal, cost-balanced 4. Galileo Luna-2: the purpose-built evaluator 5. Future AGI Turing series (turing_large, turing_flash): fine-tune meets full stack 6. DeepSeek-V3: open-weight cost leader 7. Llama 3.3 70B: the regulated self-host pick 8. Prometheus 2 (8x7B): the open-weight evaluator fine-tune How the eight stack up The cascade pattern beats picking one judge How to pick: a 90-minute calibration sprint Common mistakes when picking an LLM judge model Recent shifts worth tracking Where Future AGI fits in this stack Related reading document.addEventListener("DOMContentLoaded",()=>{const o=document.querySelectorAll(".toc-link"),c=document.querySelectorAll("h2[id], h3[id]");if(o.length===0||c.length===0)return;const r=new IntersectionObserver(e=>{e.forEach(n=>{if(n.isIntersecting){o.forEach(i=>i.classList.remove("active"));const t=document.querySelector(.toc-link[data-heading="${n.target.id}"]);if(t){t.classList.add("active");try{t.scrollIntoView({block:"nearest",behavior:"smooth"})}catch{}}}})},{rootMargin:"-80px 0px -80% 0px",threshold:0});c.forEach(e=>r.observe(e))}); You ship a customer-support agent. The judge is GPT-4o and the helpfulness rubric reads 0.91 every Monday. In March the judge bumps to a 4o minor version. In April the agent quotes a refund off by an order of magnitude. The rubric still reads 0.91. The signal stopped meaning what you thought it meant the day the judge changed. Most posts on this topic rank judge models by a single SummEval Spearman or an MT-Bench winrate. Those numbers do a lot of unspoken work. A judge that hits 0.514 Spearman on summarization is not automatically the judge that holds for two years on your customer-support rubric. The benchmark measures one task on one dataset with one rubric the paper authors wrote. The thesis this post defends: pick a judge model by three axes. Human correlation against your rubric. Cost per score at your traffic volume. Self-preference bias against your candidate models. The best judge for your eval wins on your rubric, not on SummEval. We compare eight models worth running on a calibration set in May 2026. Methodology note: scoring axes below are calibration ceiling (kappa against human labels on subjective rubrics), cost per million output tokens, p95 latency on a 2K-token transcript, self-preference bias documented across published work, and license/deployment shape. Pricing verified May 2026 against vendor pricing pages. Calibration is task-dependent. Treat these as starting points, not procurement decisions. TL;DR: which judge wins on which axis Axis Pick Best calibration ceiling on subjective rubrics Claude Sonnet 4.5, Claude Opus 4.x Best structured-output reliability and cost-quality balance GPT-5 + GPT-5-mini cascade Best long-context judging (over 200K) and multimodal rubrics Gemini 2.5 Pro Cheapest frontier-tier judge at scale Gemini 2.5 Flash, GPT-5-mini Best fine-tuned eval-specific judge (closed) Galileo Luna-2 Best fine-tuned eval-specific judge (with full stack integration) Future AGI turing_large / turing_flash Best open-weight judge on cost DeepSeek-V3 Best self-hosted regulated judge Llama 3.3 70B Best open-weight evaluator-specific fine-tune Prometheus 2 (8x7B) Worst idea Same model as judge and candidate If you read one row: there is no single winner. Run two or three candidates on a 100-to-300 example human-labeled set, measure kappa, divide by cost per score, then make the call. The three axes that actually matter The reason published leaderboards mislead is that they collapse the choice to one number. In production, three axes bind separately and the binding axis changes by team. Human correlation on your rubric. Cohen’s kappa or Spearman against a human-labeled hold-out, computed on your data, not on SummEval. A marketing-copy rubric tolerates kappa around 0.6. A medical-advice rubric needs 0.85 or higher. A judge that hits 0.85 on a public benchmark and 0.55 on your dataset is a judge that does not know your domain. The fix is unglamorous: hand-label 100 to 300 examples covering the failure modes that matter, run every candidate judge against the same set, and only then read the leaderboard. Cost per score at your traffic volume. A frontier judge call on a 30-second agent trace costs $0.01 to $0.05 depending on judge and tokens. At a million traces a day that is $30K to $1.5M monthly. The judge that wins on kappa and loses on dollar-per-score loses the procurement. Three patterns rescue the bill: fine-tuned judges that score at one to ten percent of frontier per call, classifier cascades that escalate only close cases to the frontier model, and sample-don’t-score on routine traffic. Self-preference bias against your candidates. A judge prefers outputs from its own family at 10 to 25 percent margin per Zheng et al. 2024 . The cardinal mistake is the same model as judge and candidate. The second mistake is judging GPT-5 candidates with a GPT-5 judge and Sonnet 4.5 candidates with a Sonnet judge in the same eval suite, which builds family bias into the comparison. The mitigation is a three-judge ensemble across families on launch decisions and single-family judges only for trend tracking. Add one operational axis underneath the three: judge version stability. Sonnet 4.5 to Sonnet 4.6 is a minor bump that ships with a new training mix. The mean rubric score shifts 3 to 8 points; the distribution narrows. If you do not pin the judge model id inside the eval contract, the dashboard moves but the agent did not. See why LLM-as-a-judge and G-Eval definitive guide for the contract pattern. The eight judge models worth running in 2026 1. Claude Sonnet 4.5 / Opus 4.x: the calibration ceiling Closed-weight frontier. Anthropic , Bedrock, Vertex. Where it earns its bill. Subjective rubrics that need open-ended reasoning over long context. Multi-document RAG faithfulness, multi-turn conversation adherence, agent-trajectory rubrics. Sonnet 4.5 sits at the top of LMSYS Chatbot Arena’s pairwise preference table and produces reasoning chains dense enough to survive an audit log. The 200K window covers most production transcripts in one pass. Cost. $3 input / $15 output per 1M tokens for Sonnet 4.5; Opus 4.x runs higher. A 2K-input, 200-output call lands near $0.009 per score. Viable as the second-stage judge in a cascade. Expensive as the first pass. Latency. 1.5 to 3 seconds p95. Not an inline-guardrail judge. Self-preference. Claude prefers Claude-family outputs in published evaluations. Do not use Sonnet 4.5 to judge a Sonnet 4.5 candidate. Pair with GPT or Gemini in an ensemble for launch decisions. Best for. High-stakes judging where calibration matters more than cost. Pre-launch validation. The frontier slot in a two-stage cascade. 2. GPT-5 / GPT-5-mini: the structured-output workhorse Closed-weight frontier. OpenAI , Azure OpenAI. Where it earns its bill. Structured-output evaluation. JSON-mode reliability on GPT-5 is the lowest-parsing-failure tier in this list, which matters when 100K judgments at a 5 percent parse-failure rate is 5,000 retries. The two-stage GPT-5-mini-screens / GPT-5-rescore cascade lands under 30 percent of GPT-5-only cost at the same calibration target. Cost. Verify on the pricing page . GPT-5-mini is roughly an order of magnitude cheaper per million tokens than GPT-5. Latency. GPT-5-mini sub-second; GPT-5 at 1.5 to 3 seconds p95. Self-preference. GPT-4 was the original self-preference data point in Zheng et al. 2024 at 10 to 25 percent margin. Treat the bias as inherited by GPT-5 until measured. Best for. General-purpose judge in OpenAI stacks. Strict structured-output rubrics where parse failures kill throughput. 3. Gemini 2.5 Pro / Flash: long context, multimodal, cost-balanced Closed-weight frontier. Vertex AI , AI Studio. Where it earns its bill. Context windows past 200K. Multimodal judging on image, audio, and video inputs. Gemini 2.5 Pro carries a 1M-plus context window with experimental 2M tiers. The only judge here that scores a whole-document multi-doc RAG response in one pass. Flash brings frontier-tier reasoning to a price point where million-span-a-day scoring is financially viable; Vertex Batch Prediction discounts offline workloads further. Latency. Flash sub-second on short inputs; Pro at 2 to 4 seconds on long-context judging. Region-dependent. Self-preference. Less publicly documented than GPT or Claude. Assume present, measure on your set, do not judge Gemini candidates with a Gemini judge. Worth flagging. The 1M-plus window degrades subtly past a few hundred K tokens. Calibrate empirically on long inputs. Best for. Long-context judging where 200K is not enough. Multimodal rubrics. Cost-sensitive high-volume scoring with Flash. 4. Galileo Luna-2: the purpose-built evaluator Closed-weight fine-tune. Galileo . Where it earns its bill. Eval-specific fine-tunes win on cost-per-score for the rubrics they were trained on. Luna-2 is a 2B-parameter model fine-tuned for hallucination, context adherence, and tool-call correctness. Galileo published agreement numbers against GPT-4o on internal benchmarks and prices it at a fraction of frontier judge calls. The reference category for “small purpose-built judge beats the big general-purpose one on its home turf.” Latency. Sub-second by design. Se