LLM-as-judge tools compared: the question is not which one scores, it is which one you can trust

#ai #machinelearning #datascience #llm

TL;DR: I compared the main LLM-as-judge tools (DeepEval's G-Eval, Confident AI, Evidently, Braintrust, Promptfoo, and MLflow) on the axis that actually decides whether the scores mean anything: how well each helps you VALIDATE the judge against human labels. A judge that has not been checked against humans is just a second opinion with the same blind spots, and most tooling makes it easy to run a judge and hard to prove it agrees with you.

A judge you have not validated is not a measurement

An LLM-as-judge has known failure modes: position bias (prefers the first answer), verbosity bias (prefers the longer one), and self-preference (prefers its own family). Run it un-validated and you inherit all three silently. The only thing that turns a judge into a measurement is checking its agreement with human labels on a held-out set, with an actual statistic (Cohen's kappa, not "looks about right"). So I judge the judge-tools by how much they help with that.

The six, by how much they help you validate

DeepEval (G-Eval): the popular pick. G-Eval gives you chain-of-thought judge metrics out of the box and a pytest-style harness. Strong on running judges; you bring your own human-label comparison.
Confident AI: the hosted layer on DeepEval, useful for storing runs and sharing, same validation gap to close yourself.
Evidently: strong on report-style dashboards and drift, including LLM-judge descriptors; good if you want monitoring framing.
Braintrust: a clean UI for comparing judge outputs side by side across runs, which helps you eyeball disagreement even if it does not compute kappa for you.
Promptfoo: treats the judge as an assertion in a test matrix; lightweight and CI-friendly, thin on judge-vs-human stats.
MLflow: fits if MLflow is already your tracking backbone; judge metrics plug into the same runs and registry.

None of them, as of June 2026, makes "compute the judge's agreement with my human labels and show me the confusion matrix" a one-click default, which is the step that actually decides whether the judge is trustworthy. You still wire it.

How I actually validate a judge

Label 200 examples by hand. Run the judge on the same 200. Compute Cohen's kappa (chance-corrected agreement), not raw accuracy. Below about 0.6 and the judge is not ready; read the confusion matrix to see which class it confuses, fix the rubric, re-measure. Only then do I trust the judge on the unlabeled rest.

Open question

Kappa against my labels assumes my labels are right. On genuinely subjective dimensions (helpfulness, tone) two careful humans disagree, so the ceiling on judge-human agreement is the human-human agreement, which I rarely measure. I do not have a clean way to know whether a kappa of 0.55 means a bad judge or an irreducibly subjective task. If you have, I want to read it.