Benchmarking LLM-evaluation platforms

Shashank Agarwal

6/27/2026

#noveum#evaluations#benchmark#llm-as-judge#novaeval#analysis#reports

Today we're publishing a benchmark of eight LLM-evaluation platforms — Noveum, Arize/Phoenix, Braintrust, Maxim, Galileo, Patronus, DeepEval, and Ragas. We built a retrieval-augmented support agent, ran it through 100 adversarial conversations — 440 assistant turns, of which 131 were hallucinations — and asked each platform to find the agent's mistakes using its own native scorers, run reference-free in its documented best setup. Ground truth was set by a model that none of the platforms use: Claude Opus-4.8.

Across six dimensions, Noveum has the highest overall accuracy and the most even coverage, with an equal-weighted composite score of 0.999, against 0.650 for the next platform. The field is closer on individual dimensions, and several are led by other platforms — Ragas on context-relevance, Braintrust on false-alarm rate, and DeepEval on tool selection. All results are reported below.

Noveum builds one of the eight platforms, and the agent under test answers questions about Noveum's own documentation — a home-domain advantage that the neutral reference and the best-native competitor configurations reduce but do not fully remove. Ground truth came from a third party's model, every competitor ran in its strongest documented configuration, and the agent, the traces, the reference labels, every scorer, and the raw per-platform scores are all published, so any number here can be reproduced.

Results across every dimension

The table below summarizes the full study. Noveum leads or ties seven of the nine metrics; where another platform leads, the table records it: Braintrust has the lowest false-alarm rate (the share of correct answers a scorer wrongly flags) at 8%, Ragas the highest context-relevance correlation at 0.95, and DeepEval ties Noveum on tool selection at 100%.

Results table across nine metrics for eight platforms. Noveum leads composite (0.999), faithfulness F1 (0.84), answer-relevance (0.78), tool selection (100%), tool parameters (84%), role-adherence (0.79), and latency (0.59s). Braintrust has the lowest false-alarm (8%), DeepEval ties tool selection (100%), Ragas leads context-relevance (0.95).

The composite is an equal-weighted average of five dimensions that share a comparable metric. Noveum scores 0.999, then DeepEval 0.650, Arize/Phoenix 0.576, Galileo 0.558, Ragas 0.490, Maxim 0.345, Braintrust 0.274, and Patronus 0.078. It summarizes the table; the per-dimension charts that follow show the detail.

Composite score bar chart: Noveum 0.999, DeepEval 0.650, Arize 0.576, Galileo 0.558, Ragas 0.490, Maxim 0.345, Braintrust 0.274, Patronus 0.078.

Faithfulness

Faithfulness is whether each claim in an answer is supported by the retrieved context. It is the most important dimension for evaluation that runs without a person checking the output. A reliable scorer must combine high recall (it catches real hallucinations) with a low false-alarm rate (it rarely flags a correct answer).

Faithfulness on 440 turns. Recall with 95% confidence intervals: Noveum 89%, Arize 82%, Braintrust 80%, DeepEval 73%, Ragas 94%, Patronus 48%. False-alarm rate: Noveum 10%, Arize 11%, Braintrust 8%, DeepEval 10%, Ragas 37%, Patronus 18%.

On 440 turns, Noveum catches 89% of hallucinations while wrongly flagging 10% of correct answers, for an F1 of 0.84. Braintrust (80% / 8%) and Arize (82% / 11%) are close behind, and their 95% confidence intervals overlap with Noveum's — so the top three are a statistical tie, and we make no recall or false-alarm claim within that group. Ragas catches the most, 94%, but flags 37% of correct answers; at that calibration it cannot run as an automatic gate, because more than a third of what it flags is correct. Patronus, a purpose-built hallucination model, catches 48%: it reliably flags blatant fabrications but misses the fluent, in-context-looking ones that dominate this dataset.

To check that this reflects the method and not simply a stronger judge model, we ran Noveum's scorer on the same GPT model the competitors use. It still scored higher than Braintrust's scorer there — 82% versus 55% recall — so the difference is in the method, not only the model.

One further factor shapes the faithfulness result: which of a platform's own scorers you choose. Braintrust's ClosedQA scorer flags 8% of correct answers; its autoevals.ragas scorer flags 66% on the same data. DeepEval's GEval catches 73% of hallucinations; its FaithfulnessMetric catches 1%. Both are real, documented scorers. The unit of comparison is therefore not the platform but the specific scorer, in a specific configuration, on a given kind of traffic.

The aggregate recall also hides how differently scorers handle different kinds of mistakes.

Recall by failure type. Fabricated code: Ragas 88%, Noveum 75%, Braintrust 42%, Arize 38%, DeepEval 21%, Patronus 21%. Pressure for facts: Noveum 100%, Ragas 96%, Arize 89%, Braintrust 86%, DeepEval 86%, Patronus 79%.

Multi-step reasoning errors were caught by every general-purpose scorer at 100%; only the narrow Patronus model missed them, at 41%. Fabricated code — a non-existent SDK method written inside otherwise-valid code — was the hardest case: recall ranged from 21% to 88%. Noveum caught 75% of fabricated code and 100% of pressure-to-answer fabrications; its weakest family was prompt injection, at 55%, where per-chunk scorers did better. Ragas's 88% on fabricated code reflects the same low precision that produces its 37% false-alarm rate elsewhere. Part of what makes this hard is that 78% of the agent's hallucinations came with no retrieval at all — the agent skipped the lookup and answered anyway — so a faithfulness scorer has to treat an empty retrieval as a real signal rather than abstain.

Relevance

Relevance has two parts: whether the retrieved chunks were relevant to the question (context-relevance), and whether the answer addressed the question (answer-relevance). We measure each as the correlation between a platform's score and the reference.

Correlation with the reference. Context-relevance: Noveum 0.945, Ragas 0.95, DeepEval 0.92, Arize 0.87, Braintrust 0.15. Answer-relevance: Noveum 0.78, Ragas 0.27, DeepEval 0.04, Arize 0.06, Braintrust 0.24.

Ragas leads context-relevance at r = 0.95, with Noveum just behind at 0.945, then DeepEval at 0.92 and Arize at 0.87; Braintrust's scorer was essentially uncorrelated here, at 0.15. On answer-relevance, Noveum reaches 0.78 while the rest fall between 0.04 and 0.27. This is the lowest-powered dimension in the study — the reference's answer-relevance labels vary little on this dataset — so this should be read as the other scorers failing to separate good answers from bad ones here while Noveum's does, rather than as Noveum being several times better.

Tool calls

Agents also call tools, and the failures that matter are calling the wrong tool or calling the right one with a fabricated argument. Four platforms expose a tool scorer that runs locally; Braintrust and Patronus ship none, and Galileo and Maxim run theirs server-side. The chart shows specificity — how often a scorer correctly leaves a correct tool call alone.

Tool selection specificity: Noveum 100%, DeepEval 100%, Arize 89%, Ragas 13%. Tool parameter specificity: Noveum 84%, Arize 81%, DeepEval 35%, Ragas 35%.

On tool selection, Noveum and DeepEval both reach 100%. On parameters there is a genuine trade-off: the exact-match scorers (DeepEval, Ragas) catch every fabricated argument but flag 65% of correct calls, for a specificity of 35%; Arize is the most balanced at 81%; Noveum favors leaving correct calls alone, at 84%. The right choice depends on whether the goal is an automatic gate or a strict regression test against a known set of arguments.

Conversational behavior

Agents can also drift over the course of a conversation — abandoning their role, or conceding under pressure. Only Noveum and DeepEval ship a dedicated multi-turn scorer for this; the rest fall back to a general judge.

Role-adherence correlation: Noveum 0.79, DeepEval 0.68, Galileo 0.67, Ragas 0.67, Maxim 0.66, Patronus 0.65, Arize 0.59, Braintrust 0.56.

Noveum has the highest agreement with the reference, at 0.79; the rest fall between 0.56 and 0.68. Every platform, Noveum included, is more lenient toward the failing agent than the reference is — a useful caveat for anyone relying on conversational scores to gate deployments automatically.

Speed

For scoring production traffic, speed matters alongside accuracy.

Median judge latency in seconds (lower is better): Noveum 0.59, Arize 1.04, Braintrust 3.30, DeepEval 3.86, Ragas 13.62, Patronus 15.73.

Noveum's scorers run at 0.59 seconds per call — the fastest in the study, between 2 and 27 times faster than the rest, on a small model (gemini-3.1-flash-lite). On this dataset, no platform that scored more accurately also ran faster. (Latency here is the local judge-model-call time, not a hosted-API round-trip, and is measured the same way for every platform.)

One practical difference comes before any score is computed: turning raw traces into clean, per-turn evaluation items. Noveum was the only platform that did this with no manual field-mapping — it read the trace structure and produced 440 ready-to-score items on its own. The other hosted platforms ingest traces but require the operator to map the fields; DeepEval and Ragas are libraries and do not ingest traces at all.

The agent, the dataset, the reference labels, every scorer, and the raw per-platform scores are published, so each figure in this report can be re-derived and re-run on a different judge model, a different domain, or an additional platform. To run these scorers on your own traces, see noveum.ai.

Exclusive Early Access

Get Early Access to Noveum.ai Platform

Join the select group of AI teams optimizing their models with our data-driven platform. We're onboarding users in limited batches to ensure a premium experience.