EV

Evaluation

Measure what your agents actually do.

NovaEval scores your real traffic with 106 calibrated evaluators across 18 categories — including the market’s only dedicated voice suite — and builds custom scorers from your product requirements.

106

calibrated scorers

18

scorer categories

17

dedicated voice scorers

app.noveum.ai — eval results
Noveum eval job score results

01 — Capabilities

An instrument you can trust, not just a number.

106 scorers, 18 categories

Hallucination, RAG quality, answer correctness, tool use, coherence, safety — assembled from years of studying how language models fail systematically.

The only voice suite

17 dedicated audio scorers: mispronunciation, audio breakage, word error rate, talk-ratio, barge-in, dead-air, and full STT / LLM / TTS pipeline latency.

Panel of LLM judges

Disagreement-aware judging instead of a single model’s opinion — fewer hallucinated scores, calibrated verdicts.

Custom scorers from your PRD

Hand over your product requirements; instruction-driven scorers with templated placeholders — input_text, output_text, expected_output, retrieved_context, tool_calls — turn them into evaluation criteria, with scorers suggested from the traffic it sees.

Datasets from real traffic

ETL jobs convert production traces into living eval datasets, refreshed continuously — your logs become your benchmark.

Calibrated across versions

A 7/10 means the same thing after the model upgrade. Score distributions are monitored so your instrument never drifts silently.

02 — The library

Eighteen categories. One calibration bar.

Every scorer reports a score and its reasoning, so you can audit the judge — not just trust it.

RAG EvaluationHallucinationAnswer QualityAgent EvaluationVoice LatencyAudio & TTS QualitySafetyBias DetectionConversationalAccuracy & MatchingContext AnalysisFormat ValidationNLP MetricsPanel JudgingMulti-ContextRelevanceG-EvalCustom
Hallucination detection with reasoning
Hallucination scorer with reasoning — live product

03 — Under the hood

What every score is made of.

106 calibrated scorers, a dedicated voice suite, template-driven custom criteria, and datasets that build themselves — with reasoning attached to every verdict.

Scorer library

106 calibrated scorers across 18 categories — hallucination, RAG quality, answer correctness, tool use, coherence, safety, bias, format validation, and more — assembled from years of studying how language models fail.

Voice & audio suite

17 dedicated voice scorers: mispronunciation, audio breakage, word error rate (WER), talk-ratio, barge-in, dead-air, and STT / LLM / TTS pipeline latency — signals a transcript never shows.

Panel of judges

A panel of LLM judges with disagreement awareness replaces a single model’s opinion — fewer hallucinated scores, calibrated verdicts you can defend.

Custom scorers

Instruction-driven scorers with templated placeholders — input_text, output_text, expected_output, retrieved_context, tool_calls, and more — turn your product requirements into evaluation criteria.

Scorer output

Every scorer returns a score from 0–10, a passed boolean, free-text reasoning, and structured metadata — so you audit the judge, not just trust the number.

Datasets from traffic

Evaluation datasets are auto-built from production traces via ETL — your real logs become a living benchmark, refreshed continuously.

The loop

Part of one closed loop.

Evaluation is the control signal for everything else: NovaGuard enforces these scores in real time, and NovaPilot optimizes against them.

Next step

Put your production AI under control.

Start free and ship your first trace in 15 minutes — or book 30 minutes and we’ll integrate live on the call: your stack, your data, your first eval report before it ends.

SOC 2 Type II · HIPAA · GDPR · On-prem & BYO ClickHouse available