Documentation

Scorers Reference

Complete reference for all NovaEval scorers — organized by category with required fields and score ranges.

Overview

NovaEval includes 80+ scorers organized into categories. Each scorer reads from the StandardData schema — the fields listed under "Requires" must be populated for a scorer to produce a meaningful result.

Scores are displayed on a 0–10 scale in the Noveum dashboard (higher is better). A score of 7.0+ is generally passing; the exact threshold is configurable per eval job.


Accuracy

Basic string and semantic comparison scorers.

ScorerDescriptionRequiresScore
accuracySemantic similarity between response and ground truthagent_response, ground_truth0–10
exact_matchExact string equality checkagent_response, ground_truth0 or 10
f1Token-level F1 score between response and ground truthagent_response, ground_truth0–10
bleuBLEU-4 n-gram overlapagent_response, ground_truth0–10
rougeROUGE-L longest common subsequenceagent_response, ground_truth0–10

Accuracy & RAG

Evaluate retrieval quality, grounding, and factual correctness.

ScorerDescriptionRequiresScore
answer_relevancyHow relevant the response is to the queryagent_task, agent_response0–10
faithfulnessWhether the response is grounded in retrieved contextretrieved_context, agent_response0–10
contextual_precisionPrecision of retrieved chunks for the queryretrieval_query, retrieved_context, ground_truth0–10
contextual_recallRecall of retrieved chunks vs. expected answerretrieval_query, retrieved_context, ground_truth0–10
ragasComposite RAGAS score (faithfulness + relevancy + precision + recall)retrieval_query, retrieved_context, agent_response, ground_truth0–10
hallucination_detectionDetects when the agent generates information not supported by contextagent_task, agent_response0–10
claim_verificationVerifies whether claims in the response are supported by available contextagent_task, agent_response, retrieved_context0–10
factual_accuracyEvaluates the factual correctness of the generated responseagent_task, agent_response0–10

Conversational

Evaluate multi-turn dialogue quality.

ScorerDescriptionRequiresScore
conversation_relevancyWhether each agent turn is relevant to the conversationconversation_context, agent_response0–10
intention_fulfillmentWhether the agent fulfilled the user's stated intentconversation_context, agent_task, agent_response0–10
knowledge_retentionWhether the agent remembered and used earlier contextconversation_context0–10
role_adherenceWhether the agent stayed in its defined roleagent_role, conversation_context, agent_response0–10
conversational_role_adherenceConversational variant of role adherenceagent_role, conversation_context0–10
conversational_metricsComposite: relevancy + intention + retention + role adherenceconversation_context, agent_task, agent_response, agent_role0–10

Agent

Evaluate tool use, task completion, and goal achievement.

ScorerDescriptionRequiresScore
tool_correctnessWhether the correct tool was calledexpected_tool_call, tool_calls0 or 10
parameter_correctnessWhether tool parameters match expected valuestool_calls, parameters_passed, tool_call_results0–10
task_progressionWhether the agent made progress toward the task goalagent_task, agent_role, system_prompt, agent_response0–10
context_relevancyWhether the agent's response is relevant to the task contextagent_task, agent_role, agent_response0–10
agent_role_adherenceWhether the agent acted within its defined roleagent_role, agent_task, agent_response, tool_calls0–10
goal_achievementWhether the agent achieved the overall goal (requires full trace)agent_exit (=True), trace0–10
conversation_coherenceWhether the multi-step conversation was coherent end-to-endagent_exit (=True), trace0–10

Audio Quality

Evaluate voice output quality. These scorers use raw_complete_audio and Gemini multimodal analysis on the assistant audio channel.

ScorerDescriptionRequiresScore
mosMean Opinion Score — overall perceived audio qualityraw_complete_audio0–10
tone_clarityClarity and naturalness of the agent's voiceraw_complete_audio0–10
pronunciation_audioPronunciation accuracy across the sessionraw_complete_audio0–10
gibberishDetects garbled, unintelligible, or nonsensical speechraw_complete_audio0–10
audio_breakageDetects audio dropouts, clipping, and signal issuesraw_complete_audio0–10
mispronunciationFrequency of mispronounced wordsraw_complete_audio0–10
speaking_over_userHow often the agent interrupted while the user was speakingraw_complete_audio0–10
word_accuracyWord-level accuracy of TTS output vs. input texttts_data, raw_complete_audio0–10
assistant_average_pitch_hzAverage fundamental frequency of the agent's voiceraw_complete_audio0–10
assistant_volume_rmsRMS volume level of the agent's audioraw_complete_audio0–10

Latency

Measure real-time response speed. Higher score = faster response relative to the threshold. These scorers read from metrics_collected (VAD, STT, TTS, LLM, and EOU metrics captured by the SDK).

ScorerDescriptionRequiresScore
llm_ttftLLM time-to-first-tokenmetrics_collected (LLMMetrics)0–10
llm_latencyTotal LLM processing timemetrics_collected (LLMMetrics)0–10
stt_latencySpeech-to-text processing timemetrics_collected (STTMetrics)0–10
stt_audio_durationDuration of audio processed by STTmetrics_collected (STTMetrics)0–10
stt_processing_durationSTT processing time excluding audio durationmetrics_collected (STTMetrics)0–10
tts_latencyTotal TTS synthesis timemetrics_collected (TTSMetrics)0–10
tts_ttfbTTS time-to-first-byte — latency before any audio startsmetrics_collected (TTSMetrics)0–10
tts_durationActual TTS audio output durationmetrics_collected (TTSMetrics)0–10
tts_audio_durationDuration of TTS audio contentmetrics_collected (TTSMetrics)0–10
e2e_latencyEnd-to-end latency: user finishes speaking → agent starts respondingmetrics_collected (all)0–10
end_of_turn_delayDelay between EOU detection and agent response startmetrics_collected (EOUMetrics)0–10
on_user_turn_completed_delayDelay from user turn completion to next pipeline stagemetrics_collected (EOUMetrics)0–10
assistant_latencyAggregate agent response latencymetrics_collected0–10

Conversation Quality

Evaluate phone and voice call specific behaviors. These scorers use LLM analysis on the conversation_context.

ScorerDescriptionRequiresScore
instruction_adherenceWhether the agent followed its given instructions throughout the callconversation_context, system_prompt0–10
sentiment_csatEstimated customer satisfaction / CSAT based on conversation sentimentconversation_context0–10
drop_off_nodeIdentifies where in the conversation the user disengaged or dropped offconversation_context0–10
conversation_context_coherenceWhether the agent maintained coherent context across the full callconversation_context0–10
appropriate_call_terminationWhether the call ended appropriatelyconversation_context0–10

Safety

Detect harmful, inappropriate, or unsafe outputs.

ScorerDescriptionRequiresScore
content_moderationGeneral content moderation checkagent_response0–10
toxicityDetects toxic, offensive, or harmful languageagent_response0–10
is_harmful_adviceDetects advice that could cause real-world harmagent_task, agent_response0–10
content_safety_violationChecks against a defined set of content policy rulesagent_response0–10
answer_refusalDetects when the agent refused to answer and whether the refusal was appropriateagent_task, agent_response0–10

LLM-as-Judge

Use a language model to evaluate against custom criteria.

G-Eval (g_eval)

G-Eval uses chain-of-thought reasoning to score against custom evaluation criteria.

ParameterDescription
criteriaThe evaluation criteria as a string (e.g., "Is the response helpful and accurate?")
stepsOptional: Step-by-step evaluation instructions
score_rangeMin/max for the output score (default: 0–10)
iterationsNumber of LLM judge calls to average (default: 1)
chain_of_thoughtEnable CoT reasoning before scoring (default: true)

Requires: agent_task, agent_response (+ any fields referenced in the criteria)

Panel of Judges (panel_judge)

Runs multiple LLM judge models simultaneously and aggregates their scores.

ParameterDescription
judgesList of JudgeConfig objects (model, weight, name, specialty, temperature)
aggregationHow to combine scores: mean, median, weighted, majority, consensus, min, max
criteriaShared evaluation criteria for all judges

Requires: agent_task, agent_response


Custom

Custom scorer (custom_scorer)

Define a scoring template using {field} placeholders that reference any StandardData field.

template: |
  Given this user question: {agent_task}
  And this agent response: {agent_response}
  Rate the response quality from 0 to 10.

Requires: Any fields referenced in the template.

Item summary (item_summary)

Generates a human-readable summary of the evaluation item. Useful for debugging and report generation.

Requires: agent_task, agent_response


Choosing scorers

For RAG pipelines

answer_relevancy + faithfulness + hallucination_detection + contextual_precision + contextual_recall

For conversational agents

conversation_relevancy + intention_fulfillment + knowledge_retention + role_adherence

For tool-calling agents

tool_correctness + parameter_correctness + task_progression + goal_achievement

For voice agents (LiveKit / Pipecat)

mos + tone_clarity + gibberish + e2e_latency + llm_ttft + tts_ttfb

For telephony agents

instruction_adherence + sentiment_csat + appropriate_call_termination + conversation_context_coherence

For any agent

g_eval (with custom criteria) + content_moderation + answer_refusal


Using "Recommend Scorers"

When creating an Eval Job, click Recommend Scorers and NovaEval will analyze a sample of your dataset items and suggest the most appropriate scorers based on what fields are populated. This is the fastest way to get started.


Next steps

Exclusive Early Access

Get Early Access to Noveum.ai Platform

Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.

Sign up now. We send access to new batch every week.

Early access members receive premium onboarding support and influence our product roadmap. Limited spots available.

Scorers Reference | Documentation | Noveum.ai