Scorers Reference
Complete reference for all NovaEval scorers — organized by category with required fields and score ranges.
Overview
NovaEval includes 80+ scorers organized into categories. Each scorer reads from the StandardData schema — the fields listed under "Requires" must be populated for a scorer to produce a meaningful result.
Scores are displayed on a 0–10 scale in the Noveum dashboard (higher is better). A score of 7.0+ is generally passing; the exact threshold is configurable per eval job.
Accuracy
Basic string and semantic comparison scorers.
| Scorer | Description | Requires | Score |
|---|---|---|---|
accuracy | Semantic similarity between response and ground truth | agent_response, ground_truth | 0–10 |
exact_match | Exact string equality check | agent_response, ground_truth | 0 or 10 |
f1 | Token-level F1 score between response and ground truth | agent_response, ground_truth | 0–10 |
bleu | BLEU-4 n-gram overlap | agent_response, ground_truth | 0–10 |
rouge | ROUGE-L longest common subsequence | agent_response, ground_truth | 0–10 |
Accuracy & RAG
Evaluate retrieval quality, grounding, and factual correctness.
| Scorer | Description | Requires | Score |
|---|---|---|---|
answer_relevancy | How relevant the response is to the query | agent_task, agent_response | 0–10 |
faithfulness | Whether the response is grounded in retrieved context | retrieved_context, agent_response | 0–10 |
contextual_precision | Precision of retrieved chunks for the query | retrieval_query, retrieved_context, ground_truth | 0–10 |
contextual_recall | Recall of retrieved chunks vs. expected answer | retrieval_query, retrieved_context, ground_truth | 0–10 |
ragas | Composite RAGAS score (faithfulness + relevancy + precision + recall) | retrieval_query, retrieved_context, agent_response, ground_truth | 0–10 |
hallucination_detection | Detects when the agent generates information not supported by context | agent_task, agent_response | 0–10 |
claim_verification | Verifies whether claims in the response are supported by available context | agent_task, agent_response, retrieved_context | 0–10 |
factual_accuracy | Evaluates the factual correctness of the generated response | agent_task, agent_response | 0–10 |
Conversational
Evaluate multi-turn dialogue quality.
| Scorer | Description | Requires | Score |
|---|---|---|---|
conversation_relevancy | Whether each agent turn is relevant to the conversation | conversation_context, agent_response | 0–10 |
intention_fulfillment | Whether the agent fulfilled the user's stated intent | conversation_context, agent_task, agent_response | 0–10 |
knowledge_retention | Whether the agent remembered and used earlier context | conversation_context | 0–10 |
role_adherence | Whether the agent stayed in its defined role | agent_role, conversation_context, agent_response | 0–10 |
conversational_role_adherence | Conversational variant of role adherence | agent_role, conversation_context | 0–10 |
conversational_metrics | Composite: relevancy + intention + retention + role adherence | conversation_context, agent_task, agent_response, agent_role | 0–10 |
Agent
Evaluate tool use, task completion, and goal achievement.
| Scorer | Description | Requires | Score |
|---|---|---|---|
tool_correctness | Whether the correct tool was called | expected_tool_call, tool_calls | 0 or 10 |
parameter_correctness | Whether tool parameters match expected values | tool_calls, parameters_passed, tool_call_results | 0–10 |
task_progression | Whether the agent made progress toward the task goal | agent_task, agent_role, system_prompt, agent_response | 0–10 |
context_relevancy | Whether the agent's response is relevant to the task context | agent_task, agent_role, agent_response | 0–10 |
agent_role_adherence | Whether the agent acted within its defined role | agent_role, agent_task, agent_response, tool_calls | 0–10 |
goal_achievement | Whether the agent achieved the overall goal (requires full trace) | agent_exit (=True), trace | 0–10 |
conversation_coherence | Whether the multi-step conversation was coherent end-to-end | agent_exit (=True), trace | 0–10 |
Audio Quality
Evaluate voice output quality. These scorers use raw_complete_audio and Gemini multimodal analysis on the assistant audio channel.
| Scorer | Description | Requires | Score |
|---|---|---|---|
mos | Mean Opinion Score — overall perceived audio quality | raw_complete_audio | 0–10 |
tone_clarity | Clarity and naturalness of the agent's voice | raw_complete_audio | 0–10 |
pronunciation_audio | Pronunciation accuracy across the session | raw_complete_audio | 0–10 |
gibberish | Detects garbled, unintelligible, or nonsensical speech | raw_complete_audio | 0–10 |
audio_breakage | Detects audio dropouts, clipping, and signal issues | raw_complete_audio | 0–10 |
mispronunciation | Frequency of mispronounced words | raw_complete_audio | 0–10 |
speaking_over_user | How often the agent interrupted while the user was speaking | raw_complete_audio | 0–10 |
word_accuracy | Word-level accuracy of TTS output vs. input text | tts_data, raw_complete_audio | 0–10 |
assistant_average_pitch_hz | Average fundamental frequency of the agent's voice | raw_complete_audio | 0–10 |
assistant_volume_rms | RMS volume level of the agent's audio | raw_complete_audio | 0–10 |
Latency
Measure real-time response speed. Higher score = faster response relative to the threshold. These scorers read from metrics_collected (VAD, STT, TTS, LLM, and EOU metrics captured by the SDK).
| Scorer | Description | Requires | Score |
|---|---|---|---|
llm_ttft | LLM time-to-first-token | metrics_collected (LLMMetrics) | 0–10 |
llm_latency | Total LLM processing time | metrics_collected (LLMMetrics) | 0–10 |
stt_latency | Speech-to-text processing time | metrics_collected (STTMetrics) | 0–10 |
stt_audio_duration | Duration of audio processed by STT | metrics_collected (STTMetrics) | 0–10 |
stt_processing_duration | STT processing time excluding audio duration | metrics_collected (STTMetrics) | 0–10 |
tts_latency | Total TTS synthesis time | metrics_collected (TTSMetrics) | 0–10 |
tts_ttfb | TTS time-to-first-byte — latency before any audio starts | metrics_collected (TTSMetrics) | 0–10 |
tts_duration | Actual TTS audio output duration | metrics_collected (TTSMetrics) | 0–10 |
tts_audio_duration | Duration of TTS audio content | metrics_collected (TTSMetrics) | 0–10 |
e2e_latency | End-to-end latency: user finishes speaking → agent starts responding | metrics_collected (all) | 0–10 |
end_of_turn_delay | Delay between EOU detection and agent response start | metrics_collected (EOUMetrics) | 0–10 |
on_user_turn_completed_delay | Delay from user turn completion to next pipeline stage | metrics_collected (EOUMetrics) | 0–10 |
assistant_latency | Aggregate agent response latency | metrics_collected | 0–10 |
Conversation Quality
Evaluate phone and voice call specific behaviors. These scorers use LLM analysis on the conversation_context.
| Scorer | Description | Requires | Score |
|---|---|---|---|
instruction_adherence | Whether the agent followed its given instructions throughout the call | conversation_context, system_prompt | 0–10 |
sentiment_csat | Estimated customer satisfaction / CSAT based on conversation sentiment | conversation_context | 0–10 |
drop_off_node | Identifies where in the conversation the user disengaged or dropped off | conversation_context | 0–10 |
conversation_context_coherence | Whether the agent maintained coherent context across the full call | conversation_context | 0–10 |
appropriate_call_termination | Whether the call ended appropriately | conversation_context | 0–10 |
Safety
Detect harmful, inappropriate, or unsafe outputs.
| Scorer | Description | Requires | Score |
|---|---|---|---|
content_moderation | General content moderation check | agent_response | 0–10 |
toxicity | Detects toxic, offensive, or harmful language | agent_response | 0–10 |
is_harmful_advice | Detects advice that could cause real-world harm | agent_task, agent_response | 0–10 |
content_safety_violation | Checks against a defined set of content policy rules | agent_response | 0–10 |
answer_refusal | Detects when the agent refused to answer and whether the refusal was appropriate | agent_task, agent_response | 0–10 |
LLM-as-Judge
Use a language model to evaluate against custom criteria.
G-Eval (g_eval)
G-Eval uses chain-of-thought reasoning to score against custom evaluation criteria.
| Parameter | Description |
|---|---|
criteria | The evaluation criteria as a string (e.g., "Is the response helpful and accurate?") |
steps | Optional: Step-by-step evaluation instructions |
score_range | Min/max for the output score (default: 0–10) |
iterations | Number of LLM judge calls to average (default: 1) |
chain_of_thought | Enable CoT reasoning before scoring (default: true) |
Requires: agent_task, agent_response (+ any fields referenced in the criteria)
Panel of Judges (panel_judge)
Runs multiple LLM judge models simultaneously and aggregates their scores.
| Parameter | Description |
|---|---|
judges | List of JudgeConfig objects (model, weight, name, specialty, temperature) |
aggregation | How to combine scores: mean, median, weighted, majority, consensus, min, max |
criteria | Shared evaluation criteria for all judges |
Requires: agent_task, agent_response
Custom
Custom scorer (custom_scorer)
Define a scoring template using {field} placeholders that reference any StandardData field.
Requires: Any fields referenced in the template.
Item summary (item_summary)
Generates a human-readable summary of the evaluation item. Useful for debugging and report generation.
Requires: agent_task, agent_response
Choosing scorers
For RAG pipelines
answer_relevancy + faithfulness + hallucination_detection + contextual_precision + contextual_recall
For conversational agents
conversation_relevancy + intention_fulfillment + knowledge_retention + role_adherence
For tool-calling agents
tool_correctness + parameter_correctness + task_progression + goal_achievement
For voice agents (LiveKit / Pipecat)
mos + tone_clarity + gibberish + e2e_latency + llm_ttft + tts_ttfb
For telephony agents
instruction_adherence + sentiment_csat + appropriate_call_termination + conversation_context_coherence
For any agent
g_eval (with custom criteria) + content_moderation + answer_refusal
Using "Recommend Scorers"
When creating an Eval Job, click Recommend Scorers and NovaEval will analyze a sample of your dataset items and suggest the most appropriate scorers based on what fields are populated. This is the fastest way to get started.
Next steps
- Running Evaluations — create an eval job with your chosen scorers
- StandardData Schema — understand which fields to populate
- NovaSynth — generate test data with voice quality metrics
Get Early Access to Noveum.ai Platform
Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.