Scorers Reference

Complete reference for all NovaEval scorers — organized by category with required fields and score ranges.

Overview

NovaEval includes 80+ scorers organized into categories. Each scorer reads from the StandardData schema — the fields listed under "Requires" must be populated for a scorer to produce a meaningful result.

Scores are displayed on a 0–10 scale in the Noveum dashboard (higher is better). A score of 7.0+ is generally passing; the exact threshold is configurable per eval job.

Accuracy

Basic string and semantic comparison scorers.

Scorer	Description	Requires	Score
`accuracy`	Semantic similarity between response and ground truth	`agent_response`, `ground_truth`	0–10
`exact_match`	Exact string equality check	`agent_response`, `ground_truth`	0 or 10
`f1`	Token-level F1 score between response and ground truth	`agent_response`, `ground_truth`	0–10
`bleu`	BLEU-4 n-gram overlap	`agent_response`, `ground_truth`	0–10
`rouge`	ROUGE-L longest common subsequence	`agent_response`, `ground_truth`	0–10

Accuracy & RAG

Evaluate retrieval quality, grounding, and factual correctness.

Scorer	Description	Requires	Score
`answer_relevancy`	How relevant the response is to the query	`agent_task`, `agent_response`	0–10
`faithfulness`	Whether the response is grounded in retrieved context	`retrieved_context`, `agent_response`	0–10
`contextual_precision`	Precision of retrieved chunks for the query	`retrieval_query`, `retrieved_context`, `ground_truth`	0–10
`contextual_recall`	Recall of retrieved chunks vs. expected answer	`retrieval_query`, `retrieved_context`, `ground_truth`	0–10
`ragas`	Composite RAGAS score (faithfulness + relevancy + precision + recall)	`retrieval_query`, `retrieved_context`, `agent_response`, `ground_truth`	0–10
`hallucination_detection`	Detects when the agent generates information not supported by context	`agent_task`, `agent_response`	0–10
`claim_verification`	Verifies whether claims in the response are supported by available context	`agent_task`, `agent_response`, `retrieved_context`	0–10
`factual_accuracy`	Evaluates the factual correctness of the generated response	`agent_task`, `agent_response`	0–10

Conversational

Evaluate multi-turn dialogue quality.

Scorer	Description	Requires	Score
`conversation_relevancy`	Whether each agent turn is relevant to the conversation	`conversation_context`, `agent_response`	0–10
`intention_fulfillment`	Whether the agent fulfilled the user's stated intent	`conversation_context`, `agent_task`, `agent_response`	0–10
`knowledge_retention`	Whether the agent remembered and used earlier context	`conversation_context`	0–10
`role_adherence`	Whether the agent stayed in its defined role	`agent_role`, `conversation_context`, `agent_response`	0–10
`conversational_role_adherence`	Conversational variant of role adherence	`agent_role`, `conversation_context`	0–10
`conversational_metrics`	Composite: relevancy + intention + retention + role adherence	`conversation_context`, `agent_task`, `agent_response`, `agent_role`	0–10

Agent

Evaluate tool use, task completion, and goal achievement.

Scorer	Description	Requires	Score
`tool_correctness`	Whether the correct tool was called	`expected_tool_call`, `tool_calls`	0 or 10
`parameter_correctness`	Whether tool parameters match expected values	`tool_calls`, `parameters_passed`, `tool_call_results`	0–10
`task_progression`	Whether the agent made progress toward the task goal	`agent_task`, `agent_role`, `system_prompt`, `agent_response`	0–10
`context_relevancy`	Whether the agent's response is relevant to the task context	`agent_task`, `agent_role`, `agent_response`	0–10
`agent_role_adherence`	Whether the agent acted within its defined role	`agent_role`, `agent_task`, `agent_response`, `tool_calls`	0–10
`goal_achievement`	Whether the agent achieved the overall goal (requires full trace)	`agent_exit` (=True), `trace`	0–10
`conversation_coherence`	Whether the multi-step conversation was coherent end-to-end	`agent_exit` (=True), `trace`	0–10

Audio Quality

Evaluate voice output quality. These scorers use raw_complete_audio and Gemini multimodal analysis on the assistant audio channel.

Scorer	Description	Requires	Score
`mos`	Mean Opinion Score — overall perceived audio quality	`raw_complete_audio`	0–10
`tone_clarity`	Clarity and naturalness of the agent's voice	`raw_complete_audio`	0–10
`pronunciation_audio`	Pronunciation accuracy across the session	`raw_complete_audio`	0–10
`gibberish`	Detects garbled, unintelligible, or nonsensical speech	`raw_complete_audio`	0–10
`audio_breakage`	Detects audio dropouts, clipping, and signal issues	`raw_complete_audio`	0–10
`mispronunciation`	Frequency of mispronounced words	`raw_complete_audio`	0–10
`speaking_over_user`	How often the agent interrupted while the user was speaking	`raw_complete_audio`	0–10
`word_accuracy`	Word-level accuracy of TTS output vs. input text	`tts_data`, `raw_complete_audio`	0–10
`assistant_average_pitch_hz`	Average fundamental frequency of the agent's voice	`raw_complete_audio`	0–10
`assistant_volume_rms`	RMS volume level of the agent's audio	`raw_complete_audio`	0–10

Latency

Measure real-time response speed. Higher score = faster response relative to the threshold. These scorers read from metrics_collected (VAD, STT, TTS, LLM, and EOU metrics captured by the SDK).

Scorer	Description	Requires	Score
`llm_ttft`	LLM time-to-first-token	`metrics_collected` (LLMMetrics)	0–10
`llm_latency`	Total LLM processing time	`metrics_collected` (LLMMetrics)	0–10
`stt_latency`	Speech-to-text processing time	`metrics_collected` (STTMetrics)	0–10
`stt_audio_duration`	Duration of audio processed by STT	`metrics_collected` (STTMetrics)	0–10
`stt_processing_duration`	STT processing time excluding audio duration	`metrics_collected` (STTMetrics)	0–10
`tts_latency`	Total TTS synthesis time	`metrics_collected` (TTSMetrics)	0–10
`tts_ttfb`	TTS time-to-first-byte — latency before any audio starts	`metrics_collected` (TTSMetrics)	0–10
`tts_duration`	Actual TTS audio output duration	`metrics_collected` (TTSMetrics)	0–10
`tts_audio_duration`	Duration of TTS audio content	`metrics_collected` (TTSMetrics)	0–10
`e2e_latency`	End-to-end latency: user finishes speaking → agent starts responding	`metrics_collected` (all)	0–10
`end_of_turn_delay`	Delay between EOU detection and agent response start	`metrics_collected` (EOUMetrics)	0–10
`on_user_turn_completed_delay`	Delay from user turn completion to next pipeline stage	`metrics_collected` (EOUMetrics)	0–10
`assistant_latency`	Aggregate agent response latency	`metrics_collected`	0–10

Conversation Quality

Evaluate phone and voice call specific behaviors. These scorers use LLM analysis on the conversation_context.

Scorer	Description	Requires	Score
`instruction_adherence`	Whether the agent followed its given instructions throughout the call	`conversation_context`, `system_prompt`	0–10
`sentiment_csat`	Estimated customer satisfaction / CSAT based on conversation sentiment	`conversation_context`	0–10
`drop_off_node`	Identifies where in the conversation the user disengaged or dropped off	`conversation_context`	0–10
`conversation_context_coherence`	Whether the agent maintained coherent context across the full call	`conversation_context`	0–10
`appropriate_call_termination`	Whether the call ended appropriately	`conversation_context`	0–10

Safety

Detect harmful, inappropriate, or unsafe outputs.

Scorer	Description	Requires	Score
`content_moderation`	General content moderation check	`agent_response`	0–10
`toxicity`	Detects toxic, offensive, or harmful language	`agent_response`	0–10
`is_harmful_advice`	Detects advice that could cause real-world harm	`agent_task`, `agent_response`	0–10
`content_safety_violation`	Checks against a defined set of content policy rules	`agent_response`	0–10
`answer_refusal`	Detects when the agent refused to answer and whether the refusal was appropriate	`agent_task`, `agent_response`	0–10

LLM-as-Judge

Use a language model to evaluate against custom criteria.

G-Eval (`g_eval`)

G-Eval uses chain-of-thought reasoning to score against custom evaluation criteria.

Parameter	Description
`criteria`	The evaluation criteria as a string (e.g., "Is the response helpful and accurate?")
`steps`	Optional: Step-by-step evaluation instructions
`score_range`	Min/max for the output score (default: 0–10)
`iterations`	Number of LLM judge calls to average (default: 1)
`chain_of_thought`	Enable CoT reasoning before scoring (default: true)

Requires: agent_task, agent_response (+ any fields referenced in the criteria)

Panel of Judges (`panel_judge`)

Runs multiple LLM judge models simultaneously and aggregates their scores.

Parameter	Description
`judges`	List of `JudgeConfig` objects (model, weight, name, specialty, temperature)
`aggregation`	How to combine scores: `mean`, `median`, `weighted`, `majority`, `consensus`, `min`, `max`
`criteria`	Shared evaluation criteria for all judges

Requires: agent_task, agent_response

Custom

Custom scorer (`custom_scorer`)

Define a scoring template using {field} placeholders that reference any StandardData field.

template: |
  Given this user question: {agent_task}
  And this agent response: {agent_response}
  Rate the response quality from 0 to 10.

Requires: Any fields referenced in the template.

Running Evaluations — create an eval job with your chosen scorers
StandardData Schema — understand which fields to populate
NovaSynth — generate test data with voice quality metrics

Scorers Reference

Overview

Accuracy

Accuracy & RAG

Conversational

Agent

Audio Quality

Latency

Conversation Quality

Safety

LLM-as-Judge

G-Eval (`g_eval`)

Panel of Judges (`panel_judge`)

Custom

Custom scorer (`custom_scorer`)

Item summary (`item_summary`)

Choosing scorers

For RAG pipelines

For conversational agents

For tool-calling agents

For voice agents (LiveKit / Pipecat)

For telephony agents

For any agent

Next steps

Get Early Access to Noveum.ai Platform

On this page