NovaEval - AI Evaluation Engine

Production-ready AI evaluation with 80+ specialized scorers covering accuracy, RAG, conversational, agent, voice quality, latency, telephony, and safety.

NovaEval is Noveum's comprehensive AI evaluation engine. It provides an intelligent, automated interface for evaluating AI agents, voice assistants, RAG pipelines, and conversational systems — all integrated seamlessly into your Noveum dashboard.

About NovaEval

NovaEval powers Noveum's evaluation pipeline with 80+ specialized scorers across 10 categories. It automatically selects the most relevant scorers based on your dataset's content and presents results directly in your Noveum dashboard.

Key Capabilities

Comprehensive Scorer Library: 80+ specialized scorers covering accuracy, RAG, conversational, agent, audio quality, latency, telephony, safety, LLM-as-judge, and custom domains
Voice Agent Evaluation: Audio quality scorers (MOS, pronunciation, gibberish detection) and latency scorers (TTFT, STT, TTS, E2E) built specifically for voice agents
Telephony Scorers: Purpose-built scorers for phone agents — instruction adherence, CSAT, drop-off node detection, call termination quality
Intelligent Selection: "Recommend Scorers" analyzes your dataset and suggests the best scorers automatically
Custom Scorer Support: Create domain-specific evaluation metrics when standard scorers don't meet your needs
Batch and Realtime modes: Score historical datasets or continuously score new items as they arrive
Production Ready: Built for enterprise-scale evaluation with robust error handling and scalability

Why Model Evaluation Matters

AI models can perform differently across various scenarios, and understanding their strengths and weaknesses is crucial for:

Performance Optimization: Identify which models work best for specific use cases
Cost Efficiency: Find the most cost-effective models without sacrificing quality
Quality Assurance: Ensure consistent, reliable outputs across different scenarios
Continuous Improvement: Track model performance over time and identify degradation

Comprehensive Scoring Framework

NovaEval provides a comprehensive suite of scorers organized by evaluation domain. Most scorers follow a shared scorer interface and support both synchronous and asynchronous evaluation. Agent-specific evaluators use dedicated evaluation primitives and are intentionally standalone.

🎯 Accuracy & Classification Metrics

ExactMatchScorer

Exact string matching with case-sensitive/insensitive options.

AccuracyScorer

Classification accuracy with intelligent answer extraction.

MultiPatternAccuracyScorer

Multiple matching patterns and strategies.

F1Scorer

Token-level F1 score with precision and recall.

📊 NLP Metrics

BLEUScorer

Translation quality evaluation with n-gram precision.

ROUGEScorer

Summarization quality with n-gram overlap.

LevenshteinSimilarityScorer

Edit distance-based similarity measurement.

💬 Conversational AI Metrics

KnowledgeRetentionScorer

Evaluates information retention across conversations.

ConversationRelevancyScorer

Response relevance to conversation context.

ConversationCompletenessScorer

Assesses if user intentions are fully addressed.

RoleAdherenceScorer

Consistency with assigned persona or role.

ConversationalMetricsScorer

Comprehensive conversational evaluation.

🔍 RAG (Retrieval-Augmented Generation) Metrics

AnswerRelevancyScorer

Answer relevance using semantic similarity.

FaithfulnessScorer

Measures faithfulness to context without hallucinations.

ContextualPrecisionScorer

Precision of retrieved context relevance.

ContextualRecallScorer

Recall of necessary information in context.

RAGASScorer

Composite RAGAS methodology combining multiple metrics.

ContextualPrecisionScorerPP

LLM-based chunk precision evaluation.

ContextualRecallScorerPP

LLM-based chunk recall evaluation.

RetrievalF1Scorer

F1 score for retrieval performance.

RetrievalRankingScorer

Ranking quality using NDCG and Average Precision.

SemanticSimilarityScorer

Semantic alignment using embeddings.

RetrievalDiversityScorer

Evaluates diversity and non-redundancy.

AggregateRAGScorer

Aggregates multiple RAG metrics.

📝 Answer Quality Metrics

InformationDensityScorer

Information richness and content value.

ClarityAndCoherenceScorer

Language clarity and logical flow.

RAGAnswerQualityScorer

Comprehensive answer quality for RAG systems.

CitationQualityScorer

Citation relevance and accuracy.

AnswerCompletenessScorer

Completeness in addressing questions.

QuestionAnswerAlignmentScorer

Answer alignment with question intent.

TechnicalAccuracyScorer

Domain-specific technical accuracy.

ToneConsistencyScorer

Consistency of tone throughout.

TerminologyConsistencyScorer

Consistent use of terminology.

🛡️ Hallucination Detection Metrics

FactualAccuracyScorer

Factual accuracy against context.

ClaimVerificationScorer

Verifies individual claims against context.

HallucinationDetectionScorer

Detects unsupported or fabricated information.

🔀 Multi-Context Metrics

ConflictResolutionScorer

Handles contradictory information across contexts.

ContextPrioritizationScorer

Prioritizes most relevant context information.

ContextFaithfulnessScorerPP

Faithfulness across multiple context sources.

ContextGroundednessScorer

Measures grounding in provided contexts.

ContextCompletenessScorer

Utilization of all relevant context.

ContextConsistencyScorer

Consistency across multiple contexts.

🔒 Safety & Moderation Metrics

AnswerRefusalScorer

Appropriate refusal of out-of-scope questions.

ContentModerationScorer

Harmful or inappropriate content detection.

ContentSafetyViolationScorer

Specific safety violations detection.

ToxicityScorer

Toxic or offensive language detection.

IsHarmfulAdviceScorer

Harmful advice detection.

⚖️ Bias Detection Metrics

BiasDetectionScorer

General bias detection across multiple types.

NoGenderBiasScorer

Gender bias and stereotypes detection.

NoRacialBiasScorer

Racial bias and stereotypes detection.

NoAgeBiasScorer

Age bias and stereotypes detection.

CulturalSensitivityScorer

Cultural sensitivity and inclusivity.

✅ Format Validation Metrics

IsJSONScorer

Valid JSON format validation.

IsEmailScorer

Valid email address format validation.

ValidLinksScorer

URL and link validation.

🤖 LLM-as-Judge Metrics

GEvalScorer

LLM evaluation with chain-of-thought reasoning.

PanelOfJudgesScorer

Multi-LLM evaluation with diverse perspectives.

SpecializedPanelScorer

Specialized panel configurations with expertise.

🎭 Agent Evaluation Metrics

Tool Relevancy Scoring

Appropriateness of tool call selection.

Tool Correctness Scoring

Correctness of tool call execution.

Parameter Correctness Scoring

Correctness of tool parameters.

Task Progression Scoring

Progress toward assigned tasks.

Context Relevancy Scoring

Response appropriateness for role and task.

Role Adherence Scoring

Consistency with assigned agent role.

Goal Achievement Scoring

Overall goal accomplishment measurement.

Conversation Coherence Scoring

Logical flow and context maintenance.

🔧 RAG Pipeline Metrics

QueryProcessingEvaluator

Query understanding and preprocessing.

RetrievalStageEvaluator

Retrieval quality and relevance.

RerankingEvaluator

Reranking and generation stage assessment.

RAGPipelineEvaluator

Comprehensive pipeline evaluation.

PipelineCoordinationScorer

Stage coordination and integration.

LatencyAnalysisScorer

Performance and bottleneck analysis.

ResourceUtilizationScorer

Resource efficiency evaluation.

ErrorPropagationScorer

Error handling across stages.

How Noveum Handles Evaluation

Noveum automatically handles the entire evaluation process for you, making it seamless and effortless:

1. Automatic Trace Processing

Your AI traces are automatically processed and converted into evaluation datasets
No manual data preparation or configuration required
Real-world data from your actual AI interactions

2. Intelligent Scorer Selection

Based on your specific use case and AI application type, Noveum automatically selects the most relevant scorers from our comprehensive suite:

RAG Applications: Contextual precision, faithfulness, answer relevancy
Conversational AI: Knowledge retention, conversation relevancy, role adherence
Agent Systems: Tool correctness, task progression, goal achievement
Classification Tasks: Accuracy, F1 scores, exact match evaluation
Voice Agents: MOS score, tone clarity, pronunciation, gibberish detection, E2E latency, STT/TTS latency
Telephony / Phone: Instruction adherence, CSAT sentiment, drop-off node detection, appropriate call termination

3. Custom Scorer Creation

When standard scorers don't meet your specific requirements, Noveum enables you to:

Create custom evaluation metrics tailored to your business needs
Define domain-specific scoring criteria
Implement proprietary evaluation logic
Integrate with your existing evaluation frameworks

4. Dashboard Integration

All evaluation results are automatically presented in your Noveum dashboard:

Real-time scoring and performance metrics
Comparative analysis across different models
Historical performance tracking
Actionable insights and recommendations

Choosing the Right Scorers

With our extensive collection of 80+ specialized scorers, selecting the right combination is crucial for meaningful evaluation:

For RAG Applications

AnswerRelevancyScorer: Ensures answers directly address the questions
FaithfulnessScorer: Prevents hallucinations and maintains factual accuracy
ContextualPrecisionScorer: Validates relevance of retrieved context
ContextualRecallScorer: Ensures comprehensive information coverage

For Conversational AI

KnowledgeRetentionScorer: Tracks information retention across conversations
ConversationRelevancyScorer: Maintains context-aware responses
RoleAdherenceScorer: Ensures consistent persona maintenance
ConversationCompletenessScorer: Validates complete request fulfillment

For Agent Systems

Tool Relevancy Scoring: Validates appropriate tool selection
Task Progression Scoring: Measures goal advancement
Goal Achievement Scoring: End-to-end task completion assessment
Context Relevancy Scoring: Role-task alignment validation

For Voice Agents (LiveKit / Pipecat / NovaSynth)

MOS Scorer: Mean Opinion Score — overall perceived audio quality (0–10 in dashboard)
Tone Clarity Scorer: Naturalness and clarity of synthesized speech
Gibberish Scorer: Detects garbled or unintelligible audio segments
Pronunciation Audio Scorer: Pronunciation accuracy across the session
E2E Latency Scorer: End-to-end latency from user speech end to agent response start
LLM TTFT Scorer: LLM time-to-first-token
TTS TTFB Scorer: TTS time-to-first-byte

For Telephony / Phone Agents

Instruction Adherence Scorer: Whether the agent followed its instructions throughout the call
Sentiment CSAT Scorer: Estimated customer satisfaction score based on conversation sentiment
Drop-off Node Scorer: Identifies where in the conversation the user disengaged
Conversation Context Coherence Scorer: Coherence across the full call
Appropriate Call Termination Scorer: Whether the call ended appropriately

For Custom Requirements

When standard scorers don't meet your needs, create custom evaluation metrics that align with your specific business objectives and domain requirements.

Next Steps

Scorers Reference — complete reference for all 80+ scorers with required fields
Running Evaluations — create an eval job step by step
SDK Integration — set up tracing to capture data for evaluation
NovaSynth — generate voice test data for audio/telephony scorers
NovaPilot — AI analysis and recommendations on eval results