NovaEval - AI Evaluation Engine
Production-ready AI evaluation with 80+ specialized scorers covering accuracy, RAG, conversational, agent, voice quality, latency, telephony, and safety.
NovaEval is Noveum's comprehensive AI evaluation engine. It provides an intelligent, automated interface for evaluating AI agents, voice assistants, RAG pipelines, and conversational systems — all integrated seamlessly into your Noveum dashboard.
About NovaEval
NovaEval powers Noveum's evaluation pipeline with 80+ specialized scorers across 10 categories. It automatically selects the most relevant scorers based on your dataset's content and presents results directly in your Noveum dashboard.
Key Capabilities
- Comprehensive Scorer Library: 80+ specialized scorers covering accuracy, RAG, conversational, agent, audio quality, latency, telephony, safety, LLM-as-judge, and custom domains
- Voice Agent Evaluation: Audio quality scorers (MOS, pronunciation, gibberish detection) and latency scorers (TTFT, STT, TTS, E2E) built specifically for voice agents
- Telephony Scorers: Purpose-built scorers for phone agents — instruction adherence, CSAT, drop-off node detection, call termination quality
- Intelligent Selection: "Recommend Scorers" analyzes your dataset and suggests the best scorers automatically
- Custom Scorer Support: Create domain-specific evaluation metrics when standard scorers don't meet your needs
- Batch and Realtime modes: Score historical datasets or continuously score new items as they arrive
- Production Ready: Built for enterprise-scale evaluation with robust error handling and scalability
Why Model Evaluation Matters
AI models can perform differently across various scenarios, and understanding their strengths and weaknesses is crucial for:
- Performance Optimization: Identify which models work best for specific use cases
- Cost Efficiency: Find the most cost-effective models without sacrificing quality
- Quality Assurance: Ensure consistent, reliable outputs across different scenarios
- Continuous Improvement: Track model performance over time and identify degradation
Comprehensive Scoring Framework
NovaEval provides a comprehensive suite of scorers organized by evaluation domain. Most scorers follow a shared scorer interface and support both synchronous and asynchronous evaluation. Agent-specific evaluators use dedicated evaluation primitives and are intentionally standalone.
🎯 Accuracy & Classification Metrics
ExactMatchScorer
Exact string matching with case-sensitive/insensitive options.
AccuracyScorer
Classification accuracy with intelligent answer extraction.
MultiPatternAccuracyScorer
Multiple matching patterns and strategies.
F1Scorer
Token-level F1 score with precision and recall.
📊 NLP Metrics
BLEUScorer
Translation quality evaluation with n-gram precision.
ROUGEScorer
Summarization quality with n-gram overlap.
LevenshteinSimilarityScorer
Edit distance-based similarity measurement.
💬 Conversational AI Metrics
KnowledgeRetentionScorer
Evaluates information retention across conversations.
ConversationRelevancyScorer
Response relevance to conversation context.
ConversationCompletenessScorer
Assesses if user intentions are fully addressed.
RoleAdherenceScorer
Consistency with assigned persona or role.
ConversationalMetricsScorer
Comprehensive conversational evaluation.
🔍 RAG (Retrieval-Augmented Generation) Metrics
AnswerRelevancyScorer
Answer relevance using semantic similarity.
FaithfulnessScorer
Measures faithfulness to context without hallucinations.
ContextualPrecisionScorer
Precision of retrieved context relevance.
ContextualRecallScorer
Recall of necessary information in context.
RAGASScorer
Composite RAGAS methodology combining multiple metrics.
ContextualPrecisionScorerPP
LLM-based chunk precision evaluation.
ContextualRecallScorerPP
LLM-based chunk recall evaluation.
RetrievalF1Scorer
F1 score for retrieval performance.
RetrievalRankingScorer
Ranking quality using NDCG and Average Precision.
SemanticSimilarityScorer
Semantic alignment using embeddings.
RetrievalDiversityScorer
Evaluates diversity and non-redundancy.
AggregateRAGScorer
Aggregates multiple RAG metrics.
📝 Answer Quality Metrics
InformationDensityScorer
Information richness and content value.
ClarityAndCoherenceScorer
Language clarity and logical flow.
RAGAnswerQualityScorer
Comprehensive answer quality for RAG systems.
CitationQualityScorer
Citation relevance and accuracy.
AnswerCompletenessScorer
Completeness in addressing questions.
QuestionAnswerAlignmentScorer
Answer alignment with question intent.
TechnicalAccuracyScorer
Domain-specific technical accuracy.
ToneConsistencyScorer
Consistency of tone throughout.
TerminologyConsistencyScorer
Consistent use of terminology.
🛡️ Hallucination Detection Metrics
FactualAccuracyScorer
Factual accuracy against context.
ClaimVerificationScorer
Verifies individual claims against context.
HallucinationDetectionScorer
Detects unsupported or fabricated information.
🔀 Multi-Context Metrics
ConflictResolutionScorer
Handles contradictory information across contexts.
ContextPrioritizationScorer
Prioritizes most relevant context information.
ContextFaithfulnessScorerPP
Faithfulness across multiple context sources.
ContextGroundednessScorer
Measures grounding in provided contexts.
ContextCompletenessScorer
Utilization of all relevant context.
ContextConsistencyScorer
Consistency across multiple contexts.
🔒 Safety & Moderation Metrics
AnswerRefusalScorer
Appropriate refusal of out-of-scope questions.
ContentModerationScorer
Harmful or inappropriate content detection.
ContentSafetyViolationScorer
Specific safety violations detection.
ToxicityScorer
Toxic or offensive language detection.
IsHarmfulAdviceScorer
Harmful advice detection.
⚖️ Bias Detection Metrics
BiasDetectionScorer
General bias detection across multiple types.
NoGenderBiasScorer
Gender bias and stereotypes detection.
NoRacialBiasScorer
Racial bias and stereotypes detection.
NoAgeBiasScorer
Age bias and stereotypes detection.
CulturalSensitivityScorer
Cultural sensitivity and inclusivity.
✅ Format Validation Metrics
IsJSONScorer
Valid JSON format validation.
IsEmailScorer
Valid email address format validation.
ValidLinksScorer
URL and link validation.
🤖 LLM-as-Judge Metrics
GEvalScorer
LLM evaluation with chain-of-thought reasoning.
PanelOfJudgesScorer
Multi-LLM evaluation with diverse perspectives.
SpecializedPanelScorer
Specialized panel configurations with expertise.
🎭 Agent Evaluation Metrics
Tool Relevancy Scoring
Appropriateness of tool call selection.
Tool Correctness Scoring
Correctness of tool call execution.
Parameter Correctness Scoring
Correctness of tool parameters.
Task Progression Scoring
Progress toward assigned tasks.
Context Relevancy Scoring
Response appropriateness for role and task.
Role Adherence Scoring
Consistency with assigned agent role.
Goal Achievement Scoring
Overall goal accomplishment measurement.
Conversation Coherence Scoring
Logical flow and context maintenance.
🔧 RAG Pipeline Metrics
QueryProcessingEvaluator
Query understanding and preprocessing.
RetrievalStageEvaluator
Retrieval quality and relevance.
RerankingEvaluator
Reranking and generation stage assessment.
RAGPipelineEvaluator
Comprehensive pipeline evaluation.
PipelineCoordinationScorer
Stage coordination and integration.
LatencyAnalysisScorer
Performance and bottleneck analysis.
ResourceUtilizationScorer
Resource efficiency evaluation.
ErrorPropagationScorer
Error handling across stages.
How Noveum Handles Evaluation
Noveum automatically handles the entire evaluation process for you, making it seamless and effortless:
1. Automatic Trace Processing
- Your AI traces are automatically processed and converted into evaluation datasets
- No manual data preparation or configuration required
- Real-world data from your actual AI interactions
2. Intelligent Scorer Selection
Based on your specific use case and AI application type, Noveum automatically selects the most relevant scorers from our comprehensive suite:
- RAG Applications: Contextual precision, faithfulness, answer relevancy
- Conversational AI: Knowledge retention, conversation relevancy, role adherence
- Agent Systems: Tool correctness, task progression, goal achievement
- Classification Tasks: Accuracy, F1 scores, exact match evaluation
- Voice Agents: MOS score, tone clarity, pronunciation, gibberish detection, E2E latency, STT/TTS latency
- Telephony / Phone: Instruction adherence, CSAT sentiment, drop-off node detection, appropriate call termination
3. Custom Scorer Creation
When standard scorers don't meet your specific requirements, Noveum enables you to:
- Create custom evaluation metrics tailored to your business needs
- Define domain-specific scoring criteria
- Implement proprietary evaluation logic
- Integrate with your existing evaluation frameworks
4. Dashboard Integration
All evaluation results are automatically presented in your Noveum dashboard:
- Real-time scoring and performance metrics
- Comparative analysis across different models
- Historical performance tracking
- Actionable insights and recommendations
Choosing the Right Scorers
With our extensive collection of 80+ specialized scorers, selecting the right combination is crucial for meaningful evaluation:
For RAG Applications
- AnswerRelevancyScorer: Ensures answers directly address the questions
- FaithfulnessScorer: Prevents hallucinations and maintains factual accuracy
- ContextualPrecisionScorer: Validates relevance of retrieved context
- ContextualRecallScorer: Ensures comprehensive information coverage
For Conversational AI
- KnowledgeRetentionScorer: Tracks information retention across conversations
- ConversationRelevancyScorer: Maintains context-aware responses
- RoleAdherenceScorer: Ensures consistent persona maintenance
- ConversationCompletenessScorer: Validates complete request fulfillment
For Agent Systems
- Tool Relevancy Scoring: Validates appropriate tool selection
- Task Progression Scoring: Measures goal advancement
- Goal Achievement Scoring: End-to-end task completion assessment
- Context Relevancy Scoring: Role-task alignment validation
For Voice Agents (LiveKit / Pipecat / NovaSynth)
- MOS Scorer: Mean Opinion Score — overall perceived audio quality (0–10 in dashboard)
- Tone Clarity Scorer: Naturalness and clarity of synthesized speech
- Gibberish Scorer: Detects garbled or unintelligible audio segments
- Pronunciation Audio Scorer: Pronunciation accuracy across the session
- E2E Latency Scorer: End-to-end latency from user speech end to agent response start
- LLM TTFT Scorer: LLM time-to-first-token
- TTS TTFB Scorer: TTS time-to-first-byte
For Telephony / Phone Agents
- Instruction Adherence Scorer: Whether the agent followed its instructions throughout the call
- Sentiment CSAT Scorer: Estimated customer satisfaction score based on conversation sentiment
- Drop-off Node Scorer: Identifies where in the conversation the user disengaged
- Conversation Context Coherence Scorer: Coherence across the full call
- Appropriate Call Termination Scorer: Whether the call ended appropriately
For Custom Requirements
When standard scorers don't meet your needs, create custom evaluation metrics that align with your specific business objectives and domain requirements.
Next Steps
- Scorers Reference — complete reference for all 80+ scorers with required fields
- Running Evaluations — create an eval job step by step
- SDK Integration — set up tracing to capture data for evaluation
- NovaSynth — generate voice test data for audio/telephony scorers
- NovaPilot — AI analysis and recommendations on eval results
Get Early Access to Noveum.ai Platform
Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.