Documentation

Evaluation by NovaEval

Learn how to evaluate your AI models using NovaEval, our open-source evaluation framework

NovaEval is our comprehensive, open-source AI model evaluation framework designed for production use. It provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.

About NovaEval

NovaEval is our comprehensive evaluation engine that powers Noveum's intelligent model assessment capabilities. With 25+ specialized scorers across multiple domains, it automatically selects the most relevant evaluation metrics based on your specific use case and presents results directly in your Noveum dashboard.

Key Capabilities

  • Comprehensive Scorer Library: 25+ specialized scorers across accuracy, conversational AI, RAG, LLM-as-judge, and agent evaluation domains
  • Intelligent Selection: Automatic scorer selection based on your AI application type and use case
  • Custom Scorer Support: Create domain-specific evaluation metrics when standard scorers don't meet your needs
  • Seamless Integration: Automatic trace processing and dashboard integration with no manual configuration required
  • Real-time Results: Live scoring and performance metrics presented in your Noveum dashboard
  • Production Ready: Built for enterprise-scale evaluation with robust error handling and scalability

Why Model Evaluation Matters

AI models can perform differently across various scenarios, and understanding their strengths and weaknesses is crucial for:

  • Performance Optimization: Identify which models work best for specific use cases
  • Cost Efficiency: Find the most cost-effective models without sacrificing quality
  • Quality Assurance: Ensure consistent, reliable outputs across different scenarios
  • Continuous Improvement: Track model performance over time and identify degradation

Comprehensive Scoring Framework

NovaEval provides a comprehensive suite of scorers organized by evaluation domain. All scorers implement the BaseScorer interface and support both synchronous and asynchronous evaluation.

🎯 Accuracy & Classification Metrics

ExactMatchScorer

Performs exact string matching between prediction and ground truth with case-sensitive/insensitive options and whitespace normalization.

AccuracyScorer

Advanced classification accuracy with intelligent answer extraction capabilities, perfect for MMLU-style multiple choice questions.

F1Scorer

Token-level F1 score for partial matching scenarios with configurable tokenization and precision/recall calculations.

💬 Conversational AI Metrics

KnowledgeRetentionScorer

Evaluates if the LLM retains information provided by users throughout conversations using sophisticated knowledge extraction.

ConversationRelevancyScorer

Measures response relevance to recent conversation context with sliding window analysis and LLM-based assessment.

ConversationCompletenessScorer

Assesses whether user intentions and requests are fully addressed with comprehensive coverage analysis.

RoleAdherenceScorer

Evaluates consistency with assigned persona or role throughout conversations with character maintenance assessment.

ConversationalMetricsScorer

Comprehensive conversational evaluation combining knowledge retention, relevancy, completeness, and role adherence.

🔍 RAG (Retrieval-Augmented Generation) Metrics

AnswerRelevancyScorer

Evaluates how relevant answers are to given questions using semantic similarity and multiple question generation.

FaithfulnessScorer

Measures if responses are faithful to provided context without hallucinations using three-tier verification.

ContextualPrecisionScorer

Evaluates precision of retrieved context relevance with intelligent context segmentation and relevance scoring.

ContextualRecallScorer

Measures if all necessary information for answering is present in context with comprehensive coverage analysis.

RAGASScorer

Composite RAGAS methodology combining Answer Relevancy, Faithfulness, Contextual Precision, and Contextual Recall.

🤖 LLM-as-Judge Metrics

GEvalScorer

Uses LLMs with chain-of-thought reasoning for custom evaluation criteria based on G-Eval research methodology.

CommonGEvalCriteria

Predefined criteria including Correctness, Relevance, Coherence, and Helpfulness for standardized evaluation.

PanelOfJudgesScorer

Multi-LLM evaluation with diverse perspectives and configurable aggregation methods for robust assessment.

SpecializedPanelScorer

Specialized panel configurations including Diverse Panel, Consensus Panel, and Weighted Expert Panel.

🎭 Agent Evaluation Metrics

Tool Relevancy Scoring

Evaluates appropriateness of tool calls given available tools with detailed tool selection assessment.

Tool Correctness Scoring

Compares actual tool calls against expected tool calls with detailed correctness assessment.

Parameter Correctness Scoring

Evaluates correctness of parameters passed to tool calls with comprehensive validation.

Task Progression Scoring

Measures agent progress toward assigned tasks with completion status and advancement quality analysis.

Context Relevancy Scoring

Assesses response appropriateness given agent's role and task with role-task-response alignment evaluation.

Role Adherence Scoring

Evaluates consistency with assigned agent role across actions with comprehensive role consistency tracking.

Goal Achievement Scoring

Measures overall goal accomplishment using complete interaction traces with end-to-end evaluation.

Conversation Coherence Scoring

Evaluates logical flow and context maintenance in agent conversations with coherence analysis.

How Noveum Handles Evaluation

Noveum automatically handles the entire evaluation process for you, making it seamless and effortless:

1. Automatic Trace Processing

  • Your AI traces are automatically processed and converted into evaluation datasets
  • No manual data preparation or configuration required
  • Real-world data from your actual AI interactions

2. Intelligent Scorer Selection

Based on your specific use case and AI application type, Noveum automatically selects the most relevant scorers from our comprehensive suite:

  • RAG Applications: Contextual precision, faithfulness, answer relevancy
  • Conversational AI: Knowledge retention, conversation relevancy, role adherence
  • Agent Systems: Tool correctness, task progression, goal achievement
  • Classification Tasks: Accuracy, F1 scores, exact match evaluation

3. Custom Scorer Creation

When standard scorers don't meet your specific requirements, Noveum enables you to:

  • Create custom evaluation metrics tailored to your business needs
  • Define domain-specific scoring criteria
  • Implement proprietary evaluation logic
  • Integrate with your existing evaluation frameworks

4. Dashboard Integration

All evaluation results are automatically presented in your Noveum dashboard:

  • Real-time scoring and performance metrics
  • Comparative analysis across different models
  • Historical performance tracking
  • Actionable insights and recommendations

Choosing the Right Scorers

With our extensive collection of 25+ specialized scorers, selecting the right combination is crucial for meaningful evaluation:

For RAG Applications

  • AnswerRelevancyScorer: Ensures answers directly address the questions
  • FaithfulnessScorer: Prevents hallucinations and maintains factual accuracy
  • ContextualPrecisionScorer: Validates relevance of retrieved context
  • ContextualRecallScorer: Ensures comprehensive information coverage

For Conversational AI

  • KnowledgeRetentionScorer: Tracks information retention across conversations
  • ConversationRelevancyScorer: Maintains context-aware responses
  • RoleAdherenceScorer: Ensures consistent persona maintenance
  • ConversationCompletenessScorer: Validates complete request fulfillment

For Agent Systems

  • Tool Relevancy Scoring: Validates appropriate tool selection
  • Task Progression Scoring: Measures goal advancement
  • Goal Achievement Scoring: End-to-end task completion assessment
  • Context Relevancy Scoring: Role-task alignment validation

For Custom Requirements

When standard scorers don't meet your needs, create custom evaluation metrics that align with your specific business objectives and domain requirements.

Next Steps


Ready to optimize your AI models? Explore our Available Scorers to find the perfect evaluation metrics for your use case!

Exclusive Early Access

Get Early Access to Noveum.ai Platform

Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.

Sign up now. We send access to new batch every week.

Early access members receive premium onboarding support and influence our product roadmap. Limited spots available.