Documentation

NovaEval - AI Evaluation Engine

Production-ready AI model evaluation with 73+ specialized scorers and intelligent metric selection

NovaEval is Noveum's comprehensive AI model evaluation engine designed for production use. It provides an intelligent, automated interface for evaluating language models across various datasets, metrics, and deployment scenarios—all integrated seamlessly into your Noveum dashboard.

About NovaEval

NovaEval is our comprehensive evaluation engine that powers Noveum's intelligent model assessment capabilities. With 73+ specialized scorers across multiple domains, it automatically selects the most relevant evaluation metrics based on your specific use case and presents results directly in your Noveum dashboard.

Key Capabilities

  • Comprehensive Scorer Library: 73+ specialized scorers across accuracy, conversational AI, RAG, LLM-as-judge, and agent evaluation domains
  • Intelligent Selection: Automatic scorer selection based on your AI application type and use case
  • Custom Scorer Support: Create domain-specific evaluation metrics when standard scorers don't meet your needs
  • Seamless Integration: Automatic trace processing and dashboard integration with no manual configuration required
  • Real-time Results: Live scoring and performance metrics presented in your Noveum dashboard
  • Production Ready: Built for enterprise-scale evaluation with robust error handling and scalability

Why Model Evaluation Matters

AI models can perform differently across various scenarios, and understanding their strengths and weaknesses is crucial for:

  • Performance Optimization: Identify which models work best for specific use cases
  • Cost Efficiency: Find the most cost-effective models without sacrificing quality
  • Quality Assurance: Ensure consistent, reliable outputs across different scenarios
  • Continuous Improvement: Track model performance over time and identify degradation

Comprehensive Scoring Framework

NovaEval provides a comprehensive suite of scorers organized by evaluation domain. All scorers implement the BaseScorer interface and support both synchronous and asynchronous evaluation.

🎯 Accuracy & Classification Metrics

ExactMatchScorer

Exact string matching with case-sensitive/insensitive options.

AccuracyScorer

Classification accuracy with intelligent answer extraction.

F1Scorer

Token-level F1 score with precision and recall.

🔍 RAG (Retrieval-Augmented Generation) Metrics

AnswerRelevancyScorer

Answer relevance using semantic similarity.

FaithfulnessScorer

Measures faithfulness to context without hallucinations.

ContextualPrecisionScorer

Precision of retrieved context relevance.

💬 Conversational AI Metrics

KnowledgeRetentionScorer

Evaluates information retention across conversations.

ConversationRelevancyScorer

Response relevance to conversation context.

RoleAdherenceScorer

Consistency with assigned persona or role.

🛡️ Hallucination Detection Metrics

FactualAccuracyScorer

Factual accuracy against context.

ClaimVerificationScorer

Verifies individual claims against context.

HallucinationDetectionScorer

Detects unsupported or fabricated information.

🎭 Agent Evaluation Metrics

Tool Relevancy Scoring

Appropriateness of tool call selection.

Task Progression Scoring

Progress toward assigned tasks.

Goal Achievement Scoring

Overall goal accomplishment measurement.

How Noveum Handles Evaluation

Noveum automatically handles the entire evaluation process for you, making it seamless and effortless:

1. Automatic Trace Processing

  • Your AI traces are automatically processed and converted into evaluation datasets
  • No manual data preparation or configuration required
  • Real-world data from your actual AI interactions

2. Intelligent Scorer Selection

Based on your specific use case and AI application type, Noveum automatically selects the most relevant scorers from our comprehensive suite:

  • RAG Applications: Contextual precision, faithfulness, answer relevancy
  • Conversational AI: Knowledge retention, conversation relevancy, role adherence
  • Agent Systems: Tool correctness, task progression, goal achievement
  • Classification Tasks: Accuracy, F1 scores, exact match evaluation

3. Custom Scorer Creation

When standard scorers don't meet your specific requirements, Noveum enables you to:

  • Create custom evaluation metrics tailored to your business needs
  • Define domain-specific scoring criteria
  • Implement proprietary evaluation logic
  • Integrate with your existing evaluation frameworks

4. Dashboard Integration

All evaluation results are automatically presented in your Noveum dashboard:

  • Real-time scoring and performance metrics
  • Comparative analysis across different models
  • Historical performance tracking
  • Actionable insights and recommendations

Next Steps


Ready to optimize your AI models? Start by setting up SDK Integration to capture data for evaluation!

Exclusive Early Access

Get Early Access to Noveum.ai Platform

Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.

Sign up now. We send access to new batch every week.

Early access members receive premium onboarding support and influence our product roadmap. Limited spots available.