Documentation

Trace + Eval Overview

Noveum.ai provides a complete solution for AI application observability and evaluation through two complementary open-source tools: Noveum Trace SDK and NovaEval Framework.

🔄 How They Work Together

Noveum Trace SDK → Data Collection

The Noveum Trace SDK collects real-time data from your LLM applications:

import noveum_trace
from openai import OpenAI
 
# Initialize tracing
noveum_trace.init(api_key="your-key", project="my-app")
 
@noveum_trace.trace_llm
def call_llm(prompt: str) -> str:
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

What Noveum Trace captures:

  • Performance metrics: Latency, token usage, cost per request
  • Quality metrics: Error rates, response quality indicators
  • Business context: User interactions, feature usage patterns
  • Model behavior: Response patterns, hallucination detection

NovaEval Framework → Systematic Evaluation

NovaEval provides comprehensive evaluation capabilities:

from novaeval import Evaluator
from novaeval.datasets import MMLUDataset
from novaeval.models import OpenAIModel
from novaeval.scorers import AccuracyScorer
 
# Create evaluation setup
dataset = MMLUDataset(subset="elementary_mathematics", num_samples=100)
model = OpenAIModel(model_name="gpt-4", temperature=0.0)
scorer = AccuracyScorer(extract_answer=True)
 
# Run evaluation
evaluator = Evaluator(
    dataset=dataset,
    models=[model],
    scorers=[scorer],
    output_dir="./results"
)
results = evaluator.run()

What NovaEval evaluates:

  • Accuracy: Exact match, semantic similarity, classification accuracy
  • Code quality: Syntax validation, execution success, test coverage
  • Cost efficiency: Token usage optimization, cost per accuracy point
  • Performance: Latency analysis, throughput testing

🎯 Use Cases

1. Production Monitoring → Evaluation

Noveum Trace (Production) → NovaEval (Evaluation)

Workflow:

  1. Deploy your LLM app with Noveum Trace
  2. Collect real user interactions and performance data
  3. Identify problematic patterns or performance issues
  4. Create evaluation datasets from production logs
  5. Test model improvements with NovaEval
  6. Deploy optimized models back to production

2. Model Comparison → Production Validation

NovaEval (Testing) → Noveum Trace (Validation)

Workflow:

  1. Compare multiple models using NovaEval benchmarks
  2. Select the best performing model
  3. Deploy to production with Noveum Trace
  4. Monitor real-world performance
  5. Validate that production performance matches evaluation results

3. Continuous Improvement Loop

Production → Evaluation → Optimization → Production

Workflow:

  1. Monitor production with Noveum Trace
  2. Evaluate with NovaEval using production data
  3. Optimize models based on evaluation results
  4. Deploy improvements back to production
  5. Repeat the cycle for continuous improvement

📊 Integration Examples

Example 1: Production Data → Evaluation Dataset

# Step 1: Collect production data with Noveum Trace
@noveum_trace.trace_llm
def customer_service_bot(user_query: str) -> str:
    # Your LLM call here
    return response
 
# Step 2: Create evaluation dataset from production logs
from novaeval.datasets import CustomDataset
 
class ProductionDataset(CustomDataset):
    def load_data(self):
        # Load data from Noveum Trace logs
        logs = noveum_trace.get_production_logs(
            start_date="2024-01-01",
            end_date="2024-01-31"
        )
        return self.process_logs_to_samples(logs)
 
# Step 3: Evaluate with NovaEval
dataset = ProductionDataset()
evaluator = Evaluator(dataset=dataset, models=[model], scorers=[scorer])
results = evaluator.run()

Example 2: Evaluation Results → Production Monitoring

# Step 1: Evaluate models with NovaEval
evaluator = Evaluator(
    dataset=MMLUDataset(subset="math"),
    models=[gpt4_model, claude_model],
    scorers=[AccuracyScorer()]
)
results = evaluator.run()
 
# Step 2: Deploy best model with Noveum Trace
best_model = results.get_best_model()
@noveum_trace.trace_llm
def production_llm_call(prompt: str) -> str:
    return best_model.generate(prompt)
 
# Step 3: Monitor production performance
production_metrics = noveum_trace.get_production_metrics()
print(f"Production accuracy: {production_metrics.accuracy}")
print(f"Evaluation accuracy: {results.accuracy}")

🏗️ Architecture Benefits

Unified Data Pipeline

  • Single source of truth for both production and evaluation data
  • Consistent metrics across monitoring and evaluation
  • Seamless integration between real-time and batch processing

Comprehensive Coverage

  • Real-time monitoring with Noveum Trace
  • Systematic evaluation with NovaEval
  • Complete observability from development to production

Scalable Workflow

  • Local development with both tools
  • CI/CD integration for automated evaluation
  • Production deployment with monitoring
  • Cloud-native architecture for scale

🚀 Getting Started

1. Start with Noveum Trace

pip install noveum-trace
import noveum_trace
noveum_trace.init(api_key="your-key", project="my-app")
 
@noveum_trace.trace_llm
def my_llm_function(prompt: str) -> str:
    # Your LLM call here
    return response

2. Add NovaEval for Evaluation

pip install novaeval
from novaeval import Evaluator
from novaeval.datasets import MMLUDataset
from novaeval.models import OpenAIModel
from novaeval.scorers import AccuracyScorer
 
evaluator = Evaluator(
    dataset=MMLUDataset(subset="math", num_samples=100),
    models=[OpenAIModel(model_name="gpt-4")],
    scorers=[AccuracyScorer()]
)
results = evaluator.run()

3. Integrate Both Tools

# Production monitoring
@noveum_trace.trace_llm
def production_call(prompt: str) -> str:
    return model.generate(prompt)
 
# Evaluation with production data
class ProductionDataset(CustomDataset):
    def load_data(self):
        return noveum_trace.get_production_logs()
 
evaluator = Evaluator(
    dataset=ProductionDataset(),
    models=[model],
    scorers=[scorer]
)
evaluation_results = evaluator.run()

📈 Benefits

For Developers

  • Faster iteration with real-time feedback
  • Confidence in deployments with comprehensive evaluation
  • Reduced debugging time with detailed observability

For Data Scientists

  • Rich datasets from production interactions
  • Systematic evaluation with multiple metrics
  • Continuous improvement through feedback loops

For Product Teams

  • Data-driven decisions on model selection
  • Cost optimization through performance monitoring
  • Quality assurance with automated evaluation

For Enterprises

  • Scalable architecture for large deployments
  • Compliance ready with detailed logging
  • Risk mitigation through systematic evaluation

🔗 Next Steps


Ready to build better AI applications? Start with Noveum Trace for monitoring and NovaEval for evaluation to create a complete AI observability solution.

Exclusive Early Access

Get Early Access to Noveum.ai Platform

Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.

Sign up now. We send access to new batch every week.

Early access members receive premium onboarding support and influence our product roadmap. Limited spots available.