Documentation
Getting Started/Eval Jobs

Eval Jobs

An Eval Job is how Noveum systematically compares multiple AI models on a chosen dataset. With a single click, you can pit GPT-4, Claude 2, DeepSeek, or any other model side by side and get a comprehensive summary of their performance, cost, and speed.

1. Defining an Eval Job

  1. Pick a Dataset: Choose from your curated logs or imported benchmarks.
  2. Select Models: Pick any providers configured in your AI Gateway (or external providers configured in the dashboard).
  3. Set Evaluation Criteria:
    • Metric weighting (e.g., 50% accuracy, 30% cost, 20% latency)
    • Additional domain checks like toxicity or bias (optional)
  4. Run the Job: Noveum will automatically re-send the dataset queries to each selected model and capture responses.

2. Evaluation Process

For each query in your dataset, Noveum does the following:

  1. Send: The query is sent to each model.
  2. Collect: The responses (along with cost, token usage, and latency).
  3. Compare: Depending on your chosen evaluation criteria (e.g., references, LLM-based scoring, etc.), Noveum determines a “win,” “tie,” or “loss” for each model relative to your baseline.
  4. Aggregate: The results are summarized as accuracy, F1-scores, cost metrics, and more.

You can track the job in real time to see partial results or wait for a final consolidated report.

3. Automatic LLM-Based Scoring

Noveum includes an Evaluator feature that uses a panel of LLMs (Mixture of Experts) to judge each response when no clear “ground truth” is available. This is especially helpful for tasks like summarization or creative text generation where strict right/wrong answers don’t exist.

4. Scheduling Eval Jobs

To maintain ongoing optimization:

  • Daily/Weekly Evaluations: If your application is business-critical and new models appear frequently, schedule automated runs.
  • Threshold Triggers: Re-run jobs automatically if your usage logs show a performance dip or if you notice a new model is beating your current approach.

Scheduling is configured in the Automation tab of Noveum. You can also integrate it into your CI/CD pipeline if you want to run evals before deploying new features.


Next Steps

Exclusive Early Access

Get Early Access to Noveum.ai Platform

Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.

Sign up now. We send access to new batch every week.

Early access members receive premium onboarding support and influence our product roadmap. Limited spots available.

On this page