Documentation

Understanding Eval Job Results

After running an Eval Job, Noveum generates an in-depth report showcasing each model’s performance. Below is a breakdown of the metrics and how to interpret them.

1. Accuracy & F1-Scores

  • Accuracy: The percentage of dataset queries for which the model’s answer was judged correct or best.
  • F1-Score: The harmonic mean of precision and recall, often used for classification tasks. A higher score indicates balanced performance.

Use these metrics to see how effectively a model meets your use case. For example, if it’s a QA application, high accuracy is typically crucial.

2. Cost Metrics

  • Cost/Request: How much each inference costs.
  • Token Usage: If the provider charges by token (e.g., GPT-4), Noveum tracks average tokens used per request.

A model with a slightly lower accuracy but significantly lower cost could be a viable trade-off, depending on your Optimization Priorities.

3. Latency & TTFB

  • Latency: Total round-trip time from request to response.
  • Time to First Byte (TTFB): Measures how quickly the model starts returning a response.

For real-time or customer-facing applications, these metrics can be critical. Sometimes a model is accurate but too slow for user expectations.

4. Model Comparison Table

Noveum aggregates these metrics into a comparison table (or chart), highlighting the top performer based on your weighting criteria. For example:

ModelAccuracyLatency (ms)Cost/Request ($)Weighted Score
GPT-486.4%13000.0258.3
Claude 285.5%11000.0188.1
DeepSeek82.9%9000.0108.0

The Weighted Score is a single metric factoring in multiple criteria—configured by you.

5. Recommendations

Based on the aggregated scores, Noveum offers recommended models. If, for instance, your priority is reducing cost while keeping accuracy above 80%, the recommendation might be DeepSeek. If you only care about the best overall accuracy, GPT-4 might be flagged.

6. Detailed Logs & Insights

For further analysis, you can drill down into:

  • Individual request logs
  • Panel-based evaluation results for open-ended tasks
  • Error analysis to see which queries or categories performed poorly

These insights help you fine-tune your data or possibly adjust your application logic.


What’s Next?

After reviewing your Eval Job results, you can:

  1. Switch providers seamlessly by updating a single configuration in the AI Gateway.
  2. Rerun the same job with a new dataset or model to confirm improvements.
  3. Schedule recurring jobs to stay up-to-date when new models are released.

Ready for more optimization? Check out how to integrate new providers or stay tuned for our upcoming fine-tuning feature.

Exclusive Early Access

Get Early Access to Noveum.ai Platform

Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.

Sign up now. We send access to new batch every week.

Early access members receive premium onboarding support and influence our product roadmap. Limited spots available.

On this page