Understanding Eval Job Results

After running an Eval Job, Noveum generates an in-depth report showcasing each model’s performance. Below is a breakdown of the metrics and how to interpret them.

1. Accuracy & F1-Scores

Accuracy: The percentage of dataset queries for which the model’s answer was judged correct or best.
F1-Score: The harmonic mean of precision and recall, often used for classification tasks. A higher score indicates balanced performance.

Use these metrics to see how effectively a model meets your use case. For example, if it’s a QA application, high accuracy is typically crucial.

2. Cost Metrics

Cost/Request: How much each inference costs.
Token Usage: If the provider charges by token (e.g., GPT-4), Noveum tracks average tokens used per request.

A model with a slightly lower accuracy but significantly lower cost could be a viable trade-off, depending on your Optimization Priorities.

3. Latency & TTFB

Latency: Total round-trip time from request to response.
Time to First Byte (TTFB): Measures how quickly the model starts returning a response.

For real-time or customer-facing applications, these metrics can be critical. Sometimes a model is accurate but too slow for user expectations.

4. Model Comparison Table

Noveum aggregates these metrics into a comparison table (or chart), highlighting the top performer based on your weighting criteria. For example:

Model	Accuracy	Latency (ms)	Cost/Request ($)	Weighted Score
GPT-4	86.4%	1300	0.025	8.3
Claude 2	85.5%	1100	0.018	8.1
DeepSeek	82.9%	900	0.010	8.0

The Weighted Score is a single metric factoring in multiple criteria—configured by you.

Based on the aggregated scores, Noveum offers recommended models. If, for instance, your priority is reducing cost while keeping accuracy above 80%, the recommendation might be DeepSeek. If you only care about the best overall accuracy, GPT-4 might be flagged.

6. Detailed Logs & Insights

For further analysis, you can drill down into:

Individual request logs
Panel-based evaluation results for open-ended tasks
Error analysis to see which queries or categories performed poorly

These insights help you fine-tune your data or possibly adjust your application logic.

What’s Next?

After reviewing your Eval Job results, you can:

Switch providers seamlessly by updating a single configuration in the AI Gateway.
Rerun the same job with a new dataset or model to confirm improvements.
Schedule recurring jobs to stay up-to-date when new models are released.

Ready for more optimization? Check out how to integrate new providers or stay tuned for our upcoming fine-tuning feature.