Eval Jobs
An Eval Job is how Noveum systematically compares multiple AI models on a chosen dataset. With a single click, you can pit GPT-4, Claude 2, DeepSeek, or any other model side by side and get a comprehensive summary of their performance, cost, and speed.
1. Defining an Eval Job
- Pick a Dataset: Choose from your curated logs or imported benchmarks.
- Select Models: Pick any providers configured in your AI Gateway (or external providers configured in the dashboard).
- Set Evaluation Criteria:
- Metric weighting (e.g., 50% accuracy, 30% cost, 20% latency)
- Additional domain checks like toxicity or bias (optional)
- Run the Job: Noveum will automatically re-send the dataset queries to each selected model and capture responses.
2. Evaluation Process
For each query in your dataset, Noveum does the following:
- Send: The query is sent to each model.
- Collect: The responses (along with cost, token usage, and latency).
- Compare: Depending on your chosen evaluation criteria (e.g., references, LLM-based scoring, etc.), Noveum determines a “win,” “tie,” or “loss” for each model relative to your baseline.
- Aggregate: The results are summarized as accuracy, F1-scores, cost metrics, and more.
You can track the job in real time to see partial results or wait for a final consolidated report.
3. Automatic LLM-Based Scoring
Noveum includes an Evaluator feature that uses a panel of LLMs (Mixture of Experts) to judge each response when no clear “ground truth” is available. This is especially helpful for tasks like summarization or creative text generation where strict right/wrong answers don’t exist.
4. Scheduling Eval Jobs
To maintain ongoing optimization:
- Daily/Weekly Evaluations: If your application is business-critical and new models appear frequently, schedule automated runs.
- Threshold Triggers: Re-run jobs automatically if your usage logs show a performance dip or if you notice a new model is beating your current approach.
Scheduling is configured in the Automation
tab of Noveum. You can also integrate it into your CI/CD pipeline if you want to run evals before deploying new features.
Next Steps
Get Early Access to Noveum.ai Platform
Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.