Eval Jobs

An Eval Job is how Noveum systematically compares multiple AI models on a chosen dataset. With a single click, you can pit GPT-4, Claude 2, DeepSeek, or any other model side by side and get a comprehensive summary of their performance, cost, and speed.

1. Defining an Eval Job

Pick a Dataset: Choose from your curated logs or imported benchmarks.
Select Models: Pick any providers configured in your Noveum Trace SDK (or external providers configured in the dashboard).
Set Evaluation Criteria:
- Metric weighting (e.g., 50% accuracy, 30% cost, 20% latency)
- Additional domain checks like toxicity or bias (optional)
Run the Job: Noveum will automatically re-send the dataset queries to each selected model and capture responses.

2. Evaluation Process

For each query in your dataset, Noveum does the following:

Send: The query is sent to each model.
Collect: The responses (along with cost, token usage, and latency).
Compare: Depending on your chosen evaluation criteria (e.g., references, LLM-based scoring, etc.), Noveum determines a “win,” “tie,” or “loss” for each model relative to your baseline.
Aggregate: The results are summarized as accuracy, F1-scores, cost metrics, and more.

You can track the job in real time to see partial results or wait for a final consolidated report.

3. Automatic LLM-Based Scoring

Noveum includes an Evaluator feature that uses a panel of LLMs (Mixture of Experts) to judge each response when no clear “ground truth” is available. This is especially helpful for tasks like summarization or creative text generation where strict right/wrong answers don’t exist.

4. Scheduling Eval Jobs

To maintain ongoing optimization:

Daily/Weekly Evaluations: If your application is business-critical and new models appear frequently, schedule automated runs.
Threshold Triggers: Re-run jobs automatically if your usage logs show a performance dip or if you notice a new model is beating your current approach.

Scheduling is configured in the Automation tab of Noveum. You can also integrate it into your CI/CD pipeline if you want to run evals before deploying new features.

Next Steps

Eval Jobs

1. Defining an Eval Job

2. Evaluation Process

3. Automatic LLM-Based Scoring

4. Scheduling Eval Jobs

Get Early Access to Noveum.ai Platform

On this page