Running Evaluations
Create and run Eval Jobs to score your dataset with NovaEval — a step-by-step guide to the wizard, scorer selection, and reviewing results.
What is an Eval Job?
An Eval Job applies a set of scorers to every item in a dataset and records the results. After a run completes, you have per-item scores, aggregated statistics (mean, standard deviation, min, max), and pass/fail status for each scorer.
Eval Jobs are the engine behind NovaPilot reports and recommendations — NovaPilot analyzes eval job results to identify failure patterns and suggest prompt improvements.
Creating an Eval Job
There are two ways to start:
- From a Dataset detail page → click the Evaluate button
- From Org → Eval Jobs → click New Eval Job
Both open the three-step wizard.
Step 1 — Setup
| Field | Description |
|---|---|
| Job name | A descriptive name (e.g., production-v2-rag-eval) |
| Dataset | The dataset to evaluate. Recently used ETL-linked datasets are shown first. |
| Evaluation mode | batch or realtime (see below) |
| Description | Optional notes about this eval job |
Batch vs. realtime mode
| Mode | Description | Best for |
|---|---|---|
| Batch | Scores all items in the dataset in one pass | Historical evaluation, regression testing, one-time analysis |
| Realtime | Continuously scores new items as they arrive in the dataset | Production monitoring, live quality gates |
Use batch for evaluating a fixed dataset snapshot. Use realtime for ongoing ETL-fed datasets where you want scores to keep up with new traces.
Step 2 — Browse & select scorers
The scorer browser lets you find and add scorers to the job.
Search and filter
- Search by name — type any part of a scorer name
- Filter by category — toggle categories:
accuracy,rag,conversational,agent,audio,latency,telephony,safety,g_eval,custom
Adding scorers
Click a scorer card to add it to the job. A checkmark appears on selected scorers. Click again to remove.
Recommend Scorers
Click Recommend Scorers to let NovaEval analyze a sample of your dataset and suggest the most relevant scorers automatically:
This is the fastest way to pick the right scorers if you're unsure which to use.
Step 3 — Configure scorers
For each selected scorer, you can set:
| Setting | Description |
|---|---|
| Pass threshold | Minimum score to count as "pass" (default varies by scorer) |
| Model | For LLM-as-judge scorers (g_eval, panel_judge): which model to use |
| Criteria | For g_eval: the evaluation criteria text |
| Judges | For panel_judge: list of judge models and weights |
After reviewing the configuration, click Create Eval Job.
Running an eval job
Manual run
From the Eval Job detail page, click Run Now. The run starts immediately.
Scheduled runs via NovaPilot Cron Jobs
For recurring evaluation, create a NovaPilot Cron Job that runs the eval job on a schedule (daily, weekly, etc.) and automatically generates a new NovaPilot report.
Viewing results
Runs list
The Runs tab shows all past runs with:
- Start time and duration
- Status:
running,completed,failed - Items evaluated and items passed
- Overall pass rate
Run detail
Click a run to see:
Aggregated stats
For each scorer, a summary row shows:
- Mean score
- Standard deviation
- Min and max
- Number of items that passed / failed the threshold
Per-item scores
The items table shows every evaluated item with:
- Columns for each scorer (pass/fail badge + raw score)
- Click any item to open the full item detail (same as the Datasets UI item view)
- Filter items by pass/fail per scorer to quickly find failures
Score distribution
A histogram shows the distribution of scores for each scorer, making it easy to see if failures are clustered or spread out.
Understanding results
Pass/fail badges
Each item×scorer cell shows:
- Green ✓ — score meets or exceeds the pass threshold
- Red ✗ — score is below the threshold
- Gray — — scorer could not run (required fields missing)
Why items fail
Click a failing item and go to the score-details tab to see the scorer's explanation. LLM-as-judge scorers (g_eval, panel_judge) always include a reasoning paragraph.
Missing fields
If many items show gray (not scored), check the StandardData Schema to confirm your ETL mapper is populating the required fields for your chosen scorers.
Connecting to NovaPilot
Once an eval job has run, NovaPilot can analyze the results and generate:
- A Report with overall health, failure patterns, and system prompt suggestions
- Recommendations — specific, actionable fixes prioritized by impact
- Cron Jobs — scheduled recurring runs with automatic report generation
Navigate to Project → NovaPilot → Recommendations to get started.
Next steps
- NovaPilot — AI analysis and recommendations on your eval results
- Scorers Reference — every scorer with required fields and descriptions
- Datasets — manage the dataset your eval job runs against
Get Early Access to Noveum.ai Platform
Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.