Running Evaluations

Create and run Eval Jobs to score your dataset with NovaEval — a step-by-step guide to the wizard, scorer selection, and reviewing results.

What is an Eval Job?

An Eval Job applies a set of scorers to every item in a dataset and records the results. After a run completes, you have per-item scores, aggregated statistics (mean, standard deviation, min, max), and pass/fail status for each scorer.

Eval Jobs are the engine behind NovaPilot reports and recommendations — NovaPilot analyzes eval job results to identify failure patterns and suggest prompt improvements.

Creating an Eval Job

There are two ways to start:

From a Dataset detail page → click the Evaluate button
From Org → Eval Jobs → click New Eval Job

Both open the three-step wizard.

Step 1 — Setup

Field	Description
Job name	A descriptive name (e.g., `production-v2-rag-eval`)
Dataset	The dataset to evaluate. Recently used ETL-linked datasets are shown first.
Evaluation mode	`batch` or `realtime` (see below)
Description	Optional notes about this eval job

Batch vs. realtime mode

Mode	Description	Best for
Batch	Scores all items in the dataset in one pass	Historical evaluation, regression testing, one-time analysis
Realtime	Continuously scores new items as they arrive in the dataset	Production monitoring, live quality gates

Use batch for evaluating a fixed dataset snapshot. Use realtime for ongoing ETL-fed datasets where you want scores to keep up with new traces.

Step 2 — Browse & select scorers

The scorer browser lets you find and add scorers to the job.

Search and filter

Search by name — type any part of a scorer name
Filter by category — toggle categories: accuracy, rag, conversational, agent, audio, latency, telephony, safety, g_eval, custom

Adding scorers

Click a scorer card to add it to the job. A checkmark appears on selected scorers. Click again to remove.

Click Recommend Scorers to let NovaEval analyze a sample of your dataset and suggest the most relevant scorers automatically:

Click Recommend Scorers in the scorer browser.

NovaEval samples up to 20 items from the dataset and analyzes which fields are populated.

A dialog shows recommended scorers with reasoning.

Click Add all or select individual recommendations.

This is the fastest way to pick the right scorers if you're unsure which to use.

Step 3 — Configure scorers

For each selected scorer, you can set:

Setting	Description
Pass threshold	Minimum score to count as "pass" (default varies by scorer)
Model	For LLM-as-judge scorers (`g_eval`, `panel_judge`): which model to use
Criteria	For `g_eval`: the evaluation criteria text
Judges	For `panel_judge`: list of judge models and weights

After reviewing the configuration, click Create Eval Job.

Running an eval job

Manual run

From the Eval Job detail page, click Run Now. The run starts immediately.

Scheduled runs via NovaPilot Cron Jobs

For recurring evaluation, create a NovaPilot Cron Job that runs the eval job on a schedule (daily, weekly, etc.) and automatically generates a new NovaPilot report.

Viewing results

Runs list

The Runs tab shows all past runs with:

Start time and duration
Status: running, completed, failed
Items evaluated and items passed
Overall pass rate

Run detail

Click a run to see:

Aggregated stats

For each scorer, a summary row shows:

Mean score
Standard deviation
Min and max
Number of items that passed / failed the threshold

Per-item scores

The items table shows every evaluated item with:

Columns for each scorer (pass/fail badge + raw score)
Click any item to open the full item detail (same as the Datasets UI item view)
Filter items by pass/fail per scorer to quickly find failures

Score distribution

A histogram shows the distribution of scores for each scorer, making it easy to see if failures are clustered or spread out.

Understanding results

Pass/fail badges

Each item×scorer cell shows:

Green ✓ — score meets or exceeds the pass threshold
Red ✗ — score is below the threshold
Gray — — scorer could not run (required fields missing)

A Report with overall health, failure patterns, and system prompt suggestions
Recommendations — specific, actionable fixes prioritized by impact
Cron Jobs — scheduled recurring runs with automatic report generation

Navigate to Project → NovaPilot → Recommendations to get started.

Next steps

NovaPilot — AI analysis and recommendations on your eval results
Scorers Reference — every scorer with required fields and descriptions
Datasets — manage the dataset your eval job runs against