Documentation

Running Evaluations

Create and run Eval Jobs to score your dataset with NovaEval — a step-by-step guide to the wizard, scorer selection, and reviewing results.

What is an Eval Job?

An Eval Job applies a set of scorers to every item in a dataset and records the results. After a run completes, you have per-item scores, aggregated statistics (mean, standard deviation, min, max), and pass/fail status for each scorer.

Eval Jobs are the engine behind NovaPilot reports and recommendations — NovaPilot analyzes eval job results to identify failure patterns and suggest prompt improvements.


Creating an Eval Job

There are two ways to start:

  • From a Dataset detail page → click the Evaluate button
  • From Org → Eval Jobs → click New Eval Job

Both open the three-step wizard.


Step 1 — Setup

FieldDescription
Job nameA descriptive name (e.g., production-v2-rag-eval)
DatasetThe dataset to evaluate. Recently used ETL-linked datasets are shown first.
Evaluation modebatch or realtime (see below)
DescriptionOptional notes about this eval job

Batch vs. realtime mode

ModeDescriptionBest for
BatchScores all items in the dataset in one passHistorical evaluation, regression testing, one-time analysis
RealtimeContinuously scores new items as they arrive in the datasetProduction monitoring, live quality gates

Use batch for evaluating a fixed dataset snapshot. Use realtime for ongoing ETL-fed datasets where you want scores to keep up with new traces.


Step 2 — Browse & select scorers

The scorer browser lets you find and add scorers to the job.

Search and filter

  • Search by name — type any part of a scorer name
  • Filter by category — toggle categories: accuracy, rag, conversational, agent, audio, latency, telephony, safety, g_eval, custom

Adding scorers

Click a scorer card to add it to the job. A checkmark appears on selected scorers. Click again to remove.

Recommend Scorers

Click Recommend Scorers to let NovaEval analyze a sample of your dataset and suggest the most relevant scorers automatically:

Click Recommend Scorers in the scorer browser.
NovaEval samples up to 20 items from the dataset and analyzes which fields are populated.
A dialog shows recommended scorers with reasoning.
Click Add all or select individual recommendations.

This is the fastest way to pick the right scorers if you're unsure which to use.


Step 3 — Configure scorers

For each selected scorer, you can set:

SettingDescription
Pass thresholdMinimum score to count as "pass" (default varies by scorer)
ModelFor LLM-as-judge scorers (g_eval, panel_judge): which model to use
CriteriaFor g_eval: the evaluation criteria text
JudgesFor panel_judge: list of judge models and weights

After reviewing the configuration, click Create Eval Job.


Running an eval job

Manual run

From the Eval Job detail page, click Run Now. The run starts immediately.

Scheduled runs via NovaPilot Cron Jobs

For recurring evaluation, create a NovaPilot Cron Job that runs the eval job on a schedule (daily, weekly, etc.) and automatically generates a new NovaPilot report.


Viewing results

Runs list

The Runs tab shows all past runs with:

  • Start time and duration
  • Status: running, completed, failed
  • Items evaluated and items passed
  • Overall pass rate

Run detail

Click a run to see:

Aggregated stats

For each scorer, a summary row shows:

  • Mean score
  • Standard deviation
  • Min and max
  • Number of items that passed / failed the threshold

Per-item scores

The items table shows every evaluated item with:

  • Columns for each scorer (pass/fail badge + raw score)
  • Click any item to open the full item detail (same as the Datasets UI item view)
  • Filter items by pass/fail per scorer to quickly find failures

Score distribution

A histogram shows the distribution of scores for each scorer, making it easy to see if failures are clustered or spread out.


Understanding results

Pass/fail badges

Each item×scorer cell shows:

  • Green ✓ — score meets or exceeds the pass threshold
  • Red ✗ — score is below the threshold
  • Gray — — scorer could not run (required fields missing)

Why items fail

Click a failing item and go to the score-details tab to see the scorer's explanation. LLM-as-judge scorers (g_eval, panel_judge) always include a reasoning paragraph.

Missing fields

If many items show gray (not scored), check the StandardData Schema to confirm your ETL mapper is populating the required fields for your chosen scorers.


Connecting to NovaPilot

Once an eval job has run, NovaPilot can analyze the results and generate:

  • A Report with overall health, failure patterns, and system prompt suggestions
  • Recommendations — specific, actionable fixes prioritized by impact
  • Cron Jobs — scheduled recurring runs with automatic report generation

Navigate to Project → NovaPilot → Recommendations to get started.


Next steps

  • NovaPilot — AI analysis and recommendations on your eval results
  • Scorers Reference — every scorer with required fields and descriptions
  • Datasets — manage the dataset your eval job runs against
Exclusive Early Access

Get Early Access to Noveum.ai Platform

Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.

Sign up now. We send access to new batch every week.

Early access members receive premium onboarding support and influence our product roadmap. Limited spots available.

Running Evaluations | Documentation | Noveum.ai