What is a Dataset

Datasets are versioned collections of evaluation items that power NovaEval scoring and NovaPilot analysis. Learn how they are created, structured, and managed.

Overview

A Dataset in Noveum is a versioned collection of evaluation items — each one a structured snapshot of an AI interaction. Datasets are the bridge between your live application traces and the evaluation pipeline: you capture real (or synthetic) conversations and agent runs, transform them into a standardized format, and then run scorers against them to measure quality, safety, and performance.

Every item in a dataset conforms to the StandardData schema, which means scorers know exactly which fields to read regardless of the original trace format.

How Datasets Are Created

There are three ways to populate a dataset:

1. Manual selection from Traces

From the Datasets section of your project, you can select traces and convert them directly into dataset items. This is useful when you want to build a curated golden set for regression testing.

Navigate to your project and open the Datasets tab.

Click New Dataset and give it a name.

Inside the dataset, use the trace selector to filter and browse available traces.

Select the traces you want to include and click Add to Dataset to convert those trace spans into StandardData items.

2. ETL Job (automated transformation)

An ETL Job continuously watches a trace environment and automatically transforms new spans into dataset items using an AI-generated Python mapper. This is the recommended approach for production monitoring and continuous evaluation.

See ETL Jobs for the full setup guide.

3. NovaSynth synthetic runs

When you run NovaSynth tests, each synthetic session automatically generates dataset items — complete with audio metrics, STT/TTS data, and conversation turns. These are ideal for pre-production quality gates.

See NovaSynth for details.

Dataset Types

Every dataset item has a novaeval_item_type field that tells scorers how to evaluate it:

Type	Description	Typical use case
`agent`	Single-turn or multi-step agentic interaction with tool calls, RAG retrieval, and exit status	LLM agents, function-calling pipelines, RAG systems
`conversational`	Multi-turn dialogue with speaker-tagged messages	Chatbots, voice assistants, customer support bots

Items can be mixed within a dataset, and scorers will automatically apply the correct evaluation logic based on the type.

Dataset Versions

Datasets are versioned so you can evolve your evaluation set without losing history.

Draft and Published states

Every dataset starts in a draft state. While in draft, items can be freely added, edited, or removed. Once you're satisfied, you publish the dataset — creating an immutable snapshot that eval jobs can run against.

The dashboard shows an "Unreleased changes" banner whenever a dataset has unpublished modifications since the last publish.

Version diff

When reviewing a new draft, the version diff view shows exactly which items were added, modified, or removed since the last published version. This makes it easy to audit changes before committing them.

The Datasets UI

The Datasets interface is a three-pane layout:

┌─────────────────┬──────────────────────┬────────────────────────┐
│   Dataset List  │   Items Table        │   Item Detail          │
│                 │                      │                        │
│  ○ My Dataset   │  [Filter] [Search]   │  agent-info            │
│  ○ Voice Tests  │                      │  conversation          │
│  ○ RAG Eval     │  item_001  ✓ pass    │  execution             │
│                 │  item_002  ✗ fail    │  system-prompt         │
│  + New Dataset  │  item_003  — n/a     │  tools                 │
│                 │                      │  retrieval             │
│                 │                      │  response-analysis     │
│                 │                      │  evaluation-context    │
│                 │                      │  audio-metrics         │
│                 │                      │  score-details         │
└─────────────────┴──────────────────────┴────────────────────────┘

Item detail tabs

When you select a dataset item, the right pane shows structured tabs:

Tab	What it shows
agent-info	Agent name, role, task description, exit status
conversation	Multi-turn dialogue with speaker labels
execution	Tool calls made, parameters passed, tool results
system-prompt	The system prompt used for this interaction
tools	Available tools and their schemas
retrieval	RAG queries issued and retrieved context chunks
response-analysis	Agent response, ground truth, extracted content
evaluation-context	Custom context fields for scorer input
audio-metrics	STT/TTS latency, MOS score, audio quality signals
score-details	Per-scorer results with pass/fail and raw scores

Dependency checking

Before you can delete a dataset, Noveum checks whether any Eval Jobs or NovaPilot Cron Jobs depend on it. If dependencies exist, a warning dialog lists them so you can update those jobs before proceeding.

Next steps

StandardData Schema — full field reference
ETL Jobs — automate trace → dataset transformation
Running Evaluations — score your dataset with NovaEval
NovaPilot — AI-powered analysis and recommendations