Documentation

What is a Dataset

Datasets are versioned collections of evaluation items that power NovaEval scoring and NovaPilot analysis. Learn how they are created, structured, and managed.

Overview

A Dataset in Noveum is a versioned collection of evaluation items β€” each one a structured snapshot of an AI interaction. Datasets are the bridge between your live application traces and the evaluation pipeline: you capture real (or synthetic) conversations and agent runs, transform them into a standardized format, and then run scorers against them to measure quality, safety, and performance.

Every item in a dataset conforms to the StandardData schema, which means scorers know exactly which fields to read regardless of the original trace format.


How Datasets Are Created

There are three ways to populate a dataset:

1. Manual selection from Traces

From the Datasets section of your project, you can select traces and convert them directly into dataset items. This is useful when you want to build a curated golden set for regression testing.

Navigate to your project and open the Datasets tab.
Click New Dataset and give it a name.
Inside the dataset, use the trace selector to filter and browse available traces.
Select the traces you want to include and click Add to Dataset to convert those trace spans into StandardData items.

2. ETL Job (automated transformation)

An ETL Job continuously watches a trace environment and automatically transforms new spans into dataset items using an AI-generated Python mapper. This is the recommended approach for production monitoring and continuous evaluation.

See ETL Jobs for the full setup guide.

3. NovaSynth synthetic runs

When you run NovaSynth tests, each synthetic session automatically generates dataset items β€” complete with audio metrics, STT/TTS data, and conversation turns. These are ideal for pre-production quality gates.

See NovaSynth for details.


Dataset Types

Every dataset item has a novaeval_item_type field that tells scorers how to evaluate it:

TypeDescriptionTypical use case
agentSingle-turn or multi-step agentic interaction with tool calls, RAG retrieval, and exit statusLLM agents, function-calling pipelines, RAG systems
conversationalMulti-turn dialogue with speaker-tagged messagesChatbots, voice assistants, customer support bots

Items can be mixed within a dataset, and scorers will automatically apply the correct evaluation logic based on the type.


Dataset Versions

Datasets are versioned so you can evolve your evaluation set without losing history.

Draft and Published states

Every dataset starts in a draft state. While in draft, items can be freely added, edited, or removed. Once you're satisfied, you publish the dataset β€” creating an immutable snapshot that eval jobs can run against.

The dashboard shows an "Unreleased changes" banner whenever a dataset has unpublished modifications since the last publish.

Version diff

When reviewing a new draft, the version diff view shows exactly which items were added, modified, or removed since the last published version. This makes it easy to audit changes before committing them.


The Datasets UI

The Datasets interface is a three-pane layout:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Dataset List  β”‚   Items Table        β”‚   Item Detail          β”‚
β”‚                 β”‚                      β”‚                        β”‚
β”‚  β—‹ My Dataset   β”‚  [Filter] [Search]   β”‚  agent-info            β”‚
β”‚  β—‹ Voice Tests  β”‚                      β”‚  conversation          β”‚
β”‚  β—‹ RAG Eval     β”‚  item_001  βœ“ pass    β”‚  execution             β”‚
β”‚                 β”‚  item_002  βœ— fail    β”‚  system-prompt         β”‚
β”‚  + New Dataset  β”‚  item_003  β€” n/a     β”‚  tools                 β”‚
β”‚                 β”‚                      β”‚  retrieval             β”‚
β”‚                 β”‚                      β”‚  response-analysis     β”‚
β”‚                 β”‚                      β”‚  evaluation-context    β”‚
β”‚                 β”‚                      β”‚  audio-metrics         β”‚
β”‚                 β”‚                      β”‚  score-details         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Item detail tabs

When you select a dataset item, the right pane shows structured tabs:

TabWhat it shows
agent-infoAgent name, role, task description, exit status
conversationMulti-turn dialogue with speaker labels
executionTool calls made, parameters passed, tool results
system-promptThe system prompt used for this interaction
toolsAvailable tools and their schemas
retrievalRAG queries issued and retrieved context chunks
response-analysisAgent response, ground truth, extracted content
evaluation-contextCustom context fields for scorer input
audio-metricsSTT/TTS latency, MOS score, audio quality signals
score-detailsPer-scorer results with pass/fail and raw scores

Dependency checking

Before you can delete a dataset, Noveum checks whether any Eval Jobs or NovaPilot Cron Jobs depend on it. If dependencies exist, a warning dialog lists them so you can update those jobs before proceeding.


Next steps

Exclusive Early Access

Get Early Access to Noveum.ai Platform

Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.

Sign up now. We send access to new batch every week.

Early access members receive premium onboarding support and influence our product roadmap. Limited spots available.

On this page

What is a Dataset | Documentation | Noveum.ai