The Evaluation Pipeline
How to go from raw traces to scored dataset items and AI-generated improvement recommendations — the complete Noveum workflow.
Overview
Sending traces to Noveum is just the beginning. Once your application is instrumented, Noveum can automatically convert every LLM call in your traces into a structured evaluation item, score each one with 80+ built-in quality metrics, and surface prioritized recommendations for improving your agent.
This guide walks through the full 5-step pipeline so you know exactly what to set up and in what order.
The 5-step flow
Step 1 — Integrate the SDK and send traces
Install the Noveum SDK and initialize it in your application. Every LLM call, tool use, and agent decision your app makes will be captured as a trace.
Traces appear in your Noveum dashboard in real time. Each trace represents one end-to-end request through your agent — for example, one user conversation or one task execution.
See Quick Setup to get your first traces flowing in 5 minutes.
Step 2 — Create an ETL Job
Raw traces contain rich telemetry, but scorers need structured, item-by-item data. An ETL Job bridges the two.
Key concept: one item per LLM call. Every LLM call inside a trace becomes one independently-evaluated dataset item. A chatbot trace with 5 conversational turns produces 5 items. Each item contains exactly what the scorer needs to evaluate that specific decision in isolation: the input, the output, the system prompt, and the conversation history up to that point.
When you create an ETL Job, Noveum reads a sample of your recent traces and uses an AI agent to generate mapper code for you. The mapper code runs on every new trace and extracts the items. You review the code in the editor, make any adjustments using the "Improve Code" button, and apply it.
See ETL Jobs Overview to create your first ETL job, and Using the AI Mapper to understand how the AI-generated code works and how to improve it.
Step 3 — Review your Dataset
Once the ETL job runs, items appear in the Dataset linked to that job. Each item is a versioned record containing:
| Field | What it is |
|---|---|
| Input | The user's question or request |
| Output | The LLM's response |
| System prompt | The instructions the LLM was given |
| Conversation history | Prior turns (for conversational agents) |
| Tool calls | Any tools invoked and their results |
| Item type | agent or conversational — determines which scorers apply |
Datasets are versioned: items accumulate over time as new traces arrive and the ETL job processes them. You can publish a version when you want to lock a snapshot for a specific evaluation run.
See What is a Dataset for the full dataset guide.
Step 4 — Run an Eval Job
An Eval Job applies a set of scorers to every item in your dataset. Each item is scored independently, so you get granular insight into which specific LLM decisions are failing and why.
Results include:
- A pass/fail badge per item per scorer
- Score values (0–10) for LLM-as-judge scorers
- Aggregated stats across all items (mean, pass rate, distribution)
See Running Evaluations for the full eval job guide.
Step 5 — Get NovaPilot Recommendations
After at least one eval job run, NovaPilot analyzes all your scored items and generates a prioritized improvement report.
NovaPilot identifies failure patterns — for example, "17 items failed the faithfulness scorer when the user asked about pricing" — and suggests specific changes to your system prompt or agent logic to fix them.
The report includes:
- Overall health score and trend vs. previous runs
- Failure patterns grouped by root cause
- Suggested system prompt changes you can copy and test immediately
- Cron jobs to schedule recurring eval runs and track improvement over time
See NovaPilot for the full guide.
What the ETL mapper does
After you create an ETL Job, open the Mapper tab. Click Generate Code and Noveum reads a sample of your recent traces, then writes mapper code that you can review in the editor. The code defines how each span in your trace gets converted into a dataset item.
You can edit the code directly, or use Improve Code to describe a problem in plain English and let the AI fix it. Once you are happy with the output, click Apply to activate the mapper on all future traces.
See Using the AI Mapper for the full step-by-step guide.
Next steps
- Quick Setup — send your first traces in 5 minutes
- ETL Jobs Overview — create your first ETL job
- Using the AI Mapper — generate and improve mapper code
- What is a Dataset — understand dataset structure and versioning
- Running Evaluations — score your dataset items
- NovaPilot — get AI-generated recommendations
Get Early Access to Noveum.ai Platform
Be the first one to get notified when we open Noveum Platform to more users. All users get access to Observability suite for free, early users get free eval jobs and premium support for the first year.