The Evaluation Pipeline

How to go from raw traces to scored dataset items and AI-generated improvement recommendations — the complete Noveum workflow.

Overview

Sending traces to Noveum is just the beginning. Once your application is instrumented, Noveum can automatically convert every LLM call in your traces into a structured evaluation item, score each one with 80+ built-in quality metrics, and surface prioritized recommendations for improving your agent.

This guide walks through the full 5-step pipeline so you know exactly what to set up and in what order.

The 5-step flow

flowchart LR
    SDK["1. Integrate SDK\n(send traces)"] --> Traces["Traces\nin Noveum"]
    Traces --> ETL["2. ETL Job\n(AI mapper)"]
    ETL --> Dataset["3. Dataset\n(one item\nper LLM call)"]
    Dataset --> Eval["4. Eval Job\n(NovaEval scorers)"]
    Eval --> NovaPilot["5. NovaPilot\n(recommendations)"]

Step 1 — Integrate the SDK and send traces

Install the Noveum SDK and initialize it in your application. Every LLM call, tool use, and agent decision your app makes will be captured as a trace.

import noveum_trace
 
noveum_trace.init(
    api_key="YOUR_API_KEY",
    project="my-agent",
    environment="production",
)

Traces appear in your Noveum dashboard in real time. Each trace represents one end-to-end request through your agent — for example, one user conversation or one task execution.

See Quick Setup to get your first traces flowing in 5 minutes.

Step 2 — Create an ETL Job

Raw traces contain rich telemetry, but scorers need structured, item-by-item data. An ETL Job bridges the two.

Key concept: one item per LLM call. Every LLM call inside a trace becomes one independently-evaluated dataset item. A chatbot trace with 5 conversational turns produces 5 items. Each item contains exactly what the scorer needs to evaluate that specific decision in isolation: the input, the output, the system prompt, and the conversation history up to that point.

When you create an ETL Job, Noveum reads a sample of your recent traces and uses an AI agent to generate mapper code for you. The mapper code runs on every new trace and extracts the items. You review the code in the editor, make any adjustments using the "Improve Code" button, and apply it.

See ETL Jobs Overview to create your first ETL job, and Using the AI Mapper to understand how the AI-generated code works and how to improve it.

Step 3 — Review your Dataset

Once the ETL job runs, items appear in the Dataset linked to that job. Each item is a versioned record containing:

Field	What it is
Input	The user's question or request
Output	The LLM's response
System prompt	The instructions the LLM was given
Conversation history	Prior turns (for conversational agents)
Tool calls	Any tools invoked and their results
Item type	`agent` or `conversational` — determines which scorers apply

Datasets are versioned: items accumulate over time as new traces arrive and the ETL job processes them. You can publish a version when you want to lock a snapshot for a specific evaluation run.

See What is a Dataset for the full dataset guide.

Step 4 — Run an Eval Job

An Eval Job applies a set of scorers to every item in your dataset. Each item is scored independently, so you get granular insight into which specific LLM decisions are failing and why.

Go to your project and click Eval Jobs → New Eval Job.

Select the dataset you created in Step 3.

Pick your scorers — or click Recommend Scorers to let Noveum suggest the best ones for your agent type.

Click Run. Results appear as each item is scored.

Results include:

A pass/fail badge per item per scorer
Score values (0–10) for LLM-as-judge scorers
Aggregated stats across all items (mean, pass rate, distribution)

See Running Evaluations for the full eval job guide.

Step 5 — Get NovaPilot Recommendations

After at least one eval job run, NovaPilot analyzes all your scored items and generates a prioritized improvement report.

NovaPilot identifies failure patterns — for example, "17 items failed the faithfulness scorer when the user asked about pricing" — and suggests specific changes to your system prompt or agent logic to fix them.

The report includes:

Overall health score and trend vs. previous runs
Failure patterns grouped by root cause
Suggested system prompt changes you can copy and test immediately
Cron jobs to schedule recurring eval runs and track improvement over time

See NovaPilot for the full guide.

What the ETL mapper does

After you create an ETL Job, open the Mapper tab. Click Generate Code and Noveum reads a sample of your recent traces, then writes mapper code that you can review in the editor. The code defines how each span in your trace gets converted into a dataset item.

You can edit the code directly, or use Improve Code to describe a problem in plain English and let the AI fix it. Once you are happy with the output, click Apply to activate the mapper on all future traces.

See Using the AI Mapper for the full step-by-step guide.

Next steps

Quick Setup — send your first traces in 5 minutes
ETL Jobs Overview — create your first ETL job
Using the AI Mapper — generate and improve mapper code
What is a Dataset — understand dataset structure and versioning
Running Evaluations — score your dataset items
NovaPilot — get AI-generated recommendations