Autonomous QA for enterprise AI agents

AI agents that test, monitor,and fix your AI agents.

You build the agent — Noveum is the autonomous QA team for everything after. It traces every run, catches failures before your users do, and ships fixes it has validated end-to-end as pull requests.

Free tier · no credit card · first trace in 15 minutes

SOC 2 Type II · HIPAA · GDPR · On-prem & BYO ClickHouse

agent.py
from noveum_trace.integrations.langchain import (
NoveumTraceCallbackHandler,
)
handler = NoveumTraceCallbackHandler()
agent.invoke({"input": question},
config={"callbacks": [handler]})

Noveum, automatically

Traced 1,248 agent runs today
NovaEval flagged hallucination_rate · 7.1%
NovaPilot opened PR #312 · +11% in simulation

Drop in the SDK — Noveum traces, evaluates, and ships the fix automatically.

Open-source SDKPython · TypeScript on GitHub
60M+spans traced / day
SOC 2 Type IIHIPAA · GDPR compliant

Running in production at

DarGlobalMyOperatorand AI platform teams across telecom, real estate & financial services

Built on / partnered with

Amazon Web ServicesGoogle CloudNVIDIA Inception ProgramDigitalOceanOVHcloud

09 — Customers

What our customers say.

DarGlobalSwitched from Langfuse
Being able to analyze spans and traces directly inside the platform — and identify issues right there — has saved a lot of debugging time. Before, we manually went through logs, copied traces into Claude Code, analyzed issues, then implemented fixes.
Maintenance and deployment overhead significantly reduced after switching from Langfuse
AI-assisted debugging directly on spans and traces, inside the platform
AI-suggested scorers based on the logs and traces actually captured
CrewAI integration requested as a custom requirement — shipped and integrated quickly

AI Engineering Team — DarGlobal

MyOperatorVoice agents at scale
We went from spending days debugging agent failures to having verified fixes delivered automatically.

84→95%+

call success rate

200×

faster iteration

4–6×

answer quality

10 min

trace → verified fix

Lead AI Engineer — MyOperator

Read the case study

08 — Proven in production

From flying blind to 95%+ call success.

MyOperator, a leading cloud-communications platform, runs AI voice agents at scale. Noveum was integrated in four lines of code. NovaPilot then tested 136 prompt variations across four evolutionary generations and shipped fixes it backtested on the failing calls and verified end-to-end in simulation — work that used to take days of manual log-diving.

84→95%+

call success rate, autonomously optimized

200×

faster iteration than manual debugging

4–6×

improvement in answer quality

10 min

from failing trace to verified fix

10 — The honest comparison

Everyone else shows you what broke. Noveum fixes it.

Observability and evals are table stakes now — every tool can point at a failing trace. The real test is what happens next. Noveum is the only one that ships the fix, on your own live traffic, with no dataset to maintain.

Swipe to compare
CapabilityNoveumArizeBraintrustGalileoLangfuseMaxim
Validated auto-fix shipped as a code PRPrompt onlySuggestionsInsights
Evaluates live traffic — no datasets to maintainProduction-firstDataset-firstDataset-firstDataset-firstDataset-firstDataset-first
Real-time guardrails & policy enginePartialPartial
Dedicated audio & voice evaluation30+ scorersPartialPartial
Voice simulation over real phone calls
Calibrated scorers out of the box106LibraryBuild own20+Bring ownLibrary
On-prem / BYO ClickHouse — every featureVPC (ent.)HybridEnt. onlySelf-hostIn-VPC
White-glove integration on a live call

Compiled June 2026 from public documentation and pricing pages; capabilities evolve — verify against your own requirements. Langfuse was acquired by ClickHouse (Jan 2026). Galileo entered an agreement to be acquired by Cisco (Apr 2026).

01 — The reliability gap

Your agents pass the demo. Production is where they break.

In a demo, every path is the happy path. In production, agents update records, move money, file documents, and place calls across millions of runs a day — and a 0.1% failure rate becomes thousands of silent incidents your customers find before your team does.

40%+

of agentic AI projects will be canceled by the end of 2027

Gartner, 2025

42%

of enterprises scrapped most AI initiatives before production in 2025 — up from 17% a year earlier

S&P Global, 2025

59%

end-to-end success rate of a 95%-per-step agent across a 10-step workflow

Compounding-error math

$290K

refunded to the Australian government after an AI tool fabricated citations in a consulting report

Reported Oct 2025

Past ~100,000 traces a day, reviewing logs by hand collapses. The teams shipping reliable AI automate the entire loop — detect, diagnose, fix, and prove — so reliability scales with usage instead of headcount.

02 — How Noveum works

How Noveum closes the loop most tools leave open.

Five systems wrap every agent you ship — guard live traffic, trace every run, evaluate it, simulate what's next, and ship the fix. Each one feeds the next.

01 · NOVAGUARD

Guard

Enforce policy on live traffic in real time.

02 · NOVATRACE

Trace

Capture every LLM call, tool call, retrieval, and agent step — one SDK, live in 15 minutes.

03 · NOVAEVAL

Evaluate

Score every run against 106 calibrated evaluators across 18 categories.

04 · NOVASYNTH

Simulate

Forward-simulate whole scenarios against the agent — the engine that validates end-to-end outcomes, with tool calls virtualized.

05 · NOVAPILOT

Fix

Ship fixes — backtested per-call and simulated end-to-end — as pull requests. Your engineers stay in control.

Every fix feeds the next run

03 — The platform

One platform for the entire agent lifecycle.

From pre-production testing to live monitoring, evaluation, guardrails, and autonomous fixes, Noveum covers the entire agent lifecycle in one platform. You build the agent; we handle everything after — in our cloud, your VPC, or fully on-prem.

Observability

NovaTrace

One open-source SDK captures every LLM call, tool invocation, retrieval, and agent step — with tokens, cost, and latency on every span. Deep native integrations for LangChain, LangGraph, LiveKit, Pipecat, and CrewAI.

  • Deep framework integrations
  • Threads & multi-agent graphs
  • Fail-soft transport

6M+ traces/day · OpenTelemetry-aligned · Apache-2.0 SDK

app.noveum.ai — traces explorer
The Noveum three-pane traces explorer

04 — NovaPilot

From failing eval to merged fix.

Four analyzer agents read every low-scoring trace, separate real defects from scorer noise, and carry the fix all the way to your repository — as a prompt, tool, flow, or configuration change, with both the backtest and the simulation attached.

01 DIAGNOSE

Root-cause every failure class

Specialized analyzers isolate the variable that actually failed — prompt phrasing, context handling, tool-call format, or model configuration — and filter out scorer false positives first.

02 VALIDATE

Prove the local fix and the outcome

Backtest the exact calls the fix changed — same context, new prompt — to prove the local defect is gone; then forward-simulate the whole scenario to prove the end-to-end outcome holds. No regression, local or run-level, no ship.

03 SHIP

Deliver as a pull request

The winning fix lands in your repository as a PR with the full report attached. Your engineers review, comment, and merge — NovaPilot iterates until checks pass.

acme/support-agent

fix(agent): ground refund answers in policy context, tighten tool retries #312

Opened by NovaPilot · backtested on 412 calls · simulated end-to-end

Validation report

hallucination_rate7.1%0.4%FAIL
tool_call_success81%98%FAIL
goal_completion84%95%FAIL
106 scorers re-run on the changed calls + scenario re-simulated — no regressions
Awaiting human reviewMerge fix

A real NovaPilot deliverable — the validated fix, opened as a PR

Evaluation quality is the control signal for the whole loop — a fix is only as good as the score that justifies it. That’s why the 106-scorer library came first, and why every fix ships with both its backtest and its simulation.

05 — Voice

Score the call, not just the transcript.

NovaSynth places real SIP calls to your agent — or runs full chat simulations — then NovaEval scores the audio itself: mispronunciations, interruptions, barge-in, VAD, dead air, and the entire latency pipeline. Catch what transcript-only tools miss, before a customer ever hears it.

  • 30+ dedicated voice & audio scorers — mispronunciation, audio breakage, MOS, gibberish, talk-ratio, pitch & volume
  • Full voice-pipeline timing: STT, LLM time-to-first-token, TTS time-to-first-byte, end-to-end
  • VAD, interruptions, and barge-in recovery — the failures a transcript never shows
  • Real SIP calls and chat simulations with personas, accents, and adversarial intent
  • LiveKit, Pipecat, Twilio, and custom voice stacks
Live call · NovaSynth → your agentCALL #1042 · SIP
Agent

Thanks for calling — how can I help with your order today?PASS

Caller

I need to change the address on order eaze… ezy… EZ-4471.

Agent

Sure, updating order eee-zee forty-four seventy-one now.MISPRONOUNCED

Agent

… (3.1s of dead air after tool call) …DEAD AIR

242 ms

STT

486 ms

LLM TTFT

191 ms

TTS TTFB

1.3 s

E2E

98%

VAD

3.1 s

DEAD AIR

Flagged by NovaEval audio scorers — caught in pre-launch NovaSynth simulation.

30+ voice & audio scorers, including

MispronunciationAudio breakageWord accuracyMOSTone clarityGibberishTalk ratioAgent WPMAI-interrupts-userUser-interrupts-agentAssistant overlapAssistant silenceTTS time-to-first-byteSTT latencySTT over-suppressionLLM time-to-first-tokenEnd-to-end latencyEnd-of-turn delaySentiment / CSATAppropriate call termination

06 — Instrumentation

One SDK. Every framework you already run.

Noveum Trace is open-source and framework-native. Add a callback, a wrapper, or a single context manager — and your agents stream full traces in minutes, with tokens, cost, and latency on every span. No collector to operate, no OpenTelemetry pipeline to babysit.

pip install noveum-trace
from noveum_trace.integrations.langchain import (
NoveumTraceCallbackHandler,
)
handler = NoveumTraceCallbackHandler()
agent.invoke(
{"input": question},
config={"callbacks": [handler]},
)
  • ~3,000-line LangChain callback — LLM, chain, agent, tool & retrieval events
  • Playable STT/TTS audio captured per turn for voice agents
  • One line wires a whole CrewAI crew, including agent-to-agent delegation

Native integrations

LangChain

Drop-in callback handler

LiveKit

STT/TTS wrappers, playable audio

Custom Python

Context managers — wrap any call

LangGraph

Graph-aware, parent-resolved spans

Pipecat

Per-turn voice-pipeline spans

CrewAI

Crews, A2A delegation, MCP & flows

MCP server included

Let your coding agent wire it up

You still integrate the SDK — but you don't have to do it by hand. Point Claude Code or Cursor at the Noveum MCP server (~60 tools from our OpenAPI spec) and your agent adds the SDK, then drives the platform: query traces, run evals, launch NovaSynth, trigger NovaPilot.

Python 3.9–3.13 · Apache-2.0 · fail-soft, batched, sampled transport

07 — MCP

Your coding agent runs Noveum for you.

The whole platform ships as an MCP server — ~60 tools auto-generated from our OpenAPI spec. Your agent can integrate the SDK and then operate Noveum end to end, in plain language, from Claude Code or Cursor.

  • Ask your agent to add the Noveum SDK and wire up tracing — it reads the docs as MCP tools
  • Then drive the full loop conversationally: traces → datasets → evals → NovaSynth → NovaPilot
  • Same API key as REST, scoped per project, with full audit logging
Claude Code
noveum-mcp · ~60 tools

Call success dropped since yesterday’s deploy. Find out why — and fix it.

traces.searchevals.runnovapilot.analyze

Found it: dead-air regressions on tool-call turns after the 14:02 deploy — 38 calls affected. NovaPilot generated a fix, backtested the calls it changed, and simulated the whole scenario end-to-end. No regressions.

Pull request opened — awaiting your review

Works in Claude Code · Cursor · VS Code · any MCP client

See it on your own agents.

Start free and ship your first trace in 15 minutes — or book a call and our team integrates live on the spot.

No credit card · SOC 2 Type II · on-prem available

11 — Enterprise

Enterprise-grade from day one.

Your security team's checklist isn't an afterthought here. Choose exactly where every trace lives — our cloud, your VPC, or fully air-gapped on-prem — with data residency, PII redaction, SSO/SAML, RBAC, and audit logging built in. SOC 2 Type II, HIPAA, and GDPR from the start.

Managed cloud

Multi-tenant SaaS on our infrastructure — RBAC, audit logs, encryption at rest, and a 99.9% uptime target.

SOC 2 Type II · fastest start

Bring your own ClickHouse

Hybrid: Noveum runs as managed software while your telemetry stays in your own ClickHouse. Trace data never leaves your cloud.

Your data plane · our control plane

On-premise

The full platform inside your VPC or fully air-gapped — Kubernetes and Helm, the same stack we run in production.

Air-gap ready · Kubernetes / Helm

What your infosec team needs, already in place

Data residency

Choose your region. With BYO ClickHouse or on-prem, trace data never leaves your cloud.

PII redaction

Pseudonymize or strip sensitive fields inside the SDK, before anything is ever sent.

SSO, SAML & RBAC

Enterprise SSO, SCIM provisioning, and fine-grained role-based access control.

Audit logging

Every authorization decision and platform action is logged and exportable.

Encryption everywhere

TLS in transit, encryption at rest, and encrypted per-tenant credentials.

Compliance

SOC 2 Type II, HIPAA with BAA, and GDPR — DPA and subprocessor list on request.

SOC 2 Type IIHIPAA · BAAGDPRSSO / SAMLRBACAudit logsPII redactionEncryption at rest

Load-tested to 15,000 spans/sec · 60M+ spans/day in production

12 — Pricing

Start free. Scale org-flat.

Pricing that follows your usage, not your headcount. Every plan includes all five products — bring the whole team.

Free

$0/month

  • 2,500 credits every month
  • 1M spans · 30-day retention
  • No credit card required
Get your API key

Pro

Most popular

From $69/month

  • Org-flat — unlimited seats
  • Usage-based credits, rule-based scorers free
  • More spans, retention & NovaPilot runs
Start with Pro

Enterprise

Custom

  • On-prem & BYO ClickHouse
  • SSO/SAML · SLAs · audit & DPAs
  • White-glove integration on a live call
Book a demo

13 — FAQ

The questions enterprise teams ask first.

What happens if NovaPilot ships a bad fix?

It can’t reach production on its own — NovaPilot only opens a pull request. Every candidate fix is validated two ways before it reaches your queue: a backtest re-runs the exact calls it changed and re-scores them across all 106 evaluators, and a forward simulation runs the whole scenario end-to-end. If either regresses, the fix never reaches you. Your engineers review and merge every change.

How is our data isolated — and is it used to train your models?

Every organization gets isolated trace storage, with BYO ClickHouse or fully on-prem options. Your traces are never used to train shared models, and in air-gapped deployments nothing leaves your network — NovaPilot runs against models you control.

Can we run the whole platform on-prem or in our own VPC?

Yes — SaaS, hybrid with your own ClickHouse, or fully air-gapped on-premise via Kubernetes and Helm. Every feature is included in every model, NovaPilot included; it doesn’t get weaker behind your firewall.

We run mission-critical voice agents. What can you catch that transcript-based tools can’t?

We score the audio itself: mispronounced names, talking over the customer, dead air after a tool call, and full pipeline latency (STT, LLM TTFT, TTS TTFB). NovaSynth places real SIP calls with accents and barge-in before launch, so failures surface in simulation — not with your customers.

You’re younger than the incumbents — what protects our investment?

Your instrumentation is an Apache-2.0 open-source SDK on OpenTelemetry-aligned schemas, and you own your trace data in your own store. There’s no proprietary lock-in on the layer that matters most — your data and integration are portable.

How does pricing work at scale?

Plans are org-flat and usage-based — you pay for evaluation work, not seats — so adding engineers and traces never multiplies your bill. You start free, and rule-based scorers run at no cost. Enterprise adds on-prem/BYOC, SSO/SAML, SLAs, and DPAs.

Next step

Put your production AI under control.

Start free and ship your first trace in 15 minutes — or book 30 minutes and we’ll integrate live on the call: your stack, your data, your first eval report before it ends.

SOC 2 Type II · HIPAA · GDPR · On-prem & BYO ClickHouse available