Four analyzer agents
Prompt, Tool, Flow, and General analyzers each own a failure class — isolating the variable that actually broke, not just the symptom.
The autonomous engineer
NovaPilot reads every failing eval, isolates the root cause with four specialized analyzers, and validates each candidate fix two ways — a backtest on the exact calls it changed, then a forward simulation of the whole scenario — before opening a pull request with the evidence attached. You stay the reviewer.
4
specialized analyzer agents
10 min
from failing trace to verified fix
200×
faster than manual debugging
fix(agent): ground refund answers in policy context, tighten tool retries #312
Opened by NovaPilot · backtested on 412 calls · simulated end-to-end
Validation report
01 — Capabilities
Prompt, Tool, Flow, and General analyzers each own a failure class — isolating the variable that actually broke, not just the symptom.
Before anything gets “fixed”, NovaPilot validates that the failure is real — scorer noise gets filtered out, not optimized against.
A backtest replays each call the fix changed — same context, new prompt — re-scored across all 106 scorers to prove the local answer improved. That alone can't prove the run-level outcome: a changed decision reroutes everything downstream. So a forward simulation runs the changed agent through the whole scenario end-to-end. Each claim is earned by the method that can measure it — the local fix by backtest, the outcome by simulation — before anything ships.
The winning fix lands in your repo as a PR with the full report attached. Your engineers review, comment, and merge.
Evals, experiments, failure reports, and fix candidates run continuously — reliability work that doesn’t sleep between incidents.
NovaPilot suggests which scorers to run based on the traffic it actually sees — the setup work disappears.
02 — The pipeline
Each step produces an artifact you can audit — the report, the backtest, the simulation, the PR. Autonomy with evidence, not vibes.
01
Detect
Low-scoring traces stream in from production evals and NovaSynth runs.
02
Diagnose
Four analyzers isolate the failing variable and discard scorer false positives.
03
Generate
Candidate fixes target the specific variable — prompt, tool schema, flow, or configuration.
04
Validate
Backtest each changed call — same context, new prompt — to prove the local answer; then forward-simulate the whole scenario to prove the end-to-end outcome holds. Regress either the local call or the run-level outcome and the fix is disqualified.
05
Open the PR
The winner ships to your repository with the evidence attached, behind your review.

03 — Under the hood
Four specialized analyzers, fixes that target one variable, two-way validation across all 106 scorers, confidence scoring, and a full report attached to the PR.
Analyzer agents
Four specialized analyzers — Prompt, Tool, Flow, and General — each isolate one failure class, pinpointing the variable that actually broke and discarding scorer false positives.
Candidate fixes
Fixes target the specific failing variable — a prompt, a tool schema, a flow, or a configuration change — not a blanket rewrite.
Backtest validation
Each candidate is backtested by re-running the exact calls it changed — same context, new fix — re-scored across all 106 scorers to prove the local defect is gone.
Forward simulation
A full-scenario forward simulation runs the changed agent end-to-end, so the outcome is validated — not just the single call. A fix that regresses either is disqualified.
Confidence & report
Each recommendation carries a confidence score, and the report includes an executive summary, actionable findings, failure patterns, a scorer breakdown, and the evidence behind it.
Pull request delivery
The winning fix opens a BullMQ-scheduled pull request with all evidence attached — your engineers review, comment, and merge. You stay the reviewer.
The loop
NovaPilot closes the loop: it consumes everything upstream — traces, scores, simulations, blocked outputs — and returns merged improvements.
Next step
Start free and ship your first trace in 15 minutes — or book 30 minutes and we’ll integrate live on the call: your stack, your data, your first eval report before it ends.
SOC 2 Type II · HIPAA · GDPR · On-prem & BYO ClickHouse available