PL

The autonomous engineer

The AI engineer that ships verified fixes.

NovaPilot reads every failing eval, isolates the root cause with four specialized analyzers, and validates each candidate fix two ways — a backtest on the exact calls it changed, then a forward simulation of the whole scenario — before opening a pull request with the evidence attached. You stay the reviewer.

4

specialized analyzer agents

10 min

from failing trace to verified fix

200×

faster than manual debugging

acme/support-agent

fix(agent): ground refund answers in policy context, tighten tool retries #312

Opened by NovaPilot · backtested on 412 calls · simulated end-to-end

Validation report

hallucination_rate7.1%0.4%FAIL
tool_call_success81%98%FAIL
goal_completion84%95%FAIL
106 scorers re-run on the changed calls + scenario re-simulated — no regressions
Awaiting human reviewMerge fix

01 — Capabilities

Autonomy with evidence at every step.

Four analyzer agents

Prompt, Tool, Flow, and General analyzers each own a failure class — isolating the variable that actually broke, not just the symptom.

False-positive filtering

Before anything gets “fixed”, NovaPilot validates that the failure is real — scorer noise gets filtered out, not optimized against.

Validated two ways

A backtest replays each call the fix changed — same context, new prompt — re-scored across all 106 scorers to prove the local answer improved. That alone can't prove the run-level outcome: a changed decision reroutes everything downstream. So a forward simulation runs the changed agent through the whole scenario end-to-end. Each claim is earned by the method that can measure it — the local fix by backtest, the outcome by simulation — before anything ships.

Deploy as a pull request

The winning fix lands in your repo as a PR with the full report attached. Your engineers review, comment, and merge.

Always on

Evals, experiments, failure reports, and fix candidates run continuously — reliability work that doesn’t sleep between incidents.

Scorer recommendations

NovaPilot suggests which scorers to run based on the traffic it actually sees — the setup work disappears.

02 — The pipeline

Five steps from failure to merged fix.

Each step produces an artifact you can audit — the report, the backtest, the simulation, the PR. Autonomy with evidence, not vibes.

01

Detect

Low-scoring traces stream in from production evals and NovaSynth runs.

02

Diagnose

Four analyzers isolate the failing variable and discard scorer false positives.

03

Generate

Candidate fixes target the specific variable — prompt, tool schema, flow, or configuration.

04

Validate

Backtest each changed call — same context, new prompt — to prove the local answer; then forward-simulate the whole scenario to prove the end-to-end outcome holds. Regress either the local call or the run-level outcome and the fix is disqualified.

05

Open the PR

The winner ships to your repository with the evidence attached, behind your review.

NovaPilot generated agent fix report
NovaPilot · generated fix with backtest + simulation evidence — live product

03 — Under the hood

What ships in every pull request.

Four specialized analyzers, fixes that target one variable, two-way validation across all 106 scorers, confidence scoring, and a full report attached to the PR.

Analyzer agents

Four specialized analyzers — Prompt, Tool, Flow, and General — each isolate one failure class, pinpointing the variable that actually broke and discarding scorer false positives.

Candidate fixes

Fixes target the specific failing variable — a prompt, a tool schema, a flow, or a configuration change — not a blanket rewrite.

Backtest validation

Each candidate is backtested by re-running the exact calls it changed — same context, new fix — re-scored across all 106 scorers to prove the local defect is gone.

Forward simulation

A full-scenario forward simulation runs the changed agent end-to-end, so the outcome is validated — not just the single call. A fix that regresses either is disqualified.

Confidence & report

Each recommendation carries a confidence score, and the report includes an executive summary, actionable findings, failure patterns, a scorer breakdown, and the evidence behind it.

Pull request delivery

The winning fix opens a BullMQ-scheduled pull request with all evidence attached — your engineers review, comment, and merge. You stay the reviewer.

The loop

Part of one closed loop.

NovaPilot closes the loop: it consumes everything upstream — traces, scores, simulations, blocked outputs — and returns merged improvements.

Next step

Put your production AI under control.

Start free and ship your first trace in 15 minutes — or book 30 minutes and we’ll integrate live on the call: your stack, your data, your first eval report before it ends.

SOC 2 Type II · HIPAA · GDPR · On-prem & BYO ClickHouse available