LLM-as-Judge Scorers

112 Calibrated LLM-as-Judge Scorers for Comprehensive AI Evaluation

Evaluate every dimension of your AI agents with Noveum.ai's calibrated scorer library. From hallucination detection to bias assessment, you get every eval metric you need out of the box.

View Documentation

LLM-as-Judge Technology

Comprehensive Coverage

No Manual Labeling Required

112

Production scorers

15

Quality categories

76+

LLM-as-Judge metrics

Built for production AI evaluation with everything you need out of the box

Why use Noveum.ai Scorers/Evals?

LLM-as-Judge Technology

Powered by advanced LLM evaluation for nuanced quality assessment that understands context and intent.

No Manual Labeling Required

Evaluate agents automatically using system prompts as ground truth. No need to create expected outputs for every test case.

Comprehensive Coverage

112 calibrated scorers covering every dimension of AI quality from hallucination detection to bias assessment.

Enterprise-Ready

Used by leading enterprises for production agent evaluation with battle-tested reliability and scale.

Fully Customizable

Create custom scorers for your specific business needs. Extend existing scorers or build from scratch.

Trace-Based Evaluation

Evaluate complete agent workflows from traces. Analyze tool calls, reasoning steps, and multi-turn conversations.

36 voice & audio scorers

Voice agents, evaluated end to end

The only scorer library with a dedicated voice suite: 36 scorers covering speech recognition, synthesis quality, latency, and call dynamics — straight from your production audio, no reference transcripts needed.

Stage 1

Speech-to-Text

Was the caller heard correctly — and fast enough?

SttEfficacyScorerWordAccuracyScorerSttOverSuppressionScorerTranscriptionDelayScorerSttLatencyScorer

Stage 2

Reasoning

Did the model start responding without dead air?

LlmTtftScorerLlmLatencyScorer

Stage 3

Text-to-Speech

Did the agent sound natural, clear, and correct?

MispronunciationScorerAudioBreakageScorerMosScorerToneClarityScorerTtsTtfbScorerTtsLatencyScorer

Stage 4

Call Dynamics

Interruptions, talk ratio, pacing, sentiment, and how the call ended.

E2eLatencyScorerEndOfTurnDelayScorerAiInterruptUserScorerTalkRatioScorerAgentWpmScorerSentimentCsatScorerAppropriateCallTerminationScorer

Reference-free — no ground-truth transcripts required

Audio-native judges listen to the actual call

Runs on real production calls, not synthetic samples

Explore All Scorers

Search and filter 112+ scorers across 15 categories

Showing 112 of 112 scorers

Find the right scorers for your specific use case

Do You Have the Scorers You Need?

RAG System Evaluation

AnswerRelevancyScorer
FaithfulnessScorer
ContextualPrecisionScorer
ContextualRecallScorer
RAGASScorer

Safety & Compliance

ToxicityScorer
ContentSafetyViolationScorer
IsHarmfulAdviceScorer
ContentModerationScorer
AnswerRefusalScorer

Bias Detection

NoGenderBiasScorer
NoRacialBiasScorer
NoAgeBiasScorer
CulturalSensitivityScorer
BiasDetectionScorer

Everything you need to know about Noveum.ai scorers

Frequently Asked Questions

What are evaluation scorers?

Evaluation scorers are specialized metrics that assess different aspects of AI agent behavior. They use LLM-as-Judge technology to evaluate quality dimensions like faithfulness, relevancy, safety, and more—without requiring manual labeling or expected outputs.

How many scorers does Noveum.ai have?

Noveum.ai provides 112 calibrated scorers across 15 categories, covering every dimension of AI quality including RAG evaluation, hallucination detection, bias assessment, safety compliance, and more.

Do I need to provide expected outputs?

No, most scorers don't require expected outputs. Our system-prompt-as-ground-truth approach evaluates agents based on their intended behavior, making evaluation much faster and easier to set up.

Can I create custom scorers?

Yes! Noveum.ai allows you to create custom scorers tailored to your specific business needs. You can extend existing scorers or build entirely new ones using our flexible SDK.

Which scorers should I use for my use case?

It depends on your specific needs. For RAG systems, use our RAG scorers (Faithfulness, ContextualPrecision, etc.). For agent evaluation, use Agent scorers. For safety compliance, use Safety and Bias scorers. Browse our categories to find the right fit.

How do scorers impact performance?

Our scorers are optimized for production use with minimal overhead. They can be run asynchronously and in batches. Most evaluations complete in seconds, and you can control which scorers to run based on your latency requirements.

Can I use scorers in my own code?

Yes, all scorers are available via our SDK. You can integrate them directly into your CI/CD pipeline, testing framework, or production monitoring system.

What's the difference between LLM-based and rule-based scorers?

LLM-based scorers use advanced language models to evaluate nuanced quality aspects like relevancy and faithfulness. Rule-based scorers use deterministic algorithms for objective metrics like JSON validation or exact matching. Both have their place depending on your needs.

Start Free Trial

Ready to evaluate your AI agents?

Start using 112 calibrated scorers for free. No credit card required.

Start Free Trial View Documentation Contact Sales

How Noveum.ai evaluates agents →Integration guides →Pricing →

112 Calibrated LLM-as-Judge Scorers for Comprehensive AI Evaluation

Why use Noveum.ai Scorers/Evals?

LLM-as-Judge Technology

No Manual Labeling Required

Comprehensive Coverage

Enterprise-Ready

Fully Customizable

Trace-Based Evaluation

Voice agents, evaluated end to end

Explore All Scorers

AgentRoleAdherenceScorer

ContextRelevancyScorer

ConversationCoherenceScorer

GoalAchievementScorer

ParameterCorrectnessScorer

TaskProgressionScorer

ToolCorrectnessScorer

ToolRelevancyScorer

ConversationFlowAdherenceScorer

ConversationRelevancyScorer

ConversationalMetricsScorer

ConversationalRoleAdherenceScorer

IntentionFulfillmentScorer

KnowledgeRetentionScorer

RepetitionFuzzyMatchScorer

RoleAdherenceScorer

AppropriateCallTerminationScorer

ConversationContextCoherenceScorer

DropOffNodeScorer

InstructionAdherenceScorer

SentimentCsatScorer

AgentWpmScorer

AiInterruptUserScorer

AssistantAveragePitchScorer

AssistantLatencyScorer

AssistantOverlapScorer

AssistantVolumeScorer

AudioBreakageScorer

GibberishScorer

MispronunciationScorer

MosScorer

PronunciationAudioScorer

SpeakingOverUserScorer

SttEfficacyScorer

SttOverSuppressionScorer

TalkRatioScorer

ToneClarityScorer

UserInterruptAgentScorer

WordAccuracyScorer

E2eLatencyScorer

EndOfTurnDelayScorer

LlmLatencyScorer

LlmTtftScorer

OnUserTurnCompletedDelayScorer

SttAudioDurationScorer

SttLatencyScorer

SttProcessingDurationScorer

TranscriptionDelayScorer

TtsAudioDurationScorer

TtsDurationScorer

TtsLatencyScorer

TtsTtfbScorer

AggregateRAGScorer

AnswerRelevancyScorer

BiasDetectionScorer

ContextualPrecisionScorerPP

ContextualPrecisionScorer

ContextualRecallScorerPP

ContextualRecallScorer

FaithfulnessScorer

RAGASScorer

RetrievalDiversityScorer

RetrievalF1Scorer

RetrievalRankingScorer

SemanticSimilarityScorer

SourceAttributionScorer

ConflictResolutionScorer

ContextCompletenessScorer

ContextConsistencyScorer

ContextFaithfulnessScorerPP