Multi-Context ScorerLLM-as-Judge

ContextFaithfulnessScorerPP

Enhanced faithfulness detection with fine-grained claim-by-claim analysis. Performs two-stage evaluation: first extracting individual factual claims from the answer, then verifying each claim against context. Particularly effective for detecting partial hallucinations.

Overview

Enhanced faithfulness detection with fine-grained claim-by-claim analysis. Performs two-stage evaluation: first extracting individual factual claims from the answer, then verifying each claim against context. Particularly effective for detecting partial hallucinations.

multi-contextraghallucinationllm-judgetrace-evaluationfaithfulnessclaims

Use Cases

  • RAG-based question answering systems
  • Hallucination detection in generated content

How It Works

This scorer uses LLM-as-Judge technology to evaluate responses. It prompts a large language model with specific evaluation criteria and the content to assess, then analyzes the LLM's judgment to produce a score and detailed reasoning.

Input Schema

ParameterTypeRequiredDescription
output_textstrYesAnswer containing claims to verify
input_textstrNoOriginal question
contextlist[str] | strYesContext for claim verification

Output Schema

FieldTypeDescription
scorefloatFaithfulness score (0-10)
passedboolTrue if faithful
reasoningstrVerification analysis
metadata.verified_claimsintNumber of verified claims
metadata.total_claimsintTotal claims extracted

Score Interpretation

Default threshold: 7/10

9-10ExcellentResponse fully meets all evaluation criteria
7-8GoodResponse meets most criteria with minor issues
5-6FairResponse partially meets criteria, needs improvement
3-4PoorResponse has significant issues
0-2FailingResponse fails to meet basic criteria

Frequently Asked Questions

When should I use this scorer?

Use ContextFaithfulnessScorerPP when you need to evaluate multi-context and rag aspects of your AI outputs. It's particularly useful for rag-based question answering systems.

Why doesn't this scorer need expected output?

This scorer evaluates quality aspects that don't require comparison against a reference answer. It uses the system prompt and context as the implicit ground truth.

Can I customize the threshold?

Yes, the default threshold of 7 can be customized when configuring the scorer.

Quick Info

CategoryMulti-Context
Evaluation TypeLLM-as-Judge
Requires Expected OutputNo
Default Threshold7/10

Ready to try ContextFaithfulnessScorerPP?

Start evaluating your AI agents with Noveum.ai's comprehensive scorer library.

Explore More Scorers

Discover 68+ LLM-as-Judge scorers for comprehensive AI evaluation