⚖️ Legal Citation Benchmark

Evaluating LLM Accuracy in Legal Citation & Case Retrieval

🏛️ Why This Benchmark?

1. Focus on Case Law (Common Law)

Existing benchmarks often focus on statutory law (codes & regulations). However, the U.S. legal system relies heavily on Case Law and stare decisis (precedent). This benchmark specifically targets the ability to reason with and cite judicial opinions, which is the core of legal argumentation.

2. Combating Hallucination

For lawyers, an LLM that "hallucinates" (invents) cases is dangerous. Accuracy is non-negotiable. Our goal is to evaluate whether models can accurately retrieve and cite real-world cases without fabrication. We simulate the lawyer's workflow: searching for and citing binding precedent.

3. Data Source: Real-World Ground Truth

Derived from the Case Law Access Project, our dataset comprises 1,000 cases from the last 10 years, randomly selected and geographically distributed across the United States. We use the actual citations made by judges in these cases as the "Gold Standard" for evaluation.

📊 Dataset

Questions: 24845
Categories: 5
Root Cases: 1000

🎯 Tasks

Citation Retrieval (Cat1)
Citation Completion (Cat2)
Citation Error Detection (Cat3)
Case Matching (Cat4-1)
Case Verification & Correction (Cat4-2)

🔗 Resources

📦 Datasets (5 categories)
📄 Paper: Coming soon
💻 Code: Coming soon

🏆 Overall Rankings

Status: Demo Version (v0.1)

Scoring Methodology

Category scores are computed using an LLM judge (GPT-4o-mini) and scaled to 0–100:

Cat1 & Cat2 (Citation Retrieval/Completion): F1-based scoring. The judge extracts all citations from the model output and checks each against the ground truth using substring matching. Precision, Recall, and F1 are computed at the citation level, then scaled to 0–100. A model that outputs many hallucinated citations is penalized via low Precision; a model that misses many ground truth citations is penalized via low Recall.
Cat3 (Citation Error Detection): Detection + correction accuracy. Score 100 = correctly identified error and provided the right fix; Score 40 = detected error but wrong fix; Score 20 = missed the error entirely.
Cat4-1 (Case Matching): Identity match between model output and ground truth case. Score 100 = exact match; Score 60–80 = minor variation; Score 40 = related but wrong case.
Cat4-2 (Case Verification): Binary verification task. Score 100 = correct judgment + correct case; Score 40 = correct rejection but wrong/missing correction.
Overall (Norm.): Normalized average. Each category score is divided by the current best score in that category (0–1), then averaged equally across all 5 categories. Updates automatically as new models are added.

Hallucination Rate ↓ measures the proportion of responses where the model answered but provided largely incorrect citations (score ≤ 40/100). Computed only for Cat1, Cat2, and Cat4-1 — the three retrieval-heavy tasks where fabricated citations are most consequential. A lower hallucination rate indicates the model is more reliable when it chooses to answer. Note that this metric does not penalize abstention: a model that declines to answer is not counted as hallucinating.

Lower is better for Hallucination Rate. Higher is better for all other scores.

📋 Task Descriptions

🧭 Task Taxonomy: Citation-level vs. Case-level Reasoning

Categories 1–3 are citation-level tasks: one judicial opinion contains many internal citations, and the model must retrieve, complete, or verify those citation items. Categories 4-1 and 4-2 are case-level tasks: the object is the judicial decision itself. The model must match or verify whether the correct underlying case is being identified or cited, rather than validating citation strings (volume/page/reporter).

📝 Submit Your Model Results

We welcome submissions from the research community. Currently: Manual submission (automated evaluation pipeline coming soon).

Result JSON format

{
  "model": "Your Model Name",
  "cat1": 88.5,
  "cat2": 84.3,
  "cat3": 82.1,
  "cat4-1": 86.7,
  "cat4-2": 84.5,
  "hallucination_rate": 12.3,
  "notes": "Optional: model size, context length, decoding, etc."
}

Contact: Submit via email or create an issue on our GitHub repository (links coming soon).

Legal Citation Benchmark v0.1 (Demo)
Last Updated: Feb-25-2026
⚠️ Demo version. Full benchmark + baseline results coming soon.