โ๏ธ Legal Citation Benchmark
Evaluating LLM Accuracy in Legal Citation & Case Retrieval
๐๏ธ Why This Benchmark?
1. Focus on Case Law (Common Law)
Existing benchmarks often focus on statutory law (codes & regulations). However, the U.S. legal system relies heavily on Case Law and stare decisis (precedent). This benchmark specifically targets the ability to reason with and cite judicial opinions, which is the core of legal argumentation.
2. Combating Hallucination
For lawyers, an LLM that "hallucinates" (invents) cases is dangerous. Accuracy is non-negotiable. Our goal is to evaluate whether models can accurately retrieve and cite real-world cases without fabrication. We simulate the lawyer's workflow: searching for and citing binding precedent.
3. Data Source: Real-World Ground Truth
Derived from the Case Law Access Project, our dataset comprises 1,000 cases from the last 10 years, randomly selected and geographically distributed across the United States. We use the actual citations made by judges in these cases as the "Gold Standard" for evaluation.
๐ Dataset
- Questions: 24845
- Categories: 5
- Root Cases: 1000
๐ฏ Tasks
- Citation Retrieval (Cat1)
- Citation Completion (Cat2)
- Citation Error Detection (Cat3)
- Case Matching (Cat4-1)
- Case Verification & Correction (Cat4-2)
๐ Resources
- ๐ฆ Datasets (5 categories)
- ๐ Paper: Coming soon
- ๐ป Code: Coming soon
๐ Overall Rankings
Status: Demo Version (v0.1)
Scoring Methodology
Category scores are computed using an LLM judge (GPT-4o-mini) and scaled to 0โ100:
- Cat1 & Cat2 (Citation Retrieval/Completion): F1-based scoring. The judge extracts all citations from the model output and checks each against the ground truth using substring matching. Precision, Recall, and F1 are computed at the citation level, then scaled to 0โ100. A model that outputs many hallucinated citations is penalized via low Precision; a model that misses many ground truth citations is penalized via low Recall.
- Cat3 (Citation Error Detection): Detection + correction accuracy. Score 100 = correctly identified error and provided the right fix; Score 40 = detected error but wrong fix; Score 20 = missed the error entirely.
- Cat4-1 (Case Matching): Identity match between model output and ground truth case. Score 100 = exact match; Score 60โ80 = minor variation; Score 40 = related but wrong case.
- Cat4-2 (Case Verification): Binary verification task. Score 100 = correct judgment + correct case; Score 40 = correct rejection but wrong/missing correction.
- Overall (Norm.): Normalized average. Each category score is divided by the current best score in that category (0โ1), then averaged equally across all 5 categories. Updates automatically as new models are added.
Hallucination Rate โ measures the proportion of responses where the model answered but provided largely incorrect citations (score โค 40/100). Computed only for Cat1, Cat2, and Cat4-1 โ the three retrieval-heavy tasks where fabricated citations are most consequential. A lower hallucination rate indicates the model is more reliable when it chooses to answer. Note that this metric does not penalize abstention: a model that declines to answer is not counted as hallucinating.
Lower is better for Hallucination Rate. Higher is better for all other scores.
๐ Task Descriptions
๐งญ Task Taxonomy: Citation-level vs. Case-level Reasoning
Categories 1โ3 are citation-level tasks: one judicial opinion contains many internal citations, and the model must retrieve, complete, or verify those citation items. Categories 4-1 and 4-2 are case-level tasks: the object is the judicial decision itself. The model must match or verify whether the correct underlying case is being identified or cited, rather than validating citation strings (volume/page/reporter).
๐ Submit Your Model Results
We welcome submissions from the research community. Currently: Manual submission (automated evaluation pipeline coming soon).
Result JSON format
{
"model": "Your Model Name",
"cat1": 88.5,
"cat2": 84.3,
"cat3": 82.1,
"cat4-1": 86.7,
"cat4-2": 84.5,
"hallucination_rate": 12.3,
"notes": "Optional: model size, context length, decoding, etc."
}
Contact: Submit via email or create an issue on our GitHub repository (links coming soon).
Last Updated: Feb-25-2026
โ ๏ธ Demo version. Full benchmark + baseline results coming soon.