Checked

481

grounding-checked

Avg grounding

0.81

median 0.91

Borderline

15.2% of answers

Hallucinations

8.5% badly ungrounded

Score distribution

Mean score over time

Quality by model

Model	Evals	Mean score	Critical	Status
claude-opus-4-7	260	0.800	9.6%	healthy
gpt-4o	90	0.816	10.0%	healthy
gpt-4-mini	60	0.819	6.7%	healthy
claude-sonnet-4-6	71	0.843	4.2%	healthy

Where answers fail

Grounded187 · 35%

supported by retrieved context

Contradicted by source201 · 37%

model conflicts with its context → fix generation / prompt

Not in retrieved context151 · 28%

answer-bearing source wasn't retrieved → fix retrieval

Of failed claims, 43% are retrieval misses and 57% are the model contradicting good context — over 539 claims.

Worst offenders

0.01claude-opus-4-7tenant: soylent
Refund approved under policy P-204(b), $189.99 returned to card ending 4421 within 3 days…
View trace →
0.04claude-opus-4-7tenant: soylent
Q3 revenue hit $4.2B, a 38% YoY jump, with operating margin expanding to 27%…
View trace →
0.04gpt-4otenant: acme
Q3 revenue hit $4.2B, a 38% YoY jump, with operating margin expanding to 27%…
View trace →
0.06gpt-4-minitenant: acme
Q3 revenue hit $4.2B, a 38% YoY jump, with operating margin expanding to 27%…
2 contradicted, 1 unsupported of 4 claims
View trace →
0.06gpt-4-minitenant: globex
Q3 revenue hit $4.2B, a 38% YoY jump, with operating margin expanding to 27%…
View trace →
0.08gpt-4otenant: globex
Q3 revenue hit $4.2B, a 38% YoY jump, with operating margin expanding to 27%…
0 contradicted, 1 unsupported of 2 claims
View trace →

Why answers go wrong