Score distribution
Lower scores = more hallucinated content. Bars colored by threshold zone.
Mean score over time
Hourly mean for the last 24h. Red dots flag hours with at least one critical score.
Quality by model
Routes for hard queries: pick from the top of this list.
| Model | Evals | Mean score | Critical | Status |
|---|---|---|---|---|
| claude-opus-4-7 | 260 | 0.800 | 9.6% | healthy |
| gpt-4o | 90 | 0.816 | 10.0% | healthy |
| gpt-4-mini | 60 | 0.819 | 6.7% | healthy |
| claude-sonnet-4-6 | 71 | 0.843 | 4.2% | healthy |
Claim verdicts
Stacked across all detailed evaluations
Over 539 factual claims across detailed evaluations.
Worst offenders
Lowest-scoring LLM spans in the last 24 hours. Each links to its full trace.
- View trace →0.01claude-opus-4-7tenant: soylent
Refund approved under policy P-204(b), $189.99 returned to card ending 4421 within 3 days…
- View trace →0.04claude-opus-4-7tenant: soylent
Q3 revenue hit $4.2B, a 38% YoY jump, with operating margin expanding to 27%…
- View trace →0.04gpt-4otenant: acme
Q3 revenue hit $4.2B, a 38% YoY jump, with operating margin expanding to 27%…
- View trace →0.06gpt-4-minitenant: acme
Q3 revenue hit $4.2B, a 38% YoY jump, with operating margin expanding to 27%…
2 contradicted, 1 unsupported of 4 claims - View trace →0.06gpt-4-minitenant: globex
Q3 revenue hit $4.2B, a 38% YoY jump, with operating margin expanding to 27%…
- View trace →0.08gpt-4otenant: globex
Q3 revenue hit $4.2B, a 38% YoY jump, with operating margin expanding to 27%…
0 contradicted, 1 unsupported of 2 claims