Agents are black boxes.
Peekr makes them transparent.
A profiler for every layer of your agent. Trace every LLM call and tool invocation, score every output for groundedness, and surface where time and money go โ all from one library.
The SDK is MIT. The cloud is hosted.
Five problems Peekr makes obvious
Every team running LLMs in production has the same five complaints.
Peekr is one library that answers each, with the matching view in the dashboard. You don't buy Peekr for one of these โ you buy it because all five show up the week you ship.
The complaint
โMy agent gave the wrong answer.โ
Peekr's fix ยท See exactly what the LLM received โ not what you think you sent.
Malformed context is the silent killer. Peekr captures every message, every tool result, every retrieved chunk โ exactly as the model saw them โ so you can find the mismatch in seconds.
Open a real traceagent.run 2100ms โโ tool.fetch_user 12ms in: {"user_id": 42} out: null โ returned null โโ openai.chat [gpt-4o] 2088ms ยท 4821tok in: [{"role": "system", content: "User profile: null..."}] ^ LLM received garbage
The complaint
โMy agent is hallucinating.โ
Peekr's fix ยท Score every output for groundedness. Catch the regression before customers do.
LLM-as-judge plus RAGAS-style claim decomposition. Every model response is scored 0โ1 against its context, with per-claim verdicts (supported / contradicted / unsupported) you can query in SQL.
Open Quality dashboard#1 โฌค 0.00 gpt-4o-mini ยท acme ยท /api/qa Q: When was the Eiffel Tower built and by whom? โโ SOURCE CONTEXT โโโโโโโโโโ โโ MODEL ANSWER โโโโโโโโโโโโโ โ The Eiffel Tower was โ โ Built in 1923 by โ โ completed in 1889 for โ โ Frank Lloyd Wright for โ โ the Paris World's Fairโฆ โ โ the London Olympics. โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ contradicted "1923" contradicted "Frank Lloyd Wright" unsupported "London Olympics" eval_scores: {Hallucination: 0.0, Rubric: 0.5}
โPeekr's 24h on demo: 481 evaluated ยท mean 0.81 ยท 41 critical ยท 73 warning
The complaint
โMy agent is too slow.โ
Peekr's fix ยท The trace shows exactly where time went. The LLM is rarely the bottleneck.
Most developers assume the model is slow and start swapping. Peekr's latency view splits trace time across LLM, tool, and your own code โ so you optimise the right thing.
Open Latency viewagent.run 4300ms โโ tool.search_web 3800ms โ 88% of total time. Cache this. โโ tool.rerank 18ms โโ openai.chat 490ms โ not your problem
โPeekr's 24h on demo: LLM 85% of in-trace time ยท slowest model: claude-opus-4-7 ยท trace p95 6.2s
The complaint
โMy API bill is too high.โ
Peekr's fix ยท Token counts across traces reveal patterns invisible in code.
Cost attribution by feature, by user, by model. Plus a recommendation engine: model swaps, prompt-caching opportunities, output caps, fine-tune ROI โ each backed by the spans behind it.
Open InsightsTrace 1: 18,432 tokens ยท $0.018 Trace 2: 21,104 tokens ยท $0.021 Trace 3: 24,891 tokens ยท $0.025 โ unbounded growth Top fix: route short support_bot queries to gpt-4o-mini evidence: 110 of 119 support_bot calls in 24h were under 600 input tokens on claude-opus-4-7. saves $74/mo (99% on feature)
โPeekr found $228/mo in savings โ 42% of $541/mo ยท 4 recommendations
The complaint
โIt works locally but fails in prod.โ
Peekr's fix ยท The bug is in your data pipeline, not your agent. The trace proves it.
Peekr captures what your tools actually returned, not what you assume. Compare local and prod traces side by side, and the diff jumps out โ a retrieval miss, a flag flipped, a slow upstream that broke timing.
See a trace with tool I/O# local tool.fetch_inventory 8ms out: [{"id":1,"qty":42}] # prod tool.fetch_inventory 8ms out: [] โ empty. Data pipeline bug. # local openai.chat [gpt-4o] 843ms ยท 1820tok # prod openai.chat [gpt-4o] 612ms ยท 480tok โ model had no context to work with
โSame agent. Different upstream. The trace makes it obvious.
Three primitives
Trace. Score. Slice.
Everything else is consolidation we run for you. The same three concepts power the OSS library and Peekr Cloud โ you just don't pay the LLM bill for evaluators yourself.
Trace
Auto-patch your LLM client. Every call, tool, and agent step becomes a span.
- OpenAI / Anthropic / Gemini / Bedrock auto-instrumented
- LangChain / LlamaIndex / CrewAI agent steps captured
- Stream chunks rolled up into one parent span
- OTel exporter alongside HTTPExporter โ ship to Datadog too
Score
LLM-as-judge faithfulness. NLI tier for the cheap pass, judge tier for the verdict.
- Hallucination evaluator (RAGAS-style claim breakdown)
- Citation evaluator (every claim โ source span)
- Rubric evaluator (your own 0โ1 scoring prompt)
- Failures land in /quality with the offending output
Slice
tenant_id and retention_class are top-level columns. Filter and TTL without JSON gymnastics.
- Indexed (project_id, tenant_id) โ sub-100ms tenant filtering
- retention_class per span: short / default / long / pii
- Per-tenant dashboards out of the box
- B2B agents subdivide every metric by their own customers
Hallucination detection ยท built in
Catch the contradicted sentence before customers do.
Peekr's evaluator decomposes every model output into atomic factual claims, then assigns one of three verdicts using the RAGAS faithfulness method. You see the score, the breakdown, and the exact claim that failed.
- โSupportedโ directly entailed by the context. The model didn't make it up.
- โContradicted โ directly conflicts with the context. Hard fail.
- ?Unsupported โ context is silent. Maybe correct, maybe invented. Soft warn.
Hallucination ยท Acme Agents ยท 24h
481 evaluations ยท mean 0.812
vs. the alternatives
What you get vs. what else is out there.
| Capability | Peekr | Raw OpenTelemetry | Langfuse / Arize | Build it yourself |
|---|---|---|---|---|
| Auto-patch OpenAI/Anthropic/Gemini/Bedrock | partial | |||
| LangChain / LlamaIndex / CrewAI step capture | ||||
| Hallucination evaluator (LLM-judge) | addon | diy | ||
| RAGAS-style claim verdict breakdown | diy | |||
| Citation / Rubric / NotEmpty evaluators | partial | diy | ||
| tenant_id as a first-class indexed column | diy | |||
| retention_class per span (PII / long / short) | diy | |||
| OTel exporter so you can ship to Datadog too | โ | partial | diy | |
| MIT SDK ยท dep-free ยท self-hostable | โ | |||
| Hosted dashboard out of the box |
yes = in the box ยทpartial = on some providers ยทaddon = paid add-on ยทdiy = you write it
Everything you need. Nothing you don't.
One library. Sixteen capabilities. No backend required for any of them.
From basic tracing to evals, experiments, and a data flywheel โ Peekr ships every primitive listed below. The Cloud is opt-in: when you outgrow a single-process file, the same wire format ships to the hosted backend.
Zero config
One call patches OpenAI, Anthropic, Gemini, and Bedrock. No wrappers, no env vars, no accounts.
Automatic nesting
Spans link to parents via Python's contextvars โ tool calls nest under the LLM that triggered them.
@trace decorator
Wrap any sync or async function. Captures inputs, outputs, latency, and errors.
Session tracing
Stitch multi-turn agent runs into one trace tree with peekr.session(...).
LLM-as-judge eval
Score every response automatically with Hallucination, Citation, Rubric, NotEmpty.
Hallucination detection
Faithfulness score 0โ1 per span. Queryable in SQL. Plug in your retrieved docs for RAG.
RAGAS claim decomposition
Each output broken into atomic claims with per-claim supported / contradicted / unsupported verdicts.
Observability dashboard
Health hero, channel ร time heatmap, AI-generated recommendations, per-call action items.
Multi-tenant schema
tenant_id + retention_class as top-level columns. Indexed at the storage layer.
Alerts
Slack / webhook sinks fire when error rate, latency, token spend, or cost spikes cross your threshold.
A/B experiments
Tag prompt variants. Peekr compares cost-per-success across variants automatically.
Feedback + export
Rate traces good/bad. Export labeled data as OpenAI fine-tuning JSONL in one command.
Trace replay
Re-run any stored trace with the same inputs. Debug production issues locally.
Custom exporters
One method to ship spans to Datadog, Honeycomb, Grafana Tempo, or your own backend.
SQLite storage
WAL mode for concurrent writes. Query traces across runs with SQL. Works inside Docker and CI.
Privacy first
Nothing leaves your machine by default. Drop-on-ingest for retention_class='pii'.
Query your agent like a database
SQLite storage means every trace is queryable. No dashboard needed.
-- Mean groundedness by model in the last 24h
SELECT json_extract(attributes,'$.model') AS model,
AVG(json_extract(attributes,'$.eval_scores.Hallucination')) AS groundedness,
SUM(json_extract(attributes,'$.tokens_total')) AS tokens
FROM spans
WHERE start_time >= unixepoch() - 86400
GROUP BY model
ORDER BY groundedness DESC;Tour
Every view, populated with real-shaped data.
Insights
Save $$Ranked recommendations: model swaps, prompt caching, output caps, fine-tune ROI. Each backed by the spans behind it.
Costs
Spend by feature, by user, by model. Top spenders, projected monthly, heavy-5% concentration in one place.
Quality
HallucinationLLM-as-judge faithfulness scores, RAGAS claim verdicts, worst-offenders list.
Latency
LLM vs tool time, p50/p95/p99 by model, slowest traces. The LLM isn't usually the bottleneck.
Trace waterfall
Every LLM and tool call on the same axis. Inline attributes. Claim verdicts when present.
Overview
KPI strip, throughput, quality at a glance, top operations, recent traces.
Tenants
Subdivide every metric by tenant_id without query gymnastics. Indexed at the column level.
Three lines
Instrument your agent in less time than your build takes.
Patch your LLM client, point HTTPExporter at the cloud, spans start flowing โ batched, retried, idempotent on the server-side upsert. Errors retried once, then logged and dropped so nothing blocks your agent.
- Daemon-thread batched flush โ no perf hit
- Server-side upsert on (project_id, span_id) โ safe retries
- Stays alongside JSONL / SQLite exporters โ keep local logs
import peekr
peekr.instrument(
tenant_id="acme",
exporter=peekr.HTTPExporter(
endpoint="https://ingest.peekr.cloud",
api_key="pk_live_โฆ",
),
)
peekr.eval.register(peekr.eval.Hallucination(detailed=True))
# Your OpenAI / Anthropic / Gemini / Bedrock code is now
# traced โ model, tokens, latency, errors, tool calls,
# and per-claim hallucination verdicts.Already in production
Powers Extremis Cloud out of the box.
Peekr is the observability layer bundled inside Extremis Cloud. Write-time NLI + LLM-judge marks every memory as grounded, unverified, or contradicted before it lands. Runtime tracing produces a span tree per recall, with reasons attached. The whole stack is dogfooding Peekr today.
- Hallucination checking embedded in a paid product
- Runtime tracing across multi-tenant Postgres
- Same Peekr you can install โ no fork, no vendor lock
โPeekr runs inside every Extremis Cloud account, free. It's the standalone observability layer we use to check groundedness at write time and trace every recall at runtime โ so when a memory feels off, the reason is one click away.โ
โ Extremis Cloud product page
Drop into your agent today.
First cohort is opening now. Free tier sustains the first 10k spans per month; manual invoicing for the first paid customers.