Peekr Cloud ยท private beta

Agents are black boxes.
Peekr makes them transparent.

A profiler for every layer of your agent. Trace every LLM call and tool invocation, score every output for groundedness, and surface where time and money go โ€” all from one library.The SDK is MIT. The cloud is hosted.

OpenAI Anthropic Gemini Bedrock LangChain CrewAI LlamaIndex

Five problems Peekr makes obvious

Every team running LLMs in production has the same five complaints.

Peekr is one library that answers each, with the matching view in the dashboard. You don't buy Peekr for one of these โ€” you buy it because all five show up the week you ship.

The complaint

โ€œMy agent gave the wrong answer.โ€

Peekr's fix ยท See exactly what the LLM received โ€” not what you think you sent.

Malformed context is the silent killer. Peekr captures every message, every tool result, every retrieved chunk โ€” exactly as the model saw them โ€” so you can find the mismatch in seconds.

Open a real trace
peekr view trace_a3f2b1c0
agent.run  2100ms
   โ””โ”€ tool.fetch_user  12ms
         in:  {"user_id": 42}
         out: null                       โ† returned null
   โ””โ”€ openai.chat [gpt-4o]  2088ms ยท 4821tok
         in:  [{"role": "system", content: "User profile: null..."}]
                                                                ^ LLM received garbage

The complaint

โ€œMy agent is hallucinating.โ€

Peekr's fix ยท Score every output for groundedness. Catch the regression before customers do.

LLM-as-judge plus RAGAS-style claim decomposition. Every model response is scored 0โ€“1 against its context, with per-claim verdicts (supported / contradicted / unsupported) you can query in SQL.

Open Quality dashboard
peekr quality ยท worst-offender
#1 โฌค 0.00  gpt-4o-mini ยท acme ยท /api/qa
Q: When was the Eiffel Tower built and by whom?

โ”Œโ”€ SOURCE CONTEXT โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€ MODEL ANSWER โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ The Eiffel Tower was     โ”‚  โ”‚ Built in 1923 by          โ”‚
โ”‚ completed in 1889 for    โ”‚  โ”‚ Frank Lloyd Wright for    โ”‚
โ”‚ the Paris World's Fairโ€ฆ  โ”‚  โ”‚ the London Olympics.      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

  contradicted  "1923"
  contradicted  "Frank Lloyd Wright"
  unsupported   "London Olympics"

  eval_scores: {Hallucination: 0.0, Rubric: 0.5}

โ–ŒPeekr's 24h on demo: 481 evaluated ยท mean 0.81 ยท 41 critical ยท 73 warning

The complaint

โ€œMy agent is too slow.โ€

Peekr's fix ยท The trace shows exactly where time went. The LLM is rarely the bottleneck.

Most developers assume the model is slow and start swapping. Peekr's latency view splits trace time across LLM, tool, and your own code โ€” so you optimise the right thing.

Open Latency view
peekr view --slow
agent.run  4300ms
   โ””โ”€ tool.search_web  3800ms   โ† 88% of total time. Cache this.
   โ””โ”€ tool.rerank           18ms
   โ””โ”€ openai.chat         490ms   โ† not your problem

โ–ŒPeekr's 24h on demo: LLM 85% of in-trace time ยท slowest model: claude-opus-4-7 ยท trace p95 6.2s

The complaint

โ€œMy API bill is too high.โ€

Peekr's fix ยท Token counts across traces reveal patterns invisible in code.

Cost attribution by feature, by user, by model. Plus a recommendation engine: model swaps, prompt-caching opportunities, output caps, fine-tune ROI โ€” each backed by the spans behind it.

Open Insights
peekr view --tokens-over-time
Trace 1:  18,432 tokens   ยท $0.018
Trace 2:  21,104 tokens   ยท $0.021
Trace 3:  24,891 tokens   ยท $0.025   โ† unbounded growth

Top fix: route short support_bot queries to gpt-4o-mini
  evidence: 110 of 119 support_bot calls in 24h were under 600 input tokens on claude-opus-4-7.
  saves $74/mo (99% on feature)

โ–ŒPeekr found $228/mo in savings โ€” 42% of $541/mo ยท 4 recommendations

The complaint

โ€œIt works locally but fails in prod.โ€

Peekr's fix ยท The bug is in your data pipeline, not your agent. The trace proves it.

Peekr captures what your tools actually returned, not what you assume. Compare local and prod traces side by side, and the diff jumps out โ€” a retrieval miss, a flag flipped, a slow upstream that broke timing.

See a trace with tool I/O
peekr diff local prod
# local   tool.fetch_inventory  8ms   out: [{"id":1,"qty":42}]
# prod    tool.fetch_inventory  8ms   out: []   โ† empty. Data pipeline bug.

# local   openai.chat [gpt-4o]  843ms ยท 1820tok
# prod    openai.chat [gpt-4o]  612ms ยท 480tok   โ† model had no context to work with

โ–ŒSame agent. Different upstream. The trace makes it obvious.

Three primitives

Trace. Score. Slice.

Everything else is consolidation we run for you. The same three concepts power the OSS library and Peekr Cloud โ€” you just don't pay the LLM bill for evaluators yourself.

Primitive 01

Trace

Auto-patch your LLM client. Every call, tool, and agent step becomes a span.

  • OpenAI / Anthropic / Gemini / Bedrock auto-instrumented
  • LangChain / LlamaIndex / CrewAI agent steps captured
  • Stream chunks rolled up into one parent span
  • OTel exporter alongside HTTPExporter โ€” ship to Datadog too
Primitive 02

Score

LLM-as-judge faithfulness. NLI tier for the cheap pass, judge tier for the verdict.

  • Hallucination evaluator (RAGAS-style claim breakdown)
  • Citation evaluator (every claim โ†’ source span)
  • Rubric evaluator (your own 0โ€“1 scoring prompt)
  • Failures land in /quality with the offending output
Primitive 03

Slice

tenant_id and retention_class are top-level columns. Filter and TTL without JSON gymnastics.

  • Indexed (project_id, tenant_id) โ€” sub-100ms tenant filtering
  • retention_class per span: short / default / long / pii
  • Per-tenant dashboards out of the box
  • B2B agents subdivide every metric by their own customers

Hallucination detection ยท built in

Catch the contradicted sentence before customers do.

Peekr's evaluator decomposes every model output into atomic factual claims, then assigns one of three verdicts using the RAGAS faithfulness method. You see the score, the breakdown, and the exact claim that failed.

  • โœ“Supportedโ€” directly entailed by the context. The model didn't make it up.
  • โœ—Contradicted โ€” directly conflicts with the context. Hard fail.
  • ?Unsupported โ€” context is silent. Maybe correct, maybe invented. Soft warn.

Hallucination ยท Acme Agents ยท 24h

481 evaluations ยท mean 0.812

healthy
01422830.00.20.40.60.8Hallucination score (1.0 = fully grounded)
Healthy
367
Warning
73
Critical
41

vs. the alternatives

What you get vs. what else is out there.

CapabilityPeekrRaw OpenTelemetryLangfuse / ArizeBuild it yourself
Auto-patch OpenAI/Anthropic/Gemini/Bedrockpartial
LangChain / LlamaIndex / CrewAI step capture
Hallucination evaluator (LLM-judge)addondiy
RAGAS-style claim verdict breakdowndiy
Citation / Rubric / NotEmpty evaluatorspartialdiy
tenant_id as a first-class indexed columndiy
retention_class per span (PII / long / short)diy
OTel exporter so you can ship to Datadog tooโ€”partialdiy
MIT SDK ยท dep-free ยท self-hostableโ€”
Hosted dashboard out of the box

yes = in the box ยทpartial = on some providers ยทaddon = paid add-on ยทdiy = you write it

Everything you need. Nothing you don't.

One library. Sixteen capabilities. No backend required for any of them.

From basic tracing to evals, experiments, and a data flywheel โ€” Peekr ships every primitive listed below. The Cloud is opt-in: when you outgrow a single-process file, the same wire format ships to the hosted backend.

Zero config

One call patches OpenAI, Anthropic, Gemini, and Bedrock. No wrappers, no env vars, no accounts.

Automatic nesting

Spans link to parents via Python's contextvars โ€” tool calls nest under the LLM that triggered them.

@trace decorator

Wrap any sync or async function. Captures inputs, outputs, latency, and errors.

Session tracing

Stitch multi-turn agent runs into one trace tree with peekr.session(...).

LLM-as-judge eval

Score every response automatically with Hallucination, Citation, Rubric, NotEmpty.

Hallucination detection

Faithfulness score 0โ€“1 per span. Queryable in SQL. Plug in your retrieved docs for RAG.

RAGAS claim decomposition

Each output broken into atomic claims with per-claim supported / contradicted / unsupported verdicts.

Observability dashboard

Health hero, channel ร— time heatmap, AI-generated recommendations, per-call action items.

Multi-tenant schema

tenant_id + retention_class as top-level columns. Indexed at the storage layer.

Alerts

Slack / webhook sinks fire when error rate, latency, token spend, or cost spikes cross your threshold.

A/B experiments

Tag prompt variants. Peekr compares cost-per-success across variants automatically.

Feedback + export

Rate traces good/bad. Export labeled data as OpenAI fine-tuning JSONL in one command.

Trace replay

Re-run any stored trace with the same inputs. Debug production issues locally.

Custom exporters

One method to ship spans to Datadog, Honeycomb, Grafana Tempo, or your own backend.

SQLite storage

WAL mode for concurrent writes. Query traces across runs with SQL. Works inside Docker and CI.

Privacy first

Nothing leaves your machine by default. Drop-on-ingest for retention_class='pii'.

Query your agent like a database

SQLite storage means every trace is queryable. No dashboard needed.

-- Mean groundedness by model in the last 24h
SELECT json_extract(attributes,'$.model') AS model,
       AVG(json_extract(attributes,'$.eval_scores.Hallucination')) AS groundedness,
       SUM(json_extract(attributes,'$.tokens_total')) AS tokens
FROM spans
WHERE start_time >= unixepoch() - 86400
GROUP BY model
ORDER BY groundedness DESC;

Three lines

Instrument your agent in less time than your build takes.

Patch your LLM client, point HTTPExporter at the cloud, spans start flowing โ€” batched, retried, idempotent on the server-side upsert. Errors retried once, then logged and dropped so nothing blocks your agent.

  • Daemon-thread batched flush โ€” no perf hit
  • Server-side upsert on (project_id, span_id) โ€” safe retries
  • Stays alongside JSONL / SQLite exporters โ€” keep local logs
agent.py
import peekr

peekr.instrument(
  tenant_id="acme",
  exporter=peekr.HTTPExporter(
    endpoint="https://ingest.peekr.cloud",
    api_key="pk_live_โ€ฆ",
  ),
)

peekr.eval.register(peekr.eval.Hallucination(detailed=True))

# Your OpenAI / Anthropic / Gemini / Bedrock code is now
# traced โ€” model, tokens, latency, errors, tool calls,
# and per-claim hallucination verdicts.

Already in production

Powers Extremis Cloud out of the box.

Peekr is the observability layer bundled inside Extremis Cloud. Write-time NLI + LLM-judge marks every memory as grounded, unverified, or contradicted before it lands. Runtime tracing produces a span tree per recall, with reasons attached. The whole stack is dogfooding Peekr today.

  • Hallucination checking embedded in a paid product
  • Runtime tracing across multi-tenant Postgres
  • Same Peekr you can install โ€” no fork, no vendor lock
E
Extremis Cloud
Managed memory for AI agents
โ€œPeekr runs inside every Extremis Cloud account, free. It's the standalone observability layer we use to check groundedness at write time and trace every recall at runtime โ€” so when a memory feels off, the reason is one click away.โ€

โ€” Extremis Cloud product page

Bundled ยท free tierSelf-host the same library

Drop into your agent today.

First cohort is opening now. Free tier sustains the first 10k spans per month; manual invoicing for the first paid customers.