MemoSift
Benchmarks

Deterministic
by design.

Reproducible recall. Predictable latency. Verified on live cloud.

0%

Positional recall

0

Hallucinations

0

Types classified

0

Recall modes

Save your agent from context rot.

Offload bulky tool output before it eats the context window — without losing access to any of it.

Raw tool output

files · JSON · logs · diffs · PDFs

< 500 chars — pass through

~15%

Small, cheap, keeps flowing.

≥ 500 chars — externalized

~85%

Classified, typed, stored. Agent gets a handle + metadata.

CODE
JSON
CSV
PDF
LOG
MD
SQL
BIN
0%

offloaded before it can bloat the context window. Still reachable through artifact_id the moment the agent needs it.

Two modes. One call.

ms.recall(mode="fast" | "deep")

Fast

Hybrid SQL + rerank. Skips LLM hop.

~750 ms

p50

embedquery_analysisentity_expandhybrid_sqlrerank

Deep

Adds LLM query analysis + entity expansion.

~1.1 s

p50

embedquery_analysisentity_expandhybrid_sqlintent_anchorrerank
skipped stage
active stage

No recency bias. No primacy bias.

A single fact is planted at a known turn, then the full 170-turn session ingests. A recall probe targeting that fact fires afterward. Each dot is one probe. Its height is the top-hit cosine similarity score from the recall pipeline; its horizontal position is where the needle sat in the session timeline.

170-turn workload
X axis · turn position as % of session length
Y axis · top-hit similarity score (0–1)
Dashed line · abstention threshold at 0.5
0.000.250.500.751.00top-hit similarity scoreabstention threshold · 0.500.55turn 3 · 2%0.55turn 9 · 5%0.58turn 36 · 21%0.54turn 128 · 75%0.59turn 131 · 77%0.71turn 162 · 95%0.68turn 165 · 97%position in session (turn ≈ position% × 170)

Probes

7 / 7

returned target in top three hits

Lowest score

0.54

still +0.04 above abstention threshold

Spread

0.54 – 0.71

no positional drift — early turns score as well as late

Why a flat curve is the good outcome

Long-context LLMs and naive RAG both suffer from “lost-in-the-middle” — facts placed in the middle of a session are retrieved less reliably than facts near the start or end. A flat line here means the retrieval score doesn't care where in the session the fact was mentioned. The agent is just as likely to find a turn-3 detail as a turn-165 one.

Every probe clears the abstention threshold, so the retrieved top hit is above the “I don't know” floor the system uses to avoid confabulation — it would rather say nothing than return something below that line.

Upstream oil & gas — where verbatim matters.

A digit off on a well API points to a different well. A rounded median loses the comparison. The pipeline kept them intact.

Column names

Schema order preserved byte-for-byte

Well IDs

API numbers retrievable by partial match

Comparison ratios

Median values + exact ratio between plays

Safety certs

Graph traversal → top hit by entity salience

Deliverables

Final artifact filename at hit #1

ms.recall() → verbatim

The same properties matter anywhere technical specificity drives decisions — medical, legal, regulatory, financial.

Predictable envelopes.

Wall-clock p50 and p95 from the client against an idle server. Non-recall tools are bounded and uniform.

list

enumerate · paginated

p50 400msp95 625ms

explore

4-axis graph walk

p50 400msp95 590ms

compress

SQL template · no LLM

p50 610msp95 650ms

fetch

artifact bytes

p50 670msp95 720ms

recall · fast

hybrid + rerank

p50 1100msp95 1500ms

recall · deep

+ LLM analysis

p50 1100msp95 1500ms
0500 ms1 s1.6 s

How we measured.

No self-scoring shortcuts. Every probe is hand-written, every ground truth is sourced from the raw session transcript, and an independent judge grades each probe against that ground truth.

47 hand-authored probes · 12 categories

Each probe is a question + ground-truth expected content + grading rubric. No autogenerated probes.

3Goal / intent recallWhat was the session about?
8Fact recallExact numeric values, counts, parameters
4Structural recallColumn names, schemas, ports, file paths
6EnumerationList all N items from a specific turn
8NIAH positionalFact planted at turn X of 170 — retrievable?
3Graph traversalExplore from entity → related memories
3Multi-hop reasoningChain facts across non-adjacent turns
4Artifact fetch + listRetrieve bytes, enumerate by type
2Temporal / bi-temporalWhat did the agent know at turn N?
2AbstentionQuery for content never seen — no hallucination
2Analysis recallMedian, ratio, derived comparison values
2Deliverable recoveryFinal filename retrievable post-session

Independent per-probe judgment

Claude Opus 4.7 (1M context) · running in Claude Code

  • Ground truth hand-authored by the human operator directly from the raw session transcript — not the model.
  • The judge sees the probe question, the top-five recall hits, the ground truth, and a grading rubric. Grades PASS / PARTIAL / FAIL, with evidence quoted from the hits.
  • PARTIAL weighted as 0.5 — prevents the scoreboard from hiding near-misses. Strict-only score (partial = fail) and lenient (partial = pass) published alongside for full transparency.
  • No self-scoring shortcut — the judge never grades its own memory extraction. Judge is an independent Claude session; the system under test is the MemoSift cloud.

Run conditions

Live cloud · reproducible from the repo

  • Every probe hits the same Railway deployment users hit. No local stub, no mocked DB, no replay from cache.
  • Sessions replayed turn-by-turn through the full ingest pipeline (Phase 1 sync + Phase 2 async). Probes fire only after drain_complete: true.
  • Session scope: short (10 turns), medium (77 turns), long (170 turns). 11 query categories × 3 sessions = 47 probes.
  • Harness, ground-truth files, and raw reports live under benchmarking/ in the repo. Anyone can rerun.

How MemoSift stacks up.

Results from established evaluation frameworks that we've run end-to-end against the deployed cloud. Every card shows our score next to published baselines.

Needle-in-a-Haystack

Gkamradt (adapted for agent memory)

A planted fact is retrieved after the full session ingests.

metric: Top-hit similarity · % above abstention thresholdcontext: 170-turn agent session · 7 needle positions (2% → 97%)
MemoSift7 / 7every position clears threshold
Long-context LLMs (lost-in-the-middle)degrades mid-sessionLiu et al. 2023
Naive vector RAGrecency biascommon failure mode on long sessions

LongMemEval · category parity

Zhong et al. 2024 (probe-category adaptation)

Five long-memory capabilities: information extraction, multi-session reasoning, knowledge updates, temporal reasoning, abstention.

metric: Effective pass rate (PASS=1, PARTIAL=0.5)context: 47 hand-authored probes across 10 / 77 / 170-turn sessions
MemoSift98.9%46 PASS · 1 PARTIAL · 0 FAIL

MTEB retrieval (inherited)

Muennighoff et al.

Embedding retrieval quality across 58 tasks.

metric: Inherited via the embedding model MemoSift usescontext: OpenAI text-embedding-3-small
text-embedding-3-small62.3 avgMTEB leaderboard
text-embedding-3-large64.6 avgupgrade path; not default

Additional benchmark runs (LoCoMo, LongMemEval full harness, RULER, HELMET, LongBench v2, MemGPT bench) are scheduled in the harness plan under docs/superpowers/plans/. Each lands here only after the end-to-end run is reproducible from the repo with a published comparison table.

Benchmark your own session.

The harness, ground-truth files, judge prompts, and raw reports live in the repo. Reproduce every number on this page — or run the same probes against your own workload.