Question 1

What was actually measured in the MemoSift benchmark?

Accepted Answer

47 hand-authored probes covering 11 query categories — goal and intent recall, fact recall, structural lookups (column names, schemas, URLs), enumeration (lists returned verbatim), artifact fetch and list, graph traversal, multi-hop reasoning, needle-in-a-haystack at known turn positions, bi-temporal snapshots, abstention on out-of-scope queries, and deliverable filename recovery. Every probe was judged PASS / PARTIAL / FAIL against hand-authored ground truth.

Question 2

How is positional recall tested?

Accepted Answer

On a 170-turn session, a specific fact is planted at a known turn position — at 1.8%, 5.3%, 21%, 75%, 77%, 95%, and 97% of the session. The full session ingests, then the probe fires. All seven positions return their target in the top three hits with top-hit scores clearing the abstention threshold — no recency bias, no primacy bias.

Question 3

What does abstention accuracy mean here?

Accepted Answer

Out-of-scope queries — asking about content that was never in the session — should not return fabricated answers. MemoSift returns low-confidence scores (around 0.5) on tangentially related content, which agent layers can threshold on. 100% of out-of-scope probes in the benchmark returned low-confidence neighbors rather than hallucinated facts.

Question 4

Can I reproduce these benchmarks?

Accepted Answer

Yes. The full harness, ground-truth files, and raw reports live under benchmarking/ in the repository. Every request in the benchmark hits the same live Railway deployment users call in production. There is no local stub or mock.

Question 5

What is the fast-vs-deep recall latency trade-off?

Accepted Answer

Fast mode runs hybrid SQL scoring and cross-encoder rerank only — idle p50 around 750 ms server-side, skipping the LLM query-analysis hop. Deep mode adds LLM query decomposition, entity expansion, and intent anchoring — idle p50 around 1.1 s. The agent picks a mode per call; routine lookups stay fast, exploratory queries get broader recall.

Deterministic
by design.

Save your agent from context rot.

Two modes. One call.

No recency bias. No primacy bias.

Upstream oil & gas — where verbatim matters.

Predictable envelopes.

How we measured.

How MemoSift stacks up.

Benchmark your own session.

Deterministicby design.