Essay

AI Memory Has a Confidence Problem

April 2026

AI agent memory is having a moment. In the past four months, three serious systems have shipped — each solving a different piece of the problem, each leaving a different gap.

The field is converging on an insight: flat memory doesn’t work. It hasn’t converged on what to do about it. Here’s the map.


The practitioner baseline

Cog shipped in March 2026 and pulled 337 GitHub stars in three weeks. It’s a convention-based memory system for Claude Code — markdown files, slash commands, progressive context loading. Observations are append-only. They condense into patterns, capped at 70 lines. Stale data archives. Indexes rebuild.

It solves a real problem: Claude Code’s built-in memory is unstructured and difficult to inspect. Cog gives it bones.

What Cog doesn’t do: track confidence. Every observation carries the same weight. There’s no mechanism for “I’m less sure about this one” or “this contradicts something I recorded last month.” A consistency sweep checks for contradictions, but produces no structured output for holding them. Contradictions get found and then lost.

Organization is necessary. Epistemology is what’s missing.


The academic response

Two systems formalize what Cog leaves implicit.

Hindsight, from Vectorize.io, organizes memory into four networks: facts, experiences, observations, and opinions. The key move is that opinions carry confidence scores — 0 to 1, updated through explicit reinforcement. Supporting evidence nudges confidence up. Contradiction pushes it down. They call this epistemic clarity: the clean separation between what an agent knows, has experienced, believes, and has inferred. It hit 91.4% on LongMemEval. VentureBeat covered it.

Kumiho, published the same month, goes further. It’s a graph-native memory architecture formally grounded in AGM belief revision — the gold standard for how rational agents should update beliefs when new information arrives. The paper proves satisfaction of the core AGM postulates. This is the first system that can say: our belief updates are provably rational.

Both advance the field. Neither targets individual practitioners. Hindsight is built for agentic workflows at scale. Kumiho is an SDK with Redis and Neo4j backends. These are infrastructure for production systems, not tools you drop into your working directory.


The gap between infrastructure and use

Mainstream agent memory frameworks — Mem0, Zep, Letta — haven’t caught up. They treat memory as retrieval: store, find, return. Mem0 resolves conflicts by overwriting. Zep tracks temporal validity but not subjective confidence. Letta surfaces conflicts as text-level merge issues. None model the difference between “I observed this” and “I believe this at 70% confidence based on three sessions of evidence.”

The canonical survey paper — “Memory in the Age of AI Agents,” December 2025 — maps a taxonomy of Forms, Functions, and Dynamics. Useful as an organizing frame. But it notes that traditional long/short-term memory models are insufficient without proposing what replaces them.

There’s a missing layer between “store and retrieve” and “formally rational belief revision.” Something a practitioner can adopt today, without deploying graph databases or proving postulate satisfaction.


What I built

Epistemic Memory is a protocol — not a library, not an SDK. Markdown files and two Claude Code skills that copy into your project directory.

Every belief carries metadata: confidence (0.0–1.0), first recorded, last confirmed, last challenged, and permanence class. The system doesn’t treat “you use dark mode” and “you think in systems” as the same kind of claim. It distinguishes them by evidence quality and decays them on different schedules.

I wrote about the design in a previous post. Since then, two pieces were missing: the Observe skill, which automates behavioral evidence collection at conversation end, and the session counter hook, which gives periodic reviews a mechanical trigger instead of relying on the AI to remember. Both now ship. All ten components are live.

The difference from Cog isn’t sophistication — it’s posture. Cog asks “where does this memory go?” Epistemic Memory asks “how confident should I be, and what would change my mind?” Cog is a filing cabinet. This is a belief revision system that happens to use files.

The difference from Hindsight and Kumiho is scope. Those are infrastructure for memory-augmented agents at scale. This is a protocol for one person working with one AI, adopted in five minutes by copying files.


What’s still missing

The decay rates — two-year half-life for stable traits, five months for durable, six weeks for situational — are starting guesses, not calibrated values.

There’s no ground truth for belief accuracy. The periodic review asks the user directly, but that depends on the user’s self-knowledge — which has its own blind spots.

The entire protocol depends on the LLM following complex instructions reliably. LLMs don’t. The session counter mechanizes one enforcement point. The Observe skill mechanizes another. The rest runs on instruction-following that isn’t guaranteed.

And the term itself — “epistemic memory” — has near-zero adoption in AI agent discourse. The concepts are arriving under different names: epistemic clarity, epistemic governance. Whether this language sticks is an open question.


Where this is heading

The convergence is real. Four months ago, no one was talking about confidence-weighted beliefs for AI agents. Now there are three independent implementations, a survey paper, and a growing practitioner community that knows append-only memory isn’t enough. 337 stars for a well-organized filing cabinet tells you the appetite is there.

The next step is empirical. Run the protocol for N sessions with adversarial inputs — introduce contradictory behavior, test whether the system flags it, measure how many sessions until confidence adjusts. Compare against a flat baseline. The formal frameworks exist. The implementations exist. What’s missing is evidence that confidence-weighted memory outperforms the alternative in practice, not just in theory.

I think it does. But “I think” is a 0.6, not a 0.9. The protocol’s own epistemology demands I say so.