The Full Stack: A Cognitive Operating System Inside Claude Code
April 2026
Claude Code ships as a CLI coding assistant. It reads files, writes code, runs tests. What it doesn’t ship with is opinions about how to think. No reasoning strategies, no memory across sessions, no way to challenge its own blind spots. It’s a capable mind with no cognitive architecture around it.
Over the past several months I’ve built that architecture. Not by modifying Claude’s weights or calling custom APIs—just by composing the extension points Claude Code already provides: skills (slash commands), agents (spawnable subprocesses), hooks (shell triggers), and a persistent file-based memory system. The result is five architectural layers that turn a single-turn tool into something closer to a cognitive operating system—one that remembers what it learned last week, reviews its own work with adversarial fresh eyes, and restructures its reasoning on demand through empirically validated linguistic constraints.
This post is the full tour. Every component is generalizable. Nothing here requires my specific project context to work.
The five layers
Everything stacks. Each layer depends on the ones below it but can function independently. You could adopt just the lenses, or just the memory system, or just the adversarial review skill. They compose, but they don’t require each other.
Layer 1: Orchestration skills
A skill is a slash command: you type /parallax or /research and it expands into
a multi-step orchestration plan that spawns agents, coordinates their work, and synthesizes the output.
The user-facing interface is one word. The machinery behind it can involve three to five parallel agents,
each with different constraints, converging into a single synthesized answer.
Five skills form the orchestration layer. Each one implements a different pattern for deploying cognitive diversity.
The common pattern: spawn diverse agents in parallel, let them work independently, then synthesize. The synthesis step is where the real value lives—surfacing blind spots, divergences, and consensus that no single agent would produce alone.
Layer 2: Cognitive lenses
If I wrote about lenses in my previous post, this is the update after running a 15,600-trial experiment across six frontier models. The hypothesis held: vocabulary constraints measurably restructure model reasoning. But the surprise was which constraints work best.
Lenses come in two families. Vocabulary constraints ban specific words or grammatical structures, forcing the model to route around its default patterns. Reasoning modifiers prescribe analytical frameworks that change the shape of the output without touching vocabulary.
Vocabulary constraints (empirically validated)
These three constraints were tested in a controlled experiment: 130 analytical items, 6 models, 5 conditions each, with stochastic replication. The accuracy gains are real and statistically significant (p < 0.001 by Fisher’s exact test).
The punchline is counterintuitive: the lightest constraint wins. The regularizer just bans 18 filler words—very, quite, rather, somewhat, really, pretty, just, simply, basically, actually, literally, definitely, certainly, obviously, clearly, essentially, virtually, practically—and it produces the largest accuracy gain with near-zero compliance cost. E-Prime, the deepest constraint (banning all forms of “to be”), improves accuracy less and costs dramatically more: 45.6% of trials need retries for compliance violations.
Why? The working theory: shallow constraints disrupt the model’s default “fluency autopilot”—the pattern of generating smooth, confident-sounding text that papers over analytical gaps. Removing filler words doesn’t change what the model can think, but it removes the linguistic shorthand that lets it skip thinking. Deep constraints do something different: they force genuinely novel cognitive patterns (E-Prime makes you describe relationships instead of categories), which produces more diverse reasoning but isn’t as reliably more accurate.
This maps directly to infrastructure. The regularizer auto-applies to every unlensed agent as a free accuracy boost. The deeper constraints deploy when you want cognitive diversity, not when you want precision.
Reasoning modifiers
Each lens lives in a single markdown file in .claude/agents/. Add a new file, and every skill
that composes lenses can use it immediately. No code changes. The directory is the interface.
Lens profiles for Parallax
Parallax ships with pre-configured lens combinations tuned for specific problem types:
Layer 3: Epistemic memory
Most AI memory systems work like append-only logs: save facts, retrieve facts, done. This one works more like a belief system. Every piece of stored knowledge carries metadata about how confident the system should be in it—and that confidence changes over time.
Beliefs, not facts
A belief isn’t a fact. It’s an interpretation that the system holds with some degree of confidence, tracked explicitly:
conf: 0.65 # developing — seen twice, could be situational
permanence: durable # working pattern, ~year timescale
first: 2026-03-15
confirmed: 2026-03-28 # last session consistent with this
challenged: null # nothing contradicted it yet
The confidence scale runs from speculative (0.0–0.2, inferred but never observed) through established (0.7–0.8, consistent across many sessions) to factual (0.9–1.0, reserved for verified facts only). Confidence gets earned through repeated observation, not assigned by how insightful a belief sounds.
Dormancy decay
Here’s the part that surprised me most in practice: confidence isn’t static. If the system goes dormant (30+ days without sessions), beliefs attenuate exponentially based on how permanent they are:
A belief like “heritage is Ghanaian” (stable, factual) barely decays across years. A belief like “currently frustrated with the auth migration” (situational, 0.5 confidence) is nearly floor after two months. This makes staleness visible as a gradient rather than a cliff edge—and prevents the system from confidently acting on months-old situational reads.
Tensions, not overwrites
When new observations contradict existing beliefs, most systems silently update. This one creates a tension—a dated record that says “belief A and observation B don’t agree, and we’re watching.” Real people are contradictory. A profile that fits too neatly is probably just a comfortable reduction.
Layer 4: Reflection
The memory system stores beliefs. The reflection layer is what maintains them—the self-watching machinery that keeps the belief system honest over time.
Observe
Running at two speeds simultaneously. Live mode fires mid-conversation when it notices a judgment call, pushback, process signal, or unprompted reveal. Harvest mode runs at the end of every substantive session as a backstop. Both write to the same observation log:
Suggests: Prefers pragmatic grouping over conventional small-PR discipline for refactors.
Therefore: For future refactors in this area, default to bundled PRs unless scope is truly independent.
Connects to: PR-style preference (new)
Strength: single
Quality gate: nothing gets written unless the Therefore field would change future behavior.
No journaling for journaling’s sake.
Mirror
Four modes that run on cadence (every 10 sessions or monthly):
- Portrait—prose character study with blind spots and contradictions
- Audit—epistemological dashboard: confidence distributions, drift, staleness flags
- Gut-check—quick validation of 6 beliefs (3 strongest, 3 weakest)
- Interview—8–12 structured questions generated from blind spots and tensions
The Portrait mode also promotes observations to beliefs: when 2+ observations cluster around the same pattern, they graduate to a belief at low initial confidence (0.30–0.40). The system learns, but slowly and skeptically.
Harvest
End-of-session knowledge extraction. Scans the conversation for insights, decisions, and facts, then proposes them as new or updated entries in a knowledge graph. Dual-writes everything: native storage and the graph, with bidirectional wikilinks for discoverability. Runs observe as its final step, making it a single atomic operation: extract knowledge and collect behavioral evidence.
Layer 5: Hooks & automation
The glue layer. Shell hooks fire on specific events; instruction-level auto-behaviors handle the rest.
| Trigger | Hook | What it does |
|---|---|---|
| SessionStart | Detects pending session handoffs, reports system status | Picks up where the last session left off without manual context-loading |
| SessionStart | Re-injects CLAUDE.md after context compaction | Long conversations compress old messages; this ensures instructions survive |
| Pre-Edit | Protected file guard | Prevents accidental writes to locked files (analysis plans, configs) |
| Post-Edit | Auto-format HTML | Keeps HTML files formatted consistently after every change |
| End of session | Auto-harvest (instruction-level) | Extracts knowledge + runs observe automatically. No manual trigger needed |
| Mid-session | Live observe (instruction-level) | Ambient behavioral evidence collection. Writes observations in real time |
The last two aren’t traditional shell hooks—they’re behaviors defined in CLAUDE.md (the master instruction file) that the model executes as part of its conversational loop. The distinction matters: shell hooks are deterministic triggers; instruction-level behaviors are probabilistic but pervasive. Together they create a system that maintains itself.
How the pieces wire together
Three primary composition patterns emerge from the architecture. Each one connects layers that were designed independently but compose naturally.
Analysis pipeline
Orient selects the right tools. Research gathers evidence across facets. Parallax triangulates the decision through diverse lenses. You decide with the full picture.
Production pipeline
Build the artifact. Scrutinize it with fresh-context adversarial review. Apply agreed changes. Harvest extracts what you learned and observes how you worked.
Learning loop (cross-session)
Extensibility: files, not code
The architecture grows by file addition. No code edits required for any of these:
- New lens: create
.claude/agents/my-lens.mdwith atype: lensheader and a constraint body. Immediately available to Parallax, Research, Constrain, and Orient. - New skill: create
.claude/skills/my-skill/SKILL.mdwith trigger definitions and instructions. Available as/my-skillinstantly. - New memory category: create an index file. The memory routing skill auto-discovers it.
- New Parallax profile: edit one table in the skill file. Three lenses, a name, done.
This is what makes the system generative rather than just configurable. The directory structure is the API. Anyone who can write a markdown file can extend the cognitive architecture.
Design principles
After months of building and iterating, these are the principles that survived contact with daily use:
- Composition over monoliths. Skills orchestrate agents. Agents are parameterized by lenses. Small, recombinant pieces scale better than feature-rich single tools. A new lens multiplies the capability of every skill that uses lenses.
- Parallel execution by default. Research spawns parallel investigators. Parallax runs three lenses simultaneously. Constrain tests two agents side by side. Fresh contexts prevent groupthink and surface genuine blind spots. Serial execution is the exception.
- Shallow constraints beat deep ones. The biggest empirical surprise. Banning 18 filler words (+7.3pp accuracy) outperforms banning all forms of “to be” (+3.7pp). The regularizer auto-applies to every unlensed agent as a free accuracy boost. Deep constraints deploy when you want diversity, not precision.
- Confidence is earned, not assigned. Beliefs start at low confidence and climb through repeated confirming observations. Dormancy decay makes staleness visible. Contradictions create tensions rather than silent overwrites. The system stays humble about what it knows.
- Behavior over self-report. The observe skill tracks what users do under pressure—not what they say about themselves. When the two diverge, a tension is logged rather than smoothing over the contradiction.
- Fresh-context critique as default. Scrutinize’s critic has no creation history. It reads the artifact cold. This prevents the author bias that makes it nearly impossible to review your own work honestly. The same principle applies to every synthesizer: it reads agent outputs without knowing which agent produced which output.
- The directory is the interface. New capabilities are files, not code changes. The system grows by addition. This makes it inspectable (read the files to understand the system), forkable (copy the directory), and collaborative (PRs are just markdown changes).
- Reflexive observation. The system observes itself observing the user. Live observation catches real-time signals. Harvest backstop catches what live mode missed. Mirror synthesizes both into profile updates. No external auditor required—the review loop is built into the architecture itself.
The theoretical foundation for this work is Umwelt Engineering—a framework for designing the linguistic world an AI agent can think in. Lenses implement umwelt constraints. The empirical grounding comes from a 15,600-trial experiment across six frontier models, which will be published as “Cognitive Restructuring in Frontier Language Models.” The core finding: even trivial vocabulary constraints produce statistically significant reasoning improvements, because they subtract the linguistic shortcuts that let models skip past analytical depth.
Everything described here is built from Claude Code’s existing extension points: skills, agents, hooks, and file-based memory. No custom APIs, no model modifications, no external services. Just markdown files and shell scripts, composed into something that thinks about how it thinks.
Built and maintained with Claude Code. The infrastructure described in this post is the same infrastructure that helped write this post.