The Full Stack: How I Engineered a Cognitive Operating System Inside Claude Code

Claude Code ships as a CLI coding assistant. It reads files, writes code, runs tests. What it doesn’t ship with is opinions about how to think. No reasoning strategies, no memory across sessions, no way to challenge its own blind spots. It’s a capable mind with no cognitive architecture around it.

Over the past several months I’ve built that architecture. Not by modifying Claude’s weights or calling custom APIs—just by composing the extension points Claude Code already provides: skills (slash commands), agents (spawnable subprocesses), hooks (shell triggers), and a persistent file-based memory system. The result is five architectural layers that turn a single-turn tool into something closer to a cognitive operating system—one that remembers what it learned last week, reviews its own work with adversarial fresh eyes, and restructures its reasoning on demand through empirically validated linguistic constraints.

This post is the full tour. Every component is generalizable. Nothing here requires my specific project context to work.

The five layers

Everything stacks. Each layer depends on the ones below it but can function independently. You could adopt just the lenses, or just the memory system, or just the adversarial review skill. They compose, but they don’t require each other.

01 Orchestration Skills that spawn parallel agent teams and synthesize results

02 Cognitive Lenses Language constraints that change how agents reason (+7.3pp accuracy)

03 Epistemic Memory Beliefs with confidence metadata, dormancy decay, and tension tracking

04 Reflection Mirror, Harvest, Observe — the system that watches itself learning

05 Hooks & Automation Shell triggers and instruction-level auto-behaviors that wire it all together

Layer 1: Orchestration skills

A skill is a slash command: you type /parallax or /research and it expands into a multi-step orchestration plan that spawns agents, coordinates their work, and synthesizes the output. The user-facing interface is one word. The machinery behind it can involve three to five parallel agents, each with different constraints, converging into a single synthesized answer.

Five skills form the orchestration layer. Each one implements a different pattern for deploying cognitive diversity.

◎

Orient

/orient “I need to decide X”

The chief of staff. Surveys the full toolkit—every skill, lens, agent, and native capability—then recommends a deployment strategy for your specific problem. It doesn’t execute. It teaches you your own toolkit by explaining which tools to reach for and in what order.

◆

Research

/research “How do X and Y interact?”

Decomposes a question into 2–5 independent facets, spawns a parallel researcher agent for each one, then synthesizes findings. Each researcher works independently (preventing groupthink). The synthesizer maps agreement, divergence, and coverage gaps.

◈

Parallax

/parallax --profile architecture “Should we use X?”

Runs the same query through three cognitively diverse agents simultaneously, each constrained by a different lens. A synthesizer reads all three outputs blind and maps: blind spots (what only one lens caught), divergences (conflicting framings), and consensus (where all three agree).

◉

Constrain

/constrain --depth shallow “Analyze this code”

Tests how a specific linguistic constraint affects problem-solving. Spawns two parallel agents: one constrained, one baseline. The synthesizer compares where they agree, where they diverge, and whether the constraint exposed anything the baseline missed. Three depth levels: shallow (regularizer), medium (no-have), deep (E-Prime).

◇

Scrutinize

/scrutinize paper2.md

Adversarial architecture review. Spawns a fresh-context critic that reads your artifact cold—no creation history, no author bias. The critic delivers structured critique. Then a multi-round dialogue unfolds: advocate responds, critic rebuts, points get resolved into agreed changes vs. contested points. Fresh eyes catch what the builder’s eye skips.

The common pattern: spawn diverse agents in parallel, let them work independently, then synthesize. The synthesis step is where the real value lives—surfacing blind spots, divergences, and consensus that no single agent would produce alone.

Orient recommends tools → skills spawn parallel agents → synthesizer integrates

Layer 2: Cognitive lenses

If I wrote about lenses in my previous post, this is the update after running a 15,600-trial experiment across six frontier models. The hypothesis held: vocabulary constraints measurably restructure model reasoning. But the surprise was which constraints work best.

Lenses come in two families. Vocabulary constraints ban specific words or grammatical structures, forcing the model to route around its default patterns. Reasoning modifiers prescribe analytical frameworks that change the shape of the output without touching vocabulary.

Vocabulary constraints (empirically validated)

These three constraints were tested in a controlled experiment: 130 analytical items, 6 models, 5 conditions each, with stochastic replication. The accuracy gains are real and statistically significant (p < 0.001 by Fisher’s exact test).

Accuracy gain over unconstrained baseline

Regularizer

+7.3pp

No-Have

+5.4pp

E-Prime

+3.7pp

The punchline is counterintuitive: the lightest constraint wins. The regularizer just bans 18 filler words—very, quite, rather, somewhat, really, pretty, just, simply, basically, actually, literally, definitely, certainly, obviously, clearly, essentially, virtually, practically—and it produces the largest accuracy gain with near-zero compliance cost. E-Prime, the deepest constraint (banning all forms of “to be”), improves accuracy less and costs dramatically more: 45.6% of trials need retries for compliance violations.

Why? The working theory: shallow constraints disrupt the model’s default “fluency autopilot”—the pattern of generating smooth, confident-sounding text that papers over analytical gaps. Removing filler words doesn’t change what the model can think, but it removes the linguistic shorthand that lets it skip thinking. Deep constraints do something different: they force genuinely novel cognitive patterns (E-Prime makes you describe relationships instead of categories), which produces more diverse reasoning but isn’t as reliably more accurate.

This maps directly to infrastructure. The regularizer auto-applies to every unlensed agent as a free accuracy boost. The deeper constraints deploy when you want cognitive diversity, not when you want precision.

Reasoning modifiers

Counterfactual

For every assumption, explore what breaks if it were false. Finds hidden dependencies.

Analogical

Map findings to cross-domain parallels (biology, architecture, cooking) before stating them directly.

Minimal

Absolute minimum words. Every word earns its place or gets cut. No jargon, no hedging.

Evidential

Tag every claim: [observed], [inferred], [assumed], [uncertain]. Separates knowledge from guessing.

First Principles

Derive from axioms only. No conventions, no best practices, no appeals to authority.

Process-Only

Everything expressed as process, flow, and transformation. Nothing “is” anything; things happen.

Steel-Man

Build the strongest possible defense of the status quo before identifying where it breaks.

Each lens lives in a single markdown file in .claude/agents/. Add a new file, and every skill that composes lenses can use it immediately. No code changes. The directory is the interface.

Lens profiles for Parallax

Parallax ships with pre-configured lens combinations tuned for specific problem types:

Pre-configured lens combinations — each profile optimized for a specific problem type

Layer 3: Epistemic memory

Most AI memory systems work like append-only logs: save facts, retrieve facts, done. This one works more like a belief system. Every piece of stored knowledge carries metadata about how confident the system should be in it—and that confidence changes over time.

Beliefs, not facts

A belief isn’t a fact. It’s an interpretation that the system holds with some degree of confidence, tracked explicitly:

belief: “Prefers bundled PRs over many small ones for refactors”
conf: 0.65   # developing — seen twice, could be situational
permanence: durable   # working pattern, ~year timescale
first: 2026-03-15
confirmed: 2026-03-28   # last session consistent with this
challenged: null   # nothing contradicted it yet

The confidence scale runs from speculative (0.0–0.2, inferred but never observed) through established (0.7–0.8, consistent across many sessions) to factual (0.9–1.0, reserved for verified facts only). Confidence gets earned through repeated observation, not assigned by how insightful a belief sounds.

Dormancy decay

Here’s the part that surprised me most in practice: confidence isn’t static. If the system goes dormant (30+ days without sessions), beliefs attenuate exponentially based on how permanent they are:

Effective confidence after 6 months dormant

Stable

~2yr half-life

Durable

~5mo half-life

Situational

~6wk half-life

A belief like “heritage is Ghanaian” (stable, factual) barely decays across years. A belief like “currently frustrated with the auth migration” (situational, 0.5 confidence) is nearly floor after two months. This makes staleness visible as a gradient rather than a cliff edge—and prevents the system from confidently acting on months-old situational reads.

Tensions, not overwrites

When new observations contradict existing beliefs, most systems silently update. This one creates a tension—a dated record that says “belief A and observation B don’t agree, and we’re watching.” Real people are contradictory. A profile that fits too neatly is probably just a comfortable reduction.

Layer 4: Reflection

The memory system stores beliefs. The reflection layer is what maintains them—the self-watching machinery that keeps the belief system honest over time.

Observe

Running at two speeds simultaneously. Live mode fires mid-conversation when it notices a judgment call, pushback, process signal, or unprompted reveal. Harvest mode runs at the end of every substantive session as a backstop. Both write to the same observation log:

Evidence: User pushed back on splitting the PR, said “one bundled PR is the right call here.”
Suggests: Prefers pragmatic grouping over conventional small-PR discipline for refactors.
Therefore: For future refactors in this area, default to bundled PRs unless scope is truly independent.
Connects to: PR-style preference (new)
Strength: single

Quality gate: nothing gets written unless the Therefore field would change future behavior. No journaling for journaling’s sake.

Mirror

Four modes that run on cadence (every 10 sessions or monthly):

Portrait—prose character study with blind spots and contradictions
Audit—epistemological dashboard: confidence distributions, drift, staleness flags
Gut-check—quick validation of 6 beliefs (3 strongest, 3 weakest)
Interview—8–12 structured questions generated from blind spots and tensions

The Portrait mode also promotes observations to beliefs: when 2+ observations cluster around the same pattern, they graduate to a belief at low initial confidence (0.30–0.40). The system learns, but slowly and skeptically.

Harvest

End-of-session knowledge extraction. Scans the conversation for insights, decisions, and facts, then proposes them as new or updated entries in a knowledge graph. Dual-writes everything: native storage and the graph, with bidirectional wikilinks for discoverability. Runs observe as its final step, making it a single atomic operation: extract knowledge and collect behavioral evidence.

Layer 5: Hooks & automation

The glue layer. Shell hooks fire on specific events; instruction-level auto-behaviors handle the rest.

Trigger	Hook	What it does
SessionStart	Detects pending session handoffs, reports system status	Picks up where the last session left off without manual context-loading
SessionStart	Re-injects CLAUDE.md after context compaction	Long conversations compress old messages; this ensures instructions survive
Pre-Edit	Protected file guard	Prevents accidental writes to locked files (analysis plans, configs)
Post-Edit	Auto-format HTML	Keeps HTML files formatted consistently after every change
End of session	Auto-harvest (instruction-level)	Extracts knowledge + runs observe automatically. No manual trigger needed
Mid-session	Live observe (instruction-level)	Ambient behavioral evidence collection. Writes observations in real time

The last two aren’t traditional shell hooks—they’re behaviors defined in CLAUDE.md (the master instruction file) that the model executes as part of its conversational loop. The distinction matters: shell hooks are deterministic triggers; instruction-level behaviors are probabilistic but pervasive. Together they create a system that maintains itself.

How the pieces wire together

Three primary composition patterns emerge from the architecture. Each one connects layers that were designed independently but compose naturally.

Analysis pipeline

Orient

→

Research

→

Parallax

→

Decision

Orient selects the right tools. Research gathers evidence across facets. Parallax triangulates the decision through diverse lenses. You decide with the full picture.

Production pipeline

Build

→

Scrutinize

→

Revise

→

Harvest

Build the artifact. Scrutinize it with fresh-context adversarial review. Apply agreed changes. Harvest extracts what you learned and observes how you worked.

Learning loop (cross-session)

Conversations produce observations → observations accumulate → Mirror synthesizes into belief updates → beliefs inform future conversations

Extensibility: files, not code

The architecture grows by file addition. No code edits required for any of these:

New lens: create .claude/agents/my-lens.md with a type: lens header and a constraint body. Immediately available to Parallax, Research, Constrain, and Orient.
New skill: create .claude/skills/my-skill/SKILL.md with trigger definitions and instructions. Available as /my-skill instantly.
New memory category: create an index file. The memory routing skill auto-discovers it.
New Parallax profile: edit one table in the skill file. Three lenses, a name, done.

This is what makes the system generative rather than just configurable. The directory structure is the API. Anyone who can write a markdown file can extend the cognitive architecture.

Design principles

After months of building and iterating, these are the principles that survived contact with daily use:

Composition over monoliths. Skills orchestrate agents. Agents are parameterized by lenses. Small, recombinant pieces scale better than feature-rich single tools. A new lens multiplies the capability of every skill that uses lenses.
Parallel execution by default. Research spawns parallel investigators. Parallax runs three lenses simultaneously. Constrain tests two agents side by side. Fresh contexts prevent groupthink and surface genuine blind spots. Serial execution is the exception.
Shallow constraints beat deep ones. The biggest empirical surprise. Banning 18 filler words (+7.3pp accuracy) outperforms banning all forms of “to be” (+3.7pp). The regularizer auto-applies to every unlensed agent as a free accuracy boost. Deep constraints deploy when you want diversity, not precision.
Confidence is earned, not assigned. Beliefs start at low confidence and climb through repeated confirming observations. Dormancy decay makes staleness visible. Contradictions create tensions rather than silent overwrites. The system stays humble about what it knows.
Behavior over self-report. The observe skill tracks what users do under pressure—not what they say about themselves. When the two diverge, a tension is logged rather than smoothing over the contradiction.
Fresh-context critique as default. Scrutinize’s critic has no creation history. It reads the artifact cold. This prevents the author bias that makes it nearly impossible to review your own work honestly. The same principle applies to every synthesizer: it reads agent outputs without knowing which agent produced which output.
The directory is the interface. New capabilities are files, not code changes. The system grows by addition. This makes it inspectable (read the files to understand the system), forkable (copy the directory), and collaborative (PRs are just markdown changes).
Reflexive observation. The system observes itself observing the user. Live observation catches real-time signals. Harvest backstop catches what live mode missed. Mirror synthesizes both into profile updates. No external auditor required—the review loop is built into the architecture itself.

The theoretical foundation for this work is Umwelt Engineering—a framework for designing the linguistic world an AI agent can think in. Lenses implement umwelt constraints. The empirical grounding comes from a 15,600-trial experiment across six frontier models, which will be published as “Cognitive Restructuring in Frontier Language Models.” The core finding: even trivial vocabulary constraints produce statistically significant reasoning improvements, because they subtract the linguistic shortcuts that let models skip past analytical depth.

Everything described here is built from Claude Code’s existing extension points: skills, agents, hooks, and file-based memory. No custom APIs, no model modifications, no external services. Just markdown files and shell scripts, composed into something that thinks about how it thinks.

Built and maintained with Claude Code. The infrastructure described in this post is the same infrastructure that helped write this post.

The Full Stack: A Cognitive Operating System Inside Claude Code

The five layers

Layer 1: Orchestration skills

Layer 2: Cognitive lenses

Vocabulary constraints (empirically validated)

Reasoning modifiers

Lens profiles for Parallax

Layer 3: Epistemic memory

Beliefs, not facts

Dormancy decay

Tensions, not overwrites

Layer 4: Reflection

Observe

Mirror

Harvest

Layer 5: Hooks & automation

How the pieces wire together

Analysis pipeline

Production pipeline

Learning loop (cross-session)

Extensibility: files, not code

Design principles