Discovery Through Disconfirmation

In March 2026 I published a paper arguing that deep linguistic constraints — E-Prime, No-Have, constructions that force entire grammatical categories out of reach — restructure how language models reason. Not just what they say. How they think. The claim rested on a specific theoretical commitment: I believed strong Sapir-Whorf was literally true for token predictors. Language is cognition for an LLM. Change the available vocabulary and you change the available computation.

The results looked good. A 16-agent ensemble under diverse linguistic constraints hit 100% accuracy on a benchmark where the unconstrained control scored 88%. A minimal subset — three agents — matched the full ensemble. The cross-model correlation between constraint effect and model capability was −0.75: what helped one model hurt another, suggesting constraints weren’t just adding effort but genuinely redirecting cognition.

I believed the theory. So I designed the experiment that would close the confounds Paper 1 left open. It closed them — and killed the theory in the process.

The active control

Paper 1 had a baseline problem. Every constrained condition was compared to unconstrained English. That leaves a confound wide open: maybe the improvement comes from the model trying harder because it received a longer, more detailed prompt. Maybe it comes from the metacognitive overhead of monitoring its own output for violations. Neither of those requires the constraint to be linguistically meaningful. It just requires the constraint to exist.

Paper 2 added two active controls to close that gap.

The first was an elaborated prompt — matching the constraint prompts in length, detail, and metacognitive framing, but imposing no vocabulary restriction. If prompt complexity alone explained the improvement, this condition would match the constrained ones.

The second was the one that mattered: a neutral word ban. It banned twenty semantically inert filler words — intensifiers and hedges like very, quite, rather, somewhat, really, pretty, just, basically, actually, literally. This matched the metalinguistic self-monitoring load of E-Prime and No-Have without targeting any reasoning-relevant vocabulary. If the neutral ban showed null effects while the real constraints showed gains, the monitoring confound would be dead empirically. The distinction between deep and shallow constraints would hold.

That was the prediction. The neutral ban was supposed to be the null.

What the data showed

15,600 trials. Five conditions. Six models — three frontier, three small. 130 items across seven task types. Four repetitions per cell.

Control: 83.0%.

Neutral word ban: 89.7%.

No-Have: 88.4%. Elaborated prompt: 87.2%. E-Prime at full compliance: 86.7%.

The null condition outperformed every real constraint. Banning filler words improved reasoning more than banning all forms of to be.

Constraint depth — how much of the language the ban removes, how deeply it restructures available grammar — tracked inversely with effect size. The deeper the constraint, the smaller the improvement. E-Prime, the deepest, performed worst among the treatment conditions. The ordering is suggestive with four conditions, not a proven functional relationship — but the direction was the opposite of what the theory predicted.

And the cross-model correlation from Paper 1 — the −0.75 that suggested constraints redirect cognition differently depending on the model’s internal structure? It didn’t replicate. Paper 2: r = 0.005. Functionally zero.

What went wrong with the theory

The theory predicted that deeper constraints would produce larger effects because they force more radical cognitive restructuring. The data showed the opposite. That kills the causal mechanism. Constraints don’t improve reasoning by redirecting cognition into structurally different pathways. They improve reasoning by disrupting default token generation.

Any vocabulary ban — deep or shallow — pushes the model off its most probable output distribution. The model can’t produce its default phrasing, so it searches for alternatives. That search, in itself, functions as a regularizer. It breaks the ruts.

The regularity of this effect explains the depth inversion. Shallow constraints are easy to comply with: the model spends minimal effort on compliance and captures the full regularization benefit. Deep constraints impose real compliance overhead — the model burns tokens monitoring for violations, retrying phrasing, working around grammatical gaps. E-Prime’s position at the bottom may also reflect the overhead of its compliance retry loop — 45.6% of E-Prime trials required a retry, a confound unique to that condition. The net gain shrinks as depth increases.

E-Prime doesn’t make a model think differently. It makes a model think less efficiently about thinking differently. The data are consistent with regularization as a mechanism — disruption of default generation patterns — though this interpretation is post-hoc and would need direct testing to confirm.

What survived

The results don’t invalidate the research program. They redirect it.

Every constrained condition outperformed the unconstrained control. Vocabulary bans improve LLM reasoning — reliably, across models, across task types. That finding is robust. Fourteen of seventy pooled comparisons survived false discovery rate correction, though the differences between treatment conditions (1–3 percentage points) are mostly not statistically distinguishable from each other. The rank ordering is a trend, not a certainty.

Not every model follows the pattern. Sonnet reverses the depth inversion entirely — E-Prime ranks second, the neutral ban ranks last. The paper discusses model-level variation in detail. The pooled result holds, but individual models can diverge.

The elaborated prompt — the “just try harder” control — also improved performance (+4.2 points over control), but the neutral ban beat it by another 2.5 points. So it’s not just prompt length and it’s not just effort. There’s something specific about vocabulary restriction that goes beyond metacognitive framing.

The practical upshot is striking: the cheapest, simplest constraint — banning a list of filler words — produces the largest improvement. You don’t need a theory of linguistic relativity. You don’t need E-Prime or Korzybski or Jakob von Uexküll. You need twenty banned words and a compliance check. Whether this translates outside the experimental setup — different benchmarks, different prompt formats — remains to be tested.

What I got wrong

The Umwelt Engineering framework — the idea that you can design the linguistic world an AI agent inhabits, shaping its cognition by constraining its available vocabulary — was built on the assumption that the content of the constraint matters. That banning to be forces a specific kind of operational, process-oriented thinking. That banning possessive to have forces relational reasoning. That the constraint’s semantic target determines the cognitive effect.

The data says no. The constraint’s semantic target is largely irrelevant. What matters is the disruption itself — the forcing function that pushes the model off its default distribution. The content of the ban is secondary to the fact of the ban.

This doesn’t make the Umwelt framework useless. It makes it a design lens rather than a causal theory. It’s still a useful way to think about what you’re doing when you constrain an agent’s language. But the mechanism it posited — deep linguistic restructuring of cognition — isn’t what the data support. The frame survives as metaphor. The causal claim doesn’t.

Why publish the disconfirmation

There’s a version of this story where Paper 2 never gets written. Paper 1 has a clean narrative, an interesting result, a novel framework. It stands on its own. The active controls that would challenge it are exactly the ones no reviewer demanded — I designed them myself, because I knew the confounds were open.

The reason to run the experiment anyway is the same reason the experiment was worth designing: you don’t learn anything from a theory you won’t test. The most informative result is the one that surprises you. Paper 1 told me what I expected to hear. Paper 2 told me something I didn’t expect, and the field is better for it.

The finding that trivial vocabulary bans outperform deep linguistic constraints is more useful than the finding that deep constraints work. It’s cheaper to implement, easier to explain, and harder to argue with. It points practitioners toward a technique they can deploy in five minutes — ban filler words from your agent’s system prompt — rather than one that requires a philosophy of language.

Discovery through disconfirmation. You learn the most when you’re wrong.

Written with AI assistance. Paper 1: arXiv:2603.27626. Paper 2: arXiv:2604.02699. Code: github.com/rodspeed/umwelt-engineering.