More tokens doesn't mean better results
TL;DR
- Tested a research-level problem (random walk on 2D torus)
- More web searches, more tokens, more instructions → same error
- The model consumed so many resources the chat errored with “overflow”
- Lesson: when it doesn’t understand the problem, more resources = more rationalization
The problem
I tried a research-level problem: calculate the probability that a random walk on a 2D torus visits the origin before returning to the starting point.
Correct answer: e^(-π/2) ≈ 0.208
Strategy 1: Web search
| Setup | Result |
|---|---|
| No tools | Made up formulas (0, 1/e, 1/2) |
| With internet | Found correct theory, extracted a wrong value |
| With hint “that value is wrong” | Fixed value, misapplied the formula |
Each layer of tools helped partially but introduced new errors.
Strategy 2: Exhaustive meta-prompt
I designed a prompt that instructed:
- Search multiple sources
- Verify every extracted value
- Compare results between papers
- Only respond when everything matches
Result: The model made so many searches and compactions that the chat errored: “no more compacts allowed.” First time I’d seen this.
And the final answer after consuming massive resources: 1/2 (incorrect, the same simple heuristic).
Why it happened
The model used an elegant but incorrect argument:
“The origin and x₀ share 2 of 4 neighbors, so the probability is 1/2”
When it doesn’t understand the underlying problem, more resources just mean more space to rationalize the wrong answer.
Lesson
| More X | Better result? |
|---|---|
| More thinking tokens | ❌ If it doesn’t know, it rationalizes |
| More web searches | ⚠️ Can extract wrong data |
| More compactions | ❌ Loses useful context |
| More instructions | ❌ Can ignore them |
Prompt engineering has a ceiling. For problems that require specialized technical knowledge the model doesn’t have, no prompt solves it.
This is the third experiment in the series. It started with the model that wouldn’t commit.
To know when to use elaborate prompts and when not to, read my taxonomy of LLM failures.
This post is part of my series on the limits of prompting. For a complete view, read my prompt engineering guide.
You might also like
The prompt that solves ambiguous problems
Practical guide to prompt v17b: a methodology for LLMs to identify and discard incorrect interpretations
It got to 0 and called it a contradiction
Why separating contexts isn't enough for an LLM to self-correct: the problem of accepting counterintuitive results
Prompt Engineering Guide: How to Talk to LLMs
Everything you need to know to write effective prompts. From beginner to advanced, with practical examples and the limits nobody tells you about.