Current giant language fashions (LLMs) — akin to OpenAI’s o1/o3, DeepSeek’s R1 and Anthropic’s Claude 3.7 — reveal that permitting the mannequin to suppose deeper and longer at take a look at time can considerably improve mannequin’s reasoning functionality. The core strategy underlying their deep considering functionality is named chain-of-thought (CoT), the place the mannequin iteratively generates intermediate reasoning steps and appends them to the present context till producing the ultimate reply.
Nonetheless, as duties grow to be more and more complicated, the steps wanted to unravel them develop dramatically. For example, think about fixing NP-hard issues utilizing CoT — the reasoning hint would inevitably span exponential steps, assuming a fixed-size Transformer as the bottom mannequin and P ≠ NP. This raises an vital query:
Will CoT-based test-time scaling hit laborious ceilings?
Sadly, most likely sure. Varied limitations will emerge for tougher duties: (1) chains will inevitably exceed mannequin’s context home windows, (2) important data turns into buried and practically unimaginable to retrieve from quite a few previous tokens, and (3) the self-attention complexity makes producing every new token prohibitively costly.

On this article, we problem the standard “write-only” CoT reasoning paradigm that dominates present LLM architectures, from each theoretical and sensible views. Moreover, we’ll discover a essentially totally different reasoning strategy that enables LLM to not solely generate ideas, but additionally erase ideas. This capability for thought erasure not solely presents vital sensible advantages in efficiency and effectivity, however proves elementary for reaching optimum reasoning effectivity from a computational principle perspective.
This put up relies on the paper C. Yang et al., “PENCIL: Lengthy ideas with brief reminiscence” accepted in Worldwide Convention on Machine Studying 2025, a collaboration with Nathan Srebro, David McAllester, Zhiyuan Li. Code can also be out there.

Not All the things Must Be Remembered
The thought of selectively discarding data has deep roots in pc science historical past, from the earliest computational fashions to fashionable programs. The traditional Turing machine overwrites symbols on its tape fairly than preserving each state; programming languages reclaim reminiscence by means of stack frames which might be mechanically launched when features full their execution; and fashionable rubbish collectors repeatedly determine and take away objects not accessible to this system. These mechanisms weren’t merely effectivity optimizations — they have been important design selections that made complicated computation doable inside finite assets.
This concept additionally applies to human reasoning. In theorem proving, as soon as a lemma is established, we discard its detailed derivation whereas preserving the outcome; when exploring problem-solving approaches, we merely mark unproductive paths as “failed” with out retaining their full traces. All through complicated reasoning, we naturally compress data, retaining conclusions whereas discarding the scaffolding used to succeed in them.
✏️ PENCIL: A New Reasoning Paradigm
Due to this fact, we suggest ✏️ PENCIL, a brand new reasoning paradigm for LLMs. In contrast to ✒️ CoT that solely generates ideas, PENCIL recursively generates and erases ideas till reaching the ultimate reply. It maintains solely the minimal context required for producing future ideas, so the mannequin can suppose longer and deeper to unravel tougher duties utilizing shorter working reminiscence. The next determine illustrates how PENCIL works

How Do Fashions Erase Ideas?
PENCIL’s erasure mechanism attracts on two classical concepts. First, from rewriting guidelines in logic and classical automated theorem proving, which repeatedly apply predefined guidelines to simplify complicated logical or arithmetic expressions into canonical varieties till reaching a closing reply. Second, from purposeful programming languages, which creates stack frames to retailer native variables when calling features and releases corresponding reminiscence when features return, mechanically discarding intermediate states which might be not wanted.
Particularly, we introduce three particular tokens, known as [CALL], [SEP], and [RETURN], and use the next discount rule to implement erasure:

the place C stands for context, T stands for intermediate ideas, and A stands for reply. At any time when the generated sequence utterly matches the sample on the left, PENCIL triggers the discount rule, erasing ideas and merging the reply again into the context. It is very important notice that C, T and A can themselves comprise particular tokens, thereby supporting recursive constructions much like nested perform calls — for instance, C might comprise one other [CALL] token, indicating {that a} new considering subroutine has been initiated.
Methods to Use PENCIL?
PENCIL’s erasure mechanism flexibly helps varied reasoning patterns, akin to:
1️⃣ Activity Decomposition: Utilizing [CALL] to provoke subproblems, generate intermediate outcomes, after which use [SEP] and [RETURN] to merge outputs and erase subproblem reasoning particulars;
2️⃣ Department and Backtrack: Utilizing a [CALL], [SEP], [RETURN] triplet to handle an exploration department in a search tree, erasing invalid paths upon conflicts or failures.
3️⃣ Summarization / Tail Recursion: Condensing a prolonged reasoning hint into concise abstract, much like tail recursion optimization in programming:

the place T represents the unique complicated reasoning course of (or a harder downside), and T’ represents the summarized or simplified content material (or an equal, extra tractable downside).
Instance on a NP-Full Activity
For instance, think about a traditional NP-Full downside Boolean Satisfiability (SAT): given a Boolean components, decide whether or not there exists a variable task that makes it true. This downside is (broadly believed to) require exponential time however solely polynomial area to unravel, with the best strategy being traversing a binary search tree of depth n.
Conventional CoT would accumulate intermediate calculations, inflicting the context size to develop proportionally with the variety of nodes within the search tree, which is exponential time complexity of O(2^n). Compared, PENCIL can recursively department to attempt True/False for a variable, backtracking upon battle and erasing all ideas inside that department. This thus retains the context size proportional to the search depth, which is area complexity of solely O(n).
The next determine compares the utmost context size of the vanilla CoT with out discount (blue) and PENCIL with discount (pink). As downside complexity will increase, PENCIL achieves dramatic area effectivity, notably lowering context size from 151,192 to only 3,335 tokens for Einstein’s Puzzle.

Coaching and Experiments
The core distinction between CoT and PENCIL throughout coaching is the calculation of the loss perform:


For CoT, the loss for every new token relies on the whole historic context; for PENCIL, after every “write-erase” iteration, the mannequin calculates loss for brand new tokens solely on the lowered sequence. Though each generate the identical variety of tokens, PENCIL considerably shortens the context size corresponding to every token and thus is extra environment friendly.
It’s additionally worthwhile to notice that after every discount, the KV cache for the shared prefix C will be straight reused, with solely the cache for the shorter half A needing recalculation.
Experimental Outcomes
Our experiments give attention to three inherently laborious reasoning duties: 3-SAT (NP-Full), QBF (PSPACE-Full), and Einstein’s Puzzle (pure language reasoning). For every process, we wrote a generator to generate a coaching set the place particular tokens are included. We prepare a small transformer (SAT/QBF with 10.6M parameters; Einstein’s Puzzle with 25.2M parameters) beginning with random initialization for these duties.
📊 In comparison with CoT, we discovered PENCIL can clear up larger-scale reasoning issues. As proven within the determine under, in SAT (left) and QBF (proper) duties, when downside measurement is small, each CoT and PENCIL completely clear up issues; however as measurement will increase, conventional CoT accuracy drops considerably (e.g., solely about 50% for SAT at n=10), whereas PENCIL maintains excessive accuracy ≥ 99%. That is primarily as a result of CoT’s context sequence size explodes exponentially, whereas PENCIL avoids explosion by dynamic discount.

⚡️ Moreover, PENCIL considerably saves computational assets. As proven within the determine, for QBF (n=3–6) duties, we in contrast the convergence pace of CoT (blue) and PENCIL (pink) beneath the identical FLOPs funds. PENCIL shortly reaches 100% accuracy whereas CoT, on account of repeatedly increasing context size, requires extra FLOPs to strategy optimality. As the issue measurement will increase, the hole between the 2 turns into extra pronounced.

to six). Circles and vertical traces point out the primary time every technique reaches optimum efficiency.
🧩 We additional thought of a really tough logical reasoning downside: Einstein’s Puzzle. Every downside consists of 5 homes and 5 attribute classes of individuals dwelling in them — coloration, nationality, drink, cigarette, and pet (e.g., Pink/Inexperienced/Blue, Brit/German/Swede, Hen/Canine/Fish, and many others.). Given clues like “the inexperienced home is true subsequent to the hen proprietor’s” and “the canine proprietor lives within the pink home,” the duty is to infer “who owns the fish?” This downside presents an excessive problem for present LLMs: even GPT-4 struggles to unravel it. The determine under reveals a simplified model with solely 3 homes and three attribute classes:

As proven under, for this downside that even giant fashions wrestle with, PENCIL achieves 97% accuracy utilizing solely a small 25.2M parameter mannequin, whereas conventional CoT achieves solely 25% accuracy (near random guessing).

Principle: Common Environment friendly Computation
We additional reveal PENCIL’s elementary benefit over conventional CoT from the theoretical expressive energy perspective: PENCIL is Turing full with optimum area complexity, and thus can clear up arbitrary computable duties effectively. That is one thing essentially unimaginable for CoT!
Fundamental Outcomes
Particularly, we show: Utilizing a hard and fast, finite-sized Transformer, PENCIL can simulate any Turing machine with optimum time and area complexity, thereby effectively fixing all computable issues.

In different phrases, for any Turing machine operating in T time and S area, PENCIL requires solely O(T) tokens whereas sustaining a most context size of O(S) to provide an identical outcomes. Whereas earlier work established that conventional CoT could make Transformers Turing full, it calls for O(T) context size with every token representing an intermediate computation step. This distinction between most context size turns into essential as a result of for many algorithms, area complexity S is considerably smaller than time complexity T, particularly for tougher issues.
Contemplate NP-Full issues like Touring Salesman or Hamiltonian Circuit, that are broadly believed to require exponential time however solvable in polynomial area. Conventional CoT can’t clear up these inside polynomial context size constraints, and requires at the very least exponential size that exceeds sensible reminiscence limitations of any actual system. PENCIL, in distinction, can clear up them utilizing solely polynomial most context size, making beforehand intractable reasoning duties possible.
Proof Sketch
We now briefly introduce our proof thought, the place the important thing perception is to have PENCIL use a collection of “Simulation-Summarization” iterations to scrub the reminiscence.

Step 1: Utilizing CoT to Encode Turing Machine Transitions As illustrated within the left a part of the determine above, we encode every Turing machine state transition as a token encoding “new state”, “written image”, and “head motion route” triplet within the embedding. The mannequin can use self-attention to calculate the present head place and decide the image at this place. With out discount, this course of generates T tokens with context size O(T).
Step 2: Alternating “Simulation-Summarization” PENCIL achieves area/time optimality by means of alternating:
- Simulation: Constantly generate Turing machine state transition tokens, simulating a number of computation steps;
- Summarization: When new tokens exceed twice the area wanted, summarize the computation utilizing S tokens. The discount rule then discards earlier ideas, conserving solely the most recent Turing machine state for the subsequent spherical.
This technique maintains complete token technology at O(T) whereas limiting context size to O(S).
Step 3: Transformer Implementation To show this course of will be applied by Transformers, we developed the Full-Entry Sequence Processing (FASP) programming language and proved that any algorithm written in FASP will be applied by a fixed-sized Transformer. In a FASP program, every variable corresponds to a Transformer sub-module, and every line of code transforms present variables to a brand new variable by means of predefined features, which is equal to developing a extra complicated Transformer based mostly on sub-modules. The variable returned by this system is the specified Transformer that encodes the algorithm. We wrote a FASP program that implements the “Simulation-Summarization” operation, which suggests there exists a constant-sized Transformer that may carry out the identical perform
Conclusion
In conclusion, we suggest a brand new reasoning paradigm PENCIL, which alternates between technology and erasure, and allows fashions to suppose deeper to unravel extra sophisticated issues. Theoretically, we show that PENCIL achieves Turing completeness with optimum time and area effectivity and thus can effectively clear up any computable issues. Wanting ahead, a promising route can be to fine-tune LLMs to include PENCIL’s memory-efficient reasoning capabilities. We hope these findings will encourage reexamining present reasoning fashions from the attitude of principle of computation.
