Notes on: Training Language Models via Neural Cellular Automata by Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit Agrawal (2026)

tags: Cellular automata, Neural cellular automata, Transfer learning, Kolmogorov complexity
source: (Lee et al. 2026)

Summary

This paper proposes using neural cellular automata (NCA) as a source of synthetic, non-linguistic data for pre-pre-training large language models: an initial training phase on synthetic data that precedes standard pre-training on natural language corpora.

The authors use 2D discrete NCA on a 12x12 grid with 10 states, where the transition rule is parameterized by a randomly sampled neural network (3x3 convolution + MLP). Trajectories are tokenized into patches and serialized as sequences for autoregressive training with a 1.6B parameter Llama-based transformer.

The key finding is that pre-pre-training on only 164M NCA tokens improves downstream language modeling perplexity by up to 6% and accelerates convergence by 1.6x across web text, math, and code domains. Remarkably, this outperforms pre-pre-training on 1.6B tokens of natural language (C4) with more compute. The gains transfer to reasoning benchmarks including GSM8K, HumanEval, and BigBench-Lite.

The paper investigates what drives transfer through ablations. Attention layers capture the most transferable computational primitives, while MLP layers encode domain-specific statistics. The optimal NCA complexity varies by downstream domain: code benefits from simpler dynamics (30 to 40% gzip compressibility), while web text and math favor more complex ones (50%+). This aligns with the intrinsic Kolmogorov complexity of the target corpora, suggesting that matching synthetic data complexity to the target domain maximizes transfer.

Key Ideas

Pre-pre-training paradigm: a three-stage pipeline (synthetic pre-pre-training → natural language pre-training → fine-tuning) where NCA data instills transferable computational priors before exposure to language.
NCA as synthetic data generators: randomly parameterized NCA produce diverse spatiotemporal dynamics with Zipfian token distributions resembling natural language, while being fully controllable and cheap to generate.
Complexity-based sampling: NCA rules are filtered by gzip compression ratio of their trajectories, providing a practical proxy for complexity metrics and enabling systematic control over data complexity.
Attention carries transferable structure: re-initialization experiments show attention layers account for the majority of transfer, acting as universal carriers of in-context learning and long-range dependency tracking. MLPs encode domain-specific patterns.
Domain-targeted complexity tuning: the optimal NCA complexity band varies by downstream task, matching the intrinsic compressibility of the target corpus which is not possible in natural language pre-training.
160M synthetic tokens > 1.6B natural tokens: NCA pre-pre-training provides a purer signal for in-context rule inference, as every sequence requires inferring a hidden transition rule from context, avoiding the semantic shortcuts present in natural text.

Comments

This is a striking result that connects cellular automata research with LLM training in an unexpected way. The finding that 160M tokens of NCA data outperform 1.6B tokens of natural language for pre-pre-training is surprising and suggests that the structure of training data, not its semantic content, is what matters most for acquiring general computational primitives. The question is whether the warm start really benefits from any given structure of the input token

The connection to in-context learning is compelling: since each NCA sequence is generated by a unique random rule, next-token prediction forces the model to perform implicit Bayesian inference over the latent rule, exactly the mechanism underlying in-context learning in transformers.

The paper is reminiscent of the cellular automata as CNNs perspective from (Mordvintsev et al. 2020), but applied in the opposite direction: rather than using neural networks to simulate CA, this work uses CA dynamics to train neural networks.

The complexity matching finding is also noteworthy: it suggests that effective synthetic pre-training requires calibrating the statistical properties of the synthetic data to match the target domain, which connects to broader questions about Kolmogorov complexity and what structural features matter for learning useful representations ((Cisneros et al. 2019))

Connections

Directly builds on Notes on: Growing Neural Cellular Automata which introduced the NCA framework used here for synthetic data generation
Related to Cellular automata as convolutional neural networks as both explore the deep connection between CA dynamics and neural network computation
Connects to Transfer learning by demonstrating that computational primitives learned from non-linguistic synthetic data transfer to language tasks
Uses Kolmogorov complexity (approximated via gzip compression) as a practical measure for controlling synthetic data complexity
Relevant to Complexity metrics through its use of compression ratios to characterize and control NCA dynamics

Bibliography

Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit Agrawal. March 9, 2026. "Training Language Models via Neural Cellular Automata". https://arxiv.org/abs/2603.10055.
Alexander Mordvintsev, Ettore Randazzo, Eyvind Niklasson, Michael Levin. February 11, 2020. "Growing Neural Cellular Automata". Distill 5 (2):e23. DOI. See notes
Hugo Cisneros, Josef Sivic, Tomas Mikolov. December 2019. "Evolving Structures in Complex Systems". In 2019 IEEE Symposium Series on Computational Intelligence (SSCI), 230–37. IEEE. DOI.

Training Language Models via Neural Cellular Automata by Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit Agrawal (2026)

Summary

Key Ideas

Comments

Connections

Bibliography

Comments

Leave a comment