Training Language Models via Neural Cellular Automata by Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit Agrawal (2026)

This note was initially drafted with LLM assistance. Generated notes are periodically reviewed and revised by the author.
tags
Cellular automata, Neural cellular automata, Transfer learning, Kolmogorov complexity
source
(Lee et al. 2026)

Summary

This paper proposes using neural cellular automata (NCA) as a source of synthetic, non-linguistic data for pre-pre-training large language models: an initial training phase on synthetic data that precedes standard pre-training on natural language corpora.

The authors use 2D discrete NCA on a 12x12 grid with 10 states, where the transition rule is parameterized by a randomly sampled neural network (3x3 convolution + MLP). Trajectories are tokenized into patches and serialized as sequences for autoregressive training with a 1.6B parameter Llama-based transformer.

The key finding is that pre-pre-training on only 164M NCA tokens improves downstream language modeling perplexity by up to 6% and accelerates convergence by 1.6x across web text, math, and code domains. Remarkably, this outperforms pre-pre-training on 1.6B tokens of natural language (C4) with more compute. The gains transfer to reasoning benchmarks including GSM8K, HumanEval, and BigBench-Lite.

The paper investigates what drives transfer through ablations. Attention layers capture the most transferable computational primitives, while MLP layers encode domain-specific statistics. The optimal NCA complexity varies by downstream domain: code benefits from simpler dynamics (30 to 40% gzip compressibility), while web text and math favor more complex ones (50%+). This aligns with the intrinsic Kolmogorov complexity of the target corpora, suggesting that matching synthetic data complexity to the target domain maximizes transfer.

Key Ideas

  • Pre-pre-training paradigm: a three-stage pipeline (synthetic pre-pre-training → natural language pre-training → fine-tuning) where NCA data instills transferable computational priors before exposure to language.
  • NCA as synthetic data generators: randomly parameterized NCA produce diverse spatiotemporal dynamics with Zipfian token distributions resembling natural language, while being fully controllable and cheap to generate.
  • Complexity-based sampling: NCA rules are filtered by gzip compression ratio of their trajectories, providing a practical proxy for complexity metrics and enabling systematic control over data complexity.
  • Attention carries transferable structure: re-initialization experiments show attention layers account for the majority of transfer, acting as universal carriers of in-context learning and long-range dependency tracking. MLPs encode domain-specific patterns.
  • Domain-targeted complexity tuning: the optimal NCA complexity band varies by downstream task, matching the intrinsic compressibility of the target corpus which is not possible in natural language pre-training.
  • 160M synthetic tokens > 1.6B natural tokens: NCA pre-pre-training provides a purer signal for in-context rule inference, as every sequence requires inferring a hidden transition rule from context, avoiding the semantic shortcuts present in natural text.

Comments

This is a striking result that connects cellular automata research with LLM training in an unexpected way. The finding that 160M tokens of NCA data outperform 1.6B tokens of natural language for pre-pre-training is surprising and suggests that the structure of training data, not its semantic content, is what matters most for acquiring general computational primitives. The question is whether the warm start really benefits from any given structure of the input token

The connection to in-context learning is compelling: since each NCA sequence is generated by a unique random rule, next-token prediction forces the model to perform implicit Bayesian inference over the latent rule, exactly the mechanism underlying in-context learning in transformers.

The paper is reminiscent of the cellular automata as CNNs perspective from (Mordvintsev et al. 2020), but applied in the opposite direction: rather than using neural networks to simulate CA, this work uses CA dynamics to train neural networks.

The complexity matching finding is also noteworthy: it suggests that effective synthetic pre-training requires calibrating the statistical properties of the synthetic data to match the target domain, which connects to broader questions about Kolmogorov complexity and what structural features matter for learning useful representations ((Cisneros et al. 2019))

Connections

Bibliography

  1. . . "Training Language Models via Neural Cellular Automata". https://arxiv.org/abs/2603.10055.
  2. . . "Growing Neural Cellular Automata". Distill 5 (2):e23. DOI. See notes
  3. . . "Evolving Structures in Complex Systems". In 2019 IEEE Symposium Series on Computational Intelligence (SSCI), 230–37. IEEE. DOI.
Last changed | authored by

Comments

Loading comments...

Leave a comment

Back to Notes