2. Insightful Limits Reveal Fundamental Behavior

"Appropriate asymptotic perspectives often render otherwise intractable systems analytically tractable." Modern deep learning systems regularly involve hundreds of interacting architectural components comprised of hundreds of billions of parameters trained on trillions of tokens. Constructing microscopic theories that track every individual parameter in such practical setups seems all but hopeless. Fortunately, complex systems often simplify when approximated as effectively infinite in size, revealing simple mathematical structures that remain deeply informative for original finite systems.

A. The Infinite Width Limit & The Lazy/Rich Dichotomy

Focuses on mean-field behaviors when the number of neurons in hidden layers approaches infinity, separating frozen feature kernels from adaptive representation learning.

Neal [1996] — Priors for infinite networks
Poole et al. [2016] — Exponential expressivity in deep neural networks through transient chaos
Lecun et al. [1998] — Gradient-based learning applied to document recognition
Jacot et al. [2018] — Neural tangent kernel: Convergence and generalization in neural networks
Lee et al. [2019] — Wide neural networks of any depth evolve as linear models under gradient descent
Chizat et al. [2019] — On lazy training in differentiable programming
Mei et al. [2019] — Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit
Rotskoff & Eric Vanden-Eijnden [2018] — Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks
Chizat & Bach [2018] — On the global convergence of gradient descent for over-parameterized models using optimal transport
Bordelon & Pehlevan [2022] — Self-consistent dynamical field theory of kernel evolution in wide neural networks
Aubin et al. [2018] — The committee machine: Computational to statistical gaps in learning a two-layers neural network
Goldt et al. [2019] — Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup
Ren et al. [2025] — Emergence and scaling laws in sgd learning of shallow neural networks
Abbe et al. [2022] — The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks
Moniri et al. [2023] — A theory of non-linear feature learning with one gradient step in two-layer neural networks
Cui et al. [2024] — Asymptotics of feature learning in two-layer networks after one gradient-step
Defilippis et al. [2025] — Scaling laws and spectra of shallow neural networks in the feature learning regime
Montanari & Wang [2026] — Phase transitions for feature learning in neural networks
Saxe [2015] — Deep linear neural networks: A theory of learning in the brain and mind
Atanasov et al. [2021] — Neural networks as kernel learners: The silent alignment effect
Atanasov et al. [2025] — The optimization landscape of sgd across the feature learning strength
Maennel et al. [2018] — Gradient descent quantizes relu network features
Woodworth et al. [2020] — Kernel and rich regimes in overparametrized models

The Bayesian Perspective

Lee et al. [2017] — Deep neural networks as gaussian processes
Cohen et al. [2021b] — Learning curves for overparametrized deep neural networks: A field theory perspective
Lavie et al. [2024] — Towards understanding inductive bias in transformers: A view from infinity
Seroussi et al. [2023] — Separation of scales and a thermodynamic description of feature learning in some cnns
Rubin et al. [2023] — Grokking as a first order phase transition in two layer networks
Rubin et al. [2025b] — From kernels to features: A multi-scale adaptive theory of feature learning
Rubin et al. [2025a] — Mitigating the curse of detail: Scaling arguments for feature learning and sample complexity
Yang et al. [2023a] — A theory of representation learning gives a deep generalisation of kernel methods

B. Special Focus: The Tensor Programs Framework & µP

Highlights the structural infinite-width and depth scaling theories that formulate Maximal Update Parameterization, safeguarding zero-shot hyperparameter transfer across massive target coordinates.

Yang & Hu [2021] — Tensor programs iv: Feature learning in infinite-width neural networks
Yang & Littwin [2023] — Tensor programs ivb: Adaptive optimization in the infinite-width limit
Yang et al. [2022] — Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer
Noci et al. [2024] — Super consistency of neural network landscapes and learning rate transfer
Ghosh et al. [2025] — Understanding the mechanisms of fast hyperparameter transfer
Hayou [2025] — A proof of learning rate transfer under mu p

C. The Infinite Depth Limit & Alternative Structural Directions

Focuses on continuous limits where networks approximate Differential Equations, alongside scaling rules tailored for multi-head attention blocks and mixture-of-experts.

Bordelon et al. [2024b] — Infinite limits of multi-head transformer dynamics
Lénaïc Chizat [2025] — The hidden width of deep resnets: Tight error bounds and phase diagrams
Chaintron et al. [2026] — Resnets of all shapes and sizes: Convergence of training dynamics in the large-scale limit
Ricky TQ Chen et al. [2018] — Neural ordinary differential equations
Bordelon et al. [2023] — Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit
Yang et al. [2023b] — Tensor programs vi: Feature learning in infinite-depth neural networks
Dey et al. [2025] — Don't be lazy: Completep enables compute-efficient deep transformers
Clark et al. [2026] — Structure, disorder, and dynamics in task-trained recurrent neural circuits
Bauer et al. [2026] — A unified theory of feature learning in rnns and dnns
Hron et al. [2020] — Infinite attention: Nngp and ntk for deep attention networks
Małaśnicki et al. [2025] — µ-parameterization for mixture of experts
Jiang et al. [2026] — Hyperparameter transfer with mixture-of-expert layers

D. Joint Scaling Limits

Focuses on high-dimensional random matrix regimes where sample capacity and parameter sizes approach infinity concurrently under controlled scaling fractions.

Seung et al. [1992] — Statistical mechanics of learning from examples
Saad & Solla [1995] — Exact solution for on-line learning in multilayer neural networks
Lenka Zdeborová & Florent Krzakala [2016] — Statistical physics of inference: Thresholds and algorithms
Qianyi Li & Haim Sompolinsky [2021] — Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization
Hoffmann et al. [2022] — Training compute-optimal large language models
Bordelon & Pehlevan [2025] — Deep linear network training dynamics from random initialization: Data, width, depth, and hyperparameter transfer
Hayou & Yang [2023] — Width and depth limits commute in residual networks

E. The Discretization Hypothesis & Finite-Size Corrections

Evaluates the structural boundaries where noisy, finite neural structures drift from continuous infinite reference horizons.

Hanin & Nica [2019] — Finite depth and width corrections to the neural tangent kernel
Li et al. [2022] — The neural covariance sde: Shaped infinite depth-and-width networks at initialization
Noci et al. [2023] — The shaped transformer: Attention models in the infinite depth-and-width limit
Hanin & Tianze Jiang [2025] — Global universality of singular values in products of many large random matrices
Mandt et al. [2017] — Stochastic gradient descent as approximate bayesian inference
Jastrzebski et al. [2017] — Three factors influencing minima in sgd
Daniel A Roberts et al. [2022] — The principles of deep learning theory
Jacob Zavatone-Veth et al. [2021] — Asymptotics of representation learning in finite bayesian neural networks
Segadlo et al. [2022] — Unified field theoretical approach to deep and recurrent neuronal networks
Bordelon & Pehlevan [2023] — Dynamics of finite width kernel and prediction fluctuations in mean field neural networks
Glasgow et al. [2025] — Propagation of chaos in one-hidden-layer neural networks beyond logarithmic time

2. Insightful Limits Reveal Fundamental Behavior ​

A. The Infinite Width Limit & The Lazy/Rich Dichotomy ​

The Bayesian Perspective ​

B. Special Focus: The Tensor Programs Framework & µP ​

C. The Infinite Depth Limit & Alternative Structural Directions ​

D. Joint Scaling Limits ​

E. The Discretization Hypothesis & Finite-Size Corrections ​