KAN: Rethinking the Building Blocks of Neural Networks

For decades, the multi-layer perceptron has been the universal approximator of choice — the default building block for everything from transformers to diffusion models. But a landmark paper from Liu et al. at MIT (accepted to ICLR 2025) proposes a fundamental rethinking of what a neural network layer should look like. Kolmogorov-Arnold Networks (KANs) replace the fixed activations on neurons with learnable activation functions on edges, parameterized as B-splines. The result is a network that achieves dramatically better parameter efficiency, possesses provably faster neural scaling laws, and — perhaps most importantly — can be visualized and interpreted in ways that MLPs simply cannot match.

The Kolmogorov-Arnold Representation Theorem

The theoretical foundation of KANs comes from a deep mathematical result proved by Andrey Kolmogorov and Vladimir Arnold in the 1950s: any multivariate continuous function defined on a bounded domain can be expressed as a finite composition of continuous single-variable functions and the binary operation of addition. Formally, for a smooth function f: [0,1]ⁿ → ℝ, we can write f(x₁, ..., xₙ) = Σₙ₌₁²ⁿ⁺¹ Φₙ(Σₚ₌₁ⁿ φₙ,ₚ(xₚ)). In other words, the only truly multivariate operation required to represent any continuous function is addition — everything else reduces to univariate transformations. This is a stunningly powerful statement, but it was long considered practically useless for machine learning because the univariate functions could be non-smooth or fractal in the worst case. The KAN paper's key insight is that for the smooth, compositionally structured functions that arise in science and engineering, this pathological behavior does not occur — and a neural architecture built on this theorem can exploit the resulting structure to beat the curse of dimensionality.

Architecture: B-Splines on Edges Instead of Fixed Activations on Nodes

The core architectural innovation is deceptively simple. In a KAN, there are no linear weight matrices at all — every weight parameter in a traditional MLP is replaced by a learnable univariate function parameterized as a B-spline. Each edge in the computational graph computes φ(x) = w_b · silu(x) + w_s · spline(x), where the spline term is a linear combination of B-spline basis functions with learnable coefficients. The nodes simply sum incoming signals without applying any nonlinearity. A single KAN layer with n_in inputs and n_out outputs is defined as a matrix of 1D functions Φ = {φ_q,p} where p=1,...,n_in and q=1,...,n_out. Composing multiple such layers yields deep KANs. The original Kolmogorov-Arnold representation corresponds to a 2-layer KAN with shape [n, 2n+1, 1], but the authors generalize this to arbitrary widths and depths by stacking KAN layers — a conceptual breakthrough that makes deep KANs possible. B-splines are a particularly clever choice for the learnable functions because they are locally controllable (changing one coefficient affects only a local region of the function) and support grid extension, allowing the network to be refined after initial training without retraining from scratch.

Parameter Efficiency and Neural Scaling Laws

The empirical results in the paper are striking. A KAN with just 2 layers and width 5 can match or exceed the accuracy of an MLP with 4 layers and width 100+, representing a 100x or greater improvement in parameter efficiency on tasks with compositional structure. The paper provides formal approximation theory (Theorem 2.1) showing that KANs achieve scaling exponents of α = k+1 (where k is the spline order, typically 3), giving α = 4 — significantly better than MLPs which suffer from exponents tied to the input dimensionality. In practice, this means that as model size increases, KANs improve far faster than MLPs. The paper demonstrates this on five synthetic benchmarks including high-dimensional examples (up to 100 inputs), 15 special functions from mathematical physics, Feynman equations from physics, and even PDE solving for the Poisson equation. In every case, KANs achieve better Pareto frontiers — lower error for the same parameter count — than equivalently trained MLPs. The grid extension technique is particularly elegant: one can train a KAN with a coarse grid and then refine it by fitting a finer B-spline to the coarser one, producing staircase-like loss curves where accuracy jumps at each refinement step.

Interpretability: Opening the Black Box

Perhaps the most exciting aspect of KANs is their inherent interpretability. Because every learned function in a KAN is a 1D B-spline, it can be plotted and inspected directly. The authors propose a complete pipeline for making KANs interpretable: (1) train with L1 and entropy regularization to sparsify the network, (2) prune unimportant nodes based on incoming and outgoing scores, (3) visualize the remaining activation functions with transparency proportional to their magnitude, and (4) use symbolic regression to replace visually identified activation functions with symbolic forms (sin, exp, x², etc.) — fitting affine parameters to match the learned splines. This is not just a post-hoc analysis tool; it is a genuine interaction loop where the user can inspect the network, prune it, set symbolic functions, retrain, and converge to the exact symbolic formula underlying the data. The paper demonstrates this on f(x,y) = exp(sin(πx) + y²), where the KAN automatically prunes from [2,5,1] to [2,1,1] and the user can identify the sin, square, and exp functions by inspection. This kind of transparency is unprecedented in deep learning and has profound implications for scientific discovery — the authors show that KANs can (re)discover mathematical structures in knot theory and physical laws in Anderson localization.

Limitations and the Road Ahead

KANs are not without their limitations, and the authors are refreshingly honest about them. Training is slower than MLPs of equivalent parameter count because B-spline computations are not as well optimized as simple matrix multiplications, and the symbolic branch in particular is not parallelizable. KANs are not a drop-in replacement for MLPs in large-scale settings like transformers — hyperparameters need significant tuning, and the authors suggest that KANs may be best used in latent spaces with embedding/unembedding linear layers (as explored in GraphKAN). The most common question is whether KANs could be the next generation of LLM architecture; the authors do not have a strong intuition for this, noting that LLMs care about different kinds of accuracy and interpretability than scientific applications do. However, as a tool for AI-driven science — where accuracy on small data, interpretability, and the ability to discover symbolic relationships matter most — KANs represent a genuine breakthrough. The paper has already spawned numerous follow-ups (EfficientKAN, FourierKAN, GraphKAN, KAN-RL), suggesting that the community is actively working to address the limitations. As the lead author puts it, the message is less 'KANs are great' and more 'try thinking of current architectures critically and seeking fundamentally different alternatives that can do fun and useful stuff.' That is the kind of thinking that moves the field forward.