Shallow Univariate ReLU Networks as Splines: Initialization, Loss Surface, Hessian, and Gradient Flow Dynamics

Understanding the learning dynamics and inductive bias of neural networks (NNs) is hindered by the opacity of the relationship between NN parameters and the function represented. Partially, this is due to symmetries inherent within the NN parameterization, allowing multiple different parameter settings to result in an identical output function, resulting in both an unclear relationship and redundant degrees of freedom. The NN parameterization is invariant under two symmetries: permutation of the neurons and a continuous family of transformations of the scale of weight and bias parameters. We propose taking a quotient with respect to the second symmetry group and reparametrizing ReLU NNs as continuous piecewise linear splines. Using this spline lens, we study learning dynamics in shallow univariate ReLU NNs, finding unexpected insights and explanations for several perplexing phenomena. We develop a surprisingly simple and transparent view of the structure of the loss surface, including its critical and fixed points, Hessian, and Hessian spectrum. We also show that standard weight initializations yield very flat initial functions, and that this flatness, together with overparametrization and the initial weight scale, is responsible for the strength and type of implicit regularization, consistent with previous work. Our implicit regularization results are complementary to recent work, showing that initialization scale critically controls implicit regularization via a kernel-based argument. Overall, removing the weight scale symmetry enables us to prove these results more simply and enables us to prove new results and gain new insights while offering a far more transparent and intuitive picture. Looking forward, our quotiented spline-based approach will extend naturally to the multivariate and deep settings, and alongside the kernel-based view, we believe it will play a foundational role in efforts to understand neural networks. Videos of learning dynamics using a spline-based visualization are available at http://shorturl.at/tFWZ2.

Understanding the learning dynamics and inductive bias of neural networks (NNs) is hindered by the opacity of the relationship between NN parameters and the function represented. Partially, this is due to symmetries inherent within the NN parameterization, allowing multiple different parameter settings to result in an identical output function, resulting in both an unclear relationship and redundant degrees of freedom. The NN parameterization is invariant under two symmetries: permutation of the neurons and a continuous family of transformations of the scale of weight and bias parameters. We propose taking a quotient with respect to the second symmetry group and reparametrizing ReLU NNs as continuous piecewise linear splines. Using this spline lens, we study learning dynamics in shallow univariate ReLU NNs, finding unexpected insights and explanations for several perplexing phenomena. We develop a surprisingly simple and transparent view of the structure of the loss surface, including its critical and fixed points, Hessian, and Hessian spectrum. We also show that standard weight initializations yield very flat initial functions, and that this flatness, together with overparametrization and the initial weight scale, is responsible for the strength and type of implicit regularization, consistent with previous work. Our implicit regularization results are complementary to recent work, showing that initialization scale critically controls implicit regularization via a kernel-based argument. Overall, removing the weight scale symmetry enables us to prove these results more simply and enables us to prove new results and gain new insights while offering a far more transparent and intuitive picture. Looking forward, our quotiented spline-based approach will extend naturally to the multivariate and deep settings, and alongside the kernel-based view, we believe it will play a foundational role in efforts to understand neural networks. Videos of learning dynamics using a spline-based visualization are available at http://shorturl.at/tFWZ2. Keywords: neural networks, symmetry, implicit bias, splines, learning dynamics

INTRODUCTION
Deep learning has revolutionized the field of machine learning (ML), leading to state of the art performance in image segmentation, medical imaging, machine translation, chess playing, reinforcement learning, and more. Despite being intensely studied and widely used, theoretical understanding of some of its fundamental properties remains poor. One critical, yet mysterious property of deep learning is the root cause of the excellent generalization achieved by overparameterized networks (Zhang et al., 2016).
In contrast to other machine learning techniques, continuing to add parameters to a deep network (beyond zero training loss) tends to improve generalization performance. This has even been observed for networks that are massively overparameterized, wherein, according to traditional ML theory, they should (over)fit the training data (Neyshabur et al., 2015). This overparameterization leads to a highly complicated loss surface, with increasingly many local minima. How does training networks with excess capacity lead to generalization? And how can generalization error decrease with overparameterization? How can gradient descent in a loss landscape riddled with local minima so frequently converge to near-global minima? One possible solution is a phenomenon known as implicit regularization (or implicit bias). Empirical studies (Zhang et al., 2016;Advani et al., 2020;Geiger et al., 2020) have shown that overparametrized NNs behave as if they possess strong regularization, even when trained from scratch with no explicit regularization-instead possessing implicit regularization or bias (IR or IB), since such regularization is not explicitly imposed by the designer.
We approach these issues by considering symmetries of the NN parameterization. Several recent works have brought attention to the importance of symmetries to deep learning (Badrinarayanan et al., 2015;Kunin et al., 2020;Tayal et al., 2020). One important advance is building symmetry aware architectures that automatically generalize, such as convolutional layers granting translation-invariant representations, or many other task-specific symmetries (Liu et al., 2016;Barbosa et al., 2021;Bertoni et al., 2021;Liu and Okatani, 2021). Understanding and exploiting symmetries may lead to similar improvements, resulting in better performance or models that require less training data due to implementation of expert-knowledge based symmetry groups.
We consider a new spline-based reparamaterization of a shallow ReLU NN, explicitly based on taking a quotient with respect to an existing weight symmetry. In particular, we focus on shallow fully connected univariate ReLU networks, whose parameters will always result in a Continuous Piecewise Linear (CPWL) output. We reparameterize the CPWL function implemented by the network in terms of the locations of the nonlinearities (breakpoints), the associated change in slope (delta-slopes), and a parameter that identifies which side of the breakpoint the associated neuron is active on (orientation). This parameterization is defined formally in section 2.1, and illustrated in Figure 2. We provide theoretical results for shallow networks, with experiments confirming these results.

Main Contributions
The main contribution of this work are as follows: -Initialization: Increasingly Flat with Width. In the spline perspective, once the weight symmetry has been accounted for, neural network parameters determine the locations of breakpoints and their delta-slopes in the CPWL reparametrization. We prove that, for common initializations, these distributions are mean 0 with low standard deviation. Notably, the delta-slope distribution becomes increasingly concentrated near 0 as the width of the network increases, leading to flatter initial approximations. -A characterization of the Loss Surface, Critical Points and Hessian, revealing that Flat Minima are due to degeneracy of breakpoints. We fully characterize the Hessian of the loss surface at critical points, revealing that the Hessian is positive semi-definite, with 0 eigenvalues occurring when elements of its Gram matrix set are linearly dependent. We show that many of these 0 eigenvalues occur due to symmetry artifacts of the standard NN parametrization. We characterize the ways this can occur, including whenever multiple breakpoints share the same active data-thus we expect that at any given critical point in an overparametrized network, the loss surface will be flat in many directions. -Implicit Regularization is due to Flat Initialization in the Overparametrized Regime. We find that implicit regularization in overparametrized Fully Connected (FC) ReLU nets is due to three factors: (i) the very flat initialization, (ii) the curvature-based parametrization of the approximating function (breakpoints and delta-slopes, made explicitly clear by our symmetry-quotiented reparametrization) and (iii) the role of gradient descent (GD) in preserving initialization and providing regularization of curvature. In particular, each neuron has an global, rather than local, impact as each contributes an affine ReLU. Thus, backpropogation distributes the "work" of fitting the training data over many units, leading to regularized delta-slope and breakpoints. All else equal, these non-local effects mean that more overparametrization leads to each unit mattering less, thus typically resulting in better generalization due to implicit regularization (Neyshabur et al., 2015(Neyshabur et al., , 2018.

Spline Parametrization and Notation
Consider a fully connected ReLU neural netf (x; θ ) with a single hidden layer of width H, scalar input x ∈ R and scalar output y ∈ R, which we are attempting to match to a target function f (x), from which we have N sample input/output data pairs (x n , y n ); the full vectors of inputs and outputs are denoted x and y, respectively.f (·; θ ) is a continuous piecewise linear (CPWL) function since the ReLU nonlinearity is CPWL.
Frontiers in Artificial Intelligence | www.frontiersin.org FIGURE 1 | The manifold generated by the scaling transformation of input weight, input bias, output weight: denote the global bias plus the input weight, bias, and output weight, respectively of neuron i, and (·) + max{0, ·} denotes the ReLU function.
An important observation is that the ReLU NN parametrization is redundant: for every functionf represented by Equation (1) there exists infinitely many transformations of the parameters θ ′ . These invariant transformations R consist of (i) permutations of the hidden units and (ii) scalings of the weights and biases of the form (Rolnick and Kording, 2019). The manifold of constant loss generated by the α-scaling symmetry is shown in Figure 1. The set G of such function-invariant transformations together with function composition • forms a group.
We want to understand the function implemented by this neural net, and so we ask: How do the CPWL parameters relate to the NN parameters? We answer this by taking the quotient with respect to the above α-scaling symmetry, transforming from the NN parametrization of weights and biases to the following CPWL spline parametrization: where and the Iversen bracket [[b]] is 1 when the condition b is true, and 0 otherwise. The CPWL spline parametrization is the Breakpoint, Delta-Slope, Orientation (BDSO) parametrization, of) the breakpoint (or knot) induced by neuron i, FIGURE 2 | Illustration of BDSO terminology of a CPWL function. Breakpoints are where adjacent linear pieces meet, and the delta-slope determines the difference in slope between the adjacent linear pieces. Orientations determine on which side of a breakpoint the delta-slope is applied. In a shallow NN, each neuron contributes exactly one breakpoint, with an associated facing s i and delta-slope µ i -in the above example, neurons 1 and 2 are right-facing with positive delta-slope (s 1 = s 2 = 1, µ 1 , µ 2 > 0), and neuron 3 is left-facing with negative delta-slope (s 3 = −1, µ 3 < 0). µ i w i v i is the delta-slope contribution of neuron i, and s i sgn w i ∈ {±1} is the orientation of β i (left for s i = −1, right for s i = +1). This terminology is illustrated in Figure 2.
Considering the BDSO parametrization provides a new, useful lens with which to analyze neural nets, enabling us to reason more easily and transparently about the initialization, loss surface, and training dynamics. The benefits of this approach derive from taking the quotient with respect to the α-scaling symmetry, leading to two useful properties: (1) the loss depends on the NN parameters θ NN only through the BDSO parameters (the approximating function) θ BDSO i.e., ℓ(θ NN ) = ℓ(θ BDSO (θ NN )), analogous to the concept of a minimum sufficient statistic in exponential family models; and (2) the BDSO parameters are more intuitive, allowing us to reproduce known results (Williams et al., 2019) more succinctly, and expand out to new results more easily. Much recent related work has also moved in this direction, analyzing function space (Balestriero and Baraniu, 2018;Hanin and Rolnick, 2019).
The BDSO parametrization makes clear that, fundamentally, f (x; θ NN ) is represented in terms of its second derivative. Examiningf further illustrates the meaning of the BDSO parameters, especially µ i and β i : the second derivative is zero everywhere except at x = β i for some i, where there is a Dirac delta of magnitude µ i . We note that the BDSO parametrization of a ReLU NN is closely related to but different than a traditional mth order spline parametrizationf spline (x) (Reinsch, 1967). The BDSO parametrization (i) lacks the base polynomial, and (ii) has two possible breakpoint orientations s i ∈ {±1} whereas the spline is canonically all rightfacing. Additionally, adding in the base polynomial (for the linear case m = 1) into the BDSO parametrization yields a ReLU ResNet parametrization.
It is also useful to consider the set of sorted breakpoints, {β p } H+1 p=0 , where −∞ β 0 < β 1 < · · · < β p < · · · < β H+1 ∞, which induces two partitions: (i) a partition of the domain X into intervals X p [β p , β p+1 ); and (ii) a partition of the training data D x {x n } N n=1 into pieces π p X p ∩ D x , so that D x = ∪ H+1 p=0 π p . Note that sorting the breakpoints (and hence the neurons) removes the permutation symmetry present in the standard NN parametrization.
Finally, we introduce some notation relevant to the training of the network: letǫ ∈ R N×1 denote the vector of residuals y−ŷ, i.e., , and let ǫ i ǫ ⊙ 1 i ∈ R N×1 and x i x ⊙ 1 i ∈ R N×1 denote the relevant residuals and inputs for neuron i, respectively. In words, relevant residuals and inputs for neuron i are those which are not zeroed out by the ReLU activation.

Basis Expansion, Infinite Width Limit
The expansion in Equation 2 can be further understood by viewing it as a basis expansion. Specifically, we can view the BDSO framework as a basis expansion off ′′ in a basis of Dirac delta functions, or, from Equation 1 as a basis expansion off in a basis of functions (x − β i ) s i . The form of these basis functions is fixed as the ReLU activation-analogous to the mother basis function in wavelets (Rao, 2002)-but can be translated via changes in β i , reflected by s i , or re-weighted via changes in µ i . Also, note that if the breakpoints β i are fixed, and only the delta-slope parameters µ i are optimized, then we effectively have a (kernel) linear regression model. Thus, the power of neural network learning lies in the ability to translate/orient these basis functions in order to better model the data. Intuitively, in a good fit, the breakpoints β i will tend to congregate near areas of high curvature in the ground truth function, i.e., |f ′′ (x)| ≫ 0, so that the sum-of-Diracsf ′′ (x; θ BDSO ) better approximates f ′′ (x). 1 Consider the infinite width limit H→ ∞, rewriting Equation 2 asf (1/H) and so we may write µ i s i 1 Although a "breakpoints near curvature" fit will be a good fit, that does not mean that Gradient Descent from a random initialization can find such a fit; see section 2.3. c(β i )/Hp β (β i ). Then, from Equation 3, we can expand this as µ i s i =f ′′ (β i ; θ BDSO )/Hp β (β i ).

Random Initialization in Function Space
We now study the random initializations commonly used in deep learning in function space. These include the independent Gaussian initialization, . We find that common initializations result in flat functions, becoming flatter with increasing width.
Theorem 1. Consider a fully connected ReLU neural net with scalar input and output, and a single hidden layer of width H. Let the weights and biases be initialized randomly according to a zero-mean Gaussian or Uniform distribution. Then the induced distributions of the function space parameters (breakpoints β, delta-slopes µ) are as follows: (a) Under an independent Gaussian initialization, Using this result, we can immediately derive marginal and conditional distributions for the breakpoints and delta-slopes.
Corollary 1. Consider the same setting as Theorem 1.
(a) In the case of an independent Gaussian initialization, is the Meijer G-function and K ν (·) is the modified Bessel function of the second kind. (b) In the case of an independent Uniform initialization, where Tri(·; a) is the symmetric triangular distribution with base [−a, a] and mode 0.

Implications
Corollary 1 implies that the breakpoint density drops quickly away from the origin for common initializations. As breakpoints are necessary to fit curvature, if the ground truth function f (x) has significant curvature far from the origin, then it may be far more difficult to fit. We show that this is indeed the case by training a shallow ReLU NN with an initialization that does not match the underlying curvature, with training becoming easier if the initial breakpoint distribution better matches the function curvature. This suggests that a datadependent initialization, with more breakpoints near areas of high curvature, could potentially be faster and easier to train. Note that this simple curvature-based interpretation is made possible by our symmetry-quotiented reparametrization.
Theorem 2. Consider the initial roughness ρ 0 under a Gaussian initialization. In the He initialization, we have that the tail probability is given by In the Glorot initialization, we have that the tail probability is given by Thus, as the width H increases, the distribution of the roughness of the initial functionf 0 gets tighter around its mean. In the case of the He initialization, this mean is constant; in the Glorot initialization, it decreases with H. In either case, for reasonable widths, the initial roughness is small with high probability, corresponding to small initial delta-slopes. This smoothness has implications for the implicit regularization phenomenon observed in recent work (Neyshabur et al., 2018), and studied later in section 2.6, in particular, Theorem 7. Related Work. Several recent works analyze the random initialization in deep networks. However, there are two main differences; first, they focus on the infinite width case (Neal, 1994;Lee et al., 2017;Jacot et al., 2018;Savarese et al., 2019) and can thus use the Central Limit Theorem (CLT), whereas we focus on finite width case and cannot use the CLT, thus requiring nontrivial mathematical machinery (see Supplement for detailed proofs). Second, they focus on the activations as a function of input whereas we also compute the joint densities of the BDSO parameters i.e., breakpoints and delta-slopes. The latter is particularly important for understanding the non-uniform density of breakpoints away from the origin as noted above. The most closely related work is Steinwart (2019), which considers only the breakpoints, and suggests a new initialization with uniform breakpoint distribution.

Loss Surface in the Spline Parametrization
We now consider the squared loss 1 2 N n=1 (f (x n ; θ ) − y n ) 2 as a function of either the NN parameters ℓ(θ NN ) or the BDSO parametersl(θ BDSO ). We begin with the symmetryquotiented case: Theorem 3. The loss functionl(θ BDSO = (β, µ, s)) is a continuous piecewise quadratic (CPWQ) spline. Furthermore, consider the evolution of the loss as we vary β i along the x axis; , which both have positive curvature, and let m j arg min p j (β i ). Then, with measure 1, the knots x n fall into one of three types as shown in Figure 3: We call Type I knots Passthrough knots because gradient flow starting from the higher side will simply pass through the knot. Similarly, we call Type II knots Repellers and Type III knots Attractors because β i will be repelled by or attracted to the knot during gradient flow.
We now return to the loss ℓ(θ NN ) under the NN parametrization, and consider varying (w i , v i , b i ) such that the corresponding β i changes but µ i and s i stay the same. Due to the α-scaling symmetry, this can be implemented in two distinct ways: for the appropriate α, which leaves the loss invariant. In fact, there is a continuum of transformations which could be achieved by version (i) followed by an arbitrary α-transformation.
, a horizontally reflected and scaled version ofl(β i ; β −i ; µ, s). The 1-dimensional "slice" corresponding to version (ii) will similarly be a stretched version ofl(β i ; β −i ; µ, s), with the caveat that this "slice" is along a hyperbola in the (v i , w i )-plane. Noting that any changes in (w i , v i , b i ) that implement the same change in β i change the loss in the same way, let ℓ(β i ; θ NN \ β i ) denote the equivalence class of 1-dimensional slices of ℓ(θ NN ). This gives the following corollary: that implement a change only in β i are CPWQ with knots as in Theorem 3.

Critical Points of the Loss Surface
In addition to the fixed points associated with attractor datapoints, there are also critical points that lie in the quadratic regions, i.e., local (and global) minima which correspond to the minima of some parabolic piece for each neuron. We characterize these critical points here.
Theorem 4. Consider some arbitrary θ * NN such that there exists at least one neuron that is active on all data. Then, θ * NN is a critical point ofl(·) if and only if for all domain partitions X p , the restrictionf (·; θ * NN )| X p is an Ordinary Least Squares (OLS) fit of the training data contained in π p .
An open question is how many such critical points exist. A starting point is to consider that there are C(N + H, H) (N + H)!/N!H! possible partitions of the data. Not every such partition will admit a piecewise-OLS solution which is also continuous, and it is difficult to analytically characterize such solutions.
Using Theorem 4, we can characterize growth of global minima in the overparametrized case. Call a partition lonely if each piece π p ∈ contains at most one datapoint (see Figure 4). Then, we can prove the following results: Lemma 1. For any lonely partition , there are infinitely many parameter settings θ BDSO that induce and are global minima withl(θ BDSO ) = 0. Furthermore, in the overparametrized regime H ≥ cN for some constant c ≥ 1, the total number of lonely partitions, and thus a lower bound on the total number of global minima ofl is H+1 Remark 1. Suppose that the H breakpoints are uniformly spaced and that the N datapoints are uniformly distributed within the region of breakpoints. Then in the overparametrized regime H ≥ dN 2 for some constant d ≥ 1, the induced partition is lonely with high probability 1 − e −N 2 /(H+1) = 1 − e −1/d .
Thus, with only O(N 2 ) hidden units, we can almost guarantee a lonely partition at initialization. This makes optimization easier but is not sufficient to guarantee that learning will converge to a global minimum, as breakpoint dynamics could change the partition to crowded (see section 2.5). Note how simple and transparent the spline-based explanation is for why overparametrization makes optimization easier. For brevity, we include experiments testing our theory of loneliness in Figure 14 in Appendix.

The Gradient and Hessian of the Loss ℓ(θ NN )
Previous works has revealed many key insights into the Hessian of the loss surface of neural networks. It has long been empirically known that GD tends to work surprisingly well, and that most or all local minima are actually near-optimal (Nguyen and Hein, 2017), although this is known to depend on the overparametrization ratio and the data distribution Granziol et al., 2019). Flatter local minima have been shown to be connected to better generalization (Sankar et al., 2020), but the areas around critical points may be unusually flat, with the bulk of Hessian eigenvalues near 0, and only a few larger outliers (Ghorbani et al., 2019), meaning that many local minima are actually connected within a single basin of attraction (Sagun et al., 2017). Our theory can shed new light on many of these phenomena: The gradient of ℓ(θ NN ) can be expressed as From this, we can derive expressions for the Hessian of the loss ℓ(θ NN ) at any parameter θ NN .
Theorem 5. Let θ * be a critical point of ℓ(·) such that β * i (θ * ) = x n for all i ∈ [H] and for all n ∈ [N]. Then the Hessian H ℓ (θ * ) is the positive semi-definite Gram matrix of the set of 3H + 1 vectors as shown in eq. (4). Thus, H ℓ (θ * ) is positive definite iff the vectors of this set are linearly independent.
Remark 2. The zero eigenvalues of H ℓ (θ * ) correspond to 1. α-scaling: For each neuron i, the scaling transformation leaves the approximating functionf (x) (and thus the training loss) invariant, thus generating a 1-dimensional hyperbolic manifold of constant loss; thus, the tangent vector of this manifold will be in the null space of H ℓ (θ ) for all θ . This manifold is depicted in Figure 1. or v i do not changef (x), thus contributing two 0 eigenvalues 3. Functional changes: Any functional change that does not change the value off (x) at the training data will leave the loss unchanged. In the sufficiently overparametrized regime, there are many ways to make such changes. Some examples are shown in Figure 5.
The first two types are artifacts of the NN parametrization, caused by invariance to the α-scaling symmetry.
Additionally, we note that these are all also zero eigenvalues for non-critical points, but that there are other zero eigenvalues for non-critical points. While H ℓ (θ * ) does not depend on ǫ at the above critical points, at other points, ǫ terms can "cancel out" the corresponding term in eq. (4), yielding additional zero eigenvalues (as well as negative eigenvalues). From this, we note that for any critical point θ * , there is a large connected sub-manifold of parameter space of constant loss. Moving along this manifold consists of continuously deforminĝ f (x; θ * NN ) such that the output values at datapoints are left invariant or moving along the α-scaling manifolds which leavê f (x; θ * NN ) invariant. The 0 singularities lie at the intersection of every α-scaling manifold for neuron i, corresponding to the α = 0 and α = ∞ limits.
Consider the µ-only Hessian under the BDSO parametrization, d 2 ℓ dµ 2 , and rewrite our network asŷ = (x; β)µ, where is the N × H feature matrix of activations (or basis functions) (x n − β i ) s i , which is constant with respect to µ. Then, we have d 2 ℓ dµ 2 = ⊤ . Applying an SVD decomposition = USV ⊤ , then ⊤ = VS 2 V ⊤ , such that the eigenvalues of the Hessian will be the singular values of squared. Thus, we can decompose the vector of delta-slopes µ = µ r + µ n where µ r is in the range of while µ n is in the nullspace. Thus, y = (x; β)µ r + (x; β)µ n = (x; β)µ r , and the gradient with respect to µ is always constrained to the linear subspace µ r . Figure 6A shows a 2-dimensional representation of the loss surface, which is quadratic along µ r , and constant along µ n . Figure 6B illustrates the effect of varying β, which changes (x; β) and hence µ r and µ n .

The Flatness of the Hessian
An important question is: How does overparametrization impact (the spectrum of) the Hessian? The symmetry-quotiented spline parametrization enables us to lower bound the number of zero eigenvalues of the Hessian [e.g., the "flatness" ] as follows.
Corollary 3. Let θ * be a critical point of ℓ(·) such that its data partition is lonely, and either at least one neuron is active on all data or there is at least one pair of oppositely-faced neurons in the same data gap, so that x ∈ span(B). Then, ℓ(θ * ) = 0 and H ℓ (θ * ) has exactly N non-zero eigenvalues, and thus 3H + 1 − N zero eigenvalues.
Intuitively, as overparametrization H/N increases, the number of neurons with shared activation patterns increases, which in turn means many redundant breakpoints between each pair of datapoints, which increases the number of flat directions FIGURE 5 | Examples of functional changes that leave the loss invariant: (A) when there are multiple breakpoints with the same facing in the interval (x n , x n+1 ), they can change in ways that cancel out, leaving the function unchanged on the data; (B) if two breakpoints on either side of a datapoint (with no other data in between) have the same facing, they may "rotate" the portion of the function between them around x n ; (C) any neuron outside of the data range and facing away (which therefore has no active data) is completely unconstrained by the loss until β i enters the data range.
FIGURE 6 | 2-dimensional representation of the loss surface as a function of µ, broken down into a loss-sensitive range and loss-independent nullspace (A) Example for fixed β (B) Same system as before, except β was varied, resulting in modified µ range and nullspace.
(zero eigenvalues) of the Hessian. Additionally, each neuron is guaranteed one degree of freedom in the form of the α-scaling symmetry, leading to H intrinsic zero eigenvalues of the Hessian.

Gradient Flow Dynamics for Spline Parameters
Theorem 6. For a one hidden layer univariate ReLU network trained with gradient descent with respect to the neural network , the gradient flow dynamics of the function space parameters θ BDSO = {(β i , µ i )} H i=1 are governed by the following laws: Here i ∈ [H] indexes all hidden neurons and the initial conditions β i (0), µ i (0)∀i ∈ [H] must be specified by the initialization used (see Appendix B.8 for derivation).

Impact of Init Scale α
As mentioned previously, the standard NN parametrization has symmetries such that the functionf and the loss is invariant to α-scaling transformations of the NN parameters (section 2.1). However, such scalings do have a large effect on the initial NN gradients; specifically, applying the α-scaling transformation in eqs. (6) and (7), the initial learning rate for β i scales as 1/α 2 i , while that of µ i scales approximately as α 2 i . Changing α at initialization has a large impact on the final learned solution. In Section 2.6 we will show how α determines the kind of (implicit) regularization seen in NN training (Woodworth et al., 2020).

Breakpoint Dynamics
Next, we wish to build some intuition of the conditions that lead to the different knot types and the effect the types have on training dynamics and implicit regularization. First, we rewrite eq. (6) asβ and note that the dot product is a cumulative sum of weighted residuals (summing from the furthest datapoint on the active side of neuron i and moving toward β i ). Next, let A i = {n|1 i,n = 1} denote the set of data indices which are active for neuron i, and X i be the active half-space of neuron i [i.e., (−∞, β i ) or (β i , ∞)].
If we viewǫ i (t) asǫ(β i , t) n∈A iǫ n δ(β i − x n ), we can rewrite the dot product as an integral and writeβ i as a function ̺ i (β, t): i.e., ψ + (·, ·) is just ψ − (·, ·) reflected across the (finite) total integral; accordingly, we drop the subscript and refer to ψ(·, ·) from now on. Then, to identify Repeller and Attractor knots, we are interested in the datapoints β i = x n whereβ i changes sign, i.e., we are interested in the zero-crossings of ψ(·, t). Very roughly, at earlier stages of learning, large regions of the input space have high residual, so that many neurons get large, similar updates, leading to broad, smooth changes tof (x; θ NN ) and hence broad, smooth changes toǫ(x, t) and ψ(β, t). Applying such a change to ψ(β, t) will have the effect of "centering" ψ(β, t) such that it has an average value of 0 over large regions. Because the changes are broad at first, the steeper regions of ψ(β, t) will retain most of their steepness. Together with the "centering, " this leads to the conclusion that the local extrema of ∂ψ(β,t) ∂β will be near zero-crossings of ψ(β, t).
Due to the broad centering effect, the second term ≈ 0, ≈ǫ(β, t)(1 + β 2 ) Therefore, the local extrema ofǫ(β, t) will be local extrema of ∂ψ(β,t) ∂β and hence near zero-crossings of ψ(β, t). Note that the above analysis shows thatβ i and hence the classification of each datapoint is determined by the same ψ(β, t), plus a scaling factor v i (t)/w i (t) and possibly a reflection based on sgn(w i ), i.e., groups of neurons with the same signs of v i and w i will yield the same datapoint classification. If datapoint persists in being an Attractor knot for a group of nearby neurons for a long enough time, and is surrounded by Passthrough knots, then those nearby neurons will converge on the datapoint and form a cluster. This leaves the question: what conditions lead to local extrema ofǫ(β, t) that persist long enough for this behavior?
Empirically, we find concentrations of breakpoints forming near regions of high curvature or discontinuities in the training data, leading to a final fit that was close to a linear spline interpolation of the data. In particular, breakpoints migrate toward nearby curvature which is being underfit and whose inflection is the same as that breakpoint's delta-slope. These clusters can remain fixed for some time, before suddenly shifting (all breakpoints in a cluster move together away from the shared attractor x n ) or splitting (two groups of β form two new subclusters that move away from x n in each direction). These new movements can cause "smearing, " when the cluster loses coherence. Sometimes, this "smearing" effect is caused by the need to fit smooth curvature near a cluster: once the residual at the cluster falls below the residual of nearby curvature being underfit by the linear fit provided between clusters, the tendency is for the cluster to spread out; see Figure 7E.
As an illustrative example to explore this question, consider fitting a data set such as that shown in Figure 7. At initialization (Figure 7A),f (x; θ NN ) is much flatter than f (x). After a short time of training (Figure 7B),f (x; θ NN ) has matched the average value of f (x), but is still much flatter, leading to relatively broad regions whereǫ(β, t) is large and of the same sign, with zero crossings of ψ(β, t) near the first two inflection points. After a little more training (Figure 7C), we see that all three inflection points correspond to zero crossings in ψ(β, t), and that these zero crossings have persisted long enough that clusters of breakpoints have started to form, although the leftmost inflection point is already fit well enough that ψ(β, t) (and henceβ i is quite low in the region, so that those breakpoints are unlikely to make it "all the way" to a cluster before the data is well fit. Later (Figure 7D), the right two inflection points have nearby tight clusters. However, oncef (x; θ NN ) reaches f (x) at these clusters, the local maxima ofǫ(β, t) move away from the clusters, and the clusters "spread out, " as shown in the converged fit Figure 7E.

Videos of Gradient Descent/Flow Dynamics
We have developed a BDSO spline parametrization-based visualization. For many of the experiments in this paper, the corresponding videos showing the learning dynamics are available at http://shorturl.at/tFWZ2.

Implicit Regularization
One of the most useful and perplexing properties of deep neural networks has been that, in contrast to other high capacity function approximators, overparametrizing a neural network does not tend to lead to excessive overfitting (Savarese et al., 2019). Where does this generalization power come from? What is the mechanism? One promising idea is that implicit regularization (IR) plays a role. IR acts as if a extra regularization term had been applied to the optimizer, despite no such regularization term being explicitly added by a designer. Instead, IR is added implicitly, brought about by some combination of architecture, initialization scheme, or optimization. Much recent work (Neyshabur et al., 2015(Neyshabur et al., , 2018 has argued that this IR is due to the optimization algorithm itself (i.e., SGD). Our symmetry-quotiented spline perspective makes the role of α-scaling symmetry apparent: IR depends critically on breakpoint and delta-slope learning dynamics and initialization. These results follow from section 2.5 , showing that changes to the initialization scale result in dramatic changes to the relative speeds of learning dynamics for different parameters. In particular, in the kernel regime α → ∞, breakpoints move very little whereas delta-slopes move a lot, resulting in diffuse populations of breakpoints that distribute curvature. In stark contrast, in the adaptive regime α → 0 breakpoints move a lot whereas delta-slopes move very little, resulting in tight clusters of breakpoints that concentrate curvature.

Kernel Regime
We first analyze the so-called kernel regime, inspired by Chizat et al. (2019) where it is referred to as "lazy training." In this section, we omit the overall bias b 0 .
Lemma 2. Consider the dynamics of gradient flow on ℓ(·) started from θ NN,α (αw 0 , αb 0 , v 0 = 0), where w i = 0∀i ∈ [H]. In the limit α → ∞, β(t) does not change, i.e., each breakpoint location and orientation is fixed. In this case, the θ NN model reduces to a (kernel) linear regression:ŷ where µ ∈ R H are the regression weights and (x; β) ∈ R N×H are the nonlinear features φ ni (x n − β i ) s i .
Using this, we can then adapt a well known result about linear regression (see e.g., Zhang et al., 2016): Theorem 7. Let µ * be the converged µ parameter after gradient flow on the BDSO model eq. (8) starting from some µ 0 , with β held constant. Furthermore, suppose that the model perfectly interpolates the training datal(θ BDSO ) = 0. Then, Thus, the case where breakpoint locations and orientations are fixed, and µ 0 = 0 reduces to ℓ 2 -regularized linear regression on the delta-slope parameters µ. A geometric representation of this process is shown in Figure 8. Recalling that µ 2 2 is the roughness FIGURE 8 | View of the loss surface as in Figure 6. GD starting from the origin (green dashed line) monotonically increases µ 2 (shown as concentric circles). GD starting from some other initialization (red dashed line) minimizes µ − µ 0 2 instead (shown as concentric arcs).
off (x; θ BDSO ), this result demonstrates the importance of the initialization, and in particular the initial roughness (Theorem 2): with a high-roughness initialization minimizing µ − µ 0 2 2 will yield a high-roughness final fit.
In the overparametrized regime with an initialization that yields a lonely partition, the model will reachl(θ BDSO ) = 0. If we consider the infinite-width limit, we get the following result, which is a specialization of Theorem 5 of Williams et al. (2019): Corollary 4. Consider the setting of Theorem 7, with the additional assumptions that µ 0 = 0 and the breakpoints are uniformly spaced, and let H → ∞. Then the learned function f ∞ (x; µ * , β * ) is the global minimizer of As such,f ∞ (x; µ * , β * ) is a natural cubic smoothing spline with N degrees of freedom (Ahlberg et al., 1967).
Theorem 5 of Williams et al. (2019), is a generalization of Corollary 4 to the non-uniform case which replaces Equation (9) where p β (β) is the initial density of breakpoints induced by the specific initialization used. 2 Deriving this result using the BDSO framework allows us to see that the initialization has the impact of weighting the curvature of certain locations more than others.

Kernel Regime Dynamics as PDE
Assume a flat (µ i (0) = 0 ∀i ∈ [H]) kernel regime initialization. Specializing our spline parameterization dynamics results (Theorem 6), we havė Note that the terms r 1,i (t) and r x,i (t) only depend on the index i through the mask 1 i ; in other words, they are the same for all breakpoints with the same activation pattern. This pattern is entirely determined by the orientation s i and the data interval (x n , x n+1 ) that β i falls into. Thus, the possible activation patterns can be indexed by data point index n and orientation s. Letting ς n,s denote the set of breakpoints with the activation pattern corresponding to (n, s), we haveμ i (t) = r 1,n,s (t) + r x,n,s (t)β i i ∈ ς n,s Let r 1,n,s (t) and r x,n,s (t) be vectors containing |ς n,s | copies of r 1,n,s (t) (resp. r x,n,s (t)), and let r 1 (t) and r x (t) be the concatenation of these over all n and s. This allows us to writė where µ and β are the vectors of µ i (resp. β i ) values for all i. Then, µ(t) can be viewed as a function µ :[H]×{+, −}×R → R, which is isomorphic to the function µ : where r 1 (x, s, t) and r x (x, s, t) are piecewise constant functions of x with discontinuities at datapoints. Because r 1 (x, s, t) is piecewise constant in x for all t, t 0 r 1 (x, s, τ ) dτ will also be piecewise constant, and likewise for r x (x, s, t). Thus, µ(x, s, t) will be (discontinuous) piecewise linear in x for all t. Stepping back, we see that a finite-width ReLU network trained by gradient descent is a discretization (in both space and time) of this underlying continuous PDE. 2 In the original formulation (Williams et al., 2019), the p β (x) term is defined much more opaquely as ν(x) |w|p(b = ax|w) dp(w). Thus,f " ∞ (x; t) = µ(x, +, t) + µ(x, −, t), and as µ(x, s, t) is (discontinuous) piecewise linear, this implies thatf ∞ (x; t) will be continuous piecewise cubic in x for all t, consistent with Corollary 4.

Adaptive Regime
Previous work (Arora et al., 2019a) has shown that the kernel regression regime is insufficient to explain the generalization achieved by modern DL. Instead, the non-convexity of the optimization (i.e., the adaptive regime, α → 0, called the "rich regime" in Woodworth et al., 2020) must be responsible. The spline parametrization clearly separates the two, with all adaptive regime effects due to breakpoint dynamics β i (t): as α → 0, breakpoints become more mobile, and the breakpoint dynamics described in section 2.5 becomes more and more dominant. As shown in section 2.3, high breakpoint mobility can lead to clusters of breakpoints forming at certain specific data points. For low enough α, this clustering of breakpoints yielding a fit that is more linear spline than cubic spline; intermediate values of α interpolates these two extremes, giving a fit that has higher curvature in some areas than others.
Intriguingly, this suggests an explanation for why the kernel regime is insufficient: the adaptive regime enables breakpoint mobility, allowing the NN to adjust the initial breakpoint distribution (and thus, basis of activation patterns). This suggests that data-dependent breakpoint initializations may be quite useful (see Table 1 for a simple experiment in this vein).
Spline theory traditionally places a knot at each datapoint for smoothing splines (James et al., 2013). However, in circumstances where this is not feasible, either computationally or due to using a different spline, a variety of methods exist (MLE, at set quantiles of the regression variable, greedy algorithms, and more) (Park, 2001;Ruppert, 2002;Walker et al., 2005). More recent work has started to develop adaptive splines, where knots are placed to minimize roughness, length, or a similar metric, and knot locations are found via Monte Carlo methods or evolutionary algorithms (Miyata and Shen, 2005;Goepp et al., 2018). This adaptivity allows the spline to better model the function by moving knots to areas of higher curvature -remarkably similar to the behavior of a low-α NN.
In general, an individual breakpoint does not have a large impact on the overall function (O(1/H) on average); it is only through the cooperation of many breakpoints that large changes to the function are made. Another way of formulating this distinction is that the function (and hence the loss) depend on the breakpoints only through the empirical joint densityp H (β, µ, s). Similarly, training dynamics depend on the θ NN joint densitŷ p H (w, v, b). This formulation is explored in Mei et al. (2018), where they derive a dynamics equation for the joint density (in θ NN ). Adapting this work to explore the dynamics of the θ BDSO joint density is ongoing work.

Comparison With Concurrent Work
Independent of and concurrent with previous versions (Sahs et al., 2020a,b) of this work, Williams et al. (2019) has implicit regularization results in the kernel and adaptive regimes which parallel our results in this section rather closely. Despite the similarities, we take a significantly different approach. Comparing our results to those in Williams et al. (2019), the key differences are: (1) 2019) is slightly more general form of our Corollary 4, extending to the case of non-uniform breakpoints. (6) Finally, we have many novel results regarding initialization, loss surface properties, Hessian and dynamics (everything outside of section 2.6). All in all, we believe our results are quite complementary to those of Williams et al., particularly as we extend our results to novel areas using our Hessian and loss surface analysis.

Extending to Multivariate Inputs
Throughout this work we chose to focus on the univariate case in order to build intuition and enable simpler theoretical results. However, in practice most networks have multivariate inputs; fortunately, our BDSO parametrization can be easily extended to multivariate inputs. For D-dimensional inputs, writê where the input weights are represented as D-dimensional vectors w i . Using the reparametrization η where η i is a "delta-slope" parameter, 3 and (ξ i , γ i ) parametrize the orientation and signed distance from the origin of a FIGURE 9 | Training loss vs. number of pieces (∝ number of parameters) for various algorithms fitting a piece-wise linear (PWL) function to a quadratic. We observe a strict ordering of optimization quality with for the stochastic gradient descent based algorithms, with Adam (an optimizer with momentum) outperforming BatchNorm SGD outperforming Vanilla SGD. D − 1-dimensional "oriented breakplane" (generalizing the 0dimensional left-or-right oriented breakpoint represented by (s i , β i ) in the 1-dimensional case). Generalizing our results to this parametrization is ongoing work.

Suboptimality of Gradient Descent
Focusing on univariate networks allows us to directly compare against existing (near-) optimal algorithms for fitting Piecewise Linear (PWL) functions, including the Dynamic Programming algorithm (DP, Bai and Perron, 1998), and a very fast greedy approximation known as Greedy Merge (GM, Acharya et al., 2016) (Figure 9). Given a fixed budget of pieces (∝ to number of parameters e.g., network width), how well do these algorithms compare to SGD variants in fitting a quadratic (high curvature) function? DP and GM both quickly reduce training error to near 0 with order 100 pieces, with GM requiring slightly more pieces for similar performance. All the GD variants require far more pieces to reduce error below any target threshold, and may not even monotonically decrease with number of pieces. These results show how inefficient GD is w.r.t parameters, requiring orders of magnitude more for similar performance compared with PWL fitting algorithms.

Effect of Initial Breakpoint Distribution
We first ask whether the standard initializations will experience difficulty fitting functions that have significant curvature away from the origin (e.g., learning the energy function of a protein molecule). We train ReLU networks to fit a periodic function (sin(x)), which has high curvature both at and far from the origin. We find that the standard initializations do quite poorly away from the origin ( Table 1, first row), consistent with our theory that breakpoints are essential for modeling curvature. Probing further, we observe empirically that breakpoints cannot FIGURE 10 | Top: "Spiky" (orange) and standard initialization (blue), compared before (left) and after (right) training. Note both cases reached similar, very low training set error. Bottom: Roughness vs. Width (left) and the variance of the initialization (right) for both data gap cases shown in Figure 11. Each datapoint is the result of averaging over 4 trials trained to convergence. migrate very far from their initial location, even if there are plenty of breakpoints overall, leading to highly suboptimal fits. In order to prove that it is indeed the breakpoint density that is causally responsible, we attempt to rescue the poor fitting by using a simple data-dependent initialization that samples breakpoints uniformly over the training data range [x min , x max ], achieved by exploiting eq. (1). We train shallow ReLU networks on training data sampled from a sine and a quadratic function, two extremes on the spectrum of curvature. The data shows that uniform breakpoint density (Table 1, second row) rescues bad fits in cases with significant curvature far from the origin, with less effect on other cases, confirming the theory. We note that this could be a potentially useful data-dependent initialization strategy, one that can scale to high dimensions, but we leave this for future work.

Generalization: Implicit Regularization Emerges From Flat Init and Curvature-Based Parametrization
Our theory predicts that IR depends critically upon flatness of the initialization (Theorem 7 and corollary 4). Here, we test this theory for the case of shallow and deep univariate ReLU nets. We compare training with the standard flat initialization to a "spiky" initialization, and find that both fit the training data near perfectly, but that the "spiky" initialization has much worse generalization error (Figure 10 Top and Table 3 in Appendix).
It appears that generalization performance is not automatically guaranteed by GD, but is instead due in part to the flat initializations which are then preserved by GD. Our theoretical explanation is simple: integrating the dynamics in Equation (7) yields µ i (t) = µ i (0) + · · · and so the initialization's impact remains.

Impact of Width and Init Variance
Next, we examine how smoothness (roughness) depends on the width H, focusing on settings with large gaps in the training data. We use two discontinuous target functions (shown in Figure 11), leading to a gap in the data, and test how increasing H (with α unchanged) affects the smoothness off . We test this on a "smooth" data gap that is easily fit, as well as a "sharp" gap, where the fit will require a sharper turn. We trained shallow ReLU networks with varying width H and initial weight variance σ w until convergence, and measured the total roughness of resulting CPWL approximation in the data gaps. Figure 10 Bottom shows that roughness in the data gaps decreases with width and increases with initial weight variance, confirming our theory. A higher weight variance, and thus rougher initialization, acts in a similar fashion to the "spiky" initialization, and leads to increased roughness at convergence. In contrast, higher width distributes the curvature "work" over more units, leading to lower overall roughness. 3.5. Impact of Init Scale α Finally, we explore how changing α impacts IR. Empirically, as α increases from 0 to ∞ we see three qualitative regimes: (i) an underfitting linear spline, (ii) an interpolating linear spline, and (iii) a roughness-minimizing natural cubic spline. This is quantified in Table 2, where we compare the NN fit to a linear interpolation and a natural cubic spline fit, for varying α. We first test in the special case that the initial function approximation is perfectly flat; we find excellent agreement with the linear interpolation and cubic spline fits for α = 3, 100 (Uniform initialization) and α = 10, 100 (He initialization). The impact of α on IR can be more easily visualized in our Supplementary Videos. In order to gauge the impact of the initialization breakpoint distribution, we also test with a standard He initialization (Cauchy distributed breakpoints, Corollary 1). In this case, we find that generalization error is uniformly higher for all α. More strikingly, the α regime (ii) above disappears, as a result of breakpoints being more concentrated around the origin and the initialization roughness being significantly nonzero. This supports the idea that the initial parameter settings, in particular the breakpoint distribution, has a critical impact on the final fit and its IR.
Taken together, our experiments strongly support that a smooth, flat initialization and overparametrization are both responsible for the phenomenon and strength of IR, while the initialization weight scale α critically determines the type of IR.

DISCUSSION
We show that removing the α-scaling symmetry and examining neural networks in spline space enabled us to glean new theoretical and practical insights. The spline view highlights the importance of initialization: a smooth initial approximation is required for a smooth final solution. Fortunately, existing initializations used in deep learning practice approximate this property. In spline space, we also achieve a surprisingly simple and transparent view of the loss surface, its critical points, its Hessian, and the phenomenon of overparametrization. It clarifies how increasing width relative to data size leads with high probability to lonely data partitions, which in turn are more likely to reach global minima. The spline view also allows us to explain the phenomenon of implicit regularization, and how it arises due to overparametrization and the initialization scale α.

Related Work
Our approach is related to previous work (Frankle and Carbin, 2018;Arora et al., 2019b;Savarese et al., 2019) in that we wish to characterize parameterization and generalization. We differ from these other works by focusing on small width networks, rather than massively overparametrized or infinite width networks, and by using a spline parameterization to study properties such as smoothness of the approximated function. Previous work (Advani and Saxe, 2017) has hinted at the importance of low norm initializations, but the spline perspective allows us to prove implicit regularization properties in shallow networks. Finally, Williams et al. (2019) is closely related and is discussed in detail at the end of section 2.6.

Explanation for Correlation Between Flatness of Minima and Generalization Error
A key unexplained empirical observation has been that flatter local minima tend to generalize better Wei and Schwab, 2019). Our results above provide an explanation: as overparametrization O = H/N increases, the flatness of the local minima (as measured by the number of zero eigenvalues) increases (Corollary 3) and the smoothness of the implicitly regularized function (as measured by inverse roughness ρ(f H ) ≥ ρ(f ∞ )) also increases. As previously shown (Wu et al., 2017), flatter and simpler local minima imply better generalization.
Our work provides a parsimonious explanation for this: as we increase overparametrization, partitions become increasingly lonely, allowing for an increased redundancy in number of ways to exactly fit the training data (thus increasing the number of zero eigenvalues), while the inductive bias of gradient descent spreads the "work" (e.g., curvature changes due to delta-slopes) among many units, ensuring that each unit has a lesser effect and making the loss surface increasingly flat.

Future Work
Looking forward, there are still many questions to answer from the spline perspective: How does depth affect the expressive power, learnability, and IR? What kinds of regularization are induced in the adaptive regime and how do modern deep nets take advantage of them? How can data-dependent initializations of the breakpoints help rescue/improve the performance of GD? Can we design new global learning algorithms inspired based on breakpoint (re)allocation? We believe the BDSO perspective can help answer these questions.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.