Rethinking Breiman's Dilemma in Neural Networks: Phase Transitions of Margin Dynamics

Margin enlargement of training data has been an important strategy for perceptrons in machine learning for the purpose of boosting the confidence of training toward a good generalization ability. Yet Breiman (1999) shows a dilemma: a uniform improvement on margin distribution does not necessarily reduce generalization errors. In this paper, we revisit Breiman's dilemma in deep neural networks with recently proposed spectrally normalized margins from a novel perspective based on phase transitions of normalized margin distributions in training dynamics. Normalized margin distribution of a classifier of the data can be divided into two parts: low/small margins such as some negative margins for misclassified samples vs. high/large margins for high confident correctly classified samples, which often behave differently during the training process. Low margins for training and test datasets are often effectively reduced in training, along with reductions of training and test errors, whereas high margins may exhibit different dynamics, reflecting the trade-off between the expressive power of models and the complexity of data. When data complexity is comparable to the model expressiveness, high margin distributions for both training and test data undergo similar decrease-increase phase transitions during training. In such cases, one can predict the trend of generalization or test error through margin-based generalization bounds with restricted Rademacher complexities, shown in two ways in this paper with early stopping time exploiting such phase transitions. On the other hand, over-expressive models may have both low and high training margins undergoing uniform improvements with a distinct phase transition in test margin dynamics. This reconfirms the Breiman's dilemma associated with over-parameterized neural networks where margins fail to predict overfitting. Experiments are conducted with some basic convolutional networks, AlexNet, VGG-16, and ResNet-18, on several datasets, including Cifar10/100 and mini-ImageNet.


INTRODUCTION
Margin, as a measurement of the robustness that allows some perturbations on classifiers without changing decisions on training data, has a long history in characterizing the performance of classification algorithms in machine learning. As early as [1], it played a central role in the proof on finite-stopping or convergence of perceptron algorithm when training data is separable. Equipped with the convex optimization technique, a plethora of large margin classifiers were triggered by support vector machines [2,3]. For neural networks, Bartlett [4,5] showed that the generalization error can be bounded by a margin-sensitive fat-shattering dimension, which is in turn bounded by the ℓ 1 -norm of weights, shedding light on the possible good generalization ability of over-parameterized networks with small weights despite the large VC dimensionality. The same idea was later applied to AdaBoost, an iterative algorithm to combine an ensemble of classifiers proposed by [6], often exhibiting a phenomenon of resistance to overfitting that, during the training process, the generalization error does not increase even when the training error drops to zero. In pursuit of deciphering such resistance to the overfitting phenomenon, Schapire et al. [7] proposed an explanation that the training process keeps on improving a notion of classification margins in boosting among later improvements [8] and works on establishing consistency of boosting via early stopping regularization [9][10][11]. Lately, such a resistance to overfitting was again observed in deep neural networks with over-parameterized models [12]. A renaissance of margin theory was brought by [13] with a normalization of networks using Lipschitz constants bounded by products of operator spectral norms. It has inspired many further investigations in various settings [14][15][16].
However, margin theory has a limitation that the improvement of margin distribution does not necessarily guarantee a better generalization performance, which is at least traced back to [17] in his effort to understand AdaBoost. In this work, Breiman designed an algorithm arc-gv such that the margin can be maximized via a prediction game. He then demonstrated an example that one can achieve uniformly larger margin distributions on training data than AdaBoost but suffer a higher generalization error. At the end of this paper, Breiman made the following comments with a dilemma: "The results above leave us in a quandary. The laboratory results for various arcing algorithms are excellent, but the theory is in disarray. The evidence is that if we try too hard to make the margins larger, then overfitting sets in. · · · My sense of it is that we just do not understand enough about what is going on." In this paper, we are going to revisit Breiman's dilemma in the context of deep neural networks. We shall see margin distributions on training and test data may behave differently on the low and high parts during training processes. First of all, let us look at the following illustration example. Example 1.1 (Breiman's Dilemma with a CNN). A basic fivelayer convolutional neural network of c channels (see section 3 for details) is trained with the CIFAR-10 dataset whose 10% labels are randomly permuted as injected noises. When c = 50 with 92, 610 parameters, Figure 1A shows the training error and generalization (test) error in solid curves. From the generalization error in Figure 1A, one can see that overfitting indeed happens after about 10 epochs despite the training error continuously dropping to zero. One can successfully predict such an overfitting phenomenon from Figure 1B, which shows the evolution of normalized training margin distribution defined later in this paper. In Figure 1B, while low or small margins are monotonically improved during training, high or large margins undergo a phase transition from increase to decrease around 10 epochs such that one can predict the tendency of generalization error in Figure 1A using high margin dynamics. Two particular sections of high margin dynamics are highlighted in Figure 1B, one at 9.8 on x-axis, which measures the percentage of normalized training margins no more than 9.8 (training margin error), and the other at 0.8 on the y-axis, which measures the normalized margins at quantile q = 0.8 (i.e., 1/γ q,t defined later). Both of them meet the tendency of generalization error in Figure 1A and find a good early stopping time to avoid overfitting. However, as we increase the channel number to c = 400 by about 5.8M parameters and retrain the model, Figure 1C shows a similar overfitting phenomenon in terms of the generalization error; on the other hand, Figure 1D exhibits a uniform improvement of both low and high normalized margins without a phase transition during the training and thus fails to capture the overfitting. This demonstrates the Breiman's dilemma in wide CNN.
A key insight into this dilemma is that one needs a trade-off between the expressive power of models and the complexity of the dataset to endorse training margins as a prediction power. On one hand, when a model has a limited expressive power relative to the training dataset, in the sense that the low and high training margins cannot be uniformly improved during training, low margins can be effectively enlarged during training by reducing the training loss, though at the cost of sacrificing high margins, which does not affect the training loss as much as low margins, indicating misclassified samples. In this case, the generalization or test error may be predicted from dynamics of normalized training margin distributions by the increase-decrease phase transition that high margins experience. On the other hand, if we push too hard to improve margins by giving models too much degree of freedom such that the training margins are uniformly improved during training process, the predictability may be lost and overfitting set in. A trade-off is thus necessary to balance the complexity of model and dataset, otherwise one is doomed to meet Breiman's dilemma when the models arbitrarily increase the expressive power.
The example above shows that the expressive power of the models relative to the complexity of the dataset can be observed from the dynamics of normalized margins in training instead of counting the number of parameters in neural networks. In the sequel, our main contributions are to make these precise by revisiting the Rademacher complexity bounds on network generalization error.
• With the Lipschitz-normalized margins, a linear inequality is established between training margin and test margin in Theorem 1. When both training and test normalized margin distributions undergo similar phase transitions on increase-decrease during the training process, one may predict the generalization error based on the training margins, as illustrated in Figure 1. • In a dual direction, one can define a quantile margin via the inverse of margin distribution functions to establish another linear inequality between the inverse quantile margins and the test margins, as shown in Theorem 2. Quantile margin is far easier to tune in practice and enjoys a stronger prediction power exploiting an adaptive selection of margins along model training. • In all cases, Breiman's dilemma may fail both of the methods above when dynamics of normalized training margins undergo different phase transitions to that of test margins during training where a uniform improvement of training margins results in overfitting.
Section 2 describes our method to derive the two linear inequalities of generalization bounds above. Extensive experimental results are shown in section 3 with basic CNNs, AlexNet, VGG, ResNet, and various datasets, including CIFAR10, CIFAR100, and mini-Imagenet. Conclusions and future directions are discussed in section 4. More experimental figures and proofs are collected in Appendices.

Definitions and Notation
Let X be the input space [e.g., X ⊆ R C×W×H in image classification of size #(channel)-by-#(width)-by-#(height)] and Y : = {1, . . . , K} be the space of K classes. Consider a sample set of n observations S = {(x 1 , y 1 ), . . . , (x n , y n ) : x i ∈ X , y i ∈ Y} that are drawn i.i.d. from P X,Y . For any function f : X × Y → R, let Pf = X ×Y f (X, Y)dP X,Y be the population expectation and P n f = (1/n) n i=1 f (x i ) be the sample average. Define F to be the space of functions f : X → R K represented by neural networks: where l is the depth of the network, W i is the weight matrix corresponding to a linear operator on x i , and σ i stands for either the element-wise activation function (e.g., ReLU) or pooling operator, which are assumed to be Lipschitz bounded with constant L σ i . An example would be the convolutional network W i x i + b i = w i * x i + b i where * stands for the convolution between the input tensor x l and kernel tensor w l . We equip F with the Lipschitz semi-norm, that is, for each f , where · σ is the spectral norm and L σ = l i=1 L σ i . Without loss of generality, we assume L σ = 1 for simplicity. Moreover, we consider the following family of hypothesis functions as network mapping evaluated at (x, y), where [·] j denotes the jth coordinate, and we further define the following class induced by Lipschitz semi-norm bound on F , Now, rather than merely looking at whether prediction f (x) on y is correct or not, we further consider the prediction margin, which is defined as With that, we can define the ramp loss and margin error depending on the confidence of predictions. Given two thresholds γ 2 > γ 1 ≥ 0, we define the ramp loss to be where : = γ 2 − γ 1 . In particular γ 1 = 0 and γ 2 = γ , we also write ℓ γ = ℓ (0,γ ) for simplicity. Define the margin error to measure if f has margin no more than a threshold γ , In particular, e 0 (f (x), y) is the common mis-classification error and E[e 0 (f (x), y)] = P[ζ (f (x), y) < 0]. Note that e 0 ≤ ℓ γ ≤ e γ , and ℓ γ is the Lipschitz bounded by 1/γ .

Rademacher Complexity and the Scaling Issue
The central question we try to answer is, can we find a proper upper bound to predict the tendency of the generalization error during training such that can stop the training early, near the We begin with the following lemma, as a margin-based generalization bound with network Rademacher complexity for multi-label classifications, using the uniform law of large numbers [8,13,18,19].

Lemma 2.1. (Rademacher Complexity based Generalization
Bound). Given a γ 0 > 0, then, for any δ ∈ (0, 1) with probability at least 1 − δ, the following holds for any f ∈ F with f F ≤ L: is the Rademacher complexity of function class H L with respect to n samples, and the expectation is taken over Unfortunately, direct application of such a bound in neural networks with a constant γ 0 will suffer from the so-called scaling issue. To see this issue, let us look at the following proposition as a lower bound of Rademacher complexity term.

Proposition 1. (Lower Bound of the Rademacher Complexity).
Consider the networks with activation functions σ , where we assume σ is Lipschitz continuous and there exists x 0 such that σ ′ (x 0 ) = 0 and σ ′′ (x 0 ) exists. For any L > 0, then, it holds that where C > 0 is a constant that does not depend on S.
This proposition extends Theorem 3.4 in [13] to general activation functions and a multi-class scenario, and the proof is presented in the Appendix.
The scaling issue refers to the fact that, if the network Lipschitz L → ∞, by this Lemma the upper bound (6) becomes trivial since R n (H L ) → ∞. On the other hand, the gradient descent method with logistic regression (crossentropy) loss [20] and exponential loss (boosting) [21] will drive weight estimates to approach infinity for max-margin classifiers when the data is linearly separable. In particular, the latter work shows the growth rate of weight estimates is log(t). As for the deep neural network with cross-entropy loss, the input of the last layer is usually viewed as several features extracted from the original input. Training the last layer with other layers being fixed is a logistic regression, and the feature is linearly separable as long as the training error achieves zero. Therefore, without any normalization, the hypothesis space during training has no upper bound on L, and the upper bound (6) is thus useless.
To solve the scaling issue, in the following we are going to present normalization of margins and restricted Rademacher complexity within a unit Lipschitz ball. We are going to see when such bounds are tight enough to predict generalization errors based on training data.

Generalization Bounds by Normalized Margins and Restricted Rademacher Complexity
The first remedy is to restrict our attention on H 1 by normalizing f with its Lipschitz semi-norm f F or some tight upper bound estimates. Note that a normalized networkf = f /C has the same mis-classification error as f for all C > 0. For the choice of C, it is difficult in practice to directly compute the Lipschitz semi-norm of a network; instead, some approximate estimates on the upper bound L f in (2) are available as discussed in section 2.4.
In the sequel, letf = f /L f be the normalized network and h = h/L f = ζ (f , y)/L f = ζ (f , y) ∈ H 1 be the corresponding normalized hypothesis function from (3). A simple idea is to regard R n (H 1 ) as a constant when the model complexity is not over-expressive against data; one can then predict the tendency of generalization error via training the margin error of the normalized network, which avoids the scaling issue and the exact computation of Rademacher complexity. In the following we present two bounds, with one on normalized margin error bound as the direct application of Lemma 2.1 and the other on quantile margin error bound as the inverse of the former that turns out to be more effective in applications.

Normalized Margin Error Bound
The following theorem states that the probability of normalized test margins rather than γ 1 is controlled by the percentage of normalized training margins less than γ 2 > γ 1 up to a constant R n (H 1 )/(γ 2 − γ 1 ) if the Rademacher complexity of unit ball R n (H 1 ) is not large. Theorem 1. Given γ 1 and γ 2 such that γ 2 > γ 1 ≥ 0 and : = γ 2 − γ 1 ≥ 0, for any δ > 0 with probability of at least 1 − δ along the training epoch t = 1, . . . , T, the following holds for each f t :

Remark 1.
When we take γ 1 = 0 and γ 2 = γ > 0, the bound above becomes Recently, Liao et al. [16] investigated for normalized networks the strong linear relationship between cross-entropy training loss and test loss when the training epochs are large enough. However, the bound here is applied to the whole training process for all epoch t, which enables us to find the early stopping time t * by looking at change points of P n 1[ζ (f t (x), y) < γ 2 ] in the dynamics of high training margin distributions that will be discussed below.
Theorem 1 says that one can bound the normalized test . In particular, one hopes to predict the trend of generalization (test) error by choosing γ 1 = 0 and a proper γ > such that the high training margin errors P n [ζ (f t (x i ), y i ) < γ ] enjoy a high correlation with a test error of up to a monotone transformation. The following facts make it possible to achieve this.
• First, we do not expect the bound; for example (10), is tight for every choice of γ > 0. Instead, we hope there exists some γ such that the training margin error almost changes monotonically with generalization error. This indeed happens when the model complexity is not too much where one cannot uniformly enlarge the high training margins. For example, Figure 5 below shows the existence of such γ when models are not too big by exhibiting rank correlations between training the margin error at various γ and the test error for a CNN trained on CIFAR10 dataset. Moreover, Figure 4 below shows that the training margin error at such a good γ successfully recovers the tendency of generalization error. • Second, the normalizing factor is not necessarily an upper bound of Lipschitz semi-norm. The key point is to prevent the complexity term of the normalized network going to infinity. Since for any constant c > 0, normalization byL = cL works in practice where the constant could be absorbed to γ , we could ignore the Lipschitz constant introduced by general activation functions in the hidden layers.
However, such a strategy may fail. As shown by Example 1.1 using Figure 1 above, once the training margin distribution is uniformly improved, the dynamic of training margin error fails to capture the change point (minimum) of the generalization error in the early stage. This is because when the network structure becomes complex and over-expressive enough against the data, the training margin distribution can be more easily improved. In this case, the restricted Rademacher complexity R n (H 1 ) in Theorem 1 will blow up such that it is invalid to bound the generalization error using merely the training margins, P n [ζ (f t (x i ), y i ) < γ ], despite it is reduced in training. This is exactly the same observation made in [17], casting doubt on the margin theory in boosting type algorithms. More detailed discussions will be given in section 3.3 with experiments.

Quantile Normalized Margin Error Bound
A serious limitation of Theorem 1 lies in that we must fix a γ along the whole training process. In fact, the first and second terms in the bound (10) vary in opposite directions with respect to γ , and it is thus possible that different f t at different t may prefer different γ for a trade-off. Can we adaptively choose good γ t at different t?
The answer is Yes. In fact, as shown in Figure 1B of Example 1.1 above, while choosing γ is to fix an x-coordinate section of margin distributions, another direction is to look for a ycoordinate section, which enables different margins for different f t . This motivates us to define the quantile margin below. Letγ q,f be the qth quantile margin of the network f with respect to sample S, i.e.,γ The following theorem bounds the generalization error by the inverse of quantile margins on training data.

Remark 3.
We simply denote γ q,t for γ q,f t when there is no confusion.
Compared with the bound (10), Theorem 2 bound (12) makes it possible to choose γ t (varying with f t and the cost is an additional constant term in C q ) as well as the constraint γ q,t > τ , which typically holds for large enough q in practice. In applications, the stochastic gradient descent method often effectively improves the training margin distributions along with the reduction of training errors; a small enough τ and large enough q usually meetγ q,t > τ . Moreover, even with the choice τ = exp(−B), constant term [log log 2 (4(M + l)/τ )]/n = O( log B/n) is still negligible and thus very little cost is paid in the upper bound.
In practice, tuning q ∈ [0, 1] is far easier than tuning γ > 0 directly, and setting a large enough q usually provides lots of information about the generalization performance. The quantile margin works effectively when the dynamics of high margin distributions reflect the behavior of generalization error, e.g., as shown in Figure 1. In this case, after certain epochs of training, the high margins have to be sacrificed to further improve the low margins for reducing the training loss, which typically indicates a possible saturation or overfitting in test error.

Estimate of Normalization Factors
It remains to be discussed how the Lipschitz constant bound in (2) should be estimated. Given an operator W associated with a convolutional kernel w, i.e., Wx = w * x, there are two ways to estimate its operator norm. We begin with the following proposition, of which part (A) is adapted from the continuous version of Young's convolution inequality in L p space (see Theorem 3.9.4 in [22]) and part (B) is a generalization to multiple channel kernels widely used in convolutional networks nowadays. The proof is presented in the Appendix B.5.

Proposition 2. (A) For convolution operator W with kernel w
In other words, W σ ≤ w 1 .
(B) Consider a multiple channel convolutional kernel w ∈ R C out ×C in ×Size with stride S, which maps input signal x of C in channels to the output of C out channels by where x and w are assumed of zero-padding outside its support. The following upper bounds hold.
In all these cases, the ℓ 1 -norm of w dominates the estimates. In the following, we will thus simply call these bounds ℓ 1 -based estimates. Another method is given in [14] based on power iterations [23] as a fast numerical approximation for the spectral norm of the operator matrix. We compare the two estimates in Figure 10. It turns out both can be used to predict the tendency of generalization error using normalized margins, and both will fail when the network has large enough expressive power. Although using the ℓ 1 -based estimate is very efficient, the power iteration method may be tighter and have a wider range of predictability.
However, a shortcoming of the power method is that it cannot be directly applied to the ResNet blocks. In the remainder of this section, we will discuss the treatment of ResNets. ResNet is usually a composition of the basic blocks shown in Figure 2 with short-cut structure. The following method is used in this paper to estimate upper bounds of spectral norm of such a basic block of ResNet.
B are mean and variance of batch samples, while keeping an online averaging asμ andσ 2 . BN then rescales x + by using estimated parametersα,β, and outputx = αx + +β. The whole rescaling of BN on the kernel tensor w of the convolution layer, therefore, isŵ = wα/ √σ 2 + ǫ, and its corresponding rescaled operator is Ŵ σ = W σα / √σ 2 + ǫ. (c) Activation and pooling: their Lipschitz constants can be known a priori, e.g., L σ = 1 for ReLU and hence can be ignored. In general, L σ cannot be ignored if they are in the shortcut as discussed below. (d) Shortcut: In residue net with basic block in Figure 2, one has to treat the mainstream (Block 2 , Block 3 ) and the shortcut Block 1 separately. Since f + g F ≤ f F + g F , in this paper we take the Lipschitz upper bound by L σ out ( Ŵ 1 σ + L σ in Ŵ 2 σ Ŵ 3 σ ), where Ŵ i σ denotes a spectral norm estimate of BN-rescaled convolutional operator W i . In particular L σ out can be ignored since all paths are normalized by the same constant, while L σ in cannot be ignored due to its asymmetry.

EXPERIMENTAL RESULTS
The spirit of the following experiments is to show when and how the margin bound above could be used to numerically predict the

Networks and Datasets
The networks and datasets used in the experiments are introduced in brief here. For the network, our illustration, Example 1.1, is based on a simple convolutional neural network whose architecture is shown in Figure 3 (more details in Figure A1 in Appendix), called basic CNN(c), here with c channels that will be specified in different experiments below. It essentially has five convolutional layers of c channels at each one, and this is followed by batch normalization and ReLU as well as a fully connected layer in the end. Furthermore, we consider various popular networks in applications, including AlexNet [24], VGG-16 [25], and ResNet-18 [26]. For the dataset, we consider CIFAR10, CIFAR100 [27], and Mini-ImageNet [28].

Success: Similar Phase Transitions in Training and Test Margin Dynamics
In this section, we show that when the expressive power of models are comparable to data complexity, the dynamics of training margin distributions and that of test margin distributions share similar phase transitions, which enables us to predict generalization (test) error utilizing the theorems in this paper.
In this experiment, we are going to demonstrate when there is a nearly monotone relationship between training margin error and test margin error such that Theorem 1 and Theorem 2 can be applied to predict the tendency of generalization (test) error. Let us first consider training a basic CNN(50) on CIFAR10 dataset with and without random noise. The relations between test error and training margin error e γ (f (x), y) with γ = 9.8, inverse quantile margin 1/γ q,t with q = 0.6 are shown in Figure 4. In this simple example, where the network is small and the dataset is simple, the bounds (9) and (12) show a good prediction power: they stop either near the epoch of sufficient training without noise (Left, original data) or before an overfitting occurs with noise (Right, 10% label corrupted).
Why does it work in this case? Here are some detailed explanations on its mechanism. The training margin error (P n [ζ (f t (x i ), y i ) < γ ]) and the inverse quantile margin (1/γ q,t ) are both closely related to the dynamics of training margin distributions. Figure 1B actually shows that the dynamics of training margin distributions undergo a phase transition: while the low margins have a monotonic increase, the high or large margins undergo a phase transition from increase to decrease, which is indicated by the red arrows. Therefore different choices of γ for the linear bounds (9) [a parallel argument holds for q in (12)] will have different effects. In fact, the training margin error with a small γ is close to the training error, while that with a large γ is close to test error. Figure 5 shows such a relation using rank   correlations (in terms of Spearman-ρ and Kendall-τ 1 ) between training margin errors (or inverse quantile margins) and training errors, as well as training margin errors (or inverse quantile margins) and test errors, for each γ (or q, respectively). In these plots, we see that the dynamics of large margins have a trend that is similar to the test errors, while small margins are close to training errors in rank correlations. For a good prediction, one should thus choose a large enough γ = 9.8 (or q = 6.8, respectively) at the peak point of the rank correlation curve between training margins and test errors. Under these conditions, the epoch when the phase transition above happens is featured with a cross-over in dynamics of training margin distributions in Figure 1B and exists near the optima of the training margin error curve.
Although both the training margin error (P n [ζ (f t (x i ), y i ) < γ ]) and the inverse quantile margin (1/γ q,t ) can be used here to successfully predict the trend of test (generalization) error, the latter can be more powerful in our studies. In fact, dynamics of the inverse quantile margins can adaptively select γ t for each f t without access to the complexity term. Unlike merely looking at the training margin error with a fixed γ , the quantile margin bound (12) in Theorem 2 shows a stronger prediction power than (10) and is even able to capture more local optima. In Figure 6, the test error curve has two valleys corresponding to a local optimum and a global optimum, and the quantile margin curve with q = 0.95 successfully identifies both. However, if we consider the dynamics of training margin errors, it is rarely possible to recover the two valleys at the same time since their critical thresholds γ t 1 and γ t 2 are different. Another example of ResNet-18 is given in Figure A2 in the Appendix.
In a summary, when training and test margin dynamics share similar phase transitions, both theorems we developed can be used to predict test (generalization) error via normalized training margins, even leaving us with the data-dependent early stopping rule to avoid overfitting when data is noisy. However, below we shall see a different scenario when training and test margin dynamics are of distinct phase transitions, such a prediction fails as Breiman's dilemma.

Failure: Distinct Phase Transitions in Margin Dynamics and Breiman's Dilemma
In this section, when model complexity arbitrarily increases to be over-expressive against the dataset, the training margins can be monotonically improved, while high test margin dynamics undergo a distinct phase transition of decrease-increase. In this case, the prediction power of training-margin-based bounds is lost and overfitting may set in. This exhibits Breiman's dilemma in neural networks.
We conduct three sets of experiments in the following.

Experiment I: Basic CNNs on CIFAR10
In the first experiment shown in Figure 7, we fix the dataset to be CIFAR10 with 10% of labels randomly permuted and gradually increase the channels from basic CNN(50) to CNN(400). For CNN(50) [#(parameters) is 92,610] and CNN(100) [#(parameters) is 365,210], both training margin dynamics and test margin dynamics share a similar phase transition during training: small margins are monotonically improved while large margins are firstly improved then dropped afterwards. The last row in Figure 7 shows the heatmaps as Spearman-ρ rank correlations between these two dynamics drawn in γ 1 -γ 2 plane. The block diagonal structures in the rank correlation heatmaps illustrates such a similarity in phase transitions. To be specific, small (or large) margins in both training margins and test margins share high-level rank correlations marked by diagonal blocks in light color, while the difference between small and large margins are marked by offdiagonal blocks in dark color. Particularly at γ 1 = 0, the test (generalization) error dynamics can be predicted using large training margins, as their rank correlations are high.  However, as the channel number increases to CNN(400) [#(parameters) is 5,780,810], the dynamics of the training margin distributions becomes a monotone improvement without the phase transition above. This phenomenon is not a surprise, as, with a strong representation power, the whole training margin distribution can be monotonically improved without sacrificing the large margins. On the other hand, the generalization or test error cannot be monotonically improved. The heatmap of rank correlations between training and test margin dynamics thus exhibits such a distinction in phase transitions by changing the block diagonal structure above to double column blocks for CNN(400). In particular, for γ 1 ≤ 0, test margin dynamics have low rank correlations with all training margin dynamics as they are of different phase transitions in evolutions. As a result, one cannot predict test error at γ = 0 using training margin dynamics.

Experiment II: CNN(400) and ResNet-18 on CIFAR100 and Mini-ImageNet
In the second experiment shown in Figure 8, we compare the normalized margin dynamics of training CNN(400) and ResNet-18 on two different datasets, CIFAR100 and Mini-ImageNet. CIFAR100 is more complex than CIFAR10 but less complex than Mini-ImageNet. It shows that (a) CNN(400) does not have an over-expressive power on CIFAR100, whose normalized training margin dynamics exhibits a phase transition-a sacrifice of large margins to improve small margins during training; it also shows that (b) ResNet-18 does have an over-expressive power on CIFAR100 by exhibiting a monotone improvement on training margins but loses such a power in Mini-ImageNet with phase transitions of training margin dynamics.
From this experiment, one can see that simply counting the number of parameters and samples cannot tell us if the model and data complexities are over-representative or comparable. Instead, phase transitions of margin dynamics provide us a tool to investigate their relationship. CNN(400) (5.8 M parameters) has a power that is too expressive for the simplest CIFAR10 dataset such that the training margins can be monotonically improved during training; but CNN(400)'s expressive power seems comparable to the more complex CIFAR100. Similarly, the more complex model ResNet-18 (11 M parameters) has a too much expressive power for CIFAR100 but seems comparable to Mini-ImageNet. In this part, we collect comparisons of various networks on the CIFAR10/100 and Mini-ImageNet dataset. Figure 9 shows both success and failure cases with different networks and datasets. In particular, the predictability of generalization error based on Theorem 1 and Theorem 2 can be rapidly observed on the third column of Figure 9, the heatmaps of rank correlations between training margin dynamics and test margin dynamics. On one hand, one can use the training margins to predict the test error as shown in the first column of Figure 9. In these cases, model complexity is comparable to data complexity such that the training margin dynamics share similar phase transitions with test margin dynamics, indicated by block diagonal structures in rank correlations [e.g., CNN(100)-CIFAR10, AlexNet-CIFAR100, AlexNet-MiniImageNet, VGG16-MiniImageNet, and ResNet-18-MiniImageNet]. On the other hand, such a prediction fails when models become over-expressive against datasets such that the training margin dynamics undergo different phase transitions to test margin dynamics, indicated by the loss of block diagonal structures in rank correlations [e.g., CNN(400)-CIFAR10, ResNet-18-CIFAR100, and VGG16-CIFAR100].
As we have shown, phase transitions of margin dynamics play a central role in characterizing the trade-off between model expressive power and data complexity, hence the predictability of generalization error by our theorems. If one tries hard to improve training margins by arbitrarily increasing the model complexity, the training margin distributions can be monotonically enlarged but may lead to overfitting. This phenomenon is not unfamiliar to us, since Breiman has pointed out that the improvement of training margins is not enough to guarantee a small generalization or test error in the boosting type algorithms [17]. We find the same phenomenon ubiquitous in deep neural networks. In this paper, the inspection of the trade-off between expressive power of models and complexity of data via phase transitions of margin dynamics provides us with a new perspective to study the Breiman's dilemma in applications.

Discussion: Effluence of Normalization Factor Estimates
In the end, it is worth mentioning that different choices of the normalization factor estimation may affect the range of , Mini-ImageNet (Right) with 10% of the labels corrupted. With a fixed network structure, we further explore how the complexity of dataset influences the margin dynamics. Taking ResNet-18 as an example, margin dynamics on CIFAR100 doesn't have any cross-over (phase transition), but on Mini-Imagenet a cross-over occurs.  In the top row, the spectral norm in L f is estimated via the ℓ 1 -based estimate method and in the middle row, the spectral norm is estimated by power iteration. Bottom pictures show the estimates of L f by power iterations (in green color) and by the ℓ 1 -based estimate method (in blue color), respectively. The curves of L f estimates are rescaled for visualization since a fixed scaling factor and training does not influence the occurrence of cross-overs or phase transitions. Note that the original ℓ 1 -based estimates are of order 1e + 17, 1e + 19, and 1e + 21 (100 channels, 400 channels, and 900 channels, respectively), and the power iteration estimates are of 1e + 3, 1e + 3, and 1e + 3 (100 channels, 400 channels, and 900 channels, respectively). As shown above, a more accurate estimation of spectral norm may extend the range of predictability but eventually faces Breiman's dilemma if the model representation power grows too much against the dataset complexity.
predictability but may still exhibit Breiman's dilemma. In all experiments above, the normalization factor is estimated via the ℓ 1 -based estimate in Proposition 2 in section 2.4. One could also use power iteration [14] to present a more precise estimation on spectral norm. Usually the ℓ 1 -based estimates lead to a coarser upper bound than the power iterations, see Figure 10. It is a fact that, in training margin dynamics, large margins are typically improved at a slower speed than small margins. A more accurate estimation of spectral norm with faster increases in training may thus bring with it cross-overs (or phase transitions) in large training margins and extend the range of predictability. Breiman's dilemma, however, still persists when the balance between model representation power and dataset complexity is broken as model complexity arbitrarily grows.

AUTHOR CONTRIBUTIONS
WZ proved the theorems, conducted some experiments, and wrote the paper. YH carried out major experiments. YY designed the project and wrote the paper. All authors contributed to the article and approved the submitted version.