Locally linear attributes of ReLU neural networks

A ReLU neural network functions as a continuous piecewise linear map from an input space to an output space. The weights in the neural network determine a partitioning of the input space into convex polytopes, where each polytope is associated with a distinct affine mapping. The structure of this partitioning, together with the affine map attached to each polytope, can be analyzed to investigate the behavior of the associated neural network. We investigate simple problems to build intuition on how these regions act and both how they can potentially be reduced in number and how similar structures occur across different networks. To validate these intuitions, we apply them to networks trained on MNIST to demonstrate similarity between those networks and the potential for them to be reduced in complexity.

1. Introduction.Building a better understanding of neural network behavior is critically important.Neural networks are state-of-the-art in a variety of contexts including facial recognition [2] and object recognition [19].However, there is limited understanding of how these networks work or what they are truly doing to achieve such high performance.We present one path for building understanding and intuition by investigating the locally linear behavior of ReLU networks.
We investigate the linear region facets of ReLU neural networks -the small regions where the network behaves as a linear function.These can be considered both through the underlying piecewise linear structure of the network and through the gradient of the network in each region.Prior work has been done on establishing theoretical bounds on the number of regions that it is possible for a network to have [14,16,18] and on investigating metrics involving these structures [15].
We investigate the behavior of these facets for small networks trained on easily visualized problems and on larger, more modern networks trained to recognize handwritten digits [21,10,8].We determine that clustering these facets can be carried out while preserving much of the performance of the networks and that the facets of two different networks, trained on the same problem, are related by a linear map that maintains high accuracy.In related work done by McNeely-White et al. [13], it was shown that one can apply a linear map to the feature vector (the outputs of the preclassification layer) of one network to obtain a vector, considered as a feature vector in the second network, that can then be used by the second network for classification while maintaining high accuracy.The clustering results suggest that networks have significant redundancy at the facet level while the existence of the linear map suggests that networks follow qualitatively similar methods to solve problems.Although our methods are not currently usable for compression or simplification, they are useful for investigating the behavior of networks.There are aspects of what is being presented here that may be obvious to those that have thought about neural networks, but we present them to further build intuition for network behavior.
1.1.Outline.We first provide an overview of previous work before defining and providing an overview of linear region structures.We provide illustrated examples of linear regions in different contexts in Section 3. We then investigate potential applications of these methods: • In Section 4, we show how the linear regions change throughout the training process of simple networks trained on two-dimensional problems, and how the structure of early layers induce behavior in later layers.• In Section 5.1, we show experimentally that the number of linear regions a network has is not necessarily a perfect predictor of its complexity, as those linear regions can be clustered while preserving the accuracy of the network.• In Section 5.2, we show experimentally that the linear regions of different networks exhibit similarity and that a simple affine mapping between those linear regions maintains a high level of classification accuracy.
2. Related Work.Much of the original work dealing with the linear regions of ReLU neural networks has focused on investigating expressivity and complexity.It has previously been shown that networks are universal approximators, that is, subject to certain mild constraints, they are able to approximate any well-behaved function to within arbitrary precision as the size of the network increases [5,1,7,11,10].As meaningful as these results are, they are typically not applicable to practical neural networks and do not say anything about the expressivity of a given neural network.To assist with determining the expressivity of networks in practice, various groups found and improved bounds on the maximum number of linear regions that ReLU neural networks can have [14,16,18].The main result of this work is that the maximum number of linear regions a network can have grows polynomially in the width and exponentially in the depth [18].This partially explains the success of the trend in many modern neural networks to go deeper, such as ResNet [6].
However, empirical investigations of the number of linear regions actually achieved by many neural networks have shown different results.Untrained neural networks after initialization have a number of linear regions that tends to grow linearly in the number of ReLU functions along any one-dimensional subspace of the input space [3].Furthermore, they tend to grow polynomially in the number of ReLU nodes in the network and exponentially in the dimension of the inputs to the network [4].
These linear regions have also been used empirically to measure the sensitivity of neural networks.As will be discussed in Section 3, the Jacobian of a neural network at a point, together with the value of the neural network at the point, describes exactly the linear function that agrees with the network in a polytope around that point.Novak et al. [15] utilized this fact to investigate the effect of hyperparameters on input sensitivity and found that overparameterization can help in generalization.Additionally, they and Zhang et al. [23] investigated how the linear region structure can be used to predict the quality of a network.
Zhang et al. [22] showed that due to the piecewise linear structure of these neural networks, and under certain assumptions, the set of ReLU neural networks, the set of piecewise linear functions, and the set of tropical rational functions are equivalent.We do not extend our results to the realm of tropical algebra, but we do take inspiration from the concept of the dual as commonly expressed in tropical algebra.
3. Linear Regions.Neural networks with piecewise linear activation functions, such as ReLU, are continuous piecewise linear maps from the input space to the output space [22].Additionally, each of the linear portions of this mapping is supported on a convex polytope.
3.1.Definition.The piecewise linear and convex polytope structures of a ReLU neural network, f : R d → R o with o outputs and inputs in R d , mean that it can be Network: Fig. 1: An illustration of how the ReLU activation pattern for an input determines the linear mapping used for that input.The region of validity refers to the possible x values for which this ReLU activation pattern exists.All the equations must be satisfied.
written as For each i, the affine mapping defined by W i and b i is valid on the convex polytope defined by A i and c i .One can find the values for these parameters, that are valid at x, as follows.From a given value of x, one finds the associated ReLU activation pattern in the network.From this activation pattern one determines the affine function, from input space to output space, that agrees with the neural network at x (i.e. the values of W i and b i ).Next, one determines the region of validity for this linear function (i.e. the values of A i and c i ).Putting it together, W i and b i determine the affine linear function that agrees with the ReLU neural network on the polytope, defined by A i and c i , that contains x.The process of how one of these W i , b i pairs can be calculated is shown in Figure 1.This piecewise linear mapping structure can be extended to various other common layers types, such as max and average pooling.
The W i and A i are also linked -the W i are selected based on which ReLU nodes are activated, and the A i describe where ReLU nodes switch from activated to deactivated or vice-versa.This is partially illustrated in Figure 1 and a specific, smaller example of this is shown later in Equations 3.2 and 3.3.There are also similarities and relationships between different W i or A i -because they are coming from the same network weights with rows removed, there is an inherent structure in the specific values used to construct them.
An additional note to make is that in general, the number of regions, m, has the potential to be very large with exponential growth in the depth and polynomial growth in the width of the network [14,16,18].Experimentally, trained networks have been shown to typically exhibit polynomial growth with the number of ReLU activations of the network, where the degree of the polynomial is the input dimension [4].Although this is polynomial, networks applied in domains such as image recognition frequently have inputs with at least 1,000 dimensions, so this still results in very large numbers of regions [19].
The linear mapping network definition, Equation 3.1, highlights the fact that as long as one of the ReLU nodes does not switch from "activated" to "deactivated" or vice-versa, the behavior of the network is purely linear.Since the network is a composition of continuous linear and piecewise linear functions, it is itself a continuous piecewise linear function that splits the input space into disjoint polytopes, on each of which there is an associated affine mapping.This represents an unequivocally simple way to conceptualize what ReLU networks compute, but unfortunately, the typically extreme growth in the number of facets in Equation 3.1 means enumerating the full set of affine mappings is wildly impractical.Equation 3.1 is of conceptual value but arguably by itself not of much practical value, but it leads to several distinct, yet ultimately equivalent views, of neural networks.These views are: • The weight matrix, W i , is the Jacobian of the neural network in the region described by A i .The j th row of W i is the gradient of the j th output of the network.This fact has been utilized previously to consider sensitivity metrics for neural networks [15].This also allows for simple calculation of the W i and b i values.• The weight matrices, W i , and biases, b i , form a set of linear maps which the neural network chooses from based on the value of the input.Each row of these W i is a surface normal to the hyperplane used for classification.• The choices are based on the location of the input in a set of connected polytopes induced by the ReLU structure of the network.We provide animations showing how these structures evolve as networks train in Section 4. • Each row of W i concatenated with the corresponding element of b i forms a point in R d+1 .These points can be considered as lying in a "dual" space to the corresponding output of the network, and their structure is analyzed in that context.We show how this space forms in this section and Section 4, and analyze this space for clustering and similarity of networks in Sections 5.1 and 5.2.
x y f (x, y)

Example on XOR.
For an example of how the piecewise linear nature of ReLU neural networks works, we consider the XOR problem and a ReLU neural network that solves it as presented in Figure 2. We choose XOR as it is a complex enough problem that it illustrates nonlinear aspects of network behavior, but simple enough that full analysis of that behavior is feasible.Note that for the XOR function itself, shown in Figure 2a, zero is replaced with minus one to make subsequent examination of network calculations easier.Figure 2b shows a network which solves the XOR problem.The functional form of that same network mapping from the two inputs x and y may be written as As a function on R 2 , the network divides R 2 into three linear regions with corresponding linear function/polytope pairs, x + y ≤ −1, Neither ReLU activated These linear regions are shown in Figure 3.Even for this very simple example a complication arises: there is actually a "fourth" region, −4x − 4y + 3, tied to the case where the bottom ReLU unit is activated and the top is not.However, that case Fig. 3: The polytopes and associated linear regions for a simple network to solve the XOR problem.Left: the cross-section of the network in the plane.Green corresponds to points that would be labeled in the positive class (neural network output greater than zero) and red corresponds to points that would be labeled in the negative class (neural network output less than zero).The black lines correspond to the points at which one of the two ReLU units "activates" or "deactivates" and switches the linear region used for classification.The three polytopes form bands in the plane.Right: the surface of the neural network.The points used for training are shown as green and red dots, the nonlinearities are shown as red lines, and the decision boundary (zeros of the network) are shown as black lines.occurs in the empty polytope 0 ≤ x + y ≤ −1 which cannot occur for any values of x and y, and thus in practical terms this empty polytope does not exist.This is an example of a general phenomena where cases exist in principle but are unreachable regardless of input.Further, the existence of such cases explains in part why the number of possible linear regions grows as it does and not simply exponentially in the number of ReLU functions.
There are additional practical complications that can arise but do not on this network due to its simplicity -a network can be considered as a function on all of R n but the data to which the network is actually applied lies in a bounded region within R n .Polytopes may exist outside of that region but not be meaningful for the given inputs.Furthermore, in many problems the data used is but a discrete subset of this bounded region.It is possible for the network to define polytopes lying in the bounded region but too small to contain any of the discrete data to which the network is applied.In general, the number of non-empty polytopes does typically grow beyond the number of actual training samples.
Returning to the regions shown in Figure 3, the weights and biases in these polytopes can be considered as d + 1-dimensional points existing in a "dual space" to the original neural network.For example, and so the point (−1, −1, 1) is induced by this region.Further examples of these duals are illustrated in Figure 4.These can illustrate patterns in the behavior of the network, and as will be discussed in more detail, mapping between networks or clustering in this space can identify similarity metrics and areas where the neural network gives potentially unnecessary complexity.3.3.Polytope Visualization.One way to think of the polytopes resulting from ReLU activation patterns is the way in which they arise as a consequence of the iterated perceptron structure inherent in this style of network.Each layer builds upon the nonlinearities in the previous layers by drawing a line in the output space of the previous layer.An example of this is illustrated in Figure 5.
The first hidden layer of the network, bottom left of Figure 5, is relatively simple -each of the nodes in the first layer has a line for a decision boundary (where the output of that node switches from positive to negative, resulting in the attenuation by ReLU).Each subsequent layer builds upon the previous.To illustrate this, the decision boundaries highlighted for each layer, in its plot in the bottom row of Figure 5, are reproduced in subsequent layers in gray.The more complicated decision boundaries for each subsequent layer are always locally linear with changes in direction only arising where they intersect a boundary from a previous layer.This is a direct result and also illustration of the fact that the nonlinearities of multi-layer networks must be built up from decision boundaries established by the previous layers in the network.Finally, notice in the bottom right of Figure 5 that the output layer of the network does as expected, constructing a valid piecewise linear approximation to the original classification task.

Extension to Image
Data.The idea of investigating and visualizing linear regions can be extended to higher dimensions and specifically to image data, although visualizations are no longer as simple.We use the MNIST dataset of handwritten digits which contains 60,000 training samples and 10,000 test samples [8].MNIST was chosen as an image classification dataset due to its relative simplicity.We used PyTorch [17] to train four networks on the MNIST dataset.These networks are • A dense network with a single hidden layer consisting of 128 nodes.This network achieves an accuracy of 96.03%.The training process used crossentropy loss and PyTorch's SGD function with parameters of 0.01 update rate, 0.5 momentum, 0.01 weight decay, and a batch size of 64 over 30 epochs.• A simple convolutional networks consisting of a convolutional layer with 10 filters and kernel size of 5 followed by a max pool followed by a convolutional layer with 20 filters and a kernel size of 5 followed by a max pool followed by a fully connected layer from 320 nodes to 50 followed by a linear layer from 50 nodes to the 10 outputs.This network achieves an accuracy of 98.07%.
The training process used cross-entropy loss and PyTorch's SGD function with parameters of 0.01 update rate, 0.5 momentum, 0.01 weight decay, and a batch size of 64 over 30 epochs.• A network with the Inception-v3 architecture as implemented in Torchvision's models subpackage trained from scratch [12,21].This network uses more complex layer structures, but to the best of our knowledge none of them result in the network not being a piecewise linear map.This network achieves an For a given input image and a given output node, each network determines a polytope, within the input space, which contains the image.By restricting the neural network to one output node, the gradient of this restricted neural network, at the input image, can be displayed in the same format as the input image.The collection of 40 different gradient "images", computed by considering each of the 4 neural networks and each of the 10 output nodes, are visualized at the given input image in Figure 6.The dense network has relatively little complexity, so it is classifying based on its "ideal" shape of each output.The other networks have more complexity, tend to focus more sharply on the relevant information being passed in, and classify based on that input.ResNet has behavior that is not as human-interpretable.The visualization of these linear regions is similar to the idea of saliency mappings, although many modern forms of saliency mapping are more sophisticated than simply visualizing the gradients at an input image, as this is doing [20].

Polytope Evolution Through
Training.The polytope structures discussed in Section 3.3 and their associated linear mappings change as the network Fig. 8: The training process of the simplest possible network (three hidden nodes in a single layer) on this problem.An animation of this process is available at https: //www.youtube.com/watch?v=lpXQI-UJIZM.trains.For an example of this, we continue with the problem of classifying a circle versus a surrounding annulus and additionally consider a more complex problem that is a combination of the XOR problem and the circle versus annulus problem, both illustrated in Figure 7.
There exist many simple solutions to the single circle versus single annulus problem, but neural networks do not intrinsically take advantage of the rotational symmetry of this problem to express these solutions.As has been demonstrated previously [18,5], any network that solves this problem requires a minimum of three hidden nodes in at least one of its hidden layers.A node in any layer creates a line in the embedding that is its input, but when mapped back to the original input space that line becomes a trajectory that "breaks" by re-angling whenever it encounters a line created by the activation boundary of a node in a previous layer.A network with a maximum width of two is unable to solve this problem as it is unable to create a closed region in the input space.To see this, note that each layer can only partition space into four regions (both on, one on, the other on, both off), one of which (both off) will be constant.Due to this, any such network cannot form a closed region in space and will instead have each of its polytopes extend to infinity.Fig. 9: A solution to this problem found by a more complex network (three hidden layers, each with eight nodes).An animation of the training process of this network is available at https://www.youtube.com/watch?v=rANyD9t-X-c.
To illustrate how these polytopes and decision boundaries change as the neural network trains, we have two examples.One is the simplest possible network with three nodes in the single hidden layer, and the other is a far more complex network with three hidden layers each containing eight nodes.Still images of the polytope development throughout the training process for the simple network are shown in Figure 8 and the end result of the complex network is shown in Figure 9. Full videos of the evolution of their polytope structure throughout the training are available at https://www.youtube.com/watch?v=lpXQI-UJIZM and https://www.youtube.com/watch?v=rANyD9t-X-c, respectively.
An example of the polytopes constructed by the more complex network on the more complex problem is in Figure 10.A video of the training process is shown at https://www.youtube.com/watch?v=T uoGBUOgUY.
It has previously been shown by Raghu et al. [18] that earlier layers are more important than later layers for the quality of a network and certain visualizations of this were included in their work.These animations provide additional intuitive examples of this -the structures constructed by the early layers are passed on, and many of the deeper layers provide only slight modifications to the structures apparent in the first layers.
5. Region Modifications.Rather than focusing on the polytope structure of the networks, we can also investigate the affine mappings that arise on each polytope.This is useful for a number of reasons, but the two simplest are that the visualizations in the previous section cannot be done as simply in high dimension, and that the number of polytopes increases significantly with the complexity of the network.Even for the simplest network on MNIST, nearly every image in the dataset lies on a unique polytope, and that behavior has been shown to extend to other networks and datasets [15].For investigating these affine mappings there are two useful steps to make: constructing notation to allow us to refer to the set of affine mappings Fig. 10: A solution to this problem found by a more complex network (three hidden layers, each with eight nodes).An animation of the training process of this network is available at https://www.youtube.com/watch?v=T uoGBUOgUY.potentially used for a specific output of a network, and considering only the affine mappings that are used for training or testing to reduce the number to something computationally manageable.
In terms of notation, the W i and b i described in Equation 3.1 can be written as (5.1) . . .
Where each w i,j and b i,j correspond to the affine mapping in region i for the j th output of the network.Then, it is possible to construct the matrix containing the set of linear regions used for a given output, j, as (5.2) In practice, it is computationally infeasible to calculate all m linear regions, so for the purpose of empirical studies we choose p points in the input space to sample and construct the matrix For simple two-dimensional problems, we choose the p points by sampling from a uniform grid.We also consider the MNIST dataset [8], where the points we sample are the 60,000 training and 10,000 testing input samples from that network.We construct the C j matrices using the training samples, and we additionally construct Cj using the testing samples for evaluation of how various modifications impact accuracy.

Clustering Regions.
Even for potentially large numbers of sampled affine maps it is likely that many samples will have a unique w i,j due to the large number of total linear regions.For example, even simple networks on the MNIST dataset only have overlap on < 1% of the inputs.This isn't necessarily surprising, simply due to the sheer number of possible linear regions the network can construct.
However, although these weights are not necessarily equivalent, there is potentially a great deal of redundancy or similarity among them.As shown in Figure 4, patterns appear in the induced linear regions that can indicate redundant behavior.We can cluster the linear maps and determine how well those clusters are able to replicate the behavior of the network.One note to make here is that although it would be ideal if we were able to take advantage of this fact to simplify network structures, we don't currently have an algorithm for modifying the neural network based on clustering the points in the dual, so for now it is limited to a tool purely for analysis.To actually investigate the degree to which this impacts accuracy, we use the MNIST dataset.
The process for evaluating the impact on the MNIST dataset is as follows: 1. Calculate the C j and Cj matrices.2. Train a K-means clustering model on each collection of points in the C j , j = 1, ..., 10 matrices.3.For each row of each of the Cj , determine for which cluster center it is closest.4. Use that cluster center as a linear mapping from input space to determine the value for that output.5. Classify the input based on which of the newly calculated outputs is highest.Accuracies for this process with different numbers of cluster centers are shown in Fig. 11: The decision boundaries, wireframes, and weights in the dual space for two different XOR networks and the result of training an affine mapping from the linear regions of one network to the other.In the dual space the four points of XOR are in red and all other points sampled uniformly from the grid in the original input space are in blue.
Interestingly, the less complex networks capture the linear behavior of the MNIST dataset well.The dense, single-hidden-layer network in particular is able to recover a solution very close to the best linear classifier in the single cluster case, suggesting it is in some way strongly linear.This matches previous work that shows that wide networks tend to behave in highly linear ways [9].
Additionally, although the Inception architecture performs around the accuracy that would be expected from a random classifier with a single cluster, it is able to recover 96.6% accuracy (better than the original dense network) with as few as 10 clusters per output.

Affine Maps Between Sample Centered Local Functions.
Another area where representing the weights of these linear regions as points in space can be useful is in finding similarities between two networks.Given C j,network1 and C j,network2 , we can train a least-squares regression model to find a matrix M j ∈ R d+1×d+1 that minimizes (5.4) This method finds a mapping between the linear region weights, or, equivalently, between the gradients of the outputs with respect to the input.Due to this, as with the K-means clustering method, this method requires running inputs through each original network, calculating the weights, then applying the transformation.This is similar to the work done by McNeely-White et al. [13] where the authors demonstrated that the outputs of the final layer before the linear classifier of networks trained on ImageNet are affine-equivalent.Unlike their work, our work investigates the connection between the affine mappings of networks, rather than the feature vectors of networks.
For XOR, we use the C matrices (there is only one output) constructed by sampling points on the 101 × 101 uniform grid.For MNIST, we use the C j matrices arising from the training set, then evaluate the degree to which the constructed M j reduce accuracy on the Cj matrices calculated on the testing set.
Results of this process for the XOR networks are shown in Figure 11.The resulting points of C j,network1 M j are very similar to C j,network2 and vice versa.The  This method can also be used to get comparisons on the MNIST dataset.Examples of this are illustrated for five input samples for W 0,dense and W 0,conv in Table 12.Qualitatively, the mapped linear regions are similar, but not equivalent, to the target.
Table 2 shows the results of the affine mapping trained on the training set and evaluated on the testing set for the four networks trained on MNIST.In general, there is degradation in accuracy but a high level of accuracy is nevertheless maintained.A few interesting points are that Inception does not reproduce the dense network well -this matches the result that Inception has poor accuracy with one cluster, and suggests that the architecture is doing something that is not as simple.ResNet is able to reproduce all except the dense network extremely well, suggesting that it forms a strong representation of the data that includes the behavior of the other networks.
One note of caution here is that in some respects this result is not necessarily unexpected -these four networks are trained to solve the same problem (and even use the same loss function), so on a certain level they are all approximating the same function.Additionally, the accuracies may not tell the whole story -although a high level of performance is preserved, it may be the case that the slight variations in accuracy represent significant, qualitatively meaningful differences in what the networks do.However, these results demonstrate that there is ostensibly an interesting relationship between these different networks and their similar behaviors.Further research is needed to gauge the extent to which networks trained to solve the problem exhibit equivalent or near-equivalent behavior on a global level.

Conclusion.
We have extended the work of Raghu et al. [18] in visualizing the polytope structure of neural networks with two inputs by constructing animations of the evolution of the polytope structure.These animations demonstrate how early layers have significant influence over the structure of subsequent layers and how the polytope structures form through training.
Additionally, we have shown experimentally that the number of linear regions that networks have is not necessarily a perfect predictor of their complexity.The linear regions of all networks considered, except ResNet, can be can be clustered to as few as one or ten cluster centers for networks trained on MNIST while preserving much of their accuracy.
We have also shown experimentally that the linear regions of different networks are similar under an affine mapping.Applying such an affine mapping preserves a high level of accuracy in the resulting classifier, suggesting that many of the considered networks are solving problems in globally similar ways.
For the future, we are interested in investigating the extent to which these results carry to different datasets and potentially more dissimilar networks -MNIST is a relatively simple, near linear dataset, and that may potentially skew our results.Additionally, all four MNIST networks considered were trained using the same process.Although this demonstrates that their disparate architectures do not seem to differentiate the approximations they learn, investigating to what extent structures such as different loss functions or other forms of regularization impact similarity could prove interesting.
We also provide support for the tantalizing idea that different networks converge to similar solutions that have a great deal more simplicity than would be suggested by their complex architectures.We would like to continue to explore the extent to which that idea is correct for modern neural networks.
a) The input and output values of the modified XOR problem.(b) A network that solves this modified XOR problem.

Fig. 2 :
Fig. 2: The modified XOR problem.(a) the input and output values -inputs and outputs are rescaled to be from -1 to 1 rather than from 0 to 1.(b) A network architecture and its associated weights that solves this problem.Nodes in red have ReLU applied after calculating their associated input values.

Fig. 4 :
Fig.4: The decision boundaries (left), wireframe representations of output (center), and dual representation of the linear regions (right) for three networks designed to solve ReLU.The top network is the simple one described previously.The center and bottom are single hidden layer neural networks with the center having 20 hidden nodes and the bottom having 100 hidden nodes.In the dual, blue dots represent linear regions used on the 101 × 101 uniform grid in [−1, 1] 2 .The red dots represent the linear regions used for the actual classification of the four data points -note that the top image only has three dots corresponding to these, rather than four, as it only has a total of three linear regions.

Fig. 5 :
Fig.5: The polytopes resulting from the various layers of a simple network to classify a circle versus a surrounding annulus.Top left: the original problem and the decision boundary determined by the network.Top middle: the outputs of the network.Top right: the dual weights.Bottom first: the zeros for each of the perceptrons from the original input space to the first hidden layer -these decision boundaries are all lines, as the perceptrons at this stage are purely linear in the original input.Each color corresponds to one of eight nodes in this hidden layer (and the colors do not relate between each of the four bottom plots).Not every node has zeros occuring in the window shown.Bottom second: the zeros for each of the perceptrons from the original input to the second hidden layer, with the boundaries for the first hidden layer in light gray.These are lines in the output space of the first layer, but appear non-linear when shown in the original input space.Each boundary can only break at one of the lines from the previous layer.Bottom third: the zeros for each of the perceptrons from the original input to the third layer, with the boundaries of the first two layers in gray.Breaks in this layer can occur at any location where it crosses a zero of a previous layer.Bottom fourth: zeros in the output layer.This forms the decision boundary shown in the top left.

Fig. 6 :
Fig.6: The linear regions for each output of the four networks for the input four in the top left corner.These visualizations are similar to simple forms of saliency mapping[20].

Fig. 7 :
Fig. 7: Two classification problems used to show animations of polytope structures during the training process.Left: the goal is to classify points in the red annulus as being a separate class as those in the blue annulus.Right: a a combination of the left problem with the XOR problem to demonstrate more sophisticated network behavior.

Fig. 12 :
Fig. 12: An example of the affine mapping from the dense network to the simple convolutional network.

Table 1 :
Dense Conv Inception ResNet The accuracies of networks on the MNIST dataset after applying K-means clustering to their collection of local linear maps.Values reported are the number of correctly labeled test set samples out of 10,000.Note that the number of clusters for a given network is technically 10 times larger than stated in the table -for a given output there will be that many clusters, but there are 10 outputs for the MNIST networks.

Table 2 :
Number of correct labels on the test set (out of 10,000) after applying the affine mapping.Diagonal elements are the original accuracies of the networks.function resulting from this is no longer continuous -because the bias is part of what is being mapped, the result is able to vary based on the position in the plane and regions may no longer join at their boundaries.