Immiscible Color Flows in Optimal Transport Networks for Image Classification

In classification tasks, it is crucial to meaningfully exploit the information contained in data. While much of the work in addressing these tasks is devoted to building complex algorithmic infrastructures to process inputs in a black-box fashion, less is known about how to exploit the various facets of the data, before inputting this into an algorithm. Here, we focus on this latter perspective, by proposing a physics-inspired dynamical system that adapts Optimal Transport principles to effectively leverage color distributions of images. Our dynamics regulates immiscible fluxes of colors traveling on a network built from images. Instead of aggregating colors together, it treats them as different commodities that interact with a shared capacity on edges. The resulting optimal flows can then be fed into standard classifiers to distinguish images in different classes. We show how our method can outperform competing approaches on image classification tasks in datasets where color information matters.


I. INTRODUCTION
Optimal Transport (OT) is a powerful method for computing the distance between two data distributions.This problem has a cross-disciplinary domain of applications, ranging from logistic and route optimization [1][2][3], to biology [4,5] and computer vision [6][7][8][9][10], among others.Within this broad variety of problems, OT is largely utilized in machine learning [11], and deployed for solving classification tasks, where the goal is to optimally match discrete distributions that are typically learned from data.Relevant usage examples are also found in multiple fields of physics, as in protein fold recognition [12], stochastic thermodynamics [13], designing transportation networks [14,15], routing in multilayer networks [16] or general relativity [17].A prominent application is image classification [18][19][20][21][22][23], where the goal is to measure the similarity between two images.OT solves this problem by interpreting image pairs as two discrete distributions and then assesses their similarity via the Wasserstein (W 1 ) distance ( [24], Definition 6.1) a measure obtained by minimizing the cost needed to transform one distribution into the other.Using W 1 for image classification carries many advantages over other similarity measures between histograms.For example, W 1 preserves all properties of a metric [9,24], it is robust over domain shift for train and test data [22], and it provides meaningful gradients to learn data distributions on non-overlapping domains [25].Because of these, and several others desirable properties, much research effort has been put in speeding up algorithms to calculate W 1 [12,19,20,26,27].However, all these methods overlook the potential of using effectively image colors directly in the OT formulation.As a result, practitioners have access to increasingly efficient algorithms, but that do not necessarily improve accuracy in predictions, as we lack a framework that fully exploits the richness of the input information.
Colored images, originally encoded into three-dimensional histograms-with one dimension per color channel-are often compressed into lower dimensional data using feature extraction algorithms [9,23].Here, we propose a different approach that maps the three distinct color histograms to multicommodity flows transported in a network built using images' pixels.We combine recent developments of OT with physics insights of capacitated network models [1,5,[28][29][30][31] to treat colors as mass of different types that flows through the edges of a network.Different flows are coupled together with a shared conductivity, and minimize a unique cost function.This setup is reminiscent of the distinction between modeling the flow of one substance, e.g.water, and modeling flows of multiple substances that do not mix, e.g.immiscible fluids, that share the same network infrastructure.By virtue of this multicommodity treatment we achieve stronger classification performance than state-of-the-art OT-based algorithms in real datasets where color information matters.

II. PROBLEM FORMULATION A. Unicommodity Optimal Transport
Given two m, n-dimensional probability vectors g, h, and a positive-valued ground cost matrix C, the goal of a standard-unicommodity-OT problem is to find an Optimal Transport path P satisfying the conservation constraints j P ij = g i ∀i and i P ij = h j ∀j, while minimizing J(g, h) = ij P ij C ij .Entries P ij can be interpreted as the mass transported from g i to h j when paying a cost C ij , while J , i.e.J evaluated at P , encodes the minimum effort needed to transport g to h.Notably, if all entries C ij are distances between i and j, then J is the W 1 distance between g and h (see [24] for a standard proof and [9] for derivations focusing on the discrete case).

B. Physics-inspired Multicommodity Optimal Transport
Interpreting colors as mass traveling along a network built from images' pixels (as we define in details below), unicommodity OT could be used to capture the similarity between grayscale images.However, it may not be ideal for colored images, when color information matters.The limitation of unicommodity OT in Section II A is that it does not fully capture the variety of information contained in different color channels, as it is not able to distinguish them.Motivated by this, we tackle this challenge and move beyond this standard setting by incorporating insights from the dynamics of immiscible flows in physics.Specifically, we treat the different pixels' color channels as mass of different types that do not mix, but rather travel and interact on the same network infrastructure, while optimizing a unique cost function.By assuming capacitated edges with conductivities that are proportional to the amount of mass traveling through an edge, we can define a set of ODEs that regulate fluxes and conductivities.These are optimally distributed along a network to better account for color information, while satisfying physical conservation laws.Similar ideas have been successfully used to route different types of passengers in transportation networks [2,16,32].
Formally, we couple together histograms of M = 3 color channels, the commodities, indexed with a = 1, . . ., M .We define g a , h a as m, n-dimensional probability vectors of mass of type a.More compactly, we define the matrix G with entries G ia = g a i (resp.H for h), each containing the intensity of color channel a in pixel i of the first (resp.second) image.These regulate the sources and sinks of mass in our setting.We then enforce conservation of mass for each commodity index a: i g a i = j h a j .This ensures that all the color mass in the first image should be accounted for in the second one, and viceversa, and this should be valid for each mass type.
Moreover, we define the set Π(G, H), containing (m × n × M )-dimensional tensors P with entries P a ij , being transport paths between g a and h a .These regulate how fluxes of colors of different types travel along a network.We enforce interaction between transport paths for different commodities by introducing a shared cost: where is the 2-norm of the vector P ij = P 1 ij , . . ., P M ij and 0 < Γ < 4/3 is a regularization parameter.We take Γ > 0 since a negative exponent would favor the proliferation of loops with infinite mass [28].Instead, we conventionally consider Γ < 4/3 (see Section III B) since the cost J Γ exhibits the same convexity properties for any Γ > 1, i.e., it is strictly convex, and OT paths do not substantially change with Γ in this regime [2].We can thus formulate its corresponding multicommodity OT problem as that of finding a tensor P solution of: ( Notice that for M = 1 and Γ = 1, we recover the standard unicommodity OT setup.The problem in Eq. ( 2) admits a precise physical interpretation.In fact, it can be recasted into a constrained minimization problem with objective function being the energy dissipated by the multicommodity flows (Joule's law), and with constant total conductivity.Furthermore, transport paths follow Kirchhoff's law enforcing conservation of mass [2,32,33] (see Supplementary Material for a detailed discussion).
Noticeably, J Γ is a quantity that takes into account all the different mass types, and the OT paths P are found through a unique optimization problem.We emphasize that this is fundamentally different from solving M independent unicommodity problems, where different types of mass are not coupled together as in our setting, and then combining their optimal costs to estimate images' similarity.Estimating J Γ (G, H) gives directly a quantitative and principled measure of the similarity between two images G and H.The lower this cost, the higher the similarity of the two images.While this is valid also for the unicommodity cost in Section II A, the difference here is that we account differently for the color information as we distinguish different colors via the M -dimensional vector P ij .The cost in Eq. ( 2) then properly couples colors by following physical laws regulating immiscible flows.The idea is that, if this information matters for the given classification task, incorporating it into the minimization problem would output a cost that helps to distinguish images better, e.g. with higher accuracy.

A. Optimal Transport network on images
Having introduced the main ideas and intuitions, we now explain in details how to adapt the OT formalism to images.Specifically, we introduce an auxiliary bipartite network K m,n (V 1 , V 2 , E 12 ), being the first building block of the network where the OT problem is solved.A visual representation of this is shown in Fig. 1.The Images 1 and 2 are represented as matrices (G and H) of size m × M, and n × M respectively, where M is the number of color channels of the images (M = 3 in our examples).The sets of nodes V 1 , V 2 of the network K m,n are the pixels of Image 1 and Image 2, respectively.The set of edges E 12 contains a subset of all pixels' pairs between the two images, as detailed below.We consider the cost of an edge (i, j) as where the vector v i = (x i , y i ) contains the horizontal and vertical coordinates of pixel i of Image 1 (similarly v j for Image 2).The quantity θ ∈ [0, 1] is an hyperparameter that is given in input and can be chosen with cross-validation.It acts as a weight for a convex combination between the Euclidean distance between pixels and the difference in their color intensities, following the intuition in [9,23].When θ = 0, the OT path P is that minimizing only the geometrical distance between pixels.Instead, when θ = 1, pixels' locations are no longer considered, and transport paths are only weighted by color distributions.The parameter τ is introduced following [22,23] with the scope of removing all edges with cost C ij (θ, τ ) = τ , i.e. those for which (1 − θ) These are substituted by m + n transshipment edges e ∈ E , each of cost τ /2, connected to one unique auxiliary vertex u 1 .
Thresholding the cost decreases significantly the computational complexity of OT, making it linear with the number of nodes Furthermore, we relax conservation of mass by allowing i G ia = j H ja .The excess of mass m a = j H ja − i G ia is assigned to a second auxiliary node, u 2 .We connect it to the network with n additional transshipment edges e ∈ E , each penalizing the total cost by c = max ij C ij /2.This construction improves classification when the histograms' total mass largely differs [22].Intuitively, this can happen when comparing "darker" images against "brighter" ones.More precisely, when entries of g a and h a are further apart in the RGB color space.
Overall we obtain a network K with nodes and edges E = E 12 ∪ E , i.e. the original bipartite graph K m,n , together with the auxiliary transshipment links and nodes.Note that in its entirety the system is isolated, i.e. the total mass is conserved.See Supplementary Material for a detailed description of the OT setup.
Given this auxiliary graph, the OT problem is then solved by injecting in nodes i ∈ V 1 the color mass contained in Image 1, as specified by G, and extracting it in nodes j ∈ V 2 of Image 2, as specified by H.This is done by transporting mass using either (i) an edge in E 12 , or (ii) a transshipment one in E .In the following Section we describe how this problem is solved mathematically.

B. Optimizing immiscible color flows: the dynamics
We solve the OT problem by proposing the following ODEs controlling mass transportation: ) which constitute the pivotal equations of our model.Here we introduce the shared conductivities x e ≥ 0, and define ] is an hyperparameter that needs to be chosen before solving Eqs. ( 4)- (5).Depending on the modelling task, its value can be fixed a priori (e.g.β = 1 for the shortest path problem [34], β 5/3 for river networks [35], and β → 2 − for the Steiner Tree problem [36]), or cross-validated as we do here for image classification.The exponent aggregates paths with a principle of economy of scale if 1 < β < 2. It dilutes them along the network otherwise, with the goal of reducing traffic congestion.This behavior is a direct consequence of the subadditivity of J Γ in Eq. (2) for β > 1 (Γ < 1), and respectively of its superadditivity for β < 1 (Γ > 1).It has been theoretically discussed and empirically observed, for example, in [32,37,38].
The feedback mechanism of Eq. ( 5) defines multicommodity fluxes (P a e ) that are admissible for the minimization problem introduced in Eq. (2).Particularly, for color of type a on edges e = (i, j), we couple potentials (φ a i ) which are solutions of Eq. ( 4), and shared conductivities (x e ) to define: This also highlights another physical interpretation: by interpreting the φ a i as pressure potentials, the fluxes are seen to arise from a difference in pressure between two nodes, as in hydraulic or electrical networks.Crucially, this allocation is governed by one unique conductivity for all commodities, whose dynamics depends on the 2-norm over a of differences of potentials, as in Eq. ( 5).In analogy with immiscible flows, this ensures that flows of different types share the same infrastructure, and in practice it couples them into a unique optimization problem.
In the case of only one commodity (M = 1), variants of this dynamics have been used to model transport optimization in various physical systems [1,5,[29][30][31].
The salient result of our construction is that asymptotic trajectories of Eqs. ( 4)-( 5) are equivalent to minimizers of Eq. (2), i.e. lim t→+∞ P (t) = P (see Supplementary Material for derivations following [32,33]).Therefore, numerically integrating our dynamics solves the multicommodity OT problem.In other words, this allows us to estimate the optimal cost in Eq. ( 2), and to use that to compute similarities between images.A pseudo-code of the algorithmic implementation is shown in Algorithm 1. Solve Kirchhoff's law, Eq. ( 4)

C. Computational complexity
In principle, our multicommodity method has a computational complexity of order O(M |V | 2 ) for complete transport network topologies, i.e., when edges in the transport network K are assigned to all pixels' pairs.Nonetheless, we decrease substantially this complexity by sparsifying the graph with the trimming procedure of [22,23], reducing the complexity to O(M |V |).More details are given in the Supplementary Material.Empirically, we observe that running Eqs. ( 4)-( 5), most of the entries of x decay to zero after a few steps, producing a progressively sparser weighted Laplacian L[x].This allows for faster computation of the Moore-Penrose inverse L † [x], and of the least-square potentials An experimental thorough analysis of convergence properties of the OT dynamics has been done in [39].

A. Classification task
We provide empirical evidence that our multicommodity dynamics outperforms competing OT algorithms on classification tasks.As anticipated above, we use the OT optimal cost J Γ as a measure of similarity between two images, and perform supervised classification with a k-nearest neighbor (k-NN) classifier as in [20].Alternative methods (e.g.SVM as in [19]) could also be employed for this task.However, these may require the cost J Γ to satisfy the distance axioms to properly induce a kernel.While it is not straightforward to verify these conditions for the OT cost in Eq. ( 2), this is not necessary for the k-NN classifier, which requires instead looser conditions on J Γ .
We compare classification accuracy of our model against: (i) Sinkhorn algorithm [19,40] (utilizing the more stable Sinkhorn scheme proposed in [41]); (ii) a unicommodity dynamics executed on grayscale images, i.e., with color information compressed in one single commodity (M = 1); (iii) Sinkhorn algorithm on grayscale images.All methods are tested on two datasets: the Jena Flowers 30 Dataset (JF30) [42], and the Fruit Dataset (FD) of [43].The first consists of 1,479 images of 30 wild-flowering angiosperms (flowers).Flowers are labelled with their species, inferring them is the goal of the classification task.The second dataset contains 15 fruit types and 163 images, here we want to classify fruit types.Parameters of the OT problem setup (θ and τ ), as well as regularization ones (β and ε, which enforces the entropic barrier in Sinkhorn algorithm [19]), have been cross-validated for both datasets (see Section III and Section IV in Supplementary Material).All methods are then tested in their optimal configurations (see Supplementary Material for implementation details).Classification results are shown in Table I.In all cases, leveraging colors leads to higher accuracy (about 8% increase) with respect to classification performed using grayscale images.This signals that, in the datasets under consideration, color information is a relevant feature for differentiating image samples.Remarkably, we get a similar increase in performance (about 7-8%) on both colored datasets, when comparing our multicommodity dynamics against Sinkhorn.As the two algorithms use the same (colored) input, we can attribute this increment to an effective usage of the color that our approach is capable of.
In addition, analyzing results in more detail, we first observe that on JF30 all methods perform best when θ = 0.25, i.e., 25% percent of the information used to build C comes from colors.This trend does not recur on FD, where both dynamics favor θ = 0 (Euclidean C).Hence, our model is able to leverage color information via the multicommodity OT dynamical formulation.
Second, on JF30 both dynamics perform best with τ = 0.125, contrarily to Sinkhorn-based methods that prefer τ = 0.05.Thus, Sinkhorn's classification accuracy is negatively affected both by low τ -many edges of the transport network are cut, and by τ large-noisy color information is used to build C. We do not observe this behavior in our model, where trimming less edges is advantageous.All optimal values of τ are lower on FD, since color distributions of this dataset are naturally light-tailed (See Supplementary Material).
Lastly, we investigate the interplay between θ and β.We notice that θ = 0 (FD) corresponds to higher β = 1.5.Instead, for larger θ = 0.25 (JF30), the model prefers lower β (β = 1, 1.25 for the multicommodity and the unicommodity dynamics, respectively).In the former case (θ = 0, C ij is the Euclidean distance), the cost is equal to zero for pixels with the same locations.Thus, consolidation of transport paths-large β-is favored on cheap links.Instead, increasing θ leads to more edges with comparable costs as colors distribute smoothly over images.In this second scenario, better performance is achieved with distributed transport paths, i.e. lower β (See Supplementary Material).

B. Performance in terms of sensitivity
We assess the effectiveness of our method against benchmarks by comparing the sensitivity of our multicommodity dynamics and that of Sinkhorn algorithm on the colored JF30 dataset.Specifically, we set all algorithms' parameters to their best configurations, as shown in Table I.Then, for each of the 30 class in JF30 we compute its one-vs-all sensitivity, i.e. the true positive rate.This is defined for any class c as where TP(c) is the true positives rate, i.e. the number of images in c that are correctly classified; FN(c) is the false negative rate, i.e. the number of c-samples that are assigned a label different than c.Hence, Eq. ( 7) returns the probability that a sample is assigned label c, given that it belongs to c.We find that our method robustly outperforms Sinkhorn.Specifically, the multicommodity dynamics has the highest sensitivity 50% of the times-15 classes out of the total 30, as shown in Fig. 2. In 9 cases Sinkhorn has higher sensitivity, and for 6 classes both methods give the same values of S. Furthermore we find that in 2/3 (20 out of 30) of the classes, the multicommodity dynamics returns S(c) ≥ 1/2.This means that our model predicts the correct label more than 50% of the times.In only 3 out of these 20 cases Sinkhorn attains higher values of S, while in most instances where Sinkhorn outperforms our method it has lower sensitivity S < 1/2.Hence, this is the case of classes where both methods have difficulties in distinguishing images.

C. The impact of colors
To further assess the importance of leveraging color information we conduct three different experiments where we highlight, both qualitatively and quantitatively, various performance differences between the unicommodity and the multicommodity approaches.As the two share the same principled dynamics based on OT, with the main difference being that multicommodity does not compress the color information, we can use this analysis to better understand how fully exploiting the color information drives better classification.Sinkhorn RGB (red triangles).Markers are sorted in descending order of S, regardless the method.Background colors are blue, red, and gray, when S is higher for the multicommodity method, Sinkhorn, or none of them, respectively.In green we plot frequency bars for all classes in the test set.
Experiment 1: landscape of optimal cost.Here we focus on a qualitative comparison between the cost landscapes obtained with the two approaches.We consider the example of an individual image taken from the test set of FD and we plot the landscape of optimal costs J Γ when comparing it against the train set.Results for the multicommodity dynamics (M = 3), and for the unicommodity dynamics (M = 1) on grayscale images are in Fig. 3. Here, we highlight the 5 lowest values of the cost and mark them in green if they correspond to correctly classified train samples, and in red otherwise.While at a first glance one may conclude that their performance is identical (as both dynamics classify correctly 3 samples out of 5), we notice how the multicommodity dynamics consistently clusters them at the bottom of the cost landscape, thus ranking them in a better order.This may be explain why the cross-validated best value of k (the number of nearest-neighbors in the k-NN classifier) is higher for unicommodity methods in this dataset.On a larger sample of data, this results in better overall classification performance, as show in Table I.
Experiment 2: controlling for shape.We further mark this tendency with a second experiment where we select a subset of FD composed of images belonging to three classes of fruits that have similar shapes but different colors: red apples, orange apricots, yellow melons.As we expect shape to be less informative than colors in this custom set, we can assess the extent to which color plays a crucial role in the classification process.Specifically, the test set is made of three random samples each drawn from one of these classes (top row of the rightmost panel) in Fig. 3, while the train set contains the remaining instances of the classes.We plot the cost landscape J Γ for the train set and draw in red, orange, and yellow values of J Γ correspondent to samples which are compared against the test apple, apricot, and melon, respectively.We also sort the train samples so that they are grouped in three regions (highlighted by the background color in Fig. 3), correspondent to train melons, apricots, and apples.With this construction, if the minimum cost among the yellow markers falls in the yellow region, it will correspond to a correctly-classified sample (resp.for orange and red).We further mark the yellow, orange, and red minima in green if test and train labels correspond, i.e. markers' and background's colors are the same, and in red otherwise.Train and test samples are also in Fig. 3.The multicommodity dynamics correctly labels each test image.In contrast, the unicommodity dynamics fails at this task, labeling a melon as an apricot.This suggests that the multicommodity approach is able to use the color information in datasets where this feature is more informative than others, e.g.shape.
Experiment 3: when shape matters.Having shown results on a custom dataset where shape was controlled to matter less, we now do the opposite and select a dataset where this feature should be more informative.The goal is to assess whether a multicommodity approach helps in this case as well, as its main input information may not be as relevant anymore.Specifically, we select as a test sample a cherry, whose form is arguably distinguishable from that of many other fruits in the dataset.One can expect that comparing it against the train set of FD will result in having both unicommodity and multicommodity dynamics able to assign low J Γ to train cherries, and higher costs to other fruits.This intuition is confirmed by results in Fig. 4.Here train cherries (in green) strongly cluster in the lower portion of the cost landscape, whereas all the other fruits have higher costs.In Fig. 4 we also plot some of the correctly classified train samples.These results suggest that when color information is negligible compared to another type of information (e.g.shape), unicommodity and multicommodity formulations perform similarly.In light of this, we reinforce the claim that our multicommodity formulation can boost classification in contexts where color information does matter, but may not give any advantage when other types of information is more informative.We encourage practitioners to evaluate when this is the case based on domain-knowledge when available.FIG. 3. Evaluating the effect of colors.Experiment 1: Top black framed image is that to be classified.Predictions given by the multicommodity and the unicommodity dynamics (those with lower J Γ ) are shown in the right side of the panel, and are displayed in sorted fashion from worst to best (from bottom to top).Experiment 2: Top right samples are the three test images to be classified.Middle and bottom rows are predictions given by the two dynamics.Markers, backgrounds and test images shared color code: red for apples, orange for apricots, yellow for melons.In both panels: green circles and red crosses are used to highlight classified and misclassified images, respectively.All algorithms are executed with their optimal configurations in Table I.

V. CONCLUSION
We propose a physics-informed multicommodity OT formulation for effectively using color information to improve image classification.We model colors as immiscible flows traveling on a capacitated network, and propose equations for its dynamics, with the goal of optimizing flow distribution on edges.Color flows are regulated by a shared conductivity, and minimize a unique cost function.Thresholding the ground cost as in [22,23] makes our model computationally efficient.
We outperform other OT-based approaches as the Sinkhorn algorithm on two datasets where color matters.Our model also assigns lower cost to correctly classified images than its unicommodity counterpart, and it is more robust on datasets where items have similar shape, and thus color information is distinctly relevant.We note that, for some datasets, color information may not matter, as when another type of information (e.g.shape) has stronger discriminative power.However, while we focused here on different color channels as the different commodities in our formulation, the ideas of this paper can be extended to scenarios where other relevant information can be distinguished into different types.For instance, one could combine several features together, e.g.colors, contours and objects' orientations, when available.
Our model can be further improved.While it uses the thresholding of [22,23] to speed-up convergence (as mentioned in Section III A), it is still slower than Sinkhorn-based methods.Hence, investigating approaches aiming at improving  I. its computational performance is an important direction for future work.Speed-up can be achieved, for example, with the implementation of [39], where the unicommodity OT problem on sparse topologies is solved in O(|E| 0.36 ) time steps.This bound has been found using a backward Euler scheme combined with the inexact Newton-Raphson method for the update of x, and solving Kirchhoff's law using an algebraic multigrid method [44].
Our main goal is to frame an image classification task into that of finding optimal flows of mass of different types in networks built from images.We follow physics principles to assess whether using colors as immiscible flows can give an advantage compared to other standard OT-based methods that do not incorporate such insights.The increased classification performance observed in our experiments stimulates the integration of similar ideas into deep network architectures [45] as a relevant avenue for future work.Combining their prediction capabilities with our insights on how to better exploit the various facets of the input data has the potential to push even further the performance of deep classifiers.For example, one could extend the state-of-the-art architecture of Eisenberger et al. [45], which efficiently computes implicit gradients for generic Sinkhorn layers within a neural network, by including edge, shape, and contour information for Wassertein barycenters computation or image clustering.

CODE AND DATA AVAILABILITY STATEMENT
To facilitate practitioners using our algorithms, we make our Python code publicly available [46].The datasets Jena Flowers 30 Dataset (JF30) and the Fruit Dataset (FD) used for this study can be found in [42] and [43], respectively.

Immiscible Color Flows in Optimal Transport Networks for Image Classification: Supplemental Material (SM)
Alessandro Lonardi, 1, * Diego Baptista, 1, † and Caterina De Bacco Here, we extensively explain the procedure used for constructing the transport network K.All the essential steps are schematically drawn in Fig. S1, and are as follows.
Step 1.Initially, a pair of images (Image 1, Image 2), is given as a couple of multidimensional arrays of dimensions (w i , h i , M = 3), with i = 1, 2. We denote with w images' widths and with h their heights.The third dimension has size M = 3, and corresponds to the three RGB color channels.We then assign a cost to each edge of this graph using both information given by pixels' locations, and by images' colors.In particular, we define: contributes with colors to edges' costs.We model the effect of colors taking, in Eq. ( S3), the 1-norm between arrays G i , H j , containing the RGB intensities in i (pixel of Image 1) and j (pixel of Image 2).Both X e and Y e have been opportunely rescaled in the range [0, 1].Lastly, we use the scalar parameter 0 ≤ θ ≤ 1 to weigh Y e and X e in a convex combination, in Eq. (S1).
Step 2. Once Step 1 is complete, and a cost C e is assigned to each edge of the complete bipartite graph between the two images, we implement a trimming procedure similar to that of [22,23] to cut highly expensive links.In particular, we trim all edges e that have cost C e > τ , where τ > 0 is a threshold fixed a priori.The links between V 1 and V 2 that do not get cut make up the set E 12 .We then add a first transshipment node, u 1 , to the network, and connect it with m + n links to the sets V 1 and V 2 .Each transshipment link is assigned a fixed cost C e = τ /2.This implies that one needs to pay a total cost of τ to transport mass from a node of Image 1 (in V 1 ) to one of Image 2 (in V 2 ), when traversing transshipment links.
There are several benefits in thresholding for the cost: (i) from a purely intuitive standpoint, humans perceive distances as saturated distances [47]; (ii) many natural color distributions are noisy and heavy-tailed, thus thresholding permits to assign a fixed cost to outliars; (iii) thresholded distances induce a W 1 distance between distributions in standard unicommodity OT problems [23].More practically, thresholding improves accuracy and speed of OT [23] (see also the Computational Cost Section in this SM).
Step 3. The last step required to obtain K is the introduction of a second auxiliary node, u 2 , together with its edges, to relax mass balance.In detail, in a standard OT setting i G ia = j H ja = Λ a > 0 holds ∀a = 1, . . ., M , i.e., two histograms to be transported belong to the same simplex of mass Λ a > 0. We relax this constraint permitting i G ia = j H ja and penalizing Eq. ( 1) (main text).Particularly, we use a similar relaxation of that in [23], which we generalize to the multicommodity setup: The intuition of Eq. ( S4) is that the OT problem is penalized proportionally to the net difference between the inflowing and the outflowing mass.Hence, two images whose colors strongly differ return a higher cost and, in a supervised classification task, are less likely to be assigned the same label.We fix α = 1/2 as in [22].
This penalization can be translated to the transport network with the addition of n links, costing c = α max e∈E12 C e = max e∈E12 C e /2, connected to u 2 .The excess of mass m a = j H ja − i G ia given by each commodity is injected u 2 to guarantee that the whole system is isolated.With this expedient one recovers exactly the relaxed OT formulation in Eq. (S4).In fact, all the transport paths that not flow into one of the n nodes of Image 2 penalize the cost by traversing the edges connected to u 2 .From conservation of mass one can see that these transport paths satisfy Thus, summing over a and j returns exactly aj P a ju2 = || j H j − i G i || 1 , with the 1-norm taken over the commodities.This is precisely the penalization we introduced in Eq. (S4).

II. EQUIVALENCE BETWEEN MULTICOMMODITY DYNAMICS AND OT SETUP
With the following derivations (similar to [32,33]), we show that asymptotic solutions of Eqs. ( 3)-( 4) (main text) are equivalent to minimizers of Eq. ( 1) (main text).This implies that by solving the multicommodity dynamics we find a solution of the multicommodity OT minimization problem.More practically, for a given pair of images, running a numerical scheme on Eqs. ( 3)-( 4) (main text) allows us to compute lim t→∞ P (t) = P , hence J Γ = J Γ | P =P , and use the latter as a measure of similarity between them.
More in detail, we first demonstrate the equivalence between stationary solutions of the multicommodity dynamics and minimizers of the multicommodity OT problem introducing a second accessory minimization problem.Stationary solutions are proven to be asymptotes of Eq. ( 4) (main text) only afterwards, with the introduction of a Lyapunov functional for the multicommodity dynamics.

A. Stationary solutions of the multicomodity dynamics and OT minimizers
Initially, we observe that stationary solutions of the multicommodity dynamics satisfy the relation that one can derive setting the left hand side of Eq. ( 4) (main text) to zero, defining P a e = x e (φ a i − φ a j )/C e for e = (i, j), and γ = 2 − β.We recover an scaling identical to Eq. (S5) introducing the following auxiliary constrained minimization problem: min  e , is the cost needed to build the network infrastructure.The constraints in Eq. (S7)-identical to Eq. ( 3) (main text)-are equivalent to Kirchhoff's law, enforcing conservation of mass.
Most remarkably, the scaling of Eq. (S8) can be also recasted in Eq. (S6) to find that J Γ = J + W γ (neglecting multiplicative constants).This connects the multicommodity dynamics with the objective function of Eq. ( 1) (main text).In detail, (S10) To complete the mapping between the multicommodity dynamics and the minization setup, we show that the space of transport tensors Π(G, H) is exactly the same space defined by Eq. (S7).This can be seen with the following chain of equalities: Here we take the difference between the OT constraints of Π(G, H) in Eq. (S13), we then use the definition of S in Eq. (S14), and compact the plus and minus signs using the signed incidence matrix B in Eq. (S15).This allows us to recover Kirchhoff's law as formulated in Eq. (S7) and Eq.(3) (main text).

B. Multicommodity dynamics asymptotes: Lyapunov functional
We complete our discussion introducing the Lyapunov functional for Eq. ( 4) (main text) proposed in [32,33].The functional reads: and it is a multicommodity generalization of that originally introduced in [34].This is a well-defined Lyapunov functional for the multicommodity dynamics, in fact, along a curve x(t) solution of Eq. ( 4) (main text), With the equality satisfied if and only if x(t) is a stationary point of Eq. ( 4) (main text).This can be shown as follows.
We claim that This equality can be retrieved differentiating both sides of Eq. ( 3) (main text) by x e , thus obtaining Then, multiplying Eq. (S20) by φ a i and summing over i one gets further summing over a yields where in the left hand side of Eq. (S22) we used Eq. ( 3) (main text).From Eq. (S22) the equality in Eq. (S18) follows immediately.Now, thanks to Eq. (S18) we can prove that the Lie derivative of the functional is less than or equal to zero.In fact, With the equality in Eq. (S25) that is recovered if and only if (i) x e = 0, or (ii) the scaling in Eq. (S8) holds.Finally, we show that the Lyapunov is identical to the total sum of dissipation and transport cost, i.e., L γ = J + W γ .This can be done multiplying both sides of Eq. (3) (main text) by φ a i and then summing over i and a, namely where we used P a e = x e (φ a i − φ a j )/C e , for e = (i, j).This allows us to conclude.In summary, we showed that the multicommodity dynamics admits a well-defined Lyapunov functional, which is equivalent to the sum of a dissipation and an infrastructure cost.These two contributions, which are jointly minimized by Eq. (4) (main text), when evaluated along their minimizers correspond to the multicommodity OT cost J Γ of Eq. (1) (main text).Introducing the Lyapunov functional is crucial to formally show that asymptotics of the dynamics are equivalent to minimizers of the cost, namely lim t→∞ P (t) = P .
Lastly, we remark the effect of γ (resp.β) on the minimization problem.In the setting where γ > 1 (β < 1) the functional L γ is convex, with one unique minimizer.For γ < 1 (β > 1) the functional landscape becomes rugged and strongly non-convex, with multiple minimizers each correspondent to a local minima of the cost.Hence, in this second scenario, running Eq. ( 4) (main text) permits to converge in a stationary point, which however may not be its global minimum.

III. CROSS-VALIDATION: FLOWERS DATASET
We perform a 4-fold cross validation on both parameters used for the construction of the ground cost, θ and τ , and on algorithms' regularization parameters, β and ε.We briefly summarize it in this section.
The JF30 Dataset [42] is made of 1,479 elements, divided in 30 classes.First, we separate it into two subsets: train and test, with classes' frequencies being the same in these subsets as in the entire dataset.To cross-validate our methods, we further separate the train set into 4 folds of equal size, each to be used in turn as a validation set.More in detail, each experiment is executed fixing the validation fold and an image belonging to it, then, the Optimal Transport costs J Γ between such image all the other images in the train set-made of the other three folds-is calculated.This procedure is repeated for all images in the validation set, and swapping each of the 4 train folds as validation set.We use a k-nearest neighbors classifier over J Γ to assign to an image in the validation set its label, that is, for each validation image we consider the k train samples with lowest J Γ , and label the validation sample with the most frequent class among these k.This allows us to calculate the classification accuracy of a given fold, and then to average the accuracy over the 4 permutations of the validation and train set.The total amount of experiments we ran in order to cross-validate the model is approximately 50,000.
Results are shown in Fig. S2  Noticeably, the accuracy monotonically increases (resp.decreases) with β for a fixed value of θ, namely θ = 0 (resp.θ = 0.25).This can be addressed to the fact that, when no color information is taken into account in the construction of the ground metric (θ = 0), it is more advantageous to consolidate transport paths on cheap edges correspondent to pixels whose positions are close, thus choosing a larger β.On the other hand, introducing colors in C (θ > 0), and thus creating a more disordered ground cost matrix, favors distributing transport paths on the network (See Model Interpretability Section in this SM).
Remarkably, τ also has an impact on the classification accuracy of our algorithms: the larger we set its value to be-thus trimming less edges from the transport network-the more accurate the classification becomes.This behavior is evidently different for Sinkhorn algorithm, as explained here below.
Cross-validation of Sinkhorn algorithm is taken a step further.Motivated by the classification accuracy drop observed in Fig. S2 Notice that both Sinkhorn GS and Sinkhorn RGB returns bell-shaped curves when changing τ .In particular, low classification accuracy is attained when strongly reducing τ , as well as when the trimming threshold is high (approximately τ ≥ 0.5).In the first case, many elements of the ground cost matrix are cut, and not enough information   S1, optimal parameters are (θ, τ, ε) = (0.25, 0.05, 100).

IV. EXPERIMENTAL DETAILS: FRUITS DATASET
Here, we describe in detail the experimental setup designed for the Fruit Dataset (FD) [43].FD consists of 163 images of 15 fruit types.We split the whole dataset into train and test sets, each with 70% and 30% of the available images, respectively.As for the other dataset, classes' frequencies are the same in these subsets as in the entire dataset.Given the rather small size of this dataset, we directly perform classification comparing train and test.All

V. IMAGE PREPROCESSING
The elements of both datasets are processed in the following way.First, each image is coarsened with an average pooling, the only input needed for this step is the size of the square mask, ms.Its stride is in fact set to stride = ms, and the padding to pad = 0.All images were conveniently trimmed so that both their widths and heights are divisible by the pooling mask size.We set ms = 40 for JF30, and ms = 30 for FD.Furthermore, we smooth the images using a Gaussian filter on each color channel, with standard deviation σ = 0.5.
Moreover, to convert colored images into grayscale ones, which are given as input to Sinkhorn GS and to our unicommodity dynamics (M = 1), we preproces them as follows.Let (R, G, B), be the three color channels composing each pixel of a colored image, these are converted into a unique channel (its grayscale counterpart), whose intensity I is calculated with the weighted sum I = 0.2125R + 0.7154G + 0.0721B.The weights correspond to those used by cathode-ray tube (CRT) phosphors as they are more suitable to represent human perception of red, green and blue than equally valued weights [48].

A. Color distributions of images
As shown in Table I (main text), for the multicommodity and the unicommodity dynamics, optimal values of the trimming threshold τ are much lower in the Fruit Dataset [43] than in the Jena Flowers 30 Dataset [42].This can be addressed to the fact that color distributions of fruits, belonging to the first dataset, are drastically light-tailed compared to those of flowers in the second dataset.Thus, the cost C is naturally noisier in the latter case, and a larger trimming is necessary to remove such noise from classification.Most of the noise in pictures of flowers comes from the background.In fact, while all flowers are photographed in nature, fruits are depicted on a white background.This can be seen in Fig. S6

VI. SINKHORN BENCHMARKS
In our experiments, we compare the multicommodity and unicommodity dynamics against Sinkhorn algorithm, popularized by the seminal work of [19].The idea of Sinkhorn is to regularize the standard OT problem by adding an entropic barrier to the cost function.More in detail, and following the notation adopted in our manuscript, the minimization problem proposed in [19] is: Here transport paths P , which generally lie in the polyhedral set described by the constraints j P ij = g i ∀i and i P ij = h j ∀j, are smoothed by the entropy h(P ).This trick makes the optimization problem strictly convex, and permits to solve it with a very efficient matrix scaling algorithm-Sinkhorn's fixed point iteration.
We generalize the problem in Eq. (S28) in order to take in account transport tensors, G and H, which carry information of multiple color channels, and transport paths P .In detail, we propose the following minimization problem for each commodity-color channel-a, min  the 1-Wasserstein distance between two probability distributions in O(|V | 2 /ε 2 ) arithmetic operations [26].Recently, an Adaptive Primal-Dual Accelerated Gradient Descent (APDAMD) scheme with complexity O(min{|V | 9/4 /ε, |V | 2 /ε 2 }) for the same ε-perturbed problem has been presented in [27].
In principle, our multicommodity method has a computational complexity of order O(M |V | 2 ) for complete transport graph topologies, i.e., when edges in the transport network K are assigned to all pixels' pairs.Nonetheless, we achieve a substantial decrease in complexity by sparsifying the graph with the trimming procedure of [22,23].Similarly to [22], the final complexity of our algorithm is O(M |V |).This improvement can be formally justified as follows, we start from a complete bipartite graph with |E| = |V | 2 /4 (for simplicity m = n = |V |/2 is assumed).First, we trim expensive links, and reduce the number of edges of the transport network to K |V | + |V |, where K is the average number of edges connected to a node that are not trimmed by τ , and the second term |V | counts the number of inflowing and outflowing transshipment links.Second, we add |V |/2 links to the transport network to enforce Kirchhoff's law penalization, so that the final number of links amounts to |E| = |V |( K + 3/2), which is linear with respect to the number of nodes.
Additionally, it is shown [39] that for β = 1, the Optimal Transport paths of the unicommodity OT problem on sparse topologies can be recovered with z time steps as in Eq. (3) (main text), with O(1) < z < O(|E| 0.36 ).This bound has been found using a backward Euler scheme combined with the inexact Newton-Raphson method for the update of x, and solving Kirchhoff's law using an algebraic multigrid method.

FIG. 2 .
FIG.2.Sensitivity on the JF30 dataset.Sensitivity values are shown for the multicommodity dynamics (blue circles) and for Sinkhorn RGB (red triangles).Markers are sorted in descending order of S, regardless the method.Background colors are blue, red, and gray, when S is higher for the multicommodity method, Sinkhorn, or none of them, respectively.In green we plot frequency bars for all classes in the test set.

FIG. 4 .
FIG.4.Evaluating the importance of colors: when shapes matter most.Experiment 3: Top black framed image is that to be classified.The best 3 (out of 10) predictions returned by the two dynamics are shown on the right.We mark with green circles training samples belonging to the same class of the test image.All algorithms are executed with their optimal configurations in TableI.
FIG. S1.Detailed construction of transport networks.Step 1: conversion of colored images to tensors and construction of the first complete bipartite graph.Step 2: trimming of expensive edges and addition of the transshipment node, u1, with its links (in brown).Step 3: relaxation of mass balance with the addition of the second auxiliary node, u2, together with its links (in magenta).
FIG. S2.Cross-validation results for τ = 0.1.Figures are accuracy values obtained with the 4-fold cross validation on JF30.Cells are colored with a darkest-to-brightest palette based on the accuracies.Subplots correspond to: (A) the multicommodity dynamics, (B) the unicommodity dynamics, (C) Sinkhorn on colored images, and (D) Sinkhorn on grayscale images.
FIG. S3.Cross-validation results for τ = 0.125.Figures are accuracy values obtained with the 4-fold cross validation on JF30.Cells are colored with a darkest-to-brightest palette based on the accuracies.Subplots correspond to: (A) the multicommodity dynamics, (B) the unicommodity dynamics, (C) Sinkhorn on colored images, and (D) Sinkhorn on grayscale images.
FIG.S4.Sinkohrn's cross-validation results varying τ .Subplots correspond to: (A) Sinkhorn GS, (B) Sinhorn RGB.In each subplot, circular markers correspond to the accuracy values of each fold, instead squares and bars represent their average and standard deviations.In the insets, we refined the grid of τ in an interval of interest, where classification accuracy is peaked.
FIG. S5.Refined cross-validation results Sinkhorn GS and τ = 0.05.Figures are accuracy values obtained with the 4-fold cross validation on JF30.Cells are colored with a darkest-to-brightest palette based on the accuracies.
(A)-(D).In subplot (A) we show four images randomly sampled from the Fruit Dataset, in (C) four random samples of the Jena Flowers 30 Dataset.In (B) and (D) we plot the average color intensity of the RGB channels for 100 random samples belonging to the two datasets.Here, the histograms in (B) are relative to the fruits, those in (D) to the flowers.From the plots it can be clearly seen that the color distributions of Fig. S6 (B) are starkly peaked around (R, G, B) = (255, 255, 255) = white in standard RGB encoding.

2 FIG
FIG. S7.Effect of θ and τ on OT.On the left, we display the two samples used to build the ground costs C. Highlighted rows in blue and orange are those considered to extract G and H. On the right side of the panel we plot C for θ = {0, 0.25, 0.5, 0.75} and τ = {0.1,0.25, 0.5, 1}.White regions correspond to trimmed values, i.e. entries of C that are larger than τ .
FIG.S8.Effect of β on OT.In the leftmost portion of the panel we plot Image 1 and Image 2, used in the OT problem.From these we extract the (3 × 3)-dimensional tensors G and H.These are drawn together with a heatmap of cost C, and with the correspondent transport network.Color scales of edges and of entries of C are identical.We also use the same numbering and color scheme for tensors' entries, indexes of C, and network nodes.Brown and magenta auxiliary nodes and edges are added after trimming, and after relaxing Kirchhoff's law.On the right side of the panel we plot the transport network again, but with edge thickness proportional to the Optimal Transport paths retrieved from Eqs. (3)-(4) (main text), and with colors correspondent to those of the commodities a. Node sizes are proportional to the values of g a and h a , for a = 1, . . ., 3.
taking values S a u1 = 0 and S a u2 = m a on the auxiliary nodes.With L ij [x] = e (x e /C e )B ie B je we denote the weighted Laplacian of K, where B is its signed incidence matrix; ∂i is the neighborhood of node i. Lastly, φ a i are scalar potentials acting on nodes, for a given commodity a. Least-square solutions of Eq. (4) are FIG. 1. Bipartite network representation for multicommodity OT.The two images (shown in the leftmost and rightmost sides of the panel) are encoded in the RGB matrices G and H, that regulate the flow traveling on network K.The graph is made of m + n + 2 nodes, i.e. the total number of pixels plus the two auxiliary vertices introduced in Section III A. Grey edges (belonging to the set E12) connect nodes in Image 1 to nodes in Image 2; these edges are trimmed according to a threshold τ .We highlight the entries of the matrix C in red if these are larger than τ .Transshipment and auxiliary edges used to relax mass conservation (which belong to E ) are in brown and magenta.
g. xe ∼ U (0, 1) 3: Construct a bipartite network Km,n between G and H complexity O(m • n) 4: Assign Cij(θ, τ ) = min{(1 − θ) ||vi − vj||2 + θ ||Gi − Hj||1, τ } to every edge (i, j) in Km,n, as in Eq. (3) 5: Remove from Km,n all edges s.t.Cij > τ complexity O(m + n) 6: Add u1 to Km,n and it m + n auxiliary links, each costing τ /2 7: Balance mass: add u2, with inflowing mass m a = i Hia − j Gja 8: while convergence is False do 1,  ‡ 1Max Planck Institute for Intelligent Systems, Cyber Valley, Tübingen 72076, GermanyI.CONSTRUCTION OF THE NETWORK The color channels are flattened to obtain the tensors G and H, the first for Image 1, and the second for Image 2. In detail, each channel is vectorized to have dimension m × 1, with m = w 1 • h 1 , for g a (resp.n × 1, with n = w 2 • h 2 , for h a ), which are inflows and outflows of our multicommodity dynamics.In this way, the tensors G and H, which are obtained stacking g a and h a , have size m × M and n × M .Entries of G and H are in standard RGB encoding, hence they are integers ranging from 0 to 255.
To obtain the transport network K, we first generate a complete bipartite graph between m + n = |V 1 | + |V 2 | nodes, the first m = |V 1 | are the pixels of Image 1, and the other n = |V 2 | are those of Image 2.