Characterization of Diffusion Metric Map Similarity in Data From a Clinical Data Repository Using Histogram Distances

As the sharing of data is mandated by funding agencies and journals, reuse of data has become more prevalent. It becomes imperative, therefore, to develop methods to characterize the similarity of data. While users can group data based on the acquisition parameters stored in the file headers, these gives no indication whether a file can be combined with other data without increasing the variance in the data set. Methods have been implemented that characterize the signal-to-noise ratio or identify signal drop-outs in the raw image files, but potential users of data often have access to calculated metric maps and these are more difficult to characterize and compare. Here we describe a histogram-distance-based method applied to diffusion metric maps of fractional anisotropy and mean diffusivity that were generated using data extracted from a repository of clinically-acquired MRI data. We describe the generation of the data set, the pitfalls specific to diffusion MRI data, and the results of the histogram distance analysis. We find that, in general, data from GE scanners are less similar than are data from Siemens scanners. We also find that the distribution of distance metric values is not Gaussian at any selection of the acquisition parameters considered here (field strength, number of gradient directions, b-value, and vendor).

Here b is the the total number of bins in the set x = x0, x1, …xb-1, n is the total number of binned elements, and Hi(X) is the frequency of histogram X in bin i.
The Canberra and Lorentzian metrics are variations of the L1-norm. The Canberra (Webb and Copsey, 2011), which normalizes the absolute differences to the sum of the two bins values and is known to be sensitive to small changes near zero, and the Lorentzian (Deza and Deza, 2012) which is essentially the log of the L1-norm, though unity is added to ensure non-negativity and to avoid the log of zero. Canberra: (4a) Lorentzian:

Intersection Family
The non-intersection (Duda et al., 2000) metric is based on the minimum value between the histograms at each bin. Since we are here interested in the differences between histograms we choose the non-intersection, rather than the intersection metric and because we are dealing with normalized histograms, this metric is zero if the histograms are identical. Other, more complicated versions of the Intersection family, such as the Czekanowski (Gordon, 1999) metric can be shown to be equivalent to the simpler non-intersection metric used here in the case of normalized histograms. Non-Intersection:

Fidelity family
This family contains metrics that employ the sum of the modified geometric means of the histograms using the square root rather than the b-th root. We choose the commonly-used Hellinger (Deza and Deza, 2012) metric and the Squared-chord (Deza and Deza, 2012) metric which is the most general version. Other commonly used metrics from this family are the Bhattacharyya (Bhattacharyya, 1943;Choi and Lee, 2003) distance, Matusita (Matusita, 1951;1955) distance. The Bhattacharyya distance has been shown to be a bound on the Bayesian minimum mis-classification probability and is related in form to the Matusita distance. Hellinger:

Inner product Family
Metrics in the Inner Product family treat the two histograms as vectors and calculate the inner product normalized by some factor. Here we choose the Cosine (Webb and Copsey, 2011) metric that is the inner product normalized by the square-root of the sum of the squares of each histogram element. This family also contains the familiar Jacquard (Jacquard, 1901) and Dice (Dice, 1945) metrics and which also contain the inner product but have a different normalizing factor. Cosine:

Squared -L2 Norm
Here we use the squared  2 metric (Deza and Deza, 2012) because its normalizing factor is symmetric in the two histograms (as opposed to the Pearson (Deza and Deza, 2012) and Neyman (Deza and Deza, 2012)  2 metrics which normalize the squared differences in the numerator by the bin value of one or the other of the histograms). This metric is essentially the normalized Euclidean distance between two vectors.

Shannon Entropy
These metrics take the form of Shannon's entropy (the quantity p  ln(p)) with various choices of normalizing factors. When applied to two histograms, it measures the minimum cross entropy of two probability distributions. The metrics in this family are not, in fact, true distances since they are not symmetric with respect to the ordering of the input histograms. We used the Kullback-Leibler (Kullback and Liebler, 1951) distance, and Jeffreys (Jeffreys, 1946) metrics, the latter being the symmetric form of the former.

Earth Movers Family
The Earth Mover's distance (EMD) (Rubner et al., 1998) calculates the minimum amount of work that is necessary to transform one distribution to another. We have used the Cha-Srihari (Cha and Srihari, 2002) distance for ordinal data, which is related to the EMD rather than the EMD itself because the Cha-Srihari metric is a special univariate case of the EMD and is of O(b) rather than O(b 3 ) complexity and therefore has a lower computational burden. Note that all the metrics discussed here operate on single-bins except for the Cha-Srihari metric.