Visualizing and Interpreting Unsupervised Solar Wind Classifications

One of the goals of machine learning is to eliminate tedious and arduous repetitive work. The manual and semi-automatic classification of millions of hours of solar wind data from multiple missions can be replaced by automatic algorithms that can discover, in mountains of multi-dimensional data, the real differences in the solar wind properties. In this paper we present how unsupervised clustering techniques can be used to segregate different types of solar wind. We propose the use of advanced data reduction methods to pre-process the data, and we introduce the use of Self-Organizing Maps to visualize and interpret 14 years of ACE data. Finally, we show how these techniques can potentially be used to uncover hidden information, and how they compare with previous manual and automatic categorizations.


Introduction
The effects of solar activity on the magnetic environment of the Earth have been observed since the publication of Edward Sabine's work in 1852 [52]. During almost two hundred years we have learned about the intimate connection between our star and the plasma environment of the Earth. Three main physical processes connect the Sun to us: the transfer of electromagnetic radiation, the transport of energetic particles, and the flow of solar wind. The later is a continuous stream of charged particles that carries the solar magnetic field out of the corona and into the interplanetary space.
The name 'solar wind' was coined by Parker in 1958 because 'the gross dynamical properties of the outward streaming gas [from the Sun] are hydrodynamic in character' [41]. Over time we have learned that the wind also has many more complex properties. Initially, it was natural to classify the solar wind by defining a boundary between 'fast' and 'slow' winds [25]. The former has been associated with mean speed values of 750 km/s, while the later shows a limit at 500 km/s, where the compositional ratio (Fe/O) shows a break [21,55]. The solar wind also carries information about its origins on the Sun. At certain solar distances the ion composition of the solar wind is expected to be frozen-in, reflecting the electron temperature in the corona and its region of origin [21,64,55]. These particles have multiple energies and show a variety of kinetic properties, including non-Maxwellian velocity distributions [44,35].
The solar wind is also connected to the Sun by the Interplanetary Magnetic Field (IMF), thorough magnetic field lines directed towards the Sun, away from the Sun, or in the case of flux ropes connected at both ends [40,24]. The thin region where solar magnetic fields of opposite directions meet is called the Heliospherc Current Sheet (HCS). When a spacecraft crosses the HCS instruments onboard can detect the change in magnetic field direction as a 180 • reversal. Changes in the flow properties are also observed around the HCS. This perturbed zone is called the Heliospheric Plasma Sheet (HPS), and the passage of the spacecraft from one side of the HPS to the other is known as a Sector Boundary Crossing (SBC) [60]. In spacecraft observations these are sometimes confused with Corotating Interaction Regions (CIR), which are zones of the solar wind where fast flows have caught up with slow downstream solar wind, compressing the plasma.
From the point of view of a spacecraft SBCs and CIRs can show similar sudden changes in the plasma properties. These two in turn are often grouped and mixed with other transient events, like Coronal Mass Ejections (CME) and Magnetic Clouds (MC). Since 1981 when [12] described the propagation of MC behind an interplanetary shock, it was suspected that CMEs and MC where coupled. However, more recent studies show that CMEs observed near the Sun do not necessarily become MC, but instead 'pressure pulses' [23,62].
Much more recently it has been revealed, by observations from Parker Solar Probe, that the properties of the solar wind can be drastically different closer to the Sun, were the plasma flow is more pristine and has not yet mixed with the interplanetary environment. Patches of large intermittent magnetic field reversals, associated with jets of plasma and enhanced Poynting flux, have been observed and named 'switchbacks' [5,6].
The solar wind is thus not only an hydrodynamic flow, but a compressible mix of different populations of charged particles and electromagnetic fields that carry information of their solar origin (helmet streamer, coronal holes, filaments, solar active regions, etc.) and is the dominion of complex plasma interactions (ICMEs, MC, CIRs, SBCs, switchbacks).
To identify and study each one of these phenomena we have relied in the past on a manual search, identification and classification of spacecraft data. Multiple authors have created empirical methods of wind type identification based on in-situ satellite observations and remote imaging of the solar corona. Over the years the number and types of solar wind classes has changed, following our understanding of the complexity of heliospheric physics. Solar wind classification serves three main roles: 1. it is used for the characterization of its origins in the corona, 2. to identify the conditions where the solar wind is geoeffective, 3. to isolate different plasma populations in order to perform statistical analysis.
Among these classifications we can include the original review work by [61], the impressive continuous inventory by [48,47,49], and the detailed studies by [64] and [63]. These publications classify the solar wind based on its origins and on the transient events detected. Each system includes two, three or four classes, generally involving coronal-hole origins, CMEs, streamer belt origins and sector reversal regions.
We are moving now towards a new era of data analysis, where manual human intervention can be replaced by 'intelligent' software. The trend has already started, with the work by [13] who used the [63] classes to train a Gaussian Process algorithm that autonomously assigns the solar wind to the proper class, and more recently by [50] who used unsupervised classification to perform a 4 and 8 class solar wind classification. Machine learning algorithms have been used in the past in other applications in solar physics [33,45,2,9,10,37,14], and astrophysics [57,38,26,56,4,11].
The most basic machine learning techniques learn using two methods: a) in supervised learning the algorithms are shown a group of inputs and outputs with the goal to find a non-linear relationship between them, b) in unsupervised learning the machine is presented with a cloud of multi-dimensional points that have to be autonomously categorized in different classes. This means that we can program the computer to learn about the different types of solar wind using the existing empirical classifications, or by allowing it to independently detect patterns in the solar wind properties.
In the present work we show how the second method, unsupervised classification, can be used to segregate different types of solar wind. In addition, we show how to visualize and interpret such results. The goal of this paper is to introduce the method to the community, the best use practices and the opportunities that such system can bring. We implement one specific type of classification, called Self-Organizing Maps, and we compare it to simpler classification techniques, showing that it can reveal hidden information in years of solar wind data.
In the next sections we present in detail the techniques of data processing (section 2.1), data dimension reduction (sections 2.2.1 and 2.2.2) and data clustering (section 2.2.3) that we have used. We then present in detail the Self-Organizing Map technique and all its properties in section 2.2.4. We show how to connect all of these parts together in section 2.2.5, and finally we show how the full system can be used to study 14 years of solar wind data from the ACE spacecraft in section 3.

Data Set Used
The solar wind data used in this work was obtained by the Advanced Composition Explorer (ACE) spacecraft, during a period of 14 years, between 1998 and 2011. The data can be downloaded from the FTP servers of The ACE Science Center (ASC). The files in this repository correspond to a compilation of hourly average data from three instruments: MAG (Magnetometer), SWEPAM (Solar Wind Electron, Proton, and Alpha Monitor) and EPAM (Electron, Proton, and Alpha Monitor). A detailed description of the entries in this data set can be found in the ASC website listed in section 5.
A total of 122712 data points are available. However, routine maintenance operations, low statistics, instrument saturation and instrument degradation produce gaps and errors in the data. The SWICS data includes a flag assessing the quality of the calculated plasma moments. We retain only 'Good quality' entries. Our pre-processed data set contains a total of 72454 points.

Additional Derived Features
We created additional features for each entry, based on previous knowledge of the physical properties of the solar wind. Some are derived from the existing properties in the data set, others computed from statistical analysis of their evolution. We introduce here the additional 'engineered' features included in our data set.
Multiple techniques have been proposed in the literature to identify ejecta, Interplanetary Coronal Mass Ejections (ICME), and solar wind origins in the ACE data. [64] suggest that, during solar cycle 23, three classes of solar wind can be identified using its speed, V sw , and the oxygen ion charge state ratio, O 7+ /O 6+ . It has been shown that slow winds originating in coronal streamers correlate with high values of the charge state ratio and fast winds coming from coronal holes presents low values [16]. Plasma formed in coronal loops associated with CMEs also show high values of the charge state ratio [63]. The classification boundaries of the 'Zhao classification' are presented in Table 1. [63] suggested an alternative four classes system based on the proton-specific entropy, S p = T p /n 2/3 p , the Alfvén speed, V A = B/(µ 0 m p n p ) 1/2 , and the expected proton temperature, T exp = (V sw /258) 3.113 . The classification conditions based on these three parameters are also presented in Table 1. This solar wind discrimination system will be known in this manuscript as the Xu classification [63]. For each entry in the data set we have included the values of S p , V A , T exp , and the solar wind type given by the two classification methods. Two of the classes, i.e. ICME/ejecta and coronal hole, are common to the Xu and Zhao classifications. The number given to each class is arbitrary, but the two common classes share the same identification.
Auxiliary variables, like the Alfvén Mach number (M A ) and the temperature ratio (T exp /T p ), have also been included in the data set. These features have been selected as they are the main components of the Xu classification and we want to compare the automatic classification methods against empirical models.
In addition to these instantaneous quantities, we can perform statistical operations over a window of time of six hours, including values of the maximum, minimum, mean, standard deviation, variance, auto-correlation, and range. We expect to capture with these quantities transient events in the immediate solar wind upstream and downstream. These are a complement to the stationary solar wind parameters mentioned above and add some information about the temporal evolution of the plasma. The selection of the statistical parameters and the window of time is arbitrary and will require a closer examination in a future publication.
Two additional terms, which have been successfully used in the study of solar wind turbulence and wave characterization [65,1,34,16], are included here to account for additional time correlations. These are: the normalized cross-helicity, σ c eq. (1), and the normalized residual energy, σ r eq. (2), where b = (B − B ) /(µ 0 m p n p ) 1/2 is the fluctuating magnetic field in Alfvén units, v = V sw − V sw is the fluctuating solar wind velocity, z ± = v ± b are the Elsässer variables [19], and . denotes the averaging of quantities over the time window.
Due to gaps in the data, some of the above quantities can not be obtained. We eliminate from the data set all entries for which the derived features presented in this section could not be calculated. This leaves a total of 51608 entries in the data set used in the present work.
To account for the differences in units and scale, each feature column F in the data set is normalized to values between 0 and 1, using: Not all the features might be useful and some of them can be strongly correlated. We do not perform here a detailed evaluation of the inter dependencies of the different features, and we leave that task for a future work. The present manuscript focuses on the description of the methodology and on the visualization and interpretation capabilities of unsupervised machine learning classification. We limit our work here to test two models that use a different number of features. These are listed in table 2 and named Amaya-21, and Roberts-8, in honor of the work done by [50]. As the name suggests each model uses respectively 21, and 8 features of the data set.

Complementary Data Catalogs
We support the interpretation of our results using data from three solar wind event catalogs. The first is the well known Cane and Richardson catalog that contains information about ICMEs detected in the solar wind, ahead of the Earth [15] [47] 1 . We used the August 16, 2019 revision. As the authors state in their website, there is no spreadsheet or text version of this catalog and offline editing was necessary. We downloaded and re-formatted the catalog to use it in our application. The CSV file created has been made available in our repository. We call this, the Richardson and Cane catalog.
The second catalog corresponds to the ACE List of Disturbances and Transients 2 produced by the University of New Hampshire. As in the previous case, the catalog is only available as an html webpage, so we have manually edited the file and extracted the catalog data into a file also available in our repository. This is hereafter referred to as the UNH catalog.
Finally, we also included data from the Shock Database 3 maintained by Dr. Michael L. Stevens and Professor Justin C. Kasper at the Harvard-Smithsonian Center for Astrophysics. Once again we have gathered and edited multiple web-pages in a single file available in our repository. In this work this database will be known as the CfA catalog.

Dimension Reduction using PCA
Principal Component Analysis (PCA) is a mathematical tool used in data analysis to simplify and extract the most relevant features in a complex data set. This technique is used to create entries composed of linearly independent 'principal components'. These are the eigenvectors of the covariance matrix Σ applied to the centered data, eq.(4), ordered from the largest to the smallest eigenvalue, λ 1 ≥ λ 2 ≥ ... ≥ λ n , where X is the mean value of each original feature, eq.(3). The projection of the data onto the principal component space ensures a maximal variance on the direction of the first component. Each subsequent principal component is orthogonal to the previous ones and points in the direction of maximal variance in the residual sub-space [54].
The PCA transformation creates as many components in the transformed space,X, as features in the original data space X. However, components with small eigenvalues belong to a dimension where the variance is so small that it is impossible to separate points in the data. It is a general practice in data reduction to keep only the first k components that explain at least a significant portion of the total variance of the data, λ i=1..k /Tr(Σ) > . This allows for a selection of information that will effectively differentiate data points, and for a reduction of the amount of data to process during analysis. Many techniques have been suggested for the selection of the values of k and the cut-off [46]. We use the value of k = 3 to simplify the comparison among the different models, but for a detailed study of the solar wind, if a PCA transformation is applied, it is important to use a fixed criteria for the selection of the cut-offs. Fig. 1 (A) is a 3D scatter plot of all the data points, colored by the Xu classification, projected on the first three PCA components. The features used to create this figure are presented in section 2.2.5. Panel (B) contains the same data colored by the Zhao classification. These projections show that the Xu and Zhao classification are defined by hyper-planes separating the points, even if the data has been linearly transformed by the PCA. Class 2, ICMEs-ejecta, is restricted to a small domain in this coordinate system (for both the Xu and Zhao classification). The lateral plots on panel (A) are 2D histograms of the point distribution on the three main PCA planes. They show that the concentration of points is not homogeneous and different zones can be isolated using unsupervised classification techniques. There is a clear segregation of points in the (1st,2nd)component plane: as we will see in subsequent sections, one of the features of the solar wind presents a strong bimodal distribution that is prioritized by the PCA.

Dimension Reduction Using Autoencoders
PCA has a limitation: the principal components are a linear combination of the original properties of the solar wind. An alternative to data reduction is the use of autoencoders (AE). These are machine learning techniques that can create non-linear combinations of the original features projected on a latent space with less dimensions [27]. This is accomplished by creating a system where an encoding function, φ, maps the original data X to a latent space, F , eq.(5). A decoder function, ψ, then maps the latent space back to the original input space, eq.(6). The objective of the autoencoder is to minimize the error between the original data and the data produced by the compression-decompression procedure as shown in eq. (7).
Autoencoders can be represented as feed-forward neural networks, where fully connected layers lead to a central bottleneck layer with few nodes and then expands to reach again the input layer size. An encoded element, z ∈ F , can be obtained from a data entry, x ∈ X, following the standard neural network function, eq.(8), where W is the weights matrix, b is the bias, and σ is the non-linear activation function.
The decoding procedure, shown in eq.(9), transforms z →x, where the prime quantities are associated with the decoder. The loss function, L(x,x), is the objective to be minimized by the training of the neural network using gradient descent. Once training is completed, the vector z is a projection of the input vector x onto the lower dimensional space F .
Additional enhancements and variations of this simple autoencoder setup exist in the literature, including multiple regularization techniques to minimize over-fitting [31], Variational Autoencoders (VAE) that produce encoded Gaussian distribution functions [30], and Generative Adversarial Networks (GAN) that produce new (unseen) data [22]. In this work we use the most basic form of autoencoders, presented above.
The second column of Fig.1, panels (C) and ( D), contains the same information as the first column, but with the data set encoded in the three dimensional latent space F . Panel (C) shows that all the classes in the Xu and Zhao classification are easy to distinguish, including ICMEs-ejecta (class 2) that is difficult to discern in the PCA. This projection also shows that class 4 from the Zhao classification in the bottom panels, overlaps with class 3 (sector reversal origin), and partially with class 2 (ejecta) in the Xu classification on the top panels. Panel (C) shows, on the side planes, 2D histograms of the density of points. These can be seen as the volume integral of the point density in each direction. Here again it is possible to observe multiple zones of high concentration, suggesting that multiple types of solar wind are present in the data and that they can be differentiated using an unsupervised classification technique.

Clustering Techniques
The goal of unsupervised machine learning is to group data points in a limited number of clusters in the N-dimensional space Ω ∈ R N , where N is the number of features (components or properties) in the data set. Multiple techniques can be used to perform multi-dimensional clustering. We present in Fig. 2 the three clustering techniques used to classify our 3D reduced data. The panels in the first column show the data projected in the PCA reduced space,X, while the second column shows the data in the latent AE space, F . Each row corresponds to a different clustering method. The colors in the top panels (A) and (D) were obtained using the k-means method [32], the colors in the middle panels (B) and (E) were obtained using the Gaussian Mixture Model (GMM) [8]. The bottom panels are colored by the classes from the Self-Organizing Maps described later in section 2.2.4.
The k-means technique has already been used in a recent publication for the determination of solar wind states [50]. To our knowledge other clustering methods have never been used in the literature to classify the solar wind, but [18] has used the GMM to characterize magnetic reconnection regions in simulations using their velocity distribution information.
The colors used in Fig.2 are assigned randomly by each clustering technique. The most glaring issue with them is that different methods can lead to different clusters of points. The GMM and the k-means agree on their classification in the PCA space, but show dissimilar results in the AE space. Moreover, for a single method, e.g. k-means, slight modifications of the clustering parameters, e.g. using a different seed for the random number generator, can lead to very different results. We address this last issue using an algorithm that launches the k-means and GMM algorithms 500 times until the methods converge to a quasi-steady set of clusters. But we warn that the results are implementation dependent.
In the present data set, the cloud of points is convex and well distributed in all three components. This raises one additional issue, observed more clearly in the first column of Fig.2: when classical clustering methods are applied to relatively homogeneously dense data, it divides the feature space in Voronoï regions with linear hyper-plane boundaries. This is an issue with all clustering techniques based on discrimination of groups using their relative distances (to a centroid or to the mean of the distribution). To avoid this problem density-based techniques, such as DBSCAN [20], and agglomeration clustering methods, use a different approach. However, we can not apply them here because in such homogeneous cloud of points these techniques lead to a trivial solution where all data points are assigned to a single class.
There is no guarantee that a single classification method, with a particular set of parameters will converge to a physically meaningful classification of the data if the points in the data do not have some level of separability, or have multiple zones of high density. This is also true for other classification methods based on 'supervised learning'. In those applications same issues will be observed if the training data uses target classes derived from dense data clouds using simple hyper-plane boundaries, as done for the Zhao and Xu classes. An example of such application was published by [13]. The authors used the Xu classification to train a Gaussian Process classifier.

Self-Organizing Maps
Classical SOM Following the definitions and notations by [59], a class can be defined as A cluster C i is then a partition of Ω, and {w i } are the code words (also known as nodes, weights or centroids) associated. The mapping from the data space to the code word set, Φ : Ω → W, is obtained by finding the closest neighbor between the points x and the code words w, eq.(11). The code word w s , the closest node to the input x s , is called the 'winning element'. The class C i corresponds to a Voronoï region of Ω with center in w i .
A Self-Organizing Map (SOM) also composed of structured nodes arranged in a lattice, and assigned to a fixed position p i in R q , where q is the dimension of the lattice (generally q = 2). The map nodes are characterized by their associated code words. The SOM learns by adjusting the code words w i as input data x is presented.
The SOM is the ensemble of code words and nodes . For a particular entry x s , the code word s ∈ N is associated to the winning node p s if the closest word to x s is w s . At every iteration of the method, all code words of the SOM are shifted towards x following the rule: with h σ (t, i, j) defined as the lattice neighbor function: where (t) is the time dependent learning rate, eq. (14), and σ(t) is the time dependent lattice neighbor width, eq.(15). The training of the SOM is an iterative process where each data point in the data set is presented to the algorithm multiple times t = 0, 1, .., t f . In these equations the subscript 0 refers to initial values at t = 0 and the subscript f to values at t = t f .
This procedure places the code words in the data space Ω in such a way that neighboring nodes in the lattice are also neighbors in the data space. The lattice can be presented as a q-dimensional image, called map, where nodes sharing similar properties are organized in close proximity.
The main metric for the evaluation of the SOM performance is called the quantization error: where M , is the total number of entries in the data set.
Once the training of the SOM is finished, the code words w i can be grouped together using any clustering technique, e.g. k-means. The nodes of the SOM with close properties will be made part of the same class. The classes thus created are an ensemble of Voronoï subspaces, allowing a complex non-linear partitioning of the data space Ω.
The final number of clusters is an input of the algorithm, but can also be calculated autonomously. The Within Cluster Sum of Squares (WCSS) can be used as a metric of the compactness of the clustered nodes. As its name implies the WCSS is the sum of the squared distances from each node to their cluster point. If only one class is selected, the large spread of the nodes would produce a high WCSS. The lowest possible value of the WCSS is obtained for a very high number of classes, when the number of classes is equal to the number of nodes. But such extreme solution is also unpractical. The optimal number of clusters can be obtained using the Kneedle class number determination [53]. In the present work we do not use this technique: we will perform comparisons with previous publications that propose a fixed number of solar wind types. We will explore the use of an automatic class number selection in a future publication.
Dynamic SOM The time dependence of the SOM training allows the code words w i to reach steady coordinates by slowing down their movement over the iterations. Due to the minimization of the distance in eq.(11) code words tend to agglomerate around high density zones of the feature space. The Dynamic Self-Organizing Map (DSOM), introduced by [51], eliminate the time dependence and allows to cover larger zones of the space outside of the high density regions.
The DSOM is a variation of the SOM where the learning function (12) and the neighbor function (13) are replaced by eqs. (17) and (18) respectively: where is a constant learning rate, h η (i, s, x) is defined as the new lattice neighbor function, and η is the 'elasticity' parameter. In their work [51] show that DSOM can be used to better sample the feature space Ω, reducing the agglomeration of code words around high density zones. The DSOM does not converge to a steady solution, due to the lack of a temporal damping factor.
Visualization of SOM and DSOM Clustering techniques do not necessarily converge to a steady immutable solution. Differences in the training parameters or slight changes in the data can have an important impact on the final classification. These tools can be used for statistical analysis, comparisons, data visualization and training of supervised methods. But it will be practically impossible to claim the existence of a general objective set of states discovered only by the use of these basic clustering techniques.
However, SOMs and DSOMs provide an important tool for the study of the solar wind: the maps are composed of nodes that share similar properties with its immediate neighbors. This allows for visual identification of patterns and targeted statistical analysis.
We used the python package MiniSom [58] as the starting point of our developments. Multiple methods of the MiniSom have been overloaded to implement the DSOM, and to use a lattice with hexagonal nodes. All auxiliary procedures used to calculate inter-nodal distances, node clustering, data-to-node mapping, and class boundary detection have been implemented by us. All visualization routines are original and have been developed using the python library Matplotlib [28]. Fig.3 shows the basic types of plots and maps that can be generated using the SOM/DSOM techniques. This figure uses data from the model Amaya-21 which has been encoded, using AE, into a set of entries, z i , each one composed of three components. Panel (A) shows a histogram of the first two components of the feature space Ω, with dots marking the position of the code words w i . The colors of the dots represent their SOM classification. The red lines connect a single code word w s with its six closest neighbors. The panel (B) shows the 'hit map' of the SOM. It contains the lattice nodes p i associated to the code words w i . They are depicted as hexagons with sizes representing the number of data points connected to each node and colored by their SOM class. The thickness of the lines between lattice nodes represent the relative distance to its neighbors in the feature space Ω. Red lines connect the node p s , associated to the code word w s in panel (A), to its closest neighbors.
Panel (C) of Fig.3 corresponds to the value of a single feature associated to each node; as an example we use the ionized oxygen ratio O 7+ /O 7+ ('O7to6'). To improve visualization all hexagon sizes have been set to their maximum and the inter-node distance line has been colored white. In order to obtain the correct value for each node, we must first perform a decoding of the data from the latent space, Ω = F , to the original data set space, X.
Panel (D) of Fig.3 shows that the nodes of the lattice can also be used to present data that has not been used in the training of the SOM. The method keeps track of the data set points associated to each lattice node, it is then possible to perform independent statistical operations on those points alone. Moreover, it is possible to activate the SOMs with just a subset of the data, i.e. with points that feature a specific solar wind type. In this case, as an example, we have colored the map using the average oxygen charge state Q O ('avqO'), and we have set the size of the nodes to represent the frequency of points with solar wind type Xu=2 (ejecta). The dark line between the lattice nodes designate the boundaries between different SOM classes.
These four representations are only a few examples of the variety of data that can be represented using SOMs. The most important aspect of the SOMs is that data is represented in simple 2D lattices where the nodes share properties with their neighbors. Here we also decided to use hexagonal nodes, connecting 6 equidistant nodes, but other types of representations are also valid, e.g. square or triangular nodes.
The bottom row of Fig.3 displays all three components of the code words w i associated with each one of the p i nodes. In the first panel they have been mapped to the basic colors Red, Green and Blue (RGB). The remaining panels have been colored using each individual component. The first panel is then the RGB composition of the three remaining ones where the boundaries between the SOM classes have been highlighted.

The Full Architecture
The previous sections introduced all the individual pieces that we use for the present work. Here we give a global view of the full model. Fig.4 shows how all the components are interconnected. The data set is composed of clean and processed entries. We tested the PCA transformation in cases Amaya-21 and Roberts-8, keeping only the first three principal components. This possible setup is presented on the left of the figure. It is also possible to perform an unsupervised clustering directly on the un-processed data, as shown on the top of the figure, but it is not recommended. For the remaining of this manuscript we present only the cases where the non-linear AE encoding, shown at the right of Fig.4. The bottleneck of the AE network is three nodes, i.e. the data is encoded in three components. The transformed data is then used to train the SOM.
After training, the code words of the SOM are then clustered to group together nodes that share similar properties. This second level classification is done using the k-means++ algorithm with 500 re-initializations (it is in general recommended to use between 100 and 1000 iterations). The total number of classes selected is an input of the model and has been set to 8. This arbitrary choice was made following the results presented by [50]. All the software was implemented in Python using as main libraries PyTroch [42], Scikit-learn [43], Matplotlib [28], MiniSom [58], Pandas [36] and NumPy [39].
Autoencoder architecture We use a basic, fully connected feed-forward neural network for the encoding-decoding process. The bottleneck of the network has been fixed to three neurons in order to simplify the visualization. This arbitrary choice is another parameter of the models that need further investigation. The neural network is symmetric in size but the weights of the encoder, W , and the decoder, W , are not synchronized (see eqs. (8), (9)). We use multiple fully connected hidden layers, where the central layer is the size of the bottleneck. Each layer is composed of a linear regressor, followed by batch normalization and a ReLU activation function. The output layer of the network contains a linear regressor followed by a hyperbolic tangent activation function. The autoencoder has been coded in python using the PyTorch framework [42].
We use an Adam optimizer [29] for the gradient descent with a learning rate of 0.001 and a weight decay of 0.0001 for regularization. The loss function is the Mean Squared Error (MSE). We train the network for 30 epochs, after which we see no additional improvement in the loss function. The full data set was randomly divided 50%/50% between training and testing sets.

Two Models of Solar Wind Classification
We have tested the two models presented in Table   2. The models are inspired by the work of [50]. We call these cases Amaya-21 and Roberts-8. The table lists all the features used in each model. A detailed description of each feature can be found in the ACE Level 2 documentation. To spread the data over a larger range of values in each component, we have used the logarithm of all the quantities, except of those marked with an asterisk in the table.
Features 15 to 20 contain an additional suffix, corresponding to a statistical operation performed on the corresponding feature. The operations include the mean, the range, the standard deviation and the auto-correlation of quantities over a window of time of 6 hours. This window allows to capture temporal (spatial) fluctuations in some of the solar wind parameters.
On the lower part of Table 2 we present the range of dates used for each model. For Amaya-21 we use the full data set, while for the model Roberts-8 we try to replicate as much as possible the choices made in [50]. The same table also contains the hyper-parameters selected to run the two models. The number of neurons per layer in the encoding half of the neural network is listed in the table and was manually selected to minimize the final loss value of the AE.
All the figures presented until now correspond to the processing of data from model Amaya-21. The amount of data and figures produced in this work is very large and is not possible to include all of them in the present document. We will present in the next section some highlights, but more detailed analysis of each one of the cases will be presented in future publications.
Budget Machine learning models require fine tuning of different parameters, from the selection and testing of multiple methods, to the parameterization of the final architecture. [17] suggests that every publication in machine learning should include a section on the budget used for the development and training of the method. The budget is the amount of resources used in the data processing, the selection of the model hyper-parameters (HP), and its training.
The most time-consuming task in the present work has been the data preparation, the model setup and debugging and the writing of the SOM visualization routines. All the techniques described in the previous sections have been coded in python and are freely accessible in the repositories listed in section 5. We estimate the effort to bring this work from scratch to a total of 2 persons month.
Of these, one person week was dedicated to the manual testing an selection of different model HPs (autoencoder architecture, feature selection, learning rates, initialization methods, number of epochs for training, selection of data compression method, size of the time windows, etc.).
All classic clustering techniques presented in section 2.2.3 require only a few lines of code and can be trained in minutes on a mid-range workstation (e.g. Dell Precission T5600, featuring two Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz with four cores and eight threads each). The most time consuming tasks of our models are the training of the autoencoder (5% of the total run time), the multiple passages of the clustering algorithms (15% of the run time), and the optimization of the SOM hyper-parameters (80% of the run time). The training of the SOM is performed in less than a minute.
For reference, the total run-time for each one of the models used in this work are: 60 minutes for the Amaya-21 model and 20 minutes for the Roberts-8 model.

Hyper-Parameter Optimization
Our main goal in this manuscript is to introduce the use of the SOMs for the classification of solar wind data. SOMs require the selection of four main Hyper-Parameters (HPs): the size of the lattice, (m × n), the initial learning rate, 0 , and the initial neighbor radius, σ 0 . In the case of the DSOM algorithm, these two last HPs are replaced by the constant learning rate, , and the elasticity, η. The automatic selection of the best HP for machine learning model is called Hyper-Parameter Optimization (HPO).
We use the library 'optuna' [3] to perform an automatic optimization of the four HPs. The optimization is based on a technique called Tree-structured Parzen Estimator (TPE) [7], which uses Bayesian Optimization to minimize a target function provided by the user. We propose the use of the objective function, H, described in eq. (19): where Q E is the SOM quantization error, m and n are the number of lattice nodes in each dimension, and m max and n max are the given maximum number of possible nodes. The weight factors α, β and γ are used to impose restictions on each term. We have fixed their value to α = β = 0.4Q 0 E and γ = 0.08Q 0 E , where Q 0 E is the quantization error at iteration zero of the first optimization trial. When a large number of nodes is available smaller values of Q E are automatically obtained because the mean distance from the data set entries to the code words is reduced. The second and third terms on the RHS of H leads the optimizer to reduce the number of nodes in the SOM. The squaring term γmn forces the map to be as squared as possible.
After a total of 100 trial runs of the model using different HPs, the optimizer selected the parameters presented in the lower section of table 2. The total run time for the HPO of the case Amaya-21 is about 40 minutes. HPO is, understandably, one of the most expensive procedures in all our setup.

Results and comparisons
3.1 Interpretation of ACE data using SOMs 3.1.1 General overview Fig.3 (A) shows the distribution of code words in the latent space for the model Amaya-21. The feature and components maps in the bottom row show in addition that lattice nodes share common attributes with their neighbors. The regularity in the colors of the feature map confirms that the SOM keeps its most important feature: organization. This is clear by the proximity of neighboring code words in the latent space, marked with red lines. We expect then to find common patterns in all the following maps in this section.
In the same figure the panels (B) and (C), and the component maps in the bottom row, show black and white lines indicating the relative distances between the lattice nodes: thicker lines represent larger inter-node distances. This shows that there are sections of the map (groups of code words in the latent space) that can form separate groups. This group separation has been highlighted in the 'Feature map': the k-means clustering of the code words divides the space following the inter-nodal divisions. The classified points in the 3D latent space are shown in the third row of Fig.2.
The 'Hit map' in Fig.3 presents a good distribution of points among all the lattice nodes, except in regions isolated from the rest. These zones represents solar wind types that have atypical properties. One of the goals of the DSOM is to cover those isolated zones where rare events can be classified in separate nodes. In contrast, the classic SOM method tends to cluster the code words in regions of high density, troubling the categorization of rare events, like ICMEs, ejecta or magnetic clouds, in the solar wind data. Fig.5 shows the distribution of the SOM training data in the normalized range [0..1]. This is a violin plot superposed by a box plot, showing the data distribution for each one of the features listed in table 2. Normalization of the data is performed using the maximum and minimum values of each feature, but outliers (extremely large or small values) can hinder the use of particular features. In any classification problem, outlier detection and elimination is extremely important. The figure shows that all our data points are well represented in the data. Even in some cases, like for feature 13 (Alfvén Mach number), where the distribution has a small width, it still covers a significant part of the total range. This figure also shows that feature 6 (cross-helicity, σ c ) has a bimodal distribution, with two peaks close to the limits. A significant part of the data lays close to the σ c = ±1 limit. The cross-helicity is a measure of the Alfvénicity of the solar wind, representing the direction and intensity of the propagation of Alfvénic fluctuations. At the Earth orbit this is an indication of the origins of the solar wind in the north or south hemispheres of the Sun.
Feature 20 (Pearson auto-correlation of the magnetic field magnitude) also shows a large distribution function, with a marked peak near one. This quantity was calculated for a window of time of six hours and a time lag of one hour. It shows the extent to which the values of the magnetic field have changed in one hour. High autocorrelation values represent situations in which the magnetic field does not change during the window, i.e. values in the window at time t are the same as values shifted by one hour, t − 1. Completely uncorrelated signals, produced by random changes in time, will produce autocorrelation values close to 0 (0.5 after normalization), i.e. the data in the window at time t is different from the data in the shifted window. Positive (negative) values represent a signal with a periodicity of one hour. Additional time lags could be used to create extra features, but here we use only one to test its effectiveness.
The features selected for this model have not followed a meticulous vetting process. We included features inspired from previous publications and new interesting additions. Our goal in this work was to test if the data transformation into an encoded latent space can account for redundant or un-interesting features. This is a very useful property for data sets where expert knowledge is not available. It also shows how the SOM can point to features that don't have added value. As we will see in the next sections, the method converges to meaningful classes, even when some of the features used turn out to be not very relevant. We will perform in a future publication a more detailed selection of the features, based on the experience of human experts.

Interpretation of the Automatic Classification
Lattice nodes are characterized by their weights (code). Applying a reverse transformation, followed by a re-scaling, we obtain their values in the original N-dimensional space. Fig.6 shows the DSOM clustering and the 21 solar wind properties associated to each lattice node. We have clustered the nodes in eight classes. This is a subjective selection inspired by other works in the literature. In our case the clustering leads to contiguous groups of nodes.
The maps show the properties that differentiate each of the eight classes. We can try to attribute a physical significance to the classes by analyzing, together, the characteristic features of each class in Fig.6 and examples of the recorded solar wind data in Fig.7 (see table 1) are plotted using a black continuous line. The red area corresponds to 'non-coronal hole origin' solar wind, and data points receive this classification if the dotted line enters the red zone. If it stays above it, the solar wind is considered an ICME. If the curve drifts bellow the red zone, the wind is considered to have origins in a coronal-hole.
The three time series in panels (A), (B) and (C) show that multiple techniques can be used for the clustering of solar wind properties. But SOMs allow fast visualization and interpretation not available in other clustering methods. All time series have a strong tendency to group the solar wind in two groups, depending on the heliospheric sector. This is due to the importance of the cross-helicity in the data set and its bimodal distribution.
Class 5 can be mapped to transient events, CMEs and ejecta. It presents very high values of the oxygen ion charge state ('O7to6'), a feature that [64,55,63] associate to CME plasma. It is also characterized by high solar wind velocity, cross-helicity σ c ∼ 0, high values of magnetic field magnitude and Alfvén speed. These features are usually associated to explosive transient activity [50,63]. Fig. 7, panel (C) shows that class 5 maps well to the Richardson and Cane, UNH and CfA CME catalogs.
We can then use the cross-helicity, σ c , to identify two groups of classes: the ones with mostly positive (1 and 3) and mostly negative cross-helicity (0, 2, 4, 6, 7). As already done in [50], we associate them to solar wind plasma originating from areas with different magnetic polarity, respectively northern and southern sector. Inspection of Fig. 7 confirms this association.
Among the classes with negative σ c , class 7 can be quite confidently associated with coronal hole plasma. It is characterized by the very low values of the O 7+ /O 6+ ratio that [64,55,63] associate with plasma originating from open magnetic field lines. It also exhibits high wind speed, low proton density and proton density variability, high absolute values of cross-helicity and equipartition levels of the residual energy σ r (telltale signs of Alfvénicity), high proton entropy and moderately high values of Alfvén speed, high proton temperature and temperature variability. These are characteristics widely associated to fast coronal holes solar wind plasma. Inspection of Fig. 7 again supports this identification.
Class 4 and 0 are both possibly composed of a mix of slow Alfvénic wind [16] and 'conventional' slow and intermediate speed wind. Slow Alfvénic wind shares coronal hole origin and a number of characteristics related to the origin (O 7+ /O 6+ ratio, low density values, high |σ c | and low σ r values) with the fast Alfvénic wind. The main difference, apart, of course, from the speed, is the proton temperature, which tends to be lower in Alfvénic and 'conventional' slow wind with respect to the Alfvénic fast wind. The parameters that are usually used to distinguish between slow and fast wind (speed, density, proton entropy, proton temperature) span a quite large interval in both classes 0 and 4, and in fact point to the presence of a mix of slow and fast wind in both. The main difference between class 0 and 4 is given by the very high values of the residual energy σ r in class 4, which point to kinetically dominated structures. Class 2 is characterized mainly by very high values of the Alfvénic Mach number. It is a class with a very low number of hits (see the hit map), and rarely spotted in Fig. 7.
Among the classes associated with positive cross-helicity, class 1 and 3, we associate class 3 to coronal hole origin and class 1 to streamer belt origin. Our class 3, maps quite closely to the 'red' class in Fig. 1 and 2 of [50], associated there to coronal hole plasma from sectors of positive polarity. Class 3 indeed shows high absolute values of cross-helicity and near-zero values of residual energy, as expected form Alfvénic wind from coronal holes. We notice, however, that its O 7+ /O 6+ ratio, velocity, density, proton entropy and proton temperature values are somehow less 'coronal hole-like' than the ones observed in class 7 for the opposite polarity. A qualitative difference between the wind from sectors of opposite magnetic polarity can be seen in Fig. 7, and is already remarked upon in [50]. While we are quite surprised that it extended to the large time interval that we used to generate the SOM, we plan to conduct further analysis on the topic in the future.
We will perform further refinements of the model and its interpretation in a future work. These preliminary results show the great potential of the techniques introduced in this paper. SOMs show the variability of solar wind and how it can be visually characterized. The SOM is a helpful guide in the study of the different types of solar wind, but is not necessarily an objective, unbiased and final classification method. SOMs open the possibility for a fast visual characterization of large and complex data sets.
The short analysis of the different SOM classes performed above, was also informed by the data presented in Fig.8. Each row corresponds to a single solar wind class, represented by its 21 features using box plots. The colors correspond to the class number (0 to 7 from top to bottom). The first column has been built from a classification using k-means, the second with GMM and the third with DSOM. Here we can find again the properties described in the previous section. We call these plots class 'fingerprints'. Different classification methods lead to different classes with different fingerprints. A visual inspection of the fingerprints is much more difficult to interpret than the SOMs. [50] and [63] performed detailed descriptions of particular solar wind classes based on the mean values observed in each subset of points. But Fig.8 shows that some features can have a very large distributions. For example the values of the solar wind speed, feature 2, have a very large spread on all the classes and all classifications, except for class 7 in the GMM classification. Other features with large fingerprint spread includes the cross-helicity (6) and the residual energy (7). These are a consequence of the bimodal nature of the two features, as shown in the violin plots of Fig.5. This bimodal distribution plays a very critical role in the separation of classes, as shown in our results and the results presented by [50]. Depending on the expected use of the solar wind classification, using the raw values of σ c might not be recommended.
It is noticeable that in the k-means classification multiple classes have very similar fingerprints, at the exception of a single feature. For example classes 6 and 7 have similar characteristics except for features 6 (cross-helicity). The same is true for classes 0 and 2 in the GMM classification, where the largest difference is the spread of feature 20. SOM classes tend to present fingerprints that are more variable. This is a consequence of the use of the dynamic version of the SOM that does not agglomerate nodes in zones with high density of points. On the contrary, k-means and GMM will tend to put more points in high density zones, creating very similar class fingerprints.
Maps of Empirical Classifications SOM allows visual analysis of previously published results.
In this section we show how the Xu and Zhao classifications activate different nodes of the SOM. We use two properties of the SOM simultaneously: the size of the lattice nodes will represent the number of hits for a particular class, and the color will represent one property of the solar wind.
To perform this analysis, instead of using the full data set, we extract 3 subsets corresponding to the entries categorized as CHW, ICME and NCHW in the Zhao catalog. Each one of these three subsets is passed through the Amaya-21 model and we observe how each one activates the SOMs. All properties are normalized between zero and one, using the maximum and minimum values for each feature in the full data set, so we can perform comparisons among all the subsets. Fig.9 shows the SOMs of the three Zhao classes produced by the Amaya-21 model. CHW, ICME and NCHW classes have different number of hits in the SOM. If these solar wind types were clustered in different SOM classes, they would activate different nodes in the lattice. However, class CHW and NCHW activate very similar regions in this model's map. The colors in each map are homogeneous, demonstrating that all points in the class share similar solar wind properties. The values of oxygen ion state ratio and the solar wind speed do not seem to play an important role in the automatic classification of our model and both classes activate similar nodes with similar solar wind properties. The goal of our team for our next publication is to work in collaboration with solar experts to design a more sophisticated model that can accurately reproduce the Zhao and Xu classes in our maps.
In these maps it is clear that the ICME class is mainly contained in the zone corresponding to class 5, which has been previously identified as such in the map and time series analysis above. Here the total number of hits is only 445, which explains why it is so difficult to observe in the time series. This is an additional benefit of using SOMs: we are able to detect important data points that can easily be overlooked with other methods.
In a similar way, Fig.10 shows the SOMs of the Xu classification. This time we used four subsets of the data set, each one corresponding to a different Xu class. Once again 'ejecta' is confined to the region of SOM class 5, and 'sector reversal origin' solar wind activates some of the nodes corresponding to the non-coronal hole wind in the Zhao classification. This same zone is also overlapping with 'streamer belt origin' zones in the Xu classification. This class seems to be included in the SOM class 2, and in part in class 3. It is possible to isolate singular nodes and study more in detail all their characteristics, but this is out of the scope of the present work and will be presented in a future publication.
The separation of the Xu classes is also not perfect in this model. We have tested other models in which the separation is more clear. Those models were based on different number of features and time ranges. An example of such model is presented in the next section.

Model Roberts-8
The same techniques used for model Amaya-21 were applied to this model. The only difference between these two is the amount of data used and the selected initial features. We can see that these two modifications can have an important impact in the final results. Fig.11 (A) shows the volume integrated point density and the distribution of the code words. This is a projection in the latent space after transformation using an autoencoder. The 'hit map' and the 'feature map' shows a clear segregation of points, allowing for a proper splitting of the data set. The time series in Fig.11 shows that the model can differentiate zones of high and low speed, as well as zones of polarity inversion and some of the shocks and ICMEs. The use of less features gives more dominance to properties with larger distribution spread, like the cross-helicity. It is clear from the time series that class 4 and 5 are dominant northwards and southwards of the HCS respectively. Class 3 in this model is associated with transient events, including ICMEs and sector reversals.
For simplicity we will not present a detailed analysis of the Roberts-8 model. We would like only to point an important difference with model Amaya-21: Fig.12 shows how the solar wind classifications by Xu and Zhao are interpreted by the model. First, it is important to notice that all classes activate nodes predominantly in different SOM classes. Second, the small variations in colors inside each map demonstrates that the classes are well represented by the main properties proposed by Xu and Zhao. One exception is the class 'ejecta' that shows uneven values of proton specific entropy, S p , and temperature ratio T exp /T p .
Coronal hole solar wind classes by Xu and Zhao activate exactly the same nodes in the map, corresponding to class 4 in the Roberts-8 model. ICME from Zhao is considered as a subset of the ejecta class from Xu, while the NCHW class from Zhao contains the sector reversal class from Xu.
A careful selection of features and the data range of the models can produce a particularly powerful tool for the analysis of solar wind information. We are currently working towards the creation of an accurate solar wind classification system based on the developments presented here.

Discussion
In this paper we show how the categorization of solar wind can be informed by classic unsupervised clustering methods and Self-Organizing Maps (SOM). We demonstrate that a single technique used in isolation can be misleading for the interpretation of automatic classifications. We show that it is important to examine the SOM lattices, in conjunction with the solar wind fingerprints and the time series. Thanks to these tools we can differentiate classes associated with known heliospheric events.
We are convinced that basic unsupervised clustering techniques will have difficulties in finding characteristic solar wind classes when they are applied to unprocessed data. A combination of feature engineering, non-linear autoencoding and SOM training leads to a more appropriate segmentation of the data points.
The classification of the solar wind also depends on the objectives that want to be attained: if the goal is to classify the solar wind to study its origin on the Sun, features related to solar activity must be included in the model; however, if the goal is to identify geoeffectiveness, other parameters should be added to the list of features, including geomagnetic indices.
In this work we have presented a first test of the capabilities of the SOMs for the analysis of data from a full solar cycle. Due to the extent of the work done, in this paper we introduce all the methods and techniques developed, but we leave for a future publication a detailed selection of all the model parameters, and the corresponding detailed physical interpretation of the solar wind classification.
All the tools and the techniques presented here can be applied to any other data set consisting of large amounts of points with a fixed number of properties. SOMs have already been used in astrophysics [56] and magnetospheric physics [14], and we are currently working on their deployment for the study of active regions on the Sun.
All the software and the data used in this work are freely available for reproduction and improvement of the results presented above.

Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Author Contributions
JA created the software used in this work, built the data sets and wrote the manuscript. RD provided important insights into the use of the machine learning techniques, and performed revisions of the different drafts. MEI gathered information from external collaborators, provided insights into the data usage, and proofread different drafts. GL supervised the work. All authors contributed to manuscript revision, read and approved the submitted version.

Funding
The work presented in this manuscript has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 754304 (DEEP-EST, www.deepprojects.eu), and from the European Unions Horizon 2020 research and innovation programme under grant agreement No 776262 (AIDA, www.aida-space.eu).
Only two types overlap in both classifications: classes 1 and 2.