^{1}

^{2}

^{3}

^{2}

^{4}

^{2}

^{3}

^{2}

^{3}

^{1}

^{5}

^{6}

^{7}

^{*}

^{2}

^{3}

^{*}

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{7}

Edited by: Miguel Castelo-Branco, Coimbra Institute for Biomedical Imaging and Translational Research (CIBIT), Portugal

Reviewed by: Li Dong, University of Electronic Science and Technology of China, China; Xi Jiang, University of Electronic Science and Technology of China, China

This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

In independent component analysis (ICA), the selection of model order (i.e., number of components to be extracted) has crucial effects on functional magnetic resonance imaging (fMRI) brain network analysis. Model order selection (MOS) algorithms have been used to determine the number of estimated components. However, simulations show that even when the model order equals the number of simulated signal sources, traditional ICA algorithms may misestimate the spatial maps of the signal sources. In principle, increasing model order will consider more potential information in the estimation, and should therefore produce more accurate results. However, this strategy may not work for fMRI because large-scale networks are widely spatially distributed and thus have increased mutual information with noise. As such, conventional ICA algorithms with high model orders may not extract these components at all. This conflict makes the selection of model order a problem. We present a new strategy for model order free ICA, called Snowball ICA, that obviates these issues. The algorithm collects all information for each network from fMRI data without the limitations of network scale. Using simulations and

Functional connectivity and network-based analysis of functional magnetic resonance imaging (fMRI) have revolutionized our understanding of the overall functional organization of the brain (

Model order selection (MOS – choosing the number of extracted components) in ICA is a significant methodological concern that contributes to this variation in fMRI brain network analysis (

Many solutions have been proposed to address this issue. Information-theoretic criteria (ITC) have been used in numerous signal processing applications to estimate model order, including minimum code length based minimum description length (MDL) criterion (

In the field of fMRI data processing, the Group ICA of fMRI Toolbox(GIFT^{1}; ^{2};

In this article, we propose a new ICA strategy that does not require MOS before ICA decomposition, the Snowball ICA. This approach differs from traditional ICA algorithms by iteratively collecting information about source signals from the fMRI data. First, we demonstrate how conventional MOS procedures contribute to inaccurate estimation of signal sources. Our Snowball ICA will then be compared with standard implementations using MELODIC and GIFT to determine the accuracy of spatial quality estimation using simulated data and

For conventional multi-subject fMRI data analysis, both noise free ICA (

where _{s} ∈ ^{T×M} represents the _{s}. At this stage, components that explain 90% variance are typically used to construct the reduced data. For group ICA, data from different subjects is temporally concatenated after the data reduction, as follows:

Singular value decomposition of the aggregate data is as follows:

where ^{N×M} is the aggregate matrix of all subjects fMRI data. _{k} on diagonal of ∑ is the square root of variance of each component.

Based on PCA theory, the dimensionality of the aggregate matrix after concatenation is reduced again:

where R is the model order. The formula in Eq. 4 is therefore selecting the first R vectors of ∑

Conventional ICA algorithm. In conventional ICA, the model order is determined and then data reduction is applied prior to ICA. The removal of this information as “noise” may also result in removal of meaningful signals.

Noise-free ICA is given by the following:

where ^{R×M} is the matrix to be fed into ICA unmixing program. ^{R×R} represents the unmixing matrix. ^{R×M} is the independent component matrix that is used to estimate the source matrix.

Reconstruction of time courses is done as follows:

With [_{m}_{m}^{−1}], (

Information included in the ICA algorithm is represented by the sum of the first

where

In order to study the distribution of the meaningful source information that may be contained in the PCA components labeled “noise,” a “pseudo-ICA” approach is used. In these simulations, the spatial correlation coefficient,

where _{p} represents the components matrix estimated by Pseudo-ICA. Each row of _{p} represents one component estimated by Pseudo-ICA. The order of the component matrix is exactly same as that of sources matrix. _{1:R,:} is exactly the same with what would be fed into conventional ICA algorithms. _{:,1:R} works as the unmixing matrix but without the limitation of independent constrain. Pseudo-ICA is therefore able to examine the effect of the number of PCA components on the accuracy of estimated results, neglecting the restriction of independence between components.

The correlation between the ground truth and PCA components represents the information ratio for each source signal for PCA components. Both the distribution of source information throughout the PCA components labeled “noise” and the accuracy of Pseudo-ICA components were used to evaluate performance of the PCA.

Mutual information is a criterion used to describe the dependence between two variables. Under the independence constraint of spatial ICA, components with high mutual information will not be extracted appropriately. For two discrete random variables Y_{1} and Y_{2}, the mutual information is defined as (

Once the model order was selected for the conventional ICA procedure, the PCA data reduction limits the resulting information for each source that is fed into ICA. For example,

The overall workflow of Snowball ICA is illustrated in ^{T×M}, which is organized with subjects’ fMRI data temporally concatenated without data reduction.

Snowball ICA algorithm. In order to avoid the necessity of specifying model order in conventional ICA, Snowball ICA estimates components one by one (or a few at a time) until no stable components can be further estimated from the data. After seed creation for a new component(s) to be estimated via information collection, the estimated snowball component(s) are removed from the multi-subject data and seed creation is repeated to determine the next seed pattern(s). The Snowball ICA will stop when there are no more stable components estimated during the seed creation step.

The first stage is seed creation, in which a randomly selected individual subject’s fMRI data is selected from the group fMRI dataset. Since the information about each component from different subjects will be collected in the next stage (Information Collection), the initial seed creation is not that critical. It can be created only from a single subject, from large chunks of randomly selected data, or from conventional spatial group ICA results. In order to save time, we use a randomly selected subject to create the seed in this study. Once the data have been selected, ICA is repeated many times. The most stable component from these ICAs is selected as the seed. ICASSO (ICASSO-software package^{3}) was used to evaluate component stability. The algorithm was run for ICA repeatedly with the same parameters and the same algorithm. Then hierarchical clustering was implemented to cluster all of the extracted components (

In order to ensure the stability of results, the most stable component was selected as the seed to feed into the information collection stage. _{seed} denotes the stable component estimated from Seed Creation. The Snowball ICA algorithm stops once stable components are no longer extracted.

In the second stage, information collection, the seed component is concatenated with randomly selected new scans, and these new aggregate data are then fed into the ICA unmixing algorithm. The resulting ICA component that most closely matches the seed is then used as the new seed to be concatenated with more of the original data in the next iteration. Gradually, the seed will collect all information about the signal it represents from each scan, resulting in accurate ICA components.

First, the aggregate data

where _{k} ∈ ^{T/K×M}. Then for the first block, _{1}, ICA with reference (ICA-R) (_{seed} being selected as the most stable component after repeated ICAs. Once the information belonging to the first block is collected, the estimated component is designated _{seed}. Then the next block of data, _{k}, goes through the same procedure to create an updated _{seed_ new} as reference for the next block, with remaining blocks going through the same procedure iteratively until all blocks are used. The order of processing blocks may be random. Once all blocks have been used, the resulting component is Y_{1}, the first extracted ICA spatial map (SM). This process then repeats, after removing the estimated Y_{1} component from the original data, to identify the next component Y_{2}, and so on. For each _{k}, the implementation of ICA-R is as follow:

where _{k} is unmixing matrix estimated with independence constraint. _{seed_ new} is updated seed that represents network information collected about _{seed} from _{k}. It will replace _{seed} in the next block iteration. The seed component, _{seed}, works as a reference and the information belonging to this reference network is gradually collected as more and more blocks are used. Once _{seed} is the SM of estimated component of Snowball ICA and is represented by _{snowball}.

As shown in

where _{snowball} is the accurate component estimated via the information collection stage of the Snowball ICA. There are several ways to reconstruct subject components from group ICA results (_{snowball} represents time series that is calculated with the first stage of dual regression (_{s} represents the _{s_ new} replaces _{s}in the next iteration of Snowball ICA to identify subsequent components. With this method, the information from components that has been estimated accurately will not be considered during creation of the next seed for new component estimation. Note that _{s_ new} is

The steps of Snowball ICA are as follows:

Identify the seeds _{seed} as most stable ICA component (Iq > 0.9) from a small amount of the total data.

Aggregate _{seed} with randomly selected scans chosen from all scans of all subjects to form a new data matrix and apply the ICA algorithm to decompose the new data matrix. Each scan is selected only once.

The most similar IC(s) to the seed(s) are selected as new seed(s) _{seed}.

Repeat Steps 2 and 3 until all scans of all subjects have been used and the components are estimated accurately.

Remove the resulting snowball ICA components (_{snowball}) from the multi-subject data and repeat to determine the next seed pattern. The Snowball ICA will stop when there are no more stable components estimated in Step 1.

The ICA SMs are thresholded using a Z-threshold criterion (|

For processing of real data that lack ground truth, the pros and cons of different algorithms can be compared as to whether they meet the ICA assumption. In theory, the purpose of ICA is to extract non-Gaussian signals. The stronger the non-Gaussianity of the signals, the more this assumption of ICA is satisfied. The standard measure of non-Gaussianity is kurtosis (

Independent component analysis spatial components are estimated by maximizing non-Gaussianity, therefore we use Kurt to evaluate the “goodness” of independence components. The higher the index is, the easier a network is distinguished from background (Gaussian) noise.

Simulated phantom fMRI data experiment aims to explore the reason that MOS effect the ICA decomposition. Simulated data were generated with the MATLAB toolbox, SimTB (^{4}. In SimTB, we adopt a data generation model that is largely consistent with the assumption of spatial ICA. In other words, data can be expressed as the product of activation temporal courses (TCs) and non-Gaussian sources, which we refer to as SMs. For subjects

Ground truth spatiotemporal signal sources used in the simulations for simulated phantom fMRI data.

In the traditional ICA algorithm, it is generally true that the first

For the simulated data, Matlab FastICA (

In order to test the performance of Snowball ICA, MELODIC, and GIFT under different CNR levels, a range of CNR (0.1–20) was also applied when subjects’ data generated. All the other parameters were kept exactly same. All three methods were then applied to the same dataset. In MELODIC and GIFT, the number of independent components was set to equal the number of sources (29 for the simulated data). Estimation accuracy is calculated as the average over all components of the spatial cross correlation between the independent components and their corresponding ground truth signals.

Resting state fMRI for 50 healthy unrelated subjects were utilized from the WU-Minn Human Connectome Project (HCP:

Independent component analysis components estimated using the traditional ICA algorithm and the Snowball ICA strategy were compared. To apply the FastICA algorithm to the

To assess how selection of the initial parameters affects estimation, estimation consistency was tested across model orders in seed creation and across block sizes of information collection. For seed creation, model orders of 20, 40, and 60 were used to estimate seeds from randomly selected single subject’s fMRI data or from conventional group ICA SMs. The right frontoparietal network was then selected as the seed. For information collection, the block size was set to 20, 40, and 60. The consistency of the estimated networks with these parameter combinations was then assessed by visual inspection and similarity score, as shown in

Ideally, the PCA data reduction procedure will be able to retain all information related to signal sources while removing only noise.

Correlation of PCA spatial maps with simulated signal source maps. Group PCA was applied to the simulated group data (

The second simulated signal source was selected as an example to show the Pseudo-ICA results under different model orders (

Pseudo-ICA results with different model orders. Without limitations of the spatial independence constraint of spatial ICA, the ICA estimation would be more accurate as more PCA components are retained with increasing model order (MO).

The components estimated by Matlab FastICA under different model orders are shown in

FastICA results under different model orders, MO (shown on the far left side of the figure). Snowball ICA results are shown in the row above the ground truth signals (shown in the bottom row). As model order increases for FastICA, the accuracy of estimation of individual networks increases but some networks with wide spatial distributions disappear at higher MO. In addition, FastICA only identifies a maximum of about 65% of the simulated sources, and only 9/29 at the true MO of the simulated data. In contrast, Snowball ICA estimates 26/29 ground truth signal sources, including widely distributed and highly focal networks, with high accuracy. Additional details on this figure can be found in the

The disappeared component under each model order is defined as the component that cannot be estimated under the model order but can be estimated with lower model order. This means that the disappearance is caused by independent constraint but not information insufficient of the component. As shown in

The comparison of Snowball ICA and traditional FastICA results are shown in

The performance of Snowball ICA, MELODIC, and GIFT under different CNR levels are compared in terms of spatial estimation accuracy, as shown in

Spatial estimation accuracy of Snowball, MELODIC, and GIFT across different levels of contrast-to-noise ratio (CNR). Snowball ICA demonstrates greater accuracy across the ranges of CNR that may be observed in BOLD fMRI signals as GIFT and MELODIC.

The model order was chosen from 25 to 50 for fMRI data processing (

Real

For the dataset, the estimated model order is found to be 16 using the MDL criteria, 648 using the LAP, 830 using ER_AR and 806 using ER_FM. The wide range of estimated model order when using different criteria makes MOS a big problem in real-world applications. The explained information under model order 16 is only 22.36%. Under the model order of 648, 830, or 806, based on simulation results, some components with large scale could not be extracted. Even though model order 830 is very high, the information used is only 71.34%.

The corresponding components estimated by Snowball ICA, GIFT, and MELODIC were compared by visual inspection and kurtosis. Visual inspection of components estimated by each method suggests that Snowball ICA produces SMs with the cleanest spatial distribution (

Real

Real

The results of different parameter choices are shown in

Real

The main findings of this study are that (1) traditional ICA methods have significant limitations in estimation of signal source number and spatial component quality, (2) information is lost during data reduction, and (3) large and small scale components cannot be accurately estimated with the same model order due to independence constraints. Furthermore, our proposed method, the Snowball ICA, addresses these limitations and outperforms traditional ICA algorithms on a number of metrics. We believe these findings are important for a number of reasons.

Information is lost by data reduction. When the model order is different, different numbers of PCA components are used, and the information fed into ICA unmixing program is markedly different. Therefore, the accuracy of components varies based on model order. Traditional ICA presumes that PCA data reduction reduces redundant information and only leaves useful information. This is easy to implement when the signal-to-noise ratio is high, such as mixed high-quality acoustic signals. However, because fMRI data is much more complex than acoustic signals, the removal of this information as “noise” may be obscuring meaningful signals in the brain and contributing to issues with reproducibility. While increasing the number of PCA components used increases the accuracy of components and the percentage of information used, traditional methods are unable to include all of the meaningful information. Snowball ICA is able to estimate these components accurately, represent higher proportions of information, and do so without increasing model order to unreasonable levels. Even though PCA is also applied in Snowball ICA seed creation part, the lost information would be collected in the information collection stage.

The spatial scale of estimated components decreases as the model order increases with conventional approaches. Based on information theory with independence constraints, the mutual information between components should be low (

The information collection step of Snowball can be followed with any kind of current ICA. There are two kinds of ICA algorithms in fMRI data processing. The first is GIFT, based on the ICASSO, the conventional method introduced in this article. The other is probabilistic ICA (

In addition to the application of ICA for fMRI data analyses, ICA is also widely used in the analysis and processing of electroencephalogram (EEG) and EEG-fMRI data fusion (

When Snowball ICA is applied for group wise fMRI decomposition, there are several factors that may influence the accuracy of estimation. First, differences in brain shapes of different subjects may result in misalignment of the network SMs across subjects. Second, it is logical to assume some components are stronger in some subjects/scans. So, if iteration in information collection is not stable, the order where these scans are fed into Snowball may impact the estimation of components. Third, there are a number of parameters that are important for snowball ICA (seed model order, threshold Iq, block size of information collection iterations). In this study, theses parameters are selected based on our experience. The threshold Iq was also selected as 0.9 based on our experience and on previous published studies.

Model order selection is also a crucial step in temporal ICA. Theoretically, Snowball ICA can also be applied for temporal ICA. However, it is not clear if the improvements we observe with spatial ICA will be realized with temporal ICA. In the case of spatial ICA, Snowball is effective at estimating networks without specifying a model order because it utilizes more information in the data (e.g., it obviates loss of information during data reduction with PCA) and is better able to identify networks that share large degree of mutual information with the noise. It is likely that temporal ICA will have different factors that may impact whether improvements in estimation would be seen with Snowball ICA.

While Snowball ICA is time-consuming due to the iterative process, it will be important for researchers determine the cost-benefit based on their hypothesized source signal scale. Further research will be necessary to explore ways to decrease computational costs for Snowball ICA. Besides, Snowball ICA is an empirical strategy that combining the conventional ICA and iteration of ICA with reference to solve the mode order problem. New algorithm with overall theoretical principles to solve the model order problem worth further investigation. Due to the lack of acknowledge of ground truth in real-world application, even though Snowball ICA can be identified with better performance compared with conventional ICA in some sense, more comparison of them in terms of neuroscience such as assessing relative heritability, or behavioral prediction accuracy, or split-half reproducibility is needed.

In this article, we present a novel strategy, called Snowball ICA, to solve the MOS problem of ICA for applications to fMRI data processing. Choice of model order for ICA, and the PCA data reduction step prior to the ICA, directly impacts how much variance in the data is utilized for ICA estimation. In addition, shared mutual information between estimated sources and noise varies with the network spatial scale, making optimization of model order a challenging problem. Snowball ICA ultimately utilizes much more information contained in the data for the ICA decomposition, and is able to estimate signal sources that share mutual information with noise, which results in improved network estimation when compared with traditional ICA. The effectiveness of the proposed method is demonstrated through extensive simulations and by application to

Publicly available datasets were analyzed in this study. This data can be found here:

GH, FC, and LN: conceptualization. GH: methodology, software, formal analysis, and writing—original draft preparation. GH, SA, and BF: validation. GH and LN: investigation. GH, AW, and LN: writing—review and editing. AW: visualization. FC and LN: supervision and project administration. FC, LN, and BF: funding acquisition. All authors contributed to the article and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at: