Introducing Region Based Pooling for handling a varied number of EEG channels for deep learning models

Introduction A challenge when applying an artificial intelligence (AI) deep learning (DL) approach to novel electroencephalography (EEG) data, is the DL architecture's lack of adaptability to changing numbers of EEG channels. That is, the number of channels cannot vary neither in the training data, nor upon deployment. Such highly specific hardware constraints put major limitations on the clinical usability and scalability of the DL models. Methods In this work, we propose a technique for handling such varied numbers of EEG channels by splitting the EEG montages into distinct regions and merge the channels within the same region to a region representation. The solution is termed Region Based Pooling (RBP). The procedure of splitting the montage into regions is performed repeatedly with different region configurations, to minimize potential loss of information. As RBP maps a varied number of EEG channels to a fixed number of region representations, both current and future DL architectures may apply RBP with ease. To demonstrate and evaluate the adequacy of RBP to handle a varied number of EEG channels, sex classification based solely on EEG was used as a test example. The DL models were trained on 129 channels, and tested on 32, 65, and 129-channels versions of the data using the same channel positions scheme. The baselines for comparison were zero-filling the missing channels and applying spherical spline interpolation. The performances were estimated using 5-fold cross validation. Results For the 32-channel system version, the mean AUC values across the folds were: RBP (93.34%), spherical spline interpolation (93.36%), and zero-filling (76.82%). Similarly, on the 65-channel system version, the performances were: RBP (93.66%), spherical spline interpolation (93.50%), and zero-filling (85.58%). Finally, the 129-channel system version produced the following results: RBP (94.68%), spherical spline interpolation (93.86%), and zero-filling (91.92%). Conclusion In conclusion, RBP obtained similar results to spherical spline interpolation, and superior results to zero-filling. We encourage further research and development of DL models in the cross-dataset setting, including the use of methods such as RBP and spherical spline interpolation to handle a varied number of EEG channels.

Introducing Region Based Pooling for handling a varied number of EEG channels for deep learning models Introduction: A challenge when applying an artificial intelligence (AI) deep learning (DL) approach to novel electroencephalography (EEG) data, is the DL architecture's lack of adaptability to changing numbers of EEG channels.That is, the number of channels cannot vary neither in the training data, nor upon deployment.Such highly specific hardware constraints put major limitations on the clinical usability and scalability of the DL models.
Methods: In this work, we propose a technique for handling such varied numbers of EEG channels by splitting the EEG montages into distinct regions and merge the channels within the same region to a region representation.The solution is termed Region Based Pooling (RBP).The procedure of splitting the montage into regions is performed repeatedly with di erent region configurations, to minimize potential loss of information.As RBP maps a varied number of EEG channels to a fixed number of region representations, both current and future DL architectures may apply RBP with ease.To demonstrate and evaluate the adequacy of RBP to handle a varied number of EEG channels, sex classification based solely on EEG was used as a test example.The DL models were trained on channels, and tested on , , and -channels versions of the data using the same channel positions scheme.The baselines for comparison were zero-filling the missing channels and applying spherical spline interpolation.The performances were estimated using -fold cross validation.

Introduction
Recent advancements in artificial intelligence (AI) have opened up new opportunities for the fields of cognitive neuroscience and clinical brain health research.In this context, the EU Horizon 2020 funded project AI-Mind (www.ai-mind.eu)has been established, which aims at developing AI-based tools to estimate the risk of dementia for people affected by mild cognitive impairment.The project collects a comprehensive set of biomarkers, including blood samples, sociodemographic information, digital cognitive test scores, and electroencephalography (EEG) data.A combination of traditional machine learning (ML) and deep learning (DL)-based algorithms will be employed.While the former commonly provides improved transparency and integration of domain knowledge, the latter has the capacity to find patterns and extract features in complex and unstructured data beyond what can be obtained by hand-crafted features.
DL is a method in AI with potential to significantly transform healthcare services (Hinton, 2018).By processing data in multiple layers, DL learns representations with different levels of abstraction.Breakthroughs of DL include processing of images, video, speech, audio, and text (LeCun et al., 2015).Despite the progress in research and development, there are still significant gaps to be filled for deployment of AI in clinical practice, such as mitigating discriminatory bias and improving generalization to new populations (Kelly et al., 2019;Chen et al., 2023).In particular, AI systems trained on datasets with an underrepresentation of marginalized groups have an elevated risk of bias toward those groups (Rajpurkar et al., 2022).Furthermore, AI algorithms trained on data generated by a single system (e.g., when all imaging data are collected using the same camera with fixed settings) may exhibit single-source bias, resulting in a decrease in performance on inputs collected from other systems (Rajpurkar et al., 2022).For the AI-Mind project, such biases may pose challenges requiring particular considerations.While about two-thirds of dementia cases are in low-income and middle-income countries (LMICs), extrapolating predictive models developed in high-income countries to LMICs is not always feasible (Stephan et al., 2020).A technical prerequisite for extrapolating models to LMICs is the availability of hardware needed for data acquisition.As a neuroimaging modality, EEG is low-cost and mobile compared to magnetic resonance imaging and magnetoencephalography.Moreover, it does not require a dedicated isolated room.Extrapolation of EEG biomarkers to LMICs is thus not hindered by difficulties in installation of the acquisition hardware.
The recent progress of DL has significantly increased its relevance for EEG data analysis (Roy et al., 2019).Domains of application include emotion recognition (Houssein et al., 2022), driver drowsiness (Stancin et al., 2021;Mohammed et al., 2022), classification of alcoholic EEG (Farsi et al., 2021), epileptic seizure detection (Ahmad et al., 2022), mental disorders (de Bardeci et al., 2021), schizophrenia (Oh et al., 2019), major depressive disorder and bipolar disorder detection (Yasin et al., 2021), motor imagery and other brain computer interface (BCI)-related problems (Lotte et al., 2018;Abo Alzahab et al., 2021).Despite the attention of DL in EEG, little research has focused on issues relating to the crossdataset setting and generalization (Wei et al., 2022).As AI-Mind will use EEG signals for its algorithm development, enabling our tools for deployment on multiple data acquisition systems and mitigating discriminatory bias, is a necessity.
However, a common limitation of many existing DL architectures occurring specifically to EEG is their inherent inability to handle a varied number of channels as input data (Wei et al., 2022).This lack of compatibility conflicts with the real-world high variety of EEG hardware and hinders training and deployment on heterogeneous datasets where both the number of electrodes and their positions on the scalp may vary.Hence, this challenge not only prevents integration of DL models into diverse EEG setups but also limits the inclusion of larger sample sizes as well as more heterogeneous and representative data.Moreover, evidence from clinical neurology research suggests that the number of channels used during EEG recording may have a significant impact on the data's ability to capture spatially limited phenomena (Hatlestad-Hall et al., 2023).The inability to handle this diversity originates from tensors such as matrices and vectors requiring fixed dimensions to be compatible from a linear algebraic perspective.To address this technical issue, this work aims at introducing a simple methodological framework which can be used in combination with current and future DL models to handle a varied number of electrodes.Here, two methods for scaling the data to fit into the DL model are used as baselines for comparison: (1) zero-filling missing channels and (2) applying spherical spline interpolation (Perrin et al., 1989).
There exist several techniques which may leverage external datasets to improve DL models, which we hypothesize will play a significant role in cross-dataset learning and generalization.Approaches such as unsupervised and self-supervised learning may be utilized even in the absence of the target of interest.Improvements may be in terms of, e.g., performance or generalization, and are considered to play an important role for data efficiency of DL (Hinton, 2018;Hendrycks et al., 2019;Banville et al., 2021).In the field of EEG research, Kostas et al. (2021) obtained improved results on multiple downstream datasets by using contrastive self-supervised learning on a large dataset for pretraining.Furthermore, Banville et al. (2021) successfully applied self-supervised learning to sleep staging and pathology detection.Another approach on heterogeneous EEG datasets is to use transfer learning, shown in the BEETL competition (Wei et al., 2022).Furthermore, a desired outcome of AI-Mind is to characterize brain networks from EEG data.While metrics from neuroscientific literature have known cognitive relevance (Stam et al., 2006), a DL methodology to obtain features of similar neurophysiological meaning seems non-trivial.This is due to features of DL being learned in a data-driven manner rather than human defined to capture the underlying neurophysiological phenomena.We hypothesize, however, that feature learning and pre-trained models may be viable alternatives.
The intended purposes for developing methods for handling a varied number of channels with possibly different positions on the scalp are (1) to enable the application of DL models on a range of existing and varied EEG systems.For clinical implementation, a highly desired property is to have a method which works on the EEG systems currently in use at different clinical centers around the world.The number of channels and channel locations are indeed varied, meaning that it is a necessity to handle this diversity, to maximize outreach and clinical usefulness; (2) to be able to pretrain or perform representation learning on heterogeneous and large amounts of data.There are many open-source datasets from a range of nationalities, pathologies, age groups, and cohorts.To generalize across such data, methods including pre-training and representation learning on multiple and heterogeneous datasets may be a step in the right direction, as it can lead to more robust and generalized features.Improving the robustness and generalization may in turn improve the fairness and equity of the developed AI models.This relevance extends to all medical use and integration of DL in EEG, including the generation of synthetic data (Goodfellow et al., 2014) and digital twins (Grieves and Vickers, 2017), and enabling of simulation techniques for improved clinical treatment selection.Indeed, developing methods to facilitate the evolution of such precision medicine approaches is essential.This study does not carry out such pre-training or representation learning but introduces a framework for enabling it to be performed in a larger scale, with a varied number of electrodes.Instead, this study conducts an initial evaluation to ascertain the efficacy or inadequacy of the framework.
Our framework is designed to be model agnostic, meaning that both current and future DL architectures can apply it with ease.The code is publicly available and may be used to develop customized implementations of the framework, or to combine it with other DL architectures.Furthermore, we aim to experimentally demonstrate that by applying our framework, the algorithm performance in itself remains the same.

Materials and methods
In this section, the dataset, methods, models, and experiments are described.A high-level overview of the workflow is provided in Figure 1.

. Data
The data used for this study is an open-source dataset from Child Mind Institute (Alexander et al., 2017).It contains a large high-density EEG (129 electrodes) dataset from the age distribution 5-21 years, including male and female subjects, with varied brain pathologies.The objective of the DL models was to classify the sex of a subject, given the EEG data.After removing samples which did not fulfill the inclusion criteria for data quality (see Section 2.1.1),the dataset was balanced by down sampling the class in abundance, resulting in a final dataset with 1,788 subjects.Only the restingstate EEG data files were extracted.The first 30 s of the recordings were skipped as the first parts of the EEG are more likely to contain unwanted artifacts.The proceeding 10 s was used as input for the models.Only a single 10 s window was used per subject, and the splitting of data was thus made on subject level.The sampling frequency was kept at 500 Hz as in the original dataset. .

. Preprocessing
The raw data was preprocessed using an automated data cleaning pipeline developed in MATLAB, using functions from the EEGLAB toolbox (Delorme and Makeig, 2004).Channels with low-quality data were removed by iterative exclusion of signals with amplitude standard deviation SD > 75µV or no amplitude variation at all.The EEG file was rejected if the number of excluded channels exceeded 39 (>30%).Line artifacts were removed with Zapline (de Cheveigné, 2020), and the signals were band-pass filtered between 1 and 45 Hz.Excluded channels were replaced with interpolated signals to ensure data dimension consistency.The channels were re-referenced to the average of all scalp channels.The pipeline is available at GitHub.

. Inception network
The Inception network is a convolutional neural network (CNN) based architecture, which is the main building block of InceptionTime.Here, the Inception network is briefly described, and for further details on the architecture, the reader is referred to the original study (Ismail Fawaz et al., 2020).
An Inception network is composed of multiple Inception modules, with linear shortcut connections for every third Inception module.A key component of the Inception module is the bottleneck layer, which effectively computes linear combinations of the input time series.Furthermore, the Inception module applies filters of different lengths simultaneously on the same input time series, and resulting feature maps are aggregated by concatenation.After passing the data through all Inception modules, global average pooling is performed in the temporal dimension.Finally, while the original Inception network used a fully connected layer with softmax activation, this was changed to a single fully connected layer with sigmoid activation (Ismail Fawaz et al., 2020).
The hyperparameters of our Inception network was set as described in the original study.This includes a depth of six Inception modules, and 32 number of filters for all convolutional kernels in all Inception modules (Ismail Fawaz et al., 2020).

. Methods for handling a varied number of channels
Three methods for handling a varied number of channels were tested on a binary classification problem, sex prediction.The three methods were (1) zero-filling, (2) spherical spline interpolation (Perrin et al., 1989), and our suggested new method (3) Region Based Pooling (RBP).Inception network (Ismail Fawaz et al., 2020) was the DL model used after zero-filling, interpolation, or applying RBP, with the exception that the final layer used scalar output and sigmoid as activation function for predictions.
https://github.com/hatlestad-hall/prep-childmind-eeg .Region based pooling RBP splits the topology of the EEG montage into regions, as illustrated in Figure 2. The channels within a single region are pooled into one or more region representations, and hence the name Region Based Pooling.To minimize the loss of information, multiple splits with different region formations are performed.RBP introduces three new optimization problems; (1) how to split the EEG montage into regions (both the number of montage splits and the algorithm separating the regions), (2) how to pool the channels within the same region, and (3) how to merge the outputs of the different montage splits.The proceeding two subsections intend to illustrate how the first two problems can be addressed and are meant as examples of implementation.
All RBP models in the experiments of this study merged the outputs of the montage splits by concatenation.Furthermore, all channels within the same region were merged to a single region representation.Finally, all region representations were normalized by subtracting the mean and dividing by the standard deviation in the temporal dimension.

. . Method for splitting into regions
A montage split is a region-based partitioning of the EEG montage.The set of all montage splits are denoted {M 1 , M 2 , ..., M n }, where n is the number of montage splits.Each montage split contains multiple regions, where the regions may or may not overlap.Furthermore, a montage split may or may not cover the entire EEG montage.Given a channel system C which is compatible with the partitioning, the j-th region of the i-th montage split R (i)  j ∈ M i contains the channels R (i) j ⊃ R (i) j ∩ C, where R (i)  j ∩ C denotes the set of channels of channel system C, positioned within the boundaries of R (i)  j .The algorithm used in all experiments for splitting the montage into regions is illustrated in Figure 3.It follows an iterative procedure and was designed to not have overlapping regions.Furthermore, all regions are used for all montage splits.The algorithm requires one to fix a split vector k where the elements of k and p are design choices/hyperparameters.As a pre-step of the algorithm, all channel positions are mapped to 2D coordinates.Thereafter, the centroid of the channel positions is calculated, and a random angle is generated.With the centroid and the random angle as starting point and angle, k 1 − 1 angles are computed such that the angles split the channels into k 1 equally sized regions.Here, the size of a region refers to the number of channels within it.For all newly generated regions, the same procedure is repeated; (1) compute the centroid (2) generate a random angle, and (3) generate k 2 − 1 angles such that k 2 number of equally sized regions are formed.This iterative approach is executed either p times, or until the number of channels in the regions are too low, defined by a stopping criteria min_nodes.
For the experiments, there were seven different split vectors,  Multiple montage splits may be performed, and the number of montage splits equals to two in this figure, M and M .If there is at least one channel in all used regions, the mapping from channels into region representations can be made.This is illustrated as channel system A and channel system B have unequal numbers of channels with di erent channel locations, and they can both obtain region representations.After pooling channels into region representations, the region representations are stacked/row concatenated.The sequence of stacking represents an arbitrarily chosen design.

FIGURE
Example of how the EEG montage may be split into regions.In this example, the split vector was set to k = ( , ).This can be observed, as the montage was first split into five regions, followed by splitting those into three regions.
probabilities.The stopping criteria was one of the hyperparameters for grid search and included min_nodes ∈ {1, 2, 3}. .

. Pooling operations
To enable compatibility with a varied number of channels with possibly different channel positions, defining pooling mechanisms which can input and handle multivariate time series of different dimensions within the regions, is a prerequisite.That is, to apply mechanisms within the regions which can map a varied number of channels to a single region representation.Finding sophisticated mechanisms with this property may be crucial for RBP.This subsection presents several approaches for pooling mechanisms.

. . . Average
The first pooling mechanism is to merge the channels within a region by computing its mean in channel dimension.This offers a simple and time-efficient method and aggregates the channels with equal contributions for computing region representations.

. . . Channel attention
A second pooling mechanism is to select the key channels by first assigning an importance score, and secondly merge the channels by computing a weighted average based on the importance scores.Mathematically, this may be accomplished by defining a function g : R 1×T → R, where T denotes the number of time samples, applied on all time series within the region, and using the FIGURE Illustration of channel attention mechanism.An importance scalar is computed for each channel, and the attention vector is computed by applying softmax on a concatenation of these.The elements of the attention vector are used as coe cients to compute a linear combination of the channels.
values obtained to compute coefficients of a linear combination, as illustrated in Figure 4. Applying g to each channel in a region gives an importance scalar for each channel, which is subsequently concatenated and passed to a softmax activation function, giving the channel attention vector of the i-th montage split and j-th region a (i,j) ∈ {q ∈ (0, 1) : ||q|| 1 = 1}.The vectors a (i,j) have the properties that the entries are positive and sum to one due to the softmax activation function.After computing a (i,j) , the channels of the i-th montage split and j-th region are pooled by weighted averaging f are the EEG time series of all channels within the region.
ROCKET-based features: Random Convolutional Kernel Transform (ROCKET) (Dempster et al., 2020) is a highly efficient time series classifier, which obtained high performance in a short time frame in a multivariate time series classification bake off (Ruiz et al., 2021).For feature extraction, ROCKET applies a large number of diverse, random and non-trainable convolutional kernels, and computes the proportion of positive values and maximum value of the resulting feature maps.This was adopted as a pooling mechanism, where the proportion of positive values and max values of the feature maps were used for computing the importance score of a channel.From the num_kernels • 2 features, a trainable fully connected module with scalar output and specific to the i-th montage split and j-th region, FC (i,j) : R num_kernels•2 → R, was applied.After computing the importance scores for all time series in the region, a softmax activation function was applied to obtain positive coefficients only, which sum to one.A desirable property of using non-trainable convolutional kernels is that the output feature maps (along with proportion of positive values and max values) are being computed only once per subject, prior to training.Therefore, the computational cost of a large number of convolutions may be justified by its property to be pre-computed.
The number of convolutional kernels was set to 1000, and the maximum receptive field in the temporal dimension to 250, which corresponds to half a second with the given sampling rate.This was based on computational feasibility, taking both time consumption and memory usage on limited hardware into account.Furthermore, no padding was used, in contrast to the original implementation.The ROCKET features were pre-computed prior to training, as the convolutional kernel weights were frozen, and the proportion of positive values and max values of the feature maps were thus constant per channel and subject during training.Furthermore, the ROCKET kernels were shared across all regions and montage splits to reduce runtime.The FC modules mapping the num_kernels•2 features to a single coefficient, used only a single fully connected layer with linear activation function.That is, for every subject, the importance score of the k-th channel in the j-th region of the ith montage split prior to softmax normalization, was computed as g (i,j) (x k ) = FC (i,j) (z k ) = w T i,j z k , where w i,j ∈ R num_kernels•2 is a trainable weight vector of the j-th region of the i-th montage split, x k ∈ R T is the time series of the k-th channel, and z k ∈ R num_kernels•2 is the pre-computed ROCKET features of channel k.

. . . Continuous channel attention
Another possible pooling mechanism is to apply continuous channel attention, which is illustrated in Figure 5.In the channel attention mechanism explained in Section 2.4.2.2, it is impossible for the model to adapt its channel attention in time.Therefore, continuous channel attention is implemented by defining a function g : R 1×T → R 1×T , apply g to every channel, and apply softmax activation function in the channel dimension.That is, what was in Section 2.4.2.2 an attention vector of the j-th region in the i-th montage split a (i,j) ∈ {q ∈ (0, 1) : ||q|| 1 = 1} is replaced by an attention matrix A (i,j) ∈ {Q ∈ (0, 1) : ||Q :,t || 1 = 1 ∀t ∈ {1, 2, ..., T}}, where all elements are positive and each column sum to 1 due to the softmax activation function.The region representation of the j-th region in the i-th montage split is followingly computed as f , where 1 is a vector of ones, ⊙ is the Hadamard product (element-wise multiplication), and X R (i) j ∩C ∈ R j ∩C|×T is the EEG data of the channels in R (i) j ∩ C.This formulation is equivalent to applying a unique attention vector per time step.

FIGURE
Illustration of continuous channel attention.An importance scalar is computed for every channel and time step, and the attention matrix is computed by applying softmax on a concatenation of these in the channel dimension.The attention matrix is used to compute a linear combination of the channels per time step.That is, a new linear combination is computed for each time step, allowing the pooling mechanism to shift its attention through time.
In the experiments, an Inception network (Ismail Fawaz et al., 2020) was used as g.The depth of the architecture was set to two Inception modules, and the number of filters was set to two for all convolutional kernels and Inception modules.These hyperparameters were set smaller than in the original study due to high memory consumption.

. . . Region based pooling with head region
With the pooling mechanisms described in Sections 2.4.2.1, 2.4.2.2 and 2.4.2.3, RBP is not able to tailor the region representations based on other regions.As this may be an important property to possess, RBP can be extended to Region Based Pooling with a Head Region, which is illustrated in Figure 6.A head-region is selected, which exhibits the property of being able to influence the aggregation of channels in non-head regions.
The region representation is computed as an aggregation of the channels, given a vector embedding of the head region.For every montage split s where s is the search vector embedding of the head region H (i)   with relevance to region is the function mapping the channels of the head region to s , and AGG (i,j) is an aggregation function.The vector embedding of the head region may thus depend on the region to compute a region representation of.The motivation of this is that the head region systematically searches for certain characteristics in the other regions, and such characteristics may depend on the given regions.
The region representation of the head region was computed as in ROCKET channel attention, introduced in Section 2.4.2.2.The search embeddings s where σ is the softmax activation function computed in the channel dimension, Z H (i) ∩C ∈ R num_kernels•2×|H (i) ∩C| is a concatenation of the ROCKET features, and W (1) i,j and W (2) i,j are trainable weight matrices of the search embedding function of region R (i)  j .The use of softmax allows the search embedding to weight the different channels in the head region differently for each ROCKET feature.The region representation of region R (i)  j ∈ M i \{H (i) } are computed per subject as with a k being the elements of a.Note that the same embedding functions (f (i,j) 1 ) are used on the channels of R (i) j ∈ M i \{H (i) } as on the channels of the head region H (i) .This may be beneficial, as the embeddings share the same space, and computing similarity may thus be more meaningful.
For the experiments in this study, the number of rows in the weight matrices W (1)  i,j and W (2) i,j (and hence the dimensionality of the search vector embeddings s ) were set to 64, for all i and j.

FIGURE
Region based pooling with a head region.The head region may influence how the channels in the non-head region should be aggregated.This is done by passing an embedding vector of the head region to the aggregation functions.By passing di erent embeddings to the di erent non-head regions, the head region is allowed to search for di erent features in the di erent spatial locations. .

Experiments
All models were implemented using PyTorch (Paszke et al., 2019), version 1.10.1+cu113.The hardware used was a computer equipped with an NVIDIA GeForce RTX 3060 12GB GPU.The code is publicly available on GitHub.
All models were run with learning rate set to 0.0001.The maximum number of epochs was set to 50, except for RBP with continuous channel attention, which used 20 epochs due to high time consumption.The batch size was mainly set to 16 although some models required smaller batch size due to memory constraints.The exceptions are listed in Table 1.Experiments using zero-filling and spherical spline interpolation were run with batch size set to 4, 8, 16, and 32, to ensure that potential improvements https://github.com/thomastveitstol/RegionBasedPoolingEEGwere not due to differences in batch size.Adam (Kingma and Ba, 2015) and binary crossentropy (with logits loss for improved numerical stability) were used as optimization technique and loss function, respectively.
For all experiments, a 5-fold cross validation strategy was carried out.For every fold, the 4 folds not used for testing were split into training and validation 75/25.The training data was used to optimize the trainable parameters of the DL models, whereas the validation data was used to estimate what epoch to stop at.During a single fold, only the model parameters which obtained the highest area under the receiver operating characteristics curve (AUC) on the validation set (computed as the mean performance on the 32, 65, and 129-channel versions of the channel system) was used when testing on the test data fold.
To evaluate the sensitivity with respect to two new hyperparameters introduced by RBP, a grid search was made for all pooling mechanisms.The first hyperparameter was min_nodes, which is the smallest number of channels allowed in the 32channel version of the channel system.The smaller the min_nodes, the smaller the regions are allowed to be when splitting the montage.The second hyperparameter was num_montage_splits, which is the number of montage splits performed.The grid search was carried out with min_nodes ∈ {1, 2, 3} and num_montage_splits ∈ {5, 10, 25, 50}, with the exception of RBP using continuous channel attention, which was restricted to num_montage_splits ∈ {5, 10, 25} due to memory limitations.

Results
Figures 7-9 show the results of grid search for the different pooling methods, on 32, 65, and 129 number of channels, respectively.The number in each entry represents the average performance estimate on the test sets after conducting a 5-fold cross validation.The results show that the performance is more sensitive to the selected hyperparameters for the low-resolution channel systems than the 129-channel system version.In particular, RBP seems to favor smaller regions per montage split for the downsampled channel systems.
Figure 10 compares the performance of using RBP, spherical spline interpolation, and zero-filling.The RBP model selected used ROCKET channel attention as pooling mechanism, with number of montage splits set to 25, and min_nodes set to 1.The model selection was based on the mean validation performance on 5fold cross validation and maximizing the mean performance on the three channel systems.The selected models using spherical spline interpolation and zero-filling used batch size set to 32 and 8, respectively, following the same model selection procedure as for RBP.For the 32-channel system version, the mean AUC values were as follows: RBP (93.34%), spherical spline interpolation (93.36%), and zero-filling (76.82%).On the 65-channel system version, the performances were RBP (93.66%), spherical spline interpolation (93.50%), and zero-filling (85.58%).Finally, the 129-channel system version produced the following results: RBP (94.68%), spherical spline interpolation (93.86%), and zero-filling (91.92%).

Discussion
. RBP for handling a varied number of channels RBP shows highly similar performance to spherical spline interpolation for all channel systems, as seen in Figure 10.Both RBP and spherical spline interpolation demonstrate robustness in handling a varied number of channels, as indicated by the minor performance degradation observed on the down-sampled channel systems.A potential decrease in performance when reducing the number of channels is not necessarily to be evaluated as weaknesses in these methods but may be due to a loss of information when removing channels.The objective of the methods is to handle the channel down-sampling with the smallest reduction in performance as possible although no method can restore the fully lost information.In contrast to RBP and spherical spline interpolation, zero-filling missing channels vastly reduce the performance on the lower resolution channel systems.Zero-filling is therefore not a recommended approach for ./fninf. .

FIGURE
Mean performance on the channel system with electrodes, as a function of number of montage splits and number of allowed electrodes in the smallest channel system.

FIGURE
Mean performance on the channel system with electrodes, as a function of number of montage splits and number of allowed electrodes in the smallest channel system.
handling missing channels, despite its use in, e.g., the official preprocessed version of the EEG data of the Child Mind Institute (Alexander et al., 2017).
The results from the grid searches on the different pooling mechanisms indicate that the selection of pooling mechanism was unimportant for the selected task and dataset, except for continuous channel attention for 25 number of montage splits.However, the batch size was set to 1 due to memory constraints, which is not optimal for training, and thus a strong confounder.More research is therefore needed to assess if a high number of montage splits failed in continuous channel attention due to inadequacy of the pooling mechanism or if it is solely due to the batch size.No pooling mechanism was superior to the others for all hyperparameters.A consistent trend appears to be that RBP benefits from smaller regions, as the performance on especially the channel systems with 32 and 65 channels seem to increase when the stopping criteria min_nodes decrease.This is not an unexpected finding as using smaller regions increases the spatial resolution per montage split.The current results further suggest that solely increasing the number of montage splits is insufficient when the regions are excessively large.However, as future work may include even smaller channel systems, larger regions may be beneficial from a practical point of view.Finding the optimal balance between low resolution channel systems compatibility and model performance may therefore be important for future research.However, as the model was trained only on 129 channels, the performance on the low-resolution channel systems may be increased by including them in the training data as well.For extension to the largescale setting with multiple datasets, this is likely to be a feasible approach.Furthermore, it may be used as a data augmentation technique, in particular when the high-resolution channel system has low-resolution equivalents.
This study proposed an algorithm for splitting the EEG montage into regions although no optimization of montage splits was performed.It is likely that different EEG related problems may benefit from different montage splits.This is because the important spatial features may be task related and require higher or lower resolution of some areas.Furthermore, as only one algorithm for splitting the EEG montage into regions was tested, future work could benefit from exploring and evaluating alternative methods.Note that with the current use of regions having defined boundaries, where an electrode is either inside or not inside a region, optimizing montage splits by gradient based methods cannot work directly.This is because an infinitely small change to the boundaries of the region will either cause zero change in output or an output change of fixed size (not infinitely small, as required).The gradients would thus be either zero or infinite, making gradient based learning infeasible.Two potential solutions are further discussed in Section 4.4.2.

. Related work
As discussed in Wei et al. (2022), limited studies has focused on generalizing DL models to handle the cross-dataset setting and varied number of channels.A desired outcome of the BEETL competition was to develop transfer learning techniques in the cross-dataset setting (Wei et al., 2022).However, the top three entries selected simple methods to handle a varied number of channels and the difference in channel locations; channel removal, dataset removal, or both.Furthermore, to handle a varied number of channels in the pre-training and downstream training, Kostas et al. ( 2021) mapped all datasets to 19 channels, and in that process,

FIGURE
Results of sex prediction using Inception network in combination with RBP (blue), spherical spline interpolation (orange), and zero-filling (green).The splitting into folds were equal for the di erent methods, and only the five performance estimates from the test sets are plotted.For the channel system with c = , interpolation and zero-filling are technically the same, as there are no channels to interpolate nor zero-fill.The model selection procedure, however, selected di erent batch sizes, and the performance di erences are therefore attributed to both the model selection and di erences in initialization of weights.
sacrificed a considerable part of the data for several of the datasets used for downstream training.However, research from clinical neurology suggests that certain characteristics require high-density EEG with an increased number of channels (Kuhnke et al., 2018;Hatlestad-Hall et al., 2023).The feasibility of downsampling the spatial resolution may therefore be limited to only a subset of EEG-related tasks.
Li and Metsis (2022) developed SPP-EEGNET, an architecture designed for inter-dataset transfer learning, and is compatible with a varied number of channels.However, SPP-EEGNET pools the feature maps by spatial pyramid pooling (SPP) (He et al., 2014) after convolutions have been applied channel-wise.Cross-channel patterns can therefore not be extracted by the convolutional module of SPP-EEGNET as the receptive field of the feature maps are bounded to their respective single channel.Such cross-channel patterns may only be extracted by the fully connected module, after applying the SPP layer.As the success of signal processing is mostly attributed to the convolutional module, this approach may be sub-optimal.Furthermore, many existing DL architectures for EEG data apply 1D convolutions across channels, hindering its application to many of the currently existing architectures.This contrasts with RBP, which is compatible with any DL model for multivariate time series classification/regression.This is beneficial, as the current high-performing models from literature may apply RBP with ease (simply use RBP as the initial layer), meaning the accumulated research and development on DL architectures over time is respected.Furthermore, it offers a simple solution for working on the cross-dataset and cross-channel system setting in the future.Note also that although this study represented the EEG data as time series, using other representations such as power Although one requirement is di erentiability, if the pooling mechanism has parameters to be optimized as part of the gradient based learning.
Frontiers in Neuroinformatics frontiersin.orgspectral density or operating on wavelet transformed images are popular choices of input to DL models.RBP is indeed compatible with such representations although the pooling mechanisms must be tailored to fit the input domain.Finally, the pooling in RBP is performed based on the spatial positions of the electrodes, whereas SPP-EEGNET does not precisely specify how the feature maps of the different channels were merged.If the pooling is made only by the data matrix X [as if it was an image, following the original SPPnet (He et al., 2014)], then inconsistency in which channels end up in which spatial region will occur. .

Limitations of the study
A limitation of this study is its reliance on a single dataset and classification problem, which may restrict the generalizability of the findings.In particular, the size of the dataset was larger than what is commonly available for EEG datasets with more clinically relevant labels.When the total number of region representations exceeds the number of channels in a given channel system, RBP effectively expands the dimensionality of the data.This is especially the case when the regions are small, and the number of montage splits are many.For smaller datasets in particular, this may lead to an increased risk of overfitting.The generalizability of the results to smaller datasets, and in particular, the effect of the hyperparameters min_nodes and num_montage_splits is therefore poorly investigated.While testing the methods on sex classification allowed for a large dataset with low chance of false labeling, its clinical utility is low.Thus, classification/regression problems with higher clinical relevance should be considered in the future.Furthermore, only a single model (Inception network) was used in combination with the three different methods for handling a varied number of electrodes.Although Inception network is an effective DL model for multivariate time series analysis, generalization to other models was not assessed.This is needed due to the high number of DL models used for EEG analysis.Finally, hardware limitations constrained the training of all RBP models using the same batch size, potentially reducing the performance of the models with smaller batch size.By testing with more models, datasets, and classification/regression problems, the relevance of the methods will thus be better addressed.In particular, to fully explore the potential and relevance of the investigated methods, experiments including datasets with even smaller numbers of EEG channels, such as 19 or 25, are required.

. Future work . . Pooling mechanisms and hyperparameters
The use of features as computed in ROCKET, and a single linear layer to compute the importance score of a channel, provides a light-weight method for computing channel attention.It was selected based on its light-weightedness as the sole purpose of the pooling mechanism is to compute coefficients of a linear combination.Furthermore, the extracted ROCKET features could be pre-computed prior to training, making it a pragmatic choice for run-time efficiency.Using more powerful DL models was hypothesized to be unnecessary and overpowered for such a task although in the absence of proper experimental results in this regard, final conclusions cannot be drawn.Using pooling mechanisms which selects not only the channels of interest but also the frequency bands of interest is a possible future direction.
All pooling mechanisms used in the experiments were compatible with a single channel per region.This is the case, e.g., for computing channel attention using ROCKET features, as the function g for computing the importance score of a channel only uses the features of that very channel.Future work may attempt to define pooling mechanisms which require more than one channel per region.This may be accomplished by e.g.extending the input domain and output range of g to g : R p in ×T → R p out , where p in is the lower bound of accepted number of channels in a region, and p out is the number of output features per application of g.However, as this may either require larger regions (which by the current results does not appear to be favorable) or lead to incompatibility with the low-resolution channel systems, it is important to determine if the potential benefits outweigh the drawbacks in future research.
All experiments in this study merged the different montage splits by concatenation directly after the pooling was made.Another approach could be to apply convolutional modules separately on the montage splits, prior to merging them.Furthermore, other approaches such as summation, averaging, or alternating between applying convolution and adding a montage split such as skip connections, are examples of other possible pooling strategies.In particular, merging montage splits by skip connections and using dynamic neural networks (Han et al., 2022) to e.g.perform a sample or channel system conditioned number of montage splits by early exiting or layer skipping is a possible future direction.By using dynamic architectures, more montage splits could be used on the high-resolution channel systems, and fewer montage splits could be used on the low-resolution channel systems.Furthermore, montage splits with small regions could be used on high-resolution channel systems only, possibly alleviating the here observed trade-off between performance and low-resolution compatibility.

. . Splitting into regions
While the current study did not perform any optimization of the splitting of the EEG montage into regions, two possible solutions which may be explored in the future are (1) use other techniques for optimizing.One approach could be to generate many splits and apply sparsity.(2) Introduce soft regions, where electrodes are assigned a non-binary weight to its presence in the region.A region could e.g.be represented as a Gaussian, where the mean and standard deviation are treated as trainable parameters.The influence of a specific channel on a region representation would be determined by both an importance score calculated from a function g operating on the time series, and its spatial importance given the properties of the region (e.g., mean and standard deviation).

. . Training strategies with large amounts of data
A major motivation behind RBP is to enable the use of multiple and heterogeneous datasets with a varied number of channels for different training strategies.Large-scaled use of multiple datasets should be tested for methods such as pre-training (e.g., transfer learning or self-supervised learning), representation learning (e.g., self-supervised or unsupervised learning), and simply using more datasets if the same targets are available.Fixing different electrode arrays and using spherical spline interpolation in the case of varied channel systems across the datasets, should be used as baselines.
For the AI-Mind project, this may be of high relevance for both improving the DL model performance and generalization.While the project aims at collecting a dataset comprised of 1,000 participants and possibly expanding this with synthetic data, this is not guaranteed to be sufficient for DL models.Improving data efficiency and model performance by the abovementioned training strategies may be enhanced by enabling them in the cross-channel system setting.Furthermore, data collection from four different countries and five different clinical sites is likely to mitigate bias to some extent.However, its sufficiency is difficult to address a priori.Two arguments against, are that (1) all clinical sites are situated in European countries, and (2) the hardware for EEG recordings are the same.Thus, by applying the abovementioned training strategies to heterogeneous datasets, the ability of the DL models to generalize across populations and hardware may be improved.

Conclusion
Region based pooling was introduced for deep learning models to handle a varied number of EEG channels.Furthermore, its adequacy in maintaining performance when downsampling the channel system was experimentally demonstrated.Grid search was used to assess the effect of two new hyperparameters, which relates to the size of the regions and the number of montage splits.Several pooling mechanisms were introduced and tested, yielding highly similar results.Region based pooling obtained similar results to spherical spline interpolation, and superior results to zero-filling missing channels when downsampling the channel system to 65 and 32 channels.Zero-filling missing channels is therefore not a recommended method for handling a varied number of channels.Future work includes applying region based pooling on multiple and heterogeneous datasets with different EEG channel systems.In particular, large-scale pre-training and representation learning in combination with region based pooling will be investigated.

FIGURE
FIGUREHigh-level overview of the workflow.The di erent hyperparameters for each model are described in Section . .
, and k = (3, 4, 2) T .For each montage split, the selection of k was made by random sampling with equal Frontiers in Neuroinformatics frontiersin.org

FIGURE
FIGURERegion based pooling.The EEG montage is split into multiple regions.All channels in the same region are pooled into a region representation.Multiple montage splits may be performed, and the number of montage splits equals to two in this figure, M and M .If there is at least one channel in all used regions, the mapping from channels into region representations can be made.This is illustrated as channel system A and channel system B have unequal numbers of channels with di erent channel locations, and they can both obtain region representations.After pooling channels into region representations, the region representations are stacked/row concatenated.The sequence of stacking represents an arbitrarily chosen design.

FIGURE
FIGUREMean performance on the channel system with electrodes, as a function of number of montage splits and number of allowed electrodes in the smallest channel system.