Stage-independent, single lead EEG sleep spindle detection using the continuous wavelet transform and local weighted smoothing

Sleep spindles are critical in characterizing sleep and have been associated with cognitive function and pathophysiological assessment. Typically, their detection relies on the subjective and time-consuming visual examination of electroencephalogram (EEG) signal(s) by experts, and has led to large inter-rater variability as a result of poor definition of sleep spindle characteristics. Hitherto, many algorithmic spindle detectors inherently make signal stationarity assumptions (e.g., Fourier transform-based approaches) which are inappropriate for EEG signals, and frequently rely on additional information which may not be readily available in many practical settings (e.g., more than one EEG channels, or prior hypnogram assessment). This study proposes a novel signal processing methodology relying solely on a single EEG channel, and provides objective, accurate means toward probabilistically assessing the presence of sleep spindles in EEG signals. We use the intuitively appealing continuous wavelet transform (CWT) with a Morlet basis function, identifying regions of interest where the power of the CWT coefficients corresponding to the frequencies of spindles (11–16 Hz) is large. The potential for assessing the signal segment as a spindle is refined using local weighted smoothing techniques. We evaluate our findings on two databases: the MASS database comprising 19 healthy controls and the DREAMS sleep spindle database comprising eight participants diagnosed with various sleep pathologies. We demonstrate that we can replicate the experts' sleep spindles assessment accurately in both databases (MASS database: sensitivity: 84%, specificity: 90%, false discovery rate 83%, DREAMS database: sensitivity: 76%, specificity: 92%, false discovery rate: 67%), outperforming six competing automatic sleep spindle detection algorithms in terms of correctly replicating the experts' assessment of detected spindles.

Sleep spindles are critical in characterizing sleep and have been associated with cognitive function and pathophysiological assessment. Typically, their detection relies on the subjective and time-consuming visual examination of electroencephalogram (EEG) signal(s) by experts, and has led to large inter-rater variability as a result of poor definition of sleep spindle characteristics. Hitherto, many algorithmic spindle detectors inherently make signal stationarity assumptions (e.g., Fourier transform-based approaches) which are inappropriate for EEG signals, and frequently rely on additional information which may not be readily available in many practical settings (e.g., more than one EEG channels, or prior hypnogram assessment). This study proposes a novel signal processing methodology relying solely on a single EEG channel, and provides objective, accurate means toward probabilistically assessing the presence of sleep spindles in EEG signals. We use the intuitively appealing continuous wavelet transform (CWT) with a Morlet basis function, identifying regions of interest where the power of the CWT coefficients corresponding to the frequencies of spindles (11-16 Hz) is large. The potential for assessing the signal segment as a spindle is refined using local weighted smoothing techniques. We evaluate our findings on two databases: the MASS database comprising 19 healthy controls and the DREAMS sleep spindle database comprising eight participants diagnosed with various sleep pathologies. We demonstrate that we can replicate the experts' sleep spindles assessment accurately in both databases (MASS database: sensitivity: 84%, specificity: 90%, false discovery rate 83%, DREAMS database: sensitivity: 76%, specificity: 92%, false discovery rate: 67%), outperforming six competing automatic sleep spindle detection algorithms in terms of correctly replicating the experts' assessment of detected spindles.

Introduction
Sleep spindles are characteristic oscillatory patterns of brain activity which can be visually detected in human electroencephalography (EEG) signals. These transient patterns are typically portrayed as nearly sinusoidal waxing and waning waveforms with a characteristic frequency profile of 11-16 Hz [formerly this range was narrowed between 12 and 14 Hz in the Rechtschaffen and Kales criteria (Rechtschaffen and Kales, 1968), and different research labs might use slightly different frequency ranges] (Iber et al., 2007;Kryger et al., 2010). Interestingly, although seep spindles exhibit substantially varying characteristics (amplitude, duration, density) in the population, they are fairly stable for individuals (Werth et al., 1997). Spindles are generated in the thalamus, and contemporary evidence suggests they can be classified into slow spindles (11)(12)(13) and fast spindles (13)(14)(15)(16), which are believed to regulate different activation patterns (DeGennaro and Ferrara, 2003).
The presence of sleep spindles is one of the hallmarks for determining stage 2 (S2) in the hypnogram, which provides an overall representation of sleep structure successively assigning short signal segments (known as epochs, usually of 30 s duration) to one of five sleep stages (Iber et al., 2007). They have been associated with various higher cognitive processes in particular memory (Tamminen et al., 2010), but also learning performance (Schmidt et al., 2006) and skill performance (Astill et al., 2015). Moreover, there is a growing body of research literature highlighting their potential as biomarkers: a number of studies have reported clinically significant differences in spindle characteristics for a range of neurological disorders (Ferrarelli et al., 2007;Wamsley et al., 2012;Christensen et al., 2014).
The gold standard for the determination of sleep spindles has traditionally been achieved through visual inspection of the EEG by sleep physiology experts. Despite the best attempts of experts to standardize protocols, expert-based assessments rely on expensive human resources, depend on the rater's experience and level of expertise, are laborious and prone to errors due to fatigue, and by nature cannot scale to handle very large datasets. As with all cases where the gold standard is set by subjective assessments of trained experts, there can always be an argument that an automated algorithmic process could provide an alternative, often sufficiently accurate, robust, scalable, replicable, cost-effective, and objective mode to achieve the aim; indicative studies highlighting these concepts include Grove and Meehl (1996), Seshadrinathan et al. (2010), and Tsanas (2012) amongst many others. At the very least, the development of algorithmic tools can facilitate and expedite the work of trained experts particularly due to the sheer amount of the growing availability of massive datasets.
There are several approaches that have been proposed to tackle the problem of automatic sleep spindle detection. The majority of the proposed algorithms rely on a time-frequency analysis. In all cases, a major hurdle is the determination of appropriate thresholds, which may need to be optimized for each individual. Unfortunately, it is difficult to define universally applicable thresholds due to the large variability in spindle characteristics amongst individuals (Werth et al., 1997). Frequently, the setting of these thresholds for many algorithms require prior hypnogram assessment, and subsequent focusing only on stage 2 sleep (Mölle et al., 2002;Wamsley et al., 2012) or Non Rapid Eye Movement (NREM) sleep (Ferrarelli et al., 2007;Martin et al., 2013). However, we argue that all these approaches are quite restrictive, particularly because in practice we want to completely automate the EEG signal processing task without requiring prior hypnogram assessment by experts. Detecting spindles might be the end goal in one application, but could also be used to guide automated sleep staging assessment. Another generic approach for many algorithms is attempting to determine the presence of spindles by successively searching over pre-defined short windowed EEG segments [typically 1 s, e.g., see Huupponen et al. (2007), although some approaches rely on the detection of spindles in the more traditional 30-s epochs used in hypnogram assessment]. A major limitation with this approach is that one needs to specify a small signal segment to assess whether a spindle occurred within that segment and loosely approximate the spindle onset and offset.
Recently Wendt et al. (2012) introduced a fusion approach to detect spindles applying their sleep detection algorithm on two EEG channels (central and occipital). However, spindles are known to occur locally (Kryger et al., 2010) and hence there is no guarantee that both the central and occipital deflections will identify the spindle; furthermore, this complicates the practical task of spindle assessment by imposing the requirement that additional recordings are available (ideally a single channel would be sufficient for detecting spindles locally). It should be noted that localized sleep can occur, and therefore a single channel cannot reveal the overall sleep structure for the entire brain. In practice we want to focus on specific brain areas, detecting spindles locally, e.g., at the central regions where the spindle density is maximal (Kryger et al., 2010); some interesting recent work has focused on spindle propagation (O'Reilly and Nielsen, 2014b).
One of the simplest algorithmic approaches for detecting spindles is to band-pass the EEG signal and assess the presence of spindles by setting an appropriate (relative) threshold on the amplitude of the band-passed version of the signal (Schimicek et al., 1994), which is both sensible and remains topical to this day at least as a benchmark. Similarly, the ubiquitous Fourier Transform (FT) has been investigated in this application (Huupponen et al., 2007). However, there are inherent limitations of the FT in that it implicitly assumes a periodic signal, and also that it requires a sufficiently adequate number of samples for the spectrum estimation; in practice this sets a minimum requirement of about 1 s signal segment (Pardey et al., 1996). In turn, this means that with FT it is fundamentally impossible to correctly determine the spindle onset and offsets accurately as highlighted previously. Wavelet analysis is particularly suitable for analyzing non-stationary signals (such as the EEG), thus overcoming certain shortcomings of the traditional spectral analysis with the FT, and hence has justifiably attracted interest recently in the spindle detection domain (Sitnikova et al., 2009;Wamsley et al., 2012).
This study extends the methodology of recent approaches using the Continuous Wavelet Transform (CWT) with Morlet basis functions (Sitnikova et al., 2009;Wamsley et al., 2012). The Morlet wavelet has been widely used in many practical applications because it has the desirable property that it minimizes the product of the wavelet's time and frequency spreads; hence it optimizes the time-frequency resolution (Addison, 2002). The main novelty of this work lies in the processing of the relative normalized power of the CWT coefficients to determine spindle candidates. Whereas previous studies computed the moving average of the power of the CWT coefficients to detect spindles directly, we first rank the CWT coefficients in terms of their normalized power at each time instant. Then, we compute the instantaneous ratio of the CWT coefficients falling within the scale spindle range (corresponding to the standard 11-16 Hz frequency range) over the top 10 ranked CWT coefficients. This ratio denotes the "instantaneous strength" of detecting a spindle, which is subsequently processed with weighted moving average methods to detect spindles. The proposed algorithm overcomes several shortcomings of competing algorithms: (a) it does not require processing successive small (e.g., 1 s) signal segments which blur the determination of true onset and offset of spindles (instead the algorithm works directly the entire signal), (b) it does not require prior hypnogram assessment, (c) it uses a single EEG lead. Moreover, using the proposed algorithm we can determine the frequency variation contour as a function of time within each spindle: these features may have clinical relevance, a fact which is often overlooked by contemporary competing approaches (for example, FT-based approaches cannot readily provide this information).

Materials and Methods
This section summarizes the dataset used in this study, summarizes some of the previously published algorithms against which the new sleep spindle detection algorithm developed in this study is benchmarked, and outlines the evaluation criteria for assessing the performance of the algorithms.

Data
We used two publicly available databases in this study.
The first database was collected during the DREAMS project (Devuyst et al., 2011), which aimed to provide a platform to assist assessment of automatic detection algorithms. The sleep spindles database contains recordings from eight participants with diverse sleep pathologies (dysomnia, restless legs syndrome, insomnia, apnoea/hypopnoea syndrome). Two EOG channels (P8-A1, P18-A1), three EEG channels (CZ-A1 or C3-A1, FP1-A1, and O1-A1) and one submental EMG channel were recorded, using a sampling frequency of 200 Hz (six signals), 100 Hz (one signal), or 50 Hz (one signal). A segment of 30 min of a central EEG channel (C3-A1 or Cz-A1) was extracted from each wholenight recording, and two experts have independently annotated the presence of sleep spindles. The second expert has only annotated six out of the eight recordings, and has not provided the exact duration of the assessed spindles (hence, it was all assigned to be 1 s in duration). Although the hypnograms (according to standard Rechtschaffen and Kales criteria) were available, these were not used in the assessment of the spindles by the experts. The dataset along with additional information is publicly available from: http://www.tcts.fpms.ac.be/∼devuyst/ Databases/DatabaseSpindles/.
The second database was collected as part of a large project looking into sleep, the Montreal Archive of Sleep Studies (MASS) (O'Reilly et al., 2014a). It contains overnight PSG recordings from 19 healthy controls: specifically, electroencephalography (EEG) montage of 19 channels, 4 electro-oculography (EOG), electromyography (EMG), electrocardiography (ECG), and respiratory signals. The EEG signals were sampled at 256 Hz. The database was annotated independently by two experts for sleep spindles. The second expert has only annotated 15 out of the 19 signals for sleep spindles. Hypnograms (according to standard Rechtschaffen and Kales criteria) were also made available. For further details see O'Reilly et al. (2014a). The dataset became available to the authors of this study after the development of the algorithms and the original submission of the manuscript; we deliberately decided not to further fine-tune the original algorithms developed using the DREAMS data to guide the sleep spindle estimation process, in order not to bias the presented findings in any way. The dataset can be accessed from: http:// www.ceams-carsm.ca/en/MASS.
In all cases, the EEG signals were resampled at 100 Hz.

Methods
Before delving into the details of the sleep spindle detection algorithms, it is useful to revisit the definition of spindles, and visualize some examples annotated by experts in order to motivate the subsequent algorithmic development. According to the latest recommendation of the AASM Manual for the scoring of sleep, a spindle is defined as "a train of distinct waves with frequency 11-16 Hz (most commonly 12-14 Hz) with a duration ≥0.5 s, usually maximal in amplitude in the central derivations." (Iber et al., 2007). The spindle frequency range is nowadays generally accepted to be 11-16 Hz, but the range over which researchers focus may vary slightly depending on the research lab, e.g., 10.5-16 Hz (Huupponen et al., 2007), or 12-15 Hz (Ferrarelli et al., 2007); the standard reference book "Principles of Sleep Medicine" quotes the range 10-15 Hz (Kryger et al., 2010). We note there is no formal recommendation for the use of amplitude thresholds to detect a spindle, although many researchers have explicitly used amplitude criteria in their algorithmic implementations (Devuyst et al., 2011;Wamsley et al., 2012). Also, many researchers have relaxed the requirement of the minimum spindle duration, e.g., 0.4 s (Wamsley et al., 2012) or even as low 0.3 s instead (Warby et al., 2014). In practice, most spindles are typically around 0.5-1.5 s (very occasionally might be over 2 s), and typically most researchers impose a maximum length constraint (typically 3 s, e.g., Warby et al., 2014) in their algorithmic approaches. Sleep textbooks often depict sleep spindles as waxing and waning, nearly sinusoidal waveforms; however, in practice spindle waveforms are markedly noisy, exhibiting diverse characteristics. Figure 1 illustrates some spindles detected by experts for the same signal in the DREAMS sleep spindle database (Devuyst et al., 2011). It is striking that all these transient waveforms FIGURE 1 | Exemplary sleep spindles annotated by one of the experts for one of the EEG signals in the DREAMS sleep spindles database (the sampling frequency of the signal is 100 Hz). We can visually appreciate the wide variability of sleep spindle characteristics within the same EEG signal. Both the original signal segment and the band-passed (11-16 Hz) version of the signal segment are presented to assist visualization. The solid red line indicates the start of the spindle and the dashed line indicates the end; the green lines indicate the envelope of the signal. In practice, some experts use both the signal and the band-passed version of the signal to assess the presence of spindles.
(stemming from the same EEG recording and being only a few seconds or minutes apart) display such widely varying features (for example compare the peak-to-peak amplitudes). Nevertheless, all these illustrative examples are considered true spindles according to at least one of the two experts and set the ground truth against which all automated sleep spindle detection algorithms are benchmarked. For each signal we also present its band-passed version at the spindle frequency range. Following visual inspection of these plots, we can postulate that amplitude may be a misleading criterion to assess automatically the presence of spindles; on the other hand, the presence of the spindle appears to be more consistent when also observing the band-pass version of the signals. This exploratory step may assist in the motivation and understanding of the sleep detection algorithms which are presented in the following sections.

Contemporary Sleep Spindle Detection Algorithms
For simplicity and to conform to the terminology of Warby et al. (2014) we will denote with a x each of the sleep spindle detection algorithms used in this study, where the subscript indicates the corresponding algorithm. In this section we summarize the six spindle detection algorithms used in Warby et al. (2014) (denoted here with a 1 -a 6 ), and in the following section we will introduce the new algorithmic approaches. These algorithms (occasionally with slight modifications) have been widely used in a number of studies, and therefore can be considered indicative of the most popular contemporary approaches to automatically detect sleep spindles. We used the Matlab implementations provided by Warby et al. (2014) for a 1 -a 6 and the description of the algorithms below follows their algorithmic modifications; hence the described algorithms differ slightly in comparison to the original algorithms. Our own algorithms were also implemented in Matlab, and are made freely available on Physionet (www. physionet.org) and the first author's website.

Algorithm a 1 , Bódizs' average amplitude spectrum
The first algorithm, a 1 , is due to Bódizs et al. (2009), and attempts to tackle the problem of intra-subject variability in terms of EEG characteristics by incorporating subject-specific information (hence building upon the findings of Werth et al. (1997) that the variability of the spindle characteristics is low for each individual). The algorithm detects spindles in customized frequency ranges (identifying slow and fast spindles) using the average amplitude spectrum of NREM sleep using epochs of 4 s. The decision to evaluate the presence of a spindle is based on the amplitude threshold in each of the two band-pass regions for slow spindles or fast spindles. The implementation by Warby et al. (2014) used here requires both a central and an occipital EEG channel.

Algorithm a 2 , Ferrarelli's band pass and signal envelope algorithm
The second algorithm, a 2 , was proposed by Ferrarelli et al. (2007) and with slight modifications has been used in some recent studies, e.g., Astill et al. (2015). The algorithm applies a band-pass filter (11-15 Hz) to the NREM data (epochs), and the envelope of the resulting signal is subsequently used. An amplitude threshold (threshold 1 ) is then set relative to the mean signal amplitude (because different channels exhibit different amplitude profiles). A spindle is marked by first detecting a local maximum in the envelope of the filtered signal above threshold 1 , and its duration is determined by identifying the preceding and following instances when this amplitude falls below a lower threshold (threshold 2 ), i.e., detecting the nearest troughs below threshold 2 (local minima). The slightly different versions of this type of algorithm set threshold 1 and threshold 2 slightly differently than the original algorithm, but the essential main idea remains the same.

Algorithm a 3 , Mölle's band pass RMS overlapping moving window
The third algorithm, a 3 , was described by Mölle et al. (2002). This algorithm is also band-pass filtering the NREM data at the spindle frequency range (12-15 Hz), and subsequently computes the Root Mean Squared (RMS) value of the filtered data over a short-frame overlapping (50%) moving window of 100 ms. Then, spindles are determined only on the data from sleep stage 2 depending on whether the RMS value exceeds an amplitude threshold (set at 1.5 times the standard deviation of the band-pass filtered signal) and the duration is within the acceptable spindle limits (0.3-3 s).

Algorithm a 4 , Martin's band pass RMS percentile moving window
The fourth algorithm, a 4 , by Martin et al. (2013) is conceptually very similar to a 3 . It differs from a 3 in terms of the spindle frequency range used (11-15 Hz) for the band-pass filter, the use of a non-overlapping time window (25 ms) to compute the RMS values, and the threshold for detecting the spindle which is set to be the 95th percentile of the RMS signal.

Algorithm a 5 , Wamsley's CWT moving average
The fifth algorithm, a 5 , was developed by Wamsley et al. (2012). Contrary to the algorithms described so far, this algorithm is based on the CWT, which has some desirable properties for analyzing EEG signals as discussed previously. The algorithm relies on prior hypnogram assessment and attempts to detect spindles during stage 2. The signal is transformed into the wavelet domain using the complex Morlet wavelet basis function. The Morlet scales corresponding approximately to the pseudo-frequencies of interest (10-16 Hz) were used, and the moving average of the coefficients using a 100 ms sliding window was computed; when it exceeded a threshold for a minimum of 0.3 s a spindle was registered. The threshold was set using only the amplitude of epochs assessed as stage 2 by experts.

Algorithm a 6 , Wendt's two-channel band pass and signal envelope combination
The sixth algorithm, a 6 , was developed by Wendt et al. (2012). This algorithm is conceptually similar to a 2 , the main difference is that the boundaries for the spindle detection are determined using local extrema of the signal envelope and its rate of change, whereas a 2 relied on local minima. A further difference is that both a central and an occipital EEG channels are used in the band 11-16 Hz, and the spindle detection is a result of the combination of the two different sets of envelopes.
Recently, Warby et al. (2014) applied the six algorithms described so far in a large private database with sleep spindles from 110 healthy controls, and reported that the best algorithm in terms of accurately detecting spindles and minimizing false detections was a 5 , closely followed by a 4 . We note that all six algorithms described so far (a 1 -a 6 ) rely on prior hypnogram assessment, which was provided given that the sleep stages assessed by experts was available for this database. We note that this fact effectively places competing algorithms which do not have access to hypnogram information at a disadvantage when it comes to direct algorithmic performance comparisons. The following new algorithms (a 7 -a 8 ) do not rely on prior sleep staging information, but we aim to demonstrate that the new algorithms are nevertheless very competitive.

Novel Sleep Spindle Detection Algorithms
We have already highlighted the intuitively appealing features of the CWT for analyzing EEG signals due to its time-frequency localization properties, and the fact that it does not make assumptions regarding signal periodicity. Exploring the data by visual inspection of the true spindles (see Figure 1) seems to indicate that amplitude-based characteristics may be misleading (this is also implicit in the AASM criteria where no amplitude recommendation is made when assessing spindles); hence the primary focus of the developed algorithms is on the frequency content of the signal. Strictly speaking, we work directly with the CWT scales which correspond to the (pseudo)frequencies of interest (11-16 Hz). We defined 131 Morlet scales with a resolution of 0.1 in the range 2-15 (corresponding pseudo-frequencies: 5.4-40.6 Hz), which led to 24 scales lying within the spindle scale range. There is a non-linear mapping of the scales to their corresponding pseudo-frequencies, which is a function of the wavelet basis function and the sampling frequency of the signal. For the Morlet wavelet with a signal sampling frequency of 100 Hz, the scales of interest (spindle scale range) are 5.1-7.4. We used a lower threshold of pseudo-frequency at 5.4 Hz above which we try to assess the probability of having a spindle so as to avoid challenging settings of spindles occurring on the background of largeamplitude slow oscillations (the delta frequency range, 1-4 Hz). Conceptually the starting basis of the proposed algorithms is similar to the study by Wamsley et al. (2012) (algorithm a 5 ), who subsequently thresholded the CWT coefficients at the spindle frequency range using a moving average of 100 ms sliding window. What distinguishes the algorithms proposed in this study compared to previous algorithms using the CWT is the different processing of the extracted Morlet CWT coefficients and the fact that we do not rely on expert-based hypnogram (in particular determining sleep stage 2) assessment. Figure 2 presents a high-level flowchart of the two new algorithms introduced in this study. All sleep spindle detection algorithms developed in the research literature have some free parameters (typically these are some thresholds, e.g., on amplitude values). Similarly, the proposed algorithms in this study rely on a number of free parameters which need to be optimized: the chosen values were determined by testing on random subsamples of the training data so that regions of relative stability were found; exhaustive searches over the parameter space were not possible due to the size of the data set. We deliberately decided not to pursue rigorous optimization of these parameter values, in order to avoid overfitting the characteristics in the DREAMS database (effectively this would be training and testing on the same data). It is likely that the parameter values chosen could benefit from further refinement to optimize the outputs of the proposed algorithms, but a larger database would be needed.

Algorithm a 7 , CWT instantaneous probabilistic estimate with moving averaging
The algorithm a 7 , uses the following steps after the computation of the CWT coefficients: (a) Computes the normalized percentage power of the CWT coefficients (henceforth referred to as normalized coefficients). (b) Sorts the normalized coefficients in descending order at each time instant and works on the top 10 Morlet CWT scales corresponding to the top normalized coefficients (thus resulting in a matrix of size number of signal samples × 10). (c) Computes instantaneous probabilistic estimate of spindle occurrence at the spindle scale range using the following algorithmic expression: where P(s i ) denotes the probability of having a spindle at a given sample i, T is the cardinality of the top 10 scales corresponding to the sorted top 10 CWT normalized coefficients at instant i coinciding with the spindle scale range (i.e., for each sample i, we find how many of the top 10 sorted scales corresponding to the normalized coefficients match the scales in the spindle scale range), M i contains the positions of the detected scales intersecting with the spindle scale range in the 10-element vector and the operator "./" denotes element-wise division. The value P(s i ) effectively expresses the confidence that the sample i is part of a spindle (the higher the value, the more likely this sample may be part a spindle). The underlying concept is that if a sufficiently large number of successive samples (corresponding to some minimum time duration to be defined) have large probabilities denoting spindles, then that sequence will be denoted as a spindle. Effectively, we determine how many of the top 10 sorted scales matched the spindle scale range, and weigh these scales based on where they feature in the list with the instantaneous top 10 scales.
If none of the sorted top 10 scales overlapped with the spindle scale range then P(s i ) is zero. L denotes a normalization constant factor which was computed as L = T i = 1 (1./ 1 . . . 10 ).
(d) Now, we need to smooth the instantaneous P(s i ) estimates based on their K neighbors P(s i − K/2 ) . . . P(s i + K/2 ) to determine whether some EEG segments (regions) of arbitrary length within some duration boundaries (here 0.5-1.5 s) correspond to a spindle. Essentially, we have scales corresponding to the spindle scale frequency and we want to smooth neighboring regions to decide whether these are above the minimum duration threshold (in practice we very rarely have all consecutive samples in a spindle exhibiting large proportion of the scales belonging to the spindle scale range). Conceptually, this is similar to the concept that Wamsley et al. (2012) used, smoothing the data using a moving average of 0.1 s. Similarly, we used a moving average filter of 0.1 s to obtain the P smooth (s i ). (e) It is possible that certain P (s i ) < P smooth (s i ) and we want to encourage relative large values to maximize the probability of detecting true spindles; hence we applied a final check: P final (s i ) = max ∀i (P smooth (s i ) , P (s i )). (f) The candidate spindle instances (as a first pass) were detected at those samples when P final (s i ) > 0.3 (for as many successive samples as the threshold remains valid). We remark this threshold (and all free parameters in this spindle detection algorithm such as number of top scales to investigate and K) were not rigorously optimized to avoid over-fitting the database used in this study. Instead we have attempted to determine "good" parameter values, which may be refined if presented with additional databases which will assist in properly optimizing the values of the free parameters. (g) Finally, we need to group together regions which contain series of samples with high probabilities of denoting spindles. This was achieved using flags to denote if successive regions containing candidate spindles would group in terms of their proximity, average probabilistic estimate of having a spindle in a region defined between samples (i 1 , . . . , i 2 ) P(s i 1 ) . . . P(s i 2 ) , and the duration of the candidate spindle. Specifically, we grouped successive candidate spindles in the following cases: (i) The duration between successive spindles was less than 0.3 s, and both successive spindles exhibited average Process the instantaneous P(s i ) esƟmates using local weighƟng smoothing methods based on their K neighbors. Aim: determine if there are segments in the Ɵme series which could be spindles, i.e. successive P(s i ) all above a probability threshold for a minimum period of Ɵme (set to 0.3 seconds). The difference between the algorithms a 7 and a 8 is on how the local smoothing is performed probabilistic strength above a threshold, i.e., both spindles appeared to be very likely true spindles: > 0.7, and the duration of both successive spindles was at least 0.1 s (case: "strong" spindles). (ii) The duration between successive spindles was less than 0.3 s and both successive spindles exhibited average probabilistic strength: 1 i 2 − i 1 · i 2 i = i 1 P (s i ) > 0.6 and both were at least 0.3 s long (case: "long spindles").

Algorithm a 8 , CWT instantaneous probabilistic estimate with distance and amplitude weighted averaging
The algorithm a 8 , is very similar to a 7 . The difference lies in how we process the instantaneous probability spindle estimates P(s i ) to affect neighboring P(s j ) values. That is, the first steps (a)-(c) are identical, and step (d) processes the computed P(s i ) using the exponential weighted moving average concept (instead of moving average). The underlying idea is that we want to update P(s i ) values depending on their neighboring P(s j ) values as a weighted function of their distances and a weighted function of their magnitude (which is weighted exponentially to promote EEG regions where instantaneous P(s i ) estimates are large). Specifically, step (d) now becomes: (d) We used smoothing over 0.2 s, linearly scaling the effect of samples P(s j ) on P(s i ) as a function of their distance from P(s i ), i.e., {w t } 10 t=−10, t =0 = 1 |t| · P(s i + t ). In order to augment the effect of large P(s i ) values (which denote great confidence that the sample i is part of a spindle) we exponentiated these values. Overall, conceptually it is similar to using an exponential weighted moving average approach. Algorithmically this is expressed as: where the notation ⌈·⌉ denotes that the value is upper bounded to be 1, and the notation ". * " denotes element wise multiplication. The subsequent steps (e)-(g) are identical to a 7 to detect a spindle. We remark that a 8 is by design heavily weighting regions where there is a possibility of observing a spindle, but these regions will likely contain many cases which are not likely to be spindles.

Evaluation of Sleep Spindle Detection Algorithms
Both the DREAMS sleep spindles database and the MASS database have been annotated by two experts. Given the large inter-rater variability (e.g., for the DREAMS database the first rater has marked 289 spindles whereas the second rater has marked 409 spindles), there are two approaches to determine the ground truth. One approach is to only consider cases where both experts agree, an approach used previously for the DREAMS database by other researchers (Devuyst et al., 2011;Nonclercq et al., 2013). However, this biases the results, because one might argue that cases where both experts agree may denote "easily detectable" spindles; hence in this study we used all assessments by both experts, removing one of the double entries (in those cases where both experts agreed, in the DREAMS database we removed the assessment by the second expert because only the first expert had also provided the duration of the assessed spindle). Each of the sleep spindle algorithms used in this study results in estimates summarized in the format N×2, where N denotes the number of detected spindles for each EEG signal: the first column contains the estimated onset, and the second column the spindle duration. This facilitates direct comparison with the ground truth which is in the same format. In order to assess the performance and fairly compare all algorithms, we used the following commonly used metrics: Cohen's kappa coefficient was originally developed to assess inter-rater agreement, and some researchers suggest it takes into account agreement between raters which could be attributed to chance. Effectively, this implies that when raters are uncertain they guess about their decision, which some researchers have suggested is unlikely in many practical settings. Some of the problems and limitations of Cohen's kappa have been discussed by Gwet (2008); we cautiously include it in this study because some research papers published in the sleep spindle detection literature have used it. We also used and put greater emphasis on the weighted kappa in this study because spindles are rare events in the EEG signal and we wanted to weigh accordingly for spindles correctly detected and spindles missed by the spindle detection algorithms (that is, we set the weight for TP and FN to be 10 times compared to the weight assigned to FP and TN).
(e) Absolute difference in the onset timings between the ground truth and the estimated onset.
where True Positive (TP) denotes agreement between the algorithm and the ground truth about the detection of a spindle, False Negative (FN) denotes a true spindle as assessed by the experts which was missed by the algorithm, False Positive (FP) when the algorithm detected a spindle that was not assessed as a spindle by the experts, and True Negative (TN) was defined as in Devuyst et al. (2011): TN = signal duration in seconds − FP − TP − FN. We assess a true positive when the absolute difference between the onset of the ground truth and the estimated spindle onset by the algorithm is less than 0.5 s. Other studies have used different, less stringent definitions to assess whether an algorithm has matched the expert's assessment in correctly detecting a spindle. Some studies assess whether a spindle was detected within a sliding prespecified time-interval (epoch), e.g., Duman et al. (2009), however this does not assess directly the accuracy in determining the spindle onset. Other studies, e.g., Nonclercq et al. (2013), consider than an algorithm has correctly detected a spindle if there was any overlap between the duration of the estimated spindle and the true spindle duration. However, this may positively bias sleep detection algorithms which provide spindle estimates with large durations.

Evaluation of the Spindle Detection Algorithms on the DREAMS Sleep Spindles Database
Tables 1-3 summarize the performance of the sleep spindle detection algorithms used in this study for each of the eight signals. Ideally, a good algorithm exhibits large sensitivity and specificity, and low false discovery rate. We observe relatively large deviations in the performance of the sleep spindle detection algorithms across the eight signals. Overall, the new algorithm a 7 exhibits large sensitivity and specificity. The more complicated new algorithm a 8 can accurately The best performing algorithm for each case appears in bold. The best performing algorithm for each case appears in bold. The best performing algorithm for each case appears in bold.
Frontiers in Human Neuroscience | www.frontiersin.org detect more spindles than the competing approaches including a 7 (large sensitivity), at the cost of decreased specificity and increased false discovery rate. We have also evaluated the absolute difference in the onset timings between the ground truth and the estimated onset: this was fairly consistent amongst the algorithms with a mean absolute difference in onset timings ranging between 0.15 and 0.2 s and the standard deviation ranging between 0.11 and 0.15 s. Overall, all algorithms performed similarly with respect to correctly detecting onset spindle timing. We have emphasized that Cohen's kappa suffers from certain limitations (Gwet, 2008) and we use it here cautiously simply to facilitate comparisons with other studies in the research literature. Specifically the (unweighted) Cohen kappa was (mean ± standard deviation): a 1 = 0.15 ± 0.12, a 2 = 0.19 ± 0.11, a 3 = 0.29 ± 0.22, a 4 = 0.46 ± 0.20, a 5 = 0.37 ± 0.19, a 6 = 0.25 ± 0.18, a 7 = 0.40 ± 0.20, a 8 = 0.18 ± 0.14.

Evaluation of the Spindle Detection Algorithms on the MASS Database
We have also evaluated the performance of all eight algorithms in terms of correctly detecting the sleep spindles in the MASS database. The results are summarized in Table 4. Interestingly, the findings in terms of sensitivity, specificity, and FDR are similar across the two databases used in this study. The algorithm a 7 outperforms the competing approaches in terms of sensitivity whilst being very competitive in terms of specificity. As indicated previously, we prefer the weighted Cohen kappa (see Table 4) penalizing more severely missed true spindles compared to false positives. Nevertheless, to facilitate direct comparisons with the research literature the unweighted Cohen kappa for the algorithms is also reported (mean ± standard deviation): a 1 = 0.20 ± 0.11, a 2 = 0.22 ± 0.04, a 3 = 0.28 ± 0.24, a 4 = 0.51 ± 0.13, a 5 = 0.38 ± 0.18, a 6 = 0.37 ± 0.18, a 7 = 0.24 ± 0.12, a 8 = 0.16 ± 0.09.

Algorithmic Comparisons with Results Reported in the Research Literature
Many researchers have indicated that it is not easy to directly compare the performance of different algorithms across studies because of the different criteria used to detect spindles and assess the performance of the automated algorithms (Devuyst et al., 2011;Nonclercq et al., 2013). Table 4 attempts to summarize many of these published findings in the research literature as an indicative reference, but we emphasize these results should be cautiously interpreted when comparing algorithms unless they have been tested on the same database using identical criteria to assess performance. Table 5 summarizes the four performance metrics in this study (sensitivity, specificity, FDR, weighted Cohen's kappa) in terms of percentile scores, thus providing a good overview of the overall performance of each algorithm (including their behavior at extremes).

Discussion
This study revisited the problem of accurate and automatic detection of sleep spindles using a single EEG channel. We reviewed some indicative and widely used signal processing approaches toward this aim, and highlighted some of the underlying problems. Two new signal processing approaches which are based on the CWT with Morlet basis were proposed and demonstrated to be very competitive against some commonly used algorithms found in the research literature. Interestingly, there was no universally best algorithm for all signals, although a 3 , a 6 , and a 7 appear to display relatively large sensitivity and specificity scores. We found that the new algorithm a 7 led to a range of 65.6-88.9% sensitivity scores and a range of 78.1-97.3% specificity scores for the DREAMS database, which compare favorably against competing approaches. The new algorithm a 8 exhibits higher sensitivity and lower specificity in the DREAMS database, on average, hence it might be more suitable primarily in cases where a human expert will post-process the estimates to determine whether the detected spindles correspond to true spindles. We re-iterate that the DREAMS sleep spindles database used in this study suffers from large inter-rater variability: the first rater has marked 289 spindles whereas the second rater has marked 409 spindles. Hence, the inter-rater agreement is lower than the agreement between raters reported in other studies (Huupponen et al., 2007), which may suggest automatic detection of spindles in this dataset may be challenging. The original manuscript submission did not include the MASS database and hence the development of the spindle detection algorithm relied only on the DREAMS data. We have deliberately refrained from any additional fine-tuning of a 7 and a 8 to optimize performance in the MASS data, which might have potentially improved our reported results on the MASS database. It is reassuring that the proposed algorithms work very well on the MASS data, in particular a 7 . It is also encouraging to see that the results of sensitivity, specificity, FDR and weighted Cohen's kappa are similar across the two databases (see Table 4) for all algorithms: this inspires confidence regarding the objective merits of each algorithm, and may be a good indicator of the performance of the sleep spindle detection algorithms in new, unseen datasets. It is possible that other studies relying on a single database to develop and test their spindle detection algorithms might have over-trained on that particular dataset, so we find the reported findings on the MASS database (truly out-of-sample) to be particularly compelling. Table 5 provides an overall summary of performance of the sleep spindle algorithms on both databases, including extremes (i.e., the algorithms at their worst and at their best) by reporting percentile values. We note that a 7 in particular is very competitive across the entire range of the distribution of performances, particularly for the MASS database (and interestingly, exhibiting good performance even for the 5th and 25th percentiles, i.e., it is fairly stable across individuals compared to many of the competing algorithms).
For reference purposes we have summarized the findings of multiple sleep spindle studies in the research literature in Table 4. However, direct comparison of findings across studies in this application is not straightforward for a number of reasons: (a) many studies rely solely on data stemming from healthy controls which are arguably easier to analyze than data from pathological cohorts (or process EEG artifact-free data, whereas the DREAMS sleep spindle database used here contains data from various sleep disorders), (b) the criteria for identifying sleep   Nonclercq et al. (2013) reported sensitivity scores ranging between 65.8 and 82.8% and specificity scores ranging between 96.7 and 98.7% for the first six signals in the database. However, we note that in both studies the authors used as ground truth only those cases where the experts agreed on the first six signals, which potentially biases the results (spindles detected by either one of the raters are probably borderline and more difficult to assess, but on the other hand are probably also more interesting). Similarly, the MASS database is a new publicly available database and we anticipate future studies will benchmark algorithms against this database. Ideally, a sleep spindle detection algorithm should correctly detect all true spindles without indicating the presence of additional (erroneous) spindles (an artifact or other class of event erroneously considered to be spindle). In practice, there is a tradeoff compromising between maximizing the detection of true spindles (true positive rate) and minimizing the false assessment of EEG segments as spindles. Essentially this is the case with the closely related algorithms a 7 and a 8 proposed in this study. The algorithm a 8 can typically correctly detect more spindles than a 7 at the cost of increasing the number of falsely detected spindles (increased false discovery rate). We note that a 6 and a 3 are similarly more prone compared to competing algorithms to decide that spindles have occurred in the EEG signal: this causes their The first row for each algorithm a 1 -a 8 corresponds to the (5,25,50,75,95) percentiles in the DREAMS database, and the second row to the percentiles in the MASS database.
true positive rate to be generally higher at the cost of additional false positives. O'Reilly and Nielsen (2014b) envisage that "most probably, manual [sleep spindle] scoring will progress toward semiautomation benefitting from further advances in signal processing" an assertion we find plausible. In that sense, if sleep spindle assessment is performed semi-automatically (prior assessment by an algorithm and subsequent checking by an expert) it is beneficial to correctly detect as many spindles as possible, even at the cost of erroneously recording spindles (i.e., increasing sensitivity at the cost of an increased false positive rate). There is probably no universal solution to this problem, and the sensitivity trade-off might need to be a free parameter of sleep spindle algorithms which could be appropriately adjusted by the operator of the algorithm. We remark that some of the sleep spindle detection algorithms used in this study require more than a single-EEG channel to detect spindles. For example, a 1 and a 6 require the use of an additional EEG channel, and a 1 -a 5 need to be presented with the hypnogram assessment (moreover the algorithm a 5 explicitly requires stage 2 assessments). We emphasize again that the proposed algorithms in this study (a 7 and a 8 ) have minimal requirements in terms of the input data in order to detect spindles: a single EEG channel is sufficient. Therefore, we argue that these new algorithms may be more readily deployable on databases which have not been scored by experts prior to sleep spindle estimation (no sleep staging requirement). Nevertheless, future studies could further explore whether the use of additional EEG channels and/or hypnogram might increase the sleep spindle detection accuracy.
A critical aspect for comparing algorithms in this application is the definition of TP, TN, FP, FN. In some studies it is not explicitly clear how authors deemed that the automated sleep spindle detector has matched the assessment of an expert in correctly identifying a sleep spindle. There is no clear consensus in the research literature currently; the last column in Table 4 summarizes some of the different approaches that have been used. We agree with Causa et al. (2010) who criticize other studies that the criteria used for algorithmic assessment are not made explicit, and would encourage other researchers to meticulously report the methodology followed to mark their assessments; ideally this methodology should be standardized to facilitate direct comparisons of algorithmic concepts.
Inspection of the results revealed that different sleep spindle detection algorithms have the potential to detect different spindles under different conditions. This would suggest that exploring some data fusion approaches might have good potential in this application. Data fusion in conceptually related applications (combining the outputs of multiple signal processing algorithms which estimate some property of the signal) has shown great promise (Mitchell, 2012;Tsanas et al., 2014;Zhu et al., 2014). In fact, simple combination approaches of the first six sleep spindle detection algorithms used in this study have been previous explored by Warby et al. (2014) but the authors did not report any significant improvement over the single best algorithm; future studies could further explore some principled data fusion frameworks in this application.