A Novel 9-Class Auditory ERP Paradigm Driving a Predictive Text Entry System

Brain–computer interfaces (BCIs) based on event related potentials (ERPs) strive for offering communication pathways which are independent of muscle activity. While most visual ERP-based BCI paradigms require good control of the user's gaze direction, auditory BCI paradigms overcome this restriction. The present work proposes a novel approach using auditory evoked potentials for the example of a multiclass text spelling application. To control the ERP speller, BCI users focus their attention to two-dimensional auditory stimuli that vary in both, pitch (high/medium/low) and direction (left/middle/right) and that are presented via headphones. The resulting nine different control signals are exploited to drive a predictive text entry system. It enables the user to spell a letter by a single nine-class decision plus two additional decisions to confirm a spelled word. This paradigm – called PASS2D – was investigated in an online study with 12 healthy participants. Users spelled with more than 0.8 characters per minute on average (3.4 bits/min) which makes PASS2D a competitive method. It could enrich the toolbox of existing ERP paradigms for BCI end users like people with amyotrophic lateral sclerosis disease in a late stage.

In an approach by Klobassa et al. (2009), 36 characters and symbols were arranged in a 6 × 6 matrix. Each row and column is represented by one of six different sounds. A character is chosen by first attending to the sound representing its row and then to the corresponding column. Thus, each selection of a symbol requires two steps.
A similar approach was studied by Furdea et al. (2009), where 25 letters are presented in a 5 × 5 grid. In this paradigm, each row and column is coded by a number which is presented auditorily. This paradigm was tested with severely paralyzed patients in the end-stage of neurodegenerative diseases .
To increase discriminability of auditory stimuli, there are recent approaches for multiclass BCI paradigms (Schreuder et al., 2009 that include a second spatial dimension. In their paradigm, stimuli are varying in two dimensions (pitch and direction) with both dimensions transmitting the same (redundant) information: a tone with a specific pitch was always presented from the same direction. Letters are organized in six groups and each selection requires two steps.
There are other auditory paradigms for BCI which are not linked to speller applications. In Hill et al. (2005) and Kim et al. (2011), the authors proposed paradigms where subjects focused attention to one of two concurrent auditory stimulus sequences. Both sequences have individual target events and stimulus onset asynchronies (SOA, also called inter stimulus onset interval, or ISOI). A similar approach was described by Kanoh et al. (2008) with two sequences having the same SOA. Those paradigms enable a binary selection. Halder et al. (2010) proposed a three-stimulus paradigm with two target stimuli and one frequent stimulus. Subjects attend to one specified target stimulus, masking the other two non-target events. It was found that a paradigm with targets differing in pitch performs better than a paradigm with targets differing in loudness or direction. Furthermore, there are recent studies that investigate ERP correlates for perceived and imagined rhythmic auditory patterns which can potentially be used for BCI paradigms in the future (Schaefer et al., 2011).
This work presents an approach to combine several characteristics of the above mentioned work by using novel spatial auditory stimuli in a nine-class paradigm. Similarly to the approach of Schreuder et al. (2010) the auditory stimuli applied are varying in both, pitch (high, medium, low) and direction (left, middle, right). However, the information transmitted by the two dimensions is independent and not redundant as in Schreuder's approach. The resulting 3 × 3 design offers an arrangement of nine stimuli that are easy to discriminate from each other. The effectiveness of this nine-class BCI is proven in a spelling application. It establishes an intuitive spelling procedure that can be used even under the presence of visual impairments. Initial results of this new approach have been presented in Höhne et al. (2010).
The paradigm was named "Predictive Auditory Spatial Speller with two-dimensional stimuli," or PASS2D. Twelve healthy participants were asked to spell two sentences in an online experiment with the finding that PASS2D is more accurate and faster than most of the auditory ERP spellers previously reported.

PARTICIPANTS
Twelve healthy volunteers (9 male, mean age: 25.1 years, range: 21-34, all non-smokers) participated in a single session of a BCI experiment. Table 1 provides details about the age and sex of the participants. A session consisted of a calibration phase first and a subsequent online spelling part. It lasted 3-4 h. Subjects were not paid for participation. Two of them (VPmg and VPja) had previous experience with BCI. Each participant provided informed consent and did not suffer from a neurological disease and had normal hearing. The analysis and presentation of data was anonymized. Two subjects (VPnx and VPmg) were excluded from the online phase due to a poor estimated classification performance based on the calibration data.

DATA ACqUISITION
Electroencephalogram signals were recorded monopolarly using a Fast'n Easy Cap (EasyCap GmbH) with 63 wet Ag/AgCl electrodes placed at symmetrical positions based on the International 10-20 system. Channels were referenced to the nose. Electrooculogram (EOG) signals were recorded in addition. Signals were amplified using two 32-channel amplifiers (Brain Products), sampled at 1 kHz was performed initially. Prior to the start of each calibration trial, the current target cue was presented to the subject three times while in addition the corresponding number on the 3 × 3 grid was highlighted on the screen. During the calibration phase, each trial consisted of 13 or 14 pseudo-random sequences of all nine auditory stimuli. Visual stimuli were not given during these trials. While the last 12 sequences were used to train the classifier, the first one or two sequences were dismissed to ensure a balanced distribution of stimuli in the calibration data. The presentation of one tone stimulus and the corresponding EEG data (an epoch of up to 800 ms after a stimulus) is named subtrial. Thus a single trial provided 9 × 12 subtrial and filtered by an analog bandpass filter between 0.1 and 250 Hz. Further analysis was performed in Matlab. The online feedback was implemented as pythonic feedback framework (PyFF; Venthur et al., 2010). After applying the analog filter, the data was low-pass filtered to 40 Hz and down sampled to 100 Hz. The data was then epoched between −150 and 800 ms relative to each stimulus onset, using the first 150 ms as a baseline. Figure 1 shows the course of the experiment. Both parts of the experiment consisted of an auditory oddball task where the participants were asked to focus on target stimuli while ignoring all nontarget stimuli. Participants were asked to minimize eye movements and the generation of any other muscle artifacts during all parts of the experiment. They sat in a comfortable chair facing a screen that showed the visual representation of a 3 × 3 pad with 9 numbers ordered row-wise (see Figure 2). Light neckband headphones (Sennheiser PMX 200) were positioned comfortably. While preparing the EEG cap, the participants got used to the tonal character and speed of the stimulus presentation, by listening to the stimuli that were used in the spelling paradigm later on.

Collection of calibration data
Three calibration runs have been recorded per subject. To differentiate target and non-target subtrials in later experimental stages, the collected data was used subsequently for the training of a binary classifier as described in Section 2.5. Each calibration run consisted of nine trials (i.e., nine multiclass selections) with each of the nine sounds being target during one of the trials, see Figure 1. In addition, one practice-run (run 0) without recording

CLASSIfICATION
Binary classification of target and non-target epochs was performed using a (linear) Fisher discriminant analysis (FDA). The features used for classification consisted of two to four amplitude values per channel (2.6 on average). These values represented the mean potential in intervals of epochs that were discriminative for the classification task. The intervals were selected manually based on a discriminance analysis (r-square values of targets vs. non-targets) of the calibration data. Due to the large dimensionality of the features [{2,3,4} intervals × 63 channels = {126, 189, 253} features], a shrinkage method (Blankertz et al., 2011) was applied to regularize the FDA classifier. All 2916 subtrial epochs (minus those discarded with artifact rejection) were used to train the classifier.
The classification error for unseen data was estimated by cross validation. Due to the global selection of feature interval borders, a slight over-optimistic estimate was accepted, as the performance during the online task in a later step would provide a precise error measure.
In the online spelling task, the 1 out of 9 multiclass decisions were based on a fixed number (n = 135) of subtrials and their classifier outputs. To select the attended key based on the 15 classifier outputs for every key, a one sided t-test with unequal variances was applied for each key and the most significant key (i.e., the one with the lowest p-value) was chosen.

PREDICTIvE TExT SySTEM
For the presented ERP speller, the commonly used T9 predictive text system from mobile phones [discussed in Dunlop and Crossan (2000)] was applied in a modified version. A similar approach was presented in Jin et al. (2010) in order to effectively communicate Chinese characters in a visual BCI paradigm. The standard T9 system uses more than nine keys: key "1" codes for dot/comma, keys "2" to "9" code for the alphabet, "0" for space, "+," and "#" for symbols or further functions. The system was modified such that instead of the 12 keys mentioned above, only nine keys were needed for spelling while remaining an intuitive control scheme. The system was constrained to words in a corpus of about 10,000 frequently used words of the German language, which can be arbitrarily extended. epochs (12 target subtrials and 8 × 12 non-target subtrials) for the classifier training. The combined training data from all runs comprised 108 × 27 = 2916 subtrial epochs and per person (minus a small fraction of artifactual epochs that were discarded).
Participants were asked to count the targets and to report the number of occurrences at the end of each trial (counting task). A simple minimum/maximum-threshold method was applied to exclude artifactual epochs from the calibration data: epochs were rejected if their peak-to-peak voltage difference in any channel exceeded 100 μV.

Online spelling task
After the calibration of the binary classifier (see Section 2.5), two online spelling runs were performed. Subjects were asked to spell a short German sentence ("Klaus geht zur Uni") composed of 18 characters (including space characters) and a long sentence composed of 36 characters ("Franz jagt im Taxi quer durch Berlin") in separate runs, see Figure 1. The task was to finish both sentences without mistakes, thus each false selection had to be corrected.
The order of the sentences was randomized. In order to help the participants to remember the text, the whole sentence was printed on a sheet of paper, which was placed next to the screen. Each trial consisted of 135 subtrials -15 iterations of 9 stimuli (see Figure 1). Artifact correction was not applied in the online runs.

AUDITORy STIMULI
The selection of stimuli is a crucial element for any kind of ERP BCI system. Here, three tones varying in pitch (high/medium/ low) and tonal character were carefully chosen in a way that they are -on a subjective scale -as different as possible from each other. The tones were generated artificially with 708 Hz (high), 524 Hz (medium), and 380 Hz (low) as base frequencies.
Each tone was presented on the headphones with three different directions: only on the left channel, only on the right channel, and on both channels. This two-dimensional 3 × 3 design obeys a close analogy to the number pad of a standard mobile phone (see Figure 2), where, e.g., key 4 is represented by the medium tone pitch (used for keys 4, 5, and 6) and was presented on the left channel only (used for keys 1, 4, and 7).
Each stimulus lasted 100 ms, SOA was 225 ms (see Figure 1), and a low-latency USB sound card (Terratec DMX 6Fire USB) was used to reduce latency and jitter. The pseudo-random sequences of stimuli were generated such that two subsequent stimuli did not have the same pitch. Moreover, the same stimulus was repeated only after at least three other stimuli had appeared. During the calibration and during the spelling of the short sentence, the stimulus classes were sequence-wise balanced 1 .
The visual domain was only used to report which selections were made, which text had already been spelled, or which words were available to choose from in the so called control mode (see Figure 3B and Section 2.6 for an explanation of the two system modes during the online spelling phase).
on average over all 10 remaining subjects, 77.7% of the stimuli were correctly classified. To account for the in balance between non-targets and targets, the classwise balanced accuracy was used, which is the average decision accuracy across classes (target vs. non-target, chance level 50%).

LOCATION AND LATENCy Of N200 AND P300
Figure 4 depicts the grand average ERPs at electrodes "Cz" and "FC5" together with the corresponding scalp maps for two time intervals. As expected, the ERPs for the non-target stimuli (gray lines) show a regular pattern that reflects the neural processing of the auditory stimuli. It occurs every 225 ms and is dominated by a N200 component. Moreover, those plots illustrate the different EEG signatures of the non-targets and the targets. At frontal electrodes a lateral and symmetric class discriminative negativity is observed 230-300 ms after stimulus onset. It directly follows up on the N200 component. For simplicity reasons it will be referred to as the N200 component in the following. Starting from approximately 350 ms after the stimulus onset, a second class-discriminant interval is observed for target stimuli. It is a symmetric positive component located at central electrodes and will be referred to as the P300 component in the following. The amount of class discriminability that is contained in the two electrodes during different time intervals is represented by two colored bars (Figure 4B). They depict a modified version of the area-under-the ROC-curve (AUC), which is in the range of [0, 1]. However, the AUC does not provide information about the direction of an effect. Thus, we used a simple modification of the AUC which is signed and linearly scaled to the range of [−1,1]. The resulting loss will be referred to as ssAUC in the following. Positive ssAUC values are colored in red and represent time intervals where target ERP amplitudes are larger than non-target ERP amplitudes. Negative ssAUC values are colored in blue and represent time intervals with target ERP amplitudes smaller than non-target ERP amplitudes.
Due to the contra-lateral processing of auditory stimuli (Langers et al., 2005), the N200 was expected to vary for each stimulus. Figure 5A depicts the grand average ssAUC scalp maps of N200 for each of the nine stimuli, illustrating that the early negative deflection is spatially varying for different auditory stimuli, but not the P300 (Figure 5B). In most multiclass ERP paradigms the classification is based on a two-class problem (target vs. non-target). Thus, the fact that there might be a variability in the spatial (or temporal) distribution of discriminant information for different stimuli is mostly disregarded. Although the classification procedure in the presented approach is also based on two-class decision, Figure 5A shows that there is some spatial class-discriminant information, which is not exploited yet. Generally, the discriminative information of the N200 component seems to be stronger on the left hemisphere (cp. the grand average in Figure 4D). In addition, this spatial distribution seems to follow a slight contra-lateral tendency ( Figure 5A): stimuli presented on the right audio channel (right column in the grid of scalp maps) induce a more discriminative N200 component on the left hemisphere. On the contrary, the class discriminative N200 components induced by stimuli on the left audio channel (left column in the grid of scalp maps) are located rather on the right hemisphere.
To overcome the problem of having a spelling scheme that is easy and intuitive to use and on the other hand flexible and fast with only nine keys, two modes were implemented: A spelling mode and a control mode, see Figure 3.
In analogy to the predictive text system in mobile phones, a word was spelled by entering a sequence of keys. To spell a character, the user had to select the corresponding key ("2" to "9") in the spelling mode. Each key codes for three or four characters, see Figure 3A.
After selecting the correct sequence of keys for a specific word, the user chose key "1" to switch the system into the control mode. In this mode, he sees the desired word in a list -together with all other words which can be represented by the entered sequence. By choosing one of the keys "4" to "8" he determines the desired word with one additional selection step. The list of matching words is ordered such that more frequent words (i.e., they have a higher rank according to the underlying corpus) are represented by smaller key numbers. As an example: after entering the keys "6346361" the user can choose from "nehmen," "meinem," "meinen," and "Nehmen" (see Figure 3B) as all these words can be represented by the entered sequence of keys. In the control mode, the system is limited to present a maximum list size of five words, which was sufficient to spell each word of the underlying corpus.
In case the user performed an erroneous multiclass selection, he could correct it by one to three additional selections. If the selection of the last sequence key did not conform to the corpus (there was no word in the corpus that fitted the entered code), it was not accepted and could be corrected with only one selection. If the mode was changed by mistake, it took one selection to return to the correct mode and another to choose the right key. In all other cases of erroneous selections, it took the user two selections to delete an erroneous key (change the mode by entering "1" and then delete last the key by entering "3") and a third selection to enter the intended key. Table 1 shows the results of the counting task during the calibration phase. Row "|diff| in counts" contains the sum of the absolute differences between the correct and the reported number of target presentations for each trial. A variation ranging from 1 to 35 was observed, which also indicates a varying ability across subjects to discriminate between the stimuli. It can be seen that this behavioral data is not directly linked to the spelling performance: some subjects (VPny, VPoe) have poor behavioral results -i.e., inaccurate counting -but perform well in the online spelling. On the other hand, those subjects with a bad spelling performance did not have particularly bad behavioral results (see VPmg and VPnx). Based on these results, the behavioral data alone could not be used as a predictor for the spelling performance in the online phase.

OffLINE BINARy ACCURACy
The accuracy of a binary decision (based on the epoch of one subtrial) was estimated on the calibration data for each participant. Based on the estimated errors, participants VPnx and VPmg were excluded from the following online experiments due to the poor binary classification performance of less than 70% (classwise balanced). A cross validation analysis (see Table 1

DISCRIMINATIvE INfORMATION IN THE SPATIAL AND TEMPORAL DOMAIN
The impact of the spatial and the temporal domain on the classifier was investigated separately (Figure 6) by analysis of isolated short time segments. The most discriminative information was found 400-500 ms after the stimulus onset, which reflects the importance of the P300 component. The most discriminative information was found at central-lateral locations such as C4/C5, when averaging intervals were selected heuristically. Comparing this to the grand average ERP scalp maps in Figure 4, one can find an overlap of N200 and P300 in the mentioned areas. This stresses the importance of the N200, although a stimulus-specific variation (see Figure 5 and results above) was found. It can be concluded that both components N200 and P300 can be used for classification, but the P300 component contains more discriminative information.

ONLINE BIT RATE AND CHARACTERS PER MINUTE
It took 15-26 min (μ = 20.9) to spell the short sentence and 31-76 min (μ = 43.5) for the long sentence. Variation in the number of multiclass selections originates from the different number of false selections (see Table 1) which then had to corrected. Since the sentences were not spelled word by word but in one go, all kinds of pauses are taken into account. However, the time for individual relaxation and fixed intertrial periods are among the main influence factors for the spelling speed. Figure 7 shows that neglecting the time for individual relaxation, and thereby only considering the stimulation time (∼31 s) and a fixed inter trial time (4 s), results in an average benefit of more than 1 bit/min or 0.25 char/min. In general, a higher multiclass accuracy can be obtained by increasing the number of subtrials. The rate of communication (Wolpaw et al., 2002) counterbalances this effect, enabling to compare different studies more accurately.
On average, subjects achieved an information transfer rate (ITR) of 3.4 bits/min in the online condition (based on the nine-class decision, including all pauses), see Table 2. An average online spelling speed of 0.89 characters/minute was observed, see also Table 1 and Figure 7.
In general, the information of one character is coded by at least 4.75 bits (1 out of 27, 26 letters plus space). Considering that the BCI controlled speller application presented here enables an average spelling speed of 0.89 characters/min, the ITR could also be quantified with 4.23 bits/min (0.89 × 4.75) in a hypothetical BCI paradigm with 27 classes. The discrepancy between 3.4 and 4.23 bits/min can be explained with the predictive text system, which thus increases the ITR by at least 0.83 bits/min or 24%.

ONLINE MULTICLASS ACCURACy
Averaged over all trials and participants of the online experiments, 89.37% of the multiclass decisions were correct (chance level is 1/9, thus 11.11%). The multiclass accuracy was slightly higher for the short sentence (92.63%) than for the long sentence (87.95%), see Table 1. Table 3 reveals that none of the stimuli has a significantly increased or decreased accuracy. Nevertheless, one can observe that the low-pitch keys "7," "8," and "9" -although chosen less frequently than other keys -have the highest selection accuracy.
Furthermore it was found that keys "1," "3," and "4" are chosen more frequently than others. This is due to the design of the predictive text system.
In the presented paradigm, the nine auditory stimuli are not completely independent: for each target there are four non-targets being equal in one dimension, i.e., two stimuli with the same pitch (same row) and two stimuli with the same direction (same column). Since these similarities could influence the results, it was tested if this is reflected in the binary classifier outputs or multiclass decisions.
(1) False positives for non-targets with the same pitch as the target: The probability of false positives in single-epoch binary classification for these non-targets was in fact higher than for other  physiological instabilities or lack of concentration may have caused this effect, but could be neither found, nor excluded. Experiments for VPnv, VPoc, and VPoe were stopped after the drop of accuracy.

DISCUSSION
It is clear, that the stimulus characteristics have a strong impact on the BCI performance. Our decision for a 3 × 3 design was partially driven by the possibility to use a T9-like text encoding system, even though other designs could potentially be better in terms of signalto-noise ratio. Prior work  showed, that both stimulus types (direction and pitch) contain valuable information

0.89
The diagonal elements (correct decisions) are marked in bold. Column "Acc" provides the specific accuracy for each key.
non-targets, as the classifier outputs were significantly more negative and thus more similar to target outputs (p < 10 −20 ). An increased probability for erroneous multiclass selections with the correct pitch but wrong direction was observed as well. Table 3 reveals that 47 out of 79 multiclass errors had an equal pitch. Assuming no dependency, one would expect 19.75 (2 out of 8). This is a significant deviation (χ 2 Test with p < 10 −11 ). (2) False positives for non-targets with the same direction as the target: No significant effect was found for non-targets with a correct direction but with different pitch in comparison to other nontargets (p = 0.13), although the average classifier outputs were again more negative. Multiclass selection errors toward a decision with a correct direction but wrong pitch were not accumulated.
According to these results, the classifier could resolve the dimension "pitch" better than the dimension "direction" which also stands in line with the findings by Halder et al. (2010).
Four subjects (VPnv, VPnz, VPoc, VPoe) had a sudden drop of multiclass accuracy within the online phase. The exact reason for that effect remains unclear. Technical problems as well as for a discrimination task, and that a redundant combination can enhance the separability compared to the single stimulus types. For a multiclass approach (nine or more classes) with redundant coding, however, the pitch steps or angle distance might get too small for a reliable perception. Thus, the decision to use the two stimulus dimensions independently instead of using them to code the same information in a redundant way, represents a trade-off between large steps in pitch/angle and a relatively large number of 9 classes.
Our results show that the PASS2D paradigm offers fast spelling speed (average of 0.89 chars/min and 3.4 bits/min) and an intuitive interaction scheme while being driven by simple stimuli from headphones.
Using modern machine learning approaches for ERP classification (Müller et al., 2008), the individual discriminative ERP signatures of subjects could be exploited reasonably well and in real-time and most participants could spell two complete sentences during a single session.
The direct performance comparison of BCI spellers is tedious, as the reported ITRs (in bits/min or char/min) are based on various different assumptions or definitions. For example, the inclusion or exclusion of pause times between trials or runs has a strong impact on the calculation of ITR (see Section 3.5). Moreover, several formulae for the calculation of ITR were proposed by the BCI community (Wolpaw et al., 2002;Schlögl et al., 2007) and are actively used. The majority of reported ITR rates for different visual BCI paradigms with healthy subjects range between approximately 10-20 bits/min (with some positive outliers of up to 80 bits/min).
Although among the fastest currently available audio paradigms for BCI, the present work is not reaching the ITR level of these visual paradigms yet, but it is not far from this performance. As the line of research of auditory BCI is relatively young, the potential future development is promising. Moreover -as pointed out in Section 1 -it represents a qualitatively new solution for end users with visual impairments.
Embedded in the patient-oriented TOBI-project 2 , the presented paradigm follows principles of user-centered design. Firstly, this is expressed by the decision to use a T9-like text entry method. The spelling process in PASS2D is easy to understand and widely known because of its similarity to T9-spelling in mobile phones. Moreover, it implements a predictive text entry system, that improves the spelling speed and usability.
Secondly -although the spatial dimension as a class discriminative cue could be exploited more fine-grained (cp. to the approach of Schreuder et al. (2010) using up to eight spatial directions) -the PASS2D approach was restricted to three directions only. Taking this decision, the hardware complexity and space requirements for the setup of the system at a patient's home can be reduced, as three directions can be implemented by off-the-shelf headphones and simple stereo sound cards.
Thirdly, the PASS2D paradigm has the potential to adapt to its user in terms of the underlying language model: the predictive text system can consider individual spelling profiles via updates of the text corpus. This implements an important aspect of flexibility, as patients tend to use a lot of individual abbreviations of frequently used terms in order to speed up their communication.
Fourthly, the presented speller design is flexible with respect to the sensory modality. Although operated as a spelling interface with auditory feedback, the interaction scheme is well suited also for visual ERP stimuli or control via eye tracking assistive technology with full visual feedback. In combination with suitable visual highlighting effects (Hill et al., 2009;Tangermann et al., 2011), the graphical representation of the speller (see Figure 3) can directly be used to elicit ERP effects by a visual oddball. Thus, patients in LIS with remaining gaze control could use both, a visual or a hybrid (del Millán et al., 2010) visual-auditory version of the speller. With a progressing neurodegenerative disease, a further decrease in gaze control or daily changing conditions, the patient has the opportunity to switch from the visual to the hybrid or to the purely auditory setting. As the elicited ERPs are expected to change during this transition, the underlying feature extraction, and classification should of course be adapted. If it is possible to perform this transition in a transparent manner, patients can simply continue to use the same interaction scheme independent of the stimulus modality in action.
It is concluded that this auditory ERP speller enables BCI users to kick-start communication within a single session and thereby offers a promising alternative for patients in LIS or CLIS. The next step will be to further simplify the spelling procedure such that it allows a purely auditory navigation.
Future work will also be conducted to further improve the paradigm with respect to spelling speed, pleasantness, intuitiveness, and applicability for patients in locked-in state and complete locked-in state. Experiments with patients are planned.