Eye Movements During Everyday Behavior Predict Personality Traits

Besides allowing us to perceive our surroundings, eye movements are also a window into our mind and a rich source of information on who we are, how we feel, and what we do. Here we show that eye movements during an everyday task predict aspects of our personality. We tracked eye movements of 42 participants while they ran an errand on a university campus and subsequently assessed their personality traits using well-established questionnaires. Using a state-of-the-art machine learning method and a rich set of features encoding different eye movement characteristics, we were able to reliably predict four of the Big Five personality traits (neuroticism, extraversion, agreeableness, conscientiousness) as well as perceptual curiosity only from eye movements. Further analysis revealed new relations between previously neglected eye movement characteristics and personality. Our findings demonstrate a considerable influence of personality on everyday eye movement control, thereby complementing earlier studies in laboratory settings. Improving automatic recognition and interpretation of human social signals is an important endeavor, enabling innovative design of human–computer systems capable of sensing spontaneous natural user behavior to facilitate efficient interaction and personalization.

A ratio between eye movements is meant to describe the ratio between the number of occurrences in one time window. E.g. if there were 1 fixations and 2 saccade in a time window, the ratio of fixations to saccades would be 1 2 . The rate of an eye movement is defined as its number of occurrences per second (e.g. 5 fixations per second).
• fixation/saccade rate (2) • small/large saccade rate (2) • left/right saccade rate (2) • saccade to fixation ratio • ratio of small/large/right/left saccades to the total number of saccades (4) • mean/variance/minimum/maximum saccade amplitude (4) • mean/variance/minimum/maximum saccade saccadic peak velocity (4) • mean/variance/minimum/maximum of the mean pupil diameter during saccades (4) • mean/variance/minimum/maximum of the fixation duration (4) • dwelling time, i.e. sum of all fixation durations • mean/variance of the mean of angles between subsequent fixations (2) • mean/variance of the variance of angles between subsequent fixations (2) • mean of the variance in raw x coordinates during fixations (2) • mean of the variance in raw y coordinates during fixations (2) • mean/variance of the mean pupil diameter during fixations (2) • mean/variance of the variance in pupil diameter during fixations (2) • mean/variance/minimum/maximum of blink duration (4) • blinks per second

Statistics over raw gaze data
Features that are computed on the raw x coordinates, raw y coordinates and pupil diameter separately are abbreviated as on 'XYD'.
• heatmap counts with 8 × 8 cells (64) • mean of XYD (3) • 3rd quartile of the distribution over XYD (3) • inter quartile range of the distribution over XYD (3) • absolute value of mean difference between subsequent XYD (3) • mean difference between subsequent XYD (3) • mean angle between two subsequent raw gaze points (i.e. between the x-axis and the vector connecting the two gaze points)

Information on the temporal course of saccades and fixations
n-grams describe series of n events, for instance 2-grams based on the events {short fixation, long fixation, saccade} could be: [short fixation, saccade] or any of the 3 2 possible combinations of events. The frequency of one such n-gram is defined as its number of occurrences here, e.g. the example 2-gram [short fixation, saccade] could occur twice in a given time window. The frequencies for all possible events could be computed, concatenated and then features such as the mean over these frequencies can be extracted.
For n-grams with n from 1 to 4, each once based on saccades only, once based on saccades and fixations. For saccades only, each saccade was labelled as small or large, and with one out of 8 possible directions. For saccades and fixations, a fixation was labelled as either short or long; a saccade was labelled as small or large and with one out of 4 possible directions.
• number of different movements • highest/lowest frequency within the n-gram (2) • most/least frequent movements (2) • range/mean/variance of frequencies within the n-gram (3) Table S1. Personality score ranges for each personality trait. Participants with raw values smaller or equal to boundary 1 were assigned to personality score range 1, those larger than boundary 1 but smaller or equal to boundary 2 were assigned to score range 2, and those larger than boundary 2 were assigned to score range 3.

DESCRIPTIVE ANALYSIS
To facilitate comparisons with related work we also provide correlation coefficients between personality scores and those features extracted from a sliding window with a length of 15 seconds (which was the window length that was most frequently selected in our training scheme) in Table S2. These correlation coefficients describe properties of the collected data and, in contrast to the feature importance scores in Figure 2 in the main article, are independent of the machine learning approach.

TRAINING PROCEDURE
In the following, the scheme to train a single random forest classifier will be described (see GitHub for our implementation): A priori, no clear hypotheses about useful window sizes for the sliding window approach or informative features for every day life recordings existed. Therefore, each classifier was trained in a nested cross validation scheme that fits parameters during training while preventing overfitting. Cross validation means that a given set of recordings is not simply split into one training and one test set, but instead the algorithm cycles different splits into the training and test sets. On each pair of training and test set, a separate classifier is trained and the performance measure is then averaged over all splits. Typically, a single classifier is trained on the training set and is later evaluated on the unseen test set. In nested cross validation however, another (so-called inner) cross validation takes place on the training set. This is, the algorithm again cycles through different splits of the (outer) training set into (inner) training and (inner) test set. In this inner cross validation, different parameters can be tested on the inner training set and applied to the inner test set. Once the best performing parameters (i.e. in our case window size and features) are chosen, a classifier with these parameters is newly trained on the entire outer training set and evaluated on the still unseen outer test set. Figure S1 shows the distribution of all chosen window sizes. . During the training procedure, a nested cross validation scheme is used to determine the best performing parameters, one of which is the size of a sliding window used to process raw gaze data. This histogram shows the frequency of each window size (x-axis) summed up over all cycles of the training scheme. While the y-axis shows relative frequency, the number above each bar is the absolute number of times a window was chosen.
The inner cross validation cycle included 3 different data splits (folds) that were chosen to contain an approximately equal number of samples for each class. The goodness of a classifier on the inner test set was evaluated using the accuracy over all samples (i.e. independent of the participant they belong to). For the outer cross validation, the data used for testing is cycled such that each time, 5 different participants are in the test set and the remaining 37 in the training set. Finally, a prediction for each participant in the outer test set was derived by majority voting over all time windows associated with that particular participant. The predictions from all outer folds were merged into a set of exactly one prediction per participant and evaluated in terms of an F1 score with macro averaging. The F1 score is defined as the harmonic mean of precision and recall, where precision for a certain class captures how many of the actual class instances were correctly detected, whereas recall captures how many of the instances that were predicted to be class instances were actually correct. Macro averaging means that one F1 score per class is computed and those are then averaged. This reflects the intuition that a personality predictions system is expected to predict extreme values, as opposed to a system that typically predicts the most frequent personality class. Feature selection within the inner cross validation cycle is performed based on the Random Forests' feature importance: all features that did not reach a higher average importance than 0.005 across all splits were disabled for the outer cross validation.