Identifying Active Travel Behaviors in Challenging Environments Using GPS, Accelerometers, and Machine Learning Algorithms

Background: Active travel is an important area in physical activity research, but objective measurement of active travel is still difficult. Automated methods to measure travel behaviors will improve research in this area. In this paper, we present a supervised machine learning method for transportation mode prediction from global positioning system (GPS) and accelerometer data. Methods: We collected a dataset of about 150 h of GPS and accelerometer data from two research assistants following a protocol of prescribed trips consisting of five activities: bicycling, riding in a vehicle, walking, sitting, and standing. We extracted 49 features from 1-min windows of this data. We compared the performance of several machine learning algorithms and chose a random forest algorithm to classify the transportation mode. We used a moving average output filter to smooth the output predictions over time. Results: The random forest algorithm achieved 89.8% cross-validated accuracy on this dataset. Adding the moving average filter to smooth output predictions increased the cross-validated accuracy to 91.9%. Conclusion: Machine learning methods are a viable approach for automating measurement of active travel, particularly for measuring travel activities that traditional accelerometer data processing methods misclassify, such as bicycling and vehicle travel.


INTRODUCTION
Individual travel behavior has been important in transportation research and traffic planning for decades (1). More recently, active travel has also become a focus for public health (2). Studies of adults and children have shown that individuals who walk or bike for transportation, or use public transportation, accumulate more physical activity and are more likely to meet public health recommendations (3,4). In some countries active travel has been related to obesity (5). These relationships, however, have been poorly studied because they are reliant on self-report data, which provide crude metrics (e.g., number of days vs. total minutes of active travel). The premise of active living research is that built environment can support more routine physical activity behaviors, and that if active travel is an equal choice compared to car travel, more people are likely to take advantage. Improvement in measurement of active travel will enable intervention studies trying to promote routine daily behaviors such as active travel.
Traditionally, travel behavior has been measured by travel and time use diaries or self-report surveys (6). Not only are these burdensome to participants, but also recall of events is often inaccurate and potentially biased (7)(8)(9). The emergence of lightweight, low cost, and accurate global positioning system (GPS) devices has enabled researchers to objectively track the location of individuals. However, while it is relatively straightforward to view and understand GPS data using Geographic Information System (GIS) packages, it would be extremely time consuming to do large-scale data analysis by manually interpreting each GPS track. In light of this, researchers began looking into automated ways of segmenting trips and identifying transportation mode from GPS data. Early studies of GPS data in transportation research focused on vehicle travel, simplifying the development of algorithms somewhat. More recently, multiple transportation modes have been studied, including active transportation.
There have been a variety of approaches to predicting transportation mode automatically from GPS data. These include heuristic rule-based algorithms, (10)(11)(12)(13), fuzzy logic (14,15), neural networks (16,17), Bayesian models (18,19), and decision trees (20,21). These approaches rarely include non-travel activities (e.g., sitting or standing), and many incorporate map matching or GIS components, which are particularly useful in order to ascertain when a user is traveling on a public transit route. Physical activity researchers, however, may not have access to GIS data. In contrast, they are likely to include accelerometer data when assessing active travel (3,4). Previous studies have shown that specific behaviors such as housework can be derived from accelerometer signals (22). These studies rarely include vehicle travel as an activity mode, are often performed in highly controlled lab settings, and mostly do not include GPS data that can inform trip mode.
Only a few studies have employed GPS and accelerometer data. Reddy et al. (23) use decision trees and Markov models to determine transportation mode on mobile phones, using both GPS and accelerometer data. They report 93% accuracy in predicting five activities (still, walking, running, bicycling, and vehicle). Troped et al. (24) also combine accelerometer and GPS data, but from standalone devices, using linear discriminant analysis to predict five activities (walking, running, bicycling, inline skating, and driving a car). They report 90% accuracy in predicting activities, but with a relatively small dataset of 712 min of data.
Many of these algorithms in the literature to date are only tested in ideal conditions or controlled environments, which may overestimate their accuracy. Conditions like instantaneous mode changes, cold start journeys, and trips in urban canyons can all interfere with signal detection and challenge the effectiveness of algorithms. Our novel contribution to the research includes a validation protocol that tested travel modes in multiple real world conditions, and collected a comprehensive dataset consisting of about 150 h of annotated data. We employed both GPS and accelerometer data, and used machine learning algorithms to identify transportation mode, using multiple features of both devices to inform the prediction model. We used a random forest algorithm, an efficient algorithm that to our knowledge has not previously been used to predict transportation modes from accelerometer and GPS data, although Lustrek and Kaluza (25) use a random forest algorithm for activity recognition from 12 small infrared motion tags placed on a user's body and Casale et al. (26) use random forests for physical activity recognition from accelerometer data.

DATA COLLECTION PROCEDURES
Two trained research assistants in San Diego followed a data collection protocol. They collected data under varying conditions (i.e., open space vs. urban, indoor vs. outdoor), for a variety of transportation modes (walking, driving, etc.). The researchers wore an Actigraph GT3X+ accelerometer on the hip and 12 devices attached to 2 wooden boards (9 × 12 ) carried in a backpack. Board-mounted devices were attached with Velcro to ensure device antennae were all aligned in the same way. The board contained 12 devices, each with different settings: 2 different GPS models, at 3 different epochs with either warm or cold start conditions (i.e., device on and signal obtained or device switched on immediately before travel and no signal established). The GPS device we used in this analysis was a Qstarz BT1000X set to collect data at 15 s epochs using warm starts. The accelerometer device collected data at 30 Hz on three axes. The researchers followed a set protocol of trips, pauses, and locations. There were at least 4 example trips per condition resulting in over 500 trips. The researchers kept a log of each trip, location, and condition settings and noted start times for each event. A research manager reviewed files on a daily basis and reallocated trips that were not successfully completed or noted.
(The full study protocol is available from the authors). All the data are considered at the trip level, the individual characteristics of the two data collectors were not included. Table 1 outlines the different conditions under which GPS and accelerometer data were collected following the protocol. Table 2 reports the total minutes of data collected in each transportations mode, and includes periods of time between the prescribed trips during data collection.

CLASSIFICATION PIPELINE
Each step in the data classification pipeline is detailed in the following sections. Raw data were first preprocessed to remove common GPS errors. The data were then split into 1-min windows and feature vectors were extracted from each window. Each window was then classified using a machine learning algorithm. Figure 1 shows an overview of the classification pipeline, which starts from raw sensor data and produces classifications for each minute of data.

DATA PREPROCESSING
Data smoothing typically occurs before data are employed in a machine learning environment. GPS data were first processed by the Personal Activity Location Measurement System (PALMS) (27). PALMS filters spurious data points and smoothes out common GPS interference patterns. PALMS uses a set of simple filters to remove invalid coordinates and reduce data volume. The filters include removal of points with excess speed; large changes in elevation or very small changes in distance between consecutive points; and scatter caused by interference from buildings. For periods of signal loss PALMS imputes the previous valid coordinates (28).

TRIP SEGMENTATION
Many previous approaches to travel mode classification do a first step of segmenting the GPS stream into cohesive trips consisting of a single travel mode (10,12,14). This segmentation is usually done based on simple rules that make some assumptions about the way people travel -for example, people always walk in between trips of different modes, or are always stationary for a certain length of time in between trips. However, these assumptions may not hold in the real world -in fact the data in this study were collected in order to explicitly violate these assumptions. Therefore, instead of segmenting the data into trips, in our method we individually classified each minute of data with a travel mode. These predictions are then smoothed with a simple moving average filter that encourages consecutive minutes of data to be classified with the same mode. Trips can then be easily defined by grouping consecutive minutes classified with the same travel mode. This approach prevents the use of heuristics that enforce a specific ordering of transportation modes.

FEATURE EXTRACTION
Most machine learning algorithms require inputs to be example data points consisting of real numeric data. These inputs are called feature (or attribute) vectors. Our input data consist of streams of accelerometer and GPS data. The feature extraction step is the process of transforming these data streams into feature vectors that capture relevant and predictive information. We used a sliding window to break the data stream into 1-min windows of accelerometer and GPS data, each with a corresponding transportation mode label. If the window spanned multiple different transportation modes, or an unlabeled segment of data, we left it unlabeled. Consecutive windows overlap by 30 s. We summarized each 1-min window by computing a feature vector consisting of descriptive statistics of the data in that window (e.g., average speed, correlation between accelerometer axes, etc.). We normalized the features to have mean zero and standard deviation one, to account for the scale difference between features (i.e., acceleration measurement is between ±6 G, while GPS speed is commonly above 40 mph). We computed a 49-dimensional feature vector for each minute of data, consisting of 43 acceleration features and 6 GPS features. Using a 1-min window resulted in 17,916 example minutes in our dataset, 14,307 of which had valid labels.

Acceleration features
An 1-min window of acceleration measurements contains T = 60s × 30 Hz × 3 axes = 5400 samples of acceleration measurements along the x, y, and z axes, which we represent as a matrix, The data from this window are condensed to 43 acceleration features. Most features are computed from the vector magnitude of the 3-axis acceleration, although some features (for example, correlations between axes) are computed differently. We compute the following features: • Basic descriptive statistics computed from the vector magnitudes a 1:T : mean, standard deviation, 25th and 75th percentiles, minimum, and maximum. These are features commonly used in previous work predicting physical activity from accelerometers (21,23,24). • Skewness and Kurtosis, descriptive statistics derived from the third and forth moments of the data distribution, that measure the asymmetry and peakedness, respectively, of the distribution of accelerometer magnitudes in a minute. • Autocorrelation of the vector magnitude with 1-s lag (17).
• Correlations between each pair of axes of the accelerometer (i.e., x-y correlation, x-z correlation, and y-z correlation). • Entropy, a measurement of the randomness of the distribution of accelerometer magnitudes. • Angular features, to provide information about the orientation of the accelerometer in space. The roll, pitch, and yaw are measurements used in aeronautics to describe the rotation of an aircraft, and are calculated by: tan −1 a t ,y , a t ,x Figure 2 shows these angles on the coordinate axes.
• Principal direction of motion, obtained via Eigen-decomposition of the acceleration covariance matrix, AA T . In particular, we determined the principal direction of motion by taking the eigenvector ν of AA T with corresponding maximal eigenvaluethis corresponds to the direction with maximum variation. • Autoregressive coefficients: we model the acceleration vector magnitude by an autoregressive model of order p = 5, where c 0 , . . . ,c p are the model coefficients, and ε t is white noise. • Dominant frequency, the frequency corresponding to maximal power in the power spectrum, and corresponding power.

GPS features
We obtained six features from the GPS device: average speed, average number of satellites used and in view, average signal-to-noise ratio (SNR) of satellites used and in view, and net distance traveled in the minute. These 6 features were appended to the 43 acceleration features to obtain the 49-dimensional feature vector that describes each minute of data.

MACHINE LEARNING METHODS
Our goal in this work is to use supervised machine learning methods to predict transportation mode from streams of accelerometer and GPS data. The term"supervised"refers to the fact that we make use of a training data set containing examples of accelerometer and GPS data with known corresponding labels (i.e., transportation mode). A supervised learning algorithm analyzes the training data and produces an inferred function, or classifier, which maps a data point (represented by its feature vector), to a label. This is in contrast to "unsupervised" learning algorithms, which use data without labels to identify some sort of underlying structure in the data (i.e., clustering methods). A labeled training dataset is made up of pairs of feature vectors and labels. There are a wide variety of classifier functions and learning algorithms that could be used, and we tested several of the more popular choices.

Classification
We tested several well-known machine learning algorithms to classify transportation mode: k-nearest neighbor (kNN), support vector machines (SVM), naive Bayes, decision trees, and random forests. Many software packages exist that implement these algorithms and allow them to be used as a "black box" to perform classification of data. Of these algorithms, the random Frontiers in Public Health | Public Health Education and Promotion forest algorithm, which is an ensemble method based on decision trees, produced the highest accuracy, and thus the remainder of the analysis in this paper will focus on the random forest algorithm. A decision tree is a type of classifier that consists of leaves representing class labels and branches representing conjunctions of features that lead to those class labels. For a test data point, the class label is found by traversing the tree according to the conjunctions in the branches of the tree, and when a leaf is reached the label in that leaf is assigned to the data point. The training phase of the algorithm consists of building the decision tree, i.e., learning the branches that lead to a tree that correctly classifies as many examples in the training data set as possible. A random forest combines the outputs of multiple randomized decision trees. Shotton et al. (29) use random forests to do human pose recognition for the Xbox Kinect. To learn each decision tree, we chose a random subset of 10,000 training examples (at 1-min epoch) and a random subset of 25 features. We learned 100 of these randomized decision trees. To classify a given test example, we traversed each tree until we arrived at a leaf node. Each leaf node has a probability score for each transportation mode, according to the ratio of training examples of each transportation mode that land in that node. We summed these probability scores in the final leaf node over the 100 trees, and chose the transportation mode with highest probability for our test example. We choose the parameters for our classification algorithms (i.e., number of trees to use) using a held-out day of data that was not included in the final cross-validation results.
For the decision trees, we set the minimum number of examples in a leaf to be 10. For the kNN algorithm k was chosen to be 3 and for SVM the regularization parameter was chosen to be 10.

MOVING AVERAGE OUTPUT FILTER
We filter the output predictions from the random forest classifier with a simple moving average filter. This filter looks at the predictions made in the 2 min previous to and 2 min following the minute in question, and outputs the mode that is predicted the highest number of times. If there is a tie, it outputs either the prediction from the current minute (if this is one of the tied predictions), or the prediction from the earlier of the tied minutes. This prevents rapid switching between different modes and encourages successive predictions to belong to the same travel mode.

RESULTS
We tested the performance of each machine learning algorithm by performing leave-one-day-out cross-validation. This corresponds to a realistic setting in which the algorithm would be used -the data from within a trip are never used as training data to classify another piece of data from that same trip. Using standard k-fold cross validation that allows data from the same trip in the test and training set produces artificially high accuracy scores. In addition to overall accuracy, we evaluated the performance of each classifier using precision, recall, and F -score.  FN), where TP is the number of true positives, and FN is the number of false negatives. F -score is a measure of accuracy, and is computed as the harmonic mean of precision and recall, F -score = 2PR/(P + R). These metrics provide detailed information about how the algorithm performs on each class. The F -scores obtained from each algorithm are shown in Table 3. The random forest algorithm showed the highest performance, with an overall accuracy of 89.8%.
Using the moving average output filter to smooth predictions significantly improved results, leading to an average precision of 0.900, average recall of 0.882, and overall accuracy of 91.9%. Table 4 reports the precision, and recall scores before and after output filtering. Figure 3 shows an example day of data, before and after smoothing, compared to the ground truth annotations for the day. Table 5 shows the confusion matrix for the   smoothed random forest classifier, which reports the number of test examples classified in each transportation mode. The diagonal emboldened numbers represent the correct predictions by mode.

IMPORTANCE OF FEATURES
In order to gain insight into the usefulness of each feature, we computed an importance score. This score was computed by summing the changes in the training error each time a feature was used to create a new branch in a decision tree, and averaged over each tree in the random forest. We then normalize the scores to sum 1 over all the features. Table 6 shows the top 15 features according to this importance score.  (23)] and machine learning techniques to predict transportation mode. Our study demonstrated similar accuracy rates as Reddy et al., but deployed the devices over a larger number of trips, which varied by environmental features, thereby providing a more challenging test of the algorithms. Another difference was the Reddy study employed smart phones whereas we used research-grade accelerometers and GPS devices mounted at specific locations. Although smartphones are becoming ubiquitous technologies for continuous sensing of geolocation and acceleration data, they are limited because of competing power demands of the phone or other functions (e.g., there must be sufficient power for the phone to make calls, text, read email, surf the web, etc., with limited interruption to other sensor inputs). It is also unclear if the integrity of smartphone sensor data relies on the phone being in a fixed position (i.e., the person keeping the phone in the same position all day). Although recent studies have attempted to circumvent these issues, solutions appear largely experimental or prototypical (30,31). Importantly, there have been few studies in the transportation literature that have included accelerometer data to improve prediction of transportation mode. Previous studies have employed machine learning techniques on GPS data alone, but the diversity of the training data was unknown.
The analysis of the importance of each feature demonstrated that the accelerometer data contribute additional predictive power above the GPS data. The feature with highest importance was the standard deviation of the acceleration, which captures information about the signal variability. Stationary sitting and standing should consistently produce low speeds and accelerations, while bus and car have a much wider range of possible values. The feature with the second highest importance score was the speed from the GPS, which differentiates fairly well between activities with very different average speeds, such as vehicle vs. walking and sitting or standing. Although vehicle speeds are often higher than active transportation modes, in downtown corridors vehicle speeds can be slow including periods where the vehicle has stopped altogether, such as at a traffic light. These movements may mimic that of walking or biking. The net distance covered feature is computed from the GPS data, by simply computing the distance between the first and last latitude and longitude points in a data window. It is helpful in determining whether substantial forward progress was made during the minute, which is helpful, for example, in differentiating walking from standing. Another important feature was the power at the maximum frequency -this measures whether the acceleration signal has a strong dominant frequency, which is exhibited in signals that are highly periodic, such as walking or bicycling. Walking in particular has a very consistent dominant frequency between 3 and 4 Hz, which accounts for the high importance of the 3-4 Hz feature. Another interesting feature is the average roll, which provides information about the angle with which the device is positioned. Since the device is firmly affixed to the participant's hip can provide information about whether the subject is bent at the hip, i.e., sitting vs. standing.
This study demonstrated that under varied environmental conditions known to affect GPS signal, transportation modes could be detected with high accuracy. Although the accelerometer and GPS data were collected from separate devices, most mobile phones now collect GPS and accelerometer data. Additional studies could investigate algorithms applied to such mobile phone data. While mobile phone implementation of this study would lose some standardization of device placement (i.e., subjects could hold the phone wherever they like), which may make the angular features in particular less helpful, these features may instead provide some information about device placement, which can be correlated with transportation mode. Studies in free-living populations should also be conducted to confirm the generalizability of these algorithms. New sensors such as the SenseCam that provides accompanying image data may make this possible (32).
Limitations of this study include the small sample of subjects (n = 2; although each subject provided over 200 trips). When participants are performing prescribed trips, there should be less variation between participants than in a free-living scenario. In particular, the GPS data of two participants performing the same prescribed trip should look very similar. For this reason we chose to focus on collecting data from a wide variety of environments rather than a wide variety of participants. Validating methods on a controlled dataset is a first step before applying them to larger free-living dataset, which is the subject of future work. Additional limitations are that the data were only collected in one county (although the environments purposefully varied), and application of the algorithms to only one GPS model. Additionally, we only tested our algorithms on windows of data that fell completely within a bout of certain transportation mode -i.e., we did not include windows containing transitions between activities. This was done because evaluating the performance of our algorithm on these windows that have no single ground truth label is not straightforward. Future work should determine a metric for assessing the performance of the algorithm on these split class windows. One of the main sources of error in our classification was confusion between bus and car transportation modes, which is to be expected since the two modes are so similar. Including information from GIS systems, such as the proximity of public transportation routes, has the potential of greatly improving the distinction between these two modes, and we plan on addressing this in future work. Additionally, future work will test the effect of warm vs. cold start GPS data.

AUTHOR CONTRIBUTIONS
Katherine Ellis developed and implemented the algorithm, analyzed results, and wrote the manuscripts. Suneeta Godbole coordinated data collection and data management and critically revised the manuscript. Simon Marshall and Jacqueline Kerr developed the study design, coordinated data collection, and critically revised the manuscript. Gert Lanckriet and John Staudenmayer advised on algorithms and critically revised the manuscript.

ACKNOWLEDGMENTS
The analysis in this paper was supported in part through a collaboration with the PALMS Project (UCSD-Palms-Project.wikispaces.com) and TREC project 4 at the University of California, San Diego, CA, USA. Funding was provided www.frontiersin.org