A deep learning approach to diabetic blood glucose prediction

We consider the question of 30-minute prediction of blood glucose levels measured by continuous glucose monitoring devices, using clinical data. While most studies of this nature deal with one patient at a time, we take a certain percentage of patients in the data set as training data, and test on the remainder of the patients; i.e., the machine need not re-calibrate on the new patients in the data set. We demonstrate how deep learning can outperform shallow networks in this example. One novelty is to demonstrate how a parsimonious deep representation can be constructed using domain knowledge.


Introduction
Deep Neural Networks, especially of the convolutional type (DCNNs), have started a revolution in the field of artificial intelligence and machine learning, triggering a large number of commercial ventures and practical applications. There is therefore a great deal of theoretical investigations about when and why deep (hierarchical) networks perform so well compared to shallow ones. For example, Montufar and Bengio [14] showed that the number of linear regions that can be synthesized by a deep network with ReLU nonlinearities is much larger than by a shallow network. Examples of specific functions that cannot be represented efficiently by shallow networks have been given very recently by Telgarsky [28] and by Safran and Shamir [22].
It is argued in [13] that from a function approximation point of view, deep networks are able to overcome the so-called curse of dimensionality if the target function is hierarchical in nature; e.g., a target function of the form h l (h 3 (h 21 (h 11 (x 1 , x 2 ), h 12 (x 3 , x 4 )), h 22 (h 13 (x 5 , x 6 ), h 14 (x 7 , x 8 )))), where each function has a bounded gradient, can be approximated by a deep network comprising n units, organized as a binary tree, up to an accuracy O(n −1/2 ). In contrast, a shallow network that cannot take into account this hierarchical structure can yield an accuracy of at most O(n −1/8 ). In theory, if samples of the functions h, h 1,2 , . . . are known, one can construct the networks explicitly, without using any traditional learning algorithms such as back propagation.
One way to think of the function in (1) is to think of the inner functions as the features of the data, obtained in a hierarchical manner. While classical machine learning with shallow neural networks requires that the relevant features of the raw data should be selected by using domain knowledge before the learning can start, deep learning algorithms appear to select the right features automatically. However, it is typically not clear how to interpret these features. Indeed, from a mathematical point of view, it is easy to show that a structure such as (1) is not unique, so that the hierarchical features cannot be defined uniquely, except perhaps in some very special examples.
In this paper, we examine how a deep network can be constructed in a parsimonious manner if we do allow domain knowledge to suggest the compositional structure of the target function as well as the values of the constituent functions. We study the problem of predicting, based on the past few readings of a continuous glucose monitoring (CGM) device, both the blood glucose (BG) level and the rate at which it would be changing 30 minutes after the last reading. From the point of view of diabetes management, a reliable solution to this problem is of great importance. If a patient has some warning that that his/her BG will rise or drop in the next half hour, the patient can take certain precautionary measures to prevent this (e.g., administer an insulin injection or take an extra snack) [17,26].
Our approach is to first construct three networks based on whether a 5-minute prediction, using ideas in [12], indicates the trend to be in the hypoglycemic (0-70 mg/dL), euglycemic (70-180 mg/dL), or hyperglycemic (180-450 mg/dL) range. We then train a "judge" network to get a final prediction based on the outputs of these three networks. Unlike the tacit assumption in the theory in [13], the readings and the outputs of the three constituent networks are not dense in a Euclidean cube. Therefore, we use diffusion geometry ideas in [5] to train the networks in a manner analogous to manifold learning.
From the point of view of BG prediction, a novelty of our paper is the following. Most of the literature on this subject which we are familiar with does the prediction patient-bypatient; for example, by taking 30% of the data for each patient to make the prediction for that patient. In contrast, we consider the entire data for 30% of the patients as training data, and predict the BG level for the remaining 70% of patients in the data set. Thus, our algorithm transfers the knowledge learned from one data set to another, although it does require the knowledge of both the data sets to work with. From this perspective, the algorithm has similarity with the meta-learning approach by [17], but in contrast to the latter, it does not require a meta-features selection.
We will explain the problem and the evaluation criterion in Section 2. Some prior work on this problem is reviewed briefly in Section 3. The methodology and algorithm used in this paper are described in Section 4. The results are discussed in Section 5. The mathematical background behind the methods described in Section 4 is summarized in Appendix A and Appendix B.

Problem statement and evaluation
We use a clinical data set provided by the DirectNet Central Laboratory [1], which lists BG levels of different patients taken at 5-minute intervals with the CGM device; i.e., for each patient p in the patient set P , we are given a time series {s p (t j )}, where s p (t j ) denotes the BG level at time t j . Our goal is to predict for each j, the level s p (t j+m ), given readings s p (t j ), . . . , s p (t j−d+1 ) for appropriate values of m and d. For a 30-minute prediction, m = 6, and we took d = 7 (a sampling horizon t j − t j−d+1 of 30 minutes has been suggested as the optimal one for BG prediction in [12,7]).
In this problem, numerical accuracy is not the central objective. To quantify the clinical accuracy of the considered predictors, we use the Prediction Error-Grid Analysis (PRED-EGA) [25], which has been designed especially for glucose predictors and which, together with its predecessors and variations, is by now a standard metric in the blood glucose prediction problem (see, for example, [16,17,19,21]) This assessment methodology records reference BG estimates paired with the BG estimates predicted for the same moments, as well as reference BG directions and rates of change paired with the corresponding estimates predicted for the same moments. As a result, the PRED-EGA reports the numbers (in percent) of Accurate (Acc.), Benign (Benign) and Erroneous (Error) predictions in the hypoglycemic, euglycemic and hyperglycemic ranges separately. This stratification is of great importance because consequences caused by a prediction error in the hypoglycemic range are very different from ones in the euglycemic or the hyperglycemic range.

Prior work
Given the importance of the problem, many researchers have worked on it in several directions. Relevant to our work is the work using a linear predictor and work using supervised learning.
The linear predictor method estimates s p (t j ) based on the previous d readings, and then predicts Perhaps the most classical of these is the work [23] by Savitzky and Golay, that proposes an approximation of s p (t) by the derivative of a polynomial of least square fit to the data (t k , s p (t k )), k = j, . . . , j −d+1. The degree n of the polynomial acts as a regularization parameter. However, in addition to the intrinsic ill-conditioning of numerical differentiation, the solution of the least square problem as posed above involves a system of linear equations with the Hilbert matrix of order n, which is notoriously ill-conditioned. Therefore, it is proposed in [9] to use Legendre polynomials rather than the monomials as the basis for the space of polynomials of degree n. A procedure to choose n is given in [9], together with error bounds in terms of n and the estimates on the noise level in the data, which are optimal up to a constant factor for the method in the sense of the oracle inequality. A substantial improvement on this method was proposed in [12], which we summarize in Appendix A. As far as we are aware, this is the state of the art in this approach in short term blood glucose prediction using linear prediction technology.
There exist several BG prediction algorithms in the literature that use a supervised learning approach. These can be divided into three main groups.
The first group of methods employ kernel-based regularization techniques to achieve prediction (for example, [15,17] and references therein), where Tikhonov regularization is used to find the best least square fit to the data (t k , s p (t k )), k = j, . . . , j − d + 1, assuming the minimizer belongs to a reproducing kernel Hilbert space (RKHS). Of course, these methods are quite sensitive to the choice of kernel and regularization parameters. Therefore, the authors in [15,17] develop methods to choose both the kernel and regularization parameter adaptively, or through meta-learning ("learning to learn") approaches.
The second group consists of artificial neural network models (such as [18,19]). In [19], for example, a feed-forward neural network is designed with eleven neurons in the input layer (corresponding to variables such as CGM data, the rate of change of glucose levels, meal intake and insulin dosage), and nine neurons with hyperbolic tangent transfer function in the hidden layer. The network was trained with the use of data from 17 patients and tested on data from 10 other patients for a 75-minute prediction, and evaluated using the classical Clarke Error-Grid Analysis (EGA) [4], which is a predecessor of the PRED-EGA assessment metric. Classical EGA differs from PRED-EGA in the sense that it only compares absolute BG concentrations with reference BG values (and not rates of change in BG concentrations as well). Although good results are achieved in the EGA grid in this paper, a limitation of the method is the large amount of additional input information necessary to design the model, as described above.
The third group consists of methods that utilize time-series techniques such as autoregressive (AR) models (for example, [21,27]). In [21], a tenth-order AR model is developed, where the AR coefficients are determined through a regularized least square method. The model is trained patient-by-patient, typically using the first 30% of the patient's BG measurements, for a 30-minute or 60-minute prediction. The method is tested on a time series containing glucose values measured every minute, and evaluation is again done through the classical EGA grid. The authors in [27] develop a first-order AR model, patient-by-patient, with time-varying AR coefficients determined through weighted least squares. Their method is tested on a time series containing glucose values measured every three minutes, and quantified using statistical metrics such as measuring the mean square of the errors. As noted in [17], these methods seem to be sensitive to gaps in the input data.

Methodology in the current paper
Our proposed method represents semi-supervised learning that follows an entirely different approach from those described above. It is not a classical statistics/optimization based approach; instead, it is based on function approximation on data defined manifolds, using diffusion polynomials. In this section, we describe our deep learning method, which consists of two layers, in details.
Given the time series {s p (t j )} of BG levels at time t j for each patient p in the patient set P , where t j −t j−1 = 5 minutes, we start by formatting the data into the form {(x j , y j )}, where x j = (s p (t j−d+1 ), · · · , s p (t j )) ∈ R d and y j = s p (t j+m ) ∈ R, for all patients p ∈ P. We will use the notation We also construct the diffusion matrix from P. This is done by normalizing the rows of the weight matrix W ε N in (8), following the approach in [8, pp. 33-34].
Having defined the input data P and the corresponding diffusion matrix, our method proceeds as follows.

First layer: Three networks in different clusters
To start our first layer training, we form the training patient set T P by randomly selecting (according to a uniform probability distribution) M % of the patients in P . The training data are now defined to be all the data (x j , y j ) corresponding to the patients in T P . We will use the notations Next, we make a short term prediction L x j (t j+1 ) of the BG level s p (t j+1 ) after 5 minutes, for all the given measurements x j ∈ C, by applying the linear predictor method (summarized in Section 3 and Appendix A). Based on these 5-minute predictions, we divide the measurements in C into three clusters C o , C e and C r , The motivation of this step is to gather more information concerning the training set to ensure more accurate predictions in each BG range -as noted previously, consequences of prediction error in the separate BG ranges are very different.
In the sequel, let S(Γ, x j ) denote the result of the method of Appendix B (that is, S(Γ, x j ) is defined by equation (7)), used with training data Γ and evaluated at a point x j ∈ P. After obtaining the three clusters C o , C e and C r , we compute the three predictors as well as the "judge" predictor, based on the entire training set C , using the summability method of Appendix B (specifically, (7)). We remark that, as discussed in [11], our approximations in (7) can be computed as classical radial basis function networks, with exactly as many neurons as the number of eigenfunctions used in (5). (As mentioned in Appendix B, this number is determined to ensure that the system (6) is still well conditioned.) The motivation of this step is to decide which one of the three predictions (f o , f e or f r ) is the best prediction for each datum x ∈ P. Since we do not know in advance in which blood glucose range a particular datum will result, we need to use all of the training data for the judge predictor f J to choose the best prediction.

Second layer (judge): Final output
In the last training layer, a final output is produced based on which f , ∈ {o, e, r} gives the best placement in the PRED-EGA grid, using f J as the reference value. The PRED-EGA grid is constructed by using comparisons of f o (resp., f e and f r ) with the reference value f J -specifically, it involves comparing for all x j ∈ P. Based on these comparisons, PRED-EGA classifies f o (x j ) (resp., f e (x j ) and f r (x j )) as being Accurate, Benign or Erroneous. As our final output f (x j ) of the target function at x j ∈ P, we choose the one among f (x j ), ∈ {o, e, r} with the best classification, and if there is more than one achieving the best grid placement, we output the one among f (x j ), ∈ {o, e, r} that has value closest to f J (x j ). For the first and last x j for each patient p ∈ P (for which the rate of change (2) cannot be computed), we use the one among f (x j ), ∈ {o, e, r} that has value closest to f J (x j ) as the final output.

Evaluation
Lastly, to evaluate the performance of the final output f (x j ), x j ∈ P, we use the actual reference values y j to place f (x j ) in the PRED-EGA grid.
We repeat the process described in Subsections 4.1 -4.3 for a fixed number of trials, after which we report the average of the PRED-EGA grid placements, over all x ∈ P and over all trials, as the final evaluation.
A summary of the method is given in Algorithm 1. A flowchart diagram of the algorithm is shown in Figure 1.

Remarks
(i) An optional smoothing step may be applied before the first training layer step in Subsection 4.1 to remove any large spikes in the given time series {s p (t j )} , p ∈ P, that may be caused by patient movement, for example. Following ideas in [27], we may apply flat low-pass filtering (for example, a first-order Butterworth filter). In this case, the evaluation of the final output in Subsection 4.3 is done by using the original, unsmoothed measurements as reference values.
Algorithm 1 Deep Network for BG prediction input : Time series {s p (t j )}, p ∈ P , formatted as P = {x j } with x j = (s p (t j−d+1 ), · · · , s p (t j )) and y j = s p (t j+m ), and corresponding diffusion matrix d ∈ N (specifies sampling horizon), m ∈ N (specifies prediction horizon) M ∈ (0, 100) (percentage of data used for training). Let T P contain M % of patients from P (drawn according to uniform prob. distr.) Set C = {x j } and C = {(x j , y j )} for all patients p ∈ T P output: Prediction f (x j ) ≈ s p (t j+m ) for x j ∈ P.
First layer: Second layer: for x j ∈ P do for ∈ {o, e, r} do Construct PRED-EGA grid for f (x j ) using f J (x j ) as reference value end Let j ∈ {o, e, r} denote the subscript for which f j (x j ) produces the best PRED-EGA grid placement Set final output f (x j ) = f j (x j ) end (ii) We remark that, as explained in Section 1, the training set may also be implemented in a different way: instead of drawing M % of the patients in P and using their entire data sets for training, we could construct the training set C for each patient p ∈ P separately by drawing M % of a given patient's data for training, and then construct the networks f , ∈ {o, e, r} and f J for each patient separately. This is a different problem, studied often in the literature (see, for example, [21,27,6]). We do not pursue this approach in the current paper.

Results and discussion
As mentioned in Section 2, we apply our method to data provided by the DirecNet Central Laboratory. Time series for 25 patients are considered. This specific data set was designed to study the performance of CGM devices in children with Type I diabetes; as such, all of the patients are less than 18 years old. Each time series {s p (t j )} contains more than 160 BG measurements, taken at 5-minute intervals. Our method is a general purpose algorithm, where these details do not play any significant role, except in affecting the outcome of the experiments.
We provide results obtained by implementing our method in Matlab, as described in Algorithm 1. For our implementation, we employ a sampling horizon t j − t j−d+1 of 30 minutes (d = 7), a prediction horizon t j+m − t j of 30 minutes (m = 6), and a total of 100 trials (T = 100). We provide results for both 30% training data (M = 30) and 50% training data (M = 50) (which is comparable to approaches followed in for example [19,21]). After testing our method on all 25 patients, the average PRED-EGA scores (in percent) are displayed in Table 1. For 30% training data, the percentage of accurate predictions and predictions with benign consequences is 84.32% in the hypoglycemic range, 97.63% in the euglycemic range, and 82.89% in the hyperglycemic range, while for 50% training data, we have 93.21% in the hypoglycemic range, 97.68% in the euglycemic range, and 86.78% in the hyperglycemic range.
For comparison, Table 1 also displays the PRED-EGA scores obtained when implementing a shallow (ie, one layer) feed-forward network with 20 neurons using Matlab's Neural Network Toolbox, using the same parameters d, m, M and T as in our implementation, with Matlab's default hyperbolic tangent sigmoid activation function. The motivation for using 20 neurons is the following. As mentioned in Subsection 4.1, each layer in our deep network can be viewed as a classical neural network with exactly as many neurons as the number of eigenfunctions used in (5) and (7). This number is determined to ensure that the system in (6) is well conditioned; in our experiments, it turned out to be at most 5. Therefore, to compare our two-layer deep network (where the first layer consists of three separate networks) with a classical shallow neural network, we use 20 neurons. It is clear that our deep network performs substantially better than the shallow network; in all three BG ranges, our method produces a lower percentage of erroneous     predictions.
For a further comparison, we also display in Table 1 the PRED-EGA scores obtained when training through supervised learning using a standard Tikhonov regularization to find the best least square fit to the data; specifically, we implemented the method described in [20, pp 3-4] using a Gaussian kernel with σ = 100 and regularization constant γ = 0.0001. Again, our method produces superior results, especially in the hypoglycemic range. Figures 2 and 3 display boxplots for the percentage accurate predictions in each BG range for each method over the 100 trials.
We also provide results obtained when applying smoothing through flat low-pass filtering to the given time series, as explained in Subsection 4.4. For our implementations, we used a first-order Butterworth filter with cutoff frequency 0.8, with the same input parameters as before. The results are given in Table 2. For 30% training data, the percentage of accurate predictions and predictions with benign consequences is 88.92% in the hypoglycemic range, 97.64% in the euglycemic range, and 77.58% in the hyperglycemic range, while for 50% training data, we have 96.43% in the hypoglycemic range, 97.96% in the euglycemic range, and 85.29% in the hyperglycemic range. Figures 4 and 5 display boxplots for the percentage accurate predictions in each BG range for each method over the 100 trials.
In all our experiments the percentage of erroneous predictions is substantially smaller with deep networks than the other two methods we have tried, with the only exception of hyperglycemic range with 30% training data and flat low pass filtering. Our method's performance also improves when the amount of the training data increases from 30% to 50%, because the percentage of erroneous predictions decreases in all of the BG ranges in all experiments.  Moreover, from this viewpoint, our deep learning method outperforms the considered competitors, except in the hyperglycemic BG range in Table 2. A possible explanation for this is that the DirecNet study was done on children (less than 18 years of age) with type 1 diabetes, for a period of roughly 26 hours. It appears that children usually have prolonged hypoglycemic periods as well as profound postprandial hyperglycemia (high blood sugar "spikes" after meals) -according to [3], more than 70% of children display prolonged hypoglycemia while more than 90% of children display significant postprandial hyperglycemia. In particular, many of the patients in the data set only exhibit very limited hyperglycemic BG spikes -for the patient labeled 43, for example, there exists a total of 194 BG measurements, of which only 5 measurements are greater than 180 mg/dL. This anomaly might have affected the performance of our algorithm in the hyperglycemic range. Obviously, it performs remarkably better than the other techniques we have tried, including fully supervised training, while ours is only semi-supervised.

Conclusions
The prediction of blood glucose levels based on continuous glucose monitoring system data 30 minutes ahead is a very important problem with many consequences for the health care industry. In this paper, we suggest a deep learning paradigm based on a solid mathematical theory as well as domain knowledge to solve this problem accurately as assessed by the PRED-EGA grid, developed specifically for this purpose. It is demonstrated in [13] that deep networks perform substantially better than shallow networks in terms of expressiveness for function approximation when the target function has a compositional structure. Thus, the blessing of compositionality cures the curse of dimensionality. However, the compositional structure is not unique, and it is open problem to decide whether a given target function has a certain compositional structure. In this paper, we have demonstrated an example where domain knowledge can be used to build an appropriate compositional structure, leading to a parsimonious deep learning design.

A Filtered Legendre expansion method
In this appendix, we review the mathematical background for the method developed in [12] for short term blood glucose prediction. As explained in the text, the main mathematical problem can be summarized as that of estimating the derivative of a function at the end point of an interval, based on measurements of the function in the past. Mathematically, if f : [−1, 1] → R is continuously differentiable, we wish to estimate f (1) given the noisy values We summarize only the method here, and refer the reader to [12] for the detailed proof of the mathematical facts.
For this purpose, we define the Legendre polynomials recursively by Let h : R → R be an even, infinitely differentiable function, such that h(t) = 1 if t ∈ [0, 1/2] and h(t) = 0 if t ≥ 1. We define Next, given the points {t j } d j=1 ⊂ [−1, 1], we use least squares to find n = n d such that and the following estimates are valid for all polynomials of degree < 2n: for some positive constant A. We do not need to determine A, but it is guaranteed to be proportional to the condition number of the system in (3).
Finally, our estimate of f (1) based on the values y j = f (t j ) + j , j = 1, · · · , d, is given by It is proved in [12,Theorem 3.2] that if f is twice continuously differentiable, and | j | ≤ δ then the minimum being over all polynomials P of degree < n/2.

B Function approximation on data defined spaces
While classical approximation theory literature deals with function approximation based on data that is dense on a known domain, such as a cube, torus, sphere, etc., this condition is generally not satisfied in the context of machine learning. For example, the set of vectors (s p (t j ), . . . , s p (t j − d + 1)) is unlikely to be dense in a cube in R d . A relatively recent idea is to think of the data as being sampled from a distribution on an unknown manifold, or more generally, a locally compact, quasi-metric measure space. In this appendix, we review the theoretical background that underlies our experiments reported in this paper. The discussion here is based mainly on [10,11,5].
Let X be a locally compact quasi-metric measure space, with ρ being the quasi-metric and µ * being the measure. In the context of machine learning, µ * is a probability measure and X is its support. The starting point of this theory is a non-decreasing sequence of numbers {λ k } ∞ k=0 , such that λ 0 = 0 and λ k → ∞ as k → ∞, and a corresponding sequence of bounded functions {φ k } ∞ k=0 that forms an orthonormal sequence in L 2 (µ * ). Let Π λ = span{φ k : λ k < λ}, and analogously to (4), we define for a uniformly continuous, bounded function f : X → R, With the function h as defined in Appendix A, we define Φ λ (h; x, y) = k:λ k <λ h λ k λ φ k (x)φ k (y), x, y ∈ X.
Next, let {x j } M j=1 ⊂ X be the "training data". Our goal is to learn a function f : X → R based on the values {f (x j )} M j=1 . Toward this goal, we solve an under-determined system of equations M j=1 W j φ k (x j ) = 1, if k = 0, 0, if k > 0, k : λ k < λ (6) for the largest possible λ for which this system is still well conditioned. We then define an approximation, analogous to the classical radial basis function networks, by It is proved in [10,11,5] that under certain technical conditions, for a uniformly continuous, bounded function f : X → R, we have where c > 0 is a generic positive constant. Thus, the approximation σ λ (f ) is guaranteed to yield a "good approximation" that is asymptotically best possible given the training data.
In practice, one has a point cloud P = {y i } N i=1 ⊃ {x j } M j=1 rather than the space X. The set P is a subset of some Euclidean space R D for a possibly high value of D, but is assumed to lie on a compact sub-manifold X of R D , with this manifold having a low dimension. We build the graph Laplacian from P as follows: With a parameter ε > 0, induces the weight matrix W ε N defined by W ε N ;i,j = exp(−|y i − y j | 2 /ε).
We build the diagonal matrix D ε N ;i,i = N j=1 W ε N ;i,j and define the (unnormalized) graph Laplacian as Various other versions of this Laplacian are also used in practice. For M → ∞ and ε → 0, the eigenvalues and interpolations of the eigenvectors of L ε M converge towards the "interesting" eigenvalues λ k and eigenfunctions φ k of a differential operator ∆ X . This behavior has been studied in detail by many authors, e.g., [2,8,24]. When {y i } M i=1 are uniformly distributed on X, this operator is the Laplace-Beltrami operator on X. If {y i } M i=1 are distributed according to the density p, then the graph Laplacian approximates the elliptic Schrödinger type operator ∆ + ∆p p , whose eigenfunctions φ k also form an orthonormal basis for L 2 (X, µ).