Causal Discovery in Radiographic Markers of Knee Osteoarthritis and Prediction for Knee Osteoarthritis Severity With Attention–Long Short-Term Memory

The goal of this study is to build a prognostic model to predict the severity of radiographic knee osteoarthritis (KOA) and to identify long-term disease progression risk factors for early intervention and treatment. We designed a long short-term memory (LSTM) model with an attention mechanism to predict Kellgren/Lawrence (KL) grade for knee osteoarthritis patients. The attention scores reveal a time-associated impact of different variables on KL grades. We also employed a fast causal inference (FCI) algorithm to estimate the causal relation of key variables, which will aid in clinical interpretability. Based on the clinical information of current visits, we accurately predicted the KL grade of the patient's next visits with 90% accuracy. We found that joint space narrowing was a major contributor to KOA progression. Furthermore, our causal structure model indicated that knee alignments may lead to joint space narrowing, while symptoms (swelling, grinding, catching, and limited mobility) have little impact on KOA progression. This study evaluated a broad spectrum of potential risk factors from clinical data, questionnaires, and radiographic markers that are rarely considered in previous studies. Using our statistical model, providers are able to predict the risk of the future progression of KOA, which will provide a basis for selecting proper interventions, such as proceeding to joint arthroplasty for patients. Our causal model suggests that knee alignment should be considered in the primary treatment and KOA progression was independent of clinical symptoms.


INTRODUCTION
Osteoarthritis (OA) is a common disease in older individuals, and the economic burden of OA rapidly increases with obesity prevalence and aging in the United States. Knee osteoarthritis (KOA) is the most prevalent type of osteoarthritis with around 22.7% (54.4 million) adults diagnosed with arthritis in the United States (1). The main symptoms of knee OA (KOA) are pain, stiffness, and swelling. These symptoms cause inconvenience to everyday life, and some people may even lose their physical ability due to this. Once the symptoms appear, it is hard to be cured since damage to the joints cannot be reserved. There is no effective disease-specific drug for this irreversible degenerative disease currently (2). Medication only helps to relieve OA symptoms, primarily pain. Since OA is a slow-developing disease, it is often undiagnosed until symptoms appear, so the best treatment opportunity is missed. It is estimated that the number of total knee replacement surgery or knee arthroplasty (TKA) will reach 1.26 million by 2030 in the United States (3). Considering the rapid increase of KOA patients, it is important to detect KOA at an early stage and perform intervention treatment before knee condition deteriorates.
The most common standard of quantifying the severity of KOA is the five-grade Kellgren and Lawrence (KL) system (4). This grading system divides KOA into four stages, ranging from 0 to 4. Grade 0 indicates no evidence of KOA and grade 4 indicates severe KOA. Medically, incident radiographic KOA is defined when KL grade is ≥2 (5). A great number of works have been done to identify risk factors for the occurrence of KOA (6,7). Zhang et al. used a logistic regression model and identified age, sex, body mass index (BMI), occupational risks, injury, and family history of KOA as risk factors (8). However, these studies only identified risk factors for the occurrence of KOA (whether KL ≥ 2), instead of the progression of the disease. There were also some attempts to quantify KOA severity. Du et al. employed support vector machine, random forest, and naive Bayes to predict the progression of KOA from 3-D magnetic resonance (MR) imaging (9). Although these specific models can quantify KOA severity, most attempts were built on image processing and were hard to interpret in the clinical setting. Therefore, our study was designed to build a predictive model with patients' assessment data such as symptoms, questionnaire data, and interpretable image features and identify the key risk factors during the disease progression. Besides the predictive model, we also adopted the causal inference to verify the risk factors identified in predictive models. In order to get higher accuracy, the predictive model may include some unnecessary predictors since it only considers the association between the dependent variable and predictors. However, if physicians take all predictors under consideration in intervention treatment, it may cause overtreatment, because some predictors are not the cause of the disease. In order to eliminate the effect of unknown factors (also called confounders) and avoid unnecessary therapies, we adopted a causal analysis to estimate the causal effect relationship between clinical factors and radiographic markers. The directed graphical causal models (DGCM) can identify the causal relationship instead of association (10). This causal relation is the result of multiple hypothetical experiments by measuring how much an error score will change when the values of a variable are randomly permuted (11). These experimental predictions are computed from the probability distribution. Therefore, a causal relation is defined when variable A and variable B co-vary if we only changed variable A (12). In this study, we used the fast causal inference (FCI) algorithm (12) for causal inference. FCI is designed to test conditional independence. It first generates a complete undirected graph and then deletes recursively edges based on the conditional independence decisions.
The purpose of this study is to use short-term data to predict long-term KOA progression by inputting observed time series into an attention-long short-term memory (LSTM) model and outputting the likelihood of patients' KL grade. Additionally, we built a causal model that evaluates the causal structure of potential predictors and identified the primary contributors to KOA progression and pain progression. To characterize the OA progression, we used 5-year data from the Osteoarthritis Initiative (OAI), specifically patients' clinical assessment data such as symptoms, questionnaire data, and interpretable radiographic image features.

Study Population
The OAI is a multicenter observational cohort study with longitudinal clinical and image data. This database includes MRI/CT imaging data, genotyping data, and clinical data for evaluating potential biomarkers and characterizing OA incidence and progression. Individuals from this study were between the ages of 45 and 79 and were at high risk for KOA. We selected patients with at least five visits over the 5-year longitudinal study and excluded patients who underwent knee replacement surgery. In total, 518 patients were included in our study. Among these patients, 394 patients remained in the same level of KL while the knee conditions for other 124 patients had worsen during the 5-year following-up period. We used predictive mean value to impute the missing values. For patients with OA in both knees, we selected the knee with a higher KL grade. We also extracted data from the Cerner Health Facts database to cross-validate our predictive model. This database contains clinical health records from over 500 health care centers across the United States. This clinical dataset uses ICD-9/10 diagnosis codes instead of KL grades.

Predictors
To capture the full picture of disease progression, we extracted clinical data, questionnaires, and radiographic markers from the OAI. In order to assess the functional status of patients, we adopted the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC). The WOMAC questionnaire evaluates the level of pain, stiffness, and function of the knee. The American College of Rheumatology considers this questionnaire as the gold standard for KOA functional status (13). For the clinical data, we chose patients' characteristics that could potentially predict future KOA, such as age, BMI, and measurements of physical ability. The radiographic markers, such as joint space width and knee alignment (the hip-kneeankle angle), were extracted from X-ray. The minimal joint space width (mJSW) has been considered as a proxy for cartilage thickness (14). We first analyzed different independent predictors of KL grade in a multivariate logistic regression model, which is assumed as the basic standard in KOA analysis. BMI and age were analyzed as a categorical variable. BMI is divided into four groups: underweight (BMI [10][11][12][13][14][15][16][17][18][19], normal (BMI 20-26), overweight (BMI 26-34), and obese (35+). Age was divided into 5-year intervals, namely 50-54, 55-59, 60-64, 65-69, 70-74, and 75-79. We set the normal BMI and age between 45 and 49 as the reference for the odds ratio. For the questionnaire, we marked "no symptom" as the reference and the highest score as the worst condition of the individual in that question. Table 1 shows the independent predictors of KL grade in a multivariate logistic regression model.

Predictive Model
The traditional prediction methods based on time series primarily comprise the autoregressive integrated moving average model (ARIMA), hidden Markov model (HMM), and recurrent neural networks (RNN) (15). However, ARIMA models are usually applied where data shows evidence of non-stationarity and suitable for numerical sequence (16). Since we included some questionnaires as inputs, ARIMA is not a good option. A Markov model is a useful tool to characterize the moving process where an individual transits among multiple states. In this study, we used Markov processes to estimate KOA disease stage transition probabilities. However, one limitation of the Markov model is that it assumes that the future state only depends on the current state (17). Unlike the Markov model, RNN allows the future states to depend on all past states (18).
RNN is a special neural network which could efficiently pass forward information to the next cell at each point (19). For each hidden layer at time point t, it not only includes the input layer at time point t but also considers the output of the hidden layer at the previous time point t−1 (20). Although RNN did a very good job in dynamically combining the sequential information based on its internal recurrence (21), RNN may have some issues of gradient vanishing when dealing with long-term data. Therefore, an advanced type of RNN named LSTM was designed to solve the gradient disappearing (22). Unlike other RNNs, LSTM introduced a forget gate (22) to decide whether to keep or drop a cell state based on the previous hidden state and the current input variables. Another important component of LSTM is cell state which is used to control whether to add or remove the information in the previous cell state. It has been proven that LSTM did a better job in long-term time series data (23).
The attention mechanism focuses on certain time points in the time series when processing the data. For example, BMI is very important in the early stage of the disease, while joint space width is more important in the late stage. It allows the model to pay more attention to the most important time point based on what it has learned so far. We adopted the attention mechanism from Nauta's repository and calculated the context vector as a weighted sum of each input vector instead of each time series so that an attention vector learns weights corresponding to input features (11). The attention mechanism assigns a different weight to different variables based on its ability to forecast. All initial attention scores are set to one and updated in every training epoch.

Causal Inference
In this study, we used FCI with directed acyclic graphs (DAGs). DAGs are commonly used to represent causal relationships, where vertices denote variables and the edges represent causal relationships between variables. The PC algorithm uses statistical tests to find conditional independence and constructs the structure of DAGs based on the results (12). The FCI algorithm is a generalization of the PC algorithm, except that it allows the existence of confounder variables (12). The FCI algorithm is able to detect a Markov equivalence class of DAGs with latent variables based on conditional independence information from the observed variables (12).
There are two important structures in FCI: the "V" structure and the "Y" structure (24). The "V" structure is defined when A and C are independent but dependent conditionally on B, marked as A→B←C. The "Y" structure is defined when A and C are independent of D conditional on B, marked as A→B←C and B→D. The FCI starts with a complete, undirected graph and removes recursively edges based on conditional independence decisions. After finding the skeleton of DAG, edges are oriented by identifying the "V" and "Y" structures, and further orientation rules given by Zhang (25) are applied.

Validation and Data Integration
We used 10-fold cross-validation for performance evaluation and compared true KL grade vs. the KL grade predicted. Figure 1 reports areas under the receiving operating characteristic curves (ROC). The x-axis presents sensitivity (true-positive rate) and the y-axis represents specificity (false-positive rate). The area under the curve is defined as AUC, which is a standard of performance of classification. The higher the AUC is, the better the classifier.
One limitation of the OAI dataset is that not every hospital measures KOA-specific features. In order to include these clinical indicators, we chose 30 clinical features from the Cerner database, which are commonly used for KOA diagnosis, but not included in the OAI, such as pulse popliteal of the knee. Previous studies identified age and BMI as the most significant risk factors in the development of KOA (26); therefore, we assumed that patients may have similar clinical indicators with those who shared the same BMI and age. Under this assumption, we used values from the Cerner dataset to estimate the values of variables that are missing in the OAI dataset. For each patient in the OAI dataset, we extracted clinical features from age-and BMI-matched patients from the Cerner database. OAI patients who are age-and BMI-matched with only one patient from the Cerner dataset are directly assigned the matched patient's Cerner clinical values. For patients who matched with multiple patients from the Cerner dataset, we took the average of matched patients and assigned averages to corresponding OAI patients. However, the prediction accuracy of LSTM after this imputation reduced to 85%. Therefore, we applied autoencoder and principal components analysis (PCA) as the denoising feature extractor. The performance of autoencoder is better than PCA and achieved a prediction accuracy of 93% in the dataset combined with the OAI and Cerner data.

Characterizing Disease Progression
Before we built the predictive model, we used logistic regression for feature selection. Compared with the other predictive models, such analysis can avoid confounding effects by considering the

Predicting Disease Progression With Attention-LSTM
We investigated the performance of LSTM algorithms in predicting KL grade using clinical data spanning 1 year. The predicted accuracy of LSTM achieved 90%. We compared the LSTM model against the previous models, namely, random forest, support vector machine, and naive Bayes. Random forest (RF) constructs a set of multiple decision trees on data and then averages the prediction from each of them. Compared with a single decision tree, RF reduces the chance of overfitting. Support vector machine (SVM) adopts kernels, which transform a lower dimensional input space into a higher dimensional space. This conversion made SVM more flexible and accurate. Naive Bayes is another classification technology based on Bayes' theory, which assumes that all the predictors are independent to each other. Our LSTM model performs better than RF, SVM, and naive Bayes in predicting KL grade. ROC curves are shown in Figure 1. A value of 0.5 in AUC represents as a random guessing and a value higher than 0.8 is considered quite good. Based on the results, LSTM and RF both did a good job in KL classification. The AUC value of LSTM is higher than that of SVM and naive Bayes for FIGURE 1 | AUC curve for predictive models. The x-axis represents sensitivity and the y axis represents specificity. all KL grades, and LSTM has better performance than random forest for KL = 3/4. In other words, random forest did a better job in diagnosing KOA, while LSTM did a better job in predicting KOA severity.
To better describe disease progression, we consider Markov states as disease stages. The Markov chain result is shown in Figure 2. For example, the patients at high risk (KL = 1) have 17% chance to move to the next stage (KL = 2) and 78% chance to maintain their KL grade after the 1-year follow-up visit. It should be noted that patients diagnosed with knee osteoarthritis (KL ≥ 2) are more likely to remain at this stage and very less likely to revert back to previous disease states. Figure 3 graphically depicts the important indexes of random forest. We used the mean decrease in accuracy index to evaluate the importance of variables in classification. It shows that the variable "XRJSM" (joint space narrowing) stands out among all the variables with the largest mean decrease in accuracy. Variables "P01BMI" (BMI), "MCMJSW" (medial minimum joint space width), "V00AGE" (age), and "WOMADL" (WOMAC disability score) are also relatively important for predicting KOA severity based on the indexes of variable importance.

Importance Analysis of Variables
The attention maps in Figure 4 demonstrate how attention scores can identify the dynamics. Although it was hard to distinguish which visit would have the model attended the most, XRJSM (OARSI joint space narrowing grade) was significantly important in predicting KL grade. The attention mechanism identified that joint space narrowing leads to the worsening status of KOA, which causes pain. The finding also explained why joint space narrowing is the most important factor in predicting KL progression in the LSTM model.

Causal Relationship of Variables
To understand the dependency and independency of important features in predicting KL progression, we employed FCI. We found that misalignments of the left and right knee are the reason for joint space narrowing. The angles of tibiofemoral and patellofemoral joints affected the alignment of the knees and caused joint space narrowing. This is consistent with previous reports that various knee alignment is associated with the radiographic measures of KOA severity (28). The FCI output is presented in Figure 5. The bidirected edge between left alignment and right alignment indicates that there exists at least one unmeasured confounder of left alignment and right alignment. The "o" symbols at alignment and joint space narrowing (JSM) indicate that it is difficult to distinguish whether   the connection between alignment and JSM connection is a directed edge or an unmeasured confounder. This result also eliminated some unnecessary predictors from the LSTM model. Although the symptoms, such as swelling, feel grinding, knee catch, straighten knee full, and bend knee full, may be good predictors for KL grade, they are not the reason causing KOA progression. Symptoms seem to be a good predictor of pain progression. Using FCI, we find that the change in joint space is the real reason for disease progression. Our attention-LSTM model not only reliably and accurately predicts KL grade and progression but also identifies the key factors among different time points. Using our model, clinicians can predict the possible KOA progression of patients and take preventative measures in advance. The attention mechanism dynamically shows the importance of variables at different disease stages and indicates that joint space narrowing is the primary factor in KOA progression at all time points. This method will provide strong support in clinical decision-making, including diagnosis, appropriate treatments, and preventive care. Using causal inference analysis, we identified the real cause of joint space narrowing and pain. They are misalignments of tibiofemoral and patellofemoral joints. Current primary preventive intervention is limited to weight loss (29). This finding recommended another possible intervention in clinical practice. The causal discovery also helped providers to avoid overtreatment. For example, treatment for symptoms such as swelling, grinding, catching, and inability to straighten or fully flex the knee may not be able to prevent the worsening status of knee OA. Our work had two major clinical contributions.

DISCUSSION
(1) We evaluated a broad spectrum of potential risk factors (clinical variables such as age, BMI, clinical symptoms, WOMAC questionnaires, and image measurements from X-rays) and investigated the performance of RNN algorithms in predicting KL grade. By using Markov hidden states as the disease stages, we have gained more knowledge about the stage transition, which enables a deeper understanding of the temporal progression of OA. (2) Considering the existence of hidden confounding variables, we built a causal structure of candidate risk factors and identified the preventable factors for treatment.
This study has several limitations to be considered. First, the research target of the OAI is individuals at high risk, with expected overweight and aging population. Thus, our model may not be applicable to the general population. The continuing work will focus on testing its generalizability of the models to different populations. To improve the expansion ability, we would consider transfer learning and generativeadversarial-network-based method in further studies. Second, when interpreting the findings, we found that symptoms of right knee and left knee have different impacts on KOA progression. This finding required some external validation. An additional limitation of this study is the small sample size. We have included prominent OA symptoms such as swelling joint pain, stiffness, and bending (30) in this study. However, we were unable to adjust for factors that may affect long-term outcomes, such as other symptomatic joint diseases.
In conclusion, we used attention scores in LSTM to describe feature importance at different time points and compared our model with previous works. In addition, we used causal inference to identify the key diagnostic criteria in disease progression. Our study has illustrated that clinical symptoms are important in predicting disease severity but may not be essential in disease progression. With the help of causal inference, LSTM is a better tool to help physicians in decision-making.

DATA AVAILABILITY STATEMENT
The original contributions presented in this study are included in the article/supplementary materials, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
YW: data preparation, predicting models, causal inference, statistical analysis, and manuscript preparation. LY: study design and manuscript editing. JC: manuscript editing. LL: manuscript review. PN: clinical support and manuscript editing. XZ: study design, manuscript editing, and responsible for the integrity of this work. WZ: final version manuscript review. YZ: data extract and data management from Cerner Heath Facts database. HX: responsible for Cerner Health Facts and manuscript review. All authors contributed to the article and approved the submitted version.