Host Blood Gene Signatures Can Detect the Progression to Severe and Cerebral Malaria

Malaria is a major international public health problem that affects millions of patients worldwide especially in sub-Saharan Africa. Although many tests have been developed to diagnose malaria infections, we still lack reliable diagnostic biomarkers for the identification of disease severity, especially in endemic areas where the diagnosis of cerebral malaria is very difficult and requires the exclusion of all other possible causes. Previous host and pathogen transcriptomic studies have not yielded homogenous results that can be harnessed into a reliable diagnostic tool. Here we utilized a multi-cohort analysis approach using machine-learning algorithms to identify blood gene signatures that can distinguish severe and cerebral malaria from moderate and non-cerebral cases. Using a Regularized Random Forest model, we identified 28-gene and 32-gene signatures that can reliably distinguish severe and cerebral malaria, respectively. We tested the specificity of both signatures against other common infectious diseases to ensure the signatures reliability and suitability as diagnostic markers. The severe and cerebral malaria gene-signatures were further integrated through k-top scoring pairs classifiers into ten and nine gene pairs that could distinguish severe and cerebral malaria, respectively. These signatures have various implications that can be utilized as blood diagnostic tools for malaria severity in endemic countries.


INTRODUCTION
Malaria is an important vector-transmitted infectious disease that affect millions of patients worldwide especially in sub-Saharan Africa, with an estimated new 228 million cases and 405,000 deaths in 2018 alone (World Malaria Report, 2019). Despite the decreasing number of new patients, a result of multinational efforts, and various advancements in diagnosis and treatment options, it is still a large burden especially on the countries most affected.
The disease is caused by the infection of human erythrocytes with protozoa of the genus Plasmodium, where P. falciparum is by far the most relevant (White et al., 2014). P. falciparum infection can lead to many several severe complications such as respiratory distress, hypoglycemia, metabolic acidosis, and severe anemia (Trampuz et al., 2003). Cerebral malaria is one of the most severe complications especially in children, and can lead to long-term neurological effects and higher mortality rate (Hora et al., 2016).
Although many diagnostic tests have been developed for the identification and screening of malaria infections (McMorrow et al., 2011), and some clinical signs such as retinopathy are hypothesized to be associated with severe and cerebral malaria (Beare et al., 2006), we still lack a reliable diagnostic biomarker for the identification of disease severity. In disease-endemic regions, cerebral malaria is an exclusion diagnosis (Idro et al., 2005) where patients with other etiologies such as viral encephalopathy m ay happen to additionally have asymptomatic parasitemia (Taylor et al., 2004). More sensitive diagnostic and prognostic tools are required to enable rapid identification of severe and cerebral malaria to ensure adequate therapeutic response, which would improve disease outcome (Mwangi et al., 2005;Vinnemeier et al., 2012).
Many transcriptomic studies have tried to elucidate characteristic features of the host immune response to malaria infection and subsequently define promising candidates for biomarker development and treatment. However, studies with large sample numbers are rare, and the platform and design heterogeneity of the studies performed so far have made it difficult to define uniform biomarkers (Hodgson et al., 2019). A practical approach to harness the potential of these studies while overcoming the various heterogeneities caused by study specific methods, is using multi-cohort analysis to compensate for these study-specific biases and to increase the analysis sensitivity by incorporating many samples analyzed in these studies. In this way it is possible to distinguish the most relevant features of the tested phenotype (Haynes et al., 2016).
This approach has been successful in harnessing the advantage of using various gene-expression studies towards identification of reliable biomarkers and novel gene signatures for various diseases such as bacterial (Sweeney et al., 2016;Badr et al., 2021) and viral infections (Barral-Arca et al., 2020;Li et al., 2020) and elucidate novel molecular mechanisms responsible for infectious and autoimmune diseases' development (Badr and Häcker, 2019;Zhong et al., 2020).
Here we implemented a multi-cohort analysis using machinelearning algorithms to identify gene signatures from the whole blood and PBMC of malaria patients that we find capable of distinguishing cerebral and severe cases from mild malaria as well as from infections with other agents.

Collection of Gene Expression Data
Collection of the meta-analysis data was carried out by searching public expression databases (NCBI GEO and Array Express) (accessed September 2020). For the GEO query, we used the following search terms: "Plasmodium", "malaria", and the filters (organism (Homo sapiens)), study type (expression profiling by array), entry type (Dataset/Series)). The Array Express query was executed using the following search terms: Plasmodium", "malaria", and the filters (organism (Homo sapiens)), experiment type (array assay). Initially 89 entries from GEO and 34 entries from Array Express were retrieved. Duplicates and irrelevant studies were excluded, and 19 studies remained and were further refined using the inclusion criteria (below) to identify the final nine studies included in our analysis. We included only studies that had analyzed gene expression in whole blood, PBMC or blood cell components but excluded studies using other tissues, ex vivo experiments, and cell line infection models. The database-search followed the Preferred Reporting Items of Systematic reviews and Meta-Analyses (PRISMA) statement and is documented in the PRISMA Flow Diagram (Supplementary File 1). Only datasets with available raw data were included. After a thorough search and excluding datasets as specified above, nine datasets with 417 samples were selected for further analysis.

Data Pre-Processing and Normalization
We removed samples taken from healthy controls keeping 318 patient samples, which were further included in the downstream analysis. We ensured that all datasets were normalized and logscaled before analysis. Since our analysis includes datasets from experiments with different technologies, we further Ztransformed the gene expression of each dataset separately to ensure that all datasets are on the same scale. The nine datasets were combined in a single metadata based on a subset of common genes (2578 genes) and samples were labeled as severe or non-severe and cerebral or non-cerebral using the phenotype information provided in each dataset. In terms of malaria severity, samples without available annotation were labeled as severe if they have one or more of the following: a) cerebral malaria; b) severe anemia; c) hyperparasitemia). These criteria are based on the World Health Organization (WHO) criteria for the diagnosis of severe malaria infection (World Health Organization, 2000). Subsequently, we divided the data into 70% training and 30% testing using balanced stratification ensuring that both divisions have a similar representation of the important covariates including age, sex, WBC count, and the original dataset. Finally, the training and testing data were quantile-normalized separately ( Supplementary Figures 1, 2).

Identification of the Gene Signatures
To identify parsimonious gene signatures of both severe and cerebral malaria, we performed a feature selecting process using regularized random forest (RRF) models (Deng and Runger, 2012;Deng and Runger, 2013) on the training data. RRF is similar to random forest but returns a subset of non-redundant features by penalizing the features used for node splitting if their information gain is similar to features used at previous splits. Since the selected features might depend on the specific data used to build the model, we bootstrapped the training data 100 times and built a RRF model on each one. We hypothesized that consistently selected features would be important to the phenotype under study, so we included those selected at least five times in the final models. These consistently selected features were then used to train standard RF models on the training data and the number of variables randomly sampled for splitting at each tree node (mtry) was selected using the "tuneRF" function. This whole process was performed for both phenotypes to identify two small subsets of genes that can distinguish severe from non-severe and cerebral from non-cerebral malaria.

Independent Evaluation of Performance
We evaluated both signatures on the unseen testing data using different performance metrics including the area under the ROC curve (AUC) and the area under the precision recall curve (AUPRC). To compute the ROC and PRC curves together with the AUC values, we used the predicted class probabilities (ranging from 0 to 1) returned by the RF model together with the true class labels (Fawcett, 2006). These probabilities were transformed to binary classes (severe vs non-severe and cerebral vs non-cerebral) using the default cutoff (0.5). The predicted classes were compared with the true labels to calculate the other metrics including the accuracy, sensitivity, and specificity. Notably, since these metrics can be misleading especially in the case of unbalanced datasets (Bekkar et al., 2013;Wald and Bestwick, 2014), MCC was used as an additional metric to assess the signatures performance (Matthews, 1975) since it takes into account the class unbalance. MCC can be interpreted as the correlation between the class predictions and the true labels with values ranging from -1 (worst prediction) to 1 (best prediction) (Chicco et al., 2021).
To examine whether the severe malaria signature can capture some of the molecular changes induced by malaria in non-blood tissues, we applied the signature to a dataset of 20 placental samples (GSE7586), ten of which have placental malaria (PM) and the other ten are from controls. Eight samples have signs of placental inflammation, seven with and one without PM. The signature was used to distinguish PM-positive from PM-negative samples and to distinguish samples with inflammation from inflammation-free samples.

Specificity of the Signatures
Since many infectious diseases may induce similar, non-specific molecular changes in the blood, we proceeded to test the specificity of the two malaria signatures. For this purpose, we used the signatures to classify dengue fever (DF) versus healthy controls and DF versus severe dengue (dengue hemorrhagic fever (DHF) and dengue shock syndrome (DSS)) in blood samples from six different datasets (GSE51808, GSE96656, GSE25001, GSE18090, GSE17924, and GSE13053). We used DF to test the specificity of our signatures since malaria and DF have a similar geographical distribution, both are mosquito-transmitted, and both share several immunopathogenic features (Arias et al., 2014;Mendonca et al., 2015). Similarly, we used the malaria signatures to distinguish pulmonary or extra-pulmonary tuberculosis (TB) from healthy control in blood samples from four datasets (GSE19444, GSE73408, GSE62525, and GSE83456) and meningitis from healthy controls using blood samples from two datasets (GSE80496 and GSE40586). Finally, the signatures were also tested in six other datasets (GSE40396, GSE42026, GSE6269, GSE63990, GSE39940, and GSE46681) with samples from multiple viral and bacterial infections including TB,

Improving the Interpretability of the Signatures
Since interpretability of the gene signatures is essential for their potential clinical uses, we proceeded to test if we can simplify the decision rules of the two malaria signatures. For this purpose, we divided the genes comprising the signatures into two sets of upand down-regulated genes. These were subsequently used to build gene pairs with each pair consisting of an up-regulated and another down-regulated gene. We used the resulting gene pairs to build K-Top Scoring Pairs (K-TSPs) models with the target of identifying a subset of gene pairs that can separate severe from non-severe and cerebral from non-cerebral malaria. The K-TSPs is a rank-based classification method that selects gene pairs (K) whose expression levels consistently switch their ranking between the two classes of interest (Geman et al., 2004). Each pair votes for one class based on the relative ordering of the two genes and the final prediction is simply determined by the sum of votes.

Software and Packages
We used R programming language (version 4.0.2) for initial processing and analysis of dataset. The datasets were accessed from the NCBI GEO database using the GEOquery R package. The feature selection processes were performed using the RRF package (Deng and Runger, 2012) and the random forest models were constructed using the RandomForest package (Liaw and Wiener, 2002). Visualization and clustering of the samples were done using PCA and heatmap methods implemented in the R packages pcaMethods, pheatmap, ClustVis, and ggplot2.

Data Acquisition
From the initial datasets acquired by searching public databases, nine matched our predetermined inclusion criteria (see methods). The datasets included samples from 99 healthy controls and 318 malaria patients, from which 137 were asymptomatic or had mild malaria, 51 severe non-cerebral and 130 cerebral malaria. The data summary of the included datasets is shown in Table 1.

Discovery of gene Signatures of Severe and Cerebral Malaria
For severe malaria, we used a bootstrap process to identify 28 genes that were frequently selected (≥ 5%) by the RRF model.

Evaluation of the Identified Signatures
When evaluated on the unseen testing dataset, both the severe and cerebral malaria signatures showed a good performance. The severe malaria signature was able to distinguish severe from nonsevere malaria in the testing dataset with an AUC of 0.85, sensitivity of 0.91, specificity of 0.62, and MCC of 0.54 (Figure 2A). Similarly, the cerebral malaria signature could distinguish cerebral from non-cerebral malaria with an AUC of 0.98, sensitivity of 0.89, specificity of 0.93, and MCC of 0.81 in  the testing dataset ( Figure 2B). See Table 2 for complete performance. Additionally, the severe malaria signature was able to distinguish PM from non-PM samples and samples with inflammation from those without inflammation with AUCs of 0.70 and 0.76, respectively (see Supplementary Figure 5).

Signature Specificity and Comparison With Other Infectious Diseases
To examine the specificity of the signatures, we applied them to different datasets of other infectious diseases (Supplementary Table 1). The signatures were used to distinguish DF from healthy controls and complicated DF (DHF, DSS) from uncomplicated DF. In all DF datasets, the severe malaria signature performed poorly with AUCs ranging from 0.37 to 0.64 (Figure 3) while the cerebral signature had a relatively better performance with AUCs ranging from 0.30 to 0.92 (Supplementary Figure 6). Both signatures also failed to distinguish primary pulmonary and extra-pulmonary TB from healthy controls in four different datasets with AUCs ranging from 0.32 to 0.566 and 0.15 to 0.65 for the severe and cerebral signatures, respectively ( Supplementary Figures 7 and 8).
Similarly, the signatures were also applied to six different datasets comprising multiple viral and bacterial infections in which they also failed to distinguish infected from non-infected samples ( Supplementary Figures 9 and 10). Surprisingly, the severe malaria signature (Supplementary Figure 11) had a much better performance in distinguishing meningitis from healthy  controls in blood samples compared with the cerebral malaria signature (Supplementary Figure 12).

Simplifying the Signatures
We proceeded to improve the interpretability of the two malaria signatures to improve their clinical utility. The genes comprising each signature were divided into up-and down-regulated genes based on their mean expression in severe vs non-severe and cerebral vs non-cerebral samples (see Supplementary Tables 2  and 3). A total of 14 up-regulated and 9 down-regulated genes showed a big difference in their mean expression in cerebral versus non-cerebral malaria and were subsequently used to build a list of 126 gene pairs. Similarly, the up-and down-regulated genes in the severe malaria signature were used to build a list of 192 pairs. Those gene pairs were fed to a K-TSPs classifier to select the top pairs relative to the phenotype being predicted. The severe malaria K-TSPs model identified ten gene pairs capable of differentiating severe from non-severe malaria including: SLC38A2-SCML1, SLC25A40-MAP2K7, DNALI1-AGPAT3, LIFR-TBCD, STK17B-ORC2, SF3B1-USP48, ZNF148-ZCCHC2, CBX5-CHAF1A, CNOT7-PLXNA2, and CREM-IDH1. When evaluated on the unseen testing data, the signature showed a good performance with an AUC of 0.68, accuracy of 0.66, sensitivity of 0.64, specificity of 0.71, and MCC of 0.30 ( Figure 4A). Similarly, the K-TSPs model for cerebral malaria identified nine gene pairs including: TTC17-C18orf8, PUM2-ASB7, RABEP1-MYH11, SETX-SPATS2L, XRCC5-TRIP12, ELF2-CHRNA10, LARP4-ANK2, MREG-KPNA6, and ZNF197-CD53. Those nine pairs distinguished cerebral from non-cerebral malaria in the testing data with an AUC of 0.79, accuracy of 0.73, sensitivity of 0.78, specificity of 0.67, and MCC of 0.45 showing a similar performance to the one obtained by the RF model but with better interpretability owing to its simple decision rules ( Figure 4B).
For both signatures, each pair votes for a particular class based on the relative ordering of the two genes and the final prediction is determined by the sum of votes. Thresholds of five and four votes were used for the severe and cerebral malaria K-TSPs signatures, respectively. In that sense, for malaria severity, samples with ≥ 5 votes would be classified as severe malaria and for the cerebral phenotype, a sample with ≥ 4 votes would be classified as cerebral malaria. Heatmaps of the TSPs votes in the testing data are shown in (Supplementary Figures 13 and 14).

DISCUSSION
Malaria is one of the main world public health problems, which tops the WHO priority list and remains one of the top causes of death in many low-income countries (World malaria report 2019). New approaches to rapidly diagnose severely affected patients are essential to combat its high mortality rate. The available diagnostic tools lack a reliable and accessible measure to distinguish severe and cerebral malaria from mild cases, especially in high endemicity areas, where the identification of other infections can be confused with malaria asymptomatic parasitemia. Previous postmortem autopsies of fetal cerebral malaria cases indicated that the misdiagnosis of cerebral malaria could reach as high as 23% (Taylor et al., 2004). In our study, we demonstrate two blood gene signatures that can identify severe and cerebral malaria patients. To select the most relevant genes able to classify disease status in our cohort, we implemented a multi-step analysis, where we combined a data-preprocessing pipeline to ensure reliable integration of samples from different datasets and used a twostep genomics classification model to select the most important features. For the first selection, we used regularized random forests (RRF) techniques, which offer a modification to standard random forest models by introducing a limitation to features used for splitting the trees, meaning that new features are added only when they offer a predictive value superior to those used in previous splits, which ensures choosing the most relevant features to the model accuracy (Ancuceanu et al., 2020).
We identified 28-gene and 32-gene signatures that can reliably distinguish severe and cerebral malaria with an AUC of 0.85 and 0.98, and sensitivity of 0.91 and 0.89, respectively. The high performance of these signatures in the malaria datasets without cross-reacting with other infectious diseases makes them suitable candidates for new diagnostic platforms for malaria severity.
These signatures provide a substantial improvement to previously detected host-gene signatures that were mainly focused on distinguishing acute malaria from healthy patients (Griffiths et al., 2005), or harbor too many genes to be implemented in a diagnostic tool (Nallandhighal et al., 2019).
Our multi-cohort approach could detect many genes that may have been missed in individual study analysis. ATP5G3, which was downregulated in the two malaria signatures, plays a part in energy metabolism and energy production. Its downregulation in both types of disease can indicate an infection-induced mitochondrial injury, which can lead to reduced energy production, reducing the capacity of immune cells to stop the infection (Lobet et al., 2015).
Several immunological aspects have been associated with the development of severe and cerebral malaria in comparison with mild cases such as the levels of tumor necrosis factor (TNF)

A B
FIGURE 4 | Performance of the K-TSPs severe and cerebral malaria signatures. (A) the performance of the severe malaria 10-TSPs model at distinguishing severe from non-severe malaria. (B) the performance of the cerebral malaria 9-TSPs model at distinguishing cerebral from non-cerebral malaria. Shown are the ROC curves in the training (red) and testing (green) data. The set of genes comprising each signature was divided into up-and down-regulated genes and used to build a K-top scoring pairs (K-TSPs) model with improved interpretability. AUC: area under the ROC curve. (Grau et al., 2010), although TNF-dependent regulation of the immune response is essential in various infectious diseases such as cerebral tuberculosis (Francisco et al., 2015). In our cerebral malaria signature, we see that the immune-cell specific tetraspanin CD53, which is downregulated in cerebral patients, can be a better marker for cerebral disease status, as it also belongs to one of the gene pairs in the K-TSPs analysis, and was shown to be down-regulated during neutrophil activation with TNF (Mollinedo et al., 1998). Furthermore, CD53 plays an important role in the adaptive immune response, especially in B cell activation and differentiation (Dunlock, 2020), and its deficiency is associated with recurrent infections (Mollinedo et al., 1997). Moreover, its expression is preserved between blood and brain tissue highlighting its importance as a diagnostic biomarker for cerebral malaria (Cai et al., 2010). Most genes in the two signatures have not been previously reported to be associated with the severity of malaria infection but some play a role in other infectious diseases. Isocitrate Dehydrogenase (NADP(+)) 1 (IDH1) is one of the genes we identified as down-regulated in severe malaria has also been found to be associated with HIV infection. Specifically, Chinn et al. reported that SNPs in IDH1 were significantly associated with HIV infection, three of which were found in transcription factors binding sites (Chinn et al., 2010). Similarly, CNOT7 and ADAP2, both down-regulated in severe malaria, were previously reported to have a protective role during viral infections (Shu et al., 2015;Chalabi Hagkarim et al., 2018). Of the up-regulated genes in severe malaria, TRA2A was found to promote human influenza A virus replication by inhibiting the splicing of the NS segment of its mRNA (Zhu et al., 2020). CREM was found to play a role in T cell exhaustion by reducing IL-2 production (Maine et al., 2016) and its expression is increased in mice infected with Entamoeba histolytica (Wojcik et al., 2018).
The cerebral malaria signature consists of 19 up-regulated and 13 down-regulated genes. The Pumilio protein PUM2, which is up-regulated in cerebral malaria patients, plays a role in the regulation of RIG-I signaling, which is essential for pathogen detection (Narita et al., 2014). XRCC5, the gene encoding the KU80 protein, which plays a role in the repair of DNA doublestrand breaks (Grabsch et al., 2006), is up-regulated in cerebral patients in comparison with non-cerebral ones. This indicates a DNA-damage response by the host in response to cerebral malaria infection that may explain some of the long-term effects of cerebral malaria such as neurocognitive defects seen in survivors (Schiess et al., 2020). Both Senataxin (SETX) and MORC Family CW-Type Zinc Finger 2 (MORC2) are associated with a number of neurological disorders including cerebellar ataxia (Coutelier et al., 2018) and Charcot-Marie-Tooth disease (CMT) (Sevilla et al., 2016), however, SETX was also found to decrease the expression of anti-viral genes like INF-b delaying the infection resolution (Miller et al., 2015). EPH Receptor A4 (EPHA4) and other Eph receptors are known to be up-regulated after neuronal injury (Goldshmit et al., 2006). Although the role of EPHA4 has not been explored in malaria, it was proposed as a blood mRNA biomarker for tuberculosis (de Araujo et al., 2016). O-Linked N-Acetylglucosamine (GlcNAc) Transferase (OGT) was found to promote influenza A virus replication and cytokine production  and its overexpression has been linked to hepatitis C virus (HCV) infectivity and HCVinduced hepatocellular carcinoma (Herzog et al., 2020).
Gene expression markers have been gaining increased attention for their suitability in point-of-care testing tools, to arrive at a precise and certain diagnosis of complicated infectious diseases. In daily practice it is important to distinguish bacterial from viral infections (Herberg et al., 2016;Goḿez-Carballa et al., 2019), but in the same way malaria has to be differentiated from other severe diseases. To improve the clinical utility of both signatures, we enhanced their interpretability using a gene-pair system (K-TSPs) that can be easily integrated in a point-of-care molecular based test with various nucleotide amplification techniques. The K-TSPs uses a simple classification mechanism which selects a set of features that consistently switch their ranking between the two classes of interest and subsequently uses these features to construct gene pairs (Tan et al., 2005). Each pair votes for one class based on the relative ordering of the two genes, and the final prediction is determined by the sum of votes given by all the pairs in the final classifier. Using this approach, we managed to simplify the severe and cerebral malaria signatures into ten and nine gene pairs that can still accurately distinguish severe from non-severe and cerebral from non-cerebral malaria, respectively. Since this classification mechanism depends solely on the relative ranking of genes rather than the absolute expression values, it is very flexible and can be implemented through different platforms like RT-PCR.
Notably, our study has some limitations. First, while our signatures have been tested on independent datasets, there is still need to further validate their performance in large patient cohorts using RT-PCR or other testing platforms. Secondly, given the fact that malaria is geographically prevalent in lowincome countries with limited infrastructure, any diagnostic tests should be low-cost and feasible (Gallup and Sachs, 2001). Achieving this would require extensive collaboration between researchers, physicians, industry personnel and other entities to design and validate a prototype based on these signatures that can be used as a point-of-care diagnostic test in malaria-endemic regions. With this in mind, we spent special effort on transforming the RF-based signatures into interpretable ones with simple rank-based decision rules using the K-TSPs algorithm. This feature makes both signatures platformfriendly and would expedite their clinical use.
In conclusion, we identify two gene signatures capable of detecting severe and cerebral malaria infections. To the best of our knowledge, this is the first study to implement RRF and K-TSP algorithms coupled with multi-cohort analysis to identify gene signatures capable of distinguishing cerebral and severe malaria patients. While it is clear that these signatures have to be further validated in prospectively curated large cohorts, especially in malaria endemic areas, they at this stage propose the basis for the first diagnostic assay for predicting malaria disease severity and distinguishing cerebral malaria from other causes of encephalitis.
Our study demonstrates the power of exploiting heterogenic datasets through multi-cohort analysis and rigorous preprocessing and data cleaning approaches in guiding new molecular studies and disease biomarker discoveries. These signatures can play a role in closing a fundamental gap in the efforts to decrease the disease burden and to combat disease mortality.

DATA AVAILABILITY STATEMENT
The datasets analyzed in this study are publicly available on the Gene Expression Omnibus (GEO) and ArrayExpress under the corresponding accession number. The code for this analysis is available on GitHub and can be accessed using the following link: https://github.com/MohamedOmar2020/Malaria.