Biomarker-based classification of bacterial and fungal whole-blood infections in a genome-wide expression study

Sepsis is a clinical syndrome that can be caused by bacteria or fungi. Early knowledge on the nature of the causative agent is a prerequisite for targeted anti-microbial therapy. Besides currently used detection methods like blood culture and PCR-based assays, the analysis of the transcriptional response of the host to infecting organisms holds great promise. In this study, we aim to examine the transcriptional footprint of infections caused by the bacterial pathogens Staphylococcus aureus and Escherichia coli and the fungal pathogens Candida albicans and Aspergillus fumigatus in a human whole-blood model. Moreover, we use the expression information to build a random forest classifier to classify if a sample contains a bacterial, fungal, or mock-infection. After normalizing the transcription intensities using stably expressed reference genes, we filtered the gene set for biomarkers of bacterial or fungal blood infections. This selection is based on differential expression and an additional gene relevance measure. In this way, we identified 38 biomarker genes, including IL6, SOCS3, and IRG1 which were already associated to sepsis by other studies. Using these genes, we trained the classifier and assessed its performance. It yielded a 96% accuracy (sensitivities >93%, specificities >97%) for a 10-fold stratified cross-validation and a 92% accuracy (sensitivities and specificities >83%) for an additional test dataset comprising Cryptococcus neoformans infections. Furthermore, the classifier is robust to Gaussian noise, indicating correct class predictions on datasets of new species. In conclusion, this genome-wide approach demonstrates an effective feature selection process in combination with the construction of a well-performing classification model. Further analyses of genes with pathogen-dependent expression patterns can provide insights into the systemic host responses, which may lead to new anti-microbial therapeutic advances.


THE RANDOM FOREST CLASSIFIER
We examined the effect of changing the parameters mtry and ntree on the classification accuracy using a 10-fold stratified cross-validation. As suggested by Liaw and Wiener (Liaw and Wiener, 2002), it should be tested if doubling or halving of the default value of mtry ( √ g , where g is the number of genes of the input dataset) has an effect on the results. Furthermore, we reduced the number of trees to 10,000 and 1,000. We found that for all tested values, our results were identical (Supplementary Table  1). Additionally, we assessed the performance of the classifier choosing extremely low values for the parameters (mtry*0.01, ntree=10). Due to the small mtry, only one gene is considered at each split when building a tree. It could be expected that the accuracies decrease for these settings. However, the calculated accuracy values were unchanged, except for one run using mtry*2 and ntree=10 (Supplementary Table  1). The reason for the stability of the results is the strong expression differences between the classes for most of the biomarkers. When examining the expression intensities (Supplementary Figure 1), we found that there is nearly no overlap between the expression values of the samples associated to the class of a biomarker gene and the values of the samples of the other two classes. This means, when the random forest algorithm builds a tree based on only two genes from different classes, then a sample can be classified correctly. For example, using a fungal biomarker gene (e.g., FOSB) at one split and a bacterial biomarker (e.g., IRG1) at another split is sufficient to distinguish between any of the three classes. Moreover, there are cases, where only one gene might be sufficient for classification, as the expression values of this gene are on a different level for each class, e.g., PPAP2B or HERC6 (Supplementary Figure 1). As the large expression differences can be found for most selected biomarkers of each class, the classifications yield high accuracies even for very small parameter values. Thus, the parameter mtry as well as the number of trees have no major influence on the performance of the classifier.

SELECTION OF BIOMARKER GENES
Supplementary

TEST FOR NOISE-ROBUSTNESS
Additionally to the assessment how well our classifier performs on new and/or independent data, we also evaluated its ability to overcome fluctuations in the expression data. We simulated different expression intensities for the selected biomarker genes across all samples by adding increasing levels of noise to the gene expression data (Materials and Methods). Due to this noise, a wider range of intensities is covered by the expression values and the single data points are more and more scattered within this range.
To test for noise-robustness, we simulated noise for the gene expression data. The noise values, which were added to the expression intensities, were generated by producing normal distributed random values at mean 0 and standard deviation σ. The magnitude of σ was based on the average standard deviation (ASD) of all genes in the analysis, which was 0.4859. Then the ASD was multiplied with scalars (0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5) to increase the effect stepwise. The noise was included before normalizing the data.
The noise effect was included on the raw data, so that the data processing steps (i.e., normalization and classification) are applied on the manipulated data. We calculated 11 levels of increasing noise, based of the ASD of all genes. For each of these levels, we repeated for 1000 times the process of drawing a random sample, adding noise, normalizing the gene expression values by the reference genes, and classifying the sample according to the type of infection. The accuracy decreases for increasing amounts of noise (see Supplementary Table 3 for sensitivities and specificities). For up to 2*ASD, over 95% of the classifications were still correct, while we achieved accuracies of 88%, 78%, and 74% for 3, 4, and 5 times ASD, respectively (Supplementary Figure 3).
Similarly to the accuracy, the certainty scores are dropping with increasing noise (Supplementary Figure 4). Starting at 0.975, the average score is decreasing to 0.352 for 5*ASD. However, unlike the accuracy rates, the decrease is large between lower noise levels and eases for higher amounts of noise. Separating the certainty scores according to the classes revealed that the mock-infected samples achieved the highest scores. Classification of the fungal and bacterial samples, however, led to lower certainty scores, while the values of both classes were rather similar.
Supplementary Table 3. Sensitivities and specificities for the levels of increasing noise.