Cross-modal predictive modeling of multi-omic data in 3D airway organ tissue equivalents during viral infection

Rezapour, Mostafa; McNutt, Patrick M.; Ornelles, David A.; Walker, Stephen J.; Murphy, Sean V.; Atala, Anthony; Gurcan, Metin Nafi

doi:10.3389/fgene.2025.1658577

ORIGINAL RESEARCH article

Front. Genet., 25 September 2025

Sec. Computational Genomics

Volume 16 - 2025 | https://doi.org/10.3389/fgene.2025.1658577

Cross-modal predictive modeling of multi-omic data in 3D airway organ tissue equivalents during viral infection

Anthony Atala²

Metin Nafi Gurcan¹

¹Center for Artificial Intelligence Research, Wake Forest University School of Medicine, Winston-Salem, NC, United States
²Wake Forest Institute for Regenerative Medicine, Wake Forest University School of Medicine, Winston-Salem, NC, United States
³Microbiology Immunology, Wake Forest University School of Medicine, Winston-Salem, NC, United States

Introduction: Developing robust predictive models from multi-omics data is challenging because sample sizes are typically small (often fewer than 100) while the feature space is vast (over 20,000 molecular features such as genes, transcripts, and proteins), which increases the risk of overfitting and limits generalizability. To address this challenge, this study introduces the Magnitude-Altitude Score Analysis for Tracking Infection and Time-Dependent Genes (MASIT), a novel method adept at filtering out irrelevant features/genes while focusing on important ones.

Methods: Applied to the 3D airway organ tissue equivalent model that mimics human airway physiology, MASIT employed both RNA-Seq and NanoString technologies for a comprehensive analysis. RNA-Seq offered a transcriptomic overview of 19,671 protein coding genes, whereas NanoString targeted 773 specific genes. We used MASIT to analyze gene expression changes in the airway tissue equivalent after exposure to Influenza A virus, Human metapneumovirus, and Parainfluenza virus type 3 at 24- and 72-hour post-infection. MASIT was trained and validated on NanoString data, tested on the held-out RNA-Seq test set, and benchmarked against widely used feature selection approaches, including Fisher score, minimum Redundancy Maximum Relevance, embedded Lasso regression, and Boruta feature importance.

Results: MASIT achieved a 92% accuracy in differentiating eight groups of infected samples. Our findings showed that MASIT outperformed models using the full gene set, notably in algorithms like Random Forest, XGBoost, and AdaBoost. Selected genes such as IFIT1, IFIT2, IFIT3, OASL, IFI44, and OAS3 were particularly effective in categorizing samples by viral type and infection stage. Benchmarking further demonstrated that MASIT not only exceeded the performance of existing feature selection methods within NanoString data but also uniquely maintained high accuracy and stability when applied to held-out RNA-Seq data.

Discussion: These results provide insights into the host’s molecular response to viral infections and highlight MASIT as a robust tool for analyzing high-dimensional, small-sample multi-omics datasets.

1 Introduction

The human bronchial tree is important not only for air conduction but also for its role as a key interface for several biological functions. It enables mucociliary clearance, airway humidification, and the detection and defense against pathogens and particulates, significantly impacting the immune system (Crystal et al., 2008). The diversity in airway epithelium, which includes cell types such as ciliated cells, goblet cells, and basal cells (Knight and Holgate, 2003), is directly linked to respiratory diseases such as asthma and chronic obstructive pulmonary disease (COPD) (Hiemstra et al., 2015). This epithelial functionality is augmented by its interactions with the surrounding microenvironment, including various cell types and the extracellular matrix (ECM), which are important for maintaining structural integrity and cellular communication within the airways (Tam et al., 2011; Sacco et al., 2004; Liu et al., 2021).

Traditional models using monocultures of human bronchial epithelial (HBE) cells on porous polymer membranes at air-liquid interface (ALI) have been beneficial for studying certain airway functions like mucus production and ciliation. However, these models lack comprehensive cell-cell interactions with non-epithelial cells and do not adequately mimic the complex cell-ECM interactions. Additionally, the typically stiffer growth surfaces can alter the phenotype, diversity, and functionality of HBE cells, thus not accurately reflecting the conditions of live airway epithelia (Hirst et al., 2014; Jaroch et al., 2018; Barros et al., 2021).

Our group has developed an innovative planar airway 3D organ tissue equivalent (OTE) model that significantly enhances the simulation of human airway physiology in vitro (Leach et al., 2023). This model integrates a well-differentiated HBE layer maintained at an ALI on a hydrogel substrate, which supports the inclusion of native lung fibroblasts and solubilized human lung extracellular matrix (ECM) (Leach et al., 2023). This configuration allows for a more realistic representation of the physiological interactions between cells and the ECM, as well as biomechanical influences that are important for studying airway function and disease (Leach et al., 2023).

Our OTE model addresses the traditional models’ limitations by providing a dynamic 3D microenvironment where cellular interactions and ECM mechanics more closely resemble those found in vivo. This advanced setup not only supports the differentiation and functional activities of airway epithelial cells but also facilitates intricate studies on how these cells interact with other airway components under physiological and pathological conditions (Leach et al., 2023).

In our recent study, we investigated the gene expression dynamics within 3D airway OTEs following infections by Influenza A virus (IAV), Human metapneumovirus (MPV), and Parainfluenza virus type 3 (PIV3) at both 24- and 72-h post-infection. Using Generalized Linear Models (GLMs) (Nelder and Wedderburn, 1972) with Quasi-Likelihood F-tests (QL) (Wedderburn, 1974) alongside the novel Magnitude-Altitude Score (MAS) and Relaxed Magnitude-Altitude Score (RMAS) algorithms, we analyzed RNA-Seq data across 19,671 genes (Rezapour et al., 2024a). This approach enabled the identification of genes significantly altered by viral infection, with an emphasis on those important for initiating and sustaining the host’s immune response. The Gene Ontology (GO) analysis provided a detailed view of how IAV, MPV, and PIV3 impact vital biological processes and cellular components, and highlighted a strategic adaptation of cellular functions to support viral replication. Notably, the activation of interferon-stimulated genes such as IFIT1, IFIT2, IFIT3, and OAS1, along with alterations in cilium and mitochondrial ribosome assembly, underscored a robust antiviral response and the manipulation of host machinery to favor viral survival (Rezapour et al., 2024a).

In another study, we used the 3D airway OTE model but employed the NanoString platform instead of RNA-Seq for gene expression analysis, focusing on 773 specific genes (Rezapour et al., 2024b). This shift to NanoString technology was driven by its capability to provide precise and direct quantification of mRNA transcripts, which is important for accurately capturing the nuances of gene expression changes at predefined time points post-infection. We employed the Magnitude-Altitude Score (MAS) algorithm, an analytical tool designed to integrate biological significance with statistical rigor. The MAS algorithm not only identified genes with substantial fold changes but also applied the Benjamini-Hochberg correction to control for false discoveries, enhancing the reliability of our data (Rezapour et al., 2024b). This dual approach of using NanoString, coupled with the MAS algorithm, allowed us to unveil distinct patterns of gene expression in response to viral infections. At 24-h post-infection, a pronounced interferon-stimulated gene response was evident, particularly against IAV, showcasing an immediate robust antiviral defense. Meanwhile, MPV and PIV3 infections illustrated different aspects of the immune response, emphasizing the versatility of the host’s defensive strategies. By 72 h, the evolving gene expression profiles indicated an adaptation to ongoing viral presence, with shifts towards maintaining a balanced state of antiviral defense and cellular homeostasis.

We also conducted a comparative analysis of RNA-Seq and NanoString platforms to evaluate their efficacy in profiling gene expression within the same 3D airway OTE model under various viral infections at 24- and 72-h post-infection (Rezapour M. et al., 2024). The comprehensive evaluation employed a range of analytical techniques including Spearman correlation, Distance correlation, and Bland-Altman analysis, along with the use of Generalized Linear Models (GLMs), Huber regression, and the concordance analysis. Our findings highlighted a high degree of agreement between the platforms, particularly in their capacity to identify crucial antiviral defense genes such as ISG15, MX1, RSAD2, and the OAS family.

Building on our extensive research using our innovative 3D OTE model (Leach et al., 2023), this paper aims to further elucidate gene expression dynamics across different viral infections and post-infection time points using RNA-Seq and NanoString technologies. Unlike our previous studies (Rezapour et al., 2024a; Rezapour et al., 2024b; Rezapour M. et al., 2024), which employed these platforms for differential expression analysis and biomarker identification across entire datasets, the primary objective of this paper is to develop a novel predictive model tailored for multi-omics data with limited sample sizes, such as six samples per group.

The challenge with employing complex machine learning models in scenarios with small sample sizes is their tendency to overfit (Dietterich, 1995). Overfitting occurs because these models have a large number of parameters relative to the amount of available training data, which leads them to learn noise and anomalies in the data as if they were meaningful patterns. This not only prevents the model from generalizing to new datasets but often results in poor training performance as the model fails to converge on a stable solution. To address these issues, we introduce a novel filtering system that precedes the application of different classifiers such as Logistic Regression (LR) (Kleinbaum et al., 2002; Christodoulou et al., 2019), Support Vector Machine (SVM) (Pisner and Schnyer, 2020), Naive Bayes (Yang, 2018), Random Forest (RF) (Speiser et al., 2019), XGBoost (Chen and Guestrin, 2016), AdaBoost (Hastie et al., 2009), Gradient Boosting Machine (GB) (Friedman, 2001), Extremely Randomized Trees (ER) (Geurts et al., 2006) and k-Nearest Neighbors (kNN) (Guo et al., 2003). This system, namely Magnitude-Altitude Score Analysis for Tracking Infection and Time-Dependent Genes (MASIT) is designed to carefully select transcripts that are most indicative of different viral infection conditions and their respective post-infection times.

The MASIT strategically reduces the number of input features to focus on genes identified through rigorous statistical testing, which addresses the challenge of training models on comprehensive genomic datasets with limited sample sizes. By prioritizing statistically robust features, we reduce the risk of overfitting and enhance the reliability and interpretability of our predictive models. Here, we outline the key contributions of our study:

• Our study introduces the first supervised predictive model that leverages stringent differential expression analysis for gene selection. This model uniquely combines predictive functionality with statistical rigor and focuses on genes whose expression patterns are strongly associated with viral infection responses and treatment effects. By ensuring that the selected genes demonstrate both substantial effect sizes (fold changes) and statistical significance, we enhance the interpretability and applicability of our predictive outcomes.

• Addressing a common challenge in the field, our work pioneers a predictive modeling approach tailored for multi-omics datasets characterized by small sample sizes. Through the development of a novel pre-filtering system that precedes a classifier, our model effectively minimizes overfitting. This is important in scenarios where the number of parameters could easily overwhelm the amount of available data, thus preserving the integrity and reliability of the model’s predictions. Our approach sets a new standard for handling limited-sample datasets in biomedical research, offering a robust framework that can be adapted to various diseases and conditions.

• Uniquely, our novel filtering system, MASIT, is trained and validated on NanoString data and tested using a held-out RNA-Seq dataset, marking a first in the field. This cross-modal validation not only demonstrates the filtering system’s adaptability and robustness across different technological platforms but also underscores its accuracy and effectiveness in diverse experimental settings.

2 Materials and methods

In this study, we carefully prepared the virus infection medium and conducted a series of controlled infections on our 3D airway OTE model (Leach et al., 2023) with three different viruses: IAV, MPV, and PIV3. The infection process was standardized by involving the preparation of the infection medium, sterilization, and specific conditions for incubation and viral application to the OTEs. Following the infection, RNA was extracted using a protocol that ensured the removal of potential DNA contaminants, where it preserved the integrity of our samples for detailed molecular analysis. The extracted RNA underwent sequencing and analysis. We used RNA sequencing to construct cDNA libraries, which were then processed and sequenced on an Illumina^® NovaSeq 6000 System. Additionally, we applied the NanoString nCounter^® Analysis System for highly multiplexed detection of mRNA targets, which is especially suitable for samples where RNA integrity may vary. For data normalization and analysis, we used the Trimmed Mean of M-values (TMM) method for RNA-Seq and a two-step process involving Positive Control and CodeSet Content Normalization for NanoString data. These procedures, important for accurately dissecting the complex interactions within the OTE model, are described in full detail in (Rezapour M. et al., 2024).

Table 1 illustrates the eight groups of infected OTEs classified based on (1) virus and (2) post-infection time within both NanoString and RNA-Seq data. Our primary goal is to identify gene biomarkers using the NanoString data and subsequently validate these biomarkers within the held-out RNA-Seq data. This method allows us to directly assess whether the identified biomarkers from NanoString hold up under the broader transcriptomic scrutiny provided by RNA-Seq, without the intermediary step of training. The rationale for this approach stems from the technological complementarity of NanoString and RNA-Seq.

Table 1

Table 1. Eight infection conditions were classified based on (1) virus and (2) post-infection time within both NanoString and RNA-Seq data.

2.1 Magnitude-altitude score (MAS)

In the field of differential gene expression analysis, accurately identifying genes that signal specific conditions is very important. Traditional methods often focus solely on statistical significance (measured by adjusted p-values) or only on effect size (measured by fold change), potentially overlooking the importance of integrating both aspects. This may lead to the selection of genes that, despite showing large expression changes, lack statistical support, or vice versa. To address this limitation, we employed the Magnitude-Altitude Score (MAS) algorithm [see Algorithm 1 in (Rezapour et al., 2024b)] in our previous studies (Rezapour et al., 2024a; Rezapour et al., 2024b; Rezapour M. et al., 2024; Rezapour et al., 2024d). The MAS algorithm integrates effect size (captured by fold change) with statistical significance (adjusted p-values), which ensures that the identified genes are both statistically robust and strongly altered in expression. This dual consideration offers a more balanced and comprehensive perspective on gene expression. Notably, the MAS algorithm has already demonstrated superior performance in complex problems, such as identifying axillary lymph node metastasis in breast cancer (Rezapour et al., 2024d), outperforming existing state-of-the-art methods.

Initially, for data such as from NanoString, where assumptions of normality and equal variance are reasonable, MAS performs a two-sample t-test (α = 0.05) (Kim, 2015) between Baseline (mock-infected samples) and Treated (infected) conditions for each gene to detect differences in expression. The p-values from these tests are then adjusted using the Benjamini-Hochberg method (Benjamini et al., 1995) to control the false discovery rate. For genes that show significant differential expression, the algorithm calculates fold changes, which provides a measure of the magnitude of expression change associated with the treatment. The MAS assigns a score to significant genes using the formula:

{MAS}_{l} = |{(\log_{2} (FC}_{l}) |^{M}| {(\log_{10} (p}_{l}^{BH})) |^{A},

where $p_{l}^{B H}$ denotes BH adjusted p-values for l $= 1, 2, \dots, s$ , where $s$ is the number of rejected null hypotheses based on BH adjusted p-value ( $p_{l}^{B H} < α = 0.05)$ . The hyperparameters M and A, both set to 1 in this study, are chosen to balance the p-value and the log fold change. It then assigns a rank to each significant gene such that the gene with the maximum RMAS score is ranked 1. The final ranking of genes based on their MAS scores highlights those with the most substantial evidence of a treatment effect. The output lists these BH-significant genes along with their MAS ranks and scores, providing a prioritized view of the treatment’s impact.

The MAS algorithm can be concisely represented as a function/mapping, $MAS (B, T, M, A) = (G, R, Υ),$ where $B$ and $T$ denote Baseline and Treated gene expressions, respectively, and $M$ and $A$ are hyper-parameters, $G$ is the list of BH-significant genes, $R$ provides their MAS ranks and $Υ$ contains their MAS scores. In the MAS algorithm, the hyper-parameters $M$ and $A$ play key roles in the computation of the MAS score.

2.2 Magnitude-altitude score analysis for tracking infection and time-dependent genes (MASIT)

In this section, we introduce the Magnitude-Altitude Score Analysis for Tracking Infection and Time-Dependent Genes (MASIT) algorithm (Algorithm 1), designed to enhance our understanding of how gene expression evolves in response to infection over time. Unlike our previous studies (Rezapour et al., 2024a; Rezapour et al., 2024b; Rezapour M. et al., 2024) that primarily focused on identifying highly expressed genes across various conditions, MASIT is engineered to detect and analyze genes that can distinctly differentiate between multiple infection states and their corresponding post-infection timelines.

The primary objective of MASIT is to identify BH-significant genes capable of distinguishing various active infection conditions and their respective post-infection times. This involves assessing gene expressions from a baseline (mock) and multiple treated (infected) groups, each treated with a different virus, at specific post-infection time points. The unique aspect of this algorithm is its ability to not only differentiate each treated group from the baseline at each post-infection time but also to distinguish between each pair of treated groups across distinct time points.

For instance, in a scenario where we have multiple treated groups of OTEs, each actively infected with a distinct virus, and a baseline group (mock), the algorithm assesses the expression data at various post-infection times $T_{1}, T_{2}, . . ., T_{t}$ . Our initial and primary aim is to identify at least one gene, infection-dependent, within each post-infection time $T_{i}$ that can differentiate all treated groups from each other and from the baseline at that specific time $T_{i}$ . Subsequently, for predictive purposes, we seek time-dependent genes, which are used solely for tracking post-infection time points. Biologically, the time-dependent genes may or may not be directly related to the infection but are important for determining the post-infection time. It is important to note that the combination of infection-dependent and time-dependent genes provides a comprehensive picture of whether a sample is infected with a specific virus and at which time.

Algorithm 1

Algorithm 1. Magnitude-Altitude Score Analysis for Tracking Infection and Time-Dependent Genes (MASIT).

The final output of MASIT, a set of distinctive genes categorized under $G_{Selected}$ provides a robust basis for enhancing predictive models. By focusing on both the specificity of infection-related changes and their temporal dynamics, these features substantially enhance the predictive power of our models. Depending on the number of samples, we can adjust the number of selected genes. We adhere to the rule of thumb that suggests using one feature for every 10 samples to ensure the model does not overfit and remains generalizable (Ying, 2019). In our study, due to the limited number of samples, which is 48 with 6 per class, we only select one gene for each post infection time point and one for temporal classification during the predictive modeling.

MASIT operates under the assumption that the gene sets identified by MAS as common BH-significant among all treated groups and baseline conditions are non-empty. In scenarios where no common BH-significant genes are found, MASIT can adapt by relying on raw p-values to continue its analysis. Given the robust performance of MAS and MASIT, which are designed to identify the top gene demonstrating the highest statistical significance, we can further relax the significance level if raw p-values yielded non-empty sets.

Additionally, the degree of similarity between the conditions of the treated groups is expected to be sufficient to allow the identification of non-empty sets of significant genes under BH-correction, or at the very least, using raw p-values. In our study, the inclusion of three viral infection conditions (IAV, MPV, and PIV3) under NanoString data (see Sections 2.3, 2.4) allowed us to fully apply MASIT using the Benjamini-Hochberg method without encountering any empty sets.

2.3 Cross-modal capability of MASIT

To explore the cross-modal capability of MASIT across NanoString and RNA-Seq platforms, we initially apply MASIT to the NanoString data, without evaluating their predictive capabilities. From this analysis, we identify approximately 1% of the genes, specifically 8 key markers. These genes are classified into two categories: infection-dependent and time-dependent genes.

Infection-dependent genes are identified based on their significant expression changes under viral infection conditions. These genes are biologically meaningful as they play key roles in mediating the OTE’s response to viral infections and provide key insights into the biological interpretation of viral infection dynamics. Time-dependent genes, on the other hand, are selected for their ability to enhance classification accuracy by marking specific post-infection time points. While Time-dependent genes are important for improving the temporal classification of the samples, they do not necessarily hold significant biological relevance in terms of the infection process itself. Their inclusion is primarily aimed at refining the model’s ability to differentiate between various stages of infection based on gene expression patterns, thereby improving the overall accuracy of the classification system.

We proceed to test the effectiveness of these 8 selected genes in clustering RNA-Seq samples. Hierarchical clustering (Murtagh and Contreras, 2012) was performed using the Ward’s method (Saraçli et al., 2013) with Euclidean distance as the metric. This analysis aims to validate whether these genes can effectively group the samples based on similarities in their expression profiles across platforms, which confirms their discriminative power and relevance in broader genomic studies.

2.4 Cross-modal predictive modeling with MASIT

In this section, we employ the MASIT within a cross-modal predictive modeling framework to analyze gene expression dynamics influenced by viral infections at various post-infection times. Our methodology comprises two primary phases: training and validating of MASIT on NanoString data, followed by testing on the held-out RNA-Seq data.

Initially, we categorize the NanoString data according to different post-infection times, each consisting of baseline and various treated conditions. We undertake a training process using stratified K-fold cross-validation (Wong and YehPo-Yang, 2019) to ensure that our model robustly captures the nuances of the data while preventing overfitting. During each fold of the cross-validation, MASIT is applied solely to the training set to identify a selected group of genes. These genes serve as potential biomarkers and predictors to distinguish between different infection states and timings. These identified genes are subsequently employed to train several models such as logistic regression (Kleinbaum et al., 2002) using one-versus-one (ovo) and one-versus-all (ovr) strategies (Rocha and Goldenstein, 2013), Support Vector Machines (SVM) (Steinwart and Christmann, 2008) with a linear kernel (Cristianini and Shawe-Taylor, 2000), Naive Bayes (Yang, 2018), Random Forest (RF) (Speiser et al., 2019), XGBoost (Chen and Guestrin, 2016), AdaBoost (Hastie et al., 2009), Gradient Boosting Machine (GB) (Friedman, 2001), Extremely Randomized Trees (ER) (Geurts et al., 2006) and k-Nearest Neighbors (kNN) (Guo et al., 2003).

To minimize overfitting, for models such as Random Forest, XGBoost, AdaBoost, Extremely Randomized Trees, and Gradient Boosting, we not only limit the number of genes to three MASIT-chosen genes but also use a limited number of trees and a restricted maximum depth. For the AdaBoost model, although it typically uses decision stumps (trees with a maximum depth of 1) by default, we use decision trees with varying maximum depths to enhance the model’s capacity to capture more complex patterns in the data.

After optimizing and validating our model with NanoString data, we used the MASIT-selected genes as input features for classifiers from the training-validation stage to test them on the held-out RNA-Seq data using K-fold cross-validation. The goal of using cross-validation on RNA-Seq data is to fine-tune the classifiers' parameters to accommodate the different scales of NanoString and RNA-Seq data.

Note that we could standardize the scales of both NanoString and RNA-Seq data to directly apply the trained NanoString model to RNA-Seq. However, we have chosen to retain the original scales to ensure that the model can accurately identify the virus and post-infection times in new samples from either platform without relying on any transformations. Figure 1 illustrates our structured approach.

Figure 1

This figure illustrates the two-phase process of training and validating of on NanoString data followed by testing on the held-out RNA-Seq data. It highlights the categorization of data by post-infection times, the application of MASIT for gene selection during training, and the use of optimal linear classifiers for cross-validation and final testing to evaluate the translational efficacy of predictive insights across different genomic technologies.

Figure 1. This figure illustrates the two-phase process of training and validating of on NanoString data followed by testing on the held-out RNA-Seq data. It highlights the categorization of data by post-infection times, the application of MASIT for gene selection during training, and the use of optimal linear classifiers for cross-validation and final testing to evaluate the translational efficacy of predictive insights across different genomic technologies.

For specifying hyperparameters of tree-based models, we employed two complementary strategies. Because of the limited number of samples, we first constrained model capacity by fixing the number of estimators and maximum depth to the lowest feasible values. This conservative approach reduces the risk of overfitting and ensures stability when data availability is highly restricted. However, fixing hyperparameters a priori may also prevent models from capturing informative patterns. To address this limitation, we also implemented a nested cross validation procedure. In this framework, the dataset was first split into six outer folds using stratified partitioning to preserve the class distribution. For each outer iteration, five folds were used for training and one was held out for outer testing. Within each outer training set, a second level of five fold inner cross validation was performed, where a grid search was used to evaluate all combinations of the number of estimators and maximum depth. For every parameter combination, inner training accuracy was defined as the mean performance of the model on the training portion of the inner folds, and inner validation accuracy as the mean score on the corresponding validation folds.

The combination of parameters that maximized the inner validation accuracy was then selected as the “inner optimal” configuration. This configuration was subsequently refitted on the entire outer training set and evaluated on the outer test fold to yield the outer validation accuracy, which serves as an unbiased estimate of generalization. The process was repeated across all outer folds, and mean and standard deviation values for training, inner validation, and outer validation accuracies were computed to summarize central tendency and variability.

This design allowed us to systematically evaluate model capacity and identify optimal configurations for Random Forest, Extra Trees, Gradient Boosting, AdaBoost, and XGBoost classifiers. While the fixed minimal-capacity approach is more appropriate in small-sample contexts, nested cross-validation provides a more flexible and rigorous framework for hyperparameter optimization. By combining both approaches, we ensured that our findings were not dependent on arbitrary parameter choices, while maintaining methodological robustness under limited data conditions.

Finally, to evaluate the practical advantages of MASIT, we conducted a systematic benchmarking analysis against widely adopted feature selection strategies. Feature selection methods are generally categorized as filtering, wrapper, embedded, or hybrid approaches (Chandrashekar and Sahin, 2014). In this study, we selected one representative method from each category: Fisher score (filtering) (Sun et al., 2021), minimum Redundancy Maximum Relevance (mRMR, wrapper) (Peng et al., 2005), embedded Lasso regression (Tibshirani, 1996), and Boruta feature importance (hybrid) (Kursa and Rudnicki, 2010). These methods have been extensively applied in gene expression studies and thus provide a relevant basis for comparison. In this comparative framework, we replaced the MAS component of MASIT with each of these methods while keeping the overall structure identical. For all benchmarking experiments, we employed Random Forest as the common downstream classifier. Random Forest was chosen because it is widely used in transcriptomic analyses and handles high-dimensional data effectively. Using a single classifier ensured that observed differences in performance could be attributed to the feature selection strategies themselves rather than classifier-specific effects.

3 Results

3.1 Cross-modal capability of MASIT

In this section, we aim to apply the MASIT algorithm across the entire NanoString dataset, which includes expression data for 773 genes from OTEs infected with IAV, MPV, and PIV3 at two post-infection times, $T_{1} = 24$ hours and $T_{2} = 72$ hours. As previously outlined in Section 2.3, our goal is to identify a select group of transcripts, specifically, 8 genes, representing 1% of the total genes within NanoString. These genes will be used to perform hierarchical clustering within the RNA-Seq data. To do so, we identify three top infection genes at time $T_{1} = 24$ , and three infection-dependent genes at time $T_{2} = 72$ , and two top time-dependent genes for tracking post infection time based on MASIT.

It turns out that the top three selected infection-dependent genes at time 24 are IFIT1, IFIT2, and IFIT3, and at time 72, they are IFI44, OAS3, and OASL. Additionally, MASIT was applied to identify time-dependent genes across the entire NanoString dataset, resulting in the selection of IL33 and CCL20 as the top time-dependent genes. Following this selection process, which was conducted exclusively on the NanoString data and without prior exposure to the RNA-Seq data (comprising 19,671 genes), Figure 2 illustrates the hierarchical clustering of all infected OTEs within the RNA-Seq dataset using these eight specifically selected genes.

Figure 2

This figure illustrates the hierarchical clustering of all infected OTEs within the RNA-Seq dataset, using the eight genes identified through the MASIT algorithm applied exclusively to the NanoString dataset. The genes include six infection-dependent genes (IFIT1, IFIT2, IFIT3, OASL, OAS3 and IFI44) marked with blue in the annotation, and two time-dependent genes (IL33, CCL20) highlighted in red.

Figure 2. This figure illustrates the hierarchical clustering of all infected OTEs within the RNA-Seq dataset, using the eight genes identified through the MASIT algorithm applied exclusively to the NanoString dataset. The genes include six infection-dependent genes (IFIT1, IFIT2, IFIT3, OASL, OAS3 and IFI44) marked with blue in the annotation, and two time-dependent genes (IL33, CCL20) highlighted in red.

3.2 Cross-modal predictive modeling with MASIT

In this section, we apply MASIT to NanoString data as described in Section 2.4. Our approach uses a 6-fold stratified cross-validation to train and validate models that simultaneously classify infections by multiple viruses: IAV, MPV, and PIV3. These models determine whether the OTEs are infected and, if so, identify the specific virus and ascertain the post-infection times of 24 and 72 h.

We refer to our models as MASIT-model, reflecting the integration of the MASIT process with various classification models. Each model begins by applying MASIT to the training set to identify the most statistically significant infection-dependent genes, as well as time-dependent genes. These genes are then employed as predictors (input features) for predictive models, including Logistic Regression in both one-vs-one (OVO) and one-vs-rest (OVR) configurations, and for SVM with OVO and OVR strategies using linear kernels. Additional models used include Support Vector Machines (SVM) with a linear kernel, Naive Bayes, Random Forest (RF), XGBoost, AdaBoost, Gradient Boosting Machine (GB), Extremely Randomized Trees (ER), and k-Nearest Neighbors (kNN). The efficacy of these models is evaluated on the validation set using only the genes selected during the training phase.

To ensure maximum generalizability and robustness of our models, we incorporated bootstrapping (Kohavi, 1995) into our cross-validation process. This involves randomly resampling the training dataset with replacement to create multiple bootstrap datasets. For each bootstrap sample, the MASIT process identifies key genes, and models are subsequently trained. This repetition helps stabilize the predictive power of the models by mitigating overfitting and providing a more reliable estimate of model performance.

Figure 3a illustrates the infection and time-dependent genes for each fold within the 6-fold stratified cross-validation strategy, where we identified IFIT1 and CXCL10 as top infection-dependent genes for 24- and 72-h post-infection, respectively, and IL33 as the top time-dependent gene. In Figure 3b, the effectiveness of models such as SVM, LR, and three variants of tree-based algorithms using MASIT-selected genes as input features are demonstrated using the NanoString dataset.

Figure 3

(a) Illustrates the infection- and time-dependent genes within Nanostring data for each fold within the 6-fold stratified cross-validation strategy, where we identified IFIT1 and CXCL10 as top infection-dependent genes at 24- and 72-h post-infection, respectively, and IL33 as the top timedependent gene. (b,c) illustrate the performance accuracies of various models using MASIT-selected genes (IFIT1, CXCL10, and IL33) on both NanoString (b) and RNA-Seq (c) datasets. The models evaluated include SVM, LR, Random Forest with 5 trees and a maximum depth of 4 (MASIT-RF-5-4), XGBoost with 3 trees and a maximum depth of 4 (MASIT-XGBoost-3-4), and AdaBoost with 3 trees and a maximum depth of 4 (MASIT-AdaBoost-3-4).

Figure 3. (a) Illustrates the infection- and time-dependent genes within Nanostring data for each fold within the 6-fold stratified cross-validation strategy, where we identified IFIT1 and CXCL10 as top infection-dependent genes at 24- and 72-h post-infection, respectively, and IL33 as the top time-dependent gene. (b,c) illustrate the performance accuracies of various models using MASIT-selected genes (IFIT1, CXCL10, and IL33) on both NanoString (b) and RNA-Seq (c) datasets. The models evaluated include SVM, LR, Random Forest with 5 trees and a maximum depth of 4 (MASIT-RF-5-4), XGBoost with 3 trees and a maximum depth of 4 (MASIT-XGBoost-3-4), and AdaBoost with 3 trees and a maximum depth of 4 (MASIT-AdaBoost-3-4).

Note that in our previous studies (Rezapour et al., 2024a; Rezapour et al., 2024b; Rezapour M. et al., 2024), we have comprehensively discussed the set of biomarkers for each virus. However, our goal here is to identify the top infection and time-dependent genes that can effectively identify the infection and time status even for unseen data. Table 2 shows the training and validation accuracies of the 6-fold cross-validation for various classifiers using the MASIT-selected genes within on NanoString data. The models evaluated include Random Forest (RF), XGBoost, AdaBoost (AB), Gradient Boosting (GB), and Extremely Randomized Trees (ET) with varying numbers of trees and maximum depths, as well as Support Vector Machines (SVM) with both one-vs-one (OVO) and one-vs-rest (OVR) configurations, Logistic Regression (LR) with OVO and OVR, Naive Bayes, and k-Nearest Neighbors (kNN). Each model’s performance is reported as the mean ± standard deviation for both training and validation sets.

Table 2

Table 2. This table presents the training and validation accuracies obtained from 6-fold cross-validation of different classifiers using the genes selected by the MASIT algorithm within each fold of the NanoString dataset.

Using the most common MASIT-selected genes within all folds, IFIT1 (infection dependent), CXCL10 (infection dependent), and IL33 (time dependent), we applied SVM (ovo and ovr), LR (ovo and ovr), Random Forest with 5 trees and a maximum depth of 4 (MASIT-RF-5-4), XGBoost with 3 trees and a maximum depth of 4 (MASIT-XGBoost-3-4), and AdaBoost with 3 trees and a maximum depth of 4 (MASIT-AdaBoost-3-4) on the held-out RNA-Seq samples using the same 6-fold cross-validation to compare the results of classification within RNA-Seq and NanoString using the MASIT-selected genes from NanoString. To further evaluate the model’s robustness, we applied a similar bootstrapping strategy to the RNA-Seq dataset. Figure 3c shows the performance of the models using the MASIT-selected genes from NanoString data on the held-out RNA-Seq dataset, which illustrates the cross-platform validation of predictive capabilities using these key genes.

To illustrate the efficiency of MASIT-selected genes in classification performance, we also used the entire gene set within RNA-Seq (comprising 19,671 genes) to train and validate the performance of these models. Figure 4 illustrates the validation accuracies for both strategies: using MASIT-selected genes (only three genes, which helps avoid overfitting) and the entire gene set for all 6 folds.

Figure 4

This figure compares the validation accuracies of various models trained using MASIT-selected genes (IFIT1, CXCL10, and IL33) and the entire gene set (19,671 genes) within RNA-Seq data. The models evaluated include SVM with OVO and OVR configurations, Logistic Regression with OVO and OVR configurations, Random Forest with 5 trees and a maximum depth of 4 (RF-5-4), XGBoost with 3 trees and a maximum depth of 4 (XGBoost-3-4), and AdaBoost with 3 trees and a maximum depth of 4 (AdaBoost-3-4).

Figure 4. This figure compares the validation accuracies of various models trained using MASIT-selected genes (IFIT1, CXCL10, and IL33) and the entire gene set (19,671 genes) within RNA-Seq data. The models evaluated include SVM with OVO and OVR configurations, Logistic Regression with OVO and OVR configurations, Random Forest with 5 trees and a maximum depth of 4 (RF-5-4), XGBoost with 3 trees and a maximum depth of 4 (XGBoost-3-4), and AdaBoost with 3 trees and a maximum depth of 4 (AdaBoost-3-4).

To compare the expression of these three genes, IFIT1, CXCL10, and IL33 between NanoString and RNA-Seq, Figure 5 provides a 3D visualization of all samples using IFIT1, CXCL10, and IL33 as the coordinate axes. The left panels are for NanoString and the right panels are for RNA-Seq, where the top panels visualize all samples, and the bottom panels specifically show only IAV-infected samples and Mock samples as examples.

Figure 5

This figure provides a 3D visualization comparing the log2-transformed expression of IFIT1, CXCL10, and IL33 between NanoString and RNA-Seq datasets. The left panels represent the NanoString data, while the right panels represent the RNA-Seq data. The top panels show all samples, and the bottom panels focus specifically on IAV-infected samples and Mock samples as examples. Each axis corresponds to the log2-transformed expression levels of one of the three genes.

Figure 5. This figure provides a 3D visualization comparing the log2-transformed expression of IFIT1, CXCL10, and IL33 between NanoString and RNA-Seq datasets. The left panels represent the NanoString data, while the right panels represent the RNA-Seq data. The top panels show all samples, and the bottom panels focus specifically on IAV-infected samples and Mock samples as examples. Each axis corresponds to the log2-transformed expression levels of one of the three genes.

To assess how the performance of tree-based ensemble models changes with model capacity, we performed nested cross-validation over a grid of the number of estimators and maximum tree depth. Figure 6a shows the resulting inner-CV mean-accuracy surface for the Random Forest model. The black cross marks the optimal configuration, which was identified as six estimators (trees) with a maximum depth of five, based on the highest inner validation mean accuracy. In this framework, “training accuracy” refers to the average score obtained on the inner training folds during the grid search, “inner validation accuracy” refers to the average score on the held out inner folds used for tuning, and “outer validation accuracy” is computed by refitting the model with the inner optimal parameters on each outer training set and scoring on the corresponding outer test set, which provides. For each metric, the reported mean is the arithmetic average across folds, and the standard deviation quantifies variability across folds.

Figure 6

(a) Nested cross validation surface of Random Forest performance within the NanoString dataset across a grid of the number of estimators and maximum depth. The color surface represents the mean inner validation accuracy, averaged across outer folds. The black cross denotes the optimal configuration, identified as six estimators with a maximum depth of five, based on the highest inner validation mean accuracy. Reported metrics include mean ± standard deviation for training accuracy (averaged across inner training folds), inner validation accuracy (averaged across inner validation folds), and outer validation accuracy (averaged across outer folds at the optimal configuration). (b) Comparative results of nested cross validation for treebased ensemble models, including Random Forest, Extra Trees, Gradient Boosting, AdaBoost, and XGBoost, within the NanoString dataset. For each model, the optimal hyperparameters were selected by the inner grid search. Bars show mean accuracy ± standard deviation for training, inner validation, and outer validation performance at these optimal configurations. (c) Performance of the same models on RNA-Seq data using MASIT-selected genes from the NanoString dataset. For this analysis, hyperparameters were fixed to the optimal values previously identified within NanoString (panel b), and models were evaluated using 6-fold stratified cross validation.

Figure 6. (a) Nested cross validation surface of Random Forest performance within the NanoString dataset across a grid of the number of estimators and maximum depth. The color surface represents the mean inner validation accuracy, averaged across outer folds. The black cross denotes the optimal configuration, identified as six estimators with a maximum depth of five, based on the highest inner validation mean accuracy. Reported metrics include mean $\pm$ standard deviation for training accuracy (averaged across inner training folds), inner validation accuracy (averaged across inner validation folds), and outer validation accuracy (averaged across outer folds at the optimal configuration). (b) Comparative results of nested cross validation for tree-based ensemble models, including Random Forest, Extra Trees, Gradient Boosting, AdaBoost, and XGBoost, within the NanoString dataset. For each model, the optimal hyperparameters were selected by the inner grid search. Bars show mean accuracy $\pm$ standard deviation for training, inner validation, and outer validation performance at these optimal configurations. (c) Performance of the same models on RNA-Seq data using MASIT-selected genes from the NanoString dataset. For this analysis, hyperparameters were fixed to the optimal values previously identified within NanoString (panel b), and models were evaluated using 6-fold stratified cross validation.

We then applied the same nested cross validation procedure to other tree-based ensemble models, including Extra Trees, Gradient Boosting, AdaBoost, and XGBoost. For each model, the inner grid search identified the hyperparameter configuration that maximized mean inner validation accuracy, after which performance was summarized across training folds, inner validation folds, and outer validation folds. Figure 6b presents a comparative overview of these models, which shows mean accuracy and standard deviation for training, inner validation, and outer validation under their respective optimal configurations.

Finally, to assess how the NanoString-derived optimal configurations generalize to RNA-Seq data, we applied the same models using the hyperparameters selected from the NanoString nested cross validation. In this case, we did not repeat nested cross validation for the RNA-Seq dataset. Instead, we directly evaluated each model using 6-fold stratified cross validation on the RNA-Seq data, restricted to the same MASIT-selected genes and the optimal configuration for the model learned from NanoString. Figure 6c shows these results, which provide a direct comparison of model performance when transferring NanoString-optimized hyperparameters and gene features to RNA-Seq data.

Finally, to directly compare MASIT with existing feature selection approaches, we replaced the MAS component of MASIT with Fisher score, mRMR, embedded Lasso, or Boruta, while keeping the remainder of the framework identical. To minimize overfitting in this small-sample setting, each method was constrained to select only three genes per fold within the six-fold stratified cross-validation on NanoString data, and Random Forest classifiers were tuned via nested cross-validation using the number of trees and maximum depth.

Figure 7a summarizes the optimal configurations and performance for each method. Within NanoString data, MASIT and mRMR achieved the highest outer validation accuracies (0.90 $\pm$ 0.10 and 0.88 $\pm$ 0.11, respectively), whereas Lasso, Fisher, and Boruta produced substantially lower accuracies (ranging from 0.52 to 0.71). To further assess cross-platform robustness, we selected the two best-performing feature selection strategies from the NanoString experiments, MASIT and mRMR, and applied them directly to the held-out RNA-Seq dataset. Using the same Random Forest classifier with the optimal hyperparameters identified in the NanoString nested cross-validation (tree depth = 5, number of estimators = 5), we evaluated whether the selected genes could generalize to the broader transcriptome. As shown in Figure 7b, MASIT achieved a considerably higher validation accuracy (0.87 $\pm$ 0.12) compared to mRMR (0.67 $\pm$ 0.12), while maintaining similar training performance. These findings highlight MASIT’s ability not only to match state-of-the-art methods within NanoString but also to transfer effectively across platforms.

Figure 7

(a) Nested cross-validation results within NanoString data using Random Forest classifiers with tuned hyperparameters (tree depth and number of estimators). MASIT and mRMR achieved the highest outer validation accuracies (0.90 ± 0.10 and 0.88 ± 0.11, respectively), while Fisher, Lasso, and Boruta yielded lower accuracies ranging from 0.52 to 0.71. Bars show mean ± standard deviation across folds for training, inner validation, and outer validation accuracies. (b) Cross-platform evaluation of MASIT and mRMR on the held-out RNA-Seq dataset. Using the optimal Random Forest configuration identified from NanoString (tree depth = 5, estimators = 5), genes selected by each method were directly applied to RNA-Seq. MASIT achieved a substantially higher validation accuracy (0.87 ± 0.12) compared to mRMR (0.67 ± 0.12), while both methods maintained comparable training performance.

Figure 7. (a) Nested cross-validation results within NanoString data using Random Forest classifiers with tuned hyperparameters (tree depth and number of estimators). MASIT and mRMR achieved the highest outer validation accuracies (0.90 $\pm$ 0.10 and 0.88 $\pm$ 0.11, respectively), while Fisher, Lasso, and Boruta yielded lower accuracies ranging from 0.52 to 0.71. Bars show mean $\pm$ standard deviation across folds for training, inner validation, and outer validation accuracies. (b) Cross-platform evaluation of MASIT and mRMR on the held-out RNA-Seq dataset. Using the optimal Random Forest configuration identified from NanoString (tree depth = 5, estimators = 5), genes selected by each method were directly applied to RNA-Seq. MASIT achieved a substantially higher validation accuracy (0.87 $\pm$ 0.12) compared to mRMR (0.67 $\pm$ 0.12), while both methods maintained comparable training performance.

4 Discussion

4.1 Cross-modal capability of MASIT

The robustness of the MASIT in selecting statistically significant genes is exemplified by its success in identifying key markers for viral infection states and their progression over time. The MASIT selected genes, IFIT1, IFIT2, IFIT3, OASL, IFI44, OAS3, IL33, and CCL20 from NanoString data, not only effectively categorize RNA-Seq samples by viral type and infection stage as shown in Figure 2 but also offer insights into the host/OTE’s molecular response to viral infections.

The IFIT family, comprising IFIT1, and IFIT3, plays a key role in the antiviral defense, particularly against RNA viruses such as IAV, MPV, and PIV3. These proteins, by binding to viral RNA with a triphosphate group (PPP-RNA), prevent viral replication and translation, and showcase a specific defense against viruses that utilize PPP-RNA structures, such as IAV (Pichlmair et al., 2011). Moreover, the interaction of IFIT proteins with the 5′ cap of viral RNA blocks the translation of viral proteins is important for regulating the host’s immune response via the JAK-STAT signaling pathway (Zhou et al., 2013). This mechanism not only underscores the adaptability of IFIT proteins in controlling viral infections but also enhances our understanding of their broader role in immune regulation. Fensterl et al. (Fensterl and Sen, 2015) contributed additional insights into the non-enzymatic antiviral mechanisms of IFITs, highlighting how these proteins inhibit virus replication by binding and sequestering RNA molecules from viruses and modulating protein interactions.

OASL and OAS3 are significant for their roles in the immune response to viral infections. OAS3’s activation by double-stranded RNA (dsRNA), typical of viral replication intermediates, and its subsequent induction of RNase L lead to the degradation of viral RNA, effectively stopping virus replication (Choi et al., 2015). This role is important in combating viruses like IAV. OASL, on the other hand, enhances the RIG-I signaling pathway that is important during the early stages of viral infections, and its ability to modulate RIG-I sensitivity significantly impacts the immune response to viral infection (Melchjorsen et al., 2009).

IFI44, highlighted by its upregulation in conditions such as respiratory syncytial virus (RSV) infections, plays a role in the formation of microtubular structures and pathways activated by IFN-α stimulation. This positions IFI44 as a potential therapeutic target and a biomarker for RSV infections, providing insights into its dual role in enhancing and suppressing immune activity (Li et al., 2021; DeDiego et al., 2019).

IL33 and CCL20, selected as time-dependent genes, reveal the MASIT’s ability to recognize genes whose expression patterns are indicative of specific physiological or immunological events. IL33, for instance, is important in type-2 immune activation and plays a significant role in allergic lung inflammation, which aligns with its temporal expression patterns (Hardman et al., 2013). CCL20’s upregulation in response to inflammatory stimuli further underscores the model’s utility in capturing key temporal dynamics related to immune responses (Schutyser et al., 2003).

The clustering in Figure 2 validates the MASIT’s effectiveness in identifying relevant biomarkers that maintain consistency across NanoString and RNA-Seq platforms. This versatility in gene expression profiling technologies demonstrates the model’s utility in providing comprehensive insights into the OTE (host)’s molecular response to infections. The ability to identify and validate these key biomarkers across platforms not only enhances our understanding of viral pathogenesis but also strengthens the MASIT framework’s application in future infectious disease research and therapeutic development.

4.2 Cross-modal predictive modeling with MASIT

Table 2 illustrates the performance of several machine learning classifiers on NanoString data, using only three MASIT-selected genes from each fold’s training set. To better understand MASIT’s efficiency, we selected a range of models, including both linear and some tree-based ensemble models that demonstrated strong performance during validation. These models include MASIT-SVM (OVO and OVR) with a linear kernel, MASIT-LR (OVO and OVR), Random Forest with five trees and a maximum depth of four (MASIT-RF-5-4), XGBoost with three trees and a maximum depth of four (MASIT-XGBoost-3-4), and AdaBoost with three trees and a maximum depth of four (MASIT-AdaBoost-3-4). We compared the results of 8-class classification on the RNA-Seq and NanoString data, which are visualized in Figure 3.

Using only three MASIT-selected genes as predictors in classifiers, particularly for linear models such as SVM and Logistic Regression, as well as tree-based ensemble models with few estimators (trees), significantly reduced the risk of overfitting. This limitation in the models’ complexity, combined with a small number of predictors, inherently lowers the variance and prevents the models from learning noise and idiosyncrasies in the data. This focused approach not only enhances model generalizability but also maintains robustness in prediction accuracy. The alignment of training and validation performances for models utilizing MASIT-selected genes, as illustrated in Figure 3, validates the efficacy of MASIT in identifying the most informative and generalizable features from complex genomic data.

Figure 4 showcases the validation accuracies for various classifiers using either three MASIT-selected genes or all genes within the RNA-Seq dataset, which highlights a clear contrast in performance between more focused and extensive feature sets. The box plots illustrate a trend where models using just three MASIT selected genes generally achieve higher median validation accuracies compared to those using all genes. This difference is particularly notable in complex models such as Random Forest, XGBoost, and AdaBoost, where the performance with all genes markedly decreases, which signals potential overfitting when the feature space is too large. In contrast, the simpler, linear models (SVM and Logistic Regression) demonstrate less variation in performance between the two feature sets, though they still benefit from the reduced complexity afforded by using only the selected genes.

Additionally, the use of only three MASIT-selected genes leads to more stable and consistent validation performances across all models. This suggests that the MASIT selection process effectively captures the most predictive features, thereby enhancing the models’ ability to generalize without being misled by the noise and redundancy often present in larger datasets.

Moreover, the significant disparity in performance between the full gene set and the MASIT-selected gene set, highlighted in Figure 4, underscores the tendency for complex models to overfit when equipped with unrestricted gene sets. Notably, when all genes were used in RNA-Seq, overfitting occurred across all models because we observed 100% accuracies in training, while the validation performances did not match those of training, indicating a significant generalization gap. This demonstrates the risk associated with using a large number of predictors and the benefits of a more targeted approach. Therefore, MASIT’s method of selecting a concise, impactful set of genes is a key strategy in genomic studies that facilitate the development of robust, accurate, and generalizable models, and thus establishing itself as an invaluable tool in the field of genomics.

Figure 5 demonstrates that both NanoString and RNA-Seq platforms provide consistent and comparable expression data for IFIT1, CXCL10, and IL33. The 3D visualization emphasizes the ability of these genes to serve as biomarkers for infection status and progression. This comparative analysis underscores the robustness of using these genes for studying infection dynamics and highlights the effectiveness of both NanoString and RNA-Seq technologies in capturing gene expression profiles.

These findings, supported by the comparative analyses shown in Figures 3–5, suggest that employing MASIT for feature selection should be considered a best practice in genomic model training, especially when addressing high-dimensional data where overfitting poses a significant risk. This approach does not merely simplify the training process but also ensures the relevance and reliability of the models in real-world applications.

The comparative evaluation of tree-based ensemble classifiers in Figure 6 highlights that Random Forest, XGBoost, and AdaBoost consistently achieved strong validation performance when applied to MASIT-selected features. This indicates that MASIT enhances predictive accuracy across different ensemble modeling frameworks, which demonstrates that its utility is not confined to a single algorithm.

Figure 7 provides a structured comparison between MASIT and widely used feature selection approaches evaluated under identical modeling conditions. Within NanoString using nested cross validation, MASIT and mRMR achieved the highest outer validation accuracies, which were 0.90 $\pm$ 0.10 and 0.88 $\pm$ 0.11, respectively, whereas Fisher score, embedded Lasso, and Boruta produced substantially lower accuracies ranging from 0.52 to 0.71. The gap between training and outer validation for Fisher, Lasso, and Boruta indicates limited generalization.

A key distinction emerges in the cross platform experiment. When we transferred the feature subsets and the Random Forest configuration selected within NanoString to the held out RNA Seq dataset, MASIT maintained high validation accuracy of 0.87 $\pm$ 0.12, whereas mRMR dropped to 0.67 $\pm$ 0.12 despite similar training performance. This divergence suggests that MASIT’s selection strategy, which integrates effect size with multiple testing control and then enforces agreement across infection conditions and time points, yields features that are more stable to platform specific measurement differences.

These results support two practical advantages. First, MASIT matches or exceeds state of the art methods within NanoString while selecting a very small and interpretable set of genes. Second, MASIT generalizes across platforms, which directly addresses a common source of failure for many feature selection pipelines that exhibit platform specific bias. Together, these findings clarify MASIT’s intended contribution: it is a statistically driven pre filter that emphasizes reproducible effect under stringent error control, which improves external validity while keeping model capacity low.

4.3 Limitations of the study

This study has several limitations: First, the small number of biological replicates limits statistical power and may constrain the generalizability of the findings, a challenge common in multi-omics studies. Second, although MASIT reduces overfitting by prioritizing informative features, some residual risk of overfitting cannot be fully eliminated, particularly when applying complex classifiers. Third, while MASIT successfully transferred between NanoString and RNA-Seq platforms, broader cross-platform applications (e.g., to proteomic or metabolomic datasets) will require further validation and potential scaling adjustments. Finally, although the RNA-Seq dataset was held out for testing, its limited size may restrict the strength of claims regarding independence and reproducibility. These considerations highlight areas for future work to further validate and extend the MASIT framework.

5 Conclusion

In our study, we tackled the challenge of developing robust predictive models from multi-omics data, contending with the dual constraints of small sample sizes and expansive feature spaces. We introduced the novel Magnitude-Altitude Score Analysis for Tracking Infection and Time-Dependent Genes (MASIT) methodology to enhance the accuracy and generalization capabilities of our models. MASIT adeptly sifts through gene, transcript, or protein features, pinpointing those important for understanding the dynamic responses in the OTE model. Using MASIT, we explored gene expression changes in OTEs following exposure to Influenza A virus (IAV), Human metapneumovirus (MPV), and Parainfluenza virus type 3 (PIV3) at 24- and 72-h post-infection. This approach leveraged the complementary strengths of RNA-Seq and NanoString technologies: RNA-Seq provided a broad transcriptomic overview of 19,671 genes, while NanoString offered precise quantification of 773 targeted genes, enabling a focused and comprehensive analysis.

Our methodology incorporated a K-Fold Stratified Cross-validation within MASIT to systematically select the most informative genes, each marking distinct infection types and stages, and others serving as temporal markers between time points. This selective feature strategy significantly mitigated the risk of overfitting, a prevalent issue in models trained on large datasets with limited samples. The predictive models, enhanced by MASIT, employed a concise set of genes chosen through this rigorous process. These models exhibited superior performance and generalizability compared to those utilizing the full gene set, particularly in complex models like Random Forest, XGBoost, and AdaBoost. Key markers identified by MASIT, including IFIT1, IFIT2, IFIT3, OASL, IFI44, and OAS3, were instrumental in effectively categorizing both NanoString and RNA-Seq samples by viral type and infection stage. MASIT’s robust cross-modal capability extends to predictive modeling. By strategically selecting a small number of highly informative genes, MASIT reduces the common risk of overfitting in genomic studies with extensive datasets. This focused approach not only enhances the models’ generalizability but also maintains prediction accuracy across different data modalities. The alignment of training and validation performances further underscores MASIT’s efficiency in feature selection, and affirms its efficacy in maintaining robust predictive accuracy across different platforms.

A comparative analysis between MASIT selected genes and the full gene set underscores a marked contrast in performance, especially in complex models such as Random Forest, XGBoost, and AdaBoost. Using a concise set of MASIT selected genes leads to higher validation accuracies and more consistent performance across models, which highlights the effectiveness of MASIT in capturing the most predictive features essential for robust modeling. This strategy not only simplifies the training process but also ensures the relevance and reliability of the models, and establishes MASIT as an invaluable tool in genomics. The ability of MASIT to consistently identify and validate key biomarkers across NanoString and RNA-Seq platforms enhances our understanding of viral pathogenesis and strengthens the framework’s application in infectious disease research and therapeutic development. This underscores the important role of MASIT in advancing genomic research and its potential to significantly influence the development of diagnostic and therapeutic strategies in viral infections.

Benchmarking against widely used feature selection approaches, including Fisher score (filtering), mRMR (wrapper), Lasso (embedded), and Boruta (hybrid), further confirmed MASIT’s advantages: it not only exceeded their performance within NanoString data but also uniquely maintained high accuracy and reproducibility on held-out RNA-Seq data. These results underscore MASIT’s superior robustness, reproducibility, and cross-platform generalizability.

Data availability statement

All data supporting the findings of this study are available from the corresponding author upon reasonable request.

Author contributions

MR: Formal Analysis, Validation, Methodology, Conceptualization, Writing – review and editing, Data curation, Software, Investigation, Writing – original draft, Visualization. PM: Conceptualization, Validation, Formal Analysis, Data curation, Supervision, Writing – review and editing, Methodology, Investigation, Writing – original draft, Resources. DO: Writing – original draft, Conceptualization, Methodology, Writing – review and editing, Data curation, Validation, Investigation. SW: Data curation, Methodology, Validation, Conceptualization, Supervision, Writing – original draft, Writing – review and editing, Investigation. SM: Validation, Methodology, Writing – review and editing, Investigation, Conceptualization, Writing – original draft, Supervision. AA: Project administration, Validation, Writing – review and editing, Methodology, Supervision, Conceptualization, Resources, Funding acquisition. MG: Writing – original draft, Investigation, Conceptualization, Resources, Writing – review and editing, Methodology, Formal Analysis, Validation, Data curation, Supervision.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. Effort sponsored by the U.S. Government under Other Transaction number W15QKN-16-9-1002 between the MCDC, and the Government. The US Government is authorized to reproduce and distribute reprints for Governmental purposes, notwithstanding any copyright notation thereon.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Author disclaimer

The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government.

References

Barros, A. S., Costa, A., and Sarmento, B. (2021). Building three-dimensional lung models for studying pharmacokinetics of inhaled drugs. Adv. drug Deliv. Rev. 170, 386–395. doi:10.1016/j.addr.2020.09.008

PubMed Abstract | CrossRef Full Text | Google Scholar

Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57 (1), 289–300. doi:10.1111/j.2517-6161.1995.tb02031.x

CrossRef Full Text | Google Scholar

Chandrashekar, G., and Sahin, F. (2014). A survey on feature selection methods. Comput. and Electr. Eng. 40 (1), 16–28. doi:10.1016/j.compeleceng.2013.11.024

CrossRef Full Text | Google Scholar

Chen, T., and Guestrin, C. (2016). “Xgboost: a scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794.

Google Scholar

Choi, U.Y., Kang, J-S., Hwang, Y. S., and Kim, Y. J. (2015). Oligoadenylate synthase-like (OASL) proteins: dual functions and associations with diseases. Exp. and Mol. Med. 47 (3), e144. doi:10.1038/emm.2014.110

PubMed Abstract | CrossRef Full Text | Google Scholar

Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E. W., Jan, Y., and Van Calster, B. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J. Clin. Epidemiol. 110, 12-–22. doi:10.1016/j.jclinepi.2019.02.004

PubMed Abstract | CrossRef Full Text | Google Scholar

Cristianini, N., and Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press.

Google Scholar

Crystal, R. G., Scott, H., Engelhardt, J. F., Voynow, J., and Sunday, M. E. (2008). Airway epithelial cells: current concepts and challenges. Proc. Am. Thorac. Soc. 5 (7), 772–777. doi:10.1513/pats.200805-041HR

PubMed Abstract | CrossRef Full Text | Google Scholar

DeDiego, M. L., Nogales, A., Martinez-Sobrido, L., and Topham, D. J. (2019). Interferon-induced protein 44 interacts with cellular FK506-binding protein 5, negatively regulates host antiviral responses, and supports virus replication. MBio 10 (4), e01839-19–1128. doi:10.1128/mBio.01839-19

PubMed Abstract | CrossRef Full Text | Google Scholar

Dietterich, T. (1995). Overfitting and undercomputing in machine learning. ACM Comput. Surv. (CSUR) 27 (3), 326–327. doi:10.1145/212094.212114

CrossRef Full Text | Google Scholar

Fensterl, V., and Sen, G. C. (2015). Interferon-induced ifit proteins: their role in viral pathogenesis. J. virology 89 (5), 2462–2468. doi:10.1128/JVI.02744-14

PubMed Abstract | CrossRef Full Text | Google Scholar

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Ann. statistics 29, 1189–1232. doi:10.1214/aos/1013203451

CrossRef Full Text | Google Scholar

Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Mach. Learn. 63, 3–42. doi:10.1007/s10994-006-6226-1

CrossRef Full Text | Google Scholar

Guo, G., Wang, H., Bell, D., Bi, Y., and Greer, K. (2003). “KNN model-based approach in classification,” in On the move to meaningful internet systems 2003: coopis, DOA, and ODBASE: OTM confederated international conferences, CoopIS, DOA, and ODBASE 2003, catania, Sicily, Italy, November 3-7, 2003. Proceedings (Springer), 986–996.

Google Scholar

Hardman, C. S., Panova, V., and McKenzie, A. N. J. (2013). IL-33 citrine reporter mice reveal the temporal and spatial expression of IL-33 during allergic lung inflammation. Eur. J. Immunol. 43 (2), 488–498. doi:10.1002/eji.201242863

PubMed Abstract | CrossRef Full Text | Google Scholar

Hastie, T., Rosset, S., Zhu, J., and Zou, H. (2009). Multi-class adaboost. Statistics its Interface 2 (3), 349–360. doi:10.4310/sii.2009.v2.n3.a8

CrossRef Full Text | Google Scholar

Hiemstra, P. S., McCray, P. B., and Bals, R. (2015). The innate immune function of airway epithelial cells in inflammatory lung disease. Eur. Respir. J. 45 (4), 1150–1162. doi:10.1183/09031936.00141514

PubMed Abstract | CrossRef Full Text | Google Scholar

Hirst, R. A., Jackson, C. L., Coles, J. L., Williams, G., Rutman, A., Goggin, , et al. (2014). Culture of primary ciliary dyskinesia epithelial cells at air-liquid interface can alter ciliary phenotype but remains a robust and informative diagnostic aid. PloS one 9 (2), e89675. doi:10.1371/journal.pone.0089675

PubMed Abstract | CrossRef Full Text | Google Scholar

Jaroch, K., Jaroch, A., and Bojko, B. (2018). Cell cultures in drug discovery and development: the need of reliable in vitro-in vivo extrapolation for pharmacodynamics and pharmacokinetics assessment. J. Pharm. Biomed. Analysis 147, 297–312. doi:10.1016/j.jpba.2017.07.023

PubMed Abstract | CrossRef Full Text | Google Scholar

Kim, T. K. (2015). T test as a parametric statistic. Korean J. Anesthesiol. 68 (6), 540–546. doi:10.4097/kjae.2015.68.6.540

PubMed Abstract | CrossRef Full Text | Google Scholar

Kleinbaum, D. G., Dietz, K., Gail, M., Klein, M., and Klein, M. (2002). Logistic regression. Springer.

Google Scholar

Knight, D. A., and Holgate, S. T. (2003). The airway epithelium: structural and functional properties in health and disease. Respirology 8 (4), 432–446. doi:10.1046/j.1440-1843.2003.00493.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Ijcai. 1137–1145.

Google Scholar

Kursa, M. B., and Rudnicki, W. R. (2010). Feature selection with the boruta package. J. Stat. Softw. 36, 1–13. doi:10.18637/jss.v036.i11

CrossRef Full Text | Google Scholar

Leach, T., Gandhi, U., Reeves, K. D., Stumpf, K., Okuda, K., Marini, F. C., et al. (2023). Development of a novel air--liquid interface airway tissue equivalent model for in vitro respiratory modeling studies. Sci. Rep. 13 (1), 10137. doi:10.1038/s41598-023-36863-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, L., Ni, , An and Song, Y., Yi, Z., and Wang, F. (2021). Identification of pathogenic genes and transcription factors in respiratory syncytial virus. BMC Pediatr. 21, 27–10. doi:10.1186/s12887-020-02480-4

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, L., Stephens, B., Bergman, M., May, A., and Chiang, T. (2021). Role of collagen in airway mechanics. Bioengineering 8 (1), 13. doi:10.3390/bioengineering8010013

PubMed Abstract | CrossRef Full Text | Google Scholar

Melchjorsen, J., Kristiansen, H., Christiansen, R., Rintahaka, J., Matikainen, S., Paludan, S. R., et al. (2009). Differential regulation of the OASL and OAS1 genes in response to viral infections. J. interferon cytokine Res. 29 (4), 199–207. doi:10.1089/jir.2008.0050

PubMed Abstract | CrossRef Full Text | Google Scholar

Murtagh, F., and Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2 (1), 86–97. doi:10.1002/widm.53

CrossRef Full Text | Google Scholar

Nelder, J. A., and Wedderburn, R. W. M. (1972). Generalized linear models. J. R. Stat. Soc. Ser. A General. 135 (3), 370–384. doi:10.2307/2344614

CrossRef Full Text | Google Scholar

Peng, H., Long, F., and Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. pattern analysis Mach. Intell. 27 (8), 1226–1238. doi:10.1109/TPAMI.2005.159

PubMed Abstract | CrossRef Full Text | Google Scholar

Pichlmair, A., Lassnig, C., Eberle, C.-A., Gorna, , Maria, W., Baumann, , et al. (2011). IFIT1 is an antiviral protein that recognizes 5'-triphosphate RNA. Nat. Immunol. 12 (7), 624–630. doi:10.1038/ni.2048

PubMed Abstract | CrossRef Full Text | Google Scholar

Pisner, D. A., and Schnyer, D. M. (2020). Support vector machine. Mach. Learn., 101–121. doi:10.1016/b978-0-12-815739-8.00006-7

CrossRef Full Text | Google Scholar

Rezapour, M., Walker, S. J., Ornelles, D. A., McNutt, P. M., Atala, A., and Gurcan, M. N. (2024a). Analysis of gene expression dynamics and differential expression in viral infections using generalized linear models and quasi-likelihood methods. Front. Microbiol. 15, 1342328. doi:10.3389/fmicb.2024.1342328

PubMed Abstract | CrossRef Full Text | Google Scholar

Rezapour, M., Walker, S. J., Ornelles, D. A., Khan Niazi, M. K., McNutt, P. M., Atala, A., et al. (2024b). Exploring the host response in infected lung organoids using NanoString technology: a statistical analysis of gene expression data. PloS one. 19 (11). e0308849. doi:10.1371/journal.pone.0308849

PubMed Abstract | CrossRef Full Text | Google Scholar

Rezapour, M., Walker, S. J., Ornelles, D. A., Niazi, M. K. K., McNutt, P. M., Atala, A., et al. (2024c). A comparative analysis of RNA-seq and NanoString technologies in deciphering viral infection response in upper airway lung organoids. Front. Genet. 15, 1327984. doi:10.3389/fgene.2024.1327984

PubMed Abstract | CrossRef Full Text | Google Scholar

Rezapour, M., Wesolowski, R., and Gurcan, M. N. (2024d). Identifying key genes involved in axillary lymph node metastasis in breast cancer using advanced RNA-seq analysis: a methodological approach with GLMQL and MAS. Int. J. Mol. Sci. 25 (13), 7306. doi:10.3390/ijms25137306

PubMed Abstract | CrossRef Full Text | Google Scholar

Rocha, A., and Goldenstein, S. K. (2013). Multiclass from binary: expanding one-versus-all, one-versus-one and ecoc-based approaches. IEEE Trans. neural Netw. Learn. Syst. 25 (2), 289–302. doi:10.1109/TNNLS.2013.2274735

PubMed Abstract | CrossRef Full Text | Google Scholar

Sacco, O., Silvestri, M., Sabatini, F., Sale, R., Defilippi, A.-C., and Rossi, G. A. (2004). Epithelial cells and fibroblasts: structural repair and remodelling in the airways. Paediatr. Respir. Rev. 5, S35–S40. doi:10.1016/s1526-0542(04)90008-5

PubMed Abstract | CrossRef Full Text | Google Scholar

Saraçli, S., Doğan, N., and Doğan, İ. (2013). Comparison of hierarchical cluster analysis methods by cophenetic correlation. J. inequalities Appl., 1–8.

CrossRef Full Text | Google Scholar

Schutyser, E., Struyf, S., and Damme, V. (2003). The CC chemokine CCL20 and its receptor CCR6. Cytokine and growth factor Rev. 14 (5), 409–426. doi:10.1016/s1359-6101(03)00049-2

CrossRef Full Text | Google Scholar

Speiser, J. L., Miller, M. E., Tooze, J., and Ip, E. (2019). A comparison of random forest variable selection methods for classification prediction modeling. Expert Syst. Appl. 134, 93–101. doi:10.1016/j.eswa.2019.05.028

PubMed Abstract | CrossRef Full Text | Google Scholar

Steinwart, I., and Christmann, A. (2008). Support vector machines. Springer Science and Business Media.

Google Scholar

Sun, L., Wang, T., Ding, W., Xu, J., and Lin, Y. (2021). Feature selection using fisher score and multilabel neighborhood rough sets for multilabel classification. Inf. Sci. 578, 887–912. doi:10.1016/j.ins.2021.08.032

CrossRef Full Text | Google Scholar

Tam, A., Wadsworth, S., Dorscheid, D., Man, S. F. P., and Sin, D. D. (2011). The airway epithelium: more than just a structural barrier. Ther. Adv. Respir. Dis. 5 (4), 255–273. doi:10.1177/1753465810396539

PubMed Abstract | CrossRef Full Text | Google Scholar

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 (1), 267–288. doi:10.1111/j.2517-6161.1996.tb02080.x

CrossRef Full Text | Google Scholar

Wedderburn, R. W. (1974). Quasi-likelihood functions, generalized linear models, and the gauss—newton method. Biometrika 61 (3), 439–447. doi:10.2307/2334725

CrossRef Full Text | Google Scholar

Wong, T.-T., Yeh, , and Po-Yang, (2019). Reliable accuracy estimates from k-fold cross validation. IEEE Trans. Knowl. Data Eng. 32 (8), 1586–1594.

Google Scholar

Yang, F.-J. (2018). “An implementation of naive bayes classifier,” in 2018 international conference on computational science and computational intelligence (CSCI) (IEEE), 301–306.

CrossRef Full Text | Google Scholar

Ying, X. (2019). An overview of overfitting and its solutions. J. Phys. Conf. Ser. 1168, 022022. doi:10.1088/1742-6596/1168/2/022022

CrossRef Full Text | Google Scholar

Zhou, X., Michal, J. J., Zhang, L., Ding, B., Lunney, J. K., Bang, L., et al. (2013). Interferon induced IFIT family genes in host antiviral defense. Int. J. Biol. Sci. 9, (2). 200, 208. doi:10.7150/ijbs.5613

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: predictive modeling, 3D airway organ tissue equivalent (OTEs), viral infection, RNA-Seq data, NanoString technologies, differential expression analysis

Citation: Rezapour M, McNutt PM, Ornelles DA, Walker SJ, Murphy SV, Atala A and Gurcan MN (2025) Cross-modal predictive modeling of multi-omic data in 3D airway organ tissue equivalents during viral infection. Front. Genet. 16:1658577. doi: 10.3389/fgene.2025.1658577

Received: 02 July 2025; Accepted: 28 August 2025;
Published: 25 September 2025.

Edited by:

Yuriy Gusev, Georgetown University, United States

Reviewed by:

Valentina Di Salvatore, University of Catania, Italy
Yunus Emre Işik, Sivas Cumhuriyet University, Türkiye

Copyright © 2025 Rezapour, McNutt, Ornelles, Walker, Murphy, Atala and Gurcan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Mostafa Rezapour, bXJlemFwb3VAd2FrZWhlYWx0aC5lZHU=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.