Cytokine TGFβ Gene Polymorphism in Asthma: TGF-Related SNP Analysis Enhances the Prediction of Disease Diagnosis (A Case-Control Study With Multivariable Data-Mining Model Development)

Introduction TGF-β and its receptors play a crucial role in asthma pathogenesis and bronchial remodeling in the course of the disease. TGF-β1, TGF-β2, and TGF-β3 isoforms are responsible for chronic inflammation, bronchial hyperreactivity, myofibroblast activation, fibrosis, bronchial remodeling, and change the expression of approximately 1000 genes in asthma. TGF-β SNPs are associated with the elevated plasma level of TGF-β1, an increased level of total IgE, and an increased risk of remodeling of bronchi. Methods The analysis of selected TGF-β1, TGF-β2, TGF-β3-related single-nucleotide polymorphisms (SNP) was conducted on 652 DNA samples with an application of the MassARRAY® using the mass spectrometry (MALDI-TOF MS). Dataset was randomly split into training (80%) and validation sets (20%). For both asthma diagnosis and severity prediction, the C5.0 modelling with hyperparameter optimization was conducted on: clinical and SNP data (Clinical+TGF), only clinical (OnlyClinical) and minimum redundancy feature selection set (MRMR). Area under ROC (AUCROC) curves were compared using DeLong’s test. Results Minor allele carriers (MACs) in SNP rs2009112 [OR=1.85 (95%CI:1.11-3.1), p=0.016], rs2796821 [OR=1.72 (95%CI:1.1-2.69), p=0.017] and rs2796822 [OR=1.71 (95%CI:1.07-2.71), p=0.022] demonstrated an increased odds of severe asthma. Clinical+TGF model presented better diagnostic potential than OnlyClinical model in both training (p=0.0009) and validation (AUCROC=0.87 vs. 0.80,p=0.0052). At the same time, the MRMR model was not worse than the Clinical+TGF model (p=0.3607 on the training set, p=0.1590 on the validation set), while it was better in comparison with the Only Clinical model (p=0.0010 on the training set, p=0.0235 on validation set, AUCROC=0.85 vs. 0.87). On validation set Clinical+TGF model allowed for asthma diagnosis prediction with 88.4% sensitivity and 73.8% specificity. Discussion Derived predictive models suggest the analysis of selected SNPs in TGF-β genes in combination with clinical factors could predict asthma diagnosis with high sensitivity and specificity, however, the benefit of SNP analysis in severity prediction was not shown.


INTRODUCTION Background/Rationale
The latest concept of chronic airway inflammation in asthma implies the existence of a complex interaction between the epithelium, innate lymphoid cells (ILCs), lymphocytes and, finally, effector cells. Current advances in basic sciences have allowed researchers to discover three basic forms of responses of airway epithelium to allergic and non-allergic factors leading to its damage. Such a division results from the discovery of various types of ILCs, cytokine profiles and responses of the effector cells (1,2). Type 1 immunity consists of T-bet(+) IFN-g-producing group 1 ILCs (ILC1) and natural killer cells, CD8(+) cytotoxic T cells (Tc1), and CD4(+) Th1 cells and forms a mechanism protecting against viral infections. Type 2 immunity is composed of GATA-3(+) ILC2s, Tc2 cells, and Th2 cells producing IL-4, IL-5, and IL-13 which activate mast cells, B lymphocytes, basophils, and eosinophils and are responsible for anti-allergic and anti-parasitic reactions. Type 3 immunity is regulated by the retinoic acid-related orphan receptor gt(+) ILC3s, Tc17 cells, and Th17 cells producing IL-17, IL-22, or both, and mediates antifungal and antibacterial reactions. On the other hand, types 1 and 3 determine the development of autoimmune diseases (non-allergic diseases), while type 2 is responsible for the development of allergic diseases (2).
It is the epithelium/Th2/ILC2 system that determines the lack of control of asthma symptoms, progression of the disease and development of its complications. Eosinophils can induce EMT (Epithelial-Mesenchymal Transition) in airway epithelial cells via increased production of the transforming growth factor (TGFb), which is the main and most important molecular and cellular mechanism causing airway remodeling. This data has been confirmed by the latest experimental research (3)(4)(5). Experimental murine airway remodeling models have proven that blocking TGF-b mediated inflammation by targeting Smad proteins, c-Jun Nterminal kinase and phosphoinositide 3-kinase signaling pathways decreases bronchial fibrosis. Undoubtedly, the above proteins are responsible for chronic inflammation, bronchial hyperreactivity, myofibroblast activation, fibrosis, bronchial remodeling, and they change the expression of approximately 1000 genes in asthma, especially those of MMPs, PAI-1, CTGF, MCP-1, IL-6, TGF-b, TSP-1, TGFR-1/ 2, fibronectine, proteoglycans, as well as type I and III collagen (3,4,6).
The TGFb (1-3 isoforms, are small, 25 kDa secreted homodimeric signaling proteins) and especially TGFb1 superfamily is responsible for immunosuppresion of T and B lymphocytes as well as NK cells, chemotaxis of macrophages and fibroblasts, stimulation of proliferation, an increase in fibroblast synthesis, stimulation of synthesis of fibronectin, proteoglycans, and type I and III collagen, eosinophil chemotaxis after allergen exposure, MAP kinase phosphorylation, increase in bronchial myocyte proliferation, inhibition of collagenase and matrix metalloproteinase gene expression, inhibition of MHC class II antigen expression and inhibition of surfactant synthesis by type II pneumocytes (6,7). On the other hand, hyperactivity of the TGFb-Smad (a family of proteins similar to the gene products of the Drosophila gene 'mothers against decapentaplegic' (Mad) and the C. elegans gene Sma) signaling pathway underlies many human disorders, such as excess deposition of the extracellular matrix, fibrotic disorders, and progressive cancers (6)(7)(8).
Numerous studies conducted on diverse populations have shown that genetic factors largely contribute to variability in the pulmonary function and to familial aggregation of asthmatic patients.
We detected base substitutions as single-stranded conformational polymorphisms. We screened each polymorphism by a case-control analysis in order to find association with allergy and asthma using our data base containing 237 atopic asthmatic and 268 non-asthmatic families. Table 1 presents analyzed polymorphisms in the TGFb1, TGFb2 and TGFb3 genes. Polymorphism rs8179181 in the TGFb1 gene is associated with an increased risk of childhood asthma and atopy. It is associated with a more severe course of the disease and increased levels of TGFb1 mRNA (8). rs4803455 correlates with the risk of the disease. Moreover, it worsens the lung function and causes airway remodeling in asthma (9). rs1800469 in the TGF-b1 promoter has been found to be related to an elevated plasma level of TGF-b1, an elevated level of total IgE and an increased risk of remodeling bronchi, as well as the development of asthma (10)(11)(12)(13). rs11083616 is associated with bronchial obturation as well as with airway wall phenotypes -airway wall thickness. It is a significant risk of obturatory diseases (14). It has not really confirmed that rs8109627 in the TGFb1 gene contributes to an increased risk of asthma. The role of tagging polymorphisms in the TGFb2 gene (rs10495098, rs17047703, rs17558745, rs2799085, rs2009112, rs10482751, rs2027567, rs10779329, rs2796821, rs2796822, rs4846479, rs2798631, rs10863399) as well as in the TGFb3 gene (rs4903359, rs3917187, rs2284792, rs2268626) has not been confirmed yet. Nevertheless, it should be stressed that asthma phenotypic differences that result from altered expression due to SNPs are sometimes inconsistent and disease association studies are often ambiguous (15).

Aims and Objectives
The aim of our study was to identify SNPs in TGF-b family potentially associated with asthma occurrence and severity, and subsequently test their predictive value. To that end, we decided to evaluate the prevalence of SNPs in TGF-b1, TGF-b2 and TGF-b3 in both asthmatic and non-asthmatic polish population. Collected data were intended to serve as a base for binary classification models.

Consent of the Bioethics Committee
The study was approved by the local ethics committee (Consent of Research Review Board at the Medical University of Lodz, Poland, No RNN/133/09/KE). At the commencement of the study, participants were invited to take part voluntarily. Before enrollment, a written informed consent was obtained from each patient.

Variables and Subjects
Asthma diagnosis was established according to GINA (The Global Initiative For Asthma) recommendations, based on clinical asthma symptoms and a lung function test. The level of asthma severity and control was determined on the basis of GINA Report Guidelines. All the participants underwent structuralized anamnesis and clinical examination, to collect details on factors such as: gender, obesity, tobacco smoking, duration of bronchial asthma, allergy to house dust mites, animal fur, mould spores, cockroaches allergens, hypersensitivity to non-steroid anti-inflammatory drugs (NSAIDs), etc., in order to determine their role in the development of resistance to glucocorticoids, as well as to establish genetic predisposition (obtained from medical records of particular patients). If results of spirometry or allergological tests were unavailable, such examinations were additionally performed during the recruitment visit. Subjects suffering from clinically significant exacerbations, using drugs which might induce resistance to glucocorticoids (such as rifampicin, phenobarbital, phenytoin, effedrine), subjects with signs of viral infections, either generalized, or affecting the respiratory tract, as well as subjects failing to comply with the doctor's recommendations, were excluded from the patient group. The control arm included a group of healthy adults, who met the following criteria: no history or symptoms of either bronchial asthma or other pulmonary diseases, no history or symptoms of allergy, no history or symptoms of atopic dermatitis, no history, or signs of hypersensitivity to aspirin, negative results of skin tests for 12 common allergens, no first-degree relatives with bronchial asthma or atopic disorders. Spirometry tests were conducted in the Outpatient Department according to ERS (European Respiratory Society)/ATS (American Thoracic Society) standards, whereas allergological tests were performed according to EAACI (European Academy of Allergy and Clinical Immunology) guidelines (10-13, 16, 17).
The whole study group consisted of 652 individuals whose mean age was 47.4 ± 15.9 years. Detailed characteristics of the patients were presented in Table 2.

Statistical Methods
The statistical analysis was performed with an application of Welch two-sample t-test and Pearson's Chi-squared test (with Yates' continuity correction if appropriate) in intragroup association testing. Unadjusted Chi-Square test statistic was also used in pairwise linkage disequilibrium analysis prior to the modelling. Standard r-squared (r2) and p-values were calculated for each pair.
Logistic regression analysis was performed to derive odds ratios with their confidence intervals in univariable analysis. In the first step, the analysis was performed for minor allele carriers (MACs) i.e. presence of at least one minor allele (so called recessive model). In this analysis, the lack of a minor allele was considered as reference. The second step of the analysis included testing the association of particular genotypes (i.e. additive model) with asthma diagnosis and severity, with the most common genotype considered as a reference. The goodness of fit was assessed using the likelihood-ratio test (LR-test). Just before the analysis, as missing data constituted only the 2% (n=723/35860) of the overall dataset, multiple imputation using chained equations was performed. Predictive mean matching was performed with a maximum of 500 iterations.
In order to create the clinically useful multivariable models, the data-mining procedures followed the gold-standards of predictive model development. Since both asthma occurrence and asthma severity were variables of interest, binary classification models were created for asthma diagnosis (asthma vs healthy) and asthma severity (mild vs severe; using only data asthmatic patients), respectively. Both scenarios were executed independently, with identical steps.
Firstly, the dataset was randomly divided (with stratification) into training and validation set in 80%:20% ratio. To answer whether addition of SNP-related data adds to discriminatory power of models, we developed models for 3 scenarios (1): jointly clinical data and TGF-related SNPs (scenario further referred to as "TGF+clinical") (2), only clinical data (scenario further referred to as "clinical") (3), selected features from clinical data and TGFrelated SNPs using minimum redundancy feature selection (scenario further referred to as "MRMR"). Predictive models were developed using Quinlan's C5.0 algorithm with hyperparameter optimization (including 500 iterations of random search). As decision trees employ own build-it feature selection and pruning, no additional feature selection was performed. To counteract possible overfitting, the best set of hyperparameters was selected based on the accuracy of metrics derived from 10-fold crossvalidation performed on the training set, thus the validation set had no impact on selection of best hyperparameters.
The Quinlan's C5.0 algorithm extends the C4.5 classification algorithm and can produce decision trees or collections of rules. Both of those can be further boosted, creating ensemble models. Information gain (entropy) is used as its splitting criteria, while C5.0 pruning technique adopts the binomial confidence limit method. All of those lead to detection of far more complex patterns than frequently used logistic regression.
The best models were finally validated on hold-out validation set. To avoid bias, the minimum redundancy feature selection (MRMR) was performed after dataset splitting. (Figure 1) DeLong's test for two correlated ROC curves (receiver operating characteristic curves) was used to compare predictive abilities between sets and scenarios. The DeLong method was also applied in calculations of 95% confidence intervals (95%CI) for the area under the ROC curves (AUC ROC) (20,21).
The analysis was performed utilizing R programming languages (version 3.6.1) with the following crucial packages: caret (version 6.0-84), mRMRe (version 2.0.9) and C50 (version 0.1.2). Final caret models were extracted in RDS format and placed as Supplementary Files for reproducibility and further validations. As all comparisons were preplanned, no multiple comparison correction was applied. The analysis code was published in GitHub repository: https:// github.com/kstawiski/Panek_TGF. Additional data may be provided upon readers' requests (20,21).

Participants
A comparative analysis of biometric parameters revealed differences with asthmatic patients and healthy controls. Detailed characteristics of the patients were presented in Table 2. As it was assumed, differences in sex, height, and weight as well as in smoking pack-years were not statistically significant. However, we have noticed that healthy controls were slightly younger and had a lower BMI. The difference in BMI was not noticed between patients with severe and non-severe asthma ( Table 3). An analysis of samples with an application of mass spectrometer MassARRAY4, the authors obtained raw results (Supplementary Figure 1) presented in the form of mass spectra. They were used to detect polymorphisms in the studied genes. Figure 2 presents a distribution of homo-and heterozygotes for all analyzed samples depending on the yield (Figure 2A), height of the mass spectrum peak ( Figure 2B) and common logarithm (LOG) from the height of the mass spectrum peak ( Figure 2C).
The graphs presented in Figure 2 show a result of an analysis of rs2009112 polymorphism for three randomly selected samples. They are image representations and output data for identification of polymorphism in the analyzed sample. Supplementary Figure 2 Similarly to the Hardy-Weinberg principle, the co-occurrence of different SNPs should be theoretically random. However, the pairwise linkage disequilibrium analysis showed that 100 out of 190 comparisons were significantly associated. Not surprisingly, based on the r2 values, the strongest associations were found to be between SNPs from the same genes. However, the results od this analysis were rather mixed, indicating complex genetic landscape of selected SNPs. Please see the network of statistically significant disequilibrium on Figure 3. Information redundancy and significant associations between SNPs further supported application of feature selection and data-mining modeling with embedded feature selection.
As shown in Table 4, the univariable analysis has revealed a significance of minor allele carriers (MACs) of rs2009112, rs2796821, and rs2796822 regarding severe asthma development. None of the SNPs was significantly associated with the risk of asthma in the univariable analysis.
Furthermore, A/A genotype of rs4803455 presented to be protective against severe asthma development in comparison with C/C genotype. Multiple SNPs were significantly associated with asthma severity. Both C/T and T/T of rs2009112 were associated with an increased severity of asthma in comparison with C/C genotype, like C/T in rs2796821, A/G in rs2796822, A/ C in rs10863399. In contrast, in the analysis of the risk of asthma diagnosis, only the C/C genotype in rs10779329 was associated with a significantly lower risk of disease and rs4903359 A/G in comparison with most common genotypes.
To assess the predictive abilities of studied SNPs, we have developed benchmark predictive models for both asthma diagnosis (asthma vs. healthy) and severity (severe vs. nonsevere, as defined in the "Methods" section).
As shown in Table 5, out of the total number of 49 features, 23 remained after MRMR feature selection for modeling of asthma diagnosis prediction. For asthma severity, an application of prediction MRMR feature selection allowed to reduce the dimensionality of the dataset to 17 features. Similar observations were not noted for asthma severity. Although the Clinical+TGF model was better than the Only Clinical model alone on the training set (p<0.0001), no difference in AUC ROC on the validation set was noted (p=0.7977), which indicated overfitting and lack of benefit from a SNP analysis. MRMR feature selection decreased predictive performance of models on the training set (p<0.0001) while the performance for validation remained similar (AUC ROC 0.77 vs. 0.76, p=0.8393). No statistically significant benefit was observed between Clinical +TGF, Only Clinical and MRMR models on validation sets. ROC curves were shown in Figure 4.

DISCUSSION
Despite the fact that allergy, which can be detected in around 90% of patients, a combination of genetic factors and other environmental determinants are responsible for an occurrence of the disease. From the clinical point of view, of all candidate gene groups of allergic diseases and asthma, those genes which are associated with the function of Th2 lymphocytes, epithelial cells and the lung function, bronchial remodeling and asthma severity are particularly important. This group includes TGF-b1, TGF-b2 and TGF-b3 genes (22)(23)(24)(25)(26).
Today, the importance and role of SNPs in the pathogenesis of asthma is widely discussed. It should be noted however that there are a lot of studies on this issue, conducted on populations of different character and different sizes. Not all results are always replicable, either. It should be pointed out that many studies have shown and confirmed the functional role of SNPs of TGF-b1, TGF-b2 and TGF-b3 genes in asthma (12, 17-19, 23, 24). It is important, from the point of view of basic and clinical sciences, to know how these SNPs influence signaling pathways regulated by the TGF-b gene in asthma.
In this paper we analyzed SNPs in TGFb1, TGFb2 and TGFb3 genes. Their functions and effects on the expression of TGFb1, TGFb2 and TGFb3 mRNA, as well as a new pool not yet studied in asthma, had been known before. We present a comprehensive analysis of 20 polymorphisms of TGF-b1, TGF-b2 and TGF-b3 genes as a predictor of the disease as well as its severity. SNPs were tested for both MAC differences between asthmatic patients and healthy controls as well as between patients with non-severe and severe asthma. There were no statistically significant differences regarding any of studied MACs of SNPs in TGF-b1, TGF-b2 and TGF-b3 genes between the asthmatics and healthy subjects. However, rs2009112 The specific SNP may be more commonly presented in patients with asthma (increased risk of severe asthma), and it was higher by 93% for heterozygous forms of rs2796821 and by 102% for the heterozygote forms of rs2796822. A statistical analysis of single SNPs, particularly on selected genes, as shown in the results, provides incomplete knowledge on the role of SNPs in the development of asthma, as well as in its more severe forms. In our study, we did not confirm the role of several SNPs in the TGF-b1 gene, but we discovered a new functional significance of other SNPs in the TGF-b2 gene (6, 10-13, 16, 17). In this part of the work, genotyping of 20 in TGF-b1, TGF-b2 and TGF-b3 genes allowed to discover two new SNPs that increase the risk of asthma (rs10779329 and rs4903359). To confirm these analyzes, it is worth investigating in the future the expression of TGF/ Smad signaling pathway on cell models, and in particular, to determine the levels of Smad2/3 and Smad4, due to the fact that these proteins play a special role in stimulating the synthesis of fibronectin, proteoglycans, type I and III collagen and the intensification of eosinophil chemotaxis after allergen exposure in bronchi of asthmatics (16)(17)(18)(19).   Considering the fact that alleles occur in SNP with different frequency in different populations as well as different results in the analysis of genotypes of the same SNPs using different molecular techniques, it should be concluded that the statistical analysis of single SNPs is of low molecular and clinical importance in the development of chronic inflammatory respiratory diseases, such as asthma.
In the light of the above, in the subsequent part of the study, we tested whether the analysis of selected SNPs could increase the predictive potential of well-known clinical factors in terms of asthma diagnosis and severity prediction. By splitting the dataset (hold-out validation), performing hyperparameter optimization and analysis of ROC curves we followed the golden standard of predictive model development. To further check whether selection of particular SNPs and clinical features could counteract overfitting the MRMR algorithm was used for dimensionality reduction. In the results section we showed that not all clinical and genomic features are needed to develop overfitting-resilient model for asthma diagnosis prediction. Based on the metrics in hold-out validation, in our opinion, the MRMR model could be recommended for asthma diagnosis prediction in clinical settings. One can reuse our models for prediction using RDS files in Supplementary Appendix via predict function in R caret package. Anonymized data of an individual patient were added to the appendix to facilitate the reproducibility and further research.
However, few things have to discussed at this step. First, univariable analysis has shown limited independent association of particular SNPs with asthma diagnosis and severity. By application of C5.0 algorithm in this paper we were able to find more complex patterns that show the information gain, however, one has to acknowledge that derived model is the ensemble of decision tree, thus does not provide a simple explanation of the predictions. Additionally, Only Clinical model was not worse than MRMR model. Although, we applied standard data mining pipeline (with hyperparameter optimization and hold-out validation) the model requires further external validation. Furthermore, limitations inherited with technology could serve as a source of bias. Although the missing data rate was law, data imputation was required for predictive modelling. This was due to the specificity of subsequent experiments, impurities that can sediment the on SpectroCHIP and may interfere with signal detection, as well as the likelihood of DNA degradation in the obtained samples. Lastly, the results may be valid only for Polish population, which is quite homogeneous.
Nevertheless, in this study we show that analysis of selected SNPs in combination with selected clinical factors predicts the asthma diagnosis better than just clinical factors. Proven validity of MRMR model could implement preventative methods against asthma in particular groups of patients (asthma endotypes). Therefore, earlier identification of patients burdened with risk of more severe disease (carriers of specific SNPs in TGF-b1, TGF-b2 and TGF-b3 genes) is possible. This would facilitate faster implementation of intensive anti-inflammatory treatment (GCS, glucocorticoids) and prevent disease progression, exacerbations and bronchial remodeling (regulation of remodeling by TGF-b1, TGF-b2 and TGF-b3 genes).

SUMMARY
This work is the first in the Polish population to analyze the problem of the functional impact of 20 SNPs of TGF-b1, TGF-b2 and TGF-b3 genes on the risk of asthma. We have revealed new relationships between the occurrence of SNP rs10779329 and rs4903359 of the TGF-b2 gene and a statistically significantly increased risk of asthma. This observation is particularly important because the TGF-b gene affects eosinophil levels, bronchial hyperreactivity and bronchial obturation as well as clinical symptoms of asthma. The TGF-b1-3 gene complex is an important regulator of the immune response in asthma. We also proposed new predictive models which proven that analysis of selected SNPs in combination with selected clinical factors predicts the asthma diagnosis better than just clinical factors   for asthma diagnosis prediction. This was not proved for asthma severity prediction. Good validation properties indicate that presented models may be of great clinical potential.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Consent of the Bioethics Committee. The study was approved by the local ethics committee (Consent of Research Review Board at the Medical University of Lodz, Poland, No RNN/133/09/KE). At the commencement of the study, participants were invited to take part voluntarily. Before enrolment, a written informed consent was obtained from each patient. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
MP conceived of the presented idea. MP, KS, and MK developed the theory and performed the computations. KS and MK verified the analytical methods. PK and MP encouraged KS and MK to investigate SNP models in asthma and supervised the findings of this work. All authors discussed the results and contributed to the final manuscript.

FUNDING
The study has been financed from the Polpharm Scientific Foundation, grant no. 16  The graph presenting the intensity of mass spectra for T homozygote, lack of signal for C allele; (B) The graph presenting the intensity of mass spectra for CT heterozygote; (C) The graph presenting the intensity of mass spectra for C homozygote, lack of signal for T allele.