An integrated signature of extracellular matrix proteins and a diastolic function imaging parameter predicts post-MI long-term outcomes

Background Patients suffering from acute myocardial infarction (AMI) are at risk of secondary outcomes including major adverse cardiovascular events (MACE) and heart failure (HF). Comprehensive molecular phenotyping and cardiac imaging during the post-discharge time window may provide cues for risk stratification for the outcomes. Materials and methods In a prospective AMI cohort in New Zealand (N = 464), we measured plasma proteins and lipids 30 days after hospital discharge and inferred a unified partial correlation network with echocardiographic variables and established clinical biomarkers (creatinine, c-reactive protein, cardiac troponin I and natriuretic peptides). Using a network-based data integration approach (iOmicsPASS+), we identified predictive signatures of long-term secondary outcomes based on plasma protein, lipid, imaging markers and clinical biomarkers and assessed the prognostic potential in an independent cohort from Singapore (N = 190). Results The post-discharge levels of plasma proteins and lipids showed strong correlations within each molecular type, reflecting concerted homeostatic regulation after primary MI events. However, the two molecular types were largely independent with distinct correlation structures with established prognostic imaging parameters and clinical biomarkers. To deal with massively correlated predictive features, we used iOmicsPASS + to identify subnetwork signatures of 211 and 189 data features (nodes) predictive of MACE and HF events, respectively (160 overlapping). The predictive features were primarily imaging parameters, including left ventricular and atrial parameters, tissue Doppler parameters, and proteins involved in extracellular matrix (ECM) organization, cell differentiation, chemotaxis, and inflammation. The network signatures contained plasma protein pairs with area-under-the-curve (AUC) values up to 0.74 for HF prediction in the validation cohort, but the pair of NT-proBNP and fibulin-3 (EFEMP1) was the best predictor (AUC = 0.80). This suggests that there were a handful of plasma proteins with mechanistic and functional roles in predisposing patients to the secondary outcomes, although they may be weaker prognostic markers than natriuretic peptides individually. Among those, the diastolic function parameter (E/e' - an indicator of left ventricular filling pressure) and two ECM proteins, EFEMP1 and follistatin-like 3 (FSTL3) showed comparable performance to NT-proBNP and outperformed left ventricular measures as benchmark prognostic factors for post-MI HF. Conclusion Post-discharge levels of E/e', EFEMP1 and FSTL3 are promising complementary markers of secondary adverse outcomes in AMI patients.


Supplementary
. The comparison of the unadjusted and adjusted log hazard ratios to evaluate the effect of medication use during hospital stay to baseline for MACE (top panel) and incident HF (bottom panel) in CDCS and IMMACULATE study. In CDCS, the log hazard ratios are adjusted for the use of beta-blockers, clopidogrel, calcium channel antagonist, long-acting nitrates, ACE inhibitors, aspirin, diuretics, statin and warfarin. For IMMACULATE, the log hazard ratios are adjusted for the use of beta-blockers, ACE inhibitors and angiotensin receptor blockers (ARB). The markers from the identified network signatures are colored in blue (protein), green (lipid), salmon (clinical biomarker) and purple (echo imaging). Markers that are not a part of the signature are shown in grey. Supplementary Information

The Coronary Disease Cohort Study (CDCS)
The CDCS cohort consists of 2,140 patients admitted into two tertiary hospitals (Christchurch Hospital and Auckland City Hospital) in New Zealand (NZ) for an ACS event, recruited from 2002 and 2009. They were invited to return to the hospital one month after discharge for baseline measurements. Patients were excluded from the study if they had a severe comorbidity that limited their life expectancy to less than 3 years. Clinical measurements, blood sample collection, and echocardiogram were all obtained at baseline (30 days from discharge) and 4 months, 6 months and 12 months after discharge. These individuals were followed up until the end of the study on 31 st January 2012, death or lost-to-follow-up. Subsequent clinical events and mortality information were obtained from the NZ Nation Health Information System. The study was approved by the New Zealand Multi-region Ethics Committee (CTY/02/02/018) and all participants had provided a written informed consent prior to their participation. More information on the study can be found in Prickett et al (1).
Out of these 2140 patients, a nested sub-cohort of 750 patients matched by age and gender was selected, and both aptamer-based SOMALOGIC protein assay and multiple reaction monitoring (MRM)-based lipidomics were used to measure proteins and lipids, respectively. Among these, 250 patients were readmitted for HF after their hospital discharge, 250 patients were readmitted for myocardial infarction (MI) episode and 250 patients were selected as controls without any recurrence of MI or incidence of heart failure hospitalization.
The controls were further defined as free of adverse cardiac remodelling and any cardiovascular-related death over the follow-up period. They were selected by matching each HF patient to a MI patient and a control by age and gender using the MatchIT package in R (2).
Following a quality control step for the molecular data, 741 patients were retained for network inference.
For the supervised analysis using the iOmicsPASS method, 286 patients were removed due to the following exclusion criteria: 69 patients with cardiac remodelling, 81 patients with a medical history of heart failure, 97 patients with history of stroke, 78 patients with a prior MI within the previous five years, and 19 patients with missing or unknown status to minimize possible error in the outcomes. A total of 464 patients was retained for predictive subnetwork identification for MACE and HF. Of those, 185 patients remained event-free, and 279 patients had a secondary MACE, including 117 patients hospitalised for HF during follow-up.

The Improving Outcomes in Myocardial Infarction through Reversal of Cardiac Remodelling (IMMACULATE) registry
The IMMACULATE registry consists of 859 patients who were admitted for MI into three local hospitals in Singapore (National University Hospital, Tan Tock Seng Hospital and Singapore General Hospital) from 2011 to 2014. The patients were followed up for a median of 3.9 years (interquartile range IQR: 2.0-4.8 years).
Inclusion criteria were patients with a ST-segment elevation MI (STEMI) or non ST-segment elevation MI (NSTEMI) with ischemic chest pain or angina symptoms, concentrations of troponin exceeding 99 th percentile URL and >50% occlusion of one or more coronary arteries based on angiography. Patients were excluded if they were anaemic (haemoglobin <8 g/dL in men and <7 g/dL in women), had severe renal impairment (estimated glomerular filtration rate, eGFR<15 ml/min/m 2 ), were in cardiogenic shock not able to be weaned off inotropes or intra-aortic balloon pump, history of malignancy or other conditions limiting life expectancy diagnosed within the last 12 months.
Hospital records were used to capture readmissions due to a recurrent ACS event or other morbidity.
Mortality data and primary cause of death was obtained from the National Registry of Death Office (NRDO). undertaken. In the current report, the IMMACULATE study serves as validation and only measurements one month from hospital discharge post-MI, a temporal match with CDCS, have been used. Among these, 190 subjects had all relevant biochemistry data for cardiac markers, proteomics, and lipidomics data. We further removed one patient with self-reported history of HF and six patients with history of stroke. However, we did not remove any patients with self-reported prior MI as the duration of prior MI to primary ACS event was unknown, as well as due to the constraint in the sample size. As a result, we recorded 152 patients who remained event-free and 38 patients with secondary 5-point MACE (15 4-point MACE and 23 HF). Here, we use IMMACULATE cohort as a validation for the evaluation of classification performance.

SOMALOGIC Protein Assay
Plasma samples (50 μL) collected 30 days after the index event were analysed using a Slow Off-rate Modified Aptamer (SOMAmer)-based capture array, "SOMAscan" (somaLogic, Inc, Boulder, CO, USA) (3). The abundance of a protein is reported using relative fluorescent units (RFU). A total of 1,305 human proteins were quantified (47% secreted proteins, 28% extracellular domains and 25% intracellular proteins). Details of the experimental methods and data analysis protocol are described in Chan et al (4). We note that the protein names in all diagrams and tables are given by the gene symbols of the encoding genes (all italicised), with the exception of NT-proBNP and cardiac troponin I, which are measured by both immunoassay and SOMALOGIC.

Lipidomics Assay
Samples were randomized into 10 analytical batches and batch quality control (BQC) samples were prepared by pooling equal plasma aliquots from all the samples. To assess analytical performances, one BQC sample was included for every ten study samples. Plasma (10 µL) was mixed with 190 µL 1-butanol/methanol (1:1, v/v), [BuMe], containing 4.5 µL of SPLASH™ Lipidomix® Mass Spec Standard I (#330707) and 4.5 µL Cer/Sph Mixture I (#LM6002) from Avanti Polar Lipids, Inc. The mixture was vortexed for 30 seconds, then sonicated for 30 min (ice was added into the sonication bath) and then centrifuged at 14,000 x g for 10 min at 10°C. The supernatant was transferred into autosampler vials for mass spectrometry. Extracted blanks were prepared using the same extraction protocol but without plasma samples. Technical quality control (TQC) samples were generated by pooling lipid extracts from the study samples and were used to monitor the analytical system performance.
Analysis of plasma extracts was performed on an Agilent 6495A QQQ mass spectrometer coupled to an Agilent 1290 Infinity-II UHPLC system and the MRM data acquired in positive ion mode were processed using the MRMkit tool (5). Peaks were integrated and the concentration data were further normalized by a time-trend and batch effect correction algorithm built in MRMkit. Only lipid species with the overall coefficient-of-variation smaller than 25% were included for the downstream analysis. We did not perform multi-point calibration to derive molar concentrations for this analysis since the current peak area values normalized by the internal standard provide the same quantitative information as the final absolute concentration values (linear transforms).

Quality control for proteomics and lipidomics data
For both CDCS and IMMACULATE studies, only the measurements collected one month after hospital discharge (baseline) were used. For the CDCS study, a total of 376 unique lipids were used for the subsequent analysis. For the IMMACULATE study, a total of 418 lipid species were retained for validation analysis for relaxed criteria (CoV 50%). Internal standards were used to normalize the raw peak areas in the corresponding lipid class and concentrations were further normalized to the protein concentration in the original sample. Due to the methodology used in our study, we do not report absolute concentration values. Endogenous species were quantified using one standard per lipid class, hence our method can only deliver relative quantification results, which does not hamper testing of concentration differences between subjects within any given lipid species. For the proteome analysis in both studies, a standard SOMALOGIC 1.3k array was used. For the purpose of the analysis, we removed 5 proteins that were from human viruses (i.e. human papillomavirus) and samples that failed QC, resulting in 747 samples in CDCS and 197 in IMMACULATE study.

Biological function enrichment analysis incorporating tissue specific gene expression
The proteins included in the final predictive signatures were annotated in terms of their biological processes in Gene Ontology (GO) using gProfiler (6). Enrichment analysis was repeated with an additional filtering of query and background gene lists with regard to the minimal expression level (transcript per million reads, or TPM > 5) in different tissues, heart (atrial appendage, left ventricle), arteries (aorta, coronary), kidneys (cortex, medulla), liver, lungs and skeletal muscle, in GTEx (7).

iOmicsPASS+
iOmicsPASS has been expanded and reimplemented as iOmicsPASS+ to allow for greater flexibility and generalizability to other types of omics data. First, the tool now allows for only one network and one data as input. If only one data (X) is used as input, then the network input is assumed to be one which links the features within the data. If more than one data is used as input (e.g. X, Y and Z), then a second network should be provided to link the features between the data in X to the data in Y. Data Z is used to normalize against matching features in data Y as in the original iOmicsPASS.
Second, the network input now allows for direction of association for the pair of interacting or covarying data features. The user can specify an additional column in the network file to include whether the association between the pair of features is positive (i.e. 1) or negative (i.e. -1). For the latter, the interaction scores are calculated as ratios (differences of two Z-scores), instead of the usual product (sum of two Z-scores).
Within iOmicsPASS+, a sample correlation matrix is first calculated to determine if the direction of the association is consistent with the indicated direction by the user in the network. An edge with interaction score is created only if the sign of the correlation between two data features is consistent with the specified direction of association. Otherwise, if inconsistent, the pair of data features is not used as predictive features for the subsequent analysis. This enables the application to other omics data such as microRNAs and DNA methylation where the inhibition of gene expression occurs as a result of interaction with the target genes.
Last, several new modules are added to the R package to facilitate users to build, install, create input parameter file and run the software. One of the most important module is a network inference module which estimates the best co-varying network between the different types of data, using an existing R-package "huge" (9), which carries out graphical LASSO (GLASSO) estimation. Within the module, we propose two different approaches for estimating the network: a supervised approach and a hybrid approach. In the supervised approach, the network is fully estimated from the precision matrix while the latter fuses a user-provided network input based on prior knowledge with the estimated GLASSO-derived partial correlation network to produce a combined network.
Another essential module in the R package is the prediction module that uses the iOmicsPASS results to assign new samples to the phenotypic groups based on the set of predictive signatures. The prediction module uses a discriminant model to assign each sample to the phenotype group with the highest discriminant score. In this module, the discriminant score is modified such that the prior probability incorporates other clinical information such as age, gender and BMI, that may influence the classification.
The following sections will describe the network inference method in greater detail.

Notation
We first define a few notations for the subsequent sections. Consider a p-dimensional data, = to be conditionally independent given other variables present. If ≠ 0, then and are not conditionally independent and an edge can be form between the two in a network setting.

Network inference module
In this module, there are two proposed ways to create a pseudo network, linking one -omics to another. First, a supervised approach is completely driven by the estimates from the data. Second, a hybrid approach, which uses a mix of what is known as a prior and what is derived from the data by the supervised approach to produce a combined network adjacency matrix. The subsections below provide a detailed description of the method in the two approaches.

Supervised approach
In the supervised approach, the input data is first standardized to unit standard deviation so that the sample covariance matrix and the sample correlation matrix are equivalent. Before the sample covariance matrix is computed, PCA is used to identify any outliers, defined to be more than 4 SDs away from the median of the first 2 PCs derived from concatenating all the data into a single matrix. After filtering outliers, the sample covariance matrix is computed and corrected to the nearest PSD matrix if any of its eigenvalues are negative.
To convert a matrix to the nearest PSD matrix, the approach by Nicholas J. Higham (15) is used: After selecting an optimal regularization parameter , the final GLASSO model is fitted and the estimated sparse inverse covariance matrix, Ω, is produced. The non-zero entries in this matrix are then converted into an edge-level network file to be used for the predictive analysis module. In addition, a partial correlation matrix is produced as output to provide a measure of confounding-free correlation for every pair of data features present in the network. For entries where the estimated inverse covariance matrix are shrunken to zero, the partial correlations remain as zero. The regularized partial correlation estimate between data features and is calculated as follows (16,17): where ̂, represents the ( , ) ℎ entry in the estimated precision matrix, Ω. Then, heat maps illustrating the derivation of the sample covariance matrix to the estimated nearest PD covariance matrix, the estimated precision matrix and the corresponding adjacency matrix, are also produced in PDF files.

Hybrid approach
The hybrid approach is a semi-supervised approach where the prior knowledge of the network is combined with the estimated network by the supervised approach above into a single adjacency matrix. First, a network file is required as input as the prior (matrix ), then the supervised approach is carried out to obtain an estimated precision matrix (matrix ̂) . Let us first define the following matrices: We extract the corresponding elementwise sample covariances in from the non-zero entries in matrix by defining the Hadamard product of the two matrices to result in a non-negative matrix: A score matrix * , constrained between 0 to 1, can be computed as follows: where ′ = ( ′ ) = and max = , ′ . Here, the matrix = Lastly, a fused matrix is computed by: adding to the prior if the edge is supported by what is estimated in Ω (i.e. ̂, ≠ 0), or penalizing the prior if the edge is present in the prior but not supported by the estimated network (i.e. ̂, = 0).
The final network is generated by forming an edge between feature and if the absolute value in ( , ) ℎ entry of matrix is at least 0.5 and the direction of association of the edge is determined by the sign of the entry.
Heatmaps illustrating the derivation of the sample covariance matrix to the estimated precision matrix, the input prior matrix, the product fused matrix and the corresponding adjacency matrix, are produced as a graphical output at the end.

Model selection criterion
To optimize the regularization parameter, , for the penalised log-likelihood in GLASSO that controls for the sparsity of the precision matrix, several model selection criteria are used, including Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) (18,19). However, when dealing with molecular data, we often encounter the curse of the high dimensionality where the size of the feature space, p, is much greater than the sample size, n. Hence, other criteria such as extended Bayesian information criterion (e-BIC) and K-fold CV may be more relevant.

Akaike Information Criterion, AIC
AIC is a model selection criterion which penalises for overfitting of the model. The formula for AIC is: where is the number of non-zero elements in the estimated precision matrix (Ω) and ℒ is the likelihood function without the penalty term. The best regularised parameter would correspond to the model which yield the lowest AIC value.

Bayesian Information Criterion, BIC
BIC is similar to AIC but penalizes more against over-parameterization. Due to the heavier penalty, linear regression models chosen by BIC tends to be the same or simpler models than the one chosen by AIC (20). The formula for BIC is: where is the sample size. One underlying assumption of BIC is that the observations should be independent and identically distributed (21). Otherwise, the effective sample size, , should be adjusted accordingly. Then, the regularized parameter should be picked with the model which has the lowest BIC value.

Extended Bayesian Information Criterion, e-BIC
Another recently proposed model selection criterion is the extended BIC, which came about after Foyel et al.
and Chen et al. showed that AIC and BIC lose consistency when the feature space exceeds the sample size (i.e. ≫ ). They propose a modification to the BIC formula as follows (19,22): where 0 ≤ ≤ 1, with higher values yielding simpler and more parsimonious models. When = 0, the formula reverts to BIC and = 0.5 is suggested to be used for more consistent models (19). Similar to both AIC and BIC, the best regularized parameter corresponds to the model with the smallest e-BIC value.

K-fold Cross-Validation, CV
An alternative approach to estimate a regularized parameter is through learning directly from the data. In 2009, Fan et al. (23) proposed a K-fold cross validation score given by: where log |Ω| = log (det(Ω)), denotes the sample size and the set represents the test dataset in the ℎ fold. Here, Ω − ( ) is the precision matrix estimated by using the training dataset, ⋃ ≠ . Typically, a fivefold cross-validation is recommended unless the sample size is too small and the regularised parameter is chosen at the value with the largest CV score. inputDat) and combining multiple datasets into a single matrix. PCA is then carried out to screen for outlying samples that are more than 4 SDs away from the median of the first 2 PCs. After this, a supervised or hybrid approach is performed to estimate the adjacency matrix from the precision matrix or fused matrix, respectively, and to form the resulting network.
-----createPrior() -----This function fits a logistic model for a binary outcome or a multiple logistic model for a phenotypic group with more than 2 outcomes, adjusting for the user-specified clinical variables. User has to specify which variables are categorical and thus need to be converted to factor variables, otherwise they are treated as continuous variables. It then calculates the probability of every sample's membership to each phenotypic outcome based on the fitted model. The probabilities calculated can be fed into run.iOmicsPASS() to be used as prior in the discriminant model for classification.
The data can also be split into a training and test dataset where the logistic model is fitted on the training dataset and the classification probabilities are calculated on the test dataset, by using the option "predict=TRUE". The probabilities calculated can be fed into Predict.iOmicsPASS() function as prior probabilities used in the discriminant model to assign test samples to the phenotypic outcome groups in the training model.
-----createInputParam() -----This function creates an input parameter file needed to run iOmicsPASS with user's specifications and the directory containing the required input files.
-----run.iOmicsPASS() -----This function calls system and execute the compiled binary, performing predictive analysis to identify network signatures that can separate the phenotypic groups of interest. The software may take some time if the dimensionality of the network is very large. It is advised to first run CV once to create a misclassification error plot by setting Cross.Validate=TRUE, then turn off the CV option and re-specify a suitable threshold that strikes a good balance between change in mean CV error and the number of selected edges. Its function also outputs the key model parameters required for prediction module below.
-----Predict.iOmicsPASS() -----This function requires the model parameters associated with the network signatures obtained from a training dataset. Using the same network features in the test data to compute a co-expression matrix, class probabilities are computed to assign each new sample to a phenotypic group.