A Novel Sample Selection Approach to Aid the Identification of Factors That Correlate With the Control of HIV-1 Infection

Individuals infected with HIV display varying rates of viral control and disease progression, with a small percentage of individuals being able to spontaneously control infection in the absence of treatment. In attempting to define the correlates associated with natural protection against HIV, extreme heterogeneity in the datasets generated from systems methodologies can be further complicated by the inherent variability encountered at the population, individual, cellular and molecular levels. Furthermore, such studies have been limited by the paucity of well-characterised samples and linked epidemiological data, including duration of infection and clinical outcomes. To address this, we selected 10 volunteers who rapidly and persistently controlled HIV, and 10 volunteers each, from two control groups who failed to control (based on set point viral loads) from an acute and early HIV prospective cohort from East and Southern Africa. A propensity score matching approach was applied to control for the influence of five factors (age, risk group, virus subtype, gender, and country) known to influence disease progression on causal observations. Fifty-two plasma proteins were assessed at two timepoints in the 1st year of infection. We independently confirmed factors known to influence disease progression such as the B*57 HLA Class I allele, and infecting virus Subtype. We demonstrated associations between circulating levels of MIP-1α and IL-17C, and the ability to control infection. IL-17C has not been described previously within the context of HIV control, making it an interesting target for future studies to understand HIV infection and transmission. An in-depth systems analysis is now underway to fully characterise host, viral and immunological factors contributing to control.

Individuals infected with HIV display varying rates of viral control and disease progression, with a small percentage of individuals being able to spontaneously control infection in the absence of treatment. In attempting to define the correlates associated with natural protection against HIV, extreme heterogeneity in the datasets generated from systems methodologies can be further complicated by the inherent variability encountered at the population, individual, cellular and molecular levels. Furthermore, such studies have been limited by the paucity of well-characterised samples and linked epidemiological data, including duration of infection and clinical outcomes. To address this, we selected 10 volunteers who rapidly and persistently controlled HIV, and 10 volunteers each, from two control groups who failed to control (based on set point viral loads) from an acute and early HIV prospective cohort from East and Southern Africa. A propensity score matching approach was applied to control for the influence of five factors (age, risk group, virus subtype, gender, and country) known to influence disease progression on causal observations. Fifty-two plasma proteins were assessed at two timepoints in the 1st year of infection. We independently confirmed factors known to influence disease progression such as the B * 57 HLA Class I allele, and infecting virus Subtype. We demonstrated associations between circulating levels of MIP-1α and IL-17C, and the ability to control infection. IL-17C has not been described previously within the context of HIV control, making it an interesting target for future studies to understand HIV infection and transmission. An in-depth systems analysis is now underway to fully characterise host, viral and immunological factors contributing to control.
Keywords: HIV-1, elite controllers, infection-immunology, viral control, immunology & infectious diseases INTRODUCTION Individuals infected with HIV display varying rates of viral control and disease progression, with a small percentage being able to spontaneously control in vivo viral replication without the need for anti-retroviral treatment (ART) (1). Such exquisite control is likely to happen in the very early battle between host and virus in acute and early HIV infection (2). Our understanding of host-pathogen interactions and the mechanisms underpinning the immune response to HIV infection have been informed by studies of individuals who demonstrate an enhanced ability to control in vivo viral replication, and on non-pathogenic SIV infection in non-human primates (NHP) (3). However, many of the studies of HIV control are cross-sectional after set point viral load and control has been achieved. Many of these studies have been focussed in Clade A and C infection. A full understanding of the mechanisms governing such spontaneous control of infection has been hampered by the paucity of informative and linked samples coupled to technology with sufficient resolution to define this phenomenon.
Systems-based approaches have helped define novel factors driving disease progression and protection during infections such as tuberculosis (4,5), yellow fever (6), malaria (7), and influenza (8). But their application to aid the definition of the drivers of spontaneous control in HIV has been limited. The gene signature analysis of early gut mucosal T cell responses to HIV-1 suggest that the absence of an inflammatory gene signature may define Long-term non-progressors (LTNPs) (9). But recent scRNA-Seq profiling during acute HIV infection in a limited number of treatment-naïve subjects from the Females Rising through Education, Support and Health (FRESH) (10) cohort described an interferon response gene signature before peak viraemia as well as the presence of gene modules associated with antiviral control (APOBEC3A, IFITM1, and IFITM3) in individuals able to naturally control infection (11). The post hoc integrated systems analysis to the RV144 trial samples also uncovered roles for Type I and II interfons, as well as IRF7 and mTORC1 in susceptibility to infection post-vaccination (12). The mammalian target of rapamycin metabolic pathway has also been shown to be key to enhanced CD8 activity in elite controllers (13). These studies highlight the potential to utilise systems methods to define the correlates associated with the control of HIV-1 infection.
Heterogeneity in the data generated using high throughput systems methodologies can be further complicated by the inherent variability encountered at the population, individual, cellular and molecular levels (14). Studies by Chowdhury et al. (13) and others (15,16) have highlighted the diversity of transcriptional profiles that exist within a single subset of T lymphocytes that accounts in part for control of HIV infection. The control of HIV replication in vivo is multifactorial. Indeed viral control has been shown to be associated with age at infection, time post-infection, gender, HLA type, virus subtype and route of infection (17)(18)(19)(20)(21)(22). Obtaining sufficient numbers of samples to allow for the control of all these confounders and the discovery of new correlates of disease trajectory poses a real challenge (23,24).
We applied a unique approach to retrospectively classify HIVinfected individuals in order to aid the delineation of a profile associated with early and persistent in vivo control of HIV-1 replication in the absence of antiretroviral treatment. Using this approach, we defined three groups of HIV infected volunteers from Protocol C; a multisite early infection prospective cohort consisting of 613 participants recruited from nine clinical research centres in five African countries (25,26) (Figure 1, also on https://dataspace.iavi.org/). These groups comprised volunteers with low (n = 10), medium (n = 10) and high (n = 10) set point viral load who were identified within days of their estimated date of HIV infection and followed over time for up to 7 years. Importantly, the low viral load volunteers showed rapid and persistent control of viral replication in the absence of treatment, and had sufficient samples available during the resolution of peak infection to enable the investigation of signatures associated with rapid and persistent HIV control. We present the profile for fifty-two soluble proteins in the acute phase of HIV infection across the three groups, demonstrating the potential to identify unique signatures associated with ART-naïve viral control using this selection approach.

Study Population and Selection Approach
Volunteers included in this study were selected from a historic acute and early HIV infection prospective cohort drawn from nine clinical research centres in South Africa, Zambia, Uganda, Kenya and Rwanda enrolled from 2006 to 2011 (Figure 1). Details of study characteristics, distributions, recruitment procedures, initial immunological methods and epidemiological profiling data of the Protocol C cohort are described elsewhere (18,22,25,26).
Individuals from the study were ranked according to the magnitude of their mean viral load (Geometric mean) measurements taken between 9-36 months post-EDI (estimated day of infection), and divided into quartiles. Mean viral load was calculated for 362 of the 613 volunteers from the Protocol C cohort who did not receive antiretroviral treatment. A matching algorithm (27) based on the nearest neighbour was then applied to define the groups of volunteers for this study. Briefly, we selected 10 Low viral load volunteers (LVLVs) from the first quartile of the ranked dataset who had a visible period of dynamic control of infection demonstrated by the presence of a downslope in their viral load measurements, and who were able to control viral load to ≤2,000 copies/mL in the first 3 years of infection. The LVLVs were then matched with Intermediate viral load volunteers (IVLVs, n = 10) from the second and third quartiles, and High viral load volunteers (HVLVs, n = 10) from the fourth quartile of the ranked dataset. Volunteers were matched on age, clade, country, gender and risk group. Soluble proteins in plasma were assessed at two timepoints in the period following peak viraemia and within the 1st year of infection for each of the selected volunteers (Supplementary Table 1). All individuals assessed for this study were treatment naïve.

HLA Frequency Calculation
To determine the HLA I frequencies within the 362 ARTnaïve Protocol C volunteers, two-digit allelic frequencies were calculated using the Los Alamos National Laboratory HLA frequency and Graphing tool (https://www.hiv.lanl. gov/content/immunology/hla). For each MHC Class I alleles with an allele frequency >5%, we compared the set point viral load of all positive volunteers with those of all negative volunteers. Statistical tests used are described in subsequent sections. Cryopreserved plasma from the incidence study were thawed at room temperature and applied to the panels according to the manufacturer's protocol. Plates were read on the MSD plate reader model MESO QuickPlex SQ 120. All plasma samples for the study were thawed and run at the same time and grouped on plates in the order in which they were selected for the study to avoid intra-assay variability. Data was collected for two replicates per sample using the MSD software (Discovery Workbench Version 4.0). A five-parameter logistic regression formula was used to derive sample concentrations from the standard curves. Analytes below the lower limit of detection were assigned a concentration of half the lower limit of quantification (LLOQ).

Statistical Analysis
We analysed the MSD data using a non-parametric approach, because of the small sample size and the non-Gaussian distribution, as determined using the Shapiro-Wilk test.
Non-parametric analysis (Mann-Whitney test, comparing ranks) of the differences in VL measurements for individuals expressing HLA alleles was performed in Graphpad Prism 8 Software. P-values <0.05 were considered significant. Propensity score matching to define study populations was executed using the MatchIT package (27) in R. Non-parametric test for similarities in the age distribution between the study groups was performed in SPSS 24.
We computed descriptive summary statistics, including the median and inter-quartile range (IQR) and Spearman correlations and excluded MIP-3α from further analyses due to missing data (40%). The null hypothesis of the difference between the two time points was assessed using Wilcoxon Signed Ranks tests. We computed the (rank based) correlation matrix for each group (LVLVs, IVLVs, HVLVs) by averaging the concentrations over time and presented the correlation matrices as heat maps.
To investigate the association between the mean viral load of volunteers (which was used to define the study groups) and the concentration of proteins in peripheral blood we fitted a linear robust regression model where a function of the ranks of the residuals was used instead of the Euclidian distance in the least square estimation (28). SAA and CRP were excluded from the model due to singularity issues. Based on these univariate results, we selected the significant proteins and estimated a multivariate robust regression model after adjusting for multiple comparison (Bonferroni). We then fitted a multivariate regression model with the significant (p-value adjusted 0.05) proteins and removed those that eventually where highly correlated to avoid multicollinearity issues in the regression fitting.

RESULTS
The Outcome of Infection Is Linked to Gender, Viral Subtype, and the Expression of Immune Receptors on Lymphocytes Set-point viral load represents a dynamic state of equilibrium between infecting virus and the immune response in the absence of complete elimination of the virus (29). It remains an important measure of disease progression. Set-point viral load was calculated for 362 of the 613 volunteers who did not receive antiretroviral treatment over at least 36 months of follow-up. Volunteers were then ranked and divided into equal quartiles to explore any associations with disease progression (Figure 2A). Median set point VL for the 362 volunteers was 26,061 copies/ml (IQR: 6,981-65,813 copies/mL). Mean CD4 counts calculated for the same period for all 362 volunteers was inversely correlated with mean viral load (r = −0.2892, p <0.0001) (Figure 2B).
We examined the distribution of gender and viral subtype within our ranked dataset. In agreement with previous studies (18,22) our analysis showed that that women had lower set point viral loads than men. There was also a higher representation of Subtype A in the lower quartiles, with the opposite being true for Subtype C infected subjects ( Figure 2C).
To assess the impact of MHC on disease progression, we compared the influence of Class I alleles with an allele frequency > 5% on the set point viral load and found that individuals with B * 57 (p < 0.0001) and C * 04 (p = 0.0335) had lower and higher set point viral loads, respectively, compared with individuals lacking either HLA allele (Figures 3A,B).

A Propensity-Based Approach to Sampling an HIV Incidence Cohort to Aid Systems Analysis
During untreated HIV infection, the rate of viral replication and set-point probably reflects the dynamic interaction between the virus and host responses (23). We defined Low Viral Load Volunteers (LVLVs) as those with a set point of <2,000 copies/mL in line with previous studies (1) and were able to identify 40 Protocol C volunteers within this category. Ten of the 40 LVLVs identified had sufficient samples and viral load measurements collected over the period following peak viraemia for analysis. There was a steady decrease in viral load in the LVLVs over the 12 months following peak viraemia in the absence of ART ( Figure 4A, Supplementary Figure 1). We focussed initially on the period immediately following peak viraemia in an effort to describe correlates of the early control of viral replication.
The ability to control viral replication in vivo has been linked to factors such as age at infection, time post-infection, gender, HLA type, route of virus entry and HIV subtype (17)(18)(19)(20)(21)(22). In an effort to control for the confounding effects of some of these factors, we utilised a propensity score matching approach (27), which allows the matching of persons in one group with persons in another group based on each case's propensity score. For all the volunteers in the first group (LVLVs), we selected an equal number of volunteers from quartiles 2 and 3 (Figure 2A) based on their propensity scores for age, time post infection, gender, route of virus entry and infecting virus subtype. We designated the group selected from quartiles 2 and 3 as Intermediate viral load volunteers (IVLVs). We applied the same matching approach to volunteers in quartile 4 to identify 10 High viral load volunteers (HVLVs), from the ranked dataset. It was impossible to match on HLA haplotypes due to the diversity of alleles represented in the cohort.
We successfully identified three distinct groups consisting of 10 individuals per group, from the ranked dataset that were matched on age, gender, risk group (route of infection), country and infecting subtype (Figures 4A-C, Table 1) and for whom samples were available at two timepoints within the initial phase of control of viral replication immediately after peak viraemia. Given the age of the cohort, sample availability within this period (obtained from dataspace.iavi.org) was a real challenge. Days post EDI was also considered during the selection of these early timepoints with matched timepoints no more than 6 months apart where possible (Supplementary Table 1).

Concentrations of Soluble Markers in the Acute Phase Are Associated With Early and Sustained Control of in vivo Viral Load
We measured the levels of 52 soluble proteins in plasma at two timepoints following peak viraemia and report the median and IQR for the two timepoints for all volunteers ( Table 2). For the most part, protein concentrations were not significantly different across the two early timepoints assessed with the exception of VCAM1 (p = 0.02) and IL-10 (p = 0.012) for the overall dataset, IL-6 (p = 0.037) and IL-17C (0.006) for LVLVs, SAA (p = 0.027) and CRP (p = 0.027) for IVLVs, and IFN-γ (p = 0.01) for HVLVs (group data also shown in Supplementary Figures 2, 3).
To further investigate the association between the mean viral load of volunteers (which was used to define the study groups) and the concentration of proteins in peripheral blood we applied a univariate regression model where a function of the ranks of the residuals was used instead of the Euclidian distance in the least square estimation (28). The estimates for nine analytes that were significant are shown in Table 3 (GM-CSF, p < 0.001; MIP 1-α, p <0.001; IL-8, p < 0.001; IFN-γ, p < 0.001; IL-2, p < 0.001; IL-13, p = 0.05; IL-17C, p < 0.001; IL-9 = 0.02; IL-31 = 0.02).
Based on the results of the univariate analyses, we generated a multivariate robust regression model using the proteins which were associated with mean viral load, after adjusting for multiple comparison (Bonferroni) and excluding highly correlated analytes to avoid issues of multicollinearity. The only protein that remained significantly associated with mean viral load was MIP1-α (p < 0.001) after p-value adjustment (Holmmel). IL-8 was excluded from the model because it was highly correlated with GM-CSF (Spearman correlation, 0.71).
Exploratory heatmaps based on the lower triangle Spearman correlation matrices and using the average plasma protein value between the two timepoints for each group suggests that differences exist in the relationships between different plasma proteins across the groups (Figure 5) with more frequent positively correlated proteins seen in IVLVs and HVLVs compared to LVLVs.

DISCUSSION
We present a unique approach to classifying individuals drawn from an acute and early HIV infection cohort that considers a range of factors known to have an impact on disease progression, to efficiently define the peripheral secretory profile associated with early and sustained control of in vivo viral replication. This selection approach enabled us to define the profile of 52 proteins deployed within the specific period of dynamic immunological control of viral replication for all volunteers in the absence of antiretroviral treatment. Expectedly our ranking approach show that measurements for CD4 cells, which are the first cells to become infected during transmission (30,31) and continue to be a primary target for HIV-1 (32), were negatively correlated with set point viral load calculations.
HIV subtype has been shown to be associated with disease trajectory and outcome (21,22,33,34) but limitations of study size and design have meant that such findings have been largely descriptive in nature. By matching the selected individuals from Protocol C based on other relevant confounding variables, we were able to address the independent contribution of viral subtype to the control of viral replication in vivo. We report similar observations to Price et al. (22) who performed a subtypeby-geographic-region covariate analysis on the whole Protocol C cohort and showed that Subtype A-infected volunteers were more likely to control viral load than Subtype-C infected volunteers. Also in this cohort, Amornkul et al. (18) show that subtype C is associated with faster progression to AIDS and CD4+ T cell decline compared to subtype A. Here we also show a higher representation of Subtype-A relative to Subtype-C infected subjects in the lower quartiles of the viral load-ranked ranked dataset of the same cohort.
Whilst the diversity of HLA types represented in the cohort did not permit complete matching of volunteers based on this factor, well-reported trends like the favourable influence of B * 57 on disease control were evident. The presence of the less studied   C * 04 HLA Class I allele, which is reportedly associated with B * 35 on chromosome 6 (35, 36) appeared to have a less favourable influence on disease control in this study. This potentially deleterious effect of the C * 04 allele on disease progression has only been reported by a few studies that focused on either HIV-1 subtype B alone (35)(36)(37) or mostly C (38). Our data covering HIV-1 subtype A, C, D and recombinant viral subtypes suggests that this effect of C * 04 may apply regardless of the subtype of the virus. It has been suggested that this effect may be mitigated by the association of C * 04 with other deleterious HLA Class I alleles or with the killer-cell immunoglobulin-like receptor (KIR) KIR2DS4 (37). These observations provide some validation of the novel propensity matching method presented in this study. Early HIV infection is characterised by a cytokine storm that is detectable at the levels of gene (11) and protein expression (39). We were able to focus on the dynamics of the plasma protein response following peak viraemia and during the 1st year of infection when the immune system is most actively involved in the control of viral replication. The fact that the circulating levels of most of the plasma proteins appeared to be largely stable in all volunteers with a few exceptions (IL-6, IL-17C, SAA, CRP, and IFN-γ) over the period assessed was somewhat surprising. For the most part the indicated analytes have been linked previously to HIV disease outcomes (40,41), even if the strength of their associations remain poorly understood. A notable exception was the falling levels of IL-17C observed in LVLVs. In contrast to other members of the IL-17 family (IL-17A and IL-17F), IL-17C is predominantly produced by epithelial cells (42)(43)(44) and not leucocytes (45), with broad activity on epithelial cells, TH17 leucocytes (46) and monocytic lineage cells (43). Although not fully described within the context of HIV-1 infection, early release of IL-17C in other models of infection suggest a dual function in the regulation of both innate and adaptive immune responses (45). The decreasing levels of serum IL-17C seen in LVLVs may suggest a role in early recruitment and differentiation of innate and adaptive modulators in response to HIV infectioni.e., prior to peak viral load, and subsequent downregulation in those who eventually go on to control infection. The difference in its cellular source compared to other members of the IL-17 cytokine family makes it a potentially interesting target for further studies to examine a potential role in barrier immunity during and immediately after transmission at mucosal surfaces.
Our regression analysis suggests a relationship between the levels of nine plasma proteins including IL-17C in the period following peak viraemia and set point viral load, with MIP-1α being the most significantly associated with mean viral load in the multivariate analysis. MIP-1α is one of three well-characterised β-chemokines produced by immune cells including CD8 and CD4 T cells that have been implicated in the inhibition of HIV infection (47,48). Whilst the evidence strongly suggests that T cell capacity to secrete β-chemokines like MIP-1α is strongly associated with in vivo and in vitro viral control (49)(50)(51)(52), the relationship between the quantity of circulating soluble β chemokines and disease outcome appears to be less clear (53)(54)(55). Our results suggest that the concentration of MIP-1α in the period following peak viraemia is strongly associated with set point viral load.
Several groups have assessed the relationships between specific cytokines and the control of viral replication (56)(57)(58). Our exploratory analysis of the correlations between 52 proteins during the period following peak viraemia suggest that disease progression is underpinned by clear differences in the deployment of immune modulators. Whilst the functional implications of the negatively correlated proteins in LVLVs is not immediately clear, this and previous observations validate the unique selection approach presented here and the potential to apply it to support systems investigation of the correlates that define natural control of HIV-1 infection.
The impact of biological sex on the outcome of viral infections has been highlighted by other groups (59,60). Our ranked data set showed that women had lower set point viral loads than men. In comparing the two groups (men and women), we identified differences in the levels of IL-7, a cytokine implicated in early T cell development, proliferation and differentiation (61) in the period studied. Levels of thymus and activation-regulated chemokine (TARC), a chemokine constitutively produced in the thymus and by keratinocytes and dendritic cells (62,63) with powerful chemoattractant effect on T cells was also found to be higher in men in this study. Whilst recent studies examining these sex differences point to a possible role for increased levels of innate inflammatory cytokines (60) on disease progression during viral infections, our data may indicate differences in how T cells are activated/recruited and deployed in men and women. A recent study by El-Badry et al. highlights the impact of plasma levels of 17β-estradiol in women on T cell activation in the acute phase of HIV infection. It is worth noting the small sample size from which these observations were made; confirmation of our observations will be required.
Taken together with previous results, it is reasonable to state that whilst our results suggest an association between levels of soluble MIP-1α in the period of active immune suppression of viral replication and disease progression, they also support the notion that the mechanism of in vivo suppression of HIV is likely multifactorial (64). As such efforts aimed at reducing the potential for noise in datasets will go a long way to enable the definition of the correlates associated with immunological control of HIV.
We present our unique selection approach as a way to potentially counter some of the noise associated with extreme heterogeneity in datasets allowing for the application of highresolution systems methodologies to define the correlates associated with natural control of HIV infection. Whilst the ranking and propensity-based selection methods presented may not directly predict correlates of natural protection against HIV-1, they enable the exclusion of any noise arising as a result of the factors that are controlled for in the study design. Given that HIV pathogenesis is multifactorial, the tendency for such noise to obscure valid observations is considered a real barrier to the application of high dimensional (or systems) analytical methods to aid the definition of the correlates of natural control (14). We demonstrate the utility of the ranking and matching approach by independently confirming known factors associated with disease progression, albeit in a small study population. We also identify novel soluble proteins such as IL-17C and illustrate differences in the pattern of deployment of peripheral cytokines that tally with disease progression. These findings demonstrate the utility of the unique ranking and selection approach for systems analysis is subsequent studies.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
This study was reviewed and approved by the following ethical review boards: the Kenya Medical Research Institute Ethical Review Committee, the Kenyatta National Hospital Ethical Review Committee of the University of Nairobi, the Rwanda National Ethics Committee, the Uganda Virus Research Institute Science and Ethics Committee (Currently the UVRI Research Ethics Committee) and the Uganda National Council of Science and Technology, the University of Cape Town Health Science Research and Ethics Committee, the Bio-Medical Research Ethics Committee at the University of KwaZulu Natal, the University of Zambia Research Ethics Committee, and the Emory University Institutional Review Board. Informed consent was obtained from all volunteers prior to the collection of study related resource. All methods were carried out in accordance with relevant guidelines and regulations. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
JM was responsible for conceptualisation, methodology, formal analysis, data curation, sample application preparation, original draft preparation, review, and editing. EN was responsible for methodology development, sample application preparation, and original draft review. AF-S was responsible for statistical analysis. CS was responsible for running immunological assays. CK, JD, SB, GM, and PH were responsible for methodology development and original draft review. JH was responsible for assay review, processing and approval of sample application. DK, SJ, EM, and BA were responsible for methodology development, data curation, assay review and original draft review. EH, MP, ES, and JG were responsible for project conceptualisation, methodology development, data curation, granting sample access, review, and editing. The IAVI protocol C investigators were responsible for the initiation and successful completion of the Protocol C study. All authors contributed to the article and approved the submitted version.

FUNDING
This work was made possible by IAVI, which was supported by funding from many donors, including USAID, the Bill and Melinda Gates Foundation, the Ministry of Foreign Affairs of Denmark, Irish Aid, the Ministry of Finance of Japan in partnership with The World Bank, the Ministry of Foreign Affairs of the Netherlands, the Norwegian Agency for Development Cooperation, the United Kingdom Department for International Development, and the US Agency for International Development (the full list of IAVI donors is available at: http://www.iavi.org).