Factors Associated With HIV Infection in Zimbabwe Over a Decade From 2005 to 2015: An Interval-Censoring Survival Analysis Approach

Objectives: The main objective of this study was to compare results from two approaches for estimating the effect of different factors on the risk of HIV infection and determine the best fitting model. Study design: We performed secondary data analysis on cross-sectional data which was collected from the Zimbabwe Demographic Health Survey (ZDHS) from 2005 to 2015. Methods: Survey and cluster adjusted logistic regression was used to determine variables for use in survival analysis with HIV status as the outcome variable. Covariates found significant in the logistic regression were used in survival analysis to determine the factors associated with HIV infection over the 10 years. The data for the survival analysis were modeled assuming age at survey imputation (Model 1) and interval-censoring (Model 2). Results: Model goodness of fit test based on the Cox-Snell residuals against the cumulative hazard indicated that Model 1 was the best model. On the contrary, the Akaike Information Criterion (AIC) indicated that Model 2 was the best model. Factors associated with a high risk of HIV infection were being female, number of sexual partners, and having had an STI in the past year prior to the survey. Conclusion: The difference between the results from the Cox-Snell residuals graphical method and the model estimates and AIC value maybe due to the lack of adequate methods to test the goodness-of -fit of interval-censored data. We concluded that Model 2 with interval-censoring gave better estimates due to its consistency with the published results from literature. Even though we consider the interval-censoring model as the superior model with regards to our specific data, the method had its own set of limitations.


INTRODUCTION
The 90-90-90 targets was launched by the Joint United Nations Programme on HIV/AIDS (UNAIDS) and partners with the aim to diagnose 90% of all HIV positive persons, provide antiretroviral therapy (ART) for 90% of those diagnosed, and achieve viral suppression for 90% of those treated by 2020 (1).
In Zimbabwe, a population based survey carried out in 2016 reported that 74.2% of people living with HIV (PLHIV) aged 15-64 years knew their HIV status. Amoung the PLHIV who knew their status, 86.8% self-reported current use of Antiretroviral treatment (ART), with 86.5% of those who self-reported, are virally suppressed (2). In order for these 90-90-90 targets to be met, prevalence, and incidence rates estimates are crucial in understanding the current status of the HIV epidemic and determine whether the trends are improving in a manner which can facilitate to achieve the 2020 target.
The gold standard for estimating the HIV incidence is to test uninfected individuals for new infections periodically, however this method is feasible though costly and time-consuming. In addition, even if an HIV negative cohort is followed over-time, the exact date of infection is rarely observed (3). In this scenario, an interval can be determined between the latest negative and the earliest positive test dates. Taking into consideration the issue of cost and time, cohort analysis for estimating the HIV incidence rate in a general epidemic will not produce estimates which are representative of the whole population. Due to these reasons, sentinel surveillance systems have been set up to monitor the spread of the pandemic (4). In addition, population-based surveys, in which HIV tests are performed, are carried after every 5-years in Zimbabwe. The advantage of the population-based survey is that, data is more representative of the population than a cohort. On the other hand, the same data does not provide the exact date of the infection but rather provide what is called "current status data." Current status data occurs when an individual is observed at one single point, and the only information obtained is whether the event of interest has occurred (5). An example of current status data includes information collected during a demographic health survey, in which an individual is tested whether they are HIV positive or negative. If an individual were found to be HIV positive, the individual was recorded as left censored at the time the test was done. If an individual were HIV negative, they were recorded to be right censored. Sometimes current status data can be referred to as case intervalcensored data, with case II interval-censored data referred to as the general case (6). Interval censoring takes into account the range, that is, an interval inside of which one can say the outcome of interest has occurred (7). Given that we would want to determine the factors associated with the hazard of infection using data from these surveys, then survival analysis can be implemented.
In the setting of standard survival analysis, modeling the hazard rate of HIV infection can be achieved by imputing the time-to-onset of disease as the time at diagnostic visit (3). Modeling current status data using the mentioned two approaches may overestimate the hazard rate. However, analyzing this type of data using interval censoring will be a better approach. Although they are documented advantages of interval censoring compared to the standard cause-specific survival model, the superiority of the interval-censoring remains unclear in estimating the effect of different exposures on the risk of HIV infection. With this argument in mind, the main objective of this study was to compare results from these different approaches for estimating the effect of different factors on the risk of HIV infection and determine the best fitting model.

Study Design and Area
This study used data from Zimbabwe, a landlocked country, bordered by Mozambique on the East, South Africa on the South, Botswana on the West, and Zambia on the North and Northwest. Zimbabwe is sub-divided into 10 Provinces which are: Matabeleland South, Matabeleland North, Mashonaland East, Mashonaland Central, Mashonaland West, Midlands, Masvingo, Manicaland, Harare, and Bulawayo. Each province is subdivided into districts, and each district is made up of wards. The designs for the three 5 yearly surveys were cross-sectional.

Zimbabwe Demographic Health Survey Data and Sampling
Demographic and Health Surveys (DHS) are nationallyrepresentative household surveys that have been implemented in approximately 70 countries since 1984 (8)(9)(10). They provide data for a wide range of monitoring and impact evaluation in the areas of population, health and nutrition. Data used for the analysis were obtained from 2005-06, 2010-11, and 2015 ZDHS and were retrieved from the DHS programme website (https:// dhsprogram.com) (11). A representative probability sample of 10,800, 10,828, and 11,196 households were selected for the 2005-06, 2010-11 and 2015 ZDHS, respectively. A two-stage cluster sampling technique was used to select the households. The first stage selected 400, 406, and 400 enumeration areas (EAs) for 2005-06, 2010-11, and 2015 ZDHS, respectively. At the second stage, using a complete listing of households in the selected EAs, a fixed number of households were randomly chosen. This  allowed the use of EAs specific weights to be assigned in the design (8-10).

Measurement of the Outcome (HIV Status) and Explanatory Variables
With consent from the respondent or parent/guardian (for minors), blood samples were collected in all households for HIV testing in the laboratory for females aged 0-49 and males aged 0-54. Blood spots were collected on filter paper from a finger prick and transported to a laboratory for testing. An initial ELISA test was performed, and then retesting of all positive and 5-10 % of the negative tests with a second ELISA was done. If they were discordant results on the two ELISA tests, a new ELISA or a Western Blot was performed. The data used for this study was obtained from the DHS Data Archives (11) and only included individuals aged 15-49 years. The following explanatory variables were extracted from the dataset: sex, marital status, education level, religion, currently employed, place of residence, STI treatment in the past 12 months and number of sexual partners (8-10).

Ethical Considerations
The  survey data files for legitimate academic research. Registration was required for access to data.

Statistical Methodology
Application to Zimbabwe Demographic Health Survey The socio-demographic datasets for men and women records were appended to provide a single analysis dataset for all the three surveys. The appended dataset was then merged using the unique combination of the individual line number, household line number, and cluster (EA) number to the HIV prevalence dataset. All individuals without an HIV test result, never been sexually active, and individuals who did not have an age at first intercourse were excluded from the analysis. In the case of individuals who were HIV positive when the survey was conducted, the age at HIV infection was defined as age at survey date and for individuals who were HIV negative, the age at HIV infection was right-censored at the date of survey in Models 1 (Figure 1). Accounting for interval-censoring, the age at HIV infection was interval-censored between age at first sexual intercourse and age at date of survey, but right-censored at the age at survey, for individuals who were HIV negative for Model 2. All the models assumed a parametric Weibull distribution for the baseline hazard λ 0 (t) which allowed estimation of β which is the vector of regression coefficient. Model Specification for all the models, refer to Supplementary Data.

Statistical Analysis
In this study, the outcome or response variable was the HIV status, a binary variable. The study investigated the socio-cultural, socio-economic, behavioral, and demographic factors, which are associated with HIV. Trends in HIV prevalence were assessed using the non-parametric trend test in STATA. A stepwise logistic regression approach was adopted in STATA SE version 15.1 statistical software (12) using the command svy: swaic (13) on some selected explanatory variables highlighted earlier. Factors that were significantly associated with HIV from the stepwise survey logistic regression were then considered for the parametric survival analysis. The most suitable baseline hazard function was investigated using the package icenReg (14) in R software. The data were modeled assuming age at survey imputation and intervalcensoring (Figure 1). The model goodness-of-fit (GOF) test was assessed using the Akaike Information Criterion (AIC) and two graphical methods which included the Cox-Snell residuals. All analysis were performed in STATA SE version 15.1 and R statistical software.

RESULTS
The 2005 The parametric Weibull distribution was used to investigate the factors associated with HIV infection. The Weibull parametric function was suitable to model the baseline hazard distribution, as shown in Figure 4. Model goodness of fit test indicated that Model 2 was the best model based on the Akaike Information Criterion (AIC) presented in Table 2. A graphical goodness of fit test was performed, where the semi-parametric model was compared to the parametric model. According to the results in Figure 3, Model 1 fits the data better than Model 1 and Model 2. Based on Figure 3, Model 1 overestimates the survival rates between 15 and 35 years and underestimates the survival rates between 35 and 49 years. However, Figure 4 with the Cox-Snell residuals shows that Model 1 is better than Model 2 as the estimated cumulative hazards are close to the reference line which is formed by the Cox-Snell residuals. Furthermore, dot charts depicting the importance of variables in the three models were plotted ( Figure 5). Place of residence was the least important variable for all the three models, while marital status and sex where the most important variables for Model 1 and Model, respectively ( Figure 5).
The risk of HIV infection was lower in males than females, as shown in Table 3   the risk of HIV infection for all the survey years was lower in individuals who were separated/divorced/widowed in reference to Model 1, but Model 2 results indicated that the risk was higher for that particular group of individuals as compared to those who were single ( Table 3). According to the results in Model 1, the risk of HIV infection was almost the same for those who had one or more than two sexual partners, as compared to those who did not have any sexual partners in the past 12 months prior to the survey. However, Model 2, the risk was more than four times for those who had more than two sexual partners, as compared to those who did not have any sexual partners in the past 12 months prior to the survey. Of interest, the risk of HIV infection for those who had more than two sexual partners increased over time for Model 2, with Model 1 having a decreasing trend ( Table 3).

DISCUSSION
Frequently, researchers are interested in using standard survival models in determining the failure times, however interval censoring has become increasingly common. The purpose of this study was to identify risk factors for HIV infection using three different models and determine the best fitting model. To identify the best fitting model, we utilized the Akaike Information Criterion (AIC), where the model with the least AIC value was the best fitting model. We also used graphical methods to ascertain the best fitting model, however graphical goodness-of-fit test for interval-censored data is rare with the available methods are still lacking in implementation (15)(16)(17)(18). We managed to check the goodness-of-fit for all the models by overlaying the semiparametric model with the fitted parametric model of the survival function and also plotting the Cox-Snell residuals against the cumulative hazard. The model with interval censored data resulted in better estimates of the risk of HIV infection as compared to the standard survival model i.e., Models 1 based on the AIC value and Figure 3. The superiority of Model 2 was due to the ability to precisely mimic some of the results in literature from previous studies which also used population-based HIV surveys or specific cohorts in Zimbabwe. For example, a study in Zimbabwe conducted between 1999 and 2001 reported that they observed an increasing trend of HIV incidence among the educated individuals, which was rather unexpected; however, this might have been due to a higher socio-economic status, a factor reported to be associated with HIV infection in the region of Africa (19,20). However, a systematic review which explored time trends in the association between educational attainment and risk of HIV infection in sub-Saharan Africa, reported a shift in the HIV epidemic from educated to the uneducated (21). According to a study in which they used the 2005/06 Zimbabwe Demographic Health Survey (ZDHS) to determine the relationship between HIV status and the demographic and socio-economic characteristics among adults in Zimbabwe by construction the risk profile of the average adult, they concluded that there was a significant negative association between HIV infection and education (22). They further clarified that an extra year of schooling to an average of 8 years (i.e., secondary education and above) was associated with a 0.5 percent point decrease in the probability of HIV infection for Zimbabwe (22). Another study which used the 2010/11 ZDHS also reported a lower risk of HIV infection of individuals with secondary level education and above (23). Based on these findings from literature, and the trend observed, the model with intervalcensoring was consistent with the reported results, however, the other two models reported a higher risk of HIV infection for individuals with secondary level education, which was contrary to the findings from literature. according to an in-depth analysis of the 2005/06 ZDHS on the risk factors associated with HIV infection, it was reported that individuals who never had any sexually transmitted infection 12 months prior the survey were significantly associated with a 0.437 times lower risk of HIV infection compared with their counterparts who had a sexually transmitted infection during the same period (24). This result was close to the results obtained for the hazard of HIV infection in the model with interval-censoring. Another example was a study of the analysis of 2005/06 ZDHS, which reported that the likelihood of being HIV infected increased with the number of sexual partners and decreased with the level of faithfulness to a spousal partner. In the same study, it was reported that the odds of being HIV infected were 3 to 4 times greater among those who had two more sexual partner (25). These results were again consistent with results from the model with interval-censoring rather than the other two models for the 2005/06 ZDHS. On a general note, a 2013 study noted that having a large number of life partners increased HIV infection in a cohort from Manicaland (26).
Comparisons of the three models used in this study reveal a consistent match on the factors associated with HIV infection estimated from ZDHS data and the results obtained from previous studies using population-based HIV surveys or specific cohorts in Zimbabwe. For example, in all the three models, and based on all the three surveys, i.e., ZDHS 2005/06, 2010/11 and 2015, females were more at risk of HIV infection than men. Similarly, studies which determined the factors associated with HIV infection using the 2005/06 and 2010/11 ZDHS reported the same findings (22)(23)(24). However, though the surveys were the same, they did not use the same analytical methods to reach the same conclusions. Results suggest that marriage was associated with a lower risk of infection based on all the models. This is further supported by a study which analyzed the 2005/06 and 2010/11 ZDHS data (23). The study reported that marital union was positively associated with the decline of HIV infection for both men and women (23). Another study to determine the baseline predictions of HIV-1 acquisition among women reported that being unmarried was the strongest risk factor for HIV-1 acquisition (27). Results from these studies are again consistent with what all the models in our studies. Even though the models reported results similar to what had already been reported in literature, the precision of the model with interval-censoring in explaining some of the covariates is what stood out the most. However, the Cox-Snell residuals clearly showed that Model 1 was the best fitting model. The difference between the Cox-Snell residuals graphical method and the model estimates maybe due to the lack of adequate methods to test the goodness-of -fit of interval-censored data as cited by other authors (15-18).
The main strength of this study dependent on the quality of the data obtained from the surveys. These data were derived from population-based surveys, which in reality provides more reliable and robust data. Another strength of this study was due to the fact that we did not restrict our analysis to one method, however, we had the opportunity to determine the best model to fit the hazard of infection by comparing two different scenarios. For instance, if the median survival time for HIV infection was 5 years given the type of data we had, and the intervals were about 3 to 6 months wide, then we would have no reason to complicate the analysis by considering interval censoring. On the other hand, if the intervals were about 1 year or longer, then accounting for uncertainty in the analysis was necessary, which we did when we implemented the intervalcensoring approach. Another reason for concluding that intervalcensoring gave better estimates was due to its consistency with the published results from literature. Even though we consider the interval-censoring model as the superior model with regards to our specific data, the method had its own set of limitations. These limitations included the wide range of intervals used, which could have underestimated or overestimated the effect of other factors on the risk of HIV infection. Inclusion of competing risks factors in the model would have greatly improved the modeling approach. Further studies can be done on imputation models, which imputes an estimated time of HIV infection based on the data.

AUTHOR CONTRIBUTIONS
RB and EM conceived of the presented idea and verified the analytical methods and models. RB developed the theory and performed the data analysis. EM supervised the findings of this work. All authors discussed the results and contributed to the final manuscript.