Test–Retest Reliability of Self-Reported Sexual Behavior History in Urbanized Nigerian Women

Background Studies assessing risk of sexual behavior and disease are often plagued by questions about the reliability of self-reported sexual behavior. In this study, we evaluated the reliability of self-reported sexual history among urbanized women in a prospective study of cervical HPV infections in Nigeria. Methods We examined test–retest reliability of sexual practices using questionnaires administered at study entry and at follow-up visits. We used the root mean squared approach to calculate within-person coefficient of variation (CVw) and calculated the intra-class correlation coefficient (ICC) using two way, mixed effects models for continuous variables and (κ^) statistics for discrete variables. To evaluate the potential predictors of reliability, we used linear regression and log binomial regression models for the continuous and categorical variables, respectively. Results We found that self-reported sexual history was generally reliable, with overall ICC ranging from 0.7 to 0.9; however, the reliability varied by nature of sexual behavior evaluated. Frequency reports of non-vaginal sex (agreement = 63.9%, 95% CI: 47.5–77.6%) were more reliable than those of vaginal sex (agreement = 59.1%, 95% CI: 55.2–62.8%). Reports of time-invariant behaviors were also more reliable than frequency reports. The CVw for age at sexual debut was 10.7 (95% CI: 10.6–10.7) compared with the CVw for lifetime number of vaginal sex partners, which was 35.2 (95% CI: 35.1–35.3). The test–retest interval was an important predictor of reliability of responses, with longer intervals resulting in increased inconsistency (average change in unreliability for each 1 month increase = 0.04, 95% CI = 0.07–0.38, p = 0.005). Conclusion Our findings suggest that overall, the self-reported sexual history among urbanized Nigeran women is reliable.

inTrODUcTiOn Information about sexual behavior and sexual health is often collected in epidemiologic studies. These include research on risk factors for acquisition and spread of communicable diseases such as sexually transmitted diseases including HIV/AIDS and on non-communicable diseases such as cancers. Sexual history is relevant in diseases where it does not have direct etiological relationship but where it may provide insights into overall wellbeing and quality of life, or where disease may negatively affect the sexual domain. Examples include chronic illnesses such as stroke and diabetes, whose progression, treatment, or resolution may affect sexual function and quality of life.
In most epidemiologic studies, the history of sexual practices and sexual hygiene are elicited through self-reports but the validity of such data has been repeatedly questioned (1)(2)(3). In the absence of precise biomarkers that can serve as gold standards to evaluate the accuracy of self-reports, several studies have been done to evaluate methods of testing the reliability. The most popular method is the use of test-retest correlation of responses to questionnaires while another method uses the presence of biomarkers of vaginal exposure to semen such as the presence of sperm, prostate-specific antigen, or Y chromosome in vaginal fluids (4)(5)(6)(7)(8). These latter methods are more relevant for evaluation of recent unprotected sexual intercourse in women and may not be relevant in most epidemiological studies where long-term exposure and variety of exposures are of interest (9). Other methods that have been used include correlation of partner reports of sexual behavior and the use of sexual diaries (2,9). Partner's reports of sexual behavior is not ideal because it may be influenced by the nature of the relationship between partners and raises problems of confidentiality in reporting on the behavior of another. The use of sexual diaries in large, population-based studies may not be practicable because of the burden on research participants, which may lead to high attrition rates, non-compliance, recall bias, and participants' reactivity (2,(9)(10)(11).
Studies that used test-retest correlations for measuring reliability of sexual history among diverse populations in the United States have yielded intraclass correlation coefficients (ICC) ranging from 0.3 to 0.9 for reports of lifetime number of sexual partners (9,(12)(13)(14)(15). Most studies in low-and middleincome countries (LMICs) that have evaluated the reliability of self-reported sexual history have been restricted to either the young (15-24 years old) or to practitioners of high-risk sexual behavior (16,17). One possible reason for high prevalence of these types of research in LMIC is that sexual behavior is commonly used as indicators for monitoring the HIV/AIDS epidemic by the Joint United Nations Program on HIV/AIDS (UNAIDS) (18). As self-report of sexual behavior may be subject to self-presentation and social desirability bias, which may differ by age, sex, and population characteristics that reflect acceptable norms and cultural attitudes toward talking about sex in any given society, it is important to evaluate reliability in the context of conducting epidemiological research in resource limited settings (6).
In this study, we examined a 14-item questionnaire used to collect sexual behavior history from urbanized Nigerian women to determine its reliability and that of similar instruments used for self report of sexual behavior in epidemiological research.

MaTerials anD MeThODs study Population
Between August 2012 and December 2013, we recruited women from our cervical cancer screening clinics in Abuja, Nigeria, into a prospective study of the host and viral factors associated with persistent hrHPV infection in Nigeria. We enrolled women who were at least 18 years old and had engaged in vaginal sexual intercourse. We excluded women who had a total hysterectomy, were pregnant, or unable to provide an informed consent. At enrollment, we used interviewer administered questionnaires to collect data on sociodemographic characteristics, lifestyle risk factors, reproductive and sexual behavior histories. Trained nurses performed gynecologic examinations on all participants, collected biological samples for HPV detection, and examined the cervix for premalignant lesions through visual inspection with acetic acid/Lugol's reagent (VIA/VILI). We treated all women diagnosed with premalignant cervical lesions with thermocoagulation if the lesions met specific criteria: complete visualization of the lesions, lesions covering less than 75% of the transformation zone, lesions amenable to complete coverage by the tip of the cryoprobe and lesions not suspicious of cancer (19). All participants were scheduled for a follow-up visit after 6 months. At the follow-up visit, the same nurses who administered the baseline questionnaires readministered the questionnaires to all returning participants. At the baseline visit, participants were not informed that they would be asked the same questions at follow-up. All nurses were trained to administer the questionnaires either in English or local languages in cases where participants could not speak English. All questionnaires were completed prior to biological sample collection.

Main Outcome Measures
We adapted protocols for sexual behavior history from the Phenx toolkit version 5.0 February 24, 2012 and developed a 14-item questionnaire. We piloted the questionnaires among 50 women of reproductive age with similar characteristics as our study population. Details of these items and coding of responses are shown in Table 1. Six items were coded as continuous variables while eight items were coded as categorical. To distinguish between actual behavior change and test-retest reliability, we asked all participants to report changes in sexual behavior in the period between the first and second questionnaires, and adjusted our analysis to account for any reported changes.

statistical analysis
For categorical variables, we estimated kappa coefficient ( ) κ to determine agreement beyond what would be expected by chance. We estimated 95% confidence intervals (95% CI) for κ using bootstrap methods with bias-corrected estimation as some of the variables such as type of sex at sexual debut, and types and frequency of practice of different types of sex had more than two categories (20)(21)(22). We compared the κ statistics  (24). For continuous variables, we calculated indices of absolute and relative test-retest reliability. For absolute reliability, the degree to which repeated responses varied for individuals, we used within person coefficients of variation (CVw), Bland and Altman's limit of agreement, paired t-tests, differences in responses at study entry and retest. For relative reliability, the degree to which individuals maintain their position in the group, we used ICCs two-way mixed effects model. We chose to use the two-way mixed effects ICC model because the same set of research assistants administered the same questionnaires to all participants at study entry and retest. Therefore, the research assistants and questionnaires were considered to be fixed effects while the random effects were participants and possibly the interactions between participants and the research assistants. We used the guidelines suggested by Cicchetti to interpret the correlation coefficients, with values below 0.40 interpreted as poor; values of 0.40-0.59 as fair; values of 0.60-0.74 as good, and values of 0.75-1.00 as excellent (25).
To investigate the association between potential correlates and test-retest reliability, we used two different types of regression models; log binomial regression models for sexual behavior responses collected as categorical variables; and linear regression models for sexual behavior response collected as continuous variables ( Table 1). We evaluated age-adjusted models of the outcome and the potential predictors such as interval between test administration, marital status, level of education, self-perception of general health and HIV status, and others identified from the literature. We identified predictors with p-values less than 0.20 in the age-adjusted models and included them in multivariable regression models (9,26).
We used principal component analysis to create a summary measure of reliability for the continuous variables. Using the eigenvalue cutoff of 1, the scree plot, and interpretability of fac tors, we retained one factor, which explained a cumulative variance of 53%. We predicted scores for test-retest reliability using the factor loadings for the retained factor, such that participants with high scores may be considered to have higher levels of inconsistency in their responses compared with participants with low scores. We used the summary measure in linear regression models testing for test-retest reliability for each participant ( Table 1).
For the categorical variables (Table 1), we created a summary variable such that participants who had any disagreement in the categorical variables at test and retest had a score of one and participants who were consistent in their test-retest responses had a score of 0. Next, we used this summary measure in log binomial models to evaluate the association between potential correlates and reliability of responses provided for sexual behavior questions collected as categorical variables ( Table 1).
We considered a p-value <0.05 as significant. Formal adjustments for multiplicity were not considered appropriate as inferences for itemized questionnaire items were not based on significance of individual endpoints. In regression models, where  inferences were based on significance of the endpoint, we used summary variables as endpoints. All statistical analyses were conducted using Stata version 13 (Stata Corp, College Station, TX, USA).

resUlTs study characteristics
Of the 725 participants included in this study, 346 (48%) were HIV positive, 354 (49%) were HIV negative, and the HIV status of 25 (3%) participants was unknown. The latter were excluded from regression models and comparisons of reliability between HIV-positive and HIV-negative participants. The mean (SD) age of participants was 38.5 (7.8) years and mean (SD) interval between questionnaire administrations was 8.6 (4.0) months ( Table 2). Most of the participants were married (67%) and had more than 6 years of formal education (88%). The prevalence of oral and anal sex among the participants at study entry was 16 and 2%, respectively.

indices of absolute reliability
The mean of the difference (SD) in responses provided at study entry and at retest for all but one of the continuous variable was close to 0 [age at sexual debut, 0.4 (3.1); lifetime number of partners, 0.0 (2.3); age at oral sex debut, −0.1 (4.5); lifetime number of oral sex partners 0.1 (1.3); age at anal sex debut −1.0 (6.2)] ( Table 3). Except for age at oral and anal sex debut, the responses provided at retest were generally lower than the responses provided at baseline as shown by the positive direction of the mean of the difference between the responses ( Table 3). The 95% limits of agreement for the mean of the differences between responses at study entry and retest are shown in the Bland and Altman plots (Figure 1). The plots show that with increasing number of sexual partners reported, the less reliable the responses were.
Comparing HIV-negative women to HIV-positive women in univariate analyses, there were no significant differences for the sexual history measures collected as continuous variables: age at sexual initiation (p 0.25), lifetime number of partners (p 0.86), age at oral sex debut (p 0.61), lifetime number of sexual partners (p 0.76), and age at anal sex debut (p 0.76). The intraindividual variability were lower for time-invariant measures (age at sexual debut CVw = 10.7 and age at oral sex debut CVw = 11.5) compared with frequency measures (lifetime number of partners CVw = 35.2, and lifetime number of oral sex partners CVw = 34.1) ( Table 3).

indices of relative reliability
As shown in Table 3, the ICC for age at sexual debut (0.8), total lifetime number of partners (0.8), age at oral sex debut (0.7), and total lifetime number of oral sex partners (0.9) for the total study population was excellent.
HIV-negative women had a ICC than HIV-positive women for lifetime number of partners (0.9 vs 0.8, p < 0.001). Conversely, HIV-negative women had a lower ICC than HIV-positive women for age at oral sex debut (0.4 vs 0.9, p 0.001).

agreement for categorical Variables
There was a high level of agreement between responses at study entry and responses at retest for sexual orientation (98.8%), type of sex at sexual debut (97.8%), ever practiced oral sex (85.4%), ever practiced anal sex (98.2%) ( Table 4). However, agreement for frequency of sexual activity was relatively lower ranging from 59.1% for frequency of vaginal sex to 63.9% for oral sex. Despite the high levels of agreement, κ statistics were slight to moderate. Generally, HIV-negative individuals had higher κ statistics than HIV-positive individuals.

Predictors of reliability
In Model 1 for continuous variables, we found that a 1-month increase in test-retest interval resulted in an average increase of 0.04 points in inconsistency of responses (95% CI = 0.01-0.06, p-value = 0.003) ( Table 5). HIV infection was also statistically significantly associated with reliability, with HIV-positive individuals having an average increase of 0.22 points in inconsistency compared to HIV-negative individuals (95% CI = 0.07-0.38, p-value = 0.005). In Model 2 for categorical variables, we did not observe any significant relationships.

DiscUssiOn
In this study of test-retest reliability of self-reported sexual behavior using interviewer administered questionnaires, we found that self-report of sexual behaviors was reasonably reliable overall. However, we observed varying levels of reliability based on the nature of sexual behavior reported. The reports on frequency of non-vaginal sexual practices were more reliable than those of vaginal sexual practices. Differences in the patterns of reliabilities for frequency of vaginal and non-vaginal sexual practices may reflect differences in the frequencies of the behaviors. Among heterosexual women, vaginal sexual practices tend to occur more frequently than non-vaginal sexual practices (14). Reports of less frequent behavior are generally more stable, as people tend to use more efficient recall strategies (28,29). Enumeration recall strategies, where each event is recalled and counted separately are commonly used for infrequent behaviors, especially when these behaviors are associated with particularly distinctive time periods, events, or people. However for frequent behaviors, enumeration may be too difficult or time consuming; therefore, estimation recall strategies where rate-based mental calculations are made without recalling individual events are commonly used (30)(31)(32). We found that reports of time-invariant events (age at sexual debut, ever-practiced oral sex, ever-practiced anal sex) were more reliable than frequency reports (number of partners, frequency of sex). This finding may reflect the different psychological processes that underlie these two types of reports. Time-invariant events may be associated with more vividness and personal salience, especially when accompanied with strong emotions at the time of the encounter, for example, age at sexual initiation or ever practiced anal sex (9). Conversely, frequency reports that asks about number of events may involve less vivid memories especially in people with high levels of sexual networking. This is further complicated by the need for rate-based inferences, which require mental calculations that can be inconsistent (9,11,30).
For continuous measures, reliability was also significantly decreased with increasing interval between questionnaire administration after controlling for age, HIV status, marital status, perception of general health, and level of education. One possible explanation for our finding is the possibility that behaviors may change with increasing intervals between tests and, therefore, responses provided at retest may reflect current behavior at the time of test administration rather than an indication of instability. Several research studies have evaluated the relationship between recall periods and reliability. Results from some of these studies showed that shorter recall periods were more reliable than longer recall periods (9,33). On the other hand, other studies reported no association or increased reliability for longer recall periods for particular behaviors such as lifetime number of sexual partners (15,30,34). These varying results may reflect underlying differences in the nature of behaviors evaluated, mode of assessment, and study population as can be observed from the results from our models that evaluate reliability for variables collected as categorical where there were no associations between HIV status, test-retest interval, and reliability. While our results provide the only estimates for women living in an urban community in Nigeria, similar findings have been reported in adolescent populations in South Africa (35). The optimal recall period for studies on sexual risk remains an active area of research (2). Although the indices for absolute test-retest reliability for lifetime number of partners showed high levels of reliability, responses were less reliable as the self-reported number of partners increased, which is consistent with results from previous studies (10,29,30,36). This may be explained by a combination of several factors, such as different recall strategies and the attitudinal propensity toward casual sex among people with multiple sexual partners compared to people who claim to be abstainees or monogamists (30). Participants who have been sexually inactive or monogamous during the recall period may use enumeration strategies to report 0 or 1, respectively. Whereas participants who have had multiple sexual partners may use rate-based mental calculations, which yield imprecise estimates. The cognitive processes involved in the abstinent or monogamous participant are straightforward and probably result in higher degrees of reliability than in women reporting multiple sexual partners. Additionally, studies show that people with higher numbers of sexual partners display more favorable attitudes toward casual sex, which tend to be less vivid with less psychological involvement than sex in the context of sustained relationships (30). As recall is associated with vividness of events, it is understandable that discrepancies are higher with increasing number of partners (37).

strengths of this study
A notable strength of our study is that we evaluated reliability of individual sexual behaviors, rather than assume that reliability of measures of one sexual behavior confer reliability on other measures of sexual behavior. This has important implications for researchers in making informed decisions about the collection of self reported sexual history.
In estimating sample size for epidemiologic studies, the importance of considering measurement errors of important covariates has been described by several authors (38,39). One simple approach is to adjust the sample size estimates based on desired    (38). An alternative to sample size adjustments is to incorporate expected levels of measurement error into the data analysis (40). These approaches require that the magnitude of the measurement error for the covariates are known. In the absence of correlation, estimates for true and self-reported sexual behavior history, due to difficulties in determining the true values, our test-retest correlation estimates provide some guidance for sample size adjustments to account for measurement errors in the use of self-reported sexual behavior history in epidemiologic studies. Another strength of our study is that by examining a time-invariant sexual attribute such as age at sexual debut, we were able to evaluate test-retest reliability without the confounding effects of behavior change that may occur during the test interval. For time-variant measures such as lifetime number of sexual partners, we included a question in the retest questionnaire for participants to record number of new partners since the administration of the first test. Motivation to participate in a research study and topic of research study may be important sources of response bias (41). Participants in reproductive and sexual health research studies may give more thoughtful responses to questions on sexual practices because of altruistic reasons in aiding investigators to arrive at useful answers or they may perceive that their responses may affect their clinical management, leading to better reliability than participants in other types of studies, where sexual behavior may not be perceived as being important. Our study was hospital-based and conducted among adult females in the context of cervical cancer screening; therefore, our participants may have given responses that can be generalized to populations who participate in similar research.

limitations
In our study, we used interviewer administered in-person interviews, and this may have led participants to provide more socially desirable responses. We minimized interviewer influences by using well-trained interviewers and by arranging sensitive questions after less sensitive ones so that the participants' trust would be high by the time sensitive questions were asked. There is some evidence to suggest that participants respond more objectively to self-administered interviews than to interviewer-administered ones, particularly for behaviors that may be considered embarrassing, stigmatizing, or illegal (42). This may be due to increased privacy afforded by self-administered interviews and ability of participants to control the pace of the interview. However, other studies have found no difference in the use of either methods, especially for sexual behavior history that may include complex branch and skip patterns (43). Audio-assisted computer self-administered questionnaires may improve objectivity of self-administered questionnaires, but they require respondents to comprehend questions and provide relevant responses. Their infrastructural demands and literacy requirements may preclude their use in large scale epidemiological studies in LMICs (44,45).