Edited by: Holmes Finch, Ball State University, USA
Reviewed by: Kenn Konstabel, National Institute for Health Development, Estonia; Wolfgang Rauch, Heidelberg University, Germany
*Correspondence: Kodi B. Arfer
This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
According to theory, choices relating to patience and self-control in domains as varied as drug use and retirement saving are driven by generalized preferences about delayed rewards. Past research has shown that measurements of these time preferences are associated with these choices. Research has also attempted to examine how well such measurements can predict choices, but only with inappropriate analytical methods. Moreover, it is not clear which of the many kinds of time-preference tests that have been proposed are most useful for prediction, and a theoretically important aspect of time preferences, nonstationarity, has been neglected in measurement. In Study 1, we examined three approaches to measuring time preferences with 181 users of Mechanical Turk. Retest reliability, for both immediate and 1-month intervals, was decent, as was convergent validity between tests, and association was similar to previous results, but predictive accuracy for 10 criterion variables (e.g., tobacco use) was approximately nil. In Study 2, we examined one other approach to measuring time preferences, and 40 criterion variables, using 7,127 participants in the National Longitudinal Survey of Youth 1979. Time preferences were significantly related to criterion variables, but predictive accuracy was again poor. Our findings imply serious problems for using time-preference tests to predict real-world decisions. The results of Study 1 further suggest there is little value in measuring nonstationarity separately from patience.
People frequently need to choose between one outcome available soon (the smaller sooner, or SS, option) and a more desirable outcome available later (the larger later, or LL, option). Deciding whether to indulge in a dessert or stick to a diet, to splurge on an impulse buy or save up for a more desirable item, and to relax or study for an upcoming exam, can all be characterized as intertemporal choices. Intertemporal choice is often called “delay discounting,” because of people's and animals' tendency to treat LL as less valuable the more it is delayed, which has led to the development and popularization of quantitative models of intertemporal choice based on multiplicative discount factors (Samuelson,
It is generally agreed that a high-quality standardized test of a personality construct is important for research into that construct. Actually, while time preferences have been studied from such diverse perspectives as cognitive psychology, economics, and psychiatry, they have not generally been regarded as a personality construct (see Odum,
Despite widespread interest in time preferences and use of tests to measure them, there has been little agreement on which tests to use. Instead, researchers have employed a wide variety of tests. Some tests have subjects make binary choices (Rachlin and Green,
Given this wide array of possibilities, what test of time preferences should an investigator use? Past research has examined various aspects of how time preferences are measured (such as forced-choice vs. free-response formats), how these measurement details affect the resulting scores, and how well scores obtained from different tests agree (e.g., Smith and Hantula,
One way in which the many extant tests are similar, but possibly deficient, is that all but a few of them measure only patience, that is, people's overall willingness to wait for rewards. Patience is distinct from nonstationarity, which is how people's patience changes as the delays to both SS and LL increase in equal measure. It is nonstationarity that is theoretically linked to self-control (Thaler and Shefrin,
A statistical issue that is particularly important for criterion validity and applied uses of tests is the distinction between association and predictive accuracy. Typically, when psychologists wish to quantify the relationship between variables, they use a measure of how well values can be associated with each other, such as a correlation coefficient, or how well a model fits the data, such as root mean squared error (RMSE). The question of predictive accuracy, by contrast, is how well a model can estimate the value of a dependent variable (DV) when the model does not already have access to that value. For example, the predictive accuracy of a time-preference test might signify how accurately the test can predict number of cigarettes smoked among people whose smoking has not been measured. Notice that the prediction in question is of individual data values (e.g., the number of cigarettes smoked by subject 3) in numeric terms (e.g., 7 cigarettes). Such predictions are distinct from what is usually meant when a researcher states that a theory “predicts” something: namely, ordinal effects such as “Less patient people smoke more.” Prediction, in the statistical sense of the term, need not be future-oriented: we can examine the predictive accuracy of one variable for another variable measured at the same time, or even before the predictor. What is important is not the timing of measurements, but that the model is not trained with the very same cases it is trying to predict.
This sort of criterion validity—predictive rather than merely associative—is especially useful in applied contexts, for assessment and decision-making. In such situations, tests are used for their ability to inform us about what we do not already know concerning the examinee. Test scores are used as tools to estimate these other, unknown quantities. Hence, the more accurate the test is in predicting the unknown quantities, the more useful it will be for guiding our decisions.
Predictive accuracy can be quantified with some of the same statistics as association, such as RMSE. However, association is optimistically biased as a measure of predictive accuracy, due to overfitting (Wasserman,
Many past studies have claimed to find that a test of time preferences, or at least a test of patience, can indeed predict CVs. These are only a subset of the large literature showing association between time preferences and CVs. But generally, they have not used appropriate methods to assess and quantify predictive accuracy, so they are not informative as to the question of prediction, despite the intentions of the authors. For example, Daugherty and Brase (
The present studies sought to provide direct insight on the question of how to measure time preferences and how well tests of time preferences can predict CVs. Study 1 compared three families of time-preference tests on reliability, convergent validity, and ability to predict 10 self-reported CVs, ranging from overweight to credit-card debt. Subjects were 181 US residents recruited from the crowdsourcing website Amazon Mechanical Turk. Study 2 examined the predictive accuracy of another test of time preferences for 40 self-reported CVs, which covered many of the same content areas, as well as new areas such as flu vaccination and age of sexual debut. Subjects were thousands of people from the nationally representative National Longitudinal Survey of Youth 1979.
To help answer the question of how best to measure time preferences, rather than manipulate a few administration details, we compared three representative families of time-preference tests. One family used the popular items of Kirby et al. (
We evaluated the preference tests on their retest reliability, their convergent validity, their association with a number of practically important CVs, and, crucially, their accuracy at predicting said CVs. In fact, we used predictive rather than associative methods for retest reliability and convergent validity as well as for criterion validity. We administered the preference tests three times, with the third administration separated from the first two by 1 month, allowing us to estimate retest reliability over immediate as well as 1-month intervals.
The procedure was approved by the Stony Brook University Committee on Research Involving Human Subjects. All subjects provided informed consent. The committee waived the requirement for documentation of informed consent, since the study was conducted on Mechanical Turk.
Subjects completed three families of tests of time preferences. Each family used a different approach to measure the same two theoretical constructs, patience (i.e., willingness to wait for larger rewards) and nonstationarity (i.e., the effect of front-end delays on willingness to wait). Each family comprised two very similar tests (for a total of six distinct tests), which differed only in whether a front-end delay of 1 month was added to all the options presented to subjects. Tests with this delay are termed the
The first family, which we call the
We designed the second family, of
Finally, the family of
Notice that all tests were scored in the direction with greater scores implying greater patience. Notice also that these scores, which we used for all analyses below, are not the same as model-specific measures common in the study of intertemporal choice such as the discount rate
To check that the selected numbers of trials for the bisection and matching families were adequate, we ran pilot studies in which we administered both tests in each family twice, to 14 users of Mechanical Turk per family. We examined the difference between the discount factors yielded by the first and second administration for each test. For the two families, 75 and 83% of differences, respectively, were less than 0.1 (on the scale of discount factors, from 0 to 1), which we judged to be adequate.
Subjects self-reported about a wide variety of real-world behaviors that are in theory related to patience and self-control, as well as their demographics. Chabris et al. (
Are you male or female? [“Male” or “Female”]
How old are you? [Integer]
How tall are you? [Length, US or SI units]
How much do you weigh? [Weight, US or SI units]
Do you use tobacco? [“Yes” or “No”]
■ [Shown only if the subject answered “Yes”] How many packs of cigarettes do you smoke per week? (Enter 0 if you don't smoke cigarettes.) [Integer]
How many hours per week are you physically active (for example, working out)? [Integer]
For how many of your meals do you choose the amount or kind of food you eat with health or fitness concerns in mind? [Percentage]
How many times per week do you use dental floss? [Integer]
Have you used a credit card at all in the past 2 years? [“Yes” or “No”]
■ [Shown only if the subject answered “Yes”] Over the past 2 years, how many times were you charged a late fee for making a credit card payment after the deadline? [Integer]
■ [Shown only if the subject answered “Yes”] Over the past 2 years, how many of your credit-card payments were for less than your total balance? [Percentage]
Over the past 2 years, how much of your income have you saved? (Please include savings into retirement plans and any other form of savings that you do.) [Percentage]
On how many days per month do you gamble? (Gambling includes such activities as playing at casinos, playing cards for stakes, buying lottery tickets, and betting on sports.) [Integer]
The theoretical relationship of the CVs to time preferences, and thus self-control, is that each is related to choices between small rewards available soon and larger rewards available later. For example, smoking entails getting the immediate pleasure of a cigarette and forgoing the long-term health benefits of not smoking. Overweight is related to overeating, which again entails getting immediate pleasure and forgoing long-term health benefits. And credit-card debt is accumulated by making immediate purchases with the effect of having to pay much more than the original purchase price over the long term. No variable is supposed to be caused by self-control alone, but each involves self-control partly.
Subjects were recruited from the crowdsourcing website Amazon Mechanical Turk. They were required to live in the United States. Of the 200 subjects who participated in session 1, 74 (37%) were female, and the median age was 31 (95% sample interval 19–64). All 200 were invited to complete session 2, and 103 did so. Subjects provided informed consent before each session. They were compensated with $1 for completing session 1 (median completion time 15 min) and $0.50 for completing session 2 (median completion time 6 min).
Session 1 took place from February 15th to 21st, 2014. In session 1, subjects completed all six preference tests in a random order (round 1), then did so again in another random order (round 2), with no delay in between or notification that the same tests were applied twice. (The order of tests was randomized per subject.) Finally, subjects completed the criterion questionnaire.
Without prior warning, subjects were invited to participate in session 2 by email approximately 30 days later. To balance server load, invitations were sent out in 5 batches of 10 per day starting on March 18th, 2014, so all participants in session 1 had been invited back by March 22nd. Subjects had until March 25th to complete session 2. This session consisted entirely of a third administration of the six preference tests, again in random order (round 3).
As an extra check on subject attention, we included in each bisection test two catch trials. In one catch trial, the ratio of amounts (SS amount divided by LL amount) was set to 0.07, making LL clearly preferential for all but the most impatient of subjects. In the other catch trial, the ratio of amounts was set to 1.13, making one option have both the lesser delay and the greater amount, giving subjects with no clear justification to prefer the other option. Choices in these trials were not used for scoring any tests.
The raw data for both Study 1 and Study 2, as well as task code and analysis code, can be found at
Of the 200 subjects who participated in session 1, we excluded from analysis 4 subjects who gave a nonsensical answer to a question in the criterion questionnaire, 10 subjects who took the designated incorrect choice in at least 3 of the 8 catch trials in session 1, and 6 subjects who gave an LL response smaller than SS in at least 3 of the 40 matching trials in session 1. (No subjects made 3 or more errors in either category in session 2.) Accounting for overlap in these groups, 181 subjects remained. These 181 subjects constitute the sample for the bulk of the following analyses.
Of the 103 subjects who participated in both sessions, 7 were already excluded by the above rules, and we excluded an additional 3 subjects who completed session 2 in less than 3 min, for a final sample size of 93. This smaller sample of 93 is used only for the 1-month retest reliability analysis (the right-hand half of
Descriptive statistics for preference-test scores in round 1 are shown in Table
Q3 | 11 | 11 | 0.88 | 0.87 | 0.87 | 0.84 |
Median | 9 | 9 | 0.76 | 0.77 | 0.69 | 0.66 |
Q1 | 7 | 5 | 0.59 | 0.58 | 0.50 | 0.50 |
MAD | 2 | 2 | 0.15 | 0.16 | 0.18 | 0.16 |
We transformed and coded the CVs as follows, based on inspection of their distributions (without reference to the preference tests), so as to retain variability while maximizing their suitability as a DV for linear regression, dichotomous probit regression, or ordinal probit regression.
Hours of exercise per week was incremented by 1 and log-transformed.
Healthy meals and savings were clipped to [0.005, 0.995] and logit-transformed.
Overweight was calculated by computing body mass index (as weight in kilograms divided by the square of height in meters), then dichotomizing using a threshold of 25.
Gambling, credit-card late fees, and credit-card subpayments were dichotomized according to whether the subject's answer was greater than 0.
Number of cigarettes smoked was ignored in favor of the dichotomous variable of whether the subject used tobacco.
Flossing was coded into three ordered categories: 0 (less than once per week), 1–6 (less than once per day), and 7 or more (once or more per day).
For each CV and family of preference tests, we assessed how well the preference tests (administered in round 1) could account for variation in the CV with a regression model. Each model had four terms: an intercept, main effects for the near and far test scores, and an interaction. The model was an ordinary linear regression model for the continuous CVs (exercise, healthy meals, and savings), a dichotomous probit regression model for the dichotomous CVs (overweight, tobacco, gambling, credit-card late fees, and credit-card subpayment), and an ordinal probit regression model for flossing. Models with a CV related to credit cards included only subjects who stated they had used a credit card in the past 2 years.
Measures of model fit are shown under “Association” in Table
Fixed | 0.026 | 0.037 | 0.004 | 0.005 | 0.019 | 0.025 | 0.009 | 0.025 | 0.012 |
Bisection | 0.084 | 0.054 | 0.025 | 0.002 | 0.035 | 0.055 | 0.004 | 0.033 | 0.013 |
Matching | 0.035 | 0.040 | 0.013 | 0.001 | 0.003 | 0.040 | 0.009 | 0.015 | 0.038 |
Baseline | 6.880 | 0.335 | 0.170 | 0.560 | 0.760 | 0.830 | 0.360 | 0.750 | 0.600 |
Fixed | 7.110 | 0.348 | 0.188 | 0.490 | 0.750 | 0.820 | 0.350 | 0.730 | 0.570 |
Bisection | 6.770 | 0.352 | 0.186 | 0.520 | 0.760 | 0.820 | 0.280 | 0.740 | 0.540 |
Matching | 7.150 | 0.360 | 0.188 | 0.560 | 0.760 | 0.820 | 0.330 | 0.740 | 0.580 |
In the previous section, we assessed the association of preference tests with CVs. Here, by contrast, we assess how accurately the tests could predict unseen data. To do this, we subjected the same models just described to tenfold cross-validation. The results are shown under “Predictive accuracy” in Table
We reasoned that if there really existed a predictively useful relationship between the preference tests and CVs, perhaps it was too complex to be exploited by these models. We thus tried several more complex procedures:
The notion of reliability most closely related to predictive criterion validity is a test's accuracy in predicting itself. We therefore assessed the retest reliability of our preference tests by assessing how well scores in round 1 predicted scores in rounds 2 and 3. We attempted to predict round-2 and round-3 scores with unaltered round-1 scores rather than using a statistical model, since the question is how stable the scores are on their own. By contrast, Pearson correlations, for example, do not penalize bias. A convenience of this approach is that cross-validation is not necessary for estimating predictive accuracy, because there is no model to train.
Our three rounds of preference testing in two sessions allowed for estimating retest reliability over two intervals, immediately and 1 month (Table
PVAF | 0.70 | 0.49 | 0.51 | 0.62 | 0.75 | 0.82 | 0.44 | 0.57 | 0.44 | 0.64 | 0.57 | 0.56 |
Abs err, median | 2.00 | 2.00 | 0.04 | 0.05 | 0.03 | 0.03 | 2.00 | 2.00 | 0.06 | 0.04 | 0.06 | 0.05 |
Abs err, 90th %ile | 5.00 | 7.00 | 0.23 | 0.18 | 0.14 | 0.13 | 5.00 | 5.00 | 0.16 | 0.19 | 0.19 | 0.21 |
Abs err, 95th %ile | 5.00 | 9.00 | 0.28 | 0.24 | 0.20 | 0.18 | 7.00 | 6.00 | 0.21 | 0.23 | 0.28 | 0.26 |
Bias | 0.10 | 0.09 | 0.00 | −0.01 | −0.01 | −0.01 | −0.37 | −0.29 | 0.01 | 0.01 | −0.02 | −0.02 |
Kendall τ | 0.71 | 0.61 | 0.65 | 0.63 | 0.74 | 0.78 | 0.66 | 0.65 | 0.66 | 0.67 | 0.65 | 0.62 |
Pearson |
0.85 | 0.72 | 0.76 | 0.81 | 0.88 | 0.91 | 0.77 | 0.80 | 0.72 | 0.82 | 0.81 | 0.79 |
Table
If our tests could predict themselves but not external CVs, could they predict each other? We estimated convergent validity by examining mutual prediction in round 1 with tenfold cross-validated linear regression (clipping predictions to the legal range of the DV). We found moderate convergent validity, with proportions of variance accounted for (PVAF) ranging from 0.25 to 0.50. These figures are comparable to those of Smith and Hantula (
Descriptively, what nonstationarity did subjects exhibit in their responses to the preference tests? Table
Less patient later | 75 | 95 | 100 |
Stationary | 64 | 0 | 10 |
More patient later | 42 | 86 | 71 |
To examine whether the presence of the front-end delay could make a difference for prediction, we conducted an analysis similar to the reliability analyses described earlier with round-1 near scores as the predictor and round-2 far scores as the DV. The results (Table
PVAF | 0.47 | 0.42 | 0.68 |
Abs err, median | 2.00 | 0.05 | 0.04 |
Abs err, 90th %ile | 7.00 | 0.22 | 0.17 |
Abs err, 95th %ile | 9.00 | 0.31 | 0.23 |
Bias | 0.46 | −0.02 | −0.01 |
Kendall τ | 0.58 | 0.60 | 0.73 |
Pearson |
0.70 | 0.75 | 0.86 |
In Study 1, none of our tests of time preferences could predict any of the CVs with more than trivial accuracy. The findings for retest reliability and convergent validity support the quality of the time-preference tests. For the CVs, however, we do not have this psychometric information, raising the possibility that the low predictive accuracy resulted from something unusual or deficient about the 10 items we happened to use, which were based heavily on the items of Chabris et al. (
The National Longitudinal Survey of Youth 1979 (NLSY79;
Suppose you have won a prize of $1000, which you can claim immediately. However, you can choose to wait 1 month to claim the prize. If you do wait, you will receive more than $1000. What is the smallest amount of money in addition to the $1000 you would have to receive 1 month from now to convince you to wait rather than claim the prize now?
The second item was the same except with an interval of 1 year instead of 1 month. Subjects could answer with any nonnegative integer. In theory, greater responses indicate less patience, and as in Study 1, a comparison of responses between the two timepoints should capture nonstationarity. Of 7,649 subjects interviewed, 7,127 provided a valid response to both questions. The remainder, who refused to answer one of the two questions or said they did not know, were given follow-up questions asking them to provide a range estimate. It was not clear to us how to use range estimates alongside the usual responses, and relatively few subjects (less than 30 for each of the month and year scenarios) provided a range estimate, so we restricted our analyses to the 7,127 subjects who answered the month and year questions with integers. To remain neutral on the question of how to model intertemporal choice itself (as in Study 1), we used the responses directly rather than fitting them to a model of intertemporal choice such as hyperbolic discounting.
Of these 7,127 subjects, 51% were female. Regarding race, 52% were white, 30% were black, 19% were Hispanic, and 1% were Asian (subjects could endorse more than one category). The median net family income (which we also use as a CV) was $54,975. Among the 93% of subjects who the surveyors could determine as living in a rural area or an urban area in 2006, 29% lived in a rural area.
For CVs, we searched the NLSY79 for items concerning real-world self-control, covering both the domains considered in Study 1 (obesity, exercise, drug use, healthy eating, oral hygiene, debt, and saving; we found no items concerning overall gambling behavior) and new domains (sleep, health insurance, vaccination, sexual debut, divorce, crime, and income). This search produced a list of 1,034 items, some of which were the same question asked in different years or were otherwise indistinct from other items, and some of which were not asked of any subjects who answered the patience items. Below, we describe the 40 variables we produced from these items. When items were available for multiple years, we preferred years closer to 2006 (the year the patience questions were asked), breaking ties in favor of years after rather than before 2006.
As in Study 1, we transformed and coded the CVs based on inspection of their distributions (without reference to the preference tests) so as to retain variability while maximizing their suitability as DVs. Table
Difficult to run mile | 4,932 | XRND | Binary | S said they could not or did not run a mile, or that it was very difficult |
Not easy to climb stairs | 4,995 | XRND | Binary | S rated climbing stairs as more difficult than “Not at all difficult” |
Overweight | 6,923 | 2006 | Binary | Body mass index 25 or more |
Exercise, light, ever | 6,789 | 2006 | Binary | S reported nonzero frequency of light or moderate exercise |
Exercise, light, min/y | 5,012 | 2006 | Continuous (log) | Calculated minutes per year of light or moderate exercise (nonzero only) |
Exercise, vigorous, ever | 6,855 | 2006 | Binary | S reported nonzero frequency of vigorous exercise |
Exercise, vigorous, min/y | 4,667 | 2006 | Continuous (log) | Calculated minutes per year of vigorous exercise (nonzero only) |
Exercise, strength, ever | 7,097 | 2006 | Binary | S reported nonzero frequency of strength training |
Checks nutrition often | 7,053 | 2006 | Binary | S “often” or “always” reads nutritional info while shopping |
Eats fast food | 6,858 | 2008 | Binary | S ate fast food at least once in past week |
Drinks soft drinks | 6,850 | 2008 | Binary | S drank a (non-diet) soft drink at least once in past week |
Sleep min, weekday | 4,997 | XRND | Continuous | Minutes of sleep S usually gets on weekdays |
Sleep min, weekend | 4,994 | XRND | Continuous | Minutes of sleep S usually gets on weekends |
Health insurance | 7,123 | 2006 | Binary | S has health insurance |
Flu vaccine | 6,852 | 1979 | Binary | S received flu vaccine in past 2 years |
Sees dentist | 6,856 | 1979 | Binary | S saw a dentist in past 2 years |
Brushes teeth 2/day | 6,551 | 2008 | Binary | S brushes teeth twice daily |
Flosses daily | 6,544 | 2008 | Binary | S flosses daily |
Smoked 100 cigs | 6,855 | 2008 | Binary | S smoked 100 cigarettes in lifetime |
Smoking | 6,856 | 2008 | Binary | S smokes “occasionally” or “daily” |
Drinking | 7,052 | 2006 | Binary | S drank alcohol in past month |
Drinking, heavy | 7,045 | 2006 | Binary | S drank more than 6 drinks in one occasion in past month |
Drinks in last month | 3,691 | 2006 | Continuous (log) | Calculated number of drinks in past month (nonzero only) |
Cannabis | 6,662 | 1998 | Binary | S ever used cannabis |
Cocaine | 6,690 | 1998 | Binary | S ever used cocaine |
Stimulants | 6,711 | 1998 | Binary | S ever used stimulants recreationally |
Other drugs | 6,694 | 1980 | Binary | S ever used illegal drugs (other than cannabis) |
Sexual debut | 6,563 | 1979 | Continuous | Age S first had “sexual intercourse” |
Divorced | 7,127 | XRND | Binary | S ever divorced or separated |
Stopped by police | 6,907 | 1980 | Binary | S ever stopped by police (other than for minor traffic violation) |
Convicted | 6,910 | 1980 | Binary | S ever convicted (other than for minor traffic violation) |
Net family income | 6,748 | 2006 | Continuous (sqrt) | Calculated net family income in previous calendar year (top-coded) |
Saving | 6,618 | 2000 | Binary | S or partner has money in bank account or US savings bonds |
Retirement account | 6,599 | 2000 | Binary | S or partner has money in IRA, Keogh, 401(k), etc. |
Missed bill payment | 6,831 | 2008 | Binary | S missed or was 2 months late to a bill in past 5 years |
CC debt, any | 6,609 | 2008 | Binary | S or partner had nonzero CC balance after most recent payment |
CC debt, dollars | 3,100 | 2008 | Continuous (log) | Dollars of credit-card debt (nonzero only) |
CC maxed out | 6,775 | 2008 | Binary | S or partner has a maxed-out credit card |
Debt to businesses | 6,811 | 2008 | Binary | S or partner in debt to a store, hospital, bank, etc. |
Negative net worth | 6,759 | 2008 | Binary | S's liabilities exceed S's assets |
To ensure we replicated past studies' findings of significant association between CVs and patience, we ran Wilcoxon rank-sum tests for the binary CVs (treating patience as the DV rather than as the predictor, as is usual in this literature) and Kendall correlation tests for the continuous CVs. We chose nonparametric tests so any monotonic relationship could be detected, regardless of scaling. Two tests were run for each CV, one for month patience and one for year patience. We found that of the 64 tests for the binary CVs, 47 were significant (34 after a Holm-Bonferroni correction was applied for 64 comparisons), and of the 16 tests for the continuous CVs, 5 were significant (all of which remained significant after a Holm-Bonferroni correction for 16 comparisons). Exactly which tests were significant is indicated by superscripts in Tables
Difficult to run mile |
0.524 | 0.005 | 0.538 | 0.527 | 0.064 | 0.574 | 0.491 |
Not easy to climb stairs |
0.604 | 0.008 | 0.603 | 0.602 | 0.071 | 0.629 | 0.581 |
Overweight |
0.717 | 0.004 | 0.717 | 0.716 | 0.048 | 0.723 | 0.705 |
Exercise, light, ever |
0.738 | 0.013 | 0.738 | 0.738 | 0.073 | 0.746 | 0.729 |
Exercise, vigorous, ever |
0.681 | 0.011 | 0.682 | 0.682 | 0.069 | 0.692 | 0.668 |
Exercise, strength, ever |
0.628 | 0.009 | 0.627 | 0.625 | 0.062 | 0.646 | 0.607 |
Checks nutrition often |
0.514 | 0.010 | 0.539 | 0.539 | 0.060 | 0.579 | 0.524 |
Eats fast food |
0.646 | 0.004 | 0.646 | 0.646 | 0.054 | 0.660 | 0.632 |
Drinks soft drinks |
0.579 | 0.010 | 0.580 | 0.579 | 0.065 | 0.613 | 0.577 |
Health insurance |
0.810 | 0.020 | 0.810 | 0.810 | 0.075 | 0.814 | 0.804 |
Flu vaccine |
0.681 | 0.003 | 0.681 | 0.681 | 0.052 | 0.691 | 0.662 |
Sees dentist |
0.668 | 0.021 | 0.670 | 0.668 | 0.073 | 0.681 | 0.652 |
Brushes teeth 2/day | 0.741 | 0.001 | 0.741 | 0.740 | 0.058 | 0.748 | 0.732 |
Flosses daily | 0.599 | 0.001 | 0.599 | 0.599 | 0.046 | 0.616 | 0.569 |
Smoked 100 cigs |
0.575 | 0.006 | 0.576 | 0.574 | 0.060 | 0.605 | 0.554 |
Smoking |
0.728 | 0.009 | 0.727 | 0.727 | 0.064 | 0.736 | 0.717 |
Drinking |
0.527 | 0.009 | 0.541 | 0.538 | 0.070 | 0.594 | 0.545 |
Drinking, heavy | 0.859 | 0.000 | 0.859 | 0.859 | 0.051 | 0.860 | 0.850 |
Cannabis |
0.617 | 0.004 | 0.617 | 0.616 | 0.055 | 0.638 | 0.605 |
Cocaine | 0.767 | 0.003 | 0.767 | 0.767 | 0.052 | 0.771 | 0.749 |
Stimulants |
0.887 | 0.005 | 0.887 | 0.887 | 0.058 | 0.889 | 0.882 |
Other drugs |
0.824 | 0.003 | 0.824 | 0.824 | 0.050 | 0.825 | 0.806 |
Divorced |
0.536 | 0.004 | 0.537 | 0.538 | 0.058 | 0.580 | 0.525 |
Stopped by police | 0.824 | 0.003 | 0.824 | 0.824 | 0.058 | 0.826 | 0.810 |
Convicted | 0.951 | 0.002 | 0.951 | 0.951 | 0.050 | 0.951 | 0.947 |
Saving |
0.716 | 0.034 | 0.715 | 0.714 | 0.097 | 0.729 | 0.705 |
Retirement account |
0.523 | 0.034 | 0.572 | 0.570 | 0.092 | 0.610 | 0.559 |
Missed bill payment |
0.787 | 0.010 | 0.787 | 0.787 | 0.062 | 0.791 | 0.769 |
CC debt, any |
0.531 | 0.005 | 0.541 | 0.537 | 0.059 | 0.587 | 0.533 |
CC maxed out |
0.887 | 0.006 | 0.887 | 0.887 | 0.062 | 0.889 | 0.881 |
Debt to businesses |
0.796 | 0.005 | 0.796 | 0.796 | 0.058 | 0.799 | 0.779 |
Negative net worth |
0.885 | 0.014 | 0.885 | 0.885 | 0.065 | 0.886 | 0.877 |
Exercise, light, min/y | 25,940.94 | 25,921.10 | 25,940.86 | 25,427.31 | 26,886.97 |
Exercise, vigorous, min/y | 22,205.79 | 22,173.85 | 22,197.58 | 21,730.30 | 23,316.54 |
Sleep min, weekday | 63.32 | 63.18 | 63.46 | 60.37 | 66.33 |
Sleep min, weekend | 73.66 | 73.18 | 73.34 | 69.09 | 76.23 |
Drinks in last month | 17.92 | 17.91 | 17.95 | 17.34 | 18.80 |
Sexual debut |
1.91 | 1.91 | 1.92 | 1.83 | 1.99 |
Net family income |
44,145.32 | 43,151.85 | 43,285.46 | 40,976.40 | 44,268.49 |
CC debt, dollars |
5,972.24 | 5,961.44 | 5,979.99 | 5,715.25 | 6,269.98 |
As in Study 1, we examined the strength of association of time preferences with each CV and also the accuracy with which time preferences could predict CVs. Table
As can be seen in Table
Table
As can be seen in Table
The overall picture is similar to that of Study 1: the available measures of time preferences cannot predict the available CVs with more than trivial accuracy (except, perhaps, in the case of retirement savings). Weak association under the log models is no doubt related to this; however, our findings for the nominal models exemplify the fact that stronger association is not sufficient for predictive accuracy.
We assessed how accurately several tests of time preferences, comprising both patience and nonstationarity, could predict a variety of CVs. Study 1, using 181 users of Mechanical Turk, 10 CVs, and 3 distinct families of time-preference tests, found low to zero predictive accuracy. Study 2 replicated this finding for 7,127 participants in the NLSY79, 40 new CVs (covering all but one of the content areas of Study 1 as well as some others), and one new test of time preferences. The studies complement each other in that Study 1 took special care to ensure the quality of measurement of time preferences, whereas Study 2 used a much larger, nationally representative sample, and considered a richer set of CVs. In Study 1, we found that the three families of time-preference tests had decent retest reliability and convergent validity, supporting the idea that our negative result was not due to poorly chosen time-preference tests. In Study 2, we found that many of the relationships between time preferences and the CVs were significantly nonzero, exemplifying that significance does not imply predictive accuracy, and we found that the nominal model's greater strength of association did not translate into greater predictive accuracy, exemplifying the gap between association and prediction.
The consistently observed lack of predictive accuracy may be surprising in light of previous studies. However, as discussed in the introduction, past studies of time preferences have generally failed to evaluate predictive accuracy, despite the intentions of researchers. For, while these and other studies have found many significant associations between time preferences and CVs, significance and strength of association are distinct from predictive accuracy, as discussed in the introduction and demonstrated in Study 2. Hence, our findings are in no way
Our finding of no predictive accuracy, which supports and strengthens the negative claim of Chabris et al. (
What can be made of Study 1's findings concerning reliability and convergent validity? They are useful as reassurance that we did not choose particularly poor tests, to which subjects responded mostly randomly. Arfer and Luhmann (
Since nonstationarity plays an important role in economic thinking on self-control, but has been somewhat neglected in behavioral research on intertemporal choice, we took special care to include it in our tests. Not only did this not suffice for predictive accuracy for CVs, we found that preference tests could predict variations of themselves with different front-end delays just as well as they could predict themselves unaltered. This finding suggests there is little value in measuring stationarity separately from patience, supporting the usual practice of measuring only patience.
Our conclusions are qualified by the limitations of our methods and the scope of our study. First, in terms of independent variables, we concerned ourselves exclusively with abstract time preferences. Even if time preferences are not predictively useful on their own, perhaps they have predictively useful interactions with other variables. Unfortunately, because there is no limit to the number and diversity of other independent variables that might be considered, there is no real way to falsify this idea. Another avenue we did not explore is measures of intertemporal choice that consider a domain other than money or that match the context of test-taking to the context of the behavior of interest. What is perhaps the most famous patience test uses marshmallows rather than money (Mischel et al.,
All our measures of time preferences had subjects give judgments about hypothetical rather than real scenarios, which may seem questionable. However, past research contrasting real and hypothetical rewards has found no effect on time preferences (e.g., Johnson and Bickel,
Finally, in Study 1, we cannot know 1-month retest reliabilities among the many subjects who did not return for session 2. Our return rate, at 52%, was less than the 60% obtained in another retest-reliability study on Mechanical Turk by Buhrmester et al. (
Both authors provided ideas, planned the study, and edited the manuscript. KA wrote task code and collected the data for Study 1, conducted analyses, and wrote initial drafts of the manuscript.
This study was funded by the National Institute of Mental Health (T32MH109205) and the UCLA Center for HIV Identification, Prevention and Treatment Services (P30MH58107).
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
1We also generated Tables
2We regenerated all the tables for Study 1 in this paper using all 200 subjects (i.e., without any exclusion criteria, except of course that we could not assess 1-month retest reliability among subjects who did not return for session 2). We also regenerated Tables