Conducting Online Behavioral Research Using Crowdsourcing Services in Japan

Majima, Yoshimasa; Nishiyama, Kaoru; Nishihara, Aki; Hata, Ryosuke

doi:10.3389/fpsyg.2017.00378

ORIGINAL RESEARCH article

Front. Psychol., 14 March 2017

Sec. Quantitative Psychology and Measurement

Volume 8 - 2017 | https://doi.org/10.3389/fpsyg.2017.00378

Conducting Online Behavioral Research Using Crowdsourcing Services in Japan

Yoshimasa Majima¹^*

Kaoru Nishiyama¹

Aki Nishihara²

Ryosuke Hata³

¹Department of Psychology for Well-Being, Hokusei Gakuen University, Sapporo, Japan
²Department of Foreign Language Education, Hokusei Gakuen University, Sapporo, Japan
³Department of Social Work, Hokusei Gakuen University, Sapporo, Japan

Recent research on human behavior has often collected empirical data from the online labor market, through a process known as crowdsourcing. As well as the United States and the major European countries, there are several crowdsourcing services in Japan. For research purpose, Amazon's Mechanical Turk (MTurk) is the widely used platform among those services. Previous validation studies have shown many commonalities between MTurk workers and participants from traditional samples based on not only personality but also performance on reasoning tasks. The present study aims to extend these findings to non-MTurk (i.e., Japanese) crowdsourcing samples in which workers have different ethnic backgrounds from those of MTurk. We conducted three surveys (N = 426, 453, 167, respectively) designed to compare Japanese crowdsourcing workers and university students in terms of their demographics, personality traits, reasoning skills, and attention to instructions. The results generally align with previous studies and suggest that non-MTurk participants are also eligible for behavioral research. Furthermore, small screen devices are found to impair participants' attention to instructions. Several recommendations concerning this sample are presented.

Introduction

Online survey research is becoming increasingly popular in psychology and other social sciences on human behavior. Researchers often collect data from participants in online labor markets, through a process known as crowdsourcing. Recruiting participants from a crowdsourcing service is attractive to researchers because of its advantages over using traditional samples.

Estelles-Arolas and Gonzalez-Ladron-De-Guevara (2012) define crowdsourcing as an online activity in which a group of diverse individuals (users) voluntarily undertake a task proposed by an individual or a profit/non-profit organization (crowdsourcer) and in which the users receive monetary and other forms of compensation in exchange for their contributions, while the crowdsourcer benefits from the work performed by the users. In behavioral research, crowdsourcing websites offer researchers a useful platform that provides convenient access to a large set of people who are willing to undertake tasks, including research studies, at a relatively low cost. One of the most well-known crowdsourcing sites is Amazon's Mechanical Turk, which is often abbreviated as MTurk.

MTurk specializes in recruiting users, who are referred to as workers, to complete small tasks that are known as HITs (human intelligence tasks). For research purposes, researchers (requesters) post a HIT that contain surveys and/or experiments that can be completed on a computer using supplied templates. Sometimes, requesters post a link to external survey tools, such as SurveyMonkey and Qualtrics. Workers will browse or search tasks, and they are paid in exchange for their successful contribution to a task.

Mason and Suri (2012) identified four advantages of MTurk: access to a large, stable pool of participants; participant diversity; a low cost and built-in payment system; and faster theory and/or experiment cycle. Because of these benefits, MTurk is becoming popular as a potential participant pool for psychology and other social sciences.

Along with increasing usage in behavioral research, the validity of the data obtained from MTurk participants has been examined in several studies (for a recent review, see Paolacci and Chandler, 2014). These investigations have typically compared MTurk data with those from traditional samples, such as university students and other community members. First, demographic surveys have shown that MTurk workers are mostly residents of the United States and India and that they are in about their thirties, which is older than typical students who are in their late teens and twenties (e.g., Paolacci et al., 2010; Behrend et al., 2011; Goodman et al., 2013). In addition, workers and participants from traditional samples differ in terms of their personality traits. For example, MTurk workers are less extraverted and emotionally stable and show lower self-esteem than students. They also value money more than time and exhibit higher materialism than an age-matched community sample (Goodman et al., 2013).

The two samples were also different in their performances of reasoning and attention to instructions. For example, Goodman et al. (2013) found that MTurk workers show lower cognitive capabilities than students on the Cognitive Reflection Test (Frederick, 2005), which requires effortful system-2 thinking and on a “trap” task that involves what is known as instructional manipulation checks (IMCs), which reflect participants' inattentive response to the survey questions. However, Goodman et al. also pointed out that failures in IMC were mainly found in ESL and non-US participants. Therefore, the language proficiency, as well as careful reading of study materials, is essential for the successful solution to IMC. Furthermore, Hauser and Schwarz (2015) showed that answering IMCs before “tricky” reasoning tasks improves performance of subsequent tasks; the authors explained that the IMC itself alters participants' attention to subsequent tasks and prompts participants to adopt a more deliberative thinking strategy, which results in improved performance on these tasks.

The MTurk and traditional samples also have several commonalities. For example, MTurk workers and students show similar performance on classical heuristic-bias judgment tasks, such as the Linda problem (Tversky and Kahneman, 1983) and the Asian disease problem (Tversky and Kahneman, 1981). Paolacci et al. (2010) showed that both students and MTurk workers exhibited a significant framing effect, conjunction fallacy, and outcome bias. They also exhibit a significant anchoring-and-adjustment effect; however, the anchoring bias is mainly shown in the community sample, and the MTurk workers do not show anchoring bias, partly because they might “search” the correct answer on the Internet. In addition, MTurk workers perform similarly in traditional experimental psychology tasks, such as the Stroop, Flanker, attentional blink, and categorical learning tasks (Crump et al., 2013).

In sum, although MTurk participants and traditional participants differ in terms of a few features, they share many common properties. Therefore, crowdsourcing is considered to be a fruitful data collection tool for psychology and other social sciences (Goodman et al., 2013; Paolacci and Chandler, 2014).

MTurk appears to provide a promising approach to behavioral studies owing to its advantages over traditional offline data collection. Despite these advantages, there are some limitations of MTurk as a participant pool for empirical studies. First, there are issues with sample diversity. Demographic surveys have repeatedly shown that the majority of MTurk workers are Caucasian residents of the United States, followed by Asian workers who live in India (Paolacci et al., 2010; Behrend et al., 2011; Goodman et al., 2013). Currently, MTurk requires their workers to provide valid US taxpayer identification information when they get paid (either in US dollars or Indian Rupees), otherwise they can only transfer their earnings to Amazon's gift card. This restriction on monetary compensation may substantially reduce the number of non-US workers. Because of its biased population, it is difficult for researchers in other countries to collect data from residents of their own cultures. Of course, there are other crowdsourcing services, such as Prolific Academic and CrowdFlower, although it seems that Caucasian residents of the USA, the UK, and other European countries are also the predominant participants of these pools. Therefore, researchers who wish to collect data from samples of other ethnicities or nationalities should utilize other crowdsourcing services. This is exactly the case with Japanese researchers.

The second issue is of a technical nature. At this time, a US bank account is required to be a requester in MTurk. This requirement also constitutes an obstacle to adopt MTurk as a participant pool for researchers outside the Unites States¹. On these grounds, MTurk is considered to provide limited access to participant pools for behavioral researchers around the world.

Several studies also pointed out potential pitfalls of online studies with MTurk. First, Zhou and Fishbach (2016) claimed that researchers should pay attention to attrition rate that poses a threat to internal validity of the study. They also recommended that researchers not only implement dropout-reduction strategies, but also explore causes of, increase the visibility of, and report participant attrition. Second, Chandler et al. (2014, 2015) suggested that MTurk workers are likely to participate in multiple surveys, hence workers might be less naïve than participants from other (e.g., student) samples. They also pointed out that the prior experience with commonly used survey question (e.g., Cognitive Reflection Test, Frederick, 2005) inflates performances on the task, and suggested that the repeated participation of workers may threaten the predictive accuracy of the task and reduce effect sizes of research findings. Furthermore, Stewart et al. (2015) estimated the size of the population of active MTurk workers and suggested that the average laboratory can collect data from the relatively smaller numbers of active workers (about 7,300 compared to 500,000 registered MTurk workers). Thus, multiple participations to similar surveys are likely to happen than expected. These pitfalls, the high rate of non-naivety of participants in particular, can be resolved if researchers recruit participants from alternative crowdsourcing services. In addition, conducting online surveys and experiments with multiple crowdsourcing platforms will be beneficial for researchers who look for a more diverse sample.

As noted previously, the quality of data collected from MTurk participants have been verified. It is also shown that the other crowdsourcing pools, such as Clickworker and Prolific Academic, are practical alternatives to MTurk (e.g., Lutz, 2016; Peer et al., 2017). However, data from other crowdsourcing samples, particularly from non-Caucasian samples, have not as yet been fully investigated. To promote research using other crowdsourcing services, we must examine whether the data obtained from other crowdsourcing pools are as reliable as those from MTurk.

Research Objectives and General Research Method

The primary goal of the present study is to extend existing findings of previous validation studies of MTurk to other non-MTurk crowdsourcing samples. Specifically, we investigated the following questions.

Question 1: Do the demographic properties of workers from the other (i.e., non-MTurk) crowdsourcing samples differ from those of students? If so, how are they different?

Question 2: Do psychometric properties, such as personality traits or those of consumer behavior, differ across non-MTurk workers and students?

Question 3: Is the quality of non-MTurk workers' performance on reasoning and judgment tasks relevant to effortful System-2 thinking in comparison with that of students?

Question 4: How do non-MTurk workers respond to “trap” questions? Are they more (or less) attentive to the instructions for these tasks?

The present study compared crowdsourcing participants with university students in terms of their personality, psychometric properties regarding decision making, and consumer behavior (Survey 1), thinking disposition, reasoning performance, and attention to the study materials (Surveys 2 and 3). In all of the surveys, the crowdsourcing participants were recruited from CrowdWorks (a Japanese crowdsourcing service, which is abbreviated as CW hereafter; https://crowdworks.jp). We adopted CW as a participant pool for the following reasons. Firstly, CW has a growing and sufficiently large pool of registered workers for validation studies (a total of more than 1 million workers as of August 2016). Second, because it offers a user interface that is written in Japanese, the majority of workers are native Japanese speakers, and as a result, it enables data collection from participants of different ethnic groups than MTurk. Third, it offers a similar payment system as MTurk, and it does not charge a commission fee for micro tasks. In addition, it accepts several payment methods, such as bank transfer, credit cards, and PayPal. The student sample was collected from two middle-sized universities that are located in Sapporo, which is a large northern city of Japan. The CW participants received monetary compensation in exchange for their participation in the survey. However, the students received extra course credit or voluntarily participated in the survey.

All of the participants answered web-based questionnaires that were administered by SurveyMonkey (Surveys 1 and 2) or Qualtrics (Survey 3). For the CW sample, we posted a link to the survey site to the CW task. When the participants reached the site, they were presented with general instructions, and they were asked to provide their consent to participate in the survey by clicking an “agree” button. If they agreed to take the survey, the online questionnaires were presented in a designed sequence. After they completed the questionnaires, they received a randomized completion code, and they were asked to enter it into the CW task page to receive payment. Because CW allocates a unique ID per person, it is possible to restrict the same worker to a single task more than once. In addition, we also enabled SurveyMonkey and Qualtrics restriction features to prohibit multiple participations. After the correct completion code had been entered, the experimenter approved the compensation to be sent to the participants' accounts. The CW participants were completely anonymous throughout the entire survey process.

The university students were recruited from introductory psychology, statistics, English, or social welfare classes, and they were provided with a leaflet that described a link to the equivalent web-based survey site. When they reached the site, they received the same general instructions and the same request for their consent to take the survey as the CW participants. After they completed the survey, they were provided with a randomly generated completion code that was required for them to receive credit.

The present study was approved and conducted in compliance with the guidelines of the Hokusei Gakuen University Ethics Committee. All of the participants gave their web-based informed consent instead of written consent.

Survey 1: Personality and Psychometric Properties

Survey 1 compared the CW and university samples in terms of their demographic status, personality traits and psychometric properties, which included the so-called Big Five traits, as well as self-esteem, goal orientation, and materialism as an aspect of consumer behavior. These scales were adopted from previous validation studies of MTurk (e.g., Behrend et al., 2011; Goodman et al., 2013).

Method

Participants

A total of 319 crowdsourcing workers agreed to participate in the survey; however, we excluded 7 participants because they did not complete the questionnaire. We also excluded 17 responses because of IP address duplication, which left 295 in the final sample. The participants received 50 JPY for completing a 10 min survey.

In addition, we collected 144 students, but we excluded 12 participants from the analyses for the following reasons: incomplete responses (11 participants) and IP address duplication (1 participant). We also excluded one participant from the analyses because of a failure to indicate that he or she was currently a university student in the demographic question. A final sample of 131 undergraduate students participated in the survey.

The sample size of the present survey was decided in reference to previous validation studies of MTurk and other practical reasons. For example, Behrend et al. (2011) collected 270 MTurk and 270 undergraduate students, and Goodman et al. (2013) sampled 207 MTurk and 131 student participants. In addition, based on our previous experience in using CW, we estimated that a growth in the number of CW participants slowed down if we recruited more than 300 participants. Furthermore, the size of the student sample was determined by rather practical reason, i.e., class attendance. However, as shown above, the present survey collected as many student participants as those of Goodman et al. (2013)'s study. Although Goodman et al. (2013) did not mention effect sizes, Behrend et al. (2011) reported that effect sizes on the difference in personality traits between MTurk and student samples ranged from d = 0.31–0.86. We conducted power analysis in G-Power to determine sufficient sample size using an alpha of 0.05, a power of 0.8, effect size (d = 0.3), and two tails. Based on the aforementioned assumptions, the desired sizes for the first and the second sample were 285 and 127. The result indicated that sample size of the present survey was sufficiently large.

Materials and Procedure

As the measures for personality traits, we administered two widely used personality inventories: a brief measure of the Big-Five personality dimensions (10-Item Personality Inventory, TIPI; Gosling et al., 2003) and Rosenberg's self-esteem scale (RSE; Rosenberg, 1965). In this survey, we adopted the Japanese version of the TIPI (TIPI-J; Oshio et al., 2012) and the RSE, which was translated into Japanese by Yamamoto et al. (1982). Furthermore, we administered two additional scales that were also used in the previous validation studies: the performance prove/avoid goal orientation scale (PPGO and PAGO; Vandewalle, 1997) and the Material Value Scale (MVS; Richins, 2004). Finally, we asked for participants' demographic status: age, gender, ethnicity, nationality, educational level, and employment status.

All of the participants completed identical measures in an identical order. In the first step, they answered each of TIPI-J items on a 7-point Likert scale (1 = Disagree Strongly to 7 = Agree Strongly). Next, the participants answered the PAGO and PPGO in mixed order (6-point scale, 1 = Strongly disagree to 6 = Strongly agree). Subsequently, the participants were presented with 10 items of the RSE followed by a nine-item version of the MVS and provided answers to each item on a 5-point scale that ranged from 1 = Strongly disagree to 5 = Strongly agree. Finally, they answered demographic questions before the end of the survey.

Results

All of the statistical analyses of the present study were performed using SPSS 21.0. In addition, when we report η² as an index of effect size of ANOVA, where the value designates partial η².

Demographics

Table 1 summarizes the demographic status of both samples. The CW workers were significantly higher in age than the students (UNIV), M_C = 36.9 vs. M_U = 19.6 years, t₍₄₂₃₎ = 22.4, p < 0.001, d = 2.35; percentage of female, CW = 63.7% vs. UNIV = 43.5%, $χ_{(1)}^{2}$ = 15.2, p < 0.001; and median level of education, Mdn_C = “associate degree,” Mdn_U = “high school,” Wilcoxon's Z = 10.3, p < 0.001. The two samples were also different in their years of work experience, M_C = 12.4 vs. M_U = 2.5, t₍₂₆₃₎ = 4.7, p < 0.001, d = 1.17; and employment status, $χ_{(5)}^{2}$ = 77.4, p < 0.001. On the one hand, 74.8% of the students were not currently employed, and 17.6% were part-time workers. On the other hand, 40.3% of the CW workers were not employed, 21.4% were full-time employees, 19.7% were self-employed, and 13.2% were part-time workers.

TABLE 1

Table 1. Demographic results of Survey 1 and 2.

Personality Traits

Table 2 summarizes the result of the Big Five personality and self-esteem scale. In the following analyses, we considered sample and gender as independent variables (age was excluded because of a strong point-biserial correlation with sample, r_pb = 0.74). The gender was included because several previous studies with Japanese participants have shown gender differences in these personality traits (e.g., Kawamoto et al., 2015; Okada et al., 2015; the gender issue was discussed in the Discussion Section). The TIPI-J scores were submitted to a sample × gender MANOVA, and they showed a significant multivariate effect of the sample, F_{(5, 418)} = 5.95, Wilk's Λ = 0.93, p < 0.001, η² = 0.07. The multivariate effect was also significant for gender, F_{(5, 418)} = 6.63, Λ = 0.93, p < 0.001, η² = 0.08. The univariate F-tests revealed significant differences of the samples in Extraversion, F_{(1, 422)} = 8.57, MSE = 8.55, p = 0.004, η² = 0.02; Agreeableness, F_{(1, 422)} = 5.22, MSE = 5.38, p = 0.023, η² = 0.01; and Conscientiousness, F_{(1, 422)} = 6.07, MSE = 6.97, p = 0.014, η² = 0.01. The two samples were not different in Emotional Stability and Openness (Fs < 1). The results also showed that males were significantly higher than females in Emotional Stability, F_{(1, 422)} = 14.08, MSE = 6.39, p < 0.001, η² = 0.03; and Openness, F_{(1, 422)} = 9.11, MSE = 6.34, p = 0.003, η² = 0.02. The gender differences were not found in Extraversion, Agreeableness, and Conscientiousness, Fs_{(1, 422)} < 1.44, ps > 0.23. Although the multivariate sample × gender interaction was not significant, F_{(5, 418)} = 1.29, Λ = 0.98, p = 0.267, η² = 0.02, the univariate analysis showed a significant sample × gender interaction on Openness, F_{(1, 422)} = 5.25, MSE = 6.34, p = 0.022, η² = 0.01. The analysis of the simple main effect indicated that male students were more open than female students, F_{(1, 422)} = 10.38, MSE = 6.34, p = 0.001, η² = 0.02; however, no such difference was found for CW workers, F < 1.

TABLE 2

Table 2. Means, standard deviations, and Cronbach's alpha coefficients of personality traits and psychometric properties as functions of the sample (UNIV, student; CW, CrowdWorks) and gender (Survey 1).

A similar ANOVA on the RSE scale failed to show significant sample and gender differences, Fs_{(1, 422)} < 2.1, ps > 0.148. However, we found a significant sample × gender interaction, F_{(1, 422)} = 5.02, MSE = 59.51, p = 0.026, η² = 0.01. The analysis of simple effects revealed that male students scored slightly higher than female students, F_{(1, 422)} = 5.0, p = 0.026, MSE = 59.51, η² = 0.01; however, no gender difference was found for the CW sample, F < 1.

Goal Orientation and Material Value

Table 2 shows the results of goal orientation and materialism. A MANOVA on two goal orientations indicated the multivariate effect of the sample, F_{(2, 421)} = 5.11, Λ = 0.98, p = 0.006, η² = 0.02. Subsequent univariate F-tests revealed that the students were higher than the workers in both PAGO and PPGO, Fs_{(1, 422)} = 8.43, 5.45, MSEs = 10.93, 11.40, ps = 0.004, 0.020, η²s = 0.02, 0.01, respectively. However, neither the effect of gender nor the interaction effect were significant, Fs < 1.

Then, a sample × gender ANOVA was conducted, and the result showed that the students were more materialistic than the crowdsourcing sample, F_{(1, 422)} = 3.94, MSE = 34.34, p = 0.048, η² = 0.01. However, gender main effect and interaction were not significant, Fs_{(1, 422)} = 2.72, 0.83, p = 0.100, 0.362.

Discussion

In Survey 1, we found a significant, but not surprising, difference between the students and the CW workers in terms of their demographic status. The findings also showed that some personality characteristics differed between the two samples. For example, the CW participants were less extraverted and agreeable, although they were more conscientious than the students. In addition, the CW participants were less materialistic and their pursuit performance-avoid or prove goals were lower than those of the students.

Some of these results, such as demographics, extraversion, openness, and performance-avoid goal orientation, were compatible with the previous validation studies using MTurk (Paolacci et al., 2010; Behrend et al., 2011; Goodman et al., 2013). There were also several inconsistent results on the difference between the two samples compared to the previous studies. For example, Goodman et al. (2013) showed that MTurk workers were more emotionally unstable, i.e., neurotic, than the students and community sample; however, we did not find any such difference between the samples, but we did find a gender difference. Recently, Kawamoto et al. (2015) showed that Japanese females scored higher in neuroticism than males, particularly in their younger adulthood. Our result is compatible with this finding if we consider the distribution of age in both of the samples (UNIV = the majority of the participants were in their late teens or early twenties, CW = 40% were in their thirties, 30% were in their forties, and 19% were in their twenties). Goodman et al. (2013) also found that MTurk workers were less conscientious than students. However, we found an opposite direction of results; our results were consistent with the findings of Big Five personality and showed that conscientiousness was likely to develop during adulthood (e.g., McCrae et al., 2000; Srivastava et al., 2003; Kawamoto et al., 2015). Furthermore, we found that the male students were higher in self-esteem than the female students; however, no gender difference was found in the CW sample. Our findings were compatible with the previous offline investigations, which showed that males had higher self-esteem than females, and this gender difference decreased throughout adulthood (Kling et al., 1999; Robins et al., 2002; Okada et al., 2015).

To summarize, our results indicated both similarities and differences between the CW workers and the students, which is generally consistent with existing findings. It is also important to note that the effect sizes of the sample differences were relatively small, as has been shown in previous studies.

Survey 2: Attentional Check and System-2 Thinking

Survey 2 aimed to compare the crowdsourcing workers and students in terms of their thinking disposition, as well as their reasoning and judgment biases related to systematic System-2 thinking.

As a measure of thinking disposition, we administered the Cognitive Reflection Test (CRT; Frederick, 2005), which is a set of widely used tasks to measure individual differences in dual process thought, particularly in effortful System 2 thinking. We also administered the following three tasks to measure the participants' biases in reasoning and judgment. The first task was the probabilistic reasoning task (Toplak et al., 2011), which aimed to measure denominator neglect bias in a hypothetical scenario. The second task was the logical reasoning task, which consisted of eight syllogisms (Markovits and Nantel, 1989; Majima, 2015) in which the validity of the conclusion always conflicted with common belief. These syllogisms were designed to measure the strength of the belief bias effect (Evans et al., 1983). The third task was a classical anchoring-and-adjustment task (Tversky and Kahneman, 1974).

We also investigated sample differences in their attention to instructions by using instructional manipulation checks (IMCs; Oppenheimer et al., 2009). In addition, we examined whether answering to the IMCs promoted successful solutions to the other “tricky” reasoning tasks, as shown in Hauser and Schwarz (2015). To investigate whether the interventional effects of an IMC on the subsequent tasks were replicated in the Japanese sample, two questionnaire orders were introduced: IMC-first, in which IMC was administered before the CRT and other reasoning tasks, and IMC-last, in which IMC was administered after those tasks.