Improving the Estimation of Psychometric Functions in 2AFC Discrimination Tasks

Ulrich and Vorberg (2009) presented a method that fits distinct functions for each order of presentation of standard and test stimuli in a two-alternative forced-choice (2AFC) discrimination task, which removes the contaminating influence of order effects from estimates of the difference limen. The two functions are fitted simultaneously under the constraint that their average evaluates to 0.5 when test and standard have the same magnitude, which was regarded as a general property of 2AFC tasks. This constraint implies that physical identity produces indistinguishability, which is valid when test and standard are identical except for magnitude along the dimension of comparison. However, indistinguishability does not occur at physical identity when test and standard differ on dimensions other than that along which they are compared (e.g., vertical and horizontal lines of the same length are not perceived to have the same length). In these cases, the method of Ulrich and Vorberg cannot be used. We propose a generalization of their method for use in such cases and illustrate it with data from a 2AFC experiment involving length discrimination of horizontal and vertical lines. The resultant data could be fitted with our generalization but not with the method of Ulrich and Vorberg. Further extensions of this method are discussed.

by fitting a single psychometric function to data aggregated across presentation orders may be seriously contaminated. In their search for uncontaminated estimation of the difference limen in the presence of order effects, Ulrich and Vorberg claimed that the psychometric function Ψ 2AFC must have its 50% point at the magnitude of the standard. Their argument relies on two facts. The first one is that Ψ 2AFC (x) = [Ψ 1 (x) + Ψ 2 (x)]/2. The second is that when the test has the same magnitude as the standard the two stimuli are physically identical. If x s is the magnitude of the standard, Ulrich and Vorberg's contention is that Ψ 1 (x s ) + Ψ 2 (x s ) = 1 and, hence, that Ψ 2AFC (x s ) = 0.5. They claim that this result (which we will refer to as "the contention") is a property of 2AFC tasks. Ulrich (2010Ulrich ( , p. 1188 further emphasized that Ψ 2AFC (x s ) = 0.5 "must always hold, and it is not a theoretical constraint but a tautology associated with the 2AFC methodology. In other words, the PSE in a 2AFC task is always equal to [x s ] (…). If PSE is estimated using some model that allows it to differ from [x s ] and if the estimated PSE does differ from [x s ], this merely reflects statistical noise." As a result, Ulrich and Vorberg as well as Ulrich recommended that 2AFC psychometric functions always be fitted under this constraint. Ulrich and Vorberg (2009) illustrated their method by fitting psychometric functions to 2AFC data from an experiment in which observers were asked to indicate which of two temporal intervals was longer. One of these intervals had fixed length on all trials and served as the standard stimulus; the other interval varied in length across trials and served as the test stimulus. In such experiment, in which test and standard differ in length but are identical in all other respects, the contention seems tenable and its validity can be proved formally on the reasonable assumption that stimuli that are IntroductIon Two-alternative forced-choice (2AFC) procedures are regarded as objective methods to gather psychophysical evidence, although they suffer from some methodological problems. In a temporal 2AFC discrimination task, one of the intervals presents a fixed stimulus (the standard) and the other presents a test (or comparison) stimulus whose magnitude differs across trials. Test magnitudes may vary from well below that of the standard to well above it, and test and standard may have the same magnitude in some trials. The order of presentation of test and standard is randomized across trials, ideally with the constraint that at each test magnitude the test is presented in the first interval on half of the trials. Observers are asked to report the interval in which perceived magnitude was stronger. When the proportion of times that the test was judged stronger is plotted as a function of test magnitude, the data typically describe a sigmoidal pattern that is often summarized by fitting a psychometric function whose location and slope are free parameters. Ulrich and Vorberg (2009) argued that the location of this psychometric function should not be a free parameter but should instead be fixed according to theoretical constraints. They started noting that 2AFC data come from a mixture of trials involving two orders of presentation of test and standard. They then discussed order effects whose origin is unknown but whose consequence is that the psychometric functions Ψ 1 and Ψ 2 that hold for trials in which the test is presented first or second may differ in slope, location, or both. Thus, points of subjective equality (PSEs) and difference limens vary with order of presentation in 2AFC tasks (see Woodruff et al., 1975;Masin and Fanton, 1989) and Ulrich and Vorberg noted that estimates of the difference limen obtained physically identical in all respects are perceived identically. But the assumption that physical identity implies perceived identity does not necessarily hold when the two stimuli differ on dimensions other than that along which they are compared. This latter characteristic is actually prevalent in empirical use of 2AFC discrimination tasks, which leaves Ulrich and Vorberg's method inapplicable.
This paper discusses the validity of Ulrich and Vorberg's (2009) contention as a general property of 2AFC tasks and our main goal is to propose a generalization that is always valid and, thus, which allows obtaining uncontaminated estimates of the difference limen in all circumstances. To make the paper self-contained, the next section describes briefly Ulrich and Vorberg's approach to fitting 2AFC data under the contention. Next, the inadequacy of the contention for 2AFC tasks in general is discussed in the light of countering and well-known empirical evidence. The contention is next amended so that it can be applied in all cases and an experiment is reported whose data reveal the inadequacy of the original contention and the validity of our generalization. Further extensions and improvements of the method are finally discussed, which should help to improve the estimation of psychometric functions from 2AFC discrimination data and, hence, to obtain estimates of the difference limen that are free of contamination from order effects.

FIttIng 2AFc dAtA under ulrich and Vorberg's (2009) contentIon
Using Ulrich and Vorberg's (2009) contention to fit 2AFC discrimination data involves three steps: (1) collecting data in 2AFC trials designed so that at each test level half of the trials display the test in the first interval while the other half displays it in the second interval, (2) segregating data by order of presentation of test and standard to compute the proportion of times that the test is judged stronger at each test level and separately for each presentation order, and (3) fitting Ψ 1 and Ψ 2 to the applicable subsets of data simultaneously under a constraint arising from their proof that the psychometric function Ψ 2AFC for data aggregated across presentation orders must satisfy Ψ 2AFC (x) = [Ψ 1 (x; a 1 , b 1 ) + Ψ 2 (x; a 2 , b 2 )]/2, where a i and b i are the location and slope parameters of Ψ 1 and Ψ 2 . The constraint, given by the contention, is that Ψ 2AFC (x s ) = 0.5.
Neither Ulrich and Vorberg (2009) nor Ulrich (2010) stated condition (1) explicitly as a necessary requirement, but it is implicit in their discussion and examples. The requirement can indeed be relaxed, but we will defer a discussion of this issue to a later section of this paper.
Only a 1 , b 1 , and b 2 are free parameters in the simultaneous fit of Ψ 1 and Ψ 2 because a 2 is determined by the constraint. The functional relation of a 2 to a 1 , b 1 , b 2 , and x s depends on the forms of Ψ 1 and Ψ 2 . When they are both logistic functions given by Ulrich and Vorberg showed that (2) If the logistic functions Ψ 1 and Ψ 2 include a range restriction determined by asymptote parameters λ 1 and λ 2 so that García-Pérez and Alcalá-Quintana (2010a) showed that the relation is Prior to fitting Ψ 1 and Ψ 2 , a 2 in the expression of Ψ 2 must be replaced by the right-hand side of Eq. 2 or 4 as appropriate. This is what eliminates a 2 as a parameter and it also demands a simultaneous fit because a 1 and b 1 are then common parameters in Ψ 1 and Ψ 2 .
The results of applying this strategy can be summarized in a plot that includes three sets of data and three functions (see Figure 4 in Ulrich and Vorberg, 2009). The first set is the empirical proportion of times that the test was judged stronger when presented in the first interval and is accompanied by the fitted Ψ 1 which should run through this data set; the second set represents data from trials in which the test was presented in the second interval and is also accompanied by the fitted Ψ 2 which should run through these data; and the third set consists of the usual proportions computed from all trials (without distinction according to presentation order) and is accompanied by the average function Ψ 2AFC , which should run through the points in this set even though Ψ 2AFC was not actually fitted to these data but merely computed as the average of the fitted Ψ 1 and Ψ 2 .
If the data have been fitted through Eqs 1 and 2 and parameter estimates ˆ,ˆ,ˆ,ˆ, a a b b 1 2 1 2 and have been obtained, an estimate of the difference limen that is uncontaminated by order effects is given by the average inverse slope of Ψ 1 and Ψ 2 , that is, by ( )ln( )/ b b ∧ ∧ + 1 2 3 2 (see Eq. 14 in Ulrich and Vorberg, 2009); if the data have been fitted through Eqs 3 and 4 instead, the uncontaminated estimate is obtained in the same way because the asymptote parameters λ 1 and λ 2 are independent of slope parameters.

VAlIdIty oF the contentIon
As discussed above, Ulrich and Vorberg (2009) seem to have derived the contention on the assumption that test and standard stimuli differ only in magnitude along the dimension on which observers compare them, and they overstated the validity of the contention by implying that it holds for all 2AFC discrimination tasks in general.
To set the stage for a discussion of the general validity of the contention, consider the 2AFC discrimination data reported by Armstrong and Marks (1997) in their Figure 1, which we reproduce and annotate in our Figure 1. These data come from a study involving length discrimination of horizontal and vertical lines, and reflect the proportion of times in which a vertical line (the test) was judged longer than a horizontal line (the standard), as a function of the length of the vertical line. The experiment involved five different horizontal line. It is certainly hard to reconcile these data with Ulrich and Vorberg's (2009) contention that the PSE in 2AFC data aggregated across presentation orders must occur at the point of objective equality (POE), which is what Ψ 2AFC (x s ) = 0.5 means. And it is also hard to regard these differences between the PSE and the POE as mere statistical noise.
The results just discussed reflect the well-known horizontalvertical illusion: Vertical and horizontal lines of the same length are not perceived equal, and the vertical line must be shorter than the horizontal line for them to be perceived equal. In contrast, the contention Ψ 2AFC (x s ) = 0.5 implies that indistinguishability occurs when x = x s and, hence, that vertical and horizontal lines would be perceived equal when they have the same physical length. All extant evidence on the horizontal-vertical illusion refutes the contention (see, e.g., Künnapas, 1955;Girgus and Coren, 1975;Prinzmetal and Gettleman, 1993;Armstrong and Marks, 1997;Richter et al., 2007;Searleman et al., 2009;Hamburger and Hansen, 2010;Mamassian and de Montalembert, 2010) and numerous studies involving all other illusory figures concur.
It must be stressed that an empirical discrepancy between the PSE and the POE is not limited to illusory phenomena. More often than not, test and standard stimuli differ on more dimensions than that along which their magnitudes are compared, and the presence of these different dimensions may push the PSE away from the POE. Consider the classical Georgeson and Sullivan (1975) study, which measured the contrast that a (test) grating of a given spatial frequency should have for it to be perceived equal to the contrast of a (standard) grating of another spatial frequency. Their study thus estimated the PSE for grating contrast across spatial frequency. Their results showed that the PSE does not occur at the POE at low standard contrasts although it certainly does at high standard contrasts. [Georgeson and Sullivan collected their data with the method of adjustment, but replications of their experiment using 2AFC tasks under various conditions have always rendered analogous results (see, e.g., Stephens and Banks, 1985;St. John et al., 1987).] A similar quest for whether or not the PSE matches the POE underlies other studies in contrast perception, where 2AFC procedures revealed a mismatch between the PSE and the POE for contrast when test and standard differed in luminance or size (Peli, 1995), phase or bandwidth (Peli, 1997;Benton and Johnston, 1999), or direction of motion (García-Pérez and Peli, 2001). In another context, research on perceptual aftereffects also relies on discrepancies between PSE and POE (e.g., Knapen et al., 2010). Situations in which PSE and POE differ are myriad and the method of Ulrich and Vorberg (2009) cannot be used in those cases. The question is, then, how one can estimate the difference limen without contamination from order effects in cases in which the PSE does not lie at the POE, a question that calls for a generalization of Ulrich and Vorberg's method such that the (unknown) location of the PSE becomes an additional free parameter.

generAlIzIng the contentIon
The preceding section has emphasized that perceived identity does not generally accompany physical identity, particularly when test and standard differ on extra dimensions. The assumption that physical identity implies perceived identity was laid out in signal detection theoretic terms by Ulrich (2010Ulrich ( , p. 1190, who stated (in a lengths for the standard horizontal line, hence the five curves in each panel. The study involved a temporal 2AFC paradigm with equal numbers of trials at each test level and also with equal numbers of trials for each presentation order. The data plotted in Figure 1 represent aggregates across presentation orders. For data on the left panel, the two stimuli in each trial appeared at different times and locations on the screen (a sort of spatio-temporal 2AFC task); data on the right panel are thoroughly analogous but in this case the two stimuli appeared at different times on the same location on the screen (a pure temporal 2AFC task). These are indeed instances of 2AFC tasks, for which Ulrich and Vorberg (2009) claimed that the contention holds.
No psychometric functions were fitted by Armstrong and Marks (1997), but the data clearly question the validity of the contention. The length of the standard is printed next to the leftmost point on the applicable curve in Figure 1, and green vertical lines help identify the z-score (i.e., the probit transformation of empirical proportion) when horizontal and vertical lines had the same physical length. These z-scores are all in the range 1.2-1.7, implying that the test was judged longer 88-96% of the times and, thus, remarkably above 50% (a level that is represented by a z-score of 0, marked by a red horizontal line across the panels). The intersection of the red horizontal line with each data curve indicates the PSE, that is, the length that the vertical line must have for it to be judged equal in length to the horizontal line. The intersection always occurs when the length of the vertical line is smaller than that of the standard Annotations: The red horizontal line across the panels indicates the 50% level (a z-score of 0), which reveals the PSE upon intersection with each curve. The abscissa at which the red line crosses a given curve indicates the length that the vertical stimulus must have to be perceived equal in length to the horizontal stimulus. Green vertical lines are drawn at the abscissa corresponding to the actual length of the horizontal stimulus for each curve and extend up to the data point on the curve reflecting the condition in which horizontal and vertical stimuli were physically identical; the ordinate of the upper end of each green line thus indicates the proportion of times (upon transformation of the z-score) that the vertical stimulus was judged longer than the horizontal stimulus when they actually had the same length. so that fitting Ψ 1 and Ψ 2 involves estimating a 1 , b 1 , b 2 , λ 1 , λ 2 , and x PSE under the constraint of Eq. 6. It may happen that the estimated x PSE equals x s within sampling error, which would provide evidence that μ t (x s ) = μ s (x s ) and, hence, that Ulrich and Vorberg's (2009) contention is empirically valid in such case. In other cases, this strategy will show that μ t (x s ) ≠ μ s (x s ) and will serve the more important goal of estimating x PSE under the theoretical constraints. Note also that the new parameter x PSE only shifts the functions Ψ 1 and Ψ 2 so as to "center" them away from the POE. Hence, the change of location does not alter the difference limen defined as the average inverse slope of Ψ 1 and Ψ 2 , which is still obtained through Ulrich and Vorberg's Eq. 14.

empIrIcAl demonstrAtIon
To illustrate the procedure just described, a 2AFC task was used in which observers judged whether a horizontal line or a vertical line was longer. The horizontal line was the standard and had a length of 104 pixels (∼3.1 cm); the vertical line was the test, whose length on each trial had one of eight values in the range between 94 and 110 pixels, in steps of two pixels. Line width was five pixels and all lines were black on a uniform light background. Each trial presented the lines in either an 'L' or a 'Γ' configuration, which is to say that the test could be placed above or below the standard in the spatial 2AFC paradigm typically used for the study of geometrical illusions. Hence, references to first and second intervals in the preceding discussion of temporal 2AFC should be understood here as referring to upper and lower spatial positions. Spatial 2AFC was used for convenience, but it should be inconsequential because (i) differences in the perceived length of horizontal and vertical lines have also been reported in spatial 2AFC tasks (e.g., Hamburger and Hansen, 2010;Mamassian and de Montalembert, 2010), (ii) order effects in temporal 2AFC have also been shown to occur as position effects in spatial 2AFC (e.g., Hellström, 2003;, and (iii) Ulrich and Vorberg (2009) claimed that the contention is a property of all 2AFC tasks.
Test lines of each length were presented 100 times in each configuration, for a total of 800 trials with each configuration. Data were collected in five sessions of 320 trials each (20 presentations at each of the eight test lengths in each of the two configurations); the sequence of trials was newly randomized in each session for each observer. Each session took 12-17 min, and observers applied self-administered pauses between sessions. The two authors participated in the experiment.
Stimuli were presented on a 19-in HP L1950g LCD monitor (flat screen size: 37.7 cm horizontally by 30.1 cm vertically) with a spatial resolution of 1280 × 1024 pixels and a 1:1 aspect ratio. All experimental events (randomization of the sequence of trials, stimulus display, and collection of responses) were controlled by a computer running custom software. Viewing distance was 60 cm, so that 1 cm on the screen subtended ∼0.95° of visual angle. Each trial started by displaying a configuration from the set of 320 in the current session. To prevent observers from developing strategies based on spatial cues, stimuli were displayed at a location on the screen such that the center of the putative rectangle closing the configuration would lie at a random position within 10 pixels of the center of the monitor. The stimulus remained present until the observer had responded different notation) that the internal representation of a standard of magnitude x s is a normally distributed random variable with mean μ(x s ) whereas the internal representation of a test of magnitude x is also normally distributed with mean μ(x). He thus assumed that the function μ is the same for test and standard stimuli and, hence, x = x s inevitably yields internal representations with equal means for test and standard and chance performance on a 2AFC task. Although the assumption seems valid for the type of stimuli that Ulrich and Vorberg (2009) used in their experiment, it does not hold in general and needs to be replaced.
Consider again the horizontal-vertical illusion. The mean of internal representations (perceived length) of horizontal and vertical lines of length x s cannot be given by the same function μ because the defining property of the illusion is that the perceived length of a vertical line is larger than that of a horizontal line of the same physical length. Thus, empirical evidence shows that the mean of the internal representation of the test is given by a function μ t that differs from the function μ s that gives the mean for the standard. Empirical estimates of these functions were reported by Marks (1997, p. 1208 andFigure 5), showing that horizontal lines are perceived nearly veridically whereas the length of vertical lines is overestimated: When perceived length was plotted against physical length, magnitude-estimation data for horizontal lines described a unit-slope line through the origin whereas data for vertical lines described a line with a slope higher than unity.
Thus, in terms of signal detection theory, chance performance on a 2AFC discrimination task (or, in other terms, Ψ 2AFC (x) = 0.5) does not necessarily occur at x = x s (i.e., when test and standard are physically equal on the dimension of comparison) but rather at the point x 0 satisfying μ t (x 0 ) = μ s (x s ) (i.e., when the perceived magnitudes of test and standard are equal). Although nothing prevents μ t and μ s from being identical in special cases, those cases must be identified empirically. This seems to suggest that one would need to know the functions μ t and μ s (or, at least, know if they differ) in order to estimate Ψ 1 and Ψ 2 under the applicable constraints. Quite on the contrary, a simple amendment of Ulrich and Vorberg's (2009) contention suffices.
Potential differences between μ t and μ s do not alter the unquestionable validity of Ulrich and Vorberg's (2009) observation that Ψ 2AFC should be the average of constrained functions Ψ 1 and Ψ 2 , but we will use it with the three-parameter logistic function in Eq. 3, which yields Even if μ t and μ s differ, it is still true that Ψ 2AFC (x) = 0.5 when Ψ 1 (x; a 1 , b 1 , λ 1 ) + Ψ 2 (x; a 2 , b 2 , λ 2 ) = 1 but, by the above discussion, this does not occur when x = x s but at an unknown point x PSE for which μ t (x PSE ) = μ s (x s ). All that this means is that the constraint Ψ 2AFC (x s ) = 0.5 must be replaced by Ψ 2AFC (x PSE ) = 0.5, where x PSE is another free parameter. When Ψ 1 and Ψ 2 are logistic functions, the amended constraint transforms Eq. 4 into requirements cannot follow the path of the data. The fitted curves represent the solution that is least inconsistent with the data, which is still overly unacceptable. Figure 3 shows the results of fitting the data with our amended method, and it is obvious that the psychometric functions fit well. Order effects are also captured by the fact that Ψ 1 and Ψ 2 are laterally shifted away from x PSE in opposite directions, consistent with with key presses indicating whether the horizontal or the vertical line appeared longer. The next trial started 500 msec after the observer's response.
The proportion of times that the test was judged longer was computed at each test level in each configuration, and an overall proportion was also computed at each test level irrespective of configuration. These three sets of data are plotted as circles in Figure 2 in a separate panel for each observer; raw counts are reported in Table 1. Contrary to Ulrich and Vorberg's (2009) expectation that the PSE for each test location should be displaced away from the POE in an opposite direction while that for aggregated data should be around the POE, all three PSEs are displaced to the left of the POE. The functions Ψ 1 (red curve) and Ψ 2 (blue curve) were fitted using Ulrich and Vorberg's method but they do not do justice to the data (red and blue circles), nor does their average (Ψ 2AFC ; black curve) follow the path of aggregated data (black circles). Clearly, this outcome is not a failure to find the "correct" parameter estimates but, rather, a proof of the failure of a contention imposing (i) that the black curve passes through the point (  Ulrich and Vorberg's (2009) method, which fails to provide a good fit because it forces the functions to pass through points that are away from the data. Estimated parameters for Ψ 1 (when the test was presented above) and Ψ 2 (when the test was presented below) are given in insets. To make sure that the lack of fit was not caused by our asymptote parameters λ 1 and λ 2 , their values were fixed at 0 and not regarded as free parameters. Repeating the procedure with free λ 1 and λ 2 did not result in any improvement because the problem is in the different horizontal location of the data and the fitted curves, not in the vertical range.  Figure 2, but curves are now fitted with our amended method and also with free λ 1 and λ 2 . By removing the constraint that the 50% point on the black curve (for aggregated data across presentation orders) must occur at x s = 104, the curves can shift horizontally and meet the data. The amended constraint involves the same basic relation of Ψ 2AFC to Ψ 1 and Ψ 2 , but allows the curves to displace to the point x PSE indicated by the data, which becomes a free parameter in the fit. Estimates of x PSE for each observer are given in insets and their location is indicated by a solid vertical line. Estimates of the difference limen (DL) computed through Eq. 14 in Ulrich and Vorberg (2009) are also given in the inset. To better understand the relevance of parameters κ and λ outside the context of finger errors or lapses of attention, consider the model described by Ulrich (2010, p. 1191 andFigure 12). This model states that observers make their judgment by comparing the stimulus presented in the second interval with a stable internal standard and produces order effects such that Ψ 2 will be adequately approximated by a logistic function with κ 2 = λ 2 = 0 whereas Ψ 1 (x) = y with constant 0 < y < 1. By being flat and independent of test level x, the shape of Ψ 1 can only be described through the four-parameter function in Eq. 7 with κ 1 = y and λ 1 = 1 − y. Although the model seems implausible in this particular form (as no evidence of flat psychometric functions seems to have ever been reported), other variants of this model can produce non-flat psychometric functions that can only be described through different and non-zero values for κ and λ. Empirical evidence will tell whether a four-parameter function is actually necessary, or in what cases.

equAl numbers oF trIAls For eAch presentAtIon order At eAch test leVel
Ulrich and Vorberg (2009) applied their method to data collected in equal numbers of trials for each presentation order at each test level. We have described this characteristic as step (1) of their method, although we noted that it is replaceable. Actually, neither Ulrich and Vorberg (2009) nor Ulrich (2010) declared this as a requisite, but it is worth discussing the effects of relaxing this requisite before we present the main issue that we want to raise here.
Suppose there are n 1 = 200 trials at each test level for the presentation order relevant to Ψ 1 but only n 2 = 100 for the presentation order relevant to Ψ 2 . Suppose also, and only to facilitate our presentation, that Ψ 1 and Ψ 2 both fit the applicable data perfectly so that the curves run on top of the data points or, formally, that the empirical proportion p ij at test level x = x j equals Ψ i (x j ). (To simplify the presentation, we will drop parameters a, b, κ, and λ from the notation.) Consider a sample case in which p 1j = 160/200 = 0.8 = Ψ 1 (x j ) whereas p 2j = 20/100 = 0.2 = Ψ 2 (x j ). Then, x PSE = x j because Ψ 1 (x j ) + Ψ 2 (x j ) = 1 and, thus, Ψ 2AFC (x j ) = 0.5, but the empirical proportion from aggregated data at x j would be (160 + 20)/(200 + 100) = 0.6. In a plot, Ψ 1 and Ψ 2 would run on top of their reference data points but Ψ 2AFC would lie below all points in its reference data set, which might be taken as a sign of poor fit. In order for Ψ 2AFC to match its reference data when n 1 ≠ n 2 , the imbalance that affects proportions computed from aggregated data should be applied upon averaging Ψ 1 and Ψ 2 , yielding Ψ 2AFC (x) = [n 1 Ψ 1 (x) + n 2 Ψ 2 (x)]/(n 1 + n 2 ). This reduces to the simple average when n 1 = n 2 . This discussion reveals the graphical consequences of using n 1 ≠ n 2 . It should nevertheless be kept in mind that Ulrich and Vorberg (2009) aptly noted that the only "true" functions are Ψ 1 and Ψ 2 , whereas Ψ 2AFC is a misleading byproduct. Then, the shape described by Ψ 2AFC and whether or not it follows the path of its reference data is immaterial. Of course, a potential graphical mismatch can be entirely eliminated by ensuring that experiments are carried out with n 1 = n 2 , which brings us to our main issue.
The experiment of Ulrich and Vorberg (2009) and the experiment reported here have both used the method of constant stimuli: The same number of trials was administered at each of a number of fixed test levels, and this number of trials was also the same across presentation orders (i.e., n 1 = n 2 ). Our sample case in the preceding paragraph also implied the method of constant what the data indicate. Estimated values of x PSE are printed on the panels, and the location parameters a 1 and a 2 reveal where the PSE lies for each presentation order. Estimated values of x PSE compare well with the average values reported by Mamassian and de Montalembert (2010) and they are also within the range reported by Hamburger and Hansen (2010): Vertical lines have to be 1.53% (Observer #1) or 4.46% (Observer #2) shorter than horizontal lines to be perceived equally long. More importantly, estimates of the difference limen that are free of contamination from order effects can be obtained which are also reported in Figure 3.

extensIons And Further ImproVements
Our generalized method lends itself to improvements that should increase its potential for providing a good fit to data and for the investigation of order effects. This section comments on them.
mAthemAtIcAl Forms oF Ψ 1 And Ψ 2 We have used logistic psychometric functions in our illustration, as did Ulrich and Vorberg (2009). In a signal detection theoretic framework, the form of the psychometric function is determined by assumptions about the distribution of internal responses or how their mean and variance change with stimulus level, and by what the decision rule is (for a formal analysis in the context of contrast perception, see García-Pérez and Alcalá-Quintana, 2007). Only under restrictive conditions will the resultant psychometric functions have a logistic form, but differences across alternative functional forms are generally small and inconsequential. Logistic functions are reasonable approximations because they are sufficiently flexible to accommodate the typical patterns that empirical data show. Ulrich and Vorberg (2009) discussed how the location and slope of Ψ 1 and Ψ 2 may differ as a result of Type-A and Type-B order effects. Arguably, order effects may also cause Ψ 1 and Ψ 2 to differ in mathematical form through changes in some of the components determining them. Then, research on order effects using our amended method should be alert to empirical signs of different forms for Ψ 1 and Ψ 2 and not only to different estimated parameters of logistic functions.

Asymptotes
The logistic function in Eq. 3 includes a further parameter that reduces the range of Ψ i to the interval (λ i , 1 − λ i ), compared to the full range (0, 1) of the function in Eq. 1. This parameter has traditionally been dubbed "lapsing rate" or "finger error rate" because it was meant to describe unexpected empirical deviations from perfect performance attributed to lapses of attention or finger errors upon hitting the response keys (see Meese, 1995). Yet, parameter λ has a new meaning in this context because order effects may affect the asymptotes of Ψ 1 or Ψ 2 for reasons unrelated to finger errors or lapses of attention. Moreover, the lower and upper asymptotes might be differently affected so as to demand a four-parameter logistic function given by ( ; , , , ) whose range is the interval (κ i , 1 − λ i ). We should stress that κ in Eq. 7 is by no means the "guessing rate" parameter typically included in psychometric functions for 2AFC detection tasks.
García-Pérez and Alcalá-Quintana (2010a) described a model of response bias and showed how it produces Type-A order effects. They also illustrated how recording and treating undecided cases as Fechner (1860Fechner ( /1966 suggested eliminates this response bias, which is not to say that Type-B order effects can also be eliminated in this way or that this removes all sources of Type-A order effects. This strategy has been empirically shown to eliminate bias in 2AFC contrast detection tasks (García-Pérez and Alcalá-Quintana, 2010b) and to eliminate Type-A order effects in 2AFC contrast discrimination tasks (Alcalá-Quintana and García-Pérez, 2011). Although one of the virtues of Ulrich and Vorberg's (2009) method and our amendment is that the constrained estimation of Ψ 1 and Ψ 2 isolates order effects and eliminates their contaminating influence on estimates of the difference limen, removing order effects whose cause is known can only help to identify the cause of those that remain.
To illustrate, the experiment reported earlier was repeated with the difference that observers were allowed to express indecision on a trial by hitting another response key. Thus, at a given test level for a given presentation order, the observer will have given N v "vertical longer" responses, N h "horizontal longer" responses, and N u "undecided" responses. Following Fechner (1860Fechner ( /1966, undecided responses were counted as half right and half wrong to yield adjusted counts of N v + N u /2 "vertical longer" responses and N h + N u /2 "horizontal longer" responses. The results are shown in Figure 4, and Table 2 gives raw counts and the resultant adjusted stimuli within each presentation order, although with n 1 ≠ n 2 across orders. This method is known to be inefficient (Meese, 1995), an inefficiency that is intensified when Type-A order effects push Ψ 1 and Ψ 2 apart from one another and Type-B order effects make them vary in support. Hence, the set of stimulus levels that is informative of one of the functions is likely to be uninformative of the other.
Adaptive methods are more efficient and much more used nowadays, and some of them provide optimal sampling plans for accurate estimation of psychometric functions even with small numbers of trials (García-Pérez and Alcalá-Quintana, 2005). The prevalence of adaptive methods raises the issue of whether our amended method could be used with data gathered through them, which thus comprise numbers of trials that differ across test levels within and across presentation orders. A satisfactory solution does not present itself upon first examination of the problem, but a reasonable-looking approach that requires thorough evaluation consists of (1) running separate (though interwoven along the experimental session) adaptive tracks designed so that each individual track deploys trials with a fixed order of presentation, because individual adaptive tracks are efficient at gathering data appropriate for a fixed psychometric function, not for a mixture of them, (2) setting the length of each individual track so as to ensure that the overall number N 1 of trials across all tracks pertaining to Ψ 1 is the same as its counterpart N 2 for Ψ 2 , and (3) to the extent that the above provides accurate constrained estimates of Ψ 1 and Ψ 2 with N 1 = N 2 , computing Ψ 2AFC as the simple average of Ψ 1 and Ψ 2 to reflect what aggregated data would be like in an experiment in which both presentation orders are equally frequent at each test level.
The validity of this approach must be evaluated in studies that might also identify alternative and/or more appropriate approaches. Yet, it should be kept in mind that the primary goal is obtaining accurate estimates of Ψ 1 and Ψ 2 ; how well the average of the estimated Ψ 1 and Ψ 2 follows the path of aggregated data seems secondary and largely immaterial.

response bIAs
Trials in a 2AFC discrimination task often present stimuli that are subjectively so similar that observers cannot make a judgment. The strategies that observers use to respond when unable to make a decision are known to bias 2AFC tasks (see, e.g., Morgan et al., 1990;Jäkel and Wichmann, 2006), and may indeed be one of the sources of Type-A order effects. On presenting his "Method of Right and Wrong cases, " Fechner (1860" Fechner ( /1966 noted that there appears to be an "interval of uncertainty" within which observers cannot make a decision. He then suggested that these undecided cases be recorded separately and counted as "half right and half wrong" and stated that this strategy "is the only one which can also yield a basis for elimination and precise determination of the influences (…) which cause constant errors" (p. 78). But Fechner's observation seems to have gone unnoticed and undecided cases are rarely recorded (for some exceptions, see Künnapas, 1955;Hellström, 2003;van Vleet and Robertson, 2006;Alcalá-Quintana and García-Pérez, 2007;García-Pérez, 2010), let alone treated as Fechner suggested. Estimates of x PSE (see the inset) are virtually identical to those reported in Figure 3. Also compared to Figure 3, the red and blue data sets (and the red and blue curves) are closer together, indicating that differences in performance across presentation orders are quantitatively smaller when response bias is eliminated with Fechner's method.
proportions. Data from each presentation order are now closer together than in Figure 3 and Ψ 1 and Ψ 2 are also more similar; at the same time, estimates of x PSE stay close to what they were when undecided cases were not recorded (compare with Figure 3). A surely significant difference between Ψ 1 and Ψ 2 remains which is likely to reflect Type-A order effects of some other origin, but it seems safe to assume that the effects of response bias have been removed. As the raw counts in Table 2 reveal, observers gave undecided responses predominantly around the estimated PSE for each presentation order (i.e., around the values of the location parameters a 1 and a 2 reported in the insets of Figure 4).
conclusIon Ulrich and Vorberg's (2009) contention holds in the special cases in which test and standard differ only on the dimension along which they are compared. These cases are an exception in empirical studies, which more often than not include additional differences between test and standard to assess their effect on sensory processing along the dimension of comparison, or simply to estimate the magnitude of visual illusions, aftereffects, or other instances of non-veridical perception. We have shown that their contention can be replaced by a more realistic one that renders a more general method also capable of isolating order effects and removing their contaminating effect on estimates of the difference limen. Removal of order effects caused by response bias through application of Fechner's "half right and half wrong" treatment of undecided cases also reveals itself as a useful strategy for the investigation of order effects. N v , number of "vertical longer" responses across 100 trials; N u , number of "undecided" cases across 100 trials; p*, adjusted proportions plotted in Figure 4 and defined as (N v + N u /2)/100.