^{1}

^{*}

^{2}

^{2}

^{1}

^{2}

Edited by: Megan Peters, University of California, Riverside, United States

Reviewed by: Jorge Morales, Johns Hopkins University, United States; Jan Brascamp, Michigan State University, United States

This article was submitted to Consciousness Research, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

By definition, metacognitive processes may monitor or regulate various stages of first-order processing. By combining causal analysis with hypotheses expressed by other authors we derive the theoretical and methodological consequences of this special relation between metacognition and the underlying processes. In particular, we prove that because multiple processing stages may be monitored or regulated and because metacognition may form latent feedback loops, (1) without strong additional causal assumptions, typical measures of metacognitive monitoring or regulation are confounded; (2) without strong additional causal assumptions, typical methods of controlling for first-order task performance (i.e., calibration, staircase, including first-order task performance in a regression analysis, or analyzing correct and incorrect trials separately) not only do not deconfound measures of metacognition but may even introduce bias; (3) that the first two problems cannot be solved by using simple models of decision-making derived from Signal Detection Theory. We conclude the paper by advocating robust methods of discovering properties of latent mechanisms.

In this paper, the term metacognition denotes cognitive processes that monitor other cognitive processes, as well as the results of such monitoring, including metacognitive regulation. This broad definition seems to be in agreement with what can be found in the majority of introductory chapters of various monographs on metacognition, of which there are now many (e.g., Nelson and Narens,

One of the reasons that it is difficult to study metacognition is that it is a latent mechanism which has a dual causal role, i.e., it monitors and so is influenced by the underlying cognitive process, but, since one of the main functions of monitoring is regulation, it may also regulate and so influence the monitored process. To further complicate the matter, not every case of a first-order process influencing metacognition may represent genuine metacognitive monitoring; for example, a first-order process could become more resource consuming, thus limiting the amount of resources available for metacognition. Similarly, it is theoretically possible for metacognition to influence a first-order process in a non-metacognitive way.

The aim of this paper is to use causal analysis to derive the theoretical and methodological consequences of this special relation between metacognition and the underlying processes. Even though this is a theoretical paper, we made sure that it does not contain any speculative claims: instead of providing our own hypotheses about how metacognition works, we combine causal analysis with the hypotheses expressed by other authors. In that sense, we are proposing a meta-theoretical causal framework for studying hierarchical tasks.

The usefulness of this approach is illustrated by showing how it can help identify important limitations of certain widespread practices in studies on metacognition. We prove that every measure of metacognitive monitoring or regulation is confounded unless strong additional causal assumptions are introduced. In particular, without additional causal assumptions, neither metacognitive judgements (e.g., confidence ratings) nor correlations between performance (e.g., accuracy or sensitivity) and metacognitive judgements are unbiased measures of metacognitive monitoring or regulation. We also show that controlling for first-order task performance may not only fail to deconfound measures of metacognition, but it may even introduce bias. Finally, we show that measures based on Signal Detection Theory or some of its generalizations are just as confounded as simpler measures of statistical dependence because they use the same information in the data. We conclude the paper by advocating robust methods of discovering properties of latent mechanisms.

Almost one-third of our paper is devoted to introducing elements of causal analysis. It is only after we describe the relevant formalism and its interpretation that we begin to address the issues directly related to metacognition. The reason for this is that we cannot assume that a researcher interested in studying metacognition will also be acquainted with causal inference, and we decided not to rely on the introductory books or papers on the subject since they contain much more information than we need to derive the main results. Note also that while we are fairly specific in our criticism of the way in which metacognition is often studied, the constructive part of our paper, in which we try to provide advice on how to do some things better, is rather generic and may not be directly applicable to any specific research problem. This is a consequence of the fact that the problems that we identify are general, but the solutions to these problems depend on the particular characteristics of each study.

We will rely on Pearl's Structural Causal Model (Pearl,

We will be concerned with qualitative causal structure, i.e., with the issue of the mere presence or absence of causal connections between variables. The quantitative properties of causal relations, such as how to best describe the effect by some deterministic function or statistical model, will be considered only to illustrate a general point. The qualitative structure of causal relations will be represented by graphs consisting of nodes (i.e., variables) and arrows. Unless we state that certain effect simply exists, an arrow from

The graph may still be valid even if some of its arrows do not correspond to any real processes, as long as no real arrows connecting the modeled variables are omitted. That is because, as long as they stand for theoretically possible effects, additional arrows may only limit the statistical implications of causal graphs. Moreover, the presence of an arrow from

Because of the dual causal role of metacognition, there will be causal loops in some of our graphs. Given that causal processes take time, the loop could be taken to mean that causality can go back in time. That is not our intention; A loop may arise because the arrows comprising it cannot be theoretically excluded or because there may be a genuine feedback connection. A real feedback loop can only connect time-aggregated variables and it is shorthand for mutual influence occurring over time, as illustrated by

Interpretation of causal loops which represent feedback processes.

Here _{i} indexes discrete time.

A causal graph can be used to predict, interpret or explain the data because causal relations have statistical implications, e.g., if ^{1}

We will often make use of the following important fact: variables

The causal graph representing a study may contain many variables and many arrows, but usually, the researcher will be primarily interested in only a small subset of paths—often just a single arrow. Following Pearl (

The importance of causal assumptions can be illustrated by elaborating on the essential difference between experimental and observational studies. If

Often more than one conductive path corresponds to the same correlation. This is especially true if the two variables are only observed, since in general either one of the two observed variables may cause the other, or the two variables may have common causes, as illustrated in

Conductive paths in experimental and observational studies.

Here,

In SCM we usually talk about confounding paths, not variables, because a variable by itself cannot imply any statistical dependence. A path is confounding only with respect to some target causal quantity (e.g.,

If some plausible alternative causal explanations are not ruled out, i.e., if some confounding paths are not neutralized or broken, then the estimate of the target effect may be biased, i.e., the expected value of the estimate—what it actually measures—may be a mixture of the target causal effect and other confounding effects. In particular, even when the estimated statistical effect is different from zero, the contribution of the target causal effect to the estimate may be null, in which case the researcher will miss the target quantity entirely.

Every confounding path is critical unless something is already known about the relative strength of the relevant causal effects. Unlike noise or measurement error, bias resulting from the presence of confounding paths cannot be dealt with by increasing the sample size because it depends on what is being measured, not on how reliable the measurement is. This bias can only be dealt with—if at all—by changing the design, the method of analysis, or both.

Deconfounding is crucial when doing basic research, especially when the study is concerned with discovering the latent mechanism, such as the mechanism of metacognition. Of course, no study is perfect, but once the confounding paths are identified, they need to be addressed. As is commonly accepted in observational studies, the burden of proof is on the researcher, who omits certain arrows and thus dismisses alternative explanations.

There are several non-exclusive ways of dealing with confounding. One is by intervention, as in experimental design. However, despite their inherent strength, experimental studies rarely if ever provide definitive answers; this is partly because, especially in disciplines such as psychology, for many variables, it is impossible to alter them directly, and the effects of interest may not be directly observable. For example, let the target quantity be the influence of short-term memory load (^{2}

Confounding paths in an experiment with indirect manipulation and measurement.

For example, the

Another approach to deconfounding is by conditioning, i.e., by selecting observations or subjects with some property (e.g., only correct trials), or by introducing additional variables in the possibly non-linear regression analysis (see Pearl et al.,

By now it should already be clear why we take the arrow to mean that the causal effect is merely possible. All it takes for some path to provide a valid candidate explanation of the observed correlation is for the path to be theoretically possible and conductive. That is why the fewer arrows there are in the graph, the stronger the assumptions: there are fewer alternative explanations and more can be inferred from data about the generating process. It follows that the more theoretically possible forms of monitoring or regulation there are, the harder it is to deconfound measures of metacognition in general. As we will now show, the relevant literature clearly indicates that it is more difficult to list processing stages that cannot possibly be monitored or regulated than it is to list ones that, at least theoretically, can be.

Metacognition is usually studied using tasks in which the stimuli or their properties can be experimentally controlled and both the first-order (e.g., classification or free recall) and the second-order (some form of metacognitive judgement) responses are measured, sometimes simultaneously. The generic graph representing theoretically possible causal process responsible for performing such tasks is shown later in the paper (see

A partial causal graph representing a generic hierarchical task. Here

For our purposes it will be useful to divide the

We will restrict our attention to monitoring or regulatory processes that operate during a task trial. If the data are not aggregated over trials, then the effects of any trial-level events on subsequent trials can often be safely ignored, which simplifies the graph considerably. This alone is a good reason not to aggregate repeated measures data: for example, by performing statistical analysis on data that are not aggregated over trials, we can ignore the possibility of the alteration of decision criterion used on the following trial caused by the perceived distribution of the stimuli on previous trials, or the possible effect of confidence in a given trial on confidence in the following trials (Rahnev et al., _{i} → _{i+1}, where

When deriving the graph for a generic hierarchical task, we will not assume anything about the stimulus or the response other than that the stimulus is randomly assigned. In this way, our model can be applied both to finite alternative forced-choice tasks as well as to tasks where the space of valid responses is not clearly defined, such as learning tasks with a free recall stage.

In the following section, we will provide a non-comprehensive list of theoretical and empirical arguments for introducing specific arrows in the graph that represents the overall process responsible for performing a generic hierarchical task. Note that the fact that we mention a study or a hypothesis does not necessarily mean that we agree with the interpretation of the results given by the authors; The reason that we do not preface most of the interpretations of the results in terms of metacognitive monitoring or regulation with the phrase “according to the authors” is readability. We do not believe that such conclusions are demonstrably false but, as our results imply, establishing the validity of such claims may require careful analysis of confounding paths. Usually, the fact that we list some study as indicating that a certain causal effect may exist only means that the authors expressed a hypothesis that has causal meaning and—because it broadens the scope of possible explanations—that should be taken into account when designing or interpreting the results of an experiment on metacognition.

Some variables that influence metacognition change asynchronously with task stimuli. For example, Samaha et al. (

Most theories of confidence assume that metacognitive assessments are informed by stimulus-related information, such as the quality of a perceptual item, its intensity or its size (e.g., Vickers and Lee,

The first-order decision process that follows the stimulus-encoding stage may also be monitored by metacognitive processes. For example, an important class of hypotheses in metamemory studies concerns the relation between fluency or ease of processing and metacognitive judgments (Kelley and Lindsay,

The very act of making a decision may also affect metacognition, for example, by reducing uncertainty (Busemeyer et al.,

It seems that the majority of studies on metacognition are concerned with monitoring, while metacognitive regulation is studied less frequently, especially in basic research. Sometimes authors (including us) may even omit the regulatory role when defining the term “metacognition,” stating, for example, that it refers to the ability to monitor one's cognitive processes or to knowledge about ongoing task performance (e.g., Metcalfe and Shimamura,

Metacognitive regulation during stimulus-encoding stages is probably ubiquitous, given the assumption that perception is an active process (for review see: Stark and Ellis,

The generalizations of Signal Detection Theory provide theoretical arguments for the existence of a regulatory arrow from metacognition to the decision process as well as for the existence of an arrow that enters the stage of making of the decision. According to the common interpretation of the diffusion model (Ratcliff and McKoon,

Metacognitive regulation has also been studied in the context of learning. These studies indicate that the allocation of learning time or the selection of learning strategies may be guided by metacognitive monitoring and metacognitive knowledge. For example, feeling-of-knowing judgements positively correlate with the time spent on a question before giving up (e.g., Gruneberg et al.,

Finally, correlations between motor response properties and confidence judgments found in many studies may also be interpreted as manifestations of the regulatory role of metacognition. For example, the positive correlation between confidence and reaction time may be at least partially explained by the hypothesis that when confidence in a decision is high there is little need to be cautious and the motor execution of the response can be relatively fast (see Gajdos et al.,

We are now in a position to draw, in

To improve the readability on this graph, causal loops are represented by bi-directional edges. Note that the

Because there is more than one possible causal loop in

The majority of studies on metacognition target metacognitive monitoring. The results of metacognitive monitoring, such as choice confidence, are only observed—they are not experimentally manipulated—and the sources of monitored information are also often not subject to experimental manipulation, at least not directly. Just for this reason, but also because of the possibility of metacognitive regulation, any measure of metacognitive monitoring may be biased.

Imagine that a researcher was interested in metacognitive monitoring, or metacognitive “resolution,” or “accuracy,” but interpreted as a property of metacognitive monitoring. This researcher measured both accuracy and confidence and interpreted their correlation as a measure of metacognitive monitoring. This situation is so common in studies on metacognition that it deserves a graph, shown in

Confounding paths in a generic metacognitive monitoring study.

Here

The arrows from

If the researcher was interested only in the correlation between _{1} → _{1} → _{2} → _{2}⋯_{n} →

Imagine also that the _{0} + _{1}_{2}_{3}_{3} ≠ 0, assuming linearity]. This would certainly explain the between-group difference in the

With some modifications, the graph from

Common use of simple deconfounding strategies such as controlling for first-order task performance clearly indicates that researchers who study metacognition are well aware of the critical importance of deconfounding. However, as we will now demonstrate, these popular simple deconfounding strategies not only fail to address this issue in its full generality but may even

A popular approach to deconfounding measures of metacognition, or measures of effects of various manipulations on metacognition, is by attempting to make some chosen performance measure equal between the conditions, either by intervention, as in calibration or staircase^{3}

The basic idea, which dates back at least to Nelson (

Unfortunately, this is not how deconfounding works. Statistically controlling for a variable just because it correlates with the effect of interest may just as easily introduce bias instead of removing it. Trying to intervene on a variable (here by staircasing or calibration) may alter this particular variable and may remove all the other arrows that point to it, but this does not mean that it removes all the confounding paths to which this variable is somehow connected.

In order to achieve deconfounding one first has to consider how confounding may arise: it is only after assuming something about the way in which the observed effects may be causally attributed to first-order and metacognitive processing that something meaningful can be said about the role of controlling for first-order task performance. We will now prove that the claim that controlling for first-order task performance deconfounds measures of metacognition is not true without additional strong causal assumptions and that it is, in fact, unlikely to be true in general. We will only consider two popular ways of controlling for first-order task performance, namely calibration and including the performance estimate in a regression analysis, but with minor modifications, our reasoning can be easily generalized to other cases.

Controlling for first-order task performance by calibration in metacognition studies usually consists of altering the stimuli in the preliminary stage of the experiment in such a way as to make the chosen performance measure more or less equal between the conditions. Anything that we say about calibration can also be said about staircasing, but not vice versa since staircasing is often continued throughout the task. As long as the performance does not change during the experiment, calibration may make any observed differences in measures of metacognition not significantly related to the calibrated performance measure.

Calibration certainly limits the set of possible paths between the stimulus and the response to those that correspond to the fixed performance score. However, this is a purely

Common trust in the deconfounding power of calibration or staircasing is based on a conceptual error: just because the first-order task

Task performance can also be controlled for statistically. When there is uncertainty as to which performance measure is most relevant, the researcher can perform separate analyses, each time controlling for a different performance measure to see if the results hold. One way to statistically control for first-order task performance is by introducing the performance measure as a predictor in the regression analysis that is aimed at estimating the metacognitive effects of interest. This method succeeds only if (1) it breaks all the confounding paths that are not dealt with by other means and (2) the first-order performance measure is not influenced by any stage along the target path. The second, arguably less obvious but equally important condition is necessary because conditioning on the descendant of a stage along the target path takes away some (or all, if the variable conditioned on ^{4}

To see when conditioning on first-order performance may result in successful deconfounding, consider a study in which some stimulus-level manipulation (

Deconfounding by conditioning on first-order task performance.

Note that here we generously assume that

Our reasoning generalizes to statistical control of first-order task performance when it is built in simplified models of decision-making, such as models of metacognitive judgement based on Signal Detection Theory. In fact, one such model, called meta-

In every model of metacognitive judgement based on Signal Detection Theory that we are aware of, the process of arriving at a decision is represented either by an internal evidence sample, as in the meta-

No part of an SDT model can help in disentangling the vertical arrows in

As we hope we have already demonstrated, it is not easy to see when successful deconfounding of metacognition is achieved without formal causal analysis, even in the case of widely practiced, intuitively sound and seemingly straightforward control of first-order task performance. The limitations of performance equalization or of fitting simplified models of decision-making as methods of studying metacognition are a consequence of several properties that make metacognition a challenging subject of study: little is known about the mechanism of metacognition, therefore the researcher is forced to consider many arrows and paths, which in turn may force the researcher to address many confounding paths. Moreover, these confounding paths can be particularly problematic because by definition metacognitive processes may be connected uni- or bi-directionally with many different stages of the first-order process.

Ultimately, the limitations of all the approaches to deconfounding metacognition that we have analyzed so far are consequences of strong causal assumptions which are implicit in simplified models or in simple statistical corrections. There are several general-purpose approaches to deconfounding which can be used in studies on metacognition and which are robust in the sense that they may not require strong unsubstantiated causal assumptions. We describe these methods here because compared to a fully-fledged causal analysis targeted at a particular research problem and study design, they are relatively easy to apply, and they may already be familiar to many researchers who study metacognition.

For lack of space, the purpose of this final section of our paper is only to provide a set of pointers and examples of how some already established practices could help in addressing various confounding issues. We want to stress that none of these methods is powerful enough to replace causal analysis. Moreover, their robustness comes at a price: as we have already mentioned at the beginning of our paper, these methods are rather generic, which means that they are not based on strong causal assumptions about the target latent mechanism, and so they may not allow for particularly strong causal conclusions. As we will see, in a way all these methods revolve around the idea of deleting arrows or paths.

Sometimes confounding paths can be guaranteed to be broken because of the design of the study. One example is studies on the effect of response order that some of us were involved in the past (Siedlecka et al.,

To the extent that it is possible that the arrows belonging to a confounding path are not real, it makes sense to try to demonstrate this empirically. Interestingly, demonstrating that some conductive path does not exist does not require an unbiased estimate of the path. To see why, imagine that a researcher was interested in the causal effect of

We are aware of two ways of solving the problem of obtaining evidence for the null, but we will only mention them briefly since this is not a paper on statistical analysis. One popular solution is to use Bayesian inference. The null hypothesis significance testing framework is ill-suited to the task of arguing for the null hypothesis because a lack of statistical significance in no way indicates that the effect does not exist, it only means that it was not reliably detected. Moreover, in frequentist inference, it is impossible to obtain a probabilistic statement about the null hypothesis because in frequentist inference point hypotheses such as a null hypothesis are not points of some probability space, and so frequentist point hypotheses can only be true or false. In Bayesian inference, a set of mutually exclusive and exhaustive hypotheses may form a probability space associated with a prior probability distribution and, once the data are obtained, a posterior probability distribution. A common approach to arguing for the null in Bayesian inference is by using the Bayes Factor in the form of the Savage-Dickey ratio (Wagenmakers et al.,

Arguing for the null is also an essential part of Sternberg's method of demonstrating separate modifiability by selective influence (Sternberg,

Given all of the above, it seems worthwhile to briefly introduce the method of separate modifiability. In its most basic form, this method consists of finding two distinct randomly assigned factors,

The purpose of separate modifiability is to decompose a latent mechanism by providing information about the separate parts from which it could be composed. By itself, this method does not deconfound anything, but it is a robust method that may help in understanding the problem of confounding by providing information about the latent causal structure.

In order to derive valid conclusions from the study, researchers may have to acknowledge the inherent limitations of the chosen method and settle for a modest interpretation of the results. Similarly, sometimes the only way of dealing with the problem of confounding may be to look for a different target quantity. When deconfounding measures of metacognition, it does not matter if the measure of statistical dependence is theory-based (e.g., meta-d', or SDT thresholds) or not (e.g., logistic regression slope or gamma correlation), because our results hold for

This means that often the terms “metacognitive monitoring,” “metacognitive sensitivity,” or “metacognitive efficiency” may have to be replaced with something else. One alternative is to use the term “metacognitive accuracy,” interpreted strictly as denoting the statistical relation between accuracy and some metacognitive judgement; another is to introduce a new term, such as “metacognitive coupling,” to emphasize that some unknown causal connection is there and that it may or may not be bi-directional. Perhaps the term “metacognitive judgement formation,” when used carefully, may also be appropriate. Admittedly, this will often make conclusions much less impressive, but it may also be the only way to ensure that what the researcher argues for is not just wishful thinking, i.e., that the conclusions actually follow from the theoretical assumptions and the data.

In this paper, we have demonstrated the limitations of common approaches to studying metacognition, including methods specifically aimed at deconfounding. Our analysis shows that detailed questions about metacognition are unlikely to be answered using simple statistical corrections such as conditioning on performance, or by fitting overly simplified mathematical models, such as various generalizations of Signal Detection Theory.

Because by definition metacognitive processes may be connected uni- or bi-directionally with arbitrary stages of first-order processing, confounding is a major problem and formal causal analysis may be required to correctly identify all the theoretically possible alternative causal explanations of the obtained statistical results, or to design a study that can potentially provide unbiased estimates of target causal quantities. It would be unreasonable to expect that every theoretically possible confounding effect has been identified and discussed, but for the causal conclusions to logically follow from the data and the theoretical assumptions, every possible

As the understanding of metacognition advances, some confounding paths may become irrelevant while new confounding paths may appear, thus making studies that once seemed valid look unconvincing or vice versa. In fact, the theoretical analysis that we have presented in this paper led us to question what we thought our own past studies on metacognition indicated.

We believe that it is not unreasonable to expect that every study provides results which are valid given the explicitly stated assumptions. To this end we have advocated modesty when interpreting the data, using selective influence and special designs that break confounding paths in order to better identify distinct parts of metacognition, and, most importantly, supplementing intuitive understanding of causality with formal analysis.

The main results were derived by BP with some help from MS, most of the text was written by BP with great help from MS, with the exception of sections describing the hypotheses about metacognitive monitoring or regulation expressed by other authors, which was written mostly by MS. MK provided valuable feedback and was partially responsible for finding the relevant literature. All authors contributed to the article and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

^{1}

^{2}This can be proved: if

^{3}The limitations of staircasing in studies on metacognition were recently discussed by Rahnev and Fleming (

^{4}This may also introduce bias in another way (see Pearl,