Minimalist Social-Affective Value for Use in Joint Action: A Neural-Computational Hypothesis

Joint Action is typically described as social interaction that requires coordination among two or more co-actors in order to achieve a common goal. In this article, we put forward a hypothesis for the existence of a neural-computational mechanism of affective valuation that may be critically exploited in Joint Action. Such a mechanism would serve to facilitate coordination between co-actors permitting a reduction of required information. Our hypothesized affective mechanism provides a value function based implementation of Associative Two-Process (ATP) theory that entails the classification of external stimuli according to outcome expectancies. This approach has been used to describe animal and human action that concerns differential outcome expectancies. Until now it has not been applied to social interaction. We describe our Affective ATP model as applied to social learning consistent with an “extended common currency” perspective in the social neuroscience literature. We contrast this to an alternative mechanism that provides an example implementation of the so-called social-specific value perspective. In brief, our Social-Affective ATP mechanism builds upon established formalisms for reinforcement learning (temporal difference learning models) nuanced to accommodate expectations (consistent with ATP theory) and extended to integrate non-social and social cues for use in Joint Action.


INTRODUCTION
The notion of Joint Action has received various definitions. The popular perspective from the 90s onwards has viewed it as a manifestation of shared intentions to act between two or more individuals (e.g., Gilbert, 1990;Searle, 1990;Bratman, 1992;Tuomela, 1993). Tomasello, for example, has stated: "[t]he sine qua non of collaborative action is a joint goal and a joint commitment" (Tomasello, 2010, p. 181). Bratman's (1992) shared intentional position on Joint Action can be described accordingly: (i) inter-acting agents have intentional behavior toward an outcome, (ii) agents adjust ("mesh") subplans of the intentional behavior to account for the other(s), and (iii) the agents are aware of the (adjusting) intentions of the other(s). The collective aspect is supposedly captured by this form, and there is a requirement of interrelatedness of individual intentions among group members: group members have the "same" individual thought on this collective form. An important ingredient in Tuomela's social ontology, for example, is the collective mode of thought (we-mode) to be distinguished from the me-mode. In this view, broadly, individuals can act as members of groups either for satisfying private ends or for satisfying group ends.
Definitions abound that attempt to de-emphasize the role that shared intentions play in Joint Action, many of which are action-or outcome-focused. Butterfill (2012), for example, bases his definition of Joint Action on shared goals rather than shared intentions. On this account, goal-directed 1 behavior need not be intentional "there are ways of representing actions as goal-directed which do not involve representing intentions or any other propositional attitudes of agents" (p. 13). Other "minimalist" accounts of Joint Action can be found. For example, Miller (1992) posits that in Joint Action: "there is more than one agent; each agent is performing (at least) one action; each agent's action is dependent on the actions of the other agents" (p. 275). Knoblich and Jordan (2002) define Joint Action as a: " [situation] where neither member of a group can achieve a common goal on his own but only with the help of the other member" (p. 2) and Sebanz et al. (2006), describe Joint Action as: "any form of social interaction whereby two or more individuals coordinate their actions in space and time to bring about a change in the environment" (p. 70).
These more "minimalist, " action/goal-oriented perspectives focus on those mechanisms that are requisite to many Joint Actions of the type that require coordination in time and space. Minimal mechanisms are pertinent when tasks are new to the actors and/or challenging (not mastered): "minimal representations may be formed when a novice learns to perform a new joint action, particularly if the novice's task is cognitively demanding and leaves insufficient resources to form rich representations that include all components of the joint action" (Loehr and Vesper, 2016, p. 536). It is an open question as to what extent such non-mental mechanisms underlie, constrain, or even substitute for the "higher" cognitive mentalizing purported by the advocates of shared intentionality in Joint Action.
The remainder of this article breaks down as follows: In Section Minimal Mechanisms and Coordination "Smoothers" in Joint Action we discuss minimalist mechanisms that enable Joint Action. Section An Affective Account of Associative Two-Process Theory concerns a description of a value function based on ATP theory, which has been applied to individual learning of differential affective states. In this section, we also introduce our (novel) hypothesis suggesting that such an "affective" implementation of ATP may be applied to a social context relevant to Joint Action. We call this the Social Aff-ATP hypothesis. In Section Neural-Computational Basis for Affective Valuation, we describe our existing neuralcomputational account of ATP as it applies to the individual, and then propose the (neural-computational) mechanisms that underlie our Social Aff-ATP hypothesis. Finally, in Section 1 Shared goals here are defined as "complexes of states and relations" (p. 13).
Discussion we provide a discussion of the mechanism's functional relevance to a Joint Action context.

Investigating Minimal Mechanisms of Joint Action
The notion of minimalism appeals to evolutionary (Tomasello et al., 2005;de Waal, 2008;Decety, 2011) and developmental (Milward et al., 2014;Milward and Sebanz, 2016;Steinbeis, 2016) continuity regarding the mechanisms applicable to social interaction. It provides a bottom-up approach, which attempts to minimize assumptions about the cognitive mechanisms that may account for a particular behavioral phenomenon. Minimalization is closely related to fundamental imperatives to minimize the complexity of Joint Action in Bayesian or active inference treatments of communication and neural hermeneutics (Frith and Wentzer, 2013;Friston and Frith, 2015). Here, the idea is to minimize the likelihood of forward models of self and other; where the marginal likelihood (or model evidence) is equal to accuracy minus complexity. This means that optimal exchange and Joint Action should be minimally complex and as "smooth" as possible. Thereby, a minimalist approach can be seen as a perspective that fosters deeper understanding of the origins and functions of processes that underlie, or contribute to, performance of Joint Actions.
A typical "minimalist" Joint Action example is given in the form of a table-moving scenario. Two individuals are said to have as their goal to move a table from place A to B (cf. Sebanz et al., 2006). The table may be too heavy for one actor but manageable for the two actors. This example requires that the actors continually take into account, and adjust to, the patterns of sensorimotor activity of the other. The actors must not simply react to the actions of the other but also predict the other's actions and adapt to them in order to best realize the achieving of the common goal.
A number of studies have sought to investigate the minimal mechanisms that may underlie different varieties of Joint Action (cf. Sebanz et al., 2003Sebanz et al., , 2005Richardson et al., 2012). In such settings "representing" task-based states of others (action outcomes, task rules) are not required for successful completion of the joint activity 2 . The indication of presence of such representations, however, is suggestive of their ubiquity and general applicability in social interactions. Apparently, individuals can't help but represent the spatiotemporally coincident (or overlapping) activities of others. The work by Sebanz et al. (2003) and Sebanz et al. (2005), has, respectively, inferred the existence of action-based, and taskbased, representations of others according to scenarios that entailed joint activity where the successful completion of the task 2 Such activity is not considered Joint Action since the participants' behavior is not required to be in any way adapted to each other in order to achieve the desired outcome.
for either individual did not depend on the performance of the other in the task. Atmaca et al. (2011), similar to the findings of Sebanz and colleagues above, found that subjects will represent task rules of another co-acting participant even when such knowledge does not beneficially impact upon performance. The general finding of Atmaca et al. (2011) was that participants produced a bigger difference in reaction times when responding to incompatible, vs. compatible, stimuli when they were in a joint condition (another participant present) compared to when they were in the individual condition. The experimenters also found that it was important as to whether participants believed that the "other" in the joint condition acted intentionally. As for the Sebanz et al. (2005) experiment, Atmaca et al. (2011) concluded that people in a Joint Activity setting have a strong tendency to represent the task (stimulus-response, or S-R, mappings) of others even when it is not required for successful completion of the task.
The above examples provide evidence that humans can't help but represent information about others when it concerns actions and (arbitrary) task rules using simple stimulus-response mappings. Such tendenices may bring to bear on, or have even evolved in the context of, Joint Action. In the remainder of Section Minimal Mechanisms and Coordination "Smoothers" in Joint Action and in subsequent sections, we will present how humans may also have a tendency to represent others' value, including affective-based outcomes (and expectancies) and how these may be brought to bear in Joint Action. Vesper et al. (2010) has proposed a minimalist perspective on Joint Action, which emphasizes the sensorimotor coordination required in physical Joint Action tasks. They suggest that whilst classical Joint Action perspectives that address planning and high level reasoning are not well-equipped to deal with issues of fine-grained spatial-temporal sensorimotor coordination, the opposite is true of sensorimotor-focused perspectives. The focus of Vesper et al.'s has been to posit an approach for bridging the gap between these two perspectives by focusing on short-term planning, monitoring and predicting the actions of others. This minimalist approach views Joint Action as involving dedicated mechanisms for coordination and is concerned with how Joint Action is performed.

The Role of Emotion in Joint Action
Much literature in Joint Action theory has concerned the shared representation of action effects (or outcomes), (e.g., Knoblich and Jordan, 2002;Sebanz and Knoblich, 2009). These minimalist approaches to Joint Action have, however, overlooked a potentially equally central aspect to Joint Action-shared value states, their expression, perception and inference. Where Joint Action is goal-based, representations of value provide a basis for expectations concerning the outcome of goal-directed behavior. By observing another's emotional state as an expression of anticipation of a goal-directed outcome or through contextually inferring its existence (e.g., empathizing), the monitoring burden (of other's actions and behavior) can be reduced. Michael (2011), like Vesper et al. (2010), has advocated for a minimalist approach to the study of Joint Action, and suggested that emotions may have an important role to play in such an approach. Michael claimed "none of [the] minimalist proposals has addressed the potential role of emotions as coordinating factors in joint actions. In fact, no proposal of any kind has addressed this issue" (Michael, 2011, p. 3). However, there are indications that the potential role of affective 3 states in Joint Action is beginning to garner interest. For example, the role of empathy, which, broadly, concerns the vicarious experience of particular affective states, has been alluded to in several recent Joint Action studies. It has been suggested that selfother representative states can only be understood in relation to the interdependence of motoric, cognitive and affective states (Sacheli et al., 2015;de Guzman et al., 2016;Milward and Sebanz, 2016;Steinbeis, 2016).

AN AFFECTIVE ACCOUNT OF ASSOCIATIVE TWO-PROCESS THEORY Associative Two-Process
In this sub-section, we will discuss ATP theory (cf. Trapold, 1970;Urcuioli, 2005Urcuioli, , 2013. We will also discuss differential outcomes training procedures that can illuminate a function for affective states in individuals. This description provides the foundation for understanding a minimalist affective learning mechanism (value function) for use in Joint Action.
ATP theory has been used to explain behavioral and learning phenomena that result when different (and arbitrary) stimulusresponse (S-R) mappings are paired with different outcomes. These outcomes may be motivational stimuli, e.g., food pellets (for rewarding pigeons or rats), or they may be salient outcomes (e.g., light flashes, visual stimuli). The differential outcomes training paradigm has been used on non-human animals (typically rats and pigeons, cf. Peterson and Trapold, 1982), but also on infant and adult humans (e.g., Estévez et al., 2001Estévez et al., , 2003Holden and Overmier, 2014). According to this training paradigm, different outcomes are associated with different, but "correct" 4 , stimulus-response (S-R) mappings.
In the differential outcomes paradigm schematized in Figure 1, arbitrary task rules (S-R mappings) can also be learned but those "correct, " e.g., "rewarding, " mappings are associated with differential outcomes. In the example in Figure 2, the outcome may simply be the probability of reward (1.0 vs. 0.5) for making the correct response to the presented stimulus.
ATP theory proposes that outcome expectancies, during a training procedure wherein different S-R mappings lead to different outcomes, can cue responses in place of, or in combination with, the external stimuli. The outcome expectancy for a particular reinforcer becomes a stimulus: "the reinforcer itself is part of what is learned" (Urcuioli, 2005, p. 1). In this sense, the classical conception of the stimulus-response-outcome, or (S-R)-O, sequential relation (with S-R in brackets denoting that the stimulus-response association is learned), is more accurately FIGURE 1 | Typical differential outcomes training schedule. In this particular task, the training subject is required to respond differentially to one of two (or more) stimuli (S1, S2 in the figure) for every trial. After some delay (e.g., 3 s) where the Stimulus is removed, two (or more) new stimuli are presented which afford responses (R1 and R2 in the figure). Only one of the two responses gives a reward. Different S-R mappings, however, provide different outcomes (e.g., rewards). In the case depicted here, S1-R1 gives a reward 100% of the time, S2-R2 gives a reward 50% of the time-a differential outcome according to probability of reward (cf. Urcuioli, 1990). Other S-R mappings receive no reward. Key: ITI, inter-trial interval (in seconds); λ, reward probability; Ø, no reward. portrayed as (S-E-R)-O where E is the learned expectation tied to a particular outcome. This relationship is captured in Figure 2, which shows how differential outcomes conditions yield different expectations in application of the different task rules (S-R mappings). These differential expectations provide, thereby, an additional source of information to response choice that can potentially facilitate, or even substitute for, the information about the task rules (S-R mappings).
Differential outcomes training procedures have also been applied to Transfer-of-Control (TOC) paradigms whereby learning and adaptive behavior is tested according to changes in the outcome contingencies that the individual experiences over learning trials. A schematic of a TOC is provided in Figure 3 along with the ATP theoretical explanation of the expected learning/behavior. The first two phases consist of a number of conditioning trials for the human / animal to make different associations based on S-R, S-E, and E-R contingencies. Since the outcomes (O1 and O2) are differential for the different S-R mappings in Phase 1 (Discrimination Training), it is possible to effectively classify new stimuli, introduced in Phase 2 (i.e., S3 and S4) by these same outcomes (cf. Urcuioli, 2005Urcuioli, , 2013. As a result, when Phase 3 (Transfer Test) occurs, since the animal/human has learned to classify S1 and S3 according to the same outcome (O1)-that is, it has formed S1-E1 and S3-E1 associations-S3 automatically cues the response associated with E1 (learned in Phase 1). No new learning is required for this in spite of the fact that the subject has not been exposed to the task rule (S3-R1 mapping) previously. This transfer of control constitutes a form of adaptive switching.
Such a result cannot be explained by recourse to task rules (S-R mappings) alone. The S-E-R route (see Figure 2) provides the means for the subject to produce the adaptive response-it effectively generalizes its previous knowledge to the new setting. This S-E-R route is otherwise referred to as the prospective route (Urcuioli, 1990) since a growing expectation of an outcome is maintained in memory during the interval between Stimulus presentation and Response option presentation. This is contrasted to the S-R retrospective route so called as the memory of the stimulus is retroactively maintained in memory until response options are presented. Subjects can construct new task rules as a result of this type of inferential 5 behavior.

Associative Two-Process Theory and Affect
If we consider the schematized differential outcomes experimental set-up given in Figure 2, the different outcomes concern reward probabilities of 1.0 and 0.5, respectively. Overmier and Lawry (1979), and Kruse and Overmier (1982), suggested behavioral responding, following stimulus presentation, can be mediated by anticipatory frustration or reward according to the strength of the respective expectancies. In the sense of Figure 2, the expectancies (E1 and E2) can represent reward acquisition expectation, and reward omission expectation. Responses are associated with these two types of affective expectation as a function of how often they are rewarded. Thus, "anticipatory frustration... [can] gain at least partial control over one response, while the expectancy of reward [can gain] full control over the other" (Kruse and Overmier, 1982, p. 518). Kruse and Overmier (1982) provided evidence for this phenomenon experimentally. Whilst differential outcomes training procedures have focused primarily on differential sensory outcomes, or otherwise differences in magnitude of rewarding outcomes (and only sometimes on probabilities), the same principle may also be applied to differential punishing outcomes (Overmier and Lawry, 1979).
The notion of classifying emotionally-relevant stimuli by differential affective states has much in common with Damasio's (1994Damasio's ( , 1999. Damage or absence (through lesioning) of brain structures (amygdala, prefrontal cortex) implicated in emotion elicitation and regulation led patients of Damasio into perseverative, overly rationalized or otherwise inappropriate decision-making. Damage to areas such as orbitofrontal cortex FIGURE 3 | Transfer of Control Paradigm with Differential Outcomes (Discriminative) Training. The conditioning consists of three phases: Phase 1-a Discrimination Training phase where different stimulus-response (S-R) mappings (S1-R1, S2-R2) yield different outcomes (O1, O2); Phase 2-a Palovian learning phase where new Stimuli are presented and associated with previously experienced outcomes; Phase 3-an instrumental transfer phase where the Stimuli from Phase 2 are re-presented as are the response options from Phase 1. ATP theory predicts that responding will be based on already existing S-E and E-R associations learned from the first two Phases where the theorized preferred selections (underlined Rs) are shown here. This has been described in terms of cueing the response(s) associated with those stimuli classified by a common outcome-in this case S1 and S3 are classified by O1. Adapted from Urcuioli (2005).
has also been implicated in rigidity of decision-making by Schoenbaum et al. (2003) (also Delamater, 2007) and Rolls (1999). In such cases, reversing responses to previously, but no longer, rewarded behaviors may be compromised.
Furthermore, Miceli and Castelfranchi (2014) have referred to a S->A->R mode of associative processing where A stands for "Affect." In this case the links between affect and stimuli are hypothesized as being bi-directional (also see de Wit and Dickinson, 2009 for discussion). An adaptive benefit of this bidirectionality may be that affective states can have a role in selective attention of external stimuli suppressing attention to those stimuli incongruent with the present affective state.

Affective Associative Two-Process Theory and Joint Action
To our knowledge, whilst differential outcomes training procedures and ATP theory have been applied to human learning and decision-making (cf. Maki et al., 1995;Estévez et al., 2001;Urcuioli, 2005;Esteban et al., 2014;Holden and Overmier, 2015), no application has been made to the social interaction domain. While the relevance of the paradigm-separate instrumental and pavlovian learning phases-might appear opaque to the types of Joint Action scenarios used to investigate the possibility of shared task representations given by Sebanz et al. (2005) and Atmaca et al. (2011), we suggest the significance of the above-mentioned Transfer of Control (TOC) paradigm to Joint Action is as follows: 1. Co-actors' observation of others' stimulus (event)-outcomes contingencies, permits a type of pavlovian learning. 2. Observing others' stimulus-outcome associations and learning therefrom, may help avoid the correspondence problem (mapping physical movements of others to those of self; cf. Brass and Heyes, 2005;Heyes and Bird, 2008) involved in learning by others' actions only.
3. Learning by differential outcomes can facilitate the learning of task rules both of self and other, as well as to lessen the importance of having explicit knowledge of task rules.
Although a social TOC paradigm does not directly entail Joint Action, similar to Sebanz et al. (2003Sebanz et al. ( , 2005, Atmaca et al. (2011), this paradigm may be used to provide evidence for tendencies for individuals to represent others' affective states for use in Joint Action. According to our postulates 1-3 above, being able to appraise events for self and emotionally appraise the state of the other serves as additional coordination facilitators that lessen the burden on monitoring and detecting the other's actions both in terms of learning how to perform a task and also in terms of learning the task (rules). Taking the example of Figure 3, one actor in the pavlovian ("Pairing") phase would, instead of passively encountering newly presented stimuli, perceive these stimuli presented to an observed (co-actor). The observer would then vicariously experience, or otherwise learn, these associations and relate them to their own behavior. In this way, during the instrumental "Transfer Test" phase, the perceiver, having previously learned, for example, an E1-R1 association ("Discrimination Training" phase) and an S3-E1 association (vicariously in the "Pairing" phase), would in the "Transfer Test" phase already have access to the S3-E1-R1 affective (anticipatory) route that can substitute for explicit knowledge of (or exposure to) the S3-R1 task rule. This means that without having to learn, the observer would be able to transfer vicariously experienced knowledge to his/her own behavior. Consistent with the requirement of minimal monitoring for spatiotemporally synchronized Joint Actions (Vesper et al., 2010;Michael, 2011), the requirement for the above-described social transfer of control (or social TOC) would be that the observer is, minimally, attentive to the co-actor's stimuli and outcomes but would not require monitoring of ongoing actions. Requisite to this perspective are neural-computational mechanisms that can relate other's outcome, or expected outcome, to one's own response repertoire. We will turn to this in the next section.

Neural-Computational Basis for Affective Valuation in Individual Action
In previous work we have described a computational model of differential outcomes expectancies based on reward (acquisition) expectation and reward omission expectation learning (Lowe et al., 2014). Our model provided a qualitative replication, in simulation, of the results of Maki et al. (1995) and Estévez et al. (2001) concerning differential outcomes training of infants of different ages between 4 and 8.5 years of age. We describe here only the expectation-based component of the model responsible for learning S-E associations. This component of the model is focused on due to the role it plays in affectively "classifying" stimuli permitting transfer of control. It thereby provides the basis for the prospective route of behavior. The full model is found in Lowe et al. (2014).
The model, depicted in Figure 4 (right), is a temporal difference (TD) learning neural network instantiation of the Balkenius and Morén (2001) network (Figure 4, left). This TD network, contrary to standard TD learning algorithms computes a value function according to two dimensions: magnitude, or reward strength, and omission, or reward omission probability. Specifically, the value function computes temporally discounted reinforcer (reward or punisher 6 ) magnitude (right-side of network) valuation of a given external stimulus (S1, S2,...Si) presented to the network. From this magnitude valuation is derived an omission valuation. Although, Balkenius and Morén (2001) did not explicitly state that the "omission" node (depicted in our network schematic of the model) computes omission probability, it effectively does so as a fraction of the magnitude size; therefore, given that the reinforcer magnitude presented to the network is equal to 1.0, the omission valuation will be a probability computation based on experience. The requirement for omission computation is that the magnitude network learns, but does not unlearn, the valuation of the reinforcer. The omission network, on the other hand, necessarily both learns and unlearns its valuation using prediction errors so as to refine its omission probability approximation. This functionality is correspondent to the orbitofrontal cortex (cf. Schoenbaum et al., 1998Schoenbaum et al., , 2003Delamater, 2007;Watanabe et al., 2007). The requirement for the magnitude network to not unlearn is biologically plausible when this part of the network is considered to implement the learning in the amygdala. Morén (2002), for example, states: "There are studies that imply that conditioning in the [a]mygdala is permanent, or at least very hard to reverse ... The rationale behind this is that once learned, a reaction-especially a negative one-is so expensive to retest that it pays to assume this negative association is valid everywhere unless definite proof otherwise has been established" (Morén, 2002, p. 85). The network, thus, does not unlearn the value but through inhibition of the output of the reinforcer magnitude network, can learn to inhibit behavioral responding. The model has been demonstrated to capture the "savings effect" whereby behavioral responding, inhibited as a result of a reinforcer no longer being presented to the network, is relearned more quickly than it was initially acquired. This occurs as a result of the relatively fast learning rate of the omission network in the model. This implements a known computational punishment systems is, however, considered unlikely by many neuroscientists (e.g., Boureau and Dayan, 2010). component of the orbitofrontal cortex, i.e., fast and flexible "reversal" learning (cf. Schoenbaum et al., 2007). Our temporal difference learning adaptation of this model (Figure 4, rightside) addresses one limitation of the Balkenius and Morén (2001) model, that is lack of sensitivity to the interval between stimulus presentation and reinforcer onset. Our model thereby implements a "prospective" component of learning-temporal difference based valuation. The TD learning model of Sutton and Barto (1998) predicted the profile of dopaminergic phasic activation transfer from reinforcer onset to earliest predictive stimulus onset (Schultz, 1998(Schultz, , 2007. The equations of our model are given in the Appendix section (Appendix A). They adapt Doya's (2000) continuous time TD learning equations providing a more neurobiologically plausible implementation of Sutton and Barto's (1998) discrete time TD learning. The TD learning mechanism in our model is described and exemplified in detail in Appendix B.
The outputs of the two value functions (for magnitude and omission), when non-linearly transformed, e.g., by a sigmoid function, serve to approximately classify outputs of this value computational network. In the example in Figure 4, non-linearly transformed output (of 0.5) will provide strong inhibition to the output of the Magnitude value function (see Figure 4, right). This stimulus (stimulus 1), is thus classified by omission expectation-a pessimistic affective judgment-and its output may be associated with particular responses (permitting the E-R associations of ATP theory). Conversely, a low omission probability is insufficient to inhibit the magnitude output leading to a dominant optimistic affective judgment.
In the Balkenius and Morén (2001) model (Figure 4, left), outputs of both the omission and magnitude value functions are linear. It should be noted that using a heaviside threshold function allows for complete classification but at the expense of failing to generate the faster (re)learning characteristic of the savings effect that the Balkenius and Morén (2001) model captures. Thus, the output functions in our model, in using semi-linear functions are able to produce both approximate affective (pessimistic/optimistic) classifications of stimuli whilst preserving the savings effect.
The manner in which stimuli classified by differential outcomes can then be associated with responses, consistent with biologically plausible TD learning methods, e.g., Actor-Critic methods (cf. Houk et al., 1995), concerns use of a threefactor learning rule. This is hebbian learning (pre-synaptic and post-synaptic activations are associated) gated by the reward prediction error generated by the "Critic, " which in our model is the inverted prediction error produced by the Omission "Critic" (also see Lowe et al., 2014). ATP theory has been used to explain generic differential outcomes learning findings (Urcuioli, 2005). However, as described with recourse to our neural-computational model, a type of differential outcomes learning involves classifying stimuli by differential probability of reward (cf. Overmier and Lawry, 1979;Kruse and Overmier, 1982). Where probabilities are sufficiently distinct, differential expectations are learned that concern an expectation of an omission of reward and an expectation of an acquisition of reward. A network that implements expectation-based learning of this type can be likened to Rolls (1999Rolls ( , 2013 stimulusreinforcer contingency "appraisal" model. The neurobiological underpinnings of this network Rolls considered to be the orbitofrontal cortex (OFC) as it interacts with the amygdala. Interestingly, Watanabe et al. (2007), in relation to work by Watanabe (2000, 2004), described the finding of neural activity in the orbitofrontal cortex correlating with omission of expected reward during a delay period (from predictive cue onset to the time at which reward is intermittently delivered). McDannald et al. (2005) have suggested that it is the interaction between the orbitofrontal cortex and the basolateral component of the amygdala (BLA) that is responsible for the encoding of reward and omission expectations associated with eliciting primary stimuli and responses: "the OFC and the BLA form a circuit that may mediate both learned motivational functions and the use of outcome expectancies to guide behavior" (p. 4626). Delamater (2007) has, similar to McDannald et al. noted impairments in differential outcomesbased and devaluation (omission)-based learning as a result of OFC lesions. Concerning links between Stimulus valuations (i.e., S-E associations) and how they bring to bear on decision making (i.e., via E-R associations), medial prefrontal cortex (Passingham and Wise, 2012), and dorsolateral prefrontal cortex (Watanabe et al., 2007) have been suggested to have respective roles in outcome-contingent choice, and integration of retrospective and prospective memory that may provide a sort of competition mediating response choice.
In sum, there exists abundant neurological and behavioral evidence for this neural computational model of ATP theory providing an affective value function.

Neural-Computational Basis for Affective Valuation in Joint Action
In the domain of Social Neuroscience, which dates back to Cacioppo and Berntson (1992), a key controversy to the present day, and critically significant to Joint Action, concerns whether social value qualitatively differs from non-social value or is fundamentally the same but entails differential pre-processing of (social) stimuli. Resolving this debate is central to understanding the extent to which individuals can detect and monitor the affective states (expected outcomes) of others for facilitating Joint Action.
Adolphs (2010) discussed whether social processing is unique or whether the information processing is of the "same type" as non-social processing. He categorized social processing into: (i) social perception, (ii) social cognition, (iii) social regulation.
Of the three domains of information processing identified, all are related to the processing of affective information. Adolphs, further stated: "An important question for the future is therefore why, when, and how emotion participates in social cognition" (p. 755).
Social Valuation: Extended Common Currency (ECC) vs. Social-Valuation-Specific (SVS) Ruff and Fehr (2014) reviewed whether a neurobiological distinction between social and non-social value can be made. They highlighted three core aspects of value: (i) Experienced value, (ii) Anticipated value, (iii) Decision value. In the case of (i), orbitofrontal cortex (OFC), amygdala, insula and anterior cingulate cortex (ACC) are linked to the experience of actual reward (or punishment). In the case of (ii), value concerns the use of prediction errors as they derive from anticipated-value signals. In individual decision-making, dopaminergic neurons encode prediction error signals while striatum, OFC and amygdala are said to constitute the reward neural circuitry correlating with value anticipation (cf. Schoenbaum et al., 2003Schoenbaum et al., , 2007Rolls, 1999). Decision value (iii), on the other hand, concerns choice-based preference and is differentiated from anticipated reward value. Its strongest neural correlate, according to Ruff and Fehr (2014), appears to be in the ventral medial prefrontal cortex (vmPFC)-also see Damasio (1994Damasio ( , 1999. The above value components have been considered within a social value conceptual framework. Ruff and Fehr identify a dichotomous perspective in the empirical and modeling literature regarding neural circuitry concerned with valuating social signals. On the one hand, social value representations are considered as utilizing the neural circuitry of non-social value representations ("identical neural processes assign motivational relevance to social and nonsocial factors, " Ruff and Fehr, 2014, p. 550). This constitutes an "extended common currency" (ECC) perspective whereby distinction between social and non-social information is made outside the value-representation circuit (see Figure 5, left). An alternative perspective concerns social value and non-social value being represented in separate dedicated circuits (see Figure 5, right) whose anatomical structure and computational processing may, nevertheless, be similar or even identical.
The particular set of modules comprising the value representation are considered by Ruff and Fehr (2014) to "not show specific brain areas and connections but rather...abstract principles of how brain areas and their interactions could implement these computations, " (Ruff and Fehr, 2014, p. 551). Such areas can include, therefore, value components that concern (i) Experience, (ii) Anticipation, (iii) Decision, valuation, as listed above. Whether all three aspects of valuation should be considered to fall into the ECC or SVS perspective is not addressed by Ruff and Fehr (2014), however.

Social Valuation and Joint Action
Knoblich and Jordan (2002) provided a high-level "minimalist" Joint Action Architecture based on action outcome effects of a mirror neuron system (see Figure 6). This can be seen as providing a framework from which to interpret models pertinent to Joint Action. In this architecture, a mirror neuron system becomes active when either the individual registers outcomes of actions (e.g., the expected end point of an action), or when the individual observes another organism achieving the same action outcome. This implies an ECC hypothesis as advanced by Ruff and Fehr (2014).
In this Joint Action context, however, these "social" and "nonsocial" effects are further modulated by a system that accounts for the complementarity of an individual or other's action. Thus, if the particular task requires Joint Action and the engagement with other is perceived as such Joint Action, the actions of self and other may be modified. Bicho et al. (2011), produced a neural-(dynamic) computational architecture of Joint Action that implements such a division between joint action, and individual components for use in an autonomous robot that was able to interact, via dialogue, with humans according to a task that required complementary actions.
While neural computational architectures of Joint Action and emotions exist (cf. Silva et al., in press) 7 , we are not aware of those that focus on affective learning mechanisms that comprise TD-based value functions. Suzuki et al. (2012)   (1) A Mirror (neuron) System whose activity may reflect either the individual effects of the "Self" or those of a perceived "Other"; (2) A Joint Action System whose activity reflects the action outcome effects of Joint Action. Adapted from Knoblich and Jordan (2002).
perspective of the nature of social value functions concerns: how humans learn another person's value to inform their own decision-making behavior. These two issues allude to Ruff and Fehr's (2014) identification of Anticipatory, and Decision, value where a separation may be made between valuation of stimuli (Anticipatory) and valuation of choices (Decision).
In Figure 7 is depicted Suzuki et al.'s (2012) reinforcement learning model of social value. In Figure 7A (left) is shown a standard (non-TD) Reinforcement Learning (RL) model that updates a value function for the self (S) based on the reward prediction error (RPE) generated following action selection. Each action is valuated by previous experience according to whether it leads to reward or not. In this model, unlike our model (illustrated in Figure 4, right), a single value dimension is depicted which is labeled "Rwd Prob" (i.e., reward probability). Reward magnitude, held constant in the social condition of Suzuki et al. (2012), was multiplied by reward probability.  Suzuki et al. (2012) posit the existence of two types of simulated prediction error that can be used when predicting the outcome of the Other in a particular task. An sRPE (simulated reward prediction error) uses the perceived outcome of the Other to update a predicted value function of the Other. Replicating the Self value function (Figure 7, left), this function valuates different actions, which are then compared as part of action selection. Moreover, the use of sAPE (simulated action prediction error) updates the Other's value function, which is used to help predict the choice of the Other increasing the ability to predict the Other's outcome and subsequent response choice.
In the validation experiment of Suzuki et al. (2012), they found that their Simulation-RL model was better able to capture behavioral data of participants in a condition requiring them to predict the choices of another subject (in reality a computer program). These choices were valuated by an abstract and probabilistic monetary reward. The Simulation-RL model replicated the empirical data relatively worse, though still fairly accurately, when only sRPE was used as compared to both sRPE and sAPE (reward and action prediction errors). The model did not match the empirical data at all when using only the (Self) RPE or only the sAPE. Suzuki et al. (2012) found that reward prediction error (and simulated reward prediction error) was correlated with neural activity (BOLD signals) in the ventral-medial prefrontal cortex (vmPFC) indicating that, consistent with the ECC perspective of Ruff and Fehr (2014), the simulation of Other's outcome prediction errors recruits circuitry used for individual outcome prediction errors. The authors suggested that their findings provided: "the first direct evidence that vmPFC is the area in which representations of reward prediction error are shared between the self and the simulated-other, " (Suzuki et al., 2012(Suzuki et al., , p. 1132. More generally throughout the decision making process made by Self (for Self) and Self on behalf of Other, vmPFC showed very similar activation in both cases: "the same region of the vmPFC contains neural signals for the subjects' decisions in both the Control and Other tasks, as well as signals for learning from reward prediction errors either with or without simulation, " (Suzuki et al., 2012(Suzuki et al., , p. 1132. This finding would suggest that at least one component of value identified by Ruff and Fehr (2014), i.e., Anticipatory value, is shared in neural-computation of value of Self and of Other.
On the other hand, dorsal lateral/medial prefrontal cortex was implicated in generating a simulated action prediction error (of Other). Ruff and Fehr (2014) interpreted these findings as being evidence of a Social-Valuation-Specific (SVS)-see Figure 5 (right)-explanation of social stimuli processing based on "spatially and functionally distinct prediction errors that nevertheless follow similar computational principles" (p. 556).
In relation to the Joint Action architecture of Knoblich and Jordan (2002; Figure 6), the Suzuki et al. (2012) architecture (Figure 7, right) embeds within an individual circuit additional computational processes for simulating the (action and outcome) effects on other that then lead to motoric outputs in the self. Simulated other prediction errors (correlating with vmPFC activity) provide a basis for a "shared representation" of value that may be requisite to coordinated joint activity (e.g., Joint Action).

Social Valuation and ATP
Let us now refer back to Section Associative Two-Process and the traditional use of TOC experiments as a means of validating the existence of an ATP (See Figure 3). Pavlovian conditioning, as a passive form of learning, i.e., where the subject's responses do not influence the onset of stimuli and outcomes, may also be conceived in a social context. In relation to the pavlovian phase in Figure 3, we postulate that individuals, rather than passively perceiving Stimulus-Outcome pairs in relation to Self, may perceive Stimulus-Outcome pairs in relation to Other. In the sense of the Suzuki et al. (2012) model/experiment described in Section Social Valuation and Joint Action, the subject may perceive the Other's observed (reward) outcome. This could be the result of at least three experimentally manipulated interaction scenarios: (i) Competitive-the Other receives a non-reward (or punisher); (ii) Collaborative-the Other receives a reward (that benefits Self); (iii) Vicarious-the Other receives a reward (neutral to the Self). Suzuki et al.'s (2012) set up explicitly concerned scenario (iii) here. In their set-up external reward was, however, provided for correctly predicting the other's choice (vicarious decision making). The authors provided behavioral and neural-computational modeling evidence to suggest that vicarious reward was not merely ego-centrically experienced, i.e., where the other's actions and outcomes were not perceived as belonging to the other.
The individual's knowledge of the social interaction scenario in which (s)he is placed permits differential pre-processing of social stimuli thereafter valuated according to ECC or SVS neural computational circuitry. Such pre-processing involves perceiving Other as competitor requiring a comparing of outcomes (i), or as a collaborator requiring monitoring of collectively attained outcomes (ii), or focus purely on Other's outcomes (iii).
Central to the perspective of ATP theory is that individuals are able to transfer knowledge (and "generalize, " or best guess) from previously experienced instrumental and pavlovian phases to a novel instrumental phase, i.e., one in which new Stimulus-Response (S-R) pairings are presented. Using the prospective (Stimulus-[Outcome]Expectation-Response) route, given a differential outcomes and transfer-of-contrl set-up, subjects are able to, with minimal or no learning, find the correct S-R mapping. Given Collaborative or Vicarious rewarding social scenarios as outlined above, independently representing outcomes of Other subjects is less obviously "typical" (does not require comparison of Self to Other) than for Competitive scenarios (does require comparison).
We consider a social version of the TOC using a differential outcomes procedure an excellent methodological paradigm for testing an ECC hypothesis, specifically, our Social Aff-ATP hypothesis. This is because if the individual case of transfer can also apply to the social case (transfer based on simulated Other's Stimulus-Outcome pairings), subjects can potentially substitute for a lack of information (about action choice). They can do this through: (i) the use of observed differential outcomes (values) that they vicariously experience or/and (ii) the perception of  (2001) network, only the stimuli used in the social Pavlovian phase are added. These "Social" stimuli would be required to have direct input into the Non-Social value function. In this sense, social valuations would directly recruit the existing non-social network. This does not preclude the existence of other social valuation representations (e.g., of the like depicted in Figure 9), however. On the right hand side is shown "Other" inputs to the network. It may be possible that the Other's internal process is not so opaque given that affective expressions (of expectations or outcome reactions) map to affective states isomorphic to those of the observer's.
expressions of Other that map to the Other's value functions (outcome expectations/predictions). Such adaptive switching behavior may facilitate coordination of Joint Action, i.e., provide a coordination smoother (Vesper et al., 2010).

A Neural Computational Implementation of Social Aff-ATP
In this sub-section, we present a Social Aff-ATP neural computational mechanism that constitutes our hypothesis in this article. It is depicted in Figure 8.
The mechanism parsimoniously adapts that of Balkenius and Morén (2001), Balkenius et al. (2009) andLowe et al. (2014)see Figure 4-by viewing social processing as a pre-valuation stage (not described here) that exploits the existing function for individual/Self stimuli valuations. The neurobiological, behavioral and neural-computational evidence for the existence of this mechanism, previously considered in terms of individual effects, was given in Section Neural-Computational Basis for Affective Valuation in Individual Action. It is an explicit implementation of the ECC schema of Ruff and Fehr (2014). It also comprises a type of mirror system as described by Knoblich and Jordan (2002) but as it applies to perception/representation of (affective) value outcomes rather than action effects. We discuss below how this mechanism, while not accounting as comprehensively for Joint Action effects as the full Knoblich and Jordan model, permits effects that can be used for Joint Action and is thereby more minimalist. The equations for this mechanism, in viewing social inputs as stimuli to be valuated, are the same as for the originally-conceived individual (non-social) model found in Appendix A.
This mechanism is apt for processing social valuations in vicarious and collaborative social scenarios where representations of Other are not functionally critical.
An alternative mechanism to our Social Aff-ATP is provided in Figure 9. Here valuation of social stimuli is computed using a separate "Other" /Social value circuit and is considered a SVS circuit. More specifically it is a SVS-ATP alternative mechanism.
Consistent with Suzuki et al. there are separate Non-Social and Social value functions for Self and (simulated) Other. Also consistent with Suzuki et al. (2012), both Non-Social and Social value functions recruit the same reward prediction error computations. Suzuki et al. suggest these computations correlate with vmPFC (BOLD) neural activity. Such prediction errors have also been attributed to activity in the striatum (cf. Redgrave et al., 1999). Our interest is, above all, in the neural computational plausibility of such a network rather than the specific anatomical root of the computations. However, in line with Balkenius and Morén (2001), as mentioned in Section Associative Two-Process Theory and Affect, we posit that the value nodes compute activity correspondent to that in the orbitofrontal cortex and amygdala 8 . In our ATP model, prediction errors code for reward magnitude (a non-negative value) and reward omission, i.e., two dimensions of value whose outputs relate to emotions elicited according to the acquisition and omission reward contingencies, respectively (cf. Rolls, 1999).
The depicted mechanism in Figure 9, as for the Social Aff-ATP mechanism (Figure 8), outputs value computations of magnitude and omission to relay nodes (E2 and E1) FIGURE 9 | ATP Pavlovian adaptation of Suzuki et al. (2012). The Non-Social network used the output function of Lowe et al. (2014) and the value function of Balkenius and Morén's (2001)-we use this latter function rather than our own temporal difference learning version to simplify the illustration. As for Suzuki et al. (2012), the prediction error nodes for Magnitude and Omission expectations are used both by the Non-Social network to update the Self's value function, but also by the social-valuation-specific network. Simulated Other (Social) value predictions (for Magnitude and Omission), as for Suzuki et al. (2012) are separately represented and have separate outputs. The Other's internal process here may still be seen as a black box as is the Other's choice. The only information required is the Other's Outcome. whose function is to non-linearly "classify" the outputs of the stimulus valuations as omission expectant (pessimistic) or acquisition expectant (optimistic)-see Lowe et al. (2014) for computational details. These outputs in turn, through a three-factor learning rule (hebbian learning gated by reward prediction error/dopaminergic projections) can be associated with actions / choices. In the alternative mechanism shown in Figure 9, the Social value representation projects instead to sE1 and sE2 relay nodes, i.e., separate output representations. For the Social Aff-ATP (and SVS-ATP) mechanism we do not focus on the action selection component of the algorithm, which can be represented simply by a winner-take-all laterally inhibiting layer of nodes (each node representing an action/choice). Nevertheless, mathematically, the link between value function output and action selection in Suzuki et al. and the Social Aff-ATP mechanism are analogous. For Suzuki et al. stimulus valuations are computed as: Q(S1) = p(S1)R(S1), where Q(S1) is the valuation of stimulus 1 (S1) computed as the product of probability of reward for S1, i.e., p(S1) and magnitude of reward for S1, i.e., R(S1). In the Social Aff-ATP (and SVS-ATP value functions), E1 is calculated as E1 ≈ R(S1) − (1 − p(S1)), where (1 − p(S1)) = omission probability and is given by the relayed output of E2 subject to non-linear transformation. When R(S1) is fixed at 1.0, as it is for Suzuki et al. in their social condition, E1 = Q(S1). A difference in our ATP-based models is that both pessimistic/omission probability focused (E2) and optimistic/acquisition probability focused (E1) outputs are permissible allowing for differential expectation-response associations. Another difference is that Suzuki et al. valuate vicarious actions by incorporating within Q(S1) an action valuation for S1 which substitutes for p(S1). Actions and stimuli are, therefore, not dissociated as they are for the prospective route of the ATP network-the actions elicited by E1/E2 do not have "knowledge" of the stimulus, which permits the classification of a number of stimuli by affective value to then be associated with particular actions critical for TOC effects to manifest.
The ATP-based circuitry here (Figures 9, 10) focuses on what would be required for transfer of pavlovian knowledge from Other to Self, i.e., for our Social Aff-ATP hypothesis to hold. Importantly, from the perspective of a Social TOC, the network above-described (Figure 9) would not allow for transfer from Other to Self of the learned Stimulus-(Outcome) Expectancy maps in the instrumental transfer phase. This is because although it may be possible to learn the Other's (Social) value function (stimulus outcome valuations) in the pavlovian phase, the association between Other's outcome expectation and Self response cannot be made in the initial instrumental phase as sE1/sE2 outputs would have separate associations with actions / choices to E1/E2 outputs. This description is schematized in Figure 10. It is arguable as to whether the SVS-ATP mechanism depicted in Figure 9, would be more representative of the Suzuki et al. model if Social value magnitude and omission representations/nodes had direct inputs to the Non-Social equivalent nodes. A Social TOC would indeed, in this case, transpire. It would also make the Social value representation redundant when not tied to separate (simulated Other) actions. We have suggested that the SVS-ATP network would be useful when individuals wish to compare their valuations with those simulated for others and the actions they expect others to make in comparison to themselves. This might be viewed in terms of a competitive interaction scenario, but could also be useful in a Joint Action scenario where complementarity of other's valuations and actions to the self should often occur.
In Figure 10, the standard TOC (non-social/individualistic) is schematized along with the learned associations in each of the first two stages and the causal links that are exploited in FIGURE 10 | Schematic of Associative Two-Process theoretic description of Pavlovian-Instrumental Transfer (TOC). Top: Non-Social TOC (standard ATP description of TOC). Bottom: Social TOC, ATP description according to our Suzuki et al. (2012) compatible social-valuation-specific ATP mechanism (Figure 9). Dashed lines represent learnable connections, Solid lines represent causal links, red solid lines represent links of the Other that are hidden to the Self. The three panels (left to right) concern instrumental, pavlovian, and instrumental transfer phase, respectively. the final (instrumental transfer) phase. This is a simplified setup since a standard TOC would include multiple S-R mappings in each phase (allowing for differential outcomes classification of stimuli). The Social TOC, using the SVS-ATP mechanism (Figure 9) shows how such a transfer would not be possible. In the instrumental transfer phase, there is no learned (causal) link between the valuation of S2 for Other (Eother) and the response and so there is not a complete prospective route by which the correct response (R1) can be automatically cued (i.e., without further learning being necessary). Since the (Self) subject has not previously learned an S2-R1 association (via the retrospective route), there is no causal route to the correct response. Only if there is a further link between Eother and Eself value representations could a transfer be possible. The Social Aff-ATP mechanism (Figure 8), however, in utilizing the value function of Self for stimuli relevant to Other through vicarious stimulus processing, should re-produce the standard TOC found in individuals (Figure 10, top).
In summary, the Social Aff-ATP neural computational hypothesis would predict TOC effects that have been neuralcomputationally found using a model capturing data for an individual task. This mechanism conforms to the ECC perspective of Ruff and Fehr (2014). An alternative mechanism conforming to the SVS alternative perspective put forward by Ruff and Fehr (2014), and likened to the reinforcement learning model of Suzuki et al. (2012), should not produce a social TOC.

DISCUSSION
In this article, we have posited a neural computational hypothesis for a minimalist affective-learning mechanism for use in Joint Action. We have called this the Social Aff-ATP (neural computational) hypothesis, which provides a specific, testable implementation of the ECC hypothesis (cf. Ruff and Fehr, 2014). We discussed the ATP theory of differential outcomes learning. We then discussed our neural-computational modeling of this process and how a tweak of the model allowing for the incorporation of social stimuli inputs provides a social variant of the model. We also suggested an alternative mechanism that implements a SVS mechanism comparable to that of Suzuki et al.'s with separate social and non-social value functions. We have presented a schematic describing why this SVS-ATP implementation would not permit a social transfer of control (TOC) of the type that typically manifests in non-social contexts.
In the remainder of the Discussion Section, we will reiterate and elaborate on why we think our Social Aff-ATP mechanism constitutes a minimal mechanism that can have a useful function in Joint Action. Vesper et al. (2010) has suggested that a feature of monitoring and detecting others' actions and action outcomes during Joint Action is to facilitate sensorimotor coordination during the Joint Action. Prediction can "smooth coordination" by enabling co-actors to accommodate each other in space and time or to compensate for deficiencies in performance of the other. Michael (2011) posited that emotions can provide such a role serving to facilitate alignment and monitoring and detecting of others (e.g., when the other expresses frustration). A perspective of Urcuioli (2005Urcuioli ( , 2013 is that outcome expectancies provide a means to effectively classify stimuli (see Figure 3). Action selection can then be simplified through exploiting affordances of the subset of those actions already associated with the outcome expectancy classes. This is a reason why participants under differential outcomes TOC training can immediately select the unique action that leads to the desired outcome even though the stimulus-action (response) contingency has previously not been experienced: Subjects have already classified the stimuli according to a given outcome expectancy previously associated with an action. This of course depends on a passive pavlovian phase. We conjecture from an evolutionary perspective it is natural that such observation might be exploited in a social context. In this case, agents observing the stimulus context of another (con-specific), irrespective of strong monitoring of actions, can learn from the stimulus-outcome contingencies and, via ECC circuitry, bring to bear such knowledge on their own instrumental capabilities. Such an ability, facilitates coordination as it subverts the need for a ponderous learning process during the Joint Action. Thereby, particularly when Joint Action is of a sensor-motorically complex nature or requires rapid and flexible interactions a Social Aff-ATP mechanism may reduce the monitoring of the other's (behavioral) activity. It may often suffice to be aware of the (stimulus) context and the (affective) outcomes of the Other.
The affective component of ATP, which concerns outcome expectancy classifications for differential dimensions of rewards (e.g., differing reward magnitudes, presentation/omission probabilities, qualitatively different rewards) or punishers, may be particularly pertinent to Joint Action. The affective properties of emotion contagion, and empathy identified by Michael (2011) are particularly relevant. In the case of the former, an actor may align his/her affective (outcome expectancy) state with the coactor thereby cuing a subset of action possibilities similar to those cued in the observed actor. In this case, observation of (affective) outcomes may not be necessary to "smooth" coordination but rather observation of the expression of the other when it is isomorphic with the other's affective state. This expression can thereby be predictive of the outcome and facilitate (corrective) action in anticipation of (undesired) desired outcomes. We could envisage social stimuli-face and other bodily expressive computed inputs-as providing an input to the value function alternatives depicted in Figure 8 (social Aff-ATP) and Figure 9 (SVS-ATP) in this case.
The case of empathy relates to our Social Aff-ATP hypothesis where perception of stimuli, in the context of the presence of another, innervates circuits (e.g., mirror neuron circuits) that relate to one's own affective experience. Empathy and contagion may draw on related neural circuits (cf. De Vignemont and Singer, 2006), which recruit similar neural structures as those alluded to for our Social Aff-ATP (see Section Neural-Computational Basis for Affective Valuation in Individual Action)-and also in Knoblich and Jordan (2002). Bernhardt and Singer (2012) (see also Decety, 2011), for example, advocate a core (value-based) circuitry involving amygdala, vmPFC, dopaminergic pathways, as well as anterior cingulate cortex and insula.
Joint Action is, of course, a dynamic phenomenon, and it might be contended that it is not captured using a minimalist, turn-taking, procedure such as the differential outcomes TOC procedure. Such a controlled approach, however, allows for relative analytic ease in identifying mechanisms that may be used in the "wild." A means of further bridging this turntaking minimalist approach with a more dynamic method would be to include parallel learning and action selection (in line with the minimalist approaches of Sebanz et al., 2003Sebanz et al., , 2005Atmaca et al., 2011;Loehr and Vesper, 2016). It is, in fact, possible to employ TOC procedures where the pavlovian and transfer phase are run concurrently (Overmier and Lawry, 1979). Turn-taking TOC procedures might also be used with fine resolution interleaving, e.g., switching from (social) pavlovian to transfer trials every other trial as compared to switching following a block of (social) pavlovian-only or transfer-only trials. Neural-dynamic computational models, of the type we have proposed in this article, may also be deployed in controlled but continuous Joint Action settings when used as controllers for artificial agents, e.g., virtual agents in virtual environments.
Notwithstanding the above arguments for appropriate testing of a social TOC, it is critical to appropriately evaluate whether subjects are, in fact, perceiving stimuli as social (e.g., vicarious) or not (ego-centric). Suzuki et al. (2012) employed sophisticated behavioral analytic means to suggest subjects did indeed act vicariously in choosing reward for others (given that they were themselves rewarded for correct Other choice predictions). Social valuation circuitry may even dynamically express itself as more or less ECC-based or SVSbased depending on the interactive nature of the task (note, Nicolle et al., 2012 found that different circuits might be deployed dynamically according to changing interactive scenarios). Neural representations for individual Self/Other and joint (action) effects may similarly entail dynamic expression according to changing perceptions of the social context in which the agents are acting.

AUTHOR CONTRIBUTIONS
RL conceived of the model and contributed most of the written content. CV and JM provided background theories, discussion and feedback upon which the model and novel content was based. AA, GL, and PG contributed discussions, written feedback, and written content to the article.

APPENDIX Appendix A
Our model is given by Equations (A1-A7).
V e (t) = sǫS θ e s (t)φ s (t) where V e (t) is the learned value function; θ e s (t) implements the value function update rule (Equation A2); e ǫ {m, o} is an index denoting Magnitude or Omission value functions, respectively; t is time in [1, T] where T is the total time over which a temporal representation of stimulus (s) memory is presented to the network; s is the number of different (social or non-social) stimuli in [1, S] where S = 2 for most cases of differential outcomes training.
Equations (A3-A5) implement the output nodes computations of the network. E 2 (t) provides the output of V o (t) and E 1 (t) provides the output of V m (t). The sigmoid function in Equation (A5) transforms the outputs of the V m (t) and V o (t) value functions to effectively produce expectations (or classifications) of stimuli inputs according to reward acquisition, and reward omission expectation, respectively.
In Equation (A6) Doya's (2000) continuous time TD learning prediction error (for the magnitude value function) is given. λ (t − t) is the reinforcer input (stimulus) λ ǫ {0|1}, τ > t is the decay constant. Equation (A7) in our model provides the omission value function prediction error σ o (t), which subtracts the error of magnitude value function σ m (t) and computes its own temporally discounted prediction error for expected omission. The same equations may be used both for our individual ATP model (Section Associative Two-Process Theory and Affect) and for our social ATP model (Section Social Valuation and ATP). We do not provide specific values for variables as the social ATP model currently provides our hypothesis to be evaluated.

Appendix B
In Figure A1 is depicted the computational process of omission probability, using the prediction error rule of Equation (A7), that is the novel feature of our TD model. The network shows the activation levels at the time step preceding reinforcement presentation (Figure A1, left), and at the time step at reinforcement presentation ( Figure A1, right). These computations show reinforcement predictions of an external stimulus, via prediction error updating, to provide a magnitude value of 1.0 and an omission probability value of 0.5. Figure A1 (left) shows, following learning, both the magnitude and omission values (V m (t) and V o (t), respectively) of the reinforcer when it is perfectly predicted. The prediction nodes of magnitude and omission critic are set to zero (white circles) as the inputs from the magnitude value node (black circle) is perfectly temporally predicted via the relayed discounted value node (lighter gray circle symbolizing a lower value) which receives input γ V m (t). When reinforcement is presented (λ (t − t) = 1, in Figure A1 right) the input from the magnitude value node to the prediction error node at the previous timestep V m (t − t) = −1) cancels the effect of the reinforcement input (λ (t − t)). However, for the Omission Critic, there is a 0.5 prediction of omission and since reinforcement presentation (of magnitude 1) cancels the V m (t − t) input, this produces a −0.5 prediction error and reduces the omission value weights according to its update rate (Equation A7). For this reason, prediction of omission probability will fluctuate around 0.5 following presentations of expected, or omitted, reinforcement as a function of the rate of learning-the smaller, the less fluctuation. Thus, when reinforcement is unexpectedly omitted (not shown), nothing reduces the previous timestep V m (t − t) = 1 input so the Omission Critic prediction error node now generates an error of +0.5. The magnitude value, being learned, is not affected by the negative prediction error as there is no unlearning of the intrinsic magnitude of the reinforcer. Both Omission Critic and Magnitude Critic learn by their temporal difference learning rules producing asymptotically growing value functions from Stimulus onset time until reinforcer presentation time.