<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">645516</article-id>
<article-id pub-id-type="doi">10.3389/frai.2021.645516</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Developing a Conversational Agent&#x2019;s Capability to Identify Structural Wrongness in Arguments Based on Toulmin&#x2019;s Model of Arguments</article-title>
<alt-title alt-title-type="left-running-head">Mirzababaei and Pammer-Schindler</alt-title>
<alt-title alt-title-type="right-running-head">Identifying Structural Wrongness in Arguments</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Mirzababaei</surname>
<given-names>Behzad</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<xref ref-type="fn" rid="fn1">
<sup>&#x2020;</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1028292/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Pammer-Schindler</surname>
<given-names>Viktoria</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="fn" rid="fn1">
<sup>&#x2020;</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/830956/overview"/>
</contrib>
</contrib-group>
<aff id="aff1">
<label>
<sup>1</sup>
</label>Know-Center GmbH, <addr-line>Graz</addr-line>, <country>Austria</country>
</aff>
<aff id="aff2">
<label>
<sup>2</sup>
</label>Institute for Interactive Systems and Data Science, Graz University of Technology, <addr-line>Graz</addr-line>, <country>Austria</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/644678/overview">Marcus Specht</ext-link>, Delft University of Technology, Netherlands</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/230818/overview">Brian Riordan</ext-link>, Educational Testing Service, United&#x20;States</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1366524/overview">Alaa Alslaity</ext-link>, Dalhousie University, Canada</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Behzad Mirzababaei, <email>bmirzababaei@know-center.at</email>
</corresp>
<fn fn-type="equal" id="fn1">
<label>
<sup>&#x2020;</sup>
</label>
<p>These authors have contributed equally to this&#x20;work</p>
</fn>
<fn fn-type="other">
<p>This article was submitted to AI for Human Learning and Behavior Change, a section of the journal Frontiers in Artificial Intelligence</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>30</day>
<month>11</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>4</volume>
<elocation-id>645516</elocation-id>
<history>
<date date-type="received">
<day>23</day>
<month>12</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>10</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2021 Mirzababaei and Pammer-Schindler.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Mirzababaei and Pammer-Schindler</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these&#x20;terms.</p>
</license>
</permissions>
<abstract>
<p>This article discusses the usefulness of Toulmin&#x2019;s model of arguments as structuring an assessment of different types of wrongness in an argument. We discuss the usability of the model within a conversational agent that aims to support users to develop a good argument. Within the article, we present a study and the development of classifiers that identify the existence of structural components in a good argument, namely a claim, a warrant (underlying understanding), and evidence. Based on a dataset (three sub-datasets with 100, 1,026, 211 responses in each) in which users argue about the intelligence or non-intelligence of entities, we have developed classifiers for these components: The existence and direction (positive/negative) of claims can be detected a weighted average F1 score over all classes (positive/negative/unknown) of 0.91. The existence of a warrant (with warrant/without warrant) can be detected with a weighted F1 score over all classes of 0.88. The existence of evidence (with evidence/without evidence) can be detected with a weighted average F1 score of 0.80. We argue that these scores are high enough to be of use within a conditional dialogue structure based on Bloom&#x2019;s taxonomy of learning; and show by argument an example conditional dialogue structure that allows us to conduct coherent learning conversations. While in our described experiments, we show how Toulmin&#x2019;s model of arguments can be used to identify structural problems with argumentation, we also discuss how Toulmin&#x2019;s model of arguments could be used in conjunction with content-wise assessment of the correctness especially of the evidence component to identify more complex types of wrongness in arguments, where argument components are not well aligned. Owing to having progress in argument mining and conversational agents, the next challenges could be the developing agents that support learning argumentation. These agents could identify more complex type of wrongness in arguments that result from wrong connections between argumentation components.</p>
</abstract>
<kwd-group>
<kwd>Toulmin&#x2019;s model of argument</kwd>
<kwd>argument mining</kwd>
<kwd>argument quality detection</kwd>
<kwd>educational technology</kwd>
<kwd>educational conversational agent</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Imagine an intelligent entity with whom a learner can discuss definitions of core concepts in a learning domain, moving from simply checking back whether the learner&#x2019;s memory of concepts and abstract understanding of concepts is correct, toward discussing increasingly complex application concepts. This, in a nutshell, is what good teachers do; and much research in artificial intelligence for education has gone into developing computational systems that are able to, at least partially, fulfill some of the functions that (good) human tutors take on (<xref ref-type="bibr" rid="B33">Koedinger et&#x20;al., 1997</xref>; <xref ref-type="bibr" rid="B22">Gertner and Kurt, 2000</xref>).</p>
<p>In the above description, the tutoring conversation(s) first focuses on reviewing knowledge, and then increasingly on comprehension and application of knowledge to concrete examples. Such a procedure follows the revised version (<xref ref-type="bibr" rid="B3">Anderson et&#x20;al., 2001</xref>) of <xref ref-type="bibr" rid="B4">Bloom and others, (1956)</xref>&#x2019;s taxonomy. Bloom&#x2019;s taxonomy is a hierarchical categorization of educational goals. Essentially, it proposes to describe in which different ways one can know and learn about a learning subject. Additionally, it proposes a hierarchy in the sense of stating which steps need to be taken before others. This makes it suitable to design an intelligent tutor, in the sense of providing the intelligent tutor with a didactical structure along which to proceed. In this taxonomy, remembering, understanding, and applying are proposed as the first three types of learning with respect to knowledge that should be learned. This taxonomy hence can be understood as the design rationale for our conversational agent&#x2019;s dialogue structure.</p>
<p>In our overarching research, we are working on a conversational agent with whom one can discuss what intelligence is, and in what sense specific entities that can be encountered in real life are intelligent or not. The choice of topic&#x2014;discussing in what sense an entity is intelligent&#x2014;has been made on the background of understanding the development of AI literacy as important in a society pervaded by increasingly powerful technology that is based on data analytics and other methods from AI (<xref ref-type="bibr" rid="B40">Long and Magerko, 2020</xref>). One puzzle piece in this is to understand what AI is (ibid); as a precursor and surrounding discussion, we see the question of what is meant by intelligence, and more specifically in what sense different entities can be understood as intelligent.</p>
<p>In this article, we focus on the part of the tutorial conversation where the student is asked to apply agreed-upon definitions of intelligence to a concrete (type of) entity, such as &#x201c;a cat,&#x201d; &#x201c;a chair,&#x201d; or &#x201c;a self-driving car.&#x201d; In the ensuing discussion, the conversational agent has the role of a tutor who develops the student&#x2019;s argumentation into a reasonable and clear argument. Such an agent needs to assess both structure and content of the argument.</p>
<p>For assessing content-wise correctness of unconstrained answers in conversational educational answers, approaches such as comparing user responses with predefined correct answers exist (<xref ref-type="bibr" rid="B26">Graesser et&#x20;al., 2000</xref>; <xref ref-type="bibr" rid="B10">Cai et&#x20;al., 2020</xref>). In parallel, research on argument mining has worked on identifying argumentative parts in longer texts (<xref ref-type="bibr" rid="B44">Moens et&#x20;al., 2007</xref>; <xref ref-type="bibr" rid="B62">Stab and Gurevych, 2014</xref>, <xref ref-type="bibr" rid="B63">2017</xref>). In complement to such prior research, this work addresses the challenge to understand the structure of a given argument, in order for a (conversational) intelligent tutor to give specific feedback on what elements of an argument are missing. To achieve this goal, we investigate the suitability of Toulmin&#x2019;s model of what components a reasonable argument has and should have (<xref ref-type="bibr" rid="B68">Toulmin, 2003</xref>) as a conceptual basis for a computational model of argument quality. This model has already been used differently, for example, in the field of computational linguistics (<xref ref-type="bibr" rid="B27">Habernal and Gurevych, 2017</xref>), outside the field (<xref ref-type="bibr" rid="B61">Simosi, 2003</xref>), and computer-supported learning (<xref ref-type="bibr" rid="B18">Erduran et&#x20;al., 2004</xref>; <xref ref-type="bibr" rid="B64">Stegmann et&#x20;al., 2012</xref>; <xref ref-type="bibr" rid="B21">Garcia-Mila et&#x20;al., 2013</xref>).</p>
<p>The goal of argumentative conversational agents could be persuading the users toward a specific topic or idea (<xref ref-type="bibr" rid="B67">Toniuc and Groza, 2017</xref>; <xref ref-type="bibr" rid="B11">Chalaguine and Anthony, 2020</xref>) or just conveying the information by offering arguments that keep the dialogue comprehensive and meaningful (<xref ref-type="bibr" rid="B36">Le et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B52">Rakshit et&#x20;al., 2019</xref>). These two goals do not focus on the educational aspect of argumentation.</p>
<p>One aspect that is missing here is the user&#x2019;s argumentation or how they learn to argue. Our focus in this work is on analyzing and giving feedback on the argument structure of the human user which is novel in comparison with other works that emphasize the retrieval of suitable (counter-) arguments within a conversation over feedback or persuading users. Furthermore, in this work, we used Toulmin&#x2019;s model in an educational conversational agent to teach how to argue and how a good argument should look&#x20;like.</p>
<p>This article is organized as follows. In <italic>Background Knowledge and Related Work</italic>, we review ongoing research on chatbots, especially in the field of education; ongoing research on argumentation mining, and lay out Toulmin&#x2019;s model of argument as background for our work. In <italic>Research Questions</italic>, we concretize the research questions that we ask and answer in this work. In <italic>Methodology</italic>, we describe the method used to answer the research questions, including data collection, annotation, inter-rater agreement, data preprocessing, feature selection, overarching model development, and evaluation process. In <italic>Results</italic> we describe results in line with the research questions from <italic>Research Questions</italic>, and we conclude the article with a discussion and conclusion <italic>Discussion and Conclusion</italic>.</p>
</sec>
<sec id="s2">
<title>2 Background Knowledge and Related Work</title>
<sec id="s2-1">
<title>2.1 Toulmin&#x2019;s Model of Argument</title>
<p>In this work, for argument classification or identifying the components of arguments, we used Toulmin&#x2019;s model (2003) of argument. Toulmin&#x2019;s model, which comes from a philosophical view, is essentially a structure for analyzing arguments. Based on Toulmin&#x2019;s conceptual schema, an argument can be broken into six different components: a claim, evidence/data/observation/fact/ground, a warrant, qualifiers, rebuttal, and backing. A claim is a conclusion whose validity must be proven. Evidence is a statement or a piece of knowledge that uses to prove the claim. The connection between the claim and the evidence is established by a warrant. A qualifier is a word or a phrase that shows the certainty of the claim, for instance, &#x201c;probably&#x201d; or &#x201c;completely.&#x201d; Rebuttal can be considered as another valid view to the claim. And finally, backing refers to a cover for the warrant, especially when the warrant was mentioned implicitly. Toulmin&#x2019;s components contain six different parts but based on the model, the main components are the claim, warrant, and fact or evidence (<xref ref-type="bibr" rid="B68">Toulmin, 2003</xref>). In <xref ref-type="fig" rid="F1">Figure&#x20;1</xref>, the relation among the component is illustrated. The warrant is used as a connector between the claim and the evidence. In addition, the rebuttal and backing are considered as a cover for the claim and the warrant respectively.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>The component of arguments based on Toulmin&#x2019;s scheme (<xref ref-type="bibr" rid="B68">Toulmin, 2003</xref>).</p>
</caption>
<graphic xlink:href="frai-04-645516-g001.tif"/>
</fig>
<p>Toulmin&#x2019;s model of argument has also been used successfully in an educational context. In <xref ref-type="bibr" rid="B60">Simon (2008</xref>), the author enriched teachers in the teaching and evaluation of argumentation in science contexts by using a program by which the teachers learn how to identify Toulmin&#x2019;s components in discussions and also teach students how to argue. In the program, the teachers identified the components of arguments in a list of arguments. The author indicated that using Toulmin&#x2019;s model of arguments as a methodological framework could be useful for analyzing argumentation in classrooms. Toulmin&#x2019;s model also has been used in computational argumentation. For instance, <xref ref-type="bibr" rid="B27">Habernal and Gurevych (2017)</xref> used machine learning approaches to identify Toulmin&#x2019;s components of arguments in essays.</p>
<p>In this work, we focused on identifying the core components, claims, warrants, and evidence. This investigation has been done based on the context of a conversational agent with whom one can discuss the concept of intelligence. Conceptually, similar conversations can be carried out with respect to other concepts than &#x201c;intelligence.&#x201d;</p>
</sec>
<sec id="s2-2">
<title>2.2 Conversational Agents</title>
<p>Conversational agents are now studied in different application domains, such as for administering surveys (<xref ref-type="bibr" rid="B32">Kim et&#x20;al., 2019</xref>), for healthcare (<xref ref-type="bibr" rid="B46">M&#xfc;ller et&#x20;al., 2019</xref>), and of course, for learning (<italic>Conversational Agents in Education</italic>), and argumentation support (<italic>Conversational Agents in Argumentation</italic>).</p>
<p>Technically, conversational agents can be classified into two&#x20;different types, retrieval and generative agents. In retrieval agents, for each turn in a conversation, a list of possible responses is considered and the most appropriate response is selected by&#x20;different techniques such as information retrieval or different kinds of similarities (<xref ref-type="bibr" rid="B36">Le et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B52">Rakshit et&#x20;al., 2019</xref>). Subsequently, such conversational agents rely on predefined conditional dialogue structures, where they only have the freedom to decide between different branches (notwithstanding the potential complexity, and even possible continual growth of the dialogue structure). Generative conversational agents generate the responses from scratch. The&#x20;process of generating the responses is done simultaneously. The responses can be generated by different approaches. For instance, in <xref ref-type="bibr" rid="B9">Cahn (2017</xref>) a parallel corpus from an argumentative dialogue has been used to train a statistical machine translation by which users&#x2019; utterances have been translated to chatbot&#x2019;s responses. The work we present here, as does most other related work on conversational agents in education and argumentation (see below), falls in the category of retrieval chatbots.</p>
<sec id="s2-2-1">
<title>2.2.1 Conversational Agents in Education</title>
<p>The vision of computational systems that resemble human tutors is old, and substantial research has and is being carried out on intelligent tutoring systems in many forms (<xref ref-type="bibr" rid="B17">Emran and Shaalan, 2014</xref>; <xref ref-type="bibr" rid="B28">Hussain et&#x20;al., 2019</xref>; <xref ref-type="bibr" rid="B45">Montenegro et&#x20;al., 2019</xref>). As both machine learning in general, and natural language processing technologies in particular, progress, conversational interaction with intelligent tutoring systems has increasingly come into focus (<xref ref-type="bibr" rid="B29">Io and Lee, 2017</xref>). Expectations toward conversational agents as supporting learning are high, with typical expectations on conversational agents that they address students&#x2019; sociocultural needs and engender high engagement and motivation (cp. <xref ref-type="bibr" rid="B70">Veletsianos and Russell, 2014</xref>) due to their offering interaction in natural language, and thereby giving an increased sense of talking to a social&#x20;other.</p>
<p>In educational conversation, usually retrieval-based conversational agents, which have a limited set of responses, have been used. For example, in <xref ref-type="bibr" rid="B26">Graesser et&#x20;al. (2000</xref>), for each question that their agent asks, there is a list of expectations or goals, good answers, and bad answers. The agent uses the latent semantic analysis (LSA) algorithm to calculate the similarity of users&#x2019; responses to the list of good and bad answers. LSA is a natural language processing technique for analyzing the relations between the words of documents and the different topics of documents. In <xref ref-type="bibr" rid="B79">Wolfbauer et&#x20;al. (2020</xref>), which is a reflective conversational agent for apprentices, the responses of apprentices are matched to different concepts and then the agent&#x2019;s response is selected from the poll of responses related to the concept.</p>
<p>In retrieval-based conversational agents, users&#x2019; responses are analyzed typically based on content and language markers. The analysis of content can be based on lexicon or patterns, for instance, <xref ref-type="bibr" rid="B79">Wolfbauer et&#x20;al. (2020)</xref> used a dictionary-based approach and regular expressions to classify apprentices&#x2019; utterances. Also, <xref ref-type="bibr" rid="B26">Graesser et&#x20;al. (2000)</xref> analyzed the users&#x2019; messages by language modules in which there was a large lexicon (about 10,000 words). Each entry of the lexicon contained a word with alternative syntactic classes and also its frequency of usage in the English language. In addition to the lexicon, <xref ref-type="bibr" rid="B26">Graesser et&#x20;al. (2000)</xref> classified the learners&#x2019; content into five different classes, WH-questions, YES/NO questions, Assertion, Directive, and Short responses. The chatbot A.L.I.C.E. used AIML, Artificial Intelligence Markup Language, to match responses to different categories and then found the most appropriate response to a user&#x20;input.</p>
<p>The subjects of the dialogues or the topics that educational agents focus on them can be various, such as mathematics (<xref ref-type="bibr" rid="B42">Melis and Siekmann, 2004</xref>; <xref ref-type="bibr" rid="B57">Sabo et&#x20;al., 2013</xref>; <xref ref-type="bibr" rid="B2">Aguiar et&#x20;al., 2014</xref>; <xref ref-type="bibr" rid="B80">Zhang and Jia, 2017</xref>), physics (<xref ref-type="bibr" rid="B69">VanLehn et&#x20;al., 2002</xref>; <xref ref-type="bibr" rid="B49">P&#xe9;rez-Mar&#xed;n and Boza, 2013</xref>), medicine (<xref ref-type="bibr" rid="B20">Frize and Frasson, 2000</xref>; <xref ref-type="bibr" rid="B66">Suebnukarn and Peter, 2004</xref>; <xref ref-type="bibr" rid="B41">Martin et&#x20;al., 2009</xref>), computer science (<xref ref-type="bibr" rid="B73">Wallace, 1995</xref>; <xref ref-type="bibr" rid="B78">Weerasinghe and Mitrovic, 2011</xref>; <xref ref-type="bibr" rid="B34">Koedinger et&#x20;al., 2013</xref>; <xref ref-type="bibr" rid="B76">Wang et&#x20;al., 2015</xref>). In these examples, conversational agents are used to support learning about a particular subject, and the key element in all these agents is their domain knowledge (implemented by different means).</p>
<p>The agents also communicate with various ranges of learners for instance K-12 students, which include pupils from kindergarten to 12th grade (<xref ref-type="bibr" rid="B73">Wallace, 1995</xref>; <xref ref-type="bibr" rid="B15">Dzikovska et&#x20;al., 2010</xref>), university students (<xref ref-type="bibr" rid="B69">VanLehn et&#x20;al., 2002</xref>; <xref ref-type="bibr" rid="B66">Suebnukarn and Peter, 2004</xref>; <xref ref-type="bibr" rid="B78">Weerasinghe and Mitrovic, 2011</xref>), or apprentices (<xref ref-type="bibr" rid="B79">Wolfbauer et&#x20;al., 2020</xref>).</p>
<p>In <xref ref-type="bibr" rid="B25">Graesser et&#x20;al. (1999</xref>, <xref ref-type="bibr" rid="B24">2001</xref>, <xref ref-type="bibr" rid="B26">2000</xref>), a conversational agent called AutoTutor is studied. AutoTutor aims to support college students learning basic knowledge in computer science such as hardware or operating systems. AutoTutor has redefined expectations for users&#x2019; responses. It uses LSA to match the students&#x2019; responses to the expectations, and depending on which expectation is met (to which predefined answer the student response is most similar using LSA), AutoTutor selects the next step of conversation.</p>
<p>In the present work, similar to AutoTutor (<xref ref-type="bibr" rid="B25">Graesser et&#x20;al., 1999</xref>; <xref ref-type="bibr" rid="B24">2001</xref>; <xref ref-type="bibr" rid="B26">2000</xref>), we have some expectations that we are looking for in users&#x2019; answers. However, our focus is not on analyzing and teaching content, but rather on analyzing and giving feedback on the argument structure. We also, instead of LSA or using similarity, used some classifiers to predict the next step in the conversation. This is novel w.r.t. the above-discussed literature that focuses on teaching content (<xref ref-type="bibr" rid="B2">Aguiar et&#x20;al., 2014</xref>; <xref ref-type="bibr" rid="B80">Zhang and Jia, 2017</xref>), or the overall structure of a reflective tutoring conversation (<xref ref-type="bibr" rid="B79">Wolfbauer et&#x20;al., 2020</xref>). Of course, in any fully realized conversational agent, both elements (capability to analyze and react to structure; capability to analyze and react to content) must be present.</p>
</sec>
<sec id="s2-2-2">
<title>2.2.2 Conversational Agents in Argumentation</title>
<p>Beyond education, conversational agents have also been studied as discussion partners for general argumentation. In this, the conversational agent does not have an educational goal; having an argumentative dialogue is the goal (<xref ref-type="bibr" rid="B36">Le et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B52">Rakshit et&#x20;al., 2019</xref>). For instance, in <xref ref-type="bibr" rid="B52">Rakshit et&#x20;al. (2019</xref>), a retrieval-based agent, named Debbie, has been presented. Their agent talked to its audiences about three topics: the death penalty, gun control, gay marriage. The main goal of the agent was to keep the meaningful conversation going until it would be ended by&#x20;users.</p>
<p>Besides the mentioned work, <xref ref-type="bibr" rid="B11">Chalaguine and Anthony (2020)</xref> created a conversational agent that tried to persuade its audiences regarding a specific topic, meat consumption. The agent selected an argument from its knowledge base which related to the audience&#x2019;s concerns to increase the chance of persuasion. The knowledge of the agent, which was collected by a crowdsourcing method, was a list of arguments and counterarguments about the&#x20;topic.</p>
<p>Other conversational agents tried to give information or persuade audiences about different controversial topics such as global warming (<xref ref-type="bibr" rid="B67">Toniuc and Groza, 2017</xref>) in which the agent had a conversation about climate change and explained the issues related to global warming.</p>
<p>In general, the available argumentative conversational agents can be persuasive (<xref ref-type="bibr" rid="B67">Toniuc and Groza, 2017</xref>; <xref ref-type="bibr" rid="B11">Chalaguine and Anthony, 2020</xref>) or just convey the information by offering arguments that keep the dialogue comprehensive and meaningful (<xref ref-type="bibr" rid="B36">Le et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B52">Rakshit et&#x20;al., 2019</xref>).</p>
<p>One aspect that is missing here is the user&#x2019;s argumentation or how they argue. Our focus in this work is on analyzing and giving feedback on the argument structure of the human user which is novel w.r.t. the above-discussed literature that emphasizes the retrieval of suitable (counter-) arguments within a conversation over feedback. Again, of course, in any fully realized educational conversational agents, both elements (specific feedback on learner&#x2019;s argument structure, and presentation of similar or different arguments as a basis for further developing an argument content-wise) must be present.</p>
</sec>
</sec>
<sec id="s2-3">
<title>2.3 Argument Mining</title>
<p>Argument mining or argumentation mining is a subfield or research area in natural language processing. Basically, it refers to the automatic identification and understating of arguments in a text by machines. It is one of the challenging tasks in natural language processing. Based on <xref ref-type="bibr" rid="B75">Wambsganss et&#x20;al. (2020</xref>), there are three different levels in argument mining, argument identification, discourse analysis, and argument classification. The machine learning approaches applied for these levels could be supervised, which needed an annotated dataset, or unsupervised, which eliminated the need for annotated data. In the rest of the section, we, first, focus on the supervised learning approaches and then unsupervised learning approaches.</p>
<p>In the first level, identification argument, the main goal is extracting or detecting the parts of documents that contain an argument; in other words, the parts are classified into argumentative and nonargumentative (<xref ref-type="bibr" rid="B44">Moens et&#x20;al., 2007</xref>; <xref ref-type="bibr" rid="B62">Stab and Gurevych, 2014</xref>; <xref ref-type="bibr" rid="B51">Poudyal et&#x20;al., 2016</xref>; <xref ref-type="bibr" rid="B81">Zhang et&#x20;al., 2016</xref>). For instance, in <xref ref-type="bibr" rid="B81">Zhang et&#x20;al. (2016</xref>), the main research goal was to design a model that can detect argumentative sentences in online discussions. However, in <xref ref-type="bibr" rid="B51">Poudyal et&#x20;al. (2016</xref>), they focused on case laws that in terms of formality are completely different from the online discussions. In another research, <xref ref-type="bibr" rid="B14">Dusmanu et&#x20;al. (2017)</xref> tackled the first level of argument mining by identifying argumentative sentences in tweets. After detecting argumentative sentences, they classified them as factual information or opinions with using supervised classifiers. Finally, the source of factual information, which was extracted in the previous step, was identified.</p>
<p>The second level of argument mining is discourse analysis which refers to identify the relations, as for support or an attack, among the claims and premises in documents (<xref ref-type="bibr" rid="B47">Palau and Moens, 2009</xref>; <xref ref-type="bibr" rid="B7">Cabrio and Villata, 2013</xref>; <xref ref-type="bibr" rid="B5">Boltu&#x17e;i&#x107; and &#x160;najder, 2014</xref>). Similar to <xref ref-type="bibr" rid="B81">Zhang et&#x20;al. (2016</xref>), in <xref ref-type="bibr" rid="B5">Boltu&#x17e;i&#x107; and &#x160;najder (2014</xref>), the authors dealt with online discussions. They tried to match users&#x2019; comments to a predefined set of topics, which can be either supported or not supported.</p>
<p>In the last level, argument classification refers to classify the components of arguments (<xref ref-type="bibr" rid="B43">Mochales and Moens, 2011</xref>; <xref ref-type="bibr" rid="B54">Rooney et&#x20;al., 2012</xref>; <xref ref-type="bibr" rid="B62">Stab and Gurevych, 2014</xref>). In this case, argumentative parts can be classified into different classes, such as claims and premises (<xref ref-type="bibr" rid="B43">Mochales and Moens, 2011</xref>; <xref ref-type="bibr" rid="B54">Rooney et&#x20;al., 2012</xref>; <xref ref-type="bibr" rid="B62">Stab and Gurevych, 2014</xref>) or claim, backing, rebuttal, premise, and refutation based on <xref ref-type="bibr" rid="B68">Toulmin&#x2019;s (2003)</xref> model of argument (<xref ref-type="bibr" rid="B27">Habernal and Gurevych, 2017</xref>). For example, <xref ref-type="bibr" rid="B27">Habernal and Gurevych (2017)</xref> proposed a sequence labeling approach in which many different types of features such as lexical, structural, morphological, semantic, and embedding features were used to vectorize sentences. The authors used SVM<sup>hmm</sup> (<xref ref-type="bibr" rid="B31">Joachims et&#x20;al., 2009</xref>) which is an implementation of Support Vector Machines specifically used for sequence labeling<xref ref-type="fn" rid="fn2">
<sup>1</sup>
</xref>. The authors annotated the documents based on the BIO encoding. This encoding is used to distinguish which is the minimal encoding for distinguishing the boundary of argumentative components, and works as follows: The first word of an argumentative component is labeled with <italic>B</italic> which means the beginning of component. The label <italic>I</italic> is used for the rest of the words in the component. All tokens in nonargumentative components are labeled with&#x20;<italic>O</italic>.</p>
<p>There are also works that tackled all the levels mentioned by <xref ref-type="bibr" rid="B75">Wambsganss et&#x20;al. (2020)</xref>. In <xref ref-type="bibr" rid="B77">Wang et&#x20;al. (2020</xref>), the authors deal with online discussions about usability on issue-tracking systems of open-source projects. Since a large number of issues and comments with different perspectives are posted daily, the contributors of projects face a major challenge in digesting the rich information embedded in the issue-tracking systems to determine the actual user needs and consolidate the diverse feedback. Thus, the authors&#x2019; ultimate goal was to make usability issues more readable. To do this, they first discriminated argumentative comments, then, classified the argumentative comments based on two independent dimensions, components and standpoints.</p>
<p>Technologically, a range of classical methods of machine learning has been applied to address different levels in argument mining. For instance, in <xref ref-type="bibr" rid="B44">Moens et&#x20;al. (2007</xref>), multinomial naive bayes classifier and maximum entropy model were used as classifiers for detecting arguments in legal texts. They converted the sentences to feature vectors which contained unigrams, bigram, trigrams, verbs, argumentative keywords such as &#x201c;but,&#x201d; &#x201c;consequently,&#x201d; and &#x201c;because of,&#x201d; statistical features namely average of word length and a number of punctuation marks. In <xref ref-type="bibr" rid="B23">Goudas et&#x20;al. (2014</xref>), the authors studied the applicability of some machine learning classifiers on social media text in two steps. First, they identified argumentative sentences by using different machine learning techniques such as Logistic Regression, Random Forest, and Support Vector Machine and second, through using Conditional Random Fields, the boundary of the premises in argumentative sentences was detected. Other machine learning methods such as support vector machine (<xref ref-type="bibr" rid="B54">Rooney et&#x20;al., 2012</xref>; <xref ref-type="bibr" rid="B58">Sardianos et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B27">Habernal and Gurevych, 2017</xref>), logistic regression (<xref ref-type="bibr" rid="B38">Levy et&#x20;al., 2014</xref>; <xref ref-type="bibr" rid="B53">Rinott et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B14">Dusmanu et&#x20;al., 2017</xref>), random forest (<xref ref-type="bibr" rid="B16">Eckle-Kohler et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B14">Dusmanu et&#x20;al., 2017</xref>), and conditional random field (<xref ref-type="bibr" rid="B23">Goudas et&#x20;al., 2014</xref>; <xref ref-type="bibr" rid="B58">Sardianos et&#x20;al., 2015</xref>) are also used in argument mining.</p>
<p>In <xref ref-type="bibr" rid="B75">Wambsganss et&#x20;al. (2020</xref>), an adaptive tool, named AL, by which students received feedback on the argumentative structure of their written text, was designed, built, and evaluated. They tried to answer two research questions that were about the acceptance of AL and also how much it was effective for users to write more persuasive texts. For the latter research question, first, they created two different classifiers by which they identified argumentative sentences and also the relation among them, supported and non-supported. Second, they evaluated the texts by measuring readability, coherence, and persuasiveness. By illustrating these scores and their definitions, users understood how to improve their&#x20;texts.</p>
<p>In <xref ref-type="bibr" rid="B59">Shnarch et&#x20;al.</xref> <xref ref-type="bibr" rid="B59">(2017</xref>), they presented an algorithm, named GrASP (Greedy Augmented Sequential Patterns), which was weak labeling of argumentative components using multilayer patterns. The algorithm produced highly indicative and expressive patterns by augmenting input n-grams with various layers of attributes, such as name entity recognition, domain knowledge, hypernyms. By considering many aspects of each n-gram, GrASP could identify the most distinguishing attributes and also iteratively extended the extracted patterns by using the information from different attributes. The greedy part of the algorithm was related to the end of each iteration in which the top k predictive patterns were kept for the next iteration.</p>
<p>Besides the supervised machine learning approaches that rely on annotated training data, there are unsupervised approaches that eliminated the need for training data. For instance, <xref ref-type="bibr" rid="B50">Persing and Ng (2020)</xref> developed a novel unsupervised approach that focused on the task of end-to-end argument mining in persuasive student essays collected and annotated by <xref ref-type="bibr" rid="B63">Stab and Gurevych (2017)</xref>. They applied a bootstrapping method from a small dataset of arguments. They used reliable contextual cues and some simple heuristics, which relied on the number of paragraphs, the location of the sentence, and the context n-grams, for labeling the different components of arguments.</p>
<p>Another unsupervised approach has been presented by <xref ref-type="bibr" rid="B19">Ferrara et&#x20;al. (2017)</xref>. Their approach was based on the topic modeling technique. In their research, they focused on detecting argument units that were at sentence-level granularity. Their method, named Attraction to Topics (A2T), had two main steps. The first step was identifying the argumentative sentences and the second step was classifying the argumentative sentences, which were discovered in the first step, to their role, as major claims or the main standpoint, claims, and premises.</p>
<p>In comparison with the previous works and the mentioned literature, in this article, we worked on an argumentative-educational conversational agent in which the agent gave feedback on missing core components. The conversational agent tried to teach argumentation instead of persuading users or giving (counter-) arguments based on similarity to continue the conversations. The challenge we address in this article is identifying the core components of arguments based on Toulmin&#x2019;s model of arguments (<xref ref-type="bibr" rid="B68">Toulmin, 2003</xref>), namely claim, warrant, and evidence or grounds. Others have already identified these elements based on traditional machine learning algorithms such as Random Forests (<xref ref-type="bibr" rid="B6">Breiman, 2001</xref>) and SVM (<xref ref-type="bibr" rid="B30">Joachims, 1998</xref>). In line with these authors, also in our work we use traditional ML methods such as K-Nearest Neighbors, SVM, Decision Trees, Random Forest and Ada Boost. We explicitly do not use deep learning methods in this work, as we have too little data; and do not use transfer learning as no suitable models from sufficiently similar problems are available.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Research Questions</title>
<p>In pursuing our overall goal, to study how to enable a conversational agent to identify different types of structural wrongness in an argument, we here investigate the suitability of using Toulmin&#x2019;s model of argument within conversational agents to operationalize what a good structure of an argument is, and subsequently to identify different structural wrongness. In the present article, we study a conversational agent with whom one can discuss a single question: Is &#x3c; an entity &#x3e; intelligent, and in what sense? In this domain of discussion, we ask and answer the following three research questions:<list list-type="simple">
<list-item>
<p>&#x2022; RQ1 (overarching): Can Toulmin&#x2019;s model of argument be used to model different types of structural wrongness within conversational agents in the given domain?</p>
</list-item>
<list-item>
<p>&#x2022; RQ2: How well can components of Toulmin&#x2019;s model of argument be identified in the given domain?</p>
</list-item>
<list-item>
<p>&#x2022; RQ3: Can a conditional dialogue structure with conditions based on the existence of components from Toulmin&#x2019;s model of argument lead to coherent conversations in the given domain?</p>
</list-item>
</list>
</p>
<p>Our methodology is as follows:<list list-type="simple">
<list-item>
<p>&#x2022; Develop classifiers that operationalize Toulmin&#x2019;s model of argument to provide evidence for RQ2 (how well can different elements of Toulmin&#x2019;s model of argument be identified) in this case (preparatory work: <italic>Apparatus&#x2014;Educational Scenario &#x201c;Is &#x3c; an entity &#x3e; Intelligent or Not? Why?&#x201d;, Data Annotation, Inter-rater Agreement, Data Processing, and Feature Selection</italic>; classifier development and evaluation in <italic>Results</italic>)</p>
</list-item>
<list-item>
<p>&#x2022; To set up a conditional dialogue structure with conditions based on the existence of arguments following Toulmin&#x2019;s model of argument and show, by example, that it can lead to a coherent conversation (existential proof by example; in answer to&#x20;RQ3).</p>
</list-item>
<list-item>
<p>&#x2022; To discuss the overall suitability of Toulmin&#x2019;s model of argument as a suitable basis for modeling different types of wrongness in conversational agents (RQ1) based on results on the collected dataset.</p>
</list-item>
</list>
</p>
<p>As discussed in the related work and background section above, by answering these research questions, we contribute to the existing scientific discourse around conversational agents in education and argumentation mining knowledge about how Toulmin&#x2019;s model of argument can be operationalized and how this operationalization can be used within a conversational agent to allow a coherent conversation that helps users develop a&#x2014;structurally&#x2014;good argument. This is useful, and novel in complement to existing research and knowledge on conversational agents that use domain background knowledge to facilitate the acquisition of factual knowledge, to develop argumentation along the dimension of content or educational conversational agents that moderate discussions by injecting discussion prompts.</p>
</sec>
<sec id="s4">
<title>4 Methodology</title>
<p>Below we describe the data collection study (<italic>Data Collection</italic>), the educational materials used in the data collection (<italic>Apparatus&#x2014;Educational Scenario &#x201c;Is &#x3c; an entity &#x3e; Intelligent or Not? Why?&#x201d;</italic>), the data annotation process and labels used (<italic>Data Annotation</italic>), the achieved inter-rater agreement as a measure of the quality of annotations and subsequently datasets (<italic>Inter-rater Agreement</italic>), the data preprocessing (<italic>Data Processing</italic>), and finally the feature selection for the three classifiers that aim to identify the existence of a claim, a warrant, and evidence in a given user statement (<italic>Feature Selection</italic>).</p>
<sec id="s4-1">
<title>4.1 Data Collection</title>
<p>To collect data, Amazon Mechanical Turk<xref ref-type="fn" rid="fn3">
<sup>2</sup>
</xref> (MTurk) was used. It is a system for crowdsourcing work and has been used in many academic fields to support research. By using crowdsourcing methods, a large number of diverse arguments can be collected and the data are free from researchers&#x2019; bias (<xref ref-type="bibr" rid="B12">Chalaguine et&#x20;al., 2019</xref>).</p>
<p>The data were collected in three rounds. In each round, essentially the question <italic>&#x201c;Is &#x3c; an entity &#x3e; intelligent or not? Why?&#x201d;</italic> was asked to study participants. The materials prepared for all three rounds are described in <italic>Apparatus&#x2014;Educational Scenario &#x201c;Is &#x3c; an entity &#x3e; Intelligent or Not? Why?&#x201d;</italic>&#x20;below.</p>
<p>To increase the chance of having data without spelling or grammatical errors and also meaningful errors, we defined some qualification requirements for participants. The participants, who wanted to answer the questions, were required to be <italic>master</italic> workers. It means they needed to have a minimum acceptance rate of 95% in order to qualify to answer the questions. This qualification requirement ensures the high quality of the results. Furthermore, an additional qualification requirement was considered which was having an academic degree equal to or higher than a US bachelor&#x2019;s degree. The reason behind that was to have better responses in terms of formality and without spelling or grammatical errors. The data have been collected in three rounds (see <xref ref-type="table" rid="T1">Table&#x20;1</xref>).</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>MTurk experiments for collecting&#x20;data.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Datasets</th>
<th align="center">Number of collected responses</th>
<th align="center">Qualification requirement</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Dataset 1</td>
<td align="char" char=".">100</td>
<td align="left">&#x2022; HIT Approval Rate (%) &#x2265; 95</td>
</tr>
<tr>
<td rowspan="2" align="left">Dataset 2</td>
<td rowspan="2" align="char" char=".">1,026</td>
<td align="left">&#x2022; HIT Approval Rate (%) &#x2265; 95</td>
</tr>
<tr>
<td align="left">&#x2022; At least US Bachelor&#x2019;s Degree</td>
</tr>
<tr>
<td rowspan="2" align="left">Dataset 3</td>
<td rowspan="2" align="char" char=".">211</td>
<td align="left">&#x2022; HIT Approval Rate (%) &#x2265; 95</td>
</tr>
<tr>
<td align="left">&#x2022; At least US Bachelor&#x2019;s Degree</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As it is shown in <xref ref-type="table" rid="T1">Table&#x20;1</xref>, in the first pilot study, 100 responses regarding the question, <italic>&#x201c;Is &#x3c; an entity &#x3e; intelligent or not? Why?&#x201d;</italic> have been collected and the only qualification requirement was having an approval rate of more than or equal to 95. In the second round, 1,026 responses were collected. However, the second qualification was also added. In the last round, the same qualification requirements similar to the second round were used and 211 new responses have been collected to use as a test set. In the end, overall, 1,335 records have been collected. The data that have been collected from the first two rounds, datasets 1 and 2, are considered as validation data and training data, and the records of the last round, dataset 3, are considered as test&#x20;data.</p>
</sec>
<sec id="s4-2">
<title>4.2 Apparatus&#x2014;Educational Scenario &#x201c;Is &#x3c; an entity &#x3e; Intelligent or Not? Why?&#x201d;</title>
<p>We prepared the following materials for data collection: five different definitions of intelligence with brief explanations for each definition, a list of eight entities, and defining some properties for a good response, and a few samples of good/bad responses.</p>
<p>The following five definitions were given and explained as follows to study participants: There are plenty of definitions by which something or someone would be called intelligent. In this task, we focus on five of them. We will call an object intelligent if it thinks humanly, acts humanly, thinks rationally, acts rationally; or if it is able to learn from experience to better reach its&#x20;goals.</p>
<p>These definitions were chosen on the background of understanding intelligence as a foundational concept for arguing about capabilities as well as non-capabilities of artificial intelligence. The first four are discussed as having an impact on the discussion around intelligence in relation to the development of artificial intelligence and inspired different directions of artificial intelligence research (cp. <xref ref-type="bibr" rid="B56">Russell and Peter, 2002</xref>). The fifth definition more closely mirrors the understanding of learning in psychology and learning sciences.</p>
<p>Every study participant was asked to decide and argue about the intelligence of one (type of) entity, which was chosen such that in each dataset, the following categories are similarly represented: <italic>Inanimate objects, plants, animals, AI-enabled technologies</italic>. These categories are ontologically different, general judgments about their intelligence are possible, and we can expect different types of argumentations per category. As a general judgment, inanimate objects can be considered to not be intelligent according to any definition, plants could be with some difficulty argued to be intelligent as a species if evolutionary aspects are put to the forefront, and animals and AI-enabled technologies could, in general, be argued to be intelligent even though in a differentiated manner.</p>
<p>In dataset 1, these categories were instantiated by: <italic>tables</italic> (inanimate object), <italic>trees</italic> (plants), <italic>cats, fish</italic> (animals), and <italic>Google search engine</italic> (AI-enabled technologies)<italic>.</italic> We collected 100 records for dataset 1 (see <xref ref-type="table" rid="T1">Table&#x20;1</xref>) which means 20 records for each entity.</p>
<p>For datasets 2 and 3, we used two examples per category, and these were <italic>office chairs</italic> and <italic>the New York Statue of Liberty</italic> (inanimate objects), <italic>sunflowers</italic>, <italic>Venus flytraps</italic> (plants), <italic>snakes</italic>, <italic>monkeys</italic> (animals)<italic>, self-driving cars</italic>, <italic>Google search engine</italic> (AI-enabled technologies). We collected 1,000 records for dataset 2 which were 125 records for each entity and 200 records for dataset 3 which were 25 records for each entity. We collected more records for datasets 2 and 3. The extra records were due to a few short answers because we asked others to answer them&#x20;again.</p>
<p>During collecting the data, it was also explained that a good response should be argumentative, contain a claim, reasoning, and an explanation, have at least 10 words, and be checked again for correcting typos. Furthermore, examples of good and bad responses were also illustrated in the explanation. In <xref ref-type="table" rid="T2">Table&#x20;2</xref>, some statistics related to the collected data are shown. For datasets 1, 2, and 3, we collected 20, 125, and 25 responses respectively for each entity. Since some of the responses were too&#x20;short or irrelevant, we did not approve of them and then asked new participants to answer them again. That is the reason&#x20;behind the small deviations in the number of responses&#x20;for each category. However, we used all the responses, rejected and approved responses, in our models. Overall, 349, 332, 329, and 327 responses were collected related to animals, plants,&#x20;inanimate objects, and AI-enabled technologies respectively.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>The descriptive statistics of different categories of entities in the datasets.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Category</th>
<th colspan="2" align="center">Datasets 1 and 2</th>
<th colspan="2" align="center">Dataset 3 (test data)</th>
</tr>
<tr>
<th align="center">&#x23; of responses</th>
<th align="center">The average number of tokens</th>
<th align="center">&#x23; of responses</th>
<th align="center">The average number of tokens</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Animals</td>
<td align="char" char=".">296</td>
<td align="char" char=".">36.73</td>
<td align="char" char=".">53</td>
<td align="char" char=".">29.77</td>
</tr>
<tr>
<td align="left">Plants</td>
<td align="char" char=".">277</td>
<td align="char" char=".">34.06</td>
<td align="char" char=".">55</td>
<td align="char" char=".">34.85</td>
</tr>
<tr>
<td align="left">Inanimate objects</td>
<td align="char" char=".">277</td>
<td align="char" char=".">31.54</td>
<td align="char" char=".">52</td>
<td align="char" char=".">30.23</td>
</tr>
<tr>
<td align="left">AI-enabled technologies</td>
<td align="char" char=".">276</td>
<td align="char" char=".">39.13</td>
<td align="char" char=".">51</td>
<td align="char" char=".">32.31</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4-3">
<title>4.3 Data Annotation</title>
<p>The whole annotating process was done by two annotators (the authors). The whole process had three steps. Frist, in a group session, we reached a conclusion about definitions of each component and how to annotate them. Second, we randomly selected 100 records from dataset 2 and annotated them separately to measure the agreement (The detail of measuring inter-rater agreement is mentioned in the next section). In the last step, the first author annotated the rest of the unannotated data. The data were annotated based on three core components of Toulmin&#x2019;s model of arguments: Claim, warrant, and evidence (cp. <italic>Argument Mining</italic> and <xref ref-type="fig" rid="F1">Figure&#x20;1</xref>, <xref ref-type="bibr" rid="B68">Toulmin, 2003</xref>).</p>
<p>Three different annotation values were considered for the claim: &#x201c;positive&#x201d; that means the user claimed that the entity is intelligent; &#x201c;negative&#x201d; that refers to the opposite direction which means the user&#x2019;s claim is that the entity is not intelligent, &#x201c;unknown&#x201d; refers to responses in which there is no specific claim or stance regarding the question.</p>
<p>For the warrant, two different values were considered, &#x201c;with warrant&#x201d; or &#x201c;without warrant&#x201d; which refers to the existence of a warrant in the response: &#x201c;with warrant&#x201d; is assigned to responses in which at least one of the definitions of intelligence is mentioned.</p>
<p>For evidence, a binary value was considered. The responses are annotated with &#x201c;with evidence&#x201d; if there are some parts in the responses in which users use their background knowledge or observation to justify their claims. <xref ref-type="table" rid="T3">Table&#x20;3</xref> represents the collected data in terms of these labels. The collected and annotated data are accessible for other researchers as an appendix to this publication<xref ref-type="fn" rid="fn4">
<sup>3</sup>
</xref>. We explicitly discarded at this stage an evaluation of how reasonable the evidence is; this is discussed further in <italic>Can Toulmin&#x2019;s Model Of Argument Be Used To Model Different Types Of Structural Wrongness Within Conversational Agents In The Given Domain? (RQ1)</italic>.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>The number of different labels for each component in training and test&#x20;data.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Component</th>
<th colspan="3" align="center">Claim</th>
<th colspan="2" align="center">Warrant</th>
<th colspan="2" align="center">Evidence</th>
</tr>
<tr>
<th align="left">Annotation</th>
<th align="center">Positive</th>
<th align="center">Negative</th>
<th align="center">Unknown</th>
<th align="center">With warrant</th>
<th align="center">Without warrant</th>
<th align="center">With evidence</th>
<th align="center">Without evidence</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Training data (datasets 1 and 2)</td>
<td align="center">477</td>
<td align="center">594</td>
<td align="center">55</td>
<td align="center">691</td>
<td align="center">435</td>
<td align="center">835</td>
<td align="center">291</td>
</tr>
<tr>
<td align="left">Test data (dataset 3)</td>
<td align="center">102</td>
<td align="center">99</td>
<td align="center">10</td>
<td align="center">111</td>
<td align="center">100</td>
<td align="center">159</td>
<td align="center">52</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4-4">
<title>4.4&#x20;Inter-rater Agreement</title>
<p>One of the reasons that make argumentation mining and its sub-tasks such a challenging task is having disagreements in annotating datasets. Most datasets that are available do not report inter-rater agreements (<xref ref-type="bibr" rid="B39">Lippi and Torroni, 2016</xref>). In principle, the overall quality of the argument is something that humans cannot agree about sometimes because, based on <xref ref-type="bibr" rid="B71">Wachsmuth et&#x20;al. (2017a</xref>), some parts of the argumentation quality are subjective, and overall quality is hard to measure. <xref ref-type="bibr" rid="B71">Wachsmuth et&#x20;al. (2017a)</xref> also showed that some dimensions of argument quality in practice were not correlated to any argument quality in theory or some practical dimensions could not be separated and matched to theoretical dimensions of argument quality.</p>
<p>In general, analyzing arguments and annotating texts is controversial most of the time and it leads to having more challenges in tasks such as detecting claims, warrants, or evidence. To train and evaluate the performance of detecting the core components, high-quality annotated datasets are required. In this article, Cohen&#x2019;s &#x3ba; value is used for evaluating inter-rater agreements. In this method, the inter-rater agreements among the labels and agreements occurring by chance are taken into account. The equation for &#x3ba; is<disp-formula id="equ1">
<mml:math id="m1">
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3ba;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="bold-italic">Pr</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold-italic">a</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="bold-italic">Pr</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold-italic">e</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="bold-italic">Pr</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold-italic">e</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>In this equation, <inline-formula id="inf1">
<mml:math id="m2">
<mml:mrow>
<mml:mi>Pr</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the relative observed agreement among raters, and <inline-formula id="inf2">
<mml:math id="m3">
<mml:mrow>
<mml:mi>Pr</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the hypothetical probability of chance agreement. Different thresholds are defined for the value of &#x3ba;. In general, the range of &#x3ba; is from zero to one and the higher amount means the higher agreement between the raters. If the raters are in complete agreement then &#x3ba; &#x3d; 1, if there is no agreement among the raters other than what would be expected by chance, &#x3ba; &#x3d; 0. Based on <xref ref-type="bibr" rid="B35">Landis and Koch (1977</xref>), the values below 0 are considered as poor, between 0 and 0.20 as slight, between 0.21 and 0.4 as fair, between 0.41 and 0.6 as moderate, between 0.61 and 0.80 as substantial, and above 0.81 as almost perfect inter-rater reliability. In <xref ref-type="bibr" rid="B65">Stemler and Tsai (2008</xref>) the threshold of 0.5 was recommended for exploratory research. For natural language processing (NLP) tasks, the agreement is considered as significant when &#x3ba; is greater than 0.6 (<xref ref-type="bibr" rid="B8">Cabrio and Villata, 2018</xref>). The values of &#x3ba; for the claim, warrant, and evidence components were 0.94, 0.92, and 0.65 respectively. The &#x3ba; value for the claim and warrant is more than 0.9 which means there is almost perfect inter-rater reliability. The definitions of claim and warrant components are straightforward and the coders exactly know what they are looking for. In contrast, the evidence could be anything based on users&#x2019; background knowledge or observations that are related to the users&#x2019; claim. So, there is a chance that in some responses the coders have different opinions. Even though there are unlimited ways of providing evidence that supports the claim of whether and in what sense an entity is intelligent, there is substantial agreement between the two raters on the existence of evidence (&#x3ba; &#x3d; 0.65). In an analysis of&#x20;disagreements, the disagreements mostly stemmed from different quality thresholds of raters on what would be acceptable to&#x20;count as evidence or not. For instance, there were disagreements for these samples, &#x201c;No. The statue of liberty cannot think and has no mind or brain.&#x201d; or &#x201c;An office chair is not intelligent because neither it can do work on its own nor it can think and act.&#x201d;</p>
</sec>
<sec id="s4-5">
<title>4.5 Data Processing</title>
<p>The preprocessing steps are the same for all the models we created for detecting claims, warrants, and evidence. The steps are as follows: 1) converting all responses to lowercase form, 2) removing additional spaces in beginning, ending, and middle of the responses, 3) replacing the various form of the entities&#x2019; names with a specific token, &#x201c;ENT,&#x201d; 4) tokenizing and lemmatizing the responses.</p>
<p>Replacing the entities&#x2019; names is crucial. It is mandatory for two reasons. First, by replacing the entities&#x2019; names with &#x201c;ENT,&#x201d; it will be possible to create only one claim detection model to cover all types of entities. Second, we wanted to ignore the impact of the entities on the prediction because the names of entities will affect the prediction of models. For example, in 86 per cent of responses in which we asked about the intelligence of monkeys the users&#x2019; claims were positive. It means the model of detecting claim tends to assign a positive claim to the responses related to the monkey entity. This justification is also valid for other entities such as &#x201c;an office chair.&#x201d; In 91 per cent of responses related to office chairs, the claim was negative which means the users claimed that the entity is not intelligent. In the next subsection, the features used to create the models are presented.</p>
</sec>
<sec id="s4-6">
<title>4.6 Feature Selection</title>
<p>To create classifiers, user responses need to be converted to vectors in order to be used by machine learning classifiers. In this subsection, we report on features in the sense of how the vectors are created. Overall, we developed three classifiers, one for each core component of Toulmin&#x2019;s model of argument: claims, warrants, and evidence (see <italic>Argument Mining</italic>). For each classifier, different features were used; and we report for each classifier separately which features were used below. Some features were nonetheless shared for all classifiers (general features&#x2014;namely TFIDF representation of the user response), and some features were component-specific, i.e. specific to the core component of Toulmin&#x2019;s model of argument.</p>
<sec id="s4-6-1">
<title>4.6.1 Claim</title>
<p>We report on features that were used as input to the classifiers that aimed to detect the existence of a claim (see <italic>Argument Mining</italic>) in user response. We aimed to differentiate between three classes, positive claims, negative claims, and unknown claims. We identified these classes with two groups of features, general features and component-specific features.</p>
<p>Term-frequency-inverse-document-frequency (TFIDF) was used in this work throughout as a general representation of user responses: As a full document set, datasets 1 and 2 are used; and the dictionary vector contains bigrams and trigrams. The unigrams were ignored because they could not be informative and indicative. The words such as &#x201c;is,&#x201d; &#x201c;intelligent,&#x201d; &#x201c;not&#x201d; did not lead us to a correct prediction about claims. However, bigrams and trigrams, for instance, &#x201c;is intelligent&#x201d; and &#x201c;is not intelligent&#x201d; are what we have needed to predict users&#x2019; claims. After preprocessing steps (see <italic>Data Processing</italic>), only the 500 most frequent bigrams and trigrams for the whole datasets 1 and 2 were used as TFIDF vectors. The underlying rationale was, to avoid high sparsity vectors.</p>
<p>In addition, we used general background knowledge as well as information from pilot studies to add features that needed to be considered both specific to the &#x201c;claim&#x201d; as one component in an argument that shall be classified; and that was specific to the particular question that has been asked (is an entity intelligent or not). Two regular expressions were used to indicate whether a response was started or ended by phrases or words such as &#x201c;yes,&#x201d; &#x201c;no,&#x201d; &#x201c;it is intelligent,&#x201d; &#x201c;it is not intelligent&#x201d; or not. If one of these patterns can be found in a response, based on being positive or negative, a ternary value, &#x2212;1, 0, 1, was added to the general feature vector of the response.</p>
</sec>
<sec id="s4-6-2">
<title>4.6.2 Warrant</title>
<p>In this study, we asked participants to use one of five definitions of intelligence as linking between their claim and their concrete evidence: &#x201c;acting humanly,&#x201d; &#x201c;acting rationally,&#x201d; &#x201c;thinking rationally,&#x201d; &#x201c;thinking humanly,&#x201d; and &#x201c;learning from experience to better reach its goals.&#x201d; We aimed to differentiate between two classes only: With a warrant in the sense of a reference to one of these definitions, and without a warrant. Note that in Toulmin&#x2019;s description, it is said that a warrant can&#x20;also be implicit as underlying understanding based on which a human is making an argument. In this work, we were looking for explicit warrants in labeling, and subsequently in classification.</p>
<p>Part of the feature vector for the warrant classifier was the same TFIDF representation of the user response (the length of vector: 500; terms, bigrams and trigrams are represented). Our goal was to identify the existence of warrants (or their absence) in the sense of detecting the usage of one of five predefined definitions of intelligence. Hence, regular expressions were used, as a component-specific feature, to detect the presence of the different forms of definitions in responses. The phrases that we looked for by regular expressions were indicative phrases such as &#x201c;act humanly,&#x201d; &#x201c;think rationally,&#x201d; &#x201c;can learn,&#x201d; &#x201c;learn from experience,&#x201d; or &#x201c;reach better.&#x201d; Based on the existence of these patterns, a binary value, 0 and 1, was added to the general features.</p>
</sec>
<sec id="s4-6-3">
<title>4.6.3 Evidence</title>
<p>We aimed to differentiate between two classes, that some evidence is given, or that it is not. Since no predefined list of facts or observations was given in the present studies, the evidence part in our study was the most free and hence most variable part of user responses. As general features, similar to other components, we used TFIDF vectors but with different parameters. As a full document set, datasets 1 and 2 were used; and the dictionary vector contained unigrams and bigrams. In contrast to the claim and warrant that trigrams phrases can be indicative for identifying the existence of claims and warrants, for detecting evidence component, unigrams and bigrams, such as &#x201c;no brain,&#x201d; &#x201c;inanimate object,&#x201d; &#x201c;prey,&#x201d; or &#x201c;making tools,&#x201d; can be discriminative. Furthermore, in contrast to the feature vectors for detecting claims and warrants, the 3,000 (instead of 500) most frequent unigrams and bigrams for the whole dataset were used in the dictionary vector, and TFIDF was computed based on this reduced dictionary vector. The underlying rationale for length reduction was again to avoid high sparsity vectors; the length was still larger because we expected more reasonable variance in the evidence part of the given arguments. Besides the length of vectors and n-grams, to remove phrases related to the claims and the warrants, we ignored all phrases that occurred in more than 30 per cent of all responses to have more relevant and meaningful phrases.</p>
<p>As component-specific features, we used two different evidence-specific feature sets, a list of evidence-specific keywords, and the length. The evidence-specific keywords where we assume that when one of them appears in the user response, there is a high likelihood that this keyword is part of the evidence for the argument, and hence that the statement should be in the &#x201c;with evidence&#x201d; class. The evidence component is the only argument component in which users need to talk about aspects specific to the entities and based on their experience and background knowledge. It means that users use their own keywords to justify their claims. <xref ref-type="table" rid="T4">Table&#x20;4</xref> shows the 30 keywords that we identified in dataset 2. To identify these keywords, we used dataset 2 and did the following preprocessing: First, all phrases related to the claim and warrant (component-specific features) were eliminated. Second, we extracted unigrams and removed stop words. For each remaining unigram, a vector with the length of the number of responses was created that showed the existence of the unigram in each response. Then, Matthew&#x2019;s correlation coefficient between each vector of unigrams and the class values of the evidence class (with/without evidence) was calculated. The 500 most correlated unigrams were chosen; the cutoff was empirical because subsequent unigrams seemed too random. This yielded the bold entries in <xref ref-type="table" rid="T4">Table&#x20;4</xref>. The non-bold entries are the keywords that have been added to the list because they are synonyms, similar, or relevant to other high-correlated unigrams such as &#x201c;by human,&#x201d; &#x201c;handmade,&#x201d; and &#x201c;made by&#x201d; as synonyms or relevant to &#x201c;man-made.&#x201d;</p>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>The 30 terms that correlate most with the class &#x201c;with evidence&#x201d; in dataset 2.</p>
</caption>
<table>
<tbody valign="top">
<tr>
<td align="left">
<bold>Instinct</bold>
</td>
<td align="left">
<bold>Plant</bold>
</td>
<td align="left">
<bold>Prey</bold>
</td>
<td align="left">
<bold>Steel</bold>
</td>
<td align="left">
<bold>Inanimate</bold>
</td>
<td align="left">
<bold>Sunlight</bold>
</td>
</tr>
<tr>
<td align="left">Hunt</td>
<td align="left">
<bold>brain</bold>
</td>
<td align="left">
<bold>object</bold>
</td>
<td align="left">
<bold>trap</bold>
</td>
<td align="left">handmade</td>
<td align="left">
<bold>living</bold>
</td>
</tr>
<tr>
<td align="left">Survive</td>
<td align="left">
<bold>lifeless</bold>
</td>
<td align="left">
<bold>aware</bold>
</td>
<td align="left">
<bold>insect</bold>
</td>
<td align="left">
<bold>man made</bold>
</td>
<td align="left">
<bold>grow</bold>
</td>
</tr>
<tr>
<td align="left">
<bold>Tool</bold>
</td>
<td align="left">alive</td>
<td align="left">
<bold>cognition</bold>
</td>
<td align="left">feed</td>
<td align="left">made by</td>
<td align="left">
<bold>food</bold>
</td>
</tr>
<tr>
<td align="left">
<bold>group</bold>
</td>
<td align="left">
<bold>metal</bold>
</td>
<td align="left">
<bold>program</bold>
</td>
<td align="left">by human</td>
<td align="left">
<bold>stone</bold>
</td>
<td align="left">
<bold>sun</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>These words are related to entities and study participants often used them in the evidence part of responses to argue why a particular definition of intelligence (warrant) applied to an entity or not. This feature set conceptually captures evidence that is made at an abstraction level that is higher than the single entity types in the sense of referring to an entity&#x2019;s decisive characteristic as being an inanimate object, or as having a brain; which in turn would be true about many more entities than the Statue of Liberty, or snakes. To use this feature, a binary value was added to the general vector of evidence to indicate the existence of this evidence-specific feature.</p>
<p>The second evidence-specific feature set was the length in terms of the number of words of the responses. Conceptually, if one wants to make a claim, refer to a predefined definition and in addition describe evidence that links claim and warrant, one needs more words than if one does not add evidence. When cross-checking this intuition in datasets 1 and 2, there is a significant difference between responses that contain evidence (M &#x3d; 39.16, SD &#x3d; 23.27, in datasets 1 and 2) and without evidence (M &#x3d; 25.10, SD &#x3d; 16.73, in datasets 1 and 2). To show this, Welch&#x2019;s <italic>t</italic>-test experiment was done and it showed that the difference was significant, t &#x3d; &#x2212;11.07, <italic>p</italic>-value&#x3c; 0.0001 and the degree of freedom &#x3d; 701.63. To show that the significant difference was due to the length of the evidence component and not the other components, first, we reduced the length value by 4 and 5 words if responses had the claim and the warrant component respectively. Then, we did Welch&#x2019;s <italic>t</italic>-test again on the new values for the length feature for responses with evidence (M &#x3d; 32.27, SD &#x3d; 22.9) and without evidence (M &#x3d; 18.39, SD &#x3d; 16.86). Based on the new length value, there was a significant difference in terms of length, t &#x3d; &#x2212;10.95, <italic>p</italic>-value&#x3c; 0.0001, and the degree of freedom &#x3d; 684.48. This feature intuitively makes sense, and yet of course is very coarse, in the sense that it can fail in single instances if no claim or warrant exists (shorter overall response that still includes evidence), can fail in single instances if claim and warrant are expressed very verbosely; and of course, absolutely fails to capture the correctness of the evidence or soundness of the overall argument in any&#x20;way.</p>
<p>In <xref ref-type="table" rid="T5">Table&#x20;5</xref>, we summarized the features that were used for identifying the existence of the core components in responses.</p>
<table-wrap id="T5" position="float">
<label>TABLE 5</label>
<caption>
<p>The features used for training classifiers of claim, warrant, and evidence components.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Component</th>
<th align="center">General feature</th>
<th align="center">Component-specific feature</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Claim</td>
<td align="left">&#x2022; TFIDF of bigrams and trigrams (The length of vector &#x3d; 500)</td>
<td align="left">&#x2022; Regular expressions to indicate phrases such as &#x201c;it is (not) intelligent&#x201d;</td>
</tr>
<tr>
<td align="left">Warrant</td>
<td align="left">&#x2022; TFIDF of bigrams and trigrams (The length of vector &#x3d; 500)</td>
<td align="left">&#x2022; Regular expressions to indicate the proposed definitions of intelligence</td>
</tr>
<tr>
<td rowspan="2" align="left">Evidence</td>
<td rowspan="2" align="left">&#x2022; TFIDF of unigrams and bigrams (The length of vector &#x3d; 3,000)</td>
<td align="left">&#x2022; The entity-specific keywords (<xref ref-type="table" rid="T4">Table&#x20;4</xref>)</td>
</tr>
<tr>
<td align="left">&#x2022; The length of responses based on the number of words</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s5">
<title>5 Results</title>
<sec id="s5-1">
<title>5.1 How Well Can Components of Toulmin&#x2019;s Model of Argument Be Identified in the Given Domain? (RQ2)</title>
<p>In this section, we answer RQ2&#x2014;How well can different elements of Toulmin&#x2019;s model of argument be identified? by developing classifiers for the three core components of Toulmin&#x2019;s model of argument, namely claims, warrants, and evidence (cp. <italic>Argument Mining</italic>) based on datasets 1 and 2 as training datasets and dataset 3 as unseen test data for evaluating these classifiers (see especially <italic>Data Collection</italic> on Data Collection and the three different datasets). The classifiers were developed using vector representations of user statements using features as described above (<italic>Feature Selection</italic>). We use traditional ML methods such as K-Nearest Neighbors, SVM, Decision Trees, Random Forest, and Ada Boost. We explicitly do not use deep learning methods in this work, as we have too little data; and do not use transfer learning as no suitable models from sufficiently similar problems are available.</p>
<p>For selecting the best classifier for each core component, we measured F1-score in 10-fold cross-validation on dataset 2 for mentioned traditional ML methods. Furthermore, we used dataset 1 as a held-out dataset to compare the ML models based on F1-score. After we had selected the best classifier, a final model was trained based on both datasets 1 and 2. To avoid overfitting, the dataset for tuning hyperparameters is a little bit larger than that for initial model training and comparison and more diverse (datasets 1 and 2 have been collected with slightly different materials&#x2014;see <italic>Apparatus&#x2014;Educational Scenario &#x201c;Is &#x3c; an entity &#x3e; Intelligent or Not? Why?&#x201d;</italic>); and the dataset for evaluation needed to be previously unseen, as is standard practice in ML literature. We note that datasets 1 and 2 were lexically relatively similar, first because we had removed concrete entity names in preprocessing (replacement with ENT), and second because user arguments differ mainly across categories of entities (inanimate object, plant, animal, AI-enabled technology) and not so much between different entities (e.g., cat vs. snake).</p>
<sec id="s5-1-1">
<title>5.1.1 Claim Component</title>
<p>In this subsection, we describe the development and evaluation of a classifier for deciding the existence and the direction of a claim. In <xref ref-type="table" rid="T6">Table&#x20;6</xref>, you can see real responses, before applying preprocessing steps, related to the different values of a&#x20;claim.</p>
<table-wrap id="T6" position="float">
<label>TABLE 6</label>
<caption>
<p>Real samples regarding the different values of the claim component. There are users&#x2019; responses without any modification.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">User&#x2019;s response</th>
<th align="center">Claim</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">&#x201c;Monkeys and humans are evolutionary speaking very close. Whilst it can&#x2019;t be said to think or act &#x201c;humanly&#x201d; (by definition only humans can do that), it can certainly think and act both intelligently and rationally, and most certainly learns from experiences. Therefore it is intelligent.&#x201d;</td>
<td align="left">Positive</td>
</tr>
<tr>
<td align="left">&#x201c;I think that a self-driving car is intelligent. It learns from experiences and adapts and makes decisions based on what it has learned.&#x201d;</td>
<td align="left">Positive</td>
</tr>
<tr>
<td align="left">&#x201c;I think a venus flytrap just wants to feed itself. That would be the goal it wants to reach.&#x201d;</td>
<td align="left">Unknown</td>
</tr>
<tr>
<td align="left">&#x201c;the New York Statue of Liberty is made of copper and it exhibits positivity to the people around it and also the toes of this statue denotes the stableness to the world.&#x201d;</td>
<td align="left">Unknown</td>
</tr>
<tr>
<td align="left">&#x201c;no I don&#x2019;t believe a self-driving car is intelligent I believe the people who wrote the code that make the car self-driving are intelligent. The car can only do what is it is programed to do.&#x201d;</td>
<td align="left">Negative</td>
</tr>
<tr>
<td align="left">&#x201c;no&#x201d;</td>
<td align="left">Negative</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We compared standard machine learning classifiers (K-Nearest Neighbors, SVM, decision tree, Random Forest, Ada Boost) using 10-fold cross-validation over dataset 2 and evaluation of performance on dataset 1 as a held-out dataset to identify the best classification model. To compare classifiers, we report the mean and standard deviation of macro-F1 scores over all training-and-test iterations. The result is shown in <xref ref-type="table" rid="T7">Table&#x20;7</xref>.</p>
<table-wrap id="T7" position="float">
<label>TABLE 7</label>
<caption>
<p>The result of 10-fold cross-validation on dataset 2 in detecting claims and evaluation of performance on the held-out dataset.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Classifiers</th>
<th colspan="2" align="center">The result of 10-fold cross-validation on dataset 2</th>
<th colspan="2" align="center">The result of using dataset 1 as a held-out dataset</th>
</tr>
<tr>
<th align="center">Average of macro F1-scores</th>
<th align="center">Standard deviation of macro F1-scores</th>
<th align="center">Macro F1-score</th>
<th align="center">Accuracy</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">K-Nearest Neighbors</td>
<td align="char" char=".">0.61</td>
<td align="char" char=".">0.02</td>
<td align="char" char=".">0.56</td>
<td align="char" char=".">0.70</td>
</tr>
<tr>
<td align="left">SVM</td>
<td align="char" char=".">0.76</td>
<td align="char" char=".">0.07</td>
<td align="char" char=".">0.63</td>
<td align="char" char=".">0.93</td>
</tr>
<tr>
<td align="left">Decision Tree</td>
<td align="char" char=".">0.75</td>
<td align="char" char=".">0.03</td>
<td align="char" char=".">0.71</td>
<td align="char" char=".">0.88</td>
</tr>
<tr>
<td align="left">Random Forest</td>
<td align="char" char=".">
<bold>0.79</bold>
</td>
<td align="char" char=".">0.07</td>
<td align="char" char=".">
<bold>0.80</bold>
</td>
<td align="char" char=".">
<bold>0.94</bold>
</td>
</tr>
<tr>
<td align="left">Ada Boost</td>
<td align="char" char=".">0.68</td>
<td align="char" char=".">0.07</td>
<td align="char" char=".">0.62</td>
<td align="char" char=".">0.92</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>The highest F1&#x2010;scores and Accuracy values.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>As is shown in <xref ref-type="table" rid="T7">Table&#x20;7</xref>, the Random Forest classifier achieved the best results, and hence we proceeded to finetune this classifier. We observed that average F1 scores are reasonable for multiple classifiers (SVM, decision tree, and Random Forest); this highlights the fundamental feasibility of separating statements with claim from those without a&#x20;claim.</p>
<p>The data that were used to train the final Random Forest classifier were the whole datasets 1 and 2 (1,126 records). Since dataset 2 was imbalanced and also it contained the majority of records, the whole training data became imbalanced. There were 594 records with &#x201c;Negative&#x201d; labels, 477 records with &#x201c;Positive&#x201d; labels, and only 55 records with &#x201c;Unknown&#x201d; labels. The data were extremely imbalanced since only 4.8 per cent of training data are annotated with &#x201c;Unknown&#x201d; labels. To tackle this, we generated synthetic examples via the Synthetic Minority Oversampling Technique (SMOTE), which generates new synthetic examples based on their selected nearest neighbors (<xref ref-type="bibr" rid="B13">Chawla et&#x20;al., 2002</xref>).</p>
<p>To select the tunning parameters of the Random Forest classifier, GridSearchCV function of Scikit-learn (<xref ref-type="bibr" rid="B48">Pedregosa et&#x20;al., 2011</xref>) was used. The tunning function was parameterized for training the Random Forest classifier with 5-fold cross-validation. The parameters that we tried to optimize were the numbers of estimators (n_estimators) and maximum depth (max_depth). The rest of the parameters used default values<xref ref-type="fn" rid="fn5">
<sup>4</sup>
</xref>. For the number of estimators, the range of [100, 150, 200, 250, and 300] and for the maximum depth, the range of [10, 20, 30, 40, and 50] were considered to find the best combination. The optimum Random Forest in terms of macro F1-measure corresponds to 200 estimators and a maximum depth of 40. In the final step, the Random Forest classifier was trained on the oversampled datasets 1 and 2. <xref ref-type="table" rid="T8">Table&#x20;8</xref> illustrates the performance of the model assessed on the unseen test data, dataset&#x20;3.</p>
<table-wrap id="T8" position="float">
<label>TABLE 8</label>
<caption>
<p>The performance of detecting claims on the test data (dataset 3) based on each&#x20;class.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th align="center">Precision</th>
<th align="center">Recall</th>
<th align="center">F1-score</th>
<th align="center">&#x23; of instances</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Positive</td>
<td align="char" char=".">
<bold>0.97</bold>
</td>
<td align="char" char=".">0.91</td>
<td align="char" char=".">
<bold>0.94</bold>
</td>
<td align="char" char=".">102</td>
</tr>
<tr>
<td align="left">Negative</td>
<td align="char" char=".">0.95</td>
<td align="char" char=".">
<bold>0.93</bold>
</td>
<td align="char" char=".">
<bold>0.94</bold>
</td>
<td align="char" char=".">99</td>
</tr>
<tr>
<td align="left">Unknown</td>
<td align="char" char=".">0.33</td>
<td align="char" char=".">0.60</td>
<td align="char" char=".">0.43</td>
<td align="char" char=".">10</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>The highest Precision, Recall and F1-score values.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>As you can see, the results are shown based on each class based on precision, recall, and F1-score. All the scores for positive and negative classes are more than 90 per cent. For precision, the positive class has the highest score; for recall, the negative class. In terms of F1-score, both categories achieve the same score. The unknown category was extremely imbalanced. Regarding the &#x201c;Unknown&#x201d; category that was extremely imbalanced, there were only 55 responses in training data, datasets 1 and 2, and only 10 in the test dataset 3. Overall, the performance on the positive and the negative class was very satisfactory. For the unknown class, it was not. The imbalanced precision and recall values mean that given an &#x201c;unknown&#x201d; label for a user statement, there is a reasonable likelihood that it will be wrongly classified as unknown (low precision). On the other hand, there is a very small likelihood of a user statement labeled as positive or negative to be anything else. In a separate experiment, a new model was created without using SMOTE, for the second time but as an up-sampling method. Without using SMOTE, all the scores of the minority class were&#x20;zero.</p>
<p>Furthermore, since the claim detection model was a multi-class classifier, macro and weighted metrics are reported. Besides&#x20;these scores, overall accuracy and Cohen&#x2019;s &#x3ba; are reported. In <xref ref-type="table" rid="T9">Table&#x20;9</xref>, the macro and the weighted score of precision, recall, and F1-score and also accuracy and Cohen&#x2019;s &#x3ba; are illustrated.</p>
<table-wrap id="T9" position="float">
<label>TABLE 9</label>
<caption>
<p>The overall performance of detecting claims on the test data (dataset 3).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Random forest classifier</th>
<th align="center">Precision</th>
<th align="center">Recall</th>
<th align="center">F1-score</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Macro average</td>
<td align="char" char=".">0.75</td>
<td align="char" char=".">0.81</td>
<td align="char" char=".">0.77</td>
</tr>
<tr>
<td align="left">Weighted average</td>
<td align="char" char=".">0.93</td>
<td align="char" char=".">0.91</td>
<td align="char" char=".">0.91</td>
</tr>
<tr>
<td align="left">Accuracy</td>
<td colspan="3" align="char" char=".">0.91</td>
</tr>
<tr>
<td align="left">Cohen&#x2019;s &#x3ba;</td>
<td colspan="3" align="char" char=".">0.83</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Macro average precision, which is the average of precision of all classes, is 0.75. However, weighted average precision, which is the average precision based on the number of records for each class, is 0.93. We also measured macro and weighted average for recall and F1-score metrics. The claim model had 0.91 accuracy which is satisfiable.</p>
</sec>
<sec id="s5-1-2">
<title>5.1.2 Warrant Component</title>
<p>In this subsection, we describe the development and evaluation of a classifier for deciding the existence of an explicit warrant in the sense of an explicit reference to one of five predefined different views on intelligence. In <xref ref-type="table" rid="T10">Table&#x20;10</xref> several real responses from the study are&#x20;shown.</p>
<table-wrap id="T10" position="float">
<label>TABLE 10</label>
<caption>
<p>Real samples regarding the different values of the warrant component.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">User&#x2019;s response</th>
<th align="center">Warrant</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">&#x201c;Yes, I think that any action that involves the act of thinking and acting, involves a certain level of intelligence, in my opinion they are very intelligent, because they are born doing things that we humans are not born doing, they learn new things, things which is outside the animal world, things that only we humans learn, but of course there is a limitation in that.&#x201d;</td>
<td align="center">With</td>
</tr>
<tr>
<td align="left">&#x201c;I think a monkey is very intelligent because it can learn just like a human.&#x201d;</td>
<td align="center">With</td>
</tr>
<tr>
<td align="left">&#x201c;Snakes have the ability to adjust their behavior as determined by their surroundings and, as such, are able to learn from their experiences, so, yes, they are intelligent.&#x201d;</td>
<td align="center">With</td>
</tr>
<tr>
<td align="left">&#x201c;A self-driving car is intelligent as long as it has the correct information for it to function. It needs to have &#x201c;brains&#x201d; in order to work properly.&#x201d;</td>
<td align="center">Without</td>
</tr>
<tr>
<td align="left">&#x201c;No, I think that the actins of reptiles which include apparent stealth and self-direction, do not correspond to selecting from a set of alternative actions. The action is the only option and it is conjured by the needs of instinct&#x201d;</td>
<td align="center">Without</td>
</tr>
<tr>
<td align="left">&#x201c;It was intelligent it shows the friendship between two countries namely France and United&#x20;States and mostly it representing liberty the enlightening the world. The torch really shows the path to freedom.&#x201d;</td>
<td align="center">Without</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Again, we compared given standard machine learning classifiers (K-Nearest Neighbors, SVM, decision tree, Random Forest, and Ada Boost) similar to what we did for the claim component. The result is shown in <xref ref-type="table" rid="T11">Table&#x20;11</xref>.</p>
<table-wrap id="T11" position="float">
<label>TABLE 11</label>
<caption>
<p>The result of 10-fold cross-validation on dataset 2 in detecting warrants and evaluation of performance on the held-out dataset.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Classifiers</th>
<th colspan="2" align="center">The result of 10-fold cross-validation on dataset 2</th>
<th colspan="2" align="center">The result of using dataset 1 as a held-out dataset</th>
</tr>
<tr>
<th align="center">Average of F1-scores</th>
<th align="center">Standard deviation of macro F1-scores</th>
<th align="center">Macro F1-score</th>
<th align="center">Accuracy</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">K-Nearest Neighbors</td>
<td align="char" char=".">0.76</td>
<td align="char" char=".">0.03</td>
<td align="char" char=".">0.55</td>
<td align="char" char=".">0.58</td>
</tr>
<tr>
<td align="left">SVM</td>
<td align="char" char=".">0.85</td>
<td align="char" char=".">0.03</td>
<td align="char" char=".">0.61</td>
<td align="char" char=".">0.61</td>
</tr>
<tr>
<td align="left">Decision Tree</td>
<td align="char" char=".">0.81</td>
<td align="char" char=".">0.04</td>
<td align="char" char=".">0.64</td>
<td align="char" char=".">0.64</td>
</tr>
<tr>
<td align="left">Random Forest</td>
<td align="char" char=".">
<bold>0.87</bold>
</td>
<td align="char" char=".">0.02</td>
<td align="char" char=".">
<bold>0.68</bold>
</td>
<td align="char" char=".">
<bold>0.68</bold>
</td>
</tr>
<tr>
<td align="left">Ada Boost</td>
<td align="char" char=".">0.85</td>
<td align="char" char=".">0.03</td>
<td align="char" char=".">0.65</td>
<td align="char" char=".">0.65</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>The highest F1&#x2010;scores and Accuracy values.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Based on the results in <xref ref-type="table" rid="T11">Table&#x20;11</xref>, Random Forest classifiers were selected for detecting the existence of warrant in user&#x2019;s responses. Similar to the claim classifier, GridSearchCV function of Scikit-learn (<xref ref-type="bibr" rid="B48">Pedregosa et&#x20;al., 2011</xref>) was used to tune the parameters of the Random Forest classifier. The hyperparameters that we tried to find their optimum values were the numbers of estimators (n_estimators) and maximum depth (max_depth). For the first hyperparameter, the number of estimators, the range of [50, 100, 150, 200, and 250] and for the maximum depth, the range of [10, 20, 30, 40, and 50] were considered to find the best combination. The optimum Random Forest in terms of F1-measure corresponds to 100 estimators and a maximum depth of 30. <xref ref-type="table" rid="T12">Table&#x20;12</xref> reports the performance of the model assessed on the unseen test data, dataset&#x20;3.</p>
<table-wrap id="T12" position="float">
<label>TABLE 12</label>
<caption>
<p>The overall performance of detecting warrants on the test data (dataset 3).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Random forest classifier</th>
<th align="center">Precision</th>
<th align="center">Recall</th>
<th align="center">F1-score</th>
<th align="center">&#x23; of instances</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">With warrant</td>
<td align="char" char=".">0.95</td>
<td align="char" char=".">0.83</td>
<td align="char" char=".">0.88</td>
<td align="char" char=".">111</td>
</tr>
<tr>
<td align="left">Without warrant</td>
<td align="char" char=".">0.83</td>
<td align="char" char=".">0.95</td>
<td align="char" char=".">0.88</td>
<td align="char" char=".">100</td>
</tr>
<tr>
<td align="left">Accuracy</td>
<td colspan="4" align="char" char=".">0.89</td>
</tr>
<tr>
<td align="left">Cohen&#x2019;s &#x3ba;</td>
<td colspan="4" align="char" char=".">0.77</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Based on <xref ref-type="table" rid="T12">Table&#x20;12</xref>, the category of &#x201c;with warrant&#x201d; had the highest precision; however, the best recall was related to &#x201c;without warrant&#x201d; category. The overall accuracy and Cohen&#x2019;s &#x3ba; were 0.89 and 0.77 respectively. Besides the metrics, which are reported in <xref ref-type="table" rid="T12">Table&#x20;12</xref>, the average F1-score for the model was 0.88. These values are overall very reasonable. Especially, however, we note that for our use case, the lower precision and higher recall for &#x201c;without warrant&#x201d; means, additional fine-tuning might need to penalize further a wrong &#x201c;without warrant&#x201d; classification; in this case the conversational agent would mistakenly ask for an explicit warrant (a reference to one of the five definitions of intelligence in our study) even though the user had already given one. This should only be done given substantial evidence that such a question does more harm (&#x3d;annoys users) than&#x20;good (&#x3d;helps users develop clear argumentative structures).</p>
</sec>
<sec id="s5-1-3">
<title>5.1.3 Evidence Component</title>
<p>In this subsection, we describe the development and evaluation of a classifier for deciding the existence of concrete evidence that illustrates the (non-)intelligence of an entity lasts. In <xref ref-type="table" rid="T13">Table&#x20;13</xref>, several responses are&#x20;shown.</p>
<table-wrap id="T13" position="float">
<label>TABLE 13</label>
<caption>
<p>Real samples regarding the different values of the evidence component.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">User&#x2019;s response</th>
<th align="center">Evidence</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">&#x201c;In my opinion, a monkey is an intelligent being, as he presents aspects similar to those in humans, such as concern for the group, being able to perceive what is best for his community with its due limitations, motor intelligence, intelligence to solve situations that demand creativity.&#x201d;</td>
<td align="left">With</td>
</tr>
<tr>
<td align="left">&#x201c;Actually, yes, I do. It doesn&#x2019;t &#x201c;think humanely, or act humanely.&#x201d; I&#x2019;m not sure if it thinks rationally or not, but it acts rationally: seeking out light in order to maximize its nutritional opportunities. It also, as all plants, learns from experience, in that it grows to match environmental conditions.&#x201d;</td>
<td align="left">With</td>
</tr>
<tr>
<td align="left">&#x201c;I don&#x2019;t believe Google search engine meets the definition of intelligent because humans are behind the code of Google so Google itself is not doing the thinking. It is also only acting on what humans tell it to do. The only learning it might do is remembering what you&#x2019;ve searched for previously and remembering cookies.&#x201d;</td>
<td align="left">With</td>
</tr>
<tr>
<td align="left">&#x201c;Based on the definition provided the venus fly trap is not intelligent. I believe it meets some of the criteria (Thinks and acts rationally, learns from experience) but not all. It does not think or act humanly&#x201d;</td>
<td align="left">Without</td>
</tr>
<tr>
<td align="left">&#x201c;yes because it behaves humanly and can be able to adapt to changes to its environment&#x201d;</td>
<td align="left">Without</td>
</tr>
<tr>
<td align="left">&#x201c;A Table is unintelligent, because it cannot think like a human, move on its own or adapt behavior to a changing environment.&#x201d;</td>
<td align="left">Without</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Similar to the warrant and claim section, we compared given standard machine learning classifiers (K-Nearest Neighbors, SVM, decision tree, Random Forest, Ada Boost). The result is shown in <xref ref-type="table" rid="T14">Table&#x20;14</xref>.</p>
<table-wrap id="T14" position="float">
<label>TABLE 14</label>
<caption>
<p>The result of 10-fold cross-validation on dataset 2 in detecting evidence and evaluation of performance on the held-out dataset.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Classifiers</th>
<th colspan="2" align="center">The result of 10-fold cross-validation on dataset 2</th>
<th colspan="2" align="center">The result of using dataset 1 as a held-out dataset</th>
</tr>
<tr>
<th align="center">Average of F1-scores</th>
<th align="center">Standard deviation of F1-scores</th>
<th align="center">Macro F1-score</th>
<th align="center">Accuracy</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">K-Nearest Neighbors</td>
<td align="char" char=".">0.87</td>
<td align="char" char=".">0.02</td>
<td align="char" char=".">0.70</td>
<td align="char" char=".">0.77</td>
</tr>
<tr>
<td align="left">SVM</td>
<td align="char" char=".">0.86</td>
<td align="char" char=".">0.01</td>
<td align="char" char=".">0.44</td>
<td align="char" char=".">0.70</td>
</tr>
<tr>
<td align="left">Decision Tree</td>
<td align="char" char=".">0.86</td>
<td align="char" char=".">0.01</td>
<td align="char" char=".">
<bold>0.72</bold>
</td>
<td align="char" char=".">0.77</td>
</tr>
<tr>
<td align="left">Random Forest</td>
<td align="char" char=".">
<bold>0.90</bold>
</td>
<td align="char" char=".">0.01</td>
<td align="char" char=".">
<bold>0.72</bold>
</td>
<td align="char" char=".">
<bold>0.81</bold>
</td>
</tr>
<tr>
<td align="left">Ada Boost</td>
<td align="char" char=".">0.88</td>
<td align="char" char=".">0.02</td>
<td align="char" char=".">0.63</td>
<td align="char" char=".">0.74</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>The highest F1&#x2010;scores and Accuracy values.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Similar to the claim and warrant classifiers, GridSearchCV was used for fine-tuning the model&#x2019;s parameters by training on datasets 1 and 2 together. The tunning function was parameterized for training a Random Forest classifier in which 5-fold was selected for cross-validation with different numbers of estimators and maximum depth. The values that we considered for the number of estimators were [100, 200, 300, 400, and 500] and for the maximum depth were [40, 50, 60, and 70]. The optimum Random Forest in terms of the average of F1-measure corresponds to 300 estimators and a maximum depth of 60. After finding the best parameters, a Random Forest classifier was trained on datasets 1 and 2. The performance of the model was assessed on the unseen test dataset 3 (<xref ref-type="table" rid="T15">Table&#x20;15</xref>).</p>
<table-wrap id="T15" position="float">
<label>TABLE 15</label>
<caption>
<p>The overall performance of detecting evidence on the test data (dataset 3).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Random forest classifier</th>
<th align="center">Precision</th>
<th align="center">Recall</th>
<th align="center">F1-score</th>
<th align="center">&#x23; of instances</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">With evidence</td>
<td align="char" char=".">0.83</td>
<td align="char" char=".">0.96</td>
<td align="char" char=".">0.89</td>
<td align="char" char=".">159</td>
</tr>
<tr>
<td align="left">Without evidence</td>
<td align="char" char=".">0.79</td>
<td align="char" char=".">0.42</td>
<td align="char" char=".">0.54</td>
<td align="char" char=".">52</td>
</tr>
<tr>
<td align="left">Accuracy</td>
<td colspan="4" align="char" char=".">0.83</td>
</tr>
<tr>
<td align="left">Cohen&#x2019;s &#x3ba;</td>
<td colspan="4" align="char" char=".">0.45</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Based on <xref ref-type="table" rid="T15">Table&#x20;15</xref>, the highest precision and recall were related to the category of &#x201c;with evidence.&#x201d; The overall accuracy and Cohen&#x2019;s &#x3ba; were 0.83 and 0.45 respectively. In addition to the metrics mentioned in <xref ref-type="table" rid="T15">Table&#x20;15</xref>, the average F-score for this model was 0.80. In our case, identifying the evidence component is the most challenging part in comparison with the other components, since the evidence part of an argument is based on users&#x2019; experiences or observations. These precision and recall values are overall very reasonable. In comparison with the warrant classifier, in which users needed to mention warrants explicitly, there was no explicit answer for the evidence part. Thus, even if the evidence classifier wrongly predicts the category of &#x201c;without_evidence,&#x201d; the whole conversation remains coherent because, in this case, the agent just asks the user to elaborate more the response.</p>
</sec>
</sec>
<sec id="s5-2">
<title>5.2 Can a Conditional Dialogue Structure With Conditions Based on the Existence of Components From Toulmin&#x2019;s Model of Argument Lead to Coherent Conversations in the Given Domain? (RQ3)</title>
<p>Above, we have ascertained that the existence of core components from Toulmin&#x2019;s model of argument can be detected reasonably well for the given dataset. In this section, we ask whether the availability of such classifiers enables us to create a conditional dialogue structure that can lead to coherent conversations (RQ3). We answer this research question by example, in the following senses: First, we are just looking for a single reasonable dialogue structure. There surely are many reasonable dialogue structures, but we just need one. Also, in the current study, we are just interested in showing that the dialogue structure can lead to coherent conversations; not in showing how often this is the case in a given setting. In showing the quality of the conditional dialogue structure we, therefore, use the concept of conversation coherence as a quality indicator. By coherence we understand the linguistic concept of coherence, as denoting the extent to which a text (in this case: the conversation between the agent and the user) is meaningful and thoughts in it are well and logically connected. We, therefore, use conversational coherence as a fundamental quality that a tutorial conversation needs to&#x20;have.</p>
<p>Coherence in the case of a retrieval-based conversational agent relies on the quality of 1) the conditional dialogue structure and 2) the developed classifiers as well as the alignment of the two. The conditional dialogue structure needs to be well designed in that it is overall a reasonable path through a conversation, with an introduction, and a reasonable sequence of questions that suit the overall goal. The developed classifiers need to be able to decide between the conditional branches. The alignment between the two is necessary because depending on the quality of the classifiers, the responses of the agent need to show a different level of confidence toward the human user, in order to better perform in cases of the wrong classification.</p>
<p>In the below example conditional dialogue structure, the question about an entity&#x2019;s intelligence is shown as embedded in a longer tutorial interaction. The interaction follows the revised version (<xref ref-type="bibr" rid="B3">Anderson et&#x20;al., 2001</xref>) of <xref ref-type="bibr" rid="B4">Bloom and others, (1956)</xref>&#x2019;s proposed taxonomy of educational goals. In the revised version, the first four steps of the taxonomy are introducing knowledge, remembering, understanding, and applying the knowledge. The focus of this article was on the applying step. Here we elaborated on each step.<list list-type="simple">
<list-item>
<p>&#x2022; Introduction: The introduction is adapted to a use case setting in which the definitions of intelligence have already previously been discussed, e.g., in an introductory lecture on artificial intelligence.</p>
</list-item>
<list-item>
<p>&#x2022; Remember: This part asks the user to repeat the learned definitions. Conditional branching with feedback can be designed for, but was outside our scope in this article.</p>
</list-item>
<list-item>
<p>&#x2022; Understand: This part asks the user to explain in own words. Conditional branching with feedback can be designed for, but was outside our scope in this article. Additionally, it could be advisable even if problematic reasoning were detected here, to proceed immediately to the application stage, in order to switch between concrete and abstract reasoning; and only to come back to this level of understanding after a successful argumentation on a concrete example was carried&#x20;out.</p>
</list-item>
<list-item>
<p>&#x2022; Apply: This part is in focus of the present article, and the goal of the below dialogue structure is to show how the classifiers that decide upon the existence of core components of Toulmin&#x2019;s model of argument can be used to decide between branches in the dialogue structure. The dialogue flowchart is illustrated in <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>; in the subsequent explanation, the identifiers in brackets denote the decision points from <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>. The classifiers are executed sequentially. We first check for the existence and direction of a claim (C2), and act on identifying a missing claim; then we check for the existence of a warrant (W2), and act on identifying a missing warrant; finally, we check for the existence of evidence (E2), and act on identifying a missing evidence. Whenever a component (claim, warrant, evidence), is detected as missing, the agent uses increasing levels of scaffolding. The first scaffold is to point concretely to the missing core component (C3, W3, and E3); the second scaffold is to give the learner the start of an argumentative sentence or paragraph that just needs to be completed (C4, W4, and E4). When the last scaffold fails, in the current dialogue structure, the conversation is (gracefully) ended, currently by apologizing for its own capability.</p>
</list-item>
</list>
</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>The different states that the agent reaches based on the user&#x2019;s responses regarding the main question of the conversation, &#x201c;<italic>Is &#x3c; an entity &#x3e; intelligent or not? Why?</italic>&#x201d;</p>
</caption>
<graphic xlink:href="frai-04-645516-g002.tif"/>
</fig>
<p>In <xref ref-type="fig" rid="F3">Figures 3</xref>, <xref ref-type="fig" rid="F4">4</xref>, two example conversations are given that showcase coherent conversations. In the conversation shown in <xref ref-type="fig" rid="F3">Figure&#x20;3</xref>, the agent asks the user to argue whether and why a snake is intelligent. The user&#x2019;s response, &#x201c;<italic>A snake is intelligent because it is able to survive, which indicates the ability to adapt to changing circumstances</italic>&#x201d; (a response from dataset 2), passes through all three classifiers (claim, warrant, evidence, see <italic>How Well Can Components of Toulmin&#x2019;s Model of Argument Be Identified in the Given Domain? (RQ2)</italic>). Subsequently, the tutorial conversation is over. In the conversation shown in <xref ref-type="fig" rid="F4">Figure&#x20;4</xref>, the agent asks the user to argue whether and why a sunflower is intelligent. the user&#x2019;s response, &#x201c;<italic>No, the sunflower is not intelligent</italic>,&#x201d; only contains the claim, but no warrant or evidence. In this case, the agent first shows its agreement and then asks for warrants to complete the argument. From this step onward, it was us as authors who completed the remainder of the tutorial dialogue, just to show how a reasonable dialogue could ensue: The user&#x2019;s utterance, &#x201c;<italic>acting humanly/rationally</italic>,&#x201d; fulfills the lack of the warrant component. Now, the only missing component is the evidence. In this step, the agent requests the user to add evidence or background knowledge to justify the claim. If the agent cannot identify the missing components, it will give the second chance to the user and ask again. As the agent again cannot find the evidence component, it asks the user to elaborate again. This is the last chance and if the agent cannot find the missing components again, the conversation will be ended. In <xref ref-type="fig" rid="F4">Figure&#x20;4</xref>, the agent needed to find a connection between having no brain and thinking or acting to consider it as evidence&#x20;part.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>A coherent conversation when all the core components were mentioned by the&#x20;user.</p>
</caption>
<graphic xlink:href="frai-04-645516-g003.tif"/>
</fig>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>A coherent conversation when some of the core components were not mentioned by the&#x20;user.</p>
</caption>
<graphic xlink:href="frai-04-645516-g004.tif"/>
</fig>
<p>The agent&#x2019;s responses and follow-up questions are selected based on the predictions of classifiers. So the conversations are coherent if the classifiers perform correctly. If they do not, the conversations can still be coherent as shown in the second example. This is created by the agent showing uncertainty when asking the user (&#x3d;learner) to elaborate (C4, W4, and&#x20;E4)<xref ref-type="fn" rid="fn6">
<sup>5</sup>
</xref>.</p>
<p>The above example conversations show that coherence can be achieved with the example conditional dialogue structure that makes use of classifiers that identify the existence of claims, warrants, and evidence as core components of Toulmin&#x2019;s model of argument.</p>
<p>Note that the example conditional dialogue structure does not show how incoherent user responses can be caught and reacted to; and appropriate responses to wrong answers for the stages of remembering and understanding are not discussed either in this article.</p>
</sec>
<sec id="s5-3">
<title>5.3 Can Toulmin&#x2019;s Model of Argument be Used to Model Different Types of Structural Wrongness Within Conversational Agents in the Given Domain? (RQ1)</title>
<p>In this subsection, we respond to the overarching research question, whether and how Toulmin&#x2019;s model of argument is a suitable basis for modeling different types of the wrongness of arguments for use within conversational agents (RQ1). Answering this research question will also immediately lead over to a broader discussion of our work in <italic>Discussion and Conclusion</italic>.</p>
<p>First, we point out that using Toulmin&#x2019;s model of argument allows us to assess structural characteristics of responses and in this sense structural quality. This means that we can detect the existence of components of a reasonable argument, but we cannot&#x2014;not by using Toulmin&#x2019;s model of arguments&#x2014;say anything more about the content-wise plausibility.</p>
<p>For this purpose, we find that Toulmin&#x2019;s model of arguments works very well: With a comparatively small dataset, we were able to develop reasonably accurate classifiers (see <italic>How Well Can Components of Toulmin&#x2019;s Model of Argument Be Identified in the Given Domain? (RQ2)</italic>) that are useful within a conditional dialogue structure to decide between branches (RQ2). The developed classifiers identify the existence of necessary components (the claim, warrant and evidence). Even though the classifiers model structural quality of users' messages, this assessment is also related to content. The &#x201c;warrant&#x201d; classifier uses as features substantially content of the pre-defined definitions. The &#x201c;evidence&#x201d; classifiers use as features substantially content-related keywords that relate to how people argue about the intelligence or non-intelligence of entities. This highlights that in quality, structure and content are inter-related.</p>
<p>Despite this dependence of assessing structural quality on content-related features (&#x223c;quality indicators), we secondly observe that identifying the existence of Toulmin&#x2019;s core argumentative components does not per se allow us to assess content-wise plausibility of the made argument. For instance, in this response &#x201c;<italic>yes because it acts rationally by providing humans comfort</italic>,&#x201d; in which <italic>it</italic> refers to an office chair, all the core components of Toulmin&#x2019;s model of argument were mentioned, but the content is arguable. However, we could use Toulmin&#x2019;s argument components to model different types of wrongness: For instance, it could be that the evidence per se is not correct (a fictional example could be to say that &#x201c;snakes are regularly observed to talk with each other in sign language&#x201d;); it could also be, however, that the given evidence does not usefully relate to the used definition (&#x201c;sunflowers move their heads with the direction of the Sun, which shows that they learn from experience&#x201d;). More generally speaking, each of Toulmin&#x2019;s model of arguments can have an independent value of &#x201c;correctness&#x201d; (whereby the value, in general, cannot be assumed to be binary), as well as interconnected values of content-wise quality in terms of how well, content-wise, the different parts of the argument align with each&#x20;other.</p>
<p>Following this observation, we ask, how such content-wise quality assessment can be implemented? The answer to this can be found both in existing literature, and in future work: In the existing literature on argument mining, both the identification of similar arguments to one made by the user, and the identification of groups of arguments have been treated (<xref ref-type="bibr" rid="B1">Adamson et&#x20;al., 2014</xref>). Argument similarity can be used when expert statements are available in the sense of a gold standard, and grouping arguments can be used when agreement with a majority opinion is a good marker of argument quality. On the other hand, fine-granular argument component detection and reasonable links between components that on their own might be correct or at least sufficiently reasonable to identify further problems are a topic for future research.</p>
</sec>
</sec>
<sec id="s6">
<title>6 Discussion and Conclusion</title>
<p>In summary, our work indicates that Toulmin&#x2019;s model of arguments provides a useful conceptual structure on which to base classifiers that help a conversational agent decide between different branches in a conversation that supports learning.</p>
<p>In the present article, we have shown this for a particular conversation around in what sense a given entity is regarded as intelligent or not. We have shown this based on a dataset of answers to this question that has been collected outside of a conversational agent. Furthermore, we used our results to show a dialogue structure for a conversational agent based on the developed classifiers.</p>
<p>Our work also has several limitations: First, in our data collection task, study participants received a specific explanation of what constitutes a good argument. Our concern was to have sufficient numbers of arguments that contain all components of Toulmin&#x2019;s model. Furthermore, on the background of our research being on educational technology, it is reasonable to expect that users would receive some explanation for this. However, in settings, where no a priori explanation is given, it is to be expected that the distribution of classes (which components of Toulmin&#x2019;s model exist in a given user statement) is different than the distribution in our data set; and subsequently performance of the developed classifiers will&#x20;vary.</p>
<p>Second, we have shown these results for a particular conversation. The classifiers use domain-specific (i.e.,&#x20;dataset-specific) features, like the limited-length TFIDF vectors, or the thirty terms most highly correlated with the &#x201c;evidence&#x201d; label (see <italic>Feature Selection</italic>). This means, for different conversation topics, still some feature reengineering would need to occur. While our <italic>approach</italic> to feature engineering can be assumed to generalize, this is 1) an assumption and 2) still will result in different concrete features. Examples of a different conversation that is structurally similar to the one discussed in this article are an ethical dilemma. By definition, ethical dilemmas are situations in which, depending on underlying prioritization of different obligations, different courses of action would be reasonable. Such conversations could be conceptualized in Toulmin&#x2019;s model of argument as laying out as a claim which course of action one would choose (claim), laying out which obligation was most highly prioritized in choosing this course of action (warrant), and giving additional reasoning as to why the chosen priority is reasonable (evidence).</p>
<p>Third, as discussed above in <italic>Can Toulmin&#x2019;s Model Of Argument Be Used To Model Different Types Of Structural Wrongness Within Conversational Agents In The Given Domain? (RQ1)</italic>, while we do argue that Toulmin&#x2019;s model of arguments can also be used to structure identifying content-wise&#x20;types of wrongness in arguments by means of argument mining, we have not shown this in the present work. Finally, we&#x20;have not shown the effect of conversing with the agent on&#x20;actual learning in an experimental study with human subjects.</p>
<p>These limitations also point out the direction of interesting future work, and stand in for research challenges that are being widely understood to be ambitious and are being addressed in educational technology and conversational agent research at large: Transferability of domain-specific classifiers; identifying more complex types of wrongness in arguments (i.e. argumentations where single components may make sense but do not fit together, as discussed toward the end of <italic>Can Toulmin&#x2019;s Model Of Argument Be Used To Model Different Types Of Structural Wrongness Within Conversational Agents In The Given Domain? (RQ1)</italic>) and effectiveness of conversational agents as intelligent tutors in comparison with other teaching methods.</p>
<p>Knowing these limitations, the contributions that this article makes toward state-of-the-art are 1) to give evidence that reasonably accurate classifiers can be built for the existence of single components of Toulmin&#x2019;s model of arguments in (short) argumentative statements as would be expected in the context of a conversation with an intelligent agent in a given domain, 2) to show by an argument that such classifiers are useful within dialogue structures of conversational agents that are designed based on Bloom&#x2019;s taxonomy for learning, and 3) to show by argument how the same conceptual structure of Toulmin&#x2019;s model of argument can be used to further structure the identification of more complex types of faulty argumentation. These contributions complement existing research that has worked on longer argumentative essays (<xref ref-type="bibr" rid="B75">Wambsganss et&#x20;al., 2020</xref>), which has differently conceptualized argumentation quality that is however less suitable for direct feedback within a conversational agent, and broader work on argumentation mining on identifying groups of similar arguments (<xref ref-type="bibr" rid="B72">Wachsmuth et&#x20;al., 2017b</xref>) or conversational agents for factual teaching (<xref ref-type="bibr" rid="B55">Ruan et&#x20;al., 2019</xref>).</p>
</sec>
</body>
<back>
<sec id="s7">
<title>Data Availability Statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="sec" rid="s12">Supplementary Material</xref>, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="s8">
<title>Author Contributions</title>
<p>BM implemented the described conversational agent, set-up data collection, implemented the described classifiers, and data analysis; researched literature, and wrote substantial parts of the paper. VP-S set up the overall research plan, contributed literature, contributed on the dialogue structures, collaborated on the data collection strategy and materials, discussed and collaborated on the described classifiers and data analysis, and wrote substantial parts of the paper.</p>
</sec>
<sec id="s9">
<title>Funding</title>
<p>This work was supported by the &#x201c;DDAI&#x201d; COMET Module within the COMET&#x2014;Competence Centers for Excellent Technologies Program, funded by the Austrian Federal Ministry (BMK and BMDW), the Austrian Research Promotion Agency (FFG), the province of Styria (SFG) and partners from industry and academia. The COMET Program is managed by&#x20;FFG.</p>
</sec>
<sec sec-type="COI-statement" id="s10">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="s12">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/frai.2021.645516/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/frai.2021.645516/full&#x23;supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="DataSheet1.zip" id="SM1" mimetype="application/zip" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<fn-group>
<fn id="fn2">
<label>1</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://www.cs.cornell.edu/people/tj/svm_light/svm_hmm.html">https://www.cs.cornell.edu/people/tj/svm_light/svm_hmm.html</ext-link>
</p>
</fn>
<fn id="fn3">
<label>2</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://www.mturk.com/">https://www.mturk.com/</ext-link>
</p>
</fn>
<fn id="fn4">
<label>3</label>
<p>The data have been uploaded as <xref ref-type="sec" rid="s12">Supplementary Material</xref>
</p>
</fn>
<fn id="fn5">
<label>4</label>
<p>The version of 0.23.2 of Scikit-learn was used in this&#x20;study</p>
</fn>
<fn id="fn6">
<label>5</label>
<p>All agent&#x2019;s responses and branches are listed in a table (<xref ref-type="sec" rid="s12">Supplementary Table S1</xref>) in the supplementary materials</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Adamson</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Dyke</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Jang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Ros&#xe9;</surname>
<given-names>C. P.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Towards an Agile Approach to Adapting Dynamic Collaboration Support to Student Needs</article-title>. <source>Int. J.&#x20;Artif. Intell. Educ.</source> <volume>24</volume> (<issue>1</issue>), <fpage>92</fpage>&#x2013;<lpage>124</lpage>. <pub-id pub-id-type="doi">10.1007/s40593-013-0012-6</pub-id> </citation>
</ref>
<ref id="B2">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Aguiar</surname>
<given-names>E. V. B.</given-names>
</name>
<name>
<surname>Tarouco</surname>
<given-names>L. M. R.</given-names>
</name>
<name>
<surname>Reategui</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Supporting Problem-Solving in Mathematics with a Conversational Agent Capable of Representing Gifted Students&#x27; Knowledge</article-title>,&#x201d; in <conf-name>2014 47th Hawaii International Conference on System Sciences</conf-name> (<publisher-loc>Waikoloa, HI, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>130</fpage>&#x2013;<lpage>137</lpage>. <pub-id pub-id-type="doi">10.1109/HICSS.2014.24</pub-id> </citation>
</ref>
<ref id="B3">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Anderson</surname>
<given-names>L. W.</given-names>
</name>
<name>
<surname>Krathwohl</surname>
<given-names>D. R.</given-names>
</name>
<name>
<surname>Bloom</surname>
<given-names>B. S.</given-names>
</name>
</person-group> (<year>2001</year>). <article-title>A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom&#x2019;s Taxonomy of Educational Objectives</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://books.google.com/books?id=JPkXAQAAMAAJ&amp;pgis=1">http://books.google.com/books?id&#x3d;JPkXAQAAMAAJ&#x26;pgis&#x3d;1</ext-link>
</comment>. </citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bloom</surname>
<given-names>B. S.</given-names>
</name>
</person-group>
<collab>others</collab> (<year>1956</year>). <source>Taxonomy of Educational Objectives. Vol. 1: Cognitive Domain</source>. <publisher-loc>New York</publisher-loc>: <publisher-name>McKay</publisher-name>, <fpage>20</fpage>&#x2013;<lpage>24</lpage>. </citation>
</ref>
<ref id="B5">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Boltu&#x17e;i&#x107;</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>&#x160;najder</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Back up Your Stance: Recognizing Arguments in Online Discussions</article-title>,&#x201d; in <conf-name>Proceedings of the First Workshop on Argumentation Mining</conf-name>, <publisher-loc>Baltimore, MD</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>, <fpage>49</fpage>&#x2013;<lpage>58</lpage>. <pub-id pub-id-type="doi">10.3115/v1/w14-2107</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Breiman</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2001</year>). <article-title>Random Forests</article-title>. <source>Machine Learn.</source> <volume>45</volume> (<issue>1</issue>), <fpage>5</fpage>&#x2013;<lpage>32</lpage>. <pub-id pub-id-type="doi">10.1023/a:1010933404324</pub-id> </citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cabrio</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Villata</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>A Natural Language Bipolar Argumentation Approach to Support Users in Online Debate Interactions&#x2020;</article-title>. <source>Argument Comput.</source> <volume>4</volume> (<issue>3</issue>), <fpage>209</fpage>&#x2013;<lpage>230</lpage>. <pub-id pub-id-type="doi">10.1080/19462166.2013.862303</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Cabrio</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Villata</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Five Years of Argument Mining: A Data-Driven Analysis</article-title>,&#x201d; in <conf-name>IJCAI International Joint Conference on Artificial Intelligence</conf-name>, <conf-loc>Stockholm</conf-loc>, <conf-date>July, 2018</conf-date>, <fpage>5427</fpage>&#x2013;<lpage>5433</lpage>. <pub-id pub-id-type="doi">10.24963/ijcai.2018/766</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Cahn</surname>
<given-names>Jack.</given-names>
</name>
</person-group> (<year>2017</year>). <source>Chatbot: Architecture, Design &#x26; Development</source>. <publisher-loc>Pennsylvania</publisher-loc>: <publisher-name>University of Pennsylvania School of Engineering and Applied Science Department of Computer and Information Science</publisher-name>, <fpage>46</fpage>. </citation>
</ref>
<ref id="B10">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Cai</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Grossman</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>Z. J.</given-names>
</name>
<name>
<surname>Sheng</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Tian-Zheng Wei</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>J.&#x20;J.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <source>Bandit Algorithms to Personalize Educational Chatbots</source>. </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chalaguine</surname>
<given-names>L. A.</given-names>
</name>
<name>
<surname>Anthony</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>A Persuasive Chatbot Using a Crowd-Sourced Argument Graph and Concerns</article-title>. <source>Front. Artif. Intelligence Appl.</source> <volume>326</volume>, <fpage>9</fpage>&#x2013;<lpage>20</lpage>. <pub-id pub-id-type="doi">10.3233/FAIA200487</pub-id> </citation>
</ref>
<ref id="B12">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Chalaguine</surname>
<given-names>L. A.</given-names>
</name>
<name>
<surname>Hunter</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Potts</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Potts</surname>
<given-names>H. F.</given-names>
</name>
<name>
<surname>Hamilton</surname>
<given-names>F. L.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Impact of Argument Type and Concerns in Argumentation with a Chatbot</article-title>,&#x201d; in <conf-name>Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI</conf-name>, <conf-date>November, 2019</conf-date> (<publisher-loc>Portland, OR</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1557</fpage>&#x2013;<lpage>1562</lpage>. <pub-id pub-id-type="doi">10.1109/ICTAI.2019.00224</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chawla</surname>
<given-names>N. V.</given-names>
</name>
<name>
<surname>Bowyer</surname>
<given-names>K. W.</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>L. O.</given-names>
</name>
<name>
<surname>Kegelmeyer</surname>
<given-names>W. P.</given-names>
</name>
</person-group> (<year>2002</year>). <article-title>SMOTE: Synthetic Minority Over-sampling Technique</article-title>. <source>J.&#x20;Artif. Intelligence Res.</source> <volume>16</volume>, <fpage>321</fpage>&#x2013;<lpage>357</lpage>. <pub-id pub-id-type="doi">10.1613/jair.953</pub-id> </citation>
</ref>
<ref id="B14">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Dusmanu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Elena</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Villata</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Argument Mining on Twitter: Arguments, Facts and Sources</article-title>,&#x201d; in <conf-name>Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</conf-name> (<publisher-loc>Copenhagen, Denmark</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>2317</fpage>&#x2013;<lpage>2322</lpage>. <pub-id pub-id-type="doi">10.18653/v1/d17-1245</pub-id> </citation>
</ref>
<ref id="B15">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Dzikovska</surname>
<given-names>M. O.</given-names>
</name>
<name>
<surname>Moore</surname>
<given-names>J.&#x20;D.</given-names>
</name>
<name>
<surname>Steinhauser</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Campbell</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Farrow</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Callaway</surname>
<given-names>C. B.</given-names>
</name>
</person-group> (<year>2010</year>). &#x201c;<article-title>Beetle II: A System for Tutoring and Computational Linguistics Experimentation</article-title>,&#x201d; in <conf-name>Proceedings of the ACL 2010 System Demonstrations</conf-name> (<publisher-loc>Uppsala, Sweden</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>13</fpage>&#x2013;<lpage>18</lpage>. </citation>
</ref>
<ref id="B16">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Eckle-Kohler</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kluge</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Gurevych</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>On the Role of Discourse Markers for Discriminating Claims and Premises in Argumentative Discourse</article-title>,&#x201d; in <conf-name>Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing</conf-name> (<publisher-loc>Lisbon, Portugal</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>2236</fpage>&#x2013;<lpage>2242</lpage>. <pub-id pub-id-type="doi">10.18653/v1/d15-1267</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Emran</surname>
<given-names>M. A.</given-names>
</name>
<name>
<surname>Shaalan</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>A Survey of Intelligent Language Tutoring Systems</article-title>,&#x201d; in <conf-name>Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2014</conf-name> (<publisher-loc>Delhi, India</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>393</fpage>&#x2013;<lpage>399</lpage>. <pub-id pub-id-type="doi">10.1109/ICACCI.2014.6968503</pub-id> </citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Erduran</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Simon</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Osborne</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2004</year>). <article-title>TAPping into Argumentation: Developments in the Application of Toulmin&#x27;s Argument Pattern for Studying Science Discourse</article-title>. <source>Sci. Ed.</source> <volume>88</volume> (<issue>6</issue>), <fpage>915</fpage>&#x2013;<lpage>933</lpage>. <pub-id pub-id-type="doi">10.1002/sce.20012</pub-id> </citation>
</ref>
<ref id="B19">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ferrara</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Montanelli</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Petasis</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Unsupervised Detection of Argumentative Units Though Topic Modeling Techniques</article-title>,&#x201d; in <conf-name>Proceedings of the 4th Workshop on Argument Mining</conf-name>. <publisher-loc>Copenhagen, Denmark</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>, <fpage>97</fpage>&#x2013;<lpage>107</lpage>. </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Frize</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Frasson</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2000</year>). <article-title>Decision-Support and Intelligent Tutoring Systems in Medical Education</article-title>. <source>Clin. Invest. Med.</source> <volume>23</volume> (<issue>4</issue>), <fpage>266</fpage>&#x2013;<lpage>269</lpage>. </citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Garcia-Mila</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Gilabert</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Erduran</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Felton</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>The Effect of Argumentative Task Goal on the Quality of Argumentative Discourse</article-title>. <source>Sci. Ed.</source> <volume>97</volume> (<issue>4</issue>), <fpage>497</fpage>&#x2013;<lpage>523</lpage>. <pub-id pub-id-type="doi">10.1002/sce.21057</pub-id> </citation>
</ref>
<ref id="B22">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Gertner</surname>
<given-names>A. S.</given-names>
</name>
<name>
<surname>Kurt</surname>
<given-names>V. L.</given-names>
</name>
</person-group> (<year>2000</year>). &#x201c;<article-title>Andes: A Coached Problem Solving Environment for Physics</article-title>,&#x201d; in <conf-name>International Conference on Intelligent Tutoring Systems</conf-name> (<publisher-loc>Montr&#x00E9;al, Canada</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>133</fpage>&#x2013;<lpage>142</lpage>. </citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Goudas</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Louizos</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Petasis</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Karkaletsis</surname>
<given-names>V.</given-names>
</name>
</person-group> <year>2014</year>. &#x201c;<article-title>Argument Extraction from News, Blogs, and Social Media</article-title>.&#x201d; In <conf-name>Hellenic Conference on Artificial Intelligence</conf-name>, <fpage>287</fpage>&#x2013;<lpage>299</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-07064-3_23</pub-id> </citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Graesser</surname>
<given-names>A. C.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Suresh</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Harter</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Person</surname>
<given-names>N. K.</given-names>
</name>
<name>
<surname>Louwerse</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> <collab>others</collab> (<year>2001</year>). <article-title>AutoTutor: An Intelligent Tutor and Conversational Tutoring Scaffold</article-title>. <source>Proc. AIED.</source> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Graesser</surname>
<given-names>A. C.</given-names>
</name>
<name>
<surname>Wiemer-Hastings</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Wiemer-Hastings</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Kreuz</surname>
<given-names>R.</given-names>
</name>
</person-group>
<collab>Tutoring Research Group</collab>
<collab>others</collab> (<year>1999</year>). <article-title>AutoTutor: A Simulation of a Human Tutor</article-title>. <source>Cogn. Syst. Res.</source> <volume>1</volume> (<issue>1</issue>), <fpage>35</fpage>&#x2013;<lpage>51</lpage>. <pub-id pub-id-type="doi">10.1016/s1389-0417(99)00005-4</pub-id> </citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Graesser</surname>
<given-names>A. C.</given-names>
</name>
<name>
<surname>Wiemer-Hastings</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Wiemer-Hastings</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Harter</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Person</surname>
<given-names>N.</given-names>
</name> <collab>Tutoring Research Group</collab>
</person-group> (<year>2000</year>). <article-title>Using Latent Semantic Analysis to&#x20;Evaluate the Contributions of Students in AutoTutor</article-title>. <source>Interactive Learn.&#x20;Environments</source> <volume>8</volume> (<issue>2</issue>), <fpage>129</fpage>&#x2013;<lpage>147</lpage>. <pub-id pub-id-type="doi">10.1076/1049-4820(200008)8:2;1-b;ft129</pub-id> </citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Habernal</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Gurevych</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Argumentation Mining in User-Generated Web Discourse</article-title>. <source>Comput. Linguistics</source> <volume>43</volume> (<issue>1</issue>), <fpage>125</fpage>&#x2013;<lpage>179</lpage>. <pub-id pub-id-type="doi">10.1162/coli_a_00276</pub-id> </citation>
</ref>
<ref id="B28">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hussain</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Sianaki</surname>
<given-names>O. A.</given-names>
</name>
<name>
<surname>Ababneh</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2019</year>). <source>A Survey on Conversational Agents/Chatbots Classification and Design Techniques A Survey on Conversational Agents/Chatbots Classi Fi Cation and Design Techniques</source>. <publisher-loc>Berlin, Germany</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>. <pub-id pub-id-type="doi">10.1007/978-3-030-15035-8</pub-id> </citation>
</ref>
<ref id="B29">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Io</surname>
<given-names>H. N.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>C. B.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Chatbots and Conversational Agents: A Bibliometric Analysis</article-title>,&#x201d; in <conf-name>2017 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM)</conf-name>, <conf-date>December, 2017</conf-date>, <fpage>215</fpage>&#x2013;<lpage>219</lpage>. <pub-id pub-id-type="doi">10.1109/IEEM.2017.8289883</pub-id> </citation>
</ref>
<ref id="B30">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Joachims</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>1998</year>). &#x201c;<article-title>Text Categorization with Support Vector Machines: Learning with Many Relevant Features</article-title>,&#x201d; in <conf-name>European Conference on Machine Learning</conf-name>, <fpage>137</fpage>&#x2013;<lpage>142</lpage>. </citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Joachims</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Finley</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>C.-N. J.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>Cutting-Plane Training of Structural SVMs</article-title>. <source>Mach Learn.</source> <volume>77</volume> (<issue>1</issue>), <fpage>27</fpage>&#x2013;<lpage>59</lpage>. <pub-id pub-id-type="doi">10.1007/s10994-009-5108-8</pub-id> </citation>
</ref>
<ref id="B32">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Gweon</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Comparing Data from Chatbot and Web Surveys</article-title>,&#x201d; in <conf-name>Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems</conf-name>, <fpage>1</fpage>&#x2013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1145/3290605.3300316</pub-id> </citation>
</ref>
<ref id="B33">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Koedinger</surname>
<given-names>K. R.</given-names>
</name>
<name>
<surname>Anderson</surname>
<given-names>J.&#x20;R.</given-names>
</name>
<name>
<surname>Hadley</surname>
<given-names>W. H.</given-names>
</name>
<name>
<surname>Mark</surname>
<given-names>M. A.</given-names>
</name>
</person-group> (<year>1997</year>). "<article-title>Intelligent Tutoring Goes to School in the Big City</article-title>," in <source>International Journal of Artificial Intelligence in Education</source> <publisher-name>Amsterdam IOS Press Leeds International AIED Society</publisher-name>, <volume>8</volume> (<issue>1</issue>), <fpage>30</fpage>&#x2013;<lpage>43</lpage>. </citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Koedinger</surname>
<given-names>K. R.</given-names>
</name>
<name>
<surname>Brunskill</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Baker</surname>
<given-names>R. S. J.&#x20;d.</given-names>
</name>
<name>
<surname>McLaughlin</surname>
<given-names>E. A.</given-names>
</name>
<name>
<surname>Stamper</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>New Potentials for Data-Driven Intelligent Tutoring System Development and Optimization</article-title>. <source>AIMag</source> <volume>34</volume> (<issue>3</issue>), <fpage>27</fpage>&#x2013;<lpage>41</lpage>. <pub-id pub-id-type="doi">10.1609/aimag.v34i3.2484</pub-id> </citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Landis</surname>
<given-names>J.&#x20;R.</given-names>
</name>
<name>
<surname>Koch</surname>
<given-names>G. G.</given-names>
</name>
</person-group> (<year>1977</year>). <article-title>The Measurement of Observer Agreement for Categorical Data</article-title>. <source>Biometrics</source> <volume>33</volume> (<issue>1</issue>), <fpage>159</fpage>&#x2013;<lpage>174</lpage>. <pub-id pub-id-type="doi">10.2307/2529310</pub-id> </citation>
</ref>
<ref id="B36">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Le</surname>
<given-names>D. T.</given-names>
</name>
<name>
<surname>Nguyen</surname>
<given-names>C.-T.</given-names>
</name>
<name>
<surname>Nguyen</surname>
<given-names>K. A.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Dave the Debater: A Retrieval-Based and Generative Argumentative Dialogue Agent</article-title>,&#x201d; in <conf-name>Proceedings of the 5th Workshop on Argument Mining</conf-name> (<publisher-loc>Brussels, Belgium</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>121</fpage>&#x2013;<lpage>130</lpage>. <pub-id pub-id-type="doi">10.18653/v1/w18-5215</pub-id> </citation>
</ref>
<ref id="B38">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Levy</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Bilu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Hershcovich</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Aharoni</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Slonim</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Context Dependent Claim Detection</article-title>,&#x201d; in <conf-name>Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers</conf-name> (<publisher-loc>Dublin, Ireland</publisher-loc>: <publisher-name>Dublin City University and Association for Computational Linguistics</publisher-name>), <fpage>1489</fpage>&#x2013;<lpage>1500</lpage>. </citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lippi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Torroni</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Argumentation Mining</article-title>. <source>ACM Trans. Internet Technol.</source> <volume>16</volume> (<issue>2</issue>), <fpage>1</fpage>&#x2013;<lpage>25</lpage>. <pub-id pub-id-type="doi">10.1145/2850417</pub-id> </citation>
</ref>
<ref id="B40">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Long</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Magerko</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>What Is AI Literacy? Competencies and Design Considerations</article-title>,&#x201d; in <conf-name>Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems</conf-name>, <conf-date>April 25&#x2013;30, 2020</conf-date> (<publisher-loc>Honolulu, Hawai&#x02BB;i, United States</publisher-loc>: <publisher-name>ACM SIGCHI</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1145/2063176.2063177</pub-id> </citation>
</ref>
<ref id="B41">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Martin</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Kirkbride</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Mitrovic</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Holland</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zakharov</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>An Intelligent Tutoring System for Medical Imaging</article-title>,&#x201d; in <conf-name>E-Learn: World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education</conf-name>. <publisher-loc>Canada</publisher-loc>: <publisher-name>AACE</publisher-name>, <fpage>502</fpage>&#x2013;<lpage>509</lpage>. </citation>
</ref>
<ref id="B42">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Melis</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Siekmann</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2004</year>). &#x201c;<article-title>ActiveMath: An Intelligent Tutoring System for Mathematics</article-title>,&#x201d; in <conf-name>International Conference on Artificial Intelligence and Soft Computing</conf-name> (<publisher-loc>Poland</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>91</fpage>&#x2013;<lpage>101</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-540-24844-6_12</pub-id> </citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mochales</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Moens</surname>
<given-names>M.-F.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Argumentation Mining</article-title>. <source>Artif. Intell. L.</source> <volume>19</volume> (<issue>1</issue>), <fpage>1</fpage>&#x2013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1007/s10506-010-9104-x</pub-id> </citation>
</ref>
<ref id="B44">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Moens</surname>
<given-names>M.-F.</given-names>
</name>
<name>
<surname>Boiy</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Palau</surname>
<given-names>R. M.</given-names>
</name>
<name>
<surname>Reed</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2007</year>). &#x201c;<article-title>Automatic Detection of Arguments in Legal Texts</article-title>,&#x201d; in <conf-name>Proceedings of the 11th International Conference on Artificial Intelligence and Law</conf-name> (<publisher-loc>Portugal</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>225</fpage>&#x2013;<lpage>230</lpage>. <pub-id pub-id-type="doi">10.1145/1276318.1276362</pub-id> </citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Montenegro</surname>
<given-names>J.&#x20;L. Z.</given-names>
</name>
<name>
<surname>da Costa</surname>
<given-names>C. A.</given-names>
</name>
<name>
<surname>Righi</surname>
<given-names>D. R. R.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Survey of Conversational Agents in Health</article-title>. <source>Expert Syst. Appl.</source> <volume>129</volume>, <fpage>56</fpage>&#x2013;<lpage>67</lpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2019.03.054</pub-id> </citation>
</ref>
<ref id="B46">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>M&#xfc;ller</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Mattke</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Maier</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Weitzel</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Conversational Agents in Healthcare: Using QCA to Explain Patients&#x2019; Resistance to Chatbots for Medication</article-title>,&#x201d; in <conf-name>International Workshop on Chatbot Research and Design</conf-name> (<publisher-loc>Amsterdam, Netherlands</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>3</fpage>&#x2013;<lpage>18</lpage>. </citation>
</ref>
<ref id="B47">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Palau</surname>
<given-names>R. M.</given-names>
</name>
<name>
<surname>Moens</surname>
<given-names>M.-F.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>Argumentation Mining</article-title>,&#x201d; in <conf-name>Proceedings of the 12th International Conference on Artificial Intelligence and Law</conf-name> (<publisher-loc>Spain</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>98</fpage>&#x2013;<lpage>107</lpage>. <pub-id pub-id-type="doi">10.1145/1568234.1568246</pub-id> </citation>
</ref>
<ref id="B48">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pedregosa</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Varoquaux</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Gramfort</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Michel</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Bertrand</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Grisel</surname>
<given-names>O.</given-names>
</name>
<etal/>
</person-group> (<year>2011</year>). <article-title>Scikit-Learn: Machine Learning in Python</article-title>. <source>J.&#x20;Machine Learn. Res.</source> <volume>12</volume>, <fpage>2825</fpage>&#x2013;<lpage>2830</lpage>. </citation>
</ref>
<ref id="B49">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>P&#xe9;rez-Mar&#xed;n</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Boza</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>A Procedure to Create a Pedagogic Conversational Agent in Secondary Physics and Chemistry Education</article-title>. <source>Int. J.&#x20;Inf. Commun. Technol. Educ.</source> <volume>9</volume> (<issue>4</issue>), <fpage>94</fpage>&#x2013;<lpage>112</lpage>. <pub-id pub-id-type="doi">10.4018/ijicte.2013100107</pub-id> </citation>
</ref>
<ref id="B50">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Persing</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Unsupervised Argumentation Mining in Student Essays</article-title>,&#x201d; in <conf-name>Proceedings of The 12th Language Resources and Evaluation Conference</conf-name> (<publisher-loc>Turkey</publisher-loc>: <publisher-name>European Language Resources Association</publisher-name>), <fpage>6795</fpage>&#x2013;<lpage>6803</lpage>. </citation>
</ref>
<ref id="B51">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Poudyal</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Goncalves</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Quaresma</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Experiments on Identification of Argumentative Sentences</article-title>,&#x201d; in <conf-name>2016 10th International Conference on Software, Knowledge, Information Management \\&#x26; Applications (SKIMA)</conf-name> (<publisher-loc>China</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>398</fpage>&#x2013;<lpage>403</lpage>. <pub-id pub-id-type="doi">10.1109/skima.2016.7916254</pub-id> </citation>
</ref>
<ref id="B52">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rakshit</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Bowden</surname>
<given-names>K. K.</given-names>
</name>
<name>
<surname>Reed</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Misra</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Walker</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Debbie, the Debate Bot of the Future</article-title>. <source>Lecture Notes Electr. Eng.</source> <volume>510</volume>, <fpage>45</fpage>&#x2013;<lpage>52</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-92108-2_5</pub-id> </citation>
</ref>
<ref id="B53">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Rinott</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Dankin</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Alzate</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Khapra</surname>
<given-names>M. M.</given-names>
</name>
<name>
<surname>Aharoni</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Slonim</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Show Me Your Evidence-An Automatic Method for Context Dependent Evidence Detection</article-title>,&#x201d; in <conf-name>Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing</conf-name> (<publisher-loc>Lisbon, Portugal</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>440</fpage>&#x2013;<lpage>450</lpage>. <pub-id pub-id-type="doi">10.18653/v1/d15-1050</pub-id> </citation>
</ref>
<ref id="B54">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Rooney</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Browne</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2012</year>). &#x201c;<article-title>Applying Kernel Methods to Argumentation Mining</article-title>,&#x201d; in <conf-name>FLAIRS Conference</conf-name> (<publisher-loc>Florida</publisher-loc>: <publisher-name>AAAI</publisher-name>), <volume>Vol. 172</volume>. </citation>
</ref>
<ref id="B55">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ruan</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Tham</surname>
<given-names>B. J.-K.</given-names>
</name>
<name>
<surname>Qiu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>QuizBot</article-title>. <source>Conf. Hum. Factors Comput. Syst. - Proc.</source> <volume>13</volume>, <fpage>1</fpage>&#x2013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.1145/3290605.3300587</pub-id> </citation>
</ref>
<ref id="B56">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Russell</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Peter</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2002</year>). <source>Artificial Intelligence: A Modern Approach</source>. <publisher-loc>NJ</publisher-loc>: <publisher-name>Prentice-Hall</publisher-name>. </citation>
</ref>
<ref id="B57">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sabo</surname>
<given-names>K. E.</given-names>
</name>
<name>
<surname>Atkinson</surname>
<given-names>R. K.</given-names>
</name>
<name>
<surname>Barrus</surname>
<given-names>A. L.</given-names>
</name>
<name>
<surname>Joseph</surname>
<given-names>S. S.</given-names>
</name>
<name>
<surname>Perez</surname>
<given-names>R. S.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Searching for the Two Sigma Advantage: Evaluating Algebra Intelligent Tutors</article-title>. <source>Comput. Hum. Behav.</source> <volume>29</volume> (<issue>4</issue>), <fpage>1833</fpage>&#x2013;<lpage>1840</lpage>. <pub-id pub-id-type="doi">10.1016/j.chb.2013.03.001</pub-id> </citation>
</ref>
<ref id="B58">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Sardianos</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Katakis</surname>
<given-names>I. M.</given-names>
</name>
<name>
<surname>Petasis</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Karkaletsis</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Argument Extraction from News</article-title>,&#x201d; in <conf-name>Proceedings of the 2nd Workshop on Argumentation Mining</conf-name> (<publisher-loc>Denver, Colorado</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>56</fpage>&#x2013;<lpage>66</lpage>. <pub-id pub-id-type="doi">10.3115/v1/w15-0508</pub-id> </citation>
</ref>
<ref id="B59">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Shnarch</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Levy</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Raykar</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Slonim</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>GrASP: Rich Patterns for Argumentation Mining</article-title>,&#x201d; in <conf-name>Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</conf-name> (<publisher-loc>Copenhagen, Denmark</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>1345</fpage>&#x2013;<lpage>1350</lpage>. <pub-id pub-id-type="doi">10.18653/v1/d17-1140</pub-id> </citation>
</ref>
<ref id="B60">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Simon</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Using Toulmin&#x27;s Argument Pattern in the Evaluation of Argumentation in School Science</article-title>. <source>Int. J.&#x20;Res. Method Educ.</source> <volume>31</volume> (<issue>3</issue>), <fpage>277</fpage>&#x2013;<lpage>289</lpage>. <pub-id pub-id-type="doi">10.1080/17437270802417176</pub-id> </citation>
</ref>
<ref id="B61">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Simosi</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2003</year>). <article-title>Using Toulmin&#x2019;s Framework for the Analysis of Everyday Argumentation: Some Methodological Considerations</article-title>. <source>Argumentation</source> <volume>17</volume> (<issue>2</issue>), <fpage>185</fpage>&#x2013;<lpage>202</lpage>. <pub-id pub-id-type="doi">10.1023/A:1024059024337</pub-id> </citation>
</ref>
<ref id="B62">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Stab</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Gurevych</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Identifying Argumentative Discourse Structures in Persuasive Essays</article-title>,&#x201d; in <conf-name>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</conf-name> (<publisher-loc>Qatar</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>46</fpage>&#x2013;<lpage>56</lpage>. <pub-id pub-id-type="doi">10.3115/v1/d14-1006</pub-id> </citation>
</ref>
<ref id="B63">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stab</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Gurevych</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Parsing Argumentation Structures in Persuasive Essays</article-title>. <source>Comput. Linguistics</source> <volume>43</volume> (<issue>3</issue>), <fpage>619</fpage>&#x2013;<lpage>659</lpage>. <pub-id pub-id-type="doi">10.1162/coli_a_00295</pub-id> </citation>
</ref>
<ref id="B64">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stegmann</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Wecker</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Weinberger</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Fischer</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Collaborative Argumentation and Cognitive Elaboration in a Computer-Supported Collaborative Learning Environment</article-title>. <source>Instr. Sci.</source> <volume>40</volume> (<issue>2</issue>), <fpage>297</fpage>&#x2013;<lpage>323</lpage>. <pub-id pub-id-type="doi">10.1007/s11251-011-9174-5</pub-id> </citation>
</ref>
<ref id="B65">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Stemler</surname>
<given-names>S. E.</given-names>
</name>
<name>
<surname>Tsai</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2008</year>). &#x201c;<article-title>Best Practices in Interrater Reliability: Three Common Approaches</article-title>,&#x201d; in <conf-name>Best Practices in Quantitative Methods</conf-name> (<publisher-name>SAGE Publications, Inc.</publisher-name>), <fpage>29</fpage>&#x2013;<lpage>49</lpage>. </citation>
</ref>
<ref id="B66">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Suebnukarn</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Peter</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2004</year>). &#x201c;<article-title>A Collaborative Intelligent Tutoring System for Medical Problem-Based Learning</article-title>,&#x201d; in <conf-name>Proceedings of the 9th International Conference on Intelligent User Interfaces</conf-name> (<publisher-loc>Madeira, Portugal</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>14</fpage>&#x2013;<lpage>21</lpage>. </citation>
</ref>
<ref id="B67">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Toniuc</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Groza</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Climebot: An Argumentative Agent for Climate Change</article-title>,&#x201d; in <conf-name>Proceedings - 2017 IEEE 13th International Conference on Intelligent Computer Communication and Processing, ICCP 2017</conf-name> (<publisher-loc>Romania</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>63</fpage>&#x2013;<lpage>70</lpage>. <pub-id pub-id-type="doi">10.1109/ICCP.2017.8116984</pub-id> </citation>
</ref>
<ref id="B68">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Toulmin</surname>
<given-names>S. E.</given-names>
</name>
</person-group> (<year>2003</year>). <source>The Uses of Argument</source>. <publisher-loc>Cambridge</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>. </citation>
</ref>
<ref id="B69">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>VanLehn</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Jordan</surname>
<given-names>P. W.</given-names>
</name>
<name>
<surname>Ros&#xe9;</surname>
<given-names>C. P.</given-names>
</name>
<name>
<surname>Bhembe</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>B&#xf6;ttner</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Gaydos</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2002</year>). &#x201c;<article-title>The Architecture of Why2-Atlas: A Coach for Qualitative Physics Essay Writing</article-title>,&#x201d; in <conf-name>International Conference on Intelligent Tutoring Systems</conf-name> (<publisher-loc>Biarritz, France</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>158</fpage>&#x2013;<lpage>167</lpage>. <pub-id pub-id-type="doi">10.1007/3-540-47987-2_20</pub-id> </citation>
</ref>
<ref id="B70">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Veletsianos</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Russell</surname>
<given-names>G. S.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Pedagogical Agents</article-title>,&#x201d; in <conf-name>Handbook of Research on Educational Communications and Technology</conf-name> (<publisher-loc>Berlin, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>759</fpage>&#x2013;<lpage>769</lpage>. <pub-id pub-id-type="doi">10.1007/978-1-4614-3185-5_61</pub-id> </citation>
</ref>
<ref id="B71">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wachsmuth</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Naderi</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Habernal</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Hou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Hirst</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Gurevych</surname>
<given-names>I.</given-names>
</name>
<etal/>
</person-group> (<year>2017a</year>). &#x201c;<article-title>Argumentation Quality Assessment: Theory vs. Practice</article-title>,&#x201d; in <conf-name>ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) 2</conf-name> (<publisher-loc>Vancouver, Canada</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>250</fpage>&#x2013;<lpage>255</lpage>. <pub-id pub-id-type="doi">10.18653/v1/P17-2039</pub-id> </citation>
</ref>
<ref id="B72">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wachsmuth</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Stein</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Ajjour</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2017b</year>). &#x201c;<article-title>"PageRank" for Argument Relevance</article-title>,&#x201d; in <conf-name>15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 - Proceedings of Conference</conf-name> (<publisher-loc>Valencia, Spain</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <volume>Vol. 1</volume>. <pub-id pub-id-type="doi">10.18653/v1/e17-1105</pub-id> </citation>
</ref>
<ref id="B73">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Wallace</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>1995</year>). <article-title>Alice-artificial Linguistic Internet Computer Entity-The ALICE AI. Foundation</article-title>. <comment>Dispon{\\i}vel Em. Available at: <ext-link ext-link-type="uri" xlink:href="http://Http://Www.%20Alicebot.Org.Acesso-Em-18">Http://Www. Alicebot.Org.Acesso-Em-18</ext-link>
</comment>. </citation>
</ref>
<ref id="B75">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wambsganss</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Niklaus</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Cetto</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>S&#xf6;llner</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Handschuh</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Leimeister</surname>
<given-names>J.&#x20;M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>AL: An Adaptive Learning Support System for Argumentation Skills</article-title>. <source>Conf. Hum. Factors Comput. Syst. - Proc.</source> <volume>20</volume>, <fpage>1</fpage>&#x2013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1145/3313831.3376732</pub-id> </citation>
</ref>
<ref id="B76">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhan</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>A Problem Solving Oriented Intelligent Tutoring System to Improve Students&#x27; Acquisition of Basic Computer Skills</article-title>. <source>Comput. Educ.</source> <volume>81</volume>, <fpage>102</fpage>&#x2013;<lpage>112</lpage>. <pub-id pub-id-type="doi">10.1016/j.compedu.2014.10.003</pub-id> </citation>
</ref>
<ref id="B77">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Arya</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Novielli</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>J.&#x20;L. C.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>ArguLens: Anatomy of Community Opinions on Usability Issues Using Argumentation Models</article-title>. <source>Conf. Hum. Factors Comput. Syst. - Proc.</source> <volume>20</volume>, <fpage>1</fpage>&#x2013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1145/3313831.3376218</pub-id> </citation>
</ref>
<ref id="B78">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Weerasinghe</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Mitrovic</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2011</year>). &#x201c;<article-title>Facilitating Adaptive Tutorial Dialogues in EER-Tutor</article-title>,&#x201d; in <conf-name>International Conference on Artificial Intelligence in Education</conf-name> (<publisher-loc>New Zealand</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>630</fpage>&#x2013;<lpage>631</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-21869-9_131</pub-id> </citation>
</ref>
<ref id="B79">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wolfbauer</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Pammer-Schindler</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Ros&#xe9;</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Rebo Junior: Analysis of Dialogue Structure Quality for a Reflection Guidance Chatbot</article-title>,&#x201d; in <conf-name>EC-TEL Impact Paper Proceedings 2020: 15th European Conference on Technology Enhanced Learning</conf-name> (<publisher-loc>Bolzano, Italy</publisher-loc>: <publisher-name>Springer</publisher-name>). </citation>
</ref>
<ref id="B80">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Jia</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Evaluating an Intelligent Tutoring System for Personalized Math Teaching</article-title>,&#x201d; in <conf-name>2017 International Symposium on Educational Technology (ISET)</conf-name> (<publisher-loc>China</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>126</fpage>&#x2013;<lpage>130</lpage>. <pub-id pub-id-type="doi">10.1109/iset.2017.37</pub-id> </citation>
</ref>
<ref id="B81">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Purao</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Argument Detection in Online Discussion: A Theory Based Approach</article-title>,&#x201d; in <conf-name>AMCIS 2016: Surfing the IT Innovation Wave - 22nd Americas Conference on Information Systems, no. Yates 1996</conf-name> (<publisher-loc>San Diego, California</publisher-loc>: <publisher-name>Association for Information Systems</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>10</lpage>. </citation>
</ref>
</ref-list>
</back>
</article>