<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">590215</article-id>
<article-id pub-id-type="doi">10.3389/frai.2021.590215</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Application of Seq2Seq Models on Code Correction</article-title>
<alt-title alt-title-type="left-running-head">Huang et&#x20;al.</alt-title>
<alt-title alt-title-type="right-running-head">Seq2Seq Models in Code Correction</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Huang</surname>
<given-names>Shan</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="http://loop.frontiersin.org/people/1047564/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhou</surname>
<given-names>Xiao</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="http://loop.frontiersin.org/people/1018260/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chin</surname>
<given-names>Sang</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
<xref ref-type="aff" rid="aff4">
<sup>4</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<label>
<sup>1</sup>
</label>Department of Physics, Boston University, <addr-line>Boston</addr-line>, <addr-line>MA</addr-line>, <country>United&#x20;States</country>
</aff>
<aff id="aff2">
<label>
<sup>2</sup>
</label>Department of Computer Science, Boston University, <addr-line>Boston</addr-line>, <addr-line>MA</addr-line>, <country>United&#x20;States</country>
</aff>
<aff id="aff3">
<label>
<sup>3</sup>
</label>Department of Brain and Cognitive Science, Massachusetts Institute of Technology, <addr-line>Boston</addr-line>, <addr-line>MA</addr-line>, <country>United&#x20;States</country>
</aff>
<aff id="aff4">
<label>
<sup>4</sup>
</label>Center of Mathematical Sciences and Applications, Harvard University, <addr-line>Boston</addr-line>, <addr-line>MA</addr-line>, <country>United&#x20;States</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/845367">Bhavya Kailkhura</ext-link>, United&#x20;States Department of Energy (DOE), United&#x20;States</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1050669">Caiwen Ding</ext-link>, University of Connecticut, United&#x20;States</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/142900">Aline Paes</ext-link>, Fluminense Federal University, Brazil</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Shan Huang, <email>sh2015@bu.edu</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Machine Learning and Artificial Intelligence, a section of the journal Frontiers in Artificial Intelligence</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>19</day>
<month>03</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>4</volume>
<elocation-id>590215</elocation-id>
<history>
<date date-type="received">
<day>31</day>
<month>07</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>08</day>
<month>01</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2021 Huang, Zhou and Chin.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Huang, Zhou and Chin</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these&#x20;terms.</p>
</license>
</permissions>
<abstract>
<p>We apply various seq2seq models on programming language correction tasks on Juliet Test Suite for C/C&#x2b;&#x2b; and Java of Software Assurance Reference Datasets and achieve 75% (for C/C&#x2b;&#x2b;) and 56% (for Java) repair rates on these tasks. We introduce pyramid encoder in these seq2seq models, which significantly increases the computational efficiency and memory efficiency, while achieving similar repair rate to their nonpyramid counterparts. We successfully carry out error type classification task on ITC benchmark examples (with only 685 code instances) using transfer learning with models pretrained on Juliet Test Suite, pointing out a novel way of processing small programming language datasets.</p>
</abstract>
<kwd-group>
<kwd>programming language correction</kwd>
<kwd>seq2seq architecture</kwd>
<kwd>pyramid encoder</kwd>
<kwd>attention mechanism</kwd>
<kwd>transfer learning</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Programming language correction (PLC), which can provide suggestions for people to debug code, identify potential flaws in a program, and help programmers to improve their coding skills, has been an important topic in the Natural Language Processing (NLP) area. Generally, code errors consist of two categories: one is explicit, syntax errors, and the other is implicit, logic errors that could cause failure during program execution, for example, memory allocation errors, redundant code, etc. The syntax error problem is relatively well studied; most compilers are able to catch syntax errors, and correcting syntax errors manually is not difficult even for beginner programmers. The latter problem, however, is much more challenging due to several reasons. First, the error space is vast. For example, Error-Prone, a rule-based Java code error detector developed by google, identifies 499 bug patterns. Second, recognizing and correcting these bugs requires a higher level of understanding of the code, including identifying the relationship between objects, making connections between blocks, and matching data types. These errors could be seen in even experienced programmers and can be time consuming to correct manually. Therefore, this study will focus on automatic correction of these logic errors in code body that pass compiling&#x20;stage.</p>
<p>At present, most work in this field used rule-based methods [<xref ref-type="bibr" rid="B10">JetBrains (2016)</xref>; <xref ref-type="bibr" rid="B17">Synopsys (2016)</xref>; <xref ref-type="bibr" rid="B4">Google (2016a)</xref>; <xref ref-type="bibr" rid="B5">Google (2016b)</xref>; <xref ref-type="bibr" rid="B15">Singh et&#x20;al., (2013)</xref>], using static analyzers, code transformations, or control flow to identify bug patterns and make corrections. These methods are quite mature, and some are even commercialized, like Resharper. Machine learning methods, however, have been a minority and are relatively new. There is also no canonical solution; people have used methods varying from reinforcement learning to recurrent neural network.</p>
<p>Given the good performance and wide usage of rule-based PLC methods, there is a major drawback: these methods are often case specific. The developer had to design specific correction strategy for each bug pattern. For example, the core code body of Error-Prone contains 499 java script, each corresponds to a type of error. Therefore, rule-based PLC often requires large human labor to build. It also suffers from incompleteness and incapability of dealing with exceptions. In the long run, one could consider rule-based PLC vs. machine learning PLC as rule-based translation vs. statistical machine translation. Machine learning methods have the following advantages: first, they are self-sufficient; they teach themselves, requiring minimum amount of human development. Second, they can do self-improvement and self-prediction by grabbing data from users. Third, after sufficient training, one can expect them to perform better with coding style and fluency, like machine translations. One main obstacle that prevents machine code correction being as successful as machine translation is a general lack of data, which will be elaborated in a latter paragraph. This further leads to another drawback: insufficient training. However, machine code correction has an unlimited potential if more studies are carried out and more datasets are produced. This article aims to provide a successful example that might inspire further researches on machine code correction.</p>
<p>Despite good intentions of replacing hand-designed rule-based PLC method with machine-learning-based PLC method and its merits discussed above, some may express concerns about its environmental costs, as such concerns have been raised by ethical AI researchers (<xref ref-type="bibr" rid="B8">Hao, 2019)</xref>. Although generally we do not agree that such concerns should overshadow the value of liberating human labor and pursuing potentially much better performances (as one did in machine translation), we leave such judgment to our readers. Since training a machine learning model takes mostly electricity and storage space, we provide an estimated power consumption and the detailed information of a number of parameters in our models (with chosen hyper parameters described in <xref ref-type="sec" rid="s3-6">Section 3.6</xref>) in the Appendix: <xref ref-type="sec" rid="s2">Section 2</xref>. Interested readers could refer to the information accordingly.</p>
<p>The machine learning models we choose are seq2seq models. Seq2seq (abbreviation of sequence to sequence) model is a group of neural-network-based models. It usually consists of an encoder and a decoder. The encoder takes a sequence as input and produces an encoded representation of the input sequence. The decoder takes this representation and produces an output sequence. It has been proved to be very successful in neural machine translation, natural language correction, text generation, etc. An example of a seq2seq model structure is shown in <xref ref-type="fig" rid="F1">Figure&#x20;1</xref>. Our results show that seq2seq models successfully repair over 70% of the code instances if the beam search size is 1 and over 90% if the beam search size is&#x20;5.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Model structure of a 3-layer seq2seq model with attention. The <inline-formula id="inf1">
<mml:math id="minf1">
<mml:mrow>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mtext>th</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> layer takes the output of the previous layer (<inline-formula id="inf2">
<mml:math id="minf2">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>) as its input. <bold>
<italic>a</italic>
</bold> is the context vector, which can be calculated using different attention mechanisms.</p>
</caption>
<graphic xlink:href="frai-04-590215-g001.tif"/>
</fig>
<p>Instead of just using regular seq2seq model, we introduce pyramid encoder structure to better suit the code correction task. The motivation is as follows: for NLC problems, the model works on a sentence level and the average length of a sentence lies around dozens of words. However, for PLC problems, the model works on the whole code instance. The average length of code instances in PLC is usually hundreds of syntax words, which results in enormous computational cost and memory requirement, especially combined with attention mechanisms. Pyramid structure aims to reduce these costs by contracting the data flow and discarding redundant information. <xref ref-type="fig" rid="F2">Figure&#x20;2</xref> shows a visual representation of the pyramid encoder; it can be implemented to most of the multilayer seq2seq learning models. In our model comparison set, pyramid encoder increases networks&#x2019; computational efficiency by 50%&#x2013;100% and memory efficiency by up to 600%, while having similar ability of reparation.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Visualization of pyramid encoder in multilayer seq2seq models. Pyramid encoder reduces length of input sequence by half in every encoding layer. <inline-formula id="inf3">
<mml:math id="minf3">
<mml:mrow>
<mml:msup>
<mml:mi>h</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> denotes output of <inline-formula id="inf4">
<mml:math id="minf4">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mtext>th</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> encoder layer and <inline-formula id="inf5">
<mml:math id="minf5">
<mml:mrow>
<mml:msup>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> denotes the input of <inline-formula id="inf6">
<mml:math id="minf6">
<mml:mrow>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mtext>th</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> encoder&#x20;layer.</p>
</caption>
<graphic xlink:href="frai-04-590215-g002.tif"/>
</fig>
<p>On the other hand, due to the privacy policies, most of the publicly available datasets are not collected from realistic program errors and fixes but rather are generated by artificial tools. The ones that are collected realistically are usually very small. To handle this issue, we also applied transfer learning to inherit the knowledge learned from previous datasets to boost the network&#x2019;s performance on smaller and noisier datasets. Details of our project are available on GitHub<xref ref-type="fn" rid="FN1">
<sup>1</sup>
</xref>.</p>
</sec>
<sec id="s2">
<title>2 Related Work</title>
<p>Rule-based methods that work on PLC have a long history and are thus more mature. One of them is proposed by <xref ref-type="bibr" rid="B15">Singh et&#x20;al., (2013)</xref>, which is a rule-directed translation strategy synthesizing a correct program from a sketch. Their model is able to provide feedback for introductory programming problems and has achieved a correction rate of 64% on incorrect submissions. Some of these methods are quite mature. For instance, Google developed Error-Prone (<xref ref-type="bibr" rid="B4">Google, 2016a</xref>) and clang-tidy (<xref ref-type="bibr" rid="B5">Google, 2016b</xref>) as rule-based tools to help in identifying and correcting potential mistakes for programmers. Some of them are even commercialized, like Resharper (<xref ref-type="bibr" rid="B10">JetBrains, 2016</xref>), developed by <xref ref-type="bibr" rid="B17">Synopsys (2016)</xref>. As a paid feature of Visual Studio, Resharper provides code analysis, refactoring, and code processing (including code generation and quick fixes for errors) as extra features to programmers.</p>
<p>In 2016, <xref ref-type="bibr" rid="B14">Pu et&#x20;al.&#x2019;s (2016)</xref> study became one of the first attempts to use machine learning method in PLC tasks. They used a Long Short Term Memory (LSTM) model on correcting MOOCs student assignment submissions. However, their dataset was not publicly available, putting difficulties on reproducing their work. Later in 2017, <xref ref-type="bibr" rid="B7">Gupta et&#x20;al., (2017)</xref> proposed a seq2seq model for fixing student submissions (Deepfix), which is also a private dataset. In a later work, they (<xref ref-type="bibr" rid="B6">Gupta et&#x20;al., 2018</xref>) used reinforcement learning based on the input code and the error messages returned by the compiler for the same task, on the same dataset. Our work, also based on seq2seq models, was carried out on a public dataset that contains more error categories.</p>
<p>The pyramid encoder played an important role in our research. It originated from <xref ref-type="bibr" rid="B19">Xie et&#x20;al., (2016)</xref>. We proposed its general form for all seq2seq models and thoroughly studied its performance in reduction of computational resources. We aimed to overcome difficulty brought by the extended length of code instances, compared to natural language sentences. These aspects of pyramid structure were not studied in Xie&#x2019;s work. We did the comparison of pyramid encoder and regular encoder under different attention mechanisms, showing that pyramid encoder could drastically reduce memory and computational cost in most setups that we considered.</p>
</sec>
<sec id="s3">
<title>3 Model</title>
<sec id="s3-1">
<title>3.1 Overview</title>
<p>Given a code instance, we wish to identify and correct potential flaw in it, which might lead to a failure in execution after successful compilation. Each bad code instance contains exactly one&#x20;flaw.</p>
<p>Formaly speaking, given an input code instance <italic>x</italic>, we wish to map it to an output code instance <italic>y</italic> and we seek to model <inline-formula id="inf7">
<mml:math id="minf7">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. A code is &#x201c;repaired&#x201d; if the flaw that <italic>x</italic> contains is fixed in the output <italic>y</italic>. The &#x201c;repair rate&#x201d; is defined as the fraction between the number of code instances fixed and the total number of code instances that the model was applied on. We use repair rate as the evaluation metric in our experiments.</p>
<p>For this purpose, we applied two major families of seq2seq models: GRU and Transformer. We use learnable embedding layers, which allows the model to recognize the relationship between different words in the vocabulary. For the encoder, we applied pyramid encoder, where a pyramid module is added in between layers of regular multilayer encoders. For the purpose of testing generality of pyramid encoder, we combined it with different attention mechanisms.</p>
</sec>
<sec id="s3-2">
<title>3.2&#x20;Word-Level Reasoning</title>
<p>In language correction, character-level reasoning is a more commonly applied method, <xref ref-type="bibr" rid="B19">Xie et&#x20;al., (2016)</xref>. However, in code correction, we apply word-level models. A &#x201c;word&#x201d; here is defined as a code syntax (e.g., &#x201c;void&#x201d;, &#x201c;{&#x201c;, space, &#x201c; &#x3d; &#x201c;, &#x201c;int&#x201d;, newline, etc.) or a custom variable name. The reason is that the basic building blocks of a code instance are related to the syntax. In the field of programming language processing, out-of-vocabulary (OOV) is less a problem than in natural language due to a fixed syntax&#x20;pool.</p>
<p>In order to prevent the model suffering from vast variation of variable names, we performed a certain degree of variable renaming. We focused on renaming function names in our dataset while keeping other variables unchanged. This method reduced vocabulary size to &#x223c;1,000 and was proven to be effective in improving the performance.</p>
<p>We include our preprocessing method to a code instance in the Appendix.</p>
</sec>
<sec id="s3-3">
<title>3.3 Pyramid Encoder</title>
<p>Given a multilayer seq2seq encoder, its input at <inline-formula id="inf8">
<mml:math id="minf8">
<mml:mrow>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mtext>th</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> layer at step <italic>t</italic> is <inline-formula id="inf9">
<mml:math id="minf9">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> and the output is <inline-formula id="inf10">
<mml:math id="minf10">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>:<disp-formula id="e1">
<mml:math id="me1">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mtext>Layer</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(1)</label>
</disp-formula>In standard seq2seq models, the output of the <inline-formula id="inf11">
<mml:math id="minf11">
<mml:mrow>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mtext>th</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> layer <inline-formula id="inf12">
<mml:math id="minf12">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is directly used as input of the <inline-formula id="inf13">
<mml:math id="minf13">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mtext>th</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> layer, <inline-formula id="inf14">
<mml:math id="minf14">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>:<disp-formula id="e2">
<mml:math id="me2">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
<label>(2)</label>
</disp-formula>and the time step <inline-formula id="inf15">
<mml:math id="minf15">
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1,2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, the layer number <inline-formula id="inf16">
<mml:math id="minf16">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1,2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. Note that <inline-formula id="inf17">
<mml:math id="minf17">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> is the embedded representation of the input instance.</p>
<p>For pyramid encoder, we introduce a pyramid module in between <inline-formula id="inf18">
<mml:math id="minf18">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf19">
<mml:math id="minf19">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> as <xref ref-type="disp-formula" rid="e4">Eq. 3</xref> <xref ref-type="disp-formula" rid="e3">follows</xref>:<disp-formula id="e3">
<mml:math id="me3">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>tanh</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mtext>pyr</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>y</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(3)</label>
</disp-formula>This module reduced the length of the input <inline-formula id="inf20">
<mml:math id="minf20">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> by half each time it is applied. The length of final output of the encoder is <inline-formula id="inf21">
<mml:math id="minf21">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>/</mml:mo>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>. One could also take a bigger window such as 3, 4, 5&#x2026; depending on their needs. The hope is that pyramid structure will extract the important information and reduce the redundant information of each of the neighboring hidden state, therefore reducing the training cost while keeping the accuracy of the correction. This is conceptually similar to a convolution, but without using filters.</p>
<p>For our GRU models, we used multilayer bidirectional GRU and we implemented pyramid encoder as described first in <xref ref-type="bibr" rid="B19">Xie et&#x20;al., (2016)</xref>:<disp-formula id="e4">
<mml:math id="me4">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">f</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>GRU</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">f</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(4)</label>
</disp-formula>
<disp-formula id="e5">
<mml:math id="me5">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">b</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>GRU</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">b</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(5)</label>
</disp-formula>
<disp-formula id="e6">
<mml:math id="me6">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">f</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">b</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
<label>(6)</label>
</disp-formula>
<disp-formula id="e7">
<mml:math id="me7">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>tanh</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mtext>pyr</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>y</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(7)</label>
</disp-formula>where <inline-formula id="inf22">
<mml:math id="minf22">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> denotes the input to next layer, <inline-formula id="inf23">
<mml:math id="minf23">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">f</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf24">
<mml:math id="minf24">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">b</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> denote output from a forward and a backward GRU, respectively. GRU (Gated Recurrent Unit) is a RNN (Recurrent Neural Network) type model that includes a gating mechanism in the following equations (<xref ref-type="bibr" rid="B2">Cho et&#x20;al., 2014</xref>):<disp-formula id="e8">
<mml:math id="me8">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c3;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#x2dc;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
<label>(8)</label>
</disp-formula>
<disp-formula id="e9">
<mml:math id="me9">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c3;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#x2dc;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>z</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
<label>(9)</label>
</disp-formula>
<disp-formula id="e10">
<mml:math id="me10">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">n</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>tanh</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mtext>&#x2a;</mml:mtext>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#x2dc;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(10)</label>
</disp-formula>
<disp-formula id="e11">
<mml:math id="me11">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#x2dc;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mtext>&#x2a;</mml:mtext>
<mml:msub>
<mml:mi mathvariant="bold-italic">n</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mtext>&#x2a;</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#x2dc;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
<label>(11)</label>
</disp-formula>where <inline-formula id="inf25">
<mml:math id="minf25">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#x2dc;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the hidden state at step <italic>t</italic>, which is denoted by <inline-formula id="inf26">
<mml:math id="minf26">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">f</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> in <xref ref-type="disp-formula" rid="e4">Eq. 4</xref> and <inline-formula id="inf27">
<mml:math id="minf27">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">b</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> in <xref ref-type="disp-formula" rid="e5">Eq. 5</xref>. <inline-formula id="inf28">
<mml:math id="minf28">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf29">
<mml:math id="minf29">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf30">
<mml:math id="minf30">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">n</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are the reset, update, and new gates, respectively. <italic>&#x3c3;</italic> is the sigmoid function.</p>
<p>Transformer is a novel family of seq2seq model that works very differently than RNN type models. In the original Transformer (see <xref ref-type="fig" rid="F3">Figure&#x20;3</xref>), a Feed Forward layer directly takes in the output from the Multihead attention layer <inline-formula id="inf31">
<mml:math id="minf31">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">c</mml:mi>
<mml:mrow>
<mml:mtext>att</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, accompanied by a residual connection, shown in<disp-formula id="e12">
<mml:math id="me12">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">c</mml:mi>
<mml:mrow>
<mml:mtext>att</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>MultiHeadAtt</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<label>(12)</label>
</disp-formula>
<disp-formula id="e13">
<mml:math id="me13">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">c</mml:mi>
<mml:mrow>
<mml:mtext>att</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:mtext>FeedForward</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">c</mml:mi>
<mml:mrow>
<mml:mtext>att</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(13)</label>
</disp-formula>In our model, we concatenated the neighboring elements in <inline-formula id="inf32">
<mml:math id="minf32">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">c</mml:mi>
<mml:mrow>
<mml:mtext>att</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> before we feed it into the Feed Forward. As a result, the dimension of the first Linear layer in the Feed Forward layer has to change from <inline-formula id="inf33">
<mml:math id="minf33">
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mtext>model</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#xd7;</mml:mo>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mtext>ff</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> to <inline-formula id="inf34">
<mml:math id="minf34">
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>2</mml:mn>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mtext>model</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#xd7;</mml:mo>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mtext>ff</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>. Here we use the same notation as in <xref ref-type="bibr" rid="B18">Vaswani et&#x20;al., (2017)</xref>, where <inline-formula id="inf35">
<mml:math id="minf35">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mtext>model</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the size of input, output, and attention vectors and <inline-formula id="inf36">
<mml:math id="minf36">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mtext>ff</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the number of neurons in the Feed Forward layer. The residual connection also has to be changed accordingly; we tried two different approaches, simply averaging the neighboring element (<xref ref-type="disp-formula" rid="e14">Eq. 14</xref>) or concatenating the neighboring element and passing it through another affine transformation to recover its dimensions (<xref ref-type="disp-formula" rid="e15">Eq. 15</xref>). For simplicity, we denote the former method with subscript &#x201c;ave&#x201d; and the latter with subscript &#x201c;aff&#x201d;.<disp-formula id="e14">
<mml:math id="me14">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mrow>
<mml:msup>
<mml:mi>t</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mtext>ave</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">c</mml:mi>
<mml:mrow>
<mml:mtext>att</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">c</mml:mi>
<mml:mrow>
<mml:mtext>att</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:mfrac>
<mml:mo>&#x2b;</mml:mo>
<mml:mtext>FeedForward</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">c</mml:mi>
<mml:mrow>
<mml:mtext>att</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">c</mml:mi>
<mml:mrow>
<mml:mtext>att</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(14)</label>
</disp-formula>
<disp-formula id="e15">
<mml:math id="me15">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">x</mml:mi>
<mml:mrow>
<mml:msup>
<mml:mi>t</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mtext>aff</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>tanh</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mrow>
<mml:mtext>aff</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">c</mml:mi>
<mml:mrow>
<mml:mtext>att</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">c</mml:mi>
<mml:mrow>
<mml:mtext>att</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mtext>aff</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mtext>FeedForward</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">c</mml:mi>
<mml:mrow>
<mml:mtext>att</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">c</mml:mi>
<mml:mrow>
<mml:mtext>att</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(15)</label>
</disp-formula>In our experiments, both methods show close performance. Therefore when showing the results, unless otherwise specified, we use the results of &#x201c;ave&#x201d; version.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>The implementation of pyramid structure in Transformer&#x2019;s encoder.</p>
</caption>
<graphic xlink:href="frai-04-590215-g003.tif"/>
</fig>
</sec>
<sec id="s3-4">
<title>3.4 Decoder and Attention Mechanisms</title>
<p>For our GRU models, we compared a regular multilayer unidirectional GRU:<disp-formula id="e16">
<mml:math id="me16">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>GRU</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(16)</label>
</disp-formula>In our experiment, we did a comparison study on Bahdanau attention (<xref ref-type="disp-formula" rid="e17">Eq. 17</xref>) and different Luong attentions. Bahdanau attention is described in following set of equations.<disp-formula id="e17">
<mml:math id="me17">
<mml:mrow>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>M</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi mathvariant="normal">&#x22a4;</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>k</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(17)</label>
</disp-formula>
<disp-formula id="e18">
<mml:math id="me18">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b1;</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>j</mml:mi>
</mml:munder>
<mml:mrow>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(18)</label>
</disp-formula>
<disp-formula id="e19">
<mml:math id="me19">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>j</mml:mi>
</mml:munder>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b1;</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>j</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
<label>(19)</label>
</disp-formula>Here, <italic>u</italic> is the alignment score, <italic>h</italic> and <inline-formula id="inf37">
<mml:math id="minf37">
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> denote the hidden state in encoder and decoder, respectively. <italic>M</italic>, <italic>N</italic> are the number of layers in decoder and encoder, respectively. <inline-formula id="inf38">
<mml:math id="minf38">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the context vector, which will be concatenated with the decoder hidden state of last layer for predicting the next word <inline-formula id="inf39">
<mml:math id="minf39">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>Luong&#x2019;s global attentions are generalizations to Bahdanau attention, but using different alignment score calculation methods. For simplicity, we omit the superscript <inline-formula id="inf40">
<mml:math id="minf40">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>M</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf41">
<mml:math id="minf41">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>.<disp-formula id="e20">
<mml:math id="me20">
<mml:mrow>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable columnalign="left">
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi mathvariant="normal">&#x22a4;</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mtext>dot</mml:mtext>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi mathvariant="normal">&#x22a4;</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mi mathvariant="bold-italic">a</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mtext>general</mml:mtext>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold-italic">v</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi mathvariant="normal">&#x22a4;</mml:mi>
</mml:msubsup>
<mml:mtext>tanh</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mi mathvariant="bold-italic">a</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mtext>concat</mml:mtext>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(20)</label>
</disp-formula>We also tried one example of Luong&#x2019;s local attention, which is done by imposing a Gaussian on <xref ref-type="disp-formula" rid="e19">Eq. 19</xref> at a desired attention center <inline-formula id="inf42">
<mml:math id="minf42">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>:<disp-formula id="e21">
<mml:math id="me21">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mi>j</mml:mi>
</mml:munder>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b1;</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mtext>exp</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:msup>
<mml:mi>&#x3c3;</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(21)</label>
</disp-formula>
<disp-formula id="e22">
<mml:math id="me22">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mtext>sigmoid</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">W</mml:mi>
<mml:mi mathvariant="bold-italic">p</mml:mi>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(22)</label>
</disp-formula>where <italic>S</italic> denotes the total length of the hidden state from the last encoding layer and <italic>&#x3c3;</italic> is a parameter chosen manually.</p>
</sec>
<sec id="s3-5">
<title>3.5 Beam Search</title>
<p>We use beam search in test and validation where text generation is involved. For each time step, we rank candidates based on their total negative logarithmic probability to current decoding time step <inline-formula id="inf43">
<mml:math id="minf43">
<mml:mrow>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mtext>dec</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>:<disp-formula id="e23">
<mml:math id="me23">
<mml:mrow>
<mml:mtext>score</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mtext>dec</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munderover>
<mml:mtext>log</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(23)</label>
</disp-formula>The search stops when there are five completed candidates.</p>
</sec>
<sec id="s3-6">
<title>3.6 Model Parameters</title>
<p>In all our experiments, we used a learnable embedding layer which embeds each &#x201c;word&#x201d; into a vector of length&#x20;400.</p>
<p>In our GRU models, we used a 3-layer bidirectional encoder; the size of the hidden states are 400 in all three layers. We used a 3-layer unidirectional decoder; the size of the hidden states are also&#x20;400.</p>
<p>In our Transformer models, following the original study, we used <inline-formula id="inf44">
<mml:math id="minf44">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mtext>model</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>512</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf45">
<mml:math id="minf45">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>2048</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>. We used 3-layer encoder and 3-layer decoder.</p>
<p>We did a coarse parameter space search to find these parameters chosen to be roughly optimal. But we did not fine-tune these parameters, because (1) we show that the overall performance of seq2seq model on PLC problem is satisfying and (2) we are more concerned about comparison between different attention mechanisms and between pyramid encoder and regular encoder.</p>
</sec>
</sec>
<sec id="s4">
<title>4 Datasets</title>
<p>We perform our experiments mainly on the Juliet Test Suite for C/C&#x2b;&#x2b; (v1.2) (created by <xref ref-type="bibr" rid="B11">NSA Center for Assured Software (2013)</xref>). This dataset contains 61,387 test cases, each test case contains one flawed code instance and one to several repaired code instance. These test cases contain more than 100 Common Weakness Enumerations (CWEs); each of them contains hundreds of example code instances. We note that the instances contain significant amount of dead code. To make the code more realistic, we remove the dead code. We also found that many of the code instances contain &#x201c;if conditions&#x201d;, that, in the flawed code instance, executes one branch, while, in the repaired instance, executes the other. These instances are unrealistic; therefore, we removed them. We also performed function renaming. After the preprocessing, we obtained 31,082 pairs of good-bad code instances.</p>
<p>To test model&#x2019;s generality, for some of the models, we also tested their performance on Juliet Test Suite for Java (v1.3) (released by <xref ref-type="bibr" rid="B12">NSA Center for Assured Software (2018)</xref>). After similar preprocessing described above, we obtain 23,015 pairs of instances.</p>
<p>We did 4-fold cross-validation in all of our experiments to achieve statistically accurate results. An estimation of time and power consumption when running our experiments is provided in the <xref ref-type="sec" rid="s12">Supplementary Material</xref> in a table, along with hardware requirements.</p>
</sec>
<sec id="s5">
<title>5 Results</title>
<sec id="s5-1">
<title>5.1 Repair Rate</title>
<p>We train our models on a GeForce GTX 1080 Ti graphic card. The metric we use for evaluation is the repair rate, which is the fraction of instances that are repaired after the model&#x2019;s edit. Since we performed beam search with beam width 5, each time a correction is being performed, we generate five correction candidates. Here we have two metrics in measuring the performance: one-candidate repair rate and five-candidate repair rate. The former corresponds to the scenario of code autocorrection, where there is no human judgment involved. The latter corresponds to correction suggesting, where the machine will identify an error and provide suggestions for the programmer for further judgment. The comparisons of the repair rates for considered models and their counterparts with pyramid encoder are listed in <xref ref-type="table" rid="T1">Table&#x20;1</xref> and <xref ref-type="table" rid="T2">Table&#x20;2</xref>. For comparison, we have attempted to test other machine-learning-based PLC tools that have been made. <xref ref-type="bibr" rid="B6">Gupta et&#x20;al., (2018)</xref> take error messages while compiling as input, but our dataset focuses on logic flaws in programs that do not have syntax errors; therefore, this tool is not applicable. <xref ref-type="bibr" rid="B14">Pu et&#x20;al., (2016)</xref> do not provide an open source repository, nor any documentations of their code. We have successfully trained <xref ref-type="bibr" rid="B7">Gupta et&#x20;al., (2017)</xref> on our C/C&#x2b;&#x2b; dataset and included it in our work for comparison. Unfortunately, a tokenizer is required for preprocessing the data into a certain format, and they only provided that for C/C&#x2b;&#x2b;, but not&#x20;java.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Repair rate of GRU and Transformer on Juliet Test Suite for C/C&#x2b;&#x2b;, comparing the regular encoder and pyramid encoder. These results are averaged over a 4-fold cross-validation. We calculated the improvement of pyramid encoders compared to their nonpyramid pairs. Apparently pyramid encoder does not collaborate well with Luong&#x2019;s local attention; therefore, we exclude it from future discussions. It is also not included when calculating the average improvement.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Model</th>
<th colspan="2" align="center">1-Candidate repair rate (%)</th>
<th colspan="2" align="center">5-Candidate repair rate (%)</th>
</tr>
<tr>
<th align="center">Regular encoder</th>
<th align="center">Pyramid encoder</th>
<th align="center">Regular encoder</th>
<th align="center">Pyramid encoder</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">GRU &#x2b; Bahdanau Att</td>
<td align="center">76.92</td>
<td align="center">76.09 (&#x2212;0.83)</td>
<td align="center">96.19</td>
<td align="center">95.55 (&#x2212;0.64)</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: Dot</td>
<td align="center">74.38</td>
<td align="center">73.04 (&#x2212;1.34)</td>
<td align="center">94.27</td>
<td align="center">94.59 (&#x2b;0.32)</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: General</td>
<td align="center">75.79</td>
<td align="center">74.85 (&#x2212;0.94)</td>
<td align="center">94.83</td>
<td align="center">94.92 (&#x2b;0.09)</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: Concat</td>
<td align="center">50.34</td>
<td align="center">47.26 (&#x2212;3.08)</td>
<td align="center">86.72</td>
<td align="center">86.14 (&#x2212;0.58)</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: Local</td>
<td align="center">65.70</td>
<td align="center">49.18 (&#x2212;15.52)</td>
<td align="center">92.46</td>
<td align="center">86.24 (&#x2212;6.22)</td>
</tr>
<tr>
<td align="left">Transformer</td>
<td align="center">75.48</td>
<td align="center">72.39 (&#x2212;3.09)</td>
<td align="center">97.66</td>
<td align="center">96.78 (&#x2212;0.88)</td>
</tr>
<tr>
<td align="left">Average improvement (%)</td>
<td colspan="2" align="center">&#x2212;1.95</td>
<td colspan="2" align="center">&#x2212;0.34</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Repair rate of GRU and Transformer on Juliet Test Suite for Java, comparing the regular encoder and pyramid encoder. We did not include result from DeepFix, because the provided data tokenizer only support C/C&#x2b;&#x2b;.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Model</th>
<th colspan="2" align="center">1-Candidate repair rate (%)</th>
<th colspan="2" align="center">5-Candidate repair rate (%)</th>
</tr>
<tr>
<th align="center">Regular encoder</th>
<th align="center">Pyramid encoder</th>
<th align="center">Regular encoder</th>
<th align="center">Pyramid encoder</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">GRU &#x2b; Bahdanau Att</td>
<td align="center">54.65</td>
<td align="center">56.21 (&#x2b;1.56)</td>
<td align="center">84.31</td>
<td align="center">83.98 (&#x2212;0.33)</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: Dot</td>
<td align="center">54.30</td>
<td align="center">55.66 (&#x2b;1.36)</td>
<td align="center">82.73</td>
<td align="center">84.86 (&#x2b;2.13)</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: General</td>
<td align="center">53.15</td>
<td align="center">52.54 (&#x2212;0.61)</td>
<td align="center">82.81</td>
<td align="center">82.83 (&#x2b;0.02)</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: Concat</td>
<td align="left"/>
<td align="left"/>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">Transformer</td>
<td align="center">56.68</td>
<td align="center">57.35 (&#x2b;0.67)</td>
<td align="center">93.11</td>
<td align="center">93.54 (&#x2b;0.43)</td>
</tr>
<tr>
<td align="left">Average improvement (%)</td>
<td colspan="2" align="center">&#x2b;0.74</td>
<td colspan="2" align="center">&#x2b;0.75</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>From these results we see that pyramid encoder has close performance to regular encoder in most of the models we applied to, except for Luong&#x2019;s local attention. The reason is that the encoder output in pyramid encoder is very &#x201c;coarse-grained&#x201d;; each output position now represents information from <inline-formula id="inf46">
<mml:math id="minf46">
<mml:mrow>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> words. This results in two drawbacks specifically to local attention: one, a much more &#x201c;blurry&#x201d; attention center and two, a much broader attention window. As a result, the attention is much less targeted, which damages the performance. Therefore, in the rest of the article, we will exclude this attention mechanism from our discussion.</p>
</sec>
<sec id="s5-2">
<title>5.2 Converging Speed</title>
<p>Since pyramid encoder reduces the sequence lengths in higher layers, one can expect a smaller training cost per batch in both GRU and Transformer models. To quantify this effect, for each of the regular encoder-pyramid encoder model pairs in <xref ref-type="table" rid="T1">Table&#x20;1</xref>, we set the same batch size and compare the average training speed in words per second, as shown in <xref ref-type="table" rid="T3">Table&#x20;3</xref>. Here the batch size is chosen so that it optimizes the training speed on the given GPU for each model. In the model, we also included number of epochs for the model to converge.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Training speed of GRU and Transformer on Juliet Test Suite for C/C&#x2b;&#x2b;.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Model</th>
<th rowspan="2" align="left">Batch size</th>
<th colspan="2" align="center">Training speed (words/s)</th>
<th colspan="2" align="center">Converge epoch</th>
</tr>
<tr>
<th align="left">Regular</th>
<th align="center">Pyramid</th>
<th align="left">Regular</th>
<th align="left">Pyramid</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">GRU &#x2b; Bahdanau Att</td>
<td align="center">8</td>
<td align="center">754</td>
<td align="center">1,185 (&#x2b;57%)</td>
<td align="center">18</td>
<td align="center">18</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: General</td>
<td align="center">16</td>
<td align="center">441</td>
<td align="center">853 (&#x2b;108%)</td>
<td align="center">23</td>
<td align="center">27</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: Dot</td>
<td align="center">128</td>
<td align="center">4,646</td>
<td align="center">10,408 (&#x2b;124%)</td>
<td align="center">36</td>
<td align="center">34</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: Concat</td>
<td align="center">6</td>
<td align="center">1,418</td>
<td align="center">2,344 (&#x2b;65%)</td>
<td align="center">23</td>
<td align="center">29</td>
</tr>
<tr>
<td align="left">Transformer</td>
<td align="center">8</td>
<td align="center">1,086</td>
<td align="center">2,181 (&#x2b;101%)</td>
<td align="center">33</td>
<td align="center">34</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Apparently it takes similar number of epochs to converge for the same type of model with pyramid encoder and regular encoder. However, pyramid encoders largely increase the training speed, between 50 and 130%. Therefore it could easily shorten the training time by two to four folds while the same performance is achieved. As an example, <xref ref-type="fig" rid="F4">Figure&#x20;4</xref> shows the learning curve for GRU model with general Luong&#x2019;s attention, comparing the regular encoder and pyramid encoder.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Learning curve of GRU with Luong&#x2019;s general attention, comparing regular encoder to pyramid encoder. Pyramid encoder model shows fast converging&#x20;speed.</p>
</caption>
<graphic xlink:href="frai-04-590215-g004.tif"/>
</fig>
</sec>
<sec id="s5-3">
<title>5.3 Memory Cost</title>
<p>The last thing we compared is the memory cost of the pyramid encoder and the regular encoder. This measure is crucial in some scenarios, where your input instances are very long; therefore, the memory of GPU is only capable of holding a very small batch. In code correction, this is often the&#x20;case.</p>
<p>The metric we use for comparison is memory cost per instance, <italic>k</italic>, which is defined as<disp-formula id="e24">
<mml:math id="me24">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mtext>&#x394;Memory&#xa0;usage</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext>&#x394;Batch&#xa0;size</mml:mtext>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(24)</label>
</disp-formula>
<xref ref-type="fig" rid="F5">Figure&#x20;5</xref> shows the calculation process of <italic>k</italic>. Define <inline-formula id="inf47">
<mml:math id="minf47">
<mml:mrow>
<mml:mi mathvariant="normal">&#x2130;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> as memory efficiency. We calculated the <italic>k</italic> and <inline-formula id="inf48">
<mml:math id="minf48">
<mml:mi mathvariant="normal">&#x2130;</mml:mi>
</mml:math>
</inline-formula> value for each of the models we applied, shown in <xref ref-type="table" rid="T4">Table&#x20;4</xref>. We also included the number of parameters in each model, from which we see that each pair of models has roughly the same model&#x20;size.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Memory cost per instance for GRU models with Bahdanau attention, <italic>k</italic> is calculated by finding the slope of the linear fit (black dashed line). The red dashed line represents the maximum memory of a GeForce GTX 1080 Ti graphic&#x20;card.</p>
</caption>
<graphic xlink:href="frai-04-590215-g005.tif"/>
</fig>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>Memory cost for considered models, comparing regular encoder and pyramid encoder: pyramid encoder greatly increased the memory efficiency.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Model</th>
<th colspan="2" align="center">k (Mb/instance)</th>
<th colspan="2" align="center">
<inline-formula id="inf49">
<mml:math id="minf49">
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x2130;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</th>
<th colspan="2" align="center">Parameters (10<sup>7</sup>)</th>
</tr>
<tr>
<th align="center">Regular</th>
<th align="center">Pyramid</th>
<th align="center">Regular</th>
<th align="center">Pyramid</th>
<th align="center">Regular</th>
<th align="center">Pyramid</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">GRU &#x2b; Bahdanau Att</td>
<td align="center">1,151.71</td>
<td align="center">164.52</td>
<td align="center">0.86</td>
<td align="center">6.08 (&#x2b;600%)</td>
<td align="center">1.24</td>
<td align="center">1.11</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: General</td>
<td align="center">830.71</td>
<td align="center">165.03</td>
<td align="center">1.20</td>
<td align="center">6.05 (&#x2b;403%)</td>
<td align="center">1.22</td>
<td align="center">1.10</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: Dot</td>
<td align="center">65.91</td>
<td align="center">52.42</td>
<td align="center">15.17</td>
<td align="center">19.08 (&#x2b;26%)</td>
<td align="center">1.20</td>
<td align="center">1.08</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: Concat</td>
<td align="center">1,381.6</td>
<td align="center">431.87</td>
<td align="center">0.72</td>
<td align="center">2.31 (&#x2b;220%)</td>
<td align="center">1.24</td>
<td align="center">1.11</td>
</tr>
<tr>
<td align="left">Transformer</td>
<td align="center">414.67</td>
<td align="center">263.33</td>
<td align="center">0.24</td>
<td align="center">0.38 (&#x2b;57%)</td>
<td align="center">2.35</td>
<td align="center">2.82</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The pyramid encoder could increase the memory efficiency by 20%&#x2013;600% depending on the attention mechanisms used, while only increase the memory occupied by the model itself by around 10%. One should note that the memory efficiency directly affects the maximum batch size one is able to use on a single GPU, and therefore affects the utility of the GPU. For example, for regular GRU with Bahdanau Attention, the memory of a GeForce GTX 1080 Ti graphic card can only support a batch size of 8, which does not fully utilize the GPU. With pyramid encoder, it can support up to 60 instances each batch. In practice, this will drastically reduce the training time by increasing the GPU utility, together with the smaller computational cost of pyramid encoder as addressed in previous section.</p>
</sec>
</sec>
<sec id="s6">
<title>6 Discussion</title>
<sec id="s6-1">
<title>6.1 Length Analyses</title>
<p>
<xref ref-type="fig" rid="F6">Figure&#x20;6</xref> shows the repair rate of the models with respect to the input length. We omitted the result of Transformer, Bahdanau&#x2019;s attention, and Luong&#x2019;s general attention, because they are qualitatively similar to the result of Luong&#x2019;s dot attention. Despite the different attention mechanisms, these seq2seq models (with pyramid encoder or regular encoder) are relatively robust to longer input lengths. The performance drops at around 250 words and above 500 words are likely resulting from the shortage of samples, which one can easily observe from <xref ref-type="fig" rid="F7">Figure&#x20;7</xref>, the length histogram of source instances and target instances. The histogram also shows that the majority of code instances contains several hundred words, while natural language sentences are typically not longer than 50 words. This feature of code instances calls for a much higher computational resource requirement for PLC problems than NLC problems, which makes pyramid structure especially useful.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Length analyses of Luong&#x2019;s general attention and Luong&#x2019;s concat attention. The results from the rest of the models are qualitatively similar to result of Luong&#x2019;s general attention and thus are omitted.</p>
</caption>
<graphic xlink:href="frai-04-590215-g006.tif"/>
</fig>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Histogram of flawed code (left) and repaired code (right) instances.</p>
</caption>
<graphic xlink:href="frai-04-590215-g007.tif"/>
</fig>
</sec>
<sec id="s6-2">
<title>6.2 Examples of Correction</title>
<p>In this section we give several examples of successful corrections from our Pyramid GRU model on Juliet C/C&#x2b;&#x2b; Test Suite for closer examination of model and datasets. The red striked out texts denote the original faulted instance, and blue buffed texts are the reparation done by the model.<list list-type="simple">
<list-item>
<p>Example 1: Memory allocation match</p>
</list-item>
</list>
</p>
<p>The flawed code creates a char variable whose size does not match its concatenating destination. The model is able to correct it so that their size matches each other.</p>
<p>
<list list-type="simple">
<list-item>
<p>Example 2: Redundant&#x20;Code</p>
</list-item>
</list>
</p>
<p>This is an example that the model deletes repeated code where a variable is freed twice.</p>
<fig id="Fx1" position="float">
<graphic xlink:href="frai-04-590215-fx1.tif"/>
</fig>
<p>
<list list-type="simple">
<list-item>
<p>Example 3: Possible Overflow</p>
</list-item>
</list>
</p>
<p>Here we show a slightly questionable example of correction provided by the dataset. In order to prevent potential string overflow emerging from environment variable, the repair suggestion given by the Juliet Test Suite is to abort the entire part of concatenating the environment string and replace the variable with an arbitrary string &#x201c;&#x2a;.&#x2a;&#x201d;. This &#x201c;correction&#x201d; is easy for the model to learn; however, it has changed the original purpose of the program.</p>
<fig id="Fx2" position="float">
<graphic xlink:href="frai-04-590215-fx2.tif"/>
</fig>
<p>
<list list-type="simple">
<list-item>
<p>Example 4: Correction Across Functions</p>
</list-item>
</list>
</p>
<p>In this example, the models demonstrate the ability of making connections across the whole instance, between different functions. Here it prevents potential overflow in the sink function caused by a variable that was passed from the main function by adding an &#x201c;if condition&#x201d;.</p>
<fig id="Fx3" position="float">
<graphic xlink:href="frai-04-590215-fx3.tif"/>
</fig>
<fig id="Fx4" position="float">
<graphic xlink:href="frai-04-590215-fx4.tif"/>
</fig>
</sec>
<sec id="s6-3">
<title>6.3 Generalizability to Syntax Error-Oriented Dataset</title>
<p>In the spirit of comparative study, we attempted to compare&#x20;our method to Deepfix (<xref ref-type="bibr" rid="B7">Gupta et&#x20;al., 2017</xref>), the only machine-learning-based PLC method that made their code and dataset open to the public, to the best of our knowledge. Unfortunately, the attempt of applying Deepfix onto Juliet Test Suite has failed, because Deepfix is aimed only&#x20;to correct syntax errors and a compiler is used as the evaluator, marking any programs that could pass the compiling stage as &#x201c;correct&#x201d;. This apparently contradicts the spirit of &#x201c;identify logic errors from syntactically correct programs&#x201d;.</p>
<p>The difficulty that we are facing here comes from a more&#x20;general problem in the field of Machine Learning PLC; the field is still disorganized and works in the field are&#x20;uncorrelated. Each group might be using their own dataset and design their systems to match the specific purpose of that dataset. Comparative work is difficult to conduct not only because the datasets are hard to obtain&#x20;due to private policies, but also because issues raised in PLC are versatile; each model is designed and optimized to&#x20;best address the problem occurring in their particular dataset.</p>
<p>For the above reasons, we had to back off to a weaker comparative study, using seq2seq models on the dataset from Deepfix. Deepfix uses a generated dataset, originated from students&#x2019; submission to an introductory C course in a web-based tutoring system (<xref ref-type="bibr" rid="B3">Das et&#x20;al., 2016</xref>). For each student submission, they generate up to five syntax errors in the code instance, including replacing &#x201c;}; &#x201d; with &#x201c;; }&#x201d;, deleting a semicolon, add an extra &#x201c;}&#x201d;, replacing a semicolon with a period, and replacing a comma with a semicolon. If all of the syntax errors were fixed, then consider such a program as successfully repaired.</p>
<p>
<xref ref-type="table" rid="T5">Table&#x20;5</xref> shows the comparison of repair rate of our seq2seq models compared to the method applied by Deepfix. We observe that pyramid encoder performs worse than regular encoder on this particular dataset. This is expected from how the dataset was generated. The generated syntax errors are extremely local in Deepfix&#x2019;s dataset. The fix usually only involves changing one token or two neighboring tokens, leaving the rest of the entire code piece unchanged. Therefore, while a pyramid encoder summarizes the information from neighboring tokens, it also blurs the local information.</p>
<table-wrap id="T5" position="float">
<label>TABLE 5</label>
<caption>
<p>Comparison of our models with Deepfix, <xref ref-type="bibr" rid="B7">Gupta et&#x20;al., (2017)</xref> on Deepfix dataset. All results are average of 5-fold cross validation.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Model</th>
<th colspan="2" align="center">1-Candidate repair rate (%)</th>
<th colspan="2" align="center">5-Candidate repair rate (%)</th>
</tr>
<tr>
<th align="center">Regular encoder</th>
<th align="center">Pyramid encoder</th>
<th align="center">Regular encoder</th>
<th align="center">Pyramid encoder</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Transformer</td>
<td align="center">51.96</td>
<td align="center">43.78</td>
<td align="center">67.16</td>
<td align="center">59.32</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: General</td>
<td align="center">51.86</td>
<td align="center">34.80</td>
<td align="center">66.33</td>
<td align="center">48.44</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Luong Att: Dot</td>
<td align="center">58.63</td>
<td align="center">41.09</td>
<td align="center">72.31</td>
<td align="center">54.47</td>
</tr>
<tr>
<td align="left">GRU &#x2b; Bahdanau Att</td>
<td align="center">27.47</td>
<td align="center">15.21</td>
<td align="center">36.19</td>
<td align="center">22.59</td>
</tr>
<tr>
<td align="left">Deepfix</td>
<td colspan="4" align="center">56</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We also observed that, in Luong&#x2019;s attention, dot has the best performance in this dataset and Bahdanau&#x2019;s attention performs the worst. After observing the dataset carefully, we came up with the following hypothesis: in this dataset, the network is only required to simply <bold>copy</bold> the original token most part of the instance and locally fix one or two tokens. This means that in the majority of times, for each decoder hidden state <inline-formula id="inf50">
<mml:math id="minf50">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, the normalized attention score <inline-formula id="inf51">
<mml:math id="minf51">
<mml:mrow>
<mml:mtext>score</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mrow>
<mml:mi mathvariant="bold-italic">t</mml:mi>
<mml:mo>&#x27;</mml:mo>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> needs to be close to one where <inline-formula id="inf52">
<mml:math id="minf52">
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x27;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> and close to 0 everywhere else. In Luong&#x2019;s attention, a dot, which simply do an inner product of hidden states, could do the job easier, because latent vectors are mostly orthogonal to each other in the latent space due to high dimensionality. On the other hand, Bahdanau attention, which does an affine transformation to every hidden state <inline-formula id="inf53">
<mml:math id="minf53">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, may overcomplicate the problem and fail to capture the correct attention.</p>
</sec>
<sec id="s6-4">
<title>6.4 Alternative Method for Small Datasets: Transfer Learning</title>
<p>One main difficulty that researchers often come across when attempting to apply machine learning methods to PLC problems is the availability of suitable datasets. Although there are many datasets and shared tasks available on <xref ref-type="bibr" rid="B16">Software Assurance Reference Dataset (2006)</xref>, most of them include less than 1,000 examples. This makes neural-network-based methods nearly impossible. To tackle this problem, we take the idea of transfer learning from <xref ref-type="bibr" rid="B13">Pan and Yang (2009)</xref>.</p>
<p>Our idea is to take the encoder part of the model that was trained on Juliet Test Suite and attach it to a untrained decoder, which was designed for the specific problem. We aim to take the advantage that codes written in the same coding language share the same syntax library and same construction&#x20;rules.</p>
<p>Since many datasets available only provide the faulted code and their corresponding fault categories, here we give an example of fault classification using transfer learning, applying the model pretrained on Juliet Test Suite for C/C&#x2b;&#x2b; on ITC benchmark (<xref ref-type="bibr" rid="B1">Charles (2015)</xref>).</p>
<sec id="s6-4-1">
<title>6.4.1 Model Structure</title>
<p>Given a faulted code instance, our goal is to train a classification model that predicts the type of error of the faulted code from a given list of error categories.</p>
<p>We keep the encoder part of the pretrained model and use it directly as the encoder in the classification problem. The exception is the embedding layer, because the vocabulary in the new dataset will contain new variable names that did not occur in pretrained embedding, although the syntax will be the same. In practice, we manually expanded the embedding layer to accommodate the new &#x201c;words&#x201d; but keep the embeddings of the old &#x201c;words&#x201d; unchanged. In order to add variation from the original model, we also reinitialized the weights in the last encoding&#x20;layer.</p>
<p>For the decoder, instead of generating a sequence, we take the output of the first time step of the reinitiated decoder and pass it to a linear layer that projects it to an <inline-formula id="inf54">
<mml:math id="minf54">
<mml:mrow>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mrow>
<mml:mtext>class</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> dimensional vector. <inline-formula id="inf55">
<mml:math id="minf55">
<mml:mrow>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mrow>
<mml:mtext>class</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the number of error classes. Model was trained to minimize cross-entropy loss with an ADAM optimizer.</p>
</sec>
<sec id="s6-4-2">
<title>6.4.2 Results</title>
<p>We extracted 566 C/C&#x2b;&#x2b; code instances from the ITC bench mark. These instances are organized into 44 error categories, with the largest category containing around 30 instance and the smallest only containing two instances. Then the instances are divided into a training set of 485 instances, a validation set of 42 instances, and test set of 39 instances. For comparison, we also tried Pyramid GRU and Pyramid Transformer with the same model structure but no prior knowledge from Juliet Test Suites. The result is shown in <xref ref-type="table" rid="T6">Table&#x20;6</xref>.</p>
<table-wrap id="T6" position="float">
<label>TABLE 6</label>
<caption>
<p>Comparison of the result of transfer learning on error type classification task. The models without transfer learning demonstrate no predicting power and no improvement during course of training.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Model</th>
<th align="center">Accuracy (%)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Transfer learning: PyrGRU</td>
<td align="char" char=".">60.5</td>
</tr>
<tr>
<td align="left">Transfer learning: PyrTFM</td>
<td align="char" char=".">69.1</td>
</tr>
<tr>
<td align="left">Fresh pyramid GRU</td>
<td align="char" char=".">16.7</td>
</tr>
<tr>
<td align="left">Fresh pyramid transformer</td>
<td align="char" char=".">7.1</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>For the fresh GRU and Transformer models, we observed that the models have no predicting power as it produces constant prediction over all inputs. There is even no sufficient gradient on the loss landscape as the loss did not reduce during the training. Transfer learning, on the other hand, demonstrates a fair power of prediction, correctly classifying over 60% of instances, despite that ITC benchmark is written in very different style than Juliet Test Suites and that the dataset is 50&#x20;times smaller.</p>
<p>This result shows that one is able to use neural-network-based methods in code correction problems despite the shortage of data, which is a common problem in this&#x20;field.</p>
</sec>
</sec>
</sec>
<sec id="s7">
<title>7 Conclusion</title>
<p>In our work, we show that seq2seq models, successful in natural language correction, are also applicable in programming language correction. Our results shows seq2seq models can be well applied in providing suggestions to potential errors and have a decent correct rate (above 70% in C/C&#x2b;&#x2b; dataset and above 50% in Java dataset) in code auto-correction. Although these results are only limited in Juliet Test Suites, we expect that, given sufficient training data, seq2seq models can also perform well when applied on other PLC problems.</p>
<p>Based on the commonly used encoder-decoder structure, we introduce a general pyramid encoder in seq2seq models. Our results demonstrates that this structure significantly reduces the memory cost and computational cost. This is helpful because PLC are generally more computationally expensive than NLC, due to its longer average instance length.</p>
<p>The publicly available datasets in PLC are mostly small and noisy. Most datasets we found contain close to or less than 1,000 code instances. This is far less than enough for training seq2seq and many other machine learning models. Our results on transfer learning pointed out a way of processing these small dataset using the pretrained model as an encoder, which boosts the performance by a large amount.</p>
<p>In future, we will further investigate the influence of different architectures in neural networks, for instance, parallel encoders/decoders, Tree2Tree models, etc. On the other hand, instead of code correction, we will modify and examine our model&#x2019;s performance on other tasks such as program generation and code optimizing. We will also examine the potential difference between artificial datasets and realistic datasets.</p>
</sec>
</body>
<back>
<sec id="s8">
<title>Data Availability Statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: <ext-link ext-link-type="uri" xlink:href="https://samate.nist.gov/SARD/around.php#juliet_documents">https://samate.nist.gov/SARD/around.php&#x23;juliet_documents</ext-link> <ext-link ext-link-type="uri" xlink:href="https://samate.nist.gov/SARD/view.php?tsID=104">https://samate.nist.gov/SARD/view.php?tsID&#x3d;104</ext-link>.</p>
</sec>
<sec id="s9">
<title>Author Contributions</title>
<p>SH came up with the generalized pyramid encoder, processed the dataset, programmed each of the seq2seq models, conducted experiments, respectively, and gathered results. He was responsible for writing parts 3, 4, 5, and 6 of the manuscript. XZ was responsible for literature reviews; he wrote part 2 of the manuscript independently and part 1 and 6 jointly with SH. He and SH also contributed together in finding supplementary datasets. SC was the advisor of the project; he gave advice on the general direction of the research, provided facilities to conduct experiments, and supervised the process of the research. He also provided the access to Juliet Test Suite, which was the main dataset used in the research. He helped proofread the manuscript. All three authors shared ideas, carried out discussions, and came up with solutions together over the course of research.</p>
</sec>
<sec id="s10">
<title>Funding</title>
<p>The study was funded by the National Science Foundation, DMS 1737897.</p>
</sec>
<ack>
<p>This manuscript has been released as a pre-print at Arxiv (<xref ref-type="bibr" rid="B9">Huang et&#x20;al., 2020</xref>).</p>
</ack>
<sec sec-type="COI-statement" id="s11">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="s12">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/frai.2021.590215/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/frai.2021.590215/full&#x23;supplementary-material</ext-link>.</p>
<supplementary-material xlink:href="datasheet1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<fn-group>
<fn id="FN1">
<label>1</label>
<p>See <ext-link ext-link-type="uri" xlink:href="https://github.com/b19e93n/PLC-Pyramid">https://github.com/b19e93n/PLC-Pyramid</ext-link>.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Charles</surname>
<given-names>O</given-names>
</name>
</person-group>. (<year>2015</year>). [Dataset] <article-title>Itc-benchmarks</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://samate.nist.gov/SARD/testsuite.php">https://samate.nist.gov/SARD/testsuite.php</ext-link>
</comment> (<comment>Accessed</comment> December 28, 2015). </citation>
</ref>
<ref id="B2">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Cho</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Van Merri&#xeb;nboer</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Gulcehre</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Bahdanau</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Bougares</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Schwenk</surname>
<given-names>H.</given-names>
</name>
<etal/>
</person-group> (<year>2014</year>). <article-title>Learning phrase representations using rnn encoder-decoder for statistical machine translation</article-title>. <comment>Preprint repository name [Preprint]. Available at: <ext-link ext-link-type="uri" xlink:href="http://arXiv:1406.1078">arXiv:1406.1078</ext-link>
</comment> (<comment>Accessed</comment> January 3, 2014). </citation>
</ref>
<ref id="B3">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Das</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Ahmed</surname>
<given-names>U. Z.</given-names>
</name>
<name>
<surname>Karkare</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Gulwani</surname>
<given-names>S</given-names>
</name>
</person-group>. (<year>2016</year>). <article-title>Prutor: a system for tutoring cs1 and collecting student programs for analysis</article-title>. <comment>Preprint repository name [Preprint]. Available at: <ext-link ext-link-type="uri" xlink:href="http://arXiv:1608.03828">arXiv:1608.03828</ext-link>
</comment> (<comment>Accessed</comment> August 12, 2016). </citation>
</ref>
<ref id="B4">
<citation citation-type="web">
<collab>Google</collab> (<year>2016a</year>). <article-title>Clang-tidy</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://clang.llvm.org/extra/clang-tidy/">http://clang.llvm.org/extra/clang-tidy/</ext-link>
</comment> (<comment>Accessed</comment> April 23, 2016). </citation>
</ref>
<ref id="B5">
<citation citation-type="web">
<collab>Google</collab> (<year>2016b</year>). <article-title>Error-prone</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://errorprone.info/">http://errorprone.info/</ext-link>
</comment> (<comment>Accessed</comment> January 25, 2016). </citation>
</ref>
<ref id="B6">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Gupta</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Kanade</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Shevade</surname>
<given-names>S</given-names>
</name>
</person-group>. (<year>2018</year>). <article-title>Deep reinforcement learning for programming language correction</article-title>. <comment>Preprint repository name [Preprint]. Available at: <ext-link ext-link-type="uri" xlink:href="http://arXiv:1801.10467">arXiv:1801.10467</ext-link>
</comment> (<comment>Accessed</comment> January 31, 2018). </citation>
</ref>
<ref id="B7">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gupta</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Pal</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kanade</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Shevade</surname>
<given-names>S</given-names>
</name>
</person-group>. (<year>2017</year>). &#x201c;<article-title>Deepfix: fixing common c language errors by deep learning</article-title>,&#x201d; in <conf-name>Proceedings of the thirty-first AAAI conference on artificial intelligence</conf-name>, <conf-loc>San Francisco, California</conf-loc>, <conf-date>February 4&#x2013;9, 2017</conf-date>, (<publisher-name>AAAI Press</publisher-name>) <fpage>1345</fpage>&#x2013;<lpage>1351</lpage>. </citation>
</ref>
<ref id="B8">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Hao</surname>
<given-names>K</given-names>
</name>
</person-group>. (<year>2019</year>). <article-title>Training a single ai model can emit as much carbon as five cars in their lifetimes</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://www.technologyreview.com/s/613630/training-a-single-ai-&#xa0;model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/">https://www.technologyreview.com/s/613630/training-a-single-ai-&#xa0;model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/</ext-link>
</comment> (<comment>Accessed</comment> September 28, 2019). </citation>
</ref>
<ref id="B9">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Chin</surname>
<given-names>S</given-names>
</name>
</person-group>. (<year>2020</year>). <article-title>A study of pyramid structure for code correction</article-title>. <comment>Preprint repository name [Preprint]. Available at: <ext-link ext-link-type="uri" xlink:href="http://arXiv:2001.11367">arXiv:2001.11367</ext-link>
</comment> (<comment>Accessed</comment> January 28, 2020). </citation>
</ref>
<ref id="B10">
<citation citation-type="web">
<collab>JetBrains</collab> (<year>2016</year>). <article-title>ReSharper</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://www.jetbrains.com/resharper/">https://www.jetbrains.com/resharper/</ext-link>
</comment> (<comment>Accessed</comment> September 12, 2016). </citation>
</ref>
<ref id="B11">
<citation citation-type="web">
<collab>NSA Center for Assured Software</collab> (<year>2013</year>). [Dataset] <article-title>Juliet test suite c/c<sup>&#x2b;&#x2b;</sup>
</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://samate.nist.gov/SARD/around.php#juliet_documents">https://samate.nist.gov/SARD/around.php&#x23;juliet_documents</ext-link>
</comment> (<comment>Accessed</comment> May 15, 2013). </citation>
</ref>
<ref id="B12">
<citation citation-type="web">
<collab>NSA Center for Assured Software</collab> (<year>2018</year>). [Dataset] <article-title>Juliet test suite java</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://samate.nist.gov/SARD/around.php#juliet_documents">https://samate.nist.gov/SARD/around.php&#x23;juliet_documents</ext-link>
</comment> (<comment>Accessed</comment> November 17, 2018). </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pan</surname>
<given-names>S. J.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Q</given-names>
</name>
</person-group>. (<year>2009</year>). <article-title>A survey on transfer learning</article-title>. <source>IEEE Trans. Knowl. Data Eng.</source> <volume>22</volume>, <fpage>1345</fpage>&#x2013;<lpage>1359</lpage>. <pub-id pub-id-type="doi">10.1109/TKDE.2009.191</pub-id> </citation>
</ref>
<ref id="B14">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Pu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Narasimhan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Solar-Lezama</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Barzilay</surname>
<given-names>R</given-names>
</name>
</person-group>. (<year>2016</year>). &#x201c;<article-title>sk_p: a neural program corrector for moocs</article-title>,&#x201d; in <conf-name>Companion proceedings of the 2016 ACM SIGPLAN international conference on systems, programming, languages and applications: software for humanity</conf-name> (<publisher-name>SPLASH Companion</publisher-name>), <fpage>39</fpage>&#x2013;<lpage>40</lpage>. </citation>
</ref>
<ref id="B15">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Singh</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Gulwani</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Solar-Lezama</surname>
<given-names>A</given-names>
</name>
</person-group>. (<year>2013</year>). &#x201c;<article-title>Automated feedback generation for introductory programming assignments</article-title>,&#x201d; in <conf-name>Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation</conf-name>, <conf-loc>Washington, DC</conf-loc>, <conf-date>June 06, 2013</conf-date> (<publisher-name>ACM</publisher-name>), <fpage>15</fpage>&#x2013;<lpage>26</lpage>. </citation>
</ref>
<ref id="B16">
<citation citation-type="web">
<collab>Software Assurance Reference Dataset</collab> (<year>2006</year>). [Dataset] <article-title>SARD datasets</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://samate.nist.gov/SARD/testsuite.php">https://samate.nist.gov/SARD/testsuite.php</ext-link>
</comment> (<comment>Accessed</comment> January 6, 2006). </citation>
</ref>
<ref id="B17">
<citation citation-type="web">
<collab>Synopsys</collab> (<year>2016</year>). <article-title>Coverity</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://www.coverity.com//">http://www.coverity.com//</ext-link>
</comment> (<comment>Accessed</comment> July 11, 2016). </citation>
</ref>
<ref id="B18">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Vaswani</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Shazeer</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Parmar</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Uszkoreit</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Gomez</surname>
<given-names>A. N.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). &#x201c;<article-title>Attention is all you need</article-title>,&#x201d; in <source>Advances in neural information processing systems</source>. <fpage>6000</fpage>&#x2013;<lpage>6010</lpage>. </citation>
</ref>
<ref id="B19">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Xie</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Avati</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Arivazhagan</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Jurafsky</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>A. Y</given-names>
</name>
</person-group>. (<year>2016</year>). <article-title>Neural language correction with character-based attention</article-title>. <comment>Preprint repository name [Preprint]. Available at: <ext-link ext-link-type="uri" xlink:href="http://arXiv:1603.09727">arXiv:1603.09727</ext-link>
</comment>
<ext-link ext-link-type="uri" xlink:href="https://persistent-url"/> (<comment>Accessed</comment> March 31, 2016). </citation>
</ref>
</ref-list>
</back>
</article>