<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="brief-report">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2022.984759</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Brief Research Report</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Creating a list of word alignments from parallel Russian simplification data</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Dmitrieva</surname> <given-names>Anna</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1408763/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Laposhina</surname> <given-names>Antonina</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c002"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1410221/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Lebedeva</surname> <given-names>Maria Yuryevna</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c003"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1148109/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Faculty of Arts, University of Helsinki</institution>, <addr-line>Helsinki</addr-line>, <country>Finland</country></aff>
<aff id="aff2"><sup>2</sup><institution>Language and Cognition Laboratory, Pushkin State Russian Language Institute</institution>, <addr-line>Moscow</addr-line>, <country>Russia</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Valery Solovyev, Kazan Federal University, Russia</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Vladimir Ivanov, Innopolis University, Russia; Marina Solnyshkina, Kazan Federal University, Russia</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Anna Dmitrieva <email>annadmitrieva252&#x00040;gmail.com</email></corresp>
<corresp id="c002">Antonina Laposhina <email>antonina.laposhina&#x00040;gmail.com</email></corresp>
<corresp id="c003">Maria Yuryevna Lebedeva <email>m.u.lebedeva&#x00040;gmail.com</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Natural Language Processing, a section of the journal Frontiers in Artificial Intelligence</p></fn></author-notes>
<pub-date pub-type="epub">
<day>12</day>
<month>09</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>5</volume>
<elocation-id>984759</elocation-id>
<history>
<date date-type="received">
<day>02</day>
<month>07</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>23</day>
<month>08</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Dmitrieva, Laposhina and Lebedeva.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Dmitrieva, Laposhina and Lebedeva</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>This work describes the development of a list of monolingual word alignments taken from parallel Russian simplification data. This word lists can be used in such lexical simplification tasks as rule-based simplification applications and lexically constrained decoding for neural machine translation models. Moreover, they constitute a valuable source of information for developing educational materials for teaching Russian as a second/foreign language. In this work, a word list was compiled automatically and post-edited by human experts. The resulting list contains 1409 word pairs in which each &#x0201C;complex&#x0201D; word has an equivalent &#x0201C;simpler&#x0201D; (shorter, more frequent, modern, international) synonym. We studied the contents of the word list by comparing the frequencies of the words in the pairs and their levels in the special CEFR-graded vocabulary lists for learners of Russian as a foreign language. The evaluation demonstrated that lexical simplification by means of single-word synonym replacement does not occur often in the adapted texts. The resulting list also illustrates the peculiarities of the lexical simplification task for L2 learners, such as the choice of a less frequent but international word.</p></abstract>
<kwd-group>
<kwd>lexical simplification</kwd>
<kwd>lexical substitution</kwd>
<kwd>vocabulary list</kwd>
<kwd>monolingual word alignment</kwd>
<kwd>Simple Russian</kwd>
<kwd>Russian as a foreign language</kwd>
</kwd-group>
<counts>
<fig-count count="3"/>
<table-count count="1"/>
<equation-count count="0"/>
<ref-count count="26"/>
<page-count count="07"/>
<word-count count="4905"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>Introduction</title>
<p>Lexical simplification is one of the main strategies to make text easier to understand for L2 learners. The key subtask here is to find suitable candidates for replacing a complex word with a simpler synonym. This study explored the potential of monolingual word aligners for the development of a list of possible lexical substitutions. This is a special word list where each &#x0201C;complex&#x0201D; word has an equivalent &#x0201C;simpler&#x0201D; (shorter, more frequent, modern, international) synonym. This list was compiled automatically and post-edited by experts. After that, we analyzed the contents of the list and compared the words in it to the existing standardized lexical minima for different levels of Russian language proficiency and a frequency dictionary. Such parallel lists of words and their simpler alternatives can be used for text simplification purposes, for example, in rule-based simplification tools and in lexically constrained decoding for neural machine translation models. Moreover, it is a valuable source of information for curriculum and educational material creators.</p>
<p>Currently, there is not much data that can be used for Russian lexical simplification. The main source of information about lexical complexity and word usage is CEFR-graded vocabulary lists for learners of Russian as a second language (L2; Andryshina and Kozlova, <xref ref-type="bibr" rid="B4">2012</xref>, <xref ref-type="bibr" rid="B5">2015</xref>; Andryshina, <xref ref-type="bibr" rid="B2">2017a</xref>,<xref ref-type="bibr" rid="B3">b</xref>). There are separate lists for each level of language proficiency from elementary to advanced. However, these lists do not provide information about synonym/hypernym/hyponym relationships between words. Conversely, while there are dictionaries of synonyms, they do not guarantee that a given synonym is simpler than a synonymized word. Therefore, in order to create, for instance, a reliable lexical simplification tool, a specialized word list would be needed.</p>
</sec>
<sec id="s2">
<title>Related work</title>
<p>Most lexical simplification models involve replacing complex words or phrases with simpler ones. The source of data on the complexity of a word here can be formal text characteristics, e.g., word frequency or length (Shardlow, <xref ref-type="bibr" rid="B22">2013</xref>), and the complexity ranking of words by native or non-native speakers (Maddela and Xu, <xref ref-type="bibr" rid="B16">2018</xref>). The other source of data can be parallel corpora of original and simplified versions of a text, which illustrate the natural process of text adaptation. Parallel monolingual simplified corpora are essential both for extracting the simplification rules (Horn et al., <xref ref-type="bibr" rid="B13">2014</xref>) and for evaluating the quality of unsupervised models (Qiang et al., <xref ref-type="bibr" rid="B21">2019</xref>). The main difficulty of this method lies in the need to match original text fragments with simplified versions. The aligning process involves matching the corresponding parts of the original and simplified parts of a text at the paragraph, sentence, or individual word levels. Thus, monolingual word alignment aims to align words or phrases with similar meanings in two sentences that are written in the same language (Lan et al., <xref ref-type="bibr" rid="B14">2021</xref>). Historically, word alignments have been used in tasks such as statistical machine translation and annotation transfer (&#x000D6;stling and Tiedemann, <xref ref-type="bibr" rid="B19">2016</xref>), and today monolingual alignments can be useful for improving the interpretability in natural language understanding tasks, improving model performance for text-to-text generation tasks, and analyzing human editing operations (Lan et al., <xref ref-type="bibr" rid="B14">2021</xref>). One of the text-to-text generation tasks that utilize lists of monolingual word alignments is lexical simplification.</p>
<p>There are not many tools suitable for monolingual language alignment of regular and simplified texts. However, many word alignment instruments have been created for parallel texts in different languages. Statistical systems such as GIZA&#x0002B;&#x0002B; (Och and Ney, <xref ref-type="bibr" rid="B18">2003</xref>) or fast_align (Dyer et al., <xref ref-type="bibr" rid="B11">2013</xref>) have been widely used for a long time; however, neural tools have also recently gained popularity. Neural network-based instruments can take advantage of large-scale contextualized word embeddings derived from language models multilingually trained on monolingual corpora (Dou and Neubig, <xref ref-type="bibr" rid="B9">2021</xref>). Recently, neural tools specifically for monolingual alignment have started to appear (Lan et al., <xref ref-type="bibr" rid="B14">2021</xref>), but so far no instruments have been developed for Russian.</p>
<p>Lexical simplification has proved to be one of the main text simplification strategies for the Russian second language learning purposes (Sibirtseva and Karpov, <xref ref-type="bibr" rid="B23">2014</xref>, p. 25; Dmitrieva et al., <xref ref-type="bibr" rid="B8">2021</xref>). It has also been shown that lexical substitution is an effective text adaptation strategy for children with reading disabilities (Zubov and Petrova, <xref ref-type="bibr" rid="B26">2020</xref>). However, for the Russian language, attempts to automated lexical substitution are rare. In one study (Dmitrieva, <xref ref-type="bibr" rid="B7">2016</xref>), lexical simplification is performed on Russian data by means of synonym replacement. The author created a list of synonym pairs for this purpose, where the target words are taken from the CEFR vocabulary lists and the source words were obtained from a dictionary of synonyms. The list is said to contain around 8,000 synonym pairs. However, the word pairs in this list were not taken from real parallel texts, which precludes the possibility of studying them as actual editing operations performed during text adaptation. This study aims to fill this gap and check the potential of the automated development of a list of candidates for lexical substitution based on a parallel corpus of original and adapted texts in Russian.</p>
</sec>
<sec id="s3">
<title>Data</title>
<p>We use a parallel Russian simplification dataset to create a word list called RuAdapt<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> (Dmitrieva et al., <xref ref-type="bibr" rid="B8">2021</xref>). It has both paragraph-aligned and sentence-aligned versions, but we chose the sentence alignments for better performance of the automatic word alignment software. RuAdapt has three subcorpora with texts of different genres, all of which have been simplified by experts in teaching Russian as a foreign language. For this stage, we chose the adapted fiction books subcorpus, for it is the largest in the dataset. It includes 24,232 pairs of sentences taken from 93 texts. In total, there are 376,432 tokens in the original sentences and 285,190 tokens in their adapted equivalents. Examples of source (1) and target (2) data are shown below.</p>
<list list-type="simple">
<list-item><p>(1) &#x0041A; &#x00443;&#x00442;&#x00440;&#x00443; <italic>A</italic>&#x0043D;&#x0043D;&#x00430; &#x00437;&#x00430;&#x00434;&#x00440;&#x00435;&#x0043C;&#x00430;&#x0043B;&#x00430;, &#x00441;&#x00438;&#x00434;&#x0044F; &#x00432; &#x0043A;&#x00440;&#x00435;&#x00441;&#x0043B;&#x00435;, &#x00438; &#x0043A;&#x0043E;&#x00433;&#x00434;&#x00430; &#x0043F;&#x00440;&#x0043E;&#x00441;&#x0043D;&#x00443;&#x0043B;&#x00430;&#x00441;&#x0044C;, &#x00442;&#x0043E; &#x00443;&#x00436;&#x00435; &#x00431;&#x0044B;&#x0043B;&#x0043E; &#x00431;&#x00435;&#x0043B;&#x0043E;, &#x00441;&#x00432;&#x00435;&#x00442;&#x0043B;&#x0043E;, &#x00438; &#x0043F;&#x0043E;&#x00435;&#x00437;&#x00434; &#x0043F;&#x0043E;&#x00434;&#x00445;&#x0043E;&#x00434;&#x00438;&#x0043B; &#x0043A; &#x0041F;&#x00435;&#x00442;&#x00435;&#x00440;&#x00431;&#x00443;&#x00440;&#x00433;&#x00443;.</p>
<p>/By morning, Anna dozed off, sitting in an armchair, and when she woke up, it was already white, light, and the train was approaching Petersburg./</p>
</list-item>
<list-item><p>(2) &#x0041A; &#x00443;&#x00442;&#x00440;&#x00443; <italic>A</italic>&#x0043D;&#x0043D;&#x00430; &#x0043D;&#x00430;&#x0043A;&#x0043E;&#x0043D;&#x00435;&#x00446; &#x00437;&#x00430;&#x00441;&#x0043D;&#x00443;&#x0043B;&#x00430;, &#x00430; &#x0043A;&#x0043E;&#x00433;&#x00434;&#x00430; &#x0043F;&#x00440;&#x0043E;&#x00441;&#x0043D;&#x00443;&#x0043B;&#x00430;&#x00441;&#x0044C;, &#x00443;&#x00436;&#x00435; &#x00431;&#x0044B;&#x0043B;&#x0043E; &#x00441;&#x00432;&#x00435;&#x00442;&#x0043B;&#x0043E; &#x00438; &#x0043F;&#x0043E;&#x00435;&#x00437;&#x00434; &#x0043F;&#x0043E;&#x00434;&#x00445;&#x0043E;&#x00434;&#x00438;&#x0043B; &#x0043A; &#x0041F;&#x00435;&#x00442;&#x00435;&#x00440;&#x00431;&#x00443;&#x00440;&#x00433;&#x00443;.</p>
<p><italic>/</italic>By morning, Anna finally fell asleep, and when she woke up, it was already light and the train was approaching Petersburg./</p>
</list-item>
</list>
<p>Each sentence pair in RuAdapt has a cosine similarity score that was assigned during automatic alignment by the CATS alignment tool (&#x00160;tajner et al., <xref ref-type="bibr" rid="B25">2017</xref>). For the purposes of this project, we chose 15,156 sentence pairs with a cosine similarity lower than 0.99 but higher than 0.31. These thresholds were chosen empirically: We wanted to omit not only pairs that are too different, since they most likely will not have many correct single-word alignments, but also nearly identical pairs.</p>
</sec>
<sec id="s4">
<title>Alignment</title>
<p>For this project, we used one statistical aligner and one neural aligner. Aligning pairs of regular and simplified sentences can be easier and harder than aligning translations at the same time: On the one hand, the sentences are monolingual, but on the other hand, the sentence length often does not match and many words might be omitted. Therefore, we decided to use different aligners and compare the results. Before alignment, we did not lemmatize the sentences, because, to the best of our knowledge, the impact of lemmatization on monolingual alignment of different sentences has not yet been studied in detail. Also, as will be discussed below, some linguistic phenomena that we are interested in can be lost during lemmatization.</p>
<p>Eflomal (Efficient Low-Memory Aligner) is a system for efficient and accurate word alignment using a Bayesian model with Markov Chain Monte Carlo (MCMC) inference. It is based on the efmaral tool (&#x000D6;stling and Tiedemann, 2016), but has advantages such as lower memory costs. According to the performance comparison on the project&#x00027;s GitHub page, eflomal shows a lower alignment error rate than efmaral and fast_align on language pairs such as English&#x02013;French and English&#x02013;Hindi.</p>
<p>Similarly to other statistical aligners, eflomal requires a substantial amount of parallel data to train on. We decided to use pairs of paraphrases in Russian, since this type of monolingual parallel data is much easier to obtain than simplification data. We obtained an additional dataset of around 2.5 mil. paraphrases from Opusparcus (Creutz, <xref ref-type="bibr" rid="B6">2018</xref>) and ParaPhraserPlus (Gudkov et al., <xref ref-type="bibr" rid="B12">2020</xref>) and used it for training purposes.</p>
<p>Eflomal<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref> outputs Pharaoh-format alignments, where a pair of numbers <italic>i</italic>-<italic>j</italic> indicates that the <italic>i</italic>-th word source sentence corresponds to the <italic>j</italic>-th word of the target sentence. In order to obtain word-to-word alignments, a dedicated instrument from the Natural Language Toolkit (NLTK) called phrase_based<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref> was used. This is a phrase extraction algorithm that extracts all consistent phrase pairs from a word-aligned sentence pair, meaning that it is also possible to obtain phrase-to-word and phrase-to-phrase alignments. However, in this study we limited ourselves to single word pairs.</p>
<p>Another word alignment tool that we used is awesome-align (Aligning Word Embedding Spaces Of Multilingual Encoders), a tool that can extract word alignments from multilingual BERT and allows users to fine-tune mBERT on parallel corpora for better alignment quality (Dou and Neubig, <xref ref-type="bibr" rid="B9">2021</xref>). Although it can be finetuned, no large training corpus is required prior to alignment, so we used the aligner as is. Awesome-align shows lower alignment error rates than eflomal on language pairs such as German&#x02013;English and French&#x02013;English.</p>
<p>The initial alignment results are shown in <xref ref-type="table" rid="T1">Table 1</xref>. As can be seen, many single-word pairs were obtained, but most of them were identical words. The percentage of &#x0201C;useful&#x0201D; pairs is in fact rather low, as is also shown on <xref ref-type="fig" rid="F1">Figure 1</xref>. A &#x0201C;useful&#x0201D; pair of words is a pair that can potentially be included in the list of word alignments. In such pairs of words, the source word and target word are different and there is no noise (punctuation instead of words, etc.).</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Alignment statistics.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Statistic</bold></th>
<th valign="top" align="center"><bold>Eflomal</bold></th>
<th valign="top" align="center"><bold>Awesome-align</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">All single word pairs</td>
<td valign="top" align="center">188,706</td>
<td valign="top" align="center">193,778</td>
</tr>
<tr>
<td valign="top" align="left">Pairs consisting of different words, cleaned from noise</td>
<td valign="top" align="center">19,687</td>
<td valign="top" align="center">22,767</td>
</tr>
<tr>
<td valign="top" align="left">Unique pairs</td>
<td valign="top" align="center">14,807</td>
<td valign="top" align="center">15,989</td>
</tr>
<tr>
<td valign="top" align="left">Unique pairs in common</td>
<td valign="top" align="center" colspan="2">8,403</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>The percentage of &#x0201C;useful&#x0201D; pairs among all pairs.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-984759-g0001.tif"/>
</fig>
<p>Considering that the two aligners produced 8,403 identical pairs, there were 22,393 pairs in all at the end of the alignment process. However, these pairs still contained noise and non-synonymic pairs, which made it clear that human editing would be needed.</p>
</sec>
<sec id="s5">
<title>Expert editing</title>
<p>In order to edit the word lists, 18 human editors were asked to check the 22,393 word pairs obtained on the previous step. All of the editors are students and/or specialists in teaching Russian as a foreign language from [institution name removed for anonymity]. Each pair was checked by at least two different editors.</p>
<p>Editors were asked to give each word pair a score of 0, 1, or 2 according to the following instructions:</p>
<list list-type="bullet">
<list-item><p>Score 0 is given to:</p></list-item>
</list>
<list list-type="bullet">
<list-item><p>noisy pairs (non-synonyms);</p></list-item>
<list-item><p>pairs consisting of the same word or different forms of the same word.</p></list-item>
</list>
<list list-type="bullet">
<list-item><p>Score 1 is given to:</p></list-item>
</list>
<list list-type="bullet">
<list-item><p>pairs that can only be considered synonyms in a certain context;</p></list-item>
<list-item><p>different words with the same root (e.g., &#x00437;&#x00432;&#x00430;&#x00442;&#x0044C;/&#x0043F;&#x0043E;&#x00437;&#x00432;&#x00430;&#x00442;&#x0044C;);</p></list-item>
<list-item><p>synonyms that are presented by different parts of speech.</p></list-item>
</list>
<list list-type="bullet">
<list-item><p>Score 2 is given to:</p></list-item>
</list>
<list list-type="bullet">
<list-item><p>pairs that are considered synonyms in most contexts;</p></list-item>
<list-item><p>pairs where the source word is an older form of the target word (e.g., &#x0043A;&#x0043E;&#x00444;&#x00438;&#x00439;/&#x0043A;&#x0043E;&#x00444;&#x00435;).</p></list-item>
</list>
<p>A score of 2 is supposed to indicate that a pair can be included in the word alignment list; score 0 indicates the opposite. Score 1 is given in cases of uncertainty that may be included in the list or preserved for future studies. It is important to note that we are aiming to evaluate the &#x0201C;usefulness&#x0201D; of the pair for the word list, not the alignment quality.</p>
<p>During the first stage of editing, the editors worked with 14,807 word pairs produced by eflomal. In the second stage, there were only 7,586 pairs produced by awesome-align left to post-edit, since the two alignment instruments produced 8,403 identical pairs. In an attempt to further ease the editors&#x00027; work, we tried to eliminate the words that had at least one same root, since they would not receive a score 2 and most in fact would receive 0 (being different forms of the same word). We used the NeuralMorphemeSegmentation tool<xref ref-type="fn" rid="fn0004"><sup>4</sup></xref> (Sorokin and Kravtsova, <xref ref-type="bibr" rid="B24">2018</xref>) to split the words into roots and affixes. However, there were not many word pairs excluded that way: This strategy detected 796 pairs, and after manual check 775 pairs were excluded, leaving the editors with 6,811 pairs to post-edit.</p>
<p>At the end of the post-editing process, only 2,336 pairs received a score of 2 from at least one editor, and 5,110 pairs received a score of 1 from at least one editor. We used Cohen&#x00027;s kappa score to measure inter-annotator agreement, yielding a score across all documents of 0.42, which is interpreted as indicating a moderate degree of agreement (McHugh, <xref ref-type="bibr" rid="B17">2012</xref>). It is evident that in many cases deciding on a score was difficult even for humans.</p>
<p>Since we are mostly interested in pairs with a score of 2, we arranged a second evaluation for the pairs that received a score of 2 from at least one editor. A third expert evaluated 2,336 pairs and gave each a resulting score, resulting in 1,409 unique pairs with a score of 2, 1,755 pairs with a score of 1 (pairs that received 1 from both editors and pairs that received 1 during expert evaluation), and 14,493 pairs with a score of 0 (pairs that received 0 from both editors).</p>
<p>Out of the 1,409 pairs with a score of 2, 1,349 were obtained from awesome-align alone or both awesome-align and eflomal, and 1,197 were obtained from eflomal or both awesome-align and eflomal. That means that 9.11% of awesome-align alignments and 8.08% of eflomal alignments received a score of 2, which are both rather small percentages. Since the scores do not reflect the alignment quality directly, they does not illustrate the aligners&#x00027; efficiency, but rather give an idea of how many single-word alignments will end up being synonyms.</p>
<p>The resulting list<xref ref-type="fn" rid="fn0005"><sup>5</sup></xref> contains 1,409 pairs of word forms and their simpler synonyms approved by experts with 1,134 unique source lemmas and 811 unique target lemmas. The choice of classical literature as a data source leaves its mark on the types of lexical substitution: e.g., the list contains examples of replacing an archaic grammatical form of a word with a modern one (e.g., &#x0043F;&#x00440;&#x0043E;&#x00441;&#x00442;&#x0043E;&#x0044E; [simple ADJ&#x0002B;GEN, archaic] &#x02192; &#x0043F;&#x00440;&#x0043E;&#x00441;&#x00442;&#x0043E;&#x00439; [simple ADJ&#x0002B;GEN]), or replacing an obsolete word with a modern synonym (e.g., &#x0043E;&#x00441;&#x0043E;&#x00431;&#x0043B;&#x00438;&#x00432;&#x0044B;&#x00439; [special, archaic] &#x02192; &#x0043E;&#x00442;&#x00434;&#x00435;&#x0043B;&#x0044C;&#x0043D;&#x0044B;&#x00439; [special]). However, most of the list presents more universal types of lexical simplification, such as replacing a word with a more neutral and frequent analog (e.g., &#x00443;&#x0043C;&#x0043E;&#x0043B;&#x0044F;&#x00442;&#x0044C; [to beg] &#x02192; &#x0043F;&#x00440;&#x0043E;&#x00441;&#x00438;&#x00442;&#x0044C; [to please]), use of hypernyms (e.g., &#x00441;&#x0043E;&#x0043B;&#x0043E;&#x00432;&#x0044C;&#x00438; [nightingales] &#x02192; &#x0043F;&#x00442;&#x00438;&#x00446;&#x0044B; [birds]), or the removal of subjective evaluation suffixes (e.g., &#x00434;&#x00435;&#x00440;&#x00435;&#x00432;&#x00435;&#x0043D;&#x0044C;&#x0043A;&#x00430; [village&#x0002B;diminutive suffix] &#x02192; &#x00434;&#x00435;&#x00440;&#x00435;&#x00432;&#x0043D;&#x0044F; [village]).</p>
</sec>
<sec id="s6">
<title>Word list evaluation</title>
<p>After obtaining the word lists, we examined the vocabulary that they contain. We were interested in pairs that received a score of 2 in the final evaluation. We first compared the word pairs to CEFR-graded vocabulary lists. Our hypothesis was that in a given pair, the source word is supposed to have a higher grade level than the target word (for example, an A2 level word should be replaced with an A1 synonym).</p>
<p>The obtained list evaluation included both the comparison of word pairs in special graded vocabulary lists and its general frequency. The most evident way to check if the target word is simpler in terms of Russian as a foreign language proficiency is to compare the first occurrences of source and target words in special vocabulary lists graded by the Common European Framework of Reference for languages (CEFR) levels. Our hypothesis is that the source word is supposed to have a higher grade level than the target word (e.g., a C1 word &#x00431;&#x0043E;&#x00440;&#x0043C;&#x0043E;&#x00442;&#x00430;&#x00442;&#x0044C; [to mutter] should be replaced with an A1 synonym &#x00433;&#x0043E;&#x00432;&#x0043E;&#x00440;&#x00438;&#x00442;&#x0044C; [to tell]).</p>
<p>Before comparing the word pairs to the frequency dictionary and CEFR vocabulary lists, we lemmatized them using the Stanza Python library<xref ref-type="fn" rid="fn0006"><sup>6</sup></xref> (Qi et al., <xref ref-type="bibr" rid="B20">2020</xref>). We used the default model, which was trained on Syntagrus (Droganova et al., <xref ref-type="bibr" rid="B10">2018</xref>). We did not lemmatize just the words from the pairs, but the corresponding sentences and extracted the necessary lemmas from them, since the context can be important for correct lemmatization.</p>
<p>There were 686 pairs where both words could be found in the CEFR vocabulary lists. Of them, in 513 cases the source word CEFR level was higher. In 545 cases the source word is not presented in the CEFR-graded lists while the target word is (see <xref ref-type="fig" rid="F2">Figure 2</xref>). This means that in 75% of the cases the proposed simplified word is considered simpler by foreign language acquisition specialists, which shows that in most cases, the source word is indeed more complicated and less often used than the target word. Of particular interest is word pairs where the source and target words have the same CEFR level tags. Most of these cases can be explained as the choice of a word whose derivative appeared on the lists earlier, so the reader is more likely to guess its meaning (e.g., &#x00441;&#x00435;&#x00440;&#x00434;&#x00438;&#x00442;&#x0044C;&#x00441;&#x0044F; [to be grumpy] is replaced by &#x00437;&#x0043B;&#x00438;&#x00442;&#x0044C;&#x00441;&#x0044F; [to be angry]; both verbs are B2 level, but the cognate adjective &#x00437;&#x0043B;&#x0043E;&#x00439; [angry ADJ] appears at the earlier A2 level). In isolated cases where the target word has a higher CEFR level than the source word, the word choice might have been prompted by the desire to use and international synonym (e.g., &#x00440;&#x00430;&#x00441;&#x00441;&#x00442;&#x0043E;&#x0044F;&#x0043D;&#x00438;&#x00435; [spacing, distance] &#x02192; &#x00434;&#x00438;&#x00441;&#x00442;&#x00430;&#x0043D;&#x00446;&#x00438;&#x0044F; [distance]), as well as illustrating imperfections in possible vocabulary lists or human errors during text adaptation.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Word grade levels statistics.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-984759-g0002.tif"/>
</fig>
<p>As word frequency is commonly used for identifying word complexity status in lexical simplification studies (Al-thunayyan and Azmi, <xref ref-type="bibr" rid="B1">2021</xref>), we compared the IPM values (instances per million words) of word pairs, positing that the IPM of the source word in a pair should be lower than the IPM of the target word. We used a frequency dictionary of modern Russian language for this purpose (Lyashevskaya and Sharov, <xref ref-type="bibr" rid="B15">2009</xref>).</p>
<p>In terms of frequency, we found out that of the 1,203 pairs where both source and target words were present in the chosen frequency dictionary, in 1,037 pairs the target IPM was higher than the source, in 112 pairs the source IPM was higher, and in 54 cases the IPMs were equal (see <xref ref-type="fig" rid="F3">Figure 3</xref>). IPM is equal mostly in cases where the source word has a non-modern spelling (&#x0043D;&#x00435;&#x00441;&#x00447;&#x00430;&#x00441;&#x00442;&#x00438;&#x00435;/&#x0043D;&#x00435;&#x00441;&#x00447;&#x00430;&#x00441;&#x00442;&#x0044C;&#x00435;), because in such cases source and target are lemmatized the same way.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Frequency statistics.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-984759-g0003.tif"/>
</fig>
<p>The evaluations show that in most cases the &#x0201C;simpler&#x0201D; candidate from the list indeed appears earlier in a graded vocabulary list for language learners and/or is more frequent in Russian language than the original, complex word. Contrariwise, cases were found where the word chosen by the authors for adaptation turned out to be less frequent (8.1% of words) or related to a higher level of CEFR (5%). These cases seem to demonstrate word selection criteria that are relevant to the foreign language learners (for example, the internationality of a word, the presence of a frequently used derivative) and are case for which it is potentially difficult to automate lexical substitution.</p>
</sec>
<sec sec-type="discussion" id="s7">
<title>Discussion</title>
<p>In this paper, we described the creation of a list of word alignments from parallel Russian simplification data. We used two automatic aligners and human post-editing in order to choose word pairs where the source and target words can be considered synonyms in most contexts or where the target word is the modern spelling version of the source word.</p>
<p>During our research, we found out that there do not seem to be many cases of actual single-word lexical simplification (i.e., synonym replacement) in adapted readers for Russian L2 learners. Despite using different aligners, in both cases &#x0003C;10% of all single-word alignments of different words ended up with a score of 2 and were included in a final list. We can hypothesize that in adapted literature, there is more lexical simplification at the phrase level than the word level, or that perhaps such phenomena cannot be fully captured without special word aligners for parallel simplification data.</p>
<p>The resulting list allows us, first, to explore lexical adaptation strategies that are relevant for L2 learners. Only 75% of the word pairs fit the classical criteria of word complexity, such as word frequency or CEFR level. In other cases, the choice of lexical substitution was explained as a choice in favor of international words or derivatives from simple frequent words. This indicates the need to take these features into account at future stages of automating the process of lexical simplification.</p>
<p>Another application of the resulting list is to improve the quality of the next iterations of the aligning process, since now we can use these word pairs as points where we expect lexical substitution.</p>
<p>In the future, we want to expand the scope of this research to phrase-level simplification and to use other datasets. We hope to gather enough data to create reliable lexical-simplification systems and tools for computer-assisted text adaptation. We also hope that in the future less human editing will be needed during the creation of other word lists, because it will be possible to use our word list to train models for automatic evaluation of word pairs.</p>
</sec>
<sec sec-type="data-availability" id="s8">
<title>Data availability statement</title>
<p>The datasets presented in this study can be found at: <ext-link ext-link-type="uri" xlink:href="https://github.com/Digital-Pushkin-Lab/RuAdapt_Word_Lists">https://github.com/Digital-Pushkin-Lab/RuAdapt_Word_Lists</ext-link>.</p>
</sec>
<sec id="s9">
<title>Author contributions</title>
<p>AD: conception and design of the study, formulation of research goals and aims, data collection, literature review, aligners building and evaluation, and writing the article. AL: literature review, data analysis, interpretation of results, and writing the article. ML: conception and design of the study, coordination of the expert annotation process, interpretation of results, and writing the article. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="funding-information" id="s10">
<title>Funding</title>
<p>The article was prepared in full within the state assignment of Ministry of Education and Science of the Russian Federation for 2020&#x02013;2024 (No. FZNM-2020-0005).</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Al-thunayyan</surname> <given-names>S.</given-names></name> <name><surname>Azmi</surname> <given-names>A.</given-names></name></person-group> (<year>2021</year>). <article-title>Automated text simplification: a survey</article-title>. <source>ACM Comput. Surv</source>. 54, 2. <pub-id pub-id-type="doi">10.1145/3442695</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="editor"><name><surname>Andryshina</surname> <given-names>N. P.</given-names></name></person-group> (ed.). (<year>2017a</year>). <source>Lexical Minimum of Russian as a Foreign Language. Level B1. Common language (9th Edn.)</source>. <publisher-loc>St. Petersburg</publisher-loc>: <publisher-name>Zlatoust</publisher-name>.</citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="editor"><name><surname>Andryshina</surname> <given-names>N. P.</given-names></name></person-group> (ed.). (<year>2017b</year>). <source>Lexical Minimum of Russian as a Foreign Language. Level B2. Common language (7th Edn.)</source>. <publisher-loc>St. Petersburg</publisher-loc>: <publisher-name>Zlatoust</publisher-name>.</citation>
</ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Andryshina</surname> <given-names>N. P.</given-names></name> <name><surname>Kozlova</surname> <given-names>T. V.</given-names></name></person-group> (<year>2012</year>). <source>Lexical Minimum of Russian as a Foreign Language. Level A1. Common Language, 4th Edn</source>. <publisher-loc>St. Petersburg</publisher-loc>: <publisher-name>Zlatoust</publisher-name>.</citation>
</ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Andryshina</surname> <given-names>N. P.</given-names></name> <name><surname>Kozlova</surname> <given-names>T. V.</given-names></name></person-group> (<year>2015</year>). <source>Lexical Minimum of Russian as a Foreign Language. Level A2. Common Language, 5th Edn</source>. <publisher-loc>St. Petersburg</publisher-loc>: <publisher-name>Zlatoust</publisher-name>.</citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Creutz</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <source>Open Subtitles Paraphrase Corpus for Six Languages. In: Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC 2018)</source>, 7-12 May, <publisher-loc>Miyazaki, Japan.</publisher-loc></citation>
</ref>
<ref id="B7">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Dmitrieva</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <source>Text Simplification for Russian as a Foreign Language</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.hse.ru/en/edu/vkr/182626286">https://www.hse.ru/en/edu/vkr/182626286</ext-link> (accessed August 30, 2022).<pub-id pub-id-type="pmid">34764901</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Dmitrieva</surname> <given-names>A.</given-names></name> <name><surname>Laposhina</surname> <given-names>A.</given-names></name> <name><surname>Lebedeva</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;A quantitative study of simplification strategies in adapted texts for L2 learners of Russian,&#x0201D;</article-title> in <source>Proceedings of the International Conference &#x0201C;Dialogue&#x0201D;</source>, 191&#x02013;203). Available online at: <ext-link ext-link-type="uri" xlink:href="http://www.dialog-21.ru/media/5504/dmitrievaapluslaposhinaapluslebedevam099.pdf">http://www.dialog-21.ru/media/5504/dmitrievaapluslaposhinaapluslebedevam099.pdf</ext-link></citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dou</surname> <given-names>Z. Y.</given-names></name> <name><surname>Neubig</surname> <given-names>G.</given-names></name></person-group> (<year>2021</year>). <source>Word Alignment by Fine-tuning Embeddings on Parallel Corpora.</source> arXiv preprint arXiv:2101.<volume>08231</volume>, <fpage>2112</fpage>&#x02013;<lpage>2118</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2021.eacl-main.181</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Droganova</surname> <given-names>K.</given-names></name> <name><surname>Lyashevskaya</surname> <given-names>O.</given-names></name> <name><surname>Zeman</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Data conversion and consistency of monolingual corpora: Russian UD Treebanks,&#x0201D;</article-title> in <source>Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018)</source>, December 13&#x02013;14, 2018. Oslo University, Norway (No. 155) (<publisher-loc>Link&#x000F6;ping, Sweden</publisher-loc>: <publisher-name>Link&#x000F6;ping University Electronic Press</publisher-name>),<fpage>52</fpage>&#x02013;<lpage>65</lpage>.</citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dyer</surname> <given-names>C.</given-names></name> <name><surname>Chahuneau</surname> <given-names>V.</given-names></name> <name><surname>Smith</surname> <given-names>N. A.</given-names></name></person-group> (<year>2013</year>). <article-title>&#x0201C;A simple, fast, and effective reparameterization of ibm model 2,&#x0201D;</article-title> in <source>Proceedings of the NAACL</source>.</citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gudkov</surname> <given-names>V.</given-names></name> <name><surname>Mitrofanova</surname> <given-names>O.</given-names></name> <name><surname>Filippskikh</surname> <given-names>E.</given-names></name></person-group> (<year>2020</year>). <article-title>Automatically ranked Russian paraphrase corpus for text generation</article-title>. <source>Proc. Fourth Workshop Neural Gener. Transl. ACL</source>, <volume>2020</volume>, <fpage>54</fpage>&#x02013;<lpage>59</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2020.ngt-1.6</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Horn</surname> <given-names>C.</given-names></name> <name><surname>Manduca</surname> <given-names>C.</given-names></name> <name><surname>Kauchak</surname> <given-names>D.</given-names></name></person-group> (<year>2014</year>). &#x0201C;Learning a lexical simplifier using wikipedia,&#x0201D; in: <italic>ACL, Volume 2: Short Papers</italic> (<publisher-loc>Baltimore, Maryland</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>458</fpage>&#x02013;<lpage>63</lpage>.</citation>
</ref>
<ref id="B14">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Lan</surname> <given-names>W.</given-names></name> <name><surname>Jiang</surname> <given-names>C.</given-names></name> <name><surname>Xu</surname> <given-names>W.</given-names></name></person-group> (<year>2021</year>). <article-title>Neural semi-Markov CRF for monolingual word alignment.</article-title> <source>arXiv preprint</source> 6815&#x02013;28. <pub-id pub-id-type="doi">10.18653/v1/2021.acl-long.531</pub-id> Available online at: <ext-link ext-link-type="uri" xlink:href="https://aclanthology.org/2021.acl-long.531/">https://aclanthology.org/2021.acl-long.531/</ext-link></citation>
</ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lyashevskaya</surname> <given-names>O. N.</given-names></name> <name><surname>Sharov</surname> <given-names>S. A.</given-names></name></person-group> (<year>2009</year>). <source>Chastotnyj slovar&#x00027; sovremennogo russkogo yazyka (na materialah Nacionalnogo korpusa russkogo yazyka) [Modern Russian frequency dictionary (based on the data from the Russian National Corpus)]</source>. <publisher-loc>Moscow</publisher-loc>: <publisher-name>Azbukovnik</publisher-name>.</citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Maddela</surname> <given-names>M.</given-names></name> <name><surname>Xu</surname> <given-names>W.</given-names></name></person-group> (<year>2018</year>). <article-title>A word-complexity lexicon and a neural readability ranking model for lexical simplification</article-title>. <source>EMNLP</source>. <volume>2018</volume>, <fpage>3749</fpage>&#x02013;<lpage>3760</lpage>. <pub-id pub-id-type="doi">10.18653/v1/D18-1410</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>McHugh</surname> <given-names>M. L.</given-names></name></person-group> (<year>2012</year>). <article-title>Interrater reliability: the kappa statistic</article-title>. <source>Biochemia medica</source>, <volume>22</volume>, <fpage>276</fpage>&#x02013;<lpage>282</lpage>. <pub-id pub-id-type="doi">10.11613/BM.2012.031</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Och</surname> <given-names>F. J.</given-names></name> <name><surname>Ney</surname> <given-names>H.</given-names></name></person-group> (<year>2003</year>). <article-title>A systematic comparison of various statistical alignment models</article-title>. <source>Comput. Linguist.</source> <volume>29</volume>, <fpage>19</fpage>&#x02013;<lpage>51</lpage>. <pub-id pub-id-type="doi">10.1162/089120103321337421</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><collab>&#x000D6;stling R. and Tiedemann, J..</collab></person-group> (<year>2016</year>). <article-title>Efficient Word Alignment with Markov Chain Monte Carlo</article-title>. <source>The Prague Bulletin of Mathematical Linguistics</source> 106. <pub-id pub-id-type="doi">10.1515/pralin-2016-0013</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Qi</surname> <given-names>P.</given-names></name> <name><surname>Zhang Yuhao</surname></name> <name><surname>Zhang Yuhui</surname></name> <name><surname>Bolton</surname> <given-names>J.</given-names></name> <name><surname>Manning</surname> <given-names>C. D.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Stanza: A Python Natural Language Processing Toolkit for Many Human Languages,&#x0201D;</article-title> in <source>Association for Computational Linguistics (ACL) System Demonstrations</source>.</citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Qiang</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Zhu</surname> <given-names>Y.</given-names></name> <name><surname>Yuan</surname> <given-names>Y.</given-names></name> <name><surname>Wu</surname> <given-names>X.</given-names></name></person-group> (<year>2019</year>). <article-title>LSBert: A Simple Framework for Lexical Simplification</article-title>. <source>J Latex Class Files</source>. <volume>14</volume>, <fpage>1</fpage>&#x02013;<lpage>11</lpage>.</citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shardlow</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). <article-title>&#x0201C;The CW corpus: a new resource for evaluating the identification of complex words,&#x0201D;</article-title> in <source>Proceedings of the 2nd Workshop on Predicting and Improving Text Readability for Target Reader Populations, Sofia, Bulgaria</source>, 69&#x02013;77</citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sibirtseva</surname> <given-names>V. G.</given-names></name> <name><surname>Karpov</surname> <given-names>N. V.</given-names></name></person-group> (<year>2014</year>). <article-title>Automatic adaptation of the texts for electronic textbooks. Problems and perspectives (on an example of Russian). [Avtomaticheskaya adaptaciya tekstov dlya elektronnyh uchebnikov. Problemy i perspektivy (na primere russkogo yazyka)]</article-title>. <source>Nov&#x000E1; rusistika</source>. <volume>VII</volume>, <fpage>19</fpage>&#x02013;<lpage>33</lpage>.</citation>
</ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sorokin</surname> <given-names>A.</given-names></name> <name><surname>Kravtsova</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Deep convolutional networks for supervised morpheme segmentation of Russian language,&#x0201D;</article-title> in <source>Artificial Intelligence and Natural Language. AINL 2018. Communications in Computer and Information Science, vol 930</source>, eds D. Ustalov, A. Filchenkov, L. Pivovarova, J. &#x0017D;i&#x0017E;ka (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer, Cham</publisher-name>).</citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>&#x00160;tajner</surname> <given-names>S.</given-names></name> <name><surname>Franco-Salvador</surname> <given-names>M.</given-names></name> <name><surname>Ponzetto</surname> <given-names>S. P.</given-names></name> <name><surname>Rosso</surname> <given-names>P.</given-names></name> <name><surname>Stuckenschmidt</surname> <given-names>H.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Sentence alignment methods for improving text simplification systems,&#x0201D;</article-title> in <source>Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017)</source>, 97&#x02013;102</citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zubov</surname> <given-names>V. I.</given-names></name> <name><surname>Petrova</surname> <given-names>T. E.</given-names></name></person-group> (<year>2020</year>). <article-title>Lexically or grammatically adapted texts: what is easier to process for secondary school children?</article-title> <source>Procedia Comput. Sci</source>. <volume>176</volume>, <fpage>2117</fpage>&#x02013;<lpage>2124</lpage>. <pub-id pub-id-type="doi">10.1016/j.procs.2020.09.248</pub-id></citation>
</ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/Digital-Pushkin-Lab/RuAdapt">https://github.com/Digital-Pushkin-Lab/RuAdapt</ext-link></p></fn>
<fn id="fn0002"><p><sup>2</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/robertostling/eflomal">https://github.com/robertostling/eflomal</ext-link></p></fn>
<fn id="fn0003"><p><sup>3</sup><ext-link ext-link-type="uri" xlink:href="https://www.nltk.org/_modules/nltk/translate/phrase_based.html">https://www.nltk.org/_modules/nltk/translate/phrase_based.html</ext-link></p></fn>
<fn id="fn0004"><p><sup>4</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/AlexeySorokin/NeuralMorphemeSegmentation">https://github.com/AlexeySorokin/NeuralMorphemeSegmentation</ext-link></p></fn>
<fn id="fn0005"><p><sup>5</sup>Available at: <ext-link ext-link-type="uri" xlink:href="https://github.com/Digital-Pushkin-Lab/RuAdapt_Word_Lists">https://github.com/Digital-Pushkin-Lab/RuAdapt_Word_Lists</ext-link></p></fn>
<fn id="fn0006"><p><sup>6</sup><ext-link ext-link-type="uri" xlink:href="https://stanfordnlp.github.io/stanza/">https://stanfordnlp.github.io/stanza/</ext-link></p></fn>
</fn-group>
</back>
</article> 