<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="systematic-review">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Sci.</journal-id>
<journal-title>Frontiers in Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Sci.</abbrev-journal-title>
<issn pub-type="epub">2624-9898</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fcomp.2020.523053</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Computer Science</subject>
<subj-group>
<subject>Systematic Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Meta-Analysis of Cross-Language Plagiarism and Self-Plagiarism Detection Methods for Russian-English Language Pair</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Tlitova</surname> <given-names>Alina</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/874992/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Toschev</surname> <given-names>Alexander</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/876936/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Talanov</surname> <given-names>Max</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/232237/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Kurnosov</surname> <given-names>Vitaliy</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1115447/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Computer Engineering Department, The Higher Institute of Information Technology and Intelligent Systems, Kazan Federal University</institution>, <addr-line>Kazan</addr-line>, <country>Russia</country></aff>
<aff id="aff2"><sup>2</sup><institution>TPPKM Department, Institute of Polymers, Kazan National Research Technological University</institution>, <addr-line>Kazan</addr-line>, <country>Russia</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Tom Crick, Swansea University, United Kingdom</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Diego R. Amancio, University of S&#x000E3;o Paulo, Brazil; Mohamed Mostafa, Cardiff Metropolitan University, United Kingdom; Imtiaz Khan, Cardiff Metropolitan University, United Kingdom</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Alina Tlitova <email>AlETlitova&#x00040;stud.kpfu.ru</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Digital Education, a section of the journal Frontiers in Computer Science</p></fn></author-notes>
<pub-date pub-type="epub">
<day>29</day>
<month>10</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>2</volume>
<elocation-id>523053</elocation-id>
<history>
<date date-type="received">
<day>26</day>
<month>12</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>20</day>
<month>08</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2020 Tlitova, Toschev, Talanov and Kurnosov.</copyright-statement>
<copyright-year>2020</copyright-year>
<copyright-holder>Tlitova, Toschev, Talanov and Kurnosov</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license></permissions>
<abstract><p>Scientists need to publish the results of their work to remain relevant and in demanded. The well-known principle of &#x0201C;publish or perish&#x0201D; often forces scientists to pursue an increase in quantity, not quality. Along with the problems of authorship, paid research, fabrication of results, plagiarism, and self-plagiarism are among the most common violations. Their impact is more subtle but no less destructive for the scientific community. Statistics show that the reuse of texts in different languages is very common in various studies. Identification of translated plagiarism is a complex task, and there are almost no such tools for this purpose on the Russian market now. In this article, we have provided an overview of the existing methods for the identification of cross-language borrowings in the scientific articles of the authors. We analyzed solutions by studying the works on various language pairs and paid the great attention to the Russian-English language pair.</p></abstract>
<kwd-group>
<kwd>cross-language plagiarism</kwd>
<kwd>plagiarism detection methods</kwd>
<kwd>self-plagiarism detection</kwd>
<kwd>text borrowings</kwd>
<kwd>multilingual plagiarism</kwd>
</kwd-group>
<contract-sponsor id="cn001">Kazan Federal University<named-content content-type="fundref-id">10.13039/501100012528</named-content></contract-sponsor>
<counts>
<fig-count count="4"/>
<table-count count="5"/>
<equation-count count="1"/>
<ref-count count="25"/>
<page-count count="10"/>
<word-count count="5086"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>The number of publications has a great influence on the career and wealth of a scientist. To succeed in the form of increasing the degree and recognition, unscrupulous authors translate an existing scientific work into the required language and pass it off as their own or republish their work in different languages without indicating that this information has been published previously (Amancio, <xref ref-type="bibr" rid="B2">2015</xref>). The whole work can be repeated with minor changes, for example, in the title, abstracts (double publication), and excerpts from the previous ones (salami slicing). Such work cannot be called scientific. A good example is the reprinted text.</p>
<p>IEEE defines plagiarism as the reuse of someone else&#x00027;s previous results or words without explicit recognition of the original author and source (IEEE-Faq, <xref ref-type="bibr" rid="B16">2019</xref>). Plagiarism in any form is unacceptable and is considered a serious violation of professional behavior with ethical and legal consequences.</p>
<p>According to the IEEE policy there are several basic factors that are taken into account when assessing possible plagiarism (IEEE, <xref ref-type="bibr" rid="B15">2019</xref>):
<list list-type="bullet">
<list-item><p>Number (full article, article section, page, paragraph, sentence, and phrases)</p></list-item>
<list-item><p>Using quotes for all copied text</p></list-item>
<list-item><p>Proper placement of links to sources of borrowing</p></list-item>
<list-item><p>Incorrect rephrasing.</p></list-item>
</list></p>
<p>Duplication of information without reference to sources is unacceptable, even if it is the property of the author, as this calls into question the relevance and scientific novelty of the idea.</p>
<p>Plagiarism detection systems are often used to identify self-plagiarism.</p>
<p>Many programs successfully cope with plagiarism and self-plagiarism in the same language (Tlitova and Toschev, <xref ref-type="bibr" rid="B24">2019</xref>). However, their significant drawback is the detection of borrowings from different languages, the so-called cross-language plagiarism.</p>
<p>There are several groups of state-of-the-art methods for detecting this type of plagiarism that we reviewed in this work.</p>
<p>The purpose of this study is to analyze existing methods for identifying textual cross-language borrowings by several characteristics to then identify their features and applicability for the Russian-English pair.</p>
</sec>
<sec sec-type="methods" id="s2">
<title>2. Methods</title>
<p>In this research we performed a meta-analysis of existing works and articles aimed at survey systems for identifying cross-language plagiarism, and their relevance for the Russian-English language pair. It was conducted in adherence to the standards of quality for conducting and reporting meta-analyses detailed in the PRISMA statement (Moher et al., <xref ref-type="bibr" rid="B20">2009</xref>).</p>
<p>We exploited two methods of literature meta-analysis.</p>
<p>The first way is looking for publications in databases such as Scopus and Web Of Science and including all publication types obtained through the World Wide Web: unpublished dissertations, peer-reviewed journal articles, book chapters, and conference proceedings. We omitted articles that were published before January 2004, mostly focusing on studies of the last 5 years. We used the following relevant keywords for the search: cross-language plagiarism, plagiarism detection methods, self-plagiarism detection, multilingual plagiarism, and text borrowings.</p>
<p>The second way consisted of a cross-referencing process that included articles that were used to recognize other appropriate works (Horsley et al., <xref ref-type="bibr" rid="B14">2011</xref>). Using a backward-search process, we read the references at the end of articles to find other research that could potentially be used in the meta-analysis. Then we conducted a forward search via Google Scholar (<xref ref-type="bibr" rid="B12">2004</xref>) to identify these studies.</p>
<p>A total of 136 studies were identified. After removing 38 duplicates, 49 studies were excluded after title and abstract review. The full text of the remaining items was examined in detail. Part of them were not suitable for inclusion in our study for the following reasons:
<list list-type="bullet">
<list-item><p>No analysis of methods in practice</p></list-item>
<list-item><p>No appropriate description of the models and methods.</p></list-item>
</list></p>
<p>After these steps of the literature search, two articles published online that explored Russian-English pair of languages (Kuznecova et al., <xref ref-type="bibr" rid="B18">2018</xref>; Zubarev and Sochenkov, <xref ref-type="bibr" rid="B25">2019</xref>) and seven articles that analyzed models of cross-language plagiarism (Barr&#x000F3;n-Cede&#x000F1;o et al., <xref ref-type="bibr" rid="B5">2010</xref>, <xref ref-type="bibr" rid="B4">2013</xref>; Potthast et al., <xref ref-type="bibr" rid="B22">2011</xref>; Franco-Salvador et al., <xref ref-type="bibr" rid="B11">2016b</xref>; Ferrero et al., <xref ref-type="bibr" rid="B9">2017b</xref>; Thompson and Bowerman, <xref ref-type="bibr" rid="B23">2017</xref>; Ehsan et al., <xref ref-type="bibr" rid="B6">2019</xref>) remained and were included in the meta-analysis.</p>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> presents a &#x0201C;PRISMA Flow Diagram&#x0201D; the study selection that depicts the flow of information through the various stages of a systematic review.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Study selection flow diagram by PRISMA (Moher et al., <xref ref-type="bibr" rid="B20">2009</xref>) template.</p></caption>
<graphic xlink:href="fcomp-02-523053-g0001.tif"/>
</fig>
<p>As known detection of plagiarism has two stages:
<list list-type="order">
<list-item><p>Search for sources for selection of candidate documents</p></list-item>
<list-item><p>Text alignment to compare a document with each candidate.</p></list-item>
</list></p>
<p>We studied the models of the second stage, which are used for the CLPD task, studied existing works aimed to detect plagiarism in a Russian-English pair, and created tables and bar charts for the visual presentation of the results. There are several approaches that offer a solution to the problem of cross-language plagiarism detection for different pairs of languages: Arabic-English (Hanane et al., <xref ref-type="bibr" rid="B13">2016</xref>), (Alaa et al., <xref ref-type="bibr" rid="B1">2016</xref>), Malay-English (Kent and Salim, <xref ref-type="bibr" rid="B17">2010</xref>), Spanish-English and German-English (Franco-Salvador et al., <xref ref-type="bibr" rid="B11">2016b</xref>), Basque-English (Barr&#x000F3;n-Cede&#x000F1;o et al., <xref ref-type="bibr" rid="B5">2010</xref>), and Russian-English (Zubarev and Sochenkov, <xref ref-type="bibr" rid="B25">2019</xref>). In the work (Barr&#x000F3;n-Cede&#x000F1;o et al., <xref ref-type="bibr" rid="B5">2010</xref>), the authors noted that the effectiveness of the plagiarism detection algorithms directly depends on the degree of relationship between the considered languages. If languages are not in the same linguistic group, it causes additional difficulties for developing an algorithm for identifying borrowings.</p>
<p>There are methods that use different models for detecting cross-language plagiarism from five following groups (Ferrero et al., <xref ref-type="bibr" rid="B9">2017b</xref>); the purpose is to determine whether two text blocks are identical in terms of information content:
<list list-type="bullet">
<list-item><p>Syntax-based models (Length model, CL-CnG, and Cognateness)</p></list-item>
<list-item><p>Dictionary-based models (CL-VSM and CL-CTS)</p></list-item>
<list-item><p>Parallel corpora-based models (CL-ASA, CL-LSI, and CL-KCCA)</p></list-item>
<list-item><p>Comparable corpora-based models (CL-KGA and CL-ESA)</p></list-item>
<list-item><p>Machine translation-based models (T &#x0002B; MA).</p></list-item>
</list></p>
<p>The authors of article (Ferrero et al., <xref ref-type="bibr" rid="B9">2017b</xref>) and (Potthast et al., <xref ref-type="bibr" rid="B22">2011</xref>) analyzed the models:
<list list-type="bullet">
<list-item><p>CL-CnG (Cross-Language Character N-Gram) is based on McNamee and Mayfield models (McNamee and Mayfield, <xref ref-type="bibr" rid="B19">2004</xref>) and represents documents with character n-grams.</p></list-item>
<list-item><p>CL-CTS (Cross-Language Conceptual Thesaurus-based Similarity) is aimed at determining semantic similarity using abstract concepts from words in the text.</p></list-item>
<list-item><p>CL-ASA (Cross-Language Alignment-based Similarity Analysis) determines how a text unit is a potential translation of another text unit using a bilingual unigram dictionary that contains pairs of translations (and their probabilities) extracted from parallel corpora.</p></list-item>
<list-item><p>CL-ESA (Cross-Language Explicit Semantic Analysis) is based on the explicit semantic analysis model, which represents the value of a document as a vector based on a dictionary derived from Wikipedia.</p></list-item>
<list-item><p>T &#x0002B; MA (Translation &#x0002B; Monolingual Analysis) consists of translating suspicious plagiarism back into the same source language to make a unilingual comparison between them.</p></list-item>
<list-item><p>CL-VSM (Cross-Language Vector Space Model) is based on a vector space model.</p></list-item>
<list-item><p>CL-LSI (Cross-Language Latent Semantic Indexing) is based on hidden semantic indexing.</p></list-item>
<list-item><p>CL-KCCA (Cross-Language Kernel Canonical Correlation Analysis) is based on canonical core correlation analysis.</p></list-item>
</list></p>
<p>CL-LSI and CL-KCCA achieve high search quality, but the runtime is very long, which makes them inapplicable to many practical tasks. CL-VSM requires a lot of effort to remove ambiguities, and the availability of dictionaries for translation of manual works depends on the frequency of translations between the respective languages for this model. So that we excluded these models from our comparison. CL-CnG, CL-ESA, and CL-ASA provide good search quality and do not require manual fine-tuning.</p>
<p>The multilingual dataset in Ferrero et al. (<xref ref-type="bibr" rid="B9">2017b</xref>) was specially designed for evaluation of cross-language textual similarity detection. It is based on parallel and comparable corpora (mix of Wikipedia, scientific conference papers, amazon product reviews, Europarl, and JRC) including French, English, and Spanish texts.</p>
<p>The authors of Barr&#x000F3;n-Cede&#x000F1;o et al. (<xref ref-type="bibr" rid="B4">2013</xref>) compared T&#x0002B;MA model with CL-CNG and CL-ASA using the Spanish-English partition of PAN&#x00027;11 competition.</p>
<p>We also found a comparative analysis of CL-CNG, CL-ESA, and CL-ASA models in Potthast et al. (<xref ref-type="bibr" rid="B22">2011</xref>). Authors studied the behavior of the models on 120,000 test documents from the JRC-Acquis parallel corpus and Wikipedia comparable corpus and for each test document highly similar texts were available in English, German, Dutch, Spanish, French, and Polish languages.</p>
<p>We paid attention to the CL-KGA model (Franco-Salvador et al., <xref ref-type="bibr" rid="B11">2016b</xref>), which reflects text parts through knowledge graphs as a language independent content model and which is applicable at the level of CL-ASA, CL-ESA, and CL-CnG. The authors did a comprehensive comparative analysis of CL-CnG, CL-ESA, and CL-ASA with usual CL-KGA and its various models using the largest multilingual semantic network that combines lexicographic information with amplitudinous encyclopedic knowledge BabelNet (Navigli and Ponzetto, <xref ref-type="bibr" rid="B21">2012</xref>) and others. They selected the datasets used for the CL plagiarism detection competition PAN-PC-10 and PAN-PC-11, which consisted of Spanish-English and German-English sections.</p>
<p>In addition, we included in our meta-analysis the less famous models (though they still show good results for this task) that are described in works (Thompson and Bowerman, <xref ref-type="bibr" rid="B23">2017</xref>; Ehsan et al., <xref ref-type="bibr" rid="B6">2019</xref>). The authors of the model used cross-lingual word embeddings (CL-WE) and multilingual translation model (MTM) (Thompson and Bowerman, <xref ref-type="bibr" rid="B23">2017</xref>) used datasets from PAN-PC-11 and PAN-PC-12. PAN-PC-12 is also used by the authors of model proposed in Ehsan et al. (<xref ref-type="bibr" rid="B6">2019</xref>).</p>
<p>Despite the variety of described models, the majority of authors use a conventional machine translation (MT) model in methods and algorithms for detecting cross-language borrowings, and the task transforms into identifying monolingual plagiarism. The disadvantage of this approach is that machine translation provides various versions, and authors can change parts of the text that are reused.</p>
<p>Many works use text comparisons based on monolingual or bilingual word vectors (Franco-Salvador et al., <xref ref-type="bibr" rid="B10">2016a</xref>; Ferrero et al., <xref ref-type="bibr" rid="B8">2017a</xref>). However, the authors proposed a method that uses vectors of phrases to detect plagiarism in a Russian-English pair in one recent study (Kuznecova et al., <xref ref-type="bibr" rid="B18">2018</xref>). The paper describes an algorithm that performs monolingual analysis of documents: firstly, the text is completely translated into English, and then not the fragments of the text but the corresponding vectors are compared to reduce instability to the translation ambiguities. This is &#x0201C;proposed&#x0201D; algorithm. Additionally, they took an algorithm based on a shingle as &#x0201C;basic&#x0201D; to reproduce the following steps:
<list list-type="order">
<list-item><p>Translation of the checked document into English</p></list-item>
<list-item><p>Lemmatization of the obtained text and its division into many overlapping 4-grams</p></list-item>
<list-item><p>Sorting words within each 4-gram to account for the possible permutations of words in translation</p></list-item>
<list-item><p>A set of matching sorted 4-grams is the result of comparing documents.</p></list-item>
</list></p>
<p>The work (Zubarev and Sochenkov, <xref ref-type="bibr" rid="B25">2019</xref>) is also devoted to plagiarism analysis for the Russian-English pair. The authors present a dataset for the text alignment task as an alternative to existing datasets. They compare two models to detect translated plagiarism. One of them is based on various similarity indicators for texts that use word embedding and neural machine translation. Moreover, the other is built as an addition to the previous one based on a pre-trained language presentation (Bert). Also, they generated two corpora with various count of negative samples per each positive sentence pair that include the source and plagiarized sentences:
<list list-type="order">
<list-item><p>Negative-1: One negative example is selected randomly from the most similar sentences. The authors use this dataset for training and tuning models.</p></list-item>
<list-item><p>Negative-4: Four negative examples are selected (one most similar sentence for each used similarity score). They use this dataset for testing purposes to check how models handle a larger amount of negative examples.</p></list-item>
</list></p>
<p>The authors (Zubarev and Sochenkov, <xref ref-type="bibr" rid="B25">2019</xref>) conduct computational experiments using various classifiers:
<list list-type="bullet">
<list-item><p>NMT is a neural machine translation that measure similarity on 1-grams using OpenNMT-py4 library to train a machine translation system as an additional criterion to evaluate the pairwise similarity between the sentences.</p></list-item>
<list-item><p>NMT2 is the same neural machine translation but measured similarity on 2-grams.</p></list-item>
<list-item><p>LR-1 is a logistic regression classifier with L2 regularization using two similarity scores: one based on sentence embeddings and one calculated after the substitution of all words with the most similar ones in the other language.</p></list-item>
<list-item><p>LR-2 is a logistic regression classifier with C = 1.0 using only sentence embeddings similarity scores and the word substitution similarity score.</p></list-item>
<list-item><p>LASER [Language-Agnostic SEntence Representations (Artetxe and Schwenk, <xref ref-type="bibr" rid="B3">2018</xref>)] is a method to obtain sentence embeddings that provides an encoder called BiLSTM and trained on 93 languages.</p></list-item>
<list-item><p>BERT [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding] Authors used the BERT Multilingual model and considered a simple linear layer for the sentence pair classification on top of the pooled output of Bert.</p></list-item>
</list></p>
<p>The authors (Zubarev and Sochenkov, <xref ref-type="bibr" rid="B25">2019</xref>) used Word2Vec for vector representation of words and created a specified dataset consisting of 16,000 sentence pairs from Yandex parallel corpus that did not take part in learning word embeddings and 4,000 sentences written by students by looking for sources in English on the internet and via translation using Yandex and Google Translate services, making adjustments for getting the right Russian text. Small parts of sentences were translated manually without these tools and served as positive instances of plagiarized pairs of sentences. The authors of the Russian language work (Kuznecova et al., <xref ref-type="bibr" rid="B18">2018</xref>) used FastText library for representation of words and chose 18.5 million parallel sentences from Opus corpora that we show in <bold>Table 5</bold>.</p>
</sec>
<sec sec-type="results" id="s3">
<title>3. Results</title>
<p>We chose six commonly used models that could be scaled to work in a real-world setting for CL plagiarism detection: CL-ASA, CL-ESA, CL-CnG, CL-CTS, CL-KGA, T&#x0002B;MA, and two models from (Thompson and Bowerman, <xref ref-type="bibr" rid="B23">2017</xref>; Ehsan et al., <xref ref-type="bibr" rid="B6">2019</xref>). The qualitative results of the analysis of them are presented in <xref ref-type="table" rid="T1">Tables 1</xref>&#x02013;<xref ref-type="table" rid="T3">3</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Comparative table of models T&#x0002B;MA, CL-ESA, CL-ASA.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Parameter</bold></th>
<th valign="top" align="left"><bold>T&#x0002B;MA</bold></th>
<th valign="top" align="left"><bold>CL-ESA</bold></th>
<th valign="top" align="left"><bold>CL-ASA</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Comparison with</td>
<td valign="top" align="left">CL-CNG, CL-ESA, CL-CTS, CL-ASA, CL-KGA</td>
<td valign="top" align="left">CL-CNG, T&#x0002B;MA, CL-CTS, CL-ASA, CL-KGA</td>
<td valign="top" align="left">CL-CNG, CL-ESA, CL-CTS, T&#x0002B;MA, CL-KGA</td>
</tr>
<tr>
<td valign="top" align="left">Dependence on corpora</td>
<td valign="top" align="left">Independent</td>
<td valign="top" align="left">Gives better results on comparable corpora like Wikipedia</td>
<td valign="top" align="left">Gives better results on parallel corpora like JRC or Europarl</td>
</tr>
<tr>
<td valign="top" align="left">Dependence on kinship of languages</td>
<td valign="top" align="left">Can be used on language pairs whose alphabet and syntax are not related</td>
<td valign="top" align="left">Can be used on language pairs whose alphabet and syntax are not related</td>
<td valign="top" align="left">Can be used on language pairs whose alphabet and syntax are not related</td>
</tr>
<tr>
<td valign="top" align="left">Dependence on machine translator</td>
<td valign="top" align="left">Depends on the quality of machine translation</td>
<td valign="top" align="left">No data</td>
<td valign="top" align="left">Uses statistical machine translation, and depends on its quality</td>
</tr>
<tr>
<td valign="top" align="left">Dependence on the length of document</td>
<td valign="top" align="left">More favorable F1 in all cases except long. Poorly detects plagiarism in small documents</td>
<td valign="top" align="left">No data</td>
<td valign="top" align="left">Its formula tends to minimize the number of false positions in short-length texts</td>
</tr>
<tr>
<td valign="top" align="left">Productivity</td>
<td valign="top" align="left">High recall (erroneous translations at the stage of normalization of the text can be the cause of low precision). More effective at sentence detalization than CL-ASA</td>
<td valign="top" align="left">Performance is close to CL-CNG but depends on language pairs. It is based on similarities to a collection of documents and gives a large number of false positives, because it was originally intended for tasks of similarity, not plagiarism</td>
<td valign="top" align="left">High precision and performance in long documents (from some paragraphs up to entire documents). Shows good results in professional, automatic translations and receives a small number of false positives. Better detects human translations than T&#x0002B;MA</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Comparative table of models CL-CNG, CL-CTS, CL-KGA.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Parameter</bold></th>
<th valign="top" align="left"><bold>CL-CNG</bold></th>
<th valign="top" align="left"><bold>CL-CTS</bold></th>
<th valign="top" align="left"><bold>CL-KGA</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Comparison with</td>
<td valign="top" align="left">T&#x0002B;MA, CL-ESA, CL-CTS, CL-ASA, CL-KGA</td>
<td valign="top" align="left">CL-CNG, CL-ESA, T&#x0002B;MA, CL-ASA, CL-KGA</td>
<td valign="top" align="left">CL-CNG, CL-ESA, CL-CTS, CL-ASA</td>
</tr>
<tr>
<td valign="top" align="left">Dependence on corpora</td>
<td valign="top" align="left">Gives better results on comparable corpora like Wikipedia</td>
<td valign="top" align="left">No data</td>
<td valign="top" align="left">No data</td>
</tr>
<tr>
<td valign="top" align="left">Dependence on kinship of languages</td>
<td valign="top" align="left">Has low quality for language pairs without lexical and syntactic similarities. More effective than CL-ASA, CL-ESA in cases when languages are syntactically related</td>
<td valign="top" align="left">No data</td>
<td valign="top" align="left">No data</td>
</tr>
<tr>
<td valign="top" align="left">Dependence on machine translator</td>
<td valign="top" align="left">No data</td>
<td valign="top" align="left">No data</td>
<td valign="top" align="left">No data</td>
</tr>
<tr>
<td valign="top" align="left">Dependence on the length of document</td>
<td valign="top" align="left">No data</td>
<td valign="top" align="left">Its formula tends to minimize the number of false positions in short-length texts</td>
<td valign="top" align="left">No data</td>
</tr>
<tr>
<td valign="top" align="left">Productivity</td>
<td valign="top" align="left">High recall. Provides acceptable retrieval quality</td>
<td valign="top" align="left">The behavior depends on granularity and the level of detail. More effective at sentence detalization than CL-ASA</td>
<td valign="top" align="left">Provides high results for all indicators through the use of knowledge graphs. Offers better performance with Babel Net&#x00027;s high reach and interconnectivity concept than CL-ESA</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Comparative table of model used cross-lingual word embeddings (CL-WE) and multilingual translation model (MTM) (Thompson and Bowerman, <xref ref-type="bibr" rid="B23">2017</xref>) and model proposed in Ehsan et al. (<xref ref-type="bibr" rid="B6">2019</xref>).</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Parameter</bold></th>
<th valign="top" align="left"><bold>Model used cross-lingual word embeddings (CL-WE) and multilingual translation model (MTM) (Thompson and Bowerman, <xref ref-type="bibr" rid="B23">2017</xref>)</bold></th>
<th valign="top" align="left"><bold>Model proposed in Ehsan et al. (<xref ref-type="bibr" rid="B6">2019</xref>)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Comparison with</td>
<td valign="top" align="left">T&#x0002B;MA</td>
<td valign="top" align="left">CL-CNG, T&#x0002B;MA</td>
</tr>
<tr>
<td valign="top" align="left">Dependence on corpora</td>
<td valign="top" align="left">Does not require parallel or comparable corpora, but it is necessary to compile a dictionary to teach the model. Not limited to bilingual CLPD tasks</td>
<td valign="top" align="left">Uses a simple dictionary (no probability of translation) as the only translation resource</td>
</tr>
<tr>
<td valign="top" align="left">Dependence on kinship of languages</td>
<td valign="top" align="left">Applicable in any pair of languages that have any translation resource</td>
<td valign="top" align="left">Applicable in any pair of languages that have any translation resource</td>
</tr>
<tr>
<td valign="top" align="left">Dependence on machine translator</td>
<td valign="top" align="left">Depends on the quality of translation of the dictionary for model training</td>
<td valign="top" align="left">Does not use a machine translation system and does not depend on the availability or quality of machine translation systems</td>
</tr>
<tr>
<td valign="top" align="left">Dependence on the length of document</td>
<td valign="top" align="left">No data</td>
<td valign="top" align="left">No data</td>
</tr>
<tr>
<td valign="top" align="left">Productivity</td>
<td valign="top" align="left">Preserves the precision of the T &#x0002B; MA model without losing recall</td>
<td valign="top" align="left">More productive than CL-CNG and T&#x0002B;MA</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="T4">Table 4</xref> introduces our comparative results of the approaches used in the articles for the Russian-English pair. Precision, recall, and F1 characteristics are also represented in <xref ref-type="fig" rid="F2">Figures 2</xref>&#x02013;<xref ref-type="fig" rid="F4">4</xref>.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>A quantitative analysis of the effectiveness of methods for a Russian-English pair.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Article</bold></th>
<th valign="top" align="left"><bold>Algorithm</bold></th>
<th valign="top" align="center"><bold>Precision</bold></th>
<th valign="top" align="center"><bold>Recall</bold></th>
<th valign="top" align="center"><bold>F1</bold></th>
<th valign="top" align="center"><bold>Computation time (seconds)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Cross-language text alignment for plagiarism detection based on contextual and context-free models (Zubarev and Sochenkov, <xref ref-type="bibr" rid="B25">2019</xref>) Negative-1</td>
<td valign="top" align="left">Sentence embeddings</td>
<td valign="top" align="center">0.75</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.76</td>
<td valign="top" align="center">2.89</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Words substitution</td>
<td valign="top" align="center">0.84</td>
<td valign="top" align="center">0.76</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">2.63</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">NMT</td>
<td valign="top" align="center">0.85</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">34.15</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">NMT-2</td>
<td valign="top" align="center">0.83</td>
<td valign="top" align="center">0.64</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">240.13</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">LR-1</td>
<td valign="top" align="center">0.91</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">0.85</td>
<td valign="top" align="center">39.68</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">LR-2</td>
<td valign="top" align="center">0.87</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">0.85</td>
<td valign="top" align="center">5.53</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Laser</td>
<td valign="top" align="center">0.90</td>
<td valign="top" align="center">0.89</td>
<td valign="top" align="center">0.89</td>
<td valign="top" align="center">7.63</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Bert</td>
<td valign="top" align="center">0.96</td>
<td valign="top" align="center">0.93</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">91.95</td>
</tr>
<tr>
<td valign="top" align="left">Cross-language text alignment for plagiarism detection based on contextual and context-free models (Zubarev and Sochenkov, <xref ref-type="bibr" rid="B25">2019</xref>) Negative-4</td>
<td valign="top" align="left">Sentence embeddings</td>
<td valign="top" align="center">0.45</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.57</td>
<td valign="top" align="center">4.02</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Words substitution</td>
<td valign="top" align="center">0.60</td>
<td valign="top" align="center">0.76</td>
<td valign="top" align="center">0.66</td>
<td valign="top" align="center">3.3</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">NMT</td>
<td valign="top" align="center">0.61</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">0.69</td>
<td valign="top" align="center">34.31</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">NMT-2</td>
<td valign="top" align="center">0.54</td>
<td valign="top" align="center">0.64</td>
<td valign="top" align="center">0.58</td>
<td valign="top" align="center">240.29</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">LR-1</td>
<td valign="top" align="center">0.73</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">0.76</td>
<td valign="top" align="center">41.65</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">LR-2</td>
<td valign="top" align="center">0.64</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">0.71</td>
<td valign="top" align="center">7.34</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Laser</td>
<td valign="top" align="center">0.70</td>
<td valign="top" align="center">0.89</td>
<td valign="top" align="center">0.78</td>
<td valign="top" align="center">11.04</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Bert</td>
<td valign="top" align="center">0.88</td>
<td valign="top" align="center">0.93</td>
<td valign="top" align="center">0.90</td>
<td valign="top" align="center">197.45</td>
</tr>
<tr>
<td valign="top" align="left">Detection of translated borrowings in large arrays of scientific documents (Kuznecova et al., <xref ref-type="bibr" rid="B18">2018</xref>)</td>
<td valign="top" align="left">Basic</td>
<td valign="top" align="center">0.99</td>
<td valign="top" align="center">0.15</td>
<td valign="top" align="center">0.26</td>
<td valign="top" align="center">-</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Proposed</td>
<td valign="top" align="center">0.93</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">0.85</td>
<td valign="top" align="center">-</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Results of methods from article (Zubarev and Sochenkov, <xref ref-type="bibr" rid="B25">2019</xref>) with Negative-1.</p></caption>
<graphic xlink:href="fcomp-02-523053-g0002.tif"/>
</fig>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Results of methods from article (Zubarev and Sochenkov, <xref ref-type="bibr" rid="B25">2019</xref>) with Negative-4.</p></caption>
<graphic xlink:href="fcomp-02-523053-g0003.tif"/>
</fig>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Results of &#x0201C;basic&#x0201D; and &#x0201C;proposed&#x0201D; methods from article (Kuznecova et al., <xref ref-type="bibr" rid="B18">2018</xref>).</p></caption>
<graphic xlink:href="fcomp-02-523053-g0004.tif"/>
</fig>
<p>F1 is a balance of accuracy and completeness of classification that is calculated as the following:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mi>P</mml:mi><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>R</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where P&#x02014;precision, R&#x02014;recall (Kuznecova et al., <xref ref-type="bibr" rid="B18">2018</xref>).</p>
<p>As for algorithms that authors used for the Russian-English pair, the Bert language model shows a high performance; however, it is inappropriate during large-scale checking of borrowings because it exhibited the worst results over time, as can be seen in <xref ref-type="table" rid="T4">Table 4</xref>. It shows good results only after retraining for a specific task. Additionally, Bert is quite complex and requires great hardware capacity, and its use is thus limited. The authors of Zubarev and Sochenkov (<xref ref-type="bibr" rid="B25">2019</xref>), therefore proposed a classifier with a reduced space for features for effective filtering: only with sentence embeddings and word substitution measures. They considered it reasonable to use both context-free models and context models together in modern plagiarism detection systems. Cross-language embeddings of words based on large parallel corpora were prepared there to analyze the similarity of two sentences. In turn, these were used to analyze two different similarity ratings: one is based on the sentences embeddings and the other calculated after replacing all the words with the most similar ones in another language. LR-1 classifier showed performance that can be comparable with Bert. This classifier, tuned to maximize recall, can greatly reduce the load on the more sophisticated processing downstream. It is more than two times faster than the Bert. The LASER showed &#x0201C;the golden mean&#x0201D; between F1 and computation time. In addition, its speed can be higher because the authors did not use pre-learned English embeddings. The NMT-2 method had the worst speed, which is even more than the Bert&#x00027;s and the least Recall measure. In all analyzed approaches, a higher number of negative examples (Negative-4) means lesser precision and F1, though recall stayed the same. Computation time is also increased on Negative-4.</p>
<p>With respect to the algorithm proposed in the Russian-language article (Kuznecova et al., <xref ref-type="bibr" rid="B18">2018</xref>), we found out that it showed quite high precision compared to the above methods but is slightly less than the accuracy of the basic algorithm. The high accuracy of the basic algorithm is due to the fact it considered the similarity of only almost-duplicate text. Despite this, the proposed algorithm has better recall and F1 indicators that the basic.</p>
<p>We discovered the most effective methods to evaluate the quantity of plagiarism are Bert and, as proposed in article (Kuznecova et al., <xref ref-type="bibr" rid="B18">2018</xref>), an algorithm used for machine translation. As there is no such characteristic as the time of calculation in Russian language articles, we could not compare them, but the precision, recall, and F1 measure of these approaches showed the best results.</p>
</sec>
<sec sec-type="conclusions" id="s4">
<title>4. Conclusion</title>
<p>We conducted a meta-analysis of approaches used for detection of cross-language plagiarism and studied the methods for identifying cross-language plagiarism during the course of this research. Comparison results are shown in <xref ref-type="table" rid="T1">Tables 1</xref>&#x02013;<xref ref-type="table" rid="T3">3</xref>.</p>
<p>For the Russian-English pair we perform an in-depth analysis comparing the models by using the following characteristics: precision, recall, F1, computing time, datasets, and vector representation of words. Results are presented in <xref ref-type="table" rid="T4">Tables 4</xref>, <xref ref-type="table" rid="T5">5</xref>, and bar charts in <xref ref-type="fig" rid="F2">Figures 2</xref>&#x02013;<xref ref-type="fig" rid="F4">4</xref>.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Comparative table for the used technologies.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Article</bold></th>
<th valign="top" align="left"><bold>Datasets</bold></th>
<th valign="top" align="left"><bold>Vector representation of words</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Cross-language text alignment for plagiarism detection based on contextual and context-free models (Zubarev and Sochenkov, <xref ref-type="bibr" rid="B25">2019</xref>)</td>
<td valign="top" align="left">44 million sentences (for each language): parallel sentences from Opus corpora &#x0002B; sentences from the Yandex Parallel corpus (English-Russian Parallel Corpora, <xref ref-type="bibr" rid="B7">2015</xref>) (16,000 sentence pairs that were not used for learning word embeddings) &#x0002B; parallel concepts from Wikidata &#x0002B; 4,000 sentences manually written by students</td>
<td valign="top" align="left">Word2Vec</td>
</tr>
<tr>
<td valign="top" align="left">Detection of translated borrowings in large arrays of scientific documents (Kuznecova et al., <xref ref-type="bibr" rid="B18">2018</xref>)</td>
<td valign="top" align="left">18.5 million parallel sentences from Opus corpora &#x0002B; 10 million sentences from the English version of Wikipedia &#x0002B; articles from journals included in the Russian Science Citation Index (RSCI)</td>
<td valign="top" align="left">FastText</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We selected machine translation mainly to reduce the task to the analysis of documents in one language. However, the disadvantage of this approach is that repeated sections of the text may not be detected due to the peculiarities of translation and interpretation.</p>
<p>Despite the large number of developed programs for detecting plagiarism, the problem of detecting translated borrowings in weakly related languages is still relevant.</p>
<p>Extending vocabulary can be considered an issue for cross-language plagiarism detection systems. Many available parallel corpora contain a common lexis, but detection of borrowings should also be well applicable and accurate for scientific papers where a multitude of special terms exist. Additionally the next possible solution is to create parallel corpora from comparable corpora with the help of the system for translated plagiarism detection and extend vocabulary with new parallel data.</p>
<p>Based on early research, it is fair to say that CL-CnG model is less effective for the Russian-English pair because these languages are not syntactically related. CL-ESA is more fit for tasks of relatedness than plagiarism detection. As for other models, not only machine translation but CL-ASA, CL-CTS, and CL-KGA can be used in methods and algorithms to develop a cross-language borrowing verification systems for a Russian-English pair. However, as proposed in articles (Thompson and Bowerman, <xref ref-type="bibr" rid="B23">2017</xref>; Ehsan et al., <xref ref-type="bibr" rid="B6">2019</xref>) models seem to be most suitable for this language pair due to an independence from resources like parallel or comparable corpora, network connectivity, and the availability of online translators throughout the entire text comparison process.</p>
<p>We found that the results of model comparison do not contradict each other in the process of studying the articles and works. In general, the difference in the productivity of models is small, and their application most often depends on the required speed of work, used resources, corpora, dictionaries, as well as the pair of languages analyzed and their relationship.</p>
<p>In future work, it is planned that we apply that which is most suitable for Russian-English pair methods in practice and assess their indicators of Precision, Recall, and F1 to confirm or refute the applicability of the models.</p>
</sec>
<sec sec-type="data-availability-statement" id="s5">
<title>Data Availability Statement</title>
<p>All datasets generated for this study are included in the article/supplementary material.</p>
</sec>
<sec id="s6">
<title>Author Contributions</title>
<p>ATl, ATo, MT, and VK contributed conception and design of the study. ATl wrote the first draft of the manuscript. VK reviewed the first draft and suggested improvements. ATl and MT wrote sections of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.</p>
</sec>
<sec id="s7">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back>
<ack><p>The work partly was performed according to the Russian Government Program of Competitive Growth of Kazan Federal University.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alaa</surname> <given-names>Z.</given-names></name> <name><surname>Tiun</surname> <given-names>S.</given-names></name> <name><surname>Abdulameer</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>Cross-language plagiarism of Arabic-English documents using linear logistic regression</article-title>. <source>J. Theoret. Appl. Inform. Technol</source>. <volume>83</volume>, <fpage>20</fpage>&#x02013;<lpage>33</lpage>.</citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Amancio</surname> <given-names>D. R.</given-names></name></person-group> (<year>2015</year>). <article-title>Comparing the topological properties of real and artificially generated scientific manuscripts</article-title>. <source>Scientometrics</source> <volume>105</volume>, <fpage>1763</fpage>&#x02013;<lpage>1779</lpage>. <pub-id pub-id-type="doi">10.1007/s11192-015-1637-z</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Artetxe</surname> <given-names>M.</given-names></name> <name><surname>Schwenk</surname> <given-names>H.</given-names></name></person-group> (<year>2018</year>). <article-title>Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond</article-title>. <source>CoRR abs/1812.10464</source>. <pub-id pub-id-type="doi">10.1162/tacl/_a/_00288</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barr&#x000F3;n-Cede&#x000F1;o</surname> <given-names>A.</given-names></name> <name><surname>Gupta</surname> <given-names>P.</given-names></name> <name><surname>Rosso</surname> <given-names>P.</given-names></name></person-group> (<year>2013</year>). <article-title>Methods for cross-language plagiarism detection</article-title>. <source>Knowledge Based Syst</source>. <volume>50</volume>, <fpage>211</fpage>&#x02013;<lpage>217</lpage>. <pub-id pub-id-type="doi">10.1016/j.knosys.2013.06.018</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Barr&#x000F3;n-Cede&#x000F1;o</surname> <given-names>A.</given-names></name> <name><surname>Rosso</surname> <given-names>P.</given-names></name> <name><surname>Agirre</surname> <given-names>E.</given-names></name> <name><surname>Labaka</surname> <given-names>G.</given-names></name></person-group> (<year>2010</year>). <article-title>Plagiarism detection across distant language pairs</article-title>, in <source>Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference</source> (<publisher-loc>Beijing</publisher-loc>), <fpage>37</fpage>&#x02013;<lpage>45</lpage>.</citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ehsan</surname> <given-names>N.</given-names></name> <name><surname>Shakery</surname> <given-names>A.</given-names></name> <name><surname>Tompa</surname> <given-names>F. W.</given-names></name></person-group> (<year>2019</year>). <article-title>Cross-lingual text alignment for fine-grained plagiarism detection</article-title>. <source>J. Inform. Sci</source>. <volume>45</volume>, <fpage>443</fpage>&#x02013;<lpage>459</lpage>. <pub-id pub-id-type="doi">10.1177/0165551518787696</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><collab>English-Russian parallel corpora</collab></person-group> (<year>2015</year>).</citation></ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ferrero</surname> <given-names>J.</given-names></name> <name><surname>Agnes</surname> <given-names>F.</given-names></name> <name><surname>Besacier</surname> <given-names>L.</given-names></name> <name><surname>Schwab</surname> <given-names>D.</given-names></name></person-group> (<year>2017a</year>). <article-title>Using word embedding for cross-language plagiarism detection</article-title>. <pub-id pub-id-type="doi">10.18653/v1/W17-2502</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ferrero</surname> <given-names>J.</given-names></name> <name><surname>Besacier</surname> <given-names>L.</given-names></name> <name><surname>Schwab</surname> <given-names>D.</given-names></name> <name><surname>Agnes</surname> <given-names>F.</given-names></name></person-group> (<year>2017b</year>). <article-title>Deep investigation of cross-language plagiarism detection methods</article-title>, <fpage>6</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.18653/v1/E17-2066</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Franco-Salvador</surname> <given-names>M.</given-names></name> <name><surname>Gupta</surname> <given-names>P.</given-names></name> <name><surname>Rosso</surname> <given-names>P.</given-names></name> <name><surname>Banchs</surname> <given-names>R. E.</given-names></name></person-group> (<year>2016a</year>). <article-title>Cross-language plagiarism detection over continuous-space and knowledge graph-based representations of language</article-title>. <source>Knowledge Based Syst</source>. <volume>111</volume>, <fpage>87</fpage>&#x02013;<lpage>99</lpage>. <pub-id pub-id-type="doi">10.1016/j.knosys.2016.08.004</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Franco-Salvador</surname> <given-names>M.</given-names></name> <name><surname>Rosso</surname> <given-names>P.</given-names></name> <name><surname>y Gamez</surname> <given-names>M. M.</given-names></name></person-group> (<year>2016b</year>). <article-title>A systematic study of knowledge graph analysis for cross-language plagiarism detection</article-title>. <source>Inform. Process. Manage</source>. <volume>52</volume>, <fpage>550</fpage>&#x02013;<lpage>570</lpage>. <pub-id pub-id-type="doi">10.1016/j.ipm.2015.12.004</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><collab>Google Scholar</collab></person-group> (<year>2004</year>).</citation></ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hanane</surname> <given-names>E.</given-names></name> <name><surname>Erritali</surname> <given-names>M.</given-names></name> <name><surname>Oukessou</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>Semantic similarity/relatedness for cross language plagiarism detection</article-title>, in <source>2016 13th International Conference on Computer Graphics, Imaging and Visualization (CGiV)</source> (<publisher-loc>Morocco</publisher-loc>). <pub-id pub-id-type="doi">10.1109/CGiV.2016.78</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Horsley</surname> <given-names>T.</given-names></name> <name><surname>Dingwall</surname> <given-names>O.</given-names></name> <name><surname>Sampson</surname> <given-names>M.</given-names></name></person-group> (<year>2011</year>). <article-title>Checking reference lists to find additional studies for systematic reviews</article-title>. <source>Cochrane Datab. System. Rev</source>. <volume>8</volume>:<fpage>MR000026</fpage>. <pub-id pub-id-type="doi">10.1002/14651858.MR000026.pub2</pub-id><pub-id pub-id-type="pmid">21833989</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><collab>IEEE</collab></person-group> (<year>2019</year>). <source>Advanced Technology for Humanity</source>.</citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><collab>IEEE-Faq</collab></person-group> (<year>2019</year>). <source>Advanced Technology for Humanity</source>.</citation></ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kent</surname> <given-names>C. K.</given-names></name> <name><surname>Salim</surname> <given-names>N.</given-names></name></person-group> (<year>2010</year>). <article-title>Web based cross language plagiarism detection</article-title>, in <source>2010 Second International Conference on Computational Intelligence, Modelling and Simulation</source> (<publisher-loc>Tuban</publisher-loc>), <fpage>199</fpage>&#x02013;<lpage>204</lpage>. <pub-id pub-id-type="doi">10.1109/CIMSiM.2010.10</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kuznecova</surname> <given-names>R. V.</given-names></name> <name><surname>Bahteev</surname> <given-names>O. U.</given-names></name> <name><surname>Chekhovich</surname> <given-names>U. V.</given-names></name></person-group> (<year>2018</year>). <article-title>Detection of translated borrowings in large arrays of scientific documents (Detektirovanie perevodnyh zaimstvovanij v bolshih massivah nauchnyh dokumentov)</article-title>.</citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>McNamee</surname> <given-names>P.</given-names></name> <name><surname>Mayfield</surname> <given-names>J.</given-names></name></person-group> (<year>2004</year>). <article-title>Character n-gram tokenization for European language text retrieval</article-title>. <source>Inform. Retriev</source>. <volume>7</volume>, <fpage>73</fpage>&#x02013;<lpage>97</lpage>. <pub-id pub-id-type="doi">10.1023/B:INRT.0000009441.78971.be</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Moher</surname> <given-names>D.</given-names></name> <name><surname>Shamseer</surname> <given-names>L.</given-names></name> <name><surname>Clarke</surname> <given-names>M.</given-names></name> <name><surname>Ghersi</surname> <given-names>D.</given-names></name> <name><surname>Liberati</surname> <given-names>A.</given-names></name> <name><surname>Petticrew</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2009</year>). <article-title>Prisma statement</article-title>.</citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Navigli</surname> <given-names>R.</given-names></name> <name><surname>Ponzetto</surname> <given-names>S. P.</given-names></name></person-group> (<year>2012</year>). <article-title>Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network</article-title>. <source>Artif. Intell</source>. <volume>193</volume>, <fpage>217</fpage>&#x02013;<lpage>250</lpage>. <pub-id pub-id-type="doi">10.1016/j.artint.2012.07.001</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Potthast</surname> <given-names>M.</given-names></name> <name><surname>Barr&#x000F3;n-Cede&#x000F1;o</surname> <given-names>A.</given-names></name> <name><surname>Stein</surname> <given-names>B.</given-names></name> <name><surname>Rosso</surname> <given-names>P.</given-names></name></person-group> (<year>2011</year>). <article-title>Cross-language plagiarism detection</article-title>. <source>Lang. Resour. Eval</source>. <volume>45</volume>, <fpage>45</fpage>&#x02013;<lpage>62</lpage>. <pub-id pub-id-type="doi">10.1007/s10579-009-9114-z</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Thompson</surname> <given-names>V.</given-names></name> <name><surname>Bowerman</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). <article-title>Detecting cross-lingual plagiarism using simulated word embeddings</article-title>.</citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tlitova</surname> <given-names>A. E.</given-names></name> <name><surname>Toschev</surname> <given-names>A. S.</given-names></name></person-group> (<year>2019</year>). <article-title>Review of existing tools for detecting plagiarism and self-plagiarism (Obzor sushchestvuyushchih instrumentov vyyavleniya plagiata i samoplagiata)</article-title>. <source>Elektronnye Biblioteki</source> <volume>22</volume>, <fpage>143</fpage>&#x02013;<lpage>159</lpage>. <pub-id pub-id-type="doi">10.26907/1562-5419-2019-22-3-143-159</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zubarev</surname> <given-names>D.</given-names></name> <name><surname>Sochenkov</surname> <given-names>I.</given-names></name></person-group> (<year>2019</year>). <article-title>Cross-language text alignment for plagiarism detection based on contextual and context-free models</article-title>.</citation></ref>
</ref-list>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> This work was partly funded by the subsidy allocated to Kazan Federal University for the state assignment in the sphere of scientific activities number 2.8303.2017/8.9. The reported research was partly funded by Russian Foundation for Basic Research and the Methods and models for the formation of the digital infrastructure of the scientific and educational cluster of the Republic of Tatarstan, grant N 18-47-160012. The reported study was partly funded by RFBR according to the research project N 19-29-03057.</p>
</fn>
</fn-group>
</back>
</article>