<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2019.00001</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Methods</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>A Pattern-Based Method for Medical Entity Recognition From Chinese Diagnostic Imaging Text</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Liang</surname> <given-names>Zihong</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/670418/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Chen</surname> <given-names>Junjie</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/670415/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Xu</surname> <given-names>Zhaopeng</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/694932/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Chen</surname> <given-names>Yuyang</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/728021/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Hao</surname> <given-names>Tianyong</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/728528/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>School of Computer Science, South China Normal University</institution>, <addr-line>Guangzhou</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>Software College, Northeastern University</institution>, <addr-line>Shenyang</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Buzhou Tang, Harbin Institute of Technology, Shenzhen, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Kristina Hettne, Center for Digital Scholarship at the Leiden University Library, Netherlands; Sunyoung Jang, Princeton Radiation Oncology Center, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Tianyong Hao <email>haoty&#x00040;m.scnu.edu.cn</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Medicine and Public Health, a section of the journal Frontiers in Artificial Intelligence</p></fn></author-notes>
<pub-date pub-type="epub">
<day>14</day>
<month>05</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>2</volume>
<elocation-id>1</elocation-id>
<history>
<date date-type="received">
<day>14</day>
<month>01</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>12</day>
<month>04</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2019 Liang, Chen, Xu, Chen and Hao.</copyright-statement>
<copyright-year>2019</copyright-year>
<copyright-holder>Liang, Chen, Xu, Chen and Hao</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p><bold>Background:</bold> The identification of medical entities and relations from electronic medical records is a fundamental research issue for medical informatics. However, the task of extracting valuable knowledge from these records is challenging due to its high complexity. The accurate identification of entity and relation is still an open research problem in medical information extraction.</p>
<p><bold>Methods:</bold> A pattern-based method for extracting certain tumor-related entities and attributes from Chinese unstructured diagnostic imaging text is proposed. This method is a composition of three steps. Firstly, an algorithm based on keyword matching is designed to obtain the primary sites of tumors. Then a set of regular expressions is applied to identify primary tumor size information. Finally, a set of rules is defined to acquire metastatic sites of tumors.</p>
<p><bold>Results:</bold> Our method achieves a recall of 0.697, a precision of 0.825 and an F1 score of 0.755 using an overall weighted metric. For each of the extraction tasks, the F1 scores are 0.784, 0.822 and 0.740.</p>
<p><bold>Conclusions:</bold> The method proves to be stable and robust with different amounts of testing data. It achieves a comparatively high performance in the CHIP 2018 open challenge, demonstrating its effectiveness in extracting tumor-related entities from Chinese diagnostic imaging text.</p></abstract>
<kwd-group>
<kwd>medical named entity recognition</kwd>
<kwd>pattern-based strategy</kwd>
<kwd>information extraction</kwd>
<kwd>clinical text</kwd>
<kwd>natural language processing</kwd>
</kwd-group>
<counts>
<fig-count count="2"/>
<table-count count="2"/>
<equation-count count="3"/>
<ref-count count="30"/>
<page-count count="8"/>
<word-count count="5505"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Biomedical named entity recognition (NER) is a critical task for extracting patient information from medical diagnosis to support medical research and treatment decision making. The aim of NER is to locate and classify medical named entity mentions in unstructured text, such as treatment, symptom and so on (Demner-Fushman et al., <xref ref-type="bibr" rid="B11">2009</xref>). Through NER, certain hidden information in the diagnosis could be dug out and further contribute to improving existing medical systems. More importantly, medical information processing systems that rely solely on structured data are unable to directly access such kinds of hidden information in the medical text. Building a practical NER system is not an easy task because of the complexity of medical text. In addition to the difficulty of accurately extracting named entities from unstructured medical text, another primary difficulty is the precise normalization of extracted named entities by mapping them to concepts (Buzhou Tang, <xref ref-type="bibr" rid="B4">2018b</xref>).</p>
<p>There is an open challenge at the China Health Information Processing Conference (CHIP) 2018 aiming at competition regarding the methodology on clinical named entity recognition (Buzhou Tang, <xref ref-type="bibr" rid="B3">2018a</xref>). It offers several subtasks addressing the information extraction problem focusing on the organ and size of cancer (Buzhou Tang, <xref ref-type="bibr" rid="B4">2018b</xref>). The first subtask focuses on extracting organ entities of primary cancer and metastasis cancer. In the subtasks, models are expected to identify and normalize organ entities where primary cancer and metastasis caner exist. The second subtask is to identify the size of primary cancer mentioned in diagnosis. The challenge provides a training and a testing dataset. In the training data set, there are 600 entries of diagnoses, primary tumor site, primary tumor size and metastatic tumor site, mostly about lung cancer and breast cancer. The testing data containing 200 entries covers broader content compared to the training dataset.</p>
<p>Targeting the NER challenge on the standard CHIP 2018 dataset, we propose a pattern-based method which exploits background knowledge, discourse knowledge and domain-specific knowledge to extract the required entities. Based on the testing data, our method achieves F1 scores of 0.78, 0.82, and 0.74 on the primary tumor site, primary tumor size and metastatic tumor site identification tasks, respectively.</p>
<p>The rest of this paper is organized as follows: Section 2 introduces related work about biomedical named entity recognition. Section 3 describes the pattern-based named entity recognition method in detail. Section 4 presents the experiment results of our method and section 5 addresses the conclusions.</p></sec>
<sec id="s2">
<title>2. Related Work</title>
<p>Various methods have been developed in clinical natural language processing (NLP) systems. In recent years, machine learning approaches have drawn much attention from the named entities recognition research community. Machine learning approaches usually regard a NER task as a sequence labeling problem. They try to explore the best label sequence, most of the time as BIO (Begin, Inside, Outside) format, for a given input sentence. Among them, Hidden Markov Model (HMM), Support Vector Machines (SVM) and Conditional Random Fields (CRF) are the most frequently used methods (Lafferty et al., <xref ref-type="bibr" rid="B18">2001</xref>; Zhou and Su, <xref ref-type="bibr" rid="B30">2002</xref>; R&#x000F6;ssler, <xref ref-type="bibr" rid="B24">2004</xref>). McCallum and Li (<xref ref-type="bibr" rid="B21">2003</xref>) proposed a CRF model that achieved an F1 score of 0.84 on the CoNLL-2003 dataset. Recent developments on neural networks boost the CRF-based model by around 10%, improving its capability for public service (Huang et al., <xref ref-type="bibr" rid="B14">2015</xref>; Ma and Hovy, <xref ref-type="bibr" rid="B20">2016</xref>; Gridach, <xref ref-type="bibr" rid="B13">2017</xref>). Despite the superior performance, the approaches still sometimes have trouble in incorporating prior domain knowledge.</p>
<p>The medical field is a specific domain with significant domain knowledge that can be exploited systematically to provide more informative support to application systems. Domain knowledge usually falls into two categories: (1) Discourse knowledge driven by a phenomenon that diagnoses usually stick to a fixed writing style and (2) background knowledge which is captured in medical datasets like UMLS (Bodenreider, <xref ref-type="bibr" rid="B2">2004</xref>), MeSH (Lipscomb, <xref ref-type="bibr" rid="B19">2000</xref>). External knowledge becomes indispensable regarding the requirement of linking named entities to certain concepts.</p>
<p>Jindal and Roth (<xref ref-type="bibr" rid="B16">2013</xref>) proposed an Integer Linear Programming approach to incorporate soft global constraints among different named entities. He included the constraint penalty in the training process and improved the performance of the 2010 i2b2/VA dataset. The advantage of incorporating discourse knowledge was also reflected in the application of word embedding (Mikolov et al., <xref ref-type="bibr" rid="B22">2013</xref>), the cornerstone of today&#x00027;s natural language processing field. Zhou et al. (<xref ref-type="bibr" rid="B29">2015</xref>) indicated the advantages of introducing continuous word representation to a community question answering system. Emphasized by Wang et al. (<xref ref-type="bibr" rid="B25">2015</xref>), the incorporation of continuous word representation allowed the neural model to outperform others on the Google Snippets dataset. Moreover, Character-embedding techniques that followed the same idea pushed the research one step further. Chiu and Nichols (<xref ref-type="bibr" rid="B6">2016</xref>) improved the F1 score by 7% on the CoNLL-2003 dataset by adding character embedding. Many of the modern NER systems utilized these techniques and surpassed their counterparts notably (Xie et al., <xref ref-type="bibr" rid="B26">2017</xref>; Xu et al., <xref ref-type="bibr" rid="B28">2017</xref>). In addition, the wide application of the technique (Mikolov et al., <xref ref-type="bibr" rid="B22">2013</xref>; Chung et al., <xref ref-type="bibr" rid="B8">2014</xref>; Rajpurkar et al., <xref ref-type="bibr" rid="B23">2016</xref>; Chen et al., <xref ref-type="bibr" rid="B5">2017</xref>) proved the impact and importance of introducing discourse knowledge.</p>
<p>Similar to the role of discourse knowledge playing in discovering concept-related named entity, background knowledge has a critical influence on named entity normalization. A named entity normalization system proposed by Cho et al. (<xref ref-type="bibr" rid="B7">2017</xref>) relied heavily on background knowledge. The key of this system is to utilize a disease/plant name dictionary to augment training data in order to obtain a named entity/concept word representation. As a novel named entity normalization method, Dogan and Lu (<xref ref-type="bibr" rid="B12">2012</xref>) trained an abbreviation resolution dictionary based on a phenomenon whereby the complete form and abbreviated form would appear together in biomedical text abstraction. By utilizing automatically studied background knowledge, their method outperformed other state-of-the-art models, e.g., METAMAP (Aronson, <xref ref-type="bibr" rid="B1">2001</xref>), significantly.</p>
<p>Domain adaptation is one of the critical problems in natural language processing. One way to solve this issue is to exploit domain-specific knowledge. Daume and Marcu (<xref ref-type="bibr" rid="B10">2006</xref>) proposed MegaM based on an idea that any data could be regarded as a combination of domain-specific and generic features. The evaluation result showed the MegaM&#x00027;s superiority over other baseline methods in terms of performance. Also, by applying MegaM, one could reduce the error rate up to 50%. Following this idea, Daum&#x000E9; (<xref ref-type="bibr" rid="B9">2009</xref>) later extended the approach to datasets with arbitrary multiple domains other than just two domains. Kim et al. (<xref ref-type="bibr" rid="B17">2016</xref>) implemented this idea with the neural network method and demonstrated a significant improvement over Damu&#x000E9;&#x00027;s approach.</p></sec>
<sec sec-type="methods" id="s3">
<title>3. Methodology</title>
<p>Based on a standard dataset obtained from the CHIP 2018 open challenge, this medical named entity recognition research targets three sub-tasks: (1) Identification of primary tumor sites, (2) extraction of primary tumor sizes, and (3) recognition of metastatic tumor sites.</p>
<p>Our method utilizes a pattern-based strategy, which is simple but practical due to the limited volume of data for training complex machine learning models, such as LSTM. The architecture of our method is shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. Our method first applies a Chinese text processing tool named Jieba, which allows users to develop their own word splitting dictionary to produce more accurate word segmentation. Since the generally used version of the tool was trained upon the corpus of People&#x00027;s Daily, which is a one-million-word corpus of Mandarin Chinese from the newspaper People&#x00027;s Daily, we develop a new dictionary containing human anatomic positions to enhance medical word segmentation performance.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>The architecture of our pattern-based biomedical named entity recognition method.</p></caption>
<graphic xlink:href="frai-02-00001-g0001.tif"/>
</fig>
<sec>
<title>3.1. Identification of Primary Tumor Sites</title>
<p>According to the task description (Buzhou Tang, <xref ref-type="bibr" rid="B4">2018b</xref>), a primary tumor site is the first anatomic site mentioned in diagnostic imaging text and related to the description of a malignant tumor. For most cases, there is only one primary tumor. Our method thus relies on a common characteristic that cancer entities highly correlate with the appearance of certain indicators which are usually used to describe malignant tumors in Chinese or English, such as &#x0201C;&#x0764C;&#x0201D; (cancer), &#x0201C;&#x06076;&#x06027;&#x0201D; (malignant), &#x0201C;&#x07624;&#x0201D;(tumor), &#x0201C;MT&#x0201D; (abbreviation of malignant tumor), &#x0201C;CA&#x0201D; (abbreviation of cancer), etc. When an anatomic position and its associated indicating words appear in the same text and this anatomic position is not related to cancer metastasis, the anatomic position is regarded as a primary tumor site. Also, we assume that cancer entities and these indicating words appear in the same sentence context due to their high correlation. Therefore, the method employs a keyword-matching mechanism to filter out irrelevant sentences and reduce computation complexity accordingly. After the filtering process, only sentences which contain target cancer entities are kept. Then, Jieba is applied to split remaining sentences into segments which are subsequently matched against our pre-defined organ dictionary to acquire approximate sites of tumors. In certain cases, a cancer entity may not be sufficient to identify the precise site of a primary cancer. For example, &#x0201C;&#x080BA;&#x0201D; (lung) can be extracted from &#x0201C;&#x080BA;&#x0764C;&#x0201D; (lung cancer). However, it is not the wanted tumor site since lung cancer can refer to any cancer which originated from any part of lung, e.g., superior lobe of left lung. To obtain the specific site, we need to search for a more detailed body part entity from other sentences. We consider the respective body part as where the primary tumor originates from if the detailed entity is a part of the approximate site.</p></sec>
<sec>
<title>3.2. Extraction of Primary Tumor Sizes</title>
<p>This is a method to extract primary tumor size information from diagnostic conclusion text, which usually contains a description of size for each lesion. We design a regular expression-based method for size entity extraction. It is composed of the following steps: (1) Detecting sentences containing size formats, (2) excluding sentences which are not related to primary tumors, and (3) extracting target size mentions with regular expressions.</p>
<p><bold>Step 1: Detecting sentences containing size formats</bold>. The basis of this method is the strong regularity of size entity expressions. According to the preprocessing of size information in the training dataset, most size entities are represented in particular formats such as &#x0201C;5x4x3CM&#x0201D;, &#x0201C;2.0X2CM&#x0201D;, &#x0201C;5CM*5MM&#x0201D;, &#x0201C;&#x04E0D;&#x08DB3; (less than) 5CM&#x0201D;, etc. Therefore, a set of regular expressions is employed to filter out irrelevant sentences. After filtering, sentences containing target expression formats are kept as candidate sentences.</p>
<p><bold>Step 2: Excluding sentences which are not related to primary tumors</bold>. In most cases, the primary sites of tumor entities obtained from subsection 3.1 and primary tumor size expressions appear in the same sentence. In addition, the size expressions correlate highly with the appearance of certain indicating words, e.g., &#x0201C;&#x09AD8;&#x05BC6;&#x05EA6;&#x05F71;&#x0201D; (high-density shadow), &#x0201C;&#x04F4E;&#x05BC6;&#x05EA6;&#x05F71;&#x0201D; (low-density shadow), &#x0201C;&#x04E0D;&#x089C4;&#x05219;&#x056E2;&#x05757;&#x0201D; (irregular conglomeration), etc. Therefore, similar to the above entity extraction, the method firstly adopts a keyword-matching mechanism to filter out irrelevant candidate sentences obtained in Step 1. Some tumor site expressions share the same meaning in some cases, such as &#x0201C;&#x05DE6;&#x080BA;&#x095E8;&#x0201D; (left hilum of lung) and &#x0201C;&#x05DE6;&#x04FA7;&#x080BA;&#x095E8;&#x0201D; (left hilum of lung), as well as &#x0201C;&#x053F3;&#x04E73;&#x0817A;&#x0201D; (right breast) and &#x0201C;&#x053F3;&#x04FA7;&#x04E73;&#x0817A;&#x0201D; (right breast). Therefore, if the primary tumor site obtained from step 1 is about &#x0201C;&#x05DE6;&#x080BA;&#x095E8;&#x0201D; (left hilum of lung), &#x0201C;&#x05DE6;&#x04FA7;&#x080BA;&#x095E8;&#x0201D; (left hilum of lung) should be regarded as a primary site. This method solves the problem of term normalization by expanding some specific primary site entities.</p>
<p><bold>Step 3: Extracting target size mentions with regular expressions</bold>. After Step 1 and Step 2, we can extract primary tumor size expressions from the remaining sentences by directly using regular expressions. For example, this is a sentence &#x0201C;&#x05DE6;&#x080BA;&#x04E0A;&#x053F6;&#x0793A;&#x04E00;&#x04E0D;&#x089C4;&#x05219;&#x08F6F;&#x07EC4;&#x07EC7;&#x05BC6;&#x05EA6;&#x07076;,&#x05927;&#x05C0F;&#x07EA6;1.3CM &#x000D7; 1.7CM,&#x08FB9;&#x07F18;&#x05206;&#x053F6;,&#x053EF;&#x089C1;&#x05F3A;&#x05316;&#x03002;&#x0201D; from a piece of text in the training dataset. The sentence contains a size mention, which can be extracted through a pre-defined regular expression &#x0201C;<italic>d&#x0002B;(.d&#x0002B;){0,1}[CcDdMm]{0,1}[Mm]{0,1}[*</italic>&#x000D7;<italic>X]d&#x0002B;(.d&#x0002B;){0,1}[CcDdMm]{0,1}[Mm]{0,1}</italic>.&#x0201D; Thus, the target size expression &#x0201C;1.3CMx1.7CM&#x0201D; can be extracted by matching the regular expression to the text.</p></sec>
<sec>
<title>3.3. Recognition of Metastatic Sites of Tumors</title>
<p>The purpose of this sub-task is the extraction of metastatic sites of cancer from diagnostic conclusion text. Metastatic sites of tumors are human anatomic positions highly correlative with keywords that indicate the existence of metastatic cancer, such as &#x0201C;&#x08F6C;&#x079FB;&#x0201D; (metastasis), &#x0201C;&#x08003;&#x08651;&#x08F6C;&#x079FB;&#x0201D; (considering as metastasis) and &#x0201C;&#x0591A;&#x053D1;&#x08F6C;&#x079FB;&#x0201D; (multiple metastasis). Body part descriptions related to such keywords in the same context are considered the metastatic sites of tumors. Based on a set of anatomic positions collected from &#x0201C;Chinese Terms of Human Anatomy&#x0201D; (Human Anatomy and Histology Terminology Committee, <xref ref-type="bibr" rid="B15">2013</xref>) and &#x0201C;Color Atlas of Human Anatomy&#x00027; (Xingheng Liu, <xref ref-type="bibr" rid="B27">2007</xref>), the identification of metastatic sites of tumors consists of three steps: text preprocessing, key sentence acquisition and missing entities completion.</p>
<p><bold>Step 1: Text preprocessing</bold>. In diagnostic conclusion text, human anatomic positions are difficult to identify since most of them are presented in a way of describing symptoms rather than stating their locations. For example, &#x0201C;&#x07EB5;&#x09694;&#x05185;&#x06DCB;&#x05DF4;&#x07ED3;&#x0201D; (mediastinal lymph node) is usually be presented as &#x0201C;&#x07EB5;&#x09694;&#x05185;&#x0591A;&#x053D1;&#x080BF;&#x05927;&#x06DCB;&#x05DF4;&#x07ED3;&#x0201D; (mediastinal multiple enlarged lymph node) or &#x0201C;&#x07EB5;&#x09694;&#x05185;&#x05C0F;&#x06DCB;&#x05DF4;&#x07ED3;&#x0201D; (mediastinal small lymph node). Therefore, our method removes this type of descriptive wording from primitive metastatic entity expressions. In other words, the descriptive words &#x0201C;&#x0591A;&#x053D1;&#x080BF;&#x05927;&#x0201D; (multiple enlarged) are removed from the phrase &#x0201C;&#x07EB5;&#x09694;&#x05185;&#x0591A;&#x053D1;&#x080BF;&#x05927;&#x06DCB;&#x05DF4;&#x07ED3;&#x0201D;.</p>
<p><bold>Step 2: Key sentence acquisition</bold>. Considering the meaning of metastatic sites of tumor, the metastatic site entities and indicating keywords like &#x0201C;&#x08F6C;&#x079FB;&#x0201D; (metastasis) and &#x0201C;&#x0591A;&#x053D1;&#x08F6C;&#x079FB;&#x0201D; (multiple metastasis) frequently exist in the same sentence. Hence, the method selects relevant sentences based on this co-occurrence phenomenon.</p>
<p><bold>Step 3: Missing entity completion</bold>. In many cases, the suffix of multiple entities is mentioned only once if they share the same suffix. For example, both &#x0201C;&#x080BA;&#x095E8;&#x053CA;&#x0524D;&#x07EB5;&#x09694;&#x06DCB;&#x05DF4;&#x07ED3;&#x0201D; (pulmonary lymph node and anterior mediastinal lymph node) and &#x0201C;&#x080BA;&#x095E8;&#x03001;&#x0524D;&#x07EB5;&#x09694;&#x06DCB;&#x05DF4;&#x07ED3;&#x0201D; (pulmonary lymph node, anterior mediastinal lymph node) indicate pulmonary lymph node and anterior mediastinal lymph node. Hence, it is incorrect to simply split entities based on conjunctive symbols. Due to this phenomena, a set of rules is applied to identify metastatic entities when conjunctive symbols like &#x0201C;&#x03001;&#x0201D; (and), &#x0201C;&#x053CA;&#x0201D; (and) or &#x0201C;&#x04E0E;&#x0201D; (and) appear in key sentences. For example, &#x0201C;&#x080BA;&#x095E8;&#x053CA;&#x0524D;&#x07EB5;&#x09694;&#x06DCB;&#x05DF4;&#x07ED3;&#x0201D; is completed as two entities &#x0201C;&#x080BA;&#x095E8;&#x06DCB;&#x05DF4;&#x07ED3;&#x0201D; (pulmonary lymph node) and &#x0201C;&#x0524D;&#x07EB5;&#x09694;&#x06DCB;&#x05DF4;&#x07ED3;&#x0201D; (anterior mediastinal lymph node).</p></sec></sec>
<sec id="s4">
<title>4. Result and discussion</title>
<sec>
<title>4.1. Datasets</title>
<p>The standard experiment dataset from the CHIP 2018 challenge consists of two sub-datasets: a training dataset and a testing dataset. The training dataset contains 600 pieces of diagnosis text. For each text, the primary tumor sites, primary tumor sizes and metastatic sites are labeled by human annotators from data providers. The 600 pieces of diagnosis text contain 8,401 sentences (14.00 sentences per text on average) in total. A total of 614 primary sites of tumor entities (1.02 entities per text on average), 360 primary tumor size entities (0.60 entities per text on average) and 1,478 metastatic sites of cancer entities (2.46 entities per text on average) are annotated. In addition, for primary site of tumor, there are 351 entities (57.17%) about &#x0201C;&#x080BA;&#x0201D; (lung) and 225 entities (36.64%) about &#x0201C;&#x04E73;&#x0201D; (breast). There are 236 different entities (15.97%) among metastatic sites of training dataset and 1,242 entities (84.03%) that appear more than once.</p>
<p>For the testing dataset, 200 pieces of diagnosis text are provided. These text pieces contain 4,934 sentences (24.67 sentences per text on average) in total. A total of 221 primary sites of tumor entities (1.11 entities per text on average), 134 primary tumor size entities (0.67 entities per text on average) and 731 metastatic sites of cancer entities (3.66 entities per text on average) are annotated. In addition, for primary site of tumor, there are 122 entities (55.20%) about &#x0201C;&#x080BA;&#x0201D; (lung) and 76 entities (34.34%) about &#x0201C;&#x04E73;&#x0201D; (breast). There are 286 different entities (39.12%) among metastatic sites of the testing dataset and 445 entities (60.88%) appearing more than once. A summary of the training and testing datasets is reported in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>The data summary of the training and testing datasets.</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td valign="top" align="left">Training dataset</td>
<td valign="top" align="left">Language</td>
<td valign="top" align="left">&#x00023;words</td>
<td valign="top" align="left">&#x00023;sentences</td>
<td valign="top" align="left">&#x00023;primary sites</td>
<td valign="top" align="left">&#x00023;pri.tumor sizes</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Chinese</td>
<td valign="top" align="left">210,073</td>
<td valign="top" align="left">8,401</td>
<td valign="top" align="left">614</td>
<td valign="top" align="left">360</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">&#x00023;texts</td>
<td valign="top" align="left">&#x00023;ave.words/text</td>
<td valign="top" align="left">&#x00023;ave.sen./text</td>
<td valign="top" align="left">&#x00023;ave.pri.sites/text</td>
<td valign="top" align="left">&#x00023;ave.pri.tumor sizes/text</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">600</td>
<td valign="top" align="left">350.12</td>
<td valign="top" align="left">14.00</td>
<td valign="top" align="left">1.02</td>
<td valign="top" align="left">0.60</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">&#x00023;metastatic sites</td>
<td valign="top" align="left">&#x00023;pri.sites about lung</td>
<td valign="top" align="left">&#x00023;pri.sites about breast</td>
<td valign="top" align="left">&#x00023;unique meta.sites</td>
<td valign="top" align="left">&#x00023;overlapping meta.sites</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">1,478</td>
<td valign="top" align="left">351</td>
<td valign="top" align="left">225</td>
<td valign="top" align="left">236</td>
<td valign="top" align="left">1,242</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">&#x00023;ave.meta.sites/text</td>
<td valign="top" align="left">%lung/all pri.sites</td>
<td valign="top" align="left">%breast/all pri.sites</td>
<td valign="top" align="left">%uni.meta.sites/all meta.sites</td>
<td valign="top" align="left">%over.meta.sites/all meta.sites</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">2.46</td>
<td valign="top" align="left">57.17%</td>
<td valign="top" align="left">36.64%</td>
<td valign="top" align="left">15.97%</td>
<td valign="top" align="left">84.03%</td>
</tr>
<tr>
<td valign="top" align="left">Testing dataset</td>
<td valign="top" align="left">Language</td>
<td valign="top" align="left">&#x00023;words</td>
<td valign="top" align="left">&#x00023;sentences</td>
<td valign="top" align="left">&#x00023;primary sites</td>
<td valign="top" align="left">&#x00023;pri.tumor sizes</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Chinese</td>
<td valign="top" align="left">129,100</td>
<td valign="top" align="left">4,934</td>
<td valign="top" align="left">221</td>
<td valign="top" align="left">134</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">&#x00023;texts</td>
<td valign="top" align="left">&#x00023;ave.words/text</td>
<td valign="top" align="left">&#x00023;ave.sen./text</td>
<td valign="top" align="left">&#x00023;ave.pri.sites/text</td>
<td valign="top" align="left">&#x00023;ave.pri.tumor sizes/text</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">200</td>
<td valign="top" align="left">645.50</td>
<td valign="top" align="left">24.67</td>
<td valign="top" align="left">1.11</td>
<td valign="top" align="left">0.67</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">&#x00023;metastatic sites</td>
<td valign="top" align="left">&#x00023;pri.sites about lung</td>
<td valign="top" align="left">&#x00023;pri.sites about breast</td>
<td valign="top" align="left">&#x00023;unique meta.sites</td>
<td valign="top" align="left">&#x00023;overlapping meta.sites</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">731</td>
<td valign="top" align="left">122</td>
<td valign="top" align="left">76</td>
<td valign="top" align="left">286</td>
<td valign="top" align="left">445</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">&#x00023;ave.meta.sites/text</td>
<td valign="top" align="left">%lung/all pri.sites</td>
<td valign="top" align="left">%breast/all pri.sites</td>
<td valign="top" align="left">%uni.meta.sites/ all meta.sites</td>
<td valign="top" align="left">%over.meta.sites/ all meta.sites</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">3.66</td>
<td valign="top" align="left">55.20%</td>
<td valign="top" align="left">34.34%</td>
<td valign="top" align="left">39.12%</td>
<td valign="top" align="left">60.88%</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>4.2. Evaluation Metrics</title>
<p>To evaluate our proposed method, we applied three widely used metrics: precision, recall and F1 score. Since this is a typical information extraction task, there are four possible classifications of performance: the information that needs to be extracted is correctly extracted (true positive, TP); the information that does not need to be extracted is wrongly extracted (false positive, FP); the information that needs to be extracted is not extracted (false negative, FN); and the information that does not need to be extracted is not extracted (true negative, TN). Based on the above four classification cases, the meanings of precision, recall and F1 score and their calculation formulas are defined as follows:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi></mml:mtd><mml:mtd><mml:mo>=</mml:mo></mml:mtd><mml:mtd><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi></mml:mtd><mml:mtd><mml:mo>=</mml:mo></mml:mtd><mml:mtd><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>F</mml:mi><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mo>=</mml:mo></mml:mtd><mml:mtd><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x000D7;</mml:mo><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In this task, we calculate the values of the three evaluation metrics, precision, recall and F1 score, for the three extraction subtasks separately. Meanwhile, we calculate the overall precision, recall and F1 score for the three subtasks as a whole. F1 score considers both the precision rate and the recall rate to compute the score, and the combination of the two can more fully reflect the effectiveness of the method. Therefore, the F1 score is used as the major evaluation criterion. In the calculation of the overall evaluation metrics, the weight of primary location is 0.2, the weight of lesion size is 0.3, and the weight of metastatic site is 0.5 in accordance with the description of this task.</p></sec>
<sec>
<title>4.3. Results</title>
<p>The results of our method on the testing dataset are presented in <xref ref-type="table" rid="T2">Table 2</xref>. Our method achieves an F1 score greater than 0.7 in all the three subtasks. We note particularly that the F1 score on primary tumor size extraction reaches 0.82. Among them, the performance of primary tumor size extraction is the best and the performance for metastatic site extraction is the worst. The overall unweighted F1 score and the overall weighted F1 score both exceed 0.75.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Results on the testing dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Entity identification tasks</bold></th>
<th valign="top" align="center"><bold>Precision</bold></th>
<th valign="top" align="center"><bold>Recall</bold></th>
<th valign="top" align="center"><bold>F1</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">primary sites of tumor</td>
<td valign="top" align="center">0.8250</td>
<td valign="top" align="center">0.7466</td>
<td valign="top" align="center">0.7838</td>
</tr>
<tr>
<td valign="top" align="left">primary tumor sizes</td>
<td valign="top" align="center">0.8301</td>
<td valign="top" align="center">0.8143</td>
<td valign="top" align="center">0.8221</td>
</tr>
<tr>
<td valign="top" align="left">metastatic sites</td>
<td valign="top" align="center">0.8241</td>
<td valign="top" align="center">0.6712</td>
<td valign="top" align="center">0.7399</td>
</tr>
<tr>
<td valign="top" align="left">overall</td>
<td valign="top" align="center">0.8255</td>
<td valign="top" align="center">0.7113</td>
<td valign="top" align="center">0.7642</td>
</tr>
<tr>
<td valign="top" align="left">overall weighted</td>
<td valign="top" align="center">0.8251</td>
<td valign="top" align="center">0.6973</td>
<td valign="top" align="center">0.7558</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The performances of primary site and metastatic site extractions are not as good as that of primary tumor size extraction on the testing dataset. The major reason may be that human anatomic positions from the training dataset are quite different from the testing dataset. Meanwhile, the human anatomic position dictionary does not have enough coverage. The performance of tumor size extraction is affected slightly with data coverage. Overall, the final performance of this method remains comparatively high (Top 5) among all 92 participants in this challenge.</p>
<p>To verify the robustness, our method runs on different amounts of test data by randomly sampling from the testing dataset. Ten rounds of testing containing a number of test data increasing from 20 to 200, with a step size of 20, are conducted. In each round, it runs 50 times and the average performance is calculated and recorded. Finally, the average overall weighted precision, recall and F1 score of each round are used as evaluation criteria for testing the robustness of this method.</p>
<p>As shown in <xref ref-type="fig" rid="F2">Figure 2</xref>, the average overall weighted precision, recall and F1 scores of this method fluctuate slightly when the amounts of test data are small. With an increase in the amount of test data, the performance of the method tends to be more stable. It can be seen that this method has robustness in processing relatively small numbers of data to larger numbers of data.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>The performance of our method using different numbers of testing text. The average recall, precision and F1 are calculated as the average values for 50 times testing.</p></caption>
<graphic xlink:href="frai-02-00001-g0002.tif"/>
</fig></sec>
<sec>
<title>4.4. Error Analysis</title>
<p>There are error cases caused from the results. To understand the hidden reasons, we analyze all the error cases and summarize them according to their respective subtasks.</p>
<p>For the primary sites of tumor identification task, our method fails to identify some primary sites of tumor, which may be caused by the following conditions:</p>
<list list-type="order">
<list-item><p>When more than one primary tumor site exist in the same phrase context, this method extracts one of primary tumor sites and misses the other. For example, &#x0201C;&#x053F3;&#x04E73;&#x0708E;&#x06027;&#x04E73;&#x0764C;&#x0672F;&#x0540E;&#x022EF;&#x022EF;&#x02009;; &#x053F3;&#x080BA;&#x04E2D;&#x053F6;&#x078E8;&#x073BB;&#x07483;&#x07ED3;&#x08282;&#x04F34;&#x08F7B;&#x05EA6;&#x07CD6;&#x04EE3;&#x08C22;&#x0589E;&#x09AD8;,&#x08003;&#x08651;&#x04E3A;&#x0539F;&#x053D1;&#x06027;MT&#x053EF;&#x080FD;&#x0201D; (the postoperative treatment of inflammatory breast cancer of the right breast&#x02026;; in the middle lobe of right lung, ground-glass nodules was accompanied by an increase in mild glucose metabolism, considered as primary MT). The method extracts &#x0201C;&#x053F3;&#x04E73;&#x0201D; (right breast) which is the first primary site of the tumor from this text and fails to extract the second primary site of the tumor &#x0201C;&#x053F3;&#x080BA;&#x04E2D;&#x053F6;&#x0201D; (middle lobe of right lung).</p></list-item>
<list-item><p>Long names of human anatomic positions may lead to incomplete extractions of primary tumor sites. In the example &#x0201C;&#x080F8;&#x04E0B;&#x06BB5;&#x098DF;&#x07BA1;&#x07BA1;&#x058C1;&#x0589E;&#x0539A;,&#x07BA1;&#x08154;&#x072ED;&#x07A84;,&#x04EE3;&#x08C22;&#x05F02;&#x05E38;&#x06D3B;&#x08DC3;,&#x07B26;&#x05408;&#x098DF;&#x07BA1;&#x0764C;&#x08868;&#x073B0;&#x0201D; (the tubal wall of the lower thoracic esophagus was thickened, the lumen was narrowed, and the metabolism was abnormally active, which accorded with the appearance of esophageal carcinoma), the primary site of tumor should be &#x0201C;&#x080F8;&#x04E0B;&#x06BB5;&#x098DF;&#x07BA1;&#x0201D; (lower thoracic esophagus). Nevertheless, our method extracts partial information &#x0201C;&#x098DF;&#x07BA1;&#x0201D; (esophagus) as the primary site.</p></list-item>
<list-item><p>Unknown names of human anatomic positions may lead to incorrect extraction. Certain human anatomic positions, such as &#x0201C;&#x080C3;&#x04F53;&#x090E8;&#x0201D; (gastric body) and &#x0201C;&#x080F0;&#x05C3E;&#x0201D; (tail of pancreas), do not appear in the training dataset but appear in the testing dataset. Consequently, our method fails to recognize these unknown names without extending our human anatomic position dictionary.</p></list-item>
</list>
<p>According to the description of the primary tumor sizes extraction task, the extraction of primary tumor sizes relies on the results of the primary tumor site extraction. Therefore, the errors of primary tumor size extraction are mainly caused by the incorrect identification of primary sites of tumors from previous task. In more detail, extraction errors can be divided into the following cases:</p>
<p>(1) Primary tumor size expressions are incorrectly extracted due to the wrong extraction of tumor site, and there is no size information of the wrong primary tumor sites in the text. (2) The extracted primary tumor size expressions are inconsistent with the facts, caused by wrong identifications of primary tumor sites. (3) The numbers of extraction results of primary tumor sizes are insufficient. A diagnostic imaging text may contain more than one primary tumor site, and all of them should be extracted. However, in certain situations, the method extracts only one primary site and misses the size information of another tumor.</p>
<p>For the recognition of metastatic sites, in addition to the incomplete extraction of human anatomic positions, there are several other error cases:</p>
<list list-type="order">
<list-item><p>Compound words consisting of multiple positions of the human body, where no indicating words such as &#x0201C;&#x053CA;&#x0201D; (and) and &#x0201C;&#x04E0E;&#x0201D; (and) are included make it difficult to extract metastatic sites of cancer. For these compounds, their component positions should be split apart during extraction process. For example, &#x0201C;&#x053F3;&#x080BA;&#x095E8;&#x07EB5;&#x09694;&#x0591A;&#x053D1;&#x080BF;&#x05927;&#x06DCB;&#x05DF4;&#x07ED3;&#x0201D; (right hilar mediastinal multiple enlarged lymph node), two metastatic sites of cancer should be extracted from this phrase: &#x0201C;&#x053F3;&#x080BA;&#x095E8;&#x06DCB;&#x05DF4;&#x07ED3;&#x0201D; (right hilar lymph node) and &#x0201C;&#x07EB5;&#x09694;&#x06DCB;&#x05DF4;&#x07ED3;&#x0201D; (mediastinal lymph node). However, our method incorrectly extracts them as &#x0201C;&#x053F3;&#x080BA;&#x095E8;&#x07EB5;&#x09694;&#x06DCB;&#x05DF4;&#x07ED3;&#x0201D; (right hilar mediastinal lymph node) in the form of compound words.</p></list-item>
<list-item><p>For the sentences with double negative expressions, metastatic sites of cancer are incorrectly extracted. For example, the double negative meaning of &#x0201C;&#x08F6C;&#x079FB;&#x04E0D;&#x09664;&#x05916;&#x0201D; (not excluding metastasis) in the text &#x0201C;&#x05DE6;&#x080BA;&#x0591A;&#x053D1;&#x053EF;&#x07591;&#x05C0F;&#x07ED3;&#x08282;,&#x08F6C;&#x079FB;&#x04E0D;&#x09664;&#x05916;&#x0201D; (multiple suspicious nodules in the left lung, not excluding metastasis) indicates a possible metastasis. Thus, the metastatic site &#x0201C;&#x05DE6;&#x080BA;&#x0201D; (left lung), which appears before &#x0201C;&#x08F6C;&#x079FB;&#x04E0D;&#x09664;&#x05916;&#x0201D; (not excluding metastasis), should be extracted by this method.</p></list-item>
<list-item><p>Certain human anatomic positions are represented in abbreviations. There is a list of abbreviations for human anatomic positions existing in the text. Without an abbreviation dictionary or the mappings of abbreviations, the method is hard to recognize the abbreviations and normalize them to formal concepts correctly. For example, the abbreviation for the fourth thoracic vertebra is &#x0201C;T4&#x080F8;&#x0690E;&#x0201D; (T4 thoracic vertebra), while our method fails to recognize it as a whole entity.</p></list-item>
</list>
</sec></sec>
<sec sec-type="conclusions" id="s5">
<title>5. Conclusions</title>
<p>In this paper, we proposed a pattern-based method to extract primary tumor sites, the size of primary tumors and metastatic sites from diagnostic imaging test. Based on a standard dataset provided by the CHIP 2018 open challenge, the method was tested and evaluated. The results demonstrate that the method achieves a relatively high performance, with an overall weighted F1 score of 0.7558 on the testing dataset. The error cases were fully analyzed, and further improvement strategies are under design. This method contributes to extracting certain entities and expressions from unstructured Chinese electronic medical record text.</p></sec>
<sec id="s6">
<title>Ethics Statement</title>
<p>The study is based on the publicly available datasets and an ethics approval was not required for the study as per applicable institutional and national guidelines and regulations.</p></sec>
<sec id="s7">
<title>Author Contributions</title>
<p>ZL, JC, and ZX led the method design and experiment implementation. ZL and ZX performed the statistical analysis. ZL, JC, ZX, and YC wrote sections of the manuscript. TH provided theoretical guidance, result review, and paper revision. All authors read and approved the final manuscript.</p>
<sec>
<title>Conflict of Interest Statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p></sec></sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Aronson</surname> <given-names>A. R.</given-names></name></person-group> (<year>2001</year>). <article-title>Effective mapping of biomedical text to the umls metathesaurus: the metamap program</article-title>, in <source>Proceedings of the AMIA Symposium</source> (<publisher-loc>Washington, DC: American Medical Informatics Association</publisher-loc>), <fpage>17</fpage>.</citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bodenreider</surname> <given-names>O.</given-names></name></person-group> (<year>2004</year>). <article-title>The unified medical language system (umls): integrating biomedical terminology</article-title>. <source>Nucleic Acids Res.</source> <volume>32</volume>(<supplement>Suppl_1</supplement>):<fpage>D267</fpage>&#x02013;<lpage>D270</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkh061</pub-id><pub-id pub-id-type="pmid">14681409</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Buzhou Tang</surname></name> <name><surname>Qingcai Chen</surname> <given-names>J. Z. L. W.</given-names></name></person-group> (<year>2018a</year>). <article-title>Brief for chip shared task</article-title>.</citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Buzhou Tang</surname></name> <name><surname>Qingcai Chen</surname> <given-names>J. Z. L. W.</given-names></name></person-group> (<year>2018b</year>). <article-title>Manual for structuralizing medical imaging examination results</article-title>.</citation></ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>D.</given-names></name> <name><surname>Fisch</surname> <given-names>A.</given-names></name> <name><surname>Weston</surname> <given-names>J.</given-names></name> <name><surname>Bordes</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>Reading wikipedia to answer open-domain questions</article-title>, in <source>Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics</source>, Vol. <volume>1</volume> (<publisher-loc>Vancouver, BC</publisher-loc>). <pub-id pub-id-type="doi">10.18653/v1/P17-1171</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chiu</surname> <given-names>J. P.</given-names></name> <name><surname>Nichols</surname> <given-names>E.</given-names></name></person-group> (<year>2016</year>). <article-title>Named entity recognition with bidirectional lstm-cnns</article-title>. <source>Trans. Assoc. Comput. Linguist.</source> <volume>4</volume>, <fpage>357</fpage>&#x02013;<lpage>370</lpage>. <pub-id pub-id-type="doi">10.1162/tacl-a-00104</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cho</surname> <given-names>H.</given-names></name> <name><surname>Choi</surname> <given-names>W.</given-names></name> <name><surname>Lee</surname> <given-names>H.</given-names></name></person-group> (<year>2017</year>). <article-title>A method for named entity normalization in biomedical articles: application to diseases and plants</article-title>. <source>BMC Bioinform.</source> <volume>18</volume>:<fpage>451</fpage>. <pub-id pub-id-type="doi">10.1186/s12859-017-1857-8</pub-id><pub-id pub-id-type="pmid">29029598</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chung</surname> <given-names>J.</given-names></name> <name><surname>Gulcehre</surname> <given-names>C.</given-names></name> <name><surname>Cho</surname> <given-names>K.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2014</year>). <article-title>Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling</article-title>, in <source>IEEE International Conference on Rehabilitation Robotics</source>, (<publisher-loc>Aberdeen</publisher-loc>) <fpage>119</fpage>&#x02013;<lpage>124</lpage>.</citation></ref>
<ref id="B9">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Daum&#x000E9;</surname> <given-names>H.</given-names></name></person-group> (<year>2009</year>). <source>Frustratingly easy domain adaptation. CoRR. arXi[Preprint].arXiv: 0907.1815</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/0907.1815">http://arxiv.org/abs/0907.1815</ext-link>.</citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Daume</surname> <given-names>H.</given-names></name> <name><surname>Marcu</surname> <given-names>D.</given-names></name></person-group> (<year>2006</year>). <article-title>Domain adaptation for statistical classifiers</article-title>. <source>J. Artif. Intell. Res.</source> <volume>1</volume>, <fpage>101</fpage>&#x02013;<lpage>126</lpage>. <pub-id pub-id-type="doi">10.1613/jair.1872</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Demner-Fushman</surname> <given-names>D.</given-names></name> <name><surname>Chapman</surname> <given-names>W. W.</given-names></name> <name><surname>McDonald</surname> <given-names>C. J.</given-names></name></person-group> (<year>2009</year>). <article-title>What can natural language processing do for clinical decision support?</article-title> <source>J. Biomed. Inform.</source> <volume>42</volume>, <fpage>760</fpage>&#x02013;<lpage>772</lpage>. <pub-id pub-id-type="doi">10.1016/j.jbi.2009.08.007</pub-id><pub-id pub-id-type="pmid">19683066</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dogan</surname> <given-names>R. I.</given-names></name> <name><surname>Lu</surname> <given-names>Z.</given-names></name></person-group> (<year>2012</year>). <article-title>An inference method for disease name normalization</article-title>, in <source>AIII Fall Symposium: Westin Arlington Gateway in Arlington</source> (<publisher-loc>Arlington, VA</publisher-loc>), <fpage>8</fpage>&#x02013;<lpage>13</lpage>.</citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gridach</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>Character-level neural network for biomedical named entity recognition</article-title>. <source>J. Biomed. Inform.</source> <volume>70</volume>, <fpage>85</fpage>&#x02013;<lpage>91</lpage>. <pub-id pub-id-type="doi">10.1016/j.jbi.2017.05.002</pub-id><pub-id pub-id-type="pmid">28502909</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>Z.</given-names></name> <name><surname>Xu</surname> <given-names>W.</given-names></name> <name><surname>Yu</surname> <given-names>K.</given-names></name></person-group> (<year>2015</year>). <source>Bidirectional LSTM-CRF models for sequence tagging. <italic>CoRR. arXi[Preprint].arXiv</italic>: 1508.01991</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1508.01991">http://arxiv.org/abs/1508.01991</ext-link>.</citation></ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><collab>Human Anatomy and Histology Terminology Committee</collab></person-group> (<year>2013</year>). <source>Chinese Terms of Human Anatomy, 2nd edn</source>. &#x079D1;&#x05B66;&#x051FA;&#x07248;&#x0793E; (<publisher-loc>Beijing</publisher-loc>: <publisher-name>Science Press</publisher-name>).</citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jindal</surname> <given-names>P.</given-names></name> <name><surname>Roth</surname> <given-names>D.</given-names></name></person-group> (<year>2013</year>). <article-title>Using soft constraints in joint inference for clinical concept recognition</article-title>, in <source>Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing</source>, (<publisher-loc>Seattle, WA</publisher-loc>) <fpage>1808</fpage>&#x02013;<lpage>1814</lpage>.</citation></ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>Y.-B.</given-names></name> <name><surname>Stratos</surname> <given-names>K.</given-names></name> <name><surname>Sarikaya</surname> <given-names>R.</given-names></name></person-group> (<year>2016</year>). <article-title>Frustratingly easy neural domain adaptation</article-title>, in <source>Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers</source> (<publisher-loc>Osaka, JP</publisher-loc>) <fpage>387</fpage>&#x02013;<lpage>396</lpage>.</citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lafferty</surname> <given-names>J.</given-names></name> <name><surname>McCallum</surname> <given-names>A.</given-names></name> <name><surname>Pereira</surname> <given-names>F. C.</given-names></name></person-group> (<year>2001</year>). <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>, in <source>ICML &#x00027;01 Proceedings of the Eighteenth International Conference on Machine Learning</source> (<publisher-loc>San Francisco, CA: Morgan Kaufmann Publishers Inc.</publisher-loc>), <fpage>282</fpage>&#x02013;<lpage>289</lpage>.</citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lipscomb</surname> <given-names>C. E.</given-names></name></person-group> (<year>2000</year>). <article-title>Medical subject headings (mesh)</article-title>. <source>Bull. Med. Lib. Assoc.</source> <volume>88</volume>:<fpage>265</fpage>. <pub-id pub-id-type="pmid">10928714</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ma</surname> <given-names>X.</given-names></name> <name><surname>Hovy</surname> <given-names>E.</given-names></name></person-group> (<year>2016</year>). <article-title>End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF</article-title>, in <source>Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics</source> (<publisher-loc>Berlin: Association for Computational Linguistics</publisher-loc>), <fpage>1064</fpage>&#x02013;<lpage>1074</lpage>. <pub-id pub-id-type="doi">10.18653/v1/P16-1101</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>McCallum</surname> <given-names>A.</given-names></name> <name><surname>Li</surname> <given-names>W.</given-names></name></person-group> (<year>2003</year>). <article-title>Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons</article-title>, in <source>Seventh Conference on Natural Language.</source> (<publisher-loc>Edmonton, AB</publisher-loc>).</citation></ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mikolov</surname> <given-names>T.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name> <name><surname>Chen</surname> <given-names>K.</given-names></name> <name><surname>Corrado</surname> <given-names>G. S.</given-names></name> <name><surname>Dean</surname> <given-names>J.</given-names></name></person-group> (<year>2013</year>). <article-title>Distributed representations of words and phrases and their compositionality</article-title>, in <source>Advances in Neural Information Processing Systems</source>, (<publisher-loc>Lake Tahoe</publisher-loc>) <fpage>3111</fpage>&#x02013;<lpage>3119</lpage>.</citation></ref>
<ref id="B23">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Rajpurkar</surname> <given-names>P.</given-names></name> <name><surname>Zhang</surname> <given-names>J.</given-names></name> <name><surname>Lopyrev</surname> <given-names>K.</given-names></name> <name><surname>Liang</surname> <given-names>P.</given-names></name></person-group> (<year>2016</year>). <source>SQuAD: 100,000&#x0002B; questions for machine comprehension of text. <italic>CoRR. arXiv[Preprint].arXiv</italic>:1606.05250</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1606.05250">http://arxiv.org/abs/1606.05250</ext-link>.</citation></ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>R&#x000F6;ssler</surname> <given-names>M.</given-names></name></person-group> (<year>2004</year>). <article-title>Adapting an ner-system for german to the biomedical domain</article-title>, in <source>Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications</source> (<publisher-loc>Association for Computational Linguistics</publisher-loc>), <fpage>92</fpage>&#x02013;<lpage>95</lpage>. <pub-id pub-id-type="doi">10.3115/1567594.1567615</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>P.</given-names></name> <name><surname>Xu</surname> <given-names>J.</given-names></name> <name><surname>Xu</surname> <given-names>B.</given-names></name> <name><surname>Liu</surname> <given-names>C.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>F.</given-names></name> <name><surname>Hao</surname> <given-names>H.</given-names></name></person-group> (<year>2015</year>). <article-title>Semantic clustering and convolutional neural network for short text categorization</article-title>, in <source>Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on NaturalLanguage Processing (Volume 2: Short Papers)</source>, <fpage>352</fpage>&#x02013;<lpage>357</lpage>. <pub-id pub-id-type="doi">10.3115/v1/P15-2058</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Xie</surname> <given-names>W.</given-names></name> <name><surname>Fu</surname> <given-names>S.</given-names></name> <name><surname>Jiang</surname> <given-names>S.</given-names></name> <name><surname>Hao</surname> <given-names>T.</given-names></name></person-group> (<year>2017</year>). <article-title>A crfs-based approach empowered with word representation features to learning biomedical named entities from medical text</article-title>, in <source>International Symposium on Emerging Technologies for Education</source> (<publisher-loc>Cape Town: Springer</publisher-loc>), <fpage>518</fpage>&#x02013;<lpage>527</lpage>.</citation></ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Xingheng Liu</surname> <given-names>T. R.</given-names></name></person-group> (<year>2007</year>). <source>Color Atlas of Human Anatomy</source>. &#x0519B;&#x04E8B;&#x0533B;&#x05B66;&#x051FA;&#x07248;&#x0793E; (<publisher-loc>Beijing</publisher-loc>: <publisher-name>Military Medical Press</publisher-name>).</citation></ref>
<ref id="B28">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>K.</given-names></name> <name><surname>Zhou</surname> <given-names>Z.</given-names></name> <name><surname>Hao</surname> <given-names>T.</given-names></name> <name><surname>Liu</surname> <given-names>W.</given-names></name></person-group> (<year>2017</year>). <article-title>A bidirectional lstm and conditional random fields approach to medical named entity recognition</article-title>, in <source>International Conference on Advanced Intelligent Systems and Informatics</source> (<publisher-loc>Cairo: Springer</publisher-loc>), <fpage>355</fpage>&#x02013;<lpage>365</lpage>.</citation></ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>G.</given-names></name> <name><surname>He</surname> <given-names>T.</given-names></name> <name><surname>Zhao</surname> <given-names>J.</given-names></name> <name><surname>Hu</surname> <given-names>P.</given-names></name></person-group> (<year>2015</year>). <article-title>Learning continuous word embedding with metadata for question retrieval in community question answering</article-title>, in <source>Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)</source>, (<publisher-loc>Beijing</publisher-loc>), <fpage>250</fpage>&#x02013;<lpage>259</lpage>.</citation></ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>G.</given-names></name> <name><surname>Su</surname> <given-names>J.</given-names></name></person-group> (<year>2002</year>). <article-title>Named entity recognition using an hmm-based chunk tagger</article-title>, in <source>proceedings of the 40th Annual Meeting on Association for Computational Linguistics</source> (<publisher-loc>Philadelphia: Association for Computational Linguistics</publisher-loc>), <fpage>473</fpage>&#x02013;<lpage>480</lpage>.</citation></ref>
</ref-list>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> The work is supported by the National Natural Science Foundation of China (No. 61772146) and the Guangzhou Science Technology and Innovation Commission (No. 201803010063).</p></fn>
</fn-group>
</back>
</article>