A practical application of text mining to literature on cognitive rehabilitation and enhancement through neurostimulation

The exponential growth in publications represents a major challenge for researchers. Many scientific domains, including neuroscience, are not yet fully engaged in exploiting large bodies of publications. In this paper, we promote the idea to partially automate the processing of scientific documents, specifically using text mining (TM), to efficiently review big corpora of publications. The “cognitive advantage” given by TM is mainly related to the automatic extraction of relevant trends from corpora of literature, otherwise impossible to analyze in short periods of time. Specifically, the benefits of TM are increased speed, quality and reproducibility of text processing, boosted by rapid updates of the results. First, we selected a set of TM-tools that allow user-friendly approaches of the scientific literature, and which could serve as a guide for researchers willing to incorporate TM in their work. Second, we used these TM-tools to obtain basic insights into the relevant literature on cognitive rehabilitation (CR) and cognitive enhancement (CE) using transcranial magnetic stimulation (TMS). TM readily extracted the diversity of TMS applications in CR and CE from vast corpora of publications, automatically retrieving trends already described in published reviews. TMS emerged as one of the important non-invasive tools that can both improve cognitive and motor functions in numerous neurological diseases and induce modulations/enhancements of many fundamental brain functions. TM also revealed trends in big corpora of publications by extracting occurrence frequency and relationships of particular subtopics. Moreover, we showed that CR and CE share research topics, both aiming to increase the brain's capacity to process information, thus supporting their integration in a larger perspective. Methodologically, despite limitations of a simple user-friendly approach, TM served well the reviewing process.


INTRODUCTION
Gathering accurate and reliable information from web repositories became increasingly complex because of the exponential growth in the number of publications. For example, a PubMed search retrieved 9407 papers including 1172 reviews for TMS in "All Fields," and the ratio became 8186/988 when the filter "[Title/Abstract]" was applied. Reading without some selection criteria becomes challenging. Even when selectively focusing on specific topics in a review, this increases the chances to miss trends shown only by huge bodies of literature. Thus, when processing vast corpora of publications, we are facing challenges that require automated solutions. One of the most promising approaches to alleviate these problems is to assist the human operator with computers running artificial intelligence applications. Here, we selected one of these applications, text mining (TM), and showed that TM will enable us to efficiently deal with huge amounts of information from the TMS-related literature.
Our approach was also motivated by the fact that, neuroscience has to make efforts to integrate data mining and TM when dealing with huge and diverse experimental datasets (Akil et al., 2011) and text documents. TM is able to catch the complexity of all relevant studies in an efficient manner. Statistical and natural language processing (NLP) procedures to "mine" the literature have been developed to address big data general problems (Dias et al., 2011). Here, we used a practical approach to promote TM as a tool for the reviewing process. Specifically, we selected a set of TM-tools that allowed user-friendly approaches to reveal relevant outcomes in large corpora of publications. Our intention was to use TM-tools that are not too demanding on programming skills, required knowledge and training period. Therefore, the example set of TM-tools could serve as an attractive guide map for researchers willing to incorporate TM in their work.
Next, we demonstrate the use of TM-tools in gaining basic insights into the relevant literature on cognitive rehabilitation (CR) and cognitive enhancement (CE) using transcranial magnetic stimulation (TMS). TMS is a valuable non-invasive perturbation method used to address fundamental and clinical neuroscience questions, both in human and animal models. Cognitive rehabilitation (CR-TMS) and cognitive enhancement (CE-TMS) are two important TMS applications. Indeed, TMS is establishing itself as major tool used in rehabilitation improving a wide variety of impaired mental functions (Miniussi and Rossini, 2011;Kammer and Spitzer, 2012;Vicario and Nitsche, 2013). Moreover, recent studies reported TMS-induced enhancements of normal brain functions (Brem et al., 2014;Luber and Lisanby, 2014).
The TM application to CR-and CE-TMS literature was focused on two main aspects. First, we aimed to show that TM could reveal the diversity of TMS applications in CR and CE, automatically retrieving trends already described in published reviews. Second, we specifically aimed to find trends only noticeable in big corpora of publications. The main feature of our findings is given by the statistical power of such analyses. Detailed analyses of the CR-and CE-TMS literature revealed relevant terms in the form of lists, topics and classes of terms associated with specific subtopics. It also showed relations between the relevant terms in the form of co-occurrences maps, groups of relevant terms with high probability co-occurrences and lists of relevant relational verbs. Moreover, the TM approach revealed conclusive sentences that appeared with a high probability. Finally, a large-scale corpora perspective showed that CR-TMS and CE-TMS share research topics allowing us to make inferences about their similarities. Although they start from specific states of the brain (impaired for CR and normal for CE), both aim to increase the brain's capacity to process information and to optimize adaptation. This unitary perspective is supported by fields that use TMS in diagnostic (TMS-DIAG) or in clinical and fundamental research (TMS-RES), which show that TMS is effective for changing and studying normal and abnormal brain processes. Accordingly, CR-TMS and CE-TMS also share research topics with these fields, showing their appurtenance to a larger context, which integrates diagnostic, fundamental research and fMRI studies.

TEXT MINING AS A METHOD TO PARTIALLY AUTOMATE THE REVIEWING OF BIG CORPORA OF PUBLICATIONS
Scholarly journals and data sources are increasingly available in electronic and Open Access form. Nonetheless, availability is not enough to extract specific information, mainly due to the abundance of information. TM comes with solutions for this problem offering automated methods to extract condensed information hidden within huge volumes of publications. TM can be achieved using several complementary approaches (Cohen and Hunter, 2008). Co-occurrence-based methods look for concepts that occur in the same unit of text (sentence/abstract) and posit a relationship between them. The statistical or machine learning systems rely on statistic properties of the text and work by building classifiers that operate on any level, from labeling part of speech to classifying full sentences or documents. The rule-based systems use knowledge about how language is structured and about how domain relevant things, facts and their relationships are stated in publications. The main areas/stages of TM are (Lourenco et al., 2009): Information Analysis that includes Information Retrieval and Information Extraction (e.g., Name Entity Recognition, Relationship Extraction, document classification and summarization); Information Synthesis that uses the databases generated by Information Extraction for answering simple questions, discovering new information and generating hypotheses.
A wide variety of publications and software tools approach TM at different levels of complexity, hampering the selection of relevant TM-tools. We chose a specific set of TM-tools, aiming to evaluate how they can help the TM-non-specialist accelerate the review of big corpora of literature, using the following criteria: a. Allow a user-friendly approach by selecting TM-tools requiring medium investments in training, programming and specific knowledge. b. Use free open-source software, selected based on features like: easiness of installation; quality of the documentation and support; accessible formats for input and output. c. Use TM-tools with general functionalities like: allowance of document corpora; pre-processing the text; built-in biomedical Name Entity Recognition; queries to a document or corpora; support for ontologies and terminologies.
We used three groups of TM-tools that served different purposes: I Basic resources that gave foundation to TM: the MeSH browser; the PubMed repository of publications; repositories of NLP resources (e.g., Neuroscience Information Framework, National Institute of Neurological Disorders and Stroke). II TM-tools II ( Table 1) that are web-based ready-to-use tools requiring no programming efforts and performing simple TM tasks (Lu, 2011). III TM-tools III that were used in the final stage of the study optimized for the reviewed topics: 1. Statistical or machine-learning-based approaches: Mallet (McCallum, 2002), Text to Matrix Generator (TMG) (Zeimpekis and Gallopoulos, 2006) and Matlab applications for NLP. 2. TM-tools with predefined NLP processing stream: KH Coder (Higuchi, 2012). 3. Integrated environments for visual programming of NLP: VisualText (Meyers, 2003;Alfred et al., 2014). 4. Biomedical TM rule-based approaches using fully automated stages in text processing: Anote2 (Lourenco et al., 2009) and Biological Research Assistant for TM (BioRAT) (Corney et al., 2004).
to few aspects. First, we aimed to show that TM could retrieve the diversity of TMS applications in vast corpora of publications about CR and CE (see Cognitive Rehabilitation and Enhancement Accomplished with TMS), automatically retrieving trends already described in published reviews (see Discussions and Conclusions). Second, we looked for trends noticeable only in big corpora of publications, relying on the statistical power of the analyses and including results like topics' occurrence frequency and relationships, relevant relational verbs, and high probability conclusive sentences (see Cognitive Rehabilitation and Enhancement Accomplished with TMS). Furthermore, we analyzed large context relationships between topics showing how CRand CE-TMS integrate with diagnostic, fundamental research and fMRI studies (see A General Context). Using TM to efficiently review corpora of publications requires roughly three stages: pre-TM-processing, TM-processing, and post-TM-processing. In this paper, we focused on the TM-processing by showing mainly the "raw" TM results. Accordingly, the seemingly redundant diversity of results is determined by our intention to illustrate few similar results given by different TM-tools.
We used a multi-stage and multi-tool approach ordered by the complexity in TM, which was gradually increased, starting with TM-tools II and continuing with TM-tools III. The same analysis was performed with few TM-tools ( Table 2), which can be regarded as alternative solutions for the same problem. This helped us to cope with the limited perspective offered by different TM-tools, to perform comparisons and cross-validations, and to build synthetic results.
The main classes of TM processing on the selected corpora were: -Statistics about the number of publications, authors, journals; thematic/MeSH headings division of the field; clustering  showing the topological maps of the main terms, using their relationships inferred from co-occurrence in the same text units. -Retrieve automatically relevant sentences, study their probability of occurrence, and identify parts-of-sentence (e.g., predicates) relevant for deriving conclusions. -Build a "map of science," which characterize large-scale relationships between multiple topics. Each topic-topic relationship was evaluated based on common relevant terms retrieved from each corpus, similar thematic clustering of the publications, common authors, journals and publications approaching both topics.
To create a larger context allowing a better understanding of CRand CE-TMS, we selected topics like TMS, CR, CE, TMS-RES, TMS-DIAG, and TMS-fMRI. Figure 1 presents a "qualitative hypothesis" about the topology of this context and we used TM to test it. Corpora of publications for each topic were retrieved using the following PubMed queries: -For CR-TMS: ("transcranial magnetic stimulation") AND ("cognitive rehabilitation" OR "rehabilitation" OR "cognitive therapy" OR "therapy" OR "cognitive recovery" OR "recovery" OR "cognitive treatment" OR "treatment" OR "cognitive repair" OR "neurorehabilitation" OR "improvement" OR "decrease"). -For CE-TMS: ("transcranial magnetic stimulation") AND [("cognitive enhancement") OR ("cognitive augmentation") OR ("cognitive improvement") OR ("cognitive enrichment") OR ("cognitive amelioration") OR ("neuroenhancement")]. -For TMS: "transcranial magnetic stimulation." -Queries for CR/CE include the second operand of the AND operator in the CR-TMS/CE-TMS queries. All queries were used separately with two filters ("[Title/ Abstract]" (TIAB-filter) and "[Title/Abstract] AND Review" TIABREV-filter) creating two separate corpora of abstracts: TIAB-corpora (the main target for TM) and TIABREV-corpora (used for comparisons). Empty or less relevant abstracts were removed from the corpora. Corpora were also compared with local databases and missing publications were added manually.
Finally, the TM results were evaluated in few ways. First of all, we used TM-tools that are already tested and evaluated, building our results on this general basis. Second, we used the post-hoc judgment of the system outputs (Cohen and Hunter, 2008) in few stages, and compared: results of similar processing (e.g., term extraction) performed with different TM-tools using TIAB-corpora (see A Practical Application of the Text Mining to Literature on Cognitive Rehabilitation and Enhancement Through Neurostimulation); results of similar processing using TIAB-corpora vs. TIABREV-corpora (see A Practical Application of the Text Mining to Literature on Cognitive Rehabilitation and Enhancement Through Neurostimulation); TM-results vs. manual-curated results (see Discussions and Conclusions).

A GENERAL CONTEXT
We used TM-tools II to determine relationships that put CR-and CE-TMS topics in the same neighborhood on a "map of science." An outline of the main results includes: 2. The number of the publications retrieved from PubMed ( Table 2) for each topic.
It is noteworthy that a large number of reviews were written for each topic.
3. The number of publications per year evaluating the interest for each topic (Figure 2), which showed the following trends: -The interest for all topics increased in the last decades; -TMS generated numerous publications per year; -CR-TMS has a stronger representation than CE-TMS; -Across-topics perspectives (  -Fundamental research (e.g., TMS-RES, TMS-DIAG) is less represented than practical applications (e.g., CR-TMS).
4. Co-occurrence matrix for terms used to build the search queries, retrieved with PubMatrix (Figures 3A,B). We made the following observations: -TMS co-occurred very frequently (publications ∼10 3 ) with terms like therapy, treatment, brain function, brain physiology, and diagnostic (mainly CR-TMS); -TMS co-occurred frequently (publications ∼10 2 ) with rehabilitation, recovery, cognitive treatment, improvement, decrease, brain anatomy, brain performance, brain networks, psychology, brain mapping, MRI, mental disorder, mental disease and psychiatric disorder (mainly CR-TMS and CE-TMS);   7. Finally, comparisons between corpora for different topics showed relevant numbers of common publications (Figure 4), thus emphasizing a unitary context. Consistently, clustering the publications for each topic using Carrot2 showed also different clusters sharing publications.

COGNITIVE REHABILITATION AND ENHANCEMENT ACCOMPLISHED WITH TMS
First, we need to reiterate that the relevance of all the retrieved terms is based on the idea that words co-occurring frequently in the abstracts are related in ways intrinsically constrained by the topic of the abstracts. Moreover, our specific selection of publications guarantees that term like TMS (all protocols) are present in all abstracts included in corpora. Thus, TMS is strongly related with all frequent terms retrieved with different TM-tools. Specifically, if, for example, the frequent terms are treatment and depression, this means that TMS is used frequently in depression treatment.
We used TM-tools III to increase our insights in CR-TMS and CE-TMS literature, and selected the following groups of results: 1. Statistical or machine-learning-based NLP: a. Topic modeling with Mallet (500 iterations, 7 topics, topics proportion threshold 0.05, and removed the standard Mallet stop-words). Each topic is a set of terms used with high probability by the authors, thus reflecting a specific thinking pattern involving TMS. Without knowing anything about the meaning of the words in a text, topic modeling assumes that any piece of text is written by selecting words from possible topics. Thus, it becomes possible to mathematically decompose a text into probable topics from which the words originated.
Example topics separated in CR/CE related terms (brackets) and TMS related terms (double brackets): a1. CR-TMS: b. Term-related queries with TMG toolbox. Document retrieval relies often on matching terms from documents with those from queries. However, natural languages present some challenges (e.g., polysemy, synonymy) that render termmatching inaccurate. TMG is using latent semantic analysis to overcome these problems (Landauer et al., 1998), based on the application of singular value decomposition of a term-by-document matrix (TDM).
We performed the indexing for each corpus creating new TDMs using "common-words" stop-list, logarithmic local weights, "GfIdf " global term weighs, normalization of terms, removing of alpha-numerics and numbers. For dimensionality reduction and TDM best rank approximation, we selected the dimensionality via the use of profile likelihood (Zhu and Ghodsi, 2006 (Table 3) and CE-TMS ( Table 4).
The KWIC statistic revealed information about both the relevant terms co-occurring with specific query terms (e.g., TMS) and statistical regularities (e.g., probable word positions) about the way the scientists build their statements.
3. Summary of results obtained with VisualText, ANote2, BioRAT and our Matlab applications (MAPP). All TMtools used similar resources, including: dictionaries (mental disabilities, brain anatomy, cognitive processes, built using Neuroscience Information Framework (NIF) resources; CR-TMS-effects, CE-TMS-effects and TMS, built using our term statistic, similar with KH Coder coding rules); ontologies (NIF gross-anatomy and NIF dysfunction; human disease and neuro-behavior ontology from The Open Biological and Biomedical Ontologies Foundry); lexical words (Mallet stopwords; verbs+ and verbs-for CR-TMS and CE-TMS, similar with KH Coder rules). Each dictionary/ontology could be considered a class of terms (indicated by brackets; e.g., [mental disabilities]).
The text processing relied on predefined streams of processing or on existing libraries of examples (e.g., TAIParse general text analyzer for VisualText). Our Matlab applications (MAPP) were used to handle the results, to perform relation extraction and to extract sentences with high probability of occurrence (low perplexity coefficients, PP).

DISCUSSIONS AND CONCLUSIONS
We here used a set of selected TM-tools to obtain basic insights into the relevant literature on the CR-and CE-TMS. For obvious reasons, we limited this application to few simple aspects. First, we showed that TM could retrieve from vast corpora of publications the diversity of TMS applications in CR and CE, automatically extracting trends already described in published reviews. Second, we searched for trends noticeable only in big corpora of publications. Along this exercise, we attempted to validate our results in different ways. For example, we compared similar results obtained using different TM-tools applied to different corpora (TIABcorpora or TIABREV-corpora). Relevant and common aspects, synthesized in unique results per type of analysis and topics were shown in the paper. Finally, we compared TM to human curation efforts. Accordingly, we selected for human curation 30 of the top ranked (with Medline Ranker) publications from the TIABREV-corpora, separately for CR-TMS and CR-TMS. We performed a selective manual curation aimed to retrieve relevant terms co-occurring with TMS (all types of protocol) in the abstracts, which belong to the following classes: mental functions, healthy or impaired, modulated by TMS; mental disabilities treated with TMS; rehabilitation or enhancement effects of TMS. We also searched for relationships between classes of relevant terms and conclusive sentences summarizing research results. Very briefly, the manual curation gave the following perspective over the main topics: a. CR-TMS. TMS is continuously establishing itself as one of the "tools of the trade" in psychiatric therapeutic practice (Kammer and Spitzer, 2012) improving mental functions in: Parkinson's disease (Pascual-Leone et al., 1994), aphasia (Medina et al., 2012), motor control after stroke (Takeuchi et al., 2005), epilepsy (Nitsche and Paulus, 2009), depression (Lisanby et al., 2009;Conforto et al., 2014), schizophrenia (Levkovitz et al., 2011;Kammer and Spitzer, 2012), autism (Krause et al., 2012), chronic migraine (Conforto et al., 2014), dyslexia (Costanzo et al., 2013), neglect (Fasotti and Van Kessel, 2013), obsessive-compulsive disorder (OCD) (Mantovani et al., 2013), chronic pain (Moreno-Duarte et al., 2014), and social anxiety disorder (Paes et al., 2013). The TMS therapy applied to younger patients (children and adolescents) improves cognitive functions (Vicario and Nitsche, 2013) in: stroke affecting the motor cortex (Kirton et al., 2008), epilepsy (Fregni et al., 2005), ADHD (Weaver et al., 2012), Tourette syndrome (Le et al., 2013), autism (Baruth et al., 2010), treatment-resistant depression (Bloch et al., 2008), and medication-resistant schizophrenia (Jardri et al., 2012). b. CE-TMS. CE is defined as any augmentation of core information processing systems in the brain underlying perception, attention, conceptualization, memory, reasoning and motor performance (Sandberg and Bostrom, 2006;Luber and Lisanby, 2014). Studies reported TMS-induced modulations and enhancements of brain functioning and neural processing involved in: language comprehension (Floel et al., 2008), learning and memory , cortical plasticity improving learning (Vallence and Ridding, 2014), motor memory (Butefisch et al., 2004), working memory (Gaudeau-Bosma et al., 2013), memory (Gagnon et al., 2011;Blumenfeld et al., 2014), phonological memory (Kirschen et al., 2006), perception (Hamilton et al., 2013), perceptual discrimination (Luber and Lisanby, 2014), eye movements and visual search, (Gerits et al., 2011;Luber and Lisanby, 2014), attention (Cooper et al., 2004;Lee et al., 2013), reward behavior (Stanford et al., 2013), analogic reasoning (Boroojerdi et al., 2001), motor learning (Luber and Lisanby, 2014), consolidation of new skills (Boyd and Linsdell, 2009), visual awareness (Grosbras and Paus, 2003), activity of specific frequencies supporting functions of the brain (Rahnev, 2013), and Pavlovian conditioning (Luber et al., 2007).
CR-TMS and CE-TMS used various TMS paradigms, including single-pulse, theta-burst, paired-pulse, and trains of rTMS at both low and high frequencies (Luber and Lisanby, 2014). Comparisons with the manual curation showed that the TM-tools were also able to extract: -All the relevant terms for CR-TMS and CE-TMS in the form of: lists; topics; classes of terms associated with specific subtopics (e.g., mental disabilities, cognitive processes). -Relations between relevant terms in the form of: cooccurrences maps (Figure 3); groups of relevant terms with high probabilities co-occurrences; KWIC (Tables 3, 4); lists of relevant relational verbs. -High probability and relevance conclusive sentences (see examples and Table 5). We studied also structural statistical regularities in both conclusive sentences and abstracts shown by: the relative position in the sentence for groups of relevant terms (Tables 3-5); combinations of relevant terms with high probability occurrence; the occurrence frequency for conclusive sentences.
In addition, the TM approach has clear advantages emerging from the statistical properties of big corpora. Accordingly, the tirade (terms, terms-relationships, sentences) gained statistical strength, enabling us to quantify the frequency of a term or occurrence probabilities for specific relationships between terms or for conclusive sentences. For example, the hierarchy of the top terms for the CR-TMS-corpus includes TMS, treatment, rTMS, brain, therapy, depression, and stroke. We also found hierarchies for classes of terms like [CR-TMS-effects] (e.g., treatment, therapy, recovery, antidepressant, rehabilitation, neurorehabilitation) and [mental disabilities] (e.g., depression, stroke, schizophrenia, tinnitus, major depressive disorder, Parkinson, OCD, epilepsy, seizures, neglect, anxiety, ADHD, stress, Alzheimer's). For the CE-TMS-corpus the top terms are TMS, rTMS, brain, cognitive, performance, and facilitation. We also added hierarchies for classes of terms like [CE-TMS-effects] (e.g., performance enhancement, improvement, facilitation, neuromodulation, neurostimulation, therapy, neuroenhancement, rehabilitation, CE) and [cognitive processes] (e.g., memory, learning, attention, working memory, perception, language skill acquisition, decision, emotion, speech, semantic processing). The relevance of all the retrieved terms and of their relationships is based on the idea that words co-occurring frequently in the abstracts are related in specific ways intrinsically constrained by the (TMS-related) topic of the abstracts. Thus, TMS is strongly related with all frequent terms retrieved with different TM-tools. Although in a relatively crude form, determined by our intention to show "raw" TM results, our study is showing that TMS emerged as one of the important non-invasive tools that can both improve cognitive and motor functions in numerous neurological diseases and induce enhancements of many fundamental brain functions.
We were able to characterize topics considering their dynamic relationships, trends in research and the interest shown by the scientific community. For example, CR-TMS and CE-TMS share studies (Figure 4), being an argument for their similarity. The reviewed topics share also publications with other fields suggesting their appurtenance to a larger context, which integrates diagnostic, fundamental research and fMRI studies. TMS can be used both to investigate and to modify brain physiology and performance in healthy and diseased subjects (Vicario and Nitsche, 2013). Methodologically speaking, we conclude that TM was helpful in getting an overall perspective on a huge corpus of literature with some level of detail, intentionally limited to handle complexity. Richer information can be extracted using more complex TM methods focused on narrower topics, but this requires extensive training and knowledge.
A decision factor to use TM relates to how profitable and how difficult the tools may be. The study aimed to address these simple issues in a pragmatic way. First and foremost, we argue that TM-tools may become a basic component in the methodological library. Unfortunately, it is equally clear that TM is a difficult task. With this in mind, we aimed to evaluate relatively immediate advantages of a user-friendly TM approach, based on easy-touse TM-tools applied to CR-and CE-TMS corpora of abstracts. The hierarchical structure of our example set of TM-tools could serve as a guide for researchers aiming to use TM. Accordingly, for a rapid enrichment of the PubMed search, TM-tools II could be used, with special considerations for Carrot2, PubReMiner, Quertle, Medline Ranker, and Textpresso for Neuroscience. All TM-tools III could help a more elaborate TM without a considerable increase in demands to the user. For complex studies combining multiple aspects of the "mining," we recommend systems like Knime, RapidMiner, and Taverna.