Event Abstract

MSL: Mining published scientific literature for the extraction and classification of text and images to support IR capabilities

  • 1 The Jackson Laboratory, Genomic Medicine, United States
  • 2 University of Wuerzburg, Department of Bioinformatics, Germany

In last decades, there has been enormous amount of upsurge in miscellaneous scientific publications (Hunter and Cohen, 2006). Most of the published scientific literature is available in Portable Document Format (PDF) files, as the combination of text in different styles (e.g. font, placement, alignment, colour, size etc.), figures in different formats (e.g. PNG, JPEG, TIFF etc.), tables and attached supplementary material (e.g. datasets, software tools, libraries etc.). Along with the significance of published text, the impact of scientific figures has been widely recognized. Millions of figures have been published so far, which includes the information about the varied results obtained from different scientific experiments e.g. PCR-ELISA data, Microarray analysis, Gel electrophoresis, Mass spectrometry data, DNA/RNA sequencing, diagnostic imaging CT/MRI, ultrasound scans etc., and medicinal imaging e.g. EEG, MEG, ECG, PET images etc., other anatomical and pathological images. Analysing published scientific figures can be very helpful in drawing significant hypotheses, exemplifying approaches and describing experimental results, establishing better understanding of different clinical and scientific problems. The importance of Information Retrieval (IR) in the scientific community is well known, as it plays vital roles in analysing crucial published data. Today, one of the major challenges in IR is the implementation of a system, which can help in efficiently analysing published scientific literature (PDF files) by extracting, classifying and categorizing the valuable content. Unfortunately, PDF is only rich in displaying and printing manuscripts, and requires extensive efforts for the extraction of text and image based information. Several commercial and free downloadable, open source tools have been developed for the physical and logical document structure analysis of PDF files but were unable to extract and provide content (text and images) in the form where it could be considered for complex data analysis e.g. mining text in reading order from double or multiple columns documents, searching marginal text using key-words, removing irrelevant graphics and extracting embedded text inside single and multi-panel complex biological images etc. The goal of our research is to implement a method, which can help in analysing published scientific literature by extracting text and figures from PDF. Furthermore, we aim to classify extracted text and mine embedded text inside figures to significantly moderate the search and IR capabilities. Meeting the technological objectives of our research, we have developed and present a user friendly, modular and client based system i.e. MSL (Ahmed et al., 2015), which supports the extraction of text and images by interpreting all kinds (single, double or multiple columns) of published PDF files (Figure 1A). MSL applies advanced data mining and image processing techniques by integrating different open source and commercial libraries (Figure 1B) for text and image extraction. It implements modular approach for the marginalization of extracted full text based on different coordinates, keywords and file attributes. It also provides features for the extraction of figures from PDF files and applies Optical Character Recognizer (OCR) to extract embedded text from all kinds of biomedical and biological Images. Moreover, for further data mining, it generates system’s output in different file formats including text (marginal information), XML (structured information), images (extracted figures) and PDF (analysed images using OCR). MSL is a desktop application, has been designed following the principles of Butterfly paradigm (Ahmed et al., 2014a; Ahmed and Zeeshan, 2014b) and developed in C# programming language. Most recent available version of MSL has been tested and validated in-house at different datasets (scientific publications from different publishers) e.g. Figure 1A (Ahmed et al., 2014c). MSL is very easy to install and use application (Figure 1C), which can only be configured at Microsoft Windows platform and supporting operating systems. MSL is freely available to download (https://zenodo.org/record/30941#.VuB_M7S5LHM) and use.

Figure 1

Acknowledgements

We thank the German Research Foundation (DFG-TR34/Z1) and The Jackson Laboratory USA for support. We thank all the open source, licensed and commercial library providers for their help in this non-commercial, academic research and software development.

References

1. Ahmed, Z., Dandekar, T. (2015) MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format [version 1; referees: 1 approved with reservations]. F1000Research., 4:1453.

2. Ahmed, Z., Saman, Z., Dandekar, T. (2014a) Developing sustainable software solutions for bioinformatics by the “Butterfly” paradigm. F1000Research., 3, 71.

3. Ahmed, Z., Saman, Z. (2014b) Cultivating Software Solutions Development in the Scientific Academia. Rec Pat Comp Sci., 7, 54-63.

4. Ahmed, Z., Zeeshan, S., Huber, C., Hensel, M., Schomburg, D., Munch, R., Eylert, E., Eisenreich, W., Dandekar, T. (2014c) ‘Isotopo’ a database application for facile analysis and management of mass isotopomer data. Database., 2014.

5. Hunter, L. and Bretonnel, C K. (2006) Biomedical Language Processing: What’s beyond PubMed?. Molecular Cell., 21, 589–594.

Keywords: Bioinformatics Data Mining, images, MSL, Publications, Scientific literature, text, ocr

Conference: Neuroinformatics 2016, Reading, United Kingdom, 3 Sep - 4 Sep, 2016.

Presentation Type: Poster

Topic: General neuroinformatics

Citation: AHMED Z, Zeeshan S and Dandekar T (2016). MSL: Mining published scientific literature for the extraction and classification of text and images to support IR capabilities. Front. Neuroinform. Conference Abstract: Neuroinformatics 2016. doi: 10.3389/conf.fninf.2016.20.00021

Copyright: The abstracts in this collection have not been subject to any Frontiers peer review or checks, and are not endorsed by Frontiers. They are made available through the Frontiers publishing platform as a service to conference organizers and presenters.

The copyright in the individual abstracts is owned by the author of each abstract or his/her employer unless otherwise stated.

Each abstract, as well as the collection of abstracts, are published under a Creative Commons CC-BY 4.0 (attribution) licence (https://creativecommons.org/licenses/by/4.0/) and may thus be reproduced, translated, adapted and be the subject of derivative works provided the authors and Frontiers are attributed.

For Frontiers’ terms and conditions please see https://www.frontiersin.org/legal/terms-and-conditions.

Received: 09 Mar 2016; Published Online: 18 Jul 2016.

* Correspondence: Dr. Zeeshan AHMED, The Jackson Laboratory, Genomic Medicine, Farmington, CT, United States, zahmed@ifh.rutgers.edu