Large Scale Subject Category Classification of Scholarly Papers With Deep Attentive Neural Networks

Subject categories of scholarly papers generally refer to the knowledge domain(s) to which the papers belong, examples being computer science or physics. Subject category classification is a prerequisite for bibliometric studies, organizing scientific publications for domain knowledge extraction, and facilitating faceted searches for digital library search engines. Unfortunately, many academic papers do not have such information as part of their metadata. Most existing methods for solving this task focus on unsupervised learning that often relies on citation networks. However, a complete list of papers citing the current paper may not be readily available. In particular, new papers that have few or no citations cannot be classified using such methods. Here, we propose a deep attentive neural network (DANN) that classifies scholarly papers using only their abstracts. The network is trained using nine million abstracts from Web of Science (WoS). We also use the WoS schema that covers 104 subject categories. The proposed network consists of two bi-directional recurrent neural networks followed by an attention layer. We compare our model against baselines by varying the architecture and text representation. Our best model achieves micro-F1 measure of 0.76 with F1 of individual subject categories ranging from 0.50 to 0.95. The results showed the importance of retraining word embedding models to maximize the vocabulary overlap and the effectiveness of the attention mechanism. The combination of word vectors with TFIDF outperforms character and sentence level embedding models. We discuss imbalanced samples and overlapping categories and suggest possible strategies for mitigation. We also determine the subject category distribution in CiteSeerX by classifying a random sample of one million academic papers.


INTRODUCTION
A recent estimate of the total number of English research articles available online was at least 114 million (Khabsa and Giles, 2014). Studies indicate the number of academic papers doubles every 10-15 years (Larsen and von Ins, 2010). The continued growth of scholarly papers increases the challenges to accurately find relevant research papers, especially when papers in different subject categories (SCs) are mixed in a search engine's collection. Searches based on only keywords may no longer be the most efficient method (Matsuda and Fukushima, 1999) to use. This often happens when the same query terms appear in multiple research areas. For example, querying "neuron" in Google Scholar returns documents in both computer science and neuroscience. Search results can also belong to diverse domains when the query terms contain acronyms. For example, querying "NLP" returns documents in linguistics (meaning "neuro-linguistic programming") and computer science (meaning "natural language processing"). If the SCs of documents are available, the users can narrow search results by specifying an SC, which effectively increases the precision of the query results, assuming SCs are accurately assigned to documents. Also, delineation of scientific domains is a preliminary tasks of many bibliometric studies at the meso-level. Accurate categorization of research articles is a prerequisite for discovering various dimensions of scientific activity in epistemology (Collins, 1992) and sociology (Barnes et al., 1996), as well as the invisible colleges, which are implicit academic networks (Zitt et al., 2019). To build a web-scale knowledge system, it is necessary to organize scientific publications into a hierarchical concept structure, which further requires categorization of research articles by SCs (Shen et al., 2018).
As such, we believe it is useful to build a classification system that assigns SCs to scholarly papers. Such a system could significantly impact scientific search and facilitate bibliometric evaluation. It can also help with Science of Science research (Fortunato et al., 2018), an area of research that uses scholarly big data to study the choice of scientific problems, scientist career trajectories, research trends, research funding, and other research aspects. Also, many have noted that it is difficult to extract SCs using traditional topic models such as Latent Dirichlet Allocation (LDA), since it only extracts words and phrases present in documents (Gerlach et al., 2018). An example is that a paper in computer science is rarely given its SC in the keyword list.
In this work, we pose the SC problem as one of multiclass classifications in which one SC is assigned to each paper. In a preliminary study, we investigated feature-based machine learning methods to classify research papers into six SCs . Here, we extend that study and propose a system that classifies scholarly papers into 104 SCs using only abstracts. The core component is a neural network classifier trained on millions of labeled documents that are part of the WoS database. In comparison with our preliminary work, our data is more heterogeneous (more than 100 SCs as opposed to six), imbalanced, and complicated (data labels may overlap). We compare our system against several baselines applying various text representations, machine learning models, and/or neural network architectures.
SC classification is usually based on a universal schema for a specific domain or for all domains. Many schemas for scientific classification systems are publisher domain specific. For example, ACM has its own hierarchical classification system 1 , NLM has medical subject headings 2 , and MSC has a subject classification for mathematics 3 . The most comprehensive and systematic classification schemas seem to be from WoS 4 and the Library of Congress (LOC) 5 . The latter was created in 1897 and was driven by practical needs of the LOC rather than any epistemological considerations and is most likely out of date.
To the best of our knowledge, our work is the first example of using a neural network to classify scholarly papers into a comprehensive set of SCs. Other work focused on unsupervised methods and most were developed for specific category domains. In contrast, our classifier was trained on a large number of high quality abstracts from WoS and can be applied directly to abstracts without any citation information. We also develop a novel representation of scholarly paper abstracts using ranked tokens and their word embedding representations. This significantly reduces the scale of the classic Bag of Word (BoW) model. We also retrained FastText and GloVe word embedding models using WoS abstracts. The subject category classification was then applied to the CiteSeerX collection of documents. However, it could be applied to any similar collection.

RELATED WORK
Text classification is a fundamental task in natural language processing. Many complicated tasks use it or include it as a necessary step, such as part-of-speech tagging, e.g., Ratnaparkhi (1996), sentiment analysis, e.g., Vo and Zhang (2015), and named entity recognition, e.g., Nadeau and Sekine (2007). Classification can be performed at many levels: word, phrase, sentence, snippet (e.g., tweets, reviews), articles (e.g., news articles), and others. The number of classes usually ranges from a few to nearly 100. Methodologically, a classification model can be supervised, semi-supervised, and unsupervised. An exhaustive survey is beyond the scope of this paper. Here we briefly review short text classification and highlight work that classifies scientific articles.
Bag of words (BoWs) is one of the most commonly used representations for text classification, an example being keyphrase extraction (Caragea et al., 2016;He et al., 2018). BoW represents text as a set of unordered word-level tokens, without considering syntactical and sequential information. For example, Nam et al. (2016) combined BoW with linguistic, grammatical, and structural features to classify sentences in biomedical paper abstracts. In Li et al. (2010), the authors treated the text classification as a sequence tagging problem and proposed a Hidden Markov Model used for the task of classifying sentences into mutually exclusive categories, namely, background, objective, method, result, and conclusions. The task described in García et al. (2012) classifies abstracts in biomedical databases into 23 categories (OHSUMED dataset) or 26 categories (UVigoMED dataset). The author proposed a bagof-concept representation based on Wikipedia and classify abstracts using the SVM model.
Classifying SCs of scientific documents is usually based on metadata, since full text is not available for most papers and processing a large amount of full text is computationally expensive. Most existing methods for SC classification are unsupervised. For example, the Smart Local Moving Algorithm identified topics in PubMed based on text similarity (Boyack and Klavans, 2018) and citation information (van Eck and Waltman, 2017). K-means was used to cluster articles based on semantic similarity (Wang and Koopman, 2017). The memetic algorithm, a type of evolutionary computing (Moscato and Cotta, 2003), was used to classify astrophysical papers into subdomains using their citation networks. A hybrid clustering method was proposed based on a combination of bibliographic coupling and textual similarities using the Louvain algorithm-a greedy method that extracted communities from large networks (Glänzel and Thijs, 2017). Another study constructed a publication-based classification system of science using the WoS dataset (Waltman and van Eck, 2012). The clustering algorithm, described as a modularity-based clustering, is conceptually similar to k-nearest neighbor (kNN). It starts with a small set of seed labeled publications and grows by incrementally absorbing similar articles using co-citation and bibliographic coupling. Many methods mentioned above rely on citation relationships. Although such information can be manually obtained from large search engines such as Google Scholar, it is non-trivial to scale this for millions of papers.
Our model classifies papers based only on abstracts, which are often available. Our end-to-end system is trained on a large number of labeled data with no references to external knowledge bases. When compared with citation-based clustering methods, we believe it to be more scalable and portable.

TEXT REPRESENTATIONS
For this work, we represent each abstract using a BoW model weighted by TF-IDF. However, instead of building a sparse vector for all tokens in the vocabulary, we choose word tokens with the highest TF-IDF values and encode them using WE models. We explore both pre-trained and retrained WE models. We also explore their effect on classification performance based on token order. As evaluation baselines, we compare our best model with off-theshelf text embedding models, such as the Unified Sentence Encoder [USE; Cer et al. (2018)]. We show that our model which uses the traditional and relatively simple BoW representation is computationally less expensive and can be used to classify scholarly papers at scale, such as those in the CiteSeerX repository (Giles et al., 1998;Wu et al., 2014).

Representing Abstracts
First, an abstract is tokenized with white spaces, punctuation, and stop words were removed. Then a list A of word types (unique words) w i is generated after lemmatization which uses the WordNet database (Fellbaum, 2005) for the lemmas. (1) Next the list A f is sorted in descending order by TF-IDF giving A sorted . TF is the term frequency in an abstract and IDF is the inverse document frequency calculated using the number of abstracts containing a token in the entire WoS abstract corpus.
Because abstracts may have different numbers of words, we chose the top d elements from A sorted to represent the abstract. We then re-organize the elements according to their original order in the abstract forming a sequential input. If the number of words is less than d, we pad the feature list with zeros. The final list is a vector built by concatenating all word level vectors v → ′k , k ∈ {1, /, d} into a D WE dimension vector. The final semantic feature vector A f is:

Word Embedding
To investigate how different word embeddings affect classification results, we apply several widely used models. An exhaustive experiment for all possible models is beyond the scope of this paper. We use some of the more popular ones as now discussed.
GloVe captures semantic correlations between words using global word-word co-occurrence, as opposed to local information used in word2vec (Mikolov et al., 2013a). It learns a word-word co-occurrence matrix and predicts co-occurrence ratios of given words in context (Pennington et al., 2014). Glove is a contextindependent model and outperformed other word embedding models such as word2vec in tasks such as word analogy, word similarity, and named entity recognition tasks.
FastText is another context-independent model which uses sub-word (e.g., character n-grams) information to represent words as vectors . It uses log-bilinear models that ignore the morphological forms by assigning distinct vectors for each word. If we consider a word w whose n-grams are denoted by g w , then the vector z g is assigned to each n-gram in g w . Each word is represented by the sum of the vector representations of its character n-grams. This representation is incorporated into a Skip Gram model (Goldberg and Levy, 2014) which improves vector representation for morphologically rich languages.
SciBERT is a variant of BERT, a context-aware WE model that has improved the performance of many NLP tasks such as question answering and inference (Devlin et al., 2019). The bidirectionally trained model seems to learn a deeper sense of language than single directional transformers. The transformer uses an attention mechanism that learns contextual relationships between words. SciBERT uses the same training method as BERT but is trained on research papers from Semantic Scholar. Since the abstracts from WoS articles mostly contain scientific information, we use SciBERT (Beltagy et al., 2019) instead of BERT. Since it is computationally expensive to train BERT (4 days on 4-16 Cloud TPUs as reported by Google), we use the pre-trained SciBERT.

Retrained WE Models
Though pretrained WE models represent richer semantic information compared with traditional one-hot vector methods, when applied to text in scientific articles the classifier does not perform well. This is probably because the text corpus used to train these models are mostly from Wikipedia and Newswire. The majority of words and phrases included in the vocabulary extracted from these articles provides general descriptions of knowledge, which are significantly different from those used in scholarly articles which describe specific domain knowledge. Statistically, the overlap between the vocabulary of pretrained GloVe (six billion tokens) and WoS is only 37% . Nearly all of the WE models can be retrained. Thus, we retrained GloVe and FastText using 6.38 million abstracts in WoS (by imposing a limit of 150k on each SC, see below for more details). There are 1.13 billion word tokens in total. GloVe generated 1 million unique vectors, and FastText generated 1.2 million unique vectors.

Universal Sentence Encoder
For baselines, we compared with Google's Universal Sentence Encoder (USE) and the character-level convolutional network (CCNN). USE uses transfer learning to encode sentences into vectors. The architecture consists of a transformer-based sentence encoding (Vaswani et al., 2017) and a deep averaging network (DAN) (Iyyer et al., 2015). These two variants have trade-offs between accuracy and compute resources. We chose the transformer model because it performs better than the DAN model on various NLP tasks (Cer et al., 2018). CCNN is a combination of character-level features trained on temporal (1D) convolutional networks [ConvNets; Zhang et al. (2015)]. It treats input characters in text as a raw-signal which is then applied to ConvNets. Each character in text is encoded using a one-hot vector such that the maximum length l of a character sequence does not exceed a preset length l 0 .

CLASSIFIER DESIGN
The architecture of our proposed classifier is shown in Figure 1. An abstract representation previously discussed is passed to the neural network for encoding. Then the label of the abstract is determined by the output of the sigmoid function that aggregates all word encodings. Note that this architecture is not applicable for use by CCNN or USE. For comparison, we used these two architectures directly as described from their original publications.
LSTM is known for handling the vanishing gradient that occurs when training recurrent neural networks. A typical LSTM cell consists of three gates: input gate i t , output gate o t and forget gate f t . The input gate updates the cell state; the output gate decides the next hidden state, and the forget gate decides whether to store or erase particular information in the current state h t . We use tanh(·) as the activation function and the sigmoid function σ(·) to map the output values into a probability distribution. The current hidden state h t of LSTM cells can be implemented with the following equations: At a given time step t, x t represents the input vector; c t represents cell state vector or memory cell; z t is a temporary result. W and U are weights for the input gate i, forget gate f, temporary result z, and output gate o.
GRU is similar to LSTM, except that it has only a reset gate r t and an update gate z t . The current hidden state h t at a given timestep t can be calculated with: with the same defined variables. GRU is less computationally expensive than LSTM and achieves comparable or better performance for many tasks. For a given sequence, we train LSTM and GRU in two directions (BiLSTM and BiGRU) to predict the label for the current position using both historical and future data, which has been shown to outperform a single direction model for many tasks. Attention Mechanism The attention mechanism is used to weight word tokens deferentially when aggregating them into a document level representations. In our system (Figure 1), embeddings of words are concatenated into a vector with D WE dimensions. Using the attention mechanism, each word t contributes to the sentence vector, which is characterized by the factor α t such that is the representation of each word after the BiLSTM or BiGRU layers, v t is the context vector that is randomly initialized and learned during the training process, W is the weight, and b is the bias. An abstract vector v is generated by aggregating word vectors using weights learned by the attention mechanism. We then calculate the weighted sum of h t using the attention weights by:

EXPERIMENTS
Our training dataset is from the WoS database for the year 2015. The entire dataset contains approximately 45 million records of academic documents, most having titles and abstracts. They are labeled with 235 SCs at the journal level in three broad categories-Science, Social Science, and Art and Literature. A portion of the SCs have subcategories, such as "Physics, Condensed Matter," "Physics, Nuclear," and "Physics, Applied." Here, we collapse these subcategories, which reduces the total number of SCs to 115. We do this because the minor classes decrease the performance of the model (due to the less availability of that data). Also, we need to have an "others" class to balance the data samples. We also exclude papers labeled with more than one category and papers that are labeled as "Multidisciplinary." Abstracts with less than 10 words are excluded. The final number of singly labeled abstracts is approximately nine million, in 104 SCs. The sample sizes of these SCs range from 15 (Art) to 734k (Physics) with a median about 86k. We randomly select up to 150k abstracts per SC. This upper limit is based on our preliminary study . The ratio between the training and testing corpus is 9:1. The median of word types per abstract is approximately 80-90. As such, we choose the top d 80 elements from A sorted to represent the abstract. If A sorted has less than d elements, we pad the feature list with zeros. The word vector dimensions of GloVe and FastText are set to 50 and 100, respectively. This falls into the reasonable value range (24-256) for WE dimensions (Witt and Seifert, 2017). When training the BiLSM and BiGRU models, each layer contains 128 neurons. We investigate the dependency of classification performance on these hyper-parameters by varying the number of layers and neurons. We varied the number of word types per abstract d and set the dropout rate to 20% to mitigate overfitting or underfitting. Due to their relatively large size, we train the neural networks using mini-batch gradient descent with Adam for gradient optimization and one word cross entropy as the loss function. The learning rate was set to 10 − 3 .

One-Level Classifier
We first classify all abstracts in the testing set into 104 SCs using the retrained GloVe WE model with BiGRU. The model achieves a micro-F 1 score of 0.71. The first panel in Figure 2 shows the SCs that achieve the highest F 1 's; the second panel shows SCs that achieve relatively low F 1 's. The results indicate that the classifier performs poorer on SCs with relatively small sample sizes than SCs with relatively large sample sizes. The data imbalance is likely to contribute to the significantly different performances of SCs.

Two-Level Classifier
To mitigate the data imbalance problems for the one-level classifier, we train a two-level classifier. The first level classifies abstracts into 81 SCs, including 80 major SCs and an "Others" category, which incorporates 24 minor SCs. "Others" contains the categories with training data < 10k abstracts. Abstracts that fall into the "Others" are further classified by a second level classifier, which is trained on abstracts belonging to the 24 minor SCs.

Baseline Methods
For comparison, we trained five supervised machine learning models as baselines. They are Random Forest (RF), Naïve Bayes (NB, Gaussian), Support Vector Machine (SVM, linear and Radial Basis Function kernels), and Logistic Regression (LR). Documents are represented in the same way as for the DANN except that no word embedding is performed. Because it takes an extremely long time to train these models using all data used for training DANN, and the implementation does not support batch processing, we downsize the training corpus to 150k in total and keep training samples in each SC in proportion to those used in DANN. The performance metrics are calculated based on the same testing corpus as the DANN model. We used the CCNN architecture , which contains six convolutional layers each including 1,008 neurons followed by three fully connected layers. Each abstract is represented by a 1,014 dimensional vector. Our architecture for USE (Cer et al., 2018) is an MLP with four layers, each of which contains 1,024 neurons. Each abstract is represented by a 512 dimensional vector.

Results
The performances of DANN in different settings and a comparison between the best DANN models and baseline models are illustrated in Figure 3. The numerical values of performance metrics using the two-level classifier are tabulated in Supplementary Table S1. Below are the observations from results.
(2) Retraining FastText and GloVe significantly boosted the performance. In contrast, the best micro-F 1 achieved by USE is 0.64, which is likely resulted from its relatively low vocabulary overlap. Another reason could be is that the single vector of fixed length only encodes the overall semantics of the abstract. The occurrences of words are better indicators of sentences in specific domains. (3) LSTM and GRU and their bidirectional counterparts exhibit very similar performance, which is consistent with a recent systematic survey (Greff et al., 2017). (4) For FastText + BiGRU + Attn, the F 1 measure varies from 0.50 to 0.95 with a median of 0.76. The distribution of F 1 values for 81 SCs is shown in Figure 4. The F 1 achieved by the first-level classifier with 81 categories (micro-F 1 0.76) is improved compared with the classifier trained on 104 SCs (micro-F 1 0.70) (5) The performance was not improved by increasing the GloVe vector dimension from 50 to 100 (not shown) under the setting of GloVe + BiGRU with 128 neurons on two layers which is consistent with earlier work (Witt and Seifert, 2017). (6) Word-level embedding models in general perform better than the character-level embedding models (i.e., CCNN). CCNN considers the text as a raw-signal, so the word vectors constructed are more appropriate when comparing morphological similarities. However, semantically similar words may not be morphologically similar, e.g., "Neural Networks" and "Deep Learning." (7) SciBERT's performance is 3-5% below FastText and GloVe, indicating that re-trained WE models exhibit an advantage over pre-trained WE models. This is because SciBERT was trained on the PubMed corpus which mostly incorporates papers in biomedical and life sciences. Also, due to their large dimensions, the training time was greater than FastText under the same parameter settings. (8) The best DANN model beats the best machine learning model (LR) by about 10%. We also investigated the dependency of classification performance on key hyper-parameters. The settings of GLoVe + BiGRU with 128 neurons on two layers are considered as the "reference setting." With the setting of GloVe + BiGRU, we increase the neuron number by factor of 10 (1,280 neurons on two layers) and obtained marginally improved performance by 1% compared with the same setting with 128 neurons. We also doubled the number of layers (128 neurons on four layers). Without attention, the model performs worse than the reference setting by 3%. With the attention mechanism, the micro-F 1 0.75 is marginally improved by 1% with respect to the reference setting. We also increase the default number of neurons of USE to 2048 neurons for four layers. The micro-F 1 improves marginally by 1%, reaching only 0.64. The results indicate that adding more neurons and layers seem to have little impact to the performance improvement.
The second-level classifier is trained using the same neural architecture as the first-level on the "Others" corpus. Figure 2 (Right ordinate legend) shows that F 1 's vary from 0.92 to 0.97 with a median of 0.96. The results are significantly improved by classifying minor classes separately from major classes.

Sampling Strategies
The data imbalance problem is ubiquitous in both multi-class and multi-label classification problems (Charte et al., 2015). The imbalance ratio (IR), defined as the ratio of the number of instances in the majority class to the number of samples in the minority class (García et al., 2012), has been commonly used to characterize the level of imbalance. Compared with the imbalance datasets in Table 1 of (Charte et al., 2015), our data has a significantly high level of imbalance. In particular, the highest IR is about 49,000 (#Physics/#Art). One commonly used way to mitigate this problem is data resampling. This method is based on rebalancing SC distributions by either deleting instances of major SCs (undersampling) or supplementing artificially generated instances of the minor SCs (oversampling). We can always undersample major SCs, but this means we have to reduce sample sizes of all SCs down to about 15 (Art; Section 5), which is too small for training robust neural network models. The oversampling strategies such as SMOTE (Chawla et al., 2002) works for problems involving continuous numerical quantities, e.g., SalahEldeen and Nelson (2015). In our case, the synthesized vectors of "abstracts" by SMOTE will not map to any actual words because word representations are very sparsely distributed in the large WE space. Even if we oversample minor SCs using semantically dummy vectors, generating all samples will take a large amount of time given the high dimensionality of abstract vectors and high IR. Therefore, we only use real data.

Category Overlapping
We discuss the potential impact on classification results contributed by categories overlapping in the training data. Our  initial classification schema contains 104 SCs, but they are not all mutually exclusive. Instead, the vocabularies of some categories overlap with the others. For example, papers exclusively labeled as "Materials Science" and "Metallurgy" exhibit significant overlap in their tokens. In the WE vector space, the semantic vectors labeled with either category are overlapped making it hard to differentiate them. Figure 5 shows the confusion matrices of the closely related categories such as "Geology," "Mineralogy," and "Geochemistry Geophysics." Figure 6 is the t-SNE plot of abstracts of closely related SCs. To make the plot less crowded, we randomly select 250 abstracts from each SC as shown in Figure 5. Data points representing "Geology," "Mineralogy," and "Geochemistry Geophysics" tend to spread or are overlapped in such a way that are hard to be visually distinguished.
One way to mitigate this problem is to merge overlapped categories. However, special care should be taken on whether these overlapped SCs are truly strongly related and should be evaluated by domain experts. For example, "Zoology," "PlantSciences," and "Ecology" can be merged into a single SC called "Biology" (Gaff, 2019; private communication). "Geology," "Mineralogy," and "GeoChemistry GeoPhysics" can be merged into a single SC called "Geology." However, "Materials Science" and "Metallurgy" may not be merged (Liu, 2019; private communication) to a single SC. By doing the aforementioned merges, the number of SCs is reduced to 74. As a preliminary study, we classified the merged dataset using our best model (retrained FastText + BiGRU + Attn) and achieved an improvement with an overall micro-F 1 score of 0.78. The classification performance of "Geology" after merging has improved from 0.83 to 0.88.

Limitations
Compared with existing work, our models are trained on a relatively comprehensive, large-scale, and clean dataset from WoS. However, the basic classification of WoS is at the journal level and not at the article level. We are also aware that the classification schema of WoS may change over time. For example, in 2018, WoS introduced three new SCs such as Quantum Science and Technology, reflecting emerging research trends and technologies (Boletta, 2019). To mitigate this effect, we excluded papers with multiple SCs and assume that the SCs of papers studied are stationary and journal level classifications represent the paper level SCs.
Another limitation is the document representation. The BoW model ignores the sequential information. Although we experimented on the cases in which we keep word tokens in the same order as they appear in the original documents, the exclusion of stop words breaks the original sequence, which is the input of the recurrent encoder. We will address this limitation in FIGURE 5 | Normalized Confusion Matrix for closely related classes in which a large fraction of "Geology" and "Mineralogy" papers are classified into "GeoChemistry GeoPhysics" (A), and a large fraction of Zoology papers are classified into "biology" or "ecology" (B), a large fraction of "TeleCommunications," "Mechanics" and "EnergyFuels" papers are classified into "Engineering" (C).

APPLICATION TO CITESEERX
CiteSeerX is a digital library search engine that was the first to use automatic citation indexing (Giles et al., 1998). It is an open source search engine that provides metadata and full-text access for more than 10 million scholarly documents and continues to add new documents (Wu et al., 2019). In the past decade, it has incorporated scholarly documents in diverse SCs, but the distribution of their subject categories is unknown. Using the best neural network model in this work (FreeText + BiGRU + Attn), we classified one million papers randomly selected from CiteSeerX into 104 SCs ( Table 1). The fraction of Computer Science papers (19.2%) is significantly higher than the results in Wu et al. (2018), which was 7.58%. The F 1 for Computer Science was about 0.94 for Computer Science which is higher than this work (about 0.80). Therefore, the fraction may be overestimated here. However Wu et al. (2018), had only six classes and this model classifies abstracts into 104 SCs, so although this compromises the accuracy (by around 7% on average), our work can still be used as a starting point for a systematic SC classification. The classifier classifies one million abstracts in 1,253 s implying that will be scalable on multi-millions of papers.

CONCLUSION
We investigated the problem of systematically classifying a large collection of scholarly papers into 104 SC's using neural network methods based only on abstracts. Our methods appear to scale better than existing clustering-based methods relying on citation networks. For neural network methods, our retrained FastText or GloVe combined with BiGRU or BiLSTM with the attention mechanism gives the best results. Retraining WE models and using an attention mechanism play important roles in improving the classifier performance. A two-level classifier effectively improves our performance when dealing with training data that has extremely imbalanced categories. The median F 1 's under the best settings are 0.75-0.76. One bottleneck of our classifier is the overlapping categories. Merging closely related SCs is a promising solution, but should be under the guidance of domain experts. The TF-IDF representation only considers unigrams. Future work could consider n-grams or concepts (n ≥ 2) and transfer learning to adopt word/sentence embedding models trained on nonscholarly corpora (Arora et al., 2017;Conneau et al., 2017). One could investigate models that also take into account stopwords, e.g., Yang et al. (2016). One could also explore alternative optimizers of neural networks besides Adam, such as the Stochastic Gradient Descent (SGD). Our work falls into the multiclass classification, which classifies research papers into flat SCs. In the future, we will investigate hierarchical multilabel classification that assigns multiple SCs at multiple levels to papers.

DATA AVAILABILITY STATEMENT
The Web of Science (WoS) dataset used for this study is proprietary and can be purchased from Clarivate 6 . The implementation software is open accessible from GitHub 7 . The testing datasets and CiteSeerX classification results are available on figshare 8 .

AUTHOR CONTRIBUTIONS
BK designed the study and implemented the models. He is responsible for analyzing the results and writing the paper. SR is responsible for model selection, experiment design, and reviewing methods and results. JW was responsible for data management, selection, and curation. He reviewed related works and contributed to the introduction. CG is responsible for project management, editing, and supervision.