The growing collection of scientific data in various web repositories is referred to as Scientific Big Data, as it fulfills the four “V’s” of Big Data–-volume, variety, velocity, and veracity. This phenomenon has created new opportunities for startups; for instance, the extraction of pertinent research papers from enormous knowledge repositories using certain innovative methods has become an important task for researchers and entrepreneurs. Traditionally, the content of the papers are compared to list the relevant papers from a repository. The conventional method results in a long list of papers that is often impossible to interpret productively. Therefore, the need for a novel approach that intelligently utilizes the available data is imminent. Moreover, the primary element of the scientific knowledge base is a research article, which consists of various logical sections such as the Abstract, Introduction, Related Work, Methodology, Results, and Conclusion. Thus, this study utilizes these logical sections of research articles, because they hold significant potential in finding relevant papers. In this study, comprehensive experiments were performed to determine the role of the logical sections-based terms indexing method in improving the quality of results (i.e., retrieving relevant papers). Therefore, we proposed, implemented, and evaluated the logical sections-based content comparisons method to address the research objective with a standard method of indexing terms. The section-based approach outperformed the standard content-based approach in identifying relevant documents from all classified topics of computer science. Overall, the proposed approach extracted 14% more relevant results from the entire dataset. As the experimental results suggested that employing a finer content similarity technique improved the quality of results, the proposed approach has led the foundation of knowledge-based startups.
The myriad of scientific research publication over the web has been increasing over the past several years [
This significant amount of data on different web repositories hinders the process of retrieving relevant information in a concrete manner. Millions of generic hits and irrelevant documents are returned by contemporary indexers, posing a challenging task for researchers. Consequently, the problem has grabbed the attention of scholarly communities, and researchers are in the process of developing effective solutions. Subsequently, the solutions to such problems may lead to the foundation of new and emerging startups. Therefore, the community is seeking solutions through various perspectives such as by exploiting citation, metadata, collaborative filtering, and content-based approaches.
Citations are deemed to be a great source of information for recommending relevant papers. Researchers have proposed various citation-based techniques, including bibliographic coupling [
In addition, metadata-based techniques, suggested by [
Another widely used approach to obtain relevant documents is collaborative filtering, which determines relevant documents by utilizing collaborative knowledge. These recommendations are based on user profiles and past preferences of the user’s taste [
In content-based approaches, the content of two individual papers is analyzed to determine their relevance with respect to each other. For example, papers with contents similar to that of focused paper “A” will be considered more relevant for paper “A” [
In context, IMRAD (Introduction, Method, Results, and Discussion) is a common structure for organizing a research article, which was introduced by Louis Pasteur [
As discussed above, each logical section has its own importance and significance; however, the entire document is treated at the same level of importance in standard content-based approaches. For example, if a term in the “Abstract” section of paper “A” matches with a term in the conclusion section of paper “B,” the standard content-based approaches will declare that paper “B” is relevant to paper “A,” although these documents might not be actually relevant. On the contrary, if the importance of logical sections are defined and a term from the “Abstract” section of paper “A” matches to a term in the “Abstract” section of paper “B,” then the probability of the two documents being relevant may increase.
Similarly, if two researchers have independently proposed two different algorithms to solve the same problem in papers “A” and “B,” respectively, then there is a greater possibility of a higher number of matching terms between the methodology sections of both papers. However, a survey paper “C” covering the same problem area might have a higher number of matched terms with paper “A” in the entire content of the paper. In this case, standard content-based approaches will consider survey paper “C” to be more relevant to paper “A” than paper “B; ” however, papers “A” and “B” are more related in reality. Thus, considering all of the aforementioned issues, we present a study that identifies
Therefore, we performed a section-wise content comparison between research papers to address the objectives of the study. The vectors of each logical section were formed from each scientific paper. Furthermore, the corresponding vectors were compared in a standard manner using cosine similarity, as in the content-based approach. This indicated that the terms appearing in each logical section were more likely to be compared only with the terms occurring in the corresponding logical section of the other papers. The proposed approach was comprehensively evaluated by accounting data from each topic under the ACM classification hierarchy. The section-based approach outperformed content-based approach in identifying relevant documents in all topics of computer science. Further, the gain percentage varied from 36% for Topic-E “Data” and 2% for the Topic-D “Software. ” Overall, the gain percentage of the proposed approach was 14% for all dataset.
The rest of the paper is organized as follows. The literature review is presented in Section 2 for the validation of the framed hypothesis. The proposed methodology of the study is presented in Section 3. The evaluation of the experiment is elaborated in Section 4 with the experimental setup and considerations, where a sample paper from the classified ACM topics was selected. The experimental results and its comparisons are presented in Section 5, and the results are discussed in Section 6. Finally, the contributions of this study are summarized and concluded in Section 7.
In the above section, the extent of research publications and the estimated quantity of scientific documents were discussed along with commonly faced problems. Consequently, various approaches have been proposed in the literature to help the scientific community in this task. These contemporary approaches are divided into four major categories. The first approach identifies and recommends relevant papers by utilizing user collaborations; the second approach uses the metadata of the papers; the third approach uses citations to identify relatedness between documents; and the fourth approach exploits the content of papers to recommend relevant research papers. Certain hybrid systems utilize two or more of these techniques to enhance the recommendations for relevant papers based on different considerations. This section reviews the most important, recent, and classical methods related to these approaches.
Collaborative filtering-based approaches list relevant documents by exploiting user profiles and past preferences of the users’ choices. Recommender systems envisage a user’s choices and preferences based on his/her accessed and rated items. Thus, collaborative filtering can be categorized into two approaches: model-based and memory-based. In model-based approaches, predictions are made based on the model, which contains information on items and users’ interactions with each other; whereas memory-based approaches exploit the user’s existing rating data for predicting their preference to other items. Therefore, collaborative filtering is considered an important approach because of its high performance and simple requirements, as highlighted by [
Currently, search engines used for academic purposes, such as Scienstein, have become influential and prominent hybrid systems recommending research.
Another approach to recommend relevant papers is that based on metadata, where the metadata of a paper, such as paper title, author names, publication date, and venues are used to extract relevant documents. Thus, the relevance and discovery of documents is characterized using metadata. Moreover, one of the core services provided to users is the creation and provision of metadata that support the functionality of various digital libraries. In particular, objects of interest and relevant information are accessed utilizing a metadata technique. As the documents and metadata are digital, the alternative implementation of the data can be made accessible. However, conventional metadata are less likely to exist in digital libraries. Metadata in digital libraries help in document discovery, network management, visibility, and organizational memory [
Thus, metadata are an important source for creating recommendations of relevant research papers. Moreover, recommendation systems based on metadata work efficiently, mostly because only a few terms need to be analyzed for recommending relevant research papers. Consequently, as the metadata are a set of small number of terms–-generated from authors’ keywords, title terms, and categories, the quality of recommendation is not highly accurate owing to the difficulty of the recommender system to make concrete decisions by analyzing a small number of terms. Therefore, several authors have developed hybrid approaches that use metadata as well as collaborative filtering, content, and citations to make accurate recommendations.
Citations form a highly important dataset that use various techniques for recommending relevant research papers. The two common techniques pertaining to this field are bibliographic coupling [
Most of the citation-based techniques use citation network information. These techniques provide adequate and appropriate recommendations, because the citations are carefully hand-picked by the authors. However, these approaches are limited to working well within certain citation networks, because the authors cannot cite every relevant paper in their research. The relevant papers that have not been cited become a weak candidate to be discovered with relevance. Thus, it has considerably high chances of missing relevant papers.
In content-based approaches, the contents of two papers are analyzed to determine their relevance. For example, papers with a more similar content to that of focused paper A will be considered more relevant to paper A. Thus, content-based recommendation systems analyze the internal content of the documents to recommend relevant papers [
This section comprehensively delineates the proposed methodology of the current study; the architecture of the proposed methodology is presented in
The dataset selection for the proposed approach included the criteria of a dataset covering a vast number of topics and being comprehensive enough to conclude the research. In addition, the dataset should allow us to access the logical sections of the papers for section-wise term extraction and matching. Based on these requirements, we selected the dataset of the
The text from the PDF files available at the J. UCS server was required to utilize its content. Thus, the PDF files were converted into XML format using a tool named PDFx with the approach adopted in [
The content-based approach had been implemented because the proposed technique was extended from it. The standard implementation of the content-based approach was performed using the Apache Lucene API that is widely applied to identify content/word similarities [
The first step of implementing the content-based approach involves the acquisition of important terms from a research paper, and inputting the text files to the Apache Lucene API, where the term extractor, TF–IDF, extracts terms from the entire document using
This measure has been extensively used by the scientific community to select important terms from text documents [
The cosine similarity measure is widely applied to compute content similarity between research documents [
This similarity measure is available in Apache Lucene. The cosine similarities of each document with all other documents of the dataset were computed. Moreover, the ranked list of other similar research documents was retrieved based on the descending scores for each document, as shown on the right-hand side of
The same dataset was converted into logical sections to apply a section-wise content-based approach in finding relevant documents. The steps followed in this approach included certain additional tasks, which are shown on the left-hand side of
In the current study, all the text files were converted into six logical sections: “Abstract,” “Introduction,” “Related work,” “Methodology,” “Results,” and “Conclusion.” The section headings appearing in the research papers were converted into these logical sections using the approach proposed in [
The process of section-wise term extraction is similar to that discussed in Section 2.3.1. However, the content of each section (“Abstract,” “Introduction,” etc.) of a paper was separately marked for Apache Lucene. The proposed approach separately required the important terms extracted by Apache Lucene from each section of each research paper. The extracted section-wise terms were then indexed in a database for future usage, as illustrated on the left-hand side of
The cosine similarity was applied upon the indexed terms to compute the similarity score between the source and target papers, where each paper was represented as a six-term vector. The corresponding term vectors of each paper were compared, i.e., six similarity scores were obtained for each paper based on the matching of the corresponding term vectors (“Abstract” with “Abstract,” “Introduction” with “Introduction,” “Related work” with “Related work,” “Methodology” with “Methodology,” “Results” with “Results,” and “Conclusion” with “Conclusion”) of every other paper. The final similarity score of each paper was evaluated by averaging all the scores obtained in each ACM topic with respect to all other papers. In this way, a ranked list of relevant papers was acquired based on the similarity score of each paper arranged in descending order. The similarity score was computed based on the cosine similarity between the papers, as shown on the left-hand side of
Based on the similarity scores computed above, two separate ranked lists were formed: one for content-based similarity and another for section-wise content similarity of the papers. These lists are required to be evaluated based on certain benchmarks; however, to the best of our knowledge, there is no standard benchmark that can be employed to evaluate the results of the proposed study. Therefore, we constructed a benchmark to compare and evaluate both approaches, as shown in the lower part of
The development of a gold standard dataset was a crucial task, because there was no available benchmark that could be used for evaluating the proposed approach in the domain of relevant paper recommendations. Normally, authors logically define such standards or prefer user studies. In this study, the documents belonging to the same ACM topic were defined as relevant documents. Besides, authors manually select suitable ACM topics to represent their research during publishing in J. UCS. Therefore, it could be presumed that the authors chose the best topic(s) representing their papers. Thus, we decided to consider the topic information of every research paper available in J. UCS as the gold standard (benchmark), and both the approaches (content and section-wise content) were evaluated against this benchmark. The evaluation process examined the number of top recommendations extracted using both the approaches from a list of 200 documents belonging to the topic(s) of the query paper. The complete list of topics developed by ACM–-hierarchically classified as topics A to K is illustrated in
I | II |
---|---|
Topic-A: General literature | Topic-G: Mathematics of computing |
Topic-B: Hardware | Topic-H: Information systems |
Topic-C: Computer systems organization | Topic-I: Computing methodologies |
Topic-D: Software | Topic-J: Computer applications |
Topic-E: Data | Topic-K: Computing milieux |
Topic-F: Theory of computation |
Comparing the top recommendations of 200 documents with the benchmark would require exhaustive manual effort. Therefore, five papers from each of the topics (“A” to “K”) were selected for evaluation. Moreover, both the approaches were employed to produce results for each of the 200 documents, to comprehensively judge the working of both the approaches. The topics of each selected paper were compared with the topics of top recommendations (top 10, top 15, and top 20) that resulted from using both the techniques. In particular, the topics of paper A (source paper) were compared with the topics of the recommended papers and ranked in a list to find the total number of matched topics. The number of recommended papers, having the same topics as the source paper, was noted, and the detailed results are discussed below.
The topic-wise results of the content and section-based approaches on Topic-A are presented in this section; the rest of the topic results have not been discussed owing to their identical nature.
Topic-A in the ACM topic classification represents papers published under the “General Literature” category. Among the 200 papers present in the employed dataset, 14 papers belonged to Topic-A, whereas 186 papers were related to various other topics, and therefore, considered as noise.
In
The comparative results for the top-10 recommendations under Topic-A are portrayed in
The section-based approach achieved gain percentages of 11%, 14%, and 33% for the paper
In
The top-20 recommendations of Topic-A extracted by the two approaches are compared in
Subsequently, the gain/loss percentages of the section-based approach are visualized in
This section presents the comparative results for all the 55 papers selected for the study.
In this study, we performed multiple experiments to evaluate the effectiveness of using terms from various logical sections rather than extracting terms from a paper. Thus, a detailed comparison between the section- and content-based approaches was presented. From the ACM classification hierarchy, five papers were considered from each root-level topic (Topic A–K). In addition, there was enough noise (irrelevant papers) under each topic to validate the working efficiency of the sections- and content-based approaches. Moreover, topic-wise comparisons and statistics of every paper from the selected list of papers were highlighted in detail. Furthermore, the gain/loss percentages of the section-based approach for the top-10, top-15, and top-20 recommendations were evaluated.
The overall topic-wise gain percentages are shown in For all topics, except Topic-D in the top-20 recommendations, the section-based approach outperformed the content-based approach. The highest accuracy of the proposed approach was observed for Topic-E and Topic-I, whereas that for Topic-D remained on low. The gain percentages for the section-based approach remained consistently higher than the content-based approach for the top-10 and top-15 results; however, the corresponding gain percentages in the top-20 list were not as high as those in the top-10 and top-15 lists, except for Topic-F. Nonetheless, the top-20 recommendations remained in close competition with the top-10 and top-15 results for Topic-A, Topic-E, Topic-F, and Topic-I. The overall gain percentages of the section-based approach were 15% for top-10, 14% for top-15, and 12% for the top-20 recommendations. The average gain percentage across all the topics was 13.72%.
Topic | Gain percentage of section-based approach (%) | |||
---|---|---|---|---|
Top 10 | Top 15 | Top 20 | Overall | |
Topic-A: General literature | 12 | 2 | 9 | 8 |
Topic-B: Hardware | 14 | 26 | 15 | 18 |
Topic-C: Computer systems organization | 3 | 11 | 3 | 6 |
Topic-D: Software | 3 | 4 | −1 | 2 |
Topic-E: Data | 33 | 39 | 35 | 36 |
Topic-F: Theory of computation | 11 | 9 | 19 | 13 |
Topic-G: Mathematics of computing | 10 | 10 | 3 | 8 |
Topic-H: Information systems | 14 | 26 | 15 | 18 |
Topic-I: Computing methodologies | 31 | 29 | 29 | 30 |
Topic-J: Computer applications | 16 | 0 | 3 | 6 |
Topic-K: Computing milieux | 19 | 14 | 7 | 13 |
The body of scientific knowledge is rapidly expanding with more than 2 million research papers being published annually. These papers are accessed through various search systems such as general search engines, citation indices, and digital libraries; however, identifying pertinent research papers from these huge repositories is a challenge. On a general query, thousands of papers are returned from these systems, thus making it difficult for end users to find relevant documents. This phenomenon has attracted the attention of researchers to devise state-of-the-art approaches that could assist the scientific community in identifying relevant papers rapidly. These techniques can be categorized into four major categories: (1) content-based, (2) citation-based, (3) metadata-based, and (4) collaborative filtering approaches. Although the content-based approaches have a better recall than other methods, there are limitations that result in a long list of recommendations against user queries. Therefore, researchers have attempted to improve the quality of results by devising more intelligent techniques [ The proposed approach outperformed the traditional content-based approach in identifying relevant documents in all topics of computer science classified under ACM hierarchy. The gain percentage varied from 36% for Topic-E (Data) to 2% for Topic-D (Software). The overall gain percentage of the proposed approach was evaluated at 14% for all topics.
The contribution of our study is related to the vision of applying Scientific Big Data techniques to facilitate the delivery of sophisticated next-generation library services. The retrieval of copious relevant knowledge with the adoption of sophisticated techniques is not just a necessity, but also a decisive action of the global scientific community toward the management of collective wisdom, prosperity, development, and sustainability. The proposed approach poses great potential for further exploration in future. Although certain previous studies [