Recently, automation is considered vital in most fields since computing methods have a significant role in facilitating work such as automatic text summarization. However, most of the computing methods that are used in real systems are based on graph models, which are characterized by their simplicity and stability. Thus, this paper proposes an improved extractive text summarization algorithm based on both topic and graph models. The methodology of this work consists of two stages. First, the well-known TextRank algorithm is analyzed and its shortcomings are investigated. Then, an improved method is proposed with a new computational model of sentence weights. The experimental results were carried out on standard DUC2004 and DUC2006 datasets and compared to four text summarization methods. Finally, through experiments on the DUC2004 and DUC2006 datasets, our proposed improved graph model algorithm TG-SMR (Topic Graph-Summarizer) is compared to other text summarization systems. The experimental results prove that the proposed TG-SMR algorithm achieves higher ROUGE scores. It is foreseen that the TG-SMR algorithm will open a new horizon that concerns the performance of ROUGE evaluation indicators.
In the field of automatic text summarization, most of the computing methods used in real systems are based on graph models, which are characterized by their simplicity and stability. First of all, the graph model does not require pre-training, and can directly calculate the weight coefficients of text sentences, which effectively avoids the disadvantage of relatively lack of training samples in the field of text summarization. Secondly, the weight contribution strategy of the graph model makes the selected abstract sentences not particularly remote, even if the extracted sentence is not the target abstract sentence. It also expresses relevant content without extracting sentences that are completely unrelated to the topic of the article.
In many areas, graph-based ranking methods such as Kleinberg’s HITS (Hyperlink-Induced Topic Search) and Google’s Pagerank algorithms have had considerable success. A graph model is a concept produced by abstraction of the real-world. Based on graph theory, the modeling process of a graph model is a series of abstract expressions of concrete things. Nodes are abstraction of specific things and can have some inherent properties. Edges represent the relationship between nodes and usually an edge has a weight representing the distance between two nodes. The computation of node weights based on relationships is an important feature of graph models.
In the field of text summarization, various methods have been proposed in the literature. For example, TextRank [
In this context, this paper improves the TextRank algorithm by using the Latent Dirichlet Analysis (LDA) model [
The rest of the paper is structured as follows. Section 2, presents the state of the art in automatic text summarization. Section 3 gives an overview of the TextRank algorithm. An improved graph model-based summarization algorithm is proposed in Section 4. The experimental results are presented in Section 5. Finally, Section 6 summarizes this paper.
The rise of statistics in the 1950 s inspired the emergence of text summarization techniques. Lunh [
With the growth of the Internet in the 1990 s, the volume of documents grew at an exponential rate. Simultaneously, advances in machine learning have helped natural language processing, providing additional impetus to text summarization technologies. Kupiec et al. [
With the development of graph model theory, text summarization methods based on graph model appears. TextRank [
Recently, Widyassari et al. [
The main idea of TextRank comes from the famous page ranking model PageRank algorithm developed by Lary Page, the co-founders of Google. PageRank algorithm believes that the importance of a web page depends on the number and the quality of sites linked to it. Suppose there is a set of
There is an
Let the set of endpoints be
Until convergence or the maximum number of iterations
The basic idea of the PageRank algorithm can be summarized as follows. Treat each web page as a node in a graph, and each web page will have an initial weight. The relationships between web pages are based on hyperlinks to other web pages. If some web pages are endpoints, the damping factor method will be adopted to give the web page an estimated value of 0.85 for linking to other ones. After the graph model is established, each web page will distribute its own weight equally to all the web pages it wants to link to. Finally, a mutual equilibrium is reached. After each iteration of weight sharing, if the weight decreases below a certain threshold, and the algorithm converges, the final weight of each web page can be determined. This can effectively increase the importance of quality web pages.
TextRank algorithm is an important method similar to PageRank algorithm for text summarization or keyword extraction. A very important improvement is that the graph model constructed by the PageRank algorithm is a directed graph, and the edges in the graph have no weights. Mihalcea and Tarau [
After the PageRank algorithm obtains the stationary vector
The weight at each iteration of the TextRank algorithm is calculated as
One difference between
This similarity of sentence A relative to sentence B is measured using the BM25 (Best Matching-25) algorithm [
In the graph model constructed by the TextRank algorithm, text summaries are calculated as follows. The nodes select the shortest sentences separated by commas, periods, semicolons and exclamation marks. The relationship between two nodes is established based on whether the sentences represented by the two nodes have the same word. If the same word appears, an edge is established, and the weight of the edge is calculated. A matrix of edge-to-edge relationships is calculated using the BM25 algorithm, and then iteratively calculated using
The above mainly introduces the basic idea of the sorting algorithm based on the graph model, the construction process of the nodes and edges in the graph model, the introduction of the weight calculation method of the undirected edge, and the iterative contribution weight of the graph model. The general approach when using a graph model for text summarization is given in Algorithm 1.
Usually, the smallest unit that can be analyzed is selected as the node in the graph model. In text summarization, the smallest unit that can be analyzed is a complete sentence, so the node in the constructed graph model is the abstraction of each sentence.
TextRank algorithm compares whether two sentences contain the same word as a basis to establish relationship between nodes. In order to reduce the interference of words that have no actual meaning, such as conjunctions, words such as function words, conjunctions, and modal particles. These words cannot represent the stem of the sentence and are usually removed. But in practice, this still has many drawbacks.
All languages in the world have polysemy. Often a word has a different meaning in one usage context than in another usage context. This leads to the fact that although the word is the same word from the participle level, they have different meanings in different sentences. Therefore, it is obviously not advisable to create systemically an edge between two nodes containing the same word like in TextRank algorithm.
Also due to the diversity of languages, the phenomenon of synonyms and polysemous words is also very common. Basically, synonyms mean the same thing, but use different words. Usually, in order to read fluently and highlight the writing level of the writer, if an article needs to express a meaning repeatedly, different words are usually used to describe the same meaning, that is, the use of synonyms in the article. Rules for edge building, like in TextRank algorithm, cannot recognize that two sentences are related and need to build edges in this case. If an edge is not created for this kind of word, the two nodes of the connection are disconnected.
The weight of an edge is a measure of the distance between two nodes. Mihalcea and Tarau [
However, the method of establishing an edge between two nodes is simply based on the appearance of the same word between two sentences and ignores the semantic-level attributes of the word. An improved method is proposed in this paper and will be described in Section 4. Since, BM25 algorithm is a word-based sentence correlation calculation method, so it is no longer applicable to the improved algorithm model.
Since the method of judging whether the corresponding two nodes in the graph model should establish an edge based on whether the two sentences contain the same keyword has the disadvantage of only considering the application of the word and ignoring the specific meaning of the word. This paper hopes that a new evaluation method can perfectly solve the phenomenon of synonyms and polysemy. Based on these considerations, a text summarization method based on an LDA-based topic model is proposed.
The LDA (Latent Dirichlet Analysis) [
It can be seen that, understanding sentences from the topic level, seeing sentences as composed of topics. Therefore, it is judged whether two sentences are related according to whether they contain the same topic, and the relationship between the two sentences is revealed from the semantic level. In this way, it not only finds interrelated sentences by using words as association judgment objects, but also reveals deeper semantic associations.
The establishment of the edge relationship is based on whether the two sentences contain the same topic. If the weight of the edge is still calculated based on the words on the sentence, it is equivalent to finding a path from sentence A to sentence B. But it is measured by the length of the other path, which is obviously wrong. It is necessary to find a method that can describe the approximation of the topics expressed on the sentences. The relative entropy is used to obtain the degree of similarity of two sentences.
The rationale for this is that each sentence uses its topic probability distribution as a benchmark to measure how similar other sentences are to its own topic probability distribution. You can also use another strategy to find a topic probability distribution that can be used as a benchmark in the full text, calculate the distance from each sentence to this benchmark, and then calculate the distance between the two sentences. Doing so becomes computationally more complex and it is relatively difficult to find a baseline for comparison. Using this asymmetric feature, not only leads to simpler calculation, but also the difference between other sentences and the current sentence can be reflected.
The difference between two probability distributions
According to Shannon’s theory, if the character set’s probability distribution is known, a method for encoding the character set with the fewest number of bits may be devised using this probability distribution. Let a character set
In order to encode characters corresponding to the probability distribution
If the two probability distributions are the same, relative entropy a value of 0, else it is greater than zero. It can be deduced from
Because relative entropy is dependent on the difference between the test and reference distributions, it cannot be used as an absolute distance measurement technique because it lacks symmetry and transitivity.
It is often assumed that the ideas presented in a sentence can more fully contain the document’s subject if the distribution of the topic on this sentence is similar to the distribution on the document than other sentences. We need to calculate the relative entropy of a sentence to the document, without calculating the relative entropy of the document to a sentence. As a result, relative entropy’s asymmetric nature will have no influence on the information that require attention.
Based on the abovementioned, the document’s topic distribution is viewed as a hypothetical probability distribution, while the sentence’s probability distribution is treated as the actual probability distribution. Then, a method is developed for determining sentence similarity and sentence weight, using relative entropy.
The probability distribution of a topic on the sentence
The similarity of two sentences
Before the graph model iterative contribution weight calculation, each sentence TF-IDF: This feature value reflects the particularity of a word in a document. Words that appear more frequently in many documents and appear less frequently in other documents are relatively important, and their weights are larger. The calculation method is as follows: Sentence position: Baxendale [ Sentence length: Word-based calculations are all affected by the sentence length. The longer the sentence, the more words it contains, the more weight it will get. It will influence the result if it is too long or too short. The sentence length is calculated according to the Gaussian distribution function as follows:
where
where
Based on the above three characteristics, a method for synthesizing all weights is given in
The basic unit of computation of the proposed improved method has changed from observable word features to topic features hidden behind words. There are new methods for the establishment of the corresponding edge and the calculation of the weight of the edge, and the iterative contribution method for the transformed model is similar to the TextRank algorithm. An overview of the text summarization method after improving the graphical model is presented in Algorithm 2.
In the above algorithm, the step of calculating the edge weight can be calculated in the sixth step, but it is listed separately in order to highlight the importance of this step. The size of
In order to verify the performance of our proposed algorithm, this paper uses the public dataset of the DUC (Document Understanding Conference) conference for verification, and uses the automatic evaluation metric ROUGE [
The DUC is currently one of the most influential evaluation conferences in the field of text summarization. DUC has a large-scale text summary dataset, which is considered to be the most authoritative datasets that can be compared. At the same time, it uses the objective evaluation criterion ROUGE (Recall-Oriented Understudy for Gisting Evaluation). The purpose is to enable all participants to evaluate on these public open corpora, not only to promote text summarization datasets to become more standard and unified, but also to promote more research scholars to engage in text summarization research.
This paper selects DUC2004 and DUC2006 datasets for experiments (see
Documentation set | DUC2004 | DUC2006 |
---|---|---|
Number of categories | 50 | 50 |
Number of documents under each category | 10 | 10 |
Total number of documents | 500 | 500 |
Data source | TDT | TREC |
Abstract length requirements | 665 bytes | 250 words |
In the modeling and training process of the LDA model, the number of implicit topics
In this paper, a Bayesian statistical standard method is used to obtain the number of topics
In this paper, on the DUC2004 and DUC2006 datasets respectively, the number of topics
After setting the experimental parameters, this paper uses the improved algorithm proposed in this paper, that is, based on the LDA topic model and relative entropy to improve the edge establishment conditions of the graph model and the edge weight calculation method, to perform automatic text summarization on the DUC2004 and DUC2006 datasets.
In order to compare the performance of our proposed method, this paper uses four text summarization algorithms for comparison in terms of ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-L, and ROUGE-SU. TextRank: A graph model-based high-quality text summarization algorithm, described in detail in Section 3. It’s widely used in practical project applications. TeamBest: The best performing computational model among competing algorithms on the datasets. Doc-LDA: Sort by topic probability, then select topics from large to small, and then select sentences with a relatively high probability of expressing the topic as summary sentences according to the topic. KL-Based: Use KL divergence to measure the importance of sentences in the way that sentences are thematically similar to articles.
ROUGE-1 | ROUGE-2 | ROUGE-3 | ROUGE-SU | ROUGE-L | ||
---|---|---|---|---|---|---|
With stop-words | TextRank | 0.389 | 0.051 | 0.062 | 0.327 | 0.099 |
TeamBest | 0.457 | 0.103 | 0.073 | 0.425 | 0.139 | |
Doc-LDA | 0.432 | 0.099 | 0.066 | 0.417 | 0.128 | |
KL-Based | 0.402 | 0.101 | 0.068 | 0.422 | 0.128 | |
TG-SMR | ||||||
Without stop-words | TextRank | 0.342 | 0.046 | 0.058 | 0.300 | 0.095 |
TeamBest | 0.432 | 0.094 | 0.071 | 0.390 | 0.124 | |
Doc-LDA | 0.421 | 0.093 | 0.062 | 0.398 | 0.116 | |
KL-Based | 0.392 | 0.097 | 0.059 | 0.419 | 0.111 | |
TG-SMR | 0.063 | 0.379 | 0.123 |
ROUGE-1 | ROUGE-2 | ROUGE-3 | ROUGE-SU | ROUGE-L | ||
---|---|---|---|---|---|---|
With stop-words | TextRank | 0.473 | 0.094 | 0.042 | 0.429 | 0.132 |
TeamBest | 0.512 | 0.124 | 0.072 | 0.478 | 0.152 | |
Doc-LDA | 0.489 | 0.117 | 0.053 | 0.452 | 0.139 | |
KL-Based | 0.494 | 0.122 | 0.053 | 0.431 | 0.148 | |
TG-SMR | 0.068 | |||||
Without stop-words | TextRank | 0.464 | 0.083 | 0.038 | 0.305 | 0.125 |
TeamBest | 0.498 | 0.094 | 0.071 | 0.390 | 0.134 | |
Doc-LDA | 0.474 | 0.105 | 0.048 | 0.323 | 0.125 | |
KL-Based | 0.450 | 0.111 | 0.043 | 0.428 | 0.138 | |
TG-SMR | 0.042 |
From
The ROUGE indicator has a value range of 0 to 1. The automated abstract becomes closer to the expert abstract as it gets closer to 1. On the DUC2004 dataset, the scores of the algorithms ROUGE-1 and ROUGE-2 proposed in this paper are higher than that of TeamBest. On the set, the algorithm proposed in this paper is higher than the scores of TeamBest on ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-SU. In other cases, our proposed algorithm TG-SMR is not much different from TeamBest. Compared with the other three algorithms, ROUGE scores of TG-SMR have been improved, especially compared to the TextRank algorithm. ROUGE-1 indicates the degree of similarity between automatic and expert summarizing, whereas ROUGE-2 denotes the smoothness of summarization. The scores of TG-SMR algorithm on ROUGE-1 and ROUGE-2 are higher than those of the other four algorithms, reflecting the superiority of text summarization based on semantic level and the effectiveness of TG-SMR algorithm.
After introducing the basic idea and calculation model of the TextRank algorithm, this paper points out its defects in the scenarios of synonyms and polysemy. An improved method is proposed, which relies on the semantic level association to reconstruct the graph model, thus fundamentally solving the defects of the TextRank algorithm in multi-semantic scenarios. Finally, through experiments on the DUC2004 and DUC2006 datasets, our proposed improved graph model algorithm TG-SMR is compared to other text summarization systems. The experimental results prove that our algorithm has the best performance on the majority of ROUGE evaluation indicators.