News feed is one of the potential information providing sources which give updates on various topics of different domains. These updates on various topics need to be collected since the domain specific interested users are in need of important updates in their domains with organized data from various sources. In this paper, the news summarization system is proposed for the news data streams from RSS feeds and Google news. Since news stream analysis requires live content, the news data are continuously collected for our experimentation. The major contributions of this work involve domain corpus based news collection, news content extraction, hierarchical clustering of the news and summarization of news. Many of the existing news summarization systems lack in providing dynamic content with domain wise representation. This is alleviated in our proposed system by tagging the news feed with domain corpuses and organizing the news streams with the hierarchical structure with topic wise representation. Further, the news streams are summarized for the users with a novel summarization algorithm. The proposed summarization system generates topic wise summaries effectively for the user and no system in the literature has handled the news summarization by collecting the data dynamically and organizing the content hierarchically. The proposed system is compared with existing systems and achieves better results in generating news summaries. The Online news content editors are highly benefitted by this system for instantly getting the news summaries of their domain interest.
Knowledge identification from online news articles have received keen attention among the news readers, especially from the Really Simple Syndication (RSS) feed-based news updates and Google news [
In this work, a news clustering based summarization system is proposed to cluster various category of news content from multiple news sources and to generate news summaries on user interested topic. The proposed system is distinctive in handling the news updates for effectively organizing the news content to retrieve it later. Further, the extractive summary of the specific topic is generated from the clustered news contents. The proposed system has been evaluated for news crawling, news content retrieval and news summarization. The evaluation results shown that the proposed system performs better in summarizing the news contents to the end users.
The paper is organized as following sections. In Section 2, the related works of the clustering and news summarization mechanisms are discussed. In Section 3, the architecture of the news retrieval system is explained. In Section 4, the experimental results of the proposed system are discussed. In Section 5, the performance evaluation of the parallel crawler, hierarchical clustering and news summarization method are explained. In Section 6, the conclusion of the work is given.
In recent years, there are lot of online recommendation systems available for assisting online shopping to various users depending upon their knowledge level. Here, we have discussed various methods related to the data collection, domain corpus, hierarchical clustering and summarization. RSS new feeds are the important sources of information from different online websites. The users are subscribing to only the required feed updates [
Multi granularity hierarchical representation [
In addition, it is essential to summarize the categorized news contents to the respective users. Extractive summarization [
Authors | Title | Methodology/Algorithm | Merits | Demerits | |
---|---|---|---|---|---|
Taddesse et al. [ |
Semantic-based merging of RSS items | Multi granularity hierarchical representation for summarization | easy access of the fine grain level data | not evaluated for multiple domains. Higher retrieval time. | |
Xu et al. [ |
Research on topic discovery technology for web news | Collaborative filtering-based content retrieval for summary generation | Web browsing behaviour of users is used | User categorization performed but content organization not done. | |
Diao et al. [ |
CRHASum: extractive text summarization with contextualized-representation hierarchical-attention summarization network | Latent semantic analysis-based summary generation | Semantic correlation among words with mapping of high dimensional words | Evaluation done on only limited corpuses. | |
Katarya et al. [ |
Capsmf: A novel product recommender system using deep learning based text analysis model | Text analysis model for summarization | Apply deep learning mechanism for content recommendation | Only content analysis done, not evaluated for user query collaborated recommendation. | |
Balahur et al. [ |
Challenges and solutions in the opinion summarization of user-generated content | Extractive summarization | Contextual information is used for summary generation. | Extracted sentences make contextual overload in summary generation. | |
Long et al. [ |
A new approach for multi-document update summarization | Semantic based clustering for summary creation | Semantically identified short texts help for better summary generation | Limited semantic relation established among contents. |
This research paper work is motivated and inspired by the related works discussed in this section. Our proposed system provides an improvement to the news summarization methods for news data streams and content retrieval is simplified with hierarchical news content clustering and user collaborative filtering. The quality of summary generation has significantly improved.
Hierarchical clustering is applied in many of the content retrieval system. Since hierarchical structure provides topic wise categorical representation elegantly, it is widely encouraged in most of the content structuring works. The retrieval time is considerably less in hierarchical structured content retrieval system [
The architecture of the proposed system is shown in
The significance of this research work focusses on collecting the news data dynamically and organizing the news data hierarchically. Further, the news contents are summarized effectively based on the user given query by processing with the collaborative filtering method.
The dataset used in this work, is collected from the news sources using the news crawler program which we implemented in our system as part of news summarization system. The RSS feed news and news data streams are monitored and collected from google news [
In this work, the news summaries are generated based on the user interest using the news updates received from numerous sources. To perform this, the first stage of work considered in this paper, is the data collection from various sources. The hierarchical structure is created for various domains. For example, Sports news are categorized with different types like cricket, football, basketball, etc. In addition, the region wise hierarchy is also represented to easily identify the location of the news such as country, state, district, city, etc. The consolidated summary of the news data collection is shown in
News source | Top news topics | News articles count |
---|---|---|
Google news | Hand sanitizer | 36 |
Damage to the lungs of COVID 19 patients | 27 | |
Coronavirus symptoms | 46 | |
Coronavirus live in patients for 5 weeks | 34 | |
Vegetarian diet prevents a stroke | 17 | |
Times of India | Asymptomatic patient | 36 |
Children less vulnerable to coronavirus infections | 25 | |
Low dose aspirin mitigates liver cancer risk | 32 | |
List of testing centres in India | 28 | |
Drugs to tackle coronavirus | 38 | |
Hindu | West Bengal government closes all educational institutions | 16 |
On IPL and coronavirus | 47 | |
Coronavirus cases increase in the country | 42 | |
Coronavirus treatment | 22 | |
Countries winning corona battle | 19 | |
Youngest corona virus victim | 18 | |
India and coronavirus | 11 | |
Coronavirus—who all cant travel to India | 12 | |
Avian flu: culling of birds begins in Malappuram | 17 |
Domain | 1-day (10.04.2021) news count | 1-week news count (05.04.2021 to 12.04.2021) | 1-month news count (11.03.2021 to 12.04.2021) | 3-month news count (11.01.2021 to 12.04.2021) |
---|---|---|---|---|
Health | 12 | 136 | 542 | 1654 |
Sports | 15 | 147 | 448 | 1428 |
Business | 10 | 124 | 492 | 1574 |
Technology | 18 | 159 | 656 | 1952 |
Entertainment | 22 | 173 | 752 | 2159 |
Science | 26 | 98 | 398 | 1027 |
Around 97000 words are available in political domain corpus [
Cosine similarity is determined to find the similar content existing in the news updates. Hierarchical clustering algorithm is used to detect the hierarchical structures among the news articles. The algorithm is shown as follows.
The domain of the cluster is also identified with the clustering process. The cluster formation from various news articles is tabulated in
Number of feeds | Number of articles | Number of clusters |
---|---|---|
1026 | 3245 | 12 |
942 | 2287 | 16 |
1124 | 3695 | 24 |
1089 | 3578 | 21 |
846 | 1896 | 13 |
Cluster | Number of feeds | Number of news articles |
---|---|---|
Health | 70 | 749 |
Science | 32 | 198 |
Technology | 70 | 387 |
The clustered articles with its corresponding clusters and domain, is shown in
Category of the cluster | Cluster topics | Number of news feeds | Number of news articles |
---|---|---|---|
Health | Coronavirus | 32 | 356 |
Alzheimers disease | 12 | 69 | |
Menopause | 4 | 98 | |
Cereals for kids | 4 | 95 | |
WHO | 2 | 32 | |
Ebola outbreak | 8 | 48 | |
Dry skin | 5 | 35 | |
midlife sex slump | 3 | 16 | |
Science | NASA's Mars Lander | 12 | 82 |
Arctic ocean | 10 | 56 | |
Cosmic fire | 2 | 14 | |
Technology | Windows 10 | 8 | 33 |
Motorolo | 9 | 47 | |
OnePlus 8 | 2 | 14 | |
Xiaomi | 5 | 36 | |
Samsung | 8 | 47 | |
Sony | 4 | 18 | |
Microsoft | 3 | 17 |
The collaborative filtering algorithm is used to filter the similar news content among the interested users. The similar news content is added to the recommendation set. The collaborative filtering based score is calculated for every similar news content and the news with maximum score is recommended to the user. The collaborative filtering algorithm is shown as follows.
The results of the collaborative filtering algorithm are shown in
User query | No. of news articles related to query | No. of correctly recommended news articles to the specific user | Collaborative filtering accuracy |
---|---|---|---|
Badminton | 124 | 96 | 77.41% |
NASA | 142 | 107 | 75.35% |
Pandemic | 136 | 103 | 75.73% |
Delhi metro | 118 | 93 | 78.81 |
COVID19 | 374 | 271 | 72.45 |
We have applied Extraction based summarization algorithm as a baseline method for performing document summarization using multiple document contents. Further, we have computed the probability distribution of the news for summary generation. The sentence with maximum score is taken for summary generation. The summarization steps are represented as follows.
The information about the user submitted query and the summary generation details from the news feeds is shown in
Query | Feeds count | Article count | Total words present in article contents | Total words present in the summary |
---|---|---|---|---|
Corona virus | 2 | 6 | 192 | 102 |
Avian flu | 3 | 8 | 300 | 158 |
Delhi violence | 2 | 7 | 184 | 97 |
Liver cancer | 5 | 9 | 176 | 95 |
Vegetarian diet | 3 | 5 | 142 | 84 |
Query | Feeds count | Article count | Summary | |
---|---|---|---|---|
Corona virus | 2 | 6 | The Indian government has defended its handling of the coronavirus outbreak after a strict lockdown—introduced with little warning—left millions stranded and without food. India has been put in lockdown to halt the spread of the coronavirus outbreak. India has been criticised for its poor record of testing people in the battle against coronavirus. A 68-year-old woman from Delhi has been confirmed as the second Indian to die from the coronavirus. With India now in a 21-day lockdown to prevent the spread of the coronavirus, there's been plenty of advice shared on how to prevent or cure the disease. | |
Avian flu | 3 | 8 | Highly pathogenic avian influenza has been reported in new regions of Germany and Hungary. Two children have been confirmed with flu infections of avian origin. Highly pathogenic avian influenza has returned to the Philippines, after an absence of two years, as well as other Asian and European countries. They include 464 chicken, 326 and 173 domestic birds, 20 pets and six turkeys found within the one-kilometre radius of the bird flu epicentre. The Philippines has detected an outbreak of avian flu in a northern province after tests showed presence of the highly infectious H5N6 subtype of the influenza A virus at a quail farm, the country's agriculture secretary said on Monday. Germany has confirmed a case of H5N8 avian flu on a small poultry farm in Saxony – a state that borders Poland and Czech Republic. |
Query | Original summary | Summary generated by the summarizer system used in this work | |
---|---|---|---|
Corona virus | The Indian government has defended its handling of the coronavirus outbreak after a strict lockdown—introduced with little warning—left millions stranded and without food. | The Indian government has defended its handling of the coronavirus outbreak after a strict lockdown—introduced with little warning—left millions stranded and without food. India has been put in lockdown to halt the spread of the coronavirus outbreak. India has been criticised for its poor record of testing people in the battle against coronavirus. A 68-year-old woman from Delhi has been confirmed as the second Indian to die from the coronavirus. With India now in a 21-day lockdown to prevent the spread of the coronavirus, there's been plenty of advice shared on how to prevent or cure the disease. | |
India has been put in lockdown to halt the spread of the coronavirus outbreak. People have been told to stay indoors, but for many daily-wage earners this is not an option. The BBC's Vikas Pandey finds out how they were coping in the days leading up to Tuesday's announcement. | |||
India has been criticised for its poor record of testing people in the battle against coronavirus. That, however, is set to change, thanks in large part to the efforts of one virologist, who delivered on a working test kit, just hours before delivering her baby. | |||
A 68-year-old woman from Delhi has been confirmed as the second Indian to die from the coronavirus. | |||
With India now in a 21-day lockdown to prevent the spread of the coronavirus, there's been plenty of advice shared on how to prevent or cure the disease. | |||
"We have a simple message to all countries—test, test, test,” World Health Organisation (WHO) head Tedros Adhanom Ghebreyesus told reporters in Geneva earlier this week. |
The summary generated for the actual google news is shown in
Actual Google news | News summary generated by summarizer system used in this work | ||
---|---|---|---|
News 1 | News 2 | ||
India Business News: Two employees working with IT companies Dell and Mindtree have been tested positive for coronavirus, according to company statements. The total number of novel coronavirus cases in the country touched 60 today, he health ministry said. | Two fresh cases were reported from Delhi and Rajasthan today. An 85-year-old man in Jaipur tested positive for the disease, a state government official said. Talking about the deadly outbreak of coronavirus, Kerala Health Minister KK Shailaja informed that those who are not revealing their travel history of coming from affected areas will be considered a crime. | The total number of novel coronavirus cases in the country touched 60 today, he health ministry said. Talking about the deadly outbreak of coronavirus, Kerala Health Minister KK Shailaja informed that those who are not revealing their travel history of coming from affected areas will be considered a crime. |
The news collection time for different number of URLs using various crawlers is tabulated in
Crawler | News collection time (s) | |||
---|---|---|---|---|
100 URLs | 200 URLs | 300 URLs | 400 URLs | |
NewsTracker (Proposed system) | 10 | 15 | 22 | 28 |
Mercator [ |
12 | 25 | 36 | 48 |
Focused news crawler [ |
14 | 24 | 38 | 54 |
Semantic web crawler [ |
18 | 28 | 41 | 62 |
The news collector is compared with different news crawler and is shown in
The similar relevant keywords of the user given input are generated and the retrieval performance is evaluated. The news retrieval performance for direct user queries and relevance keywords is shown in
The user queries are evaluated on pre-processed keyword indexing, non-pre-processed keyword indexing and non-indexing news contents. The query processing time is tabulated in
Queries | Pre-processed keyword indexing (s}) | Non-pre-processed keyword indexing (s) | Non-indexing (s) |
---|---|---|---|
Corona virus | 5 | 12 | 16 |
Avian flu | 7 | 13 | 18 |
Delhi violence | 4 | 9 | 15 |
Liver cancer | 8 | 11 | 21 |
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating the summarization. It compares the summary against a set of references summary generated by human [
The precision of the automatic summarization is shown in
Original text | Automatic summary generator | Reference summary by human | Precision using reference summary | Precision using system summary | ROUGE-1 | ROUGE-L |
---|---|---|---|---|---|---|
In the wake of novel coronavirus spread in India, the Delhi Metro services will remain completely closed, the Delhi Metro Rail Corporation (DMRC) declared. | Delhi Metro rail service completely closed till 31 March | Delhi Metro rail service closed till 31 | 7/7 = |
7/8 = |
8/9 = |
Further, we applied ROUGE specific metrics for effectively measuring the summary generation. The measures are ROUGE-N, ROUGE-S, ROUGE-L. These refers the size of the texts compared among the system summary and reference summary. ROUGE-1 refers the overlap of unigrams among the reference and system summaries. ROUGE-2 refers the overlap of bigrams among the reference and system summaries. ROUGE-1 and ROUGE-2 are the ROUGE-N type measures. It is referenced in the literature that ROUGE-1 and ROUOGE-L are appropriate for extractive summarization [
We have observed from the summarization evaluation that the {ROUGE-N} and {ROUGE-L} measures indicated that 88.88% and 77.77% of the actual news content is covered by the news summary generated. Since ROUGE-L needs to measure the longest sentence covered in the summary, the received value is a good measure that it has generated a summary covering the required sentences. The summarization performance of the proposed system is compared with other methodologies used in the literature for the summarization of document contents. The comparison result has been ensured with the ROUGE-1 metric which is the appropriate measure for news text summarization. The comparative results are tabulated in
Summarization model | ROUGE-1 |
---|---|
Variational auto encoder model [ |
0.608 |
Latent semantic analysis based topic summarization model [ |
0.540 |
LexRank based automatic summarization model [ |
0.484 |
NEWS summarization model (Proposed system) | 0.880 |
The news data streams are received and the similarity needs to be estimated. The similarity computation involves the use of similarity matrix. It requires little large memory than other clustering algorithms since it needs to keep the data elements to store the matrix values.
Even hierarchical clustering takes more space, it is widely used in many of content organization systems. The hierarchical clustering algorithms satisfy reducibility property. The increased computational time required for generating the clusters help in providing the hierarchy of cluster set with exact and unique structure with this reducibility property.
Mainly, in this work, the automatic news summarization system for the dynamic news articles with timeframes from google news. The scope of the proposed collaborative filtering based news retrieval system includes concise information from various news articles. It helps to eliminate the difficulty of going through huge news articles and provides 20% to 30%from the original news content. The scope is limited to generate the summary for the user interested keyword using the news articles in a time frame. This news retrieval system helps in a better way for the online news content editors who are in need of accessing the interested domain content immediately.
In this paper, the hierarchical clustering based news summarization system has been proposed to apply on RSS feed based news and google news. The news crawler used thread based news crawling to collect the news articles effectively with better collection efficiency which has been compared with various state of the art news crawlers. This work used various recent domain corpuses to tag and extract the topic wise news efficiently. The hierarchical clustering handled the news contents by estimating the similarity and produced the hierarchical clusters of the various domains appropriately. The evaluation of the automatic summary with the human generated summary models proved that it performed maximum for the hierarchically clustered news article contents. Hence, proposed news summarization system is suitable and useful for the content readers who are keen in knowing recent domain specific news with the generated summary from various news sources.