An Efficient Mechanism for Product Data Extraction from E-Commerce Websites

Malik Akhtar; Zahur Ahmad; Rashid Amin; Sultan Almotiri; Mohammed A.; Hamza Aldabbas

doi:10.32604/cmc.2020.011485

Open Access icon Open Access

ARTICLE

An Efficient Mechanism for Product Data Extraction from E-Commerce Websites

Malik Javed Akhtar¹, Zahur Ahmad¹, Rashid Amin^{1, *}, Sultan H. Almotiri², Mohammed A. Al Ghamdi², Hamza Aldabbas³

1 University of Engineering and Technology, Taxila, Pakistan.
2 Computer Science Department, Umm Al-Qura University, Makkah City, Saudi Arabia.
3 Al-Balqa Applied University, Al-Salt, Jordan.

* Corresponding Author: Rashid Amin. Email: email .

Computers, Materials & Continua 2020, 65(3), 2639-2663. https://doi.org/10.32604/cmc.2020.011485

Received 11 May 2020; Accepted 25 July 2020; Issue published 16 September 2020

Download PDF

Abstract

A large amount of data is present on the web which can be used for useful purposes like a product recommendation, price comparison and demand forecasting for a particular product. Websites are designed for human understanding and not for machines. Therefore, to make data machine-readable, it requires techniques to grab data from web pages. Researchers have addressed the problem using two approaches, i.e., knowledge engineering and machine learning. State of the art knowledge engineering approaches use the structure of documents, visual cues, clustering of attributes of data records and text processing techniques to identify data records on a web page. Machine learning approaches use annotated pages to learn rules. These rules are used to extract data from unseen web pages. The structure of web documents is continuously evolving. Therefore, new techniques are needed to handle the emerging requirements of web data extraction. In this paper, we have presented a novel, simple and efficient technique to extract data from web pages using visual styles and structure of documents. The proposed technique detects Rich Data Region (RDR) using query and correlative words of the query. RDR is then divided into data records using style similarity. Noisy elements are removed using a Common Tag Sequence (CTS) and formatting entropy. The system is implemented using JAVA and runs on the dataset of real-world working websites. The effectiveness of results is evaluated using precision, recall, and F-measure and compared with five existing systems. A comparison of the proposed technique to existing systems has shown encouraging results.

Keywords

Document object model, rich data region, common tag sequence, web data extraction, deep web mining.

Cite This Article

APA Style

Javed Akhtar, M., Ahmad, Z., Amin, R., H. Almotiri, S., A. Al Ghamdi, M. et al. (2020). An Efficient Mechanism for Product Data Extraction from E-Commerce Websites. Computers, Materials & Continua, 65(3), 2639–2663. https://doi.org/10.32604/cmc.2020.011485

Vancouver Style

Javed Akhtar M, Ahmad Z, Amin R, H. Almotiri S, A. Al Ghamdi M, Aldabbas H. An Efficient Mechanism for Product Data Extraction from E-Commerce Websites. Comput Mater Contin. 2020;65(3):2639–2663. https://doi.org/10.32604/cmc.2020.011485

IEEE Style

M. Javed Akhtar, Z. Ahmad, R. Amin, S. H. Almotiri, M. A. Al Ghamdi, and H. Aldabbas, “An Efficient Mechanism for Product Data Extraction from E-Commerce Websites,” Comput. Mater. Contin., vol. 65, no. 3, pp. 2639–2663, 2020. https://doi.org/10.32604/cmc.2020.011485

BibTex EndNote RIS

Citations

4

[click to view]

Copyright © 2020 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

An Efficient Mechanism for Product Data Extraction from E-Commerce Websites

Abstract

Keywords

Cite This Article

Citations

3924

2185

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link