Baseline Isolated Printed Text Image Database for Pashto Script Recognition

Arfa Siddiqu; Abdul Basit; Waheed Noor; Muhammad Khan; M. Saeed; Azam Khan

doi:10.32604/iasc.2023.036426

Open Access icon Open Access

ARTICLE

Baseline Isolated Printed Text Image Database for Pashto Script Recognition

Arfa Siddiqu, Abdul Basit^*, Waheed Noor, Muhammad Asfandyar Khan, M. Saeed H. Kakar, Azam Khan

Department of Computer Science & Information Technology, University of Balochistan, Quetta, 87300, Pakistan

* Corresponding Author: Abdul Basit. Email: email

Intelligent Automation & Soft Computing 2023, 37(1), 875-885. https://doi.org/10.32604/iasc.2023.036426

Received 29 September 2022; Accepted 13 December 2022; Issue published 29 April 2023

Abstract

The optical character recognition for the right to left and cursive languages such as Arabic is challenging and received little attention from researchers in the past compared to the other Latin languages. Moreover, the absence of a standard publicly available dataset for several low-resource languages, including the Pashto language remained a hurdle in the advancement of language processing. Realizing that, a clean dataset is the fundamental and core requirement of character recognition, this research begins with dataset generation and aims at a system capable of complete language understanding. Keeping in view the complete and full autonomous recognition of the cursive Pashto script. The first achievement of this research is a clean and standard dataset for the isolated characters of the Pashto script. In this paper, a database of isolated Pashto characters for forty four alphabets using various font styles has been introduced. In order to overcome the font style shortage, the graphical software Inkscape has been used to generate sufficient image data samples for each character. The dataset has been pre-processed and reduced in dimensions to 32 × 32 pixels, and further converted into the binary format with a black background and white text so that it resembles the Modified National Institute of Standards and Technology (MNIST) database. The benchmark database is publicly available for further research on the standard GitHub and Kaggle database servers both in pixel and Comma Separated Values (CSV) formats.

Keywords

Text-image database; optical character recognition (OCR); pashto isolated characters; visual recognition; autonomous language understanding; deep learning; convolutional neural network (CNN)

Cite This Article

APA Style

Siddiqu, A., Basit, A., Noor, W., Khan, M.A., Saeed H. Kakar, M. et al. (2023). Baseline Isolated Printed Text Image Database for Pashto Script Recognition. Intelligent Automation & Soft Computing, 37(1), 875–885. https://doi.org/10.32604/iasc.2023.036426

Vancouver Style

Siddiqu A, Basit A, Noor W, Khan MA, Saeed H. Kakar M, Khan A. Baseline Isolated Printed Text Image Database for Pashto Script Recognition. Intell Automat Soft Comput. 2023;37(1):875–885. https://doi.org/10.32604/iasc.2023.036426

IEEE Style

A. Siddiqu, A. Basit, W. Noor, M. A. Khan, M. Saeed H. Kakar, and A. Khan, “Baseline Isolated Printed Text Image Database for Pashto Script Recognition,” Intell. Automat. Soft Comput., vol. 37, no. 1, pp. 875–885, 2023. https://doi.org/10.32604/iasc.2023.036426

BibTex EndNote RIS

Copyright © 2023 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Baseline Isolated Printed Text Image Database for Pashto Script Recognition

Abstract

Keywords

Cite This Article

1902

1056

1

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link