Home / Journals / CMES / Online First / doi:10.32604/cmes.2025.058456
Special Issues
Table of Content

Open Access

ARTICLE

Integrating Speech-to-Text for Image Generation Using Generative Adversarial Networks

Smita Mahajan1, Shilpa Gite1,2, Biswajeet Pradhan3,*, Abdullah Alamri4, Shaunak Inamdar5, Deva Shriyansh5, Akshat Ashish Shah5, Shruti Agarwal5
1 Artificial Intelligence and Machine Learning Department, Symbiosis Institute of Technology, Pune, 412115, India
2 Symbiosis Centre of Applied AI (SCAAI), Symbiosis Institute of Technology, Pune, 412115, India
3 Centre for Advanced Modelling and Geospatial Information Systems (CAMGIS), School of Civil and Environmental Engineering, University of Technology Sydney, Sydney, NSW 2007, Australia
4 Department of Geology and Geophysics, College of Science, King Saud University, Riyadh, 11451, Saudi Arabia
5 Department of Computer Science and Engineering, Symbiosis Institute of Technology, Pune, 412115, India
* Corresponding Author: Biswajeet Pradhan. Email: email
(This article belongs to the Special Issue: Advances in AI-Driven Computational Modeling for Image Processing)

Computer Modeling in Engineering & Sciences https://doi.org/10.32604/cmes.2025.058456

Received 12 September 2024; Accepted 26 January 2025; Published online 09 April 2025

Abstract

The development of generative architectures has resulted in numerous novel deep-learning models that generate images using text inputs. However, humans naturally use speech for visualization prompts. Therefore, this paper proposes an architecture that integrates speech prompts as input to image-generation Generative Adversarial Networks (GANs) model, leveraging Speech-to-Text translation along with the CLIP + VQGAN model. The proposed method involves translating speech prompts into text, which is then used by the Contrastive Language-Image Pretraining (CLIP) + Vector Quantized Generative Adversarial Network (VQGAN) model to generate images. This paper outlines the steps required to implement such a model and describes in detail the methods used for evaluating the model. The GAN model successfully generates artwork from descriptions using speech and text prompts. Experimental outcomes of synthesized images demonstrate that the proposed methodology can produce beautiful abstract visuals containing elements from the input prompts. The model achieved a Fréchet Inception Distance (FID) score of 28.75, showcasing its capability to produce high-quality and diverse images. The proposed model can find numerous applications in educational, artistic, and design spaces due to its ability to generate images using speech and the distinct abstract artistry of the output images. This capability is demonstrated by giving the model out-of-the-box prompts to generate never-before-seen images with plausible realistic qualities.

Keywords

Generative adversarial networks; speech-to-image translation; visualization; transformers; prompt engineering
  • 188

    View

  • 73

    Download

  • 0

    Like

Share Link