Open Access
REVIEW
A Review on Vision-Language-Based Approaches: Challenges and Applications
1 Department of Artificial Intelligence, FPT University, Danang, 550000, Vietnam
2 Department of Business, FPT University, Danang, 550000, Vietnam
3 Faculty of Information Technology, School of Technology, Van Lang University, Ho Chi Minh City, 70000, Vietnam
* Corresponding Author: Tri-Hai Nguyen. Email:
# These authors contributed equally to this work
(This article belongs to the Special Issue: New Trends in Image Processing)
Computers, Materials & Continua 2025, 82(2), 1733-1756. https://doi.org/10.32604/cmc.2025.060363
Received 30 October 2024; Accepted 20 January 2025; Issue published 17 February 2025
Abstract
In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural language processing tasks, such as visual question answering and computer vision applications, including image captioning and image-text retrieval, highlighting their adaptability for complex, multimodal datasets. In this work, we review the landscape of Bootstrapping Language-Image Pre-training (BLIP) and other VLM techniques. A comparative analysis is conducted to assess VLMs’ strengths, limitations, and applicability across tasks while examining challenges such as scalability, data quality, and fine-tuning complexities. The work concludes by outlining potential future directions in VLM research, focusing on enhancing model interpretability, addressing ethical implications, and advancing multimodal integration in real-world applications.Keywords
Cite This Article

This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.