A Review on Vision-Language-Based Approaches: Challenges and Applications

Huu-Tuong Ho; Luong Nguyen; Minh-Tien Pham; Quang-Huy Pham; Quang-Duong Tran; Duong Nguyen; Tri-Hai Nguyen

doi:10.32604/cmc.2025.060363

Open Access icon Open Access

REVIEW

A Review on Vision-Language-Based Approaches: Challenges and Applications

Huu-Tuong Ho^1,#, Luong Vuong Nguyen^1,#, Minh-Tien Pham¹, Quang-Huy Pham¹, Quang-Duong Tran¹, Duong Nguyen Minh Huy², Tri-Hai Nguyen^3,*

1 Department of Artificial Intelligence, FPT University, Danang, 550000, Vietnam
2 Department of Business, FPT University, Danang, 550000, Vietnam
3 Faculty of Information Technology, School of Technology, Van Lang University, Ho Chi Minh City, 70000, Vietnam

* Corresponding Author: Tri-Hai Nguyen. Email: email
# These authors contributed equally to this work

(This article belongs to the Special Issue: New Trends in Image Processing)

Computers, Materials & Continua 2025, 82(2), 1733-1756. https://doi.org/10.32604/cmc.2025.060363

Received 30 October 2024; Accepted 20 January 2025; Issue published 17 February 2025

Abstract

In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural language processing tasks, such as visual question answering and computer vision applications, including image captioning and image-text retrieval, highlighting their adaptability for complex, multimodal datasets. In this work, we review the landscape of Bootstrapping Language-Image Pre-training (BLIP) and other VLM techniques. A comparative analysis is conducted to assess VLMs’ strengths, limitations, and applicability across tasks while examining challenges such as scalability, data quality, and fine-tuning complexities. The work concludes by outlining potential future directions in VLM research, focusing on enhancing model interpretability, addressing ethical implications, and advancing multimodal integration in real-world applications.

Keywords

Bootstrapping language-image pre-training (BLIP); multimodal learning; vision-language model (VLM); vision-language pre-training (VLP)

Cite This Article

APA Style

Ho, H., Nguyen, L.V., Pham, M., Pham, Q., Tran, Q. et al. (2025). A review on vision-language-based approaches: challenges and applications. Computers, Materials & Continua, 82(2), 1733–1756. https://doi.org/10.32604/cmc.2025.060363

Vancouver Style

Ho H, Nguyen LV, Pham M, Pham Q, Tran Q, Huy DNM, et al. A review on vision-language-based approaches: challenges and applications. Comput Mater Contin. 2025;82(2):1733–1756. https://doi.org/10.32604/cmc.2025.060363

IEEE Style

H. Ho et al., “A Review on Vision-Language-Based Approaches: Challenges and Applications,” Comput. Mater. Contin., vol. 82, no. 2, pp. 1733–1756, 2025. https://doi.org/10.32604/cmc.2025.060363

BibTex EndNote RIS

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

A Review on Vision-Language-Based Approaches: Challenges and Applications

Abstract

Keywords

Cite This Article

500

234

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Share Link