Home / Journals / CMC / Online First / doi:10.32604/cmc.2025.061902
Special Issues
Table of Content

Open Access

ARTICLE

Multi-Modal Named Entity Recognition with Auxiliary Visual Knowledge and Word-Level Fusion

Huansha Wang*, Ruiyang Huang*, Qinrang Liu, Xinghao Wang
National Digital Switching System Engineering & Technological R&D Center, Information Engineering University, Zhengzhou, 450001, China
* Corresponding Author: Huansha Wang. Email: email; Ruiyang Huang. Email: email

Computers, Materials & Continua https://doi.org/10.32604/cmc.2025.061902

Received 05 December 2024; Accepted 26 March 2025; Published online 09 April 2025

Abstract

Multi-modal Named Entity Recognition (MNER) aims to better identify meaningful textual entities by integrating information from images. Previous work has focused on extracting visual semantics at a fine-grained level, or obtaining entity related external knowledge from knowledge bases or Large Language Models (LLMs). However, these approaches ignore the poor semantic correlation between visual and textual modalities in MNER datasets and do not explore different multi-modal fusion approaches. In this paper, we present MMAVK, a multi-modal named entity recognition model with auxiliary visual knowledge and word-level fusion, which aims to leverage the Multi-modal Large Language Model (MLLM) as an implicit knowledge base. It also extracts vision-based auxiliary knowledge from the image for more accurate and effective recognition. Specifically, we propose vision-based auxiliary knowledge generation, which guides the MLLM to extract external knowledge exclusively derived from images to aid entity recognition by designing target-specific prompts, thus avoiding redundant recognition and cognitive confusion caused by the simultaneous processing of image-text pairs. Furthermore, we employ a word-level multi-modal fusion mechanism to fuse the extracted external knowledge with each word-embedding embedded from the transformer-based encoder. Extensive experimental results demonstrate that MMAVK outperforms or equals the state-of-the-art methods on the two classical MNER datasets, even when the large models employed have significantly fewer parameters than other baselines.

Keywords

Multi-modal named entity recognition; large language model; multi-modal fusion
  • 130

    View

  • 41

    Download

  • 0

    Like

Share Link