MSPT: Mastering Stability-Plasticity in Dynamic Multimodal Knowledge Graphs

This paper addresses the significant challenge of "catastrophic forgetting" and the lack of adaptability in current Multimodal Knowledge Graph Construction (MKGC) models when faced with dynamic, continuously evolving data. It introduces the Multimodal Stability-Plasticity Transformer (MSPT) framework, which leverages novel gradient modulation and attention distillation techniques to effectively balance the retention of previously acquired knowledge (stability) with the integration of new information (plasticity). Extensive experiments demonstrate that MSPT consistently outperforms existing state-of-the-art methods in continual MKGC tasks.

Challenges in Continual MKGC

Catastrophic Forgetting: Existing MKGC models struggle with the real-world dynamism of continuously emerging entities and relations, often losing previously acquired knowledge.
Static vs. Dynamic: Current MKGC architectures primarily focus on "static" knowledge graphs, lacking adaptability to new entity categories and relations in streaming data.
Multimodal Specific Issues: Direct transfer of unimodal continual learning strategies to MKGC faces challenges like imbalanced learning dynamics and varying forgetting rates across modalities.

The MSPT Framework and Innovations

Dual-Stream Transformer Architecture: MSPT utilizes a dual-stream Transformer structure, incorporating Visual Transformer (ViT) for visual data and BERT for textual data.
Balanced Multimodal Learning: Employs a gradient modulation technique to address imbalanced learning dynamics across modalities, adaptively tuning gradients to enhance plasticity.
Hand-in-Hand Multimodal Interaction: Introduces a unique attention generation process using shared learnable keys and self-queries to promote integration and counteract forgetting.
Attention Distillation for Stability: Refines attention matrices via an asymmetric distance function that penalizes forgetting of prior knowledge while allowing for new attention patterns, balancing plasticity and stability.

Experimental Validation and Performance

Benchmark Datasets: Evaluated on modified class-incremental versions of the Twitter-2017 dataset for Multimodal Named Entity Recognition (IMNER) and a partitioned dataset for Multimodal Relation Extraction (IMRE).
Superior Performance: MSPT consistently outperforms both traditional static MKGC models and leading unimodal continual learning methods across both IMRE and IMNER benchmarks.
Robustness: Demonstrated resilience to varying task orders, different numbers of tasks, and even small rehearsal sizes, indicating its strong generalization capabilities.

Impact of Core Components

Component Contribution: Ablation studies confirm that each proposed component—Multimodal Interaction, Attention Distillation, Gradient Modulation, and Memory (rehearsal)—positively contributes to overall performance.
Key Drivers: Gradient Modulation and the rehearsal strategy were identified as particularly significant in improving F1 scores and mitigating catastrophic forgetting.
Enhanced Plasticity: MSPT exhibits superior plasticity, effectively balancing the maintenance of past knowledge with the adaptation to new information, a critical aspect where other methods often fall short.