Development and research of multimodal neural architectures for heterogeneous unbalanced data in classification tasks

Minukhin S.; Rudoi V.

Please use this identifier to cite or link to this item: https://repository.hneu.edu.ua/handle/123456789/39145

Title:	Development and research of multimodal neural architectures for heterogeneous unbalanced data in classification tasks
Authors:	Minukhin S. Rudoi V.
Keywords:	multimodal data cross-modal attention contrastive learning knowledge distillation pruning quantization emotion classification autonomous navigation
Issue Date:	2026
Citation:	Development and research of multimodal neural architectures for heterogeneous unbalanced data in classification tasks / S. Minukhin, V. Rudoi // Computer systems and information technologies : міжнародний науковий журнал. - 2026. - № 1. – С. 28-40.
Abstract:	The article presents a comprehensive study of modern multimodal neural architectures for integrating heterogeneous and partially unbalanced data in classification tasks. It considers early and late fusion approaches, hybrid architectures with cross-modal attention, and transformers that allow the formation of consistent latent spaces of visual, auditory, and textual features. Particular attention is paid to contrastive learning (CLIP-like approaches, multimodal InfoNCE), which ensures semantic consistency of representations and improves classification accuracy in the presence of uneven data distribution and rare classes. A model is proposed that combines early and late fusion with cross-modal attention and contrastive learning to form a coherent joint latent space. Features of each modality are processed by specialized encoders, and fusion is performed with adaptive weighting, which minimizes the impact of heterogeneous data imbalance and enables the efficient processing of signals of different natures and intensities. The use of pruning, quantization, and knowledge distillation has reduced computational costs without losing accuracy, ensuring stable model performance in real-world streaming scenarios with limited resources. The results of applying the proposed model to the BDD100K and CMU-MOSEI datasets confirmed the model's high efficiency in processing heterogeneous and unbalanced data. For BDD100K, Accuracy 0.953, F1-score 0.956, ROC-AUC 0.947 were achieved, and the integral indicators Micro F1, Macro F1, and Weighted F1 were 0.953, 0.949, and 0.955, respectively; For CMU-MOSEI, Accuracy 0.956, F1-score 0.969, ROC-AUC 0.968, and the integral indicators Micro F1, Macro F1, and Weighted F1 were 0.956, 0.962, and 0.968, respectively. A comparative analysis with classical feature concatenation approaches, recent State-of-the-Art multimodal fusion models, and AutoML-based solutions demonstrated that the proposed architecture consistently outperforms existing methods. In particular, the model improves classification accuracy by approximately 2–4% compared to recent SOTA architectures and provides more stable F1-scores for minority classes. A comparison with the AutoML-based framework B-T4SA also confirms the robustness of the proposed approach. These results demonstrate that the developed model ensures higher classification consistency for both frequent and rare classes under heterogeneous and imbalanced data conditions.
URI:	https://repository.hneu.edu.ua/handle/123456789/39145
Розташовується у зібраннях:	Статті (ІС)

Файли цього матеріалу:

Файл	Опис	Розмір	Формат
CSIT-2026-N1+(22)+28-40.pdf		1,18 MB	Adobe PDF	Переглянути/відкрити

Показати повний опис матеріалу Перегляд статистики

Усі матеріали в архіві електронних ресурсів захищені авторським правом, всі права збережені.