Paper
5 July 2024 Cantonese speech separation based on cross-modal feature fusion
Yuming Huang, Junfeng Wei, Zhenyuan Huang
Author Affiliations +
Proceedings Volume 13184, Third International Conference on Electronic Information Engineering and Data Processing (EIEDP 2024); 131841W (2024) https://doi.org/10.1117/12.3032932
Event: 3rd International Conference on Electronic Information Engineering and Data Processing (EIEDP 2024), 2024, Kuala Lumpur, Malaysia
Abstract
In most audio-visual speech separation models, audio and video features are generally directly merged. This approach may result in insufficient utilization of the interrelationships among cross-modal features. Therefore, joint attention is employed during the fusion of audio-visual features to achieve cross-modal integration. By jointly modeling both intra-modal and inter-modal relationships, features of each modality can focus on both another modality and themselves, thereby obtaining improved cross-modal fusion features to enhance speech separation performance.To evaluate the effectiveness of speech separation, an audio-visual bilingual Cantonese corpus (AVCC) was established, and the actual performance of three speech separation methods—Conv-TasNet, VisualVoice, and VisualVoice with joint cross-attention—was tested using Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and Signal-to-Distortion Ratio (SDR).The study indicates that when separating mixed speech of two speakers with different genders, the recommended VisualVoice method with joint cross-attention achieves an increase in SDR of at least 0.87dB, with the highest improvement reaching up to 1.99dB.The Joint Cross-Attention Model for feature fusion can utilize the correlations between cross-modalities more effectively, thereby enhancing the separation results.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Yuming Huang, Junfeng Wei, and Zhenyuan Huang "Cantonese speech separation based on cross-modal feature fusion", Proc. SPIE 13184, Third International Conference on Electronic Information Engineering and Data Processing (EIEDP 2024), 131841W (5 July 2024); https://doi.org/10.1117/12.3032932
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Visualization

Visual process modeling

Feature fusion

Performance modeling

Network architectures

Video

Motion analysis

RELATED CONTENT


Back to Top