Paper
5 July 2024 RVTS: retrieval-based text-to-speech
Jin Yang, Xuesong Su, Wenke Xu
Author Affiliations +
Proceedings Volume 13184, Third International Conference on Electronic Information Engineering and Data Processing (EIEDP 2024); 131841L (2024) https://doi.org/10.1117/12.3032936
Event: 3rd International Conference on Electronic Information Engineering and Data Processing (EIEDP 2024), 2024, Kuala Lumpur, Malaysia
Abstract
The single-stage text-to-speech model has made significant advancements in recent years. A comparison between two prominent models, VITS and VITS2, reveals that training with the same individual's voice dataset led to the voiceprint model generated by VITS closely aligning with the original voiceprint; however, its segmentation and fluency were subpar. On the other hand, VITS2 showed the opposite characteristics. Through the implementation of Retrieval-based Speech (RVTS), involving the extraction of intonation and fragmented sentences from the VITS2 model followed by code utilization, we were able to achieve speech synthesis. Our experimental findings demonstrate that our approach surpasses both VITS and VITS2, leveraging the strengths of both models.
© (2024) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Jin Yang, Xuesong Su, and Wenke Xu "RVTS: retrieval-based text-to-speech", Proc. SPIE 13184, Third International Conference on Electronic Information Engineering and Data Processing (EIEDP 2024), 131841L (5 July 2024); https://doi.org/10.1117/12.3032936
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Education and training

Data modeling

Systems modeling

Molybdenum

Adversarial training

Computer science

Machine learning

Back to Top