A Chinese acoustic model based on convolutional neural network

Qian Zhang; Jun Sang; Mohammad S. Alam; Bin Cai; Li Yang

doi:10.1117/12.2520356

13 May 2019 A Chinese acoustic model based on convolutional neural network

Qian Zhang, Jun Sang, Mohammad S. Alam, Bin Cai, Li Yang

Proceedings Volume 10995, Pattern Recognition and Tracking XXX; 109950U (2019) https://doi.org/10.1117/12.2520356
Event: SPIE Defense + Commercial Sensing, 2019, Baltimore, MD, United States

Abstract

Speech recognition has always been one of the research focuses in the field of human-computer communication and interaction. The main purpose of automatic speech recognition (ASR) is to convert speech waveform signals into text. Acoustic model is the main component of ASR, which is used to connect the observation features of speech signals with the speech modeling units. In recent years, deep learning has become the mainstream technology in the field of speech recognition. In this paper, a convolutional neural network architecture composed of VGG and Connectionist Temporal Classification (CTC) loss function was proposed for speech recognition acoustic model. Traditional acoustic model training is based on frame-level labels with cross-entropy criterion, which requires a tedious label alignment procedure. The CTC loss was adopted to automatically learn the alignments between speech frames and label sequences, such that the training process is end-to-end. The architecture can exploit temporal and spectral structures of speech signals simultaneously. Batch normalization (BN) technique was used for normalizing each layers input to reduce internal covariance shift. To prevent overfitting, dropout technique was used during training to improve network generalization ability. The speech signal was transformed into a spectral image through a series of processing to be the input of the neural network. The input feature is 200 dimensions, and output labels of acoustic mode is 415 Chinese pronunciation without pitch. The experimental results demonstrated that the proposed model achieves the Character error rate (CER) of 17.97% and 23.86% on public Mandarin speech corpus, AISHELL-1 and ST-CMDS-20170001_1, respectively.

Citation Download Citation

Qian Zhang, Jun Sang, Mohammad S. Alam, Bin Cai, and Li Yang "A Chinese acoustic model based on convolutional neural network", Proc. SPIE 10995, Pattern Recognition and Tracking XXX, 109950U (13 May 2019); https://doi.org/10.1117/12.2520356

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available