Multi-task deep neural network acoustic models with model adaptation using discriminative speaker identity for whisper recognition

Li, Jingjie, McLoughlin, Ian Vince, Liu, Cong, Xue, Shaofei, Wei, Si (2016) Multi-task deep neural network acoustic models with model adaptation using discriminative speaker identity for whisper recognition. In: IEEE International Conference on Acoustics Speech and Signal Processing. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). . pp. 4969-4973. Institute of Electrical and Electronics Engineers (IEEE), South Brisbane, QLD (doi:10.1109/ICASSP.2015.7178916) (KAR id:55019)

PDF (Authors accepted manuscript) Author's Accepted Manuscript Language: English This work is licensed under a Creative Commons Attribution 4.0 International License.
Download this file (PDF/244kB)
Request a format suitable for use with assistive technology e.g. a screenreader
Official URL: http://dx.doi.org/10.1109/ICASSP.2015.7178916

Abstract

This paper presents a study on large vocabulary continuous whisper automatic recognition (wLVCSR). wLVCSR provides the ability to use ASR equipment in public places without concern for disturbing others or leaking private information. However the task of wLVCSR is much more challenging than normal LVCSR due to the absence of pitch which not only causes the signal to noise ratio (SNR) of whispers to be much lower than normal speech but also leads to flatness and formant shifts in whisper spectra. Furthermore, the amount of whisper data available for training is much less than for normal speech. In this paper, multi-task deep neural network (DNN) acoustic models are deployed to solve these problems. Moreover, model adaptation is performed on the multi-task DNN to normalize speaker and environmental variability in whispers based on discriminative speaker identity information. On a Mandarin whisper dictation task, with 55 hours of whisper data, the proposed SI multi-task DNN model can achieve 56.7% character error rate (CER) improvement over a baseline Gaussian Mixture Model (GMM), discriminatively trained only using the whisper data. Besides, the CER of the proposed model for normal speech can reach 15.2%, which is close to the performance of a state-of-the-art DNN trained with one thousand hours of speech data. From this baseline, the model-adapted DNN gains a further 10.9% CER reduction over the generic model.

Item Type:	Conference or workshop item (Paper)
DOI/Identification number:	10.1109/ICASSP.2015.7178916
Uncontrolled keywords:	Whisper recognition; model adaption; speaker code; multi-task DNN; Silent speech interface;
Subjects:	T Technology
Divisions:	Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Depositing User:	Ian McLoughlin
Date Deposited:	19 Apr 2016 09:47 UTC
Last Modified:	09 Dec 2022 05:38 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/55019 (The current URI for this page, for reference purposes)

University of Kent Author Information

McLoughlin, Ian Vince.

Creator's ORCID:	https://orcid.org/0000-0001-7111-2008
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Download Statistics

Total unique views for this document in KAR since July 2020. For more details click on the image.