A Conditional Generative Model for Speech Enhancement

Li, Zheng-xi, Dai, Li-Rong, Song, Yan, McLoughlin, Ian Vince (2018) A Conditional Generative Model for Speech Enhancement. Circuits, Systems, and Signal Processing, 37 . pp. 5005-5022. ISSN 0278-081X. E-ISSN 1531-5878. (doi:10.1007/s00034-018-0798-4) (KAR id:66126)

PDF (Authors accepted manuscript - not final) Author's Accepted Manuscript Language: English
Download this file (PDF/26MB)	Preview
Request a format suitable for use with assistive technology e.g. a screenreader
Official URL: http://dx.doi.org/10.1007/s00034-018-0798-4
Additional URLs: http://www.springer.com/engineering/elec...

Abstract

Deep learning based speech enhancement approaches like Deep Neural Networks (DNN) and Long-Short Term Memory (LSTM) have already demonstrated superior results to classical methods.

However these methods do not take full advantage of temporal context information. While DNN and LSTM consider temporal context in the noisy source speech, it does not do so for the estimated clean speech. Both DNN and LSTM also have a tendency to over-smooth spectra, which causes the enhanced speech to sound muffled.

This paper proposes a novel architecture to address both issues, which we term a conditional generative model (CGM). By adopting an adversarial training scheme applied to a generator of deep dilated convolutional layers, CGM is designed to model the joint and symmetric conditions of both noisy and estimated clean spectra.We evaluate CGM against both DNN and LSTM in terms of Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) on TIMIT sentences corrupted by ITU-T P.501 and NOISEX-92 noise in a range of matched and mismatched noise conditions. Results show that both the CGM architecture and the adversarial training mechanism lead to better PESQ and STOI in all tested noise conditions. In addition to yielding significant improvements in PESQ and STOI, CGM and adversarial training both mitigate against over-smoothing.

Item Type:	Article
DOI/Identification number:	10.1007/s00034-018-0798-4
Uncontrolled keywords:	Deep learning, Speech enhancement, Generative model
Subjects:	T Technology
Institutional Unit:	Schools > School of Computing
Former Institutional Unit:	Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Depositing User:	Ian McLoughlin
Date Deposited:	26 Feb 2018 10:39 UTC
Last Modified:	20 May 2025 10:21 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/66126 (The current URI for this page, for reference purposes)

University of Kent Author Information

McLoughlin, Ian Vince.

Creator's ORCID:	https://orcid.org/0000-0001-7111-2008
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Total Views

Total unique views of this page since July 2020. For more details click on the image.