An Improved Deep Embedding Learning Method for Short Duration Speaker Verification

Gao, Zhifu, Song, Yan, McLoughlin, Ian Vince, Guo, Wu, Dai, Li-Rong (2018) An Improved Deep Embedding Learning Method for Short Duration Speaker Verification. In: ISCA Conference. International Speech Communication Association (doi:10.21437/Interspeech.2018-1515) (KAR id:67451)

PDF Author's Accepted Manuscript Language: English
Download this file (PDF/349kB)
Request a format suitable for use with assistive technology e.g. a screenreader
Official URL: http://dx.doi.org/10.21437/Interspeech.2018-1515

Abstract

This paper presents an improved deep embedding learning method based on convolutional neural networks (CNN) for short-duration speaker verification (SV). Existing deep learning-based SV methods generally extract frontend embeddings from a feed-forward deep neural network, in which the long-term speaker characteristics are captured via a pooling operation over the input speech. The extracted embeddings are then scored via a backend model, such as Probabilistic Linear Discriminative Analysis (PLDA).

Two improvements are proposed for frontend embedding learning based on the CNN structure: (1) Motivated by the WaveNet for speech synthesis, dilated filters are designed to achieve a tradeoff between computational efficiency and receptive-filter size; and (2) A novel cross-convolutional-layer pooling method is exploited to capture $1^{st}$-order statistics for modelling long-term speaker characteristics. Specifically, the activations of one convolutional layer are aggregated with the guidance of the feature maps from the successive layer. To evaluate the effectiveness of our proposed methods, extensive experiments are conducted on the modified female portion of NIST SRE 2010 evaluations, with conditions ranging from 10s-10s to 5s-4s. Excellent performance has been achieved on each evaluation condition, significantly outperforming existing SV systems using i-vector and d-vector embeddings.

Item Type:	Conference proceeding
DOI/Identification number:	10.21437/Interspeech.2018-1515
Uncontrolled keywords:	Speaker verification, convolution neural network, dilated convolution, cross-convolutional-layer pooling
Subjects:	T Technology
Institutional Unit:	Schools > School of Computing
Former Institutional Unit:	Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Depositing User:	Ian McLoughlin
Date Deposited:	29 Jun 2018 09:20 UTC
Last Modified:	28 Apr 2026 08:51 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/67451 (The current URI for this page, for reference purposes)

University of Kent Author Information

McLoughlin, Ian Vince.

Creator's ORCID:	https://orcid.org/0000-0001-7111-2008
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Total Views

Total unique views of this page since July 2020. For more details click on the image.