Deep bottleneck features for spoken language identification

Jiang, Bing, Song, Yan, Wei, Si, Liu, Jun-Hua, McLoughlin, Ian Vince, Dai, Li-Rong (2014) Deep bottleneck features for spoken language identification. PLoS ONE, 9 (7). Article Number 100795. ISSN 1932-6203. (doi:10.1371/journal.pone.0100795) (The full text of this publication is not currently available from this repository. You may be able to access a copy if URLs are provided) (KAR id:48803)

The full text of this publication is not currently available from this repository. You may be able to access a copy if URLs are provided.
Official URL: http://dx.doi.org/10.1371/journal.pone.0100795

Abstract

A key problem in spoken language identification (LID) is to design effective representations which are specific to language information. For example, in recent years, representations based on both phonotactic and acoustic features have proven their effectiveness for LID. Although advances in machine learning have led to significant improvements, LID performance is still lacking, especially for short duration speech utterances. With the hypothesis that language information is weak and represented only latently in speech, and is largely dependent on the statistical properties of the speech content, existing representations may be insufficient. Furthermore they may be susceptible to the variations caused by different speakers, specific content of the speech segments, and background noise. To address this, we propose using Deep Bottleneck Features (DBF) for spoken LID, motivated by the success of Deep Neural Networks (DNN) in speech recognition. We show that DBFs can form a low-dimensional compact representation of the original inputs with a powerful descriptive and discriminative capability. To evaluate the effectiveness of this, we design two acoustic models, termed DBF-TV and parallel DBF-TV (PDBF-TV), using a DBF based i-vector representation for each speech utterance. Results on NIST language recognition evaluation 2009 (LRE09) show significant improvements over state-of-the-art systems. By fusing the output of phonotactic and acoustic approaches, we achieve an EER of 1.08%, 1.89% and 7.01% for 30 s, 10 s and 3 s test utterances respectively. Furthermore, various DBF configurations have been extensively evaluated, and an optimal system proposed.

Item Type:	Article
DOI/Identification number:	10.1371/journal.pone.0100795
Subjects:	T Technology
Divisions:	Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Depositing User:	Ian McLoughlin
Date Deposited:	25 Aug 2015 09:40 UTC
Last Modified:	17 Aug 2022 10:58 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/48803 (The current URI for this page, for reference purposes)

University of Kent Author Information

McLoughlin, Ian Vince.

Creator's ORCID:	https://orcid.org/0000-0001-7111-2008
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Download Statistics

Total unique views for this document in KAR since July 2020. For more details click on the image.