Skip to main content

A New Time–Frequency Attention Tensor Network for Language Identification

Miao, Xiaoxiao, McLoughlin, Ian Vince, Yan, Yonghong (2020) A New Time–Frequency Attention Tensor Network for Language Identification. Circuits, Systems, and Signal Processing, 39 . pp. 2744-2758. ISSN 0278-081X. (doi:10.1007/s00034-019-01286-9) (KAR id:77646)

PDF Author's Accepted Manuscript
Language: English

Restricted to Repository staff only
Contact us about this Publication
[thumbnail of AcceptedPaper.pdf]
PDF Publisher pdf
Language: English


Download (1MB) Preview
[thumbnail of Miao2019_Article_ANewTimeFrequencyAttentionTens.pdf]
Preview
This file may not be suitable for users of assistive technology.
Request an accessible format
Official URL
http://dx.doi.org/10.1007/s00034-019-01286-9

Abstract

In this paper, we aim to improve traditional DNN x-vector language identification (LID) performance by employing Wide Residual Networks (WRN) as a powerful feature extractor which we combine with a novel frequency attention network (F-ATN). Compared with conventional time attention, our method learns discriminative weights for different frequency bands to generate weighted means and standard deviations for utterance-level classification. This mechanism enables the architecture to direct attention to important frequency bands rather than important time frames, as in traditional time attention (T-ATN) methods. Furthermore, we then introduce a cross-layer frequency attention tensor network (CLF-ATN) which exploits information from different layers to recapture frame-level language characteristics that have been dropped by aggressive frequency pooling in lower layers. This effectively restores fine-grained discriminative language details. Finally, we explore the joint fusion of frame-level and frequency-band attention in a time-frequency attention network (TF-ATN). Experimental results show firstly that WRN can significantly outperform a traditional DNN x-vector implementation. Secondly, the proposed frequency attention method is more effective than time attention and thirdly that frequency-time score fusion can yield further improvement. Finally, extensive experiments on CLF-ATN demonstrate that it is able to improve discrimination by regaining dropped fine-grained frequency information, particularly for low dimension frequency features.

Item Type: Article
DOI/Identification number: 10.1007/s00034-019-01286-9
Uncontrolled keywords: Language Identification, DNN x-vector, time-frequency attention tensor network, cross-layer frequency tensor attention network
Subjects: T Technology
Divisions: Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Depositing User: Ian McLoughlin
Date Deposited: 21 Oct 2019 08:04 UTC
Last Modified: 16 Feb 2021 14:08 UTC
Resource URI: https://kar.kent.ac.uk/id/eprint/77646 (The current URI for this page, for reference purposes)
McLoughlin, Ian Vince: https://orcid.org/0000-0001-7111-2008
  • Depositors only (login required):

Downloads

Downloads per month over past year