Skip to main content
Kent Academic Repository

A New Time–Frequency Attention Tensor Network for Language Identification

Miao, Xiaoxiao, McLoughlin, Ian Vince, Yan, Yonghong (2020) A New Time–Frequency Attention Tensor Network for Language Identification. Circuits, Systems, and Signal Processing, 39 . pp. 2744-2758. ISSN 0278-081X. (doi:10.1007/s00034-019-01286-9) (KAR id:77646)

PDF Author's Accepted Manuscript
Language: English

Restricted to Repository staff only
Contact us about this Publication
[thumbnail of AcceptedPaper.pdf]
PDF Publisher pdf
Language: English


Download this file
(PDF/1MB)
[thumbnail of Miao2019_Article_ANewTimeFrequencyAttentionTens.pdf]
Preview
Request a format suitable for use with assistive technology e.g. a screenreader
Official URL:
http://dx.doi.org/10.1007/s00034-019-01286-9

Abstract

In this paper, we aim to improve traditional DNN x-vector language identification (LID) performance by employing Wide Residual Networks (WRN) as a powerful feature extractor which we combine with a novel frequency attention network (F-ATN). Compared with conventional time attention, our method learns discriminative weights for different frequency bands to generate weighted means and standard deviations for utterance-level classification. This mechanism enables the architecture to direct attention to important frequency bands rather than important time frames, as in traditional time attention (T-ATN) methods. Furthermore, we then introduce a cross-layer frequency attention tensor network (CLF-ATN) which exploits information from different layers to recapture frame-level language characteristics that have been dropped by aggressive frequency pooling in lower layers. This effectively restores fine-grained discriminative language details. Finally, we explore the joint fusion of frame-level and frequency-band attention in a time-frequency attention network (TF-ATN). Experimental results show firstly that WRN can significantly outperform a traditional DNN x-vector implementation. Secondly, the proposed frequency attention method is more effective than time attention and thirdly that frequency-time score fusion can yield further improvement. Finally, extensive experiments on CLF-ATN demonstrate that it is able to improve discrimination by regaining dropped fine-grained frequency information, particularly for low dimension frequency features.

Item Type: Article
DOI/Identification number: 10.1007/s00034-019-01286-9
Uncontrolled keywords: Language Identification, DNN x-vector, time-frequency attention tensor network, cross-layer frequency tensor attention network
Subjects: T Technology
Divisions: Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Depositing User: Ian McLoughlin
Date Deposited: 21 Oct 2019 08:04 UTC
Last Modified: 04 Jul 2023 13:01 UTC
Resource URI: https://kar.kent.ac.uk/id/eprint/77646 (The current URI for this page, for reference purposes)

University of Kent Author Information

Miao, Xiaoxiao.

Creator's ORCID:
CReDIT Contributor Roles:

McLoughlin, Ian Vince.

Creator's ORCID: https://orcid.org/0000-0001-7111-2008
CReDIT Contributor Roles:
  • Depositors only (login required):

Total unique views for this document in KAR since July 2020. For more details click on the image.