Miao, Xiaoxiao, McLoughlin, Ian Vince, Yan, Yonghong (2020) A New Time–Frequency Attention Tensor Network for Language Identification. Circuits, Systems, and Signal Processing, 39 . pp. 2744-2758. ISSN 0278-081X. (doi:10.1007/s00034-019-01286-9) (KAR id:77646)
PDF
Author's Accepted Manuscript
Language: English Restricted to Repository staff only |
|
Contact us about this Publication
|
|
PDF
Publisher pdf
Language: English
This work is licensed under a Creative Commons Attribution 4.0 International License.
|
|
Download this file (PDF/1MB) |
Preview |
Request a format suitable for use with assistive technology e.g. a screenreader | |
Official URL: http://dx.doi.org/10.1007/s00034-019-01286-9 |
Abstract
In this paper, we aim to improve traditional DNN x-vector language identification (LID) performance by employing Wide Residual Networks (WRN) as a powerful feature extractor which we combine with a novel frequency attention network (F-ATN). Compared with conventional time attention, our method learns discriminative weights for different frequency bands to generate weighted means and standard deviations for utterance-level classification. This mechanism enables the architecture to direct attention to important frequency bands rather than important time frames, as in traditional time attention (T-ATN) methods. Furthermore, we then introduce a cross-layer frequency attention tensor network (CLF-ATN) which exploits information from different layers to recapture frame-level language characteristics that have been dropped by aggressive frequency pooling in lower layers. This effectively restores fine-grained discriminative language details. Finally, we explore the joint fusion of frame-level and frequency-band attention in a time-frequency attention network (TF-ATN). Experimental results show firstly that WRN can significantly outperform a traditional DNN x-vector implementation. Secondly, the proposed frequency attention method is more effective than time attention and thirdly that frequency-time score fusion can yield further improvement. Finally, extensive experiments on CLF-ATN demonstrate that it is able to improve discrimination by regaining dropped fine-grained frequency information, particularly for low dimension frequency features.
Item Type: | Article |
---|---|
DOI/Identification number: | 10.1007/s00034-019-01286-9 |
Uncontrolled keywords: | Language Identification, DNN x-vector, time-frequency attention tensor network, cross-layer frequency tensor attention network |
Subjects: | T Technology |
Divisions: | Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing |
Depositing User: | Ian McLoughlin |
Date Deposited: | 21 Oct 2019 08:04 UTC |
Last Modified: | 05 Nov 2024 12:42 UTC |
Resource URI: | https://kar.kent.ac.uk/id/eprint/77646 (The current URI for this page, for reference purposes) |
- Link to SensusAccess
- Export to:
- RefWorks
- EPrints3 XML
- BibTeX
- CSV
- Depositors only (login required):