Learning Representations for Nonspeech Audio Events through Their Similarities to Speech Patterns

Phan, Huy, Hertel, Lars, Maass, Marco, Mazur, Radoslaw, Mertins, Alfred (2016) Learning Representations for Nonspeech Audio Events through Their Similarities to Speech Patterns. IEEE/ACM Transactions on Audio, Speech and Language Processing, 24 (4). pp. 807-822. ISSN 2329-9290. E-ISSN 2329-9304. (doi:10.1109/TASLP.2016.2530401) (KAR id:72684)

PDF Author's Accepted Manuscript Language: English
Download this file (PDF/912kB)
Request a format suitable for use with assistive technology e.g. a screenreader
Official URL: https://doi.org/10.1109/TASLP.2016.2530401
Additional URLs: Publisher

Abstract

The human auditory system is very well matched to both human speech and environmental sounds. Therefore, the question arises whether human speech material may provide useful information for training systems for analyzing nonspeech audio signals, for example, in a classification task. In order to answer this question, we consider speech patterns as basic acoustic concepts which embody and represent the target nonspeech signal. To find out how similar the nonspeech signal is to speech, we classify it with a classifier trained on the speech patterns and use the classification posteriors to represent the closeness to the speech bases. The speech similarities are finally employed as a descriptor to represent the target signal. We further show that a better descriptor can be obtained by learning to organize the speech categories hierarchically with a tree structure. Furthermore, these descriptors are generic. That is, once the speech classifier has been learned, it can be employed as a feature extractor for different datasets without re-training. Lastly, we propose an algorithm to select a sufficient subset which provides an approximate representation capability of the entire set of available speech patterns. We conduct experiments for the application of audio event analysis. Phone triplets from the TIMIT dataset were used as speech patterns to learn the descriptors for audio events of three different datasets with different complexity, including UPC-TALP, Freiburg-106, and NAR. The experimental results on the event classification task show that a good performance can be easily obtained even if a simple linear classifier is used. Furthermore, fusion of the learned descriptors as an additional source leads to state-of-the-art performance on all the three target datasets.

Item Type:	Article
DOI/Identification number:	10.1109/TASLP.2016.2530401
Uncontrolled keywords:	feature learning, representation, nonspeech audio event, speech patterns, phone triplets
Institutional Unit:	Schools > School of Computing
Former Institutional Unit:	Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Depositing User:	Huy Phan
Date Deposited:	25 Feb 2019 16:23 UTC
Last Modified:	20 May 2025 10:23 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/72684 (The current URI for this page, for reference purposes)

University of Kent Author Information

Phan, Huy.

Creator's ORCID:	https://orcid.org/0000-0003-4096-785X
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Total Views

Total unique views of this page since July 2020. For more details click on the image.