Improved Audio Scene Classification based on Label-Tree Embeddings and Convolutional Neural Networks

Phan, Huy, Hertel, Lars, Maass, Marco, Koch, Philipp, Mazur, Radoslaw, Mertins, Alfred (2017) Improved Audio Scene Classification based on Label-Tree Embeddings and Convolutional Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25 (6). pp. 1278-1290. ISSN 2329-9290. E-ISSN 2329-9304. (doi:10.1109/TASLP.2017.2690564) (KAR id:72672)

PDF Author's Accepted Manuscript Language: English
Download this file (PDF/696kB)
Request a format suitable for use with assistive technology e.g. a screenreader
Official URL: https://doi.org/10.1109/TASLP.2017.2690564
Additional URLs: Publisher

Abstract

We present in this article an efficient approach for audio scene classification. We aim at learning representations for scene examples by exploring the structure of their class labels. A category taxonomy is automatically learned by collectively optimizing a tree-structured clustering of the given labels into multiple meta-classes. A scene recording is then transformed into a label tree embedding image. Elements of the image represent the likelihoods that the scene instance belongs to the meta-classes. We investigate classification with label tree embedding features learned from different low-level features as well as their fusion. We show that combination of multiple features is essential to obtain good performance. While averaging label-tree embedding images over time yields good performance, we argue that average pooling possesses an intrinsic shortcoming. We alternatively propose an improved classification scheme to bypass this limitation. We aim at automatically learning common templates that are useful for the classification task from these images using simple but tailored convolutional neural networks. The trained networks are then employed as a feature extractor that matches the learned templates across a label tree embedding image and produce the maximum matching scores as features for classification. Since audio scenes exhibit rich content, template learning and matching on low-level features would be inefficient. With label tree embedding features, we have quantized and reduced the low-level features into the likelihoods of the meta-classes on which the template learning and matching are efficient. We study both training convolutional neural networks on stacked label tree embedding images and multi-stream networks. Experimental results on the DCASE2016 and LITIS Rouen datasets demonstrate the efficiency of the proposed methods.

Item Type:	Article
DOI/Identification number:	10.1109/TASLP.2017.2690564
Uncontrolled keywords:	audio scene classification, label tree embedding, convolutional neural network, multi-stream, template matching
Institutional Unit:	Schools > School of Computing
Former Institutional Unit:	Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Depositing User:	Huy Phan
Date Deposited:	25 Feb 2019 15:30 UTC
Last Modified:	28 Apr 2026 08:58 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/72672 (The current URI for this page, for reference purposes)

University of Kent Author Information

Phan, Huy.

Creator's ORCID:	https://orcid.org/0000-0003-4096-785X
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Total Views

Total unique views of this page since July 2020. For more details click on the image.