Representation learning for audio data

3 September 2020

The application of classical machine learning methods on complex data formats, such as audio of human speech, typically necessitates extensive feature engineering. This requires significant domain knowledge to extract the key components of the data.

Deep learning can allow models to learn their data representations, obviating the need for feature engineering. However, as the quality of the learned representations strongly influences performance on downstream tasks, how can we ensure that these representations are appropriate?

This talk explores the subject of representation learning and its application to speaker classification. We provide an overview of representation learning and variational autoencoders before discussing an architecture that employs labelled data to learn representations well-suited to speaker classification tasks.