Scott Stevenson

Representation learning for audio data

Classical machine learning often cannot be applied to modern, complex datasets–like audio datasets of human speech–without extensive feature engineering. Traditionally, feature engineering requires deep domain knowledge in order to extract the key components of the data.

The development of deep learning means we can now forego feature engineering and train models to learn their own representations of data. However, given that the representations that we learn will significantly impact the performance of the model on downstream tasks, how can we ensure that these representations are appropriate?

In this talk, I discuss representation learning and its application to the task of speaker classification. We cover an introduction to representation learning and variational autoencoders and explore an architecture that uses labelled data to create representations well-suited to speaker classification tasks.