Music auto-tagging, classification, and embedding extraction

In this tutorial, we use Essentia’s TensorFlow integration to perform auto-tagging, classification, and embedding extraction.


First of all, notice that the default Essentia’s pip package does not include TensorFlow support. Currently, this is only available through pip for Linux with Python >=3.5 and <=3.7.

!pip install -q essentia-tensorflow

Alternatively, you can follow the instructions in this blog post to build Essentia with TensorFlow support. This approach also works in Mac, for Python >=3.8 or with TensorFlow >=2.0.

After this step, we can import the required packages.

import json

from essentia.standard import MonoLoader, TensorflowPredictMusiCNN, TensorflowPredictVGGish
import numpy as np
import matplotlib.pyplot as plt

The Essentia models’ site contains models for diverse purposes. Each model comes with a .json metadata file with information such as the classes it can predict, the metrics achieved on training, or the model’s version.

Let’s download the MusiCNN auto-tagging model trained on the Million Song Dataset.

!wget -q
!wget -q

We can open the metadata file and check all the available keys.

with open('msd-musicnn-1.json', 'r') as json_file:
    metadata = json.load(json_file)

dict_keys(['name', 'type', 'link', 'version', 'description', 'author', 'email', 'release_date', 'framework', 'framework_version', 'classes', 'model_types', 'dataset', 'schema', 'citation'])

Finally, we load an audio file to use as an example.

audio_file = '../../../test/audio/recorded/techno_loop.wav'
audio = MonoLoader(sampleRate=16000, filename=audio_file)()


Now we have all the ingredients to perform auto-tagging with MusiCNN. TensorflowPredictMusiCNN is our dedicated algorithm to make predictions with models expecting MusiCNN’s mel-spectrogram signature. We have to configure the graphFilename parameter with the path to the model and feed the algorithm with audio sampled at 16kHz. The output is a two-dimensional matrix [time, activations].

activations = TensorflowPredictMusiCNN(graphFilename='msd-musicnn-1.pb')(audio)

Finally, we use matplotlib to plot the activations.

ig, ax = plt.subplots(1, 1, figsize=(10, 10))
ax.matshow(activations.T, aspect='auto')

ax.set_xlabel('patch number')
plt.title('Tag activations')

Classification with the transfer learning classifiers

Essentia is shipped with a collection of single-label classifiers based on transfer learning that rely on our auto-tagging models or other deep embedding extractors. You can read our blog post for a complete list of the available classifiers.

In this example, we use the danceability classifier based on the VGGish model.

!wget -q
!wget -q

We use the TensorflowPredictVGGish algorithm because it generates the required mel-spectrogram signature for this case, but if we wanted to use the danceability classifier based on MusiCNN we should use TensorflowPredictMusiCNN again.

with open('danceability-vggish-audioset-1.json', 'r') as json_file:
    metadata = json.load(json_file)

activations = TensorflowPredictVGGish(graphFilename='danceability-vggish-audioset-1.pb')(audio)

Finally, we can compute the global accuracy as the mean of the activations along the temporal axis.

for label, probability in zip(metadata['classes'], activations.mean(axis=0)):
    print(f'{label}: {100 * probability:.1f}%')
danceable: 99.8%
not_danceable: 0.2%

Embedding extraction

A usual transfer learning approach consists of extracting low-dimensional feature maps from one of the last layers of a pre-trained model as embeddings for a new downstream task. The metadata includes a field schema with the names of the most relevent layers of each model, including layers that are suitable for embedding extraction.

with open('msd-musicnn-1.json', 'r') as json_file:
    metadata = json.load(json_file)

{'inputs': [{'name': 'model/Placeholder', 'type': 'float', 'shape': [187, 96]}], 'outputs': [{'name': 'model/Sigmoid', 'type': 'float', 'shape': [1, 50], 'op': 'Sigmoid'}, {'name': 'model/dense_1/BiasAdd', 'type': 'float', 'shape': [1, 50], 'op': 'fully connected', 'description': 'logits'}, {'name': 'model/dense/BiasAdd', 'type': 'float', 'shape': [1, 200], 'op': 'fully connected', 'description': 'embeddings'}]}

From this, we see that model/dense_1/BiasAdd is the recommended layer for embeddings. Additionally, an external tool such as Netron can be used to inspect the model and get the information about the whole architecture.

TensorflowPredictMusiCNN offers an output parameter that can be configured to select the layer of the model to retrieve.

embeddings = TensorflowPredictMusiCNN(graphFilename='msd-musicnn-1.pb', output='model/dense_1/BiasAdd')(audio)
ig, ax = plt.subplots(1, 1, figsize=(10, 4))
ax.matshow(embeddings.T, aspect='auto')
ax.set_xlabel('patch number')