Real-time music auto-tagging

In this tutorial, we use Essentia’s TensorFlow integration to perform auto-tagging in real-time. Additionally, this serves as an example of TensorFlow inference in streaming mode and can be easily adapted to work offline.

Setup

To install Essentia with TensorFlow support, refer to the Setup section of our previous Music auto-tagging, classification, and embedding extraction tutorial for instructions.

Additionally, we rely on the pysoundcard package to capture the audio loopback of the system and feed Essentia in real-time. This way we can easily test our models with any music coming from our local player or browser.

!pip -q install pysoundcard

Let’s download MusiCNN, one of our auto-tagging models. This and more models are available from the Essentia models’ site.

!wget -q https://essentia.upf.edu/models/autotagging/msd/msd-musicnn-1.pb
!wget -q https://essentia.upf.edu/models/autotagging/msd/msd-musicnn-1.json

Then we import the required packages and Essentia algorithms. In this case, we use the TensorFlow functionalities in streaming mode.

import json

from essentia.streaming import (
    VectorInput,
    FrameCutter,
    TensorflowInputMusiCNN,
    VectorRealToTensor,
    TensorToPool,
    TensorflowPredict,
    PoolToTensor,
    TensorToVectorReal
)
from essentia import Pool, run, reset
from IPython import display
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import softmax
import soundcard as sc

%matplotlib nbagg

Define the analysis parameters. To make this demo work in real-time, we tweaked some of the analysis parameters of MusiCNN. While it was trained on patches of size 187 (~3 seconds) we set patch_size to 64 (~1 second) to increase the prediction rate. You can experiment with the patch_size and display_size parameters to modify the prediction rate to your taste.

with open('msd-musicnn-1.json', 'r') as json_file:
    metadata = json.load(json_file)

model_file = 'msd-musicnn-1.pb'
input_layer = metadata['schema']['inputs'][0]['name']
output_layer = metadata['schema']['outputs'][0]['name']
classes = metadata['classes']
n_classes = len(classes)

# Analysis parameters.
sample_rate = 16000
frame_size = 512
hop_size = 256
n_bands = 96
patch_size = 64
display_size = 10

buffer_size = patch_size * hop_size

Instantiate the algorithms. With this, we create a network similar to the one used inside TensorflowPredictMusiCNN, the wrapper algorithm presented in the previous tutorial. However, by instantiating the algorithms separately we gain additional control required for real-time usage.

buffer = np.zeros(buffer_size, dtype='float32')
vimp = VectorInput(buffer)
fc = FrameCutter(frameSize=frame_size, hopSize=hop_size)
tim = TensorflowInputMusiCNN()
vtt = VectorRealToTensor(shape=[1, 1, patch_size, n_bands],
                         lastPatchMode='discard')
ttp = TensorToPool(namespace=input_layer)
tfp = TensorflowPredict(graphFilename=model_file,
                        inputs=[input_layer],
                        outputs=[output_layer])
ptt = PoolToTensor(namespace=output_layer)
ttv = TensorToVectorReal()
pool = Pool()

Connect the algorithms. We also store the mel-spectrograms in the Pool for visualization purposes.

vimp.data   >> fc.signal
fc.frame    >> tim.frame
tim.bands   >> vtt.frame
tim.bands   >> (pool, 'melbands')
vtt.tensor  >> ttp.tensor
ttp.pool    >> tfp.poolIn
tfp.poolOut >> ptt.pool
ptt.tensor  >> ttv.tensor
ttv.frame   >> (pool, output_layer)

Create a callback function that will be called every time the audio buffer is ready to process.

def callback(data):
    buffer[:] = data.flatten()

    # Generate predictions.
    reset(vimp)
    run(vimp)

    # Update the mel-spectrograms and activations buffers.
    mel_buffer[:] = np.roll(mel_buffer, -patch_size)
    mel_buffer[:, -patch_size:] = pool['melbands'][-patch_size:, :].T
    img_mel.set_data(mel_buffer)

    act_buffer[:] = np.roll(act_buffer, -1)
    act_buffer[:, -1] = softmax(20 * pool[output_layer][-1, :].T)
    img_act.set_data(act_buffer)

    f.canvas.draw()

Initialize the plots and start processing the loopback stream.

mel_buffer = np.zeros([n_bands, patch_size * display_size])
act_buffer = np.zeros([n_classes, display_size])

pool.clear()

f, ax = plt.subplots(1, 2, figsize=[9.6, 7])
f.canvas.draw()

ax[0].set_title('Mel Spectrogram')
img_mel = ax[0].imshow(mel_buffer, aspect='auto',
                       origin='lower', vmin=0, vmax=6)
ax[0].set_xticks([])

ax[1].set_title('Activations')
img_act = ax[1].matshow(act_buffer, aspect='0.5', vmin=0, vmax=1)
ax[1].set_xticks([])
ax[1].yaxis.set_ticks_position('right')
plt.yticks(np.arange(n_classes), classes, fontsize=6)

# Capture and process the speakers loopback.
with sc.all_microphones(include_loopback=True)[0].recorder(samplerate=sample_rate) as mic:
    while True:
        callback(mic.record(numframes=buffer_size).mean(axis=1))
<IPython.core.display.Javascript object>