Freesound Extractor

essentia_streaming_extractor_freesound is a configurable command-line feature extractor for sound analysis. It computes a large set of common sound descriptors, including various spectral, time-domain, rhythm and tonal characteristics, in an easy way with no programming required. It is designed for batch computations on large sound collections and is used by Freesound. The prebuilt static binaries of this extractor are available via Essentia website.

It is possible to customize the parameters of audio analysis, frame summarization, high-level classifier models, and output format, using a yaml profile file (see below). Writing your own custom profile file you can specify:

  • output format (json or yaml)

  • whether to store all frame values

  • an audio segment to analyze using time positions in seconds

  • analysis sample rate (audio will be converted to it before analysis, recommended and default value is 44100.0)

  • frame parameters for different groups of descriptors: frame/hop size, zero padding, window type (see FrameCutter algorithm)

  • statistics to compute over frames: mean, var, median, min, max, dmean, dmean2, dvar, dvar2 (see PoolAggregator algorithm)

Sound Descriptors

See below a detailed description of audio descriptors computed by the extractor. All descriptors are analyzed on a signal resampled to 44kHz sample rate (by default) and summed to mono. The frame-wise descriptors are summarized by their statistical distribution, but it is also possible to get frame values (disabled by default).

low-level.*

For implementation details, see the code of extractor.

By default all frame-based features are computed with frame/hop sizes equal to 2048/1024 samples unless stated otherwise.

  • sound_start_frame, sound_stop_frame: indices of frames at which sound begins and ends (all frames before and after are silent). Algorithms: StartStopSilence

  • loudness_ebu128: EBU R128 loudness descriptors. Algorithms: LoudnessEBUR128

  • average_loudness: dynamic range descriptor. It computes average loudness across frames, rescaled into the [0,1] interval. The value of 0 corresponds to signals with large dynamic range, 1 corresponds to signal with little dynamic range. Algorithms: Loudness

  • dynamic_complexity: dynamic complexity computed on 2sec windows with 1sec overlap. Algorithms: DynamicComplexity

  • silence_rate_20dB, silence_rate_30dB, silence_rate_60dB, silence_rate_90dB: rate of silent frames in a signal for thresholds of 20, 30, 60, and 90 dBs. Algorithms: SilenceRate

  • spectral_rms: spectral RMS. Algorithms: RMS

  • spectral_flux: spectral flux of a signal computed using L2-norm. Algorithms: Flux

  • spectral_centroid, spectral_kurtosis, spectral_spread, spectral_skewness: centroid and central moments statistics describing the spectral shape. Algorithms: Centroid, CentralMoments

  • spectral_rolloff: the roll-off frequency of a spectrum. Algorithms: RollOff

  • spectral_decrease: spectral decrease. Algorithms: Decrease

  • spectral_crest: spectral crest. Algorithms: Crest

  • spectral_flatness_db: spectral flatness (dB): FlatnessDB

  • hfc: high frequency content descriptor as proposed by Masri. Algorithms: HFC

  • spectral_strongpeak: the Strong Peak of a signal’s spectrum. Algorithms: StrongPeak

  • zerocrossingrate zero-crossing rate. Algorithms: ZeroCrossingRate

  • spectral_energy: spectral energy. Algorithms: Energy

  • spectral_energyband_low, spectral_energyband_middle_low, spectral_energyband_middle_high, spectral_energyband_high: spectral energy in frequency bands [20Hz, 150Hz], [150Hz, 800Hz], [800Hz, 4kHz], and [4kHz, 20kHz]. Algorithms EnergyBand

  • barkbands: spectral energy in 27 Bark bands. Algorithms: BarkBands

  • melbands: spectral energy in 40 mel bands. Algorithms: MFCC

  • melbands96: spectral energy in 96 mel bands. Algorithms: MelBands

  • erbbands: spectral energy in 40 ERB bands. Algorithms: ERBBands

  • mfcc: the first 13 mel frequency cepstrum coefficients. See algorithm: MFCC

  • gfcc: the first 13 gammatone feature cepstrum coefficients. Algorithms: GFCC

  • barkbands_crest, barkbands_flatness_db: crest and flatness computed over energies in Bark bands. Algorithms: Crest, FlatnessDB

  • barkbands_kurtosis, barkbands_skewness, barkbands_spread: central moments statistics over energies in Bark bands. Algorithms: CentralMoments

  • melbands_crest, melbands_flatness_db: crest and flatness computed over energies in mel bands. Algorithms: Crest, FlatnessDB

  • melbands_kurtosis, melbands_skewness, melbands_spread: central moments statistics over energies in mel bands. Algorithms: CentralMoments

  • erbbands_crest, erbbands_flatness_db: crest and flatness computed over energies in ERB bands. Algorithms: Crest, FlatnessDB

  • erbbands_kurtosis, erbbands_skewness, erbbands_spread: central moments statistics over energies in ERB bands. Algorithms: CentralMoments

  • dissonance: sensory dissonance of a spectrum. Algorithms: Dissonance

  • spectral_entropy: Shannon entropy of a spectrum. Algorithms: Entropy

  • pitch_salience: pitch salience of a spectrum. Algorithms: PitchSalience

  • pitch, pitch_instantaneous_confidence: pitch estimation and its confidence (useful for monophonic sounds). Algorithms: PitchYinFFT

  • spectral_complexity: spectral complexity. Algorithms: SpectralComplexity

  • spectral_contrast_coeffs, spectral_contrast_valleys: spectral contrast features. Algorithms: SpectralContrast

rhythm.*

For implementation details, see the code of extractor.

  • beats_position: time positions [sec] of detected beats using beat tracking algorithm by Degara et al., 2012. Algorithms: RhythmExtractor2013, BeatTrackerDegara

  • bpm_intervals: time durations between consecutive beats

  • beats_count: number of detected beats

  • bpm, bpm_confidence: BPM value according to detected beats and confidence value for the prediction. Note that the confidence values goes from 0 to 5.32. Algorithms: RhythmExtractor2013

  • bpm_loop, bpm_loop_confidence: BPM value according to detected beats using an algorithm specilized in loops, and confidence value for the prediction. Note that the confidence values goes from 0 to 1. A confidence of 1 could be used to determine whether the sound is a loop or not. Algorithms: LoopBpmEstimator, LoopBpmConfidence

  • bpm_histogram: BPM histogram. Algorithms: BpmHistogramDescriptors

  • bpm_histogram_first_peak_bpm, bpm_histogram_first_peak_spread, bpm_histogram_first_peak_weight, bpm_histogram_second_peak_bpm, bpm_histogram_second_peak_spread, bpm_histogram_second_peak_weight: descriptors characterizing highest and second highest peak of the BPM histogram. Algorithms: BpmHistogramDescriptors

  • beats_loudness, beats_loudness_band_ratio: spectral energy computed on beats segments of audio across the whole spectrum, and ratios of energy in 6 frequency bands. Algorithms: BeatsLoudness, SingleBeatLoudness

  • onset_times, onset_count, onset_rate : time positions [sec] of detected onsets, its total number and rate per second. Algorithms: OnsetRate

tonal.*

For implementation details, see the code of extractor. By default all features are computed with frame/hop sizes equal to 4096/2048 samples.

  • tuning_frequency: estimated tuning frequency [Hz]. Algorithms: TuningFrequency

  • hpcp: 32-dimensional harmonic pitch class profile (HPCP). Algorithms: HPCP

  • hpcp_peak_count: number of peaks detected in the mean of HPCPs (number of active pitch classes). Algorithms: PeakDetection

  • hpcp_entropy: Shannon entropy of a HPCP vector. Algorithms: Entropy

  • hpcp_crest: crest of the HPCP vector. Algorithms: Crest

  • key, scale, strength; key estimation, its scale and strength using a default HPCP key profile. Algorithms: Key

  • tuning_diatonic_strength: key strength estimated from high-resolution HPCP (120 dimensions) using diatonic profile. Algorithms: Key

  • tuning_equal_tempered_deviation, tuning_nontempered_energy_ratio: equal-temperament deviation and non-tempered energy ratio estimated from high-resolution HPCP (120 dimensions). Algorithms: HighResolutionFeatures

sfx.*

For implementation details, see the code of extractor.

Total and perceived sound duration:

  • duration: total duration of an audio signal. Algorithms: Duration.

  • effective_duration: effective duration of the signal discarding silence (signal below the 10% of the envelope maximum). Algorithms: Duration.

Descriptors based on pitch and harmonics estimation:

Sound envelope descriptors:

  • temporal_centroid: ratio of the envelope centroid to total length. Algorithms: Centroid.

  • temporal_kurtosis, temporal_spread, temporal_skewness: central moments statistics describing the signal envelope shape. Algorithms: CentralMoments

  • temporal_decrease: signal envelope decrease. Algorithms: Decrease

  • tc_to_total: ratio of the envelope centroid to total length. Algorithms: TCToTotal.

  • flatness: the flatness coefficient of a signal envelope. Algorithms: FlatnessSFX.

  • logattacktime: the log10 of the attack time. Algorithms: LogAttackTime.

  • max_to_total: the maximum amplitude position to total envelope length ratio. Algorithms: MaxToTotal.

  • strongdecay: the Strong Decay. Algorithms: StrongDecay.

  • der_av_after_max: the average value of the envelope’s derivative after the maximum amplitude position weighted by its amplitude (the smaller the value the more impulsive is a sound). Algorithms: DerivativeSFX.

  • max_der_before_max: the maximum value of the envelope’s derivative before the maximum amplitude position (sounds with smooth attack phase will have lower values). Algorithms: DerivativeSFX.

Pitch envelope descriptors:

  • pitch_centroid: pitch envelope centroid. Algorithms: Centroid.

  • pitch_max_to_total: ratio of the position of the maximum pitch value to total length. Algorithms: MaxToTotal.

  • pitch_min_to_total: ratio of the position of the minimum pitch value to total length. Algorithms: MinToTotal.

  • pitch_after_max_to_before_max_energy_ratio: ratio of pitch envelope energy after the pitch maximum to pitch energy before the pitch maximum. Algorithms: AfterMaxToBeforeMaxEnergyRatio.

Configuration

It is possible to customize the parameters of audio analysis, frame summarization, high-level classifier models, and output format, using a yaml profile file. Writing your own custom profile file you can:

Specify output format (json or yaml)

outputFormat: json

Specify whether to store all frame values (0 or 1)

outputFrames: 1

Specify an audio segment to analyze using time positions in seconds

startTime: 0
endTime: 10

Specify analysis sample rate (audio will be converted to it before analysis, recommended and default value is 44100.0)

analysisSampleRate: 44100.0

Specify frame parameters for different groups of descriptors: frame/hop size, zero padding, window type (see FrameCutter algorithm). Specify statistics to compute over frames: mean, var, median, min, max, dmean, dmean2, dvar, dvar2 (see PoolAggregator algorithm)

lowlevel:
    frameSize: 2048
    hopSize: 1024
    zeroPadding: 0
    windowType: blackmanharris62
    silentFrames: noise
    stats: ["mean", "var", "median"]

rhythm:
    method: degara
    minTempo: 40
    maxTempo: 208
    stats: ["mean", "var", "median", "min", "max"]

tonal:
    frameSize: 4096
    hopSize: 2048
    zeroPadding: 0
    windowType: blackmanharris62
    silentFrames: noise
    stats: ["mean", "var", "median", "min", "max"]

Specify whether you want to compute high-level descriptors based on classifier models associated with the respective filepaths (currently no models are provided out-of-box for sound classification, see how to train your own models here)

highlevel:
    compute: 1
    svm_models: ['<path_to_gaia_svm_model1.history>', '<path_to_gaia_svm_model2.history>' ]

In the profile example below, the extractor is set to analyze only the first 30 seconds of audio and output frame values as well as their statistical summarization.

startTime: 0
endTime: 30
outputFrames: 0
outputFormat: json
requireMbid: false
indent: 4

lowlevel:
    frameSize: 2048
    hopSize: 1024
    zeroPadding: 0
    windowType: blackmanharris62
    silentFrames: noise
    stats: ["mean", "var", "median", "min", "max", "dmean", "dmean2", "dvar", "dvar2"]

rhythm:
    method: degara
    minTempo: 40
    maxTempo: 208
    stats: ["mean", "var", "median", "min", "max", "dmean", "dmean2", "dvar", "dvar2"]

tonal:
    frameSize: 4096
    hopSize: 2048
    zeroPadding: 0
    windowType: blackmanharris62
    silentFrames: noise
    stats: ["mean", "var", "median", "min", "max", "dmean", "dmean2", "dvar", "dvar2"]