Freesound Extractor¶

essentia_streaming_extractor_freesound is a configurable command-line feature extractor for sound analysis. It computes a large set of common sound descriptors, including various spectral, time-domain, rhythm and tonal characteristics, in an easy way with no programming required. It is designed for batch computations on large sound collections and is used by Freesound. The prebuilt static binaries of this extractor are available via Essentia website.

It is possible to customize the parameters of audio analysis, frame summarization, high-level classifier models, and output format, using a yaml profile file (see below). Writing your own custom profile file you can specify:

output format (json or yaml)

whether to store all frame values

an audio segment to analyze using time positions in seconds

analysis sample rate (audio will be converted to it before analysis, recommended and default value is 44100.0)

frame parameters for different groups of descriptors: frame/hop size, zero padding, window type (see FrameCutter algorithm)

statistics to compute over frames: mean, var, median, min, max, dmean, dmean2, dvar, dvar2 (see PoolAggregator algorithm)

Sound Descriptors¶

See below a detailed description of audio descriptors computed by the extractor. All descriptors are analyzed on a signal resampled to 44kHz sample rate (by default) and summed to mono. The frame-wise descriptors are summarized by their statistical distribution, but it is also possible to get frame values (disabled by default).

low-level.*¶

For implementation details, see the code of extractor.

By default all frame-based features are computed with frame/hop sizes equal to 2048/1024 samples unless stated otherwise.

sound_start_frame, sound_stop_frame: indices of frames at which sound begins and ends (all frames before and after are silent). Algorithms: StartStopSilence
loudness_ebu128: EBU R128 loudness descriptors. Algorithms: LoudnessEBUR128
average_loudness: dynamic range descriptor. It computes average loudness across frames, rescaled into the [0,1] interval. The value of 0 corresponds to signals with large dynamic range, 1 corresponds to signal with little dynamic range. Algorithms: Loudness
dynamic_complexity: dynamic complexity computed on 2sec windows with 1sec overlap. Algorithms: DynamicComplexity
silence_rate_20dB, silence_rate_30dB, silence_rate_60dB, silence_rate_90dB: rate of silent frames in a signal for thresholds of 20, 30, 60, and 90 dBs. Algorithms: SilenceRate
spectral_rms: spectral RMS. Algorithms: RMS
spectral_flux: spectral flux of a signal computed using L2-norm. Algorithms: Flux
spectral_centroid, spectral_kurtosis, spectral_spread, spectral_skewness: centroid and central moments statistics describing the spectral shape. Algorithms: Centroid, CentralMoments
spectral_rolloff: the roll-off frequency of a spectrum. Algorithms: RollOff
spectral_decrease: spectral decrease. Algorithms: Decrease
spectral_crest: spectral crest. Algorithms: Crest
spectral_flatness_db: spectral flatness (dB): FlatnessDB
hfc: high frequency content descriptor as proposed by Masri. Algorithms: HFC
spectral_strongpeak: the Strong Peak of a signal’s spectrum. Algorithms: StrongPeak
zerocrossingrate zero-crossing rate. Algorithms: ZeroCrossingRate
spectral_energy: spectral energy. Algorithms: Energy
spectral_energyband_low, spectral_energyband_middle_low, spectral_energyband_middle_high, spectral_energyband_high: spectral energy in frequency bands [20Hz, 150Hz], [150Hz, 800Hz], [800Hz, 4kHz], and [4kHz, 20kHz]. Algorithms EnergyBand
barkbands: spectral energy in 27 Bark bands. Algorithms: BarkBands
melbands: spectral energy in 40 mel bands. Algorithms: MFCC
melbands96: spectral energy in 96 mel bands. Algorithms: MelBands
erbbands: spectral energy in 40 ERB bands. Algorithms: ERBBands
mfcc: the first 13 mel frequency cepstrum coefficients. See algorithm: MFCC
gfcc: the first 13 gammatone feature cepstrum coefficients. Algorithms: GFCC
barkbands_crest, barkbands_flatness_db: crest and flatness computed over energies in Bark bands. Algorithms: Crest, FlatnessDB
barkbands_kurtosis, barkbands_skewness, barkbands_spread: central moments statistics over energies in Bark bands. Algorithms: CentralMoments
melbands_crest, melbands_flatness_db: crest and flatness computed over energies in mel bands. Algorithms: Crest, FlatnessDB
melbands_kurtosis, melbands_skewness, melbands_spread: central moments statistics over energies in mel bands. Algorithms: CentralMoments
erbbands_crest, erbbands_flatness_db: crest and flatness computed over energies in ERB bands. Algorithms: Crest, FlatnessDB
erbbands_kurtosis, erbbands_skewness, erbbands_spread: central moments statistics over energies in ERB bands. Algorithms: CentralMoments
dissonance: sensory dissonance of a spectrum. Algorithms: Dissonance
spectral_entropy: Shannon entropy of a spectrum. Algorithms: Entropy
pitch_salience: pitch salience of a spectrum. Algorithms: PitchSalience
pitch, pitch_instantaneous_confidence: pitch estimation and its confidence (useful for monophonic sounds). Algorithms: PitchYinFFT
spectral_complexity: spectral complexity. Algorithms: SpectralComplexity
spectral_contrast_coeffs, spectral_contrast_valleys: spectral contrast features. Algorithms: SpectralContrast

rhythm.*¶

For implementation details, see the code of extractor.

beats_position: time positions [sec] of detected beats using beat tracking algorithm by Degara et al., 2012. Algorithms: RhythmExtractor2013, BeatTrackerDegara
bpm_intervals: time durations between consecutive beats
beats_count: number of detected beats
bpm, bpm_confidence: BPM value according to detected beats and confidence value for the prediction. Note that the confidence values goes from 0 to 5.32. Algorithms: RhythmExtractor2013
bpm_loop, bpm_loop_confidence: BPM value according to detected beats using an algorithm specilized in loops, and confidence value for the prediction. Note that the confidence values goes from 0 to 1. A confidence of 1 could be used to determine whether the sound is a loop or not. Algorithms: LoopBpmEstimator, LoopBpmConfidence
bpm_histogram: BPM histogram. Algorithms: BpmHistogramDescriptors
bpm_histogram_first_peak_bpm, bpm_histogram_first_peak_spread, bpm_histogram_first_peak_weight, bpm_histogram_second_peak_bpm, bpm_histogram_second_peak_spread, bpm_histogram_second_peak_weight: descriptors characterizing highest and second highest peak of the BPM histogram. Algorithms: BpmHistogramDescriptors
beats_loudness, beats_loudness_band_ratio: spectral energy computed on beats segments of audio across the whole spectrum, and ratios of energy in 6 frequency bands. Algorithms: BeatsLoudness, SingleBeatLoudness
onset_times, onset_count, onset_rate : time positions [sec] of detected onsets, its total number and rate per second. Algorithms: OnsetRate

tonal.*¶

For implementation details, see the code of extractor. By default all features are computed with frame/hop sizes equal to 4096/2048 samples.

tuning_frequency: estimated tuning frequency [Hz]. Algorithms: TuningFrequency
hpcp: 32-dimensional harmonic pitch class profile (HPCP). Algorithms: HPCP
hpcp_peak_count: number of peaks detected in the mean of HPCPs (number of active pitch classes). Algorithms: PeakDetection
hpcp_entropy: Shannon entropy of a HPCP vector. Algorithms: Entropy
hpcp_crest: crest of the HPCP vector. Algorithms: Crest
key, scale, strength; key estimation, its scale and strength using a default HPCP key profile. Algorithms: Key
tuning_diatonic_strength: key strength estimated from high-resolution HPCP (120 dimensions) using diatonic profile. Algorithms: Key
tuning_equal_tempered_deviation, tuning_nontempered_energy_ratio: equal-temperament deviation and non-tempered energy ratio estimated from high-resolution HPCP (120 dimensions). Algorithms: HighResolutionFeatures

sfx.*¶

For implementation details, see the code of extractor.

Total and perceived sound duration:

duration: total duration of an audio signal. Algorithms: Duration.
effective_duration: effective duration of the signal discarding silence (signal below the 10% of the envelope maximum). Algorithms: Duration.

Descriptors based on pitch and harmonics estimation:

oddtoevenharmonicenergyratio: energy ratio between odd and even harmonics. Algorithms: OddToEvenHarmonicEnergyRatio.
tristimulus: tristimulus. Algorithms: Tristimulus.
inharmonicity: inharmonisity. Algorithms: Inharmonicity.

Sound envelope descriptors:

temporal_centroid: ratio of the envelope centroid to total length. Algorithms: Centroid.
temporal_kurtosis, temporal_spread, temporal_skewness: central moments statistics describing the signal envelope shape. Algorithms: CentralMoments
temporal_decrease: signal envelope decrease. Algorithms: Decrease
tc_to_total: ratio of the envelope centroid to total length. Algorithms: TCToTotal.
flatness: the flatness coefficient of a signal envelope. Algorithms: FlatnessSFX.
logattacktime: the log10 of the attack time. Algorithms: LogAttackTime.
max_to_total: the maximum amplitude position to total envelope length ratio. Algorithms: MaxToTotal.
strongdecay: the Strong Decay. Algorithms: StrongDecay.
der_av_after_max: the average value of the envelope’s derivative after the maximum amplitude position weighted by its amplitude (the smaller the value the more impulsive is a sound). Algorithms: DerivativeSFX.
max_der_before_max: the maximum value of the envelope’s derivative before the maximum amplitude position (sounds with smooth attack phase will have lower values). Algorithms: DerivativeSFX.

Pitch envelope descriptors:

pitch_centroid: pitch envelope centroid. Algorithms: Centroid.
pitch_max_to_total: ratio of the position of the maximum pitch value to total length. Algorithms: MaxToTotal.
pitch_min_to_total: ratio of the position of the minimum pitch value to total length. Algorithms: MinToTotal.
pitch_after_max_to_before_max_energy_ratio: ratio of pitch envelope energy after the pitch maximum to pitch energy before the pitch maximum. Algorithms: AfterMaxToBeforeMaxEnergyRatio.

Configuration¶

It is possible to customize the parameters of audio analysis, frame summarization, high-level classifier models, and output format, using a yaml profile file. Writing your own custom profile file you can:

Specify output format (json or yaml)

outputFormat: json

Specify whether to store all frame values (0 or 1)

outputFrames: 1

Specify an audio segment to analyze using time positions in seconds

startTime: 0
endTime: 10

Specify analysis sample rate (audio will be converted to it before analysis, recommended and default value is 44100.0)

analysisSampleRate: 44100.0

Specify frame parameters for different groups of descriptors: frame/hop size, zero padding, window type (see FrameCutter algorithm). Specify statistics to compute over frames: mean, var, median, min, max, dmean, dmean2, dvar, dvar2 (see PoolAggregator algorithm)

lowlevel:
    frameSize: 2048
    hopSize: 1024
    zeroPadding: 0
    windowType: blackmanharris62
    silentFrames: noise
    stats: ["mean", "var", "median"]

rhythm:
    method: degara
    minTempo: 40
    maxTempo: 208
    stats: ["mean", "var", "median", "min", "max"]

tonal:
    frameSize: 4096
    hopSize: 2048
    zeroPadding: 0
    windowType: blackmanharris62
    silentFrames: noise
    stats: ["mean", "var", "median", "min", "max"]

Specify whether you want to compute high-level descriptors based on classifier models associated with the respective filepaths (currently no models are provided out-of-box for sound classification, see how to train your own models here)

highlevel:
    compute: 1
    svm_models: ['<path_to_gaia_svm_model1.history>', '<path_to_gaia_svm_model2.history>' ]

In the profile example below, the extractor is set to analyze only the first 30 seconds of audio and output frame values as well as their statistical summarization.

startTime: 0
endTime: 30
outputFrames: 0
outputFormat: json
requireMbid: false
indent: 4

lowlevel:
    frameSize: 2048
    hopSize: 1024
    zeroPadding: 0
    windowType: blackmanharris62
    silentFrames: noise
    stats: ["mean", "var", "median", "min", "max", "dmean", "dmean2", "dvar", "dvar2"]

rhythm:
    method: degara
    minTempo: 40
    maxTempo: 208
    stats: ["mean", "var", "median", "min", "max", "dmean", "dmean2", "dvar", "dvar2"]

tonal:
    frameSize: 4096
    hopSize: 2048
    zeroPadding: 0
    windowType: blackmanharris62
    silentFrames: noise
    stats: ["mean", "var", "median", "min", "max", "dmean", "dmean2", "dvar", "dvar2"]