Bob 2.0 extraction of cepstral features (MFCC or LFCC) from audio
Algorithms have at least one input and one output. All algorithm endpoints are organized in groups. Groups are used by the platform to indicate which inputs and outputs are synchronized together. The first group is automatically synchronized with the channel defined by the block in which the algorithm is deployed.
Endpoint Name | Data Format | Nature |
---|---|---|
speech | system/array_1d_floats/1 | Input |
vad | system/array_1d_integers/1 | Input |
features | system/array_2d_floats/1 | Output |
Parameters allow users to change the configuration of an algorithm when scheduling an experiment
Name | Description | Type | Default | Range/Choices |
---|---|---|---|---|
f_max | Max frequency of the range used in bandpass filtering | float64 | 8000.0 | |
delta_win | Window size used in delta and delta-delta computation | uint32 | 2 | |
withDelta | Compute deltas (with window size specified by delta_win) | bool | True | |
pre_emphasis_coef | Pre-emphasis coefficient | float64 | 0.95 | |
win_shift_ms | The length of the overlap between neighboring windows. Typically the half of window length. | float64 | 10.0 | |
win_length_ms | The length of the sliding processing window, typically about 20 ms | float64 | 20.0 | |
dct_norm | Use normalized DCT | bool | False | |
normalizeFeatures | Normalize computed Cepstral features (shift by mean and divide by std) | bool | True | |
filter_frames | Filter frames with computed Cepstral features based on the VAD labels. Either trim out silence head/tails, keep only speech, or keep only silence. | string | trim_silence | trim_silence, silence_only, speech_only |
rate | Sampling rate of the speech signal | float64 | 16000.0 | |
n_filters | Number of filter bands | uint32 | 24 | |
f_min | Min frequency of the range used in bandpass filtering | float64 | 0.0 | |
withDeltaDelta | Compute delta-deltas (with window size specified by delta_win) | bool | True | |
withEnergy | Use power of the FFT magnitude, otherwise just an absolute value of the magnitude | bool | True | |
mel_scale | Set true to use Mel-scaled triangular filter, otherwise it's a linear scale | bool | True | |
n_ceps | Number of cepstral coefficients | uint32 | 19 |
xxxxxxxxxx
import numpy
import bob.ap
def vad_filter_features(rate, wavsample, vad_labels, features, filter_frames="trim_silence"):
"""
@param: filter_frames: the value is either 'silence_only' (keep the speech, remove everything else), 'speech_only' (keep all the silence only), or 'trim_silence' (time silent heads and tails)
"""
if not wavsample.size:
raise ValueError("vad_filter_features(): data sample is empty, no features extraction is possible")
vad_labels = numpy.asarray(vad_labels, dtype=numpy.int8)
features = numpy.asarray(features, dtype=numpy.float64)
features = numpy.reshape(features, (vad_labels.shape[0], -1))
# first, take the whole thing, in case there are problems later
filtered_features = features
# if VAD detection worked on this sample
if vad_labels is not None:
# make sure the size of VAD labels and sectrogram lenght match
if len(vad_labels) == len(features):
# take only speech frames, as in VAD speech frames are 1 and silence are 0
speech, = numpy.nonzero(vad_labels)
silences = None
if filter_frames == "silence_only":
# take only silent frames - those for which VAD gave zeros
silences, = numpy.nonzero(vad_labels==0)
if len(speech):
nzstart=speech[0] # index of the first non-zero
nzend=speech[-1] # index of the last non-zero
if filter_frames == "silence_only": # extract only silent frames
# take only silent frames in-between the speech
silences=silences[silences > nzstart]
silences=silences[silences < nzend]
filtered_features = features[silences, :]
elif filter_frames == "speech_only":
filtered_features = features[speech, :]
else: # when we take all
filtered_features = features[nzstart:nzend, :]
else:
print("Warning: vad_filter_features(): VAD labels should be the same length as energy bands")
# print("vad_filter_features(): filtered_features shape: %s" % str(filtered_features.shape))
return filtered_features
class Algorithm:
def __init__(self):
self.rate = 16000
self.win_length_ms = 20
self.win_shift_ms = 10
self.n_filters = 24
self.n_ceps = 19
self.f_min = 0
self.f_max = 4000
self.delta_win = 2
self.pre_emphasis_coef = 0.95
self.dct_norm = False
self.mel_scale = True
self.withEnergy = True
self.withDelta = True
self.withDeltaDelta = True
self.normalizeFeatures = True
self.filter_frames = 'speech_only'
self.features_len = 19
def setup(self, parameters):
self.rate = float(parameters.get('rate', self.rate))
self.win_length_ms = float(parameters.get('win_length_ms', self.win_length_ms))
self.win_shift_ms = float(parameters.get('win_shift_ms', self.win_shift_ms))
self.f_min = float(parameters.get('f_min', self.f_min))
self.f_max = float(parameters.get('f_max', self.f_max))
self.n_ceps = parameters.get('n_ceps', self.n_ceps)
self.n_filters = parameters.get('n_filters', self.n_filters)
self.delta_win = parameters.get('delta_win', self.delta_win)
self.pre_emphasis_coef = float(parameters.get('pre_emphasis_coef', self.pre_emphasis_coef))
self.mel_scale = parameters.get('mel_scale', self.mel_scale)
self.dct_norm = parameters.get('dct_norm', self.dct_norm)
self.withEnergy = parameters.get('withEnergy', self.withEnergy)
self.withDelta = parameters.get('withDelta', self.withDelta)
self.withDeltaDelta = parameters.get('withDeltaDelta', self.withDeltaDelta)
self.normalizeFeatures = parameters.get('normalizeFeatures', self.normalizeFeatures)
self.filter_frames = parameters.get('filter_frames', self.filter_frames)
wl = self.win_length_ms
ws = self.win_shift_ms
nf = self.n_filters
nc = self.n_ceps
f_min = self.f_min
f_max = self.f_max
dw = self.delta_win
pre = self.pre_emphasis_coef
self.extractor = bob.ap.Ceps(self.rate, wl, ws, nf, nc, f_min, f_max, dw, pre)
self.extractor.dct_norm = self.dct_norm
self.extractor.mel_scale = self.mel_scale
self.extractor.with_energy = self.withEnergy
self.extractor.with_delta = self.withDelta
self.extractor.with_delta_delta = self.withDeltaDelta
# compute the size of the feature vector
self.features_len = nc
if self.withDelta:
self.features_len += nc
if self.withDeltaDelta:
self.features_len += nc
return True
def normalize_features(self, features):
mean = numpy.mean(features, axis=0)
std = numpy.std(features, axis=0)
features = numpy.divide(features-mean, std)
return features
def process(self, inputs, outputs):
float_wav = inputs["speech"].data.value.astype('float64')
labels = inputs["vad"].data.value
cepstral_features = self.extractor(float_wav)
filtered_features = vad_filter_features(self.rate, float_wav, labels, cepstral_features, filter_frames=self.filter_frames)
if self.normalizeFeatures:
normalized_features = self.normalize_features(filtered_features)
else:
normalized_features = filtered_features
if normalized_features.shape[0] == 0:
# If they are zero, do not keep it empty!!! This avoids errors in next steps
normalized_features=numpy.array([numpy.zeros(self.features_len)])
outputs["features"].write({
'value': numpy.vstack(normalized_features)
})
return True
The code for this algorithm in Python
The ruler at 80 columns indicate suggested POSIX line breaks (for readability).
The editor will automatically enlarge to accomodate the entirety of your input
Use keyboard shortcuts for search/replace and faster editing. For example, use Ctrl-F (PC) or Cmd-F (Mac) to search through this box
Extract cepstral features (MFCC or LFCC) from audio
Updated | Name | Databases/Protocols | Analyzers | |||
---|---|---|---|---|---|---|
pkorshunov/pkorshunov/isv-asv-pad-fusion-complete/1/asv_isv-pad_lbp_hist_ratios_lr-fusion_lr-pa_aligned | avspoof/2@physicalaccess_verification,avspoof/2@physicalaccess_verify_train,avspoof/2@physicalaccess_verify_train_spoof,avspoof/2@physicalaccess_antispoofing,avspoof/2@physicalaccess_verification_spoof | pkorshunov/spoof-score-fusion-roc_hist/1 | ||||
pkorshunov/pkorshunov/isv-asv-pad-fusion-complete/1/asv_isv-pad_gmm-fusion_lr-pa | avspoof/2@physicalaccess_verification,avspoof/2@physicalaccess_verify_train,avspoof/2@physicalaccess_verify_train_spoof,avspoof/2@physicalaccess_antispoofing,avspoof/2@physicalaccess_verification_spoof | pkorshunov/spoof-score-fusion-roc_hist/1 | ||||
pkorshunov/pkorshunov/speech-pad-simple/1/speech-pad_gmm-pa | avspoof/2@physicalaccess_antispoofing | pkorshunov/simple_antispoofing_analyzer/4 | ||||
pkorshunov/pkorshunov/isv-speaker-verification-spoof/1/isv-speaker-verification-spoof-pa | avspoof/2@physicalaccess_verification,avspoof/2@physicalaccess_verification_spoof | pkorshunov/eerhter_postperf_iso_spoof/1 | ||||
pkorshunov/pkorshunov/isv-speaker-verification/1/isv-speaker-verification-licit | avspoof/2@physicalaccess_verification | pkorshunov/eerhter_postperf_iso/1 |
This table shows the number of times this algorithm has been successfully run using the given environment. Note this does not provide sufficient information to evaluate if the algorithm will run when submitted to different conditions.