Python API to bob.kaldi¶
This section includes information for using the Python API of bob.kaldi.
Functions¶
-
bob.kaldi.cepstral(data, cepstral_type, rate=8000, preemphasis_coefficient=0.97, raw_energy=True, delta_order=2, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True, normalization=True)[source]¶ Computes the cepstral (mfcc/plp) features for given speech samples.
Parameters: - data (numpy.ndarray) – A 1D numpy ndarray object containing 64-bit float numbers with the audio signal to calculate the cepstral features from. The input needs to be normalized between [-1, 1].
- rate (float) – The sampling rate of the input signal in
data. - cepstral_type (str) – The type of cepstral features: mfcc or plp
- preemphasis_coefficient (
float, optional) – Coefficient for use in signal preemphasis - raw_energy (
bool, optional) – If true, compute energy before preemphasis and windowing - delta_order (
int, optional) – Add deltas to raw mfcc or plp features - frame_length (
int, optional) – Frame length in milliseconds - frame_shift (
int, optional) – Frame shift in milliseconds - num_ceps (
int, optional) – Number of cepstra in MFCC computation (including C0) - num_mel_bins (
int, optional) – Number of triangular mel-frequency bins - cepstral_lifter (
int, optional) – Constant that controls scaling of MFCCs - low_freq (
int, optional) – Low cutoff frequency for mel bins - high_freq (
int, optional) – High cutoff frequency for mel bins (if < 0, offset from Nyquist) - dither (
float, optional) – Dithering constant (0.0 means no dither) - snip_edges (
bool, optional) – If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends. - normalization (
bool, optional) – If true, the input samples indataare normalized to [-1, 1].
Returns: The cepstral features calculated for the input signal (2D array of 32-bit floats).
Return type:
-
bob.kaldi.compute_dnn_vad(samples, rate, silence_threshold=0.9, posterior=0)[source]¶ Performs Voice Activity Detection on a Kaldi feature matrix
Parameters: - feats (numpy.ndarray) – A 2-D numpy array, with log-energy being in the first component of each feature vector
- rate (float) – The sampling rate of the input signal in
samples. - silence_threshold (
float, optional) – Silence threshold to be used for silence posterior evaluation. - posterior (
int, optional) – Index of posterior feature to be used for detection. Useful ones are 0, 1 and 2, for silence, laughter and noise,respectively.
Returns: The labels [1/0] of voiced features (1D array of floats).
Return type:
-
bob.kaldi.compute_vad(samples, rate, vad_energy_mean_scale=0.5, vad_energy_th=5, vad_frames_context=0, vad_proportion_th=0.6)[source]¶ Performs Voice Activity Detection on a Kaldi feature matrix
Parameters: - feats (numpy.ndarray) – A 2-D numpy array, with log-energy being in the first component of each feature vector
- rate (float) – The sampling rate of the input signal in
samples. - vad_energy_mean_scale (
float, optional) – If this is set to s, to get the actual threshold we let m be the mean log-energy of the file, and use s*m + vad-energy-th - vad_energy_th (
float, optional) – Constant term in energy threshold for MFCC0 for VAD. - vad_frames_context (
int, optional) – Number of frames of context on each side of central frame, in window for which energy is monitored - vad_proportion_th (
float, optional) – Parameter controlling the proportion of frames within the window that need to have more energy than the threshold
Returns: The labels [1/0] of voiced features (1D array of floats).
Return type:
-
bob.kaldi.gmm_score(feats, spkubm, ubm)[source]¶ Print out per-frame log-likelihoods for input utterance.
Parameters: - feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
- spkubm (str) – A text formatted Kaldi adapted global DiagGMM.
- ubm (str) – A text formatted Kaldi global DiagGMM.
Returns: The average of per-frame log-likelihoods.
Return type:
-
bob.kaldi.ivector_extract(feats, fubm, ivector_extractor, num_gselect=20, min_post=0.025, posterior_scale=1.0)[source]¶ Implements Kaldi egs/sre10/v1/extract_ivectors.sh
Parameters: - feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
- fubm (str) – A full-diagonal UBM
- ivector_extractor (str) – An ivector extractor model
- num_gselect (
int, optional) – Number of Gaussians to keep per frame. - min_post (
float, optional) – If nonzero, posteriors below this threshold will be pruned away and the rest will be renormalized to sum to one. - posterior_scale (
float, optional) – A posterior scaling with a global scale.
Returns: The iVectors calculated for the input signal.
Return type:
-
bob.kaldi.ivector_train(feats, fubm, ivector_extractor, num_gselect=20, ivector_dim=600, use_weights=False, num_iters=5, min_post=0.025, num_samples_for_weights=3, posterior_scale=1.0)[source]¶ Implements Kaldi egs/sre10/v1/train_ivector_extractor.sh
Parameters: - feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
- fubm (str) – A full-diagonal UBM
- ivector_extractor (str) – A path for the ivector extractor
- num_gselect (
int, optional) – Number of Gaussians to keep per frame. - ivector_dim (
int, optional) – Dimension of iVector. - use_weights (
bool, optional) – If true, regress the log-weights on the iVector - num_iters (
int, optional) – Number of iterations of training. - min_post (
float, optional) – If nonzero, posteriors below this threshold will be pruned away and the rest will be renormalized to sum to one. - num_samples_for_weights (
int, optional) – Number of samples from iVector distribution to use for accumulating stats for weight update. Must be >1. - posterior_scale (
float, optional) – A posterior scaling with a global scale.
Returns: A text formatted trained Kaldi IvectorExtractor.
Return type:
-
bob.kaldi.mfcc(data, rate=8000, preemphasis_coefficient=0.97, raw_energy=True, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True, normalization=True)[source]¶ Computes the MFCCs for given speech samples.
Parameters: - data (numpy.ndarray) – A 1D numpy ndarray object containing 64-bit float numbers with the audio signal to calculate the MFCCs from. The input needs to be normalized between [-1, 1].
- rate (float) – The sampling rate of the input signal in
data. - preemphasis_coefficient (
float, optional) – Coefficient for use in signal preemphasis - raw_energy (
bool, optional) – If true, compute energy before preemphasis and windowing - frame_length (
int, optional) – Frame length in milliseconds - frame_shift (
int, optional) – Frame shift in milliseconds - num_ceps (
int, optional) – Number of cepstra in MFCC computation (including C0) - num_mel_bins (
int, optional) – Number of triangular mel-frequency bins - cepstral_lifter (
int, optional) – Constant that controls scaling of MFCCs - low_freq (
int, optional) – Low cutoff frequency for mel bins - high_freq (
int, optional) – High cutoff frequency for mel bins (if < 0, offset from Nyquist) - dither (
float, optional) – Dithering constant (0.0 means no dither) - snip_edges (
bool, optional) – If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends. - normalization (
bool, optional) – If true, the input samples indataare normalized to [-1, 1].
Returns: The MFCCs calculated for the input signal (2D array of 32-bit floats).
Return type:
-
bob.kaldi.mfcc_from_path(filename, channel=0, preemphasis_coefficient=0.97, raw_energy=True, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True)[source]¶ Computes the MFCCs for a given input signal recorded into a file
Parameters: - filename (str) – A path to a valid WAV or NIST Sphere file to read data from
- channel (int) – The audio channel to read from inside the file
- preemphasis_coefficient (
float, optional) – Coefficient for use in signal preemphasis - raw_energy (
bool, optional) – If true, compute energy before preemphasis and windowing - frame_length (
int, optional) – Frame length in milliseconds - frame_shift (
int, optional) – Frame shift in milliseconds - num_ceps (
int, optional) – Number of cepstra in MFCC computation (including C0) - num_mel_bins (
int, optional) – Number of triangular mel-frequency bins - cepstral_lifter (
int, optional) – Constant that controls scaling of MFCCs - low_freq (
int, optional) – Low cutoff frequency for mel bins - high_freq (
int, optional) – High cutoff frequency for mel bins (if < 0, offset from Nyquist) - dither (
float, optional) – Dithering constant (0.0 means no dither) - snip_edges (
bool, optional) – If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends
Returns: The MFCCs calculated for the input signal (2D array of 32-bit floats).
Return type:
-
bob.kaldi.nnet_forward(feats, nnet, feats_transform='', apply_log=False, no_softmax=False, prior_floor=1e-10, prior_scale=1, use_gpu=False)[source]¶ Computes the forward pass for given features.
Parameters: - feats (numpy.ndarray) – The input cepstral features (2D array of 32-bit floats).
- nnet (str) – The neural network
- feats_transform (
str, optional) – The input feature transform forfeats. - apply_log (
bool, optional) – Transform NN output by log(). - no_softmax (
bool, optional) – Removes the last component with Softmax. - prior_floor (
float, optional) – Flooring constant for prior probability. - prior_scale (
float, optional) – Scaling factor to be applied on pdf-log-priors. - use_gpu (
bool, optional) – Compute forward pass on GPU.
Returns: The posterior features.
Return type:
-
bob.kaldi.plda_enroll(feats, pldamean)[source]¶ Implements Kaldi egs/sre10/v1/plda_scoring.sh
Parameters: - feats (numpy.ndarray) – A 2D numpy ndarray object containing iVectors (of a single speaker).
- pldamean (str) – A path to the global PLDA mean file
Returns: A path to enrolled PLDA model (average iVectors).
Return type:
-
bob.kaldi.plda_score(feats, model, plda, globalmean, smoothing=0)[source]¶ Implements Kaldi egs/sre10/v1/plda_scoring.sh
Parameters: - feats (numpy.ndarray) – A 2D numpy ndarray object containing iVectors.
- model (str) – A speaker model (average iVectors).
- plda (str) – A PLDA model.
- globalmean (str) – A global PLDA mean.
- smoothing (float) – Factor used in smoothing within-class covariance (add this factor times between-class covar).
Returns: A PLDA score.
Return type:
-
bob.kaldi.plda_train(feats, plda_file, mean_file)[source]¶ Implements Kaldi egs/sre10/v1/plda_scoring.sh
Parameters: - feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
- plda_file (str) – A path to the trained PLDA model
- mean_file (str) – A path to the global PLDA mean file
Returns: Trained PLDA model and global mean (2D str array)
Return type:
-
bob.kaldi.ubm_enroll(feats, ubm)[source]¶ Performes MAP adaptation of GMM-UBM model.
Parameters: - feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
- ubm (str) – A text formatted Kaldi global DiagGMM.
Returns: A text formatted Kaldi enrolled DiagGMM.
Return type:
-
bob.kaldi.ubm_full_train(feats, dubm, fubmfile, num_gselect=20, num_iters=4, min_gaussian_weight=0.0001)[source]¶ Implements Kaldi egs/sre10/v1/train_full_ubm.sh
Parameters: - feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
- dubm (str) – A text formatted trained Kaldi global DiagGMM model.
- fubmfile (str) – A path to the full covariance UBM model.
- num_gselect (
int, optional) – Number of Gaussians to keep per frame. - num_iters (
int, optional) – Number of iterations of training. - min_gaussian_weight (
float, optional) – Kaldi MleDiagGmmOptions: Min Gaussian weight before we remove it.
Returns: A path to the full covariance UBM model.
Return type:
-
bob.kaldi.ubm_train(feats, ubmname, num_threads=4, num_frames=500000, min_gaussian_weight=0.0001, num_gauss=2048, num_gauss_init=0, num_gselect=30, num_iters_init=20, num_iters=4, remove_low_count_gaussians=True)[source]¶ Implements Kaldi egs/sre10/v1/train_diag_ubm.sh
Parameters: - feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
- ubmname (str) – A path to the UBM model.
- num_threads (
int, optional) – Number of threads used for statistics accumulation. - num_frames (
int, optional) – Number of feature vectors to store in memory and train on (randomly chosen from the input features). - min_gaussian_weight (
float, optional) – Kaldi MleDiagGmmOptions: Min Gaussian weight before we remove it. - num_gauss (
int, optional) – Number of Gaussians in the model. - num_gauss_init (
int, optional) – Number of Gaussians in the model initially (if nonzero and less than num_gauss, we’ll do mixture splitting). - num_gselect (
int, optional) – Number of Gaussians to keep per frame. - num_iters_init (
int, optional) – Number of iterations of training for initialization of the single diagonal GMM. - num_iters (
int, optional) – Number of iterations of training. - remove_low_count_gaussians (
bool, optional) – Kaldi MleDiagGmmOptions: If true, remove Gaussians that fall below the floors.
Returns: A text formatted trained Kaldi global DiagGMM model.
Return type: