Bob 2.0 implementation of Voice Activity Detection (VAD) based on 4Hz energy filtering

This algorithm is a legacy one. The API has changed since its implementation. New versions and forks will need to be updated.
This algorithm is splittable

Algorithms have at least one input and one output. All algorithm endpoints are organized in groups. Groups are used by the platform to indicate which inputs and outputs are synchronized together. The first group is automatically synchronized with the channel defined by the block in which the algorithm is deployed.

Group: main

Endpoint Name Data Format Nature
speech system/array_1d_floats/1 Input
labels system/array_1d_integers/1 Output

Parameters allow users to change the configuration of an algorithm when scheduling an experiment

Name Description Type Default Range/Choices
rate Sampling rate of the speech signal float64 16000.0 [2000.0, 256000.0]
win_length_ms The length of the sliding processing window, typically about 20 ms float64 20.0
win_shift_ms The length of the overlap between neighboring windows. Typically the half of window length. float64 10.0
xxxxxxxxxx
272
 
1
###############################################################################
2
#                                                                             #
3
# Copyright (c) 2016 Idiap Research Institute, http://www.idiap.ch/           #
4
# Contact: beat.support@idiap.ch                                              #
5
#                                                                             #
6
# This file is part of the beat.core module of the BEAT platform.             #
7
#                                                                             #
8
# Commercial License Usage                                                    #
9
# Licensees holding valid commercial BEAT licenses may use this file in       #
10
# accordance with the terms contained in a written agreement between you      #
11
# and Idiap. For further information contact tto@idiap.ch                     #
12
#                                                                             #
13
# Alternatively, this file may be used under the terms of the GNU Affero      #
14
# Public License version 3 as published by the Free Software and appearing    #
15
# in the file LICENSE.AGPL included in the packaging of this file.            #
16
# The BEAT platform is distributed in the hope that it will be useful, but    #
17
# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY  #
18
# or FITNESS FOR A PARTICULAR PURPOSE.                                        #
19
#                                                                             #
20
# You should have received a copy of the GNU Affero Public License along      #
21
# with the BEAT platform. If not, see http://www.gnu.org/licenses/.           #
22
#                                                                             #
23
###############################################################################
24
25
import numpy
26
import math
27
import scipy.signal
28
import bob.ap
29
30
class Mod_4Hz():
31
    """VAD based on the modulation of the energy around 4 Hz and the energy """
32
    def __init__(
33
            self,
34
            max_iterations = 10,        # 10 iterations for the
35
            convergence_threshold = 0.0005,
36
            variance_threshold = 0.0005,
37
            win_length_ms = 20.,        # 20 ms
38
            win_shift_ms = 10.,           # 10 ms
39
            smoothing_window = 10, # 10 frames (i.e. 100 ms)
40
            n_filters = 40,
41
            f_min = 0.0,                       # 0 Hz
42
            f_max = 4000,                   # 4 KHz
43
            pre_emphasis_coef = 1.0,
44
            ratio_threshold = 0.1,       # 0.1 of the maximum energy
45
            **kwargs
46
    ):
47
        # copy parameters
48
        self.max_iterations = max_iterations
49
        self.convergence_threshold = convergence_threshold
50
        self.variance_threshold = variance_threshold
51
        self.win_length_ms = win_length_ms
52
        self.win_shift_ms = win_shift_ms
53
        self.smoothing_window = smoothing_window
54
        self.n_filters = n_filters
55
        self.f_min = f_min
56
        self.f_max = f_max
57
        self.pre_emphasis_coef = pre_emphasis_coef
58
        self.ratio_threshold = ratio_threshold
59
60
    def _voice_activity_detection(self, energy, mod_4hz):
61
62
        n_samples = len(energy)
63
        threshold = numpy.max(energy) - numpy.log((1./self.ratio_threshold) * (1./self.ratio_threshold))
64
        labels = numpy.array(numpy.zeros(n_samples), dtype=numpy.int16)
65
66
        for i in range(n_samples):
67
            if ( energy[i] > threshold and mod_4hz[i] > 0.9 ):
68
                labels[i]=1
69
70
        # If speech part less then 10 seconds and less than the half of the segment duration, try to find speech with more risk
71
        if  numpy.sum(labels) < 2000 and float(numpy.sum(labels)) / float(len(labels)) < 0.5:
72
            # TRY WITH MORE RISK 1...
73
            for i in range(n_samples):
74
                if ( energy[i] > threshold and mod_4hz[i] > 0.5 ):
75
                    labels[i]=1
76
77
        if  numpy.sum(labels) < 2000 and float(numpy.sum(labels)) / float(len(labels)) < 0.5:
78
            # TRY WITH MORE RISK 2...
79
            for i in range(n_samples):
80
                if ( energy[i] > threshold and mod_4hz[i] > 0.2 ):
81
                    labels[i]=1
82
83
        if  numpy.sum(labels) < 2000 and float(numpy.sum(labels)) / float(len(labels)) < 0.5: # This is special for short segments (less than 2s)...
84
            # TRY WITH MORE RISK 3...
85
            if (len(energy) < 200 ) or (numpy.sum(labels) == 0) or (numpy.mean(labels)<0.025):
86
                for i in range(n_samples):
87
                    if ( energy[i] > threshold ):
88
                        labels[i]=1
89
        return labels
90
91
    def averaging(self, list_1s_shift):
92
        len_list=len(list_1s_shift)
93
        sample_level_value = numpy.array(numpy.zeros(len_list, dtype=numpy.float))
94
        sample_level_value[0]=numpy.array(list_1s_shift[0])
95
        for j in range(2, numpy.min([len_list, 100])):
96
            sample_level_value[j-1]=((j-1.0)/j)*sample_level_value[j-2] +(1.0/j)*numpy.array(list_1s_shift[j-1])
97
        for j in range(numpy.min([len_list, 100]), len_list-100 +1):
98
            sample_level_value[j-1]=numpy.array(numpy.mean(list_1s_shift[j-100:j]))
99
        sample_level_value[len_list-1] = list_1s_shift[len_list -1]
100
        for j in range(2, numpy.min([len_list, 100]) + 1):
101
            sample_level_value[len_list-j]=((j-1.0)/j)*sample_level_value[len_list+1-j] +(1.0/j)*numpy.array(list_1s_shift[len_list-j])
102
        return sample_level_value
103
104
105
    def bandpass_firwin(self, ntaps, lowcut, highcut, fs, window='hamming'):
106
        nyq = 0.5 * fs
107
        taps = scipy.signal.firwin(ntaps, [lowcut, highcut], nyq=nyq, pass_zero=False,
108
                                   window=window, scale=True)
109
        return taps
110
111
112
    def pass_band_filtering(self, energy_bands, fs):
113
        energy_bands = energy_bands.T
114
        order = 8
115
        Wo = 4.
116
        num_taps = self.bandpass_firwin(order+1, (Wo - 0.5), (Wo + 0.5), fs)
117
        res = scipy.signal.lfilter(num_taps, 1.0, energy_bands)
118
        return res
119
120
    def modulation_4hz(self, filtering_res, rate_wavsample):
121
        fs = rate_wavsample[0]
122
        win_length = int (fs * self.win_length_ms / 1000)
123
        win_shift = int (fs * self.win_shift_ms / 1000)
124
        Energy = filtering_res.sum(axis=0)
125
        mean_Energy = numpy.mean(Energy)
126
        Energy = Energy/mean_Energy
127
128
#        win_size = int (2.0 ** math.ceil(math.log(win_length) / math.log(2)))
129
        n_frames = 1 + (rate_wavsample[1].shape[0] - win_length) // win_shift
130
        range_modulation = int(fs/win_length) # This corresponds to 1 sec
131
        res = numpy.zeros(n_frames)
132
        if n_frames < range_modulation:
133
            return res
134
        for w in range(0,n_frames-range_modulation):
135
            E_range=Energy[w:w+range_modulation] # computes the modulation every 10 ms
136
            if (E_range<=0.).any():
137
                res[w] = 0
138
            else:
139
                res[w] = numpy.var(numpy.log(E_range))
140
        res[n_frames-range_modulation:n_frames] = res[n_frames-range_modulation-1]
141
        return res
142
143
    def smoothing(self, labels, smoothing_window):
144
        """ Applies a smoothing on VAD"""
145
146
        if numpy.sum(labels)< smoothing_window:
147
            return labels
148
        segments = []
149
        for k in range(1,len(labels)-1):
150
            if labels[k]==0 and labels[k-1]==1 and labels[k+1]==1 :
151
                labels[k]=1
152
        for k in range(1,len(labels)-1):
153
            if labels[k]==1 and labels[k-1]==0 and labels[k+1]==0 :
154
                labels[k]=0
155
156
        seg = numpy.array([0,0,labels[0]])
157
        for k in range(1,len(labels)):
158
            if labels[k] != labels[k-1]:
159
                seg[1]=k-1
160
                segments.append(seg)
161
                seg = numpy.array([k,k,labels[k]])
162
163
        seg[1]=len(labels)-1
164
        segments.append(seg)
165
166
        if len(segments) < 2:
167
            return labels
168
169
        curr = segments[0]
170
        next = segments[1]
171
172
        # Look at the first segment. If it's short enough, just change its labels
173
        if (curr[1]-curr[0]+1) < smoothing_window and (next[1]-next[0]+1) > smoothing_window:
174
            if curr[2]==1:
175
                labels[curr[0] : (curr[1]+1)] = numpy.zeros(curr[1] - curr[0] + 1)
176
                curr[2]=0
177
            else: #curr[2]==0
178
                labels[curr[0] : (curr[1]+1)] = numpy.ones(curr[1] - curr[0] + 1)
179
                curr[2]=1
180
181
        for k in range(1,len(segments)-1):
182
            prev = segments[k-1]
183
            curr = segments[k]
184
            next = segments[k+1]
185
186
            if (curr[1]-curr[0]+1) < smoothing_window and (prev[1]-prev[0]+1) > smoothing_window and (next[1]-next[0]+1) > smoothing_window:
187
                if curr[2]==1:
188
                    labels[curr[0] : (curr[1]+1)] = numpy.zeros(curr[1] - curr[0] + 1)
189
                    curr[2]=0
190
                else: #curr[2]==0
191
                    labels[curr[0] : (curr[1]+1)] = numpy.ones(curr[1] - curr[0] + 1)
192
                    curr[2]=1
193
194
        prev = segments[-2]
195
        curr = segments[-1]
196
197
        if (curr[1]-curr[0]+1) < smoothing_window and (prev[1]-prev[0]+1) > smoothing_window:
198
            if curr[2]==1:
199
                labels[curr[0] : (curr[1]+1)] = numpy.zeros(curr[1] - curr[0] + 1)
200
                curr[2]=0
201
            else: #if curr[2]==0
202
                labels[curr[0] : (curr[1]+1)] = numpy.ones(curr[1] - curr[0] + 1)
203
                curr[2]=1
204
205
        return labels
206
207
    def mod_4hz(self, rate_wavsample):
208
        """Computes and returns the 4Hz modulation energy features for the given input wave file"""
209
210
        # Set parameters
211
        wl = float(self.win_length_ms)
212
        ws = float(self.win_shift_ms)
213
        nf = self.n_filters
214
        f_min = float(self.f_min)
215
        f_max = float(self.f_max)
216
        pre = float(self.pre_emphasis_coef)
217
218
        c = bob.ap.Spectrogram(float(rate_wavsample[0]), float(wl), float(ws), nf, float(f_min), float(f_max), float(pre))
219
        c.energy_filter=True
220
        c.log_filter=False
221
        c.energy_bands=True
222
223
        sig =  rate_wavsample[1]
224
        energy_bands = c(sig)
225
        filtering_res = self.pass_band_filtering(energy_bands, rate_wavsample[0])
226
        mod_4hz = self.modulation_4hz(filtering_res, rate_wavsample)
227
        mod_4hz = self.averaging(mod_4hz)
228
        e = bob.ap.Energy(float(rate_wavsample[0]), float(wl), float(ws))
229
        energy_array = e(rate_wavsample[1])
230
        labels = self._voice_activity_detection(energy_array, mod_4hz)
231
        labels = self.smoothing(labels,self.smoothing_window) # discard isolated speech less than 100ms
232
        return labels, energy_array, mod_4hz
233
234
    def __call__(self, input_signal, annotations=None):
235
        """labels speech (1) and non-speech (0) parts of the given input wave file using 4Hz modulation energy and energy
236
            Input parameter:
237
               * input_signal[0] --> rate
238
               * input_signal[1] --> signal
239
        """
240
        [labels, energy_array, mod_4hz] = self.mod_4hz(input_signal)
241
        rate    =  input_signal[0]
242
        data = input_signal[1]
243
        return rate, data, labels
244
245
246
class Algorithm:
247
    def __init__(self):
248
        self.win_length_ms = 20
249
        self.win_shift_ms = 10
250
        self.rate = 16000
251
252
    def setup(self, parameters):
253
        self.rate = parameters.get('rate', self.rate)
254
        self.win_length_ms = parameters.get('win_length_ms', self.win_length_ms)
255
        self.win_shift_ms = parameters.get('win_shift_ms', self.win_shift_ms)
256
        self.preprocessor = Mod_4Hz(win_length_ms=self.win_length_ms, win_shift_ms=self.win_shift_ms)
257
        return True
258
259
    def process(self, inputs, outputs):
260
        float_wav = inputs["speech"].data.value.astype('float64')
261
262
        if float_wav is None or not float_wav.size:
263
            labels = numpy.zeros(2, dtype=numpy.int8)
264
        else:
265
            [labels, energies, mod_4hz] = self.preprocessor.mod_4hz([self.rate, float_wav])
266
267
        outputs["labels"].write({
268
            'value': labels
269
        })
270
271
        return True
272

The code for this algorithm in Python
The ruler at 80 columns indicate suggested POSIX line breaks (for readability).
The editor will automatically enlarge to accomodate the entirety of your input
Use keyboard shortcuts for search/replace and faster editing. For example, use Ctrl-F (PC) or Cmd-F (Mac) to search through this box

4Hz modulation of energy voice activity detection (VAD) with carefully tuned thresholds.

Experiments

Updated Name Databases/Protocols Analyzers
pkorshunov/pkorshunov/isv-asv-pad-fusion-complete/1/asv_isv-pad_lbp_hist_ratios_lr-fusion_lr-pa_aligned avspoof/2@physicalaccess_verification,avspoof/2@physicalaccess_verify_train,avspoof/2@physicalaccess_verify_train_spoof,avspoof/2@physicalaccess_antispoofing,avspoof/2@physicalaccess_verification_spoof pkorshunov/spoof-score-fusion-roc_hist/1
pkorshunov/pkorshunov/speech-pad-simple/1/speech-pad_lbp_hist_ratios_lr-pa_aligned avspoof/2@physicalaccess_antispoofing pkorshunov/simple_antispoofing_analyzer/4
pkorshunov/pkorshunov/isv-asv-pad-fusion-complete/1/asv_isv-pad_gmm-fusion_lr-pa avspoof/2@physicalaccess_verification,avspoof/2@physicalaccess_verify_train,avspoof/2@physicalaccess_verify_train_spoof,avspoof/2@physicalaccess_antispoofing,avspoof/2@physicalaccess_verification_spoof pkorshunov/spoof-score-fusion-roc_hist/1
pkorshunov/pkorshunov/speech-pad-simple/1/speech-pad_gmm-pa avspoof/2@physicalaccess_antispoofing pkorshunov/simple_antispoofing_analyzer/4
pkorshunov/pkorshunov/isv-speaker-verification-spoof/1/isv-speaker-verification-spoof-pa avspoof/2@physicalaccess_verification,avspoof/2@physicalaccess_verification_spoof pkorshunov/eerhter_postperf_iso_spoof/1
pkorshunov/pkorshunov/isv-speaker-verification/1/isv-speaker-verification-licit avspoof/2@physicalaccess_verification pkorshunov/eerhter_postperf_iso/1
pkorshunov/pkorshunov/speech-antispoofing-baseline/1/btas2016-baseline-pa avspoof/1@physicalaccess_antispoofing pkorshunov/simple_antispoofing_analyzer/2
Created with Raphaël 2.1.2[compare]pkorshunov/vad_4hz/12016Mar8

This table shows the number of times this algorithm has been successfully run using the given environment. Note this does not provide sufficient information to evaluate if the algorithm will run when submitted to different conditions.

Terms of Service | Contact Information | BEAT platform version 2.2.1b0 | © Idiap Research Institute - 2013-2025