Content
The data collection followed a simple recording protocol that was replicated for all sessions. For each session there was a corresponding script describing the whole content to be recorded, as follows:
Text reading:
- a pre-defined text (extracted from the database's consent form);
- four phonetically rich sentences (randomly selected among 562 options);
- passphrase: three repetitions a single sentence (the same sentence for all participants in all sessions).
Spontaneous speech:
- answers for generic questions (all participants answered all 15 questions selected form a fixed set, distributed along the 5 sessions in random order);
- a fake name;
- a fake address;
- a fake birthday date;
- a fake ID number;
- a fake phone number;
- two command words (all participants spoke 10 words along the 5 sessions in random order).
Numbers, digits, time values and alphanumeric strings:
- a monetary amount between 10 and 10 000, randomly generated;
- a number between 10 and 1000, randomly generated;
- a number between 1000 and 10 million, randomly generated;
- three repetitions of a random digit sequence (first one read in a slow pace and others naturally read);
- a fake credit card number;
- an alphanumeric string composed of 6 characters, randomly generated;
- a time value, selected among a predefined set with 181 samples, equally distributed among participants.
It is important to note that all content was recorded in Brazilian Portuguese language.
BioCPqD Phase I database provides unbiased biometric verification protocols, one for male and one for female participants, based on the MOBIO database protocols. These protocols partition the database in three different groups:
- a Training set: used to train the parameters of algorithm to be tested, e.g., to create the projection matrix, Universal Background Models, etc.;
- a Development set: used to evaluate hyper-parameters of the tested algorithms;
- a Test set: used to evaluate the generalization performance of the tested algorithms with previously unseen data.
Both development and test sets are further split into an enrollment subset (used to enroll participants' models), and a probe set (whose files will be tested against all participants' models).