Advanced features¶
There are several extra features that we did not discuss. In this section we’ll explain the database interface, checkpointing of experiments, and multitasking with Dask.
Database interface¶
All PAD databases must inherit from the bob.pad.base.pipelines.Database
class
and implement the following methods:
`database.fit_samples`
returns the samples (or delayed samples) used to train the classifier;`database.predict_samples`
returns the samples that will be used for evaluating the system. This is where the group (dev or eval) is specified.
The returned samples must have the following attributes:
data
: the data of the samplekey
: a unique identifier for the sample. must be a string.attack_type
: the attack type of the sample, must beNone
for bonafide samples. This will indicate the presentation attack instrument (PAI) of the attack sample and will be used to report error rates per PAI.subject_id
: The identity of the subject. This might not be available for all databases.
File list interface¶
A class with those methods returning the corresponding data can be implemented
for each dataset, but an easier way to do it is with the file list interface.
This allows the creation of multiple protocols and various groups by editing
some CSV files. The bob.pad.base.database.FileListPadDatabase
class,
which builds on File List Databases (CSV), implements this interface.
The dataset configuration file will can be as simple as:
from bob.pad.base.database import FileListPadDatabase
database = FileListPadDatabase("path/to/my_dataset", "my_protocol")
The files must follow the following structure and naming:
my_dataset
|
+-- my_protocol
|
+-- train.csv
+-- dev.csv
+-- eval.csv
The dev.csv
file is the main file here and is used for scoring samples of
the development group. The content of the train.csv
file is used when a
protocol contains data for training the classifier. The eval.csv
file is
optional and is used in case a protocol contains data for evaluation.
These CSV files should contain at least the path to raw data and an identifier to the identity of the subject in the image (subject field) and an attack type. The structure of each CSV file should be as below:
filename,subject,attack_type
path_1,subject_1,
path_2,subject_2,
path_3,subject_1,attack_1
path_4,subject_2,attack_1
...
The attack_type
field is used to differentiate bonafide presentations from
attacks. An empty field indicates a bonafide sample. Otherwise different attack
types can be used (e.g. print
, replay
, etc.), and can be analyzed
separately during evaluation.
Metadata can be shipped within the Samples (e.g gender, age, session, …) by adding a column in the CSV file for each metadata:
filename,subject,attack_type,gender,age
path_1,subject_1,,M,25
path_2,subject_2,,F,24
paht_3,subject_1,attack_1,M,25
paht_4,subject_2,attack_1,F,24
...
Checkpoints and Dask¶
By default, the bob pad run-pipeline
command will save the features of each step of the pipeline
and the fitted estimators in the output folder. To avoid this, use the --memory
option.
The Dask integration can also be used by giving a client configuration to the
--dask-client
option. Basic Idiap SGE configurations are defined by
bob.pipelines: sge
and sge-gpu
:
$ bob pad run-pipeline --output output_dir --dask-client sge ...
Note
You may want to read the Dask section in PipelineSimple: Advanced features as well for more in-depth information.