Advanced features

There are several extra features that we did not discuss. In this section we’ll explain the database interface, checkpointing of experiments, and multitasking with Dask.

Database interface

All PAD databases must inherit from the bob.pad.base.pipelines.Database class and implement the following methods:

  • `database.fit_samples` returns the samples (or delayed samples) used to train the classifier;

  • `database.predict_samples` returns the samples that will be used for evaluating the system. This is where the group (dev or eval) is specified.

The returned samples must have the following attributes:

  • data: the data of the sample

  • key: a unique identifier for the sample. must be a string.

  • attack_type: the attack type of the sample, must be None for bonafide samples. This will indicate the presentation attack instrument (PAI) of the attack sample and will be used to report error rates per PAI.

  • subject_id: The identity of the subject. This might not be available for all databases.

File list interface

A class with those methods returning the corresponding data can be implemented for each dataset, but an easier way to do it is with the file list interface. This allows the creation of multiple protocols and various groups by editing some CSV files. The bob.pad.base.database.FileListPadDatabase class, which builds on File List Databases (CSV), implements this interface.

The dataset configuration file will can be as simple as:

from bob.pad.base.database import FileListPadDatabase

database = FileListPadDatabase("path/to/my_dataset", "my_protocol")

The files must follow the following structure and naming:

my_dataset
|
+-- my_protocol
    |
    +-- train.csv
    +-- dev.csv
    +-- eval.csv

The dev.csv file is the main file here and is used for scoring samples of the development group. The content of the train.csv file is used when a protocol contains data for training the classifier. The eval.csv file is optional and is used in case a protocol contains data for evaluation.

These CSV files should contain at least the path to raw data and an identifier to the identity of the subject in the image (subject field) and an attack type. The structure of each CSV file should be as below:

filename,subject,attack_type
path_1,subject_1,
path_2,subject_2,
path_3,subject_1,attack_1
path_4,subject_2,attack_1
...

The attack_type field is used to differentiate bonafide presentations from attacks. An empty field indicates a bonafide sample. Otherwise different attack types can be used (e.g. print, replay, etc.), and can be analyzed separately during evaluation.

Metadata can be shipped within the Samples (e.g gender, age, session, …) by adding a column in the CSV file for each metadata:

filename,subject,attack_type,gender,age
path_1,subject_1,,M,25
path_2,subject_2,,F,24
paht_3,subject_1,attack_1,M,25
paht_4,subject_2,attack_1,F,24
...

Checkpoints and Dask

By default, the bob pad run-pipeline command will save the features of each step of the pipeline and the fitted estimators in the output folder. To avoid this, use the --memory option.

The Dask integration can also be used by giving a client configuration to the --dask-client option. Basic Idiap SGE configurations are defined by bob.pipelines: sge and sge-gpu:

$ bob pad run-pipeline --output output_dir --dask-client sge ...

Note

You may want to read the Dask section in PipelineSimple: Advanced features as well for more in-depth information.