Samples, a way to enhance scikit pipelines with metadata¶

Some tasks in pattern recognition demands the usage of metadata to support some processing (e.g. face cropping, audio segmentation). To support scikit-learn based estimators with such requirement task, this package provides two mechanisms that:

Wraps input data in a layer called Sample that allows you to append some metadata to your original input data.

A wrapper class (SampleWrapper) that interplay between Sample and your estimator.

What is a Sample ?¶

A Sample is a simple container that wraps a data-point. The example below shows how this can be used to wrap a numpy.array.

>>> # by convention, we import bob.pipelines as mario, because mario works with pipes ;)
>>> import bob.pipelines as mario
>>> import numpy as np
>>> data = np.array([1, 3])
>>> sample = mario.Sample(data)
>>> sample
Sample(data=array([1, 3]))
>>> sample.data is data
True

Sample and metadata¶

Metadata can be added as keyword arguments in Sample, like:

>>> sample = mario.Sample(data, gender="Male")
>>> sample
Sample(data=array([1, 3]), gender='Male')
>>> sample.gender
'Male'

Transforming Samples¶

Imagine that we have the following transformer that requires some metadata to actually work:

>>> from sklearn.base import TransformerMixin, BaseEstimator
>>>
>>> class MyTransformer(TransformerMixin, BaseEstimator):
...     def transform(self, X, sample_specific_offsets):
...         return np.array(X) + np.array(sample_specific_offsets)
...
...     def fit(self, X):
...         pass
...
...     def _more_tags(self):
...         return {"requires_fit": False}
>>>
>>>
>>> # Creating X: 3 samples, 2 features
>>> X = np.zeros((3, 2))
>>> # 3 offsets: one for each sample
>>> offsets = np.arange(3).reshape((3, 1))
>>> transformer = MyTransformer()
>>>
>>> transformer.transform(X, offsets)
array([[0., 0.],
       [1., 1.],
       [2., 2.]])

While this transformer works well by itself, it can’t be used by sklearn.pipeline.Pipeline:

>>> from sklearn.pipeline import make_pipeline
>>> pipeline = make_pipeline(transformer)
>>> pipeline.transform(X, offsets)
Traceback (most recent call last):
   ...
TypeError: _transform() takes 2 positional arguments but 3 were given

To approach this issue, SampleWrapper can be used. This class wraps other estimators and accepts as input samples and passes the data with metadata inside samples to the wrapped estimator:

>>> # construct a list of samples from the data we had before
>>> samples = [mario.Sample(x, offset=o) for x, o in zip(X, offsets)]
>>> samples[1]
Sample(data=array([0., 0.]), offset=array([1]))

Now we need to tell SampleWrapper to pass the offset inside samples as an extra argument to our transformer as sample_specific_offsets. This is accommodated by the transform_extra_arguments parameter. It accepts a list of tuples that maps sample metadata to arguments of the transformer:

>>> transform_extra_arguments=[("sample_specific_offsets", "offset")]
>>> sample_transformer = mario.SampleWrapper(transformer, transform_extra_arguments)
>>> transformed_samples = sample_transformer.transform(samples)
>>> # transformed values will be stored in sample.data
>>> np.array([s.data for s in transformed_samples])
array([[0., 0.],
       [1., 1.],
       [2., 2.]])

Note that wrapped estimators accept samples as input and return samples. Also, they keep the sample’s metadata around in transformed samples.

>>> transformed_samples[1].data
array([1., 1.])
>>> transformed_samples[1].offset  # the `offset` metadata is available here too.
array([1])

Now that our transformer is wrapped, we can also use it inside a pipeline:

>>> sample_pipeline = make_pipeline(sample_transformer)
>>> np.array([s.data for s in sample_pipeline.transform(samples)])
array([[0., 0.],
       [1., 1.],
       [2., 2.]])

Delayed Sample¶

Sometimes keeping several samples into memory and transferring them over the network can be very memory and bandwidth demanding. For these cases, there is DelayedSample.

A DelayedSample acts like a Sample, but its data attribute is implemented as a function that can load the respective data from its permanent storage representation. To create a DelayedSample, you pass a load() function that when called without any parameter, it must load and return the required data.

Below, follow an example on how to use DelayedSample.

>>> def load():
...     # load data (usually from disk) and return
...     print("Loading data from disk!")
...     return np.zeros((2,))
>>> delayed_sample = mario.DelayedSample(load, metadata=1)
>>> delayed_sample
DelayedSample(metadata=1)

As soon as you access the .data attribute, the data is loaded and returned:

>>> delayed_sample.data
Loading data from disk!
array([0., 0.])

DelayedSample can be used instead of Sample transparently:

>>> from functools import partial
>>> def load_ith_data(i):
...     return np.zeros((2,)) + i
>>>
>>> delayed_samples = [mario.DelayedSample(partial(load_ith_data, i), offset=[i]) for i in range(3)]
>>> np.array([s.data for s in sample_pipeline.transform(delayed_samples)])
array([[0., 0.],
       [2., 2.],
       [4., 4.]])

Note

Actually, SampleWrapper always returns DelayedSample’s. This becomes useful when the data returned is not used. We will see that happening in Checkpointing.

Sample Set¶

A SampleSet, as the name suggests, represents a set of samples. Such set of samples can represent the samples that belongs to a class.

Below, follow an snippet on how to use SampleSet.

>>> sample_sets = [
...     mario.SampleSet(samples, class_name="A"),
...     mario.SampleSet(delayed_samples, class_name="B"),
... ]
>>> sample_sets[0]
SampleSet(samples=[Sample(data=array([0., 0.]), offset=array([0])), Sample(data=array([0., 0.]), offset=array([1])), Sample(data=array([0., 0.]), offset=array([2]))], class_name='A')

SampleWrapper works transparently with SampleSet’s as well. It will transform each sample inside and returns the same SampleSets with new data.

>>> transformed_sample_sets = sample_pipeline.transform(sample_sets)
>>> transformed_sample_sets[0].samples[1]
DelayedSample(offset=array([1]))
>>> transformed_sample_sets[0].samples[1].data
array([1., 1.])