Unveiling Synthetic Faces:
How Synthetic Datasets Can Expose Real Identities

1Idiap Research Institute, 2EPFL, 3UNIL

NeurIPS 2024 New Frontiers in Adversarial Machine Learning Workshop
sample leaked faces

Sample face images leaked from training data (first row) of generative models in different state-of-the-art synthetic face recognition datasets (second row).

Summary

Synthetic data generation is gaining increasing popularity in different computer vision applications. Existing state-of-the-art face recognition models are trained using large-scale face datasets, which are crawled from the Internet and raise privacy and ethical concerns. To address such concerns, several works have proposed generating synthetic face datasets to train face recognition models. However, these methods depend on generative models, which are trained on real face images. In this work, we design a simple yet effective membership inference attack to systematically study if any of the existing synthetic face recognition datasets leak any information from the real data used to train the generator model. We provide an extensive study on 6 state-of-the-art synthetic face recognition datasets, and show that in all these synthetic datasets, several samples from the original real dataset are leaked. To our knowledge, this paper is the first work which shows the leakage from training data of generator models into the generated synthetic face recognition datasets. Our study demonstrates privacy pitfalls in synthetic face recognition datasets and paves the way for future studies on generating responsible synthetic face datasets.

MIS: Membership Inference Attack against Synthetic Datasets

Synthetic face recognition datasets are often generated using a generator model. Therefore, an important question is whether any of the generated images in the generated synthetic face recognition dataset contain important information from training dataset, that was used to train the face generator model in the first place?

Schematic diagram of data leakage

Schematic diagram of data leakage from generator’s training data into generated synthetic face recognition dataset.


We consider an exhaustive search approach to compare all possible pairs of images from synthetic dataset and the training dataset of generator model. To this end, we use an off-the-shelf face recognition model to extract face embeddings from each face image, and then compare the embeddings of every pair of images from two datasets. Then, we sort the pairs of images according to the similarity of embeddings and consider the top-k pairs for visual comparison of images.


Sample Leaked Images

Reproducibility: Source Code and Meta-Data

The source code of our experiments as well as meta-data for sample leaked images will be released soon.

BibTeX


  @inproceedings{neurips2024unveiling,
    author    = {Hatef Otroshi Shahreza and S{\'e}bastien Marcel},
    title     = {Unveiling Synthetic Faces: How Synthetic Datasets Can Expose Real Identities},
    booktitle = {NeurIPS Workshop on New Frontiers in Adversarial Machine Learning}
    year      = {2024}
  }