Improving Biometric Privacy with Generative AI

Face recognition is widely used today, from unlocking phones to enhancing surveillance systems. However, this technology often relies on large datasets of internet photos collected without consent, raising privacy and ethical concerns. To address this, researchers at Idiap have developed a synthetic (fake but realistic) face dataset, offering a privacy-friendly alternative without compromising model performance.

Face recognition models, typically powered by deep neural networks, depend on learning from vast online image datasets. The use of such datasets introduces ethical challenges, as they include personal information without explicit consent. To mitigate this issue, Hatef Otroshi Shareza and Sébastien Marcel have created a dataset based on highly realistic synthetic faces generated using AI. This approach avoids ethical and privacy concerns associated with real individuals' identities. The method is designed to be comprehensive, providing a wide variety of facial variations—both across different individuals and within the same person.

To build the dataset, they first generated synthetic face images that were noticeably distinct individuals using an optimization technique (one synthetic face image per individual). Then, they employed a foundation model as a face generator to introduce random modifications such as different poses, lightings or expressions to produce multiple image variations per synthetic individual.

By using synthetic faces, the dataset eliminates the risk of exposing real individuals' identities. Beyond privacy protection, it has demonstrated exceptional performance in training face recognition models. When compared to other synthetic face datasets, models trained on this dataset achieved state-of-the-art accuracy, either matching or surpassing existing benchmarks. This underscores the dataset's effectiveness in producing realistic, high-quality training data.

Moreover, the methodology (named HyperFace) is highly adaptable. Researchers can adjust parameters like the number of identities or images per identity, enabling the creation of datasets tailored to specific research needs. Rigorous verification confirms that the dataset contains exclusively synthetic images, ensuring no real human faces, potentially sourced from other datasets, are present.

The HyperFace method and its resulting dataset provide a strong solution to privacy concerns in face recognition training. It allows for effective model training without depending on real people's photos, advancing both technological progress and ethical standards while safeguarding identities against potential data breaches.

This study will be presented at the 13th International Conference on Learning Representations (ICLR) at the end of April.

 

Reference:

Shahreza, H. O., & Marcel, S. (2025). HyperFace: Generating synthetic face recognition datasets by exploring face embedding hypersphere. 13th International Conference on Learning Representations (ICLR).