HyperFace: Generating Synthetic Face Recognition Datasets by Exploring Face Embedding Hypersphere

Summary

Face recognition datasets are often collected by crawling Internet and without individuals' consents, raising ethical and privacy concerns. Generating synthetic datasets for training face recognition models has emerged as a promising alternative. However, the generation of synthetic datasets remains challenging as it entails adequate inter-class and intra-class variations. While advances in generative models have made it easier to increase intra-class variations in face datasets (such as pose, illumination, etc.), generating sufficient inter-class variation is still a difficult task. In this paper, we formulate the dataset generation as a packing problem on the embedding space (represented on a hypersphere) of a face recognition model and propose a new synthetic dataset generation approach, called HyperFace. We formalize our packing problem as an optimization problem and solve it with a gradient descent-based approach. Then, we use a conditional face generator model to synthesize face images from the optimized embeddings. We use our generated datasets to train face recognition models and evaluate the trained models on several benchmarking real datasets. Our experimental results show that models trained with HyperFace achieve state-of-the-art performance in training face recognition using synthetic datasets.

HyperFace Dataset Gneration

By representing a synthetic dataset on the identity hypersphere as a set of reference embeddings, we can raise the question that “How should reference embeddings cover the identity hypersphere?” To answer this question, we remind that the distances between reference embeddings indicate the inter-class variation in the synthetic face recognition dataset. Therefore, since we would like to have a high inter-class variation in the gen- erated dataset, we can say that we need to maximize the distances between reference embeddings. We solve the optimization problem with an iterative approach based on gradient descent and then use a face genrator model to generate HyperFace synthetic dataset.

Block diagram of HyperFace Dataset Generation: We start from randomly synthesized face images and extract their embeddings using a pretrained face recognition model. The extracted embeddings are normalised and used as initial points in our HyperFace optmization. The HyperFace optimization tries to increase the intra-class variation for synthetic identities on the manifold of the face recognition model over the hypersphere using a regularization term. The resulting points are then used by a face generator model, which can generate synthetic face images from the embeddings.

Comparison with Previous Synthetic Datasets

In the following table, we compare the performance of face recognition models trained with our generated datasets and with all publicly available versions (particularly larger scale) of synthetic datasets in the literature. As the results in this table show, our generated datasets achieve competitive performance with synthetic datasets in the literature at scale.

Comparison with orevious synthetic datasets

Comparison of recognition performance of face recognition models trained with the largest available versions of different synthetic datasets as well as a real dataset (i.e., CASIA- WebFace).

Reproducibility: Source Code, Datasets, and Models

The source code of our experiments as well as different versions of HyperFace dataset are publicly availabble:

BibTeX


  @inproceedings{shahreza2025hyperface,
    title={HyperFace: Generating Synthetic Face Recognition Datasets by Exploring Face Embedding Hypersphere},
    author={Hatef Otroshi Shahreza and S{\'e}bastien Marcel},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025}
  }