Face Reconstruction from Face Embeddings using Adapter to a Face Foundation Model

Summary

Face recognition systems extract embedding vectors from face images and use these embeddings to verify or identify individuals. Face reconstruction attack (also known as template inversion) refers to reconstructing face images from face embeddings and using the reconstructed face image to enter a face recognition system. In this paper, we propose to use a face foundation model to reconstruct face images from the embeddings of a blackbox face recognition model. The foundation model is trained with 42M images to generate face images from the facial embeddings of a fixed face recognition model. We propose to use an adapter to translate target embeddings into the embedding space of the foundation model. The generated images are evaluated on different face recognition models and different datasets, demonstrating the effectiveness of our method to translate embeddings of different face recognition models. We also evaluate the transferability of reconstructed face images when attacking different face recognition models. Our experimental results show that our reconstructed face images outperform previous reconstruction attacks against face recognition models.

Face Reconstruction Attack using Foundation Models

We focus on face reconstruction attacks and propose a new method to reconstruct face images from different face recognition models using a foundation model. We use Arc2Face, which is a recently proposed face foundation model, and is capable of generating identity-consistent images given an embedding of a specific face recognition model F_FM. The Arc2Face pipeline includes a modified CLIP model to accept face embeddings from the face recognition model (F_FM) and a Stable Diffusion UNet decoder. This pipeline is end-to-end trained on an upscaled version of WebFace42M, resulting in the generation of high-quality, identity-consistent images. It is evident that this model can be readily used for inverting the embeddings of F_FM, as it is specifically trained for those embeddings. The face recognition model F_FM used to train Arc2Face has a ResNet100 backbone and is trained with WebFace42M. To attack any other face recognition system, one would need to redo the entire process of adapting a large model on a large-scale dataset, which is impractical. In this work, we propose a simple approach to adapt this model for reconstructing images from embeddings of any face recognition system (F_victim). The main idea is to develop an adapter module that can transform the embeddings from the space of the leaked embedding into the embedding space of F_FM. Figure below depicts our face reconstruction attack.

Block diagram for our face reconstruction attack.

Essentially, we propose to learn a mapping M which would map the leaked embeddings to the embedding space of F_FM. We implement this mapping function, M, as a simple linear layer. The parameters of this adapter layer can be learned using a set of pairs of embeddings extracted from F_victim and F_FM and Mean Square Error (MSE) as the loss function:

Training process of the adapter module.

Sample Reconstructed Images

Below, you can see sample reconstructed face images from embeddings of different face recognition models using our adapter module:

Sample face images from the dataset (first row) and their corresponding reconstructed face images using various methods. The values show the cosine similarity between embeddings of original and reconstructed face images.

Reproducibility: Source Code

The source code of our experiments will be available soon.

BibTeX


        @article{face_adapter,
          author    = {Hatef Otroshi Shahreza and Anjith George and  S{\'e}bastien Marcel},
          title     = {Face Reconstruction from Face Embeddings using Adapter to a Face Foundation Model},
          journal   = {arXiv preprint arXiv:2411.03960},
          year      = {2024}
        }

Face Adapter: Face Reconstruction from Face Embeddings using Adapter to a Face Foundation Model

Sample face images from the LFW dataset (first row) and their reconstructed versions by our attack (second row) based on ArcFace embeddings. The values are cosine similarity of embeddings of the original and reconstructed face image.