Does this man look familiar? This is Robert Williams who was misidentified by a police facial recognition system and had to spend a day under arrest. As this incident gets thrown around in the media, it’s important to remember that it’s easy to criticize technology while having limited or nonexistent information about how they work. Countless media sources have criticized every component of the technology, while the actual algorithm has remained enigmatic.
这个男人看起来熟悉吗? 这是罗伯特·威廉姆斯 ( Robert Williams) ,他被警察的面部识别系统弄错了身份,不得不被捕一天。 随着此事件在媒体上四处泛滥,重要的是要记住,批评技术很容易,而关于其工作方式的信息却很少或根本不存在。 无数媒体批评了该技术的每个组成部分,而实际的算法仍然令人费解。
In this blog post, I will go through a state of the art face recognition algorithm in a way that caters to both experienced professionals and the common, uninformed reader. I hope that this post helps you understand the algorithm that is being criticized all over the news right now and helps you bring much-needed information into discussions about this controversy.
在这篇博客文章中,我将以一种既能满足经验丰富的专业人士,又能满足普通无知读者的方式,介绍一种先进的人脸识别算法。 我希望这篇文章可以帮助您了解目前在新闻中遭到批评的算法,并可以将急需的信息带入有关此争议的讨论中。
什么是人脸识别? (What is Face Recognition?)
When I say the words “Face Recognition”, a variety of visuals should come to mind, many of which you may remember from James Bond or Mission Impossible movies where the protagonist’s team has to change the face database to allow the protagonist into the secret bunker. Or maybe you think of a country like China or North Korea using face recognition technology to violate people’s privacy.
当我说“人脸识别”一词时,应该想到各种各样的视觉效果,其中许多您可能还记得詹姆斯·邦德(James Bond)或《碟中谍》 。 也许您想到的是像中国或朝鲜这样的国家/地区,其使用人脸识别技术来侵犯人们的隐私。
The official definition of face recognition strips all of the pop cultures away. It is simply, the detection and classification of a person’s face. This implies that a facial recognition system should have two components, first detecting a face in an image, then finding the identity of the face.
面部识别的官方定义剥夺了所有流行文化。 简单来说,就是对人脸的检测和分类。 这意味着面部识别系统应具有两个组成部分,首先是检测图像中的面部,然后找到面部的身份。
- Face detection is a very similar problem to object detection, except instead of the entities of interest being everyday objects, they’re the faces of individuals. 人脸检测与对象检测非常相似,不同的是,人脸是个人的面Kong,而不是关注的实体是日常对象。
- Face identification is the problem of matching a detected face to an identification image in a pre-existing database. This is the same database that the hackers change in every spy movie. 人脸识别是将检测到的人脸与现有数据库中的识别图像进行匹配的问题。 这是黑客在每部间谍电影中都会更改的数据库。
人脸检测 (Face Detection)
To understand how face detection works, let’s go through the state of the art algorithm, RetinaFace. Now casual readers, don’t run away at the mention of a paper. Don’t worry, in this blog, I’ll do my best to make the algorithm as intuitive as possible, while also refraining from the oversimplification that has plagued media.
要了解人脸检测的工作原理,让我们看一下最新的算法RetinaFace 。 现在,随便的读者,不要逃避提及论文。 不用担心,在此博客中,我将尽力使该算法尽可能直观,同时也避免困扰媒体的过分简化。
The Retina Face algorithm is called an end-to-end, or single-stage detector in the lingo. If you’re familiar with object detection strategies, it is similar to the SSD or YOLO architectures.
视网膜面部算法在行话中称为端到端或单级检测器。 如果您熟悉对象检测策略,则它类似于SSD或YOLO架构。
输出明细 (Output Details)
The RetinaFace algorithm outputs three pieces of information about the face detected:
- The bounding box of the face, denoted by the bottom left corner of the box and its width and height. 面的边界框,由框的左下角及其宽度和高度表示。
- Five facial landmarks denoting the locations of the eyes, nose, and mouth 五个面部标志代表眼睛,鼻子和嘴巴的位置
- A dense 3D mapping of points which is very similar to those your cell phone uses to recognize you for a feature like Face ID on iPhones. 点的密集3D映射非常类似于您的手机用于识别您的功能,例如iPhone上的Face ID。
特征提取 (Feature Extraction)

Like most modern computer vision algorithms, RetinaFace uses deep neural networks as feature extractors. More specifically, RetinaFace uses the ResNet architecture along with Fully Pyramidal Networks(FPN) to produce a rich feature representation of the image.
与大多数现代计算机视觉算法一样,RetinaFace使用深度神经网络作为特征提取器。 更具体地说,RetinaFace使用ResNet架构以及完全金字塔形网络(FPN)来生成图像的丰富特征表示。
Intuitively, you can imagine these features capturing different levels of abstract features in the image. In the face detection realm, this is equivalent to early features encoding edges, mid-level features encoding facial features such as eyes, mouths, noses, etc, and high level features encoding the faces themselves. The FPN simply allows the model to make use of both the high-level and low-level features, which greatly aids in detecting smaller faces in images.
直观地,您可以想象这些特征捕获图像中不同级别的抽象特征。 在面部检测领域中,这等效于编码边缘的早期特征,编码诸如眼睛,嘴巴,鼻子等面部特征的中级特征以及编码面部自身的高级特征。 FPN只是允许模型同时使用高级和低级功能,从而极大地帮助检测图像中较小的面Kong。
训练 (Training)
Training is the process by which a randomly initialized network is taught to perform its task. The process of training is similar to teaching a child to do well on a test. The child is given information about the topic and then is given some sort of an evaluation test to see how well it did. The training of deep neural networks is similar, except it is given labeled data, in this case, images where the faces are labeled, and the evaluation is done using a loss function. For a more detailed understanding of deep neural networks, see my blog post.
训练是教导随机初始化的网络执行其任务的过程。 训练的过程类似于教孩子在考试中表现出色。 给孩子提供有关该主题的信息,然后给孩子进行某种评估测试,以查看其效果如何。 深度神经网络的训练是相似的,不同之处在于,它被赋予了标记数据,在这种情况下,图像被标记为面部,并且使用损失函数进行评估。 要更深入地了解深度神经网络,请参阅我的博客文章 。
Often, the training process for deep learning models is the most important part. Entire papers have been written about the huge improvement a new loss function provides. The RetinaFace algorithm is no different. Let’s examine the loss function used to train RetinaFace.
通常,深度学习模型的训练过程是最重要的部分。 关于新损失函数提供的巨大改进,已经撰写了整篇论文。 RetinaFace算法没有什么不同。 让我们研究用于训练RetinaFace的损失函数。

Let’s break this function down one by one.
- The first component, face classification, simply penalizes the model for saying that there is a face at a location, while no face exists in the image. 第一个组件是人脸分类,它只是对模型进行了惩罚,因为该模型表示某个位置有人脸,而图像中没有人脸。
- “Face box regression” is a fancy term for the distance between the bounding box coordinates of the predicted face and the coordinates of the labeled face. Specifically, this distance is calculated using what’s called a smooth L1 loss, whose graph you can see below. “面部框回归”是关于预测面部的边界框坐标与标记面部的坐标之间的距离的幻想术语。 具体而言,此距离是使用所谓的平滑L1损耗计算的,您可以在下面看到其图形。
- Facial landmark regression is similar to the box regression loss, except instead of finding the distance between the bounding box, it finds the distance between the predicted five facial landmarks and the labeled ones. 面部界标回归与框回归损失相似,不同之处在于,它不是查找边界框之间的距离,而是找到预测的五个面部界标与标记轮廓之间的距离。
- The final loss is a bit more complex and is beyond the scope of this blog post. Since these features are not labeled, we need some way to help the model learn these features. To do so, the model’s output features are used to recreate the face, and then the recreated face is compared with the face in the image 最终的损失要复杂得多,超出了本博客文章的范围。 由于未标记这些功能,因此我们需要某种方法来帮助模型学习这些功能。 为此,将模型的输出特征用于重新创建面部,然后将重新创建的面部与图像中的面部进行比较
人脸识别 (Face Identification)
Now that we have detailed information about the faces in an image, the next task in a face recognition system is identifying the face against a database of ID images.
Once again, we’ll look at another state of the art paper, ArcFace. Before going into the details of this algorithm, we need to go over the fundamentals of face identification. Face identification uses a class of networks called Siamese Networks. Here is an excerpt from my previous blog post about it:
再一次,我们将看看另一种最新技术论文ArcFace 。 在详细介绍该算法之前,我们需要仔细研究面部识别的基础知识。 人脸识别使用一类称为暹罗网络的网络。 这是我以前的博客文章的节选:
The core intuition behind Siamese Networks is to try and learn a representation of the face. This representation is similar to the information humans store in their minds about the characteristics of the face, such as the size of the facial features, color of skin, eye color, etc. A human can understand that if it sees another face with similar features, then there is a high chance that the new face belongs to the same person. On the other hand, if the human sees that the new face does not match that of the faces it has previously seen, then the human again makes a representation of the new face to store in its memory.
暹罗网络背后的核心直觉是尝试学习面部表情。 这种表示方式类似于人类在大脑中存储的有关面部特征的信息,例如面部特征的大小,皮肤颜色,眼睛颜色等。人类可以理解,如果看到另一张具有相似特征的面Kong,那么新面Kong很有可能属于同一个人。 另一方面,如果人类看到新面Kong与之前看到的面Kong不匹配,则人类再次对新面Kong进行表示以存储在其记忆中。
This is exactly how Siamese Networks function. A function transforms the input image of the face into a vector which contains a representation of the face’s features. We then want this vector to be similar to that of the same face, and very different from the vector of a different face.
这正是暹罗网络的运作方式。 函数将面部的输入图像转换为包含面部特征的矢量。 然后,我们希望该矢量类似于同一张脸的矢量,并且与另一张脸的矢量非常不同。
In a nutshell, the model learns how to extract the important features of a face which allows it to be distinguished from other faces. Once the feature mapping is obtained, it can be compared to the feature mappings of other faces in a database to be matched.
简而言之,该模型将学习如何提取面Kong的重要特征,以使其与其他面Kong区别开来。 一旦获得了特征映射,就可以将其与数据库中要匹配的其他面部的特征映射进行比较。
That blog post goes much deeper into the technical details of face identification, along with the code for full implementation of it.
Once again, the training process of Siamese networks is where the magic happens. Siamese networks are first trained as a full image classification model on cropped faces, where the model learns to directly classify a person’s face from the image, without any identification image. This requires a pre-defined list of identities in the dataset used, which is common in most face identification datasets.
再次,暹罗网络的培训过程是神奇的地方。 首先将暹罗网络训练为裁剪后的面部的完整图像分类模型,该模型可学习直接从图像中对人的面部进行分类,而无需任何识别图像。 这要求在使用的数据集中有一个预定义的身份列表,这在大多数人脸识别数据集中是很常见的。
输出明细 (Output Details)
A face identification model outputs a feature vector that encodes the facial features in a list of some numbers, usually 256 or 512. Note that this vector is different from the dense features that the RetinaFace algorithm outputted, as these features are used specifically for comparing two faces.
特征提取 (Feature Extraction)
Face identification models use standard, state of the art image classification models. The ArcFace algorithm uses ResNet architecture.
人脸识别模型使用标准的最新图像分类模型。 ArcFace算法使用ResNet体系结构。
训练 (Training)
Like RetinaFace, the crux of the ArcFace algorithm comes from the way it’s trained. As I mentioned earlier, the network is first trained like a normal classification network and is then fine-tuned to output encodings.
像RetinaFace一样,ArcFace算法的症结也来自它的训练方式。 如前所述,首先像正常分类网络一样训练网络,然后对其进行微调以输出编码。

Normally, classification networks use Cross-Entropy Loss to output a vector of class probabilities. An interpretation of this class vector is that it outputs the possible classifications for an image, along with how confident the model about each class. While this is useful for classification tasks, the designers of the ArcFace algorithm point out that there shouldn’t be considerations of uncertainty in face identification, since one face cannot belong to multiple people. To mitigate this, they devise a cosine loss which forces the output class probabilities to be clustered around one class. As can be seen in the image above, this loss uses a margin to sort of “push” the output probability vector closer to any class, which proves to have immense improvements in face identification.
通常,分类网络使用交叉熵损失来输出类别概率的向量。 该类向量的一种解释是,它输出图像的可能分类以及模型对每个类的信心程度。 尽管这对于分类任务很有用,但ArcFace算法的设计人员指出,在人脸识别中不应考虑不确定性,因为一张人脸不能属于多个人。 为了减轻这种情况,他们设计了一个余弦损失,迫使输出类别的概率聚集在一个类别的周围。 从上图可以看出,这种损失使用余量来将输出概率向量“推”到更接近任何类别的水平,这被证明在面部识别方面有巨大的改进。

Here is a visualization of the feature vector outputted by the ArcFace architecture on the MNIST dataset, a dataset of handwritten digits. As you can see, the vectors outputted for the images are clustered closely together, something that is very useful in face identification.
这是ArcFace架构在MNIST数据集(手写数字数据集)上输出的特征向量的可视化。 如您所见,为图像输出的矢量紧密聚集在一起,这在面部识别中非常有用。
获得结果 (Obtaining Results)
The typical flow of a face recognition system would be to first obtain the location and features of each face in an image. Then the cropped faces are inputted the face identification model to get a feature vector. This vector is then compared to other vectors in an identification database using the Euclidean distance function. The identification vector that is “closest” to the face’s vector tells us the identity of people in the image.
人脸识别系统的典型流程是首先获取图像中每个人脸的位置和特征。 然后将裁剪的面部输入面部识别模型以获取特征向量。 然后使用欧几里得距离函数将此向量与标识数据库中的其他向量进行比较。 与脸部向量“最接近”的识别向量告诉我们图像中人物的身份。
For a more concrete example, in the visualization of the vectors of the MNIST dataset, let’s assume that the dark blue dots correspond to the vectors for the number 4, and the cyan dots correspond to the number 3. This is analogous to an image of my face and an image of your face. If I obtain a feature vector from another image, and it lies very close to the dark blue dots, I would wonder whether the image is of a 4. Similarly, there would be no doubt in my mind that the image is 3 since the vectors for the number 3 lie so far away from the feature vector.
对于一个更具体的示例,在可视化MNIST数据集的向量中,让我们假定深蓝色的点对应于数字4的向量,而青色的点对应于数字3。我的脸和你的脸的图像。 如果我从另一个图像获得特征向量,并且它非常靠近深蓝色的点,我想知道该图像是否是4。同样,在我看来,由于向量是3,因此我毫无疑问地认为该图像是3。因为数字3距离特征向量太远了。
那么该算法如何成为种族主义者? (So How can the Algorithm be Racist?)
As you’ve now seen, the training process of any deep learning model is the key to its performance, and the biases it inherits. The issue in the news was related to the algorithm’s low performance when recognizing an African American man, and other articles criticize algorithms for struggling with women.
如您所见,任何深度学习模型的训练过程都是其性能的关键,也是它所继承的偏见。 该新闻中的问题与该算法在识别一名非裔美国人时的性能低下有关,其他文章也批评该算法与女性抗争。
The issue is not some “built-in racial bias” as claimed by this The Guardian article. The issue is simply with the training dataset used to train the face recognition models. Think of it this way, if a person has never seen apples before, or has seen it once in his life, that person will have trouble recognizing apples in the future, because they haven’t seen enough apples to know about how they look. The solution when it comes to the person is to simply show more apples to them. This is exactly the case with face recognition models.
问题不是《卫报》 这篇文章所宣称的“内在的种族偏见”。 问题仅在于用于训练人脸识别模型的训练数据集。 这样想吧,如果一个人以前从未见过苹果,或者一生中从未见过苹果,那么该人将来将很难识别苹果,因为他们没有看到足够多的苹果来了解它们的外观。 解决这个问题的方法就是简单地向他们展示更多的苹果。 面部识别模型就是这种情况。
All we need to do is come up with more inclusive and balanced datasets for training the models, and all of the bias will disappear.
未来与最后的话 (The Future and Last Remarks)
Facial recognition algorithms can be a key tool in many facets of security and surveillance. They can be a valuable addition to any automated surveillance system and can save companies millions of dollars in human labor costs.
面部识别算法可以成为安全性和监视许多方面的关键工具。 它们可以成为任何自动化监视系统的宝贵补充,并且可以为公司节省数百万美元的人工成本。
For those of you interested in what it takes to implement these algorithms, check out the Tensorflow 2 implementations of RetinaFace and ArcFace.
对于那些对实现这些算法需要什么感兴趣的人,请查看RetinaFace和ArcFace的Tensorflow 2实现。
The legitimate problem with face recognition being used is the concerns of privacy violations. But that is a topic for another day :)
使用面部识别的合法问题是侵犯隐私的问题。 但这是另一天的话题:)
Thanks for reading my blog post! I hope you learned something and will be more informed when talking about face recognition and its flaws. I’d love to discuss this issue in the comments below!
感谢您阅读我的博客文章! 我希望您能学到一些东西,并且在谈论人脸识别及其缺陷时会有所了解。 我想在下面的评论中讨论这个问题!