Facenet paper地址 : facenet;   論文解析下載地址(PDF版):論文解析

FaceNet: A Unified Embedding for Face Recognition and Clustering 


   Despite significantrecent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using  standard techniques with FaceNet embeddings as feature vectors.

    儘管人臉識別領域已經取得了重大的進步,但是對於當下的方法如何有效的運用人臉驗證和人臉識別仍然有巨大的挑戰。在這個論文裡,我們提出了一個叫facenet的系統,這個系統直接學習了一個從人臉影象到緊密型歐幾里得空間的對映,在那裡距離直接和人臉的相似度相關。一旦這個空間產生,諸如人臉識別、驗證、聚集這類的任務可以在運用FaceNet embeddings為特徵向量的標準技術下輕鬆實現。

    Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matchin/ non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face

   我們的方法運用深度卷積網路訓練直接優化embedding,而不是像原來的運用中間瓶頸層為人臉影象的向量對映,然後以分類層作為輸出層。對於訓練,我們運用online triplet 

mining 的方法生成triplets大致校準匹配或非匹配人臉補丁。我們方法的最大優點是擁有最大表徵效率:我們僅用每張人臉128位取得了人臉識別的最先進的效能。

    On the widely used Labeled Faces in the Wild(LFW) dataset, our system achieves a new record accuracy of  99.63%

On YouTubeFaces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result [15] by30% on both datasets.


    We also introduce the concept of harmonic embeddings ,and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.

    我們也介紹了harmonic embeddings和harmonic triplet loss的概念,它們描述了由不同網路產生的不同版本的face embeddings他們之間是相容的而且可以直接比較。

1.  Introduction 引言

     In this paper we present a unified system for face verification (is this the same person), recognition (who is this person) and clustering (find common people among these faces). Our method is based on learning a Euclidean embedding per image using a deep convolutional network. The network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances.

    在這個論文裡我們呈現了包含人臉驗證(是不是同一個人)、識別(他是誰?)、聚集(在人臉中找到相同的人進行歸類)的一個完整的系統。我們的方法是基於每張影象用一個深度卷積網路學習一個Euclidean embedding。然後把這個網路進行訓練這樣在embedding space的squared L2距離直接對應人臉相似度:同一個人的人臉具有小的距離不同人的人臉具有較大的距離。

   Once this embedding has been produced, then the aforementioned tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem; and clusteringcan be achieved using off-the-shelf techniques such as k-means or agglomerative clustering.

    一旦這個embedding產生前面提到的任務就變得很簡單了:人臉驗證僅僅只涉及兩個embedding距離的閥值;識別成了一個k-NN分類問題;聚集可以用現成的例如k-means or agglomerative clustering之類的技術來實現。

    Previous face recognition approaches based on deep networks use a classification layer [15, 17] trained over a set of known face identities and then take an intermediate bottleneck layer as a representation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representationsize per face is usually very large (1000s of dimensions ).Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network.

  其中一個期望改進的是瓶頸層對新人臉也可以很好的泛化並且用瓶頸層每一個臉的表示尺寸都非常大(1000s of dimensions)。一些現在的工作用PCA已經減小了維度,但是這是一個線性的變換可以在網路的一層輕鬆學習到。

      In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a triplet-based loss function based on LMNN[19]. Our triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin. The thumbnails are tight crops of the face area, no 2D or 3D alignment, other than scale and translation is performed.

     與上述那些方法比起來,FaceNe運用基於LMNN的triplet- based loss函式直接把輸出訓練成一個緊湊的128-Dembedding。我們的triplets包含兩個匹配的人臉縮圖和一個非匹配的人臉縮圖。Loss函式的目的就是通過距離邊界區分正負類。縮圖為精密剪裁的臉部區域除了執行縮放平移外沒有2D or 3D校準。                               

    Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning [1], we present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, we also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.

   選擇運用triplets被證明對於取得好的效能是非常重要的,並且受curriculum learning啟發,我們提出了一個online negative exemplar mining策略,它確保了隨著網路訓練triplets的難度持續增長。為了提高聚類準確度,我們也探索了hard-positive mining技術它對於每個人的embeddings激發出球形聚類。

   As an illustration of the incredible variability that our method can handle see Figure 1.Shown are image pairs from PIE [13] that previously were considered to be very difficult for face verification systems.


Figure 1. Illumination and Pose invariane. Pose and illumination have been a long standing problem in face recognition. This figure shows the output distances of FaceNet between pairs of faces of the same and adifferent person in different pose and illumination combinations. A distance of0.0 means the faces are identical, 4.0 corresponds to the opposite spectrum,two different identities. You can see that a threshold of 1.1 would classifyevery pair correctly.


     An overview of the rest of the paper is as follows: in section 2 we review the literature in this area;section 3.1 defines the triplet loss and section 3.2 describes our novel triplet selection and training procedure; in section 3.3 we describe the model architecture used. Finally in section 4 and 5 we present some quantitative results of our embeddings and also qualitatively explore some clustering results.


section 2:回顧本領域的相關文獻

section 3.1:定義了triplet loss

section 3.2:描述了triplet selection& training procedure

section 3.3:所用的模型結構

section 4and 5:提出了一些關於embeddings的定量的結論,並且定性地探索了一些聚類結論。

2. Related Work

   Similarly to other recent works which employ deep networks [15, 17], our approach is a purely data driven method which learns its representation directly from the pixels of the face. Rather than using engineered features, we use a large dataset of labelled faces to attain the appropriate invariances to pose, illumination, and other variational conditions.

    和現在的運用的深度卷積網路的方法類似,我們的方法是一個純粹的資料驅動方法,該方法從人臉的每一個畫素開始直接學習它的表示。我們運用標記人臉的大型資料庫去獲得合適的姿態光照和其他可變的情況的不變性,而不是運用engineered features。

   In this paper we explore two different deep network architectures that have been recently used to great success in the computer vision community. Both are deep convolutional networks [8, 11]. The first architecture is based on the Zeiler&Fergus [22] model which consists of multiple interleavedlayers of convolutions, non-linear activations, local response normalizations, and max pooling layers. We additionally add several 1*1*d convolution layers inspired by the work of [9]. The second architecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014[16]. These networks use mixed layers that run several different convolutional and polling layers in parallel and concatenate their responses. We have found that these models can reduce the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.

   在論文裡我們探究了最近在計算機視覺社群成功使用的兩類不同的深度卷積神經網路架構。第一個架構是基於Zeiler&Fergus模型其包含multiple interleaved layers of convolutions, non-linear activations, local response normalizations,and max pooling layers。受[9]的工作啟發我們新增了幾個1*1*d卷積層。第二個架構是基於the Inception model of Szegedy et al 這種架構被稱為ImageNet 2014 中最優的方式。這些網路在並聯和串聯它們的相應的時候運用執行在幾個不同的卷積和池化層組成的混合層。我們發現這兩種模型都可以減少引數的使用次數達20次並且有減少浮點運算次數的潛在效能。

    There is a vast corpus of face verification and recognition works. Reviewing it is out of this paper so we will only briefly discuss the most relevant recent work.


    The works of [15, 17, 23] all employ acomplex system of multiple stages, that combines the output of a deep convolutional network with PCA for dimensionality reduction and an  SVM for classification.

   [15, 17, 23]中的工作都是運用一個複雜的多級系統,其中結合了帶有用於減少緯度的PCA技術和分類的SVM的深度卷積網路的輸出。

    Zhenyao et al. [23] employ a deep network to “warp” faces into a canonical frontal view and then learn CNN that classifies each face as belonging to a known identity. For face verification,PCA on the network output in conjunction with an ensemble of SVMs is used.

   Taigman et al. [17] propose a multi-stage approach that aligns faces to a general 3D shape model. A multi-class network is trained to perform the face recognition task on over four thousand identities. The authors also experimented with a so called Siamese network where they directly optimize the L1-distance between two face features. Their best performance on LFW (97.35%) stems from an ensemble of three networks using different alignments and color channels. The predicted distances (non-linear SVM predictions based on the 2 kernel)of those networks are combined using a non-linear SVM.

    Taigman 訓練一個多級網路去執行一個擁有四千多特徵的人臉識別任務。作者也試用一個叫做Siamese的網路在這個網路中他們可以直接優化兩個人臉特徵間的L1-distanc。他們在LFW上最好的效能是97.35%來源於運用不同校準和顏色通道的三個網路的總體。這些網路的預測距離(基於兩個核的non-linear SVM)運用一個non-linear SVM結合。

     Sunet al. [14, 15] propose a compact and therefore relatively cheap to compute network. They use an ensemble of 25 of these network, each operating on a different face patch. For their final performance on LFW (99.47% [15]) the authors combine 50 responses (regular and flipped). Both PCA and a Joint Bayesian model [2] that effectively correspond to alinear transform in the embedding space are employed. Their method does not require explicit 2D/3D alignment. The networks are trained by using a combination of classification and verification loss. The verification loss is similar to the triplet loss we employ [12, 19], in that it minimizes the L2-distancebetween faces of the same identity and enforces a margin between the distanceof faces of different identities. The main difference is that only pairs of images are compared, whereas the triplet loss encourages a relative distance constraint.

   Sun等提出了一個緊湊的並且相對簡單的計算網路。作者結合了50個響應(regular andflipped)在LFW上取得了99.47%的最終效能。PCA 和  Joint Bayesian模型都很有效的符合我們運用的embedding space的線性變換。他們的方法不需要詳盡的2D/3D校準。他們的方法運用classification 和verification loss的組合來訓練網路。他們的verification loss和我們的triplet loss很相似,因為他們最小化了相同特徵人臉間的L2-distance,加大了不同特徵人臉間距離的邊緣。主要的不同是他們僅對一對影象進行比較,然而triplet loss促進相對距離約束。

   A similar loss to the one used here was explored in Wang et al. [18] for ranking images by semantic and visual similarity.

   Wang et al.提出similar loss是為了根據語義和視覺相似度給影象劃分等級。

3. Method

   FaceNet uses a deep convolutional network.We discuss two different core architectures: The Zeiler&Fergus [22] style networks and the recent Inception [16] type networks. The details of these networks aredescribed in section 3.3.



The recentInception架構網路


   Given the model details, and treating it as a black box (see Figure 2), the most important part of our approach lies in the end-to-end learning of the whole system. To this end we employ the tripletloss that directly reflects what we want to achieve in face verification,recognition and clustering. Namely, we strive for an embedding f(x), from an image x into a feature space Rd, such that the squared distance between all faces, independent of imaging conditions, of the same identity is small, whereas the squared distance betweena pair of face images from different identities is large.  

   考慮到模型的細節,我們暫且把它視為一個黑盒(見圖2),我們的方法最重要的部分在於對整個系統的端到端的學習。系統末端我們運用the triplet loss直接反射我們想要得到的人臉驗證、識別、和聚類。換句話來說,我們在爭取實現一個embedding f(x),從一個影象到一個特徵空間Rd,這樣在影象條件獨立的情況下,相同特徵的所有人臉間的平方距離是比較小的,然而不同特徵的人臉影象對的平方距離是比較大的。

  Figure 2. Model structure. Our network consists of a batch input layer and a deep CNN followed L2 normalization,which results in the face embedding. This is followed by the triplet loss during training.

(圖2.模型結構.我們的網路包含一批輸入層和一個深度卷積網路隨後是一個L2規範化,其結果進入face embedding。最後是訓練中的triplet loss)

   Although we did not directly compare to other losses, e.g. the one using pairs of positives and negatives, as used in [14] Eq. (2), we believe that the triplet loss is more suitable for face verification. The motivation is that the loss from [14] encourages all faces of one identity to be projected onto a single point in the embedding space. The triplet loss, however, tries to enforce a margin between each pair of faces from one person to all other faces.This allows the faces for one identity to live on a manifold, while still enforcing the distance and thus discriminability to other identities.

   雖然我們並沒有直接比較其他losses,例如在[14] Eq中用於正負類的loss,但是我們相信the triplet loss對於人臉驗證是最合適的。激勵是某個損失其來自在the embedding space鼓勵同一特徵的所有人臉投影到單個點。但是The triplet loss試圖加強從某一人臉到其他人臉人臉對的邊緣距離。這就允許同一特徵的人臉可以依靠其一個複本,同時加強上述人臉間距離並且從而分辨其他特徵。

   The following section describes thistriplet loss and how it can be learned efficiently in scale.

  下面的部分描述triplet loss和在大規模情況下如何高效的學習。

3.1.Triplet Loss

   The embedding is represented by。It embeds an image x into a d-dimensional Euclidean space. Additionally, we constrain this embedding to live on the d-dimensional hypersphere, i.e. ||f(x)||2 = 1. This loss is motivated in [19] in the context of nearest-neighbor classification. Here we want to ensure that an image xai (anchor) of a specific person is closer to all other images xpi (positive) of the same person than it is to any image xni (negative) of any other person. This is visualized in Figure 3.  

   Embedding用表示。它把一個影象x嵌入到一個d維的歐幾里得空間。另外,我們依靠一個d維的超球面約束這個embedding。如:||f(x)||2 = 1.這個loss被涉及在[19]的最近鄰居分類的上下文中。在這裡我們想確保一個特定人的影象xai (anchor)更接近於這個同一個人的其他影象xpi (positive)較遠於其他任何人的任何影象xni (negative)。見圖3

 Figure 3. The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and negative hor and a negativof a different identity.

(圖3:The Triplet Loss最小化了anchor 和positive之間的距離,它們兩個具有相同的特徵;最大化了具有不同特徵的anchor 和˙negative之間的距離

Thus we want,

whereais a margin that is enforced between positive and negative pairs。T is the set of all possible triplets in the training set and has cardinality N .

a是positive 和 negative對間的餘量。T是在訓練集中的所有可能的 triplets的集合並且有基數N。

The loss that is being minimized is then L =


   Generatingratingall possible triplets would result in many triplets that are easily satisfied (i.e. fulfill the constraint in Eq. (1)). These triplets would not contribute to the training and result in slower convergence,as they would still be passed through the network. It is crucial to select hard triplets, that are active and can therefore contribute to improving the model. The following section talks about the different approaches we use for the triplet selection.

   生成所有可能的triplets將會導致容易符合條件的許多triplets。這些triplets將不會對訓練作出貢獻並且會導致更低的收斂性,因為它們將會仍然通過網路。這是至關重要的對於選擇hard triplets,並且是有效的有助於提升模型。以下部分講述我們用於triplet selection的不同方法。

3.2. Triplet Selection

   In order to ensure fast convergence it is crucial to select triplets that violate the triple constraint in Eq. (1). This means that, given xai , we want to select an xpi (hard positive)such that argmax  and similarly xni (hardnegative) such that argmin 

為了確保快速收斂選擇triplets是非常重要的,避免了式(1)中的triplet約束。這就意味著給我們xai我們可以選擇xpi(hard positive)這樣argmax 和xn i(hardnegative) 這樣 argmin

   It is infeasible to compute the argmin and argmax across the whole training set. Additionally, it might lead to poor training, as mislabelled and poorly imaged faces would dominate the hard positives and  negatives.There are two obvious choices that avoid this issue:

  通過整個訓練集計算argmin和argmax是不現實的。另外,這可能導致差的訓練,就像錯誤的標籤和差的人臉影象會決定hard positives 和 negatives。下面是兩個明顯的選擇可以避免這個問題:

•   Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.

•   Generate triplets online. This can be done by selecting the hard positive/negativeexemplars from within a mini-batch.

• 離線每n步生成一次 triplets,運用最近的網路檢查站並在資料的子集上計算出argmin和argmax。

• 線上生成triplets 。這個可以實現通過在一個mini-batch中選擇the hard positive/negative樣本。

  Here, we focus on the online generation and use large mini-batches in the order of a few thousand exemplars and only compute the argmin and argmax within a mini-batch.


    To have a meaningful representation of the anchor- positive distances, it needs to been sured that a minimal number of exemplars of any one identity is present in each mini-batch. In our experiments we sample the training data such that around 40 faces are selected per identity per mini- batch. Additionally, randomly sampled negative faces are added to each mini-batch.

   為了得到anchor-positive distances的有意義的表達,需要確保在每個mini-batch任何一個特徵的樣本的最小量。在我們的實驗中我們對訓練資料進行取樣這樣每個mini-batch中每個特徵大約40個人臉被選擇。另外,隨機取樣的負人臉被新增進每個mini-batch。

   Instead of picking the hardest positive, we use all anchor- positive pairs in a mini-batchwhile still selecting the hard negatives. We don’t have a side-by-side comparison of hard anchor-positive pairs versus all anchor-positive pairs within a mini-batch, but we found in practice that the all anchor-positive was more stable and converged sightly faster at the beginning of training.

     我們用所有的anchor-positive對在一個mini-batch並且仍然選擇thehard negatives,而不是選擇the hardest positive。我們並沒有對hardanchor-positive pairs和所有anchor-positivepairs進行並列比較,在一個小批次中,但是在實際中我們發現所有的anchor-positive 方法更加穩定並且收斂的較快在開始的訓練的時候。

    We also explored the offline generation of triplets in conjunction with the online generation and it may allow the use of smaller batch sizes, but the experiments were inconclusive.


  Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e.f(x) = 0). In order to mitigate this, it helps to select xni such that 



實際上選擇一個hardest negatives能夠導致差的區域性最小值在訓練的早期,特別是它也可能導致坍塌的模型(如:f(x) = 0)。為了緩解這種情況,像如下一樣選擇xni是有幫助的:

We call these negative exemplars semi-hard, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchor-positive distance. Those negatives lie inside the margin a.


   As mentioned before, correct triplet selection is crucial for fast convergence. On the one hand we would like to use small mini-batches as these tend to improve convergence during Stochastic GradientDescent (SGD) [20]. On the other hand, implementation details make batches of tens to hundreds of exemplars more efficient. The main constrain twith regards to the batch size, however, is the way we select hard relevan ttriplets from with int he mini-batches. In most experiments we use a batch size of around 1,800 exemplars.

  就像以前所說的,恰當的triplet selection是至關重要的對於快速收斂。一方面我們更傾向於用小的mini-batches因為其有助於在隨機梯度下降法中提高收斂性。另一方面,實現細節使得數十到上百樣本的batches更加高效。但是,對於batch大小的主要約束是我們從the mini-batches內選擇hard relevan ttriplets的方式。在大多數的實驗中我們採用的batch大小大約是1800個樣本。

3.3. Deep Convolutional Networks

   In all our experiments we train the CNN using Stochastic GradientDescent (SGD) with standard backprop [8, 11] and AdaGrad[5]. In most experiments we start with a learning rate of 0.05 which we lower to finalize the model. The models are initialized from random, similar to [16],and trained on a CPU cluster for 1,000 to 2,000 hours. The decrease in the loss(and increase in accuracy) slows down drastically after 500h of training, but additional training can still significantly improve performance. The margin a is set to 0.2. 


   We used two types of architectures and explore their trade-offs in more detail in the experimental section. Their practical differences lie in the difference of parameters and FLOPS. The best model may be different depending on the application. E.g. a model running in a data center can have many parameters and require a large number of FLOPS, whereas a model running on a mobile phone needs to have few parameters, so that it can fit into memory. All our models use rectified linear units as the non-linear activation function.


    The first category, shown in Table 1, adds 1*1*d convolutional layers, as suggested in [9], between the standard convolutional layers of the Zeiler&Fergus [22] architecture and results a model 22 layers deep. It has a total of 140 million parameters and requires around 1.6 billion FLOPS per image.


Table 1. NN1.Zeiler&Fergus [22] based model with 1*1 convolutions inspired by [9]. The input and output sizes are described in rows *cols *#filters. The kernel is specified as rows*cols, stride and the maxout [6] pooling size as p = 2

表1. NN1.基於1*1卷積的Zeiler&Fergus,輸入和輸出大小被描述為rows*cols*#filters。核心被定義為rows *cols,跨步stride和the maxout [6] pooling大小中p = 2.

    The second category we use is based on GoogLeNet style Inception models [16]. These models have 20* fewer parameters (around 6.6M-7.5M)and up to 5*fewer FLOPS (between 500M-1.6B). Some of these models are dramatically reduced in size (both depth and number of filters), so that they can be run on a mobile phone. One,  NNS1, has 26M parameters and only requires 220M FLOPS per image. The other, NNS2, has 4.3M parameters and 20M FLOPS. Table 2 describes NN2 our largest network in detail. NN3 is identical in architecture but has a reduced input size of 160×160. NN4 has an input size of only 96×96, thereby drastically reducing the CPU requirements (285M FLOPS vs 1.6B for NN2). In addition to the reduced input size it does not use 5×5 convolutions in the higher layers as the receptive field is already too small by then. Generally we found that the 5×5 convolutions can be removed throughout with only a minor drop in accuracy,racyFigure 4 compares all our models. 

   第二個類別是基於GoogLeNet style的Inception models模型。這種模型有20* fewer的引數(大約6.6M-7.5M)和多達5*fewerFLOPS(在500M-1.6B之間)。其中的一些模型大小急劇減小(在深度和濾波器數量方面),這樣它們可以執行在移動手機上。其中一種叫NNS1擁有26M的引數並且每張影象僅需要220M的FLOPS。另外一種叫NNS2擁有4.3M的引數並且每張影象僅需要20M的FLOPS。表2詳細的描述了我們最大的網路NN2。NN3和其餘的架構是一樣的但是160×160縮小的尺寸。NN4僅有96×96的輸入大小,因此較大地減少了CPU的需求(285M FLOPS vs 1.6B for NN2)除了減小輸入大小它也並沒有在高層使用5×5卷積因為感受野已經太小了。通常我們發現5×5的卷積可以被移除僅僅會造成精度很小的下降。圖4比較了我們所有的模型。

(圖4.FLOPS對比Accuracy trade-off .由圖中可得不同的模型大小和架構在FLOPS和accuracy之間trade-off有一個較大的範圍。其中突出我們在實驗中關注的4個模型)

表2.NN2.描述了NN2 Inception incarnation 的細節。這個模型和在[16]中描述的差不多。兩個主要的不同是使用L2 pooling代替Max pooling(m),這裡詳細說明。如:用 the L2 norm 取代spatial Max。池化總是3*3(除了最後的平均池化)並且和每個 Inception module中和卷積模型平行。在每個池化被表示成p後如果有維度的減小。1*1,3*3,和5*5池化被連線在一起得到最後輸出。

4. Datasets and Evaluation

    We evaluate our method on four datasets and with the exception of Labelled Faces in the Wild and YouTube Faces we evaluate our method on the face verification task. I.e. given a pair of two face images a squared L2 distance threshold D(xi,xj) is used to determine the classification of same and different. All face pairs (i, j) of the same identity are denoted with Psame, whereas all pairs of different identities are denoted with Pdiff.

   我們在四個資料集上評估我們的方法並且除了LFW和YouTube Faces外我們在人臉驗證任務上評估我們的方法。如:對於一對人臉影象squared L2 distance門限D(xi,xj)被用作決定相同和不同人臉的分類。有相同特徵的人臉對(i, j)表示成Psame,反之不同特徵的人臉對錶示成Pdiff.

We define the set of all true accepts as

These are the face pairs (i, j) that were correctly classified as same at threshold d.

這些是依據門限值d準確分類的人臉對(i, j)。

 Similarly ,is the set of all pairs that was incorrectly classifiedas same (false accept). 


The validation rate  VAL and the false accept rate FAR(d) for a given face distance d are then defined as 


4.1. Hold-out Test Set

    We keep a hold out set of around one million images, that has the same distribution as our training set, but disjoint identities. For evaluation we split it into five disjoint sets of 200k images each. The FAR and VAL rate are then computed on 100k x100k image pairs. Standard error is reported across the five splits.


   我們保留與我們的訓練集有同樣分佈但是有不同特徵的大約100張照片的留出集。為了便於評估我們把留出集分成5個不相交的子集每個裡面有200k影象。FAR和VAL率在100k x100k的影象對上計算。通過這五個分塊來描述標準差。

4.2. Personal Photos

  This is a test set with similar distribution to our training set, but has been manually verified to have very clean labels. It consists of three personal photo collections  with a total of around 12k images. We compute the FAR and VAL rate across all 12k squared pairs of images.


   這是一個和我們的訓練集有相同分佈的測試集,但是它必須手工驗證去保證非常整潔的標籤。它由總共12k影象的三個個人照片集組成。我們在所有12k squared對影象上計算FAR 和 VAL率。

4.3. Academic Datasets

   Labeled Faces in the Wild (LFW) is the de-facto academic test set for face verification [7]. We follow the standard protocol     for unrestricted,labeled outside data and report the mean classification accuracy as well as the standard error of the mean.


     Youtube Faces DB [21] is a new dataset that has gained popularity in the face recognition community [17, 15]. The setup issimilar to LFW, but instead of verifying pairs of images, pairs of videos are used.

   Youtube Faces DB是新的資料集並且在人臉識別社群很受歡迎。其上面可以使用視訊而不是驗證影象對。

5. Experiments

   If not mentioned otherwise we use between 100M-200M training face thumbnails consisting of about 8M different identities.A face detector is run on each image and a tight bounding box around each faceis generated. These face thumbnails are resized to the input size of the respective network. Input sizes range from 96×96 pixelsto 224×224 pixels in our experiments.

   如果沒有另外說明我們使用由8M不同特徵組成的在100M-200M之間的訓練人臉縮圖。在每個人臉上執行人臉探測器並在人臉周圍生成一個緊密的bounding box。這些人臉縮圖被調整成相應網路的輸入大小。在我們的實驗中輸入大小的範圍是從96×96畫素到224×224畫素。

5.1. Computation Accuracy Trade-off

   Before diving into the details of more specific experiments we will discuss the trade-off of accuracy versus number of FLOPS that a particular model requires. Figure 4 shows theFLOPS on the x-axis and accuracy at 0,001 false accept rat  our user labelled test-data set from section 4.2. It is interesting to see the strong correlation between the computation a model requires and the accuracy it achieves. The figure highlights the five models (NN1, NN2, NN3, NNS1, NNS2) that we discuss in more detail in our experiments.


   在探究更具體的實驗細節前我們將會討論準確度和浮點運算次數的權衡這是一個特定的模型所需要的。圖4中用x軸表示浮點運算次數和在4.2中的使用者標記資料集中0.001的錯誤可接受率下的準確度。很願意看到一個模型所需要的計算和它取得的準確度具有強相關。圖示強調了在我們實驗中詳細討論的5個模型(NN1, NN2, NN3, NNS1, NNS2)。

    We also looked into the accuracy trade-off with regards to the number of model parameters. However, the picture is not as clear in that case. For example, the Inception based model NN2 achieves a comparable performance to NN1, but only has a 20th of the parameters. The number of FLOPS is comparable,though. Obviously at some point the performanceis expected to decrease, if the number of parameters is reduced further. Other model architectures may allow further reductions without loss of accuracy, just like inception [16] did in this case.

  我們也研究關於模型引數的準確度權衡。但是,這種情況在圖片中並不是太清楚。例如,Inception based model NN2與NN1相比有更好的效能,但是僅有一個a 20th of the parameters。儘管浮點運算次數是比得上的。很明顯在某些情況下效能是期望減少的如果引數量進一步減小。其他模型架構可能在不損失準確度的情況下進一步減小,Inception [16]就是這種情況。

5.2. Effect of CNN Model

   We now discuss the performance of our four selected models  in more detail . On the one hand we have our traditional Zeiler&Fergus based architecture with 1x1 convolutions [22, 9] (see Table 1). On the other hand we have Inception [16]based models that dramatically reduce the model size. Overall, in the final performance the top models of both architectures perform comparably.However, some of our Inception based models, such as  NN3, still achieve good performance while significantly reducing both the FLOPS and the model size.

下面我們更詳細的討論四個備選模型的效能。一方面我們有基於1×1卷級架構的Zeiler&Fergus(見表1).另一方面大大減小模型尺寸的Inception[16] based models。總之,用這兩種架構的頂級模型最後的效能相當。不管怎麼樣,一些Inceptionbased models例如NN3仍然在減少浮點運算次數和模型大小方面取得了好的效能。

   The detailed evaluation on our personal photos test set is shown in Figure 5. While the largest model achieves a dramatic mprovement in accuracy compared to the tiny NNS2, the latter can be run 30ms /image on a mobile phone and is still accurate enough to be used in face clustering. The sharp drop in the ROC for FAR < 10 4 indicates noisy labels in the test data groundtruth. At extremely low false accept rates a single mislabeled image can have a significant impact on the curve.

圖5顯示了在我們的個人照片測試集上更加詳細的評估。雖然NN2相比緊湊的NNS2模型最大的模型在準確度方面取得了極大的提高,但是緊湊的NNS2可以在移動手機上每個影象執行僅30ms並且對於人臉聚類來說他也是足夠準確的。在ROC曲線上當FAR < 10 4    時的急劇下降表明瞭在the test data groundtruth.上的noisy labels。在一個特別低的可接受錯誤率下一個錯誤標註的影象在曲線上可能有很大的影響。

圖5.網路架構.本圖顯示了4.2中個人照片測試集上4個不同模型的完整的ROC曲線,在10E-4 處的急劇下降可以通過噪聲解釋在the groundtruth labels.中。按照模型的效能排序依次是:NN2: 224*224 input Inception based model; NN1:Zeiler&Fergus based network with 1×1 convolutions; NNS1: small Inception style model with only 220M FLOPS; NNS2: tiny Inception model with only 20M FLOPS.  

 5.3. Sensitivity to ImageQuality

   Table 4 shows the robustness of our model across a wide range of image sizes. The network is surprisingly robust with respect to JPEG compression and performs very well down to a JPEGquality of 20. The performance drop is very small for face thumbnails down to asize of 120×120 pixels and even at 80×80 pixels it shows acceptable performance.This is notable, because the network was trained on 220×220 input images. Training with lower resolution faces could improve this range further. 




5.4. Embedding Dimensionality

    We explored various embedding dimensionalities and selected 128 for all experiments other than the comparison reported in Table 5. One would expect the larger embeddings to perform at least as good as the smaller ones ,however, it is possible that they require more training to achieve the same accuracy. That said, the differences in the performance reported in Table 5 are statistically insignificant.It should be noted, that during training a 128 dimensional float vector is used, but it can be quantized to 128-bytes without loss of accuracy. Thus each face is compactly represented by a 128 dimensional byte vector, which is ideal for large scale clustering and recognition. Smaller embeddings are possible at a minor loss ior los of accuracy and could be employed on mobile devices. 




5.5. Amount of Training Data

Table 6 shows the impact of large amounts of training data. Due to time constraints this evaluation was run on a smaller model; the effect may be even larger on larger models. It is clear that using tens ofmillions of exemplars results in a clear boost of accuracy on our personal photo test set from section 4.2.Compared to only millions of images the relative reduction in error is 60%.Using another order of magnitude more images (hundreds of millions) still givesa small boost, but the improvement tapers off. 



表6.訓練資料大小.此表的效能比較是在96×96畫素輸入的一個小的模型訓練700h後進行的。這個模型架構和NN2很相似但是沒有5×5的卷積在the Inception modules

5.6. Performanceon LFW

   We evaluate our model on LFW using the standard protocol for unrestricted,labeled outside data.Nine training splits are used to select the L2-distance threshold. Classification (same or different) is then performed onthe tenth test split. The selected optimal threshold is1.242 for all test splits except split eighth (1.256).

Our model is evaluated in two modes:

1.Fixed center crop of the LFW provided thumbnail.

2.A proprietary face detector(similar to Picasa[3])is run on the provided LFW thumbnails. If it fails to align the face (this happens for twoimages), the LFW alignment is used.




1.LFW 提供的縮圖進行中心裁剪


    Figure 6 gives an overview of all failure cases. It shows false accepts on the top as well as false rejects at the bottom.We achieve a classification accuracy of 98.87%±0.15 when using the fixed center crop described in (1) and the record breaking 99.63%±0.09 standard error of the mean when using the extra face alignment (2). This reduces the error reported for Deep Face in [17] by more than a factor of 7 and the previous state-of-the-artreported for DeepId2 in [15] by 30%. This is the performance of model NN1, but even the much smaller NN3 achieves performance that is not statistically significantly different.

   圖6給出了所有失敗案例的概述。除了顯示了在底部的拒絕的錯誤還有在頂部的可接受的錯誤。我們使用模式1中提到的固定中心剪裁得到了98.87%±0.15的分類準確度然而運用模式2中額外的校準方法取得了平均標準差99.63%±0.09的突破。在[17]中的Deep Face報道中總共7個因素中不止一個誤差減小並且比DeepId2 in [15]報道的以前最先進的減小了30%。這是在NN1模型上的效能,但是甚至非常小的NN3模型也取得了這樣的效能但是沒有統計上較大的不同。


5.7. Performanceon Youtube Faces DB

   We use the average similarity of all pairs of the first one hundred frames that our face detector detects in each video. This gives us a classification accuracy of95.12%±0.39. Using the first one thousand frames results 95.18%. Compared to [17] 91.4% who also evaluate one hundred frames per video we reduce the error rate by almost half. DeepId2 [15] achieved 93.2% and our method reduces this error by 30%, comparable to our improvement on LFW.

You tubeFaces DB上的效能

    每個視訊中我們人臉探測器探測的前100幀中所有的人臉對使用平均相似度。這個給了我們95.12%±0.39的分類準確度。用前1000幀導致95.18%。和[17] 91.4%的相比較他每個視訊也評估了100幀但是我們減少了至少一半的錯誤率。DeepId2 [15]取得了93.2%可以和我們在LFW上的提高相提並論我們減少了30%的誤差。

5.8. FaceClustering

Our compact embedding lends itself to be used in order to cluster a users personal photos into groups of people with the same identity.The constraints in assignment imposed by clustering faces, compared to the pure verification task, lead to truly amazing results. Figure 7 shows one cluster in a users personal photo collection, generated using 

agglomerative clustering.It is a clear show case of the incredible invariance to occlusion, lighting, pose and even age.


   由於我們的compact embedding導致它被用作把使用者的個人照片按照相同的特徵進行分組。施加在人臉聚類任務上的約束,與純粹的驗證任務相比有更加真實驚人的結果。圖7顯示一個使用者個人照片集上用融合聚類產生的聚類。它清晰的顯示了遮擋、光照、姿態、甚至年齡的驚人不變性。


6. Summary

    We provide a method to directly learn an embedding into an Euclidean space for face verification. This sets it apart from other methods [15, 17] who use the CNN bottleneck layer, or require additional post-processing such as an

concatenation of multiple models 

and PCA, as well as SVM classification. Ourend-to-end training both simplifies the setup and shows that directlyoptimizing a loss relevant to the task at hand improves performance.



   Another strength of our model is that it only requires minimal alignment (tight crop around the face area).[17], for example, performs a complex 3D alignment. We also experimented with a similarity transform alignment and notice that this can actually improve  performance slightly. It is not clear if it is worth the extra complexity.


 Future work will focus on better understanding of the error cases, further improving the model, and also reducing model size and reducing CPU requirements. We will also  look into 

ways of improving the currently extremely long training times, e.g. variations of ourcurriculum learning with smaller batch sizes and offline as well as online positive and negative mining.


7. Appendix:Harmonic Embedding

   In this section we introduce the concept of harmonic embeddings. By this we denote a set of in the sense that aregenerated by different models v1 and v2 but are compatible in the sense that they can be compared to each other.

附錄:Harmonic Embedding

 在這一部分我介紹Harmonic Embedding的概念。這裡我們表示一些embeddings它們由不同的V1和V2模型生成但是就它們可以互相比較來說它們是相容的。

   This compatibility greatly simplifies upgrade paths. E.g. in an scenario whereembedding v1 was computed across a large set of images and a new embeddingmodel v2 is being rolled out, this compatibility ensures a smooth transition without the need toworry about version incompatibilities. Figure 8 shows results on our 3Gdataset. It can be seen that the improved model NN2 significantly outperformsNN1, while the comparison of NN2 embeddings to NN1 embeddings performs at anintermediate level. 

相容性大大化簡了上升路徑。例如:在以下的情形中embedding v1在一個大的影象集上計算並且新的embedding model v2被推出,相容性確保了平滑過度不需要擔心不同版本的不相容。圖8顯示了在我們3G資料集上的結果。可以看到提升的NN2明顯勝過NN1,當NN2和NN1對比在一箇中間層上進行。

圖8. Harmonic Embedding Compatibility.這些ROC曲線顯示了NN2 embeddings to NN1embeddings的Harmonic Embedding相容性。NN2是改善過的模型執行的比NN1好。當比較由NN1生成的embedding和NN2生成的embedding可以看到兩者的相容性。事實上,混合模型的效能仍然比NN1自身好。

7.1. HarmonicTriplet Loss

   In order to learn the harmonic embedding we mix embeddings of v1 together with theembeddings v2, that are being learned. This is done inside the triplet loss and results in additionally generated triplets that encourage the compatibility between the different embedding versions. Figure 9 visualizes the different combinations of triplets that contribute to the triplet loss.

   為了學習 harmonic embedding我們混合了embeddings v1和embeddingsv2。這在triplet loss裡面進行並且導致額外生成triplets促進了不同embedding版本間的相容性。圖9顯示了有助於triplet loss的不同的組合。

圖9. learning the harmonic embedding.為了學習harmonic embedding我們生成了混合了V1embedding和正在訓練的V2embedding的triplets。選擇來自整個V1和V2embedding的semi-hardnegatives。

    We initialized the v2 embedding from an independently trained NN2 and retrained the last layer (embedding layer) from random initialization with the compatibility encouraging triplet loss. First only the last layer is retrained, then we continue training the whole v2 network with the harmonic loss.

   我們初始化了來自獨立的已經訓練的NN2中的 v2 embedding並且重新訓練了最後一層從具有相容性激勵triplet loss的隨機初始化。開始僅僅最後一層被訓練隨後我們訓練了帶有harmonic loss的整個v2 network。

   Figure 10 shows a possible interpretation of how this compatibility may work in practice.The vast majority of v2 embeddings may be embedded near the corresponding v1embedding, however, incorrectly placed v1 embeddings can be perturbed slightly such that their new location in embedding space improves verification accuracy.

   圖10顯示了在實際中相容性怎樣可能工作的一個合理的解釋。大多數的v2 embeddings可能嵌入到接近的相應的v1 embeddings,但是,對於錯誤的安置v1 embeddings可能被輕微的打亂導致他們新的位置在embedding space 提高了驗證準確度。

圖10. Harmonic Embedding Space

此圖描述了一個可能的解釋關於harmonic embedding怎麼提高驗證準確度當對於較小準確embeddings維持相容性的時候。在這種情況下這裡有一張錯誤分類的人臉它的embedding被擾亂為“正確”的位置在v2中。

7.2. Summary

   These are very interesting findings and it is some what surprising that it works so well. Future work can explore how far this idea can be extended. Presumably there is a limit as to how much the v2 embedding can improve over v1, while still being compatible. Additionally it would be interesting to train small networks that can run on a mobile phone and are compatible to a larger server side model.


   這些是非常有趣的發現並且它執行的這麼好是很驚人的。以後的工作是這個想法還可以怎麼擴充套件。我們可以推測v2 embedding提升超過v1有一個極限,雖然仍然可以相容。此外訓練可以執行在移動手機上的小的網路是非常有趣的可以和大的伺服器模型相容。

 本人人臉識別初學者,英語渣渣,有些翻譯基於了一些自己膚淺的理解,如有翻譯不當煩請指出,不知道是否是編輯原因部落格中可能缺少或者覆蓋了相關配圖,具體完整的可以參考我的解析PDF版 here,謝謝。