ICCV2017_S3FD:Single Shot Scale-invariant Face Detector

NO IMAGE
論文想解決的問題:人臉目標太小的話,anchor-based detector效能急劇下降;
四個spotlight:
1 類似SSD,多個feature map預測不同尺度的人臉,但沒有像FPN一樣,上下層feature map連線;提出了有效感受野(effective receptive field)、equal proportion interval principle概念;
2 通過anchor尺度補償匹配策略(small faces by a scale compensation anchor matching strategy)提升了對小區域人臉的召回率;作者認為anchor的scale是離散的,通過該尺度補償匹配策略,可以提升對小人臉、處於離散的anchor scale中間尺度人臉的召回率;
3 使用max-out background label降低了小人臉的false positive rate;該模組僅限於lowest detection layer,也即con3_3;
4 AFW, PASCAL face, FDDB、WIDER FACE四個資料集都幹到了第一,36FPS/titan x,實時;
作者認為的最重要的一個spotlight:
Proposing a scale-equitable face detection framework with a wide range of anchor-associated layers and a series of reasonable anchor scales so as to handle different scales of faces well.
中文就是:類似SSD,在若干feature map上使用尺度分佈均勻的anchor,同時設計合理的anchor scale,確保能處理不能尺度的人臉;
四個問題:
1 人臉太小,在高層feature map的特徵就太少;
2 真實人臉區域、感受野、anchor大小不匹配;
3 離散的預設anchor尺度,對真實人臉的匹配度不夠,造成了小人臉和非尺度範圍內的人臉匹配度不高;
4 小anchor會在背景上引來過多的非人臉干擾;
介紹了下基於anchor的檢測器:
Anchor-based object detection methods detect objects by classifying and regressing a series of pre-set anchors, which are generated by regularly tiling a collection of boxes with different scales and aspect ratios on the image. The anchor-associated layers are convolved to classify and align the corresponding anchors. 弊端:特徵越小,檢測效能越差.
設計合適的anchor stride和size:
stride:We tile anchors on a wide range of layers whose stride size vary from 4 to 128 pixels, which guarantees that various scales of faces have enough features for detection.
size:we design anchors with scales from 16 to 512 pixels over different layers according to the effective receptive field and a new equal-proportion interval principle, which ensures that anchors at different layers match their corresponding effective receptive field and different scales of anchors evenly distribute on the image.
anchor匹配策略:
propose a scale compensation anchor matching strategy with two stages. The first stage follows current anchor matching method but adjusts a more reasonable threshold. The second stage ensures that every scale of faces match enough anchors through scale compensation
受到FRCNN和SSD的啟發:
In this paper, inspired by the RPN in Faster RCNN and the multi-scale mechanism in SSD, we develop a state-of the-art face detector with real-time speed.
流程圖:
幾點可以注意:
1 檢測在多層進行,類似SSD;
2 有一個Normalization layer,作用:參照ICLR2016的parsenet,作者認為conv3-3、conv4-3、conv5-3的feature map啟用層尺度不同,做個each channel feature map的element-wise L2正則化,有利於更好的訓練和收斂,注:之後還有個re-scale的操作,類似BN的alpha、gamma引數;
3 predicted conv-layer輸出是1*(2 4),分別表示為:4是對應於anchor的座標偏移;2對應分類,face/non-face;
4 conv3_3輸出為1*(Ns 4):4同樣對應於anchor的座標偏移,Ns=Nm 1,1對應face,Nm對應於conv3-3的maxout bg label,主要用於conv3-3上去除小目標的誤檢;
5 fc6、fc7在原始vgg16中是全連線層,在此被調整為conv layer,在conv6-2、conv7-2實現stride-2的降取樣;
第三節介紹四個方面:
3.1 Scale equitable framework
還是提到了這點:develop a network architecture with a wide range of anchor-associated layers, whose stride size gradually double from 4 to 128 pixels,stride size逐倍增長;ensures that different scales of faces have adequate features for detection at corresponding anchor associated layers.可以確保不用尺度的人臉在對應的feature map上有足夠的特徵用於檢測;
After determining the location of anchors,we design the scales of anchors from 16 to 512 pixels based on the effective receptive field and our equal proportion interval principle. 基於有效感受野和均勻分佈插值策略設計anchor的尺度;
The former(有效感受野) guarantees that each scale of anchors matches the corresponding effective receptive field well, and the latter(均勻分佈插值策略) makes different scales of anchors have the same density on the image.
正則化層:參照ICLR2016 parsenet—-一個用於語義分割的網路
兩個創新點:
1 global pooling就是global avarege pooling;unpool將1*1*C的feature恢復至W*H*C的feature map;恢復方式挺簡單,就是1*1的特徵重複複製W*H份;
2 L2正則化,動機很簡單,fig3中說了,不同層級的feature map(如conv3-3、conv4-3、conv5-3)啟用值的尺度不一樣,可能有若干個數量級的差異,如果如fig e一樣簡單的concate,那麼低啟用值的feature map在concated向量中話語權太弱了,所以先要做個L2正則化了再說;
正則化方式比較簡單,W*H*C中的每一個channel,W*H內所有元素,平方和開根號得到Lsum,然後每個元素除以Lsum即可;如下圖公式:
後續還有個操作類似BN,為了避免過於正則化,每個channel再學習一個lamda引數,因此一共要學習C個引數;和BN的alpha、gamma引數類似,可以在訓練中學習;
pred conv layer:在detection layer後接p×3×3×q conv即可;p對應input channel,q的值為(2 4) or (Ns 4),4對應於anchor的座標偏移,2對應face/non-face、Ns對應於conv3-3的maxout bg label;
為anchor設定合適的尺寸
以上表格三個特點:
1 anchor的長寬比1:1,因為人臉一般都是接近正方形;
2 each layer stride 和 RF固定,anchor size是RF的1/4;
3 SSD的each layer feature map上設定的anchor scale只有唯一一個尺度!!!
theoretical receptive field(TRF):理論感受野,對於feature map上某個點,根據conv的winsize和stride計算出來,一般比較大,但最終只有以TRF中心的高斯區域內輸入點有效,且貢獻值按二維高斯分佈降低;下圖a黑色矩形框;
effective receptive field(ERF):只有一部分(如高斯分佈區域)對feature map上某個點的值有貢獻;下圖a白色圓形區域;
基於以上,作為提出了一個觀點:the anchor should be significantly smaller than theoretical receptive field in order to match the effective receptive field。也即,anchor的size應該匹配ERF,而非TRF;
Equal-proportion interval principle: The stride size of a detection layer determines the interval of its anchor on the input image. feature map上的stride size決定了feature map上的anchor取樣的間隔;、通過在feature map上將stride設定為對應anchor size的1/4,which guarantees that different scales of anchor have the same density on the image, so that various scales face can approximately match the same number of anchors.達到的好處就是:在不同feature map上,不同scale的anchor有同樣的取樣密度,不同尺度的人臉可以近似匹配相同數量的anchors.
3.2. Scale compensation anchor matching strategy基於尺度補償的anchor匹配策略
當前的anchor匹配策略,跟fast rcnn類似,firstly matches each face to the anchors with the best jaccard overlap and then matches anchors to any face with jaccard overlap higher than a threshold(比如0.5)
以上存在一個問題:anchor scales are discrete while face scales are continuous,anchor尺度離散,但人臉尺度連續;導致的問題就是:
1) the average number of matched anchors is about 3 which is not enough to recall faces with high scores; 每個gt face bbox只匹配3個anchor,太少了;
2) the number of matched anchors is highly related to the anchor scales. The faces away from anchor scales tend to be ignored, leading to their low recall rate. 匹配的anchor需要高度適配gt face bbox的大小,否則離散的anchor很容易漏檢部分不在anchor size區域範圍內的人臉;
提出的基於尺度補償的anchor匹配策略:
step1:使用常規的fast rcnn的anchor與gt box的jaccard overlap匹配策略,但降低thres至0.35,這樣可以提升每個gt bbox匹配的anchor數目N;
step2:將anchor與gt bbox的jaccard overlap閾值降低至0.1,降序選取top N matched anchor作為與該gt bbox match的anchor;N為step1中的N;注:0.35~0.1 thres的降序排序;
3.3 Maxout background label
當前人臉檢測存在一個矛盾:conv3-3可以檢測小目標,但為檢測出海量的小目標,必須保證anchor的尺度足夠小(we have to densely tile plenty of small anchors on the image to detect small faces,tile這個單詞用的很形象,就像瓦屋房頂,密密麻麻地密集選取大量的小size anchor)。但矛盾之處在於:These smallest anchors contribute most to the false positive faces. 也即小anchor容易帶來大量的fp;也即如下,conv3-3上的anchor數目佔比最大,但也帶來了很多fp faces;
We apply the max-out background label for the conv3-3 detection layer. For each of the smallest anchors, we predict Nm scores for background label and then choose the highest as its final score. 對於bg label,選擇最大的一個輸出作為bg output score,所以Ns 1中,Ns指的是bg non-face的label,1指的是face的label;
3.4 training
training dataset:wider face的12880張訓練影象集,顏色扭曲,水平翻轉 random crop
random crop:因為wider face人臉比較小,所以選用了一個zoom操作;對於一張圖,crop出5張圖:1張原始scale圖,剩下4張按照影象短邊scale ratio屬於[0.3, 1.0],crop出4個patch;
loss function:損失函式的定義完全與fast rcnn一致:
OHEM:anchor匹配之後,訓練時發現大部門未被匹配的anchor都是負樣本,這樣會導致正負樣本不均衡;所以使用ohem方法,將損失值降序排序,確保正負樣本比例1:3;使用ohem之後,設定bg label中的Nm = 3, loss function中的lamda為4;
實驗:
RPN-face:和frcnn一致,但anchor長寬比設定為1:1,conv5上尺度設定多一點:16, 32, 64,128, 256, 512;這樣相當於一共有1*6個anchor,與frcnn的3*3共9個anchor略微不同;RPN-face has the same choice of anchors as ours but only tiles on the last convolutional layer of VGG16。弊端:Not only its stride size (16 pixels) is too large for small faces, but also different scales of anchors have the same receptive field.—-因為frcnn僅僅在一層feature map上做預測,所以設定了多個scale的anchor;
消融實驗:
F:scale-equitable framework
S:scale compensation anchor matching strategy
M:max-out background label
總結:
Max-out background label is promising:It deals with the massive small negative anchors (i.e., background) from the conv3-3 detection layer which is designed to detect small faces.從table 3中也可以得知,圖越難,map越高,說明Maxout bg label對比較困難的圖,檢測效果很好;
實驗:
AFW dataset. It contains 205 images with 473 labeled faces.
PASCAL face dataset. It has 1, 335 labeled faces in 851 images with large face appearance and pose variations.It is collected from PASCAL person layout test subset.
FDDB dataset. It contains 5, 171 faces in 2, 845 images.
1) FDDB adopts the bounding ellipse while our S3FD outputs rectangle bounding box. train an elliptical regressor to transform our predicted bounding boxes to bounding ellipses.
2) FDDB has lots of unlabelled faces, which results in many false positive faces with high scores.
WIDER FACE dataset. It has 32, 203 images and labels 393, 703 faces with a high degree of variability in scale, pose and occlusion.
The images and annotations of training and validation set are available online, while the annotations of testing set are not released and the results are sent to the database server for receiving the precision-recall curves.
inference time:
提速方案:we first filter out most boxes by a confidence threshold of 0.05 and keep the top
400 boxes before applying NMS, then we perform NMS with jaccard overlap of 0.3 and keep the top 200 boxes. 80%的耗時在vgg16主幹網,如果使用更輕量級的網路backbone,可能提速更明顯。
總結:為了解決人臉目標過小時,檢測效能急劇下降的問題。三個亮點:
1 scale-equitable framework
2 scale compensation anchor matching strategy
3 max-out background label
程式碼也有:sfzhang15/SFD
demo可以跑,訓練給出了方案,需要修改部分程式碼;
論文參考
1 ICCV2017_S3FD:Single Shot Scale-invariant Face Detector