A Secured, Multilevel Face Recognition based on Head Pose Estimation, MTCNN and FaceNet

— Artificial Intelligence and IoT have always attracted a lot of attention from scholars and researchers because of their high applicability, which make them a typical technology of the Fourth Industrial Revolution. The hallmark of AI is its self-learning ability, which enables computers to predict and analyze complex data such as bio data (fingerprints, irises, and faces), voice recognition, text processing. Among those application, the face recognition is under intense research due to the demand in users’ identification. This paper proposes a new, secured, two-step solution for an identification system that uses MTCNN and FaceNet networks enhanced with head pose estimation of the users. The model's accuracy ranges from 92% to 95%, which make it competitive with recent research to demonstrate the system's usability.


INTRODUCTION
In the era of 4 th Industrial Revolution, the rapid development of various new sophisticated hardware has enabled many solutions that can support the high computing power required for artificial intelligence (AI) models [1].At the same time, scientific research discovered many interesting, powerful theoretical tools for automatic knowledge discovery and AI computing [2,3].For the last few years, computer vision (CV) has been a fast and widely field of application, that was developed and implemented in many smart devices and systems.Newest technological and algorithmic achievements play an increasingly important role in supporting systems that based on CV, such as object recognition and detection, image processing, and information extraction [4][5][6][7].One of the important products of computer vision software development is the face recognition solutions, widely used in industrial and civil applications.
Despite the existence of other user identification solutions, such as RFID-based tag, that are quite convenient and accurate, they still have their disadvantages.For example, the standard ID tags can be easily copied, the newest, anti-theft tags are more expensive.The tags can be stolen or purposely unlawfully borrowed [8][9][10][11].Biometrics based detector for fingerprint or iris patterns are quite expensive, hard to operate and maintenance and these devices have potential problem when the original information is unlawfully accessed [8,9].For example, the fingerprints database in the device can be stolen [10,11].Because of issues like those just mentioned, in this paper we propose to use facial recognition by using deep learning neural network as a method of user identification, which would have high accuracy and security.Currently, there are many popular models available, for example ArcFace, DeepFace and FaceNet [12][13][14][15][16][17][18][19][20][21][22][23].Although ArcFace and DeepFace provide high accuracy, they require larger computational resources than FaceNet, making them unsuitable for large-scale deployment.Therefore, in this paper, the FaceNet model was selected to be the base model for our solution.The general schematic of the system is shown in Fig. 1.First, in order to improve the fake images or masks, the user is asked to input 5 different poses of the face, including the standard straight pose, with the head turn left, right, up and down.The head pose estimation algorithm will check for the presence of all 5 poses (if one is missing, the user will be asked to retake the require image with the system's camera).Once all the poses have been achieved and recognized correctly, the straight pose image will be rotated and face extracted using Multi-tasked Cascaded Convolutional Networks (MTCNN) [13].The extracted face only will be put into the FaceNet for final recognition of the user.
The numerical simulation has shown that due to the face rotation algorithm and the extraction of face area only the accuracy of the proposed model significantly increased to about 90-95% comparing to the accuracy of about 85% of the recognition when no treatment was introduced.The proposed system shown in Fig. 2 uses two neural networks: the MTCNN and the FaceNet.Their brief descriptions are presented in the following Section.

A. The Multi-tasked Cascaded Convolutional Networks
The Multi-tasked Cascaded Convolutional Networks (MTCNN) is a popular choice for use in Deep Convolutional Neural Networks (DCNN) for face detection and related facial feature recognition for the following reasons [24]: 1) Multi-tasking: MTCNN is a multi-tasking system capable of performing multiple tasks simultaneously such as face recognition, facial landmarks (eyes, nose, mouth) location or boundary boxes detection.This allows MTCNN to provide detailed information about the face and its components, facilitating tasks such as face recognition, classification, and expression analysis.
2) High reliability: MTCNN uses convolutional neural network (CNN) for face recognition and feature localization.CNN is a proven simple but powerful network for image processing tasks.This allows MTCNN to manage to achieve high reliability in detection and robustness against noise, scaling, and light changes.Face detection, landmark location, and accurate bounding box regression are the three steps that make up the MTCNN model.These steps are made so that faces can be found and identified in photographs of varied complexity, such as those with size, position, and lighting changes.MTCNN has shown promising results in both still image and video face identification and feature recognition.In the paper, the MTCNN was used to detect the face in the image to crop it for the user recognition purposes.

B. FaceNet Network Model
FaceNet is a deep learning neural network, that was proposed and used as detecting feature points on face images.FaceNet [24][25][26][27][28][29][30][31] was announced in 2015.It takes, as input, face images and generates at output a vector of 128 features representing the face.With this features vector, further computation blocks will work faster than computing on the whole image.In this paper, the FaceNet network was trained with following steps: where the parameter α is used to ensure that the distance between samples is far enough.
An example of one step of training FaceNet is demonstrated in Fig. 3, where there were images from two different persons.Once randomly selected, the positive sample is moved closer to the anchor and the negative sample is moved further from the anchor.This increases the accuracy of the classification system and reduces the dependence on unnecessary features.Then, the loss function is represented as equation (2).
The process is continued until all images of the same human face were close together and images of different human faces were far apart.In order to achieve high training efficiency for the model, we need to adapt the network, such that the anchor and the negative images are as far apart as possible, when the anchor and positive images are as close together as possible, can be shown in equation (3).

C. Head Pose Estimation
Head pose refers to the direction or position of a person's head relative to the frame of reference [35][36][37][38].It usually describes the rotation of the head along three axes: pitch, yaw, and roll (rotation around , , ).Three degrees of freedom (DoF) can be calculated using a spherical coordinate system to estimate surface normal   .Based on a set of fixed distance ratios from multiple individuals, the tilt angle is calculated based on any viewing direction.In this case, the projected length of the surface normal   is used in both coordinate axes.In addition, it was assumed that the main view point was fixed and the spherical coordinate system would be avoided if possible [39][40][41][42][43][44].
For the two images as seen in Fig. 5, let's start with the estimation of the head position using the 4-point method (  ,   ,   ,   ) in Fig. 5. First,   and   represent the two central points between the eyes and the corners of the mouth.Then   is the point lying on the axis perpendicular to the plane normal.Finally,   can be calculated using the following equation ( 4) while  is chosen in the range of 0.45.The value of   in [7] was calculated using the four measures of data samples with relative as in the equation (5).
The general relative nose length ratio can be found by equation (6).
The rotations around the  and  axes are calculated using the relative observation distance divided by the maximum of nose height   , as in the equation (7).
Based on Fig. 5 and Eq. ( 2) to ( 7), it's shown that the rotational angles of the human face can be represented as a function of head positions [45][46][47][48][49][50][51].This will be used twofold in this paper.First, the formulas are used to check if the user has input all required poses of the head to reduce the risk of faking input images to the camera.Then for the straight face image, it will be rotated to have the face lined up with horizontal and Thai-Viet Dang, A Secured, Multilevel Face Recognition based on Head Pose Estimation, MTCNN and FaceNet vertical axes.This image preprocessing will help the accuracy of face detection and recognition later [52][53][54][55][56][57][58].

The Experimental System
Fig. 6 presents a hardware realization of the proposed solution for face recognition used in this research.The system is based on an embedded computer (a 8GB Jetson Nano).It takes images from the camera IMX-219 and processes the images for head pose estimation and running the two neural networks (in inferencing/testing mode).The user is asked to allow the camera to capture 5 images with straight face and with the head turning left, right, up and down.A demonstration of input images and their positive checking for all the poses is shown in Fig. 7. Once the checking of all the poses are finished, the image with straight face will be rotated to line up the face and input into MTCNN.The MTCNN was train to detect the face area with the databases of rotated images generated using head rotation estimation.This preprocessing of the images help to improve the accuracy of MTCNN to 100% correct face detection for the testing images.
Next, FaceNet was trained with the cropped images of the face only.Fig. 8 shows an example of a crop image from Fig. 7(a) as the output of MTCNN and is used for training the FaceNet.
One advantage of the FaceNet is the down-sampling method from the input image to a vector of 128 components used to recognize the users.For the purpose of this paper, we only need the simple K nearest neighbor algorithm to classify the actual feature vector with the ones from the learning database.With those simplifications, when using Jetson Nano board as in the proposed system, the average processing througput was from 21 to 25 frames per second (fps).The accuracy and the fps for the proposed solution and some comparisons are given in Table I.

IV. CONCLUSION
This paper presented a new approach for face recognition using head pose estimation combined with MTCNN and Facenet networks to increased the recognition efficiency and to avoid faking input images.The head pose estimation was succesfully to detect 5 main poses of the head: straight face, head turning left/right/up/down to provide additional check on the user identity.The head pose information was also used to line up the straight face image to allow the MTCNN correctly identify the area containing the face on the images.The final user recognition was done by the FaceNet trained on the images generated from the MTCNN.The proposed model is simple enough to allow its running on a Nano Jetson board.When tested with an IoT system on the board, the accuracy achievied was 90-95%, with image processing speed of 21-25 fps.

Fig. 1 .
Fig. 1.The proposed schematic of the solution Dang, A Secured, Multilevel Face Recognition based on Head Pose Estimation, MTCNN and FaceNet II.THE PROPOSED SOLUTION FOR FACE RECOGNITION

3 )
High efficiency: MTCNN is designed to work fast and efficient.With the ability of sequential feature localization and detection stages, enriching with techniques such as convolutional networks and local awareness convergence, MTCNN can process images and videos of varying complexity with high speed.4) Diverse applications: MTCNN can be applied in many different tasks, including facial recognition, age and gender classification, facial expression analysis and security monitoring system.The merge of MTCNN into DCNN opens up new opportunities for research and development of artificial intelligence applications related to image processing and face recognition.

Fig. 2 .
Fig.2.The MTCNN structure with 3 layers P-Net, R-Net and O-Net[24] Step 1. Randomly select an image as the so called anchor image.− Step 2. Randomly select an image of the same person and mark it as a positive image.− Step 3. Randomly select an image of another person and mark it as a negative image.− Step 4. Adjust the parameters of the FaceNet, such that the positive image is closer to the anchor image than the negative image, or in term of a distance measure we have, can shown in by the equation (1) [32].∀(   ), (   ), (   ) ∈  ∶ ‖(   ) − (   )

Fig. 3 .
Fig. 3.A demonstration of FaceNet training on input images

Fig. 4 .
Fig. 4. (a) Example of moving one positive sample closer to the anchor and one negative sample further from the anchor and (b) example of final results after many repeated training epochs

Fig. 6 .
Fig. 6.The proposed hardware system for face recognition

Fig. 8 .
Fig. 7.An example of checking of the different poses of an user ((a) straight face, (b) with head turnt left, (c) right, (d) up and (e) down)

TABLE I .
COMPARATIVE RESULTS OF ACCURACY AND FPS FOR SELECTED METHODS