Smart home Management System with Face Recognition Based on ArcFace Model in Deep Convolutional Neural Network

—In recent years, artificial intelligence has proved its potential in many fields, especially in computer vision. Facial recognition is one of the most essential tasks in the field of computer vision with various prospective applications from academic research to intelligence service. In this paper, we propose an efficient deep learning approach to facial recognition. Our approach utilizes the architecture of ArcFace model based on the backbone MobileNet V2, in deep convolutional neural network (DCNN). Assistive techniques to increase highly distinguishing features in facial recognition. With the supports of the facial authentication combines with hand gestures recognition, users will be able to monitor and control his home through his mobile phone/tablet/PC. Moreover, they communicate with data and connect to smart devices easily through IoT technology. The overall proposed model is 97% of accuracy and a processing speed of 25 FPS. The interface of the smart home demonstrates the successful functions of real-time operations.


INTRODUCTION
In 4 th Industrial Revolution, informatic technology and data science have strongly supported to automatic production. The application of this remarkable breakthrough generated myriad of achievement, including smart monitoring systems [1], advanced transportation infrastructure [2], automated financial systems [3] or industrial assembly robot manipulators [4]. Data science includes digital transformation, data collection and data processing. Furthermore, the application of artificial intelligence (AI) and big data processing is gradually becoming popular. AI consists of main applications as following: Natural language processing, computer vision [61,62], machine learning [63], data processing [64], and optimal decision.
Following this current trend, we exploit a powerful deep learning approach in facial and hand gestures recognition applying for IoT control management systems [27][28][29][30][31][32][33][34][35][36][37][38][39]. Local database of identified people supports into both security and privacy. Through surveillance cameras face, recognition requires high accuracy and good processing speed [5,10,18,[22][23][24][25][26]. A typical study can include the MTCNN model using three main network layers P-Net, R-Net, O-Net for face detection with very high accuracy (99.6%) [6]. Still, the processing speed is low due to the use of many network layers, and tested on Jetson Nano (quad-core ARM Cortex-A57 CPU, 128-core Maxwell™ GPU) captures low speed (FPS: 4, input image 720x1280 pixels). This makes it challenging to integrate into the system, particularly to embed MTCNN's low-resource devices. Therefore, we suggest ArcFace model based on backbone MobileNetV2 to optimize face recognition. The technique will reduce the load on the convolutional network layers and replacing the Conv2D layer with a depth-wise separable convolution layer. Next, we continue using the MobilenetV2's output as the input to the SSD network (Single-Shot multibox Detection) for face detection. The output shows that this is an optimal model for low-resourced mobile and embedded devices (score: 96%, FPS:22, Jetson Nano: quad-core ARM Cortex-A57 CPU, 128-core Maxwell™ GPU).
By comparing the pre-selected facial features from the image database, the system is able to consider the correction of the person's face with the dataset. A recent study, LBP (Local Binary Pattern), converted the input image to a binary image, then divided the face into blocks and calculated the histogram density per block to give the histogram feature [7]. However, the feature extraction from the histogram could be affected by external factors such as input image quality, lighting, etc. Besides, another advancement, Dlib algorithm correlated with HOG and SVM [8] was utilized. But there was ponderable drawback that the accuracy might reduce if we alter the faces' angle. In addition, a well-known study of the face recognition algorithm that was FaceNet [9] used the triplet loss function to calculate the distance between face vectors. Nevertheless, this algorithm had the disadvantages that the amount of math operations that the computer needs to perform increases exponentially if the volume of input data is increased and the features between the faces being overlapped. To minimize this influence, the researchers apply ArcFace model, with the method of calculating the distance between face vectors, in addition, creating a deviation angle θ and addictive angular margin m that allows to separate the features of face vectors. ArcFace has developed and improved FaceNet with the consequence that the decomposition helps to avoid misidentification when the original data image has some similarities to the photo taken with a direct angle. Therefore, we utilizes the architecture of ArcFace model based on backbone MobileNet V2, in deep neural convolutional network (DCNN). The model will optimize face recognition by getting the input image. Then, Camera captures a person's image in front of the door, then ArcFace model will recognize person's in real-time. Besides, embedded computers Jetson Nano have been extensively applied into facial recognition because of the outstanding performance and hardware support [11]. For embedded computers in the segmentation [43][44][45][46][47][48][49][50], Jetson nano is the superior choice in terms of processing speed, support for many machine learning frameworks such as Tensorflow, Keras, Pytorch, MediaPie, etc.
The results of facial recognition for security are integrated with hand gestures for controlling smart home. Real-time monitoring data is displayed through smart home interface. System model includes IoT interface based on Tkinter GUI library [12]. According to practical examples, obtained face images reach to good detection performance with an accuracy of 92% to 97% and a fast inference speed of 25 FPS. For small datasets and less resources in training model, the model is more efficient and faster than the state-of-the-art models which require larger datasets for training and processing. The research contribution is as follows: Based on the proposed facial recognition based on MobileNet V2 [40][41][42], the performance of smart home management system is guaranteed. The security ability is enhanced. Moreover, obtained results can apply for intelligent IoT system and smart services [35][36][37][38][39].
The rest of the paper is organized as follows. In Section II, we describe related work in facial recognition of FaceNet and ArcFace model based on the backbone MobileNet V2. In Section III, we briefly introduce our proposed experimental system. In Section IV, practical experiments are conducted to compare with recent state-of-the-art methods. Finally, a conclusion is presented in Section V.

A. Mobilenet V2
In this section, we mainly introduce the core features of the MobileNetV2 to be used, the optimization of the loss function and utilizes the improved model architecture from FaceNet model. MobileNet V2 uses Depthwise Separable Convolutions, and additionally recommends: Linear bottlenecks and Inverted Residual Block (shortcut connections between bottlenecks) [13]. MobileNet v2's residual block is the opposite of traditional residual architectures, because the traditional residual architecture has a larger number of channels at the input and output of a block than the intermediate layers.
Among layers, there is an inverted residual block, we also use depth-separated convolution transforms to minimize the number of model parameters. The solution helps the MobileNet model to be slightly reduced in size. The detailed structure of a depth-separated convolution of bottleneck with residuals is as follows in Fig. 2. In the era of mobile networks, the need for lightweight and real-time networking is growing. However, many identity networks cannot meet the real-time requirements due to too many parameters and computations. To solve this problem, the proposed method using the backbone MobileNetV2 achieves superior performance compared with other modern methods in the database of facial expressions and features.  Then use depth convolution for input linear feature extraction and linear convolutional integration to combine output features while reducing the network size. After size reduction, it replaces Relu6 with a linear function to enable the output channel size to match the input. MobileNetV2 will be very useful when applied to SSD to reduce latency and improve processing speed. The SSD model [14] is one of the fast and efficient object detection features with better minimal processing time than YOLO [15][16] and Faster-RCNN [17]. Therefore, the SSD uses the MobilenetV2 network to extract the feature map and adds extra bits after MobilenetV2 to predict the object. Finally, this paper proposes to use MobileNetV2 and SSD for face detection and recognition in security systems.

B. Facenet Embeddings
FaceNet uses initiation modules in blocks to reduce the number of trainable parameters [9]. This model takes a 160×160 RGB image and generates a 128-d embedding vector for each image. FaceNet features extraction for face recognition. Use FaceNet to decompose facial features into vectors and then use the triplet loss function to calculate the distance between face vectors in Fig. 4. Instead of using conventional loss functions that only compare the output value of the network with the actual ground truth of the data, triplet loss introduces a new formula that includes three input values, including the output anchor of the network, positive : the photo is the same person with the anchor, positive : the photo is not the same person with the anchor.
where α is the margin (is a distance) between the positive and negative pair, the minimum necessary deviation between the two values, ( ) is the embedding of . The above formula shows that the distance between two embeddings ( ) and ( ) will have to be at least α-value less than the pair ( ) and ( ). In other words, after training, the result obtained is that the difference between the two sides of the formula is as large as possible (meaning get close to ).

C. ArcFace model
Deep Convolutional Neural Networks (DCNN) models have proven to be outstanding, being a popular choice to extract facial features. In order to make the recognition more accurate from vectors with facial features, there are two main directions to create a classification model such as using the triplet loss function or using the SoftMax loss function.
In Fig. 5, FaceNet is an encoding process of the convolutional neural network supports to encode the image in 128 dimensions. Then these vectors will be input to the triplet loss function to evaluate the distance between the vectors. However, the triplet loss function has its own shortcomings, because triplet loss is inspired by comparing three samples at a time, so with the increase in the amount of data, the number of triplets will also increase exponentially leading to the number of loops is also significantly increased. Furthermore, the supposed optimal solution when training with triplet loss is semi-hard sample training, which is quite difficult to train effectively. The SoftMax loss function is normally used for face recognition problems [18]. The SoftMax loss function is a combination of the cross-entropy loss function and the SoftMax activation function [19]. Using the SoftMax function causes the size of the linear transformation matrix to increase in proportion to the number of classes we want to classify, in equation (2).
where ∈ ℝ represents the depth feature of sample , of class . Embedded feature size is set to 512.
∈ ℝ represents the jth column of the weight vector W ∈ ℝ and ∈ ℝ is the bias. The batch and numeric class sizes are and respectively. The method is performed by correcting the bias = 0. Then, transform the logit to = ‖ ‖‖ ‖ , where θ is the angle between the ground truth weight vector and the feature vector . Next, the algorithm fixes the individual ground truth weight ‖ ‖ = 1 by normalizing 2 . Embedding ‖ ‖ is done by normalizing 2 and re-scaling it to . The feature and weight normalization step makes the predictions depend only on the angle between the feature and the weight. The features are distributed over a hypersphere with a radius of .
Since embedded features are distributed around each feature center on the hypersphere, an additive angular margin penalty m between and is added while enhancing intra-class compactness and inter-class differentiation. Since the proposed additive angular margin penalty is equal to the geodetic distance margin penalty in the normalized hypersphere, the method is called ArcFace.  The dot product between the feature vector from the DCNN model and the final fully-connected layer is equal to the cosine distance of the normalized feature and weight , by taking cos (logit) for each class as . We do the calculation of arccos( ) and get the angle between feature and ground truth weight . Next, angular margin penalty is added to angle . Then we compute ( + ) and multiply all logits by (scale feature). Logits are then fed into softmax to derive the probability distribution of the labels. Finally, we obtain the ground truth vector (the one-hot label) and the probability, which contributes to the cross-entropy loss.
To demonstrate the effectiveness of ArcFace loss compared with traditional softmax loss, Fig. 7 shows the role of margin and deviation angle . In the separation of features. Consequently, the simulation output of the softmax function is shown in Fig. 7(a) whilst the number of output layers through ArcFace loss trained with no overlap and the distance between layers are clearly shown in Fig. 7 The ArcFace model is performed through extensive experiments on many public datasets [20,21]. The obtained results show better performance than existing approaches [22,23]. Therefore, ArcFace model will be very useful and effective when applied to face recognition in security systems.

B. Software
• Use PYQT5, a utility tool for designing graphical user interfaces (GUIs). This is an interface written in python programming language for Qt, one of the most popular and influential cross-platform GUI libraries.
• Use some deep learning frameworks: Keras, Tensorflow, MediaPie, etc., which are very easy to use and powerful python libraries and learning models.

C. Facial recognition process
Face recognition process usually consists of four main steps: face detection, face normalization, Face stamp extraction, and matchings, as shown in the block diagram below in Fig. 9.
Face recognition system using Jetson Nano capable of handling multiple video streams [24]. Jetson Nano is connected to the camera module for storing the acquired camera images. An observation device that connects to the Jetson Nano such as a TV or LCD monitor. Jetson Nano also supports Internet access with smart devices [35,37]: mobile phones, laptops, and desktop computers in Fig. 10. MobileNetV2 is used as the feature map to base the detection from the input images. Then, these feature maps with different scales are enhanced by the FPN, FPN module at the output of the network to improve the performance of the back-end detection network. Facial recognition was SSD detector based on backbone MobileNetV2 and FPN technique (Feature Pyramid Network) in Fig. 11.

D. Hand gestures recognition
Mediapipe library supports the solutions as follows: Face detection, face mesh, hands detection, human pose estimation. Hands detection provides a hand's skeleton from landmark 0 to landmark 20 in Fig. 12. Based on hand gestures, the computer converts the signals into control signals for managing smart home in Fig. 13.

E. System interface
After completing face recognition, combining with hand gestures recognition, automatic smart home management will be launched successfully in Fig. 14.
The system takes input data about the face shown in Fig.  9. Initially, users proceed to enter personal information (Name, ID, …) into the system. The camera detects and scans the faces, recording 6 photos containing the frontal angles, the left and right angles of 10 degrees, the upturned face's chin angle. Then, the facial recognition system start the process, records the features and saves the parameters of the face into the database in the form of a hypersphere with each point being a recognized face. The above process takes place in 30 seconds to 01 minute. For each face added, the algorithm will train the model to recalculate the added features to the node in the hypersphere.
Face recognition algorithm is used to train ArcFace model. Furthermore, hand gestures recognition supports to automatically control smart home. Data base of user's face can be updated. When face ID is successfully detected, the control system will execute function commands. EXPERIMENTAL RESULTS

A. Practical face recognition
The training dataset for the facial authentication model consists of 60 images each of 10 people. Each person has 6 images of face's different angle. Through testing and evaluating on the same data set with 4 frames for each class, we evaluate ArcFace model in comparison with LBP [7] and DLIB [8]. ArcFace shows high accuracy and fast FPS, specifically at 95-97 % and FPS 21-25 in Fig. 15. Table 1   A typical study on the DLIB algorithm in [8] has been implemented on Jetson Nano achieves face detection performance with 8.9 in Table 2. But this algorithm has a disadvantage as changing the viewing angle of the face causes the recognition model to fail the high efficiency.  [8].

Embedded system
Face recognition/FPS Jetson Nano 8.9 We evaluate ArcFace in face recognition algorithm with 09 faces with different face angle. The model achieves the accuracy of from 92% to 94% as shown in Fig. 16. The experimental results have been really better than DLIB model. In experimental system, the use ArcFace model based on Mobilenet V2 backbone has the greatest performance when deployed on the Jetson Nano embedded computer. This model has an accuracy of about 92% to 97% and 25 FPS frame rate, much better than previous facial recognition models.

B. Control management procedure
The system operation is according to step by step-in realtime activities.
• Step 1: Login the first time by password and account (Fig.  17). • Step 2: Updating the information of user's face in the main interface (Fig. 18). All steps of facial recognition in Fig. 9 will be performed at here.

•
Step 3: Training the hand gesture recognition model in Fig. 19. We add more symbols to relate with smart home functions.
• Step 4: Using the hand gesture's data set to control automatically the smart home's ( Fig. 20). All IoT smart devices in the room 1 and room 2 are monitored and controlled.

V. CONCLUSIONS
In the paper, model for facial recognition has been proposed for user authentication. Moreover, hand gestures recognition has been supported home control management by using multiple automated and remotely functions. Because of increasing the facial recognition accuracy, ArcFace is improved from FaceNet model. The smart home security is guaranteed based on the high accuracy and fast processing time of facial recognition model. Author utilizes the architecture of ArcFace model based on the backbone Mobilenet V2 using an additive angular margin to achieve significantly facial recognition results. According to practical process, achieved face images reach to good detection performance with an accuracy of 92% to 97% and an inference speed of 25 FPS. For small datasets and less resources in training model, the ArcFace model is more efficient and faster than the previous state-of-the-art models which require larger datasets for training and processing. We will improve the backbone ArcFace to recognize human face with different face angles as well as real-time facial expressions. Moreover, it will strengthen anti-face spoofing methods to increase system security. Finally, obtained results can apply for intelligent IoT system and smart services.

CONFLICT OF INTEREST
The author declares no conflict of interest