Leveraging a Two-Level Attention Mechanism for Deep Face Recognition with Siamese One-Shot Learning

—Discriminative feature embedding is used for large-scale facial recognition. Many image-based facial recognition networks use CNNs like ResNets and VGG-nets. Humans prioritise different elements, but CNNs treat all facial pictures equally. NLP and computer vision use attention to learn the most important part of an input signal. The inter-channel and inter-spatial attention mechanism is used to assess face image component significance in this study. Channel scalars are calculated using Global Average Pooling in face recognition channel attention. A recent study found that GAP encodes low-frequency channel information first. We compressed channels using discrete cosine transform (DCT) instead of scalar representation to evaluate information at frequencies other than the lowest frequency for the channel attention mechanism. Later layers can acquire the feature map after spatial attention. Channel and spatial attention increase CNN facial recognition feature extraction. Channel-only, spatial-only, parallel, sequential, or channel-after-spatial attention blocks exist. Current face recognition attention approaches may be outperformed on public datasets (Labelled Faces in the Wild).


I. INTRODUCTION
In recent past years, Convolutional Neural Networks (CNNs) have demonstrated their efficiency in various tasks by enhancing the state-of-the-art performance on such tasks [1]- [5].More specifically, the performance of face recognition has been boosted to an unprecedented level due to the availability of novel discriminative learning methods [6]- [11] and advanced deep learning architectures [1], [4], [5], [12].Face recognition consists of [13], [14]: • Verification: It consists of taking two images and determining if they belong to the same person or to two persons differently, which is a binary classification problem.• Identification: It consists of determining, among galleries, to which gallery member an image sample belongs.
Face recognition is a crucial tool in real-world scenarios.It is advantageous compared to other biometric systems (such as fingerprint, and iris) due to its non-intrusive feature and can also be performed at a distance [15]- [17].Face recognition can be used to enforce the law.It is also helpful for the police to quickly narrow down possible suspects by automatically retrieving suspect images from police mug-shot databases.
Face recognition's main objective is to look for a discriminative feature in order to meet the following criterion: Under a suitably chosen metric space, the minimal inter-class distance is greater than the maximal intra-class distance.Several research works have been carried out in this context to obtain discriminative features.In particular, Sun et al. [18] proposed a Deep face IDentification-verification approach using contrastive loss to get efficient and effective feature representations that enlarge inter-personal variations while reducing intra-personal differences.In [19], Wen et al. proposed a center loss to improve the discriminative power of the deeply learned features.The proposed center loss learns at the same time a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers.The authors have demonstrated that, by jointly using the softmax activation function and their proposed center loss, inter-class dispensation and intra-class compactness can be achieved as much as possible.Schroff et al. [20] used a triplet loss to perform a face recognition task with the goal of enlarging inter-class distance while reducing intra-class distance.Moreover, angular marginbased loss functions have recently demonstrated their power and achieved state-of-the-art results in face recognition.For instance, Liu et al. [8], aiming to learn angularly discriminative features, proposed an angular softmax (A-Softmax) loss function for CNNs.In [6], the authors demonstrated that the softmax activation function lacks the power of discrimination in face recognition and proposed a large margin cosine loss (LMCL) to achieve minimum intra-class variance and maximum interclass variance through virtue normalization and cosine decision margin maximization.Deng et al. [10] investigated the issues of the above angular loss functions and proposed an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition.
In addition to the loss functions mentioned above aiming to improve the performance of face recognition systems, attention mechanisms have been an active research area in the past years to improve deep learning models going from NLP to computer vision, and face recognition models have been improved with such mechanisms.In face recognition, most of the deep learning models use CNN architectures like ResNet and VGG., as their backbones.For instance, Liu et al. [8] and Wang et al. [6] have adopted Residual-Style architecture with 64 layers as the feature extractor.In [10], Deng et al. adopted ResNet-50 and ResNet-101 [1] as the backbone of their architecture.Based on human cognition, while different parts of face images have different importance in face recognition, the standard CNNs consider equally different parts of face images.In addition, the feature map generated by standard CNNs has many channels, possibly leading to information redundancy among channels.This situation can also make the feature have a risk of over-fitting even if one uses some regularization methods like dropout or l1 and l2 norm.The most important parts of an input signal can be learned through the attention mechanisms which are widely used in different areas of deep learning and achieved great success in NLP [22]- [25], images classification [2], [26]- [29], images segmentation [30], [30], [31], saliency detections [32], [33], etc.Researchers have also used attention mechanisms to enhance network performance to address the above-mentioned challenges in face recognition.More specifically, Ling et al. [34] adopted channel attention and spatial attention in their face recognition model.Additionally, Rao et al. [35] proposed an attention-aware deep reinforcement learning (ADRL) method for video face recognition.Their proposal discards the misleading and confounding frames and generates a compact, refined feature for face recognition in face videos.For channel attention, the current channel attention mechanisms used in face recognition [34]- [37] employ a scalar to represent each channel through Global Average Pooling (GAP).However, GAP struggles to capture complex information for various inputs because of its simplicity.In contrast to the previous works, we consider the scalar representation of a channel as a compression problem in this paper and used discrete cosine transform (DCT) to compress the channels for the channel attention mechanism.Then, spatial attention is applied and the resulting feature map can be fed into subsequent layers.Therefore, this two-level (channel and spatial) attention module can make the standard CNNs(ResNets, VGGNets, etc.) more powerful in extracting discriminative features for face recognition tasks.
The goal of this paper is to train a deep neural network that will be able to take an image and return its identity.Therefore, it involves, as mentioned previously, face verification, where the model determines if two given images belong to the same person or two different persons, and face identification, where the model returns the identity of the person by using the gallery.It is worth noting that we suppose that the given image must contain one and only one face.
Deep learning (DL) methods often require several labeled samples per class to perform well on the task.However, in realworld scenarios, acquiring such a tremendous amount of data for training is challenging.Therefore, a model that can rely on one or a few images per person for training is preferred.Oneshot learning was proposed to deal with such type of task.The training data consists of one or a few images per class in oneshot learning.The main contribution of this paper is summaries as follows.
• We exploit this advantage of one-shot with siamese-CNNs and propose a face recognition method • In our approach, two-level attention-CNNs are used to extract rich features from the given images, and those features are fed to the classification network for one-shot learning The rest of this paper is organized as follows.In the next section (II), we discuss the related work, while section IV provides the methodology and the proposed model.In section V, our experiment results are given, and the paper is concluded in section VI.

II. RELATED WORK
This section provides recent works that have been carried out in the context of face recognition and approaches that use attention mechanisms in computer vision.

A. Face Recognition Methods
The accuracy of face recognition started improving in the early 2000s through engineered features, more specifically with Local binary patterns [38], [39].However, when applied to unconstrained face databases, the performances of such feature engineering approaches are low.Since 2012 when AlexNet won ImageNet Large-Scale Visual Recognition Challenge (ILSVRC 2012) [4], deep learning approaches have been employed in several computer vision tasks such as image classification, image recognition, and so forth.Afterwards, more advanced deep learning architectures such as Inception [5], Resnet [1], VGG16,19 [40], and others [41] are employed for image classification tasks.More specifically, the architecture of these deep learning approaches consists of two main parts: The features extraction component, which extracts rich features from the raw inputs, and the classification or recognition network, which uses features extracted from the first component and performs the task on hand.
When it comes to face recognition using deep learning, Hu et al. [42]  Sun et al. [18] demonstrated that by using both face identification and verification signals as supervision, intra-personal variations could be reduced while enlarging inter-personal distances.In their proposal, while the identification task enlarges the inter-class distances, the verification task reduces the intraclass variations by pulling together the features extracted from the same identity.
The authors of [44] derived a face representation by employing explicit 3D face modeling to apply a piecewise affine transformation, improving the face recognition model's performance.
In [45], aiming to address the class imbalance issue, Guo and Zhang proposed a one-shot face recognition model by aligning the norms of weight vectors of underrepresented classes and normal classes, thus giving the one-shot classes an equal weight.
To deal with deficient training samples in face recognition, Wang et al. [46] proposed a framework that leverages CNNs of balancing regularize and shifting center regeneration which regulates norms of weight vector into the same scale and adjusts clustering center.
Ding et al. [47] proposed a generative learning approach to synthesize meaningful data for one-shot classes by adapting the data variances from other normal classes.Jadhav et al. [48] employed a deep attribute representation of faces to address the problem of one-shot unconstrained face recognition.They used specific attributes of human faces, like hair and gender, to fine-tune a deep CNN for face recognition.
In [49], Wu et al. proposed a hybrid method for face recognition tasks when a CNN and the nearest neighbor (NN) model are combined.
In addition, Hong et al. [50] proposed a domain adaptation network for one-shot tasks.They employed domain-adversarial training by generating synthetic images in various poses using a 3D face model.
Cheng et al. [51], aiming to produce a better representation for low-shot learning, proposed an Enforced Softmax optimization approach built upon CNNs.The proposed model leverages optimal dropout, selective attenuation, normalization, and model-level optimization.
Cao et al. [52] proposed a dataset called VGGface2 and used ResNet-50 as the network backbone structure to train the face recognition model using the proposed dataset.In the same way, [6], [8], and [53] used ResNets as the backbone structure of their network.
When it comes to the siamese network, it was initially proposed by Bromley et al. [54] for the signature verification task.Further, it was used for image recognition by [55].
Song et al. [56] proposed a mask learning technique which learns how to discard occlusion features and only uses the non-occluded facial areas for face recognition with the siamese network.

B. Attention-based Methods
The attention mechanism, often combined with the gating function and sequential approaches, was first proposed for NLP tasks to learn the most informative parts of an input signal.In recent years, the attention mechanism has been widely used for image classification and other areas of DL.More specifically, Wang et al. [28] proposed a residual attention network that is built by stacking attention modules to produce attention-aware features.The proposed attention module was incorporated into a deep residual network, which performs well on noisy input signals.
The convolutional features' inter-channel relationship was learned using a squeeze-and-excitation module through global average pooling (GAP), proposed by [2].This module can be used in CNNs to improve the feature extraction process.
Woo et al. [57] proposed a Convolutional Block Attention Module (CBAM) for feed-forward CNNs.The proposed module sequentially infers channel and spatial attention, then these attention maps are matrices multiplied by the input feature map to adaptively refine the feature map.Our proposed model follows a similar pattern except that instead of using the GAP for channel attention, we adopt DCT, which is widely used image compression.
Wang et al. [58] proposed a non-local neural network leverage self-attention model of [59] for the video classification task.
Zhang et al. [60] proposed a self-attention-based module for better image generation which finds global dependencies within internal representations.
Even though current deep learning approaches achieved high performance in face recognition systems, the standard CNNs used as backbones consider all parts of face images equally.Secondly, because of the deepening of current architectures, the huge number of channels generated in the convolution layers introduce the information redundancy phenomenon.We believe that the model performance of face recognition can still be improved with a minor change in network architecture, even with a tiny increase in the number of learnable parameters and the computational cost.We achieve the performance improvement Arkan Mahmood Albayati, Leveraging a Two-Level Attention Mechanism for Deep Face Recognition with Siamese One-Shot Learning Journal of Robotics and Control (JRC) ISSN: 2715-5072 95 of face recognition through the channel and spatial attention employed sequentially.In contrast to [34], we consider channel attention as a compression task and employ DCT instead of GAP to compress channels in a scalar.

III. PRELIMINARIES
This section provides an overview of the main components we rely on for our deep face model.

A. Siamese Network
The siamese network is a deep neural network architecture that operates on two identical sub-networks sharing the same weights while taking as inputs two different raw data vectors and are joined after features encoded by a similarity or comparative function.Fig. 1 illustrates the Siamese network.More specifically, these two sub-networks aim to calculate the similarity between the twin raw input faces.These subnetworks are identical and share the same weights, hence the term "Siamese".In this paper, the convolutional Siamese

Loss(x
Where i denotes the i th index of the current batch, Y (x i 1 , x i 2 ) is a vector of true labels and Y pre (x i 1 , x i 2 ) is a vector of predicted labels.
Then, it can be written as given in Equation ( 3).

C. Neural Architecture Search
Neural Architecture Search (NAS) is the process of automating the design of neural networks' topology in order to achieve the best performance on a specific task.The goal is to design the architecture using limited resources and with minimal human intervention.It is a search algorithm with the following features: • It operates on the search space of possible network topologies, consisting of predefined operations (e.g.convolutional layers, recurrent, pooling, fully connected, etc.) and their connections.
• A controller then chooses a list of possible candidate architectures from the search space.• The candidate architectures are trained and ranked based on their performance on the validation test.
• The ranking is used to readjust the search and obtain new candidates.
• The process iterates until it reaches a certain condition and provides the optimal architecture.• The optimal architecture is evaluated on the test set.
For more details about NAS, readers can refer to [62], [63] and [64].In this paper, we use NAS to search for the best frequency component for channels in channel attention.The details of our proposed architecture are presented in this section.Firstly, the general framework is discussed, then, the channel and spatial attention blocks are highlighted, and the mechanism which aggregates both blocks is detailed.

A. Architecture Overview
We proposed to employ both channel and spatial attention to learn the most important parts of face images in face recognition and mitigate the redundancy of information caused by standard CNNs because of the huge number of channels generated.When combined, These attention modules will enable our network to learn adaptively the inter-channel and inter-spatial relationships.Therefore, the global feature relationship of input face images will be learned, and a more discriminative feature will be obtained for face recognition.Fig. 2 presents both our channel and spatial attention blocks that we integrated into a ResNet-50 to obtain refined features for face recognition.More specifically, the channel attention block and the spatial attention block learn sequentially, the channel relationships matrix and spatial relationships matrix.Then, the refined feature is obtained through matrix multiplication.We will detail these attention blocks in the next subsections.

B. Channel Attention
This subsection details the channel attention block.This block produces a channel attention relationship matrix that models the inter-channel relationships of a given feature map.It focuses on "what" is meaningful given an input face image.Channel attention is widely used in computer vision tasks.More specifically, channel attention uses a scalar to represent and measure the importance of each channel of the feature map.For instance, assuming that F ∈ R C×H×W is the feature map of a convolution layer where C denotes the number of channels or feature detectors, H the height of each channel and W the width of each channel.Since the widely used scalar for channel attention represents the entire channel, it can be seen simply as a compression problem.Therefore, channel attention can be written as given in Equation (5).
Where att ∈ R C is the attention vector, sigmoid and f c are, respectively, the sigmoid activation function and the mapping function like fully-connected layer or 1 × 1 convolution layer while compress : R C×H×W → R C is a compression technique.The current channel attention used in face recognition such as [34] uses the GAP as a compression technique.Global max pooling [57] and global standard deviation pooling [65] are other compression methods.After obtaining R C , each channel of F is scaled by the corresponding attention value as given in Equation (6).F:,i,:,: = att i F :,i,:,: Where i ∈ {0, 1, 2, ..., C}, F denotes the output of the attention mechanism, att i is the i − th element of the attention vector and F :,i,:,: is the i − th channel of the feature map F.
In this paper, instead of using GAP or the above-mentioned other methods as compression techniques, we use the DCT.In [66], Qin et al. have demonstrated that GAP is a weighted sum of inputs and proven that GAP is a special case of 2D DCT.Additionally, by replacing h and w by 0 in Equation ( 3), it becomes Equation ( 7): Where GAP means global average pooling.f 2d 0,0 is the lowest frequency component of 2D DCT which corresponds to the GAP.It can be concluded that in the current face recognition approaches that employ the channel attention mechanism with GAP, only the lowest frequency information is preserved and other information of other frequencies is simply lost while they also encode some useful information to represent the channels and therefore need to be taken into account.Accordingly, by using 2D DCT, more information can be preserved by considering the information of other frequencies in addition to the lowest frequency component (a.k.a, GAP).
The DCT used in this paper for channel attention is described as follows: Given an input feature F, we split, along the channel dimension C, this feature map F into F (0) , F (1) , ..., F (n−1)  where n and C is divisible by n.We assign to each F i , its corresponding 2D DCT frequency component and the obtained results can be considered as the compression results of channel attention.More specifically, the equations of such a transformation are expressed in Equation (8).Where F req i ∈ R C ′ is the resulting compression vector with dimension equal to C ′ , i ∈ {0, 1, ..., n − 1} while [u i , v i ] are the frequency component 2D indices corresponding to F i .The entire compression vector is given in Equation ( 9) by concatenation.
From the above two last equations, it can be seen that the DCT is a general case of GAP resulting in more frequencies and, therefore, more information which is effectively enriched for representation.However, having several frequency component indices for each part brings a challenge of how to effectively choose [u i , v i ], which are the frequency component indices for each part.To solve this challenge, we used Neural Architecture Search (NAS) to search for the best frequency components.More specifically, to search for components, a set of continuous variables α = α (u,v) are assigned for each part F (i) which is written in Equation (11).
Where O contains all 2D DCT frequency component indices.Once the search is performed, the frequency component of each part F (i) is expressed in Equation (12).
The selected 2D DCT frequency components are summed and the resulting vector F c is multiplied by the input feature map F and we obtain F' for the spatial attention block.The channel attention block is summarized in Fig. 3.

C. Spatial Attention
While channel attention focuses on "what", the spatial focuses on "where" the meaningful information is located given an input face image, which complements the channel attention.It generates a spatial attention matrix by using the inter-spatial relationship between features.In [67], Zagoruyko et al. demonstrated that the channel axis pooling operations highlight effectively informative regions.Therefore, to generate the spatial attention map, we first apply both average pooling and max pooling in parallel.Then, an efficient feature descriptor is obtained by concatenating the output of such operations.Finally, we apply a convolution layer to produce the spatial attention matrix, which encodes information on where to suppress or emphasize.This attention block is summarized in Equation (13).
The visual detail of the spatial attention is presented in Fig. 4.

D. Face Feature Embedding with Attention Mechanism
To understand our proposed attention blocks intuitively, we arrange them sequentially and plot them in Fig. 5.More specifically, for an intermediate feature X, a channel attention matrix is first generated through the channel attention module and matrix multiplication is applied to get the weighted feature.Then, the element-wise summation is applied to get a channelrefined feature.Sequentially, the spatial attention module generates the spatial attention matrix and matrix multiplication followed by element-wise summation, leading to a spatialrefined feature.We applied BatchNormalization to enable the network to converge fast, and the residual shortcut led to feature map X R for the subsequent layers.

V. EXPERIMENTS
In this section, we detail our experiment circumstances.More specifically, details the dataset used for training, the experiment parameters, and the obtained results.Also, we perform ablation studies to demonstrate the efficiency of our proposed attention blocks.
Arkan Mahmood Albayati, Leveraging a Two-Level Attention Mechanism for Deep Face Recognition with Siamese One-Shot Learning ISSN: 2715-5072 98

A. Dataset
In this study, because of the limited computing resources available, we used only one public dataset called "Labeled Faces in the Wild"(LFW) [68] for training and validation.We chose the "LFW" dataset because it is a public dataset for performance benchmarking of face recognition methods widely used in the research community.It consists of 13, 000 images of faces collected from the web.It is a labeled dataset where each image is associated with a specific person.Some people have more than one image.More specifically, 1680 people have two or more distinct photos in the LFW data set.The dataset contains images with different facial positions in order to maintain consistency and ensure robustness.Slight variations images, such as facial hair, and obstructions images, such as headgear and eyewear, are also included.We discarded classes with less than 15 images and ended up with 96 classes in total.Therefore, we used 91 classes for training and 5 classes for testing.We split the training set according to an 80-20% split for training and validation, resulting in 72 for training and 19 for validation.

B. Experiment Settings
We employed a deep funneling method [69] to align face images.All images were resized to have a size of 112 × 112.We made small augmentations by horizontally flipping images with a probability of 0.5.
Our model was trained with a constant learning rate η equal to 0.00005 with a step-based decay method decaying at a uniform rate of 1%.We set an early stopping condition to stop the training in the case where the validation accuracy does not improve after 5 iterations.The maximum number of iterations was set to 500.We initialized the momentum for each layer to 0.5, evolving linearly until reaching a final value of 0.9.The batch size of 8 was set.We initialized the biases with the default setting of zeros in all layers, while the Glorot uniform initializer was used to initialize the weights of all layers of the network where the initializer draws samples from the uniform distribution of [−g, g].g is expressed as follows: Where f an in and f an out represent respectively the number of input units in the weight tensor, and the number of output units in the weight tensor.
The ResNet-50 with an attention block on top of each residual employs Triplet Loss expressed as follows: Where a, p, and n denote the anchor, positive and negative image, respectively.Margin defines how far away the dissimilarities are (between the anchor image and the negative image and between the anchor image and the positive image.).
N-way one-shot learning is used for evaluation, where N denotes the support classes.We experiment with 3 values of N, namely, 5, 10, and 20.

C. Models Architecture
We adopt the ResNet-50 [1] as the feature extractor.We used siamese one-shot learning for face verification and KNN for face identification.The residual bottleneck's structure is BN − ConvBN − P ReLU − Conv − BN .A kernel size of 3 × 3 is used in all convolution blocks.The number of channels in different blocks is respectively 64, 128, 256, and 512 as presented in Table I.Our proposed model adds an attention module, consisting of the channel and spatial attention, on top of each residual block as shown in TABLE I.The last layer of the architecture is a fully-connected layer of output dimension equal to 512, which represents the feature vector of the face image.

D. Experiment Results and Performance Comparison
Only the spatial attention, (3) Using both in parallel, (4) Using both sequentially, especially the channel attention followed by spatial channel, (5) Using both sequentially but the channel attention after the spatial attention.
In every ablation study, the ResNet-50 is used as the feature extractor.The squeeze and excitation network is a kind of attention that was proposed in [2].In [2], Hu et al. used Global Average Pooling for channel attention.However, as mentioned previously, GAP selects only the lowest frequency information for channel attention and other information of other frequencies is simply lost while they also encode some useful information to represent the channels.Therefore, in the performance comparison part, we implemented a ResNet-50 with a squeeze and excitation block on top of every residual.
In addition, we chose to compare our proposal with [70] because we found out that, according to our best knowledge, a recent face recognition work used a one-shot learning concept.Reference [34] is a good paper that employed both channel and spatial attention for face recognition using the "LFW" dataset even the authors used GAP in their channel attention block in contrast to our proposal where we used DCT to include in addition to the lowest frequency information, the information of other frequencies.However, the authors of [34] did not use One-Short learning.
In this paper, all models are evaluated in terms of the oneshot face recognition accuracy.The performance of different models and comparing our proposed attention module to the state-of-the-art is presented in Table II.In Table II, (1), ( 2), (3), ( 4), ( 5) at the bottom of the Table denote the designed ablation use cases.
Firstly, we trained our One-Shot siamese network using ResNet-50 as the backbone network without the attention concept.We achieved recognition accuracy of 97.08, 97.34, and 97.27, respectively, on 5-way, 10-way and 20-way One-Shot learning.However, including only the channel attention increase the recognition accuracy where we obtained 98.45, 98.18, and 98.04 on 5-way, 10-way and 20-way One-Shot learning, respectively.Moreover, the use of only spatial attention achieves 97.88, 98.15 and 98.02, respectively on 5-way, 10-way and 20way demonstrating that the importance of spatial attention is lower than channel attention as the use of only channel attention obtained better results when compared to using only spatial attention.Then, we applied both channel and spatial attention in parallel and achieved recognition accuracy of 98.38, 98.42 and 98.25 on 5-way, 10-way and 20-way One-Shot learning respectively.Using channel and spatial attention increased the recognition accuracy, meaning both are important in our face recognition system.We, then, arranged the attention blocks sequentially.The channel attention block before the spatial attention achieved the highest results of 98.58, 98.89 and 98.74 on 5-way, 10-way and 20-way One-Shot learning, among all our implemented models.Using spatial attention before channel attention is the second-best model where recognition accuracy of 98.50, 98.80 and 98.70 was achieved on 5-way, 10-way and 20-way One-Shot learning, respectively.Finally, we compared the performance of our proposed model with two baselines, namely, [70] and [2].In [70], no attention mechanism was used.The recognition accuracy of Chanda et al. [70] proposal on 5-way, 10-way and 20-way are 97.00,97.50 and 95.5 respectively which is confirmed by our model without attention mechanism in Table II  and excitation (attention with GAP) of the second baseline [2] achieved recognition accuracy of 97.63, 97.90 and 96.60, respectively on 5-way, 10-way and 20-way.
Based on the above-performance comparison use cases, our best model outperforms the two baselines, demonstrating the efficiency of DCT over the GAP.This was achieved because instead of using only the lowest frequency's information as the GAP does, DCT includes the meaningful information of other frequencies, boosting the recognition accuracy of our face recognition model.

VI. CONCLUSION AND FUTURE WORK
In this paper, we have investigated the use of channel and spatial attention blocks to boost deep face recognition with siamese One-Shot learning.The proposed attention module learns the global-feature relationships of aligned face images with the aim of reducing the information redundancy among channels and focusing on the most relevant parts of face feature maps.Instead of using the GAP that is used in current face recognition models, especially in the channel attention block, we consider in this paper, the scalar representation of a channel as a compression problem and used discrete cosine transform (DCT) to compress the channels for the channel attention mechanism in order to consider the information of other frequencies in addition to the information of the lowest frequency.Our attention module employs both channel and spatial attention arranged sequentially with channel attention before spatial attention.On a public dataset, we demonstrated that our attention module can achieve better results when integrated into ResNet-50 compared to the existing attention mechanism used in deep face recognition.

Fig. 1 .
Fig. 1.Architecture of Siamese Network Mahmood Albayati, Leveraging a Two-Level Attention Mechanism for Deep Face Recognition with Siamese One-Shot Learning Journal of Robotics and Control (JRC) ISSN: 2715-5072 96 IV.PROPOSED MODEL

Fig. 2 .
Fig. 2. Our Proposed Attention Modules with ResNet: We show in this figure, the exact position of our Channel and Spatial Attention in a ResNet Architecture

A
set of experiments are carried out in this part to demonstrate the performance of our proposed model.We also design a Arkan Mahmood Albayati, Leveraging a Two-Level Attention Mechanism for Deep Face Recognition with Siamese One-Shot Learning Journal of Robotics and Control (JRC) ISSN: 2715-5072 99 provided a good introduction to the convolutional Arkan Mahmood Albayati, Leveraging a Two-Level Attention Mechanism for Deep Face Recognition with Siamese One-Shot Learning

TABLE I
. However, the use of squeeze Arkan Mahmood Albayati, Leveraging a Two-Level Attention Mechanism for Deep Face Recognition with Siamese One-Shot Learning Journal of Robotics and Control (JRC)

TABLE II .
EXPERIMENT RESULTS AND PERFORMANCE COMPARISON