GAN-Generated Face Detection Based on Multiple Attention Mechanism and Relational Embedding

The rapid development of the Generative Adversarial Network (GAN) makes generated face images more and more visually indistinguishable

The rapid development of the Generative Adversarial Network (GAN) makes generated face images more and more visually indistinguishable, and the detection performance of previous methods will degrade seriously when the testing samples are out-of-sample datasets or have been post-processed.To address the above problems, we propose a new relational embedding network based on "what to observe" and "where to attend" from a relational perspective for the task of generated face detection.In addition, we designed two attention modules to effectively utilize global and local features.Specifically, the dual-self attention module selectively enhances the representation of local features through both image space and channel dimensions.The cross-correlation attention module computes similarity between images to capture the global information of the output in the image.We conducted extensive experiments to validate our method, and the proposed algorithm can effectively extract the correlations between features and achieve satisfactory generalization and robustness in generating

Introduction
The Generative Adversarial Network (GAN) [18] has been gradually applied to many fields since it was proposed [34,39,57].With its rapid development, the generated images are becoming more and more realistic, as shown in Figure 1 for the face images generated by GAN.Such GAN-generated face images are difficult to distinguish from human eyes and can be easily generated by ordinary people.If they are used for malicious purposes, it may adversely affect some individuals' reputations and even social security ethics.Therefore, the detection of GAN-generated face images has become increasingly necessary.
In recent years, some researchers [27,50,51] have verified the authenticity of images by actively embed watermarks to the images.Besides, many passive forensic methods have been proposed.In [1,7,16,19,23,45,49,59] they detect natural and generated faces by exploring the differences in the image formation process.They combine traditional forensic methods to generated face detection.However, when confronted with fake face images where only local areas are generated, searching for feature differences directly on the entire generated face image may lead to detection failure.Therefore, [4,5,8,38] also combine local information such as artifacts to assist in detection.Although the methods described above achieve relatively high detection accuracy, they suffer from poor generalization and a lack of interpretability [21].[21,22,26,64] try to make the results interpretable by looking for inconsistencies in the physiology-based methods.However, with the continuous innovation of GAN, the difference between generated images and natural images in the spatial domain becomes increasingly difficult to detect [68].As a consequence, [6,9,13,14,17,37,43,68] turned their attention to the frequency domain, which improves the detection generalization performance by fusing features from the spatial and frequency domains.But their methods cannot adaptively capture the most dis-criminative features.
To address the above challenges, we propose a method that is inspired by relational networks [56] and combines both dual-self attention (DSA) and cross-correlation attention (CCA) to learn "what to observe" and "where to attend" from an image relations perspective.We achieve this goal by utilizing relational patterns within and between images through the Relational Embedding Network (RENet).The DSA module learns its own feature associations for the purpose of enriching semantic in-formation.The CCA module calculates the correlations between images so that they can have global information.Our contribution can be summarized as follows: Some samples in the experimental datasets.From left to right, the columns are the fake faces generated by ProGAN [30], StyleGAN [31], StyleGAN2 [32], StarGAN [11], BeGAN [2], LsGAN [46], WgGANGP [20], RelGAN [42] respectively.The generated faces are indistinguishable by human eyes

Related Work
Fake face detection.With the development of GAN technology and fake face technology, researchers have proposed many methods to detect fake faces.Hu et al. [26] and Guo et al. [21] found that the highlight of the two eyes in GAN-generated faces are inconsistent.Guo et al. [22] found that the pupil shape of real faces should be close to circular or elliptical, while the pupils of generated faces show irregular and inconsistent shapes.Although physiology-based inconsistency detection is interpretable, accuracy is greatly reduced if the inconsistencies are occluded or the scene angle is biased.Nataraj et al. [49] extracted the co-occurrence matrices on the RGB channels of the image and input them into a neural network for classification.Barni et al. [1] reported a significant impact of the correlation between color channels on detection effectiveness.Besides individual RGB channels, they also calculated the co-occurrence matrices from pairwise combinations of channels.The experimental results showed that multiple channels can further improve the robustness of detection compared to using single color channel information.However, the accuracy and robustness of detection only in the RGB domain are far from satisfactory.Chen et al. [7] combined dual color spaces and designed an improved Xception network model to increase detection robustness.Guo et al. [23] adaptive convolution to predict manipulation traces in an image, and then maximized manipulation artifacts by updating weights through backpropagation.Liu et al. [45] analyzed that it was more robust to detect fake faces by texture, so they proposed GramNet to capture long-range texture information to improve the robustness and generalization of the model.Despite the commendable detection performance exhibited by previous methods, relying solely on the spatial domain still presents limitations [14].Moreover, the frequency domain of images has been widely applied in various fields [10,54,69,71,72].Frank et al. [14] found that there are significant differences in the DCT spectra of real and fake images, and the DCT spectrum is more robust for detecting image manipulation than the RGB spectrum.Liu et al. [43] argued that upsampling is a necessary step in most face-forgery techniques, which leads to significant changes in the frequency domain, particularly the phase spectrum.Thus, they captured upsampling artifacts in face-forgery by combining spatial images with phase spectra, achieving a good detection result.Luo et al. [67] discovered that noise in face regions is continuously distributed in real images, while in manipulated images, it appears smoother or sharper.Therefore, they employed the high-pass filter SRM to extract high-frequency noise for detecting face forgery.Le et al. [35] utilized frequency-domain knowledge distillation to retrieve the removed high-frequency components in the student network for enhancing the detection accuracy of low-resolution images.Although analyzing in other domains can improve the accuracy and robustness of detection, they focus on the global features of the image and are typically difficult to detect subtle local tampering.Chai et al. [4] proposed the patchCNN network, which truncates the entire network to focus on local artifacts.Experiments have shown that local texture information can enhance the model's generalization ability.Jia et al. [28] designed a dual-branch network for predicting image-level and pixel-level fake labels based on inter-image and intra-image inconsistency, which is processed by stable wavelet decomposition.Ju et al. [5] introduced the FPN modules into Xception and reduced the number of convolutional layers in the network to detect locally generated face.The model has good performance for detecting faces with small generated regions.However, they overemphasized local features and ignored the relationship between global and local features.In contrast to the above, we integrate both global and local features, making the framework more robust and demonstrating better generalization ability.
Attention model.The human visual nerve receives more data than it can process, so it requires the human brain to weigh the inputs and focus only on the necessary information [24].For this reason, the researchers used a similar concept in their experiments.Vaswani et al. [58] first proposed a self-attention mechanism and applied it to machine translation, which reveals concerns about image structure through similarities within a domain.Recent work [40,41,48,52,62] has shown that the self-attention mechanism can effectively capture contextual relationships and improve the intra-class compactness of images.In addition to focusing on the connections between images, the relationships between images have been used to form a central part of various prob-lems in computer vision.It calculates the relationship between two images and applies to video action recognition, few-shot classification, semantic segmentation, medical image segmentation, style transfer, etc.Recently, some GAN-generated detection algorithms [5,7] have adopted self-attention mechanisms to enhance semantic information, but they have not accounted for connections between images.Inspired by Wu et al. [63], we introduce not only self-attention in each image but also cross-attention between images to enhance the ability to classify relevant regions.
Relation Network.The relational network [56] is a metric-based network structure with the core idea of mapping images to a learnable embedding function to extract features of interest and then distinguishing different classes by measuring the similarity of features between samples.Thus, attention can be used to correct and strengthen the feature regions of interest.In this paper, multiple attention mechanisms that are employed post the embedding function to rectify the network's focus on relevant areas and boost the expression of pertinent regions This approach promotes the network's generalization and robustness.

Approach
In this section, we will introduce the Relational Embedding Network (RENet), which is used to solve the problem of poor generalization and robustness in GAN-generated face detection.In Figure 2, the overall architecture consists of three modules: a basic representation module, a feature augmentation module, and a representation comparison module.The network parameters of feature augmentation module, and a representation comparison module are shown in Table 2 The basic representation module and the representation comparison module are similar to the modules of the relational network.On this basis, we add a feature enhancement module, which is composed of dual-self attention (DSA) and cross-correlation attention (CCA).We provide a brief overview of the overall RENet in Section 3.1.Then, we introduce the implementation details of DSA and CCA in Section 3.2 and Sec.3.3, respectively.

Architecture Overview
In this paper, we treat real images from the same dataset or fake images generated by same type of GAN as one domain, which is similar to the setting of other related tasks [19,58].The training set for each domain is denoted D train , and the test set is denoted D test .Both D train and D test are divided into multiple episodes for training, each of them contains a query set Q = (I q , y q ) and a support set S = {(I S L , y S L )} NK L=1 with N categories and K images per category.As shown in Figure 2, given a pair of support and query set images {I S L , I q }, each of them has a size of C×H×W.They pass through a shared embedding network and generate corresponding features f s 1 , ..., f S NK and f q .The

Figure 2
Overall architecture of RENet.We feed f S and f q from shared embedding network into the DSA module to obtain locally enhanced features f s Dual and f q

Dual
. Following CCA, we can obtain F S and F q , both of which contain global information.Finally, they are concatenated along the channel and the category with the highest score determines the classification result support set features f s 1 , ..., f S NK from the same domain will be ∈ denoted as f s via element-wise sum.The DSA module first applies self-attention over them to generate self-attentive features {A s Dual , A q Dual } ∈ ℝ C×H×W .and multiplies them by their corresponding weights before adding them to the input features to obtain f s- Dual and f q Dual , respectively.The resulting features are then processed by the CCA module to generate {A s Cross , A q Cross ) ∈ ℝ C×H×W , which operates similarly to the DSA module.From there, output feature F s and F q are computed for the classification score by representation comparison, with F s being calculated as follows: As shown in Figure 2 where λ � and θ � are learnable parameters for assigning weights, which gradually learn a weight from 0. The output features have their own local enhancements as well as correlation information between images. � is calculated in a similar way to  � .

Dual-Self Attention (DSA)
Self-Position Attention(SPA) Global texture features help to detect GAN-generated faces and can improve model detection ability by capturing long-range texture information [15].However, [16,17] show that generated face detection relies more on subtle details in the hair or background when performing cross dataset detection, so focusing on local detailed features can effectively enhance the generalization ability of detection.In order to pay more attention to local detailed features, we introduce the self-position attention module, which lets local features establish links in contextual relationships, thereby enhancing the network's expression of local features.
As shown in the upper part of Figure 3, given the feature  ∈ ℝ � ×� × � , it is first fed into a 1x1 convolutional layer to generate  � ∈ ℝ � � × � × � , where ( � <  ).We then reshape  � and  into  � � ∈ ℝ � � × � and  � � ∈ ℝ � × � , where  =  ×  in the SPA module.Next,  � � is transposed to  � � � ∈ ℝ � × � � , and the matrix multiplication is performed between  � � and  � � � to obtain their relationship, denoted by � � � ,  � � � �.After softmax layer, we will obtain position attention maps  ∈ ℝ where  �� ∈ ℝ � × � represents feature at position i impact on position j.The more correlated the features at the two positions are, the larger  �� is.Then, after multiplying with the matrix  �� � ∈ ℝ � × � , we sum them to obtain the correlation of all features in column i,  � ∈ ℝ � × � : Finally, we aggregate the features of all points and reshape to get the final output  � ∈ ℝ � × � × � , i.e.: ( where λ 1 and θ 1 are learnable parameters for assigning weights, which gradually learn a weight from 0. The output features have their own local enhancements as well as correlation information between images.F q is calculated in a similar way to F s .

Dual-Self Attention (DSA)
Self-Position Attention(SPA).Global texture features help to detect GAN-generated faces and can im-prove model detection ability by capturing long-range texture information [15].However, [16,17] show that generated face detection relies more on subtle details in the hair or background when performing cross dataset detection, so focusing on local detailed features can effectively enhance the generalization ability of detection.In order to pay more attention to local detailed features, we introduce the self-position attention module, which lets local features establish links in contextual relationships, thereby enhancing the network's expression of local features.
As shown in the upper part of Figure 3, given the feature f ∈ ℝ C×H×W , it is first fed into a 1x1 convolutional layer to generate f 1 ∈ ℝ C'×H×W where (C'<C ).We then reshape f 1 and f into f 1 ' ∈ ℝ C'×N and f 2 '∈ ℝ C×N , where , and the matrix multiplication is performed between f 1 ' and f 1 ' T to obtain their relationship, denoted by g(f 1 ', f 1 ' T ).After softmax layer, we will obtain position attention maps S ∈ ℝ N×N .From the spatial position, i of f 1 ' and j of f 1 ' T , we respectively get two spatial points {f 1i ', f 1j ' T } ℝ C ' ,where i ∈ {1, ..., N}, j ∈ {1, ..., N} .We further denote the pointwise calculation of g(f 1 ', f 1 ' T ) as g ij (f 1i ', f 1j ' T ), and have the following equation: where  �� ∈ ℝ � × � represents feature at position i impact on position j.The more correlated the features at the two positions are, the larger  �� is.Then, after multiplying with the matrix  �� � ∈ ℝ � × � , we sum them to obtain the correlation of all features in column i,  � ∈ ℝ � × � : Finally, we aggregate the features of all points and reshape to get the final output  � ∈ ℝ � × � × � , i.e.: ( where s ij ∈ ℝ 1×1 represents feature at position i impact on position j.The more correlated the features at the two positions are, the larger s ij is.Then, after multiplying with the matrix f 2i ' ∈ ℝ 1×N , we sum them to obtain the correlation of all features in column i, E j ∈ ℝ 1×N : where  �� ∈ ℝ � × � represents feature at position i impact on position j.The more correlated the features at the two positions are, the larger  �� is.Then, after multiplying with the matrix  �� � ∈ ℝ � × � , we sum them to obtain the correlation of all features in column i,  � ∈ ℝ � × � : Finally, we aggregate the features of all points and reshape to get the final output  � ∈ ℝ � × � × � , i.e.: (3) Finally, we aggregate the features of all points and reshape to get the final output E p ∈ ℝ C×H×W , i.e.: The details of DSA.Features are fed to the self-position attention (SPA, upper) and self-channel attention (SCA, below) modules, which output  � and  � , respectively.They then fuse into features  ���� with positional attention and channel attention.DSA aims to make the network better understand "what to observe".
where  � selectively aggregates the context based on the position attention map.Similar semantic features are correlated with each other, thereby improving intra-class compactness and semantic consistency.
Self-Channel Attention (SCA).Each channel of high-level semantic features can be considered a particular class of response, and the different semantic responses are interrelated [15].As a consequence, we add a self-channel attention module to build dependencies between channels.
As shown in the lower half of Figure 3, unlike the SPA module, before computing the relationship between the two channels, we directly reshape the input features to obtain  � � ∈ ℝ � × � and  � � ∈ ℝ � × � instead of feeding them into the convolutional layer, as this maintains the relationship between different channel mappings. � � is multiplied by its transposed feature map to produce channel attention feature map  ∈ ℝ � × � .The subsequent calculation is similar to equations ( 2), (3), and (4), the final output  � ∈ ℝ � × � × � can be obtained by matrix multiplication of X and  � � , i.e.: Finally, in order to fully fuse the local detail features between position and channel, we sum the features from these two attention modules to obtain fusion feature  ���� .The DSA module does not add too many parameters to the network but effectively enhances the local representation of the features, allowing the network to better understand "what to observe."

Cross-Correlation Attention (CCA)
The GAN model concentrates on learning local features and uses an upsampling process to produce visually enhanced images.However, this local-to-global structure deviates fundamentally from the global generation pattern of natural images, resulting in GANgenerated images frequently lacking global features.Such images may appear indistinguishable to the human eye, but in reality, they are merely a composite of uncorrelated local features.
After applying the DSA module, a pair of features that aggregate local information can be obtained.However, since there is no correlation between these two features, neglecting global information could potentially impact the detection results.Consequently, we integrated the CCA into the pipeline after the DSA module to establish inter-image links, facilitating detection and classification.
As shown in Figure 4, the features  � ���� and  � ���� are fed into a 1x1 convolution network to adjust the channel size.The output is then reshaped such that where E p selectively aggregates the context based on the position attention map.Similar semantic features are correlated with each other, thereby improving intra-class compactness and semantic consistency.

Self-Channel Attention (SCA).
Each channel of high-level semantic features can be considered a particular class of response, and the different semantic responses are interrelated [15].As a consequence, we add a self-channel attention module to build dependencies between channels.
As shown in the lower half of Figure 3, unlike the SPA module, before computing the relationship between the two channels, we directly reshape the input features to obtain f 3 ' ∈ ℝ C×N and f 2 ' ∈ ℝ C×N instead of feeding them into the convolutional layer, as this maintains the relationship between different channel mappings.f 3 ' is multiplied by its transposed feature map to produce channel attention feature map X ∈ ℝ C×C .The subsequent calculation is similar to equations ( 2), (3), and ( 4), the final output E C ∈ ℝ C×H×W can be obtained by matrix multiplication of X and f 2 ', i.e.: The details of DSA.Features are fed to the self-position attention (SPA, upper) and self-channel attention (SCA, below) modules, which output  � and  � , respectively.They then fuse into features  ���� with positional attention and channel attention.DSA aims to make the network better understand "what to observe".
where  � selectively aggregates the context based on the position attention map.Similar semantic features are correlated with each other, thereby improving intra-class compactness and semantic consistency.
Self-Channel Attention (SCA).Each channel of high-level semantic features can be considered a particular class of response, and the different semantic responses are interrelated [15].As a consequence, we add a self-channel attention module to build dependencies between channels.
As shown in the lower half of Figure 3, unlike the SPA module, before computing the relationship between the two channels, we directly reshape the input features to obtain  � � ∈ ℝ � × � and  � � ∈ ℝ � × � instead of feeding them into the convolutional layer, as this maintains the relationship between different channel mappings. � � is multiplied by its transposed feature map to produce channel attention feature map  ∈ ℝ � × � .The subsequent calculation is similar to equations ( 2), (3), and (4), the final output  � ∈ ℝ � × � × � can be obtained by matrix multiplication of X and  � � , i.e.: Finally, in order to fully fuse the local detail features between position and channel, we sum the features from these two attention modules to obtain fusion feature  ���� .The DSA module does not add too many parameters to the network but effectively enhances the local representation of the features, allowing the network to better understand "what to observe."

Cross-Correlation Attention (CCA)
The GAN model concentrates on learning local features and uses an upsampling process to produce visually enhanced images.However, this local-to-global structure deviates fundamentally from the global generation pattern of natural images, resulting in GANgenerated images frequently lacking global features.Such images may appear indistinguishable to the human eye, but in reality, they are merely a composite of uncorrelated local features.
After applying the DSA module, a pair of features that aggregate local information can be obtained.However, since there is no correlation between these two features, neglecting global information could potentially impact the detection results.Consequently, we integrated the CCA into the pipeline after the DSA module to establish inter-image links, facilitating detection and classification.
As shown in Figure 4, the features  � ���� and  � ���� are fed into a 1x1 convolution network to adjust the channel size.The output is then reshaped such that Here we write separately to better illustrate the processes involved).Denote the relationship between  � � and  � � as ( � � ,  � � ) , we can compute their cross-attention graph maps  ∈ ℝ � � × � � .Similar as SPA module,  of  � � and j of  � � , we respectively get two spatial points {  �� � ,  �� � } ∈ ℝ � � ,where  ∈ {1, … ,  � }，  ∈ {1, … ,  � } and denote the pointwise calculation of ( � � ,  � � ) as  �� � �� � ,  �� � �.We choose the cosine similarity Finally, in order to fully fuse the local detail features between position and channel, we sum the features from these two attention modules to obtain fusion feature A Dual .The DSA module does not add too many parameters to the network but effectively enhances the local representation of the features, allowing the network to better understand "what to observe."

Cross-Correlation Attention (CCA)
The GAN model concentrates on learning local features and uses an upsampling process to produce visually enhanced images.However, this local-to-global structure deviates fundamentally from the global generation pattern of natural images, resulting in GAN-generated images frequently lacking global features.Such images may appear indistinguishable to the human eye, but in reality, they are merely a composite of uncorrelated local features.
After applying the DSA module, a pair of features that aggregate local information can be obtained.However, since there is no correlation between these two features, neglecting global information could potentially impact the detection results.Consequently, we integrated the CCA into the pipeline after the DSA module to establish inter-image links, facilitating detection and classification.

Figure 4
The details of CCA.Aim to adjust the "focus" of the image given during the network test As shown in Figure 4, the features f s Dual and f q Dual are fed into a 1x1 convolution network to adjust the channel size.The output is then reshaped such that Here we write separately to better illustrate the processes involved).Denote the relationship between f 1 ' and f 2 ' as g(f 1 ', f 2 '), we can compute their cross-attention graph maps A∈ℝ N 1 ×N 2 .Similar as SPA module, i of f 1 ' and j of f 2 ', we respectively get two spatial points {f 1i ', f 2j '} ∈ ℝ C ', where i ∈ {1, ..., N 1 }, j ∈ {1, ..., N 2 } and denote the pointwise calculation of g(f 1 ', f 2 ') as g ij (f 1i ', f 2j ').We choose the co-sine similarity function to calculate the relationship between features: function to calculate the relationship between features: where (,) means the cosine similarity between two features and ∥ .We perform l2-normalization over  � � and  � � , along their channel dimension, then equation ( 6) can be rewritten as: The feature map contains the cross-correlation between each position in A and B. After obtaining the cross-feature map, it is multiplied by the corresponding matrices of  � � and  � � ,respectively: It can be seen that the output feature  � ����� contains global information of  � for each pixel, and  � ����� is the same.CCA promotes the generation of more discriminative features for semantically similar regions between support and query images, allowing the network to adjust its "focus" on the images during testing.Finally, the channels of the features are adjusted to output features  � ����� ∈ ℝ � × � � × � � and  � ����� ∈ ℝ � × � � × � � .

Experiments
In this section, we first introduce the details of the dataset and experiments.Then we conduct a series of contrast experiments and ablation experiments to analyze the effectiveness of the pro-posed model and modules.Finally, we discuss how to design the RENet model structure to achieve the best performance of the network.Additionally, we explore the detection effectiveness across various generated image categories, including food, animals, landscapes, and more.

Datasets
In the experiment, the real faces CelebA-HQ [44] and FFHQ [31] are used as positive samples, and the fake faces generated by PGGAN [30] and StyleGAN [31] are used as negative samples, i.e. the number of els and enhancing their generalization ability [60,70].Consequently, during training, we exclusively augment query images to simulate real-world scenarios.
To assess generalization ability, we randomly select 2,000 images from commonly used out-of-sample datasets as test sets including StyleGAN2 [33], Star-GAN [11], BeGAN [2], LsGAN [46], WgGANGP [20], RelGAN [42] The details of the corresponding generated faces are provided in In the Base Representation, we discard the final average pooling layer and fully connected layer of ResNet50 [25] and use the remaining layers as the shared embedding network in the RENet.The structures and parameters of the remaining two segments will be thoroughly outlined in Table 2.In Figure 2, image  ∈ ℝ � ×��� × ��� is input into the shared embedding network to obtain the feature  ∈ ℝ ����×� × � .In the score network, the features are first fed into two identical convolutional groups.Each group contains a 3x3 convolutional layer with 64 filters, followed by a BN layer, ReLU layer, and maxpool layer.The final output is transformed to a range of 0 to 1 using two fully connected layers and a sigmoid function.The class with the highest score is the final classification result.In the SPA and CCA modules, we set  � =   = 256.
Training and testing details.All experiments are implemented on Pytorch and a GeForce GTX 24GB 3090 GPU, Silver 4214R CPU.The optimizer is Adam [33].The initial learning rate of 1.0e-5 and a decay learning rate of 1.0e-6.Besides a cosine scheduler for warm start.Training stops when the learning (6) where sim( ., . ) means the cosine similarity between two features and function to calculate the relationship between features: two features and We written as: The feature map contains the cross-correlation between each position in A and B. After obtaining the cross-feature map, it is multiplied by the corresponding matrices of  � � and  � � ,respectively: It can be seen that the output feature  � ����� contains global information of  � for each pixel, and  � ����� is the same.CCA promotes the generation of more discriminative features for semantically similar regions between support and query images, allowing the network to adjust its "focus" on the images during testing.Finally, the channels of the features are adjusted to output features  � ����� ∈ ℝ � × � � × � � and  � ����� ∈ ℝ � × � � × � � .

Experiments
In this section, we first introduce the details of the dataset and experiments.Then we conduct a series of contrast experiments and ablation experiments to analyze the effectiveness of the pro-posed model and modules.Finally, we discuss how to design the RENet model structure to achieve the best performance of the network.Additionally, we explore the detection effectiveness across various generated image categories, including food, animals, landscapes, and more.

Datasets
In the experiment, the real faces CelebA-HQ [44] and FFHQ [31] are used as positive samples, and the fake faces generated by PGGAN [30] and StyleGAN [31] are used as negative samples, i.e. the number of 256.Data augmentation proves to be effective in mitigating the overfitting problem in deep learning models and enhancing their generalization ability [60,70].Consequently, during training, we exclusively augment query images to simulate real-world scenarios.
To assess generalization ability, we randomly select 2,000 images from commonly used out-of-sample datasets as test sets including StyleGAN2 [33], Star-GAN [11], BeGAN [2], LsGAN [46], WgGANGP [20], RelGAN [42] The details of the corresponding generated faces are provided in In the Base Representation, we discard the final average pooling layer and fully connected layer of ResNet50 [25] and use the remaining layers as the shared embedding network in the RENet.The structures and parameters of the remaining two segments will be thoroughly outlined in Table 2.In Figure 2, image  ∈ ℝ � ×��� × ��� is input into the shared embedding network to obtain the feature  ∈ ℝ ����×� × � .In the score network, the features are first fed into two identical convolutional groups.Each group contains a 3x3 convolutional layer with 64 filters, followed by a BN layer, ReLU layer, and maxpool layer.The final output is transformed to a range of 0 to 1 using two fully connected layers and a sigmoid function.The class with the highest score is the final classification result.In the SPA and CCA modules, we set  � =   = 256.
Training and testing details.All experiments are implemented on Pytorch and a GeForce GTX 24GB 3090 GPU, Silver 4214R CPU.The optimizer is Adam [33].The initial learning rate of 1.0e-5 and a decay learning rate of 1.0e-6.Besides a cosine scheduler for warm start.Training stops when the learning .We perform l2-normalization over f 1 ' and f 2 ', along their channel dimension, then equation ( 6) can be rewritten as: function to calculate the relationship between features: where (,) means the cosine similarity between two features and � �� � ,  �� � � = ∥ .We perform l2-normalization over  � � and  � � , along their channel dimension, then equation ( 6) can be rewritten as: The feature map contains the cross-correlation between each position in A and B. After obtaining the cross-feature map, it is multiplied by the corresponding matrices of  � � and  � � ,respectively: It can be seen that the output feature  � ����� contains global information of  � for each pixel, and  � ����� is the same.CCA promotes the generation of more discriminative features for semantically similar regions between support and query images, allowing the network to adjust its "focus" on the images during testing.Finally, the channels of the features are adjusted to output features  � ����� ∈ ℝ � × � � × � � and  � ����� ∈ ℝ � × � � × � � .

Experiments
In this section, we first introduce the details of the dataset and experiments.Then we conduct a series of contrast experiments and ablation experiments to analyze the effectiveness of the pro-posed model and modules.Finally, we discuss how to design the RENet model structure to achieve the best performance of the network.Additionally, we explore the detection effectiveness across various generated image categories, including food, animals, landscapes, and more.

Datasets
In the experiment, the real faces CelebA-HQ [44] and FFHQ [31] are used as positive samples, and the fake faces generated by PGGAN [30] and StyleGAN [31] are used as negative samples, i.e. the number of 256.Data augmentation proves to be effective in mit-igating the overfitting problem in deep learning models and enhancing their generalization ability [60,70].Consequently, during training, we exclusively augment query images to simulate real-world scenarios.
To assess generalization ability, we randomly select 2,000 images from commonly used out-of-sample datasets as test sets including StyleGAN2 [33], Star-GAN [11], BeGAN [2], LsGAN [46], WgGANGP [20], RelGAN [42] The details of the corresponding generated faces are provided in In the Base Representation, we discard the final average pooling layer and fully connected layer of ResNet50 [25] and use the remaining layers as the shared embedding network in the RENet.The structures and parameters of the remaining two segments will be thoroughly outlined in Table 2.In Figure 2, image  ∈ ℝ � ×��� × ��� is input into the shared embedding network to obtain the feature  ∈ ℝ ����×� × � .In the score network, the features are first fed into two identical convolutional groups.Each group contains a 3x3 convolutional layer with 64 filters, followed by a BN layer, ReLU layer, and maxpool layer.The final output is transformed to a range of 0 to 1 using two fully connected layers and a sigmoid function.The class with the highest score is the final classification result.In the SPA and CCA modules, we set  � =   = 256.
Training and testing details.All experiments are implemented on Pytorch and a GeForce GTX 24GB 3090 GPU, Silver 4214R CPU.The optimizer is Adam [33].The initial learning rate of 1.0e-5 and a decay learning rate of 1.0e-6.Besides a cosine scheduler for warm start.Training stops when the learning The feature map contains the cross-correlation between each position in A and B. After obtaining the cross-feature map, it is multiplied by the corresponding matrices of f 1 ' and f 2 ',respectively: image given during the network test.
function to calculate the relationship between features: where (,) means the cosine similarity between two features and ∥ .We perform l2-normalization over  � � and  � � , along their channel dimension, then equation ( 6) can be rewritten as: The feature map contains the cross-correlation between each position in A and B. After obtaining the cross-feature map, it is multiplied by the corresponding matrices of  � � and  � � ,respectively: It can be seen that the output feature  � ����� contains global information of  � for each pixel, and  � ����� is the same.CCA promotes the generation of more discriminative features for semantically similar regions between support and query images, allowing the network to adjust its "focus" on the images during testing.Finally, the channels of the features are adjusted to output features  � ����� ∈ ℝ � × � � × � � and  � ����� ∈ ℝ � × � � × � � .

Experiments
In this section, we first introduce the details of the dataset and experiments.Then we conduct a series of contrast experiments and ablation experiments to analyze the effectiveness of the pro-posed model and modules.Finally, we discuss how to design the RENet model structure to achieve the best performance of the network.Additionally, we explore the detection effectiveness across various generated image categories, including food, animals, landscapes, and more.

Datasets
In the experiment, the real faces CelebA-HQ [44] and FFHQ [31] are used as positive samples, and the fake faces generated by PGGAN [30] and StyleGAN [31] are used as negative samples, i.e. the number of sets in a ratio of 7:1:2.Each image is resized to 256 × 256.Data augmentation proves to be effective in mitigating the overfitting problem in deep learning models and enhancing their generalization ability [60,70].Consequently, during training, we exclusively augment query images to simulate real-world scenarios.
To assess generalization ability, we randomly select 2,000 images from commonly used out-of-sample datasets as test sets including StyleGAN2 [33], Star-GAN [11], BeGAN [2], LsGAN [46], WgGANGP [20], RelGAN [42] The details of the corresponding generated faces are provided in Table 1 and each image resized to 256 × 256.In the Base Representation, we discard the final average pooling layer and fully connected layer of ResNet50 [25] and use the remaining layers as the shared embedding network in the RENet.The structures and parameters of the remaining two segments will be thoroughly outlined in Table 2.In Figure 2, image  ∈ ℝ � ×��� × ��� is input into the shared embedding network to obtain the feature  ∈ ℝ ����×� × � .In the score network, the features are first fed into two identical convolutional groups.Each group contains a 3x3 convolutional layer with 64 filters, followed by a BN layer, ReLU layer, and maxpool layer.The final output is transformed to a range of 0 to 1 using two fully connected layers and a sigmoid function.The class with the highest score is the final classification result.In the SPA and CCA modules, we set  � =   = 256.
Training and testing details.All experiments are implemented on Pytorch and a GeForce GTX 24GB 3090 GPU, Silver 4214R CPU.The optimizer is Adam [33].The initial learning rate of 1.0e-5 and a decay learning rate of 1.0e-6.Besides a cosine scheduler for warm start.Training stops when the learning It can be seen that the output feature A 1 Cross contains global information of f 1 for each pixel, and A 2 Cross is the same.CCA promotes the generation of more discriminative features for semantically similar regions between support and query images, allowing the network to adjust its "focus" on the images during testing.Finally, the channels of the features are adjusted to output features A 1 Cross ∈ ℝ C×H1×W1 and A 2 Cross ∈ ℝ C×H2×W2 .

Experiments
In this section, we first introduce the details of the dataset and experiments.Then we conduct a series of contrast experiments and ablation experiments to analyze the effectiveness of the pro-posed model and modules.Finally, we discuss how to design the RENet model structure to achieve the best performance of the network.Additionally, we explore the detection effectiveness across various generated image categories, including food, animals, landscapes, and more.

Datasets
In the experiment, the real faces CelebA-HQ [44] and FFHQ [31] are used as positive samples, and the fake faces generated by PGGAN [30] and StyleGAN [31] are used as negative samples, i.e. the number of cat-egories N = 4.Following [4], we randomly select 10K images from each of the four datasets and then divide the images into training, validation, and testing sets in a ratio of 7:1:2.Each image is resized to 256 × 256.Data augmentation proves to be effective in mitigating the overfitting problem in deep learning models and enhancing their generalization ability [60,70].Consequently, during training, we exclusively augment query images to simulate real-world scenarios.

Implementation Details
Network architectures.The complete network is partitioned into three segments: Base Representation, Feature Augmentation, and Representation Comparison.In the Base Representation, we discard the final average pooling layer and fully connected layer of ResNet50 [25] and use the remaining layers as the shared embedding network in the RENet.
The structures and parameters of the remaining two segments will be thoroughly outlined in Table 2.In Figure 2, image f ∈ ℝ 3×256×256 is input into the shared embedding network to obtain the feature f ∈ ℝ 2048×8×8 .In the score network, the features are first fed into two identical convolutional groups.Each group con-   tains a 3x3 convolutional layer with 64 filters, followed by a BN layer, ReLU layer, and maxpool layer.The final output is transformed to a range of 0 to 1 using two fully connected layers and a sigmoid function.The class with the highest score is the final classification result.In the SPA and CCA modules, we set C' Training and testing details.
All experiments are implemented on Pytorch and a GeForce GTX 24GB 3090 GPU, Silver 4214R CPU.The optimizer is Adam [33].The initial learning rate of 1.0e-5 and a decay learning rate of 1.0e-6.Besides a cosine scheduler for warm start.
Training stops when the learning rate is less than 1.0e-7, with a patience of 7. We use the mean square error (MSE) loss when training, where positive sample labels are 0 and negative sample labels are 1.For each episode, we compute with 8 images from each category, i.e., K = 8.Therefore, the experiment is a 4-way 8-shot, and a total of 32 images (4 x 8) are used as a batch for training.In addition, to test the robustness, we perform different post-processing operations on the testing set.Details and experimental results are given in Section 4.3.
Performance metric.We use accuracy (ACC) to evaluate the experimental results, which is calculated using the formula rate is less than 1.0e-7, with a patience of 7. We use the mean square error (MSE) loss when training, where positive sample labels are 0 and negative sample labels are 1.For each episode, we compute with 8 images from each category, i.e., K = 8.Therefore, the experiment is a 4-way 8-shot, and a total of 32 images (4 x 8) are used as a batch for training.In addition, to test the robustness, we perform different post-processing operations on the testing set.Details and experimental results are given in Section 4.3.

Table 2
Network parameters of feature augmentation module and representation comparison.
. Here, TP and TN are the numbers of correctly classified positive and negative samples, respectively.P and N are the total number of positive and negative samples.

Comparisons with State-of-the-Art Works
In this section, several advanced methods have been selected for comparison.They are summarized below.
The initial learning rate is set to 1.0e-5, and the learning rate decay is _ Li et al. [38] detected authenticity by Estimating Artifact Similarity.The optimizer is Adam [33].
The initial learning rate is set to 10e-5, then the model is trained for 200 epochs and optimized with Adam.(GaseNet) _ Chen et al. [7] proposed a method based on dual-color space to detect differences.The optimizer is Adam [33].The initial learning rate is set to 1.0e-5, and the learning rate decay is Furthermore, to better assess the performance of the models, we compared the inference times of each algorithm.As shown in Table 4, while FGLNet per-  forms well in generalization, its inference time is increased due to the necessity of adaptively selecting local images for fusion, resulting in an excessive number of parameters.CGNet, on the other hand, sacrifices some generalization performance to improve overall efficiency.In contrast, RENet maintains excellent generalization performance while also preserving good inference times, making it more favorable for practical applications on real devices.
Results on robustness against post-processing operations.In reality, some criminals use GAN to  The detection accuracy (%) of different methods on in-sample datasets for various post-processing The detection accuracy (%) of different methods on out-of-sample datasets for various post-processing operations became ineffective when attacked, our method maintained ideal detection performance.
This could be attributed to the fact that, even when post-processings cause degradation, the extracted features maintain a similarity to the prototype from the same category.Reliable detection results can be achieved through feature relation.Additionally, we visualized the attention points of RENet when processing post-processed images.As shown in Figure 8, it indicates that when the image is disturbed, RENet not only focuses on the intended areas but also pays attention to some background edges to assist in dis- Activation maps from the proposed model after postprocessing operations.From left to right, the column shows the images after the post-processed images through JPEG compression, Gaussian blur, resizing, Gaussian noise and Gamma trans tinguishing authenticity, aligning with what has been suggested in [19].Regarding the suboptimal performance of resizing on some datasets, we speculate that this is due to pixel loss caused by compressing all images uniformly to the same resolution before training.

Ablation Study and Selection of RENet Structure
In this section, we investigate four questions: For the first question, we have conducted ablation experiments in Table 5.The experimental results show that the DSA and CCA modules can improve the network's detection generalization ability, thereby verifying the effectiveness of the proposed modules.The visualization in Figure 9 demonstrates that the features extracted by the network become more precise and rational with the addition of the DSA module and CCA  module separately.Especially when combining the two modules, the network's attention accuracy further improves.Besides, during the training process, as shown in Figure 10, the network converges more rapidly after integrating DSA and CCA.This indicates their successful role in enabling the network to perceive relevant semantic features at different positions, and facilitating a more accessible learning process for comparison.
For the second question, Figure 11 visualizes the primary focus of RENet on out-of-sample datasets.It can be observed that, compared to detecting images within the dataset, when detecting images from outof-sample datasets, the network tends to focus on detecting artifacts at the edges of faces and background hair.It may be due to differences in the strategies of different generation models, leading to variations in the correlation of facial texture features.
RENet-1: Instead of element-wise sum, we use element-wise avg before feeding the support set features into the Feature Augmentation.(2) In RENet-2 and RENet-3, we modify the two convolutional kernels in the score network to 32 and 128, respectively.(3) In RENet-4, we add a convolutional layer to the score network, while in RENet-5, we reduce a convolutional layer in the score network.To make a fair comparison, we keep the other structures the same as the original RENet except for the corresponding modifications.
As shown in Table 6, We can draw the following conclusion.First, if we change element-wise sum to element-wise avg, the inner and out-of-sample datasets' detection accuracy of the network will decrease by 0.34% and 1.55%.This is attributed to the fact that element-wise sum effectively amplifies the distinctions and commonalities in the extracted features, facilitating the network in better contrasting the feature associations between the support set and query set.
On the contrary, element-wise avg diminishes these differences, impacting the network's ability to discern their associations.Second, a convolutional kernel of 64 is more suitable for RENet, and a larger (128) or smaller (32) kernel only brings negative improvements, especially when the convolutional kernel is 128, the generalization accuracy decreases by 3.85%, indicating that an appropriately sized convolutional kernel can enhance the feature augmentation results of DSA.Third, using two convolutional layers before score mapping is more stable than using one or three To explore the above directions, we have made some modifications to RENet as shown in Figure 12. (1) The complete RENet network is on the left, and various modified networks are within the dashed lines on the right.The red font indicates the modified parameters.( 4) and ( 5) do not modify the parameters but modify the number of convolutional layers  convolutional layers.Adding a convolutional layer not only reduces the detection accuracy, but also increases the network parameters, while reducing a layer cannot fully integrate the extracted features, among which the impact on generalization accuracy is the greatest.
For the fourth question, we conduct two experiments.The first one involved training the model with batch sizes of 1, 2, 4, 8 (in this paper), and 16, observing the accuracy on in-sample and out-of-sample datasets when the epoch equals 20.The second experiment set the initial learning rates to 10e-4, 10e-5 (in this paper), and 10e-6, monitoring the accuracy on the validation set at the same step.As shown in Figures 13 , appropriate batch sizes and learning rates can enhance detection accuracy, while excessively small or large batch sizes may impact generalization ability.

Generated Image Detection of Various Category
With the continuous evolution and application of datasets [5,61], as well as the ongoing iterations of GAN models [3,53,55,73,74], the range of generated images has extended beyond faces.Training solely on CelebA-HQ and FFHQ may not comprehensively test the performance of the proposed method.To thoroughly analyze RENet's performance across different categories, we further expand the testing scope, evaluating our proposed method.We employ the experimental settings outlined by Wang et al. [60], which crafted to show the generality of a generated image detector trained by a special GAN model in identifying other GAN models (not restricted to faces alone).Detailed information regarding the generated imag-  es is presented in Figure 14 and Table 7.We choose Lsun [66] as the source for real images and select one category (horse) among the 20 classes generated by ProGAN as the fake images.Each category comprises 18,000 fake images and an equal number of real images.The training settings are the same as those mentioned in Section 4. In addition to Accuracy (ACC.), we also use the Average Precision (A.P.) as an evaluation metric.The A.P. is calculated using an alternative measurement method in method [60], which approximates the area under the precision-recall curve with the use of a few thresholds.
We first assessed the detection performance of each method across various categories within the in-sample datasets.As illustrated in Figure 15, even with training limited to a single category, our proposed RENet sur-passes other methods in detecting unseen categories.Notably, in categories like Bus and Motorbike, where alternative methods show lower accuracy, RENet maintains an accuracy exceeding 90%.This suggests that proposed network is capable of effectively finding feature connections and semantic information within the same GAN model.Additionally, Figure 16 visualizes the attention of the proposed model on different types of images generated by the same GAN.We observed that, in this scenario, the model primarily focuses on discerning the authenticity by analyzing the content within the objects themselves.
Table 8 shows the comparison results between our proposed RENet and other methods.It can be observed that RENet demonstrates better generalization capabilities across most GAN models compared   to other algorithms.This is attributed to the enhanced relational network's ability to perceive semantic features and conduct comparisons.However, the generalization performance on GauGAN and Deepfake is not satisfactory.This happens due to GauGAN and ProGAN have similar semantic features, leading to overfitting problems in previous methods.In addition, the poor performance of the Deepfake model is due to the fact that it is not a GAN model and uses MSE loss and SSIM loss for training, resulting in a detection accuracy of only 69.3%.

Conclusion
In this work, we propose a relational embedding network called RENet for detecting GAN-generated face.It combines dual self-attention and cross-attention, enhancing both the relevant local features within an image and the global feature relationships between images.In addition, we observe that RENet can better generalize to unknown datasets by learning the structural correlations among features, and bring performance improvements to the network for detecting GAN-gen-erated images.In further work, we will explore the effect of a relational network combined with an attention framework on different image forensics tasks.

Figure 5
Figure 5 ROC curve for different methods in StyleGAN2

Figure 6
Figure 6The detection accuracy (%) of different methods on in-sample datasets for various post-processing (1) The impact of DSA and CCA modules on the effectiveness of RENet detection.(2) How does the RENet distinguish fake faces in out-of-sample datasets?(3) How to design the network architecture to optimize the performance of RENet.(4) What impact will batch size and learning rate have on the model?

Figure 9 Figure 10
Figure 9 The first column on the left is the input query set, and the red boxes represent the clearly fake areas.(a), (b), (c) and (d) are the feature attention maps without any module, with the DSA module, with the CCA module and with the DSA+CCA module, respectively

Figure 11
Figure 11The visualization results of feature maps for out-of-sample datasets

Figure 13
Figure 13 Above are batch size ablation studies, and below are learning rate ablation studies on the RENet

Figure 14
Figure 14Some samples in the experimental datasets.The first row represents real images, and the second row corresponds to images generated by GAN
Table 1 and each image resized to 256 × 256.
Table 1 and each image resized to 256 × 256.

Table 1
Details of datasets.

Table 1
Details of datasets

Table 2
Network parameters of feature augmentation module and representation comparison

Table 3
Detection accuracy of the different methods (%), the best results of the experiment in bold

Table 4
Detection Average inference time (seconds) comparison of six methods for a picture with 256 × 256 resolution

Table 5
Ablation experiments of the DSA and CCA modules (%)

Table 6
The influence of RENet structure selection on accuracy (%)

Table 7
Details of datasets.Generated images are not limited to faces alone 2024/2/53