Saliency Detection Algorithm for Foggy Images Based on Deep Learning

The detection of salient objects in foggy scenes is an important research component in many practical applications such as action recognition, target tracking and pedestrian re-identification. To facilitate saliency detection in foggy scenes, this paper explores two issues. The construction of dataset for foggy weather conditions and implementation scheme for foggy weather saliency detection. Firstly, a foggy sky image synthesis method is designed based on the atmospheric scattering model

The detection of salient objects in foggy scenes is an important research component in many practical applications such as action recognition, target tracking and pedestrian re-identification.To facilitate saliency detection in foggy scenes, this paper explores two issues.The construction of dataset for foggy weather conditions and implementation scheme for foggy weather saliency detection.Firstly, a foggy sky image synthesis method is designed based on the atmospheric scattering model, and a saliency detection dataset applicable to foggy sky is constructed.Secondly, we compare the current classification networks and adopt resnet50, which has the highest classification accuracy, as the backbone network of the classification module, and classify the foggy sky images into three levels, namely fogless, light fog and dense fog, according to different concentrations.Then, Residual Refinement Network (R 2 Net) was selected to train and test the classified images.Horizontal and vertical flipping and image cropping were used to enhance the training set to relieve over-fitting.The accuracy of the network model was improved by using Adam as the optimizer.Experimental results show that for the detection of fogless images, our method is almost on par with state-of-the-art, and performs well for both light and dense fog images.Our method has good adaptability, accuracy and robustness.KEYWORDS: Foggy images, Saliency detection, Image classification, Deep learning.

Introduction
Saliency detection, as one of the popular research directions in the field of computer vision, has wide range of applications in the fields of video surveillance [24], image thumbnail [23], and semantic segmentation [27].The saliency detection under foggy conditions, as one of its branches, has also attracted the attention of related researchers.However, early saliency models such as Frequency Tuned (FT) [1], Histogram-based Contrast (HC) [3], Itti (IT) [8], Luminance Contrast (LC) [28] mainly rely on features such as color, contrast and contour of the image.With the development of deep learning theory, more and more network models have been proposed.The deep contrast network proposed by Li et al. [11] solves the problem of blurred saliency map in saliency detection.The Amulet network proposed by Zhang et al. [29] utilizes convolutional features from multiple layers as saliency cues for salient object detection.
Most of these studies are aimed at targets in the natural environment, which are characterized by contrasting colors and clear outlines.Although many saliency detection models can achieve good results on existing datasets, they often fail to achieve ideal results when they are actually applied to environments under foggy conditions.Foggy scene is an environment with uncertain factors such as smoke suspended solids and automobile exhaust.These factors make the captured images subject to blur, occlusion, abnormal lighting, etc., resulting in loss of image detail, low contrast and color distortion.Accurate saliency detection plays important role in related applications.
The existing saliency detection networks are not aimed at foggy images, and are not suitable for saliency detection under foggy conditions.There is also lack of foggy detection datasets.Therefore, this paper first simulates the distorted images affected by fog through the atmospheric scattering model, and generates synthetic fog images based on the DUTS [21] dataset.In order to not lose the detection quality of fogless images and have good detection ability of foggy images with different concentrations, the detection network based on R 2 Net [6] is adopted to train separately according to the classification results.and optimize the parameters corresponding to the concentration.

Related Works
As of now, human-annotated datasets in real foggy conditions are very rare, and images for saliency detection are even rarer.Therefore, the method for synthesizing hazy images by atmospheric scattering model in this paper uses the generated dataset for training and testing the model.
The atmospheric scattering model of Koschmieder can simulate the effect of fog better.We use the atmospheric scattering model to generate a simulated fog image.The fog image I(x) can be expressed as related researchers.However, early saliency models such as Frequency Tuned (FT) [1], Histogram-based Contrast (HC) [3], Itti (IT) [8], Luminance Contrast (LC) [28] mainly rely on features such as color, contrast and contour of the image.With the development of deep learning theory, more and more network models have been proposed.The deep contrast network proposed by Li et al. [11] solves the problem of blurred saliency map in saliency detection.The Amulet network proposed by Zhang et al. [29] utilizes convolutional features from multiple layers as saliency cues for salient object detection.
Most of these studies are aimed at targets in the natural environment, which are characterized by contrasting colors and clear outlines.Although many saliency detection models can achieve good results on existing datasets, they often fail to achieve ideal results when they are actually applied to environments under foggy conditions.Foggy scene is an environment with uncertain factors such as smoke suspended solids and automobile exhaust.These factors make the captured images subject to blur, occlusion, abnormal lighting, etc., resulting in loss of image detail, low contrast and color distortion.Accurate saliency detection plays important role in related applications.
The existing saliency detection networks are not aimed at foggy images, and are not suitable for saliency detection under foggy conditions.There is also lack of foggy detection datasets.Therefore, network based on R Net [6] is adopted to train separately according to the classification results.and optimize the parameters corresponding to the concentration.

Related Works
As of now, human-annotated datasets in real foggy conditions are very rare, and images for saliency detection are even rarer.Therefore, the method for synthesizing hazy images by atmospheric scattering model in this paper uses the generated dataset for training and testing the model.

The atmospheric scattering model of
Koschmieder can simulate the effect of fog better.
We use the atmospheric scattering model to generate a simulated fog image.The fog image (x) I can be expressed as where (x) J is the haze-free image and A is the atmospheric light value.The model usually assumes that the atmospheric light value is globally constant.( ) t x is the transmittance, which represents the amount of transmittance of the scene to the camera.In the case of homogeneous medium, the transmittance depends on the distance ( ) l x from the scene to the camera: where (x) J is the haze-free image and A is the atmo- spheric light value.The model usually assumes that the atmospheric light value is globally constant.( ) t x is the transmittance, which represents the amount of transmittance of the scene to the camera.In the case of homogeneous medium, the transmittance depends on the distance ( ) l x from the scene to the camera: Histogram-based Contrast (HC) [3], Itti (IT) [8], Luminance Contrast (LC) [28] where (x) J is the haze-free image and A is the atmospheric light value.The model usually assumes that the atmospheric light value is globally constant.( ) t x is the transmittance, which represents the amount of transmittance of the scene to the camera.In the case of homogeneous medium, the transmittance depends on the distance ( ) l x from the scene to the camera: among them, β is called the attenuation coefficient, which can adjust the density of fog in the generated image.A larger β means that the fog density of the image is larger; conversely, the density is smaller.For haze-free image, the transmittance is first calculated using Equation (2), and the atmospheric light value A is calculated by substituting it into Equation (1); then mist and dense images are generated according to Equation (1).
In recent years, convolutional neural networks have become one of the research hotspots in many disciplines.The use of convolutional neural networks to process and analyze data has become popular trend.
Classical network models such as AlexNet [10], VGG-Net [19] , GoogleNet [20] and ResNet [15] have been proposed one after another.In order to solve the foggy image classification task with small samples, this paper analyzes the current popular classification network models, and determines the basis of the foggy image classification module from the perspectives of network structure, floating-point operations and parameters.Finally, this paper adopts ResNet50 [7] as the basic model to achieve three-classification.
The residual network proposed by ResNet improves the structure of the convolutional neural network, so that it can maintain its feature expression ability while increasing the depth of the network, and effectively solve the problem of gradient disappearance or gradient explosion caused by deepening the number of layers.The introduction of residual module is a crucial part in the development process of convolutional neural network.The structure of this module is shown in Figure 1.
In Equation (3), R 2 Net [6] proposes a new residual learning strategy, which is different from previous models based on multi-scale, but gradually generates The residual structure constitutes two mapping paths, identity mapping and residual mapping, in the form of cross-layer links.By adding the x-identity map in the process of common module connection, the network can effectively control the network layer parameters and the computational complexity while alleviating the problem of gradient disappearance.The residual structural unit can be expressed as: In Equation (3), In Equation (3), j x and x respectively represent the input and output information of the network in this layer; j W represents the parameters to be learned in this layer.Performing recursive operation on Equation (3), the feature representation of any deep unit J can be obtained: In Equation (3),  Then, by introducing new Attention Residual Modules (ARMs), the matching process of coarse predictions and ground-truth (GT) maps is guided.ARMs focus on edge details while guiding the refinement process, making the saliency map more discriminative.The R 2 Net network structure is shown in Figure 2.
An important topic in deep learning is the generalisation ability of the model.The problem of over-fitting is often encountered in applications, and the correct use of regularization techniques can improve or reduce the problem of over-fitting.Zheng et al. [32] proposed a two-stage training method to improve the generalization ability of the network.In the pre-training process, the network model is trained to extract the image representation for anomaly detection.In the implicit regularization training stage, the network is retrained to regularize the feature boundary to converge based on the anomaly detection results.This approach effectively maintains a low over-fitting.Jin et al. [9] proposed computer-aided facial diagnosis for various diseases using deep transfer learning for face recognition.The overall top-1 accuracy can reach more than 90%, outperforming traditional machine learning methods and clinicians in experiments.Zheng et al. [33] proposed a spectrum interference-based two-level data augmentation method for automatic modulation classification.This is the first time that frequency domain information is considered to augment radio signals to help modulation classification.
When deep neural networks process larger scale data, the excessive computation affects the learning and inference speed of the model and cannot meet the demand in practical applications.Therefore, improving the computational speed has important application value.A new faster Mean-shift algorithm is proposed by Zhao et al. [30] By introducing a novel online seed optimization policy (OSOP), the minimum number of seeds is determined adaptively to speed up the computation and optimize GPU memory.You et al. [26] extended and improved the Mean-shift algorithm with a novel Seed Selection & Early stopping method, which greatly improves the computing speed and reduces GPU memory consumption.
In summary, making saliency algorithms applied to foggy scenes, the generalization ability of the algorithm, computing speed and detection accuracy are undoubtedly among the main goals.To focus on the shortcomings in the current saliency network models, a method combining the resnet50-based classification module is proposed.The fog images with different concentrations are trained and tested using R 2 Net.Data augmentation [31] and Adam optimizer are used to alleviate the overfitting problem and im-prove the generalization ability of the model.Use to speed up training and inference through enhanced CPU usage.Finally, the detection method of this paper is tested under foggy images with different concentrations by comparing it with traditional algorithms and currently models.

Proposed Method
The fog density is not stable under real foggy conditions, and the existing image dehazing algorithms are mostly aimed at the application of single image, and have no adaptive ability for fog image detection under different concentrations.Therefore, step of fog classification is added before the saliency detection, and the detection network is trained and optimized according to the classification results, thereby improving the   Directly predicting the best result under the influence of high foggy particles makes the saliency detection task very challenging.R 2 Net's residual learning strategy can gradually refine the coarse predictions.The residuals are predicted to compensate for the errors between the coarse saliency map and ground truth masks.It can generate coarse predictions through the DCPP module and guide the residual learning process through ARM.Even if the target profile is not successfully detected, the finest saliency map can be greatly approximated.The method structure is shown in Figure 3.

Classify Module
We use ResNet [7] as the basic network of classification modules.ResNet consists of several residual blocks.The principle of the residual block is to directly skip the data output of the previous layers and introduce it into the input part of the subsequent data layer.He uses ( ) F x to represent two-layer network without skip connections, then the residual block can be expressed as ( ) ( ) = + H x F x x, introducing more abundant reference data for x , so that the network can learn more plentiful content.
The ResNet50 [7] structure is shown in Figure 4.The residual network consists of several residual blocks, connections, then the residual block can be expressed as ( ) ( ) , introducing more abundant reference data for x , so that the network can learn more plentiful content.
The ResNet50 [7] structure is shown in Figure 4.   and the structure with multiple residual blocks is called a layer.The initial layer is an ordinary convolution structure, layer1 contains 3 residual blocks, layer2 contains 4 residual blocks, layer3 contains 6 residual blocks, layer4 contains 3 residual blocks, and finally there is a full connection layer.From layer1 to layer5, after the image data with a size of 224 pixel × 224 pixel is transmitted, the residual network extracts features for learning and training, and finally reduces the size to 7 × 7 pixel.After the residual network training, the images are input to the average pooling layer and averaged, and finally the image category is divided by the Softmax function of the fully connected layer.

Detection Module
We use R 2 Net [6] as the basic network of detection modules.R 2 Net is a novel residual structure-based saliency detection network.Unlike existing methods, the network progressively modifies the error of the prediction map and the saliency mask until it best matches the ground truth.R 2 Net mainly includes the R-VGG module, DCPP module and ARM module.The R-VGG module is modified from the VGG16 [19] network.
The DCPP module structure employs four dilated convolutional layers, which are used to generate the coarse saliency map.The resulting rough saliency map is fed into the bottom residual learning branch of the residual module.Except for the different rate parameters, the four dilated convolution layers are all implemented using atrous convolution.The purpose of using atrous convolution is to enlarge the receptive field without losing spatial resolution.Atrous convolution has another advantage, by setting different dilation rates to get different receptive fields.Information at different scales can be obtained from different receptive fields, which plays an important role in vision tasks.
When the image is converted into a two-dimensional matrix [ , ] x i j and convolved with a filter with a kernel size of K, [ , ] y i j can be expressed as: When the image is converted into a twodimensional matrix [ , ] x i j and convolved with a filter ,  with a kernel size of K , [ , ] y i j can be expressed as: 1 1 [ , ] , , The parameter r is the similarities and differences between the atrous convolution and the classical convolution.As shown in the expansion rate r controls the distance between adjacent elements in the convolution kernel, and its change controls the size of the receptive field F of the convolution kernel, and will not boost the number and computation of The parameter r is the similarities and differences between the atrous convolution and the classical convolution.As shown in When the image is converted into a twodimensional matrix [ , ] x i j and convolved with a filter ,  with a kernel size of K , [ , ] y i j can be expressed as: 1 1 [ , ] , , The parameter r is the similarities and differences between the atrous convolution and the classical convolution.As shown in the expansion rate r controls the distance between adjacent elements in the convolution kernel, and its change controls the size of the receptive field F of the convolution kernel, and will not boost the number and computation of Sakaridis [18] uses the Foggy-Cityscapes to (6) the expansion rate r controls the distance between adjacent elements in the convolution kernel, and its change controls the size of the receptive field F of the convolution kernel, and will not boost the number and computation of parameters.
R 2 Net adopts four dilated convolutional layers to form a dilated convolutional pyramid pooling module for predicting coarse global saliency maps.The network uses four dilated convolutional layers with k = 3 filters but different rate parameters.In order to ensure the extraction of global view and multi-scale features, the rate of the four dilated convolutional layers is set to = 1,5,9,13, r respectively, and the number of output channels is 16.In this paper, in order to alleviate the saliency detection of small-scale objects subject to foggy conditions, the rate is set to = 1, 2, 5, 9 r according to [22], and the number of output channels is unchanged.In the end, the network can still accurately extract local and global features.

Loss Function
In this paper, the binary cross entropy loss ( BCE L ), which is often used in classification problems, is used as the loss function of the classification module.R 2 Net adopts the standard cross-entropy loss to calculate the per-pixel loss, ignoring the global structure of the image.To remedy this deficiency, this paper uses the IOU loss [17] ( IOU L ) to focus on the global structure, thereby forming global constraints on the network.For our optimized network, the loss function ( BCE L ) obtained in the classification module is used, and then the classified foggy image is sent to the detection network for training, and finally the detection result of the target is obtained.To sum up, the loss function of the model in this paper is defined as: Among them, L is the overall loss function, BCE L is the loss function of the classification network module, IOU L is the loss function of the detection network module, and λ is the balance factor.

Datasets
Sakaridis [18] uses the Foggy-Cityscapes to process synthetic foggy images.There are three different concentrations of fog in this dataset.Different concentrations of fog have different β values in the atmospheric scattering model.In this paper, 0.04 ≤ β ≤ 0.08 corresponds to light fog, and 0.12 ≤ β ≤ 0.16 corresponds to dense fog, the fog-free images use the DUTS [21].
Among them, Maximun F-measure and MAE are calculated in pixel-by-pixel manner, which cannot fully capture the structural information of the prediction graph.Therefore, S-measure is supplemented to compute structural similarity and E-measure to evaluate image-level properties.
This paper adopts Mean Absolute Error (MAE) to predict the pixel-wise mean absolute error between the saliency map and the ground truth map, as follows: where W and H represent the width and height of the saliency map, respectively, and the MAE value is normalized to {0,1} interval value.The MAE represents the similarity between the significance map and the ground truth map.
Maximum F-measure is a commonly used evaluation method, which considers both precision and recall, and uses the beta parameter to trade off precision and recall: the structural similarity between regionawareness ( r S ) and object-awareness ( o S ).
Therefore, S can be defined as: where α ∈ [0, 1] is the balance parameter.
Enhanced-alignment measure [5] is recently proposed method that considers both pixel and image level properties of expression composition, which is an effective and efficient way to evaluate saliency maps.E is proposed based on cognitive vision research to obtain image-level statistical information and its local pixel matching information.Therefore, E can be defined as: where W and H are the height and width of the map, respectively, and FM  represents the enhanced diagonal matrix.

Classification Experiment
In order to better express the performance of deep learning on foggy image classification, this paper designs three classic convolutional neural network models (AlexNet [10], VGG16 [19] and ResNet50 [7]) for classification experiments and comparisons.In order to verify the effectiveness of the proposed scheme in this paper, training and testing are carried out through the DUTS [21] dataset.In the experiment, the three networks use the same parameters, batch size is set to 32, the loss function use Cross Entropy Loss, and the optimizer uses the Adam.
In the early days of CNN, researchers focused on improving the classification accuracy of the network.While CNN has developed so far, in order to reduce the time cost and hardware This paper uses DUTS-TRAIN [21] and the generated light fog and dense fog datasets as training sets, and uses DUTS-TEST [21], ECSSD [25], HKU-IS [12] and PASCAL-S [14] as test sets.

Evaluation Metrics
We evaluates the proposed method by adopting Maximun F-measure [1], S-measure [4], E-measure [3] and Mean Absolute Error (MAE).Among them, Maximun F-measure and MAE are calculated in pixel-by-pixel manner, which cannot fully capture the structural information of the prediction graph.Therefore, S-measure is supplemented to compute structural similarity and E-measure to evaluate image-level properties.
This paper adopts Mean Absolute Error (MAE) to predict the pixel-wise mean absolute error between the saliency map and the ground truth map, as follows:
Among them, Maximun F-measure and MAE are calculated in pixel-by-pixel manner, which cannot fully capture the structural information of the prediction graph.Therefore, S-measure is supplemented to compute structural similarity and E-measure to evaluate image-level properties.
This paper adopts Mean Absolute Error (MAE) to predict the pixel-wise mean absolute error between the saliency map and the ground truth map, as follows: where W and H represent the width and height of the saliency map, respectively, and the MAE value is normalized to {0,1} interval value.The MAE represents the similarity between the significance map and the ground truth map.
Maximum F-measure is a commonly used evaluation method, which considers both precision and recall, and uses the beta parameter to trade off precision and recall: This paper follows [1] to set 2  to 0.3 to the structural similarity between regionawareness ( r S ) and object-awareness ( o S ).
Therefore, S can be defined as: where α ∈ [0, 1] is the balance parameter.
Enhanced-alignment measure [5] is recently proposed method that considers both pixel and image level properties of expression composition, which is an effective and efficient way to evaluate saliency maps.E is proposed based on cognitive vision research to obtain image-level statistical information and its local pixel matching information.Therefore, E can be defined as: where W and H are the height and width of the map, respectively, and FM  represents the enhanced diagonal matrix.

Classification Experiment
In order to better express the performance of deep learning on foggy image classification, this paper designs three classic convolutional neural network models (AlexNet [10], VGG16 [19] and ResNet50 [7]) for classification experiments and where W and H represent the width and height of the saliency map, respectively, and the MAE value is normalized to {0,1} interval value.The MAE represents the similarity between the significance map and the ground truth map.
Maximum F-measure is a commonly used evaluation method, which considers both precision and recall, and uses the beta parameter to trade off precision and recall: where W and H represent the width and height of the saliency map, respectively, and the MAE value is normalized to {0,1} interval value.The MAE represents the similarity between the significance map and the ground truth map.
Maximum F-measure is a commonly used evaluation method, which considers both precision and recall, and uses the beta parameter to trade off precision and recall: This paper follows [1] to set

Table 1
Performance comparison of classif (8) This paper follows [1] to set 2 β to 0.3 to enhance the accuracy, and we use the maximum value from all precision and recall pairs.Structure-measure [4] considers both region-oriented and object-oriented structural similarity measures.To capture the importance of structural information in the image, S is used to evaluate the structural similarity between region-awareness (S r ) and object-awareness (S o ).Therefore, S can be defined as:
Among them, Maximun F-measure and MAE are calculated in pixel-by-pixel manner, which cannot fully capture the structural information of the prediction graph.Therefore, S-measure is supplemented to compute structural similarity and E-measure to evaluate image-level properties.This paper adopts Mean Absolute Error (MAE) to predict the pixel-wise mean absolute error between the saliency map and the ground truth map, as follows: where W and H represent the width and height of the saliency map, respectively, and the MAE value is normalized to {0,1} interval value.The MAE represents the similarity between the significance map and the ground truth map.
Maximum F-measure is a commonly used evaluation method, which considers both precision and recall, and uses the beta parameter to trade off precision and recall: This paper follows [1] to set 2  to 0.3 to enhance the accuracy, and we use the maximum the structural similarity between regionawareness ( r S ) and object-awareness ( o S ).
Therefore, S can be defined as: where α ∈ [0, 1] is the balance parameter.
Enhanced-alignment measure [5] is recently proposed method that considers both pixel and image level properties of expression composition, which is an effective and efficient way to evaluate saliency maps.E is proposed based on cognitive vision research to obtain image-level statistical information and its local pixel matching information.Therefore, E can be defined as: where W and H are the height and width of the map, respectively, and FM  represents the enhanced diagonal matrix.

Classification Experiment
In order to better express the performance of deep learning on foggy image classification, this paper designs three classic convolutional neural network models (AlexNet [10], VGG16 [19] and ResNet50 [7]) for classification experiments and  where α ∈ [0, 1] is the balance parameter.
Enhanced-alignment measure [5] is recently proposed method that considers both pixel and image level properties of expression composition, which is an effective and efficient way to evaluate saliency maps.E is proposed based on cognitive vision research to obtain image-level statistical information and its local pixel matching information.Therefore, E can be defined as:
Among them, Maximun F-measure and MAE are calculated in pixel-by-pixel manner, which cannot fully capture the structural information of the prediction graph.Therefore, S-measure is supplemented to compute structural similarity and E-measure to evaluate image-level properties.
This paper adopts Mean Absolute Error (MAE) to predict the pixel-wise mean absolute error between the saliency map and the ground truth map, as follows: where W and H represent the width and height of the saliency map, respectively, and the MAE value is normalized to {0,1} interval value.The MAE represents the similarity between the significance map and the ground truth map.
Maximum F-measure is a commonly used evaluation method, which considers both precision and recall, and uses the beta parameter to trade off precision and recall: the structural similarity between regionawareness ( r S ) and object-awareness ( o S ).
Therefore, S can be defined as: where α ∈ [0, 1] is the balance parameter.
Enhanced-alignment measure [5] is recently proposed method that considers both pixel and image level properties of expression composition, which is an effective and efficient way to evaluate saliency maps.E is proposed based on cognitive vision research to obtain image-level statistical information and its local pixel matching information.Therefore, E can be defined as: where W and H are the height and width of the map, respectively, and FM  represents the enhanced diagonal matrix.

Classification Experiment
In order to better express the performance of deep learning on foggy image classification, this paper designs three classic convolutional neural network models (AlexNet [10], VGG16 [19] and ResNet50 [7]) for classification experiments and comparisons.In order to verify the effectiveness of the proposed scheme in this paper, training and testing are carried out through the DUTS [21] dataset.In the experiment, the three networks use the same parameters, batch size is set to 32, the loss function use Cross Entropy Loss, and the optimizer uses the Adam.
In the early days of CNN, researchers focused on improving the classification accuracy of the network.While CNN has developed so far, in order to reduce the time cost and hardware (10) where W and H are the height and width of the map, respectively, and FM φ represents the enhanced diag- onal matrix.

Classification Experiment
In order to better express the performance of deep learning on foggy image classification, this paper designs three classic convolutional neural network models (AlexNet [10], VGG16 [19] and ResNet50 [7]) for classification experiments and comparisons.
In order to verify the effectiveness of the proposed scheme in this paper, training and testing are carried out through the DUTS [21] dataset.In the experiment, the three networks use the same parameters, batch size is set to 32, the loss function use Cross Entropy Loss, and the optimizer uses the Adam.
In the early days of CNN, researchers focused on improving the classification accuracy of the network.While CNN has developed so far, in order to reduce the time cost and hardware limitations of training and testing, it has high image classification accuracy and a small amount of parameters.Therefore, this paper adopts the Resnet50 [7] network as the basic model of the foggy image classification module.

Experimental Results and Analysis
In this paper, the experimental environment used are Windows 10, the CPU is Inter(R) Core i9-9900K @3.6GHz and the GTX2080TI GPU is used for training.The Python version uses Version 3.7, Torch version uses version 1.2.0.During training, the batch size is set to 8. We set the momentum parameter to 0.9, the weight decay to 0.001, and the learning rate to 5e-5.The Adam are selected to train our networks.

Comparative Analysis
The experimental training data set adopts the DUTS-TRAIN [21], which contains 5019 fogless images, 5019 simulated light fog images, and 5019 simulated dense fog images.By comparing whether the detection performance is improved before and after the introduction of the classification module, the method is verified.The effectiveness of the module, the experimental results are shown in Table 2.Under the same parameter settings, it can be seen from Table 1: (1) AlexNet [10], as the most classic convolutional neural network, still has an accuracy rate of 89.54% in the three-class experiment, but its 0.7 GFLOPs cannot meet the network requirements; (2) VGG16 [19], due to its deep network layers, although the accuracy rate is as high as 93.52%, the complexity and size of the model are not conducive to the optimization of the overall It can be concluded from Table 2: (1) Under the fogless datasets, the four evaluation indicators of the DUTS-TEST [21] have maximum difference of about 0.14% with that of R 2 Net; under the HKU-IS [12], the E-measure is similar to the S-measure.Compared with R 2 Net, it is 0.0007 behind; under the ECSSD [25] and PASCAL-S [14], the indicators are similar to R 2 Net.(2) In the light fog datasets, the evaluation indicators of the improved network in each data set are slightly higher than R 2 Net.(3) Under the dense fog datasets, the improved network has significant advantages over R 2 Net in various evaluation indicators under each datasets.
It can be seen from Figure 6 that: (1) In the case of no fog, the network before and after the improvement can accurately identify the outline of the bird and even the beak.(2) In light fog, our method successfully detects small-scale objects.(3) In the case of dense fog, our method can still clearly detect the outline of the cat, and the network performance before and after the improvement is clearly distinguished.It can be seen that the robustness of R 2 Net to images with different degrees of foggy degradation is not strong, and the network is more suitable for the situation where both training images and test images are clear images.

Comparison with Other Methods
This paper compares the improved network with other methods, including four deep learning methods (Pool-Net [15], U 2 Net [16], PurNet [13], CSNet [2]) and four traditional algorithms (FT [1], HC [3], IT [8], LC [28]).For fair comparison, other deep learning networks are trained and tested in the same environment, and the same dataset is used for both training and testing.
As shown in Table 3, this paper presents the objective evaluation results of saliency map and saliency segmentation.In this paper, MAE and S-measure are used to evaluate non-binary saliency maps, and Max F-measure and E-measure are used to evaluate binary saliency segmentation.It can be seen from Figure 7 and Table 3: (1) From the subjective vision, for

Figure 7
Visually detected results at different concentrations.multiple targets (the first row), we accurately detect two targets with different scales.For large objects (second row), we pinpoint the location of the object.
Our method is also robust to complex background objects (fourth row).(2) In the case of light fog and dense fog, the MAE and S-measure indicators of the improved network are better than other methods on the four datasets, which shows that our saliency map is similar to the ground truth map and has good region-aware and object-aware structural similarity.
(3) Max F-measure and E-measure show that our saliency map has consistently high confidence in the target region, which can efficiently detect the location of the most prominent target and segment it.

Conclusion
We propose a foggy saliency detection network based on R 2 Net.In terms of network structure, R 2 Net is selected as the basic network, and the fog concentration classification module is added, so that the network can judge the fog concentration in the image and select the subsequent work module accordingly.
Through the atmospheric scattering model, the foggy degradation process was simulated, and two types of simulated foggy images of «light fog» and «dense fog» were generated to expand the datasets.Compared with the original R 2 Net, the algorithm effectively improves the accuracy of saliency detection in foggy weather, improves the robustness and generalization ability of the network, and provides a new idea for sa-liency detection in foggy images.In addition, the universality of the fog density classification module to other network models still needs to be improved, and the light-weight architecture of the fog density classification module is also worthy of further research.

Figure 1
Figure 1 Residual structure of ResNet and output information of the network in this layer; learned in this layer.Performing recursive operation on Equation (3), the feature representation of any deep unit J can be obtained: constitutes two mapping paths, identity mapping and residual mapping, in the form of cross-layer links.By adding the xidentity map in the process of common module connection, the network can effectively control the network layer parameters and the computational complexity while alleviating the problem of gradient disappearance.The residual structural unit can be expressed as: and output information of the network in this layer; j W represents the parameters to be learned in this layer.Performing recursive operation on Equation (3), the feature representation of any deep unit J can be obtained The residual structure constitutes two mapping paths, identity mapping and residual mapping, in the form of cross-layer links.By adding the xidentity map in the process of common module connection, the network can effectively control the network layer parameters and the computational complexity while alleviating the problem of gradient disappearance.The residual structural unit can be expressed as:

1
and output information of the network in this layer; j W represents the parameters to be learned in this layer.Performing recursive operation on Equation (3), the feature representation of any deep unit J can be obtained

R 2 4 )R 2
Net[6] proposes a new residual learning strategy, which is different from previous models based on multi-scale, but gradually generates prediction maps of each scale.This strategy arranges the prediction maps at each scale from small to large until it matches the best groundtruth map.R 2 Net employs Dilated Convolutional Pyramid Pooling (DCPP) module to generate (Net[6] proposes a new residual learning strategy, which is different from previous models based on multi-scale, but gradually generates prediction maps of each scale.This strategy arranges the prediction maps at each scale from small to large until it matches the best ground-truth map.R 2 Net employs Dilated Convolutional Pyramid Pooling (DCPP) module to generate coarse predictions based on global contextual information, which can locate the general location of target objects.The DCPP module consists of dilated convolutions at different rates to capture local and global information.This module has relatively few parameters compared to using fully connected layers.

Figure 2
Figure 2 Residual refinement network mode

Figure 2 Residual refinement network mode Figure 3
Figure 2Residual refinement network mode

Figure 3 Figure 3
Figure 3 Saliency detection network in foggy weather Figure 3 Saliency detection network in foggy weather The residual network consists of several residual blocks, and the structure with multiple residual blocks is called a layer.The initial layer is an ordinary convolution structure, layer1 contains 3 residual blocks, layer2 contains 4 residual blocks, layer3 contains 6 residual blocks, layer4 contains 3 residual blocks, and finally there is a full connection layer.From layer1 to layer5, after the image data with a size of 224 pixel × 224 pixel is transmitted, the residual network extracts features for learning and training, and finally reduces the size to 7 × 7 pixel.After the residual network training, the images are input to the average pooling layer and averaged, and finally the image category is divided by the Softmax function of the fully connected layer.

Figure 4
Figure 4ResNet50 network structure

Figure 4
Figure 4 ResNet50 network structure convolution.The purpose of using atrous convolution is to enlarge the receptive field without losing spatial resolution.Atrous convolution has another advantage, by setting different dilation rates to get different receptive fields.Information at different scales can be obtained from different receptive fields, which plays an important role in vision tasks.

3. 3 .
Loss Function In this paper, the binary cross entropy loss ( BCE L ), which is often used in classification problems, is used as the loss function of the classification module.R 2 Net adopts the standard cross-entropy loss to calculate the per-pixel loss, ignoring the global structure of the image.To remedy this deficiency, this paper uses the IOU loss [17] ( IOU L ) to focus on the global structure, thereby forming global constraints on the network.For our optimized network, the loss function ( BCE L ) obtained in the classification module is used, and then the classified foggy image is sent to the detection network for training, and finally the detection result of the target is obtained.To sum up, the loss function of the model in this paper is function of the classification network module, IOU L is the loss function of the detection network module, and  is the balance factor.
atrous convolution.The purpose of using atrous convolution is to enlarge the receptive field without losing spatial resolution.Atrous convolution has another advantage, by setting different dilation rates to get different receptive fields.Information at different scales can be obtained from different receptive fields, which plays an important role in vision tasks.

3. 3 .
Loss Function In this paper, the binary cross entropy loss ( BCE L ), which is often used in classification problems, is used as the loss function of the classification module.R 2 Net adopts the standard cross-entropy loss to calculate the per-pixel loss, ignoring the global structure of the image.To remedy this deficiency, this paper uses the IOU loss [17] ( IOU L ) to focus on the global structure, thereby forming global constraints on the network.For our optimized network, the loss function ( BCE L ) obtained in the classification module is used, and then the classified foggy image is sent to the detection network for training, and finally the detection result of the target is obtained.To sum up, the loss function of the model in this paper is function of the classification network module, IOU L is the loss function of the detection network module, and  is the balance factor.

Fig- ure 5
is example of the dataset used in the text.

Figure 5
Figure 5 The data set (a) no fog dataset; (b) light fog dataset; (c) dense fog dataset comparisons.In order to verify the effectiveness of the proposed scheme in this paper, training and testing are carried out through the DUTS[21] dataset.In the experiment, the three networks use the same parameters, batch size is set to 32, the loss function use Cross Entropy Loss, and the optimizer uses the Adam.In the early days of CNN, researchers focused on improving the classification accuracy of the network.While CNN has developed so far, in order to reduce the time cost and hardware limitations of training and testing, it has high(7)

2 
to 0.3 to enhance the accuracy, and we use the maximum value from all precision and recall pairs.Structure-measure [4] considers both regionoriented and object-oriented structural similarity measures.To capture the importance of structural information in the image, S is used to evaluate ResNet50 [7]) for classificatio comparisons.In order to verify of the proposed scheme in this testing are carried out throug dataset.In the experiment, the the same parameters, batch siz loss function use Cross Entro optimizer uses the Adam.In the early days of CNN, resea improving the classification network.While CNN has dev order to reduce the time co limitations of training and te image classification accuracy a of parameters.Therefore, this Resnet50 [7] network as the b foggy image classification mod comparisons.In order to verify the effectiveness of the proposed scheme in this paper, training and testing are carried out through the DUTS [21] dataset.In the experiment, the three networks use the same parameters, batch size is set to 32, the loss function use Cross Entropy Loss, and the optimizer uses the Adam.In the early days of CNN, researchers focused on improving the classification accuracy of the network.While CNN has developed so far, in order to reduce the time cost and hardware limitations of training and testing, it has high image classification accuracy and a small amount

Figure 6
Figure 6 Subjective visual contrast Figure 6Subjective visual contrast For fair comparison, other deep learning networks are trained and tested in the same environment, and the same dataset is used for both training and testing.As shown in Table 3, this paper presents the objective evaluation results of saliency map and saliency segmentation.In this paper, MAE and Smeasure are used to evaluate non-binary saliency maps, and Max F-measure and E-measure are used to evaluate binary saliency segmentation.It can be seen from Figure 7 and Table 3: (1) From the subjective vision, for multiple targets (the first row), we accurately detect two targets with different scales.For large objects (second row), we pinpoint the location of the object.Our method is also robust to complex background objects (fourth row).(2) In the case of light fog and dense fog, the MAE and S-measure indicators of the improved network are better than other methods on the four datasets, which shows that our saliency map is similar to the ground truth map and has good region-aware and object-aware structural similarity.(3) Max F-measure and Emeasure show that our saliency map has consistently high confidence in the target region, which can efficiently detect the location of the most prominent target and segment it.

Figure 7
Figure 7Visually detected results at different concentrations

Table 1
Performance comparison of classified networks

Table 3
Detection results under the fogless dataset.The best results of fogless are shown in bold.

Table 4
Detection results under the light fog dataset.The best results of light fog are shown in bold.D(DUTS); E(ECSSD); H(HKU-IS); P(PASCAL-S)

Table 3
Detection results under the fogless dataset.The best results of fogless are shown in bold.

Table 5
Detection results under the dense fog dataset.The best results of dense fog are shown in bold.D(DUTS); E(ECSSD);