Bio-driven visual saliency detection with color factor

Wang, Yan; Li, Teng; Wu, Jun; Ding, Chris H. Q.

doi:10.3389/fbioe.2022.946084

ORIGINAL RESEARCH article

Front. Bioeng. Biotechnol., 04 August 2022

Sec. Bionics and Biomimetics

Volume 10 - 2022 | https://doi.org/10.3389/fbioe.2022.946084

Bio-driven visual saliency detection with color factor

1. School of Computer Science and Technology, Anhui University, Hefei, China
2. School of artificial intelligence, Anhui University, Hefei, China
3. Guangdong CAS Cogniser Information Technology Co., Ltd., Guangzhou, China
4. School of Data Science, The Chinese University of Hong Kong, Shenzhen, China

Article metrics

View details

Citations

2,6k

Views

658

Downloads

Abstract

Most visual saliency computing methods build models based on the content of an image without considering the colorized effects. Biologically, human attention can be significantly influenced by color. This study firstly investigates the sole contribution of colors in visual saliency and then proposes a bio-driven saliency detection method with a color factor. To study the color saliency despite the contents, an eye-tracking dataset containing color images and gray-scale images of the same content is proposed, collected from 18 subjects. The CIELab color space was selected to conduct extensive analysis to identify the contribution of colors in guiding visual attention. Based on the observations that some particular colors and combinations of color blocks can attract much attention than others, the influence of colors on visual saliency is represented computationally. Incorporating the color factor, a novel saliency detection model is proposed to model the human color perception prioritization, and a deep neural network model is proposed for eye fixation prediction. Experiments validate that the proposed bio-driven saliency detection models make substantial improvements in finding informative content, and they benefit the detection of salient objects which are close to human visual attention in natural scenes.

1 Introduction

When viewing a visual scene, the human visual system can quickly focus on some unique vision areas. An understanding of human biological mechanisms in visual saliency detection is essential to many applications, including video segmentation (Ren et al., 2021), target detection Chong et al. (2020), image enhancement Sun et al. (2022), and activity recognition Jiang et al. (2019), Chen et al. (2021).

Saliency in a visual scene can arise from a spectrum of stimuli, both low-level image properties and semantic-level information Rosenholtz et al. (2011). In human visual system, color, besides contrast, intensity, and motion, is considered one of the primary features in computing bottom-up saliency. As we can see from the example in Figure 1, the guiding powers of color stimuli and grayscale stimuli are vastly different for visual attention. By comparing the eye fixation maps in both color images and gray-scale images of the same content, it is clear that color has its sole contribution to visual saliency. However, many existing attention models usually neglect the colorized effects and predict the same results for images containing the same content. The saliency based solely on color has not been well studied. One reason is the lack of eye tracking datasets including color images and gray-scale images of the same content.

FIGURE 1

However, previous visual saliency models considering color attributes address a problem that is relatively ill-posed. Classical saliency models Gelasca et al. (2005), Choi and Suk (2015) evaluate which colors attract more attention based on a subjective experiment, considering only a few colors, and the results of these studies cannot be extended to natural visual scenes. Moreover, these studies did not consider the effects of content information (e.g., position/order) rather than color. Researchers have not yet attempted to computationally model the relationship between visual attention and color, despite its contents.

In this paper, we claim that the computational model should consider the saliency of the image separately in color and gray-scale scenes and propose a new Color-Gray eye-tracking dataset (CGed) focusing on image color and human attention. Statistical analysis on CGed is conducted to investigate how colors influence human attention when viewing natural scene, how much the colors attract human attention, and how to model the factor of colors in visual saliency computing. Analytical results indicate that certain colors attract human visual attention strongly, and the color component b of the CIELab color space is closely related to visual attention more closely than the others—which we refer to as the color prioritization effect.

Based on these discoveries, we propose a salient object detection model RNCw (Region contrast based on Non-uniform quantification and Channel-weighted color distance) by incorporating the color prioritization effect into the previous method proposed by

Cheng et al. (2015)

. We further apply our discovery in eye fixation prediction and design a color weighted DNN (APNet—Not Adaptive color weighting priori attention weighting Network) model. Experiments demonstrate the superior performance of the models we proposed, especially when color-eliciting objects stand out in a scene. Our contributions can be summarized as follows:

1) We propose a new image dataset (CGed) featuring visual attention. To the best of our knowledge, this is the first dataset that contains both color images and grayscale images of the same content. It is designed for research on visual saliency, especially with regard to the effect of image color on saliency despite image content.
2) We evaluate how colors attract human attention computationally. We observe that some particular colors attract our attention more than others, and that certain combinations of color blocks can enhance attention.
3) We proposed two novel saliency computational models incorporating the color factor: RNCw model, which is compliant with human perception to improve the performance in detecting salient objects; and APNet model that encodes relative importance of objects in an image to achieve predictions more coinciding with human visual attention. The proposed models achieve state-of-the-art performance on benchmark datasets.

2 Related works

2.1 Color and visual saliency model

Since color information plays an important role in visual attention biologically, it has been used in saliency computation in previous works. Osberger Pappas (2001) suggested that some particular colors (e.g., red) attract our attention more than others, or induce a higher amount of masking. However, saliency researchers have not yet investigated what color attracts human attention despite its content when viewing natural scenes. One major reason could be the lack of a proper dataset with both color images and gray-scale images of the same content.

Achanta et al. (2008), Achanta and Süsstrunk (2010) use the color and luminance features to detect salient objects. They calculated the contrast between the local image region and its surroundings. The saliency map can be obtained by calculating the average color vector difference. Borji and Itti (2012) propose a prediction model to reflect the saliency discrimination towards eye tracking data. The model measures the scarcity of each block in both RGB and LAB color space, and then combines the local and global saliency of each color space to generate the saliency map. As stated above, many saliency models compute image saliency primarily by measuring the color feature. However, these models did not clearly consider the sole contribution of colors, excluding the image content factor.

In the past decade, substantial research has been done on visual attention computational models to predict saliency. Traditional attention models mainly rely on various cues to detect salient objects, including local contrast Klein and Frintrop (2011), global contrast Cheng et al. (2011) and background prior Yang et al. (2013). Subsequent behavioral and computational studies started to predict fixation with saliency maps to understand human visual attention and verify saliency models. A large gain in saliency prediction has resulted from the recent resurgence of convolutional neural networks (CNNs). Specifically, several methods such as Liu et al. (2015) used CNN to extract features from multiple images region with varying contexts, and then combined these contextual features to infer saliency. Some other models, such as Li and Yu (2016), adopt fully convolutional networks (FCNs) for feature representation at each image location and generate saliency maps in a convolutional way. Recent developed visual representation models such as the visual transformer have also been applied to salient object detection Liu et al. (2021), and they achieved high performance on previous datasets.

Along with these advances, attention models can effectively extract visual features and compute feature maps to quantify saliency. However, existing methods did not consider the unique influence of colors in saliency computing models, while human visual attention order is sensitive to different colors in a natural scene. By weighting the contribution of color to attention, our work effectively addresses the color prioritization effect on attention allocation in an image.

2.2 Eye-tracking datasets

Several datasets have been introduced to further challenge the eye fixation prediction model. Two widely-used image datasets are the MIT dataset Judd et al. (2010), which contains 1,003 natural images free-viewed by 15 subjects each, and the NUSEF dataset Ramanathan et al. (2010), which includes 758 (emotion evoking) images free-viewed by 25 subjects each. There are other datasets focusing on specific domains: OSIE Xu et al. (2014) features multiple dominant objects in each image, and CAT2000’s training set contains 2,000 images of diverse scenes, such as affective images and cartoons. However, there are few eye-tracking datasets suitable for research regarding the saliency of color despite its content. In this paper, we present the first eye-tracking dataset to include both color images and grayscale images of the same content.

3 Construction of the CGed dataset

The saliency of images considering content has been extensively explored, but the saliency based solely on color has not received much attention, probably due to the lack of eye-tracking datasets including color images and gray-scale images of the same content. To address the problem, we constructed CGed, a new dataset containing both color images and gray-scale images of the same content, with eye-tracking data. It is designed for research on visual saliency, especially with regard to the saliency of color, despite its content.

3.1 Image collection

CGed images were collected partially from the MIT1003 dataset and partially from an online image search engine. It contains a total of 500 brightly colored images with various semantic concepts ranging from rural to urban environments. These 500 images of natural scenes are rich in color. All color images were then converted to grayscale, so that the CGed dataset includes a total of 1,000 images, with 500 grayscale images. We collected the images to make the dataset more diverse regarding how observers’ attention is attracted.

3.2 Eye tracking

Eighteen subjects freely observe all CGed images on a 22-inch LCD monitor for 5 s. Nine of the subjects are male and the others are female, and their ages are distributed in the range of 22 to 29. These subjects can focus their attention on given images and yield precise annotation. The screen resolution of an LCD monitor is 1,680, ×, 1,050. The visual angle of the stimuli is about 42.48° × 27.31°. Eye movements of the subjects are recorded using the SensoMotoric Instruments (SMI) iView X RED system. Eye position is recorded with an eye tracker operating at a 250 Hz sample rate.

4 Computational studies of color factor on visual saliency

In this section, we study the contribution of colors despite content to attention when viewing natural scenes. We first explain our analytical methods and then report observations with supporting analyses.

A necessary prerequisite for showing an influence the color on attention is the difference in attention score between the color image and the corresponding grayscale image. Comparing attention scores between color images and gray-scale images of the same content reveals a general influence of color on visual attention since all other image features remain the same. We first study the contribution of color to attention. Based on the finding of salient colors, we study the influence of combination of color blocks to attention.

4.1 Definitions and methods

For the study on CGed, we used a common method in saliency research Le meur and Baccino (2013). Specifically, for each image, we compute a fixation map by placing at each fixation location a Gaussian distribution with sigma equal to one degree of visual angle, and then normalizing the map to maximum 1. We define the attention score of an image pixel as the fixation-map value at this pixel. The attention score of each pixel thus ranges between 0 and 1. In order to study the effect of color on attention, we compute the difference in attention score (DAS) by subtracting the attention score of the color image from the attention score of the gray-scale image.

We use the Maximal Information Coefficient (MIC) David et al. (2011) to analyze the correlation between color and attention. MIC is a correlation statistic that measures the association strength of linear and non-linear relationships between paired variables.n_x ⋅ n_y < B(n), where B(n) = n^0.6 is the search-grid size. In calculating MIC for vectors x and y, n is the number of data points, and n_x, n_y is the number of bin of partition of the x − and y − axis, respectively. H(x) and H(y) represent the entropy associated with x and y, respectively. And the join entropy of a pair of random variables x and y is represented as H (x, y).

4.2 Statistical results

Which color space is closely related to visual saliency: The color spaces, including RGB, CIELab, and HSI, have been widely adopted by previous studies; the color space has a significant influence on the algorithm performance. In our study, we determined the color space used for data analysis by comparing the correlation between color and saliency in different color spaces.

In our study, only the colors corresponding to a DAS greater than 0.1 are considered salient colors, and the relationship between these colors and attention is studied. We first get a series of colors on the color stimuli, which corresponds to a DAS greater than 0.1. We decomposed the colors in the RGB, CIELab, and HSI color spaces separately into three components for statistical analysis. Since the range of values of color components in the RGB, CIELab, and HSI color spaces are different, we normalized each component. The statistical results are reported in Figure 2. The larger the value of the MIC, the stronger the correlation. The larger total MIC of three color components in CIELab color space over the total MIC of three color components in other color spaces (1.89 vs. 1.35 vs. 1.16) suggests that saliency is more relevant to CIELab color space than both RGB and HSI color spaces. Subsequent analysis of what color attracts attention, was performed in CIELab color space.

FIGURE 2

How colors contribute to attention: We focus on the role of colors in visual attention and try to understand what colors influence more visual attention. We uniformly quantize the value of each color component in the CIELab color space to a range of 0–15, for 16 × 16 × 16 = 4096. The quantized CIELab color space is called the L’a’b’ color space. To see what colors attract attention more intuitively, we encode each color in the L’a’b’ color space. We can see a series of colors and their corresponding attention scores in Figure 3. We also counted ten of the most attractive colors in the CIELAB space. It is noticed that some specific colors have much more saliency. By calculating the difference in the attention score of the color block combination corresponding to the color image and the same content gray-scale image, we also discovered that certain color block combinations often appear with high DAS.

FIGURE 3

5 Proposed color-aware saliency models

In this section, we design our saliency computational models guided by the psychophysics findings in the previous section. On one hand, we propose a salient object detection model RNCw (Region contrast based on Non-uniform quantification and Channel-weighted color distance) to detect salient object/regions. This salient object detection method is an improvement on the RC method Cheng et al. (2015). On the other hand, we designed a color-weighted DNN (APNet—Not Adaptive color weighting priori attention weighting network) for fixation prediction.

5.1 RNCw method

The proposed RNCw considers the visual sensitivity of human eyes to different colors in a natural scene, which is a variant of the RC method. The RC method mainly consists of two stages. In the RC method, the input image is first segmented into regions using a graphics-based image segmentation method Felzenszwalb and Huttenlocher (2004), then the color contrast is computed at the region level. The saliency for each region is defined as the weighted sum of the region’s contrast to all other regions in the image. Unlike the RC method, we consider the psychopysics findings in Section 4 in both phases.

In the first phase, different from the RC method, we use the weighted color distance to measure the similarity between pixels in CIELab color space and obtain a segmentation graph. The weighted color distance between the pixels i and j in an image can be computed as follows:where L_i, a_i, and b_i respectively represent the value of pixel I_i in CIELab channels, and L_j, a_j, and b_j respectively represent the value of pixel I_j in CIELab channels. The W_L, W_a, and W_b denote the weights of L, a, and b channels in CIELab color space, respectively. The weights W_L, W_a, and W_b are determined by correlation coefficient values between attention with the L component, the a component, and the b component in CIELab color space, respectively. Guided by our psychophysics findings, we set W_L = 0.61/1.89 = 0.32, W_a = 0.52/1.89 = 0.28, and W_b = 0.76/1.89 = 0.40. There is no difference in other steps. For more details, refer to Felzenszwalb and Huttenlocher (2004).

In the second phase, we incorporate channel-weighted color distance into the contrast to compute saliency. For a region r_x, we compute its saliency value aswhere is the weight of region r_x, and the meaning is the number of pixels in the r_x region. is the channel-weighted color distance metric between two regions and can be expressed aswhere represents the occurrence frequency of the i^th color in region r_x, and is defined aswhere L_i,j, a_i,j, and b_i,j denote the value in L, a, and b channels of the j^th in region r_i, respectively. The W_L, W_a, and W_b denote the weights of L, a, and b channels in CIELab color space, respectively, whose values are equal to the values in the first phase.

Similar to the approach suggested in the RC method to increase the effects of closer regions and decrease the effects of farther regions, we also incorporate the spatial weighting in terms of Rutishauser et al. (2004). Thus, for any region r_x, the saliency iswhere denotes the weight of region r_y which is the number of pixels in region r_y, and denotes the spatial distance between the regions r_x and r_y. The σ_s controls the strength of spatial distance weighting. As with the RC method, we use with pixel coordinates normalized to [0,1].

5.2 APNet model

We proposed a DNN architecture (APNet—Adaptive color weighting and priori attention weighting Network) is shown in Figure 4. To address color prioritization, we designed a channel weighting subnetwork and a priori attention weighting subnetwork. The channel weighting subnetwork (the red dashed rectangle) encodes contextual information, enabling the network to highlight color-eliciting objects from the surroundings. The priori attention weighting subnetwork (the blue dashed rectangle) directly weights the prediction saliency map output by the DNN, which weights the combines the human eye’s perception sensitivity to color, and can achieve a saliency prediction closer to human visual attention. Also, since selective attention may happen at different resolutions, we incorporate information at multiple-scales. The two-stream network design is based on SALICON Huang et al. (2015) and is used to extract deep features from coarse-scale images and fine-scale images.

FIGURE 4

We briefly introduce our method for using DNN in fixation prediction. We feed fine-scale images of 1,000 × 750 × 3 pixels to its first stream for extracting relatively high-resolution deep features, while feeding coarser-scale images of 500 × 375 × 3 pixels to its second stream for extracting relatively low-resolution deep features. The outputs of the two network streams are rescaled to the same spatial resolution and stacked together to form a multi-scale depth feature of size 32 × 24 × 1024. After applying a 2 × 2 max polling on 1,024 channels of concatenated feature maps to reduce their dimensionality and spatial variance, the channel weighting subnetwork computes a set of 1024-dimensional feature weights for each image. And the weights are applied to the input feature by a channel-wise multiplication. We then performed a convolutional layer after a new subnet with a 1 × 1 kernel to reduce the 1,024 channel 2D image to a single channel 2D saliency map of size 32 × 24 pixels. Lastly, we rescale the saliency map back to the dimension of the original image. In the test phase, each pixel of the output saliency map is multiplied by a weight. Specifically, each pixel in the output saliency map is multiplied by the attention score of each pixel in the original image in the L’a’b color space, and then the saliency map is normalized to a maximum value that is equal to the maximum gray value of the unweighted saliency map. Since the “saliency color” and its attention score in the fourth section are under the condition that the DAS is greater than 0.1, the weight of the “non-saliency color” in the original image is set to 0.1.

6 Experiments

We evaluate our salient object detection models on the ECSSD dataset and PASCAL-S within images from a wide-variety of scenarios and resolutions. To demonstrate the effectiveness of the proposed eye fixation prediction model in predicting eye fixations, we evaluated it in CGed, CAT2000, and OSIE datasets.

6.1 Salient object detection

6.1.1 Datasets

We test the RNCw model on the ECSSD dataset and the PASCAL-S dataset. ECSSD contains 1,000 structurally complex images acquired from the Internet, and the groundtruth masks were annotated by five labelers. PASCAL-S contains 850 natural images with both saliency segmentation groundtruth and eye fixation groundtruth. Saliency groundtruth masks of PASCAL-S were labeled by 12 subjects.

6.1.2 Evaluation metrics

There are a plethora of metrics available that are used to evaluate saliency models. We use three universally agreed-upon and standard measures for evaluating salient object detection models in salient object detection datasets. They are MAE Borji et al. (2015), S-measure Fan et al. (2017), and Precion-recall (PR) curve Tong et al. (2014).

6.1.3 Result

Improved model performance. Our approach makes use of the color prioritization effect in attention in order to comply with human perceptual characteristics, which improves the performance of the salient object detection model. We report results for our improved salient object detection model (RNCw) with its base models (RC). Table 1 shows the MAE and S-measure scores on two datasets. We can see that the RNCw model can predict salient regions in complex scenes and distinguish the non-salient regions in the scene. That is, compared with the RC model, the salient regions of our saliency maps are more prominent, and the non-salient regions are rarely mistakenly predicted. On the ECSSD and PASCAL datasets, we obtained the same analysis results. Although our improved algorithm is targeted at salient object detection, its performance on human fixed prediction benchmarks has also improved (see Table 2). In addition, a large number of experimental results show that the salient object detection model (RNCw) based on priori color perception designed in this paper can improve the performance and efficiency of saliency detection to some extent.

TABLE 1

Metric	Dataset	RC	RNCw
MAE	ECSSD	0.188	0.173
MAE	PASCAL-S	0.300	0.289
S-measure	ECSSD	0.651	0.669
S-measure	PASCAL-S	0.584	0.589

The S-measure and MAE on two salient object detection datasets.

TABLE 2

Method	CGed	CAT2000	OSIE
RC	0.7131	0.7233	0.7310
RNCw	0.7251	0.7294	0.7408

The ROC areas on three eye-tracking datasets.

6.2 Fixation prediction

6.2.1 Datasets

In addition to the CGed dataset, we tested APNet on two other eye-tracking datasets. One is the CAT2000 training set, which contains 2,000 images, and another is the OSIE dataset, which contains 700 images.

6.2.2 DNN parameters

We train our APNet by first initializing the weights and biases from the VGG-16 model on ImageNet. We use a momentum of 0.9 and a weight decay of 0.0005. The learning rate is 0.00005 and the batch size is 32. The entire training procedure takes about 1 day to complete on a single NVIDIA V100 GPU using the caffe deep learning framework Sharma et al. (2015).

6.2.3 Evaluation metrics

We use nine metrics for comprehensive evaluation. The Area Under the Curve (AUC) Green and Swets (1966) is the area under a curve of true positive rate versus false positive rate for different thresholds on the saliency map. We use two variants of AUC: AUC-Judd and AUC-Borji Bylinskii et al. (2017), and shuffled-AUC (sAUC) Tatler et al. (2005), which alleviates the effects of center bias. The Normalized Scanpath Saliency (NSS) Peters et al. (2005) computes the average value at all fixations in a normalized saliency map. Similarity (SIM) Judd et al. (2012) calculates the sum of minimum values of saliency distribution and fixation distribution at each point. The saliency map can be compared with the human fixation map with the Linear Correlation Coefficient (CC) Le Meur et al. (2007) and the Kullback-Leibler divergence (KL) Tatler et al. (2005). The Earth Movers Distance (EMD) Rubner et al. (2000) considers the ground-truth and predicted saliency maps to be two probability distributions and measure the cost of transferring one distribution to the other. Information Gain (IG) Bylinskii et al. (2017) as an information theoretic metric that measures saliency model performance beyond systematic bias (e.g., a center prior baseline).

6.2.4 Result

Tables 3–5 give the quantitative results of comparison with state-of-the arts models on OSIE dataset, CAT2000 dataset and CGed dataset respectively, and Figure 5 shows visual comparisons of results generated by our saliency model with previous methods. We report results for our model both with the subnetwork for context saliency prediction (APNet) and without the subnetwork (N-APNet—Not Adaptive color weighting and priori attention weighting Network). We compared saliency prediction models with seven others. There are state-of-the-art DNN-based models: CASNet Fan et al. (2018), SALICON Thomas (2016) (We use the code of OpenSALICON Thomas (2016) which is a publicly available implementation of SALICON), SalGAN Pan et al. (2017), and ML-Net Cornia et al. (2016). Four are non-DNN models with top performance in the non-DNN model category: BMS), SROD, and GBVS. These models achieved state of the art performance in their experiments on benchmark.

TABLE 3

Metric	APNet	N-APNet	CASNet	SALICON	SalGAN	ML-Net	BMS	SROD	GBVS
AUC-Judd	0.83	0.82	0.83	0.81	0.82	0.80	0.77	0.75	0.71
AUC-Borji	0.82	0.81	0.80	0.79	0.77	0.77	0.76	0.74	0.70
sAUC	0.74	0.73	0.73	0.72	0.73	0.71	0.72	0.70	0.66
NSS	1.76	1.72	1.75	1.71	1.73	1.72	1.44	1.35	1.11
IG	6.41	6.37	6.39	6.31	6.40	6.37	6.24	6.31	6.01
CC	0.74	0.72	0.74	0.73	0.73	0.72	0.54	0.52	0.40
SIM	0.62	0.58	0.62	0.57	0.58	0.58	0.53	0.52	0.44
EMD	4.12	4.17	4.15	4.12	4.17	4.23	6.31	6.33	6.57
KL	5.99	6.09	6.05	6.13	6.21	6.24	6.34	5.37	6.44

Quantitative results on CGed dataset (color images). The best performance in each metric is highlighted in bold. For all evaluation metrics larger values indicate performance, except smaller is better for EMD and KL.

TABLE 4

Metric	APNet	N-APNet	CASNet	SALICON	SalGAN	ML-Net	BMS	SROD	GBVS
AUC-Judd	0.82	0.81	0.82	0.80	0.81	0.79	0.78	0.77	0.80
AUC-Borji	0.80	0.77	0.79	0.78	0.80	0.73	0.77	0.76	0.79
sAUC	0.79	0.74	0.76	0.77	0.77	0.75	0.73	0.72	0.75
NSS	1.51	1.36	1.50	1.45	1.45	1.35	1.15	1.07	1.25
IG	0.37	0.25	0.37	0.09	0.08	0.27	-0.17	-0.21	-0.25
CC	0.59	0.52	0.58	0.56	0.56	0.52	0.44	0.41	0.49
SIM	0.55	0.53	0.57	0.53	0.53	0.52	0.49	0.48	0.51
EMD	2.86	2.89	2.42	3.21	3.21	2.86	3.12	3.31	3.12
KL	5.77	5.84	5.82	6.03	6.08	6.08	6.21	6.06	6.29

Quantitative results on CAT2000 dataset. The best performance in each metric is highlighted in bold.

TABLE 5

Metric	APNet	N-APNet	CASNet	SALICON	SalGAN	ML-Net	BMS	SROD	GBVS
AUC-Judd	0.88	0.87	0.88	0.87	0.87	0.86	0.84	0.81	0.78
AUC-Borji	0.86	0.83	0.85	0.84	0.84	0.78	0.81	0.80	0.73
sAUC	0.84	0.74	0.85	0.82	0.82	0.77	0.79	0.78	0.73
NSS	2.38	2.37	2.36	2.31	2.26	2.37	1.54	1.33	0.37
IG	2.99	2.91	2.89	2.93	2.84	2.77	2.43	2.18	2.34
CC	0.72	0.69	0.75	0.69	0.72	0.72	0.48	0.43	0.44
SIM	0.57	0.54	0.59	0.53	0.60	0.61	0.43	0.40	0.42
EMD	2.99	3.23	2.97	3.21	2.94	2.78	4.10	4.33	4.42
KL	5.80	5.83	5.82	5.90	5.91	5.91	6.30	6.48	6.44

Quantitative results on OSIE dataset. The best performance in each metric is highlighted in bold.

FIGURE 5

The comparison method ensures fairness. We exclude DNN models that use or learn center bias (e.g., SAM-ResNet Cornia et al. (2018)). All DNN-based models are trained on the SALICON dataset to achieve their best possible performance, and all models, including ours, are directly tested on the three benchmark datasets without training/fine-tuning on them.

As shown in the quantitative results, the proposed APNet model with the contextual saliency subnetwork has the best overall performance across dataset, without additional center bias mechanism. APNet consistently outperforms N-APNet on all datasets, indicating the effectiveness of learning the relative weights of salient regions inside an image through the proposed subnetwork. On AUC-judd, NSS, IG, and KL, APNet consistently outperforms. For other metrics, APNet is not always the best, but it is close to the best. As we all know, NSS and IG consider the relative importance of salient regions and are therefore the best evaluation measures for contextual saliency. APNet beats the other models on these two metrics across all three datasets, demonstrating its advantage in contextual saliency. In Figure 6, we analyze the effectiveness of APNet in learning the relative importance of contextual information for brightly colored objects.

FIGURE 6

7 Conclusion

Studies conducted in the paper focus on understanding how humans perceive and prioritize colors in scenes and how it is related to visual attention. Based on the statistical results on a collected CGed dataset, we proposed a novel salient object detection model and a novel fixation prediction model considering the factor of color computationally. Our current work is still limited in dataset size, and the study of colorized effects on visual saliency still needs further exploration. For future work, we will collect more data to cover different scenes and different subjects. Previous work showed that fusing multi-scale features Huang et al. (2022) or multi-modal information such as depth Jiang et al. (2021) is useful, which could be another future direction. Moreover, considering that the fixation prediction models are constructed originally to understand human visual attention and eye movement prediction, while the saliency detection models were driven by the requirements of saliency-based applications, another future work could focus on integrating the two tasks of eye fixation prediction and salient object detection, to enhance the performance of both types of models. The proposed color-aware saliency computing methods can also be extended to benefit other related areas such as object proposal generation and segmentation.

Statements

Data availability statement

The raw data supporting the conclusion of this article will be made available by the authors without undue reservation, upon request.

Author contributions

YW was responsible for experiments and manuscript preparation. TL worked as a supervisor for all procedures. JW was responsible for programming and data processing. CD participated in discussions and revisions. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by the National Natural Science Foundation of China (No. 11975312) and the Anhui Provincial Natural Science Foundation of China (No. 1908085J25).

Conflict of interest

Author JW was employed by the company Guangdong CAS Cogniser Information Technology Co., Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1
AchantaR.EstradaF.WilsP.SüsstrunkS. (2008). “Salient region detection and segmentation,” in International conference on computer vision systems, Santorini, Greece, May 12‐15, 2008 (Springer), 66–75.
- Google Scholar
2
AchantaR.SüsstrunkS. (2010). “Saliency detection using maximum symmetric surround,” in 2010 IEEE International Conference on Image Processing, Hong Kong, China, 26-29 September 2010 (IEEE), 2653–2656.
- Google Scholar
3
BorjiA.ChengM. M.JiangH.LiJ. (2015). Salient object detection: A benchmark. IEEE Trans. Image Process.24, 5706–5722. 10.1109/tip.2015.2487833
- CrossRef
- Google Scholar
4
BorjiA.IttiL. (2012). “Exploiting local and global patch rarities for saliency detection,” in 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA, 16-21 June 2012 (IEEE), 478–485.
- Google Scholar
5
BylinskiiZ.JuddT.OlivaA.TorralbaA.DurandF. (2017). “What do different evaluation metrics tell us about saliency models?,” in IEEE transactions on pattern analysis and machine intelligence, 1.
- Google Scholar
6
ChenT.YinX.YangJ.CongG.LiG. (2021). Modeling multi-dimensional public opinion process based on complex network dynamics model in the context of derived topics. Axioms10, 270. 10.3390/axioms10040270
- CrossRef
- Google Scholar
7
ChengM.-M.MitraN. J.HuangX.TorrP. H.HuS.-M. (2015). Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell.37, 569–582. 10.1109/tpami.2014.2345401
- CrossRef
- Google Scholar
8
ChengM. M.ZhangG. X.MitraN. J.HuangX.HuS. M. (2011). “Global contrast based salient region detection,” in Computer vision and pattern recognition, Colorado Springs, CO, June 20–25, 2011, 409–416.
- Google Scholar
9
ChoiK.SukH. (2015). Optimal employment of color attributes to achieve saliency in icon matrix designs. Color Res. Appl.40, 429–436. 10.1002/col.21922
- CrossRef
- Google Scholar
10
ChongE.WangY.RuizN.RehgM. J. (2020). “Detecting attended visual targets in video,” in IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, June 13–19, 2020 (IEEE), 5396–5406.
- Google Scholar
11
CorniaM.BaraldiL.SerraG.CucchiaraR. (2016). “A deep multi-level network for saliency prediction,” in Pattern recognition (ICPR), 2016 23rd international conference on (IEEE), 3488–3493.
- Google Scholar
12
CorniaM.BaraldiL.SerraG.CucchiaraR. (2018). Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Trans. Image Process.27, 5142–5154. 10.1109/tip.2018.2851672
- CrossRef
- Google Scholar
13
DavidN. R.YakirA. R.HilaryK. F.SharonR. G.GileanM.PeterJ. T.et al (2011). Detecting novel associations in large data sets. Science334, 1518–1524. 10.1126/science.1205438
- CrossRef
- Google Scholar
14
FanD.-P.ChengM.-M.LiuY.LiT.BorjiA. (2017). “Structure-measure: A new way to evaluate foreground maps,” in Proceedings of the IEEE international conference on computer vision, Venice, Italy, October 22–29, 2017, 4548–4557.
- Google Scholar
15
FanS.ShenZ.JiangM.KoenigB. L.XuJ.KankanhalliM. S.et al (2018). “Emotional attention: A study of image sentiment and visual attention,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, June 18–23, 2018, 7521–7531.
- Google Scholar
16
FelzenszwalbP. F.HuttenlocherD. P. (2004). Efficient graph-based image segmentation. Int. J. Comput. Vis.59, 167–181. 10.1023/b:visi.0000022288.19776.77
- CrossRef
- Google Scholar
17
GelascaE. D.TomasicD.EbrahimiT. (2005). “Which colors best catch your eyes: A subjective study of color saliency,” in First International Workshop on Video Processing and Quality Metrics for Consumer Electronics, Scottsdale, Arizona, USA, January 1, 2005.
- Google Scholar
18
GreenD. M.SwetsJ. A. (1966). Signal detection theory and psychophysics, 1. New York: Wiley.
- Google Scholar
19
HuangL.ChenC.YunJ.SunY.TianJ.HaoZ.et al (2022). Multi-scale feature fusion convolutional neural network for indoor small target detection. Front. Neurorobot.16, 881021. 10.3389/fnbot.2022.881021
- CrossRef
- Google Scholar
20
HuangX.ShenC.BoixX.ZhaoQ. (2015). “Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks,” in Proceedings of the IEEE international conference on computer vision, Santiago, Chile, December 7–13, 2015, 262–270.
- Google Scholar
21
JiangD.LiG.SunY.HuJ.YunJ.LiuY. (2021). Manipulator grabbing position detection with information fusion of color image and depth image using deep learning. J. Ambient. Intell. Humaniz. Comput.12, 10809–10822. 10.1007/s12652-020-02843-w
- CrossRef
- Google Scholar
22
JiangD.ZhengZ.LiG.SunY.KongJ.JiangG.et al (2019). Gesture recognition based on binocular vision. Clust. Comput.22, 13261–13271. 10.1007/s10586-018-1844-5
- CrossRef
- Google Scholar
23
JuddT.DurandF.TorralbaA. (2012). A benchmark of computational models of saliency to predict human fixations. MIT-CSAIL-TR-2012-001.
- Google Scholar
24
JuddT.EhingerK.DurandF.TorralbaA. (2010). “Learning to predict where humans look,” in IEEE 12th international conference on computer vision, iccv, Kyoto, Japan, September 27–October 4, 2009, 2106–2113.
- Google Scholar
25
KleinD. A.FrintropS. (2011). “Center-surround divergence of feature statistics for salient object detection,” in IEEE international conference on computer vision, Barcelona, Spain, November 6–13, 2011, 2214–2219.
- Google Scholar
26
Le MeurO.Le CalletP.BarbaD. (2007). Predicting visual fixations on video based on low-level visual features. Vis. Res.47, 2483–2498. 10.1016/j.visres.2007.06.015
- CrossRef
- Google Scholar
27
Le meurT.BaccinoT. (2013). Methods for comparing scanpaths and saliency maps: Strengths and;weaknesses. Behav. Res. Methods45, 251–266. 10.3758/s13428-012-0226-9
- CrossRef
- Google Scholar
28
LiG.YuY. (2016). “Deep contrast learning for salient object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, June 27–30, 2016, 478–487.
- Google Scholar
29
LiuN.HanJ.ZhangD.WenS.LiuT. (2015). “Predicting eye fixations using convolutional neural networks,” in Computer vision and pattern recognition, Boston, MA, June 7–12, 2015, 362–370.
- Google Scholar
30
LiuN.ZhangN.WanK.ShaoL.HanJ. (2021). “Visual saliency transformer,” in IEEE/CVF international conference on computer vision (ICCV), Montreal, QC, October 10–17, 2021 (IEEE), 4702–4712.
- Google Scholar
31
PanJ.FerrerC. C.McGuinnessK.O’ConnorN. E.TorresJ.SayrolE.et al (2017). Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint [arXiv:1701.01081].
- Google Scholar
32
PappasT. N. (2001). “Automatic detection of regions of interest in complex video sequences,” in Human vision and electronic imaging VI, 361–372.
- Google Scholar
33
PetersR. J.IyerA.IttiL.KochC. (2005). Components of bottom-up gaze allocation in natural images. Vis. Res.45, 2397–2416. 10.1016/j.visres.2005.03.019
- CrossRef
- Google Scholar
34
RamanathanS.KattiH.SebeN.KankanhalliM.ChuaT.-S. (2010). “An eye fixation database for saliency detection in images,” in Computer vision – eccv 2010 (Berlin, Heidelberg: Springer), 6314, 30–43. 10.1007/978-3-642-15561-1_3
- CrossRef
- Google Scholar
35
RenZ.ZhangQ.ChengJ.HaoF.GaoX. (2021). Segment spatial-temporal representation and cooperative learning of convolution neural networks for multimodal-based action recognition. Neurocomputing433, 142–153. 10.1016/j.neucom.2020.12.020
- CrossRef
- Google Scholar
36
RosenholtzR.DoraiA.FreemanR. (2011). Do predictions of visual perception aid design?ACM Trans. Appl. Percept.8, 1–20. 10.1145/1870076.1870080
- CrossRef
- Google Scholar
37
RubnerY.TomasiC.GuibasL. J. (2000). The Earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis.40, 99–121. 10.1023/a:1026543900054
- CrossRef
- Google Scholar
38
RutishauserU.WaltherD.KochC.PeronaP. (2004). “Is bottom-up attention useful for object recognition?,” in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004, Washington, DC, USA, 27 June 2004 - 02 July 2004 (IEEE), II.
- Google Scholar
39
SharmaA.TuzelO.JacobsD. W. (2015). “Deep hierarchical parsing for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, June 7–12, 2015, 530–538.
- Google Scholar
40
SunY.ZhaoZ.JiangD.TongX.TaoB.JiangG.et al (2022). Low-illumination image enhancement algorithm based on improved multi-scale retinex and abc algorithm optimization. Front. Bioeng. Biotechnol.10, 865820. 10.3389/fbioe.2022.865820
- CrossRef
- Google Scholar
41
TatlerB. W.BaddeleyR. J.GilchristI. D. (2005). Visual correlates of fixation selection: Effects of scale and time. Vis. Res.45, 643–659. 10.1016/j.visres.2004.09.017
- CrossRef
- Google Scholar
42
ThomasC. (2016). Opensalicon: An open source implementation of the salicon saliency model. arXiv preprint arXiv:1606.00110.
- Google Scholar
43
TongN.LuH.ZhangL.RuanX. (2014). Saliency detection with multi-scale superpixels. IEEE Signal Process. Lett.21, 1035–1039. 10.1109/LSP.2014.2323407
- CrossRef
- Google Scholar
44
XuJ.JiangM.WangS.KankanhalliM. S.ZhaoQ. (2014). Predicting human gaze beyond pixels. J. Vis.14, 28. 10.1167/14.1.28
- CrossRef
- Google Scholar
45
YangC.ZhangL.LuH.XiangR.YangM. H. (2013). “Saliency detection via graph-based manifold ranking,” in IEEE conference on computer vision and pattern recognition, Portland, OR, June 23–28, 2013, 3166–3173.
- Google Scholar

Summary

Keywords

bio-driven, color space, saliency detection, fixation prediction, human attention

Citation

Wang Y, Li T, Wu J and Ding CHQ (2022) Bio-driven visual saliency detection with color factor. Front. Bioeng. Biotechnol. 10:946084. doi: 10.3389/fbioe.2022.946084

Received

17 May 2022

Accepted

27 June 2022

Published

04 August 2022

Volume

10 - 2022

Edited by

Gongfa Li, Wuhan University of Science and Technology, China

Reviewed by

Qieshi Zhang, Shenzhen Institutes of Advanced Technology (CAS), China

Mengran Zhou, Anhui University of Science and Technology, China

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Teng Li, liteng@ahu.edu.cn

This article was submitted to Bionics and Biomimetics, a section of the journal Frontiers in Bioengineering and Biotechnology

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Bionics and Biomimetics

ORIGINAL RESEARCH article

Bio-driven visual saliency detection with color factor

Abstract

1 Introduction