Artigo Revisado por pares

Neural saliency algorithm guide bi‐directional visual perception style transfer

2019; Institution of Engineering and Technology; Volume: 5; Issue: 1 Linguagem: Inglês

10.1049/trit.2019.0034

ISSN

2468-6557

Autores

Chunbiao Zhu, Wei Yan, Xing Cai, Shan Liu, Thomas H. Li, Ge Li,

Tópico(s)

Image Enhancement Techniques

Resumo

CAAI Transactions on Intelligence TechnologyVolume 5, Issue 1 p. 1-8 Research ArticleOpen Access Neural saliency algorithm guide bi-directional visual perception style transfer Chunbiao Zhu, Chunbiao Zhu orcid.org/0000-0002-6525-6686 Shenzhen Graduate School, Peking University, Shenzhen, People's Republic of ChinaSearch for more papers by this authorWei Yan, Wei Yan Shenzhen Graduate School, Peking University, Shenzhen, People's Republic of ChinaSearch for more papers by this authorXing Cai, Xing Cai Shenzhen Graduate School, Peking University, Shenzhen, People's Republic of ChinaSearch for more papers by this authorShan Liu, Shan Liu Tencent America, San Francisco, CA, USASearch for more papers by this authorThomas H. Li, Thomas H. Li Shenzhen Graduate School, Peking University, Shenzhen, People's Republic of China China and Advanced Institute of Information Technology, Peking University, Hangzhou, People's Republic of ChinaSearch for more papers by this authorGe Li, Corresponding Author Ge Li geli@ece.pku.edu.cn Shenzhen Graduate School, Peking University, Shenzhen, People's Republic of ChinaSearch for more papers by this author Chunbiao Zhu, Chunbiao Zhu orcid.org/0000-0002-6525-6686 Shenzhen Graduate School, Peking University, Shenzhen, People's Republic of ChinaSearch for more papers by this authorWei Yan, Wei Yan Shenzhen Graduate School, Peking University, Shenzhen, People's Republic of ChinaSearch for more papers by this authorXing Cai, Xing Cai Shenzhen Graduate School, Peking University, Shenzhen, People's Republic of ChinaSearch for more papers by this authorShan Liu, Shan Liu Tencent America, San Francisco, CA, USASearch for more papers by this authorThomas H. Li, Thomas H. Li Shenzhen Graduate School, Peking University, Shenzhen, People's Republic of China China and Advanced Institute of Information Technology, Peking University, Hangzhou, People's Republic of ChinaSearch for more papers by this authorGe Li, Corresponding Author Ge Li geli@ece.pku.edu.cn Shenzhen Graduate School, Peking University, Shenzhen, People's Republic of ChinaSearch for more papers by this author First published: 08 January 2020 https://doi.org/10.1049/trit.2019.0034Citations: 21AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract The artistic style transfer of images aims to synthesise novel images by combining the content of one image with the style of another, which is a long-standing research topic and already has been widely applied in real world. However, defining the aesthetic perception from the human visual system is a challenging problem. In this study, the authors propose a novel method for automatic visual perception style transfer. First, they render a novel saliency detection algorithm to automatically perceive the visual attention of an image. Then, different from conventional style transfer algorithm in which style transferring is applied uniformly across all image regions, the authors apply a saliency algorithm to guide the style transferring process, enabling different types of style transferring to occur in different regions. Extensive experiments show that the proposed saliency detection algorithm and the style transfer algorithm are superior in performance and efficiency. 1 Introduction For hundreds of years, people have been attracted to the art of painting with the advent of many fantastic artworks, such as Starry Night by Van Gogh. In the past, only a well-trained artist was capable of reproducing an image in a particular style, which was time-consuming. Since the mid-1990s, the art theories behind fantastic artworks have attracted the attention of both artists and computer vision researchers. There are plenty of studies exploring how to automatically transform images into synthetic artworks and aesthetic inference [[1]–[4]] such that anyone can be an artist. Among these studies, non-photorealistic rendering (NPR) is an advanced method, and its advances are inspiring; nowadays it is a firmly established field. However, the NPR algorithms are usually highly dependent on the specific artistic styles (e.g. oil paintings, animations) they simulate and cannot be easily extended to produce stylised results for other artistic styles. Recently, with the help of convolutional neural network (CNN), Gatys et al. [[5]] began to use CNN to reproduce famous painting styles on natural images. They found that they can minimise the content loss and the style loss separately, thus they got the final image representations from CNNs. Based on this discovery, Gatys et al. [[5]] proposed a neural style transfer algorithm that successfully produces fantastic stylised images with the appearance of a given artwork. Inspired by the good performance of work from Gatys et al. [[5]] on neural style transfer, there is lots of works [[6]–[8]] concentrating on the problem of style transfer using deep neural networks. In their methods, style transfer is formulated as an optimisation problem, which means that the final stylised image should have similar neural activations with the content image and similar feature correlations with the style image. Despite achieving great results for certain stylistic transfers, there is a limitation to the current method as the style transfer can only be applied to the image as a whole. This may cause the salient part to be obscured in the style-transferred image. These methods ignore meaningful spatial control over the transfer process, resulting in unsatisfying results. Fig. 1 is one of the results in [[9]], which shows the sketch style transferred onto a cat's photograph. The resulting picture is a fine sketch, but the cat is totally merged into the background and most people can not recognise the cat immediately. In fact, excellent painters are good at focusing, whose sketches are eye-catching and clear, rather than fuzzy and vague. So, how to apply semantic information into style transfer and how to make the theme of the resulting picture clear are meaningful questions. Fig. 1Open in figure viewerPowerPoint Sketch style transfer to a cat Recently, the community has resorted to introducing semantic information into the system. Some methods [[5], [9]] have been proposed to solve this semantic style transfer problem, but are still based on time-consuming backward propagation. In [[5]], the authors present a possible solution to the computation bottleneck by extending their spatial control to fast neural style transfer [[9]]. Nevertheless, it is necessary to enforce further factorisation during network training as discussed by the authors. Furthermore, all of these works use the human label mask to transfer the semantic information. There are two drawbacks for the mask-based transfer. First, a human-labelled mask is time-consuming. Second, the edge information of a labelled mask is not smooth. Therefore, it is necessary to find a new solution to automatically perceive the visual attention of an image. We hope to extend current style transfer algorithms that apply across the whole image evenly, to a discriminative manner that distinguishes between the important object and the non-important background in the image. We then applied different stylistic transfers on the different parts to maintain the focus of the original image; in this case, the cat will be less blended into the background of the image. In this paper, we propose a saliency-guided style transfer approach. The framework of the proposed method is shown in Fig. 2. First, we utilise a saliency detection network to get the salient part of an input image. The saliency map produced by the saliency detection network is used to define the importance of different regions of the input image. Then, we are able to perform style transfer on just the salient region (such as the object), or just on the non-salient region (such as the background), or same styles on the salient region and the non-salient region, or different styles on the salient region and the non-salient region. Finally, the stylised salient region and the non-salient region are combined together and the final visual perception transfer result is obtained. Fig. 2Open in figure viewerPowerPoint The framework of the proposed method Our contributions are three-fold: (i) We propose a novel saliency detection to automatically perceive the visual information of an image, which guides the style transfer to express the aesthetic perception from the human visual system. (ii) We propose a bi-directional visual perception style transfer, which transfers an image with the same or a different style using differential transfer methods; (iii) We demonstrate that our framework yields significant improvements over existing methods both for saliency detection and style transfer. 2 Related work Our system involves two parts: saliency detection on our content image to give us a mask, followed by neural style. We will review both sections separately. 2.1 Saliency detection Early works on computing saliency aimed to model and predict human gaze on images. In Itti et al. [[10]] a model combines multiscale low-level features to create a saliency map. Recently saliency detection has expanded to include the segmentation of entire salient regions or objects. The majority of these methods are based on low-level hand-crafted features, e.g. image contrast [[11]], colour [[12]]. A complete survey of these methods is beyond the scope of this paper and we refer the readers to a recent survey paper [[13]] for details. 2.2 Neural style Current neural style transfer methods can be divided into two categories, namely descriptive neural methods based on image iteration, and generative neural methods based on model iteration. The first category transfers the style by directly updating pixels in the image through gradient descent, while the second category first optimises a generative model, then produces the styled image through a single forward pass. A complete survey of these methods can be found in a recent survey paper [[14]]. In this paper, we focus on the descriptive neural methods based model. 3 Proposed method We propose a saliency-guided style transfer algorithm. Our proposed architecture mainly contains two parts: the saliency detection network and the style transfer network. The framework of the proposed method is shown in Fig. 2. Given an input content image, we first use a saliency detection network to get the salient part of it. The saliency map produced by the saliency detection network is used to define the importance of different regions of the input content image. Then, we are able to perform style transfer on just the salient region (such as the object), or just on the non-salient region (such as the background), or same styles on the salient region and the non-salient region, or different styles on the salient region and the non-salient region. Finally, the stylised salient region and the non-salient region are combined together and the final visual perception transfer result is obtained. This section is organised as follows: first, we introduce the proposed saliency detection algorithm. Then, we show the details of the style transfer method used in our framework and the way the bi-directional style transfer is guided by the saliency map. 3.1 Saliency detection algorithm The master network is based on encoder–decoder architecture. VGG [[15]] is used in the encoder part of the proposed model. Also, we employ copy–crop and multi-feature concatenation techniques. We utilise hierarchical features in an effective way. The details of the proposed saliency detection network architecture are illustrated in Fig. 3. It consists of a contracting path (left side) and an expansive path (right side). The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a max-pooling operation with stride 2 for downsampling. At each downsampling step we double the number of feature channels. Every step in the expansive path consists of an upsampling of the feature map followed by a convolution (up-convolution) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer a convolution is used to map each 64-component feature vector to the desired number of classes. In total, the network has 23 convolutional layers. Fig. 3Open in figure viewerPowerPoint The details of the proposed saliency detection network Copy–crop technique [[16]] is used here for adding more low-level features from the early stage for improving fine details of a saliency map on the up-sampling stage. The multi-feature concatenation technique is mainly based on a loss-fusion pattern. It is used here for reasonably combining both low-level and high-level features for accurate saliency detection and loss fusion. Those features in different blocks in decoder part go through one convolution kernel with size and linear activation function. After that, we get pyramid outputs. They are concatenated to the final convolutional layer which has one size kernel. The sigmoid activation function is applied to this layer. Then, the pixel-wise binary cross-entropy between predicted saliency map S and the ground truth saliency mask G is computed by (1) where are the pixel location in an image. Given an input image I, the salient object detection network produces a saliency map Sm from a set of weights . The salient object detection is posed as a regression problem, and the saliency value S of each pixel in Sm can be described as (2) where is the receptive field of location in Sm. It is related to the size of each convolutional kernel and the architecture of the network. Once the network is trained, is fixed and used to detect salient objects for any input images. 3.2 Bi-directional style transfer The original style transfer method synthesises a new image, through combining the content of an image and the style of an image [[5]]. is called 'content image', is called 'style image', and the new image is called 'stylised image'. The process of synthesising a new stylised image is regarded as an energy minimisation problem, which minimises content loss and style loss. There are two main insights behind the algorithm. First, the content information of an input image lies in the features from the convolutional layers. Second, the style information of an input image lies in the correlations between the features from the convolutional layers. However, a drawback of the original style transfer algorithm is that the style is applied evenly to the entire image, regardless of the content image semantics. When applying the algorithm to portraits, the human subjects and the background will merge. On the other hand, excellent photographers and painters are good at focusing, and their works of art place emphasis on certain aspects such as models in the foreground. To address this problem, we leverage image saliency to treat different image parts discriminatively. By controlling the Gram matrices (see Eq.4), the style transfer can be guided by the semantics of the content image. A solution to control the Gram matrices is that a semantic segmentation of the content image can be regarded as a mask to handle the transfer of the style (Fig. 4). Fig. 4Open in figure viewerPowerPoint Weighted style transfer a Content image b Stylised image c Stylised image 3.2.1 Perceptual loss function In order to get a good transferred image, the network should optimise both content loss function and style loss function. Content representation: The synthesised image's content is generated by updating a Gaussian noise image's 'pixels' through gradient descent. The gradient is computed according to the squared-error loss between the two feature representations. Our content loss function is (3) where and is the content image and the synthesised image, respectively, and and are their respective feature representation in layer l. Style representation: On top of the CNN responses in each layer of the network, we build a style representation that computes the correlations between the different filter responses, where the expectation is taken over the spatial extension of the input image. These feature correlations are given by the Gram matrix , where is the number of feature maps in the layer l. is the inner product between the vectorised feature map i and j in layer l (4) To generate a texture that matches the style of a given image (Fig. 2, style reconstructions), we use gradient descent from a white noise image to find another image that matches the style representation of the original image. This is done by minimising the mean-squared distance between the entries of the Gram matrix from the style image and the Gram matrix of the image to be generated. So let and be the original style image and the image that is generated, and let and be their respective style representations in the layer l. The contribution of that layer to the total loss is (5) and the total style loss is: (6) Style and Content Combination: To produce the desired image that matches the content of image and the style of image , the total loss function should contain both the content loss and style loss. Thus, the final total loss function is: (7) where and are the weighting factors for content and style reconstruction, respectively, and and are the sets of layers used in the content and style representation, respectively. As (7) shows, the layers chosen by content loss function and style loss function are of significance to the quality of the synthesised image. For the style features, we use layers , , , and in accordance with [[5]]. Once we extract the activation volumes at a particular style layer for S, we compute the Gram matrix, which is a representation of the style features at that scale for S. Let's assume that we have n filters at a particular layer l. Now we can unroll an activation volume of dimensions into a matrix of . Let us call this matrix . The Gram matrix is found by performing . Gram matrices found across all the style layers give us all the style features. Now, to produce the desired image, we begin with an image that is essentially white noise. We run it through the VGG 19 model, and compute the content features for as in a similar fashion as we did for the content image . We can also find style features . Then, the content loss and style loss is computed by (3) and (6), and the final loss is computed by (7). 3.2.2 Bi-directional style transfer To achieve the focus effect in photography, a mask function is defined to perform an element-wise multiplication between the scaled and normalised content mask and each feature map in a layer l. A simple way to generate a mask is semantic segmentation. Fig. 4 is an example of using semantic segmentation to get the mask of a content image. Suppose we get the foreground and the background of a content image. Then, we are able to use the mask to weight the transfer of the style. Let's assume that a layer l 's activation volume size is , where is the number of feature maps with size of . We expand the mask along the depth to the dimensions of and then perform an element-wise multiplication. The mask function is defined as: (8) where is an element-wise multiplication of tensor. The mask is obtained by our saliency detection algorithm. Once the content mask is obtained, the style transfer can be weighted by it, and the style loss function becomes: (9) where is the style image, is the stylised image, is the content mask, and is the set of layers used in the masked style representation. The style transfer results presented in the main framework is generated on the basis of the VGG-network, a CNN that rivals human performance on a common visual object recognition benchmark task. We use the feature space provided by the 16 convolutional and 5 pooling layers of the 19 layer VGG Network. For image synthesis we found that replacing the max-pooling operation by average pooling improves the gradient flow and produces slightly more appealing results. Our results show qualitative improvements over the results of state-of-the-art methods. We applied the masks to the Gram matrices in multiple convolutional layers instead of a single convolutional layer. The masks were scaled and down-sampled using bicubic interpolation rather than using a pooling layer. By using multiple masks, different styles transferring can occur in different regions within an image. The masking is not done as a post-processing step; we are not simply masking two stylised images with a threshold. Since the mask is inserted into each style layer of the CNN, the masking does not result in sharp edges around the boundaries. Instead, the boundaries between the background and foreground blend smoothly because the masking is performed during the synthesis of the image. Fig. 5 shows a weighted style transfer using one style image and two content masks. Weighted by the saliency map, our result is more clear and still possesses the style of the style image. The soft blending between the foreground and background can be seen at the top of the image where the dog's head flows into the background. Since the blending is performed during synthesis, the edges around the boundary are smooth and more natural than simply masking two images after synthesis. Fig. 5Open in figure viewerPowerPoint Masking the Gram Matrix a Content Image b Style Image c Saliency map d Gatys Results e Our Result 4 Experimental results In this section, we first introduce the data sets of saliency detection and style transfer and implementation details of them. Then, we describe the evaluation metrics of saliency detection and the objective results and the subjective results comparing with the state-of-the-art methods. Finally, the results of the bi-directional style transfer is presented. 4.1 Data sets For the saliency detection evaluation, we adopt six public large scale saliency detection data sets as follows: DUT-OMRON [[17]]: This data set has 5168 high-quality images. Images of this data set have one or more salient objects and relatively complex backgrounds. DUTS-TE [[18]]: This data set contains 5019 test images with high-quality pixel-wise annotations. ECSSD [[19]]: This data set contains 1000 natural images, which include many semantically meaningful and complex structures in their ground truth segmentation. HKU-IS [[20]]: This data set has 4447 images with high-quality pixel-wise annotations. Images of this data set are well-chosen to include multiple disconnected salient objects or objects touching the image boundary. PASCAL-S [[21]]: This data set is generated from the PASCAL VOC data set [[22]] and contains 850 natural images. SOD [[23]]: This data set has 300 images, and it was originally designed for image segmentation. Pixel-wise annotation of salient objects was generated by [[24]]. For the style transfer evaluation, we adopt a collection of images that have become the de facto standard test images. It is publicly available at https://github.com/cysmith/neural-style-tf. 4.2 Implementation details and complexity We implement our model based on the python 2.7 platform with Tensorflow 1.2. We run our method in a server with an E5-2643 CPU (with 64 G memory) and NVIDIA Tesla k80 GPU (with 12 G memory). The parameters of multilevel feature extraction layers are initialised by the Truncated normal distribution method. We use the Adam [[25]] method to train our network with learning rate from 0.001 to 0.0001. For the encoder part of the saliency detection network, we use VGG-16 architecture. The decoder part and the encoder part of the saliency detection network are symmetric, so the depth information can be deduced from Fig. 3. The detailed information of the saliency detection network are shown in Tables 1 and 2. We train our model within 12 epochs. The training process takes almost 30 h (VGG16). The batch size is eight, limited by our GPU memory. For the style transfer algorithm, we adopt the method in [[5]] and the source code is available at https://github.com/cysmith/neural-style-tf. Table 1. Details of the encoder part of the saliency detection network Module Parameter size Input size Output size Conv1_1 Conv1_2 Maxpool1_1 Conv2_1 Conv2_2 Maxpool2_1 Conv3_1 Conv3_2 Conv3_3 Maxpool3_1 Conv4_1 Conv4_2 Conv4_3 Maxpool4_1 Conv5_1 Conv5_2 Conv5_3 Table 2. Details of the decoder part of the saliency detection network Module Parameter size Input size Output size Deconv5_1 Deconv5_2 Deconv5_3 Upsampling_4 Deconv4_3+Conv4_1 Deconv4_2 Deconv4_3 Upsampling_3 Deconv3_3+Conv3_1 Deconv3_2 Deconv3_3 Upsampling_2 Deconv2_2+Conv2_1 Deconv2_2 Upsampling_1 Deconv1_2+Conv1_1 Deconv1_2 When testing, the proposed salient object detection algorithm runs at about 10 fps (VGG16) with resolution and the model size is about 491 MB. The output of our model scales to the range of [0, 255]. The saliency maps resize to the original resolution directly. For the complexity of style transfer, since our method is descriptive neural methods, which directly updates pixels in the image through gradient descent, the speed of the synthesis procedure is different according to the content image and style image and it takes up to an hour on a Nvidia k80 GPU (depending on the exact image size and the stopping criteria for the gradient descent). The model size of style transfer is about 510 MB. 4.3 Saliency detection evaluation 4.3.1 Evaluation metrics Three most widely used evaluation metrics are used to evaluate the performance of different saliency algorithms, which are the precision-recall (PR) curves, F -measure and mean absolute error (MAE). The F -measure is a weighted average of precision and average recall and can be calculated by Here, as suggested by previous works, we set to 0.3 to emphasise the precision rather than recall. The MAE evaluates the saliency detection denoted as where G is the binary ground truth mask W and H are width and height of the saliency map S, respectively. 4.3.2 Evaluation For the training, we utilise the MSRA10K data set [[26]] as well as the DUTS-TR [[27]] data set. MSRA10K includes 10,000 images with high-quality pixel-wise annotations. This DUTS data set is currently the largest saliency detection benchmark and contains 10,553 training images (DUTS-TR) and 5019 test images (DUTS-TE), and we use the DUTS-TR only. Most of the images in these data sets contain only one salient object. Both two training data sets contain very challenging scenarios for saliency detection. Before putting the training images into our proposed model, each image is rescaled into the same size [224, 224] as well as the ground truth. Comparison with the state-of-the-art methods: We compare our algorithm with 13 other state-of-the-art methods including eight deep learning-based algorithms (Amulet [[28]], DCL [[29]], DHS [[30]], DS [[31]], ELD [[32]], LEGS [[33]], MDF [[27]], RFCN [[34]]) and four conventional algorithms (BL [[35]], BSCA [[11]], DRFI [[24]], DSR [[12]]). To be fair, we use either the implementations with recommended parameter settings or the saliency maps provided by the authors. As shown in Table 3, our model outperforms these methods almost across all the data sets in terms of almost all evaluation metrics. From the PR curve (Fig. 6) we can easily see that our approach achieves a better PR curve in all of the four data sets. Owing to the refinement effect of low-level features, the precision value of our saliency maps is higher, thus resulting in a higher PR curve. Visual comparison: Fig. 7 provides a visual comparison of our approach with only some of the aforementioned approaches due to the limitation of space. It can be seen that our proposed method has the finest detail as well as highlights the most correct salient regions thanks to the aggregation of multi-layers outputs. Table 3. Quantitative comparison of F-measure and MAE scores on large-scale RGB saliency detection data sets Method ECSSD DUT-OMRON HKU-IS DUTS-TE PASCAL-S SOD MAE MAE MAE MAE MAE MAE proposed work 0.8807 0.0572 0.7259 0.0802 0.8610 0.0501 00.7526 0.0790 0.7891 0.0815 0.7964 0.1272 Amulet 0.8684 0.0587 0.6471 0.0976 0.8542 0.0521 0.7365 0.0852 0.7632 0.0982 0.7547 0.1399 DCL 0.8293 0.1495 0.6842 0.1572 0.8533 0.1358 0.7141 0.1492 0.7141 0.1807 0.7413 0.1938 DHS 0.8675 0.0594 — — 0.8541 0.0531 0.7301 0.0658 0.7741 0.0942 0.7746 0.1284 DS 0.8255 0.1215 0.6028 0.

Referência(s)