Artigo Acesso aberto Revisado por pares

ApesNet: a pixel‐wise efficient segmentation network for embedded devices

2016; Institution of Engineering and Technology; Volume: 1; Issue: 1 Linguagem: Inglês

10.1049/iet-cps.2016.0027

ISSN

2398-3396

Autores

Chunpeng Wu, Hsin‐Pai Cheng, Sicheng Li, Hai Li, Yiran Chen,

Tópico(s)

Autonomous Vehicle Technology and Safety

Resumo

IET Cyber-Physical Systems: Theory & ApplicationsVolume 1, Issue 1 p. 78-85 Research ArticleOpen Access ApesNet: a pixel-wise efficient segmentation network for embedded devices Chunpeng Wu, Chunpeng Wu Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, 15232 USASearch for more papers by this authorHsin-Pai Cheng, Hsin-Pai Cheng Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, 15232 USASearch for more papers by this authorSicheng Li, Sicheng Li Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, 15232 USASearch for more papers by this authorHai (Helen) Li, Hai (Helen) Li Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, 15232 USASearch for more papers by this authorYiran Chen, Corresponding Author Yiran Chen yiran.chen@pitt.edu Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, 15232 USASearch for more papers by this author Chunpeng Wu, Chunpeng Wu Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, 15232 USASearch for more papers by this authorHsin-Pai Cheng, Hsin-Pai Cheng Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, 15232 USASearch for more papers by this authorSicheng Li, Sicheng Li Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, 15232 USASearch for more papers by this authorHai (Helen) Li, Hai (Helen) Li Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, 15232 USASearch for more papers by this authorYiran Chen, Corresponding Author Yiran Chen yiran.chen@pitt.edu Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, 15232 USASearch for more papers by this author First published: 01 December 2016 https://doi.org/10.1049/iet-cps.2016.0027Citations: 9AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract Road scene understanding and semantic segmentation is an on-going issue for computer vision. A precise segmentation can help a machine learning model understand the real world more accurately. In addition, a well-designed efficient model can be used on source limited devices. The authors aim to implement an efficient high-level, scene understanding model in an embedded device with finite power and resources. Toward this goal, the authors propose ApesNet, an efficient pixel-wise segmentation network which understands road scenes in near real-time and has achieved promising accuracy. The key findings in the authors' experiments are significantly lower the classification time and achieving a high accuracy compared with other conventional segmentation methods. The model is characterised by an efficient training and a sufficient fast testing. Experimentally, the authors use two road scene benchmarks, CamVid and Cityscapes to show the advantages of ApesNet. The authors' compare the proposed architecture's accuracy and time performance with SegNet-Basic, a deep convolutional encoder–decoder architecture. ApesNet is 37% smaller than SegNet-Basic in terms of model size. With this advantage, the combining encoding and decoding time for each image is 2.5 times faster than SegNet-Basic. 1 Introduction Road scene understanding algorithm can effectively reduce traffic congestion and road accidents. Besides, for safety, a machine learning model should be capable of detecting pedestrian, bicyclist, road sign correctly with high confidence. The dangerous situation information can then be sent to the car and the driver for accident prevention. Common road scene segmentations use machine learning model. Among them, deep neural network (DNN) and convolution neural network (CNN) [1] were adopted and achieved significant results. CNN gained significant success in visual and audio recognition problems such as handwritten digit recognition, large-scale image and audio classification [2, 3]. Motivated by this milestone, deep CNNs were adopted to solved semantic segmentation as well as action recognition [4]. Recent studies show that the network depth and width have crucial influences on classification accuracy. For example, numerous state-of-the-art results on ImageNet Challenge credit the achievement to very DNN models [5-7]. Whole-image classification has been done successfully. The next step is the pixel-wise semantic segmentation; on the other hand, we are moving from image-wise labelling to pixel-wise labelling. However, as the main obstacle is an information overload for training, the network will be inefficient if we directly use conventional deep learning architecture. Recent studies show that there is still room for improvement on accuracy and segmentation time [8]. Until now, semantic segmentation is an on-going hot topic. Several image segmentation methods have been done. However, due to the model size, it is impractical to be implemented on an embedded device for real-time segmentation. For example, Alex Kendall et al. proposed SegNet, a deep convolutional encoder–decoder architecture [9]. Although SegNet has competitive performance on class accuracy, it does not achieve real-time performance. Our motivation for segmentation comes from the remaining issue, the processing time. Therefore, we implement a segmentation neural network in an inexpensive embedded system with minimum hardware and software supply. Our model should be small, computational efficient and real-time. We experimentally analyse the processing time for each layer. We find that it is not necessary to symmetrically align encoder and decoder. Inspired by ResNet [10], we insert shortcut connections to encoder part. ApesNet was tested on CamVid dataset [11] and we comprehensively elaborate the processing time of ApesNet with SegNet-Basic. In addition, for the consideration of safety, the network has a better distinguish accuracy on smaller objects such as road sign, pedestrian and bicyclist. For example, when SegNet-Basic can only achieve 16.4 and 36.2% accuracy on the segmentation of bicyclist and sign, ApesNet can achieve 46.1 and 52.3%, respectively. The remainder of the paper is structured as follows. In Section 2, the related work is given. In Sections 3 and 4, we describe design analysis and network architecture. In Section 5, we explain the experimental setup, results, discussion and evaluation. In Section 6, we discuss the conclusion and future work. 2 Related work An accurate semantic understanding is important for segmentation task and target objects searching. Many works have been developed to explore the potential of separating different objects precisely. Starting from CNN, it has been widely used on visual and audio recognition problems. CNN has a typical structure including convolutional layer followed by normalisation, max-pooling, fully-connected layers. With this approach, a machine learning model can achieve a high accuracy performance on classification such as MNIST, CIFAR, ImageNet, etc. Many breakthrough approaches which combine other algorithms with CNNs outperform many conventional algorithms. For instance, conditional random fields (CRF) [12] were often used for labelling and parsing sequential data. CRF effectively assigned appropriate class labels and constructed semantic-level regions. The hierarchical features were extracted to explain the meaning in the image. It is worth mentioning that Farabet et al. used multiscale convolutional network for pixel-wise labelling by combining CNNs with CRF model on many datasets [13]. However, an unavoidable disadvantage was that the information of mid-level features was lost by using spatial pooling. Other techniques uses encoder–decoder network architecture such as SegNet. SegNet uses an encoder architecture which is topologically similar to VGG16 network. The decoder upsamples feature maps and maps it to higher resolution. The architecture can do pixel-wise segmentation and has promising results on several datasets such as Pascal VOC12 salient object segmentation and SUN RGB-D indoor scene understanding challenge. However, SegNet is not applicable for embedded devices because the model is designed for multi-class classification and has enormous parameters. Even though the results are outstanding, the class accuracy and segmentation time are not satisfactory. 3 Design analysis As our goal is to accelerate the network testing without losing accuracy, our main idea is to reduce the operations which are relatively time consuming but with less contribution to the segmentation accuracy, and then compensate the lost in accuracy by adding efficient operations [14, 15]. We ran a time profiling for a popular network, AlexNet [1] which includes convolution, max-pooling, ReLU, local contrast normalisation (LRN), dropout and full-connection. We added batch-normalisation [16] between convolution and dropout. The recent proposed module of shortcut [10] has been studied, and we found out that its time consumption is mainly dependent on the convolution branch. Therefore, we will discuss the convolution instead of the shortcut module in this section. Our GPU device is the NVIDIA TITAN X and cuDNN v5 [17] is adopted. The time spent on ReLU, LRN, batch-normalisation and dropout is trivial which is shorter than 2 ms for each operation in our experiment. The running time of max-pooling and full-connection is increased with the size and number of feature maps, but still less than 4 ms for each operation. As more recent CNNs [18, 19] go deeper, the ratio of layer number of these two types to total number becomes lower. Therefore, we do not focus on these layers. Actually, the feature maps can be rescaled by setting the stride size of a convolutional layer bigger than 1 instead of using the pooling layers in a CNN, as suggested by Simard et al. [20]. We tried to remove the pooling layers in this manner, however, the final segmentation accuracy is significantly degraded in our experiment. A possible explanation is that average-pooling can be replaced by the convolution with a stride bigger than 1 as in [20], but the biologically inspired max-pooling cannot. For AlexNet, the convolution occupies most of the running time. Specifically, the first convolution takes about 10 ms, while the last one takes only 4 ms, i.e. convolution layers with larger feature maps are the most time consuming. Another design issue of networks used for pixel-wise applications (e.g. image segmentation) is how to make the resolution of the output prediction map the same as that of the input image. Inspired by the auto-encoder architecture [21], the popular CNNs for segmentation, e.g. fully convolutional networks [22], SegNet-Basic [9, 23] and EDeconvNet [24], adopt symmetric encoder and decoder to gradually restore the shrunk feature maps to the original size of the input. Their differences lie in the implementation of upsampling. However, more recent methods [25, 26] keep the encoder (e.g. VGG) unchanged while use a shallow decoder or a full-connected layer to get the output prediction vector corresponding to each pixel in the input image. We tried to replace the decoder of SegNet-Basic and EDeconvNet with that of [25, 26]. The testing time decreased significantly by around 11 ms at a cost of only 1% drop in average in segmentation accuracy. Note that the total testing time of SegNet-Basic is 63 ms using TITAN X with cuDNN v5. We further tried to gradually shrink the encoder by removing convolutional layers while keeping the decoder unchanged. The obtained results showed more than 10% decrease in accuracy, indicating a deep encoder is probably necessary for extracting expressive visual features to distinguish semantic classes of objects. 4 Architecture Based on above analysis of time profiling and encoder–decoder architecture, we adopt the following two strategies. First, the number of large feature maps in convolutional layers is decreased for acceleration compared with previous methods [22-26]. Here, 'large feature maps' indicates the maps in the first ConvBlock and the last ConvBlock in our network as shown in Fig. 1. Respectively, 16 and 8 feature maps are used for these two ConvBlock. For simplicity, we do not tune the feature map numbers of other convolutional layers. Second, we will use an asymmetric network structure with a deep encoder and a relatively shallow decoder. The deep encoder lies in adding convolutional layers with relatively small feature maps, i.e. two ApesBlock as shown in Fig. 1, in order to improve the accuracy. The convolution kernel size of these two ApesBlock is to extract features at a finer scale, compared with the kernel size 7 of all other modules in our network. Ablation studies in Section 5.3 will show the contributions of above two strategies in details. Fig. 1 shows the architecture of ApesNet, and two basic modules in it (ConvBlock and ApesBlock). Our network consists of an encoder and a decoder. The ConvBlock (M, k, r) is adopted for both the encoder and decoder, which includes a convolutional layer with kernel and M feature maps, batch-normalisation, ReLU activation, and drop-out with ratio r. The ApesBlock (M,k,r) is used only for the encoder and inspired by the ResNet [10]. Two branches, i.e. the shortcut connections and two convolutional layers with batch-normalisation and ReLU, are combined by element-wise add, which is followed by ReLU and dropout with ratio r. The convolution in the ApesBlock serves as an identity mapping. As shown in Fig. 1, an input image will be passed through three pairs of ConvBlock, max-pooling and two repeated ApesBlock in the encoder. The kernel size of all the first three ConvBlock is , while that of the latter two ApesBlock is smaller, . The max operator is over a region with stride 2. In addition, all the modules in the encoder has 64 feature maps except the first ConvBlock with fewer (i.e. 16) maps. The max-pooling indices in the encoder are used to upsample the feature maps of the corresponding un-pooling layer in the decoder. Therefore, the upsampling operator is also over a region with stride 2. The decoder is comprised of three pairs of un-pooling, ConvBlock and a final 1 × 1 convolutional layer as a classifier. The kernel size of all three ConvBlock in the decoder is , however, their feature map numbers vary: 64, 16 and 8. The output map size is the same as input, and each pixel in the input image corresponds to a vector of length C where C is the number of semantic classes. The cross-entropy is used as the objective function for training. As the pixel number of each semantic class is not balanced, the loss of each class is weighted according to median frequency balancing [27]. Figure 1Open in figure viewerPowerPoint Proposed encoder–decoder architecture of ApesNet and the inner structure of ConvBlock and ApesBlock 5 Experiments We will evaluate the accuracy and testing speed of our method by using two popular image segmentation datasets. ApesNet will be compared with SegNet-Basic [9] which is a smaller model among popular ones [22, 24, 25] while achieving competitive accuracy. Ablation studies of our improvements on accuracy and speed as stated in Section 3 will be shown. With consideration of physical constraints of embedded hardware, a fixed-point version of our network and the corresponding performance will be provided. 5.1 ApesNet on CamVid The CamVid is a road scene segmentation dataset consisting of 367 training and 233 testing RGB images (day and dusk scenes) with 11 semantic classes of manually labelled objects, e.g. car, tree, road, fence, pole, building etc. [11]. The original image resolution is 720 × 960, while we downsampled all images to 360 × 480. Two popular measures of segmentation performance are used: class average accuracy (Class Avg.) which is the mean of the predictive accuracy over all classes and mean intersection over union (Mean IoU) as used in the Pascal VOC12 Challenge [28]. Our method will be compared with SegNet-Basic [9] as its model scale is similar with ours. Both of SegNet-Basic and our model are initiated using He et al. [29] and trained with stochastic gradient descent with a fixed learning rate of 0.1 and a momentum of 0.9. The quantitative comparison on accuracy and testing speed is, respectively, shown in Tables 1 and 2. Our method achieves better accuracy (Class Avg. and Mean IoU) and faster speed with smaller model size. As shown in Table 1, our method obtains higher per-class accuracy on eight of all 11 classes of object. In addition, the traditional SegNet-Basic severely biases towards certain classes of small-scale objects, e.g., bicyclist, while our method keeps a better balance of accuracy among classes of object. An explanation is that two ApesBlock with smaller kernel size (5 × 5) are used in our method, therefore small objects will benefit from our convolutional feature extractor at a finer scale. Six examples of visualised segmentation results are shown in Fig. 2. Images from the first to the fourth column are, respectively, input image, SegNet-Basic, our method and ground-truth, which further validates our advantage. On the testing speed as listed in Table 2, the performance gap between mobile-based GTX 760M and PC-based GTX 1080 for SegNet-Basic is 133 ms, however, the gap is only 40 ms for our method. This indicates that our acceleration strategy does not heavily rely on the advance of GPU hardware itself compared with SegNet-Basic. Therefore, our method can be further speed up using NVIDIA's techniques such as dynamic parallelism and hyper-Q [30]. Figure 2Open in figure viewerPowerPoint Examples of segmentation results on CamVid. From the first to the fourth column are: input image, SegNet-Basic, ApesNet and ground-trutha CamVid example 1 b CamVid example 2 c CamVid example 3 Table 1. Segmentation results on CamVid testing set Method Building, % Tree, % Sky, % Car, % Sign, % Road, % Pedestrian, % Fence, % Pole, % Sidewalk, % Bicyclist, % SegNet-Basic 75.1 83.1 88.3 80.2 36.2 91.4 56.2 46.1 44.1 74.8 16.4 our method 76.0 80.2 95.7 84.2 52.3 93.9 59.9 43.8 42.6 87.6 46.1 Table 2. Comparison on accuracy, model size and testing speed on CamVid and Cityscapes Dataset Method Class avg., % Mean IoU, % Model size, MB GTX 760M, ms GTX 860M, ms TITAN X, ms Tesla K40, ms GTX 1080, ms CamVid SegNet-Basic 62.9 46.2 5.40 181 170 63 58 48 ApesNet 69.3 48.0 3.40 73 70 40 39 33 Cityscapes SegNet-Basic 58.4 42.0 5.59 180 168 62 57 46 ApesNet 61.2 44.5 3.57 71 67 39 38 31 5.2 ApesNet on Cityscapes The Cityscapes [31] is a large-scale urban scene dataset with high resolution annotations of 34 classes of objects, consisting of 2975 training samples, 500 validation samples and 1525 testing samples. The scale of each image is 256 × 512, while each image in CamVid is 360 × 480. Cityscapes is a more challenging dataset compared with CamVid, because of its highly varying road scenes, e.g. pedestrians and cyclists with different characteristics. Following the Cityscapes's official evaluation scripts [31], the classes that are too rare are excluded, leaving 19 classes as our benchmark. The Class Avg. and Mean IoU, the same measures for CamVid, are also adopted to evaluate the segmentation accuracy. The model size and running speed on PC-based and mobile-based GPU cards are listed in Table 2. The model size is slightly bigger compared with that in Table 2 because the final output layers have more feature maps for eight more classes of objects on Cityscapes, though we adopt the same network architecture. The segmentation result is shown in Table 2, and the visualised examples are shown in Fig. 3. Our method achieves slightly better accuracy compared with SegNet-Basic. Figure 3Open in figure viewerPowerPoint Examples of segmentation results on Cityscapes. From the first to the fourth column are: input image, SegNet-Basic, ApesNet and ground-truth a Cityscapes example 1 b Cityscapes example 2 c Cityscapes example 3 d Cityscapes example 4 5.3 Ablation studies 5.3.1 ApesBlock and training methodology As shown in Fig. 1, our asymmetric encoder–decoder architecture lies in that there are two ApesBlock at end of the encoder (before the first upsampling layer). As stated in Section 3, adopting ApesBlock will improve the segmentation accuracy without significantly increasing the testing time. On the other hand, we currently adopt the traditional training method, i.e. the encoder and decoder are trained as a whole. An alternative way is to separate training where the encoder is first trained, and the encoder and decoder are then trained together as adopted in previous research [25]. Table 3 lists the ablation study on ApesBlock and training methodology on CamVid and Cityscapes. Based on the traditional training, two ApesBlock improves the Class Avg. and Mean IoU, respectively, by 7.2 and 4.2% on CamVid compared with the first row where no ApesBlock is used while the running time only increases 4 ms. Three examples of visualised segmentation results are shown in Fig. 4. Images from the first to the fifth column are: input image, no ApesBlock, one ApesBlock, two ApesBlock and ground-truth. The segmentation results in the fourth column (two ApesBlock) are smoother than those in the second column (no ApesBlock). As considering the training methodology, our observation in Table 3 is that deeper neural network can benefit more from the separate training. ApesNet with two ApesBlock achieves better segmentation accuracy using separate training on both CamVid and Cityscapes, while degraded accuracy is obtained for one ApesBlock and zero ApesBlock. Figure 4Open in figure viewerPowerPoint Comparison of segmentation results with no ApesBlock (second column), one ApesBlock (third column) and two ApesBlock (fourth column). The first column is the original images, and the fifth column is the ground-trutha CamVid example 1 b CamVid example 2 c CamVid example 3 Table 3. Ablation study on ApesBlock and training methodology Training CamVid Cityscapes Class avg., % Mean IoU, % Speed, ms Class avg., % Mean IoU, % Speed, ms ApesNet without ApesBlock tradition 62.1 43.8 29 57.3 39.6 27 separate 61.8 43.3 29 56.7 39.3 27 ApesNet with one ApesBlock tradition 66.4 46.3 31 59.8 42.3 29 separate 66.1 46.4 31 59.7 42.3 29 ApesNet with two ApesBlock tradition 69.3 48.0 33 61.2 44.5 31 separate 69.5 48.1 33 61.4 44.7 31 5.3.2 ConvBlock with large feature maps The large feature maps are those in the ConvBlock at the start of the encoder (ConvBlock_1) and the ConvBlock at the end of the decoder (ConvBlock_6) in Fig. 1. As stated in Section 3, the decreased large feature maps will improve the testing speed without significantly degrading the segmentation accuracy. Table 4 lists the ablation study on decreased number of large feature maps on CamVid and Cityscapes. The 'ConvBlock_1: 64', for instance, indicates that there are 64 feature maps in the ConvBlock_1. The comparison of the first and the fourth row shows that the running time significantly reduces by 12 ms while the Class Avg. and Mean IoU only decrease, respectively, by 0.8 and 0.7%. Table 4. Ablation study on decreased number of large feature maps on CamVid and Cityscapes CamVid Cityscapes Class avg, %. Mean IoU, % Speed, ms Class avg., % Mean IoU, % Speed, ms ApesNet (ConvBlock_1: 64, ConvBlock_6: 64) 70.1 48.7 45 61.9 45.0 40 ApesNet (ConvBlock_1: 16, ConvBlock_6: 64) 69.8 48.5 41 61.7 44.8 37 ApesNet (ConvBlock_1: 64, ConvBlock_6: 8) 69.5 48.1 37 61.3 44.6 34 ApesNet (ConvBlock_1: 16, ConvBlock_6: 8) 69.3 48.0 33 61.2 44.5 31 5.3.3 Testing speed Fig. 5 shows the testing time comparison by increasing, respectively, the block numbers of ApesBlock (red) and ConvBlock_2 (blue) on Cityscapes. The GPU device used is NVIDIA GTX 1080. There are two ApesBlock and one ConvBlock_2 in our ApesNet as shown in Fig. 1, therefore the running time of the ConvBlock_2 is bigger than 31 ms when the block number is 2. The time increase by adding two ApesBlock is around 1 ms shorter than adding two ConvBlock_2, though both of these two curves in Fig. 5 are almost linear. Figure 5Open in figure viewerPowerPoint Testing time comparison by increasing, respectively, the block numbers of ApesBlock (red) and ConvBlock_2 (blue) on Cityscapes 5.3.4 Upsampling The upsampling adopted in our ApesNet is unpooling, which directly uses the max pooling indices to upsample the feature map without learning. The output of this unpooling method is a sparse feature map. An alternative is learning based deconvolution [24], which associates one input activation with multiple outputs. The output of the deconvolutional layer is an enlarged and dense activation map. The unpooling layer has no parameters, while a deconvolutional layer will increase the model parameters. We compare the segmentation accuracy using unpooling and deconvolution in our ApesNet on CamVid and Cityscapes as shown in Table 5. Deconvolution achieves slightly better Mean IoU and lower Class Avg., compared with unpooling. Table 5. Comparison of segmentation results using unpooling and deconvolution CamVid CamVid Cityscapes Cityscapes Class avg., % Mean IoU, % Class avg., % Mean IoU, % deconvolution 69.0 48.2 61.1 44.6 unpooling 69.3 48.0 61.2 44.5 5.3.5 Class weighting The class weighting method currently used is median frequency balancing [27]. Another more straight-forward method is reciprocal of frequency. Table 6 compares the segmentation results using these two weighting methods with the naive method, i.e. without weighting. Our current adopted median frequency balancing achieves better accuracy, especially for the Class Avg., which indicates more balanced accuracy across classes with significantly different pixel numbers. Table 6. Segmentation results using different class weighting methods CamVid CamVid Cityscapes Cityscapes Class avg., % Mean IoU, % Class avg., % Mean IoU, % no weighting 66.1 45.8 58.3 42.5 reciprocal of frequency 67.4 46.6 59.2 43.2 median frequency balancing 69.3 48.0 61.2 44.5 5.4 Fixed point Typical embedded devices prefer algorithms with limited numerical precision. Therefore, we truncate our 32-bit floating-point network to 16-bit and 8-bit fixed-point networks. The quantisation process is executed after a floating-point network is trained. Table 7 lists the segmentation accuracy based on 16-bit and 8-bit fixed-point network on CamVid and Cityscapes. Compared with our floating-point network on CamVid as listed in Table 1, the 16-bit network (integer: 16 bit and fraction: 16 bit) only loses Class Avg. and Mean IoU, respectively, by 3.1 and 1.1%. The accuracy loss is, respectively, 1.4 and 1.5% on Cityscapes. No significant accuracy gap can be observed for the three 16-bit versions. Six examples of visualised segmentation results on CamVid are shown in Fig. 6. Images from the first to the fourth column are: input image, our floating-point network, our 16-bit fixed-point network and ground-truth. The second column (floating-point) is slightly better than the third column (fixed-point). Fig. 7 shows the visualised fixed-point results on Cityscapes. An interesting observation is that more objects, e.g. the white truck in the first image, are predicted to be void areas in the fixed version as shown in Fig. 7, which more probably becomes a problem of self-driving safety. Figure 6Open in figure viewerPowerPoint Comparison between our networks of floating-point and fixed-point. From the first to the fourth column are: input image, our floating-point network, our fixed-point network (16-bit) and ground-trutha CamVid example 1 b CamVid example 2 c CamVid example 3 d CamVid example 4 e CamVid example 5 Figure 7Open in figure viewerPowerPoint Comparison between our networks of floating-point and fixed-point on Cityscapes. From the first to the fourth column are: input image, our floating-point network, our fixed-point network (16-bit) and ground-trutha Cityscapes example 1 b Cityscapes example 2 c Cityscapes example 3 Table 7. Segmentation results of our fixed-point networks on CamVid and Cityscapes CamVid CamVid Cityscapes Cityscapes Class avg., % Mean IoU, % Class avg., % Mean IoU, % 16-bit ApesNet (Int.: 8 Frac.: 8) 66.2 46.9 59.8 43.0 16-bit ApesNet (Int.: 6 Frac.: 10) 66.0 47.0 59.7 42.9 16-bit ApesNet (Int.: 10 Frac.: 6) 66.1 46.9 59.8 42.9 8-bit ApesNet (Int.: 4 Frac.: 4) 15.2 5.06 16.3 4.78 6 Conclusion We proposed a semantic segmentation network model, ApesNet, which can process images in near real-time. We expect this network to have a wide range application on embedded GPU or any resource limited embedded devices. ApesNet has been tested on six different GPUs. All the results prove that our network has significant advantages on class accuracy and time consumption compared with other semantic segmentation method such as SegNet. For road scene understanding and autonomous driving, ApesNet provides a feasible way to efficiently distinguish different objects especially smaller objects, i.e. road sign, bicyclist, pedestrian, which is essential to overall safety. By testing on an automotive data and measuring run time, our network is a possible solution for real-time embedded systems. 7 Acknowledgments This work was in part supported by NSF CCF-1615475, XPS-1337198, and AFRL FA8750-15-1-0176. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of grant agencies or their contractors. 8 References 1Krizhevsky, A., Sutshever, I., Hinton, G.E.: ' ImageNet classification with deep convolutional neural networks'. Advances in Neural Information Processing Systems (NIPS), 2012 2Lee, H., Pham, P., Largman, Y. et al.: ' Unsupervised feature learning for audio classification using convolutional deep belief networks', 2009 3Li, S., Wu, C., Li, H. et al.: ' Fpga acceleration of recurrent neural network based language model'. IEEE Int. Symp. on Field-Programmable Custom Computing Machines, 2015, pp. 111– 118 4Yang, S.J.M., Yu, K., Xu, W.: '3D convolutional neural networks for human action recognition', IEEE Trans. Pattern Anal. Mach. Intell., 2013, 35, pp. 221- 231 (doi: https://doi.org/10.1109/TPAMI.2012.59) 5Szegedy, C., Liu, W., Jia, Y. et al.: 'Going deeper with convolutions', IEEE Conference on Computer Vision and Pattern Recognition, 2015 6Simonyan, K., Zisserman, A.: 'Very deep convolutional networks for large-scale image recognition', International Conference on Learning Representations, 2015 7He, K., Zhang, X., Ren, S. et al.: 'Delving deep into rectifiers: surpassing human-level performance on ImageNet classification', IEEE Conference on Computer Vision, 2015., http://arxiv.org/abs/1502.01852 8Cheng, H., Wen, W., Song, C. et al.: ' Exploring the optimal learning technique for IBM TrueNorth platform to overcome quantization loss'. IEEE/ACM Int. Symp. on Nanoscale Architectures, 2016 9Badrinarayanan, V., Kendall, A., Cipolla, R.: ' SegNet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labeling', arxiv, 2015 10He, K., Zhang, X., Ren, S. et al.: ' Deep residual learning for image recognition', arxiv, 2015 11Brostow, G.J., Shotton, J., Fauqueur, J. et al.: ' Segmentation and recognition using structure form motion point clouds'. European Conf. on Computer Vision (ECCV), 2008 12Verbeek, J., Triggs, W.: ' Scene segmentation with CRFs learned from partially labeled images'. NIPS, 2007 13Couprie, C., Najman, L., Lecun, Y.: 'Learning hierarchical features for scene labeling', IEEE Trans. Pattern Anal. Mach. Intell., 2013, 35, pp. 1915- 1929 (doi: https://doi.org/10.1109/TPAMI.2012.231) 14Wen, W., Wu, C., Wang, Y. et al.: ' Learning structured sparsity in deep neural networks', 2016 15Li, S., Liu, X., Mao, M. et al.: ' Heterogeneous systems with reconfigurable neuromorphic computing accelerators', 2016 16Ioffe, S., Szegedy, C.: 'Batch normalization: accelerating deep network training by reducing internal convariate shift', J. Mach. Learn. Res. (JMLR), 2015, 37, pp. 1- 9 17https://developer.nvidia.com/cudnn 18Simonyan, K., Zisserman, A.: ' Very deep convolutional networks for large-scale image recognition'. Int. Conf. on Learning Representations (ICLR), 2015 19Szegedy, C., Liu, W., Jia, Y. et al.: ' Going deeper with convolutions'. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2015 20Simard, P.Y., Steinkraus, D., Platt, J.C.: ' Best practices for convolutional neural networks applied to visual document analysis'. Int. Conf. on Document Analysis and Recognition (ICDAR), 2003 21Vincent, P., Larochelle, H., Lajoie, I. et al.: 'Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion', J. Mach. Learn. Res. (JMLR), 2010, 11, pp. 3371- 3408 22Long, J., Shelhamer, E., Darrell, T.: ' Fully convolutional networks for semantic segmentation'. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2015 23Badrinarayanan, V., Kendall, A., Cipolla, E.: ' SegNet: a deep convolutional encoder-decoder architecture for image segmentation', arxiv, 2015 24Noh, H., Hong, S., Han, B.: ' Learning deconvolution network for semantic segmentation'. Int. Conf. on Computer Vision (ICCV), 2015 25Yu, F., Koltun, B.: ' Multi-scale context aggregation by dilated convolutions'. Int. Conf. on Learning Representations (ICLR), 2016 26Lin, G., Shen, C., van den Hengel, A. et al.: ' Efficient piecewise training of deep structured models for semantic segmentation'. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2016 27Eigen, D., Fergus, R.: ' Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture', arxiv, 2014 28Everingham, M., Eslami, S.A., Gool, L.V. et al.: 'The pascal visual object classes challenge: a retrospective', Int. J. Comput. Vis. (IJCV), 2014, 111, pp. 98- 136 (doi: https://doi.org/10.1007/s11263-014-0733-5) 29He, K., Zhang, X., Ren, S. et al.: ' Delving deep into rectifiers: surpassing human-level performance on ImageNet classification', arxiv, 2015 30https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf 31Cordts, M., Omran, M., Ramos, S. et al.: ' The cityscapes dataset for semantic urban scene understanding'. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2016 Citing Literature Volume1, Issue1December 2016Pages 78-85 FiguresReferencesRelatedInformation

Referência(s)