ABiFN: Attention‐based bi‐modal fusion network for object detection at night time
2020; Institution of Engineering and Technology; Volume: 56; Issue: 24 Linguagem: Inglês
10.1049/el.2020.1952
ISSN1350-911X
AutoresSai Charan Addanki, M. Jitesh, Md Intisar Chowdhury, H. Venkataraman,
Tópico(s)Infrared Target Detection Methodologies
ResumoElectronics LettersVolume 56, Issue 24 p. 1309-1311 Image and vision processing and display technologyFree Access ABiFN: Attention-based bi-modal fusion network for object detection at night time A. Sai Charan, Corresponding Author charan.a@iiits.in Indian Institute of Information Technology Sri City, Sri City, IndiaSearch for more papers by this authorM. Jitesh, Indian Institute of Information Technology Sri City, Sri City, IndiaSearch for more papers by this authorM. Chowdhury, Karvy Analytics, Hyderabad, IndiaSearch for more papers by this authorH. Venkataraman, Indian Institute of Information Technology Sri City, Sri City, IndiaSearch for more papers by this author A. Sai Charan, Corresponding Author charan.a@iiits.in Indian Institute of Information Technology Sri City, Sri City, IndiaSearch for more papers by this authorM. Jitesh, Indian Institute of Information Technology Sri City, Sri City, IndiaSearch for more papers by this authorM. Chowdhury, Karvy Analytics, Hyderabad, IndiaSearch for more papers by this authorH. Venkataraman, Indian Institute of Information Technology Sri City, Sri City, IndiaSearch for more papers by this author First published: 29 October 2020 https://doi.org/10.1049/el.2020.1952AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onEmailFacebookTwitterLinked InRedditWechat Abstract Camera-based object detection in low-light/night-time conditions is a fundamental problem because of insufficient lighting. So far, a mid-level fusion of RGB and thermal images is done to complement each other's features. In this work, an attention-based bi-modal fusion network is proposed for a better object detection in the thermal domain by integrating a channel-wise attention module. The experimental results show that the proposed framework improves the mAP by 4.13 points on the FLIR dataset. Introduction Object detection has been a fundamental computer vision problem for decades and has been adopted in various domains. In the automotive domain, level 2 and level 3 autonomy does not support the accurate detection at night time (lack of thermal imaging). With the advent of deep learning models and availability of high computational resources, detection and classification of on-road objects have attracted a lot of attention from both the industry and academia. Many effective object detection methods based on deep neural networks have been proposed in recent years [1 ]. In general, the object detection methods are grouped into two categories: two-stage detection, wherein a sparse set of object proposals are generated and one-stage detection is a proposal free method. Among the number of object detection methods proposed, region-based convolutional neural networks (R-CNN) [2-4 ] have become predominant due to their effective performance. This line of work has evolved from the R-CNN [2 ], where region proposals are extracted from the image, followed by classification of each region of interest (ROI) independently. To minimise the redundant computations, Fast R-CNN [3 ] and SPP-Net [5 ] were introduced to share the convolution features among all ROIs. The Faster R-CNN [4 ] combines the Fast R-CNN with a region proposal network (RPN), produces object proposals for further detection. Two-stage detectors perform better detection than one-stage detectors at the cost of inference speed [6 ]. Most of the object detection algorithms are designed to work on RGB images captured at day time by visible cameras. However, unlike the availability of generic/specific object detection datasets (Pascal, COCO, KITTI), a very few thermal IR datasets (FLIR, Multi-spectral) are publicly available. The detection performance drops with challenging lighting conditions and most of the algorithms would fail to detect in darkness as the structure and colour features of the object change remarkably. For instance, reflections from oncoming headlights impose challenges for detection in RGB images. Thermal IR sensors perform well in such conditions, as they are illumination invariant. They are also economical, non-intrusive and small in size. The fusion of RGB and thermal images will complement each other's features for robust detection. For a task, the optimal way to fuse multi-sensor data is often un-intuitive and there is no ideal approach for each circumstance. In this work, an 'Attention-based Bi-modal Fusion Network' (ABiFN) is proposed for object detection in thermal domain. The experimental results show that our model improves the performance on the FLIR dataset. Faster R-CNN is chosen as the base detector and the ResNet 101 [7 ] architecture for feature extraction. The mid-fusion is performed by stacking the feature vectors of RGB and thermal images obtained through the attention module. The resultant fused feature vector is fed into the RPN followed by classification. Related work Detection of objects in low-light images using thermal imagery has been an active research topic in computer vision [8-10 ] specifically in surveillance. Most of the deep learning-based object detection networks are currently designed using single-modal sensory data, typically RGB/visible images. Almost any matter above absolute zero temperature could be seen with thermal cameras [11 ]. The thermal radiation spectrum ranges from 0.1 to 100 m micrometers while that of visible light ranges from 0.4 to 0.76 m. Thermal images could therefore be useful in detecting objects when the lighting conditions are not adequate. It is to be noted that LIDARS can also be used for detection in unsatisfactory lighting conditions, but with some limitations. There are three major advantages of using thermal imaging cameras. (1) Expensive than the visible-light cameras, but cost lower than LIDAR. (2) Thermal images are greyscale visual images in nature. So, the advanced computer vision technology could directly support applications for thermal imaging. (3) It provides dense images in real time, similar to the visible camera. While LIDAR point clouds have a different type of data samples when compared with images, they are sparse point lists, rather than dense arrays [12, 13 ]. For example, the FLIR automotive thermal cameras could stream thermal images with a resolution of and run at 60 Hz. The LIDAR point clouds, however, are far sparser than thermal images, and the frame rates are also slow. There are several works on the data fusion of different sensors. Typically, the fusion approaches can be categorised into three groups based on the data abstraction – early, mid and late fusion. In our work, we use mid-fusion method where the extracted features of each raw data are fused. One of the recent works in this category is DenseFuse [14 ], where the RGB and thermal images are fused to preserve the deep features for better classification. Using a train-mounted thermal camera, an anomaly-based obstacle detection is proposed in [15, 16 ]. In [17 ], a pedestrian classifier and a fusion-based tracker is proposed using background subtraction. In this line, the work closest to ours is [18 ]. The authors proposed a multi-modal object detection based on the translation methods, primarily to increase the detection accuracy on thermal domain. However, in our work, an attention-based feature fusion is used to make the model learn the better represented features. The attention mechanism mimics the human-eye system. Human eyes do not recognise all at once, but focus and gain a field of view selectively [19, 20 ]. For instance, Pumarola et al. [21 ] proposed a generative adversarial network (GAN) based on a facial action coding system for face generation. This approach introduced attention modules to the network to concentrate exclusively on facial regions, rendering the network resilient to background noise and changing lighting conditions. In this work [22 ], the authors introduced a raindrop removal GAN using attention mechanism that can selectively identify areas of rain drops. A residual attention network [23 ] is implemented consisting of stacked attention modules to learn various kinds of attention mechanisms. However, there are still limitations with the selection of a fusion method for a specific application and the feature map attention details vary with the architecture. Due to the variation in appearances in visual images, a network structure that facilitates the learning of reliable visual features during training of the model is presented in our work. Methodology The primary idea of our method is to borrow knowledge from data-rich domains such as visual (RGB) by using attention mechanism to learn the features of the ROI in an image. The workflow proposed for our multi-modal fusion framework is shown in Fig. 1. Fig. 1Open in figure viewerPowerPoint Block diagram of the proposed ABiFN framework Attention module The attention module in this work is inspired by Sungmin [24 ] for the task of object detection. To use data from multiple sensors effectively, it is important to prioritise data by providing more weight to specific data. For this, we modify the existing attention module and use them in our detection framework. We consider that an object class is visible in one of the domains, as the RGB and thermal images have complementary details. The architecture of the attention module is illustrated in Fig. 2. It has three fully connected (FC) layers with max-pooling (max-pool) and average-pooling (avg-pool) applied to the features f horizontally and vertically, respectively. The pooling transformations happen to change the feature f to f , where C, W and H represent the channels, width and height, respectively. In the ResNet 101 architecture used for feature extraction, the attention module is inserted in the first ResNet block after the ReLU and the first batch normalisation layer. Likewise, for every alternative Resnet block, an attention module is inserted. The outputs of the pooling two layers are concatenated and an element-wise product is performed. The module produce channel-wise attention feature maps, and from the RGB-network and Thermal-network branches, respectively (1) (2) Fig. 2Open in figure viewerPowerPoint Attention module architecture Mid-fusion In the mid-fusion method, the two feature vectors and obtained are stacked together resulting in a single feature vector and then fed into the the RPN of the Faster R-CNN. To reduce the complexity, the number of region proposals for bounding boxes is set to 200 along with maintaining a trade-off with recall. The proposals are sent to the box classifier through an FC layer to generate the bounding boxes where each box has a class label with a score. We tried using late-fusion method, i.e. the fusion of the ROI pooled features of the thermal and RGB image. The pooled ROI features of the thermal image and the pooled ROI features of the RGB image were obtained from their respective region proposals and concatenated. However, the results obtained using mid-level fusion were better. Dataset FLIR [25 ] dataset is used for the experiments. The images are captured using a FLIR Tau2 camera, each image is of resolution . About 60% of the training images are collected during daytime and the remaining 40% are collected during night. The dataset provides both the RGB and thermal domain images, not paired though. Some example images from the dataset are shown in Fig. 3. The first and second rows represent thermal, RGB and fused images column-wise, respectively. For all the experiments, we use the training and test splits as provided in the dataset benchmark, which contains the car (41,260 instances), person (22,372 instances) and bicycle (3986 instances) categories. Fig. 3Open in figure viewerPowerPoint Rows 1 and 2: image examples from FLIR dataset Training details The model in this work is implemented using the PyTorch 1.4.0 with CUDA 10.1 and cuDNN 9.0 libraries. The network is trained on a PC with an Intel i7 CPU and an NVIDIA RTX 2080 Ti graphics card with 8 GB graphics memory. The weight decay and momentum are set to 0.0005 and 0.9, respectively. For training, we use the stochastic gradient descent optimisation solver and the cross-entropy loss. The initial learning rate is set to 0.001 and is decayed by a factor of 0.3 after every 7 epochs. The network is trained till there is no further decrease in the loss. Baseline Faster R-CNN is used as the baseline detector. We followed the original paper [4 ] for all the hyper-parameters in the model, unless specified otherwise. The weights obtained from training on the MS-COCO dataset are used for initialising the weights of the ResNet 101 network. Average precision (AP) and mean AP (mAP) are used as the metrics in this experiment. Fig. 4 depicts the detection on a thermal test image. In particular, the FLIR dataset provides a benchmark mAP (at IoU of 0.5) of 58.0 and the work in [18 ] shows 61.54 mAP. We show that we beat this benchmark using our ABiFN model. Fig. 4Open in figure viewerPowerPoint Multi-object detection a Test image b Detection results Results Table 1 shows the results obtained for different training schemes on FLIR thermal test dataset. Firstly, the model is trained only with thermal dataset and obtained the mAP of 54.05. Then we used both the RGB and thermal images provided in the dataset for training followed by the feature fusion using the mid-fusion scheme. The results show significant improvement with an increase in mAP to 60.47. The proposed method of fusing RGB and thermal with the extracted features using attention module improved the mAP to 62.13 and outperforms the benchmark mAP of FLIR by 4.13 points. Table 1. Performance comparison of proposed method for each class Train Test Car (AP) Person (AP) Bicycle (AP) mAP thermal (Th) thermal (Th) 70.12 57.83 34.21 54.05 RGB + Th (mid-fusion) Th 70.30 64.34 46.70 60.47 RGB + Th + attention (mid-fusion) Th 71.83 66.08 48.49 62.13 Conclusion An attention-based object detection framework is proposed to improve the detection in the thermal domain by learning and complementing the better features from each domain. For various settings in training of our model, we show that our framework surpasses the benchmark of the FLIR dataset. The image resolution of the thermal image is one significant factor making it difficult for detection of objects located far from the camera. Also, the detection of objects, for example pedestrians close to each other is detected as a single object instead of two objects. Our future work will be analysing the failure cases of detection of smaller objects. References 1Krizhevsky A. Sutskever I. Hinton G.E.: 'Imagenet classification with deep convolutional neural networks '. Neural Information Processing Systems, Lake Tahoe, NV, USA, 2012, pp. 1097– 1105Google Scholar 2Girshick R. Donahue J. Darrell T. et al.: 'Rich feature hierarchies for accurate object detection and semantic segmentation '. Proc. of the IEEE conference on computer vision and pattern recognition, Columbus, OH, USA, 2014, pp. 580– 587Google Scholar 3Girshick R.: 'Fast R-CNN '. Proc. of the IEEE Int. Conf. on Computer Vision, Santiago, Chile, 2015, pp. 1440– 1448Google Scholar 4Ren S. He K. Girshick R. et al.: 'Faster R-CNN: towards real-time object detection with region proposal networks '. Advances in neural information processing systems, Montreal, Canada, 2015, pp. 91– 99Google Scholar 5He K. Zhang X. Ren S. et al.: 'Spatial pyramid pooling in deep convolutional networks for visual recognition '. European Conf. on Computer Vision, Zurich, Switzerland, 2014, pp. 346– 361Google Scholar 6Redmon J. Farhadi A.: 'Yolo9000: better, faster, stronger', arXiv preprint, 2017Google Scholar 7He K. Zhang X. Ren S. et al.: 'Deep residual learning for image recognition '. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770– 778Google Scholar 8Hwang S. Park J. Kim N. et al.: 'Multispectral pedestrian detection: benchmark dataset and baseline '. IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 1037– 1045Google Scholar 9Mangale S. Khambete M.: 'Moving object detection using visible spectrum imaging and thermal imaging '. Int. Conf. on Industrial Instrumentation and Control, Pune, India, 2015, pp. 590– 593Google Scholar 10Zin T.T. Takahashi H. Hama H.: 'Robust person detection using far infrared camera for image fusion '. Int. Conf. on Innovative Computing Information and Control, Kumamoto, Japan, 2007, pp. 310– 310Google Scholar 11Vollmer M. Mollmann K.P.: ' Infrared thermal imaging: fundamentals, research and applications ' ( Wiley, Germany, 2017 ) Wiley Online LibraryGoogle Scholar 12Sun X. Ma H. Sun Y. et al.: 'A novel point cloud compression algorithm based on clustering ', IEEE Robot. Autom. Lett., 2019, 4, (2 ), pp. 2132– 2139 (doi: 10.1109/LRA.2019.2900747 ) CrossrefWeb of Science®Google Scholar 13Yun P. Tai L. Wang Y. et al.: 'Focal loss in 3D object detection ', IEEE Robot. Autom. Lett., 2019, 4, (2 ), pp. 1263– 1270 (doi: 10.1109/LRA.2019.2894858 ) CrossrefWeb of Science®Google Scholar 14Li H. Wu X.J.: 'Densefuse: a fusion approach to infrared and visible images ', IEEE Trans. Image Process., 2018, 28, (5 ), pp. 2614– 2623 (doi: 10.1109/TIP.2018.2887342 ) CrossrefWeb of Science®Google Scholar 15Berg A.: 'Detection and Tracking in Thermal Infrared Imagery', Linkping Studies in Science and Technology, Thesis No. 1744, Linkping University, Sweden, 2016Google Scholar 16Berg A. Ofjall K. Ahlberg J. et al.: ' Detecting rails and obstacles using a train-mounted thermal camera ', in R.R. Paulsen K.S. Pedersen (Eds.): ' Image analysis ' ( Springer, Switzerland, 2015 ), pp. 492– 503CrossrefGoogle Scholar 17Leykin A. Ran Y. Hammoud R.: 'Thermal-visible video fusion for moving target tracking and pedestrian classification ', 2007Google Scholar 18Chaitanya D. Akolekar N. Sharma M.M. et al.: 'Borrow from anywhere: pseudo multi-modal object detection in thermal imagery '. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 2019Google Scholar 19Itti L. Koch C. Niebur E.: 'A model of saliency-based visual attention for rapid scene analysis ', TPAMI, 1998, 20, (11 ), pp. 1254– 1259 (doi: 10.1109/34.730558 ) CrossrefWeb of Science®Google Scholar 20Corbetta M. Shulman G.: 'Control of goal-directed and stimulus-driven attention in the brain ', Nature Rev. Neurosci., 2002, 3, (3 ), p. 201 (doi: 10.1038/nrn755 ) CrossrefCASPubMedWeb of Science®Google Scholar 21Pumarola A. Agudo A. Martinez A.M. et al.: 'Ganimation: anatomically-aware facial animation from a single image '. ECCV, Munich, Germany, 2018Google Scholar 22Qian R. Tan R.T. Yang W. et al.: 'Attentive generative adversarial network for raindrop removal from a single image '. CVPR, Salt Lake City, UT, USA, 2018Google Scholar 23Wang F. Jiang M. Qian C. et al.: 'Residual attention network for image classification', arXiv preprint arXiv:1704.06904, 2017Google Scholar 24Sungmin C. Choi B. Kim D.-H. et al.: 'Multi-domain attentive detection network '. IEEE Int. Conf. on Image Processing, Taipei, Taiwan, 2019Google Scholar 25https://www.flir.in/oem/adas/adas-dataset-form - last accessed on 10th June 2020 Google Scholar Volume56, Issue24November 2020Pages 1309-1311 FiguresReferencesRelatedInformation
Referência(s)