GPU parallel implementation and optimisation of SAR target recognition method
2019; Institution of Engineering and Technology; Volume: 2019; Issue: 21 Linguagem: Inglês
10.1049/joe.2019.0669
ISSN2051-3305
AutoresHaopeng Quan, Zongyong Cui, R. Wang, Zongjie Cao,
Tópico(s)Radar Systems and Signal Processing
ResumoThe Journal of EngineeringVolume 2019, Issue 21 p. 8129-8133 IET International Radar Conference (IRC 2018)Open Access GPU parallel implementation and optimisation of SAR target recognition method H. Quan, H. Quan School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, People's Republic of ChinaSearch for more papers by this authorZ. Cui, Z. Cui School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, People's Republic of ChinaSearch for more papers by this authorR. Wang, R. Wang School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, People's Republic of ChinaSearch for more papers by this authorZongjie Cao, Corresponding Author Zongjie Cao zjcao@uestc.edu.cn School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, People's Republic of ChinaSearch for more papers by this author H. Quan, H. Quan School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, People's Republic of ChinaSearch for more papers by this authorZ. Cui, Z. Cui School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, People's Republic of ChinaSearch for more papers by this authorR. Wang, R. Wang School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, People's Republic of ChinaSearch for more papers by this authorZongjie Cao, Corresponding Author Zongjie Cao zjcao@uestc.edu.cn School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, People's Republic of ChinaSearch for more papers by this author First published: 08 October 2019 https://doi.org/10.1049/joe.2019.0669AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract The SAR target recognition based on optimised GPU parallel algorithm is proposed here. In general, with the rapid increment of the data dimension and the amount of data of SAR images, the traditional CPU-based target recognition algorithm cannot meet the requirements of real-time processing. Here, the target recognition algorithm which includes feature extraction and the classification is investigated and then parallel decomposed and optimised. First, the algorithms are investigated and parallel decomposed, including the principal component analysis, linear discriminant analysis, and non-negative matrix factorisation feature extraction technologies, and the support vector machines classifier. Then, the three feature extraction methods and sequential minimal optimisation algorithm are realised. Finally, the causes of compute unified device architecture programme running speed in target recognition algorithm are deeply analysed, and the algorithm is optimised from three aspects: communication, access, and instruction flow. According to the experiments, the optimised GPU-based parallel implementation of the target recognition algorithm has been optimised to obtain about 25–30 times performance upgrade 1 Introduction SAR automatic target recognition (ATR) technology refers to the absence of manual assistance in the case, from the big scene to detect and locate the target and achieve the target model, attributes, and equipment to determine the situation [1]. Last few years, there are many algorithms used in SAR target recognition. According to these algorithms, the target recognition work can be composed of two steps, SAR feature extraction and classification based on the characteristics of the classification decision [2]. However, compared with these rich theoretical results, SAR target recognition technology and practical application of research are relatively slow, practical SAR target detection and identification system is still less or immature. One of the major reasons for this process is that the increase in SAR image resolution leads to a sharp increase in the amount of data acquired, making it difficult for traditional CPU processing to process data in real time [3]. In the military field, we hope that the SAR image processing software system can be real-time SAR image detection and classification of specific targets, and make a precise blow, or its practical value will be greatly reduced. How to improve the efficiency of algorithm implementation and realise the real-time performance of target recognition software is a challenge for researchers to ensure the accuracy of recognition. In recent years, GPU general-purpose computing has been developing rapidly and will be a powerful tool to improve the speed of target recognition algorithms. This paper is organised as follows. In Section 2, the high-resolution SAR target recognition method is analysed and decomposed to design the compute unified device architecture (CUDA) [4] parallel implementation scheme. In Section 3, the implementation of the algorithm is tested with the public database of MSATR. Conclusions are presented in Section 4. 2 GPU parallel implementation and optimisation In this part, we consider principal component analysis (PCA), linear discriminant analysis (LDA) and non-negative matrix factorisation (NMF) feature extraction algorithms and support vector machines (SVM) classification algorithm and briefly introduce the CUDA program optimisation strategies. 2.1 Realisation of PCA based on CUDA PCA is a relatively common method to generate the feature and reduce the dimension, that is, with a small number of linear independent variables to describe the vast majority of multi-dimensional space information [5]. The main steps of the PCA feature extraction method can be seen in [6]. The flow chart of the PCA algorithm is shown in Fig. 1. Fig. 1Open in figure viewerPowerPoint Flow chart of the PCA algorithm According to the principle above, the covariance matrix Q is calculated as the result of the multiplication of the two matrices. Matrix multiplication is a basic module in the field of scientific computing. Fig. 2 is the schematic diagram of matrix multiplication with CUDA in a block. Fig. 2Open in figure viewerPowerPoint Diagram of matrix multiplication based on CUDA To calculate the eigenvalue, Jacobi iterative feature decomposition algorithm proposed by Kogbeliantz and Colub et al. in 1986 can be computed in parallel [7]. The basic idea of the Jacobi iterative method is presented in [8]. The algorithm of Jacobi iteration method is implemented in parallel on the GPU as shown in the following figure of Algorithm 1 (see Fig. 3). Fig. 3Open in figure viewerPowerPoint Algorithm 1: Jacobi iteration method Before performing the Jacobi iteration, it is necessary to check whether the matrix has been transformed into a diagonal matrix. Here, by looking for the largest number of absolute values in the non-diagonal elements, it is sufficient to see that the element is close to zero within the tolerance allowable range. After setting the diagonal elements of the matrix to zero, it is easy to determine whether the matrix is a diagonal matrix by looking for the largest element of all the elements in the entire matrix. In the CUDA parallel computation, the parallel reduction method is usually used to find the maximum value. The diagram of parallel reduction is shown in Fig. 4. The time complexity of the processing is . Fig. 4Open in figure viewerPowerPoint Diagram of parallel reduction 2.2 Realisation of LDA based on CUDA The basic idea of the LDA [9] is presented in [6]. As can be seen in Fig. 5, it is the flow chart of the LDA algorithm. Fig. 5Open in figure viewerPowerPoint Flow chart of the LDA algorithm According to the flow chart above, the algorithm of calculating and in the algorithm involves calculating the mean, vector multiplication, and matrix summation operations, all suitable for GPU parallel operation. Then, calculate , involving matrix inversion and matrix multiplication. In 2015, Tian Ning et al. achieved the matrix inversion of CUDA parallel transplantation by the pivoting Gauss Jordan algorithm [10]. The parallel computing process of the eigenvalues and eigenvectors of the solution matrix has been analysed in detail and a concrete CUDA implementation process has been given in previous section, so we will not go into details here. The implementation of the LDA feature extraction method based on CUDA is as shown in Fig. 6. Fig. 6Open in figure viewerPowerPoint Flow chart of LDA based on CUDA 2.3 Realisation of NMF based on CUDA The NMF algorithm was first proposed by Lee, D. and Seung, H. [11]. As can be seen in [6], Fig. 7 is the flow chart of NMF. Fig. 7Open in figure viewerPowerPoint Flow chart of the NMF algorithm Following is the parallel analysis of each step of the above flow chart. First, since the data to be dealt within the initialisation process is less and lack parallelism, it will be done in the CPU side. Then, the iterative updating of matrixes W and H involves a large number of matrix operations, such as matrix transposition, matrix multiplication, matrix multiplication, and matrix dot divisions. The algorithmic complexity is , and the matrix operation is suitable for GPU parallel computing. Therefore, the iterative updating of non-negative matrixes W and H will be implemented in parallel on the GPU side. Finally, after each iteration, it is necessary to determine whether the Frobenius norm of the error matrix is less than a threshold or whether the number of iterations exceeds a preset value. The process of calculating the Frobenius norm involves matrix subtraction and summation of all elements of the matrix, where the summation of matrix elements can be done using the parallel reduction method introduced earlier. The flow chart of parallel calculation of NMF is given as in Fig. 8. Fig. 8Open in figure viewerPowerPoint Flow chart of NMF based on CUDA 2.4 Realisation of SVM based on CUDA SVM [12] is a machine learning method developed based on the theory of statistical learning. In the case of a linear separable case, the SVM algorithm makes it possible to isolate the two classes correctly by looking for a hyperplane so that the margin between the two classes of samples is maximised. In order to use the GPU for parallel acceleration, the sequential minimal optimisation (SMO) algorithm is proposed by Platt [13, 14]. The specific flow chart of the SMO algorithm is as shown in Fig. 9. Fig. 9Open in figure viewerPowerPoint Flow chart of SMO based on CUDA According to the flow chart, the gradient need to be calculated at first. Where H is the kernel function matrix, which can be obtained by the formula . The calculation process can be decomposed into two steps, first calculate the inner product and then calculate the Gaussian function, a large number of inner product calculation is easy to use GPU to achieve parallel implementation. Then, in order to reduce the data transmission in the CPU side and the GPU side, the process of calculate the maximum can be done at the GPU side by using the parallel reduction. Finally, the process of solving and can be seen as the calculation of the inner product of two vectors with good parallelism. 2.5 Optimisation strategies In order to achieve the maximum efficiency of the algorithm, the parallel and serial task division, communication between the host and the device, accessing memory and instruction flow and other factors should be considered. As an accessible memory, shared memory can be accessed by all threads in the same block. In the absence of thread conflict, the speed of accessing the shared memory and the on-chip registers is comparable. Therefore, shared memory is the best way to achieve communication between threads. Most graphics cards are connected to the host via the PCI-E bus. The theoretical bandwidth of PCI-E channel is , much smaller than the bandwidth between video memory and GPU on-chip memory. If the data transmitted between the host and the device is relatively large, then the data communication process between the host and the device can easily become a bottleneck to improve the performance of the program. More detailed information can be found in [15]. 3 Experimental part In this section, based on the basic theory of the previous target recognition algorithm and some optimisation methods, three kinds of feature extraction algorithms and SVM classification algorithm for GPU parallel improvement will be achieved and the speed will be further enhanced. In order to verify the acceleration effect of CUDA parallel processing of high-resolution SAR target recognition algorithm, the unified hardware environment and development platform are used to compare the operation time. In these experiments, NVIDIA GeForce GTX 1080 Ti is used. The GPU card is mounted in a desktop computer equipped with Intel Core i7-7700 3.60 GHz 64-bit CPU serving as the host, 16 GB RAM, and Windows 10 64-bit OS, whereas the C/C++ Visual Studio 2015 development environment is used to implement the algorithms. The version of the CUDA Driver Runtime is 9.0. The moving and stationary target acquisition and recognition (MSTAR) [16] program was jointly sponsored by defence advanced research projects agency and air force research laboratory. Hundreds of thousands of SAR images were collected in MSTAR, which contains ground targets, including different target types, aspect angles, depression angles, serial number, and articulation. The datasets used in this paper comes from the public database of MSTAR. This paper selects three of these goals: BTR-70, BMP-2, and T-72. Their optical images and SAR images are shown in Fig. 10. Fig. 10Open in figure viewerPowerPoint Optical images and SAR images of three targets (a) BTR-70, (b) BMP-2, (c) T-72, (d ) BTR-70-SAR, (e) BMP-2-SAR, (f) T-72-SAR The datasets were divided into two parts: training subsets and testing subsets. The datasets with pitch angle of were used as training subsets, and the datasets with pitch angle of were used as testing subsets. 3.1 Feature extraction experiment As the parallel acceleration performance of PCA feature extraction is better, the PCA feature extraction algorithm based on CUDA is verified in this experiment, without considering the recognition rate of the problem temporarily. Data dimension is . The number of samples in this experiment will gradually increase from 400 to 6400. In this experiment, in order to achieve the requirements of the number of samples, the degree of fuzzy is increased by and artificially to the photo to expand the database. The results of experiment are as follows of Table 1. Table 1. PCA feature extraction experiment results Samples Training time, ms Speed-up CPU GPU 400 5432 737 8 800 20,263 1608 12 1600 134,674 6457 21 3200 943,396 33,636 28 6400 8575,242 267,925 32 3.2 SVM classification experiment This experiment is mainly to verify the acceleration of CUDA-based parallel SMO algorithm, so only the SVM model is trained to get the speedup by final comparison of training time. The results of experiment are as follows of Table 2. Table 2. SVM classification experiment results Samples Training time, ms Speed-up CPU-SVM GPU-SVM 400 8530 1821 4.7 800 31,702 3785 8.2 1600 173,641 13,926 12.5 3200 1092,332 57,789 18.9 6400 11,860,250 470,634 25.2 3.3 SAR target recognition experiment In this experiment, three feature extraction algorithms and SVM-based classification algorithms are combined to make CPU serial and CPU-GPU parallel heterogeneous computation, respectively, to realise high-resolution SAR target recognition. Record the recognition time of the two implementations and compare them to verify the possibility of high-resolution SAR target recognition software real time. A total of 200 SAR images with pitch angle of and 200 SAR images with pitch angle of were selected from three samples, and 600 training samples and 600 test samples were selected to obtain the experimental data of Table 3. Table 3. Target Recognition Experiment Results Algorithm Samples Total time, ms Speed-up Recognition rate, % CPU GPU PCA + SVM 600 95,355 6373 15.0 92.46 LDA + SVM 600 112,125 8639 12.8 84.75 NMF + SVM 600 92,634 6992 13.2 93.04 3.4 Advanced optimisation The previous part has achieved the parallel improvement of the target recognition algorithm and realised the CUDA transplantation. Based on the above-mentioned several optimisation strategies, the speed of the algorithm can be further enhanced. The results can be seen at Table 4 : Table 4. Optimised target recognition experiment results Algorithm Total time, ms Speed-up Before optimise After optimise Before optimise After optimise PCA + SVM 6373 2910 15.0 32.8 LDA + SVM 8639 4192 12.8 26.7 NMF + SVM 6992 3582 13.2 25.9 Since the target detection process involves a large number of matrix operations, with high parallelism, and the algorithm is optimised, a very high speed-up is obtained using GPU parallel computing compared to CPU serial computing. 4 Conclusion In this paper, SAR object recognition real-time problem is selected as the research object, and PCA, LDA, NMF feature extraction algorithms and SVM classifier algorithm are selected as the specific algorithms for high-resolution SAR target recognition. The three feature extraction methods and SVM algorithm are parallel decomposed and realised. In order to verify the real-time and correctness of the algorithm, the feature extraction algorithm and SVM are combined to obtain the complete target identification method. In addition, this paper gives several optimisation strategies of CUDA program and makes a series of optimisation of target recognition algorithm. The results show that the algorithm achieves 25–30 times acceleration under the premise of guaranteeing the recognition rate. The realisation of real-time processing data of target recognition software is of great significance. 5 Acknowledgments This study was supported by the Key Technology R&D Program of Sichuan Province 2015GZ0109, the National Nature Science Foundation of China under Grant no. 61271287 and Grant U1433113. 6 References 1Hu L.: ' Research on target recognition technology of synthetic aperture radar'. Master's thesis, Xidian University, 2009 2Sun J.: ' Modern pattern recognition' ( Higher Education Press, Beijing, China, 2008) 3Fredj H., Ltaif M., Ammar A. et al.: 'Parallel implementation of Sobel filter using CUDA'. Int. Conf. on Control, Automation and Diagnosis, Marrakesh, Morocco, July 2017, pp. 209 – 212 4Singh S., Paul A., Arun M.: 'Parallelization of digit recognition system using deep convolutional neural network on CUDA'. Int. Conf. on Sensing, Limerick, Ireland, 2017, pp. 379 – 383 5Javier L., Dora H., Francisco A. et al.: 'CUDA multiclass change detection for remote sensing hyperspectral images using extended morphological profiles'. 2017 9th IEEE Int. Conf. on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), Bucharest, Romania, 2017, pp. 404 – 409 6Cui Z., Cao Z., Yang J. et al.: 'A hierarchical propelled fusion strategy for SAR automatic target recognition', EURASIP J. Wirel. Commun. Netw., 2013, 2013, (1), pp. 39 – 46 7Golub G., Vorst H.: 'Eigenvalue computation in the 20th century', J. Comput. Appl. Math., 2000, 123, pp. 35 – 65 8Ryoo S., Rodrigues C., Baghsorkhi S. et al.: 'Optimization principles and application performance evaluation of a multithreaded GPU using CUDA'. ACM Sigplan Symp. on Principles and Practice of Parallel Programming, PPOPP 2008, Salt Lake City, UT, USA, 2008, pp. 73 – 82 9Ding B., Wen G., Ma C. et al.: 'Decision fusion based on physically relevant features for SAR ATR', IET Radar Sonar Navig., 2016, 11, (5), pp. 682 – 690 10Tian N.: ' Research on matrix computation of GPU acceleration'. Master's thesis, Heilongjiang University, 2015 11Cao Z., Min R., Pi Y. et al.: 'The feasibility analysis of applying NMF in SAR target recognition'. 2015 IEEE Int. Conf. on Digital Signal Processing (DSP), Singapore, 2015, pp. 721 – 725 12Yekkehkhany B., Safari A., Homayouni S. et al.: 'A comparison study of different kernel functions for SVM-based classification of multi-temporal polarimetry SAR data', ISPRS - Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., 2014, XL-2/W3, (1), pp. 281 – 285 13Demirhan M., Salor O.: 'Classification of targets in SAR images using SVM and k-NN techniques'. Signal Processing and Communication Application Conf., Zonguldak, Turkey, May 2016, pp. 1581 – 1584 14Shao X., Wu K., Liao B.: 'Single directional SMO algorithm for least squares support vector machines', Comput. Intell. Neurosci., 2013, 2013, (5), pp. 1 – 7 15Duane S., Mete Y.: ' CUDA for engineers: an introduction to high-performance parallel computing' ( Addison-Wesley Professional, Boston, MA, USA, 2015) 16Wang H., Chen S., Xu F. et al.: 'Application of deep-learning algorithms to MSTAR data'. 2015 IEEE Int. Geoscience and Remote Sensing Symp., Milan, Italy, 2015, pp. 3743 – 3745 Volume2019, Issue21November 2019Pages 8129-8133 FiguresReferencesRelatedInformation
Referência(s)