Artigo Acesso aberto Revisado por pares

Deploying tactical communication node vehicles with AlphaZero algorithm

2019; Institution of Engineering and Technology; Volume: 14; Issue: 9 Linguagem: Inglês

10.1049/iet-com.2019.0349

ISSN

1751-8636

Autores

Xiaofei Zou, Ruopeng Yang, Changsheng Yin, Zongzhe Nie, Wang Huitao,

Tópico(s)

Modular Robots and Swarm Intelligence

Resumo

IET CommunicationsVolume 14, Issue 9 p. 1392-1396 Research ArticleFree Access Deploying tactical communication node vehicles with AlphaZero algorithm Xiaofei Zou, Xiaofei Zou Information and Communications College, National University of Defense Technology, No.45 JieFang Park Road, Wuhan, Hubei, People's Republic of ChinaSearch for more papers by this authorRuopeng Yang, Corresponding Author Ruopeng Yang yrp_roc@126.com Information and Communications College, National University of Defense Technology, No.45 JieFang Park Road, Wuhan, Hubei, People's Republic of ChinaSearch for more papers by this authorChangsheng Yin, Changsheng Yin Information and Communications College, National University of Defense Technology, No.45 JieFang Park Road, Wuhan, Hubei, People's Republic of ChinaSearch for more papers by this authorZongzhe Nie, Zongzhe Nie Information and Communications College, National University of Defense Technology, No.45 JieFang Park Road, Wuhan, Hubei, People's Republic of ChinaSearch for more papers by this authorHuitao Wang, Huitao Wang Information and Communications College, National University of Defense Technology, No.45 JieFang Park Road, Wuhan, Hubei, People's Republic of ChinaSearch for more papers by this author Xiaofei Zou, Xiaofei Zou Information and Communications College, National University of Defense Technology, No.45 JieFang Park Road, Wuhan, Hubei, People's Republic of ChinaSearch for more papers by this authorRuopeng Yang, Corresponding Author Ruopeng Yang yrp_roc@126.com Information and Communications College, National University of Defense Technology, No.45 JieFang Park Road, Wuhan, Hubei, People's Republic of ChinaSearch for more papers by this authorChangsheng Yin, Changsheng Yin Information and Communications College, National University of Defense Technology, No.45 JieFang Park Road, Wuhan, Hubei, People's Republic of ChinaSearch for more papers by this authorZongzhe Nie, Zongzhe Nie Information and Communications College, National University of Defense Technology, No.45 JieFang Park Road, Wuhan, Hubei, People's Republic of ChinaSearch for more papers by this authorHuitao Wang, Huitao Wang Information and Communications College, National University of Defense Technology, No.45 JieFang Park Road, Wuhan, Hubei, People's Republic of ChinaSearch for more papers by this author First published: 01 June 2020 https://doi.org/10.1049/iet-com.2019.0349Citations: 1AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract The construction of mobile ad-hoc networks on the battlefield is mainly planned by staff or automatically planned with the help of network topology planning models in the network planning software. Most of these algorithms are actually more or less based on human knowledge or thinking ways to model network entities, environments, and rules, and the accuracy and rationality of models are not strong enough to match the changed battlefield. The AlphaZero algorithm generalised in chess and other games provides a new intelligent method to solve complex problems in the military field. Based on the AlphaZero algorithm, this study proposes a method for intelligent deployment of mobile ad-hoc networks with tactical communication node vehicles. Making an analogy between deploying tactical communication node vehicles and playing Go, the authors construct a deep reinforcement learning model for deployment of communication node vehicles. Starting from random play, and giving no domain knowledge, only setting the judgment of the network structure, with training the designing strategy value deep neural network by self-play reinforcement learning, they successfully deployed communication node vehicles on tabula rasa map and constructed battlefield mobile ad-hoc networks with deep reinforcement learning and Monte–Carlo tree search. 1 Introduction The mobile ad-hoc networks (MANET) on the battlefield are mainly constructed by communication node vehicles (CNVs), which are usually accompanied and assigned to combat units (CUs). The construction of the MANET of units below regiment size is manually by staff or automatically planned with the help of network topology planning models in the network planning software [1]. The manual planning is operated by staff who select the locations for CNVs and draw the connection relationships between them on the digital maps according to the communication support requirements, the positions selected by hand are shown roughly and the accuracy is not high. Automatic planning means that the network topology is generated by a planning model, which is built into the network planning software based on input conditions [2]. These models and algorithms are of three kinds: one is multi-objective optimisation algorithms focusing on specific communication requirements, such as communication requirements of CUs, network coverage area, deployment cost, network service quality, and other indicators [3]. Another is using graph theory to analyse and design, which is focused on the characteristics of the network topology, which abstracts the communication nodes and the connection relationship between them into points and lines by setting approximate assumptions and constraints [4]. The third is using artificial intelligence algorithms such as genetic algorithm [5] and ant colony optimisation algorithm [6], which mainly use simulating natural evolution process to search for the optimal results. Generally, most of these algorithms actually more or less based on human knowledge or thinking ways to model network entities, environments, and rules. The accuracy and rationality of models are not strong enough to match the changed battlefield. The models, which can perfectly match the changing battlefield's circumstances, are relatively less. In actual application process, it is generally necessary to rely on the experience of the commander to intervene and adjust the scheme based on the model. AlphaZero algorithm generalisation in chess and shogi also provides a new artificial intelligence method to solve various complex military problems [7]. As is known, the regiment troops’ deployment of CNVs in MANET is carried out on the grid or quasi-grid digital maps, because the grid or quasi-grid digital map is just like a chessboard, CNVs and CUs are just like white chess pieces and black chess pieces, so we can think of the process of deploying CNVs on the grid or quasi-grid maps just as playing Go [8-10] and consider learning AlphaZero algorithm to explore applications for setting up MANET in the battlefield [11]. Without manual intervention, the automatic planning of MANET in the battlefield can be realised flexibly and quickly according to the position and status of CUs and the requirement of battlefield communication support. 2 Problems for solving Although the process of deploying CNVs is generally similar to play Go, there are some detailed differences between them, such as the kind of CNV and CU number in mapping, methods for producing diverse forms, rules for judging deployment and so on, which need to be solved effectively. So, it could not completely copy the AlphaZero algorithm in board games for intelligent deployment of CNVs. 2.1 Abstraction and mapping 2.1.1 Battle area mapping The communication transmission equipment loading on CNVs is mainly in wireless transmission modes, such as very-high frequency (VHF) and shortwave (SW). Different from the blank background of the Go board, the combat area including much geographic information such as mountains, lakes, forests, and so on, which influence the effect of VHF. In the process of deploying CNVs, the influence of this geographical information should be considered. In this study, we select a specific application scenario with a visual geographical environment; the research in a complex geographic environment would be the next step. Since the wireless transmission is almost unaffected in a visual geographical environment, so the connections between CNVs are easy in VHF mode, we can think of the battle area map as Go's blank board. 2.1.2 Element mapping CNVs include many kinds of loading vehicles on battlefields. If the abstraction of CNVs is divided by the types of loading vehicles, we obviously need other colour chess pieces. Considering that the transmission mode of MANET is mainly SW or VHF, we can divide the CNVs according to the SW or VHF transmission mode. Since the difference between SW and VHF is mainly in the communication distance, we can set in the judgment of the network structure to solve. So, CNVs and CUs can be abstracted as white pieces and black pieces of Go. 2.1.3 Deployment mapping Since the positions of CUs are known in the initial phase, the MANET only involves planning the selection of CNVs’ positions, so the deploying process is not like the alternation of black pieces and white pieces in playing Go [9]. The alternating dropout rule of Go should be changed in MANET, we take the selected positions of CUs as one step and deploy them all once a time, then completing the deployment of CNVs step by step. 2.2 Diversity of data production The map of the battle area is similar to the board of chess and shogi in shape, and it also does not have the equivalence of mirror image and reversal of Go games [12]. Therefore, the number of data samples could not be expanded by the way of reversal of board as Go's. In addition, there is another difference that we can only collect the sample data when CNVs side wins, which means that the MANET is successfully constructed. So, the data sample will be less than in board games. To improve the diversity of data types and numbers, one method is improving the diversity of CNVs in the current situations, in other words, improving the diversity of network topology when the MANET is successfully deployed. The second method is to maximise the width of exploration space under the current situation. According to the AlphaZero algorithm method in sole exception, to ensure exploration, we add noise [10, 13] to expand the width of Monte–Carlo tree search (MCTS) [14] and search for more possible locations of CNVs. 2.3 Judgment of network structure The framework of the AlphaZero algorithm is based on zero-sum in board games, so there are precise game rules in Go, chess, shogi, and other games [7], and the model is easy to converge and realise. Since MANET involves command technology and art as other military operations, there are so many evaluation indicators, which belong to a multi-objective optimisation problem, so the evaluation rules are not easy to determine accurately. In addition, the result of deployment is not only loss or win, including how good results if win, but it also belongs to non-zero-sum game problems. If the AlphaZero algorithm is applied, we should ensure the convergence of the model, so it is necessary to establish some constraints to transform non-zero-sum games into zero-sum games. Since the neural network in the deep reinforcement learning is humanoid thinking [15-17], it is necessary to imitate the thinking mode of human, so the game rules are elementarily designed to increase circuitous links and improve the utilisation rate of equipment on the basis of guaranteeing network connectivity. 3 Algorithm design 3.1 Diversity of data production Learning from the principle of the AlphaZero algorithm by combining reinforcement deep learning [17-19] and MCTS algorithm [20], we construct a deep neural network model with parameter θ for battlefield CNVs’ deployment. Setting the judgment of the network structure, the input is state s of current CUs’ distribution positions, and the output is the probability of CNVs’ deployment p and the evaluation of MANET v, and . Combining strategic value neural networks with MCTS search, the neural networks are trained by self-play reinforcement learning. Firstly, the neural networks are used to guide the search space of MCTS, and then making MCTS update the weights of the neural networks, with the interaction between the neural networks and MCTS through continuous iteration and updating, we can optimise the generation of high-quality sample data and get a good trained model. Using the trained model, we can predict the most probable location in the specific application scenarios in future. 3.2 Symbolic definition As is known, the solving complexity of Go is about 1017 in the range of 19 × 19 board [7]. With the increase of the combat area coverage, it will directly affect the research of solution space, training time, and quality of solution results, if the finer granularity for rasterising a digital topographic map is selected. In this study, the application scenarios based on the open area, sight distance conditions, with no obstruction and a strong electromagnetic interference environment, and the range of the combat area coverage is in 9, 16, and 25 km2, with the size of each grid set to 0.25 km2 (the grid size meets the requirements of battlefield positioning accuracy), the corresponding board are, respectively, rasterised to the size of 6 × 6, 8 × 8 and 10 × 10. 3.2.1 Input Four binary feature planes (L × Hkm2), the first plane is the position state of battlefield CUs, the second plane is the current deployment state of CNVs, the third plane is the location of the last CNV (only one grid is 1, the others are 0), and the fourth plane is the state of positions (the position that cannot be laid is 1, and that can be laid is 0). 3.2.2 Output Two results, policy port is the current deployment location of CNVs selected with maximum probability. Value port is the judgment of MANET (−1 for a loss, 0 for a draw, +1 for a win). 3.2.3 Neural networks The number of layers of neural networks designed by the AlphaZero algorithm maybe 40–80, and the input plane is 17 in Go, 119 in chess and 362 in shogi, the training requires high hardware (use 64 tensor processing units to train the neural networks) and spending 24 h [7]. In our study, because the number of grids is smaller than in Go's, then the search space is much smaller than Go. So, we design the number of input planes and the depth of the neural networks is less than that of the AlphaZero algorithm in Go's. In Fig. 1, the neural networks are designed using only 5–6 layers, and the input is designed with four planes. Fig. 1Open in figure viewerPowerPoint Model of deploying CNVs. L × H sizes are 6 × 6, 8 × 8 and 10 × 10 3.2.4 Self-play reinforcement learning Learning from the ingenious design of formulas in the AlphaZero algorithm [7], we also actualise the combination of neural networks and MCTS. As shown in e.g. (1), the algorithm is as follows: maximise the similarity between the strategy vector and the MCTS probability , update parameter of the neural networks with gradient descent in constant iteration [21-23], evaluate the strategy by minimising the difference between the predicted placement value v and the placement judgment z, and enhance the effect of MANET by self-play reinforcement learning. In order to make the search of MCTS closer to the prediction of neural networks, the control re-regularisation level c is taken to ½ in this study (1) 3.3 Evaluating indicator In order to ensure the security of the CUs, each CU should be supported by a CNV, the CNV should be deployed within a certain distance from the CU, , dij is the distance between CNV and CU, Dmin is the minimum safe distance between CNV and CU, and Dmax is the maximum communication distance between CNVs. Since the communication distance of VHF and SW is about several kilometres, so the deviation caused by the method of measuring distance by grids cannot be considered. Ensure that CNVs could connect to MANET, and the circuitous links can be maximised on this basis. That is to say, the success design of MANET is that two CNVs with the nearest distance should be able to connect. Circuitous links are realised by the rectilinear Steiner minimum tree in this study [24]. 4 Test and analysis The self-play reinforcement learning of this test runs in the personal computer (CPU 2.3 GHz). In order to facilitate analysis and comparison, the batch in three grids (6 × 6, 8 × 8 and 10 × 10) is all set to 1000, and the play out of MCTS is set to 400 per step. Comparing with the learning rate set to 0.2 for each game and drop three times (0.02, 0.002, and 0.0002) in the AlphaZero algorithm [7], in this test the learning rate starts with 0.005 because the designed model is simpler, the weighted values are added to the learning rate and the changing conditions of the weighted values are set in order to adjust the learning rate dynamically. According to the theory of battlefield communication support, the judgment is set as follows: in 6 × 6 grids, the effective communication distance between CNV and CU is one grid (in this region, the distance between the two CNVs is already within 1–4 grids which can be connected easily). In 8 × 8 grids and 10 × 10 grids, the effective communication distance between CNV and CU is 1–2 grids and the effective communication distance between CNVs is 1–4 grids. The test begins to play with the random select the positions and trains by continued self-play reinforcement learning. Fig. 2 shows the distribution of the successful network deployments number in the three kinds of grids, we set 20 times self-play before each batch and get the statistics of the successful deployment times in 1000 batch games. On comparing Figs. 2a–c, we know the average number of successful deployments in three grids is, respectively, around 17/20, 12/20, and 10/20. It also shows that with the increase of the number of grids, the search space and the result sparseness increases, and the number of successful self-play reinforcement learning decreases. We can conclude that through self-reinforcement learning, the model has been able to randomly choose the deployment location from the beginning and gradually learn how to successfully build a network. Fig. 2Open in figure viewerPowerPoint Number of successful network deployments under the condition of self-play reinforcement learning every 20 times with the designed model. It shows that the model is able to learn how to deploy (a) Successful deployment in 6 × 6 grids, the average times is 17 in every 20 times self-play, (b) Successful deployment in 8 × 8 grids, the average times is 12 in every 20 times self-play, (c) Successful deployment in 10 × 10 grids, the average times is 10 in every 20 times self-play Fig. 3 shows that the loss function l (as formula (1)), which sums over mean-squared error and cross-entropy changes after 1000 times of training with self-play reinforcement learning in the three kinds of grids and the using time is, respectively, about 24, 60 and 72 h in 6 × 6, 8 × 8 and 10 × 10 grips. On comparing Figs. 3a–c, we can find that the loss function is fluctuating with the times increasing, but the general trend is gradually decreasing and tending to be stable around a specific value. In 6 × 6 grids, the loss function is basically maintained at about 1.5 after 500 times training. In 8 × 8 grids and 10 × 10 grids, the loss functions are basically maintained at about 2.0 after 525 and 625 times training. These indicate that the designed model is stable by this number of self-play reinforcement learning. Moreover, the cross-entropy of the deployable location probability distribution is gradually reduced, it shows that the neural networks gradually biases with self-play reinforcement learning, and the probability of selecting the locations changed from random approximate average distribution to uneven distribution, some locations’ probability increased and means that these locations are more likely to be selected, which indicate that the model can learn how to deploy by self-play reinforcement learning. Fig. 3Open in figure viewerPowerPoint Loss function which sums over mean-squared error and cross-entropy changing after 1000 times of training with self-play reinforcement learning in the three kinds of grids (a) Loss function and cross-entropy changing by 1000 times of training in 6 × 6 grids, (b) Loss function and cross-entropy changing by 1000 times of training in 8 × 8 grids, (c) Loss function and cross-entropy changing by 1000 times of training in 10 × 10 grids Fig. 4 shows the results of practical application, the layout results of CNVs have been generated by the trained model in the three kinds of grids with a random setting of 4 or 5 CUs. We can conclude that it has been able to achieve intelligent deployment, especially when two CUs are close to each other, the model can use only one CNV to serve them at the same time, which shows that the model has had certain ‘intelligence’ characteristics. It also shows the success of the AlphaZero algorithm generalised in intelligent deployment of CNVs. In addition, on comparing Figs. 4a–c, it can be seen that the deployment effects on 6 × 6 and 8 × 8 grids are much better than that on 10 × 10 grids. We think that the reason may be that because the solution space in 6 × 6 grids and 8 × 8 grids are smaller than in 10 × 10 grids, so it is easier to find feasible solutions, which is conducive to promoting model training and improving the training effect. Fig. 4Open in figure viewerPowerPoint Layout results of CNVs have been generated by the trained model in the three kinds of grids (a) Results of practical application in 6 × 6 grids, (b) Results of practical application in 8 × 8 grids, (c) Results of practical application in 10 × 10 grids 5 Conclusion The success of the model in the three kinds of grids shows the feasibility of the AlphaZero algorithm generalised in the intelligent deployment of CNVs and provides a basis for the exploration and application of the model in future complex research. However, it also brings problems, with the increase of search space, the sparsity of model solutions would increase gradually, and the time cost and difficulty of model training would be also increased. In addition, we should consider various battlefield environment factors and conditions in future research, the evaluation criteria system will become more and more complex, and model convergence and self-play reinforcement learning will be more difficult, the model and algorithm need to be further optimised and adjusted. References 1Santi P.: ‘Topology control in wireless ad hoc and sensor networks’, ACM Comput. Surv., 2005, 37, (2), pp. 164– 194 2Amaldi E., Capone A., Cesana M. et al.: ‘Optimization models and methods for planning wireless mesh networks’, Comput. Netw. Int. J. Comput. Telecommun. Netw., 2008, 52, (11), pp. 2159– 2171 3Noack A., Bok P.B., Kruck S.: ‘ Evaluating the impact of transmission power on QoS in wireless mesh networks’. Proc. 20th Int. Conf. on Computer Communications and Networks, Maui, HI, USA, 2011, pp. 1– 6 4Kim H., Park E.C., Noh S.K. et al.: ‘Angular MST based topology control for multi-hop wireless ad hoc networks’, ETRI J., 2008, 30, (2), pp. 341– 343 5Sakamoto S., Kulla E.T. et al.: ‘A comparison study of simulated annealing and genetic algorithm for node placement problem in wireless mesh networks’, J. Mob. Multimed., 2013, 9, (2), pp. 101– 110 6Le D.N., Nguyen N.G., Dinh N.H. et al.: ‘Optimizing gateway placement in wireless mesh networks based on ACO algorithm’, Int. J. Comput. Commun. Eng., 2013, 2, (2), pp. 45– 53 7‘ Mastering chess and shogi by self-play with a general reinforcement learning algorithm’, arXiv: 1712.01815v1 [cs.AI] 5December 2017 8Clark C., Storkey A.J.: ‘ Training deep convolutional neural networks to play Go’. Proc. 32nd Int. Conf. on Machine Learning, Lille, France, 2015, vol. 37, pp. 1766– 1774 9Silver D., Huang A., Maddison C.J. et al.: ‘Mastering the game of Go with deep neural networks and tree search’, Nature, 2016, 529, pp. 484– 489 10Silver D., Schrittwieser J., Simonyan K. et al.: ‘Mastering the game of Go without human knowledge’, Nature, 2017, 550, pp. 354– 359 11‘ Training neural networks with very little data-a draft’. Available at http://arXiv.org/pdf/1708.04347.pdf, 2017 12David O.E., Netanyahu N.S.: ‘ End-to-end deep neural network for automatic learning in chess’. Int. Conf. on Artificial Neural Networks, Barcelona, Spain, 2016, pp. 88– 96 13Doerr A., Ratliff N., Bohg J. et al.: ‘Direct loss minimizaton inverse optimal control’, Mol. Ecol., 2015, 23, (10), pp. 2602– 2618 14Coulom R.: ‘ Efficient selectivity and backup operators in Monte-Carlo tree search’. 5th Int. Conf. on Computers and Games, Leiden, The Netherlands, 2006, pp. 72– 83 15Mnih V., Kavukcuoglu K., Silver D. et al.: ‘Human-level control through deep reinforcement learning’, Nature, 2015, 518, pp. 529– 533 16David S., Newnham L., Barker D. et al.: ‘ Concurrent reinforcement learning from customer interactions’. Proc. 30th Int. Conf. on Machine Learning, Atlanta, GA, USA, 2013, vol. 28, pp. 924– 932 17He K., Zhang X., Ren S. et al.: ‘ Deep residual learning for image recognition’. Proc. 29th IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770– 778 18LeCun Y., Bengio Y., Hinton G.: ‘Deep learning’, Nature, 2015, 521, pp. 436– 444 19‘ A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models [EB/OL]’. Available at https://arxiv.org/pdf/1611.03852.Pdf, accessed February 2017 20Perez D., Rohlfshagen P., Lucas S.M.: ‘Monte-Carlo tree search for the physical travelling salesman problem’, Appl. Evol. Comput., Malaga, Spain, 2012, pp. 255– 264 21Mnih V., Badia A., Mirza M. et al.: ‘ Asynchronous methods for deep reinforcement learning’. Proc. 33rd Int. Conf. on Machine Learning, 2016, vol. 48, pp. 1928– 1937 22Loffe S., Szegedy C.: ‘ Batch normalization: accelerating deep network training by reducing internal covariate shift’. Proc. 32nd Int. Conf. on Machine Learning, Lille, France, 2015, vol. 37, pp. 448– 456 23Foerster J.N., Nardelli N., Farquhar G. et al.: ‘ Stabilising experience replay for deep multi-agent reinforcement learning’. Proc. 34th Int. Conf. on Machine Learning, Sydney, NSW, Australia, 2017, vol. 70, pp. 1146– 1155 24ChihHung L., Sy-Yen K., Lee D.T. et al.: ‘Obstacle-avoiding rectilinear Steiner tree construction: a Steiner-point-based algorithm’, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2012, 31, (7), pp. 1050– 1060 Citing Literature Volume14, Issue9June 2020Pages 1392-1396 FiguresReferencesRelatedInformation

Referência(s)
Altmetric
PlumX