Latency‐aware reinforced routing for opportunistic networks
2020; Institution of Engineering and Technology; Volume: 14; Issue: 17 Linguagem: Inglês
10.1049/iet-com.2020.0149
ISSN1751-8636
AutoresDeepak Kumar Sharma, Sarthak Gupta, Shubham Malik, Rohit Kumar,
Tópico(s)Mobile Ad Hoc Networks
ResumoIET CommunicationsVolume 14, Issue 17 p. 2981-2989 Research ArticleFree Access Latency-aware reinforced routing for opportunistic networks Deepak Kumar Sharma, Corresponding Author dk.sharma1982@yahoo.com orcid.org/0000-0001-6117-3464 Department of Information Technology, Netaji Subhas University of Technology, New Delhi, Delhi, 110078 IndiaSearch for more papers by this authorSarthak Gupta, Department of Information Technology, Netaji Subhas University of Technology, New Delhi, Delhi, 110078 IndiaSearch for more papers by this authorShubham Malik, Department of Information Technology, Netaji Subhas University of Technology, New Delhi, Delhi, 110078 IndiaSearch for more papers by this authorRohit Kumar, Department of Information Technology, Netaji Subhas University of Technology, New Delhi, Delhi, 110078 IndiaSearch for more papers by this author Deepak Kumar Sharma, Corresponding Author dk.sharma1982@yahoo.com orcid.org/0000-0001-6117-3464 Department of Information Technology, Netaji Subhas University of Technology, New Delhi, Delhi, 110078 IndiaSearch for more papers by this authorSarthak Gupta, Department of Information Technology, Netaji Subhas University of Technology, New Delhi, Delhi, 110078 IndiaSearch for more papers by this authorShubham Malik, Department of Information Technology, Netaji Subhas University of Technology, New Delhi, Delhi, 110078 IndiaSearch for more papers by this authorRohit Kumar, Department of Information Technology, Netaji Subhas University of Technology, New Delhi, Delhi, 110078 IndiaSearch for more papers by this author First published: 12 October 2020 https://doi.org/10.1049/iet-com.2020.0149AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onEmailFacebookTwitterLinked InRedditWechat Abstract In opportunistic networks, the path connecting two nodes is not continuous at any time instant. In such an environment, routing is an extremely taxing word owing to the ever-changing nature of the network and random connections between nodes. Routing in such networks is done by a store carry forward mechanism, in which local information is used to make opportunistic routing decisions. In this study, the authors present a novel dynamic and intelligent self-learning routing protocol that is an improvement of the history-based routing protocol for opportunistic (HiBOp) networks. The proposed method presents a novel solution for the estimation of average latency between any two nodes, which is used along with reinforcement learning to dynamically learn the nodes' interactions. Simulation results on a real mobility trace (INFOCOM 2006) show that latency-aware reinforced routing for opportunistic network applied to HiBOp outperforms the original HiBOp protocol by 14.4% in terms of delivery probability, 15% in terms of average latency and 34.7% in terms of overhead ratio. 1 Introduction With continuous ongoing research and technological development, the cost of devices capable of communicating among themselves is decreasing. With this advent of communicating devices proper, secure and efficient integrating systems are required. Opportunistic networks (Oppnets) [1] were thus proposed for enabling the integration of these diverse computing and storage devices. They bring with them new and complex problems of their own besides those of the underlying Ad hoc networks. A few attributes of Oppnets are [2]: (i) the end-to-end communication route connecting the sender and the receiver is rare; (ii) the contact intervals and contact opportunities are very few owing to the high mobility of nodes; (iii) a lot of disconnections and re-connections are possible because of the unforeseen node breakdown, energy-conservation efforts by node etc; (iv) the intermediate nodes have usually massive buffers and adopt the accumulate bear and transfer mechanism for transferring the messages and (v) due to unstable behaviour and inconsistent nature of the communicating wireless channels, the channel efficiency and performance is very fluctuating. They use all the communication methods provided by the wired or wireless device, showing self-configuring capabilities. They can be used in various fields like emergency applications, home/office Oppnet applications, Benevolent and Malevolent applications etc. [1] They have an ever-changing network topology owing to the ad hoc entry or exit of nodes. Oppnets can be regarded as the most recent evolution of the Mobile Ad hoc Network (MANET) [3] paradigm. They differ in the challenges and issues they tackle. MANET creates an end-to-end communication route connecting the sender and receiver ahead of the message transfer phase. Once established, it is assumed to stay intact during the message transfer phase. This static network is nnot possible in the Oppnets where the connections change dynamically. The connections are for a short duration because of the continuous mobility of the nodes. MANET would fail to deliver messages in this dynamic scenario. Oppnets make very few assumptions about the radio range and connectivity of the nodes. The forwarding and routing in the Oppnet are based on the contact opportunity and the store-and-carry-forward-paradigm. [4] The sender while moving gets in contact with other nodes. Depending upon the delivery probability of neighbour, the current node transfers the message, if the probability is higher. It may also carry the message along for a longer time thus requiring adequate buffer space to prevent the message from dropping. The messages in Oppnets may experience unusually high delays due to unforeseen buffering in the connections for uncertain intervals of time. Hence, Oppnets with higher message delays in some cases can be classified as a type of delay-tolerant networks [5, 6]. The forwarding and routing protocols in Oppnets may be assorted into two different types, specifically, infrastructure-based and infrastructure-less protocols [7]. Infrastructure-based protocols depend on a certain type of infrastructure for making the forwarding and routing decisions. These infrastructures may be stationary or mobile [7]. Access points and base stations are generally involved in forwarding and routing the messages. Notable instances of this type of protocols are Infostations [8], Data MULEs [9] etc. On the other hand, the infrastructure-less protocols, on the other hand, do not depend on any special nodes to help in message forwarding. All the nodes have equal priority and only contact opportunities while moving are used for forwarding. Some examples belonging to this type of protocols are Probabilistic ROuting Protocol using the record of Encounters and Transitivity (PROPHET) [10], Epidemic routing [11], history-based routing protocol for opportunistic (HiBOp) networks [12] etc. The following paper is formulated in an ensuing way. Section 2 explains the associated research regarding routing protocols in the infrastructure-less Oppnets. Section 3 deals with the notion and steps used in the protocol. Section 4 explains the proposed protocol. Section 5 deals with the simulation setup, movement model and various simulation parameters governing the proposed protocol. In Section 6, results of the simulation process are shown, latency-aware reinforced routing for opportunistic (LR-ROP) networks is compared against PROPHET, Epidemic, HiBOP and cognitive routing protocol for opportunistic (CRPO) network. Finally, Section 7 concludes our work. 2 Related work Epidemic routing protocol [11] is the most elemental protocol, in which unconditional flooding of packets is done for message delivery. Although this usually produces high delivery ratio, but often at the outlay of extremely high overhead and high average latency. Hence, protocols have been developed that focus on keeping a high delivery ratio, while minimising the overhead and average latency of delivering the messages. Some routing algorithms have been developed that exploit the social relationship between different nodes. These algorithms exploit the fact that the mobile nodes change locations because of humans, who naturally have social relationships. In [13] Bubble Rap algorithm has been developed that utilises social metrics like centrality and community for routing. Some other routing protocols that use social characteristics of networks for routing have been proposed in [14–16]. Context is an important factor for routing in mobile opportunistic. HiBOp networks [12] uses the nodes' past interactions to predict the best possible next hop. History-based prediction for routing [17] uses Markov prediction to predict locations of nodes based on previous location data. Some protocols use machine learning methods for improving routing performance. In [18], a routing algorithm has been proposed that takes next-hop decisions based on the A* algorithm. [19] uses Gaussian mixture model-based clustering for message forwarding. The use of multi-layer Perceptrons and Probabilistic Neural Networks was proposed in [20] to find the mean duration of contacts between nodes. In recent years, due to accelerated research in the machine learning domain, some notable works are presented such as Gaussian Mixture Models [19], cascade machine learning [21] and neural networks [22] to make routing decisions. Recently, analysis in OppNets [23, 24] has also exploited the essential and critical domain of security issues, [25, 26] and the layout of conviction based mechanism to thwart self-centred nodes. A significant amount of recent work has been done in various other fields too, in which reinforcement learning methods have been combined with opportunistic algorithms. In the intelligent transportation system domain, reinforcement learning and supervised learning approaches have been combined for efficient vehicle-to-cloud opportunistic data transfer [27]. In [28], the authors have used reinforcement learning and game theory concepts for secure data caching at edge devices, which has applications in mobile social networks. The authors of [29] have proposed an opportunistic routing method using the Q-learning algorithm, to achieve energy efficiency in underwater wireless sensor networks. The protocol proposed in this paper has been compared with the Epidemic routing protocol [11], PROPHET [10], CRPO [30], K-nearest neighbour classification based routing (KNNR) protocol [31] and HiBOp [12]. These protocols have been described in brief in this section. 2.1 Epidemic routing protocol Epidemic routing protocol backs the final delivery of the message with least presumption about the connecting topology of the underlying network. It only requires regular pair-wise interactions for the exchange of messages for successful delivery. The protocol works as follows. It depends upon the transitive exchange of messages for the eventual delivery to the destination node. Each node keeps a buffer filled with the messages which it either generated itself or buffering in place of other hosts. As the quantity may be big for frequent access, it uses a hash table to store the messages. The hash table assigns a key to each message. Besides the hash table, each and every node saves a bit vector known as summary vector which has the bit set for the corresponding message it already contains. Whenever two hosts come in contact, the host with the comparatively smaller identifier commences the anti-entropy session (the term is borrowed from the literature [32]). To avoid re-occurring sessions, each node keeps a cache storing the interactions. It does not interact with the contacted remote host again within a mutable time interval. During anti-entropy, two nodes interchange the summary vectors to find the messages they do not contain. They request the missing messages from the others. The receiving node has the complete authority to deny a particular message if it does not match criteria set by it. 2.2 PROPHET It is common to use random mobility model for the hosts, but in reality user's movement pattern is not random. Based on this notion PROPHET [10] was proposed. It declares a probabilistic metric , for every node b and known destination c. This parameter shows the likeliness of this node to transfer a message to that particular destination node. Delivery predictability calculation involves three steps. First, when two nodes interact we update the metric, so the frequently meeting nodes have higer delivery probability. This is explained in (1),where is the initialisation constant (1)Second, when two nodes do not meet each other in a while, we must age their delivery probability. (2), shows the update equation where k is the number of time intervals elapsed since their metric was aged and is the aging constant. It is dependent on the application type and network (2)Third, the metric also follows the transitive property, i.e. if node B usually meets node C and node C usually meets node D, then it is a good decision to use the node D to send a message intended for node B. Equation (3), describes this relationship where is a scaling constant that controls the importance of transitivity in the total probability metric value (3) Message forwarding decisions are usually taken by selecting a threshold value. If delivery probability is more than the threshold, the message is forwarded. 2.3 Cognitive routing protocol for Oppnet CRPO networks [30] uses neural network because of their characteristic to grasp knowledge from past experiences and formulate decision based on them and input from the environment. CRPO uses feed-forward neural networks [33] which involves three layers namely the input layer,the hidden layer and the output layer. The CRPO protocol has four nodes in the input layer for taking input, two nodes in the hidden layer for generating the complex interpretation of inputs and one in the output layer for evaluation. The carrier node carries the four parameters which are used to generate the output used for forwarding decision. Thus the protocol consists of two stages training stage and the evaluation stage. 2.4 K-nearest neighbour classification based routing protocol KNNR protocol for Oppnets [31] models the routing process as a binary classification problem, i.e. whether to forward a message or not. K-nearest neighbour algorithm has been used as the classification algorithm. KNNR utilises six parameters to calculate the distance metric, namely buffer space available, hop count, time-out ratio, interaction probability, neighbour node distance from destination, interaction probability and speed of neighbour. The data set for training the algorithm has been generated by using the interaction probability parameter. 2.5 History-based routing protocol for Oppnets HiBOp [12] uses the current context (CC) to deliver the messages. Context consists of two components, namely (i) CC of the user and (ii) legacy of the context evolution over time. 2.5.1 Current context CC has following components Identity table (IT): Information of the node itself (e.g. home address, work address etc.). CC table: Information regarding current neighbours of the node, obtained by interchanging ITs during pair-wise contacts. Thus, HiBOp uses the CC to decide if the node should be chosen to act as a forwarder at a particular time instant. However, it would act as a limiting factor to take decision based on only instantaneous information, since it does not represent the user's past experiences. 2.5.2 Legacy of context evolution Context evolution over time has following components History table (HT): Stores the attributes observed in the past in IT of interacted nodes. Repository table (RT): Intermediate data structure used to update the HT. Signalling interval: It shows the update time of RT. Flushing interval: After every flushing interval, HT is updated by merging RT. 2.5.3 Routing in HiBOp The sender includes (any subgroup of) the destination's IT. Probabilities of delivery are assessed on the basis of the match of this data and the content contained at the encountered node. HiBOp also controls the message replication – only the sender (creator) of the message is authorised to make more than one copy of the original message. 3 Background This section aims to provide the mathematical and algorithmic background for the actual research work. In this section, a quick synopsis of theoretical notions like Markov decision processes (MDPs), Reinforcement learning and Q-learning has been provided. 3.1 Markov chains As per the theory of probability and statistics, Markov property [34] is described to be the memory-less property of a stochastic process. This property is named as the Markov property after its pioneer, the Russian mathematician Andrei Andreyevich Markov. A stochastic process [34], also known as a random process is a mathematical entity commonly identified as an aggregate of random variables. It is known to exhibit the Markov property if the conditional probability of the upcoming states of the process (conditional on both the past as well as the present states) is dependent only on the present state, and not on the chain of actions that occurred prior to it. The process showing this quality is known as a Markov process. When controlled over a period of time, a system requires a strategy to ascertain what move be taken, depending on what is already known about the system, with regard to its state a. The state can have knowledge about the history of actions but is still referred to as the state at the time of the decision. If resolutions are made at constant intermissions of time, then if during some decision stage, the state is a, and if action c, is taken, and the state at the next decision stage has a probability (the probability depends only on ) the system is said to represent a MDP [35]. A Markov chain is defined as the series of random variables , , ,…, showcasing the Markov property, such that the expectation of going to the following state is determined alone by the present state. It can be mathematically depicted as (4)If , then, the feasible values of make an estimable set A known as the state-space of the chain. Markov chains are frequently defined through a series of directed graphs, in which the edges of graph represent the probability of moving from one state at time 'n' to the other states at time 'n+1', that is, . 3.2 Reinforcement learning Reinforcement learning [36], which emerged as a consequence of the ongoing work in the fields of statistical mathematics, neuro-science, computer science and psychology, is a domain of machine learning concerned with the methods on how the software entity decides to work in its surroundings, so as to maximise the progressive reward. Reinforcement is about progressing in the correct direction so as to maximise the reward under a specific circumstance. The disparity between supervised learning and reinforcement learning is that in the former, the training data consists of the answer key, that is, the appropriate input-output pair. Hence, in supervised learning, the training of the model is done with the right answer while in reinforcement learning, there is no correct answer and the reinforcement agent decides what to do in order to perform the given job. In this, learning is done with the help of a critique, which is the reinforcement signal and this is fed back in the training process. In a typical reinforcement learning paradigm [36], an agent is attached to its domain by means of perception and work, as depicted in Fig. 1. On every one move of communication, the agent accepts as input which is a combination of current state S and reinforcement signal R; the agent then selects an action a to produce the result. The action alters the state of the domain, as well as, the significance of this state transformation is forwarded to the agent through a reinforcement signal [36]. The behaviour of the agent should select actions, such that they result in an increment in the long-term aggregate values of the reinforcement signal. Fig. 1Open in figure viewerPowerPoint Typical reinforcement learning paradigm A reinforcement learning agent communicates with its domain in distinct time units. At every unit of time 't', the agent accepts observation , which commonly incorporates the reward, . Then an action is selected by the agent from the set of procurable actions. The environment now, goes to a new state and the reward correlated with the transformation () is decided. The objective of a reinforcement learning agent is to assemble as much reward as possible. The agent can select (possibly randomly) any action as a function of the past. 3.3 Markov decision processes and reinforcement learning There must exist a way to depict the states when solving reinforcement problems. A Markov state is a bundle of information that stores both the data about the present state of the domain, as well as all the useful information from the past. Whenever the decisions are made on the basis of a Markov state, the past events don't impact the decision making process. Any changes made as an outcome of previous events would be put within the Markov state. However, this just tells about a single state of the surrounding. A Markov process on the other side contains the collection of total attainable states and the associated transition probabilities for each one of them. Established on the grounds of this, a Markov reward process consists of all the same data and also has rewards corresponding to each state. The reward is a simple random number to find if it is recommended to take action from a certain state or stay in it instead. This forms the basis of return versus rewards. Rewards are instantaneous for every state. On the other hand, the return associated with a particular state is the sum total of all the reward if we go further from that state. Many a time, there is a discount corresponding to the future rewards, to look after the feasible variations and irregularities, and forbid infinite returns. The return of a state is specified by the value function, which is a combination of the discounted reward of the future state and the current state's spontaneous reward. In the case of numerous possible states, future rewards are solely weighted by their respective probabilities. Lastly, a MDP [37] combines all the segments as one. It includes everything a Markov reward process does, with the list of actions practicable from each state, hence allowing a scheme for the association of states to actions to be taken. A scheme is described to be a strategy on which action is taken from each state. The aim is to search the largest action-value function among every scheme which if found, we can perform the action with the greatest possible reward, leading the agent to theoretically the maximum reward. 3.4 Q-learning Q-learning [38], a type of reinforcement learning is used in the field of machine learning to achieve the objective of learning a scheme, which guides an agent on what action to take under certain surroundings. In this, the notion is to learn a Q-function [39], which is a foretelling of the return correlated with every action 'a'. It requires no need for a model of the environment and can take care of issues using only the concepts of stochastic transitions and rewards. This technique might be utilised to identify an optimal action-selection scheme for a given finite MDP, provided unlimited exploration time and a partly-random scheme. 'textitQ' here, stands for the function which gives the reward needed to give the reinforcement. 3.4.1 Q-table Q-table is a name used to refer to a simple look-up table using which we can calculate for each state the maximum anticipated future rewards by performing the actions. The table guides us for each state which action is the most favourable. In the Q-table, columns represent actions and rows represent states. The Q-table needs to be updated at each iteration which makes it an iterative process. 3.4.2 Q-function The Q-function uses the Bellman equation. It essentially takes two inputs: state(s) and action(a). (5)In (5), the process of iterative updating of Q-values is shown. At a given point of time t, there is a Q-value corresponding to every state-action pair. Here, denotes the state at time t, and denotes the action to be taken at that state. For gaining a better understanding of (5), let us rewrite it as follows: (6) Here, is called the learning rate, and it determines how much relative weight is to be given to the current Q-value and the reward while performing the update. If action is taken at the state , then we move to the next state . Apart from the immediate reward associated with this transition, we also consider the maximum expected future reward at the new state . This is done by considering all action a that can be done at state , and taking the maximum Q-value out of these. Since the immediate reward should be given more weightage than the expected future reward, we multiply the future reward with a discount factor . Using the above function, the values of Q-function can be obtained for all cells in the table. Initially, all the values in the Q-table are zeros. The process of updating the values is iterative in nature, that is, the Q-function gives more appropriate estimation by repeatedly revising the Q-values in the table. 3.4.3 Q-learning algorithm Q-learning is an off-policy approach to temporal difference learning. It can be proved that provided ample training under any epsilon-soft policy, the theorem advances to 100% probability (with probability = 1) for a close measure of the action-value function for an arbitrary target scheme. Q-learning learns the ideal policy even when actions are chosen in accordance with a more penetrating or even a casual policy. The algorithm is defined in a precise way in the following steps: (i) Load the values in the Q-table, . The initial value for every entry in the table is kept 0. (ii) Now, inspect the present state, s. (iii) As a part of this step, an action a is selected for that state using one of the action-selection schemes. (iv) After moving forward with the action, mark the reward r along with the new state, s'. (v) Modify the entries in the Q-table for the state with the help of the examined reward and the largest possible reward for the next state. The modification is performed in accordance with the formula and the guidelines discussed above. (vi) Now, this state is set as the new state. Repeat the above steps until the concluding state arrives. 4 Proposed work This section presents the proposed work done and is based on the theoretical foundations laid out in Section 3. Section 4.1 argues why reinforcement learning is a possible solution to the routing problem in Oppnets. To reduce the delay in message propagation from the source to destination, average latency has been estimated based on the interaction of nodes. The details of the algorithm to estimate the average latency have been given in Section 4.2. Lower values of average latency contribute to a bigger reward in the reinforcement learning process, thus driving the average latency of the entire network down. This would also lead to better selection of the next hop, hence would theoretically also lead to better delivery probability. We have run simulations to confirmed this hypothesis, and results have been presented in Section 6. 4.1 Motivation Oppnets satisfy the Markovian property, i.e. the future behaviour does not depend on the past but the present. The probability distribution for the upcoming state of a message relies exclusively on the present state, and not on any of the states before it. Here, the state of a message would be the node presently storing the message and it is neighbouring nodes. So, routing in Oppnets can easily be designed as a MDP. We have used Q-learning, a reinforcement learning algorithm, to solve this MDP. Supervised learning algorithms, like those used in CRPO and KNNR, suffer from a severe limitation – the absence of an optimal data set for message forwarding. In both of these algorithms, the authors have generated the data set according to some metrics. This process of data set generation does not guarantee optimal routes from source to destination. Hence, this work proposes the use of reinforcement learning to overcome such problems. LR-ROP networks, as the name implies, has two main components. First, the average latency is estimated as described in Section 4.2. Then, the q-learning algorithm is used to update the q-tables. The working of the LR-ROP protocol has been described in Section 4.3. 4.2 Average latency estimation The average latency of messages to reach from the start node to the end node is an extremely important metric for the purpose of evaluation of opportunistic routing protocols. We have proposed a method to estimate the average latency between two nodes. The calculation to estimate the average latency is presented below. All nodes maintain two maps for the calculation of average latency, one for keeping track of the number of encounters, and the other for storing the values of latency. Consider a node A. Whenever node A comes in direct contact with another node, say B, the number of encounters is increased by 1, i.e. , where is the encounter map of A and represents
Referência(s)