ROBOTIC NAVIGATION WITH SIMULTANEOUS LOCAL PATH PLANNING AND LEARNING
20240319735 ยท 2024-09-26
Assignee
Inventors
- Arup Kumar Sadhu (Kolkata, IN)
- LOKESH KUMAR (Kolkata, IN)
- RANJAN DASGUPTA (Kolkata, IN)
- Mohit Ludhiyani (Kolkata, IN)
- TITAS BERA (Kolkata, IN)
Cpc classification
G05D1/242
PHYSICS
G05D1/644
PHYSICS
G05D1/246
PHYSICS
International classification
Abstract
In conventional robot navigation techniques learning and planning algorithms act independently without guiding each other simultaneously. A method and system for robotic navigation with simultaneous local path planning and learning is disclosed. The method discloses an approach to learn and plan simultaneously by assisting each other and improve the overall system performance. The planner acts as an actuator and helps to balance exploration and exploitation in the learning algorithm. The synergy between dynamic window approach (DWA) as a planning algorithm and a disclosed Next best Q-learning (NBQ) as a learning algorithm offers an efficient local planning algorithm. Unlike the traditional Q-learning, dimension of Q-tree in the NBQ is dynamic and does not require to define a priori.
Claims
1. A method for robot navigation, the method further comprising: performing by a robotic agent executed by one or more hardware processors, a global path planning to obtain a plurality of way points to reach a goal position based on a current position, the goal position and two-dimensional floor plan of an environment the robotic agent is deployed into, wherein the current position of the robotic agent represents a current way point; and sequentially navigating by the robotic agent, through each of the plurality of way points to reach the goal position by simultaneously applying a) a Dynamic Window Approach (DWA) for a local path planning, and b) a Next best Q-learning (NBQ) approach that enables real-time learning while balancing between an exploitation approach and an exploration approach, wherein sequentially navigating through each of the plurality of way points to reach the goal position comprises iteratively performing a plurality of steps until the plurality of way points are covered, the plurality of steps further comprising: a) computing an optimal velocity vector for a local goal evaluated for the current way point at a current state among a plurality of states visited by the robotic agent; b) employing by the robotic agent, one of an exploration approach and an exploitation approach for the local path planning based on the optimal velocity vector, wherein i) the exploration approach is followed if the optimal velocity vector is empty, wherein value of a scalar parameter, required to tune a number of linear velocity samples and a number of angular velocity samples, is set to zero during the exploration approach, and ii) the exploitation approach is followed if the optimal velocity vector is not-empty, wherein value of the tuning scalar parameter is set to be greater than zero and less than one during the exploitation approach; c) computing the number of linear velocity samples and the number of angular velocity samples at each of the plurality of states based on value set for the scalar parameter; d) obtaining the optimal velocity vector and a score value for each velocity sample offered by the DWA based on the and updating a Q-value of a Q-tree; and f) recomputing the optimal velocity vector at the current state with the updated Q-tree and executing the optimal velocity vector to update the current waypoint by current position of the robotic agent.
2. The method of claim 1, wherein each of the plurality of states is a tuple further comprising a sector within sensing range of a sensor attached to the robotic agent, a current linear velocity of the robotic agent, and a current angular velocity of the robotic agent.
3. The method of claim 1, wherein the Q-value adapts based on one or more uncertainties in the environment, and wherein the Q-value is summation of immediate reward and discounted next best Q-value.
4. The method of claim 1, wherein dimension of the Q-tree in the NBQ is dynamically obtained without need to define a priori.
5. A robotic agent for robot navigation, the system further comprising: a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to: perform a global path planning to obtain a plurality of way points to reach a goal position based on a current position, the goal position and two-dimensional floor plan of an environment the robotic agent is deployed into, wherein the current position of the robotic agent represents a current way point; and sequentially navigate through each of the plurality of way points to reach the goal position by simultaneously applying a) a Dynamic Window Approach (DWA) for a local path planning, and b) a Next best Q-learning (NBQ) approach that enables real-time learning while balancing between an exploitation approach and an exploration approach, wherein sequentially navigating through each of the plurality of way points to reach the goal position comprises iteratively performing a plurality of steps until the plurality of way points are covered, the plurality of steps further comprising: a) computing an optimal velocity vector for a local goal evaluated for the current way point at a current state among a plurality of states visited by the robotic agent; b) employing, by the robotic agent, one of an exploration approach and an exploitation approach for the local path planning based on the optimal velocity vector, wherein i) the exploration approach is followed if the optimal velocity vector is empty, wherein value of a scalar parameter, required to tune a number of linear velocity samples and a number of angular velocity samples, is set to zero during the exploration approach, and ii) the exploitation approach is followed if the optimal velocity vector is not-empty, wherein value of the tuning scalar parameter is set to be greater than zero and less than one during the exploitation approach; c) computing the number of linear velocity samples and the number of angular velocity samples at each of the plurality of states based on value set for the scalar parameter; d) obtaining the optimal velocity vector and a score value for each velocity sample offered by the DWA based on the current state, the local goal, the number of linear velocity samples, and the number of angular velocity samples; e) evaluating a reward using a predefined reward function and updating a Q-value of a Q-tree; and f) recomputing the optimal velocity vector at the current state with the updated Q-tree and executing the optimal velocity vector to update the current waypoint by current position of the robotic agent.
6. The robotic agent of claim 5, wherein state is a tuple further comprising a sector within sensing range of a sensor attached to the robotic agent, a current linear velocity of the robotic agent, and a current angular velocity of the robotic agent.
7. The robotic agent of claim 5, wherein the Q-value adapts based on one or more uncertainties in the environment, and wherein the Q-value is summation of immediate reward and discounted next best Q-value.
8. The robotic agent of claim 5, wherein dimension of the Q-tree in the NBQ is dynamically obtained without need to define a priori.
9. One or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause: performing by a robotic agent executed by one or more hardware processors, a global path planning to obtain a plurality of way points to reach a goal position based on a current position, the goal position and two-dimensional floor plan of an environment the robotic agent is deployed into, wherein the current position of the robotic agent represents a current way point; and sequentially navigating by the robotic agent, through each of the plurality of way points to reach the goal position by simultaneously applying a) a Dynamic Window Approach (DWA) for a local path planning, and b) a Next best Q-learning (NBQ) approach that enables real-time learning while balancing between an exploitation approach and an exploration approach, wherein sequentially navigating through each of the plurality of way points to reach the goal position comprises iteratively performing a plurality of steps until the plurality of way points are covered, the plurality of steps further comprising: g) computing an optimal velocity vector for a local goal evaluated for the current way point at a current state among a plurality of states visited by the robotic agent; h) employing by the robotic agent, one of an exploration approach and an exploitation approach for the local path planning based on the optimal velocity vector, wherein iii) the exploration approach is followed if the optimal velocity vector is empty, wherein value of a scalar parameter, required to tune a number of linear velocity samples and a number of angular velocity samples, is set to zero during the exploration approach, and iv) the exploitation approach is followed if the optimal velocity vector is not-empty, wherein value of the tuning scalar parameter is set to be greater than zero and less than one during the exploitation approach; i) computing the number of linear velocity samples and the number of angular velocity samples at each of the plurality of states based on value set for the scalar parameter; j) obtaining the optimal velocity vector and a score value for each velocity sample offered by the DWA based on the current state, the local goal, the number of linear velocity samples, and the number of angular velocity samples; k) evaluating a reward using a predefined reward function and updating a Q-value of a Q-tree; and l) recomputing the optimal velocity vector at the current state with the updated Q-tree and executing the optimal velocity vector to update the current waypoint by current position of the robotic agent.
10. The one or more non-transitory machine-readable information storage mediums of claim 9, wherein the one or more instructions which when executed by the one or more hardware processors further cause a sector within sensing range of a sensor attached to the robotic agent, a current linear velocity of the robotic agent, and a current angular velocity of the robotic agent.
11. The one or more non-transitory machine-readable information storage mediums of claim 9, wherein the Q-value adapts based on one or more uncertainties in the environment, and wherein the Q-value is summation of immediate reward and discounted next best Q-value.
12. The one or more non-transitory machine-readable information storage mediums of claim 9, wherein dimension of the Q-tree in the NBQ is dynamically obtained without need to define a priori.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems and devices embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION
[0017] Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
[0018] Path planning for a mobile robot is the process of finding a sequence of valid collision-free configurations to transport the mobile robot from one position to another. Global path planning algorithm offers path between start and goal on a given world in offline. In order to deal with the environmental uncertainties (e.g., amendment of stationary object/obstacle in the world), global planner requires frequent replanning. Hence, global planners are computationally expensive. On the other hand, a local planner works on local environment, which is created locally within sensing range, and does not include any global information. Hence, the robot may be stuck in local minima as shown by state-of-the art. However, the computational cost of the local path planning algorithm is less as compared to the global path planning algorithm. Naturally, the local path planning algorithms in the art are capable enough to deal with frequent changes in the surroundings. One of the initial local path planning approaches is introduced as a curvature velocity method. The basics of the curvature velocity method is maximizing an objective function by choosing one suitable velocity sample (satisfying necessary constraints) from a velocity space. Based on the curvature velocity method, the concept of dynamic window approach (DWA) is derived by another existing work. The dynamic window is defined on the basis of the kinematics model and current velocity of the robot. A score is computed by selecting each velocity sample from the dynamic window as a function of robot's goal heading, velocity, and distance from the nearest obstacle. The velocity sample with maximum score value is selected for execution. Improvement of DWA is done by another existing method for better navigation capabilities in partially unknown environments among obstacles. Some works further proposed a Global DWA in order to avoid trapping in local minima. Besides several improvements of DWA, following technical limitations of DWA are specified in the literature. Firstly, the evaluation functions are not sufficient to identify potential velocity sample in the dynamic window. Hence, potential velocity sample may be ignored. Secondly, the score function is weighted sum of evaluation functions, and performance of the DWA highly depends on the choice of weight values. Former hindrances are circumvented by employing reinforcement learning and by deep reinforcement learning in the recent works in the art. However, these recent approaches require offline learning or training with a priori training data. Additionally, dimension of Q-table is defined a priori in the works in literature. To circumvent the said bottleneck (i.e., offline learning, a priori training data for learning and predefined Q-table dimension) of learning algorithm, simultaneous learning and planning algorithm (SLPA) is proposed in the art. However, the SLPA works for a fixed start and goal pair and needs to reinitialize if start and/or goal are/is altered. On the other hand, in local planning start and local goal keep on changing. So, local planning by employing SLPA is not feasible.
[0019] Embodiments herein disclose a method and system for robotic navigation with simultaneous local path planning and learning by a robotic agent. The method discloses an approach that enables the robotic agent, also referred as mobile robot or robot interchangeably, to learn and plan simultaneously, based on SLPA in sensing range (SLPA-SR) approach, a wherein learning and planning techniques are synergistically combined to assist each other and improve the overall navigational performance of the robot. The planner acts as an actuator and helps to balance exploration and exploitation in the learning algorithm. The synergy between dynamic window approach (DWA) as a planning technique and a disclosed Next best Q-learning (NBQ) as a learning technique offers an efficient local planning approach. Further, unlike the traditional Q-learning, dimension of Q-tree in the NBQ is dynamic and does not require to define a priori.
[0020] Referring now to the drawings, and more particularly to
[0021]
[0022] Referring to the components of system 100, in an embodiment, the processor(s) 104, can be one or more hardware processors 104. In an embodiment, the one or more hardware processors 104 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 104 are configured to fetch and execute computer-readable instructions stored in the memory 102. In an embodiment, the system 100 can be implemented in a variety of computing systems including laptop computers, notebooks, hand-held devices such as mobile phones, workstations, mainframe computers, servers, and the like.
[0023] The I/O interface(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular and the like. In an embodiment, the I/O interface (s) 106 can include one or more ports for connecting to a number of external devices or to another server or devices.
[0024] The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
[0025] In an embodiment, the memory 102 includes a plurality of modules 110. The plurality of modules 110 include programs or coded instructions that supplement applications or functions performed by the system 100 for executing different steps involved in the process of robotic navigation with simultaneous local path planning and learning, being performed by the system 100. The plurality of modules 110, amongst other things, can include routines, programs, objects, components, and data structures, which performs particular tasks or implement particular abstract data types. The plurality of modules 110 may also be used as, signal processor(s), node machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules 110 can be used by hardware, by computer-readable instructions executed by the one or more hardware processors 104, or by a combination thereof. The plurality of modules 110 can include various sub-modules (not shown) such as modules executing the DWA and the NBQ as depicted in
[0026]
[0027]
[0028] In an embodiment, the system 100 comprises one or more data storage devices or the memory 102 operatively coupled to the processor(s) 104 and is configured to store instructions for execution of steps of the method 200 by the processor(s) or one or more hardware processors 104. The steps of the method 200 of the present disclosure will now be explained with reference to the components or blocks of the system 100 as depicted in
[0029] Referring to the steps of the method 200, at step 202 of the method 200, the robotic agent 100, executed by one or more hardware processors 104, perform a global path planning to obtain a plurality of way points to reach a goal position based on a current position, the goal position and two-dimensional (2D) floor plan of an environment the robotic agent is deployed into, wherein the current position of the robotic agent represents a current way point.
[0030] At step 204 of the method 200, of the robotic agent 100, executed by the one or more hardware processors 104, sequentially navigates through each of the plurality of way points to reach the goal position by simultaneously applying a) a Dynamic Window Approach (DWA) for a local path planning, and b) a Next best Q-learning (NBQ) approach that enables real-time learning while balancing between an exploitation approach and an exploration approach. Sequentially navigating through each of the plurality of way points to reach the goal position comprises iteratively performing a plurality of steps (204a though 204f as listed below) until the plurality of way points are covered. [0031] a) Computing an optimal velocity vector for a local goal evaluated for the current way point at a current state among a plurality of states visited by the robotic agent (204a). Each of the plurality of states is a tuple comprising a sector within sensing range of the sensor (for example, a LIDAR) attached to the robotic agent, a current linear velocity of the robotic agent, and a current angular velocity of the robotic agent. [0032] b) Employing, by the robotic agent, one of an exploration approach and an exploitation approach for the local path planning based on the optimal velocity vector (204b). Further, [0033] i) The exploration approach is followed if the optimal velocity vector is empty. Value of a scalar parameter, required to tune a number of linear velocity samples and a number of angular velocity samples, is set to zero during the exploration approach. [0034] ii) The exploitation approach is followed if the optimal velocity vector is not-empty, The value of the scalar parameter is set to be greater than zero and less than one during the exploitation approach; [0035] c) Computing the number of linear velocity samples and the number of angular velocity samples at each of the plurality of states based on value set for the scalar parameter (204c). [0036] d) Obtaining the optimal velocity vector and a score value for each velocity sample offered by the DWA based on the current state, the local goal, the number of linear velocity samples, and the number of angular velocity samples (204d). [0037] e) Evaluating a reward using a predefined reward function and updating a Q-value of a Q-tree (204e). The Q-value adapts based on one or more uncertainties in the environment. The Q-value is summation of immediate reward and discounted next best Q-value. An alpha is the learning rate to balance between an old and a new Q-value. The dimension of the Q-tree in the NBQ is dynamically obtained without need to define a priori. [0038] f) Recomputing the optimal velocity vector at the current state with the updated Q-tree and executing the optimal velocity vector to update the current waypoint by current position of the robotic agent (204f).
[0039] Thus, method 200 employs synergistic combination of the DWA for planning and the disclosed NBQ for learning. In NBQ, a state is the tuple consisting of sector within sensing range of the sensor attached to the robotic agent (robot) and current velocity vector of the robotic agent. Action is the velocity sample chosen from the dynamic window. Unlike traditional DWA, the number of linear and angular velocity samples are computed by the disclosed action selection strategy. For each velocity sample one score value is computed, ignoring the robot's distance from the nearest obstacle which is captured by the rewarding mechanism in the NBQ. The computed Q-values in the NBQ can adapt dynamically based on the environmental uncertainties. Over the iteration, requirement of the DWA is reduced and robot becomes more dependent on the learned NBQ-values for optimal velocity sample selection at current state.
[0040] The method 200 is now explained with reference to
[0041] PRELIMINARIES: Considering dynamic window approach (DWA) [as the planning algorithm] and the disclosed NBQ-learning as learning algorithm, the preliminaries section briefly explains the DWA to improve legibility. The DWA generates linear velocity (v) and angular velocity (?) to control a robot for a finite sampling time, say ?t. The selection of optimal velocity vector: (v*,?*) from a set of velocity vectors is twofold. First fold is about generating the set of feasible velocity vectors, V.sub.r. Second fold is about the selection of (v*,?*) from V.sub.r. [0042] 1) Generating feasible velocity space: Feasible velocity space is denoted by V.sub.r=V.sub.s?V.sub.d?V.sub.ad, where V.sub.s, V.sub.d and V.sub.ad are velocity space satisfies condition 1, condition 2 and condition 3 respectively. [0043] Condition 1: After pruning complete velocity vectors by v.sub.min?v?v.sub.max and ?.sub.min????.sub.max the remained velocity vector is referred as V.sub.s. Here, max and min in the suffix of nomenclature are the corresponding maximum and minimum limitation respectively. [0044] Condition 2: The set of velocity vectors V.sub.d reachable from robot's current velocity vector (v.sub.a,?.sub.a) within time horizon ?t is defined by v.sub.a?a.sub.max?t?v?v.sub.a+a.sub.max?t and ?.sub.a??.sub.max?t????.sub.a+?.sub.max?t and is also named as dynamic window. Here, a.sub.max and ?.sub.max are the maximum possible linear and angular accelerations respectively for the robot. [0045] Condition 3: The admissible velocity vectors V.sub.ad is the collection of collision free velocity vectors by satisfying v??{square root over (2dist(v,?)v.sub.b)} and ???{square root over (2dist(v,?)?.sub.b)}. Here, dist(v,?) represents the distance between the nearest obstacle and robot's predicted path corresponding to (v,?), wherein v.sub.b and ?.sub.b are the accelerations for breakage. [0046] 2) Selecting the optimal velocity vector from V.sub.r: Let, V.sub.r consists of N.sub.vr as linear and N.sub.wr as angular velocity samples. Total velocity samples N.sub.r can be represented as a N.sub.vr?N.sub.?r window. For each velocity sample, (v,?) a score function J(v,?) is evaluated by equation (1) below.
[0047] PROBLEM FORMULATION: Online planning or offline learning based planning is the key for any successful navigation from a given position to another. Online planning suffers from repeated planning for minor positional modifications. Learning circumvents this repeated planning by learning the action (e.g., velocity vector) for the minor positional modifications. Unfortunately, existing learning algorithms either work in offline or requires a priori training data. The disclosed method 200 synergistically combines the planning (here DWA) and the learning (here the disclosed NBQ) algorithms.
[0048] The method 200: as depicted in
[0049] Local Goal Computation: Consider a sensor with sensing range of r.sub.sensor. A local goal denoted by I.sub.g is computed for two situations. In situation 1, next waypoint say P.sub.1 is within r.sub.sensor and in the situation equation 2, P.sub.1 is outside of r.sub.sensor. For situation 1, P.sub.1 is the local goal and is expressed by equation (2) below. For situation 2, a vector OP.sub.1 between robot's current position, say, O and P.sub.1 is formed. The unit vector corresponds to OP.sub.1 is given by . To obtain I.sub.g on the periphery of sensor, r.sub.sensor is multiplied by
as expressed in equation (2).
Wherein, computed I.sub.g is fed into the DWA for local planning. The DWA selects one velocity vector among multiples for execution, which corresponds to the maximum score.
[0050] Next Best Q-learning (NBQ): In the learning algorithm NBQ, Q-value at a state due to an action is the summation of immediate reward and the Q-value corresponds to the next best action at the said state. This process continues recursively to update Q-values at each state for various actions. The immediate reward in NBQ is adapted based on the pruned score function, i.e., score without considering changes of static object/obstacle in the world (??dist(v,?)) as shown in equation (3) below. Motivation of pruning score function is to deal with the environmental amendments by NBQ and acts in real-time. The computed Q-values are recorded for future reference.
[0051] Like Q-learning, in the disclosed NBQ, selection of state, action and design of reward function are very important. Hence, the said parameters are discussed in the subsequent sections. [0052] 1) State: State is denoted by s=<q.sub.k,v.sub.a,?.sub.a>, where q.sub.k is the sector within sensing range and (v.sub.a,?.sub.a) is the robot's current velocity vector. Each (v.sub.a,?.sub.a) falls within one among multiple discrete ranges defined over V.sub.r (feasible velocity space). To define q.sub.k, the entire sensing range is equally partitioned into a number of regions, where each region is one sector. Say, the horizontal sensing range ?.sub.h is equally partitioned into n number of sectors denoted by q.sub.k,k?[1, . . . , n]. Each q.sub.k lies between angle ?.sub.k.sub.
(s=<q.sub.k,v.sub.a,?.sub.a>,a=<v,?>) is equivalent to
(a=<v,?>) at s. Let us consider
(s,a*) be the score at state s because of optimal action a*.
.sup.+ denotes set of all positive real numbers. The r(s,a) as expressed by equation (7) is employed to update Q-value in the disclosed NBQ. [0055] 4) Next Best Q-Learning (NBQ)-value: Let, Q-value at s because of an action a is denoted by Q(s,a). In the disclosed NBQ, Q(s,a) adapts according to the variation of r(s,a). The value of r(s,a) varies with the variation of
(s,a) which is equivalent to the score offered by equation (3). Motivated by the traditional Q-learning rule, the disclosed NBQ updation rule is given below in equation 8
(s,a) from DWA at (s,a) [0057] 2) Find next best score of
(s,a), say
(s,a)>
(s,a) [0058] 3) Get action corresponds to
(s,a), i.e., a [0059] 4) Obtain Q-value at (s,a) from Q-tree, i.e., Q(s,a) [0060] 5) If Q(s,a)>0, then NBQ(s,a)=Q(s,a) NBQ-value is not updated if Q(s,a)?0, because this leads the robot to an obstruction. Thus, the method 200 using the SLPA-SR approach reduces frequency of invoking DWA as learning progresses, which indeed beneficial for real-time planning. For planning at s, the optimal action a* is computed by equation (9) below. However, if any feasible action is not found, then DWA is invoked for exploration.
[0062] where c?[0,1] is the scalar parameter required to tune n.sub.vr(s) and n.sub.?r(s). The value of c is responsible for maintaining a balance between exploration and exploitation. In the absence of any path, the value of c is set to 0 for exploration and 0<c<1 for exploitation. It is apparent from the above discussion that in the initial phase of learning robot explores more and exploits less. As learning progresses robot learns with less exploration and higher exploitation. This offers a balanced exploration and exploitation for the disclosed NBQ. [0063] 6) Convergence of NBQ: Explained below are few theorems with proofs. [0064] i. Theorem 1 (deals with the convergence of the disclosed NBQ)The disclosed NBQ converges at each state action pair as time t.fwdarw.?. [0065] Proof: Let, Q.sub.t(s,a) be the actual Q-value at (s,a) after t iteration and after infinite iteration Q.sub.t(s,a) attains true Q-value, i.e., Q(s,a). The error in Q-value at (s,a) after t iteration is given by ?.sub.t(s,a). Assumption 4.1 is made to establish converge of the disclosed NBQ. [0066] Assumption 1 of theorem 1: The actual and true value of r(s,a) at an iteration for any environmental condition are identical.
[0068] The method 200: As depicted in (s,a), r(s,a) and Q(s,a) are updated. Finally, a* is recomputed for execution at s. The updated p is compared with the waypoint offered by A* and this whole process is repeated. Pseudo-code1 is provided below for the method 200.
TABLE-US-00001 Pseudocode 1 implementing the SLPA-SLR: Input: ? 2D, floor plan, f.sub.v, f.sub.?, v.sub.init, ?.sub.init, constant c? [0,1) ? tolerance, s.sub.g ?goal position; Output: Optimal velocity vector, a*; Initialize: p ?current position; {pt.sub.1,...,pt.sub.n} ? A*(p,s.sub.g,
)[from equation 4]; for waypoint ? {pt.sub.1,...,pt.sub.p} do while ||p ? waypoint|| ? ? do Evaluate l.sub.g for waypoint by equation (2); Determine sector q.sub.k based on the l.sub.g; Form current state, s =< q.sub.k, v.sub.a, ?.sub.a >; /* (v.sub.a, ?.sub.a) is the current velocity vector */ Compute a* by equation (9) at s; /* a* = (v,?) is the velocity vector satisfying condition1, 2 and 3 */ if a* == ? then Set c ? 0; // DWA for exploration end if a* ? ? then Set 0 < c < 1;// DWA for exploitation end Compute n.sub.vr(s) by (10) & n.sub.?r(s) by equation (11); [a,
(s, ?)] = DWA(s,l.sub.g,n.sub.vr(s),n.sub.?r(s),.) equation [15]; Evaluate r(s,a) by equation (7); Update Q(s,a) by equation (8); Recompute a* by equation (9) at s and execute; Update p ?current position; end end
[0069] ANALYSIS: The SLPA-SR implemented by the method 200 is analyzed in terms of computational cost by considering DWA as proposed in D. Fox, W. Burgard, and S. Thrun, The dynamic window approach to collision avoidance, IEEE Robotics & Automation Magazine, vol. 4, no. 1, pp. 23-33, 1997 as the contender algorithm. Computational cost for one optimal velocity vector generation at state (s) by The SLPA-SR involves computation cost for the planner (DWA) and computation cost for the learner (NBQ). The cited DWA involves n.sub.r(s) score computation (pruned score offered by equation (3)) and compares (n.sub.r(s)?1) scores to evaluate the best score. Here, n.sub.r(s) is the number of velocity samples chosen from dynamic window at state (s). In the disclosed NBQ-learning, robot computes NBQ-values with maximum of (N.sub.r(s)?1) number of score comparison and finally, selects the optimal velocity vector by doing maximum (N.sub.r(s)?1) number of Q-value comparisons. Here, N.sub.r(s) is the maximum value of n.sub.r(s). Say, t.sub.l(s) be the computational cost for one time optimal velocity vector generation by the SLPA-SR at s and based on the above discussion t.sub.l(s) is expressed below.
where t.sub.s is the one time score computation cost by DWA following equation (3) at state (s). On the other hand, computation cost for generating optimal velocity vector by DWA with N/(s) velocity vector at s is given by:
where t.sub.s>t.sub.s be the one time score computation cost at state (s) by DWA following equation (1). It is apparent from equation (10) and equation (11) that at s at t=0 with x(s)=0, n.sub.vr(s)=(f.sub.v+v.sub.init)=N.sub.vr(s) and n.sub.?r(s)=(f.sub.?+?.sub.init)=N.sub.?r(s). Hence, n.sub.r(s)=N.sub.wr(s)?N.sub.?r(s)=N.sub.r(s). Now, by equation (14),
Again referring (10) and (11), as t.fwdarw.?, with the increase in x(s) the value of n.sub.r(s) converges to f.sub.vf.sub.?<<N.sub.r. Again, by equation (14),
Hence, it can be concluded that t.sub.l(s)<t.sub.p(s) at s.
[0070] SIMULATION AND EXPERIMENT: Demonstrated are simulation and experimental results in three steps. First step simulates the SLPA-SR and contender (DWA) algorithm in a zigzag environment (world 1) of dimension 16 m?13 m with lane width of 3.5 m using TurtleBot3 Waffle Pi?. The simulation is performed using Turtlebot3: Robot simulation made easy, (last accessed 16 Feb. 2023). [Online]. Available: https://www.turtlebot.com/about/. Second step is about simulation of the disclosed The method 200 and DWA, in a typical warehouse environment (world 2) of dimension 30 m?20 m using TurtleBot3 Waffle Pi. Finally, one real experiment is conducted using TurtleBot3 Waffle Pi? in a zigzag environment (world 3) of dimension 3.6 m?2.4 m with lane width of 1.2 m by exploiting the learned Q-tree from world 2. Python implementation of DWA is taken from A. Sakai, D. Ingram, J. Dinius, K. Chawla, A. Raffin, and A. Paques, Pythonrobotics: a python code collection of robotics algorithms, arXiv preprint arXiv:1808.10703, 2018. The performance metrics include average run-time, linear velocity, angular velocity, % of exploitation, average reward, state-action pair count. Performance metric linear velocity, angular velocity and average run-time are employed to confirm superiority of the disclosed The SLPA-SR over the contender algorithm (DWA). Remaining performance metrics are employed to establish efficacy of the SLPA-SR.
Setup:
[0071] 1) Simulation: Simulation of Pseudocode 1 is conducted in a workstation with an Intel Core i7-9700K CPU@ 3.60 GHz*8 processor? and 16 GB of RAM having Ubuntu 20.04.4 LTS? as operating system. Pseudocode 1 is implemented using Python 3.8 and simulated on Robot Operating System (ROS) Noetic? with Gazebo? for TurtleBot3 Waffle Pi?. TurtleBot3 Waffle Pi? is equipped with 360? 2D LiDAR to perceive real-time uncertainties from the world. Over the iteration uncertain obstacles are introduced randomly in (a) world 1 (zigzag) and (b) world 2 (typical warehouse) of
Procedure:
[0073] 1) Simulation: Description of the performance metric are given below. Average run-time for Pseudocode 1 {algo?[DWA, SLPA-SR]} is computed by equation (18) below. To compute average run-time, start, goal locations in world 1, 2 are separately fixed and the simulations are repeated.
TABLE-US-00002 TABLE I Average run-time analysis Randomly Average placed run-time World Start Goal obstacle Epoch Algo (sec) 1 (0, 0) (3, 6) 0 60 DWA 32.9907 method 200 31.5132 (SLPA-SR) 2 (0, 0) (10, 6) 1 10 DWA 50.7337 method 200 48.5605 (0, 0) (10, ?6) 1 10 DWA 54.4768 method 200 50.7555 (0, 0) (?10, 6) 2 10 DWA 53.8211 method 200 52.3395 (?10, 6) (?10, ?4) 3 10 DWA 82.3827 method 200 80.1633 (?10, ?4) (10, ?6) 3 10 DWA 79.6911 method 200 77.2029 random random 7 10 DWA 55.1072 method 200 54.1522
Experimental Results for TurtleBot3 Waffle Pi?.
[0076] 1) Simulation: Table I above lists average run-time for simulations conducted with the TurtleBot3 Waffle Pi in world 1 and 2. It is apparent from Table I that the disclosed method 200 (SLPA-SR) is outperforming the contender algorithm (i.e., DWA) in terms of average run-time for both world 1 and 2. In world 1, the method 200 is tested with one pair of start-goal without any obstacle for 60 epochs. In world 2, the method 200 is tested with six sets of start-goal pairs. For each start-goal pair, after 5 epochs static object/obstacle is placed randomly on global path. The remaining five epochs are tested with the randomly placed static object/obstacle as shown in Table I. Table I is also supporting equations (16) and (17).
[0078] Unlike the exiting approaches that have a technical limitation of not able to perform planning and learning simultaneously in real-time, the method and system disclosed herein provides dynamic nature of the NBQ disclosed herein, i.e., number of state-action pair is dynamic in the Q-tree. Further, provides balancing of exploration-exploitation in the NBQ with ability to deal with environmental uncertainties.
[0079] The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
[0080] It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
[0081] The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
[0082] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words comprising, having, containing, and including, and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms a, an, and the include plural references unless the context clearly dictates otherwise.
[0083] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term computer-readable medium should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[0084] It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.