SYSTEM AND METHOD FOR OPTIMIZING PATH EXPLORATION PARAMETERS BASED ON DEEP REINFORCEMENT LEARNING

Abstract

The present invention relates to the technical field of path planning, and provides a deep reinforcement learning-based path exploration parameter optimization system. The system comprises: a variable parameter path planning module, configured to perform node exploration based on a deep reinforcement learning network, conduct collision detection on child nodes in a child node set, calculate cost values for all child nodes, and finally generate a loading and parking path using a Reeds-Shepp curve; an environmental state space modeling module, configured to perform regional division of obstacles around a current node and conduct environmental state space modeling; and a deep learning parameter optimization module, configured to construct a deep learning network to compute an optimal step size and an optimal steering angle, build a reward function to optimize the deep learning network, and simultaneously execute a training process of the deep learning network.

Claims

1. A deep reinforcement learning-based path exploration parameter optimization method, comprising a non-transitory computer readable medium operable on a computer with memory for the deep reinforcement learning-based path exploration parameter optimization method, and comprising program instructions for executing the following steps of: S1: generating an optimal step length and an optimal steering angle based on a deep reinforcement learning network according to a current node and environmental information, and constructing a fixed steering angle set; performing node exploration by combining the optimal step length with the fixed steering angle set to generate a child node set, and combining the optimal step length with the optimal steering angle to generate a steering angle-optimized child node which is added to the child node set, and S2: performing collision detection on the child nodes in the child node set and calculating cost values of all the child nodes, and S3: obtaining the child node with the lowest cost value in each iteration round of the search process as the final selected next node of the current node, and S4: when the distance from the current node to the target point is less than a set threshold, generating a loading and parking path using a Reeds-Shepp curve, and generating a planned path through node backtracking; and S5: operating mining with reducing costs and a performance efficiency based on results of the deep reinforcement learning-based path exploration parameter optimization method.

2. A deep reinforcement learning-based path exploration parameter optimization system based the deep reinforcement learning-based path exploration parameter optimization method of claim 1, characterized by comprising: a variable parameter path planning module, configured to generate an optimal step length and an optimal steering angle based on a deep reinforcement learning network according to a current node and environmental information, construct a fixed steering angle set, perform node exploration by combining the optimal step length with the fixed steering angle set to generate a child node set and by combining the optimal step length with the optimal steering angle to generate a steering angle-optimized child node which is added to the child node set, perform collision detection on child nodes in the child node set and calculate cost values of all child nodes, and finally generate a loading and parking path using a Reeds-Shepp curve, and assuming the current node is N.sub.c(x.sub.c, y.sub.c, .sub.c), where x.sub.c, y.sub.c are the coordinates and Pc is the heading angle, the following is established based on the motion characteristics of the unmanned mining truck: $\begin{matrix} {\begin{matrix} x_{s} = x_{c} + d * l \cos_{c} \\ y_{s} = y_{c} + d * l \sin_{c} \\ _{s} =_{c} + l * \frac{\tan}{L_{w}} \end{matrix} & (1) \end{matrix}$ where N.sub.s (x.sub.s, y.sub.s, .sub.s) is the next child node explored from the current node, x.sub.s, y.sub.s are the position coordinates, .sub.s is the heading angle, d.sub.s{1,1} represents the expansion direction of the current node including backward or forward, and l represent the steering angle and step length of node expansion respectively, and L.sub.w is the wheelbase of the unmanned mining truck, and the optimal step length generated for the current node and the environmental information by the deep reinforcement learning network, and the optimal steering angle; and the fixed steering angle set .sub.1={.sub.1, . . . , .sub.N.sub.3} is constructed by uniform sampling, and for a fixed steering angle .sub.i (where i=1, 2, . . . , N.sub.3), the calculation method is as follows: $\begin{matrix} _{i} = -_{m ax} + (i - 1) * \frac{2 *_{m ax}}{N_{3} - 1} & (2) \end{matrix}$ where .sub.max is the maximum steering angle that the mining truck can execute, and N.sub.3 is the number of steering angles constructed, and node exploration is performed through a two-step process comprising step size optimization and steering angle optimization, and in the first step, the optimal step length L.sub.best and all sampled fixed steering angles in the fixed steering angle set .sub.1 are substituted into Formula (1), thereby generating the child node set N for fixed steering angle exploration, and the number of child nodes in the child node set N is the same as the number of sampled angles in the fixed steering angle set .sub.1, which is N.sub.3, and in the second step, the optimal step length L.sub.best and the optimal steering angle .sub.best are substituted into Formula (1) to generate the steering angle-optimized child node N.sub.best, which is then added to the child node set N, and an environmental state space modeling module, configured to perform regional division of obstacles surrounding the current node and conduct environmental state space modeling, and a deep learning parameter optimization module, configured to construct a deep learning network to calculate the optimal step length and the optimal steering angle, build a reward function to optimize the deep learning network, and simultaneously execute a training process of the deep learning network.

3. The deep reinforcement learning-based path exploration parameter optimization system according to claim 2, characterized in that, in the variable parameter path planning module, performing collision detection on the child nodes in the child node set and calculating cost values of all the child nodes specifically comprises: performing collision detection on all child nodes in the child node set N by covering the mining truck with two enveloping circles, sampling along the path from the current node to the explored child nodes, determining whether the distance to any obstacle grid is smaller than the radius of the enveloping circles; if so, the child node is considered infeasible and is removed from the child node set N, and the cost value of all explored child nodes is calculated using f(N.sub.s)=g(N.sub.s)+w.sub.hh(N.sub.s), where g(N.sub.s) represents the actual cost consumed during the movement of the mining truck from the starting point to the explored child node, h(N.sub.s) represents the estimated cost from the explored child node to the target point, and w.sub.h is the weight of the estimated cost, and wherein the actual consumption cost g(N.sub.s) is: $\begin{matrix} g (N_{s}) = g (N_{c}) + w_{1} g_{dis} (N_{s}) + w_{2} g_{back} (N_{s}) + w_{3} g_{switch} (N_{s}) + w_{4} g_{steer} (N_{s}) + w_{5} g_{change} (N_{s}) & (3) \end{matrix}$ in the above formula, g(N.sub.s) incorporates five metrics based on the cost of the current node g(N.sub.c): g.sub.dis (N.sub.s) denotes the distance from the current node N.sub.c to the child node N.sub.s in the iterative search; g.sub.back(N.sub.s) represents the reversing cost; g.sub.switch(N.sub.s) indicates the mode switch cost; g.sub.steer(N.sub.s) denotes the steering cost; g.sub.change(N.sub.s) represents the steering change cost; and w.sub.i, where i=1, . . . ,5, are the weight coefficients.

4. The deep reinforcement learning-based path exploration parameter optimization system according to claim 2, characterized in that, in the variable parameter path planning module, generating the loading and parking path using a Reeds-Shepp curve specifically comprises: when the distance between the current node N.sub.c(x.sub.c, y.sub.c, .sub.c) and the target point N.sub.g is less than a threshold L.sub.t, a plurality of candidate loading and parking path curves from the current node N.sub.c(x.sub.c, y.sub.c, .sub.c) to the target point N.sub.g are generated using the Reeds-Shepp curve, and the node costs along the curves are calculated by Formula (3), and the curves are sorted based on their costs, and the path with the minimum cost is selected, and the global path is obtained through backtracking, and if all candidate loading and parking path curves are in collision, the process proceeds to the node exploration step.

5. The deep reinforcement learning-based path exploration parameter optimization system according to claim 2, characterized in that, in the environmental state space modeling module, performing regional division of obstacles surrounding the current node specifically comprises: the space surrounding the current node N.sub.c(x, y.sub.c, .sub.c) is divided into 8 sectors D={D.sub.1, . . . , D.sub.8} by angular divisions, and within each sector, let d.sub.obs.sub.i, where i=1,2, . . . ,8 represent the minimum distance between obstacles and the mining truck in the i-th sector.

6. The deep reinforcement learning-based path exploration parameter optimization system according to claim 5, characterized in that, in the environmental state space modeling module, conducting environmental state space modeling specifically comprises: the state space S is defined as follows: $\begin{matrix} S = (d_{start},_{start}, d_{g oal},_{goal},_{goal} -, S_{position}, N_{obs}, d_{{obs}_{i}}) & (4) \end{matrix}$ wherein S.sub.position represents the coordinates of the current node, d.sub.start denotes the distance from the starting point relative to the current node, .sub.start indicates the relative angular orientation of the starting point in a coordinate system with the current node as the origin and the heading direction as the x-axis, d.sub.goal denotes the distance from the target point relative to the current node, .sub.goal indicates the relative angular orientation of the target point in a coordinate system with the current node as the origin and the heading direction as the x-axis, .sub.goal- represents the direction of the target loading position in a coordinate system with the current node as the origin and the heading direction as the x-axis, N.sub.obs denotes the number of obstacles within a given range of the current node, and d.sub.obs.sub.i=1,2, . . . , 8 represents the minimum distance between obstacles and the mining truck in the i-th sector.

7. The deep reinforcement learning-based path exploration parameter optimization system according to claim 2, characterized in that, in the deep learning parameter optimization module, constructing the deep learning network to calculate the optimal step length and the optimal steering angle specifically comprises: the DQN algorithm is employed to train the deep learning network, with the action space consisting of combinations of candidate optimal step lengths .sub.rl and candidate optimal steering angles l.sub.rl during expansion, i.e., the action space comprises all possible combinations of (.sub.rl, l.sub.rl), and the DQN algorithm utilizes two networks with identical structures but different parameters for training: a training network Q.sub. used to compute the Q-value for policy selection and iteratively update the Q-values; and a target network Q.sub. used to compute the Q-value of the next state in the temporal difference target (TD Target), and the loss function Loss of the DQN algorithm is designed as follows: $\begin{matrix} Loss = \frac{1}{N} {.Math.}_{i} {(r_{i} + \max_{a_{i}^{}} Q_{}^{} (s_{i}^{}, a_{i}^{}) - Q_{} (s_{i}, a_{i}))}^{2} & (5) \end{matrix}$ wherein (s.sub.i, a.sub.i, r.sub.i, s.sub.i) represents a set of state transition data obtained during training, including the current state s.sub.i, the current action a.sub.i, the reward r.sub.i obtained after taking the action $a_{i}^{},$ and the next state $s_{i}^{}$ obtained after interacting with the environment by taking the action; is an adjustable discount factor, and both the target network $Q_{}^{}$ and the training network Q.sub. are constructed using three fully connected layers, each containing 32 neurons, and the outputs of the first two fully connected layers are fed into an activation function before being passed to the next fully connected layer, with the PReLU activation function being employed, and the final fully connected layer directly outputs the Q-value for each action, including steering angles and step lengths, and the steering angle and step length with the highest Q-value are ultimately selected as the optimized exploration parameters, comprising the optimal step length and the optimal steering angle.

8. The deep reinforcement learning-based path exploration parameter optimization system according to claim 2, characterized in that, in the deep learning parameter optimization module, constructing a reward function to optimize the deep learning network specifically comprises: the reward function involves a target approach reward r.sub.g, an obstacle avoidance reward r.sub.o, an exploration cost r.sub.t, and a smoothness reward r.sub.s: the target approach reward r.sub.g is defined as follows: $\begin{matrix} r_{g} = {\begin{matrix} r_{success} & Reeds - shepp connection success \\ w_{g} (l_{c} - l_{b}) & Reeds - shepp connection failed \end{matrix} & (6) \end{matrix}$ wherein w.sub.g is an adjustable weight, l.sub.c is the Euclidean distance from the current node N.sub.c to the target point N.sub.g in the current iteration round, l.sub.b is the Euclidean distance from the steering angle-optimized child node N.sub.best to the target point N.sub.g, and r.sub.success is a fixed reward given when successfully connected to the destination, indicating that the mining truck has reached the target point, and the obstacle avoidance reward r.sub.o is defined as follows: $\begin{matrix} r_{o_{i}} = {\begin{matrix} r_{collision}, & d_{{obs}_{i}} d \\ \frac{w_{1}}{d_{{obs}_{i}}}, & d_{c} d_{{obs}_{i}} 2 d_{c} \\ \frac{w_{2}}{{(d_{{obs}_{i}})}^{4}}, & 2 d_{c} d_{{obs}_{i}} 10 d_{c} \\ 0, & else \end{matrix} & (7) \end{matrix}$ wherein r.sub.o.sub.i represents the obstacle avoidance reward in the i-th sector, w.sub.1 and w.sub.2 are adjustable weight coefficients respectively, and a distance threshold d.sub.c is designed, where d.sub.obs.sub.id.sub.c is considered a collision, returning a large penalty constant r.sub.collision; when d.sub.cd.sub.obs.sub.i2d.sub.c, it is considered a dangerous situation, returning a relatively large penalty function; when 2d.sub.cd.sub.obs.sub.i10d.sub.c, it is considered risky, returning a relatively small penalty function; when d.sub.obs.sub.i10d.sub.c, it is considered safe, and no penalty is returned, and the overall obstacle avoidance reward r.sub.o satisfies the following formula: $\begin{matrix} r_{o} = {.Math.}_{n = 1}^{8} - r_{o_{i}} & (8) \end{matrix}$ the exploration cost r.sub.t is defined as follows: $\begin{matrix} r_{t} = - Timeconstant & (9) \end{matrix}$ wherein TimeConstant is a fixed penalty cost constant set for each step, guiding the mining truck to approach the destination more rapidly and preventing meaningless exploration, with the cost set as a negative value, and the smoothness reward r.sub.s is defined as follows: $\begin{matrix} r_{s} = - w_{3} .Math._{r l} .Math. - w_{4} e^{- \frac{1}{l_{rl}}} .Math._{r l} -_{c} .Math. & (10) \end{matrix}$ wherein .sub.c represents the steering angle corresponding to the current node N.sub.c generated in the current search iteration round, ort corresponds to the optimal steering angle generated by the deep reinforcement learning network in the current search iteration round, l.sub.rl represents the optimal step length generated by the deep reinforcement learning network in the current search iteration round, and w.sub.3 and w.sub.4 are adjustable coefficients respectively, and the final reward function is as follows: $\begin{matrix} R = r_{g} + r_{o} + r_{t} + r_{s} . & (12) \end{matrix}$

9. The deep reinforcement learning-based path exploration parameter optimization system according to claim 7, characterized in that, in the deep learning parameter optimization module, executing the training process of the deep learning network specifically comprises: first, randomly selecting appropriate starting and target points on the map based on actual production data and performing path planning; during planning, optimizing path planning parameters through reinforcement learning, thereby forming multiple sets of state transition sample data and adding them to a replay buffer; during the training process, randomly selecting batches of data from the replay buffer and updating the parameters of the estimation network Q.sub. according to the loss function; after a certain number of iterations, copying the parameters of the training network Q.sub. to the target networkQ.sub., thereby completing one learning process.

Description

DESCRIPTION OF DRAWINGS

[0065] FIG. 1 is an overall structure diagram of the deep reinforcement learning-based path exploration parameter optimization system according to the present invention;

[0066] FIG. 2 is a flowchart of node exploration rules with variable parameters according to the present invention;

[0067] FIG. 3 is a schematic diagram of regional division of obstacles around a mining truck according to the present invention;

[0068] FIG. 4 is a training flowchart of the DQN network according to the present invention;

[0069] FIG. 5 is an overall flowchart of the deep reinforcement learning-based path exploration parameter optimization method according to the present invention;

[0070] FIG. 6 is a flowchart of the deep reinforcement learning-based path exploration parameter optimization algorithm according to the present invention.

DETAILED DESCRIPTION

[0071] To make the objectives, technical solutions, and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly and completely below in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of them. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without making creative efforts shall fall within the scope of protection of the present application.

[0072] Those skilled in the art will appreciate that unless specifically stated otherwise, the singular forms a, an, said, and the used herein may also include plural forms. It should be further understood that the term comprising used in the description of the present invention indicates the presence of the stated features, integers, steps, operations, elements, and/or components, but does not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[0073] To achieve efficient unmanned mining truck path planning, the present invention proposes a deep reinforcement learning-based path exploration parameter optimization system and method. Firstly, a Hybrid A* path planning framework with variable exploration parameters is constructed, an environment representation model considering obstacle regional division is established, and on this basis, a deep reinforcement learning-based exploration parameter optimization strategy is developed. The specifics are as follows:

1. Path Planning Framework with Variable Exploration Parameters
(1) Node Exploration Rules with Variable Parameters

[0074] By analyzing the kinematic characteristics of the unmanned mining truck, iterative node exploration rules for path nodes are established. Key exploration parameters are extracted to construct the node exploration process with variable parameters.

1.1 Node Evaluation Method

[0075] Based on the child nodes generated through node exploration in section 1.1, the validity of the child nodes is analyzed via collision detection. A targeted evaluation method is established to assign a cost value to each child node, thereby obtaining the child node set.

1.2 Loading and Parking Path Generation Method Based on Reeds-Shepp Curve

[0076] During the iterative exploration process, terminal constraints must be considered. Multiple candidate parking paths are generated based on the Reeds-Shepp curve, screened and sorted using an evaluation function, and finally an appropriate loading and parking path is selected to conclude the search.

2. Environment Representation Model Considering Obstacle Regional Division

2.1 Obstacle Regional Division

[0077] Based on the information of the current node, the surrounding space is divided according to the mining truck model to determine the occupancy status of obstacles, thereby modeling the distribution characteristics of the obstacles.

2.2 Environmental State Space Modeling

[0078] Based on the regional division in section 2.1, an environmental state space for deep reinforcement learning is constructed. This state space is used to represent the state obtained by the agent from the environment and is input into the deep reinforcement learning neural network.

3. Exploration Parameter Optimization Method Based on Deep Reinforcement Learning

3.1 Deep Learning Network Construction

[0079] Based on the state space designed in section 2.2 and the exploration parameter rules in section 1.1, a neural network is constructed to achieve the mapping from the state space to the exploration parameters.

3.2 Reward Function Construction

[0080] Necessary indicators for path planning are analyzed, and a reward function for the agent during iterative training is constructed to guide the training of the deep reinforcement learning strategy.

3.3 Deep Reinforcement Learning Training Process

[0081] Based on the design of the above deep reinforcement learning modules, an offline training process is constructed to optimize the exploration parameter network.

[0082] The following is explained through specific embodiments:

First Embodiment

[0083] As shown in FIG. 1, this embodiment provides a deep reinforcement learning-based path exploration parameter optimization system, comprising: establishing a path planning framework with variable exploration parameters; analyzing obstacle distribution characteristics based on this framework to build an environmental state space; and finally constructing a deep reinforcement learning network to achieve adaptive optimization of path exploration parameters.

I. Variable Parameter Path Planning Module 1 for the Path Planning Framework with Variable Exploration Parameters

[0084] A variable parameter path planning module 1, configured to: generate an optimal step length and an optimal steering angle based on a deep reinforcement learning network according to a current node and environmental information; construct a fixed steering angle set; perform node exploration by generating a child node set by combining the optimal step length with the fixed steering angle set and generating a steering angle-optimized child node by combining the optimal step length with the optimal steering angle and adding it to the child node set; perform collision detection on the child nodes in the child node set and calculate cost values of all the child nodes; and finally generate a loading and parking path using a Reeds-Shepp curve.

[0085] In this embodiment, the variable parameter path planning module 1 is specifically as follows:

Variable Parameter Exploration Rules

[0086] Node exploration must satisfy the motion characteristics constraints of the unmanned mining truck; otherwise, the mining truck cannot track the generated path, leading to significant risks. Therefore, it is first necessary to model the mining truck's motion characteristics. In the working scenario of the mining truck in this project, since the mining truck typically operates at low speeds, a two-degree-of-freedom vehicle kinematics model can be used to characterize the motion characteristics of the unmanned mining truck.

[0087] Specifically, the vehicle pose state at any given time can be represented as: q=(x, y, ), where the coordinate origin is located at the center of the rear axle, and the coordinate axes are parallel to the vehicle body. v denotes the vehicle speed, v denotes the vehicle heading angle, denotes the vehicle steering angle, and L.sub.w denotes the wheelbase of the vehicle. The kinematic model of the vehicle can be expressed as follows:

[00015] $[\begin{matrix} \dot{x} \\ \dot{y} \\ \dot{} \end{matrix}] = [\begin{matrix} \cos \\ \sin \\ \frac{\tan}{L_{w}} \end{matrix}] v$

[0088] Assuming the current node is N.sub.c(x.sub.c, y.sub.c, .sub.c), where x.sub.c, y.sub.c are the coordinates and .sub.c is the heading angle, the following is established based on the motion characteristics of the unmanned mining truck:

[00016] $\begin{matrix} {\begin{matrix} x_{s} = x_{c} + d * l \cos_{c} \\ y_{s} = y_{c} + d * l \sin_{c} \\ _{s} =_{c} + l * \frac{\tan}{L_{w}} \end{matrix} & (1) \end{matrix}$

[0089] Where N.sub.s(x.sub.s, y.sub.s, .sub.s) is the next child node explored from the current node, x.sub.s, y.sub.s are the position coordinates, .sub.s is the heading angle, d.sub.s{1,1} represents the expansion direction of the current node including backward or forward, and l represent the steering angle and step length of node expansion respectively, and L.sub.w is the wheelbase of the unmanned mining truck.

[0090] It can be observed that the position and orientation of the child nodes are determined by the expansion direction, steering angle, and step length. Conventional algorithms often employ fixed steering angles and step lengths, which makes it difficult for them to adapt to complex and dynamic mining operating environments.

[0091] To achieve variable steering angles and step lengths, one can sample steering angles and step lengths to form corresponding sets A={.sub.1, . . . , .sub.N.sub.1} and L={l.sub.1, . . . , l.sub.N.sub.2}, where N.sub.1 and N.sub.2 represent the number of samples for each parameter. However, since the step lengths and steering angles can be combined to form N.sub.1N.sub.2 possible combinations, this would lead to a significant increase in computational time. Therefore, the present invention optimizes these parameters through deep reinforcement learning and establishes exploration rules.

[0092] Specifically, as shown in FIG. 2, the optimal step length L.sub.best and the optimal steering angle .sub.best for the current node and environmental information are generated by the deep reinforcement learning network. Since the left and right steering capabilities of the mining truck are symmetric, the fixed steering angle set A.sub.1={.sub.1, . . . , .sub.N.sub.2} is constructed using uniform sampling, where N.sub.3 is the number of steering angle samples in the fixed steering angle exploration set (with N.sub.3 being smaller than N.sub.1 to reduce computational load and save time). For a fixed steering angle .sub.i (where i=1, 2, . . . , N.sub.3), it is calculated as follows:

[00017] $\begin{matrix} _{i} = -_{\max} + (i - 1) * \frac{2 *_{\max}}{N_{3} - 1} & (2) \end{matrix}$

[0093] Where .sub.max is the maximum steering angle that the mining truck can execute, and N.sub.3 is the number of steering angles constructed;

[0094] Based on the above parameters, Node exploration is performed through a two-step process comprising step size optimization and steering angle optimization. In the first step, the optimal step length L.sub.best and all sampled fixed steering angles in the fixed steering angle set .sub.1 are substituted into Formula (1), thereby generating the child node set N for fixed steering angle exploration. The number of child nodes in the child node set N is the same as the number of sampled angles in the fixed steering angle set .sub.1, which is N.sub.3. In the second step, the optimal step length L.sub.best and the optimal steering angle .sub.best are substituted into Formula (1) to generate the steering angle-optimized child node N.sub.best, which is then added to the child node set N.

Node Evaluation Method

[0095] After obtaining the child node set N, the child nodes within the set need to be evaluated. Specifically, collision detection is first performed on all child nodes in the child node set N by covering the mining truck with two enveloping circles and sampling along the path from the current node to the explored child node (a smooth circular arc generated by the turning radius corresponding to the steering angle). It is determined whether the distance to any obstacle grid is smaller than the radius of the enveloping circles; if so, the child node is considered infeasible and is removed from the child node set N;

[0096] On this basis, the cost value of all explored child nodes is calculated using f(N.sub.s)=g(N.sub.s)+w.sub.hh(N.sub.s), where g(N.sub.s) represents the actual consumption cost of the mining truck moving from the starting point to the explored child node N.sub.s, h(N.sub.s) denotes the predicted cost from the explored child node to the target point, and w.sub.h is the weight of the predicted cost. When designing the cost function g(N.sub.s), considerations are given to operations such as reversing and direction changes during the movement of the mining truck, which typically consume more time and energy. Therefore, this paper comprehensively incorporates factors such as reversing penalty, direction-switching penalty, and path length into the cost function to evaluate the quality of the nodes.

[0097] Wherein the actual consumption cost g(N.sub.s) is:

[00018] $\begin{matrix} g (N_{s}) = g (N_{c}) + w_{1} g_{dis} (N_{s}) + w_{2} g_{back} (N_{s}) + w_{3} g_{switch} (N_{s}) + w_{4} g_{steer} (N_{s}) + w_{5} g_{change} (N_{s}) & (3) \end{matrix}$

[0098] In the above formula, g(N.sub.s) incorporates five metrics based on the cost of the current node g(N.sub.c): g.sub.dis(N.sub.s) denotes the distance from the current node N.sub.c to the child node N.sub.s in the iterative search; g.sub.back(N.sub.s) represents the reversing cost; g.sub.switch(N.sub.s) indicates the mode switch cost; g.sub.steer(N.sub.s) denotes the steering cost; g.sub.change(N.sub.s) represents the steering change cost; and w.sub.i, where i=1, . . . ,5, are the weight coefficients.

[0099] If the child node is obtained through vehicle reversing exploration, a reversing cost g.sub.back(N.sub.s) is added to the cost function, typically as a relatively large constant cost; when the vehicle's movement direction is opposite to that in the previous search round, a mode switch cost g.sub.switch(N.sub.s) is added to the cost function, generally as a relatively large constant; if the steering angle used in the current exploration is non-zero, a steering cost g.sub.steer (N.sub.s) is added, the magnitude of which is proportional to the absolute value of the steering angle applied; when the steering angle used in the current search differs from that in the previous round, a steering change cost g.sub.change(N.sub.s) is added, the magnitude of which is proportional to the absolute value of the change in the steering angle. The heuristic function h(N.sub.s) is the estimated cost from the current node to the destination. In this paper, a heuristic function considering obstacles is used, specifically employing the A* method to compute the distance between the current node and the destination.

(3) Loading and Parking Path Generation Method Based on Reeds-Shepp Curve

[0100] When the distance between the current node N.sub.c(x.sub.c, y.sub.c, .sub.c) and the target point N.sub.g is less than a threshold L.sub.t, a plurality of candidate loading and parking path curves from the current node N.sub.c(x.sub.c, y.sub.c, .sub.c) to the target point N.sub.g are generated using the Reeds-Shepp curve. The node costs along the curves are calculated using Formula (3), and the curves are sorted based on their costs. The path with the minimum cost is selected, and the global path is obtained through backtracking. If all candidate loading and parking path curves result in collisions, the process returns to the node exploration step.

II. Environmental State Space Modeling Module 2 for Environment State Space Modeling Considering Obstacle Regional Division

[0101] The environmental state space modeling module 2 is configured to perform regional division of obstacles surrounding the current node and conduct environmental state space modeling.

[0102] In this embodiment, the environmental state space modeling module 2 is specifically as follows:

Obstacle Regional Division Method

[0103] As shown in FIG. 3, to characterize the impact of environmental obstacles on planning, the present invention divides the space surrounding the current node N.sub.c(x.sub.c, y.sub.c, .sub.c) into 8 sectors D={D.sub.1, . . . , D.sub.8} by angular divisions. Within each sector, let d.sub.obs.sub.i(where i=1,2, . . . ,8) represent the minimum distance between obstacles and the mining truck in the i-th sector.

Environmental State Space Modeling

[0104] Deep reinforcement learning determines the optimal action based on the input state space. Therefore, to enhance the generalization capability of the deep reinforcement learning model, it is necessary to consider the distance information between the current node and surrounding obstacles, as well as the relative positional information between the current node, the starting point, and the target point. Specifically, the state space S is designed as follows:

[00019] $\begin{matrix} S = (d_{start},_{start}, d_{goal},_{goal},_{goal}, -, S_{position}, N_{obs}, d_{{obs}_{i}}) & (4) \end{matrix}$

[0105] Wherein S.sub.position represents the coordinates of the current node, d.sub.start denotes the distance from the starting point relative to the current node, .sub.start indicates the relative angular orientation of the starting point in a coordinate system with the current node as the origin and the heading direction as the x-axis, d.sub.goal denotes the distance from the target point relative to the current node, .sub.goal indicates the relative angular orientation of the target point in a coordinate system with the current node as the origin and the heading direction as the x-axis, .sub.goal represents the direction of the target loading position in a coordinate system with the current node as the origin and the heading direction as the x-axis, N.sub.obs denotes the number of obstacles within a given range of the current node, and d.sub.obs.sub.i=1,2, . . . , 8 represents the minimum distance between obstacles and the mining truck in the i-th sector.

III. Deep Learning Parameter Optimization Module 3 for Deep Reinforcement Learning-Based Optimization

[0106] The deep learning parameter optimization module 3 is configured to construct a deep learning network to calculate the optimal step length and the optimal steering angle, build a reward function to optimize the deep learning network, and simultaneously execute the training process.

[0107] In this embodiment, the deep learning parameter optimization module 3 is specifically as follows:

(1) Deep Learning Network Construction

[0108] The DQN algorithm is employed to train the deep learning network. The action space consists of combinations of candidate optimal step lengths .sub.rl and candidate optimal steering angles l.sub.rl during expansion, i.e., the action space comprises all possible combinations of (.sub.rl, l.sub.rl).

[0109] For example,

[00020] $_{rl} [0.9_{\max}, 0.8_{\max}, 0.7_{\max}, 0.6_{\max}, 0.4_{\max}, 0.3_{\max}, 0.2_{\max}, 0.1_{\max}, 0, l_{rl} l_{\min}, \frac{1}{3} (2 l_{\min} + l_{\max}), \frac{1}{3} (l_{\min} + 2 l_{\max}), l_{\max}] .$

[0110] Where l.sub.min and l.sub.max are the minimum and maximum exploration step lengths, respectively, and can be adjusted. The action space consists of all possible combinations of .sub.rl and l.sub.rl, resulting in a total of 174=68 possible actions.

[0111] The DQN algorithm utilizes two networks with identical structures but different parameters for training: a training network Q.sub. used to compute the Q-value for policy selection and iteratively update the Q-values; and a target network Q.sub. used to compute the Q-value of the next state in the temporal difference target (TD Target). The loss function Loss of the DQN algorithm is designed as follows:

[00021] $\begin{matrix} Loss = \frac{1}{N} {.Math.}_{i} {(r_{i} + \max_{a_{i}^{}} Q_{}^{} (s_{i}^{}, a_{i}^{}) - Q_{} (s_{i}, a_{i}))}^{2} & (5) \end{matrix}$

[0112] Wherein (s.sub.i, a.sub.i, r.sub.i, s.sub.i) represents a set of state transition data obtained during training, including the current state s.sub.i, the current action a.sub.i, the reward r.sub.i obtained after taking the actiona.sub.i, and the next state

[00022] $s_{i}^{}$

obtained after interacting with the environment by taking the action; is an adjustable discount factor;

[0113] Both the target network

[00023] $Q_{}^{}$

and the training network Q.sub. are constructed using three fully connected layers, each containing 32 neurons. The outputs of the first two fully connected layers are fed into an activation function before being passed to the next fully connected layer, with the PReLU activation function being employed. The final fully connected layer directly outputs the Q-value for each action, including steering angles and step lengths. The steering angle and step length with the highest Q-value are ultimately selected as the optimized exploration parameters, comprising the optimal step length and the optimal steering angle.

(2) Reward Function Construction

[0114] To train and optimize the deep reinforcement learning network, it is essential to design a reasonable reward function and refine the strategy through rewards. Specifically, since path planning is an iterative search process, the deep reinforcement learning network outputs an action and receives a corresponding reward during each exploration round. The design of this reward function primarily considers guiding the mining truck to reach the target point quickly, reducing the number of iteration rounds, and maintaining a safe distance from obstacles. The designed reward function includes a target approach reward r.sub.g, an obstacle avoidance reward r.sub.o, an exploration cost r.sub.t, and a smoothness reward r.sub.s.

[0115] r.sub.g is the target approach reward, designed to guide the mining truck toward the destination. Accordingly, a positive reward is given when the mining truck moves closer to the target point, while a penalty is imposed when it moves farther away. Furthermore, when the Reeds-Shepp curve in the current round successfully connects to the destination, the mining truck is considered to have reached the target point, and a fixed arrival reward r.sub.success is granted.

[0116] The target approach reward r.sub.g is defined as follows:

[00024] $\begin{matrix} r_{g} = {\begin{matrix} r_{success} & Reeds - shepp connection success \\ w_{g} (l_{c} - l_{b}) & Reeds - shepp connection failed \end{matrix} & (6) \end{matrix}$

[0117] Wherein w.sub.g is an adjustable weight, l.sub.c is the Euclidean distance from the current node N.sub.c to the target point N.sub.g in the current iteration round, l.sub.s is the Euclidean distance from the steering angle-optimized child node N.sub.best to the target point N.sub.g, and r.sub.success is a fixed reward granted when the Reeds-Shepp curve in the current round successfully connects to the destination, indicating that the mining truck has reached the target point. Triggering the Reeds-Shepp curve signifies successful path generation, thus resulting in a relatively large reward. Conversely, if the Reeds-Shepp curve fails to trigger, further exploration is still required. During node exploration, it is desirable for the nodes generated by the optimized exploration parameters to be as close as possible to the target point. Therefore, a reward component is introduced based on the difference between the distance from the current node to the target point l.sub.c and the distance from the child node to the target point l.sub.b. If the child node generated by the optimized exploration parameters moves farther from the target point, this reward component is negative; otherwise, it is positive.

[0118] r.sub.o is the obstacle avoidance reward, designed to prevent the mining truck from getting too close to surrounding obstacles and causing collisions. When designing the obstacle avoidance reward function, the safety status of the mining truck is classified into four conditions based on the distance d.sub.obs.sub.i between the obstacles and the generated steering angle-optimized child node N.sub.best: collision, danger, risk, and safe. Furthermore, to better guide the policy training, a potential field function is employed to ensure the continuity of reward outputs at different distances.

[0119] The obstacle avoidance reward r.sub.o is defined as follows:

[00025] $\begin{matrix} r_{o_{i}} = {\begin{matrix} r_{collision}, d_{{obs}_{i}} d \\ \frac{w_{1}}{d_{{obs}_{i}}}, d_{c} d_{{obs}_{i}} 2 d_{c} \\ \frac{w_{2}}{(d_{{obs}_{i}})}, 2 d_{c} d_{{obs}_{i}} 10 d_{c} \\ 0, else \end{matrix} & (7) \end{matrix}$

[0120] Wherein r.sub.o; represents the obstacle avoidance reward in the i-th sector, w.sub.1 and w.sub.2 are adjustable weight coefficients respectively. A distance threshold d.sub.c is designed, where d.sub.obs.sub.id.sub.c is considered a collision, returning a large penalty constant r.sub.collision; when d.sub.cd.sub.obs.sub.i2d.sub.c, it is considered a dangerous situation, returning a relatively large penalty function; when 2d.sub.cd.sub.obs.sub.i10d.sub.c, it is considered risky, returning a relatively small penalty function; when d.sub.obs.sub.i10d.sub.c, it is considered safe, and no penalty is returned. The overall obstacle avoidance reward r.sub.o satisfies the following formula:

[00026] $\begin{matrix} r_{o} = {.Math.}_{n = 1}^{8} - r_{o_{i}} & (8) \end{matrix}$

[0121] The exploration cost r.sub.t is defined as follows:

[00027] $\begin{matrix} r_{t} = - Timeconstant & (9) \end{matrix}$

[0122] Wherein TimeConstant is a fixed penalty cost constant set for each step, guiding the mining truck to approach the destination more rapidly and preventing meaningless exploration, with the cost set as a negative value;

[0123] Since steering changes in the mining truck incur additional travel costs, a smoothness reward r.sub.s is set in this invention to encourage minimizing steering wheel adjustments. The smoothness reward r.sub.s is defined as follows:

[00028] $\begin{matrix} r_{s} = - w_{3} .Math._{rl} .Math. - w_{4} e^{\frac{1}{l_{rl}}} .Math._{rl} -_{c} .Math. & (10) \end{matrix}$

[0124] Wherein .sub.c represents the steering angle corresponding to the current node N.sub.c generated in the current search iteration round, .sub.rl corresponds to the optimal steering angle generated by the deep reinforcement learning network in the current search iteration round, l.sub.rl represents the optimal step length generated by the deep reinforcement learning network in the current search iteration round, and w.sub.3 and w.sub.4 are adjustable coefficients respectively;

[0125] The final reward function is as follows:

[00029] $\begin{matrix} R = r_{g} + r_{o} + r_{t} + r_{s} & (12) \end{matrix}$

(3) Deep Reinforcement Learning Training Process

[0126] As shown in FIG. 4, for the training of deep reinforcement learning, appropriate starting and target points are first randomly selected on the map based on actual production data, and path planning is performed. During planning, path planning parameters are optimized through reinforcement learning, thereby forming multiple sets of state transition samples, which are added to a replay buffer. During training, batches of data are randomly selected from the replay buffer, and the parameters of the estimation network Q.sub. are updated according to the loss function. After a certain number of iterations, the parameters of the training network Q.sub. are copied to the target network Q.sub., thereby completing one learning process. Using two networks for training reduces the correlation between the current Q-value and the target Q-value to some extent, improving algorithm stability. During training, to enhance generalization, the starting and target points can be randomly fine-tuned, thereby enriching the training data and further improving generalization. The pseudo-code of the DQN algorithm used for deep reinforcement learning training is schematically shown in Table 1 below.

TABLE-US-00001 Table 1 DQN Algorithm Algorithm 1 DQN Algorithm 1. Initialize the training network Q.sub. with random parameters. 2. [00030] $Initialize the target network Q_{}^{} by copying the same parameters .$ 3. Initialize the experience replay buffer. 4. for episode e = 1 to E do 5. Obtain the initial environment state s.sub.1. 6. for time step t = 1 to T do 7. Select an action a.sub.t using the -greedy policy based on the current network Q.sub.. 8. Execute the action a.sub.t, observe the reward r.sub.t and the next state s.sub.t+1. 9. Store the transition (s.sub.t, a.sub.t, r.sub.t, s.sub.t+1) in the replay buffer. 10. If the buffer contains enough samples, randomly sample a batch of transitions (s.sub.t, a.sub.t, r.sub.t, s.sub.t+1). 11. For each sampled transition, calculate the target value: [00031] $y_{i} = r_{i} + Q_{}^{} (s_{i + 1}, (s_{i + 1})) .$ 12. Update the parameters of Q.sub. to minimize the loss between Q.sub.(s.sub.i, a.sub.i) and y. 13. [00032] $Periodically update the target network : Q_{}^{} = Q_{} .$

Second Embodiment

[0127] As shown in FIGS. 5 and 6, this embodiment provides a deep reinforcement learning-based path exploration parameter optimization method executed by the path exploration parameter optimization system described in the first embodiment. The method comprises the following steps: [0128] S1: Generating an optimal step length and an optimal steering angle based on a deep reinforcement learning network according to a current node and environmental information, and constructing a fixed steering angle set; performing node exploration by combining the optimal step length with the fixed steering angle set to generate a child node set, and combining the optimal step length with the optimal steering angle to generate a steering angle-optimized child node which is added to the child node set; [0129] S2: Performing collision detection on the child nodes in the child node set and calculating cost values of all the child nodes; [0130] S3: Obtaining the child node with the lowest cost value in each iteration round of the search process as the final selected next node of the current node; [0131] S4: When the distance from the current node to the target point is less than a set threshold, generating a loading and parking path using a Reeds-Shepp curve, and generating a planned path through node backtracking.

[0132] A computer-readable storage medium storing computer code, wherein when the computer code is executed, the method as described above is performed. Those of ordinary skill in the art may understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing relevant hardware. The program can be stored in a computer-readable storage medium, and the storage medium may include: Read-Only Memory (ROM), Random Access Memory (RAM), magnetic disks, optical discs, etc.

[0133] The foregoing descriptions are merely preferred embodiments of the present invention, and the scope of protection of the present invention is not limited to the above embodiments. All technical solutions under the concept of the present invention shall fall within the scope of protection of the present invention. It should be noted that for those skilled in the art, several improvements and modifications made without departing from the principles of the present invention should also be considered as within the scope of protection of the present invention.

[0134] The technical features of the embodiments described above can be arbitrarily combined. For the sake of brevity, not all possible combinations of the technical features in the above embodiments have been described. However, as long as there is no contradiction in the combination of these technical features, they should be considered as falling within the scope of this specification.

[0135] It should be noted that the above embodiments can be freely combined as needed. The foregoing descriptions are merely preferred embodiments of the present invention. It should be noted that for those skilled in the art, several improvements and modifications made without departing from the principles of the present invention should also be considered as within the scope of protection of the present invention.

SYSTEM AND METHOD FOR OPTIMIZING PATH EXPLORATION PARAMETERS BASED ON DEEP REINFORCEMENT LEARNING

Assignee

Inventors

Cpc classification

Classification Explorer

G05D2105/87

PHYSICS

Classification Explorer

G05D1/622

PHYSICS

Classification Explorer

G05D2105/05

PHYSICS

Classification Explorer

G05D2101/15

PHYSICS

Classification Explorer

G05D1/644

PHYSICS

International classification

Classification Explorer

G05D1/644

PHYSICS

Classification Explorer

G05D1/622

PHYSICS

Abstract

Claims

Description