AGENT TRAINING METHOD, APPARATUS, AND COMPUTER-READABLE STORAGE MEDIUM

Abstract

An agent training method includes: obtaining environment information of a first agent and environment information of a second agent; generating first information based on the environment information of the first agent and the environment information of the second agent; and training the first agent by using the first information, so that the first agent outputs individual cognition information and neighborhood cognition information. The neighborhood cognition information of the first agent is consistent with neighborhood cognition information of the second agent.

Claims

1. An agent training method, comprising: obtaining environment information of a first agent and environment information of a second agent; generating first information based on the environment information of the first agent and the environment information of the second agent; and training the first agent by using the first information, so that the first agent outputs individual cognition information and neighborhood cognition information, wherein the neighborhood cognition information of the first agent is consistent with neighborhood cognition information of the second agent.

2. The method according to claim 1, wherein generating first information based on the environment information of the first agent and the environment information of the second agent comprises: generating second information h.sub.i of the first agent based on the environment information of the first agent; generating second information h.sub.j of the second agent based on the environment information of the second agent; and generating the first information based on h.sub.i and h.sub.j.

3. The method according to claim 2, wherein generating the first information based on h.sub.i and h.sub.j comprises: determining a first result based on a product of h.sub.i and a first matrix; determining a second result based on a product of h.sub.j and a second matrix; and generating the first information based on the first result and the second result.

4. The method according to claim 1, wherein the method further comprises: obtaining the neighborhood cognition information Ĉ.sub.j of the second agent; and training a neural network generating the neighborhood cognition information Ĉ.sub.i of the first agent based on the neighborhood cognition information Ĉ.sub.j of the second agent, so that Ĉ.sub.j is consistent with Ĉ.sub.i.

5. The method according to claim 4, wherein training a neural network generating the neighborhood cognition information Ĉ.sub.i of the first agent based on the neighborhood cognition information Ĉ.sub.j of the second agent comprises: training the neural network generating Ĉ.sub.i based on a loss function comprising Ĉ.sub.j and Ĉ.sub.i.

6. The method according to claim 1, wherein training the first agent by using the first information, so that the first agent outputs individual cognition information and neighborhood cognition information comprises: determining the neighborhood cognition information Ĉ.sub.i of the first agent based on the first information and a variational autoencoder.

7. The method according to claim 1, wherein the method further comprises: determining an estimate ô.sub.i of the environment information of the first agent based on the neighborhood cognition information Ĉ.sub.i of the first agent; and training the neural network generating Ĉ.sub.i based on a loss function comprising o.sub.i and ô.sub.i.

8. The method according to claim 1, wherein the method further comprises: determining a Q value of the first agent based on the individual cognition information and the neighborhood cognition information of the first agent; and training the first agent based on the Q value of the first agent.

9. The method according to claim 8, wherein training the first agent based on the Q value of the first agent comprises: determining Q values Q.sub.total of a plurality of agents based on the Q value of the first agent and a Q value of the second agent; and training the first agent based on Q.sub.total.

10. The method according to claim 1, wherein the method further comprises: generating an instruction based on the target individual cognition information and the target neighborhood cognition information of the first agent.

11. An agent training apparatus, comprising a communication circuit and a processing circuit, wherein: the communication circuit is configured to obtain environment information of a first agent and environment information of a second agent; and the processing circuit is configured to: generate first information based on the environment information of the first agent and the environment information of the second agent; and train the first agent by using the first information, so that the first agent outputs individual cognition information and neighborhood cognition information, wherein the neighborhood cognition information of the first agent is consistent with neighborhood cognition information of the second agent.

12. The apparatus according to claim 11, wherein the processing circuit is configured to: generate second information h.sub.i of the first agent based on the environment information of the first agent; generate second information h.sub.j of the second agent based on the environment information of the second agent; and generate the first information based on h.sub.i and h.sub.j.

13. The apparatus according to claim 12, wherein the processing circuit is configured to: determine a first result based on a product of h.sub.i and a first matrix; determine a second result based on a product of h.sub.j and a second matrix; and generate the first information based on the first result and the second result.

14. The apparatus according to claim 11, wherein: the communication circuit is further configured to obtain the neighborhood cognition information Ĉ.sub.j of the second agent; and the processing circuit is further configured to train a neural network generating the neighborhood cognition information Ĉ.sub.i of the first agent based on the neighborhood cognition information Ĉ.sub.j of the second agent, so that Ĉ.sub.j is consistent with Ĉ.sub.i.

15. The apparatus according to claim 14, wherein the processing circuit is configured to: train the neural network generating Ĉ.sub.i based on a loss function comprising Ĉ.sub.j and Ĉ.sub.i.

16. The apparatus according to claim 11, wherein the processing circuit is configured to: determine the neighborhood cognition information Ĉ.sub.i of the first agent based on the first information and a variational autoencoder.

17. The apparatus according to claim 11, wherein: the communication circuit is further configured to determine an estimate ô.sub.i of the environment information of the first agent based on the neighborhood cognition information Ĉ.sub.i of the first agent; and the processing circuit is further configured to train the neural network generating Ĉ.sub.i based on a loss function comprising o.sub.i and ô.sub.i.

18. The apparatus according to claim 11, wherein the processing circuit is further configured to: determine a Q value of the first agent based on the individual cognition information and the neighborhood cognition information of the first agent; and train the first agent based on the Q value of the first agent.

19. The apparatus according to claim 18, wherein the processing circuit is configured to: determine Q values Q.sub.total of a plurality of agents based on the Q value of the first agent and a Q value of the second agent; and train the first agent based on Q.sub.total.

20. A non-transitory computer-readable storage medium, wherein the computer-readable storage medium stores a computer program; and when the computer program is executed by a processor, the processor is enabled to perform: obtaining environment information of a first agent and environment information of a second agent; generating first information based on the environment information of the first agent and the environment information of the second agent; and training the first agent by using the first information, so that the first agent outputs individual cognition information and neighborhood cognition information, wherein the neighborhood cognition information of the first agent is consistent with neighborhood cognition information of the second agent.

Description

BRIEF DESCRIPTION OF DRAWINGS

[0066] FIG. 1 is a schematic diagram of a multi-agent system according to some embodiments;

[0067] FIG. 2 is a schematic diagram of an agent training method according to some embodiments;

[0068] FIG. 3 is a schematic diagram of a method for generating neighborhood cognition information based on a variational autoencoder according to some embodiments;

[0069] FIG. 4 is a schematic diagram of another agent training method according to some embodiments;

[0070] FIG. 5 is a schematic diagram of an agent training method using a plurality of Q values according to some embodiments;

[0071] FIG. 6 is a schematic diagram of an agent-based action generation method according to some embodiments;

[0072] FIG. 7 is a schematic diagram of an agent training apparatus according to some embodiments;

[0073] FIG. 8 is a schematic diagram of an agent-based action generation apparatus according to some embodiments; and

[0074] FIG. 9 is a schematic diagram of an electronic device according to some embodiments.

DESCRIPTION OF EMBODIMENTS

[0075] The following describes the technical solutions of this application with reference to the accompanying drawings.

[0076] FIG. 1 is a schematic diagram of a multi-agent system applicable to some embodiments.

[0077] In FIG. 1, A to F represent six routers, and a neural network is deployed on each router. Therefore, one router is equivalent to one agent, and training the agent thereby trains the neural network deployed on the agent. Lines between routers indicate communication lines. A to D are four border routers. Traffic between the border routers is referred to as an aggregation flow. For example, traffic from A to C is an aggregation flow, and traffic from C to A is another aggregation flow.

[0078] Aggregation flows between a plurality of routers may be determined by N.sub.B(N.sub.B−1), and the N.sub.B is a quantity of border routers in the plurality of routers. In the system shown in FIG. 1, there are four border routers. Therefore, there are 12 aggregation flows in total in these embodiments of the system.

[0079] For each aggregated flow, a multipath routing algorithm gives an available path. The router may determine an available path based on a routing entry (S, D, Nexthop1, rate1%, Nexthop2, rate2%, Nexthop3, rate3%, . . . ), where S represents a start router, D represents a target router, Nexthop1, Nexthop2, and Nexthop3 represent different next hops, rate1%, rate2%, and rate3% represent proportions of forwarded traffic corresponding to different next hops in total forwarded traffic, and a sum of rates is equal to 100%.

[0080] A task of the foregoing system is to determine a traffic forwarding policy of any one of the routers A to F.

[0081] A method for completing the foregoing task is to regard any router in A to F as one agent, and train the agent so that the agent can make a proper traffic forwarding policy.

[0082] The following describes in detail an agent training method according to some embodiments.

[0083] FIG. 2 shows a schematic diagram of an agent training method according to some embodiments. The method 200 may be executed by an agent, or may be executed by a dedicated neural network accelerator, a general-purpose processor, or another apparatus. The following description of the method 200 by using the agent as an execution body is an example, and should not be understood as a limitation on the execution body of the method 200. The method 200 includes the following steps.

[0084] S210: Obtain environment information of a first agent and environment information of a second agent.

[0085] The first agent may be any router in A to F, and the second agent may be any agent in A to F other than the first agent. In the following, the first agent is referred to as a target agent, and the second agent is referred to as a neighborhood agent. The neighborhood agent of the target agent may be a router that has a direct communication connection with the target agent.

[0086] For example, the target agent is the router E, and routers that have direct communication connections with the router E are the router A, the router B, and the router F. Therefore, the three routers may be used as neighborhood agents of the target agent.

[0087] Optionally, the neighborhood agent of the target agent may be further determined based on a distance between agents. A method for determining the neighborhood agent of the target agent is not limited in this application.

[0088] For ease of description, an agent i is used to represent the target agent, o.sub.i is used to represent environment information of the target agent, an agent j is used to represent the neighborhood agent of the target agent, and o.sub.j is used to represent environment information of the neighborhood agent of the target agent.

[0089] For example, o.sub.i or o.sub.j is information such as a cache size of a router, traffic in the cache, load of a direct link in different statistical periods, average load of the direct link in a previous decision period, or a historical decision of the router. Specific content of the environment information is not limited in this application.

[0090] After obtaining o.sub.i and o.sub.j, the agent i may perform the following steps.

[0091] S220: Generate first information based on the environment information of the first agent and the environment information of the second agent.

[0092] The agent i may convert o.sub.i into the first information by using a deep neural network. The first information includes abstracted content of o.sub.i and o.sub.j, and includes richer content than original environment information (o.sub.i and o.sub.j). This improves the accuracy of the decision making process by a neural network.

[0093] In this application, terms such as “first” and “second” are used to describe different individuals in objects of a same type. For example, “first information” and “second information” described below represent two different pieces of information. There is no other limitation.

[0094] The first information may be generated by the agent i, or may be received by the agent i from another device. For example, after sensing o.sub.i, the agent i may generate the first information based on o.sub.i, or may send o.sub.i to another device, and after the another device generates the first information based on o.sub.i, the agent i receives the first information from the another device.

[0095] After obtaining the first information, the agent i may perform the following steps.

[0096] S230: Train the first agent by using the first information, so that the first agent outputs individual cognition information and neighborhood cognition information, where the neighborhood cognition information of the first agent is consistent with neighborhood cognition information of the second agent.

[0097] The individual cognition information of the target agent may be represented by A.sub.i, and the neighborhood cognition information of the target agent may be represented by Ĉ.sub.i. A.sub.i reflects cognition of the agent i on its own condition and Ĉ.sub.i reflects cognition of the agent i on a surrounding environment. It is assumed that the environment information o.sub.i collected by the agent i is complete. Information in o.sub.i that is the same as or similar to the environment information of the neighborhood agent is neighborhood cognition information, and information in o.sub.i that is different from the environment information of the neighborhood agent is individual cognition information. Because generally, environments of agents in a neighborhood are the same or similar, but individual conditions of different agents are different.

[0098] The agent i may input the first information into a cognition neural network to obtain A.sub.i and Ĉ.sub.i. The following describes in detail how to obtain Ĉ.sub.i that is the same as or similar to Ĉ.sub.j (e.g., the neighborhood cognition information of the neighborhood agent).

[0099] Optionally, other methods may also be used for generating Ĉ.sub.i.

[0100] FIG. 3 shows a Ĉ.sub.i generation method by using a variational autoencoder (variational autoencoder) according to some embodiments.

[0101] First, o.sub.i is input into a fully connected network of the variational autoencoder, o.sub.i is converted into h.sub.i by using the fully connected network, and h.sub.i and h.sub.j are further converted into the first information H.sub.i, where h.sub.j is a result obtained after the environment information o.sub.j of the neighborhood agent is abstracted.

[0102] Then, a distribution average value Ĉ.sub.i.sup.μ and a distribution variance Ĉ.sub.i.sup.σ of the neighborhood cognition information of the agent i is determined based on the first information; a random value ε is obtained by sampling from a unit Gaussian distribution; and Ĉ.sub.i is determined based on Ĉ.sub.i.sup.μ,Ĉ.sub.i.sup.σ, and ε, where Ĉ.sub.i=Ĉ.sub.i.sup.μ+Ĉ.sub.i.sup.σ⊙ε.

[0103] Because Ĉ.sub.i is generated based on the random value ε, in this Ĉ.sub.i generation method, a value of Ĉ.sub.i can be diversified, and a neural network obtained by training based on Ĉ.sub.i may be more robust.

[0104] In FIG. 3, Ĥ.sub.i represents a predicted value of H.sub.i determined based on Ĉ.sub.i, ĥ.sub.i represents a predicted value of h.sub.i determined based on Ĥ.sub.i, and ô.sub.i represents a predicted value of ĥ.sub.i determined based on o.sub.i. By minimizing a loss function (for example, L2) of o.sub.i and ô.sub.i, a neural network generating Ĉ.sub.i based on o.sub.i can be trained, so that Ĉ.sub.i is correct cognition of a neighborhood environment. A reason for the advantageous effect is described in detail below.

[0105] In addition, in FIG. 3, C represents a true value of the neighborhood cognition information of the agent i. By minimizing a loss function (for example, KL) of C and Ĉ.sub.i, the neural network generating Ĉ.sub.i based on o.sub.i can be trained, to keep Ĉ.sub.i consistent with the neighborhood cognition information (for example, Ĉ.sub.j) of the neighborhood agent. This process is shown by a dashed arrow between C and o.sub.i. A reason for the advantageous effect is described in detail below.

[0106] The foregoing describes in detail a method for determining the individual cognition information A.sub.i and the neighborhood cognition information Ĉ.sub.i of the target agent based on the first information H.sub.i. Generally, a plurality of agents located in one neighborhood has a same as or similar environment. Therefore, cognition of a neighborhood environment by a plurality of agents located in one neighborhood is definitely the same or similar. According to this principle, the neighborhood cognition information Ĉ.sub.j of the neighborhood agent may be used to train the neural network generating the neighborhood cognition information Ĉ.sub.i of the target agent, so that Ĉ.sub.j and Ĉ.sub.i are the same or similar.

[0107] Optionally, the neural network generating Ĉ.sub.i may be trained based on a loss function including Ĉ.sub.j and Ĉ.sub.i. For example, the loss function is KL(q(Ĉ.sub.i|o.sub.i; w.sub.i)∥q(Ĉ.sub.j|o.sub.j;w.sub.j)). KL represents KL divergence (Kullback-Leibler divergence), q represents a probability distribution, w.sub.i represents a weight of the neural network generating Ĉ.sub.i based on o.sub.i, and w.sub.j represents a weight of the neural network generating Ĉ.sub.j based on o.sub.j. The KL divergence is also referred to as relative entropy, and is used to describe a difference between two probability distributions. Therefore, the KL divergence may be used as the loss function of Ĉ.sub.j and Ĉ.sub.i.

[0108] The KL divergence is used to measure a difference between Ĉ.sub.j and Ĉ.sub.i. In addition, another method can be further used to measure the difference between Ĉ.sub.j and Ĉ.sub.i. For example, Ĉ.sub.j and Ĉ.sub.i are essentially two vectors, and the difference between Ĉ.sub.j and Ĉ.sub.i may be measured by using a method for mathematically representing a distance, such as L1-distance and L2-distance, and the difference between Ĉ.sub.j and Ĉ.sub.i is reduced by updating a neural network generating Ĉ.sub.j or Ĉ.sub.i. L1-distance may be referred to as a Manhattan distance or an L1 norm (L1-Norm), and L2-distance may be referred to as a Euclidean distance or an L2 norm (L2-Norm). In the machine learning field, L1-distance may also be referred to as L1 regularization, and L2-distance may also be referred to as L2 regularization.

[0109] As described above, an objective of training the neural network generating Ĉ.sub.i based on the loss function including Ĉ.sub.j and Ĉ.sub.i is to enable a plurality of agents located in one neighborhood to have same or similar cognition of a neighborhood environment. If predicted values of neighborhood cognition information of agents are the same or similar to a true value, cognition of the neighborhood environment by a plurality of agents located in one neighborhood is definitely the same or similar.

[0110] Therefore, a neural network generating a predicted value Ĉ.sub.i may be trained based on a true value C of the neighborhood cognition information of the agent i, so that Ĉ.sub.i and C are the same or similar.

[0111] For example, it may be assumed that C is a standard normal distribution whose average value is μ=0 and variance is σ=1, and the neural network generating Ĉ.sub.i is trained by minimizing KL(p(C|μ=0,σ=1)∥q(Ĉ.sub.i|o.sub.i;w.sub.i), so that Ĉ.sub.i and C are the same or similar, where p represents a prior probability and q represents a posterior probability.

[0112] When the neighborhood agent (for example, the agent j) also trains the neural network generating Ĉ.sub.j based on the method shown in the foregoing example, Ĉ.sub.j and C generated by the obtained neural network are the same or similar, so that Ĉ.sub.j and Ĉ.sub.i, are the same or similar, that is, consistency between Ĉ.sub.i and the neighborhood cognition information (for example, Ĉ.sub.j) of the neighborhood agent may be enhanced. This is also a principle of an advantageous effect of training a neural network by minimizing the loss function of C and Ĉ.sub.i shown in FIG. 3.

[0113] FIG. 3 also discusses training the neural network generating Ĉ.sub.i based on o.sub.i by minimizing the loss function (for example, L2) of o.sub.i and ô.sub.i. For example, the loss function including o.sub.i and ô.sub.i is L2(o.sub.i,ô.sub.i;w.sub.i), where o.sub.i is a true value of the environment information, and ô.sub.i is a predicted value of the environment information. A specific form of the loss function including o.sub.i and ô.sub.i is not limited in this application. Training the neural network generating Ĉ.sub.i based on the loss function including o.sub.i and ô.sub.i can make o.sub.i and ô.sub.i the same or similar. When o.sub.i and ô.sub.i are the same or similar, it indicates that the environment information o.sub.i can be restored from a predicted value Ĉ.sub.i of the neighborhood cognition information, that is, Ĉ.sub.i is correct cognition of the neighborhood environment.

[0114] After generating the individual cognition information A.sub.i and the neighborhood cognition information Ĉ.sub.i, the target agent may be trained based on the neighborhood cognition information of the target agent.

[0115] Optionally, the target agent may be trained by using a Q value training method. A person skilled in the art can realize that, with the development of technologies, other methods that can train the target agent by using the neighborhood cognition information is applicable to this application.

[0116] The target agent may first perform a bitwise addition operation on A.sub.i and Ĉ.sub.i. The bitwise addition operation refers to performing an addition operation on elements at corresponding locations in different vectors. For example, A.sub.i is a 3-dimensional vector [0.25, 0.1, 0.3], Ĉ.sub.i is a 3-dimensional vector [0.1, 0.2, 0.15], and a result of performing the bitwise addition operation on A.sub.i and Ĉ.sub.i is [0.35, 0.3, 0.45].

[0117] A Q value Q.sub.i of the target agent may be generated by using a Q value neural network based on the result obtained after the bitwise addition operation is performed on A.sub.i and Ĉ.sub.i. For example, Q.sub.i=f(X*W). X is the result obtained after the bitwise addition operation is performed on A.sub.i and Ĉ.sub.i, for example, a 3-dimensional vector [0.35, 0.3, 0.45], W is a weight matrix of the Q value neural network, for example, a 3*K-dimensional weight matrix, K is a dimension of Q.sub.i (that is, a quantity of elements in an action set of the agent i), and f(*) is a function for performing a non-linear operation on *. Compared with a linear operation function, the non-linear operation function can enhance an expression capability of the neural network. Common f includes a sigmoid function and a rectified linear activation function (RELU) function.

[0118] Optionally, Q.sub.i may be directly generated by combining A.sub.i and Ĉ.sub.i. A specific manner of generating Q.sub.i is not limited in this application.

[0119] Then, the target agent may train the target agent by using the Q value.

[0120] The Q value is used to evaluate a quality of an action. The target agent can determine a final output action based on Q values corresponding to different actions. After the target agent implements the finally output action, feedback of the action is obtained from an environment, and a neural network generating the action, that is, the target agent, is trained based on the feedback.

[0121] For example, a Q value of the agent i is Q.sub.i, and the agent i may generate an action based on Q.sub.i, where the action is, for example, a traffic scheduling instruction a.sub.i*, and a.sub.i=arg max.sub.o.sub.iQ.sub.i(o.sub.i,a.sub.i). For example, a.sub.i* is a traffic proportion (rate1%, rate2%, rate3%, . . . ) of an aggregation flow, passing through the router i, on an egress port set, indicates an amount of traffic sent to nodes in (Nexthop1, Nexthop2, Nexthop3, . . . ). a.sub.i indicates a specific action. For example, currently there are four actions (that is, there are four a.sub.i), and each action corresponds to one Q value, which are Q(o,⬆), Q(o,⬇), Q(o,←) and Q(o,.fwdarw.). The agent i may select an action (for example, a.sub.i*) with a maximum Q value from the actions for execution. Then, the agent i may minimize a temporal difference (temporal difference, TD) loss function based on feedback of a.sub.i* to train a neural network generating the action.

[0122] Because the Q value of the target agent is generated based on A.sub.i and Ĉ.sub.i, the target agent can enhance consistency between Ĉ.sub.i and the neighborhood cognition information (for example, Ĉ.sub.j) of the neighborhood agent by training the neural network generating Ĉ.sub.i. In addition, the target agent can improve a degree of correct cognition of the target agent on the neighborhood environment by training the neural network generating Ĉ.sub.i, thereby improving accuracy of the Q value. Compared with a neural network training method in which Q is directly generated based on the first information, an action generated by a neural network obtained through training according to the method 200 can improve collaboration between a plurality of agents.

[0123] Refer to FIG. 4. The following further describes an agent training method according to this application. The method shown in FIG. 4 may be performed by a router i. The router i is an example of the agent i described above, and may be any one of the six routers shown in FIG. 1. A router j is one neighborhood router of the router i. The router i may perform the following steps.

[0124] Step 1: The router i senses environment information o.sub.i.

[0125] Step 2: The router i processes o.sub.i into h.sub.i by using a fully connected (FC) network. h.sub.i may be referred to as second information of the router i, and represents information obtained based on o.sub.i after abstraction.

[0126] Step 3: The router i obtains second information of all neighborhood routers. The neighborhood router of the router i may be represented as j∈N(i), where N(i) is a set of all the neighborhood routers of the router i, and j is one in the set, that is, the router j. Environment information of the router j is o.sub.j, and the router j may process o.sub.j into h.sub.j by using the FC network of the router j. h.sub.j is second information of the router j.

[0127] The router i may process h.sub.i and the second information of the neighborhood router into first information H.sub.i of the router i by using a graph convolutional network (graph convolutional network, GCN), and may perform a weighted sum operation on h.sub.i and the second information of all the neighborhood routers of the router i to obtain H.sub.i. For example, all the neighborhood routers of the router i may be represented as N(i), and the first information of the router i may be determined according to the following formula:

[00001] $\begin{matrix} H_{i} = σ (w {.Math.}_{j \in N (i) .Math. {i}} \frac{h_{j}}{\sqrt{.Math. N (j) .Math. .Math. N (i) .Math.}}) & (1) \end{matrix}$

[0128] σ represents a non-linear function, and is used to improve an expression capability of a neural network; W represents a weight of the GCN; ∪ is a union set symbol; {i}represents the router i; |N(j)| represents a quantity of all neighborhood routers of the router j; and |N(i)| represents a quantity of all the neighborhood routers of the router i.

[0129] There are two optional methods in a process of generating H.sub.i based on h.sub.i and h.sub.j.

[0130] In a first method, h.sub.i and h.sub.j are first processed (for example, combined or a weighted sum operation is performed) to obtain a larger matrix, and then a matrix multiplication operation is performed on the matrix to obtain H.sub.i.

[0131] In a second method, a multiplication operation is performed on h.sub.i and a first matrix to obtain a first result, a multiplication operation is performed on h.sub.j and a second matrix to obtain a second result, and then H.sub.i is generated based on the first result and the second result. For example, a weighted sum operation is performed on the first result and the second result or they are combined, to obtain H.sub.i.

[0132] Because h.sub.i and h.sub.j are two small-sized matrices, compared with the first method, the second method can reduce an amount of computation required for generating H.sub.i. In addition, the first matrix and the second matrix may be a same matrix, or may be different matrices. When the first matrix is the same as the second matrix, h.sub.i and h.sub.j share a same set of parameters, which helps a GCN learn more content.

[0133] Step 4: The router i processes H.sub.i into A.sub.i and Ĉ.sub.i by using a cognition (cognition) network.

[0134] Step 5: The router i generates ô.sub.i based on Ĉ.sub.i. Ĥ.sub.i represents a predicted value of H.sub.i determined based on Ĉ.sub.i, ĥ.sub.i represents a predicted value of h.sub.i determined based on Ĥ.sub.i, and ô.sub.i represents a predicted value of ĥ.sub.i determined based on ô.sub.i. By minimizing a loss function (for example, L2) of o.sub.i and ô.sub.i, a neural network generating Ĉ.sub.i based on o.sub.i can be trained, so that Ĉ.sub.i is correct cognition of a neighborhood environment. The neural network generating Ĉ.sub.i based on o.sub.i is, for example, one or more of the FC network, the GCN, and the cognition network shown in FIG. 4.

[0135] Step 6: The router i obtains neighborhood cognition information of all the neighborhood routers, and minimizes a loss function including Ĉ.sub.i and the neighborhood cognition information of all the neighborhood routers, so that Ĉ.sub.i is consistent with the neighborhood cognition information of all the neighborhood routers.

[0136] For example, after obtaining neighborhood cognition information Ĉ.sub.j, of the router j, the router i may minimize KL(q(Ĉ.sub.i|o.sub.i; w.sub.i)∥q(Ĉ.sub.j|o.sub.j;w.sub.j)) to make Ĉ.sub.i and Ĉ.sub.j consistent (the same or similar). w.sub.i represents a weight of the neural network generating Ĉ.sub.i based on o.sub.i, and w.sub.j represents a weight of the neural network generating Ĉ.sub.j based on o.sub.j. The neural network generating Ĉ.sub.i based on o.sub.i is, for example, one or more of the FC network, the GCN, and the cognition network shown in FIG. 4.

[0137] It should be noted that, for brevity, a neural network of the router i and a neural network of the router j are not distinguished in FIG. 4. Actually, an FC network, a GCN, a cognition network, and a Q value network are separately deployed on the router i and the router j. In addition, because environment information of the router i and the router j are usually not completely the same, training results of the neural networks separately deployed on the router i and the router j are typically different.

[0138] Step 7: The router i performs a bitwise addition operation on A.sub.i and Ĉ.sub.i by using the Q value network, to obtain a Q value Q.sub.i.

[0139] Step 8: The router i generates an action based on Q.sub.i, where the action is, for example, a traffic scheduling instruction a.sub.i*, and a.sub.i*=arg max.sub.o.sub.iQ.sub.i(o.sub.i,a.sub.i). For example, a.sub.i* is a traffic proportion (rate1%, rate2%, rate3%, . . . ) of an aggregation flow, passing through the router i, on an egress port set, and indicates an amount of traffic sent to nodes in (Nexthop1, Nexthop2, Nexthop3, . . . ).

[0140] Step 9: The router i may obtain feedback r.sub.i of a.sub.i* from an environment, minimizes a TD loss function based on r.sub.i, and sends back gradient generated by minimizing the TD loss function to train the agent i, to obtain accurate Q.sub.i or a.sub.i*. A neural network generating the action is, for example, one or more of the FC network, the GCN, the cognition network, and the Q value network shown in FIG. 4.

[0141] Each agent i may be trained according to formula (2).

L.sup.total(w)=L.sup.td(w)+αΣ.sub.i=1.sup.NL.sub.i.sup.cd(w) (2)

[0142] L.sup.total(w) is a weighted sum of the TD loss function L.sup.td(w) and a cognition-dissonance (cognition-dissonance, CD) loss function L.sub.i.sup.cd(w). L.sub.i.sup.cd (w) is used to reduce a cognition-dissonance loss, that is, to make cognition of a plurality of agents consistent; α is a real number, and represents a weight coefficient of L.sub.i.sup.cd(w); w represents a set of parameters of all agents (a parameter w.sub.i of the agent i are a part of the set); and N represents that there are a total of N agents in a multi-agent system. The N agents share one TD loss function, and each of the N agents has its own CD loss function.

[0143] L.sup.td(w) may be determined according to formula (3).

L.sup.td(w)=E.sub.({right arrow over (o)},{right arrow over (a)},r,{right arrow over (o)}′)[(y.sub.total−Q.sub.total({right arrow over (o)},{right arrow over (a)};w)).sup.2] (3)

[0144] E.sub.({right arrow over (o)},{right arrow over (a)},r,{right arrow over (o)}′) [expression] represents performing a sampling operation on ({right arrow over (o)},{right arrow over (a)},r,{right arrow over (o)}′), and then calculating an expected value of expression based on all samples ({right arrow over (o)},{right arrow over (a)},r,{right arrow over (o)}′); {right arrow over (o)} represents joint observation of all the agents, that is, {right arrow over (o)}=<o.sub.1, o.sub.2, . . . , o.sub.N>; {right arrow over (a)} represents a joint action of all the agents, that is, {right arrow over (a)}=<a.sub.1, a.sub.2, . . . , a.sub.N>; r represents a reward value fed back by the environment to all the agents after all the agents perform the joint action d with the joint observation {right arrow over (o)}; {right arrow over (o)}′ represents new joint observation fed back by the environment to all the agents after all the agents perform the joint action {right arrow over (a)} with the joint observation {right arrow over (o)}; Q.sub.total represents Q values of the plurality of agents; and y.sub.total may be determined according to formula (4).

[00002] $\begin{matrix} y_{total} = r + γ \max_{{\overset{.fwdarw.}{a}}^{'}} Q_{total} ({\overset{.fwdarw.}{o}}^{'}, {\overset{.fwdarw.}{a}}^{'}; w^{-}) & (4) \end{matrix}$

[0145] γ represents areal number; {right arrow over (a)}′ represents a joint action performed by all of the agents under the new joint observation {right arrow over (o)}′; and w.sup.− represents a parameter of a target neural network, which is identical to w before training starts. There are two update manners in a training process: (1) No update is performed in S training steps, and after the S training steps end, a value of w is assigned to w.sup.−. (2) An update is performed in each training step, and an update manner is w.sup.−=βw.sup.−+(1−β)w, where β is a real number used to control an update rate of w.sup.− (it should be noted that w is updated in each training step regardless of an update manner of w.sup.−, and an update manner is a total loss function L-total defined based on formula (2)).

[0146] L.sub.i.sup.cd(w) in formula (2) may be determined according to formula (5).

[00003] $\begin{matrix} \begin{matrix} L_{i}^{cd} (w) = E_{o_{i}} [L 2 (o_{i}, {\hat{o}}_{i}; w) + KL (q ({\hat{C}}_{i} .Math. o_{i}; w) .Math. p (C))] \\ \approx E_{o_{i}} [L 2 (o_{i}, {\hat{o}}_{i}; w) + \frac{1}{.Math. N (i) .Math.} {.Math.}_{j \in N (i)} KL (q ({\hat{C}}_{i} .Math. o_{i}; w) .Math. q ({\hat{C}}_{j} .Math. o_{j}; w))] \end{matrix} & (5) \end{matrix}$

[0147] It should be noted that W in formula (5) represents the set of parameters of all the agents. Therefore, it is not further distinguished that the parameter w.sub.i of the agent i is a part of the set.

[0148] Formula (2) to formula (5) are examples of formulas used when the neural network generating Ĉ.sub.i and the agent i are synchronously trained. Optionally, the router i may first complete training of the neural network generating Ĉ.sub.i, then generate Q.sub.i based on Ĉ.sub.i generated by the neural network, and train the agent i based on Q.sub.i.

[0149] In addition to training the agent by using Q.sub.i, the router i may also use Q.sub.i and another Q value to train the agent.

[0150] FIG. 5 shows an agent training method using a plurality of Q values according to some embodiments.

[0151] Compared with FIG. 4, one more Q value hybrid network is deployed for the router i in FIG. 5. The network is used to process Q values of a plurality of routers into Q.sub.total. The plurality of routers may be routers belonging to one neighborhood, or may be routers belonging to a plurality of neighborhoods. For example, the Q value hybrid network may perform weighting and calculation on Q.sub.i and Q.sub.j (a Q value of a router j). In this way, Q.sub.total can better reflect a proportion of a task undertaken by a single router to tasks undertaken by the plurality of routers, and an action generated based on Q.sub.total can enhance global coordination.

[0152] The foregoing describes in detail the agent training method provided in this application. After agent training is converged, an agent may generate an action according to the method shown in FIG. 6. The method 600 may include the following steps.

[0153] S610: An agent i senses environment information.

[0154] S620: The agent i processes the environment information into second information by using an FC network.

[0155] S630: The agent i obtains second information of all neighborhood agents, and processes all the second information into first information by using a GCN.

[0156] S640: The agent i processes the first information by using a cognition network, and generates individual cognition information and neighborhood cognition information.

[0157] S650: The agent i performs a bitwise addition operation on the individual cognition information and the neighborhood cognition information by using a Q value network, and generates a Q value based on a result of the operation.

[0158] S660: The agent i generates an action (for example, a flow scheduling instruction) based on the Q value, and applies the action to an environment.

[0159] Compared with the method 200, the method 600 does not need to update a parameter of the agent. In addition, an environment in which the agent i in the method 600 is located may change compared with an environment in which the agent i in the method 200 is located. Therefore, all information in the method 600 may be different from all information in the method 200. The information in the method 600 may be referred to as target information, and the information in the method 200 may be referred to as training information. For example, the environment information, the first information, the second information, the individual cognition information, and the neighborhood cognition information in the method 600 may be respectively referred to as target environment information, target first information, target second information, target individual cognition information, and target neighborhood cognition information; and the environment information, the first information, the second information, the individual cognition information, and the neighborhood cognition information in the method 200 may be respectively referred to as training environment information, first training information, second training information, training individual cognition information, and training neighborhood cognition information.

[0160] An agent obtained by training according to the method 200 may have a high degree of correct cognition on a neighborhood environment, and cognition of the agent obtained by training according to the method 200 on the neighborhood environment is consistent with cognition of another agent in a neighborhood on the neighborhood environment. Therefore, the action generated by the agent in the method 600 can improve collaboration between the plurality of agents.

[0161] The foregoing describes in detail examples of the agent training method and the agent-based action generation method that are provided in this application. It can be understood that, to implement the foregoing functions, a corresponding apparatus includes a corresponding hardware structure and/or software module for executing each function. A person skilled in the art should be easily aware that, with reference to units, circuits, and algorithm steps in the examples described in embodiments disclosed in this specification, this application can be implemented in a form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or hardware driven by computer software depends on particular applications and design constraints of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this application.

[0162] In this application, an agent training apparatus and an agent-based action generation apparatus may be divided into functional units according to the foregoing method, for example, each functional unit may be obtained through division based on each corresponding function, or two or more functions may be integrated into one processing module. The integrated unit may be implemented in a form of hardware (e.g., circuits), or may be implemented in a form of a software functional unit. It should be noted that, in this application, division into the units is an example, and is merely a logical function division. During actual implementation, another division manner may be implemented.

[0163] FIG. 7 is a schematic diagram of a structure of an agent training apparatus according to some embodiments. The apparatus 700 includes a processing unit (e.g., a processing circuit) 710 and a communication unit (e.g., a communication circuit) 720. The communication unit 720 can perform a sending step and/or a receiving step under control of the processing unit 710.

[0164] The communication unit 720 is configured to obtain environment information of a first agent and environment information of a second agent.

[0165] The processing unit 710 is configured to: generate first information based on the environment information of the first agent and the environment information of the second agent; and train the first agent by using the first information, so that the first agent outputs individual cognition information and neighborhood cognition information. The neighborhood cognition information of the first agent is consistent with neighborhood cognition information of the second agent.

[0166] Optionally, the processing unit 710 is specifically configured to: generate second information h.sub.i of the first agent based on the environment information of the first agent; generate second information h.sub.j of the second agent based on the environment information of the second agent; and generate the first information based on h.sub.i and h.sub.j.

[0167] Optionally, the processing unit 710 is specifically configured to: determine a first result based on a product of h.sub.i and a first matrix; determine a second result based on a product of h.sub.j and a second matrix; and generate the first information based on the first result and the second result.

[0168] Optionally, the communication unit 720 is further configured to obtain the neighborhood cognition information Ĉ.sub.j of the second agent; and the processing unit 710 is further configured to train a neural network generating the neighborhood cognition information Ĉ.sub.i of the first agent based on the neighborhood cognition information Ĉ.sub.j of the second agent, so that Ĉ.sub.j is consistent with Ĉ.sub.i.

[0169] Optionally, the processing unit 710 is specifically configured to train the neural network generating Ĉ.sub.i based on a loss function including Ĉ.sub.j and Ĉ.sub.i.

[0170] Optionally, the loss function including Ĉ.sub.j and Ĉ.sub.i is KL(q(Ĉ.sub.i|o.sub.i; w.sub.i)∥q(Ĉ.sub.j|o.sub.j;w.sub.j)). KL represents KL divergence, q represents a probability distribution, o.sub.i represents the environment information of the first agent, w.sub.i represents a weight of the neural network generating Ĉ.sub.i based on o.sub.i, o.sub.j represents the environment information of the second agent, and w.sub.j represents a weight of the neural network generating Ĉ.sub.j based on o.sub.j.

[0171] Optionally, the processing unit 710 is configured to determine the neighborhood cognition information Ĉ.sub.i of the first agent based on the first information and a variational autoencoder.

[0172] Optionally, the processing unit 710 is configured to: determine a distribution average value Ĉ.sub.i.sup.μ and a distribution variance Ĉ.sub.i.sup.σ of the neighborhood cognition information of the first agent based on the first information; obtain a random value E by sampling from a unit Gaussian distribution; and determine Ĉ.sub.i based on Ĉ.sub.i.sup.μ, Ĉ.sub.i.sup.σ, and ε, where Ĉ.sub.i=Ĉ.sub.i.sup.μ+Ĉ.sub.i.sup.σ□ε.

[0173] Optionally, the communication unit 720 is further configured to determine an estimate ô.sub.i of the environment information of the first agent based on the neighborhood cognition information Ĉ.sub.i of the first agent; and the processing unit 710 is further configured to train the neural network generating Ĉ.sub.i based on a loss function including o.sub.i and ô.sub.i.

[0174] Optionally, the loss function including o.sub.i and ô.sub.i is L2(o.sub.i,ô.sub.i;w.sub.i), L2 represents L2 regularization, and w.sub.i represents the weight of the neural network generating Ĉ.sub.i based on o.sub.i.

[0175] Optionally, the processing unit 710 is further configured to: determine a Q value of the first agent based on the individual cognition information and the neighborhood cognition information of the first agent; and train the first agent based on the Q value of the first agent.

[0176] Optionally, the processing unit 710 is configured to: determine Q values Q.sub.total of a plurality of agents based on the Q value of the first agent and a Q value of the second agent; and train the first agent based on Q.sub.total.

[0177] For a manner in which the apparatus 700 performs the agent training method and an advantageous effect generated by the method, refer to related descriptions in the method embodiments.

[0178] FIG. 8 is a schematic diagram of a structure of an agent-based instruction generation apparatus according to some embodiments. The apparatus 800 includes a processing unit (e.g., processing circuit) 810 and a communication unit (e.g., communication circuit) 820. The communication unit 820 can perform a sending step and/or a receiving step under control of the processing unit 810.

[0179] The communication unit 820 is configured to obtain target environment information of a first agent and target environment information of a second agent.

[0180] The processing unit 810 is configured to: generate target first information based on the target environment information of the first agent and the target environment information of the second agent; output target individual cognition information and target neighborhood cognition information of the first agent based on the target first information, where the target neighborhood cognition information of the first agent is consistent with target neighborhood cognition information of the second agent; and generate an instruction based on the target individual cognition information and the target neighborhood cognition information of the first agent.

[0181] Optionally, the processing unit 810 is configured to: generate target second information of the first agent based on the target environment information of the first agent; generate target second information of the second agent based on the target environment information of the second agent; and generate the target first information based on the target second information of the first agent and the target second information of the second agent.

[0182] Optionally, the processing unit 810 is configured to: generate a target Q value based on the target individual cognition information of the first agent and target neighborhood information of a target agent; and generate the instruction based on the target Q value.

[0183] Optionally, the communication unit 820 is further configured to obtain training environment information of the first agent and training environment information of the second agent; and the processing unit 810 is further configured to: generate first training information based on the training environment information of the first agent and the training environment information of the second agent; and train the first agent by using the first training information, so that the first agent outputs training individual cognition information and training neighborhood cognition information, where the training neighborhood cognition information of the first agent is consistent with training neighborhood cognition information of the second agent.

[0184] Optionally, the processing unit 810 is configured to: generate second training information h.sub.i of the first agent based on the training environment information of the first agent; generate second training information h.sub.j of the second agent based on the training environment information of the second agent; and generate the first training information based on h.sub.i and h.sub.j.

[0185] Optionally, the processing unit 810 is configured to: determine a first result based on a product of h.sub.i and a first matrix; determine a second result based on a product of h.sub.j and a second matrix; and generate the first training information based on the first result and the second result.

[0186] Optionally, the communication unit 820 is further configured to obtain the training neighborhood cognition information Ĉ.sub.j of the second agent; and the processing unit 810 is further configured to train a neural network generating the training neighborhood cognition information Ĉ.sub.i of the first agent based on the neighborhood cognition information Ĉ.sub.j of the second agent, so that Ĉ.sub.j is consistent with Ĉ.sub.i.

[0187] Optionally, the processing unit 810 is configured to train the neural network generating Ĉ.sub.i based on a loss function including Ĉ.sub.j and Ĉ.sub.i.

[0188] Optionally, the loss function including Ĉ.sub.j and Ĉ.sub.i is KL(q(Ĉ.sub.i|o.sub.i; w.sub.i)∥q(Ĉ.sub.j|o.sub.j;w.sub.j)). KL represents KL divergence, q represents a probability distribution, o.sub.i represents the training environment information of the first agent, w.sub.i represents a weight of the neural network generating Ĉ.sub.i based on o.sub.i, o.sub.j represents the training environment information of the second agent, and w.sub.j represents a weight of the neural network generating Ĉ.sub.j based on o.sub.j.

[0189] Optionally, the processing unit 810 is configured to determine the training neighborhood cognition information Ĉ.sub.i of the first agent based on the first training information and a variational autoencoder.

[0190] Optionally, the processing unit 810 is configured to: determine a distribution average value Ĉ.sub.i.sup.μ and a distribution variance Ĉ.sub.i.sup.σ of the training neighborhood cognition information of the first agent based on the first training information; obtain a random value ε by sampling from a unit Gaussian distribution; and determine Ĉ.sub.i based on Ĉ.sub.i.sup.μ, Ĉ.sub.i.sup.σ, and ε, where Ĉ.sub.i=Ĉ.sub.i.sup.μ+Ĉ.sub.i.sup.σ□ε.

[0191] Optionally, the processing unit 810 is further configured to: determine an estimate ô.sub.i of the training environment information of the first agent based on the training neighborhood cognition information Ĉ.sub.i of the first agent; and train the neural network generating Ĉ.sub.i based on a loss function including o.sub.i and ô.sub.i.

[0192] Optionally, the loss function including o.sub.i and ô.sub.i is L2(o.sub.i,ô.sub.i;w.sub.i), L2 represents L2 regularization, and w.sub.i represents the weight of the neural network generating Ĉ.sub.i based on o.sub.i.

[0193] Optionally, the processing unit 810 is further configured to: determine a training Q value of the first agent based on the training individual cognition information and the training neighborhood cognition information of the first agent; and train the first agent based on the training Q value of the first agent.

[0194] Optionally, the processing unit 810 is configured to: determine training Q values Q.sub.total of a plurality of agents based on the training Q value of the first agent and a training Q value of the second agent; and train the first agent based on Q.sub.total.

[0195] For a manner in which the apparatus 800 performs the agent training method and an advantageous effect generated by the method, refer to related descriptions in the method embodiments.

[0196] Optionally, the apparatus 800 and the apparatus 700 are a same apparatus.

[0197] FIG. 9 shows a schematic diagram of a structure of an electronic device according to some embodiments. A dashed line in FIG. 9 indicates that the unit or the module is optional. A device 900 may be configured to implement the method described in the foregoing method embodiments. The device 900 may be a terminal device, a server, or a chip.

[0198] The device 900 includes one or more processors 901. The one or more processors 901 may support the device 900 in implementing the methods in the method embodiments corresponding to FIG. 2 to FIG. 6. The processor 901 may be a general-purpose processor or a dedicated processor. The processor 901 may be a central processing unit (central processing unit, CPU). The CPU may be configured to control the device 900, execute a software program, and process data of the software program. The device 900 may further include a communication unit 905, configured to input (receive) and output (send) a signal.

[0199] For example, the device 900 may be a chip, and the communication unit 905 may be an input circuit and/or an output circuit of the chip, or the communication unit 905 may be a communication interface of the chip. The chip may be used as a component of a terminal device, a network device, or another electronic device.

[0200] For another example, the device 900 may be a terminal device or a server, and the communication unit 905 may be a transceiver of the terminal device or the server, or the communication unit 905 may be a transceiver circuit of the terminal device or the server.

[0201] The device 900 may include one or more memories 902. The memory 902 stores a program 904, and the program 904 may be run by the processor 901 to generate an instruction 903, so that the processor 901 performs, based on the instruction 903, the methods described in the foregoing method embodiments. Optionally, the memory 902 may further store data. Optionally, the processor 901 may further read the data stored in the memory 902. The data and the program 904 may be stored in a same storage address, or the data and the program 904 may be stored in different storage addresses.

[0202] The processor 901 and the memory 902 may be separately disposed, or may be integrated together, for example, may be integrated on a system on chip (system on chip, SOC) of a terminal device.

[0203] The device 900 may further include an antenna 906. The communication unit 905 is configured to implement a receiving and sending function of the device 900 by using the antenna 906.

[0204] For a manner in which the processor 901 performs the agent training method, refer to related descriptions in the method embodiment.

[0205] It should be understood that the steps in the foregoing method embodiments may be implemented by using a logic circuit in a form of hardware or an instruction in a form of software in the processor 901. The processor 901 may be a CPU, a digital signal processor (digital signal processor, DSP), an application-specific integrated circuit (application-specific integrated circuit, ASIC), or a field programmable gate array (field programmable gate array, FPGA) or another programmable logic device such as a discrete gate, a transistor logic device, or a discrete hardware component.

[0206] This application further provides a computer program product. When the computer program product is executed by the processor 901, the method according to any method embodiment of this application is implemented.

[0207] The computer program product such as the program 904 may be stored in the memory 902. After being preprocessed, compiled, assembled, linked, and the like, the program 904 is finally converted into an executable target file that can be executed by the processor 901.

[0208] This application further provides a computer-readable storage medium, which stores a computer program. When the computer program is executed by a computer, the method according to any method embodiment of this application is implemented. The computer program may be a high-level language program, or may be an executable target program.

[0209] The computer-readable storage medium is, for example, the memory 902. The memory 902 may be a volatile memory or a nonvolatile memory, or the memory 902 may include both a volatile memory and a nonvolatile memory. The nonvolatile memory may be a read-only memory (read-only memory, ROM), a programmable read-only memory (programmable ROM, PROM), an erasable programmable read-only memory (erasable PROM, EPROM), an electrically erasable programmable read-only memory (electrically EPROM, EEPROM), or a flash memory. The volatile memory may be a random access memory (RAM) and is used as an external high-speed cache. Through example but not limitative description, many forms of RAMs may be used, for example, a static random access memory (static RAM, SRAM), a dynamic random access memory (dynamic RAM, DRAM), a synchronous dynamic random access memory (synchronous DRAM, SDRAM), a double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), an enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), a synchlink dynamic random access memory (synchlink DRAM, SLDRAM), and a direct rambus random access memory (direct rambus RAM, DR RAM).

[0210] It may be clearly understood by a person skilled in the art that, for ease and brevity of description, for a specific working process and a generated technical effect of the foregoing apparatus and device, refer to a corresponding process and technical effect in the foregoing method embodiments, and details are not described herein again.

[0211] In the several embodiments provided in this application, the disclosed system, apparatus and method may be implemented in other manners. For example, some features of the method embodiments described above may be ignored or not performed. The described apparatus embodiments are merely examples. Division into the units is merely logical function division and may be other division in actual implementation. A plurality of units or components may be combined or integrated into another system. In addition, coupling between the units or coupling between the components may be direct coupling or indirect coupling, and the coupling may include an electrical connection, a mechanical connection, or another form of connection.

[0212] It needs to be understood that sequence indexes of the foregoing processes do not mean execution sequences in the embodiments of this application. The execution sequences of the processes need to be determined based on functions and internal logic of the processes, and do not need to be construed as any limitation on the implementation processes of embodiments of this application.

[0213] In addition, the terms “system” and “network” are usually used interchangeably in this specification. The term “and/or” in this specification describes only an association relationship for describing associated objects and represents that three relationships may exist. For example, A and/or B may represent the following three cases: Only A exists, both A and B exist, and only B exists. In addition, the character “/” in this specification generally represents an “or” relationship between the associated objects.

[0214] In summary, what is described above is merely example embodiments of the technical solutions of this application, but is not intended to limit the protection scope of this application. Any modification, equivalent replacement, or improvement made without departing from the spirit and principle of this application shall fall within the protection scope of this application.

AGENT TRAINING METHOD, APPARATUS, AND COMPUTER-READABLE STORAGE MEDIUM

Assignee

Inventors

Cpc classification

Classification Explorer

G06N3/088

PHYSICS

Classification Explorer

G06N3/006

PHYSICS

Classification Explorer

H04L45/06

ELECTRICITY

Classification Explorer

H04L45/245

ELECTRICITY

Classification Explorer

G06F17/16

PHYSICS

Classification Explorer

H04L45/08

ELECTRICITY

Classification Explorer

G06N3/092

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

International classification

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06F17/16

PHYSICS

Classification Explorer

H04L45/02

ELECTRICITY

Abstract

Claims

Description