PARAMETER CALCULATING DEVICE, PARAMETER CALCULATING METHOD, AND RECORDING MEDIUM HAVING PARAMETER CALCULATING PROGRAM RECORDED THEREON
20210065056 ยท 2021-03-04
Assignee
Inventors
Cpc classification
G06N7/01
PHYSICS
G05B2219/32334
PHYSICS
International classification
Abstract
Provided is a parameter calculating device that takes human prior knowledge into account. The parameter calculating device according to the present invention is provided with: an identifying means that identifies intermediate states from a certain state to a target state and rewards concerning the intermediate states on the basis of a plurality of states concerning a target system, associated information by which two states among the plurality of states are associated with each other, rewords concerning at least some of the states, model information including parameters representing the states of the target system, and given ranges concerning the parameters; and a parameter calculating means that calculates the values of the parameters in the case where the identified rewards and the degrees of the differences between the values of the parameters and the given ranges satisfy predetermined conditions.
Claims
1. A parameter calculating device, comprising: an identifying unit configured to identify an intermediate state from a certain state to a target state and a reward concerned with the intermediate state based on a plurality of states concerned with a target system, associated information in which two states among the plurality of states are associated with each other, a reward concerned with at least some of the states, model information including a parameter indicative of a state of the target system, and a given range concerned with the parameter; and a parameter calculating unit configured to calculate a value of the parameter in a case where the identified reward and a degree of a difference between the value of the parameter and the given range satisfy a predetermined condition.
2. The parameter calculating device as claimed in claim 1, comprising a conversion unit configured to calculate the intermediate state or numeric information indicative of the intermediate state based on association information indicative of association between the states and numeric information indicative of the states.
3. The parameter calculating device as claimed in claim 2, comprising a low-level planner configured to prepare control information for controlling the target system based on a difference between the numeric information indicative of the intermediate state and observation information observed with respect to the target system.
4. The parameter calculating device as claimed in claim 1, comprising an updating means configured to update the associated information based on the calculated value of the parameter.
5. The parameter calculating device as claimed in claim 2, wherein the association information includes a first symbol grounding function for associating the numeric information with the state.
6. The parameter calculating device as claimed in claim 2, wherein the association information includes a second symbol grounding function for associating the state with the numeric information.
7. A parameter calculating method in an information processing device, the method comprising: identifying an intermediate state from a certain state to a target state and a reward concerned with the intermediate state based on a plurality of states concerned with a target system, associated information in which two states among the plurality of states are associated with each other, a reward concerned with at least some of the states, model information including a parameter indicative of a state of the target system, and a given range concerned with the parameter; and calculating a value of the parameter in a case where the identified reward and a degree of a difference between the value of the parameter and the given range satisfy a predetermined condition.
8. The parameter calculating method as claimed in claim 7, the method comprising: calculating the intermediate state or numeric information indicative of the intermediate state based on association information indicative of association between the states and numeric information indicative of the states.
9. The parameter calculating method as claimed in claim 8, the method comprising: preparing control information for controlling the target system based on a difference between the numeric information indicative of the intermediate state and observation information observed with respect to the target system.
10. A non-transitory recoding medium recording a parameter calculating program causing a computer to execute: an identifying step of identifying an intermediate state from a certain state to a target state and a reward concerned with the intermediate state based on a plurality of states concerned with a target system, associated information in which two states among the plurality of states are associated with each other, a reward concerned with at least some of the states, model information including a parameter indicative of a state of the target system, and a given range concerned with the parameter; and a parameter calculating step of calculating a value of the parameter in a case where the identified reward and a degree of a difference between the value of the parameter and the given range satisfy a predetermined condition.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
DESCRIPTION OF EMBODIMENTS
Related Art
[0036] In order to facilitate an understanding of the present invention, a related art will be described first.
[0037]
[0038] The hierarchical planner 10 comprises a high-level planner 12, a first conversion unit 14, a second conversion unit 16, and a low-level planner 18.
[0039]
[0040] The control system of the related art having such a configuration operates as follows.
[0041] The environment 50 receives an action a, and produces numeric state information s belonging to a state set S and a reward r. Herein, the numeric state information s is a continuous quantity representing a state of the environment 50 with a numeric representation.
[0042] The first conversion unit 14 receives the numeric state information s, the reward r, and first symbol grounding parameters, and produces, based on a first symbol grounding function, a state symbol s.sub.h belonging to a state symbol set S.sub.h and the reward r. Herein, the state symbol s.sub.h is a symbol represented by a symbolic representation in knowledge. The first conversion unit 14 is also called a low-level/high-level conversion unit.
[0043] The high-level planner 12 receives the state symbol s.sub.h, the reward r, and high-level planner parameters, and produces a subgoal symbol g.sub.h belonging to the state symbol set S.sub.h. Herein, the subgoal symbol g.sub.h is a symbol indicative of an intermediate state represented by the symbolic representation in the knowledge. In this specification, the subgoal symbol g.sub.h may simply be also called an intermediate state. In addition, a starting state, an objective state (target state), and the intermediate state may simply be called states collectively.
[0044] The second conversion unit 16 receives the subgoal symbol g.sub.h and second symbol grounding parameters, and produces, based on a second symbol grounding function, a subgoal g belonging to the state set S. Herein, the subgoal g comprises numeric information indicative of the intermediate state. The second conversion unit 16 may also be called a high-level/low-level conversion unit.
[0045] In the related art, as the first symbol grounding function and the second symbol grounding function, functions that are manually and carefully designed beforehand are used.
[0046] The low-level planner 18 receives the numeric state information s, the subgoal g, and low-level planner parameters, and produces the action a belonging to an action set A.
[0047] It is assumed that a series of these steps is one process. Then, the history recording medium 40 receives, for every one process, the numeric state information s, the reward r, the subgoal symbol g.sub.h, the subgoal g, and the action a, and records them as the interaction history.
[0048] The parameter calculation circuitry 20 receives, from the history recording medium 40, the numeric state information s, the reward r, the subgoal symbol g.sub.h, the sugoal g, and the action a, which are saved as the interaction history, and updates parameters for the hierarchical planner 10 to produce updated parameters.
[0049] The parameter storage unit 30 receives the updated parameters from the parameter calculation circuitry 20, saves them as the hierarchical planner parameters, and outputs the saved hierarchical planner parameters in response to a readout request.
[0050] As described above, the problem in the above-mentioned related art is that, in the related art, human beings cannot easily understand operations of respective modules after optimization (i.e. the first conversion unit 14, the high-level planner 12, the second conversion unit 16, and the low-level planner 18) in the hierarchical planner 10 for performing the symbol grounding. This is because, in the related art, the hierarchical planner parameters are optimized based on only the interaction history.
Example Embodiment
[0051] An example embodiment of the present invention will hereinafter be described in detail with reference to the drawings.
[0052] [Explanation of Configuration]
[0053]
[0054] The hierarchical planner 10A comprises a high-level planner 12A, a first conversion unit 14A, a second conversion unit 16A, and the low-level planner 18.
[0055]
[0056] The parameter calculation circuitry 20A comprises an identifying unit 22A, a parameter calculation unit 24A, a first symbol grounding function parameter updating unit 26A, and a second symbol grounding function parameter updating unit 28A.
[0057] Referring to
[0058] Referring to
[0059] These means operate as follows, respectively.
[0060] The environment 50 receives an action a, and produces numeric state information s belonging to a state set S and a reward r.
[0061] The first conversion unit 14A receives the numeric state information s, the reward r, and first symbol grounding function parameters with prior knowledge which will later be described, and produces, based on a first symbol grounding function, a state symbol s.sub.h belonging to the state symbol set S.sub.h and the reward r. Herein, the first symbol grounding function is first association information indicative of association between the numeric state information and a state corresponding to the numeric state information. Accordingly, the first conversion unit 14A calculates, based on the first association information, the state corresponding to the numeric state information.
[0062] The high-level planner 12A receives the state symbol s.sub.h, the reward r, and high-level planner parameters with prior knowledge, and produces a subgoal symbol g.sub.h belonging to the state symbol set S.sub.h.
[0063] The second conversion unit 16A receives the subgoal symbol g.sub.h and the first symbol grounding function parameters with prior knowledge which will later be described, and produces, based on a second symbol grounding function, a subgoal g belonging to the state set S. Herein, the second symbol grounding function is second association information indicative of association between the state and the numeric information corresponding to the state. Accordingly, the second conversion unit 16 calculates, based on the second association information, numeric information indicative of the above-mentioned intermediate state.
[0064] The low-level planner 18 receives the numeric state information s, the subgoal g, and low-level planner parameters with prior knowledge, and produces the action a belonging to an action set A. In other words, the low-level planner 18 prepares, based on a difference between the numeric information indicative of the intermediate state and observation information which is observed with respect to the target system 50, control information for controlling the target system 50. Specifically, the low-level planner 18 may be, for example, a controller for carrying out PID (proportional integral and differential) control.
[0065] It is assumed that a series of these steps is one process. Then, the history recording medium 40 receives, for every one process, the numeric state information s, the reward r, the subgoal symbol g.sub.h, the subgoal g, and the action a, and records them as an interaction history.
[0066] The parameter calculation circuitry 20A receives prior knowledge from the knowledge recording medium 60, receives, from the history recording medium 40, the numeric information s, the reward r, the subgoal symbol g.sub.h, the subgoal g, and the action a, which are saved as the interaction history, and updates parameters for the hierarchical planner 10A to produce updated hierarchical planner parameters.
[0067] The identifying unit 22A identifies, based on a plurality of states concerned with the target system 50, associated information in which two states among the plurality of states are associated with each other, a reward concerned with at least some of the states, model information including a parameter indicative of a state of the target system 50, and a given range concerned with the parameter, an intermediate state (subgoal symbol) from a certain state to a target state (final object) and a reward concerned with the intermediate state. Herein, the associated information in which the two states among the plurality of states are associated with each other is high-level planner symbol knowledge. The model information including the parameter is, for example, a normal distribution.
[0068] The parameter calculation unit 24A calculates a value of the parameter in a case where the identified reward and a degree of a difference between the value of the parameter and the above-mentioned given range satisfy a predetermined condition. Herein, the predetermined condition is supposed to be, for example, a condition that a differential value is the largest in a case where a steepest descent is adopted as an optimization method.
[0069] As shown in
[0070] As shown in
[0071] As described above, each of the first symbol grounding function parameter updating unit 26A and the second symbol grounding function parameter updating unit 28A updates the association information (symbol grounding function) based on the values of the calculated parameters. In other words, the first symbol grounding function parameter updating unit 26A and the second symbol grounding function parameter updating unit 28A update the first and the second association information (first and second symbol grounding functions) by using the above-mentioned calculated parameters as parameters of the first and the second association information (first and second grounding functions), respectively.
[0072] The parameter storage unit 30 receives the parameters with prior knowledge from the parameter calculation circuitry 20A and saves them as the hierarchical planner parameters.
[0073] These means mutually operate so as repeat 1) accumulation of the interaction history using the hierarchical planner 10 and 2) parameter updating using the accumulated interaction history and the prior knowledge. It is therefore possible to obtain an effect that the hierarchical planner 10 can be optimized in consideration of both of the prior knowledge and the interaction history.
[0074] [Explanation of Operation]
[0075] Next, referring to a flow chart of
[0076] First, the control system carries out interaction between the hierarchical planner 10 and the environment 50 to accumulate the interaction history (Step S101). The interaction history is recorded in the history recording medium 40.
[0077] Next, the parameter calculation circuitry 20A updates the hierarchical planner parameters by referring to the prior knowledge recorded in the knowledge recording medium 60 and the interaction history recorded in the history recording medium 40 (Step S102). The updated hierarchical planner parameters are stored in the parameter storage unit 30.
[0078] The control system repeats these steps by a designated number of times (Step S103).
[0079] [Explanation of Effect]
[0080] Next, an effect of the example embodiment will be described.
[0081] The example embodiment is configured to repeat 1) accumulation of the interaction history between the hierarchical planner 10 and the environment 50 and 2) parameter updating using the accumulated interaction history and the prior knowledge. It is therefore possible to optimize the hierarchical planner parameters in consideration of both of the prior knowledge and the interaction history.
[0082] Each part of the hierarchical planner 10A may be implemented by a combination of hardware and software. In a form in which the hardware and the software are combined, the respective parts are implemented as various kinds of means by developing a parameter calculating program in a RAM (random access memory) and making hardware such as a control unit (CPU (central processing unit)) operate based on the parameter calculating program. The parameter calculating program may be recorded in a recording medium to be distributed. The parameter calculation program recorded in the recording medium is read into a memory via a wire, wirelessly, or via the recording medium itself to operate the control unit and so on. By way of example, the recording medium may be an optical disc, a magnetic disk, a semiconductor memory device, a hard disk, or the like.
[0083] Explaining the above-mentioned example embodiment with a different expression, it is possible to implement the example embodiment by making a computer to be operated as the hierarchical planner 10A act as the parameter calculation circuitry 20A (the identifying unit 22A, the parameter calculation unit 24A, the first symbol grounding function parameter updating unit 26A, and the second symbol grounding function parameter updating unit 28A) according to the parameter calculating program developed in the RAM.
Example
[0084] Next, description will proceed to an operation of the mode for embodying the present invention using a specific example.
[0085] This example supposes semi-Markov decision processes (SMDPs) described in Non-Patent Literature 4.
[0086] This example supposes a Mountain Car task. In the Mountain Car task, a torque is applied to a car to make the car arrive at a goal on a hill. In this task, the reward r is 100 if the car arrives at the goal, and is 1 otherwise. The state set S includes a velocity of the car and a position of the car. Accordingly, the numeric state information s and the subgoal g belong to the state set S. The action set A includes the torque of the car. The action a belongs to the action set A. The state symbol set S.sub.h is {Bottom_of_hills, On_right_side_hill, On_left_side_hill, At_top_of_right_side_hill}. The state symbol s.sub.h and the subgoal symbol g.sub.h belong to the state symbol set S.sub.h. In this example, [Bottom_of_hills] indicates the starting state. [At_top_of_right_side_hill] indicates the objective state (target state). [On_right_side_hill] and [On_left_side_hill] indicate the intermediate states. In this example, the environment 50 comprises an operating simulator of the car present in the hill. In addition, in this example, the hierarchical planner 10A plans a way how to apply the torque of the car based on the position and the velocity of the car. In
[0087] The high-level planner 12A in this example is a Strips-style planner based on symbol knowledge.
[0088] Furthermore, in this example, the prior knowledge recorded in the knowledge recording medium 60 is constructed based on the symbol grounding functions which are prepared by manpower.
[0089] In
[0090] Next, description will proceed to a method of learning the symbol grounding functions using the reinforcement learning with constraints according to this example.
[0091] In the reinforcement learning with constraints, as illustrated in the following numerical expression:
the parameter in policy (g.sub.t, g.sub.h, s.sub.h, |s) of the high-level planning including the symbol grounding functions with prior knowledge is learned so that E.sub.[.sub.t=.sub.0r.sub.t] becomes the maximum. The policy (g.sub.t, g.sub.h, s.sub.h, |s) is represented by the following numerical expression:
(g,g.sub.h,s.sub.h,|s):=.sub.s.sub.
where P() represents the prior knowledge. In the expression of Math. 2, the first symbol grounding function is represented by:
.sub.s.fwdarw.s.sub.
[0092] The second symbol grounding function is represented by:
.sub.s.sub.
The high-level planner 12A is represented by P(g.sub.h|s.sub.h).
[0093] Non-Patent Literature 5 proposes REINFORCE Algorithms as illustrated in
[0094] In comparison with this, this example proposes a parameter updating method for the hierarchical planner 10A as illustrated in
[0095] In this example, as illustrated in
[0096] Accordingly, in this example, the parameters in the first symbol grounding function and the second symbol grounding function are calculated in accordance with the common parameter through optimization.
[0097] As illustrated in
N(s|.sub.s.sub.
The average:
.sub.s.sub.
and the standard deviation:
.sub.s.sub.
are used as the parameter to be optimized.
[0098]
[0099] In this example, the parameter calculation circuitry 20A carries out optimization by referring to the prior knowledge concerned with these parameters. For instance, the parameter calculation circuitry 20A refers to the prior knowledge that, corresponding to:
s.sub.h=At_top_of_right_side_hill[Math. 8]
the average and the standard deviation
.sub.s.sub.
are 0.6 and 0.1, respectively.
[0100] In this example, the interaction history-based first symbol grounding function parameter updating unit 264A uses modifications of the REINFORCE Algorithms disclosed in the above-mentioned Non-Patent Literature 5 (see, the first term of the right side in the expression in
[0101] In this example, the prior knowledge-based first symbol grounding function parameter updating unit 262A and the prior knowledge-based second symbol grounding function parameter updating unit 282A update the parameter so as to bring the parameter closer to that defined by the prior knowledge (see, the second term of the right side in the expression in
[0102] The present inventor experimentally evaluated, on the basis of these methods, how easily the operations of the respective modules are interpretable actually for human beings in a case (Proposed) of learning optimization of the parameter in consideration of the prior knowledge in comparison with a case (Baseline) without consideration of the prior knowledge.
[0103]
[0104] In the Baseline, the average of Bottom_of_hills is 0.5 whereas the average of On_right_side_hill is 0.73. This suggests that the right-side bottom exists on the left side than the bottom between left-side and right-side hills and leads to a result which is incomprehensible for human beings. On the other hand, in the Proposed no such problem occurs.
[0105] A specific configuration of the present invention is not limited to the afore-mentioned example embodiment. Alterations without departing from gist of the present invention are included in the present invention.
[0106] While the invention has been particularly shown and described with reference to the example embodiment (example) thereof, the invention is not limited to the above-mentioned example embodiment (example). It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the sprit and scope of the present invention as defined by the claims.
INDUSTRIAL APPLICABILITY
[0107] The present invention is applicable to uses such as a plant operation support system. In addition, the present invention is also applicable to uses such as an infrastructure operating support system.
REFERENCE SIGNS LIST
[0108] 50 environment (target system) [0109] 10, 10A hierarchical planner [0110] 14, 14A first conversion unit [0111] 12, 12A high-level planner [0112] 16, 16A second conversion unit [0113] 18 low-level planner [0114] 20, 20A parameter calculation circuitry [0115] 22A identifying unit [0116] 24A parameter calculation unit [0117] 26A first symbol grounding function parameter updating unit [0118] 28A second symbol grounding function parameter updating unit [0119] 262A prior knowledge-based first symbol grounding function parameter updating unit [0120] 264A interaction history-based first symbol grounding function parameter updating unit [0121] 266A parameter updating combining unit [0122] 282A prior knowledge-based second symbol grounding function parameter updating unit [0123] 284A interaction history-based second symbol grounding function parameter updating unit [0124] 286A parameter updating combining unit [0125] 40 history recording medium [0126] 60 knowledge recording medium [0127] 30 parameter storage unit