ESTIMATION METHOD, ESTIMATION APPARATUS AND PROGRAM

20230083842 · 2023-03-16

Assignee

Inventors

Cpc classification

International classification

Abstract

An estimation apparatus according to one embodiment is an estimation method for estimating a parameter of a model for obtaining a state transition probability used in model-based reinforcement learning and causes a computer to perform: an input procedure in which first data indicating a state transition history in a situation where an action of the model-based reinforcement learning is not performed and second data indicating, when an action prompting a transition to a predetermined state is performed, a degree of accepting the transition to the predetermined state are input; and an estimation procedure in which a parameter of the model is estimated by using the first data and the second data.

Claims

1. A computer implemented method for estimating a parameter of a model for obtaining a state transition probability used in model-based reinforcement learning, comprising: receiving as an input: first data indicating a state transition history when a first action of the model-based reinforcement learning is not performed, and second data indicating, when a second action prompting a transition to a predetermined state is performed, a degree of accepting the transition to the predetermined state are input; and estimating a parameter of the model using the first data and the second data.

2. The computer implemented estimation method according to claim 1, wherein the second data is represented by a tuple of at least one of a state and time, the action prompting a transition to the predetermined state, and a probability indicating the degree of accepting a transition to the predetermined state.

3. The computer implemented method according to claim 2, wherein the parameter of the model is θ={u,v}, and wherein the model includes: a first model in which, when an action of the model-based reinforcement learning is not performed, a probability of transitioning to a state is defined by a parameter u, a second model in which, when an action of the model-based reinforcement learning is performed, a probability of transitioning to a state of a transition destination prompted by the action is defined by parameters u and v, and a third model in which, when an action of the model-based reinforcement learning is performed, a probability of transitioning to a state other than a state of a transition destination prompted by the action is defined by parameters u and v.

4. The computer implemented method according to claim 3, wherein the estimating further comprises estimating a parameter of the model by optimizing an objective function including a first generation probability of the first data and a second generation probability of the second data, and wherein the first generation probability of the first data is calculated by the first model and the second generation probability of the second model is calculated by the second model and the third model.

5. An estimation apparatus that estimates a parameter of a model for obtaining a state transition probability used in model-based reinforcement learning, the estimation apparatus comprising a processor configured to execute a method comprising: receiving as input first data indicating a state transition history in a situation where a first action of the model-based reinforcement learning is not performed and second data indicating, when a second action prompting a transition to a predetermined state is performed, a degree of accepting the transition to the predetermined state; and estimating a parameter of the model by using the first data and the second data.

6. A computer-readable non-transitory recording medium storing computer-executable program instructions that when executed by a processor cause for causing a computer to execute a method for estimating a parameter of a model for obtaining a state transition probability used in model-based reinforcement learning, comprising: receiving as an input: first data indicating a state transition history when a first action of the model-based reinforcement learning is not performed, and second data indicating, when a second action prompting a transition to a predetermined state is performed, a degree of accepting the transition to the predetermined state are input; and estimating a parameter of the model using the first data and the second data.

7. The computer implemented method according to claim 1, wherein the model-based reinforcement learning corresponds to learning the model associated with adaptive control of traffic lights.

8. The computer implemented method according to claim 1, wherein the model-based reinforcement learning corresponds to learning the model associated with a healthcare application notifying an activity indicating the transition to a predetermined state associated with a health.

9. The estimation apparatus according to claim 5, wherein the second data is represented by a tuple of at least one of a state and time, the action prompting a transition to the predetermined state, and a probability indicating the degree of accepting a transition to the predetermined state.

10. The estimation apparatus according to claim 9, wherein the parameter of the model is θ={u,v}, and wherein the model includes: a first model in which, when an action of the model-based reinforcement learning is not performed, a probability of transitioning to a state is defined by a parameter u, a second model in which, when an action of the model-based reinforcement learning is performed, a probability of transitioning to a state of a transition destination prompted by the action is defined by parameters u and v, and a third model in which, when an action of the model-based reinforcement learning is performed, a probability of transitioning to a state other than a state of a transition destination prompted by the action is defined by parameters u and v.

11. The estimation apparatus according to claim 10, wherein the estimating further comprises estimating a parameter of the model by optimizing an objective function including a first generation probability of the first data and a second generation probability of the second data; and wherein the first generation probability of the first data is calculated by the first model and the second generation probability of the second model is calculated by the second model and the third model.

12. The estimation apparatus according to claim 5, wherein the model-based reinforcement learning corresponds to learning the model associated with adaptive control of traffic lights.

13. The estimation apparatus according to claim 5, wherein the model-based reinforcement learning corresponds to learning the model associated with a healthcare application notifying an activity indicating the transition to a predetermined state associated with a health.

14. The computer-readable non-transitory recording medium according to claim 6, wherein the second data is represented by a tuple of at least one of a state and time, the action prompting a transition to the predetermined state, and a probability indicating the degree of accepting a transition to the predetermined state.

15. The computer-readable non-transitory recording medium according to claim 14, wherein the parameter of the model is θ={u,v}, and wherein the model includes: a first model in which, when an action of the model-based reinforcement learning is not performed, a probability of transitioning to a state is defined by a parameter u, a second model in which, when an action of the model-based reinforcement learning is performed, a probability of transitioning to a state of a transition destination prompted by the action is defined by parameters u and v, and a third model in which, when an action of the model-based reinforcement learning is performed, a probability of transitioning to a state other than a state of a transition destination prompted by the action is defined by parameters u and v.

16. The computer-readable non-transitory recording medium according to claim 15, wherein the estimating further comprises estimating a parameter of the model by optimizing an objective function including a first generation probability of the first data and a second generation probability of the second data; and wherein the first generation probability of the first data is calculated by the first model and the second generation probability of the second model is calculated by the second model and the third model.

17. The computer-readable non-transitory recording medium according to claim 6, wherein the model-based reinforcement learning corresponds to learning the model associated with adaptive control of traffic lights.

18. The computer-readable non-transitory recording medium according to claim 6, wherein the model-based reinforcement learning corresponds to learning the model associated with a healthcare application notifying an activity indicating the transition to a predetermined state associated with a health.

Description

BRIEF DESCRIPTION OF DRAWINGS

[0018] FIG. 1 illustrates an example of a functional configuration of an estimation apparatus according to the present embodiment.

[0019] FIG. 2 is a flowchart illustrating an example of estimation processing according to the present embodiment.

[0020] FIG. 3 illustrates an example of a hardware configuration of the estimation apparatus according to the present embodiment.

DESCRIPTION OF EMBODIMENTS

[0021] Hereinafter, one embodiment of the present invention will be described. In the present embodiment, an estimation apparatus 10 capable of estimating a state transition probability (hereinafter, simply referred to as a “transition probability”) used in model-based RL by using data (non-intervention transition data) collected in a situation where some kind of system such as a recommender system or a healthcare application does not intervene in a user will be described. When estimating the transition probability, the estimation apparatus 10 according to the present embodiment uses not only the non-intervention transition data but also transition acceptance data. The transition acceptance data is data indicating a degree to which the user can accept the intervention of the system (for example, a probability of accepting the intervention of the system). In other words, the transition acceptance data is a degree indicating whether the user prompted to transition to a certain state by a certain action (that is, the intervention of the system) accepts the transition to the state. Such transition acceptance data may be collected, for example, by questionnaires to users.

[0022] For example, in the case of the recommender system, the transition acceptance data is data indicating a degree to which, in response to an action of the system that “presents item 1 and item 2 as recommended items”, the user accepts the action and allows the state to transition to a state of “viewing “the page of item 1”” or “viewing “the page of item 2””. Further, for example, in the case of the healthcare application, the transition acceptance data is data indicating a degree to which, in response to an action of the system. that “notifies “why don't you go to work?””, the user accepts the action and allows the state to transition to a state of “going to work”.

[0023] <Preparation>

[0024] First, concepts, terms, etc., used in the present embodiment will be described.

[0025] <<Reinforcement Learning (RL)>>

[0026] Reinforcement learning is a method in which a learner (agent) estimates an optimal action rule (policy) through interaction with an environment. In reinforcement learning, a Markov decision process (MDP) is often used for setting the environment. In the present embodiment as well, the environment is set by a Markov decision process.

[0027] A Markov decision process is defined by a 4-tuple (S, A, P, R). S is called state space, and A is called action space. Respective elements, s∈S and a∈A, are called states and actions, respectively. P:S×A×S.fwdarw.[0,1] is called a state transition function and determines a transition probability that an action a performed in a state s leads to a next state s′.

[0028] Further, the following represents a reward function.


R:S×A.fwdarw.custom-character  [Math. 1]

[0029] The reward function defines a reward obtained when the action a is performed in the state s. The agent performs the action such that the sum of rewards obtained in the future in the above environment is maximized. The determined probability that the agent selects the action a in each state s is called a policy π:S×A.fwdarw.[0,1]. When a time-inhomogeneous Markov decision process in which the transition probability and the reward function vary at each time t is considered, the state transition function and the reward function may be defined as {P.sub.t}.sub.t, {R.sub.t}.sub.t at each time.

[0030] <<Value Function>>

[0031] Once one policy is defined, the agent can perform interaction with the environment. The agent in a state s.sub.t determines (selects) an action a.sub.t at each time t in accordance with a policy π(.Math.|s.sub.t). Next, in accordance with a state transition function and a reward function, a state s.sub.t+1 to P(.Math.|s.sub.t,a.sub.t) of the agent and a reward r.sub.t=R(s.sub.t,a.sub.t) at the next time are determined. By repeating this determination, a history of the states and actions of the agent is obtained. Hereinafter, the history of the states and actions (s.sub.0.Math.a.sub.0, s.sub.1.Math.a.sub.1, . . . , s.sub.T.Math.a.sub.T) obtained by repeating the transition T times from time t=0 to t=T is denoted as d.sub.T, which is called an episode.

[0032] Here, a value function is defined as a function serving to represent how good the policy is. The value function is defined as the average of returns obtained when the action a is selected in the state s and then continues to be performed in accordance with the policy π. When a finite time period (finite horizon) is considered, the sum total of rewards is used as the return, and when an infinite time period (infinite horizon) is considered, the sum of discounted rewards is used as the return. Evaluation functions are expressed by mathematical formula (1) and mathematical formula (2) below.

[00001] [ Math . 2 ] finite horizon : Q π ( s , a ) = 𝔼 d T π [ .Math. k = 0 T R ( s k , a k ) .Math. "\[LeftBracketingBar]" s 0 = s , a 0 = a ] ( 1 ) infinite horizon : Q π ( s , a ) = lim T .fwdarw. 𝔼 d T π [ .Math. k = 0 T γ k R ( s k , a k ) .Math. "\[LeftBracketingBar]" s 0 = s , a 0 = a ] ( 2 )

[0033] A discount rate is represented by γ∈[0,1), and an average operation related to how the episode appears by the policy π is represented as below.


custom-character.sub.d.sub.T.sup.π[ ]  [Math. 3]

[0034] Hereinafter, for simplicity, the case of infinite horizon will be considered.

[0035] When certain policies π and π′ satisfy Q.sup.π(s,a)≥Q.sup.π′(s,a) with random s∈S, a∈A, it can be expected that the policy π provides the agent with more rewards than the policy π′. This is denoted as π≥π′. An object of reinforcement learning is to obtain an optimal policy π* that satisfies π*≥π for a random policy π.

[0036] The optimal policy π* can be obtained by setting π*(a|s)=δ(a−argmax.sub.a′Q*(s,a′)) using a value function Q* thereof. The above value function is called an optimal value function. Note that δ(.Math.) is a delta function, and when δ(0), 1 is obtained, and otherwise, 0 is obtained.

[0037] It is known that an optimal value function Q* in the case of infinite horizon satisfies an optimal Bellman equation indicated by the following mathematical formula (3).

[00002] [ Math . 4 ] Q * ( s , a ) = 𝔼 s [ R ( s , a ) + γmax a Q * ( s , a ) ] ( 3 )

[0038] Therefore, if the environment (that is, the transition probability and reward function) is known, a value of the optimal value function Q* can be obtained by value iteration using the optimal Bellman equation indicated in the above mathematical formula (3). More commonly, if the environment is known, any desired method in which an optimal policy is obtained by the Markov decision process such as policy iteration can be used. The same applies to the case of finite horizon. While the common reinforcement learning has been described herein, the transition probability estimated by the estimation apparatus 10 according to the present embodiment can also be used in entropy-regularized RL or the like.

[0039] <Theoretical Configuration>

[0040] Next, a theoretical configuration of the method in which the estimation apparatus 10 according to the present embodiment estimates a transition probability will be described. Hereinafter, a case of estimating a transition probability in a time-inhomogeneous Markov decision process in which the transition probability varies depending on time will be described. However, the transition probability can also be estimated in a common-used time-homogeneous Markov decision process by using a similar framework.

[0041] <<Prior Knowledge about Action>>

[0042] In the present embodiment, it is assumed that prior knowledge about a state to which each action prompts the transition is already obtained. Such prior knowledge is available in the case of the recommender system and the healthcare application described above. For example, in the case of the recommender system, the action of the system that “presents item 1 and item 2 as recommended items” can be interpreted as an action to prompt the user to transition to a state of “viewing “the page of item 1”” or “viewing “the page of item 2””. Likewise, for example, in the case of the healthcare application, the action of the system that “notifies “why don't you go to work?”” can be interpreted as an action to prompt the user to transition to a state of “going to work”. Hereinafter, a set of destination states to which the action a prompts the user to transition is denoted as U.sub.a. Using this prior knowledge can reduce the number of parameters of a model (probability estimation model), which will be described below, so that accurate estimation can be performed. In a case where the number of states and the number of actions are small or in a case where a large amount of data (non-intervention transition data and transition acceptance data) is obtained, the estimation can be performed without this prior knowledge.

[0043] In the following description, for convenience, the transition probability is estimated assuming that the Markov decision process includes an action of “no intervention”. If a Markov decision process which does not include an action of “no intervention” is considered, the estimation result of the transition probability corresponding to such an action may not be used.

[0044] <<Data Used for Estimating Transition Probability>>

[0045] Non-intervention transition data is denoted as B.sub.tr, and transition acceptance data is denoted as B.sub.apt. The non-intervention transition data B.sub.tr indicates a state transition history obtained when no action is performed and is defined as B.sub.tr=N.sub.tij indicates the number of times that a state i has transitioned to a state j at time t. For example, in the case of the recommender system, the non-intervention transition data B.sub.tr indicates a state transition history of the user (or information obtained by aggregating such a history, for example) obtained when the function of presenting a recommended item to the user is not yet available. Likewise, for example, in the case of the healthcare application, the non-intervention transition data B.sub.t, indicates a state transition history of the user (or information obtained by aggregating such a history, for example) obtained when the system has no function of providing notification to the user.

[0046] The transition acceptance data B.sub.apt indicates a degree to which, when a certain action (that is, intervention of the system) prompts the user to transition to a certain state, the user accepts the transition to the state. As described above, the transition acceptance data B.sub.apt may be collected through a questionnaire or the like, and the questionnaire is conducted in any one of the following (format 1) to (format 3) based on the collection method.

[0047] (Format 1)

[0048] A case where the user is asked whether a specific action can be accepted when in a certain state: this format corresponds to, for example, a case where, while viewing the page of a certain item, the user is asked whether to accept a suggestion about transitioning to another page of a specific item.

[0049] In this case, the transition acceptance data B.sub.apt can be expressed as below.


B.sub.apt={(s.sub.d,a.sub.d,β.sub.d)}.sub.d=1.sup.D

[0050] D represents a transition acceptance degree included in B.sub.apt, and each (s.sub.d,a.sub.d,β.sub.d) represents transition acceptance.

[0051] Each (s.sub.d,a.sub.d,β.sub.d) indicates that, when in the state s.sub.d, the user accepts the transition to any one of the states that belong to a set indicated in Math. 6 by the action a.sub.d at the probability β.sub.d.


U.sub.a.sub.d  [Math. 6]

[0052] Note that β.sub.d is 0≤β.sub.d≤1, and this probability β.sub.d may be a subjective view (or a value based on the subjective view) of the user collected by the questionnaire or the like.

[0053] (Format 2)

[0054] A case where the user is asked whether a specific action can be accepted at certain time: this format corresponds to, for example, a case where the user is asked whether to accept a suggestion about transitioning to the page of a specific item at certain time.

[0055] In this case, the transition acceptance data B.sub.apt can be expressed as below.


B.sub.apt={(t.sub.d,a.sub.d,β.sub.d)}.sub.d=1.sup.D[Math. 7]

[0056] Each (t.sub.d,a.sub.d,β.sub.d) represents transition acceptance and indicates that the user accepts the transition to any one of the states that belong to a set indicated in Math. 8 at the time t.sub.d by the action a.sub.d at the probability β.sub.d.


U.sub.a.sub.d[Math. 8]

[0057] (Format 3)

[0058] A case where the user is asked whether a specific action can be accepted when in a certain state at certain time: this format corresponds to, for example, a case where, while viewing the page of a certain item at certain time, the user is asked whether to accept a suggestion about transitioning to another page of a specific item.

[0059] In this case, the transition acceptance data B.sub.apt is expressed as below.


B.sub.apt={(t.sub.d,s.sub.d,β.sub.d)}.sub.d=1.sup.D  [Math. 9]

[0060] Each (t.sub.d, s.sub.d, a.sub.d, β.sub.d) represents transition acceptance and indicates that, when in the state s.sub.d at the time t.sub.d, the user accepts the transition to any one of the states that belong to a set indicated in Math. 10 by the action a.sub.d at the probability β.sub.d.


U.sub.a.sub.d  [Math. 10]

[0061] Hereinafter, for simplicity, the transition acceptance data B.sub.apt described above in (Format 3) is assumed to be given. However, the present embodiment is also applicable in a similar manner to the case where the transition acceptance data B.sub.apt described above in (Format 1) or (Format 2) is given.

[0062] Next, statistics M.sub.tik and G.sub.tik are defined by the following mathematical formulas using the transition acceptance data B.sub.apt.


M.sub.tik=Σ.sub.{d|t.sub.d.sub.=t,s.sub.d.sub.=i,a.sub.d.sub.=k}β.sub.d


G.sub.tik=Σ.sub.d=1.sup.D1(t.sub.d=t,s.sub.d=i,a.sub.d=k)  [Math. 11]

[0063] Note that 1(.Math.) is an indicator function, and when a condition X is true, 1(X)=1, and otherwise, 1(X)=0.

[0064] The above statistic M.sub.tik indicates the sum of probabilities β.sub.d where the time t.sub.d=t, the state s.sub.d=i, and the action a.sub.d=a. The statistic G.sub.tik indicates the transition acceptance degree where the time t.sub.d=t, the state s.sub.d=i, and the action a.sub.d=a.

[0065] In addition, the non-intervention transition data B.sub.tr and the transition acceptance data B.sub.apt are collectively denoted as B. That is, B=B.sub.tr∪B.sub.apt

[0066] <<Model and Algorithm>>

[0067] Any model can be used as a model (hereinafter, referred to as a “probability estimation model”) for estimating a transition probability.

[0068] A parameter (hereinafter, referred to as a “model parameter”) of the probability estimation model is denoted as θ={u,v}, and the probability estimation model is represented as below to clarify dependency on the model parameter θ.


{P.sub.t.sup.θ}  [Math. 12]

[0069] In the present embodiment, a model based on a log-linear model is constructed as the probability estimation model.

[0070] As modeling policies, a transition probability of not performing any action (that is, performing the action of “no intervention”) is expressed by using a parameter v, and an impact of each action on the transition probability of not performing any action is expressed by using a parameter u. With these parameters, for example, probability estimation models described in (a) to (c) below can be obtained.

[0071] (a) When the effect of the action depends only on the current state: by using parameters v={v.sub.tig}, u={u.sub.ikj}, the probability estimation model is defined as below. The effect of the action refers to how much the action affects the transition probability (in other words, the degree of contribution of the action to the transition probability).

[00003] [ Math . 13 ] P t θ ( s t + 1 = j .Math. "\[LeftBracketingBar]" s t = i , a = k ) = { exp ( v tij ) .Math. j exp ( v tij ) ( if k = a noitv : no intervention ) exp ( v tij + u ikj ) .Math. j U k exp ( v tij + u ikj ) + .Math. j .Math. U k exp ( v tij ) ( if j U k , k a noitv ) exp ( v tij ) .Math. j U k exp ( v tij + u ikj ) + .Math. j .Math. U k exp ( v tij ) ( if j .Math. U k , k a noitv )

[0072] Here, a.sub.noitv represents an action of “no intervention”.

[0073] (b) When the effect of the action depends only on the current time: by using parameters v={v.sub.tij}, u={u.sub.tkj}, the probability estimation model is defined as below.

[00004] [ Math . 14 ] P t θ ( s t + 1 = j .Math. "\[LeftBracketingBar]" s t = i , a = k ) = { exp ( v tij ) .Math. j exp ( v tij ) ( if k = a noitv : no intervention ) exp ( v tij + u tkj ) .Math. j U k exp ( v tij + u tkj ) + .Math. j .Math. U k exp ( v tij ) ( if j U k , k a noitv ) exp ( v tij ) ( .Math. j U k exp ( v tij + u tkj ) + .Math. j .Math. U k exp ( v tij ) ( if j .Math. U k , k a noitv )

[0074] (c) When the effect of the action depends on the current state and the current time: by using parameters v=u={v.sub.tij}, u={u.sub.tikj}, the probability estimation model is defined as below.

[00005] [ Math . 15 ] P t θ ( s t + 1 = j .Math. "\[LeftBracketingBar]" s t = i , a = k ) = { exp ( v tij ) .Math. j exp ( v tij ) ( if k = a noitv : no intervention ) exp ( v tij + u tikj ) .Math. j U k exp ( v tij + u tikj ) + .Math. j .Math. U k exp ( v tij ) ( if j U k , k a noitv ) exp ( v tij ) .Math. j U k exp ( v tij + u tikj ) + .Math. j .Math. U k exp ( v tij ) ( if j .Math. U k , k a noitv )

[0075] While the present embodiment is applicable to a probability estimation model other than the probability estimation model defined in the above (a) to (c), the following description will be made by using the probability estimation model defined in any one of the above (a) to (c).

[0076] The model parameter θ can be estimated by optimizing an objective function. Here, if non-intervention transition data is regarded as intervention transition data obtained when the action a.sub.noitv, which indicates “no intervention”, is performed, a generation probability of the non-intervention transition data is given by the following mathematical formula.


p(B.sub.tr|θ)=Π.sub.t=1.sup.TΠ.sub.i,j∈S(P.sub.t.sup.θ(s.sub.t+1=j|s.sub.t=i,a=a.sub.noitv)).sup.N.sup.tij  [Math. 16]

[0077] Further, by using the transition acceptance (t.sub.d, s.sub.d, a.sub.d, β.sub.d), which is regarded as indicating the number of times that the state s.sub.d has transitioned to the state indicated in Math. 17 below for β.sub.d times by the action a.sub.d at the time t.sub.d, the generation probability of the transition acceptance data is given by the following mathematical formula indicated in Math. 18 below.


j∈U.sub.a.sub.d  [Math. 17]

[00006] [ Math . 18 ] p ( B apt .Math. "\[LeftBracketingBar]" θ ) = .Math. d = 1 D { ( .Math. j U k P t d θ ( s t + 1 = j .Math. "\[LeftBracketingBar]" s t = s d , a = a d ) ) β d ( 1 - .Math. j U k P t d θ ( s t + 1 = j .Math. "\[LeftBracketingBar]" s t = s d , a = a d ) ) ( 1 - β d ) } = .Math. t = 1 T .Math. i S .Math. k A { ( .Math. j U k P t θ ( s t + 1 = j .Math. "\[LeftBracketingBar]" s t = i , a = k ) ) M tik ( 1 - .Math. j U k P t d θ ( s t + 1 = j .Math. "\[LeftBracketingBar]" s t = i , a = k ) ) G tik - M tik }

[0078] In this way, a negative log-likelihood function represented by the sum of encoded and judged logarithms obtained from each of the non-intervention transition data generation probability p(B.sub.tr|θ) and the intervention transition data generation probability p(B.sub.apt|θ) is obtained, and this negative log-likelihood function can serve as an objective function. That is, for example, L(θ)=−log(p(B.sub.tr|θ))−ν log(p(B.sub.apt|θ))+λΩ(θ) can be an objective function. Note that a regularization term Ω(θ) is added in the above objective function to prevent over training. For example, as the regularization term, any regularization term such as an L.sub.2 norm can be used. Further, ν and λ are hyperparameters.

[0079] The model parameter θ is estimated by minimizing the above objective function L(θ).

[0080] That is, the model parameter is estimated by the following equation.


{circumflex over (θ)}=argmin.sub.θL(θ)  [Math. 19]

[0081] For convenience, in the text of the description, the model parameter obtained as the estimation result is denoted as “{circumflex over ( )}θ”. Further, a desired optimization method such as a gradient method, a Newton method, an auxiliary function method, or an L-BFGS method may be used to minimize (optimize) the objective function L(θ). In this way, the transition probability can be estimated by using the transition probability model using the model parameter {circumflex over ( )}θ.

[0082] <Functional Configuration>

[0083] Next, a functional configuration of the estimation apparatus 10 according to the present embodiment will be described with reference to FIG. 1. FIG. 1 illustrates an example of a functional configuration of the estimation apparatus 10 according to the present embodiment.

[0084] As illustrated in FIG. 1, the estimation apparatus 10 according to the present embodiment includes a learning data storing unit 101, a setting parameter storing unit 102, a model parameter estimation unit 103, a transition probability estimation unit 104, a learning data storage unit 105, a setting parameter storage unit 106, and a model parameter storage unit 107.

[0085] The learning data storing unit 101 stores the given non-intervention transition data B.sub.tr and transition acceptance data B.sub.apt in the learning data storage unit 105 as learning data B=B.sub.tr∪B.sub.apt. For example, the non-intervention transition data B.sub.tr and the transition acceptance data B.sub.apt may be acquired from a server device or the like connected to the estimation apparatus 10 via a communication network and given to the learning data storing unit 101.

[0086] The setting parameter storing unit 102 stores the given setting parameters (for example, the parameter representing the model used as a probability estimation model, the hyperparameters ν and λ, etc.) in the setting parameter storage unit 106. For example, the setting parameters may be specified by the user and given to the setting parameter storing unit 102.

[0087] The model parameter estimation unit 103 estimates a model parameter θ of the probability estimation model by using the learning data B and the setting parameters. Next, the model parameter estimation unit 103 stores the estimated model parameter {circumflex over ( )}θ in the model parameter storage unit 107.

[0088] The transition probability estimation unit 104 estimates a state transition probability by the probability estimation model using the model parameter {circumflex over ( )}θ.

[0089] FIG. 1 illustrates the functional configuration example in which the same apparatus estimates the model parameter of the probability estimation model and the transition probability. Alternatively, for example, the estimation of the model parameter of the probability estimation model and the estimation of the transition probability may be performed by different apparatuses. In that case, the model parameter estimation unit 103 and the transition probability estimation unit 104 may be arranged in different apparatuses.

[0090] <Estimation Processing>

[0091] Next, the processing in which the estimation apparatus 10 according to the present embodiment estimates a model parameter {circumflex over ( )}θ and then estimates a transition probability by using the model parameter {circumflex over ( )}θ will be described with reference to FIG. 2. FIG. 2 is a flowchart illustrating an example of the estimation processing according to the present embodiment.

[0092] First, the model parameter estimation unit 103 inputs the learning data B stored in the learning data storage unit 105 and the setting parameters stored in the setting parameter storage unit 106 (step S101).

[0093] Next, the model parameter estimation unit 103 estimates a model parameter θ of the probability estimation model by using the learning data B and the setting parameters and stores the estimated model parameter {circumflex over ( )}θ in the model parameter storage unit 107 (step S102). In this step, for example, the model parameter estimation unit 103 may use any one of the probability estimation models defined in the above (a) to (c) and estimates the model parameter {circumflex over ( )}θ by minimizing the above-described objective function L(θ) using any desired optimization method.

[0094] Next, the transition probability estimation unit 104 estimates a state transition probability by the probability estimation model using the model parameter {circumflex over ( )}θ stored in the model parameter storage unit 107 (step S103). In this way, the state transition probability used in the model-based RL is estimated.

[0095] The model parameter {circumflex over ( )}θ estimated in the above step S102 and the state transition probability estimated in the above step S103 may be output to any desired output destination. For example, when the apparatus estimating the model parameter and the apparatus estimating the state transition probability are different apparatuses, the model parameter estimation unit 103 may output (transmit) the model parameter {circumflex over ( )}θ to the apparatus estimating the state transition probability. In addition, for example, when the apparatus estimating the state transition probability and the apparatus estimating a value function of the model-based RL are different apparatuses, the transition probability estimation unit 104 may output (transmit) the state transition probability to the apparatus estimating the value function.

[0096] As described above, when the intervention transition data is not available, the estimation apparatus 10 according to the present embodiment can estimate a state transition probability of the Markov decision process by using the non-intervention transition data and the transition acceptance data. In this way, even in a situation where, for example, in construction of a recommender system, the only data available is the state transition history of the user obtained when the function of presenting a recommended item to the user is not yet available or a situation where, in a healthcare application, the only data available is the state transition history of the user obtained when the user notification function is not yet available, the estimation apparatus 10 according to the present embodiment is capable of estimating a state transition probability by collecting the transition acceptance data.

[0097] <Hardware Configuration>

[0098] Finally, a hardware configuration of the estimation apparatus 10 of the present embodiment will be described with reference to FIG. 3. FIG. 3 illustrates an example of the hardware configuration of the estimation apparatus 10 according to the present embodiment.

[0099] As illustrated in FIG. 3, the estimation apparatus 10 according to the present embodiment is a common computer or computer system and includes an input device 201, a display device 202, an external I/F 203, a communication I/F 204, a processor 205, and a memory device 206. These hardware components are connected via a bus 207 so as to be able to communicate with each other.

[0100] The input device 201 is, for example, a keyboard, a mouse, a touch panel, or the like. The display device 202 is, for example, a display or the like. The estimation apparatus 10 may not include at least one of the input device 201 and the display device 202.

[0101] The external I/F 203 is an interface to an external device. The external device includes a recording medium 203a or the like. The estimation apparatus 10 can read from and write to the recording medium 203a via the external I/F 203, for example. For example, the recording medium 203a may store at least one program that implements each of the functional units (the learning data storing unit 101, the setting parameter storing unit 102, the model parameter estimation unit 103, and the transition probability estimation unit 104) included in the estimation apparatus 10.

[0102] Examples of the recording medium 203a include a CD (Compact Disc), a DVD (Digital Versatile Disc), an SD memory card (Secure Digital memory card), and a USB (Universal Serial Bus) memory card.

[0103] The communication I/F 204 is an interface for connecting the estimation apparatus 10 to a communication network. At least one program that implements the functional units included in the estimation apparatus 10 may be acquired (downloaded) from a predetermined server device or the like via the communication I/F 204.

[0104] Examples of the processor 205 include various computing devices such as a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit). Each functional unit included in the estimation apparatus 10 is implemented by the processor 205 executing at least one program stored in the memory device 206.

[0105] Examples of the memory device 206 include various storage devices such as an HDD (Hard Disk Drive), an SSD (Solid State Drive), a RAM (Random Access Memory), a ROM (Read Only Memory), and a flash memory. Each of the storage units (the learning data storage unit 105, the setting parameter storage unit 106, and the model parameter storage unit 107) included in the estimation apparatus 10 can be implemented by using the memory device 206. In addition, at least one storage unit among each of the storage units included in the estimation apparatus 10 may be implemented by a storage device (for example, database server or the like) which is connected to the estimation apparatus 10 via the communication network.

[0106] The estimation apparatus 10 according to the present embodiment can implement the estimation processing described above by having the hardware configuration illustrated in FIG. 3. The hardware configuration illustrated in FIG. 3 is merely an example, and the estimation apparatus 10 may have a different hardware configuration. For example, the estimation apparatus 10 may have a plurality of processors 205 and may have a plurality of memory devices 206.

[0107] The present invention is not limited to the embodiment specifically disclosed above, and various modifications, changes, combinations with known techniques, and the like can be made without departing from the scope of the claims.

REFERENCE SIGNS LIST

[0108] 10 Estimation apparatus [0109] 101 Learning data storing unit [0110] 102 Setting parameter storing unit [0111] 103 Model parameter estimation unit [0112] 104 Transition probability estimation unit [0113] 105 Learning data storage unit [0114] 106 Setting parameter storage unit [0115] 107 Model parameter storage unit [0116] 201 Input device [0117] 202 Display device [0118] 203 External I/F [0119] 203a Recording medium [0120] 204 Communication I/F [0121] 205 Processor [0122] 206 Memory device [0123] 207 Bus