INFORMATION PROCESSING DEVICE

Abstract

An estimation device includes an input unit that generates supervised data including causal variables, process types, and outcome variable for each of multiple processes, and a training unit that uses the supervised data to generate a learning model by learning the outcome variables from the causal variables and the process types for each of the processes.

Claims

1. An information processing device comprising: a processor to execute a program; and a memory to store the program which, when executed by the processor, performs processes of, generating supervised data including causal variables, processing types, and outcome variables for each of a plurality of processes; and generating a learning model by using the supervised data to learn the outcome variables from the causal variables and the process types for each of the processes; wherein the causal variables are at least one of attributes and purchase history of railway users, a factor affecting a sales value of railway services, action history of railway users, fare price fare increases, price reductions, and coupon amounts to promote railway use, the process types are fare price increases, fare price reductions, campaigns to promote railway use, and distribution of coupons to promote railway use, and the outcome variables are sales values of high-priced railway services.

2. The information processing device according to claim 1, wherein, the processor estimates the outcome variables by inputting the causal variables and the process types into the learning model.

3. The information processing device according to claim 2, wherein the processor specifies an optimal combination of the causal variables and the process types by the estimated outcome variables.

4. An information processing device comprising: a processor to execute a program; and a memory to store the program which, when executed by the processor, performs processes of, generating, for each of the processes, supervised data including first causal variables that change with time, second causal variables that do not change with time, process types, and history information indicating a history of changes with time of the first causal variables; and generating a learning model by using the supervised data to learn, for each of the processes, a change with time of the first causal variables at a second time from the first causal variables, the second causal variables, and the process types at a first time in accordance with the history information.

5. The information processing device according to claim 4, wherein, the processor uses the learning model to estimate a change with time of the first causal variables obtained by a combination of two or more processes selected from the processes at a first period.

6. The information processing device according to claim 5, wherein the processor specifies an optimal combination of the two or more processes based on with the estimated change with time.

7. The information processing device according to claim 4, wherein, the first causal variables are fare price increases, fare price reductions, and amounts of coupons to promote railway use, the second causal variables are at least one of attributes and purchase history of railway users, factors affecting the sales value of railway services, and action history of railway users, the process types are fare price increases, fare price reductions, campaigns to promote railway use, and distribution of coupons to promote railway use, and the history information is a history of fare price increases, a history of fare price reductions, and a history of distribution of coupons to promote railways use.

8. An information processing device comprising: a processor to execute a program; and a memory to store the program which, when executed by the processor, performs processes of, estimating outcome variables by using supervised data including causal variables, process types, and the outcome variables for each of a plurality of processes and inputting the causal variables and the process types into a learning model generated by learning the outcome variables from the causal variables and the process types for each of the processes; and outputting a result of the estimation; wherein, the causal variables are at least one of attributes and purchase history of railway users, a factor affecting a sales value of railway services, action history of railway users, fare price increases, fare price reductions, and coupon amounts to promote railway use, the process types are fare price increases, fare price reductions, campaigns to promote railway use, and distribution of coupons to promote railway use, and the outcome variables are sales values of high-priced railway services.

9. An information processing device comprising: a processor to execute a program; and a memory to store the program which, when executed by the processor, performs processes of, estimating a change with time of first causal variables obtained from a combination of two or more processes selected from a plurality of processes during a certain time period, by using a learning model generated by using supervised data including history information indicating a history of change with time of the first causal variables that change with time, second causal variables that do not change with time, process types, and the first causal variables, for each of the processes, and by learning, for each of the processes, the change with time of the first causal variables at a second time from the first causal variables, the second causal variables, and the process types at a first time in accordance with the history information; and outputting a result of the estimation.

10. The information processing device according to claim 9, wherein, the first causal variables are fare price increase amounts, fare price reduction amounts, and amounts of coupons to promote railway use, the second causal variables are at least one of attributes and purchase history of railway users, factors affecting the sales value of railway services, and action history of railway users, the process types are fare price increases, fare price reductions, campaigns to promote railway use, and distribution of coupons to promote railway use, and the history information is a history of fare price increases, a history of fare price reductions, and a history of distribution of coupons to promote railways use.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] The present invention will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, and wherein:

[0013] FIG. 1 is a graph indicating the difference in infection rate between a case in which a vaccine is administered and a case in which the vaccine is not administered, as an example of a treatment effect;

[0014] FIG. 2 is a block diagram schematically illustrating a configuration of an estimation device according to a first embodiment;

[0015] FIG. 3 is a table providing an example of a categorical variable process for treatment variables;

[0016] FIG. 4 is a schematic diagram for describing an overview of a learning model according to the first embodiment;

[0017] FIG. 5 is a schematic diagram for explaining an overview of a multitask Gaussian process;

[0018] FIG. 6 is a table providing an example of a list of parameters, initial values, and the number of parameters according to the first embodiment;

[0019] FIG. 7 is a block diagram illustrating a hardware configuration example;

[0020] FIG. 8 is a flowchart illustrating the process operation of the estimation device according to the first embodiment;

[0021] FIG. 9 is a graph illustrating the difference in infection rate between a case in which multiple doses of a vaccine are administered and a case in which the vaccine is not administered, as an example of treatment effects at multiple time points;

[0022] FIG. 10 is a block diagram schematically illustrating a configuration of an estimation device according to a second embodiment;

[0023] FIGS. 11A to 11D are schematic diagrams illustrating the characteristics of the estimation device according to the second embodiment;

[0024] FIGS. 12A and 12B are schematic diagrams for describing an overview of a learning model according to the second embodiment;

[0025] FIG. 13 is a schematic diagram for explaining an overview of a deep multitask Gaussian process;

[0026] FIG. 14 is a schematic diagram for explaining input and output of a deep multitask Gaussian process unit;

[0027] FIG. 15 is a flowchart illustrating the process operation of an encoder unit in estimation of individual treatment effects at multiple time points in the second embodiment; and

[0028] FIG. 16 is a flowchart illustrating the processing operation of a decoder unit in estimation of individual treatment effects at multiple time points in the second embodiment.

DETAILED DESCRIPTION OF THE INVENTION

First Embodiment

[0029] The first embodiment describes an example of estimating an individual process effect at a single time point by using a multitask Gaussian process.

[0030] Here, as an example of an individual process effect, an individual treatment effect (ITE) at a single time point is estimated.

[0031] In the following, treatment is an example of a process, and treatment dosage is an example of process quantity.

[0032] The individual treatment effect is defined as the difference between the result of treating an individual and the result of not treating the individual.

[0033] The magnitude of the treatment effect changes depending on the type and dosage of the treatment, and the attributes of the individual to be treated.

[0034] For example, as illustrated in FIG. 1, the difference in infection rate between a case in which a vaccine is administered and a case in which the vaccine is not administered represents a treatment effect. This treatment effect is thought to differ depending on target attributes (e.g., gender, age, and chronic illness) of the individual to be vaccinated, the type of vaccine, the vaccine dose, and the environment at the time of vaccination.

[0035] The disclosure is not limited to individual treatment effects, but by aggregating individual target attributes to group attributes, it is also possible to calculate the average treatment effect (ATE) for each group.

[0036] Similarly, the disclosure may be applied to conditional average treatment effect (CATE) or a local average treatment effect (LATE).

[0037] Furthermore, the term individual in individual treatment effect can also be referred to as target-specific or individualized, and the term treatment can be alternatively expressed as intervention, therapy, action, or exposure.

[0038] Furthermore, the term treatment effect may be expressed as causal effect.

[0039] FIG. 2 is a block diagram schematically illustrating a configuration of an estimation device 100 serving as an information processing device according to the first embodiment.

[0040] The estimation device 100 includes an input unit 110, a training unit 120, an estimating unit 140, and an output unit 150.

[0041] The input unit 110 inputs supervised data to the training unit 120.

[0042] For example, in multiple processes, the input unit 110 generates supervised data including causal variables, processing types, and outcome variables for each of the processes.

[0043] The input unit 110 includes a data base (DB) 111 serving as a data storage unit, an input executing unit 112, and a preprocessing unit 113.

[0044] The DB 111 stores data necessary for a process in the estimation device 100. Here, the DB 111 stores at least causal variables X#, treatment types W#, and outcome variables Y#.

[0045] The input executing unit 112 acquires the causal variables X#, the treatment types W#, and the outcome variables Y# from the DB 111 and gives these to the preprocessing unit 113.

[0046] The causal variables X#, the treatment types W#, and the outcome variables Y# are data observed for the respective individuals i and are expressed by formula (1) below.

[00001] $\begin{matrix} {X #, W #, Y #} = {X^{i}, W^{i}, Y^{i}}_{i = 1}^{N} & (1) \end{matrix}$

[0047] X.sup.i represents a set J1 of causal variables, as expressed by formula (2) below. Q indicates the number of causal variables.

[00002] $\begin{matrix} J 1 = {x_{1}^{i}, x_{2}^{i}, x_{3}^{i}, .Math., x_{Q}^{i}} & (2) \end{matrix}$

[0048] As expressed in formula (3) below, each element of a causal variable may be multidimensional information, and the dimensional number of each element may be different.

[00003] $\begin{matrix} x_{q}^{i}^{d_{q}} & (3) \end{matrix}$

[0049] A causal variable x.sub.q is expressed by formulas (4) to (7) below.

[00004] $\begin{matrix} Features of treatment target = x_{q}^{c o variate} & (4) \end{matrix}$ $\begin{matrix} Treatment dosage = x_{q}^{dosage} & (5) \end{matrix}$ $\begin{matrix} Treatment history information = x_{q}^{history} & (6) \end{matrix}$ $\begin{matrix} Environmental information at time of treatment = x_{q}^{e n v i r o n m e t} & (7) \end{matrix}$

[0050] Here, the features of a treatment target are, for example, gender, age, body weight, etc.

[0051] The treatment dosage is, for example, a vaccine dose. Treatment history information refers to, for example, the type and number of vaccines administered in the past, and their respective effects.

[0052] The environmental information at the time of treatment is factors affecting the treatment effect. The environmental information at the time of treatment is, for example, weather, region, economy, etc.

[0053] The treatment type W# is a categorical variable representing the type of treatment, as expressed by formula (8) below. For example, a categorical variable is the type of vaccine.

[00005] $\begin{matrix} W #^{D} & (8) \end{matrix}$

[0054] The outcome variable Y# represents the treatment effect. For example, the treatment effect is a reduction in infection rates due to vaccination.

[0055] The preprocessing unit 113 performs preprocessing described later on the causal variable X#, the treatment type W#, and the outcome variable Y# from the input executing unit 112 and gives a causal variable X, a treatment type W, and an outcome variable Y resulting from the preprocessing to the training unit 120.

[0056] When there is a missing value in at least one of the treatment type W# and the outcome variable Y#, the preprocessing unit 113 removes the missing data {X.sup.i, W.sup.i, Y.sup.i}.

[0057] As illustrated in FIG. 3, the preprocessing unit 113, with regard to the treatment type W#, assigns integer values (1, . . . , D) to the respective patterns when there are multiple observed combination patterns related to the treatment. here, D is the number of patterns.

[0058] When there is a missing value in the element x.sub.q of the causal variable X#, the preprocessing unit 113 fills the missing value with the average value of each dimension of x.sub.q.

[0059] The preprocessing unit 113 normalizes the element x.sub.q of the causal variable X# so that each dimension has an average of 0 and a variance of 1.

[0060] The training unit 120 trains a learning model by using supervised data given from the input unit 110.

[0061] For example, the training unit 120 uses the supervised data from the input unit 110 to generate a learning model by learning outcome variables from causal variables and processing types for each of multiple processes. The generated learning model is stored in a storage unit not illustrated.

[0062] The training unit 120 includes a calculating unit 121 and an optimizing unit 122.

[0063] The calculating unit 121 initializes the parameter of a Gaussian process, inputs the causal variable X, calculates the variance covariance matrix K(X, X) of the Gaussian process, and outputs the variance covariance matrix K to the optimizing unit 122.

[0064] FIG. 4 is a schematic diagram for describing an overview of the learning model trained by the training unit 120.

[0065] A correlation between series of observed variables in a learning model can be expressed by linear correlation of a known linear model of coregionalization (LMC) kernel of a multitask Gaussian process (MTGP).

[0066] MTGP is described in detail in Reference 1 below.

[0067] A method of calculating the variance covariance matrix of an LMC kernel in the present embodiment will now be explained.

[0068] The treatment effect Y expressed by formula (9) below is expressed by formulas (10) and (11) using the LMC.

[00006] $\begin{matrix} Y = {y_{d}}_{d = 1}^{D} & (9) \end{matrix}$ $\begin{matrix} f_{d} (X) = {.Math.}_{q = 1}^{Q} a_{d, q} g_{q} (x_{q}) & (10) \end{matrix}$ $\begin{matrix} y_{d} (f_{d} (X),_{d}^{2}) & (11) \end{matrix}$ [0069] where, a.sub.d,q is a linear correlation coefficient from a state function g.sub.q to an observation function f.sub.d, and .sub.d.sup.2 is the variance of observation error.

[0070] In formula (10), g.sub.q(x.sub.q) represents a function of the Gaussian process gp((x.sub.q), k.sub.q(x.sub.q,x.sub.q)) generated from the element x.sub.q of the causal variable.

[0071] Although (x.sub.q) represents an average matrix of x.sub.q, since x.sub.q is normalized by the preprocessing unit 113, (x.sub.q)=0.

[0072] Here, k.sub.q represents a positive definite kernel. As the positive definite kernel to be used, one suitable for data such as a radial basis function (RBF) kernel can be selected. For example, an RBF kernel is expressed by formula (12) below.

[0073] Here, l.sup.1.sub.q and l.sup.2.sub.q are RBF kernel parameters.

[00007] $\begin{matrix} k_{q} (x_{q}, x_{q}^{}) = l_{q}^{1} \exp {- \frac{{.Math. x_{q} - x_{q}^{} .Math.}^{2}}{2 l_{q}^{2^{2}}}} & (12) \end{matrix}$

[0074] When the function f expressed by formula (13) below follows a Gaussian process, the function f is represented by a multidimensional Gaussian distribution like formula (14) below.

[00008] $\begin{matrix} f = {f_{d}}_{d = 1}^{D} & (13) \end{matrix}$ $\begin{matrix} f (X) ((X), K (X, X^{})) = (0, K (X, X^{})) & (14) \end{matrix}$ [0075] where K(X,X) and (K(X,X)).sub.d,d are expressed by formulas (15) and (16) below.

[00009] $\begin{matrix} K (X, X^{})^{DN DN} & (15) \end{matrix}$ $\begin{matrix} {(K (X, X^{}))}_{d, d^{}} = cov [f_{d} (X), f_{d^{}} (X^{})] & (16) \end{matrix}$

[0076] K(X,X) in formula (14) is a variance covariance matrix or a Gram matrix and is a matrix representing the degree of similarity between the causal variables X and X.

[0077] The value (X) represents the average matrix of the causal variable X, but since the causal variable X is normalized by the preprocessing unit 113, (X)=0.

[0078] Each component (K(X,X)).sub.d,d of the variance covariance matrix K(X,X) can be calculated as expressed in formula (17) below by using f.sub.d(X) of formula (10).

[0079] Note that the transformation from the second line to the third line in formula (17) takes advantage of the fact that the different causal variables (qq) are independent between them, as expressed in formula (18) below.

[00010] $\begin{matrix} {(K (X, X^{}))}_{d, d^{}} = cov [f_{d} (X), f_{d^{}} (X^{})] = {.Math.}_{q = 1}^{Q} {.Math.}_{q^{} = 1}^{Q} a_{d, q} a_{d^{}, q^{}} cov [g_{q} (x_{q}), g_{q^{}} (x_{q^{}}^{})] = {.Math.}_{q = 1}^{Q} a_{d, q} a_{d^{}, q} cov [g_{q} (x_{q}), g_{q} (x_{q^{}}^{})] = {.Math.}_{q = 1}^{Q} a_{d, q} a_{d^{}, q} k_{q} (x_{q}, x_{q}^{}) & (17) \end{matrix}$ $\begin{matrix} cov [g_{q} (x_{q}), g_{q^{}} (x_{q^{}}^{})] = 0 & (18) \end{matrix}$

[0080] K(X,X) is expressed as formulas (19) and (20) below.

[00011] $\begin{matrix} K (X, X^{}) = {.Math.}_{q = 1}^{Q} B_{q} k_{q} (x_{q}, x_{q}^{})^{DN DN} & (19) \end{matrix}$ $\begin{matrix} B_{q} = [\begin{matrix} a_{1, q}^{2} & a_{1, q} a_{2, q} & .Math. & a_{1, q} a_{D, q} \\ a_{1, q} a_{2, q} & a_{2, q}^{2} & .Math. & a_{2, q} a_{D, q} \\ .Math. & .Math. & .Math. \\ a_{1, q} a_{D, q} & a_{2, q} a_{D, q} & .Math. & a_{D, q}^{2} \end{matrix}] & (20) \end{matrix}$ [0081] where B.sub.q satisfies formula (21) below. B.sub.q is a coregionalization matrix and represents a linear transformation from the state function g.sub.q to the observation function f.sub.d.

[00012] $\begin{matrix} B_{q}^{D D} & (21) \end{matrix}$

[0082] The outcome variable Y is defined as expressed in formulas (22), (23), and (24) below by using the observation function f.

[00013] $\begin{matrix} Y ~ (f (X), .Math. I) = (0, K (X, X^{}) + .Math. I) = (0, K^{} (X, X^{}) & (22) \end{matrix}$ $\begin{matrix} K^{} (X, X^{}) = K (X, X^{}) + .Math. I & (23) \end{matrix}$ $\begin{matrix} .Math. = [\begin{matrix} _{1}^{2} & 0 & .Math. & 0 \\ 0 & _{2}^{2} & .Math. & 0 \\ .Math. & .Math. & 0 \\ 0 & 0 & 0 & _{d}^{2} \end{matrix}]^{D D} & (24) \end{matrix}$

[0083] K(X,X) can be calculated by using the causal variable X and the parameter expressed by formula (25) below. The initial value of the parameter is illustrated in FIG. 6. These parameters are optimized by the optimizing unit 122 of the training unit 120.

[00014] $\begin{matrix} = {l_{q}^{n}, a_{d, q},_{d}} & (25) \end{matrix}$

[0084] The optimizing unit 122 receives the variance covariance matrix K(X,X) calculated by the calculating unit 121, calculates peripheral likelihood, and optimizes the parameters so that peripheral likelihood is minimized.

[0085] The peripheral likelihood can be obtained through the following calculation.

[0086] The probability p (Y|x,) that the outcome variable Y is observed can be obtained by formula (26).

[00015] $\begin{matrix} p (Y | X,) = N (Y | 0, K_{}^{} (X, X^{})) = \frac{1}{(2^{N / 2})} \frac{1}{{.Math. K_{}^{} (X, X^{}) .Math.}^{1 / 2}} \exp (- \frac{1}{2} Y^{T} {K_{}^{} (X, X^{})}^{- 1} Y) & (26) \end{matrix}$ [0087] where K.sub.(X,X) represents K(X,X) calculated with the parameter .

[0088] The logarithm of both sides of formula (26) can be taken to calculate the peripheral likelihood log.sub.p(Y|X,) by formula (27) below.

[00016] $\begin{matrix} \log p (Y | X,) = - \frac{1}{2} Y^{T} {K^{} (X, X^{})}^{- 1} Y - \frac{1}{2} \log .Math. K^{} (X, X^{}) .Math. - \frac{ND}{2} \log 2 & (27) \end{matrix}$ [0089] where N is the length of vector X, and D is the output dimensional number as the number of tasks in Y.

[0090] In order to optimize the parameter , it is sufficient to maximize the peripheral likelihood log.sub.p(Y| X, ). For example, in order to match a general optimization problem, it is sufficient to minimize the optimization function E as expressed in formula (28) below with both sides multiplied by minus.

[00017] $\begin{matrix} E = - \log p (Y | X,) = \frac{1}{2} {Y^{T} (K^{} (X, X^{}))}^{- 1} Y + \frac{1}{2} \log .Math. K^{} (X, X^{}) .Math. + \frac{ND}{2} \log 2 & (28) \end{matrix}$

[0091] The optimizing unit 122 updates the optimization function E of formula (28); in other words, is updated so that the peripheral likelihood log.sub.p(Y|X,) is minimized. When is updated at the time of optimization, an update of K.sub.(X,X) is also necessary, so the calculating unit 121 calculates K.sub.(X,X).

[0092] The optimization function E can be optimized using a known technique such as stochastic gradient descent.

[0093] As described above, the optimizing unit 122 can optimize the parameter .

[0094] The estimating unit 140 calculates the treatment effect of a new input X* by using the variance covariance matrix K.sub.(X,X) calculated with the parameter optimized by the calculating unit 121.

[0095] The output y* of the new input X* expressed by formula (29) below is expressed by formula (30) below.

[00018] $\begin{matrix} X *= {x_{q}^{*}}_{q = 1}^{Q} & (29) \end{matrix}$ $\begin{matrix} Y *= {y_{d}^{*}}_{d = 1}^{D} & (30) \end{matrix}$

[0096] The output y* can be calculated by using formula (31) below with the variance covariance matrix K.sub.(X,X) and observation points {X, Y} through, for example, Gaussian process regression, which is a known technique.

[00019] $\begin{matrix} y_{d}^{*} (x *) = K_{*}^{T} {K_{}^{} (X, X^{})}^{- 1} Y & (31) \end{matrix}$ [0097] where K* is expressed by formula (32) below, and B.sub.q[:,d] is a vector in column d of B.sub.q.

[00020] $\begin{matrix} K_{*} = {.Math.}_{q = 1}^{Q} B_{q} [:, d] k_{q} (x_{q}, x_{q}^{*})^{DN} & (32) \end{matrix}$

[0098] Here, the treatment effect ITE when treatment is not performed (W=1) and when treatment is performed (W=T) can be calculated by formula (33) below.

[00021] $\begin{matrix} ITE = y_{t}^{*} - y_{1}^{*} & (33) \end{matrix}$

[0099] The estimating unit 140 estimates outcome variables by inputting causal variables and processing types into the learning model. With the estimated outcome variables, the estimating unit 140 can specify an optimal combination of a causal variable and a processing type.

[0100] For example, the estimating unit 140 calculates the treatment effect ITE for all combinations of treatment and treatment dosage and specifies the combination of a treatment T having the highest treatment effect and treatment dosage.

[0101] The output unit 150 outputs the result of the estimation performed by the estimating unit 140.

[0102] For example, the output unit 150 outputs a treatment T having the highest treatment effect for a target X*, its treatment dosage, and its treatment effect ITE. For example, the output unit 150 displays them.

[0103] The estimation device 100 described above can be implemented by, for example, a personal computer (PC) 10, as illustrated in FIG. 7.

[0104] The PC 10 includes an auxiliary storage device 11 such as hard disk drive (HDD) or solid state drive (SSD), a memory 12, a processor 13 such as a central processing unit (CPU), a communication interface (I/F) 14 such as a network interface card (NIC), an input I/F 15 such as a keyboard and mouse, and a display 16.

[0105] For example, the DB 111 can be implemented by storage such as the auxiliary storage device 11.

[0106] The input executing unit 112, the preprocessing unit 113, the calculating unit 121, the optimizing unit 122, and the estimating unit 140 can be implemented by the processor 13 loading a program stored in storage such as the auxiliary storage device 11 onto the memory 12 and executing the program.

[0107] The output unit 150 can be implemented by the display 16.

[0108] The program may be downloaded to the auxiliary storage device 11 from a recording medium (not illustrated) via a reader/writer (not illustrated) or through a network via a communication I/F 14, then loaded onto the memory 12, and executed by the processor 13. Alternatively, the program may be directly loaded onto the memory 12 from a recording medium via a reader/writer or through a network via the communication I/F 14, and executed by the processor 13.

[0109] In other words, a program may be provided as a program product such as a recording medium.

[0110] A storage unit (not illustrated) that stores a learning model can also be implemented by storage such as the auxiliary storage device 11.

[0111] FIG. 8 is a flowchart illustrating operation of the estimation device 100 according to the first embodiment.

[0112] First, the input executing unit 112 acquires the causal variable X#, the treatment type W#, and the outcome variable Y# from the DB 111 (step S10).

[0113] Next, the preprocessing unit 113 deletes missing data from the treatment type W# and the outcome variable Y# (step S11).

[0114] Next, the preprocessing unit 113 performs preprocessing of the treatment type W# (step S12).

[0115] Next, the preprocessing unit 113 processes the missing value of the causal variable X# (step S13).

[0116] Next, the preprocessing unit 113 normalizes the causal variable X# (step S14).

[0117] According to the above steps S11 to S14, preprocessing is performed on the causal variable X#, the treatment type W#, and the outcome variable Y#, to obtain the causal variable X, the treatment type W, and the outcome variable Y, respectively.

[0118] Next, the calculating unit 121 sets the parameter expressed by formula (25) above to an initial value (step S15).

[0119] Next, the calculating unit 121 uses X and 0 to calculate K.sub.(X,X) from the formula (19) above (step S16).

[0120] Next, the optimizing unit 122 calculates the optimization function E expressed by formula (28) above (step S17).

[0121] Next, the optimizing unit 122 optimizes so that the optimization function E is minimized (step S18).

[0122] Next, the calculating unit 121 determines whether or not the optimization function E has converged (step S19). The condition for the optimization function E to converge is whether the value of the peripheral likelihood log.sub.p(Y| X, ) is minimized, or the process of steps S16 to S18 reaches a predetermined number of iterations.

[0123] When the optimization function E has not converged (NO in step S19), the process returns to step S16, and when the optimization function E converges (YES in step S19), the process proceeds to step S20.

[0124] In step S20, the estimating unit 140 determines the treatment effect ITE of the new input X*.

[0125] Next, the estimating unit 140 determines a combination of the treatment with the highest treatment effect ITE and the treatment dosage (step S21).

[0126] The approach described in NPL 1 can determine only the treatment effect of with or without one type of treatment, but as described above, the estimation device 100 of the first embodiment can estimate the treatment effect of multiple types of treatment.

[0127] The approach of NPL 1 takes into account only features of a treatment target, but the estimation device 100 of the first embodiment uses treatment history information and environmental information at the time of treatment as separate independent causal variables (latent variables) in addition to the features of the treatment target. By inputting each independent causal variable separately, it is better to calculate the variance covariance matrix for each as a separate latent variable, so that no spurious correlation occurs and the accuracy of the estimation is improved, rather than calculating the variance covariance matrix of a Gaussian process as one feature obtained by combining, for example, the target and treatment features, which should be independent. Here, a spurious correlation means that the correlation coefficient between variables that are not inherently correlated does not become zero even when calculated.

[0128] The individual treatment effect estimation at a single time point in the present embodiment makes it possible to determine the appropriate treatment (action) and its extent on the basis of the features of the target. For example, it becomes clear which vaccine and how much of it should be administered to reduce the infection rate the most depending on the recipient.

[0129] Also, in the first embodiment, measurement of the effect of vaccine treatment is described as an example of a treatment effect, but it is not limited to this. For example, by setting the variables as described below, the present embodiment can be applied to cancer therapy, promotion of high-priced services such as reserved train seats or high-grade hotel rooms, or boosting store sales.

[0130] The variables in the case of cancer therapy are as follows.

[0131] The feature of the processing target can be at least one of the attributes (e.g., gender, age, presence or absence of chronic illness) and physical condition information of an individual receiving the cancer therapy.

[0132] Environmental information at during the process can be a factor affecting the effect of the cancer therapy. For example, a factor affecting the effect of the cancer therapy can be information on at least one of the region and hospital in which the therapy is provided.

[0133] Processing history information can be used as treatment history for past cancer therapy. For example, therapy history for past cancer therapy is the type and number of sessions of cancer therapy received in the past, and the effect at that time.

[0134] The process type can be radiation therapy and chemotherapy.

[0135] The process quantity can be radiation intensity and anticancer drug dosage.

[0136] The process effect can be the amount of reduction in the size of the cancer.

[0137] The variables in the case of store sales promotion are as follows.

[0138] The feature of a process target can be at least one of attributes (e.g., gender, age, annual income, and preferences) and the purchase history of the purchaser of a product.

[0139] Environmental information during the process can be a factor affecting store sales. For example, a factor affecting the purchase of a product can be at least one of commodity trends, climate, and economy.

[0140] Process history information can be the purchase history during a past process. For example, the purchase history during a past process is the type, number of times, date and time of purchase, or the like of the product purchased in the past.

[0141] The process type can be a discount ticket, an internet advertisement, a television advertisement, or a travel application ticket.

[0142] The process quantity can be a discount amount, advertisement frequency, and winning probability.

[0143] The process effect can be store sales.

[0144] The variables in the case of sales promotion of a high-priced (high-value) service are as follows.

[0145] Here, a high-priced service is to provide reserved seats on railways or buses, etc., high-class seats on airplanes, etc., high-class cabins on ships, etc., reserved seats for events such as sports or entertainment, priority tickets for tourist facilities, or high-class guest rooms at hotels.

[0146] In the case of a seat or ticket, the service includes a seat guarantee, the ability to ride an attraction without queuing on the day of the visit, and priority rights such as time selection. In particular, high-priced services in railway services include, in addition to regular reserved seats, reserved seats with good views, luxurious sleeper cars, and reserved seats on planned trains.

[0147] The feature of the process target can be at least one of the attributes (e.g., gender, age, annual income, work style, place of residence, place of work, preferences, and family structure) and purchase history of the user of the service.

[0148] Environmental information during a process can be a factor affecting the sales value of a service. For example, a factor affecting the sales value of a service can be at least one of the following factors: reservation rate, congestion rate, weather, accident, sales plan, and infection status. In particular, in railway services, weather also includes disasters such as typhoons, earthquakes, and heavy rain. Accidents include vehicle breakdowns, overhead line trouble, personal injury accidents, and railway line fires. Sales plans include planned suspensions and temporary increases in the number operations due to power outages, construction, strikes, consecutive holidays, and the like.

[0149] The process history information can be action history during a past process. The action history is a history of actions taken by users of the service. For example, the action history also includes history of actions (i.e., purchase history) of users purchasing tickets, history of users applying for campaigns, history of users exercising rights, history of users using coupons, and the like.

[0150] The process type can be price increases, price reductions, campaigns, and coupon distribution.

[0151] The process quantity can be a price increase amount, a price reduction amount, and a coupon amount. The price increase amount and price reduction amount may be the difference in price increase (price increase rate) and the difference in price reduction (price reduction rate), respectively.

[0152] The process effect is the sales value of high-priced (high-value) services.

[0153] Accordingly, the following describes a case in which the first embodiment is applied to a railway service.

[0154] The causal variables are at least one of attributes and purchase history of railway users, factors affecting the sales value of the railway service, action history of the railway users, fare price increases, fare price reductions, and coupon amounts to promote railway use.

[0155] The process types are fare price increases, fare price reductions, campaigns to promote railway use, and distribution of coupons to promote railway use.

[0156] The outcome variable is the sales value of a high-priced railway service.

[0157] On the basis of the obtained process effect, the first to third operations can be performed, for example, as follows.

[0158] First operation: To increase the reservation rate of reserved train seats, the price of reserved seats can be reduced during busy times for users who ride the train for extended periods of time on the basis of their travel time, and conversely, during less busy times, the price of reserved seats can be reduced for users who ride the train for short periods of time.

[0159] Second operation: A campaign can be conducted to upgrade the grade of train or airplane seats for free, to encourage customers to book higher-grade seats in the future.

[0160] Third operation: Services can be recommended on the basis of the price range that matches the characteristics of the user.

[0161] The distribution of coupons and other methods can be used to encourage the purchase of higher-priced services.

Second Embodiment

[0162] In the second embodiment, individual process effects at multiple time points are estimated by using a deep multitask Gaussian process. Specifically, in the second embodiment, a case in which individual process effects are estimated for when processes are performed at multiple time points.

[0163] The second embodiment explains, for example, a case in which four vaccinations are planned to lower the infection rate of a virus, as illustrated in FIG. 9. Specifically, in the second embodiment, it is estimated which vaccines should be administered two more times to achieve the highest reduction in infection rate, given that the vaccine has already been administered two times and two additional administrations are planned. In a state where the vaccine has not been administered even once, the treatment effect of the next four administrations may be estimated.

[0164] FIG. 10 is a block diagram schematically illustrating a configuration of an estimation device 200 serving as an information processing device according to the second embodiment.

[0165] The estimation device 200 includes an input unit 210, an encoder training unit 220, a decoder training unit 230, an estimating unit 240, and an output unit 250.

[0166] The input unit 210 inputs supervised data to the encoder training unit 220.

[0167] For each of the multiple processes, for example, the input unit 210 generates supervised data including a first causal variable that changes with time, a second causal variable that does not change with time, a process type, and history information indicating a history of changes in the first causal variable over time.

[0168] The input unit 110 includes a DB 211 serving as a data storage unit, an input executing unit 212, and a preprocessing unit 213.

[0169] The DB 211 stores data necessary for a process by the estimation device 200. Here, the DB 211 stores at least a feature X#, a feature V#, and a treatment type W#.

[0170] Here, the feature X# is a feature of the treatment target that fluctuates depending on the treatment, and the feature V# is a feature of the treatment target that does not fluctuate depending on the treatment.

[0171] The input executing unit 212 acquires the feature X#, the feature V#, and the treatment type W# from the DB 211 and gives these to the preprocessing unit 213.

[0172] The feature X#, the feature V#, and the treatment type W# are data observed for each individual i and are expressed by formula (34) below.

[00022] $\begin{matrix} {X, V, W} = {{X_{s}^{(i)}, W_{s}^{(i)}}_{s = 1}^{s_{\max}}, V^{(i)}}_{i = 1}^{N} & (34) \end{matrix}$ [0173] where N is the number of observed data items, and s.sub.max represents the time length of the observed time-series data.

[0174] The preprocessing unit 213 performs preprocessing described later on the feature X#, the feature V#, and the treatment type W# from the input executing unit 212 and gives the preprocessed feature X, feature V, and treatment type W to the encoder training unit 220.

[0175] When there are missing values in the feature X#, the feature V#, and the treatment type W# of an individual i, the preprocessing unit 213 removes data J2 of the individual i expressed by the formula (35) below.

[00023] $\begin{matrix} J 2 = {{X_{s}^{(i)}, W_{s}^{(i)}}_{s = 1}^{s_{\max}}, V^{(i)}} & (35) \end{matrix}$

[0176] The preprocessing unit 213 preprocesses the treatment type W# through the same procedure as that used by the preprocessing unit 113 in the first embodiment.

[0177] Moreover, the preprocessing unit 213 normalizes the feature X# and the feature V# so that the average is zero and the variance is one.

[0178] The encoder training unit 220 and the decoder training unit 230 function as a training unit that trains the learning model in the second embodiment.

[0179] For example, the training unit according to the second embodiment uses the supervised data from the input unit 210 to generate a learning model by learning the change with time of the first causal variable at a next time from the first causal variable, the second causal variable, and the processing type for each of the multiple processes in accordance with the history information. The trained learning model is stored in a storage unit (not illustrated).

[0180] FIGS. 11A to 11D are schematic diagrams illustrating the operation concept of the learning model according to the second embodiment.

[0181] FIG. 11 illustrate causal variables X that change with time, causal variables V that do not change with time, treatment types W, and history information H.

[0182] X is a factor that changes depending on a treatment, such as infection rate, and is both a causal variable and a target variable. For example, a treatment W.sub.S is performed during step s, and when X.sub.s changes to X.sub.S+1, X.sub.S is the causal variable, and X.sub.S+1 is the target variable.

[0183] X.sub.S, which affects both the treatment W.sub.S and the result X.sub.S+1 after the treatment, as indicated by the one-point chain line in FIG. 11, is referred to as a confounding factor.

[0184] When there is a confounding factor, the volume of observation data of a treatment having a high treatment probability increases due to the effect of X.sub.S on W.sub.S. Therefore, when training is performed with data as it is, a bias occurs in which the model is optimized for a treatment with a high probability, and when the effect of a treatment having low treatment probability is to be estimated, the prediction accuracy deteriorates; thus, it becomes difficult to estimate the correct treatment effect.

[0185] Accordingly, in the second embodiment, the effect of a confounding factor is reduced by using expression learning described in the fourth characteristic below.

[0186] First, the learning model is described.

[0187] FIGS. 12A and 12B are schematic diagrams for describing the learning model according to the second embodiment.

[0188] The learning model according to the second embodiment is a combination of an encoder-decoder (or Seq2Seq) of a known technique and a deep multitask Gaussian process of a known technique.

[0189] First, the characteristics of the learning model according to the second embodiment are described.

[0190] First characteristic: When a treatment provided at multiple time points is considered, the question is how far past effects are to be considered. As illustrated in FIG. 11A, the more items of past information are included, the more complicated the causal graph becomes, and the more factors are to be considered. Therefore, in the second embodiment, for example, Markov characteristics as illustrated in FIG. 11B are introduced to simplify the learning model by using information up to the previous time as history information H.sub.S1.

[0191] Second characteristic: As illustrated in FIGS. 11B and 13, for example, intermediate representations (hidden variables) calculated in the deep multitask Gaussian process of a known technique can be used to calculate history H.sub.s.

[0192] Third characteristic: As illustrated in FIGS. 11C and 13, the second layer of the deep multitask Gaussian process is branched, where one branch predicts the causal variable J3 expressed by the formula (36) below from H.sub.S in the multitask Gaussian process, and the other branch predicts the treatment type J4 expressed by the formula (37) below from H.sub.S with a Gaussian process classification model of a known technique.

[00024] $\begin{matrix} J 3 = {\overset{}{X}}_{s + 1} & (36) \end{matrix}$ $\begin{matrix} J 4 = {\overset{}{W}}_{s} & (37) \end{matrix}$

[0193] Fourth characteristic: The influence of a confounding factor on H.sub.S to W.sub.S can be reduced by improving the prediction accuracy of the causal variable J3 expressed by formula (36) from H.sub.S through the application of an expression learning approach of a known technique, and learning f.sub.h to calculate H.sub.S, which makes it difficult to predict the treatment type J4 expressed by formula (37) from H.sub.S.

[0194] Fifth feature: As illustrated in FIGS. 11D and 12, long-term prediction is possible by recursively transmitting the intermediate representation H.sub.S of the deep multitask Gaussian process to the next step (s+1), and the treatment effect in any multiple treatment plans W.sub.t:t+ can be estimated.

[0195] Sixth characteristic: Since the estimation device 200 according to the second embodiment estimates the process effect of multiple processes through a deep multitask Gaussian process as described above, in addition to the effects described in the first embodiment, the reliability of the estimated process effect can also be determined.

[0196] The deep multitask Gaussian process of the second embodiment and the learning model of an encoder-decoder are explained below.

[0197] The deep multitask Gaussian process will now be explained.

[0198] Formulas (38) to (43) below express the relationships in a deep multitask Gaussian process.

[0199] Formula (38) is a Gaussian process expression in a first layer MTGP, and formula (39) is a likelihood function in the first layer MTGP.

[0200] Formula (40) is a Gaussian process expression in a second layer MTGP, and formula (41) is a likelihood function in the second layer MTGP.

[0201] Formula (42) is a Gaussian process expression in a GP classification layer, and formula (43) is a likelihood function in the GP classification layer.

[00025] $\begin{matrix} f_{h} (X_{s}^{}) ~ gp ((X_{s}^{}), K_{_{h}} (X_{s}^{}, X_{s}^{})) & (38) \end{matrix}$

[00026] $\begin{matrix} {\overset{}{H}}_{s} = mean (f_{h} (X_{s}^{})) & (39) \end{matrix}$ $\begin{matrix} f_{x} ({\overset{}{H}}_{s}) gp (({\overset{}{H}}_{s}), K_{_{x}} ({\overset{}{H}}_{s}, {\overset{}{H}}_{s})) & (40) \end{matrix}$ $\begin{matrix} {\overset{}{X}}_{s + 1} (f_{x} ({\overset{}{H}}_{s}) .Math. I_{tasks} (W_{s}),^{2} I) & (41) \end{matrix}$ $\begin{matrix} f_{a} ({\overset{}{H}}_{s}) gp (({\overset{}{H}}_{s}), K_{_{a}} ({\overset{}{H}}_{s}, {\overset{}{H}}_{s})) & (42) \end{matrix}$ $\begin{matrix} {\overset{}{W}}_{s} = softmax (f_{a} ({\overset{}{H}}_{s})) & (43) \end{matrix}$

[0202] Here, X.sub.s is expressed by formula (44) below.

[00027] $\begin{matrix} X_{s}^{} = {X_{s}, V, H_{s - 1}} & (44) \end{matrix}$

[0203] Due to the Gaussian process, in formulas (40) and (41), or formulas (42) and (43), the respective standard deviations J7 and J8 expressed by formulas (47) and (48) can be obtained in addition to the estimated mean values J5 and J6 expressed by formulas (45) and (46).

[00028] $\begin{matrix} J 5 = {\overset{}{X}}_{s + 1} & (45) \end{matrix}$ $\begin{matrix} J 6 = {\overset{}{W}}_{s} & (46) \end{matrix}$ $\begin{matrix} J 7 = {\overset{}{X}}_{s + 1} & (47) \end{matrix}$ $\begin{matrix} J 8 = {\overset{}{W}}_{s} & (48) \end{matrix}$

[0204] In formulas (38) to (43) above, f is a function of a Gaussian process, and , sm, and N are mean functions representing a softmax function and a Gaussian distribution. Here, denotes a Gaussian process parameter, and denotes a dispersion value of the Gaussian distribution. None of the parameters have time dependency and are the same at all times.

[0205] Each causal variable constituting X.sub.s may also be an independent causal variable as in the first embodiment, or each causal variable (latent variable) may be divided into elements that do not change with time and elements that change with time, such as elements Xs and V expressed by formulas (49) and (50) below.

[00029] $\begin{matrix} X_{s} = {x_{s}^{covariate}, x_{s}^{dosage}, x_{s}^{history}, x_{s}^{environment}} & (49) \end{matrix}$ $\begin{matrix} V = {v^{covariate}, v^{dosage}, v^{history}, v^{environment}} & (50) \end{matrix}$

[0206] The prediction target J9 included in formula (41) above and expressed by formula (51) below may predict only a factor (for example, infection rate) that changes depending on the treatment. For example, only the factor J10 expressed by formula (52) below may be a prediction target.

[00030] $\begin{matrix} J 9 = {\overset{}{X}}_{s + 1} & (51) \end{matrix}$ $\begin{matrix} J 10 = x_{s}^{covariate} & (52) \end{matrix}$

[0207] A Gaussian process will now be explained.

[0208] The matrix K1 expressed by formula (53) below represents a variance covariance matrix of a normal Gaussian process having one output.

[0209] The matrices K2 and K3 expressed by formulas (54) and (55) represent variance covariance matrices of a multitask Gaussian process.

[00031] $\begin{matrix} K 1 = K_{_{a}} & (53) \end{matrix}$ $\begin{matrix} K 2 = K_{_{h}} & (54) \end{matrix}$ $\begin{matrix} K 3 = K_{_{x}} & (55) \end{matrix}$

[0210] Here, as illustrated in FIG. 14, the number of output dimensional number DH of the variance covariance matrix K2 of a multitask Gaussian process expressed by formula (54) above is set to a dimensional number of one or more, and the output dimensional number N.sub.task of the variance covariance matrix K3 expressed by formula (55) is set to the category number of W.

[0211] The kernel for calculating the variance covariance matrices K2 and K3 expressed by formulas (54) and (55) above may be either an independent multitask kernel having no correlation between output dimensions, or a dependent multitask kernel having correlation between output dimensions. In the dependent multitask kernel, for example, an LMC kernel having linear correlation between the output dimensions may be used.

[0212] Since a computational load of O(N.sup.3) is applied when the Gaussian process is directly calculated, the computational load may be reduced by using stochastic variational inference for Gauss process (SVI-GP) using inducing points of a known technique.

[0213] A likelihood function will now be described.

[0214] The predictive distribution f.sub.h(X.sub.s) of output from the first layer MTGP cannot be directly used as input to the second layer Gaussian process; therefore, the intermediate representation J11 expressed by formula (56) below may be calculated by determining the mean of the predictive distribution f.sub.h(X.sub.t) of formula (39) used in a known technique of doubly stochastic variational inference for deep Gaussian processes (DSVI-DGP), or the intermediate representation J11 expressed by formula (56) below may be acquired by sampling the predictive distribution f.sub.h(X.sub.s).

[00032] $\begin{matrix} J 11 = {\overset{}{H}}_{s} & (56) \end{matrix}$

[0215] Since only one task (e.g., process W.sub.s) is observed for each input data item in the second layer MTGP, only the type treated by multiplying the output J12 of the MTGP expressed by formula (57) with an index matrix I.sub.tasks(W.sub.s) in which W.sub.s has a one-hot format are output, as illustrated in FIG. 14.

[00033] $\begin{matrix} J 12 = f_{x} ({\overset{}{H}}_{s}) & (57) \end{matrix}$

[0216] The second classification layer makes it possible to predict the treatment W.sub.S for categorical variables by making the likelihood function a softmax function, as expressed in formula (48) above.

[0217] Next, the encoder-decoder model will be described.

[0218] An intermediate representation H.sub.s of the deep multitask Gaussian process is transmitted to the next step by using an encoder-decoder of a known technique. As illustrated in FIG. 12, the encoder unit can perform one-step ahead prediction, while the decoder unit can perform multiple-step ahead prediction. The encoder-decoder model of the second embodiment is not limited to the example illustrated in FIG. 12; for example, the encoder-decoder may incorporate an attention mechanism or employ a transformer.

[0219] The structure of the deep multitask Gaussian processes of the encode unit and the decoder unit are the same, but the parameters are individually refined.

[0220] Details of the operation of each unit are described on the basis of the deep multitask Gaussian process and the encoder-decoder model which are described above.

[0221] The encoder training unit 220 in FIG. 10 trains the encoder side in the encoder-decoder model.

[0222] The encoder training unit 220 in FIG. 10 includes an initializing unit 221, a predicting unit 222, a classification optimizing unit 223, and a prediction optimizing unit 224.

[0223] The initializing unit 221 sets the dimensional number DR of H illustrated in FIG. 14 to a dimensional number of one or more.

[0224] The initializing unit 221 also initializes the parameter J13 of the deep multitask Gaussian process of the encoder expressed by formula (58) below. The value to be initialized may be, for example, one, or sampled from any Gaussian distribution.

[00034] $\begin{matrix} J 13 = {_{h}^{e},_{w}^{e},_{x}^{e},^{e}} & (58) \end{matrix}$

[0225] As illustrated in FIG. 12A, the predicting unit 222 first inputs H.sub.0, which is a zero vector, and {X.sub.1, V, W.sub.1} in the first step (for example, s=1) in the encoder and calculates the data J14 expressed by formula (59) below. Thereafter, the predicting unit 222 calculates the data J17 expressed by formula (62) below by recursively inputting the data J15 expressed by formula (60) below and the data J16 expressed by formula (61) below until step t (t is an integer of two or more).

[00035] $\begin{matrix} J 14 = {{\overset{}{H}}_{1}^{e}, {\overset{}{W}}_{1}^{e}, {\overset{}{X}}_{2}^{e}} & (59) \end{matrix}$ $\begin{matrix} J 15 = {X_{s}, V, W_{s}}_{s = 1}^{t} & (60) \end{matrix}$ $\begin{matrix} J 16 = {\overset{}{H}}_{s}^{e} & (61) \end{matrix}$ $\begin{matrix} J 17 = {{\overset{}{H}}_{s}^{e}, {\overset{}{W}}_{s}^{e}, {\overset{}{X}}_{s + 1}^{e}}_{s = 1}^{t} & (62) \end{matrix}$

[0226] Then, the predicting unit 222 calculates the data J18 expressed by formula (63) below for all individuals (i) and gives the data J18 to the classification optimizing unit 223.

[00036] $\begin{matrix} J 18 = {{{\overset{}{H}}_{s}^{e (i)}, {\overset{}{W}}_{s}^{e (i)}, {\overset{}{X}}_{s + 1}^{e (i)}}_{s = 1}^{s_{\max}}}_{i = 1}^{N} & (63) \end{matrix}$

[0227] The classification optimizing unit 223 uses, as input, the data J19 expressed by formula (64) below and calculated by the predicting unit 222 and calculates the data J20 expressed by formula (65) below from formulas (42) and (43) above.

[00037] $\begin{matrix} J 19 = {{{\overset{}{H}}_{s}^{e (i)}}_{s = 1}^{s_{\max}}}_{i = 1}^{N} & (64) \end{matrix}$ $\begin{matrix} J 20 = {{{\overset{}{W}}_{s}^{e (i)}}_{s = 1}^{s_{\max}}}_{i = 1}^{N} & (65) \end{matrix}$

[0228] Then, the classification optimizing unit 223 optimizes the parameter J21 expressed by formula (67) below by minimizing the loss function expressed by formula (66) below by using an SVI-GP of a known technique.

[00038] $\begin{matrix} _{w}^{e} = {{-_{ELBO} (W_{s}^{(i)}, {\overset{}{W}}_{s}^{e (i)})}_{s = 1}^{s_{\max}}}_{i = 1}^{N} & (66) \end{matrix}$ $\begin{matrix} J 21 =_{w}^{e} & (67) \end{matrix}$ [0229] where, L.sub.ELBO represents the evidence lower bound (ELBO) of the SVI-GP loss function.

[0230] The prediction optimizing unit 224 calculates the data J24 expressed by formula (70) below by using formulas (38) and (39) above while using, as input, the data J22 expressed by formula (68) below and calculated by the predicting unit 222, and the observation data J23 expressed by formula (69) below.

[00039] $\begin{matrix} J 22 = {{{\hat{H}}_{s}^{e (i)}}_{s = 1}^{s_{\max}}}_{i = 1}^{N} & (68) \end{matrix}$ $\begin{matrix} J 23 = {{X_{s}, V, W_{s}}_{s = 1}^{s_{\max}}}_{i = 1}^{N} & (69) \end{matrix}$ $\begin{matrix} J 24 = {{{\hat{H}}_{s - 1}^{e^{} (i)}}_{s = 1}^{s_{\max}}}_{i = 1}^{N} & (70) \end{matrix}$

[0231] The prediction optimizing unit 224 calculates the data J26 expressed by formula (72) below from formulas (40) to (43) above by using the data J25 expressed by formula (71) below as input.

[00040] $\begin{matrix} J 25 = {{{\hat{H}}_{s - 1}^{e^{} (i)}}_{s = 1}^{s_{\max}}}_{i = 1}^{N} & (71) \end{matrix}$ $\begin{matrix} J 26 = {{{\hat{W}}_{s}^{e (i)}, {\hat{X}}_{s + 1}^{e (i)}}_{s = 1}^{s_{\max}}}_{i = 1}^{N} & (72) \end{matrix}$

[0232] To reduce the influence of a confounding factor, the prediction optimizing unit 224 uses the DSVI-DGP of a known technique with the loss function expressed by formula (76) below to optimize the parameter J30 expressed by formula (77) below; this is done to increase the prediction accuracy of the treatment effect J29 expressed by formula (75) below from the information J27 expressed by formula (73) below, while ensuring that the treatment J28 expressed by formula (74) below cannot be predicted from the information J27. However, the parameter J31 expressed by formula (78) below is fixed.

[00041] $\begin{matrix} J 27 = {\hat{H}}_{s} & (73) \end{matrix}$ $\begin{matrix} J 28 = {\hat{W}}_{s} & (74) \end{matrix}$ $\begin{matrix} J 29 = {\hat{X}}_{s + 1} & (75) \end{matrix}$ $\begin{matrix} _{h, x}^{e} = {{-_{ELBO} (X_{s + 1}^{(i)}, {\hat{X}}_{s + 1}^{e (i)})}_{s = 1}^{s_{\max}}}_{i = 1}^{N} -^{e} {{-_{ELBO} (W_{s}^{(i)}, {\hat{W}}_{s}^{e (i)})}_{s = 1}^{s_{\max}}}_{i = 1}^{N} & (76) \end{matrix}$ $\begin{matrix} J 30 =_{h}^{e},_{x}^{e},^{e} & (77) \end{matrix}$ $\begin{matrix} J 31 =_{w}^{e} & (78) \end{matrix}$ [0233] where .sup.e is a ratio that takes into account the term J32 in formula (79) below.

[00042] $\begin{matrix} J 32 =_{ELBO} (W_{s}^{(i)}, {\hat{W}}_{s}^{e (i)}) & (79) \end{matrix}$

[0234] The prediction optimizing unit 224 returns to the process by the predicting unit 222 until the value J33 expressed by formula (80) below and the value J34 expressed by formula (81) below converge, and repeats the processes by the classification optimizing unit 223 and the prediction optimizing unit 224 to optimize the parameter J35 expressed by formula (82) below.

[00043] $\begin{matrix} J 33 =_{w}^{e} & (80) \end{matrix}$ $\begin{matrix} J 34 =_{h, x}^{e} & (81) \end{matrix}$ $\begin{matrix} J 35 = {_{h}^{e},_{w}^{e},_{x}^{e},^{e}} & (82) \end{matrix}$

[0235] When the value J33 expressed in formula (80) and the value J34 expressed in formula (81) converge, the prediction optimizing unit 224 uses the optimized parameter J35 expressed by formula (82) and the parameter optimized by the predicting unit 222 and gives the data J18 calculated by formula (63) to the decoder training unit 230.

[0236] The decoder training unit 230 in FIG. 10 trains the decoder side in the encoder-decoder model.

[0237] The decoder training unit 230 in FIG. 10 includes an initializing unit 231, a predicting unit 232, a classification optimizing unit 233, and a prediction optimizing unit 234.

[0238] The initializing unit 231 initializes the parameter J36 of the decoder unit expressed by formula (83) below with the parameter J35 expressed by formula (82) and optimized by the encoder unit.

[00044] $\begin{matrix} J 36 = {_{w}^{d},_{h}^{d},_{x}^{d},^{d}} & (83) \end{matrix}$

[0239] The predicting unit 232 executes the following process to predict (step ahead prediction) the variable J37 at a step that is steps ahead of step t and expressed by formula (84) below.

[00045] $\begin{matrix} J 37 = {\hat{X}}_{s +} & (84) \end{matrix}$

[0240] As illustrated in FIG. 12B, in the predicting unit 232, the decoder receives, as input, the data J38 expressed by formula (85) below and last predicted by the encoder, and observation data {V, W.sub.(t+1)}, and the data J39 expressed formula (86) below is calculated.

[00046] $\begin{matrix} J 38 = {{\hat{H}}_{t}^{e}, {\hat{X}}_{t + 1}^{e}} & (85) \end{matrix}$ $\begin{matrix} J 39 = {{\hat{H}}_{t + 1}^{d}, {\hat{W}}_{t + 1}^{d}, {\hat{X}}_{t + 2}^{d}} & (86) \end{matrix}$

[0241] Next, the predicting unit 232 calculates the data J42 expressed by formula (89) below by recursively inputting the data J40 expressed by formula (87) below and the predictive data J41 expressed by formula (88) below up to step t+1, and obtains the data J43 expressed by formula (90) below.

[00047] $\begin{matrix} J 40 = {V, W_{s}}_{s = t + 2}^{t + - 1} & (87) \end{matrix}$ $\begin{matrix} J 41 = {{\hat{H}}_{s - 1}^{d}, {\hat{X}}_{s}^{d}} & (88) \end{matrix}$ $\begin{matrix} J 42 = {{\hat{H}}_{s}^{d}, {\hat{W}}_{s}^{d}, {\hat{X}}_{s + 1}^{d}}_{s = t + 2}^{t + - 1} & (89) \end{matrix}$ $\begin{matrix} J 43 = {\hat{X}}_{t +}^{d} & (90) \end{matrix}$

[0242] The predicting unit 232 calculates the data J44 expressed by formula (91) below for every individual (i) and gives the data J44 to the classification optimizing unit 233. Here, .sub.max represents the maximum number of steps to the prediction destination.

[00048] $\begin{matrix} J 44 = {{{{\hat{H}}_{s + - 1}^{d (i)}, {\hat{W}}_{s + - 1}^{d}, {\hat{X}}_{s +}^{d}}_{= 1}^{_{\max}}}_{s = 2}^{s_{\max} -_{\max}}}_{i = 1}^{N} & (91) \end{matrix}$

[0243] The classification optimizing unit 233 calculates the data J46 expressed by formula (93) below from formulas (42) and (43) above by using, as input, the data J45 expressed by formula (92) below and calculated by the predicting unit 232.

[00049] $\begin{matrix} J 45 = {{{{\hat{H}}_{s + - 1}^{d (i)}}_{= 1}^{_{\max}}}_{s = 2}^{s_{\max} -_{\max}}}_{i = 1}^{N} & (92) \end{matrix}$ $\begin{matrix} J 46 = {{{{\hat{W}}_{s + - 1}^{d (i)}}_{= 1}^{_{\max}}}_{s = 2}^{s_{\max} -_{\max}}}_{i = 1}^{N} & (93) \end{matrix}$

[0244] The classification optimizing unit 233 optimizes the parameter J47 expressed by formula (95) below by minimizing the loss function expressed by formula (94) below by using an SVI-GP of a known technique. During optimization, optimization is performed by calculating the sum of the losses of =1 to =.sub.max and performing back propagation.

[00050] $\begin{matrix} _{w}^{d} = {{{-_{ELBO} (W_{s + - 1}^{(i)}, {\hat{W}}_{s + - 1}^{d (i)})}_{= 1}^{_{\max}}}_{s = 2}^{s_{\max} -_{\max}}}_{i = 1}^{N} & (94) \end{matrix}$ $\begin{matrix} J 47 =_{w}^{d} & (95) \end{matrix}$

[0245] The prediction optimizing unit 234 uses the data J48 expressed by formula (96) below as an initial input, and recursively inputs the data J49 expressed by formula (97) below up to s=t+1, as illustrated in FIG. 12B, to calculate the data J50 expressed by formula (98).

[00051] $\begin{matrix} J 48 = {{{\hat{H}}_{s - 1}^{e (i)}, X_{s}, V, W_{s}}_{s = 2}^{s_{\max} -_{\max}}}_{i = 1}^{N} & (96) \end{matrix}$ $\begin{matrix} J 49 = {{\hat{H}}_{s - 1}^{d (i)}, {\hat{X}}_{s}^{d (i)}} & (97) \end{matrix}$ $\begin{matrix} J 50 = {{{{\hat{W}}_{s + - 1}^{d (i)}, {\hat{X}}_{s +}^{d (i)}}_{= 1}^{_{\max}}}_{s = 2}^{s_{\max} -_{\max}}}_{i = 1}^{N} & (98) \end{matrix}$

[0246] Then, the prediction optimizing unit 234 uses a DSVI-DGP of a known technique to optimizes the parameter J51 expressed by formula (100) below with the loss function expressed by formula (99) below. The parameter J52 expressed by formula (101) below is fixed.

[00052] $\begin{matrix} _{h, x}^{d} = {{-_{ELBO} ({X_{s +}^{d (i)}}_{= 1}^{_{\max}}, {{\hat{X}}_{s +}^{d (i)}}_{= 1}^{_{\max}}) +^{d}_{ELBO} ({W_{s +}^{d (i)}}_{= 1}^{_{\max}}, {{\hat{W}}_{s +}^{d (i)}}_{= 1}^{_{\max}})}_{s = 2}^{s_{\max} -_{\max}}}_{i = 1}^{N} & (99) \end{matrix}$ $\begin{matrix} J 51 =_{h}^{d},_{x}^{d},^{d} & (100) \end{matrix}$ $\begin{matrix} J 52 =_{w}^{d} & (101) \end{matrix}$ [0247] where .sup.d is a ratio that takes into account the term J53 expressed by formula (102) below.

[00053] $\begin{matrix} J 53 =_{ELBO} ({W_{s +}^{d (i)}}_{= 1}^{_{\max}}, {{\hat{W}}_{s +}^{d (i)}}_{= 1}^{_{\max}}) & (102) \end{matrix}$

[0248] The prediction optimizing unit 234 returns to the process by the predicting unit 232 until the value J54 expressed by formula (103) below and the value J55 expressed by formula (104) below converge, and repeats the processes by the classification optimizing unit 233 and the prediction optimizing unit 234 to optimize the parameter J36 expressed by formula (83) above.

[00054] $\begin{matrix} J 54 =_{w}^{d} & (103) \end{matrix}$ $\begin{matrix} J 55 =_{h, x}^{d} & (104) \end{matrix}$

[0249] When the value J54 expressed by formula (103) above and the value J55 expressed by formula (104) above converge, the prediction optimizing unit 234 gives the optimized parameter J36 expressed by formula (83) above and the optimized parameter J35 of the encoder expressed by formula (82) above to the estimating unit 240.

[0250] Approaches other than those explained above may be taken to optimize the parameter J35 of the encoder training unit 220 expressed by formula (82) and the parameter J36 of the decoder training unit 230 expressed by formula (83) above.

[0251] For example, training may be performed using adversarial training of a known technique. In this case, the discriminator (f.sub.a) may perform training to predict the treatment J57 expressed by formula (106) below from the intermediate representation J56 expressed by formula (105) below; and the Generator (f.sub.h) may perform training so that the discriminator (f.sub.a) cannot predict the treatment J57 expressed by formula (106) below from the intermediate representation J56 expressed by formula (105) below while the prediction accuracy of the treatment effect J58 expressed by formula (107) below is increased.

[00055] $\begin{matrix} J 56 = {\hat{H}}_{s} & (105) \end{matrix}$ $\begin{matrix} J 57 = {\hat{W}}_{s} & (106) \end{matrix}$ $\begin{matrix} J 58 = {\hat{X}}_{s + 1} & (107) \end{matrix}$

[0252] The estimating unit 240 uses a learning model to estimate the change with time of the first causal variable obtained by combining two or more processes selected from multiples processes over a certain period of time.

[0253] In this way, the estimating unit 240 can specify an optimal combination of two or more processes in accordance with the estimated change with time.

[0254] For example, the estimating unit 240 calculates the treatment effect J60 of a treatment plan J59 expressed by formulas (108) and (109) below for a time point t ahead of any time t and a dispersion value J61 expressed by formula (110) below, from the parameter J35 of the optimized encoder expressed by formula (82) and the parameter J36 of the optimized decoder expressed by formula (83).

[00056] $\begin{matrix} J 59 = {W_{s}}_{s = t}^{s = t + - 1} & (108) \end{matrix}$ $\begin{matrix} J 60 = {{\hat{X}}_{s + 1}^{d}}_{s = t}^{s = t + - 1} & (109) \end{matrix}$ $\begin{matrix} J 61 = {{\hat{X}}_{s + 1}^{d}}_{s = t}^{s = t + - 1} & (110) \end{matrix}$

[0255] The estimating unit 240 selects the treatment plan J59 expressed by for (108) above where the final treatment effect J62 expressed by formula (111) below is the highest.

[00057] $\begin{matrix} J 62 = {\hat{X}}_{t +}^{d} & (111) \end{matrix}$

[0256] Then, the estimating unit 240 gives the output unit 250 the optimal treatment plan J59 and its treatment effect J62 expressed by formulas (108), (109), and (110) above, and the dispersion value J63 expressed by formula (112) below.

[00058] $\begin{matrix} J 63 = {\hat{X}}_{t +}^{d} & (112) \end{matrix}$

[0257] When there are many combinations of the treatment plan J59 expressed by formula (108) above, or when T is for a long period of time, search of treatment plans may be performed by Monte Carlo sampling or reinforcement learning to learn combinations that have the highest final treatment effect.

[0258] The optimal treatment plan may be the one with a large final treatment effect J62 in the treatment plans J59 expressed by formulas (108) and (111) above, or the one with a large cumulative treatment effect J64 expressed by formula (113) below.

[00059] $\begin{matrix} J 64 = {.Math.}_{s = t}^{s = t + - 1} ({\hat{X}}_{s + 1}^{d} - {\hat{X}}_{s}^{d}) & (113) \end{matrix}$

[0259] The estimating unit 240 may prepare a treatment plan that can obtain the target treatment effect the fastest, a treatment plan that improves the situation the most on the last day of the time period, or a continuous treatment plan that maintains the current status so as not to exceed an index situation.

[0260] The output unit 250 outputs the result estimated by the estimating unit 240.

[0261] For example, the output unit 250 outputs the treatment plan, the treatment effect, and the dispersion value from the estimating unit 240. Specifically, the output unit 250 displays the treatment plan, the treatment effect, and the dispersion value.

[0262] The estimation device 200 described above can be implemented by, for example, the PC 10 illustrated in FIG. 7.

[0263] For example, the DB 211 can be implemented by storage such as the auxiliary storage device 11.

[0264] The input executing unit 112, the preprocessing unit 113, the encoder training unit 220, the decoder training unit 230, and the estimating unit 140 can be implemented by the processor 13 loading programs stored in storage such as the auxiliary storage device 11 to the memory 12 and executing the programs.

[0265] The output unit 250 can be implemented by the display 16.

[0266] FIGS. 15 and 16 are flowcharts illustrating operation of the estimation device 200 according to the second embodiment described above.

[0267] First, the input executing unit 212 acquires {X#, V#, W#} from the DB 211 and gives them to the preprocessing unit 213 (step S30).

[0268] Next, the preprocessing unit 213 deletes data having missing values in X#, V#, and W# (step S31).

[0269] Next, the preprocessing unit 213 preprocesses W# (step S32).

[0270] Next, the preprocessing unit 213 normalizes X# (step S33).

[0271] Next, the preprocessing unit 213 gives {X, V, W} subjected to the above process to the encoder training unit 220 (step S34).

[0272] Next, the initializing unit 221 of the encoder training unit 220 sets the dimensional number DH of H (step S35).

[0273] Next, the initializing unit 221 initializes the parameter J13 expressed by formula (58) above (step S36). Next, the predicting unit 222 calculates the data J65 expressed by formula (114) below (step S37).

[00060] $\begin{matrix} J 65 = {\hat{H}}_{s}^{e (i)} & (114) \end{matrix}$

[0274] Next, the classification optimizing unit 223 calculates the data J66 expressed by formula (115) below (step S38).

[00061] $\begin{matrix} J 66 = {\hat{W}}_{s}^{e (i)} & (115) \end{matrix}$

[0275] Next, the classification optimizing unit 223 optimizes the parameter J67 expressed by formula (116) below (step S39).

[00062] $\begin{matrix} J 67 =_{w}^{e} & (116) \end{matrix}$

[0276] Next, the prediction optimizing unit 224 calculates the data J68 expressed by formula (117) below (step S40).

[00063] $\begin{matrix} J 68 = {\hat{H}}_{s}^{e^{} (i)} & (117) \end{matrix}$

[0277] Next, the prediction optimizing unit 224 calculates the data J69 and the data J70 expressed by formulas (118) and (119), respectively, below (step S41).

[00064] $\begin{matrix} J 69 = {\hat{W}}_{s}^{e (i)} & (118) \end{matrix}$ $\begin{matrix} J 70 = {\hat{X}}_{s}^{e (i)} & (119) \end{matrix}$

[0278] Next, the prediction optimizing unit 234 optimizes the parameter J30 expressed by formula (77) above (step S42).

[0279] Here, the prediction optimizing unit 224 returns the process to step S37 to repeat the process until the value J33, which is a loss function and expressed by formula (80) below, and the value J34 expressed by formula (81) are minimized, or until a predetermined number of repetitions is reached.

[0280] Next, the prediction optimizing unit 234 gives the optimized parameter J35 expressed by formula (82) above to the decoder training unit 230 (step S43). The process then proceeds to step S44 in FIG. 16.

[0281] In step S44 of FIG. 16, the initializing unit 231 of the decoder training unit 230 initializes the parameter J36 of the decoder unit expressed by formula (83) above with the parameter J35 expressed by formula (82) above and optimized by the encoder unit.

[0282] Next, the predicting unit 232 calculates the data J71 expressed by formula (120) below (step S45).

[00065] $\begin{matrix} J 71 = {\hat{H}}_{s}^{d (i)} & (120) \end{matrix}$

[0283] Next, the classification optimizing unit 233 calculates the data J72 expressed by formula (121) below (step S46).

[00066] $\begin{matrix} J 72 = {\hat{W}}_{s}^{d (i)} & (121) \end{matrix}$

[0284] Next, the classification optimizing unit 233 optimizes the parameter J73 expressed by formula (122) below (step S47).

[00067] $\begin{matrix} J 73 =_{w}^{d} & (122) \end{matrix}$

[0285] Next, the prediction optimizing unit 234 calculates the data J74 and the data J75 expressed by formulas (123) and (124) below (step S48).

[00068] $\begin{matrix} J 74 = {\hat{W}}_{s}^{d (i)} & (123) \end{matrix}$ $\begin{matrix} J 75 = {\hat{X}}_{s}^{d (i)} & (124) \end{matrix}$

[0286] Next, the prediction optimizing unit 234 optimizes the parameter J51 expressed by formula (100) above (step S49).

[0287] Here, the prediction optimizing unit 234 returns the process to step S45 until the value J54 expressed by formula (103) above and the value J55 expressed by formula (104) above, which are loss functions, are minimized, or until a predetermined number of repetitions is reached.

[0288] Next, the prediction optimizing unit 234 gives the estimating unit 240 the parameter J35 expressed by formula (82) above and optimized by the encoder training unit 220, and the parameter J36 expressed by formula (83) above and optimized by the decoder training unit 230 (step S50).

[0289] Next, the estimating unit 240 calculates the treatment effect J60 of a treatment plan J59 expressed by formulas (108) and (109) above for a time point ahead of any time point t and the dispersion value J61 expressed by formula (110) above, from the parameter J35 of the optimized encoder expressed by formula (82) and the parameter J36 of the optimized decoder expressed by formula (83) (step S51).

[0290] Next, the estimating unit 240 selects the treatment plan J59 expressed by formula (108) above where the final treatment effect J62 expressed by formula (111) above is the highest (step S52).

[0291] Next, the estimating unit 240 gives the output unit 250 the optimal treatment plan J59 expressed by formulas (108), (109), and (110), the treatment effect J62 thereof, and the dispersion value J61 thereof.

[0292] Next, the output unit 250 displays the treatment plan, the treatment effect, and the dispersion value from the estimating unit 240 (step S54).

[0293] Here, the following describes a case in which the second embodiment is applied to a railway service.

[0294] The first causal variable is the fare price increase amount, the fare price reduction amount, and the amount of coupon to promote railway use.

[0295] The second causal variables are at least one of attributes and purchase history of railway users, factors affecting the sales value of railway services, and action history of railway users.

[0296] The process types are fare price increases, fare price reductions, campaigns to promote railway use, and distribution of coupons to promote railway use.

[0297] The history information is a history of fare price increases, a history of fare price reductions, and a history of distribution of coupons to promote railways use.

[0298] As described above, according to the second embodiment, it becomes clear which action plan should be prepared at what time point in conjunction with the characteristics of the target. The confidence level of the estimated value of the treatment effect estimated through the prepared action plan can also be determined.

[0299] For example, as illustrated in FIG. 9, when four doses of vaccine are to be administered, an administration plan can be prepared for the combination of vaccines to be administered (multiple combinations of treatments), the order (sequence of treatments), the vaccination timing (timing of treatment), and the timing to stop administration (where to stop treatment).

[0300] For example, with dynamic pricing of reserved seats in railroads or rooms in hotels, for example, it will be possible to know when to raise or lower the price of the seats or rooms, the number or combination of price increases and price reductions, and the order in which to raise or lower prices, in order to achieve higher final sales. [0301] Reference 1: Bica, I., Alaa, A. M., Jordon, J., and van der Schaar, M., Estimating Counterfactual Treatment Outcomes Over Time Through Adversarially Balanced Representations, International Conference on Learning Representations, 2020.

INFORMATION PROCESSING DEVICE

Assignee

Inventors

Cpc classification

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

G06Q50/40

PHYSICS

Classification Explorer

G06Q30/02011

PHYSICS

Classification Explorer

G06Q30/02043

PHYSICS

International classification

Classification Explorer

G06Q50/40

PHYSICS

Classification Explorer

G06N20/00

PHYSICS

Abstract

Claims

Description