SYSTEM AND METHOD FOR OPTIMIZING TRIAL DESIGN FOR CLINICAL TRIALS
20230238087 · 2023-07-27
Assignee
Inventors
- Nitish Jain (Pune, IN)
- Vipul Vinod Patni (Shrirampur, IN)
- Nishant Singhania (Jamshedpur, IN)
- Vismay Bansal (Mount Abu, IN)
- Bheru Mali (Bhilwara, IN)
Cpc classification
G16H50/70
PHYSICS
International classification
G16H50/20
PHYSICS
Abstract
A system and method for optimizing trial design for clinical trials. The system includes a computer system and a processor communicably coupled to a memory. The processor processes and structures raw trial data to a format suitable for input to train a machine learning model. The processor further identifies plurality of independent features of the raw trial data and screens actionable features. Further, the processor computes cut off range values for each of the actionable features and form a plurality of sub-groups of patients. The processor simulates patient response of each of the plurality of sub-groups of patients and identifies a sub-group of patients based upon population percentage and delta response that shows optimal clinical trial results in the simulated patient response.
Claims
1. A system for optimizing trial design in a clinical trial, wherein the system includes a computer system comprising a processor communicably coupled to a memory, the processor being configured to: process and structure raw trial data to a format suitable for input to train a machine learning model, wherein the raw trial data is patient data; identify plurality of independent features of the raw trial data, wherein the identification of the plurality of independent features is performed using the trained machine learning model; screen actionable features from the plurality of independent features using the trained machine learning model, wherein actionable features show opposite impact between treatment arm patients and control arm patient of the clinical trial; compute cut off range values for each of the actionable features, and form a plurality of sub-groups of patients using different combinations of cut-off range values, wherein the cut off range values define an upper limit and a lower limit for values of the actionable features; simulate patient response of each of the plurality of sub-groups of patients; and identify a sub-group of patients from the plurality of sub-groups, based upon population percentage and delta response obtained from the simulations, that shows optimal clinical trial results in the simulated patient response.
2. The system of claim 1, wherein the plurality of independent features have missing values in the raw trial data that are imputed using a plurality of imputation techniques, wherein the plurality of imputation techniques employ statistical extrapolation.
3. The system of claim 1, wherein the machine learning model is XGBoost regressor, and wherein the XGBoost regressor is trained using grid search.
4. The system of claim 3, wherein the XGBoost regressor identifies the independent features that do not impact efficacy of treatment used in the clinical trial.
5. The system of claim 1, wherein the plurality of independent features comprise at least one of: genetic features, baseline indexes, vital signs, underlying conditions, medical history, and demographics such as age, gender, height, weight, BMI, nationality, race.
6. The system of claim 1, wherein opposite impact between treatment arm patients and control arm patients is measured as improvement in the treatment arm patients and decrease in efficacy in the control arm patients.
7. A method for optimizing trial design in a clinical trial, wherein the method comprises: processing and structuring raw trial data to a format suitable for input to train a machine learning model using a processor, wherein the raw trial data is patient data; identifying plurality of independent features of the raw trial data, wherein the identification of the plurality of independent features is performed using the trained machine learning model; screening actionable features from the plurality of independent features using the trained machine learning model, wherein actionable features show opposite impact between treatment arm patients and control arm patient of the clinical trial; computing cut off range values for each of the actionable features, and form a plurality of sub-groups of patients using different combinations of cut-off range values, wherein the cut off range values define an upper limit and a lower limit for values of the actionable features; simulating patient response of each of the plurality of sub-groups of patients; and identifying a sub-group of patients from the plurality of sub-groups, based upon population percentage and delta response obtained from the simulations, that shows optimal clinical trial results in the simulated patient response.
8. The method of claim 7, wherein the method comprises imputing the missing values of the plurality of independent features in the trial data using a plurality of imputation techniques, wherein the plurality of imputation techniques employ statistical extrapolation.
9. The method of claim 7, wherein the method comprises training XGBoost regressor using grid search, wherein the machine learning model is XGBoost regressor.
10. The method of claim 9, wherein the method comprises identifying the independent features that do not impact efficacy of treatment used in the clinical trial using the XGBoost regressor.
11. The method of claim 1, wherein the method comprises the plurality of independent features to be at least one of: genetic features, baseline indexes, vital signs, underlying conditions, medical history, and demographics such as age, gender, height, weight, BMI, nationality, race.
12. The method of claim 1, wherein the method comprises measuring opposite impact between treatment arm patients and control arm patients as improvement in the treatment arm patients and decrease in efficacy in the control arm patients.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0024] The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those skilled in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.
[0025] Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:
[0026]
[0027]
[0028]
[0029] In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the non-underlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.
DETAILED DESCRIPTION OF EMBODIMENTS
[0030] The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practising the present disclosure are also possible.
[0031] In one aspect, the present disclosure provides a system for optimizing trial design for clinical trials, wherein the system includes a computer system comprising a processor communicably coupled to a memory, the processor operable to: [0032] process and structure raw trial data to a format suitable for input to train a machine learning model, wherein the raw trial data is patient data; [0033] identify plurality of independent features of the raw trial data, wherein the identification of the plurality of independent features is performed using the trained machine learning model; [0034] screen actionable features from the plurality of independent features using the trained machine learning model, wherein actionable features show opposite impact between treatment arm patients and control arm patient of the clinical trial; [0035] compute cut off range values for each of the actionable features, and form a plurality of sub-groups of patients using different combinations of cut-off range values, wherein the cut off range values define an upper limit and a lower limit for values of the actionable features; [0036] simulate patient response of each of the plurality of sub-groups of patients; and [0037] identify a sub-group of patients from the plurality of sub-groups, based upon population percentage and delta response obtained from the simulations, that shows optimal clinical trial results in the simulated patient response.
[0038] In another aspect, the present disclosure provides a method for optimizing trial design for clinical trials, wherein the method comprises: [0039] processing and structuring raw trial data to a format suitable for input to train a machine learning model using a processor, wherein the raw trial data is patient data; [0040] identifying plurality of independent features of the raw trial data, wherein the identification of the plurality of independent features is performed using the trained machine learning model; [0041] screening actionable features from the plurality of independent features using the trained machine learning model, wherein actionable features show opposite impact between treatment arm patients and control arm patient of the clinical trial; [0042] computing cut off range values for each of the actionable features, and form a plurality of sub-groups of patients using different combinations of cut-off range values, wherein the cut off range values define an upper limit and a lower limit for values of the actionable features; [0043] simulating patient response of each of the plurality of sub-groups of patients; and [0044] identifying a sub-group of patients from the plurality of sub-groups, based upon population percentage and delta response obtained from the simulations, that shows optimal clinical trial results in the simulated patient response.
[0045] The system and method of the present disclosure aims to provide optimization of trial design in a clinical trial. Notably, the present disclosure reduces the time and cost in selecting the right group of patients for the clinical trial. Consequently, the system eliminates the delay in the overall process of drug discovery. Furthermore, the present disclosure increases the chance of a successful clinical trial by selecting the right group of patients.
[0046] Pursuant to embodiments of the present disclosure, the system and the method provided herein are for optimizing trial design for clinical trials. Herein, “clinical trial” refers to research studies performed in people that are aimed at evaluating a medical, surgical, or behavioral intervention. Additionally, clinical trials are the primary way that researchers find out if a new treatment, like a new drug or diet or medical device (for example, a pacemaker) is safe and effective in people. Moreover, often a clinical trial is used to learn if a new treatment is more effective and/or has less harmful side effects than the standard treatment. Furthermore, clinical trials are conducted using a process that may be divided into categories or phases. Typically, clinical trial process can extend over a period of time ranging from months to years. Notably, every clinical trial requires retrieving, analyzing, and managing the collaboratively obtained clinical trial data from various clinical trial organizations collected during the clinical trial process before an investigational new drug (IND) can be submitted to the FDA.
[0047] The system includes a computer system comprising a processor communicably coupled to a memory. Herein, a “computer system” relates to at least one computing unit comprising a central storage system, processing units and various peripheral devices. Optionally, the computer system relates to an arrangement of interconnected computing units, wherein each computing unit in the computer system operates independently and may communicate with other external devices and other computing units in the computer system.
[0048] Throughout the present disclosure, the term “processor” used herein relates to a computational element that is operable to respond to and process instructions that carry out the method. Optionally, the processor includes, but is not limited to, a microprocessor, a microcontroller, a complex instruction set computing (CISC) microprocessor, a reduced instruction set (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, or any other type of processing circuit. Furthermore, the term “processor” may refer to one or more individual processors, processing devices and various elements associated with a processing device that may be shared by other processing devices.
[0049] The processor is operable to process and structure raw trial data to a format suitable for input to train a machine learning model, wherein the raw trial data is patient data. Herein “raw trial data” refers to unprocessed patient data for a clinical trial that is in its original form, in contrast to derived data. Additionally, raw trial data may not be part of the documentation accompanying an application to a regulatory authority but must be kept in records. Moreover, raw trial data may include patient medical charts, hospital records, X-rays, attending physician’s notes, and so forth. Herein, “machine learning model” refers to the output that is saved after running a machine learning algorithm on training data and represents the rules, numbers, and any other algorithm-specific data structures required to make predictions. Notably, raw trial data requires processing and structuring in a format that is a valid input to the machine learning model. Additionally, the raw trial data is inserted to the machine learning model after processing and structuring it using the processor. Additionally, the processed and structured raw trial data acts as the training data for the machine learning model.
[0050] Optionally, the machine learning model is XGBoost regressor, and wherein the XGBoost regressor is trained using grid search. Herein, “XGBoost regressor” or extreme gradient boosting is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm. Herein, gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems. Additionally, ensembles are constructed from decision tree models. Moreover, trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. Notably, this is a type of ensemble machine learning model referred to as boosting. Furthermore, models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. Consequently, this gives the technique its name, “gradient boosting,” as the loss gradient is minimized as the model is fit, much like a neural network. Herein, the XGBoost regressor is trained using grid search for tuning the Hyperparameters of the said model. Herein, “hyperparameters” refers to a parameter whose value is used to control the learning process of the machine learning model. By contrast, the values of other parameters are derived via training. Additionally, hyperparameter is a characteristic of a model that is external to the model and whose value cannot be estimated from data. Moreover, the value of the hyperparameter has to be set before the learning process begins. Herein, “grid search” refers to a process that searches exhaustively through a manually specified subset of the hyperparameter space of the targeted algorithm. Furthermore, grid-search is used to find the optimal hyperparameters of a model which results in the most accurate predictions.
[0051] The processor is operable to identify plurality of independent features of the raw trial data, wherein the identification of the plurality of independent features is performed using the trained machine learning model. Notably, after processing the data, the next step is identification of the important independent factors that primarily affect outcome of the clinical trial. Optionally, the plurality of independent features comprises at least one of: genetic features, baseline indexes, vital signs, underlying conditions, medical history, and demographics such as age, gender, height, weight, BMI, nationality, race. Furthermore, the machine learning model is run separately for treatment arm patients and control arm patients. Herein, “treatment arm” refers to a group or subgroup of participants in a clinical trial that receives a specific intervention, study drug dose, according to the study protocol. Herein, “control arm” refers to a group or subgroup of participants that do not receive the new medication, device or treatment that is under study, to provide a comparison to see how the innovation compares against no treatment. Additionally, members of the control group may receive a placebo, an inactive treatment such as a pill that makes the group think they are receiving the new treatment.
[0052] Optionally, the plurality of independent features have missing values that are imputed using a plurality of imputation techniques, wherein the plurality of imputation techniques employs statistical extrapolation. Herein, “imputation” refers to an assumed value given to an item when the actual value is not known or available. Additionally, imputed values are a logical or implicit value for an item or time set, wherein a true value is yet to be ascertained. Notably, the imputation techniques are used to determine the values of the missing independent features, if any. Moreover, the imputation techniques used are mean, median, mode and so forth. Furthermore, the imputation techniques are implemented based on the features, for example, mean for continuous, mode for categorical, and so forth.
[0053] Optionally, the XGBoost regressor identifies the independent features that do not impact efficacy of treatment used in the clinical trial.
[0054] The processor is operable to screen actionable features from the plurality of independent features using the trained machine learning model, wherein actionable features show opposite impact between treatment arm patients and control arm patient of the clinical trial. Notably, the actionable features chosen by the processor helps to compare the impact of the new drug between treatment arm patients and control arm patient of the clinical trial. Herein, actionable features refer to the independent features that can be controlled and using which an action relating to selection of patients in the clinical trial can be taken.
[0055] Optionally, opposite impact between treatment arm patients and control arm patients is measured as improvement in the treatment arm patients and decrease in efficacy in the control arm patients. Notably, opposite impact between the treatment arm that receives the new drug and control arm that doesn’t receive the new drug means an improvement in the treatment arm patients and decrease in efficacy in the control arm patients. Additionally, the decrease in efficacy in the control arm patients clearly indicates that the new drug tested on the treatment arm patients is working.
[0056] The processor is operable to compute cut off range values for each of the actionable features and form a plurality of sub-groups of patients using different combinations of cut-off range values, wherein the cut off range values define an upper limit and a lower limit for values of the actionable features. Notably, the processor applies cut off ranges at all levels of a particular independent factor and results in separation of respective population segments. Additionally, the raw trial data gets segregated into a plurality of sub-groups. Moreover, the number of sub-groups depends on the different combinations of cut-off range values.
[0057] The processor is operable to simulate patient response of each of the plurality of sub-groups of patients. Notably, the processor performs a simulation with the independent factors selected by the machine learning model and applies a combination of cutoffs at all levels of these independent factors. Additionally, a delta response is determined for patients meeting the cutoff criteria by simulating the range of important independent variables.
[0058] The processor is operable to identify a sub-group of patients from the plurality of sub-groups, based upon population percentage and delta response obtained from the simulations, that shows optimal clinical trial results in the simulated patient response. Notably, each subgroup obtained by the cut-off range values contains patients from the treatment arm as well as from the control arm. Additionally, the average difference in improvement between the two arms is compared statistically and respective p-value is calculated. Moreover, the population percentage after filtering out patients and their delta change in endpoint score is noted. Herein, endpoint score is calculated by subtracting control arm average from the treatment arm average. Furthermore, the sub-group with the best impact of the intervention in the treatment arm compared to the control arm is selected. Herein, the population percentage refers to percentage of patients with respect to total number of patients in each of the plurality of sub-groups.
[0059] Optionally, the simulation data, percentage population and delta response are plotted, and the best point is chosen as the point with good delta response and high percentage population. Consequently, this leads to a tradeoff between reducing target population versus proving efficacy. Furthermore, a population sub-group is identified that shows significantly better improvement in the treatment arm compared to the control arm with a population percentage greater than 50 percent to easily meet the recruitment needs.
[0060] In an exemplary implementation, the simulation results for different combinations of four independent features for the plurality of sub-groups may be tabulated as follows:
TABLE-US-00001 Baseline Score 1 Baseline Score 2 Time since disease commencement (Years) Age (Years) Overall Count Delta Test Score Change (Treatment -Control) Population Percentage - - - - 145.00 0.02 100% - >= x.sub.21 - - 81.00 1.00 56% - - >= X.sub.31 - 75.00 0.74 52% >= X.sub.12 - - - 74.00 1.78 51% - - - <= X.sub.41 72.00 1.06 50% >= X.sub.13 - >= X.sub.31 - 45.00 1.56 31% >= X.sub.11 - - <= X.sub.42 42.00 2.69 29% - >= X.sub.22 >= X.sub.32 - 36.00 0.37 25% - - >= X.sub.32 <= X.sub.43 35.00 1.76 24% >= X.sub.14 >= X.sub.23 - - 32.00 5.11 22% - >= X.sub.24 - <= X.sub.41 31.00 1.89 21% >= X.sub.11 - >= X.sub.33 <= X.sub.41 27.00 2.58 19% >= X.sub.11 >= X.sub.25 >= X.sub.34 - 17.00 3.98 12% >= X.sub.11 >= X.sub.21 - <= X.sub.44 14.00 4.64 10% - >= X.sub.24 >= X.sub.35 <= X.sub.41 12.00 2.25 8% >= X.sub.15 >= X.sub.23 >= X.sub.35 <= X.sub.45 9.00 4.50 6%
[0061] The corresponding plot between percentage population and delta response is illustrated in
[0062] The present disclosure also relates to the method as described above. Various embodiments and variants disclosed above apply mutatis mutandis to the method.
[0063] Optionally, the method comprises imputing the missing values of the plurality of independent features using a plurality of imputation techniques, wherein the plurality of imputation techniques employ statistical extrapolation.
[0064] Optionally, the method comprises training XGBoost regressor using grid search, wherein the machine learning model is XGBoost regressor.
[0065] Optionally, the method comprises identifying the independent features that do not impact efficacy of treatment used in the clinical trial using the XGBoost regressor.
[0066] Optionally, the method comprises the plurality of independent features to be at least one of: genetic features, baseline indexes, vital signs, underlying conditions, medical history, and demographics such as age, gender, height, weight, BMI, nationality, race.
[0067] Optionally, the method comprises measuring opposite impact between treatment arm patients and control arm patients as improvement in the treatment arm patients and decrease in efficacy in the control arm patients.
DETAILED DESCRIPTION OF THE DRAWINGS
[0068] Referring to
[0069] Referring to
[0070] Referring to
[0071] Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as “including”, “comprising”, “incorporating”, “have”, “is” used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural.