Automated prediction of biological response of chemical compounds based on chemical information
11651838 · 2023-05-16
Assignee
Inventors
Cpc classification
G16H20/30
PHYSICS
G16C20/30
PHYSICS
G16H50/20
PHYSICS
G16H20/10
PHYSICS
G16C20/90
PHYSICS
G16H50/30
PHYSICS
International classification
G16C20/30
PHYSICS
G16C20/20
PHYSICS
Abstract
Lack of safety and efficacy are the two major unwanted biological responses that play as critical bottlenecks for the success of drug candidates in drug discovery and development. Conventional systems and methods involve ineffective exploration and use of chemical information space and thereby, may fail to address safety and efficacy issues. Embodiments of the present disclosure provides an effective solution to the above bottle-necks with the effective exploration/search of chemical information space using effective statistical techniques that yield meaningful chemical information comprising relevant descriptors, fingerprints, fragments, optimized set of structural images, and the like. Further, it provides robust predictive models for the biological response, example renal toxicity using the selected chemical information in an automated manner for a given experimental data and alerts/rules that can be successfully employed to address failures of drug candidates during discovery and development.
Claims
1. A processor implemented method, comprising: receiving biological data pertaining to chemical structure of a chemical compound (302); generating a plurality of chemical information for the chemical compound using associated molecular structure, wherein the plurality of chemical information comprise a plurality of physico-chemical and structural descriptors, a plurality of Molecular Fingerprints (MFs), a plurality of molecular fragments, and a plurality of 2D and 3D structural images (304); applying one or more statistical analysis techniques on the plurality of chemical information to obtain filtered chemical information (306), wherein the step of applying one or more statistical analysis techniques on the plurality of chemical information to obtain filtered chemical information comprises: obtaining a filtered set of descriptors using the plurality of physico-chemical and structural descriptors (306 a); generating a plurality of fingerprint categories based on the plurality of molecular fingerprints, wherein a first fingerprint category comprises a first set of fingerprints that is selected based on an occurrence threshold, wherein a second fingerprint category comprises a second set of fingerprints that is selected by applying at least one of a chi-squared test and a Fisher's exact test on the plurality of molecular fingerprints, wherein a third fingerprint category comprises a third set of fingerprints that is selected by applying an information gain statistical test on the plurality of molecular fingerprints (306 b); generating a fourth fingerprint category comprising a fourth set of fingerprints that is selected based on a combination of the plurality of Molecular fingerprints and the plurality of molecular fragments and the occurrence threshold (306 c); and performing one or more transformation techniques on the plurality of 2D and 3D structural images to obtain an optimized set of structural images (306 d); automatically generating a plurality of models based on the filtered set of descriptors, the first set of fingerprints, the second set of fingerprints, the third set of fingerprints, the fourth set of fingerprints and the optimized set of structural images respectively (308); automatically selecting and recommending a best model from the plurality of models based on the biological data and the plurality of chemical information (310); and automatically predicting biological response of the chemical compound based on at least one of the best model and one or more user selected models from the plurality of models (312).
2. The processor implemented method of claim 1, wherein the first set of fingerprints, the second set of fingerprints, the third set of fingerprints, and the fourth set of fingerprints comprise one or more CDK fingerprints, one or more CDK Extended fingerprints, one or more Estate fingerprints, one or more CDK Graph only fingerprints, one or more MACCS fingerprints, one or more Pubchem fingerprints, one or more Substructure fingerprints, one or more Klekota-Roth fingerprints, 2D Atom Pair fingerprints, one or more molecular fragments or combinations thereof.
3. The processor implemented method of claim 2, wherein a second model and a fourth model generated amongst the plurality of models are based on the first and fourth set of fingerprints respectively and the occurrence of each type of first and fourth set of fingerprints in a chemical compound.
4. The processor implemented method of claim 2, wherein a third model amongst the plurality of models is generated based on the probability of at least one of an activity, a biological response or an adverse event levels in the second set of fingerprints and the third set of fingerprints.
5. The processor implemented method of claim 1, wherein the first and fourth set of fingerprints comprises at least one of a Type I fingerprint, a Type II fingerprint, a Type III fingerprint and a Type IV fingerprint.
6. The processor implemented method of claim 5, wherein a presence of Type I fingerprint in at least one of the first set of fingerprints and the fourth set of fingerprints indicates contribution of the Type I fingerprint to one of a biological response, an adverse event or an activity of the chemical compound.
7. The processor implemented method of claim 5, wherein an absence of a Type II fingerprint in at least one of the first set of fingerprints and the fourth set of fingerprints indicates contribution of the Type II fingerprint to one of a biological response, an adverse event or an activity of the chemical compound.
8. The processor implemented method of claim 5, wherein a presence of a Type III fingerprint in at least one of the first set of fingerprints and the fourth set of fingerprints indicates contribution of the Type III fingerprint in one of no activity, no adverse event, or non-toxicity of the chemical compound.
9. The processor implemented method of claim 5, wherein an absence of a Type IV fingerprint in at least one of the first set of fingerprints and the fourth set of fingerprints indicates contribution of the Type IV fingerprint in one of no activity, no adverse event, or non-toxicity of the chemical compound.
10. The processor implemented method of claim 1, wherein the second set of fingerprints comprises a Type A fingerprint and a Type B fingerprint, and wherein the third set of fingerprints comprises a Type C fingerprint.
11. The processor implemented method of claim 1, wherein the step of applying one or more statistical analysis techniques on the plurality of physico-chemical and structural descriptors to obtain a filtered set of statistically significant descriptors from data specific to the plurality of physico-chemical and structural descriptors.
12. A system (100), comprising: a memory (102) storing instructions; one or more communication interfaces (106); and one or more hardware processors (104) coupled to the memory (102) via the one or more communication interfaces (106), wherein the one or more hardware processors (104) are configured by the instructions to: receive biological data pertaining to chemical structure of a chemical compound; generate a plurality of chemical information for the chemical compound using associated molecular structure, wherein the plurality of chemical information comprise a plurality of physico-chemical and structural descriptors, a plurality of molecular fingerprints, a plurality of molecular fragments, and a plurality of 2D and 3D structural images; apply one or more statistical analysis techniques on the plurality of chemical information to obtain filtered chemical information, wherein the step of applying one or more statistical analysis techniques on the plurality of chemical information to obtain filtered chemical information comprises: obtaining a filtered set of descriptors using the plurality of physico-chemical and structural descriptors; generating a plurality of fingerprint categories based on the plurality of Molecular fingerprints, wherein a first fingerprint category comprises a first set of fingerprints that is selected based on an occurrence threshold, wherein a second fingerprint category comprises a second set of fingerprints that is selected by applying at least one of a chi-squared test and a Fisher's exact test on the plurality of Molecular fingerprints, wherein a third fingerprint category comprises a third set of fingerprints that is selected by applying an information gain statistical test on the plurality of molecular fingerprints; generating a fourth fingerprint category comprising a fourth set of fingerprints that is selected based on a combination of the plurality of molecular fingerprints and the plurality of molecular fragments and the occurrence threshold; and performing one or more transformation techniques on the plurality of 2D and 3D structural images to obtain an optimized set of structural images; automatically generate a plurality of models based on the filtered set of descriptors, the first set of fingerprints, the second set of fingerprints, the third set of fingerprints, the fourth set of fingerprints and the optimized set of structural images respectively; automatically select and recommend a best model from the plurality of models based on the biological data and the plurality of chemical information; and automatically predict biological response of the chemical compound based on at least one of the best model and one or more user selected models from the plurality of models.
13. The system of claim 12, wherein the first set of fingerprints, the second set of fingerprints, the third set of fingerprints, and the fourth set of fingerprints comprise one or more CDK fingerprints, one or more CDK Extended fingerprints, one or more Estate fingerprints, one or more CDK Graph only fingerprints, one or more MACCS fingerprints, one or more Pubchem fingerprints, one or more Substructure fingerprints, one or more Klekota-Roth fingerprints, 2D Atom Pair fingerprints, one or more molecular fragments or combinations thereof.
14. The system of claim 13, wherein a second model and a fourth model generated amongst the plurality of models are based on the first and fourth set of fingerprints respectively and the occurrence of each type of first and fourth set of fingerprints in a chemical compound, and wherein a third model amongst the plurality of models is generated based on the probability of at least one of an activity, a biological response or an adverse event levels in the second set of fingerprints and the third set of fingerprints.
15. The system of claim 12, wherein the first and fourth set of fingerprints comprises at least one of a Type I fingerprint, a Type II fingerprint, a Type III fingerprint and a Type IV fingerprint.
16. The system of claim 15, wherein a presence of Type I fingerprint in at least one of the first set of fingerprints and the fourth set of fingerprints indicates contribution of the Type I fingerprint to one of a biological response, an adverse event or an activity of the chemical compound.
17. The system of claim 15, wherein an absence of a Type II fingerprint in at least one of the first set of fingerprints and the fourth set of fingerprints indicates contribution of the Type II fingerprint to one of a biological response, an adverse event or an activity of the chemical compound.
18. The system of claim 15, wherein a presence of a Type III fingerprint in at least one of the first set of fingerprints and the fourth set of fingerprints indicates contribution of the Type III fingerprint in one of no activity, no adverse event, or non-toxicity of the chemical compound, and wherein an absence of a Type IV fingerprint in at least one of the first set of fingerprints and the fourth set of fingerprints indicates contribution of the Type IV fingerprint in no activity, no adverse event, or non-toxicity of the chemical compound.
19. The system of claim 12, wherein the second set of fingerprints comprises a Type A fingerprint and a Type B fingerprint, and wherein the third set of fingerprints comprises a Type C fingerprint.
20. The system of claim 12, wherein the step of applying one or more statistical analysis techniques on the plurality of physico-chemical and structural descriptors to obtain a filtered set of statistically significant descriptors from data specific to the plurality of physico-chemical and structural descriptors.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
DETAILED DESCRIPTION
(9) Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims.
(10) Referring now to the drawings, and more particularly to
(11)
(12) The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
(13) The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment a database 108 can be stored in the memory 102, wherein the database 108 may comprise, but are not limited to information pertaining to chemical compound, chemical information, biological responses, rules or alerts, various models that are generated and executed for prediction of biological response, various fingerprints, images, occurrence threshold values, configuration details of the system during training phase and test/validation phase to perform the methodology described herein.
(14)
(15) Thus in-short, input data for the toxicity prediction can be from various sources such as: internal database(s), external database(s), information extracted from published articles and archived data repositories using natural language processing, or data mining techniques, etc.
(16) The processed data, referred to as input data can be used for modeling purposes. Subsequently, to model a biological response/activity, examples, cardiotoxicity, renal toxicity etc., chemical information of the compounds are generated and used along with the biological response of the compounds. For example, if a specific end point is renal toxicity that needs to be modeled, relevant data such as chemical structure, assay conditions, biological response data etc. pertaining to this end point needs to be extracted from the processed database (e.g., also referred as database 108 of
(17) Referring back to
(18) In an embodiment of the present disclosure, at step 306, the one or more hardware processors 104 apply one or more statistical analysis techniques on the plurality of chemical information to obtain filtered chemical information. More specifically,
(19) In an embodiment of the present disclosure, the first set of fingerprints, the second set of fingerprints, the third set of fingerprints, and the fourth set of fingerprints comprise one or more CDK fingerprints, one or more CDK Extended fingerprints, one or more Estate fingerprints, one or more CDK Graph only fingerprints, one or more MACCS fingerprints, one or more Pubchem fingerprints, one or more Substructure fingerprints, one or more Klekota-Roth fingerprints, 2D Atom Pair fingerprints, one or more molecular fragments or combinations thereof.
(20) For better understanding of the above steps 306a-306d, the steps 306a-306d are described by way of examples below:
(21) Statistical analysis technique(s) is/are applied on the chemical information (e.g., the physico-chemical and structural descriptors) for obtaining a filtered set of descriptors by removing zero or low variance columns and then the remaining are selected using various statistical measures (or feature selection technique(s)) such as one way analysis of variance (Annova), Welch t-test, and the like as depicted in
(22) In the present disclosure, molecular fingerprints and molecular fragments are merged together as they describe similar information or attributes of chemical compounds. Subsequently, the generated molecular fingerprints and fragments are divided into broad categories based on the information they contain as follows:
(23) A first fingerprint category comprises a first set of fingerprints that is (or are) selected based on an Occurrence Threshold (also referred as OT) as depicted in
(24) Presence of Type I fingerprint in at least one of the first set of fingerprints and the fourth set of fingerprints indicates contribution of the Type I fingerprint to one of a biological response, an adverse event or an activity of the chemical compound (for example toxicity). Similarly, absence of a Type II fingerprint in at least one of the first set of fingerprints and the fourth set of fingerprints indicates contribution of the Type II fingerprint to one of a biological response, an adverse event or an activity of the chemical compound (for example toxicity). Presence of a Type III fingerprint in at least one of the first set of fingerprints and the fourth set of fingerprints indicates contribution of the Type III fingerprint in no activity or (for example, non-toxicity) of the chemical compound. Likewise, absence of a Type IV fingerprint in at least one of the first set of fingerprints and the fourth set of fingerprints indicates contribution of the Type IV fingerprint no activity (for example, non-toxicity) of the chemical compound. The various fingerprint types are depicted in
(25) A second fingerprint category that comprises a second set of fingerprints that is (or are) selected by applying at least one of a Chi-squared test and a Fisher's exact test on the one or more Molecular Fingerprints (MF). The second set of fingerprints comprises a Type A fingerprint and a Type B fingerprint (based on Fisher's exact test) as depicted in
(26) A third fingerprint category comprises a third set of fingerprints that is (or are) selected by calculating an information gain value for one or more Molecular Fingerprints (MF). The third set of fingerprints comprises a Type C fingerprint as depicted in
(27) In a nutshell, if the fingerprints does not fall in to any of the types of first fingerprint category they will be classified using to second or third fingerprint category tests. All the classified category or type of fingerprints are selected based on various statistical tests and convey statistically significant information about the end point that is to be modeled.
(28) As discussed above generated substructures from all the chemical compounds of the training set are merged to generate a set of unique molecular fragments that are not already represented in the previously generated fingerprints. For example, one of the molecular substructures generated from the training set can be 4-lodoaniline or Bromobenzene, whose structures are given below. These substructures are also represented in KlekotaRoth fingerprints and therefore, capture same properties of a chemical compound. Thus, these two generated molecular substructures can be removed as they are already captured by other fingerprints. These substructures then represent additional set of fingerprints and are classified similar to fingerprints into first, second and third fingerprint categories' sub classes.
(29) ##STR00001##
(30) A fourth fingerprint category that comprises a fourth set of fingerprints that is (or are) selected based on a combination of the one or more molecular fingerprints and the molecular fragments and the occurrence threshold wherein the MFs and fragments are combined using ‘&&’ (AND) operator. The fourth set of fingerprints comprise at least one of a Type I fingerprint, a Type II fingerprint, a Type III fingerprint and a Type IV fingerprint as depicted in
(31) ##STR00002##
(32) In an embodiment, the occurrence threshold for the combined fingerprints may be user configurable and, in another, may vary from that of the original fingerprints. In yet another embodiment, the occurrence threshold may vary based on the training set available in the system 100. For instance, if training set contains toxic and non-toxic compounds in the ratio of 100:10 or 10:1, the occurrence threshold is set as 10 for Type I fingerprint, and for others the occurrence threshold is set as 1 as per the ratio. This ratio based on input data distribution is to ensure the model is not biased towards larger class of compounds. In other words, the ratio presents a solution to the problem of data imbalance as discussed in the introduction and observed in various biological response datasets. Additionally, the occurrence threshold may be dynamically changed as per the training set, learning pattern of the system 100 and the like. In an example embodiment, the system 100 may learn that the Type I occurrence threshold if set fifteen times larger than Type III occurrence threshold the model performance improves by 5%. The system 100 may set Type I occurrence threshold as 15 and Type III occurrence threshold as one. Similarly, system 100 can derive/learn rules for dynamic updation of occurrence thresholds of Type I, II, III and IV fingerprints. Consequently, system 100 also validates its rules across each new biological response prediction models it creates. Thus, system 100 learns these rules a) by observing performances across various biological response or adverse events prediction models, b) by varying the value of occurrence thresholds and c) from user inputs.
(33) In an embodiment of the present disclosure, the first set of fingerprints, the second set of fingerprints, the third set of fingerprints, and the fourth set of fingerprints comprises one or more CDK fingerprints, one or more CDK Extended fingerprints, one or more Estate fingerprints, one or more CDK Graph only fingerprints, one or more MACCS fingerprints, one or more Pubchem fingerprints, one or more Substructure fingerprints, one or more Klekota-Roth fingerprints, 2D Atom Pair fingerprints, one or more molecular fragments or combinations thereof.
(34) In an embodiment of the present disclosure, the system 100 generates structural images of chemical compounds in two and/or three dimensions. These images are color coded to represent an element, a type of bond, size of molecule, etc. with a particular color, uniformly across all compounds. As the size and orientation of similar bonds and cyclical structures varies across compounds depending on the number of atoms, system 100 can perform various transformations on the structural images, be it 2D or 3D, of the chemical compounds. For example, as shown below, in the structures of compounds x to y, the orientation and size of benzene ring varies across three different drug like molecules. The transformations on the structural images of chemical compounds can be rotation of the 2D in various degrees, up or down scaling original image to various sizes and the like, generating additional images and addressing some of the issues discussed above.
(35) ##STR00003##
(36) Referring back to
(37) In an embodiment, a second model (model II) from the plurality of models is generated based on the first set of fingerprints (first fingerprint category) and occurrence of each type of first set of fingerprints in a chemical compound. In this, the system 100 uses the first set of fingerprints generated using the original fingerprints and fragments to predict the activity of the compounds. The activities or biological responses of chemical compounds for an end point are divided into various classes based on their value. For example toxicity as one class and non-toxicity as another class. Further, the system 100 computes class scores for each compound by verifying the presence or absence of each Type I, II, III, and IV fingerprints. Depending on these scores the system 100 assigns a class or predicts the biological response of a new compound.
(38) In another embodiment, a third model (model III) from the plurality of models is generated based on the second set of fingerprints, third set of fingerprints, or combinations thereof. For each of the second and third set of fingerprints a set of probabilities values are computed. These probabilities represent various scenarios that can occur in a dataset. For example, in a two class classification model the set of probabilities, for each fingerprint can be pr(active/present), pr(inactive/absent), pr(inactive/present) and pr(inactive/absent). If a fingerprint (FP1) is present the probability of the compound to be active or toxic (probability of compound being active given that the fingerprint is present: pr(active/present) or probability(active/present)) is calculated from the training set values as below:
(39)
(40) Further, each of the second and third set of fingerprints are used for building model II only if the calculated probability scores, for each scenario as depicted above, lie outside the unpredictability range. This range indicates the level of confidence the system 100 needs in order to avoid incorrect classification, in view of the training set configuration.
(41) In an example embodiment, and in a two class/level classification the unpredictable range can be calculated as follows. Let, nBias: be the number of compounds in a class that has larger number of compounds in training set nComp: be the total number of compounds in train set threshold: be a user defined cut-off.
(42) The system 100 calculates or defines
distortion=(nBias/nComp)−0.5;
(43) and then, the critical/unpredictable range is define as (LB-UB), where
Lower bound (LB): Minimum (threshold+distortion,threshold)
Upper bound (UB): Maximum (1−threshold+distortion,1−threshold)
(44) Further, each of the second and third category fingerprint, which has its probability scores outside the unpredictability range, is used to calculate scores for each class or activity of a chemical compound. A class is then assigned to the compound based on the comparison of all the class scores. For example, let FP1 be a fingerprint, which can be represented structurally as in figure below and which has the following probability distribution.
(45) ##STR00004## Probability set of Value FP1 Pr (active|FP1=1) 0.92 Pr (active|FP1=0) 0.64 Pr (inactive|FP1=1) 0.08 Pr (inactive|FP1=0) 0.36
If the unpredictability range for the above example is (0.25-0.89), FP1 will be used for predicting activity of a compound given that the compound contains FP1 (i.e., FP1=1) in model III by system 100. If a compound does not contain FP1 (i.e., FP1=0), the probability set of FP1, and therefore FP1, will not be used in model III by system 100 as Pr (active|FP1=0) is within the unpredictable range. In another instance, if unpredictability range is (0.4, 0.6), FP1 will be used for modelling in both the presence and absence of FP1 fingerprint.
(46) In an embodiment, subsequently, system 100 builds model III using all the second and third set of fingerprints filtered using unpredictability range. Using the probability set of each second and third set of fingerprints, the system 100 computes class scores for a chemical compound using a) presence or absence of the fingerprint FPX in the compound, b) probability score set of fingerprint FPX and c) summation and comparison of computed class scores.
(47) In yet another embodiment, a fourth model (model IV) from the plurality of models is generated based on the fourth set of fingerprints, occurrence of each type of fourth set of fingerprints in a chemical compounds, or combinations thereof. In other words, combined fingerprints are used to assign class scores similar to model II.
(48) In a further embodiment, a fifth model (model V) from the plurality of models is generated based on an analysis performed on the optimized set of structural images in a deep neural network. This model is generated using the images of chemical structures as input for a convolution deep neural network, in one example embodiment. The various models generated are depicted in
(49) In an embodiment of the present disclosure, at step 310, the one or more hardware processors 104 automatically select and recommend a best model from the plurality of models based on the biological data and the plurality of chemical information, and at step 312, the one or more hardware processors 104 automatically predict biological response of the chemical compound based on at least one of the best model and one or more user selected models from the plurality of models. The term, “biological response” can be toxicity of chemicals, potency of drug candidates against a biological target in an in in vitro assay or in a cell based assay etc. It can be defined as the response exhibited by a biological system in in vitro, ex-vivo, in vivo conditions on exposure to a chemical, drug candidate etc. In an embodiment, the biological response of the chemical compound is predicted using the system recommended best model (and/or user selected models) as depicted in
(50)
(51) Application of the above method(s) of the present disclosure (as depicted in
(52) In the above example embodiment, the system 100 collected side effects data from SIDER 4.1 version and adverse drug reaction terms classification data from ADRECS website http://bioinf.xmu.edu.cn/ADReCS/index.isp. Both these data (raw data) are used to construct biological response profiles (processed data) for various drug and drug like compounds (chemical compounds/structures) by performing various data processing techniques depicted in
(53) In the above example embodiment, for each chemical compound/structure, smiles are extracted using PubChem ID and are used to generate various chemical information as explained below:
(54) Two types of variables were generated: 1. Fingerprints using Padel software version 2.21: these are binary variables taking values ‘1’ or ‘0’, indicating the presence or absence of a structural feature or substructure. a. CDK fingerprints: 1024 fingerprints for a various Atom Containers b. CDK Extended fingerprints: 1024 extended fingerprints for various Atom Containers that extends the CDK with additional bits describing ring features c. Estate fingerprints: 79 bit fingerprints using the E-State fragments. The E-State fragments are those described in [Hall, L. H. and Kier, L.B., Electro topological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information, Journal of Chemical Information and Computer Science, 1995, 35:1039-1045]. d. CDK Graph only fingerprints: 1024 specialized version of the CDK Fingerprints which does not take bond orders into account e. MACCS fingerprints: generates 166 bit MACCS keys whose SMARTS patterns were taken from RDKit f. Pubchem fingerprints: 881 fingerprints for a molecule g. Substructure fingerprints: Checks the presence of 307 SMARTS Patterns for Functional Group Classification by Christian Laggner h. Klekota-Roth fingerprints: 4860 SMARTS based substructure fingerprint based on Chemical substructures that enrich for biological activity [Klekota, Justin and Roth, Frederick P., Chemical substructures that enrich for biological activity, Bioinformatics, 2008, 24:2518-2525]. i. 2D Atom Pair fingerprints: 780 fingerprints that check the presence of a set of atom pairs at various topological distances 2. Topological, geometrical, constitutional, and physicochemical descriptors using in house tool.
(55) In addition to the above variables log P and log S values of the compounds were also included that are sourced from ALOGPS 2.1 (http://www.vcclab.org/lab/alogps/) in the analysis.
(56) Further, the system 100 filters the generated chemical information using various criteria for example: 1. Statistically significant structural descriptors were selected using p-value calculated from one way analysis of variance test, which is applied continuous data used for predicting a categorical variable, toxicity. The system 100 may select only 83 descriptors for a p-value of less than 0.15, from generated 352 descriptors. 2. Category 1 fingerprints (Type I, Type II, Type III and Type IV) are selected by setting an occurrence threshold (OT) value. The minimum OT value can be 1. Using OT of 1, the system 100 filters 475 Type I fingerprints, 12 Type II fingerprints, 191 Type III fingerprints and no Type IV fingerprints. 3. After removing non-zero columns from all the remaining fingerprint data, the system categorizes the fingerprints with chi square value greater than a pre-defined threshold, for example 6.635, as Type A fingerprints and the remaining fingerprints as Type B fingerprints. In total, 424 Type A and 119 Type B were considered. The above processing results in selection of 1221 out of 10,145 generated fingerprints and 83 out of 352 generated descriptors.
(57) In an example embodiment, the system 100 can divide the processing data, consisting of 1114 (715 toxic and 399 non-toxic) compounds with 1221 fingerprint data and 1049 compounds with 83 descriptor data, into training and test data for model building and validation based on bitwise similarity. In an example scenario, the final datasets can be represented as follows:
(58) Train Data: 847 compounds with 548 Toxic, and 299 Non-Toxic
(59) Test Set: 267 compounds with 167 Toxic and 100 Non-Toxic
(60) The system 100 in the above example maintained the ratio of toxic and non-toxic compounds in all the data sets approximately the same.
(61) In the example case study considered above, dividing the processed data into training data and test data sets followed by fingerprint selection resulted in the below set of chemical information that can be used for model building
(62) a) Type I: 475 fingerprints
(63) b) Type II: 12 fingerprints
(64) c) Type III: 191 fingerprints
(65) d) Type IV: 0 fingerprints
(66) e) Type A: 424 fingerprints
(67) f) Type B: 119 fingerprints
(68) g) 83 Descriptors
(69) Each model built may be evaluated based on a number of metrics such as accuracy, sensitivity, specificity and percentage predicted. They are described in detail below: a) Accuracy: is the fraction of correct predictions. It can be mathematically defined as
(70)
(71)
(72)
(73)
(74) Final model was built based on 4 models: Model I, II, III and IV hierarchically, for example, if the compound is not predicted using model I, it was passed on to the next model(s), models II-V. Various combination of Model I, Model II, Model III, and Model IV (in the use case scenario) can also generate a final model. The final model combination for biological response prediction can be selected based on the highest percentage predicted, good sensitivity, specificity and accuracy in the test set.
(75) Results for Model I using classifier as Random Forest are depicted below in illustrated table (Table 1). Predicted column in the table represents the total number of compounds predicted from a given set. Non-predicted columns represents the total number of compounds that are not classified in to any of the classes by the model. The relation between predicted and non-predicted can be defined as:
Set Size=Predicted Compounds+Non-Predicted Compounds
Similarly, the column accurate contains number of compounds that are correctly predicted by the model and inaccurate column contains the number of compounds that are wrongly or inaccurately classified by the model. Some of the other relations between the columns of the table are as follows:
(76)
A model with higher percentage of predicted compounds, with excellent sensitivity and specificity is preferred over models with lower percentage predicted, and relatively poor sensitivity or specificity.
(77) TABLE-US-00001 TABLE 1 Percentage Occur- Non- In- Accuracy (%) Sen- Speci- rence Parameters Set Size Predicted Predicted Accurate accurate % Predicted sitivity ficity 1 Tree Train 847 801 46 800 1 99.87 94.56 1 1 count = Test 267 248 19 218 30 87.9 92.88 0.91 0.88 20, No. of attributes = 18
(78) In this example embodiment, let a compound CX be Vidarabine (9-13-D-arabinofuranosyladenine) with its structure depicted in the figure below. This compound may contain substructures FPK1 and FPK2 KelkatoRoth Type I fingerprints. FPK1 and FPK2 can structurally be represented as given in figure below. The presence of these Type I fingerprints, FPK1 and FPK2, in the compound CX can indicate toxic characteristics of the compound. System 100, in similar way, checks for presence of all the first set of fingerprints to calculate toxic and non-toxic class scores for each compound. The class scores can be computed by counting the presence of each type of first set of fingerprints in a compound. In this example scenario, CX may be assigned toxic class score of 3 and non-toxic class score of 0, i.e., CX contains three Type I first set of fingerprints which indicate toxicity. Thus, CX can be classified as toxic by Model II.
(79) ##STR00005##
(80) Results for Model II for all compounds are depicted below in illustrated table (Table 2):
(81) TABLE-US-00002 TABLE 2 Occur- rence Percentage Thres- Non- In- Accuracy (%) Sen- Speci- hold Set Size Predicted Predicted Accurate accurate % Predicted sitivity ficity 1 Train 847 484 363 484 0 100 57.14 1 1 Test 267 111 156 109 2 98.29 41.57 0.98 0.97
(82) In this case study, System 100, computes following values for filtering fingerprints for building model III.
(83) nBias=548
(84) nComp=847
(85) distortion=(548/857)−0.5=0.147
(86) Using the above values, and a user defined or system defined threshold, system 100 computes the unpredictable range. For example for
(87) threshold=0.15
(88) Lower bound (LB)=Minimum (0.197, 0.15)=0.15
(89) Upper bound (UB)=Maximum (0.997, 0.85)=0.99
(90) Therefore, the unpredictability range is (0.15-0.99). In another scenario, if threshold=0.25, the unpredictable range can be computed as (0.25-0.89). Subsequently, system 100, filters second and third set of fingerprints using one of the unpredictable ranges and builds model III using probability class scores for each compound.
(91) Results for Model III for two different unpredictable ranges are depicted below in illustrated table (Table 3):
(92) TABLE-US-00003 TABLE 3 Occur- rence Percentage Thres- Unpredict- Non- In- Accuracy (%) Sen- Speci- hold able Set Size Predicted Predicted Accurate accurate % Predicted sitivity ficity 1 0.15, 0.99 Train 847 32 815 28 4 87.5 3.78 0 1 Test 267 6 261 5 1 83.33 2.25 0 1 1 0.25-0.89 Train 847 310 537 275 35 88.71 36.6 0.95 0.7 Test 267 73 194 69 4 94.5 27.35 1 0.84
(93) In this example embodiment, let a compound CX be N-(1-Ethoxy-1-oxo-4-phenyl-2-butanyl) alanylproline with its structure depicted in the figure below. This compound may contain a combined fingerprint CFPX that checks for presence of fingerprints FPP, a PubChem fingerprint and FPK, a KelkatoRoth fingerprint. FPP and FPK can structurally be represented as given in figure below. System 100 checks for presence of all the fourth set of fingerprints, combined fingerprints, to calculate toxic and non-toxic class scores for each compound. The class scores can be probability of a compound to be toxic and nontoxic. It can also be computed by counting the presence of each type of fourth set of fingerprints. In this example scenario, CX may be assigned toxic class score of 19 and non-toxic class score of 0, i.e., CX satisfies 19 different Type I fourth set of fingerprints which indicate toxicity. Thus, CX can be classified as toxic by Model IV.
(94) ##STR00006##
(95) Results for Model IV for all compounds are depicted below in illustrated table (Table 4):
(96) TABLE-US-00004 TABLE 4 Type I Percentage Occur- Type Non- In- Accuracy (%) Sen- Speci- rence III Set Size Predicted Predicted Accurate accurate % Predicted sitivity ficity 10 10 Train 847 354 493 354 0 100 41.79 1 1 Test 267 116 151 105 11 90.5 43.44 1 0
(97) Results for Combined models—Model II and III are depicted below in illustrated table (Table 5): Model I: OT=1, Model II: Unpredictable Range (0.15, 0.99)
(98) TABLE-US-00005 TABLE 5 Percentage Non- In- Accuracy (%) Sen- Speci- Set Size Predicted Predicted Accurate accurate % Predicted sitivity ficity Train 847 498 349 497 1 99.79 58.79 0.99 1 Test 267 115 152 113 2 98.26 43.07 0.98 0.97
(99) Results for combined models—Model II+Model III+Model IV are depicted below in illustrated table (Table 6): Model 1 OT=1, Model II: Unpredictable Range (0.15, 0.99), Combination Thresholds: Type I=10, and Type III=10.
(100) TABLE-US-00006 TABLE 6 Percentage Non- In- Accuracy (%) Sen- Speci- Set Size Predicted Predicted Accurate accurate % Predicted sitivity ficity Train 847 584 263 583 1 99.83 68.94 0.99 1 Test 267 167 100 156 11 93.41 62.54 1 0.75
(101) Best Results for combined models—Model I+Model II+Model III+Model IV are depicted below in illustrated table (Table 7): Model 1 OT=1; Unpredictable range (0.15, 0.99); Combination Thresholds: Type I=10, and Type III=10; Random Forest: Tree=20, No. of Attributes=18
(102) TABLE-US-00007 TABLE 7 Percentage Non- In- Accuracy (%) Sen- Speci- Set Size Predicted Predicted Accurate accurate % Predicted sitivity ficity Train 847 847 0 846 1 99.88 100 0.99 1 Test 267 267 0 241 26 90.26 100 0.95 0.81
(103) Best Result for Support Vector Machine (SVM) and Random Forest models, widely used classification techniques, with Fingerprints using Information Gain for feature selection are depicted below in illustrated table (Table 8):
(104) TABLE-US-00008 TABLE 8 Model Set Size Accurate Inaccurate Accuracy % Sensitivity Specificity SVM Kernel = Train 847 836 11 98.7 0.99 0.96 RBF, C = 10, Test 267 241 26 90.26 0.91 0.88 Gamma = 0.125 Random Forest Train 847 815 32 96.22 0.95 0.96 Tree count = 21, Test 267 240 27 89.88 0.88 0.92 No. of Attributes = 4
(105) Best Result for SVM and Random Forest models with Descriptors using ANOVA test for feature selection are depicted below in illustrated table (Table 9):
(106) TABLE-US-00009 TABLE 9 Model Set Size Accurate Inaccurate Accuracy % Sensitivity Specificity SVM Kernel = Train 801 801 0 100 1 1 RBF, C = 2.0, Test 248 153 95 61.69 1 0.01 Gamma = 0.125 Random Forest Train 801 798 3 99.62 0.998 0.993 Tree count = 21, Test 248 184 64 74.19 0.855 0.56 No. of Attributes = 4
(107) Best Result for SVM and Random Forest models with both Descriptors and Fingerprints are depicted below in illustrated table (Table 10). Feature selection done using ANOVA for Descriptors and Information Gain for Fingerprints:
(108) TABLE-US-00010 TABLE 10 Model Set Size Accurate Inaccurate Accuracy % Sensitivity Specificity SVM, Kernel = Train 801 801 0 100 1 1 RBF C = 8, Test 248 153 95 61.69 1 0.01 Gamma = 0.125 Random Forest, Train 801 800 1 99.87 1 0.99 Tree count = 25, Test 248 202 46 81.45 0.93 0.625 No. of Attribute = 4
(109) Prediction results using SARpy v1.0, Occurrence Threshold=1, Range of no. of atoms=(2, 18) are depicted below in illustrated table (Table 11).
(110) TABLE-US-00011 TABLE 11 Percentage Non- In- Accuracy (%) Sen- Speci- Model Set Size Predicted Predicted Accurate accurate % Predicted sitivity ficity Minimum Train 847 846 1 666 180 78.72 99.88 0.87 0.62 Precision Test 267 264 3 196 68 74.24 98.97 0.84 0.56 (Minimum likelihood ratio = 1) Maximum Train 847 756 91 756 0 100 89.25 1 1 Precision Test 267 165 102 144 21 87.27 61.79 0.95 0.66 (Minimum Likelihood ratio = Infinity)
(111) The accuracy of 90.26 obtained by the current disclosure as depicted in Table 7 in comparison with the results of other modelling techniques presented in Tables 8-11 support the technical advantage of the current disclosure which can be observed in terms of prior defined statistical metrics. In addition, some of the insights that may be drawn by the system 100, the current disclosure, for the above example embodiment to predict renal toxicity, are a) the presence of one or more chemical sub-structures/structures depicted in Chemical structure below (e.g., Chemical structure 1) may result in toxic nature of a chemical compound b) the presence of one or more chemical structures depicted below (e.g., Chemical structure 2) may result in non-toxic nature of a chemical compound.
(112) ##STR00007##
Chemical structure 1: Substructures that may result in renal toxicity of a chemical compound
(113) ##STR00008##
Chemical structure 2: Substructures that may not result in or may inhibit renal toxicity of a chemical compound
(114) The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that are relevant or occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
(115) It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
(116) The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
(117) The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
(118) Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
(119) It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.