METHOD OF IDENTIFICATION OF A RELATIONSHIP BETWEEN BIOLOGICAL ELEMENTS
20170154151 ยท 2017-06-01
Inventors
- Anne-Claire BRUNET (TOULOUSE, FR)
- Jean-Michel LOUBES (TOULOUSE, FR)
- Jean-Marc AZAIS (TOULOUSE, FR)
- Michael COURTNEY (LYON, FR)
Cpc classification
G06F17/16
PHYSICS
International classification
Abstract
The present invention relates to a method for identifying a relationship between biological elements, said elements optionally having a measurable activity, the method comprising the following steps: defining candidate graphs, each candidate graph being a graph associated with one of the thresholding values from the plurality of thresholding values, for each thresholding value, obtaining a distribution associated by optimization of the distribution into classes of the apices of the graph associated with the relevant thresholding value, the optimization starting with an initial distribution in which with each core is associated a class for obtaining a final distribution in which each apex of a class shares more links with the other apices of the same class than with the apices of another class, selecting an optimum graph from among the plurality of candidate graphs according to at least one criterion.
Claims
1. A method for identifying a relationship between biological elements, said biological elements optionally having a measurable activity, the method being applied by a computer and comprising the following steps: providing data from biological samples of a plurality of biological individuals, the data comprising a representative quantity of the biological elements or of their activity for the plurality of biological individuals, estimating the covariance matrix between the different representative quantities of the biological elements or of their activity from provided data, associating a graph with a thresholding value, the associated graph comprising representative apices of the biological elements and links between the apices when the value of the covariance between the relevant apices is greater than the relevant thresholding value, obtaining cores by analyzing the time-dependent change of the graphs by using a plurality of thresholding values, a core being a set of apices of a graph such that the number of apices is greater than or equal to a set number, such that a thresholding value exists for which the core is a connected component of the graph associated with the thresholding value and such that no other connected components exist of a graph for which the number of apices is greater than or equal to the set number and which is included in the core, defining candidate graphs, each candidate graph being a graph associated with one of the thresholding values of the plurality of thresholding values, for each thresholding value of the plurality of threshold values, obtaining a distribution associated by optimization by the distribution into classes of the apices of the graph associated with the relevant thresholding value, the optimization starting from an initial distribution in which with each core is associated a class for obtaining a final distribution in which each apex of a class shares more links with the other apices of the same class than with the apices of another class, selecting an optimum graph from among the plurality of candidate graphs according to at least one criterion.
2. The method according to claim 1, wherein in the step for obtaining the cores, the values of the plurality of the thresholding values are used in an increasing way.
3. The method according to claim 1 wherein in the step for obtaining an associated distribution, the values of the plurality of thresholding values are used in a decreasing way.
4. The method according to claim 1 wherein the step for estimating the covariance matrix includes a sub-step for computing the empirical covariance matrix, a regularization sub-step and a normalization sub-step.
5. The method according to claim 1 wherein the step for obtaining cores applies an in-depth course algorithm.
6. The method according to claim 1 wherein the final distribution includes less classes than the number of obtained cores.
7. The method for identifying a relationship according to claim 1 wherein the number of biological elements is greater than or equal to 1000, preferentially greater than or equal to 3000, still more preferentially greater than or equal to 5000.
8. The method for identifying a relationship according to claim 1 wherein the ratio between the number of biological elements and the number of biological individuals is greater than or equal to 10, preferentially greater than or equal to 30, still more preferentially greater than or equal to 50.
9. The method for identifying a relationship according to claim 1 wherein the biological elements are genes, RNAs, proteins or metabolites.
10. The method for identifying a relationship according to claim 1 wherein the biological individuals are animals, preferentially mammals, still more preferentially humans.
11. The method according to claim 1, further comprising identifying a therapeutic target for preventing and/or treating a pathology using the following steps: applying the method for identifying a relationship according to claim 1 wherein the plurality of individuals is a plurality of biological individuals suffering from said pathology and the representative quantity is the quantification of the expression of at least one gene of the plurality of individuals, in order to obtain a first distribution in which each first class is associated on a one-to-one basis with a first value of the representative quantity, applying the method for identifying a relationship according to claim 1 wherein the plurality of individuals is a plurality of biological individuals not suffering from said pathology and the representative quantity is the quantification of the expression of at least one gene of the plurality of individuals, in order to obtain a second distribution in which each second class is associated on a one-to-one basis with a second value of the representative quantity, comparing the first distribution and the second distribution, and selecting as a therapeutic target the gene, or a product of the expression of the gene, if the representative apices of said gene belongs to a first class and to a second class for which the first value and the second value significantly differ.
12. The method according to claim 1, further comprising identifying a diagnostic, susceptibility, prognostic biomarker of a pathology or predictive of a response to a treatment of a pathology using the following steps: applying the method for identifying a relationship according to claim 1 wherein the plurality of individuals is a plurality of biological individuals suffering from said pathology and the representative quantity is the quantification of the expression of at least one gene of the plurality of individuals, in order to obtain a first distribution in which each first class is associated on a one-to-one basis with a first value of the representative quantity, applying the method according to claim 1 wherein the plurality of individuals is a plurality of biological individuals not suffering from said pathology and the representative quantity is the quantification of the expression of at least one gene of the plurality of individuals, in order to obtain a second distribution in which each second class is associated on a one-to-one basis with a second value of the representative quantity, comparing the first distribution and the second distribution, and selecting as a biomarker the gene, or an expression of the gene, if the representative apices of said gene belong to a first class and to a second class, for which the first value and the second value differ significantly.
13. The method according to claim 1, further comprising screening a compound, useful as a drug, having an effect on a known therapeutic target, for preventing and/or treating a pathology using the following steps: applying the method for identifying a relationship according to claim 1 wherein the plurality of individuals is a plurality of biological individuals suffering from said pathology and having received said compound, the representative quantity is the quantification of the expression of at least one gene of the plurality of individuals, and the data comprising the representative quantity of the therapeutic target, in order to obtain a first distribution in which each first class is associated on a one-to-one basis with a first value of the representative quantity, applying the method for identifying a relationship according to claim 1 wherein the plurality of individuals is a plurality of biological individuals suffering from said pathology and not having received said compound, the representative quantity is the quantification of the expression of at least one gene of the plurality of individuals, and the data comprising the representative quantity of the therapeutic target, in order to obtain a second distribution in which each second class is associated on a one-to-one basis with a second value of the representative quantity, comparing the first distribution and the second distribution, and selecting the compound if the representative apices of the known therapeutic target belong to a first class and to a second class for which the first value and the second value differ significantly.
14. (canceled)
15. A non-transitory computer-usable storage medium having computer readable instructions stored thereon for execution by a processor to perform a method according to claim 1.
Description
[0045] Other features and advantages of the invention will become apparent upon reading the description which follows of embodiments of the invention, only given as an example and with reference to the drawings which are:
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052] A system 10 and a computer program product 12 are illustrated in
[0053] The system 10 is a computer.
[0054] More generally, the system 10 is an electronic computer able to handle and/or transform data represented as electronic or physical quantities in registers of the system 10 and/or as memories in other similar data corresponding to physical data in memories, registers or other types of display, transmission or memory-storage devices.
[0055] The system 10 includes a processor 14 comprising a data processing unit 16, memories 18 and an information medium reader 20. The system 10 also comprises a keyboard 22 and a display unit 24.
[0056] The computer program product 12 includes a readable information medium 20.
[0057] A readable information medium 20 is a medium readable by the system 10, usually through the data processing unit 14. The readable information medium 20 is a medium adapted for storing in memory electronic instructions and capable of being coupled with a bus of a computer system.
[0058] As an example, the readable information medium 20 is a diskette or floppy disk, an optical disk, a CD-ROM, a magneto-optical disk, a ROM memory, a RAM memory, an EPROM memory, an EEPROM memory, a magnetic card or an optical card.
[0059] On the readable information medium 20 is stored in memory a computer program comprising program instructions.
[0060] The computer program is loadable on the data processing unit 14 and is adapted for causing the application of a method for identifying a relationship between physical elements when the computer program is applied on the data processing unit 14.
[0061] The operation of the system 10 in interaction with the computer program product 12 is now described with reference to
[0062] An element is a physical element when the element belongs to reality.
[0063] For example, atoms are physical elements. The statistical study of the spin states of a set of atoms is of interest both for spintronics and for material condensation problems.
[0064] According to another example, stars are physical elements. The emitted amount of a particular particle for different stars may notably be compared.
[0065] According to another example, the particles emitted by a star are physical elements. The study of particles emitted by a star gives the possibility of determining a piece of information on the state of the star considered statistically.
[0066] In the remainder of the description, examples of physical elements belonging to the field of biology are more specifically considered, without these examples being a limitation of the present method.
[0067] Notably, according to a preferred embodiment, the physical elements are biological elements. For example, the physical elements may be genes, RNAs, in particular Messenger RNAs, proteins or metabolites.
[0068] The method for identifying a relationship is all the more advantageous since the number of relevant physical elements is significant so that the physical elements are preferably sets of large dimensions.
[0069] For example, the number of physical elements is greater than or equal to 1000, preferably greater than or equal to 2000, preferably greater than or equal to 3000, preferably greater than or equal to 4000, preferably greater than or equal to 5000, preferably greater than or equal to 6000, preferably greater than or equal to 7000, preferably greater than or equal to 8000, preferably greater than or equal to 9000, preferably greater than or equal to 10000.
[0070] By the term of relationship is meant a link or a connection existing between two elements.
[0071] The method for identifying a relationship includes a step 50 for providing data relative to a plurality of individuals. The data for a particular individual comprise a quantity representative of each of the physical elements.
[0072] As a particular example, the representative quantity of a physical element may be the amount of the physical element. For example, the representative quantity of a protein in a given sample may be the amount of this protein in this sample. Thus, in such a particular case, as an illustration, a first protein will have a weight of 15 kiloDaltons, a second protein would have a weight of 10 kiloDaltons, a third protein would have a weight of 12 kiloDaltons.
[0073] Through the proposed particular example, it appears that by representative quantity of a physical element, is meant any type of measurable quantity which characterize the physical element. A representative quantity of a physical element may therefore be expressed as an amount.
[0074] According to a particular embodiment, the relevant quantity is representative of the activity of a physical element.
[0075] In particular, for the previous example of the atom, the spin is a representative quantity.
[0076] According to another example, for the case when the particles emitted by a star are the physical elements, the amount of emitted particles is a representative quantity. Similarly, for the example of stars, the amount of the emitted particular particle by each of the stars is a representative quantity.
[0077] The activity of a physical element represents the whole of the effects produced by the relevant physical element. Notably, when the physical element is a gene, the activity of the physical element may refer to the expression of said gene. The expression of a gene may in particular be quantified by measuring the amount of messenger RNA produced by the transcription process from said gene, or by measuring the amount of protein produced by the transcription and translation processes from said gene.
[0078] The representative quantity of the activity of a physical element may be the amount of a product resulting from the activity of the physical element. For example, the representative quantity of the activity of a gene may be the amount of messenger RNAs produced by the transcription process from said gene. According to another example, the representative quantity of the activity of a messenger RNA may be the amount of proteins produced by the translation process from said messenger RNA.
[0079] By the term of individual is meant a statistical element of a wider set called a <<population>>, and for which the value of the representative quantity of each of the physical elements, or of their activity, is provided in the provision step 50.
[0080] In the case of the example of atoms, the plurality of individuals is a plurality of atoms.
[0081] For the example of particles emitted by a same star, the plurality of individuals may be emissions at distinct time instants.
[0082] For the case when a plurality of stars is considered, the plurality of individuals is preferably the plurality of stars.
[0083] According to a particular embodiment, the individual may be a biological individual such as for example an animal. Preferably, the individual is a mammal. Still more preferentially, the individual is a human.
[0084] The method for identifying a relationship is all the more advantageous since the ratio between the number of physical elements and the number of individuals is greater than or equal to 10, preferably greater than or equal to 20, preferably greater than or equal to 30, preferably greater than or equal to 40, preferably greater than or equal to 50, preferably greater than or equal to 60, preferably greater than or equal to 70, preferably greater than or equal to 80, preferably greater than or equal to 90, preferably greater than or equal to 100, preferably greater than or equal to 200.
[0085] Alternatively or additionally, the number of individuals may be less than or equal to 200, preferably less than or equal to 100.
[0086] The data thus comprise, for a plurality of individuals, the different values of a representative quantity selected for each physical element. As explained earlier, according to a particular embodiment, the number of provided representative quantities is greater than or equal to 1000 for each relevant individual.
[0087] The data provided in the provision step 50 may be obtained by any means. In particular, the data may be obtained by an analysis of the omic type, for example by a genomic, transcriptomic, proteomic, or metabolomic analysis. The techniques giving the possibility of obtaining data of the omic type are well known to one skilled in the art and for example comprise those of DNA chips, of quantitative PCR or of systematic sequencing of DNA, RNA or complementary DNA.
[0088] In a particular embodiment, the data provided in the provision step 50 are obtained from a biological sample of the individual, such as one or several organs, tissues, cells or cell fragments of the individual.
[0089] At the end of the provision step 50, data comprising a representative quantity of the physical elements for a plurality of individuals have been provided.
[0090] From a mathematical point of view, the provided data correspond to the case of n models (n individuals) of p random variables X.sub.1, . . . , X.sub.p (p representative quantities). In this context, n and p are two integers.
[0091] For the following, in a sake of simplifying the matter, as an illustration, it is assumed that the random variables X.sub.1, . . . , X.sub.p are centered.
[0092] The method includes a step 52 for representing the data provided in matrix form in order to obtain a data matrix noted as X, for which the element of line i and of column j is the value of the i-th representative quantity X.sub.i for the j-th model.
[0093] The method includes a step 54 for estimating the covariance matrix between the different representative quantities from the data matrix.
[0094] In probability and statistical theory, the variance-covariance matrix or more simply the covariance matrix of a series of p real random variables X.sub.1, . . . , X.sub.p is the square matrix for which the element of line i and of column j is the covariance of the variables X.sub.i and X.sub.j. Such a matrix gives the possibility of quantifying the variation of each variable relatively to each of the other ones.
[0095] According to an embodiment, the estimation step 54 includes a computation sub-step.
[0096] As an example, in the computation sub-step, the empirical covariance matrix S is computed. By definition, S is the product of the reciprocal of the integer n by the matrix product of the data matrix X by the transposed data matrix X. This is written mathematically as:
[0097] wherein: [0098] <<.Math.>> refers to the mathematical operation of multiplication by a scalar, [0099] <<*>> refers to the mathematical matrix multiplication operation, and [0100] X.sup.t designates the transposed data matrix X.
[0101] According to another example, in the computation sub-step, the correlation matrix of Spearman is computed.
[0102] According to another embodiment, the estimation step 54 includes a regularization sub-step.
[0103] The regularization sub-step gives the possibility of forcing values of the covariance matrix to be zero in order to obtain a hollow matrix (i.e. a matrix comprising many zeros).
[0104] For example, the regularization sub-step is applied to the empirical covariance matrix S computed in the computation sub-step, in order to obtain a regularized covariance matrix S.sub.regularized.
[0105] According to a particular case, the regularization sub-step is applied by using a thresholding value , the thresholding value being positive or zero. More specifically, in order to obtain the empirical regularized covariance matrix S.sub.regularized, all the values of the empirical covariance matrix S for which the absolute value is strictly less than the thresholding value are set to 0.
[0106] As the thresholding value is a variable, the regularized empirical covariance matrix S.sub.regularized is a function of the thresholding value . Notably, when the thresholding value is zero, the regularized empirical covariance matrix S.sub.regularized is the empirical covariance matrix S. On the contrary, when the thresholding value tends to infinity, the regularized empirical covariance matrix S.sub.regularized tends towards the zero matrix, i.e. a matrix for which all the terms are zero.
[0107] Such a regularization sub-step is particularly advantageous when the integer p is large or when the integer p is greater than the integer n. Indeed, in such cases, the regularized empirical covariance matrix S.sub.regularized is an estimator of better quality than the empirical covariance matrix S, the function of the thresholding value giving the possibility of removing too small non-significant values. This notably stems from the fact that there may exist noise in the provided data and that there exist a risk of existing one or several positive false values.
[0108] Optionally, the estimation step 54 also includes a normalization sub-step in order to obtain a normalized matrix.
[0109] For example, the normalization sub-step is applied to the empirical covariance matrix S.
[0110] According to a preferred embodiment, the normalization sub-step is applied by computing the following matrix product:
[0111] wherein: [0112] R refers to the normalized matrix, and
refers to the diagonal matrix of the standard-deviations. By definition, the diagonal matrix of the standard-deviations
is a diagonal matrix for which the i-th term of the diagonal is equal to the reciprocal of the standard-deviation of the i-th variable X.sub.i, i being an integer varying between 1 and the integer p.
[0113] In statistics, the correlation of two variables A and B is equal to the ratio between the covariance between said two variables A and B on the one hand and, the standard-deviation product of the first variable A by the standard-deviation of the second variable B on the other hand. The result of this is that the normalized matrix R corresponds to the matrix of the empirical correlations.
[0114] According to these cases, the estimation step 54 thus includes a computation sub-step, or the combination of a computation sub-step and of a regularization sub-step or the combination of a computation sub-step and a normalization sub-step or a combination of the computation, regularization and normalization sub-steps.
[0115] In the case when the three sub-steps are applied, the order for applying regularization and normalization sub-steps is irrelevant. Further a regularized matrix of empirical correlations R.sub.regularized is obtained and the thresholding value is comprised between 0 and 1. In the following description, a value Y is comprised between two values a and b when, on the one hand, the value Y is greater than or equal to the value a and, on the other hand, the value Y is less than or equal to the value b.
[0116] Like for the case of the regularized empirical covariance matrix S.sub.regularized, as the thresholding value is a variable, the regularized matrix of empirical correlations R.sub.regularized is a function of the thresholding value . Notably, when the thresholding value has the value 0, the regularized matrix of empirical correlations R.sub.regularized is equal to the matrix of empirical correlations R. On the contrary, when the thresholding value has the value 1, the regularized matrix of empirical correlations R.sub.regularized tends to the zero matrix, i.e. a matrix for which all the terms are zero.
[0117] At the end of the estimation step 54, an estimated covariance matrix {circumflex over ()} is obtained grouping the estimated covariance values between the different representative quantities of the physical elements or of their activity. Alternatively, a Spearman correlation matrix is obtained when the dependency among the variables is non-linear.
[0118] As an example, for the following, it is assumed that the estimated covariance matrix {circumflex over ()} is the regularized matrix of the empirical correlations R.sub.regularized, i.e. that {circumflex over ()}=R.sub.regularized.
[0119] The method for identifying a relationship also includes a step 56 for associating a graph G.sub. with a thresholding value .
[0120] By definition, a graph G.sub. is associated with a thresholding value when the graph G.sub. comprises representative apices of the physical elements, and links between the apices when the estimated value of the covariance between the relevant apices is greater than or equal to the relevant thresholding value .
[0121] A graph G.sub. is a graphic representation of the estimated value of the covariance relatively to a given thresholding value . This means that the only links visible on a graph G.sub. are links having a relatively large value of the estimated covariance.
[0122] In the particular case of
[0123] Thus, when the thresholding value has the value 0, the graph G.sub.0 is a graph for which all the apices are connected to all the other apices. On the contrary, when the thresholding value has the value 1, the graph G.sub.1 is a graph for which all the apices are isolated, i.e. there is no link between the apices.
[0124] More specifically, it appears that the function which associates with the thresholding value the number of links to be generated in the graph G.sub. associated with the thresholding value is a decreasing function from the value of the number of links in the graph G.sub.0 down to 0.
[0125] As an illustration,
[0126]
[0127] In the first graph G.sub.1, there exist sixteen links between the thirteen apices S1 to S13. Thus, the first apex S1 is connected to the fifth apex S5 via a first link l.sub.1-5. The second apex S2 is connected to the fifth apex S5 via a second link l.sub.2-5. The third apex S3 is connected to the fourth apex S4 via a third link l.sub.3-4 and to the seventh apex S7 via a fourth link l.sub.3-7. The fourth apex S4 is connected to the third apex S3 via the third link l.sub.3-4, to the fifth apex S5 via a fifth link l.sub.4-5, to the seventh apex S7 via a sixth link l.sub.4-7 and to the eighth apex S8 via a seventh link l.sub.4-8. The fifth apex S5 is connected to the fourth apex S4 via the fifth link l.sub.4-5, to the eighth apex S8 via an eighth link l.sub.5-8 and to the ninth apex S9 via a ninth link l.sub.5-9. The sixth apex S6 is connected to the seventh apex S7 via a tenth link l.sub.6-7. The seventh apex S7 is connected to the third apex S3 via the fourth link l.sub.3-7, to the fourth apex S4 via the third link l.sub.3-4, to the eighth apex S8 via an eleventh link l.sub.7-8, to the sixth apex S6 via the tenth link l.sub.6-7 and to the eleventh apex S11 via a twelfth link l.sub.7-12. The eighth apex S8 is connected to the fourth apex S4 via the seventh link l.sub.4-8, to the fifth apex S5 via the eighth link l.sub.5-8, to the seventh apex S7 via the eleventh link l.sub.7-8, to the ninth apex S9 via a thirteenth link l.sub.8-9 and to the twelfth apex S12 via a fourteenth link l.sub.8-12. The ninth apex S9 is connected to the fifth apex S5 via the ninth link l.sub.5-9, to the eighth apex S8 via the thirteenth link l.sub.5-9, to the tenth apex S10 via a fifteenth link l.sub.9-10 and to the thirteenth apex S13 via a sixteenth link l.sub.9-16. The tenth apex S10 is connected to the ninth apex S9 via the fifteenth link l.sub.9-10. The eleventh apex S11 is connected to the seventh apex S7 via the twelfth link l.sub.7-12. The twelfth apex S12 is connected to the eighth apex S8 via the fourteenth link l.sub.8-12. The thirteenth apex S13 is connected to the ninth apex S9 via the sixteenth link l.sub.9-16.
[0128] This means that the first link l.sub.1-5, the second link l.sub.2-5, the third link l.sub.3-4, the fourth link l.sub.3-7, the fifth link l.sub.4-5, the sixth link l.sub.4-7, the seventh link l.sub.4-5, the eighth link l.sub.5-5, the ninth link l.sub.5-9, the tenth link l.sub.6-7, the eleventh link l.sub.7-5, the twelfth link l.sub.7-12, the thirteenth link l.sub.5-9, the fourteenth link l.sub.5-12, the fifteenth link l.sub.9-10 and the sixteenth link l.sub.9-16 each correspond to values of estimated covariance between the relevant apices which are strictly greater than the first thresholding value .sub.1.
[0129]
[0130] The second thresholding value .sub.2 is greater than the first thresholding value .sub.1. Further, the second graph G.sub.2 includes no more than eleven links since the third link l.sub.3-4, the fifth link l.sub.4-5, the sixth link l.sub.4-7, the ninth link l.sub.5-9 and the sixteenth link l.sub.9-16 have disappeared.
[0131] This shows that the third link l.sub.3-4, the fifth link l.sub.4-5, the sixth link l.sub.4-7, the ninth link l.sub.5-9 and the sixteenth link l.sub.9-16 each correspond to values of estimated covariance between the relevant apices which are strictly greater than the first thresholding value .sub.1 but also strictly less than the second thresholding value .sub.2. On the contrary, the first link l.sub.1-5, the second link l.sub.2-5, the fourth link l.sub.3-7, the seventh link l.sub.4-5, the eighth link l.sub.5-5, the tenth link l.sub.6-7, the eleventh link l.sub.7-8, the twelfth link l.sub.7-12, the thirteenth link l.sub.5-9, the fourteenth link l.sub.8-12 and the fifteenth link l.sub.9-10 each correspond to the values of estimated covariance between the relevant apices which are strictly greater than the second thresholding value .sub.2.
[0132]
[0133] The third thresholding value .sub.3 is greater than the second thresholding value .sub.2. Further, the third graph G.sub.3 does not include more than seven links since the first link l.sub.1-5, the fourth link l.sub.3-7, the tenth link l.sub.6-7 and the fourteenth link l.sub.8-12 have disappeared.
[0134] This shows that the first link l.sub.1-5, the fourth link l.sub.3-7, the tenth link l.sub.6-7 and the fourteenth link l.sub.8-12 each correspond to values of covariance estimated between the relevant apices which are strictly greater than the second thresholding value .sub.2 but also strictly less than the third thresholding value .sub.3. On the contrary, the second link l.sub.2-5, the seventh link l.sub.4-5, the eighth link l.sub.5-5, the eleventh link l.sub.7-5, the twelfth link l.sub.7-12, the thirteenth link l.sub.5-9, and the fifteenth link l.sub.9-10 each correspond to values of covariance estimated among the relevant apices which are strictly greater than the third thresholding value .sub.3.
[0135]
[0136] The fourth thresholding value .sub.4 is greater than the third thresholding value .sub.3. Further, the fourth graph G.sub.4 does not include more than three links since the second link l.sub.2-5, the seventh link l.sub.4-5, the twelfth link l.sub.7-12 and the fifteenth link l.sub.9-10 have disappeared.
[0137] This shows that the second link l.sub.2-5, the seventh link l.sub.4-8, the twelfth link l.sub.7-12 and the fifteenth link l.sub.8-18 each correspond to the values of estimated covariance among the relevant apices which are strictly greater than the third thresholding value .sub.3 but also strictly less than the fourth thresholding value .sub.4. On the contrary, the eighth link l.sub.5-8, the eleventh link l.sub.7-8, and the thirteenth link l.sub.8-9 each correspond to the values of estimated covariance among the relevant apices which are strictly greater than the fourth thresholding value .sub.4.
[0138]
[0139] According to another embodiment, the links on the graph are weighted with the intensity of the correlations. The weighting matrix or matrix of the weights of the links is the matrix grouping the absolute values of the matrix obtained at the end of the application of the estimation step 54.
[0140] The method for identifying a relationship comprises a step 58 for obtaining cores.
[0141] By definition, a core is a set of apices of a graph verifying three properties: the first property P1, the second property P2 and the third property P3.
[0142] According to the first property P1, the number of apices of the core is greater than or equal to a fixed number .
[0143] Preferably, the fixed number is greater than or equal to 3, preferentially greater than or equal to 5.
[0144] Preferably the fixed number is greater than or equal to 15, preferentially greater than or equal to 10.
[0145] According to the second property P2, there exists a thresholding value for which the core is a connected component of the graph G.sub. associated with the thresholding value .
[0146] In graph theory, a non-oriented graph is said to be connected if regardless of the relevant apices, there exists a sequence of links from the first apex to the second apex. A maximum connected sub-graph of any non-oriented graph is a connected component of this graph.
[0147] According to the third property P3, no other connected components of a graph exist for which the size is greater than or equal to the fixed number and which is included in the core.
[0148] In another wording, the existence of connected components having less apices than the fixed number is allowed and included in the core. It is also allowed that connected components having more or as many apices as the fixed number exist but each of these connected components has either to be included in the core or not share any apex with the core. Such a property P3 should be verified for all the thresholding values .
[0149] According to another way for presenting such a notion, a class core is a set of apices, of a set minimum size, which may all be connected through reliable paths involving weighted links (covariance) which are sufficiently significant. These paths, which form the link between the apices of a core, are stable on the graphs when the thresholding parameter is increased and this up to a quite high level. The apices not belonging to a core are on the contrary more rapidly isolated (no link with the other apices) on the graph gradually as the thresholding parameter is increased.
[0150] The step 58 for obtaining cores is applied by analyzing the time-dependent change in the graphs according to the variation of the thresholding value.
[0151] For this, a plurality of thresholding values is used. According to the example proposed with reference to
[0152] Preferably, the first plurality of thresholding values is used in an increasing way, i.e. by first considering the smallest value, and then the smallest value of the remaining values until consideration of the largest value.
[0153] Preferentially, the step 58 for obtaining cores is applied with an in-depth course algorithm.
[0154] For example, the minimum number of apices a of a core is set, a minimum thresholding value .sub.min and a parameter P are set for incrementing the thresholding value.
[0155] One begins by extracting the N connected components of the graph G.sub.min for which the number of apices is greater than the fixed number . N is an integer. The extraction of the connected components is obtained by applying an in-depth course algorithm.
[0156] As long as the integer N is different from 0, the following steps are repeated: [0157] 1) Increment the thresholding value of the preceding iteration by adding the parameter P in order to obtain a computational threshold value .sub.computed, [0158] 2) extracting the N connected components of the graph G.sub.computed for which the number of apices is greater than the fixed number . [0159] 3) defining the cores, a core being a connected component of the graph G.sub.computed-pitch (the graph associated with the thresholding value of the preceding iteration which is, by definition of the thresholding value for computation .sub.computed, the difference between the thresholding value of the computation .sub.computed and parameter P) the intersection of which with each of the connected components extracted in the extraction step 2 is zero.
[0160] The whole of the threshold values used forms a plurality of thresholding values.
[0161] The method for identifying a relationship includes a step 60 for defining candidate graphs.
[0162] Each candidate graph is a graph associated with one of the thresholding values from the plurality of thresholding values.
[0163] According to the proposed example, the candidate graphs are the first graph G.sub.1, the second graph G.sub.2, the third graph G.sub.3 and the fourth graph G.sub.4.
[0164] The method for identifying a relationship also includes a step 62 for obtaining the distributions associated with each thresholding value from the plurality of thresholding values.
[0165] By the term of distribution associated with a thresholding value is meant a partitioning into one or several classes of the apices of the graph G.sub. associated with the relevant thresholding value . A class is a set of apices. In the following, such a distribution is noted as R.sub..
[0166] Depending on the relevant example, four distributions R.sub.1, R.sub.2, R.sub.3 and R.sub.4 are therefore to be obtained.
[0167] Preferably, in step 62 for obtaining the distributions, the plurality of thresholding values is used in a decreasing way, i.e. by first considering the largest value, and then the largest value of the remaining values until the smallest value is considered.
[0168] Each of the distributions are obtained by a distinct optimization operation.
[0169] The optimization starts from an initial distribution in which with each core is associated a class for obtaining a final distribution in which each apex of a class shares more links with the other apices of the same class than with the apices of another class.
[0170] Many ways for implementing the optimization exist. Notably, two ways are more specifically described in the following description, being aware that other ways are accessible to one skilled in the art.
[0171] According to a first method, for a given thresholding parameter , the graph G.sub. is partitioned in order to obtain a distribution in which each class comprises a single core and minimizing the cost or weight of the section, defined by the sum of the weights of the links between the classes. By definition, the sum of the weights of the links between the classes is defined by the sum of the absolute value of the links existing between an apex of a class and an apex of the other one. The set of apices and cores taken into account for the distribution depends on the thresholding parameter. We are not interested in the isolated apices and the connected components of too small sizes. We note as V*(), the set of the apices contained in connected components of the graph G.sub. for which the number of apices is greater than equal to the fixed number . Such connected components comprise at least one core.
[0172] For a fixed thresholding value , if V*() contains K cores (K being a positive integer), Q.sub.1, . . . , Q.sub.K, a partition of V*() into K classes, C.sub.1, . . . , C.sub.K, is sought, such that each class Q.sub.k is the union of a core Q.sub.k and of a set of apices S.sub.k at the periphery of this core (which may be empty): C.sub.k=Q.sub.kS.sub.k.
[0173] If the set V*() is empty, i.e. V*()=, all the apices of V are isolated or contained in connected components of a too small size (strictly less than the fixed number ) and the question of the partitioning of the graph is not posed.
[0174] If the set V*() contains a single core, the partitioning of the graph is trivial, a single class groups all the apices of V*().
[0175] When the set V*() contains several cores, the apices S.sub.k around these cores are selected so as to have a minimum weight section. The weight matrix of the links of the graph G.sub. is noted as W() and S refers to the whole of the portions of A=V*()\{Q.sub.1, . . . , Q.sub.K}. S.sub.1, . . . , S.sub.K are the solution to the following optimization problem:
[0176] The first partitioning method described earlier guarantees the fact that an apex which is not in a core is more strongly connected with the class which is assigned to it, than with any other class (by assuming that there cannot be any equality).
[0177] According to a second more elaborated method, the optimization comprises a step for determining the cores for which one apex shares more links with the apices of another class than with the apices of its class. In such a case, the determined cores are no longer considered as cores but as a set of isolated apices which may each belong to a different class. This gives the possibility of avoiding classification errors.
[0178] In another wording, as it is supposed that the core of the class is the most stable and the most central portion of the class (the furthest away from the other classes), if a core contains at least one apex better connected to another class, the core is declassified by considering the apices of this core as being simples peripheral apices and we carry out new partitioning of the graph.
[0179] From a mathematical point of view, it is possible to implement the second method by coming back to the formulation of the first method. Indeed, if in a core Q it is possible to find an apex q, less strongly connected with its class C.sub.i, than with another class C.sub.p, a partition of V*() into K1 classes is sought by no longer considering Q.sub.i as a core (A=AQ.sub.1) in the optimization problem posed within the scope of the first method. This is repeated until the whole of the apices are more strongly connected to the class which is assigned to them than any other class.
[0180] According to the example of
[0181] The method for identifying a relationship also includes a step 64 for selecting an optimum graph from among the plurality of candidate graphs according to at least one criterion.
[0182] The criteria(on) selected give the possibility of selecting a candidate graph corresponding to a good compromise in terms of density. Indeed, the denser a candidate graph and the more the relevant candidate graph takes into account the information. On the contrary, the less dense the candidate graph and the more the relevant candidate graph shows sets of clearly identifiable apices.
[0183] Preferably, in the selection step 64, at least two criteria are used, the first criterion dealing with the graph and the second criterion being relative to the distribution associated with the graph.
[0184] For this, according to an example of a first criterion, the selected candidate graph is the graph for which the difference between the distribution of the connectivity degrees and a distribution according to a power law is a minimum.
[0185] The connectivity degree of an apex is for example computed by forming the sum of the weights associated with the links of the relevant apex.
[0186] The distribution according to a power law is, according to a particular example, a Pareto law.
[0187] The distribution according to a power law is, according to another particular example, a scale-invariant network law.
[0188] The difference is, as an illustration, a Euclidean distance.
[0189] According to an example, the second criterion is modularity. The modularity is a criterion comparing the proportion of links of a class of a graph with the proportion obtained for links randomly placed on the relevant graph. The distributions for which the modularity is large will be promoted.
[0190] According to another example, the second criterion is the number of classes. The distributions for which the number of classes is maximum will be promoted.
[0191] According to another example, the second criterion is the stability of the number of classes with the variation of the thresholding value . The distributions for which the number of classes is the most stable will be promoted.
[0192] The method for identifying a relationship therefore gives the possibility of obtaining an optimum graph and an optimum distribution of the physical elements. Belonging to a same class indicates that there exists a relationship between the studied physical elements.
[0193] In order to obtain such a piece of information, the identification method allows better determination of the graph and of the distribution than the methods of the state of the art in so far that such methods do not carry out optimization on the graph during the partitioning into classes of the graph.
[0194] The method for identifying a relationship therefore allows identification of sets of physical elements having a relationship between them on the basis of the relevant representative quantity.
[0195] In particular, the method for identifying a relationship may give the possibility of identifying sets of genes having a relationship between them on the basis of their expression levels in the relevant samples, or having similar expression profiles. Genes for which the expression profiles are similar (co-expressed genes) may for example have identical regulation mechanisms or be part of a same regulation route, i.e. be co-regulated.
[0196] The regulation of the expression of a gene refers to the whole of the regulation mechanisms applied during the process for synthesizing a product of a functional gene (RNA or protein) from the genetic piece of information contained in a DNA sequence. The regulation refers to modulation, in particular an increase or a reduction in the amount of products of the expression of a gene (RNA or protein). All the steps ranging from the DNA sequence to the final product of the expression of a gene may be regulated, whether this be the transcription, the ripening of the messenger RNAs, the translation of the messenger RNAs or the stability of the messenger RNAs or of the proteins.
[0197] For example, the method for identifying a relationship may give the possibility of identifying a relationship between genes or proteins which are all strongly expressed, or strongly over-expressed relatively to a control, or between genes or proteins which are all not very expressed, or strongly under-expressed relatively to a control.
[0198] In a preferred embodiment, the method for identifying a relationship advantageously gives the possibility of organizing the genes, the RNAs or the proteins, for which the expression profiles are identical, into groups or sets, according to a hierarchical grouping.
[0199] According to a particular embodiment, the method for identifying a relationship advantageously gives the possibility of identifying interactions between genes.
[0200] According to another embodiment, the method for identifying a relationship advantageously gives the possibility of identifying sets of genes which are co-expressed and/or co-regulated. This may give the possibility of identifying regulation routes which are not yet known. Moreover, a gene, the function of which is unknown and which is part of a set containing a large number of genes involved in a particular cell function or a particular cell process, has a strong probability of being itself also involved in this function or in this process. Thus, starting from the assumption that co-expressed and/or co-regulated genes may be functionally related, the method may give the possibility of identifying the putative function of certain genes.
[0201] According to a preferred embodiment, the method for identifying a relationship also includes a step in which the classes obtained in the optimum distribution are ordered.
[0202] For this, each class of the optimum distribution is associated on a one-to-one basis with a value of the representative quantity. Therefore, such a value is a synthetic value which summarizes the relevant class.
[0203] Such an association is obtained by different methods.
[0204] For example, the most significant variable in the class is selected according to a criterion, such a criterion may be the centrality or the connectivity degree to the other apices.
[0205] According to another example, the use of a method for reducing the dimensionality of the class in order to infer therefrom a synthetic value is proposed. The analysis in main components is an example of such a method for reducing the dimensionality of the class.
[0206] Again according to another example, the synthetic value is a function of the representative quantities of each variable of the class.
[0207] For example, each class of the optimum distribution is associated with the average value of the whole of the representative quantities of the apices which the relevant class includes. The average value is for example an arithmetic mean value, a geometrical mean value or a mean value weighted by coefficients related to the intensity of the correlations between the relevant apices. Preferably, the function is a linear function.
[0208] According to another embodiment, it is also possible to apply regression in order to model the representative quantity from classes of variables themselves and for selecting the classes or the most significant variables in the model.
[0209] This gives the possibility of facilitating the utilization of the optimal distribution and of the optimum graph obtained at the end of the application of the method for identifying a relationship.
[0210] Further, this also makes possible the method for identifying a relationship, which may be utilized for applying other methods illustrated in reference to the flowcharts of
[0211] Such methods may also be applied by means of the system 10 proposed in
[0212] From among the proposed methods, with reference to
[0213] By therapeutical target of a pathology, is meant any biological elements on which it is possible to act for preventing and/or treating this pathology. The therapeutic target may in particular be a gene or a product of the expression of a gene. For example, the product of the expression of a gene is an RNA, in particular a messenger RNA or a protein.
[0214] The method for identifying a therapeutic target includes a first step 100 for applying the method for identifying a relationship as described earlier for the cases when the physical elements are genes, the plurality of individuals is a plurality of biological individuals suffering from the pathology and the representative quantity is the quantification of the expression of at least one gene of the plurality of individuals. Such a first step 100 for applying the method for identifying a relationship notably gives the possibility of obtaining an optimum distribution, a so called first distribution R1, including first classes C1.sub.i, i being an integer varying between 1 and the number of classes of the first distribution R1, in which are distributed the apices representative of the genes.
[0215] The first step 100 for applying the method for identifying a target includes a sub-step in which the first classes C1.sub.i obtained in the first distribution R1 are ordered, in order to obtain a first distribution R1 in which each first class C1.sub.i is associated on a one-to-one basis to a first value Z1.sub.i of the representative quantity.
[0216] The method for identifying a therapeutic target also includes a second step 110 for applying the method for identifying a relationship as described earlier for the case when the physical elements are genes, the plurality of individuals is a plurality of biological individuals not suffering from the pathology and the representative quantity is the quantification of the expression of at least one gene of the plurality of individuals. Such a second step 110 for applying the method for identifying a relationship notably gives the possibility of obtaining an optimum distribution, a so called second distribution R2, including second classes C2.sub.j, j being an integer varying between 1 and the number of classes of the second distribution R2, in which are distributed the representative apices of the genes.
[0217] The second step 110 for applying the method for identifying a target includes a sub-step in which the second classes C2.sub.j obtained in the second distribution R2 are ordered, in order to obtain a second distribution R2 in which each second class C2.sub.j is associated on a one-to-one basis with a second value Z2.sub.j of the representative quantity.
[0218] Preferably, the first and second steps 100 and 110 for applying the method for identifying a relationship are applied simultaneously for reducing the time for applying the method for identifying a therapeutic target. This is indicated in
[0219] The method for identifying a therapeutic target also includes a step 120 for comparing the first distribution R1 and the second distribution R2.
[0220] The method for identifying a therapeutic target also includes a step 130 for selecting as a therapeutic target, a gene or a product of the expression of the gene. The gene or the product of the expression of the gene is selected when a condition is verified. The representative apex of the gene in the first distribution R1 belongs to a first class C1.sub.i0 wherein i0 refers to the number of the class. Said first class C1.sub.i0 is associated with a first value Z1.sub.i0. The representative apex of the gene in the second distribution R1 belongs to a second class C2.sub.j0 wherein j0 refers to the number of the class. Said second class C2.sub.j0 is associated with a second value Z2.sub.j0. The condition for selecting the gene or the product of the expression of the gene is verified when the first value Z1.sub.i0 significantly differs from the second value Z2.sub.j0.
[0221] By the expression <<significantly different>> is meant that the second value Z2.sub.j0 differs from the first value Z1.sub.i0 by more than 1% of the first value Z1.sub.i0, preferably more than 5% of the first value Z1.sub.i0 and preferentially more than 10% of the first value Z1.sub.i0.
[0222] The method for identifying a therapeutic target may notably give the possibility of determining a target efficiently.
[0223] From among the proposed methods, with reference to
[0224] The method for identifying a biomarker includes a first step 200 for applying the method for identifying a relationship as described earlier for the case when the physical elements are genes, the plurality of individuals is a plurality of biological individuals suffering from the pathology and the representative quantity is the quantification of the expression of at least one gene of the plurality of individuals. Such a first step 200 for applying the method for identifying a relationship notably gives the possibility of obtaining an optimum distribution, a so called first distribution R1, including first classes C1.sub.i, i being an integer varying between 1 and the number of classes of the first distribution R1, in which are distributed the representative apices of the genes.
[0225] The first step 200 for applying the method for identifying a biomarker includes a sub-step in which the first classes C1.sub.i obtained in the first distribution R1 are ordered, in order to obtain a first distribution R1 in which each first class C1.sub.i is associated in a one-to-one basis with a first value Z1.sub.i of the representative quantity.
[0226] The method for identifying a biomarker also includes a second step 210 for applying the method for identifying a relationship as described earlier for the case when the physical elements are genes, the plurality of individuals is a plurality of biological individuals not suffering from the pathology and the representative quantity is the quantification of the expression of at least one gene of the plurality of individuals. Such a second step 210 for applying the method for identifying a relationship notably gives the possibility of obtaining an optimum distribution, a so called second distribution R2, including second classes C2.sub.j, j being an integer varying between 1 and the number of classes of the second distribution R2, in which are distributed the representative apices of the genes.
[0227] The second step 210 for applying the method for identifying a relationship includes a sub-step in which the second classes C2.sub.j obtained in the second distribution R2 are ordered, in order to obtain a second distribution R2 in which each second class C2.sub.j is associated on a one-to-one basis to a second value Z2.sub.j of the representative quantity.
[0228] Preferably, the first and second steps 200 and 210 for applying the method for identifying a relationship are applied simultaneously in order to reduce the time for applying the method for identifying a biomarker. This is indicated in
[0229] The method for identification a biomarker also includes a step 220 for comparing the first distribution R1 and the second distribution R2.
[0230] The method for identifying a biomarker also includes a step 230 for selecting as a biomarker a gene or a product of the expression of the gene. The gene or the product of the expression of the gene is selected when a condition is verified. The representative apex of the gene in the first distribution R1 belongs to a first class C1.sub.i0 wherein i0 refers to the number of the class. Said first class C1.sub.i0 is associated with a first value Z1.sub.i0. The representative apex of the gene in the second distribution R1 belongs to a second class C2.sub.j0 wherein j0 refers to the number of the class. Said second class C2.sub.j0 is associated with a second value Z2.sub.j0. The condition for selecting the gene or the product of the expression of the gene is verified when the first value Z1.sub.i0 significantly differs from the second value Z2.sub.j0.
[0231] By the expression <<significantly different>> is meant that the second value Z2.sub.j0 differs from the first value Z1.sub.i0 by more than 1% of the first value Z1.sub.i0, preferably by more than 5% of the first value Z1.sub.i0 and preferentially more than 10% of the first value Z1.sub.i0.
[0232] The method for identifying a biomarker notably gives the possibility of determining a biomarker efficiently.
[0233] From among the proposed methods, with reference to
[0234] The method for identifying the screening includes a first step 300 for applying the method for identifying a relationship as described earlier for the case when the plurality of individuals is a plurality of biological individuals suffering from the pathology and having received the compound, the representative quantity is the quantification on the expression of at least one gene of the plurality of individuals and the data comprising the representative quantity of the known therapeutic target. Depending on the cases, the therapeutic target may be a gene or a product of the expression of a gene. When the therapeutic target is a gene, the physical elements are genes. When the therapeutic target is the product of the expression of a gene, the physical elements are the same product of the expression of a gene. As an example, when the therapeutic target is an RNA, the physical elements are RNAs. According to another example, when the therapeutic target is a protein, the physical elements are proteins.
[0235] Such a first step 300 for applying the method for identifying a relationship notably gives the possibility of obtaining an optimum distribution, a so called first distribution R1, including first classes C1.sub.i, i being an integer varying between 1 and the number of classes of the first distribution R1, in which are distributed the representative apices of the genes.
[0236] The first step 300 for applying the method for identifying a relationship includes a sub-step in which the first classes C1.sub.i obtained in the first distribution R1 are ordered, in order to obtain a first distribution R1 in which each first class C1.sub.i is associated on a one-to-one basis with a first value Z1.sub.i of the representative quantity.
[0237] The screening method also includes a second step 310 for applying the method for identifying a relationship as described earlier for the case when the plurality of individuals is a plurality of biological individuals suffering from said pathology and not having received said compound, the representative quantity is the quantification of the expression of at least one gene of the plurality of individuals and the data comprise the representative quantity of the known therapeutic target. Depending on the cases, the therapeutic target may be a gene or a product of the expression of a gene. When the therapeutic target is a gene, the physical elements are genes. When the therapeutic target is the product of the expression of a gene, the physical elements are the same product of the expression of a gene. As an example, when the therapeutic target is an RNA, the physical elements are RNAs. According to another example, when the therapeutic target is a protein, the physical elements are proteins.
[0238] Such a second step 310 for applying the method for identifying a relationship notably gives the possibility of obtaining an optimum distribution, a so called second distribution R2, including second classes C2.sub.j, j being an integer varying between 1 and the number of classes of the second distribution R2, in which are distributed the representative apices of the genes.
[0239] The second step 310 for applying the method for identifying a relationship includes a sub-step in which the second classes C2.sub.j obtained in the second distribution R2 are ordered, in order to obtain a second distribution R2 in which each second class C2.sub.j is associated on a one-to-one basis with a second value Z2.sub.j of the representative quantity.
[0240] Preferably, the first and second steps 300 and 310 for applying the method for identifying a relationship are applied simultaneously for reducing the time for applying the screening method. This is indicated in
[0241] The screening method also includes a step 320 for comparing the first distribution R1 and the second distribution R2.
[0242] The screening method also includes a step 230 for selecting a compound which may be used as a drug. The compound is selected when a condition is verified. The representative apex of the known therapeutic target in the first distribution R1 belongs to a first class C1.sub.i0 wherein i0 refers to the number of the class. Said first class C1.sub.i0 is associated with a first value Z1.sub.i0. The representative apex of the known therapeutic target in the second distribution R1 belongs to a second class C2.sub.j0 wherein j0 refers to the number of the class. Said second class C2.sub.j0 is associated with a second value Z2.sub.j0. The condition for selecting the compound is verified when the first value Z1.sub.i0 significantly differs from the second value Z2.sub.j0.
[0243] By the expression <<significantly differ>> is meant that the second value Z2.sub.j0 differs from the first value Z1.sub.i0 by more than 1% of the first value Z1.sub.i0, preferably by more than 5% of the first value Z1.sub.i0 and preferentially by more than 10% of the first value Z1.sub.i0.
[0244] The screening method notably gives the possibility of efficiently screening a compound which may be used as a drug.
[0245] Each of the methods proposed may be applied by means of any computer or any other type of device. Multiple systems may be used with programs applying the previous methods but it may also be contemplated to use apparatuses dedicated to the application of the previous methods, the latter being able to be inserted into the devices specific for measuring the provided data. Further, the proposed embodiments are not connected to a particular programming language. Incidentally, this implies that many programming languages may be used for applying one of the methods detailed earlier.
[0246] The methods and embodiments described above are able to be combined with each other, either totally or partly, in order to give rise to other embodiments of the invention.