METHOD FOR ESTIMATING THE PROBE-TARGET AFFINITY OF A DNA CHIP AND METHOD FOR MANUFACTURING A DNA CHIP
20170270242 · 2017-09-21
Assignee
Inventors
Cpc classification
G16B25/00
PHYSICS
International classification
Abstract
A method for estimating the affinity φ of a first DNA strand, or “probe”, to be hybridized with a second DNA strand, or “target”, to form a hybrid of length L.sub.bp, the method comprising: in each division of a set of M divisions of the hybrid, counting the number of times in which each hybrid of a set of P DNA strand hybrids is present in the division, the hybrids being of length k less than the length L.sub.bp, or “k-hybrids”; for each combination of mismatches of a set of L combinations of mismatches in a hybrid of length Lbp, determining whether the pair of mismatches is present in the hybrid; and calculating the affinity φ according to the relation:
Claims
1. A method for estimating the affinity φ of a first DNA strand, or “probe”, to be hybridized with a second DNA strand, or “target”, to form a hybrid of length L.sub.bp, the method comprising: in each division of a set of M divisions of the hybrid, counting the number of times in which each hybrid of a set of P DNA strand hybrids is present in the division, the hybrids being of length k less than the length L.sub.bp, or “k-hybrids”; for each combination of mismatches of a set of L combinations of mismatches in a hybrid of length Lbp, determining whether the pair of mismatches is present in the hybrid; and calculating the affinity φ according to the relation:
2. The method as claimed in claim 1, in which
3. The method as claimed in claim 2, comprising: for each pair of a set of N learning pairs each comprising a first and a second DNA strands capable of together forming a hybrid of length L.sub.bp, bringing together a quantity of the first DNA strand of the pair with a quantity of the second DNA strand of the pair, and measuring an intensity I.sub.n representative of the quantity of DNA strand hybrids formed following this bringing together, the hybrids of the calibration pairs comprising at least one times each k-hybrid of the set of P k-hybrids; and calculating a vector {circumflex over (B)}ε.sup.P.M, a vector {circumflex over (Θ)}ε
.sup.N and a vector {circumflex over (Δ)}ε
.sup.L minimizing a distance D between a vector I=(I.sub.1 . . . I.sub.n . . . I.sub.N).sup.Tε
.sup.N of the measured intensities and a vector M=(M.sub.1 . . . M.sub.n . . . M.sub.N).sup.Tε
.sup.N of prediction of the vector I of the measured intensities, the calculation being performed by solving an optimization problem according to the relations:
.sup.N, in which ∀nε[1,N], θ.sub.n is a scalar coding a quantity of the first and/or of second DNA strands brought together for the n.sup.th calibration pair; ∀nε[1,N], X.sub.n=(X.sub.n,1 . . . X.sub.n,m . . . X.sub.n,M) is a row matrix of predetermined design of
.sup.P.M, in which ∀mε[1,M], X.sub.n,m=(X.sub.n,m,1 . . . X.sub.n,m,p . . . x.sub.n,m,P) is a row matrix of
.sup.P and ∀pε[1,P], x.sub.n,m,p is the number of times in which the p.sup.th k-hybrid is present in the m.sup.th area of the division for the hybrid formed by the first and second DNA strands of the n.sup.th calibration pair; B=(B.sub.1 . . . B.sub.m . . . B.sub.M).sup.T is a vector of
.sup.M, in which ∀mε[1,M], B.sub.m=(β.sub.m,1 . . . β.sub.m,p . . . β.sub.m,P).sup.T is a vector of
.sup.P, with ∀pε[1,P], β.sub.m,p is a scalar quantifying the contribution to the affinity of a hybrid of length L.sub.bp of the p.sup.th k-hybrid of the set of P k-hybrids when this p.sup.th k-hybrid is present in the m.sup.th area of the division; ∀nε[1,N], Y.sub.n=(y.sub.n,1 . . . y.sub.n,l . . . y.sub.n,L) is a row matrix of predetermined design of
.sup.L, in which ∀lε[1,L], y.sub.n,l=1 if the l.sup.th pair of mismatches is present in the hybrid formed by the first and second DNA strands of the n.sup.th calibration pair; and Δ=(δ.sub.1 . . . δ.sub.l δ.sub.L).sup.T is a vector of
.sup.L, in which ∀lε[1,L], δ.sub.l is a scalar quantifying the contribution to the affinity of a hybrid of length L.sub.bp of the l.sup.th pair of mismatches.
4. The method as claimed in claim 1, wherein: the k-hybrids have a length k of between 2 and 7; and the number M of areas of the division is between 2 and 25−k.
5. The method as claimed in claim 4, wherein the number M of areas is between 3 and 15.
6. The method as claimed in claim 4, wherein the k-hybrids have a length k of between 3 and 5.
7. The method as claimed in claim 2, wherein the solving of the optimization problem is resolved subject to the additional constraint according to the relation:
8. The method as claimed in claim 2, wherein the optimization problem is solved iteratively: by setting, on the iteration i, the vectors B, Δ to their values calculated on the preceding iteration i−1 and by solving the optimization problem according to the relations:
9. The method as claimed in claim 8, wherein the first iteration is performed by setting ∀nε[1,N], X.sub.n.Math.B(1)+Y.sub.n.Math.Δ(1)=1
10. A method for estimating the contributions {circumflex over (β)}.sub.m,p of hybrids of a set of P DNA strand hybrids of length k, or “k-hybrids”, to the affinity of a DNA strand hybrid of length L.sub.bp, comprising: for each pair of a set of N learning pairs each comprising a first and a second DNA strands capable of together forming a hybrid of length L.sub.bp, bringing together a quantity of the first DNA strand of the pair with a quantity of the second DNA strand of the pair, and measuring an intensity I.sub.n representative of the quantity of DNA strand hybrids formed following this bringing together, the hybrids of the calibration pairs comprising at least one times each k-hybrid of the set of P k-hybrids; and calculating a vector {circumflex over (B)}ε.sup.P.M, a vector {circumflex over (Θ)}ε
.sup.N and a vector {circumflex over (Δ)}ε
.sup.L minimizing a distance D between a vector I=(I.sub.1 . . . I.sub.n . . . I.sub.N).sup.Tε
.sup.N of the measured intensities and a vector M=(M.sub.1 . . . M.sub.n . . . M.sub.N).sup.Tε
.sup.N of prediction of the vector I of the measured intensities, the calculation being performed by solving an optimization problem according to the relations:
.sup.N, in which ∀nε[1,N], θ.sub.n is a scalar coding a quantity of first and/or of second DNA strands brought together for the n.sup.th calibration pair; ∀nε[1,N], X.sub.n=(X.sub.n,1 . . . X.sub.n,m . . . X.sub.n,M) is a row matrix of predetermined design of
.sup.P.M, in which ∀mε[1,M], X.sub.n,m=(x.sub.n,m,1 . . . x.sub.n,m,p . . . x.sub.n,m,P) is a row matrix of
.sup.P and ∀pε[1,P], x.sub.n,m,p is the number of times in which the p.sup.th k-hybrid is present in the m.sup.th area of the division for the hybrid formed by the first and second DNA strands of the n.sup.th calibration pair; B=(B.sub.1 . . . B.sub.m . . . B.sub.M).sup.T is a vector of
.sup.P.M, in which ∀mε[1,M], B.sub.m=(β.sub.m,1 . . . β.sub.m,p . . . β.sub.m,P).sup.T is a vector of
.sup.P, with ∀pε[1,P], β.sub.m,p is a scalar quantifying the contribution to the affinity of a hybrid of length L.sub.bp of the p.sup.th k-hybrid of the set of P k-hybrids when this p.sup.th k-hybrid is present in the m.sup.th area of the division; ∀nε[1,N], Y.sub.n=(y.sub.n,1 . . . y.sub.n,l . . . y.sub.n,L) is a row matrix of predetermined design of
.sup.L, in which ∀lε[L], y.sub.n,l=1 if the l.sup.th pair of mismatches is present in the hybrid formed by the first and second DNA strands of the n.sup.th calibration pair; and Δ=(δ.sub.1 . . . δ.sub.l . . . δ.sub.L).sup.T is a vector of
.sup.L, in which ∀lε[1,L], δ.sub.l is a scalar quantifying the contribution to the affinity of a hybrid of length L.sub.bp of the l.sup.th pair of mismatches.
11. A computer program product stored on a computer-readable computing medium comprising instructions for the execution of a method as claimed in claim 1.
12. A method for fabricating a DNA chip comprising copies of a DNA strand, or probe, capable of forming a hybrid of length L.sub.bp with a target strand of nucleic acid of length greater than Lbp without mismatch, the method consisting in: identifying a set of portions of length L.sub.bp on the target DNA strand; for each identified portion of the target DNA strand, or “candidate target”: determining the complementary DNA strand, or “candidate probe”, and calculating a first affinity φ of the candidate probe and target by implementing a method as claimed in claim 1; calculating a second affinity φ of the candidate probe with each element of a set of nucleic stands not comprising the candidate target by implementing a method as claimed in claim 1; selecting, from the determined candidate probes, at least one probe the first affinity φ is above a predetermined first threshold {circumflex over (δ)}.sub.1; and each of the second affinities φ is below a second threshold S.sub.2 strictly lower than the first threshold {circumflex over (δ)}.sub.l; fabricating the DNA chip with each selected candidate probe.
13. The method for fabricating a DNA chip as claimed in claim 12, consisting in selecting, from determined candidate probes, at least one probe for which at most N calculated affinities are above a predetermined threshold and for which the other second calculated affinities are below a second threshold strictly lower than the first threshold.
Description
BRIEF DESCRIPTION OF THE FIGURES
[0082] The invention will be better understood on reading the following description, given purely by way of example, and in relation to the attached figures in which:
[0083]
[0084]
[0085]
[0086]
[0087]
[0088]
[0089]
[0090]
[0091]
[0092]
[0093]
[0094]
[0095]
DETAILED DESCRIPTION OF THE INVENTION
A) System for Estimating Affinity and Selecting Probes of a DNA Chip
[0096]
[0097] As an example, the DNA chip is designed to detect an endogenous retrovirus transcript present in the human genome, or “HERV” which stands for “human endogenous retroviruses”, and an LTR retrotransposon transcript, an ancestor of the infectious retroviruses, or “MalR” which stands for “Mammalian-Apparent Long-Terminal Repeat Retrotransposon”. The HERV/MarlR elements represent up to 8% of the human genome, or approximately 400 000 elements or loci that can each produce 0, 1 or several transcripts of a length that can range up to 10 000 nitrogenous bases. By convention, these elements are referred to as “HERV/MarlR”. It is known that designing a DNA chip targeting a particular HERV/MalR transcript is very difficult because of the very many DNA sequences called “repeats” that the HERV/MalR elements share, that is to say sequences that are identical or phylogenetically very close to one another present at very many points in the human genome.
[0098] The computing unit 10 comprises: [0099] a first memory block 12 storing a numeric bank of HERV/MalR loci, or a set of more than 400 000 numerical sequences of nitrogenous bases corresponding to the potential HERV/MarlR transcripts, e.g. previously sequenced in a manner known per se; [0100] a second memory block 14 storing the numeric sequence coding the target HERV/MarlR transcript, for example entered by the designer of the DNA chip; [0101] a third memory block 16 storing a set {{circumflex over (β)}.sub.m,p, {circumflex over (δ)}.sub.l} of coefficients {circumflex over (β)}.sub.m,p and {circumflex over (δ)}.sub.l quantifying contributions to the affinity of k-hybrids and of pairs of mismatches, as will be explained in more detail herein below; [0102] a fourth memory block 18 storing thresholds S.sub.1, S.sub.2 parameterizing DNA chip probe selection rules; and [0103] memory blocks 20, 22, 24, 26, 28 storing intermediate probe selection results.
[0104] The unit 10 also comprises computation blocks, for example software modules implemented on a computer, in particular: [0105] a block 29 for generating a set of numerical sequences of coding nitrogenous bases for the non-specific transcripts. The block 28 creates, from the bank of transcripts 12, a new set of numerical sequences of nitrogenous bases, by removing from it the target transcript stored in the block 14. This new set therefore codes the non-specific transcripts for which probes of weak affinity are sought, and is stored in the memory block 20. There are many ways of generating the set of non-specific transcripts. For example, the block 28 may be omitted when the set of the bank 12 contains only these transcripts. In order to lighten the notations, reference herein below is made indifferently to the sequences corresponding to transcripts or to the transcripts themselves; [0106] a block 30 for generating candidate probes for the target transcript. Preferably, the block 30 identifies each subsequence of the length L.sub.BP of the target transcript at each position thereof, then determines, for each of these subsequences, the strictly complementary sequence in terms of nitrogenous bases. These complementary subsequences form the “candidate” probes for the DNA chip and are stored in the memory block 22. By analogy, the target transcript portion corresponding to a candidate probe is referred to by the expression “candidate target”. Many other candidate probe selection rules can of course be implemented. For example, some portions of the target transcript can be disregarded if it is known beforehand that they cannot give appropriate probes for the DNA chip; [0107] an alignment block 32 forming hybrids between each candidate probe of the memory block 22 and the non-specific transcripts of the memory block 20. More particularly, the block 32 identifies the hybrids comprising at most two mismatches. To do this, the block 32 identifies the hybrids having a maximum number of pairs of matched bases, by introducing, as necessary, a mismatch of gap type. The hybrids thus identified are stored in the memory block 26. The limiting of the number of defects makes it possible to speed up the method according to the invention and to limit the number of coefficients {{circumflex over (β)}.sub.m,p, {circumflex over (δ)}.sub.l} necessary to the computation of the affinity. The inventors have in fact noted that, from three mismatches, the affinity of a probe with a transcript drops, the intensity of a DNA chip corresponding to the probe/transcript hybrid being moreover buried in the background noise. This observation is corroborated by the study “Custom human endogenous retroviruses dedicated microarray identifies self-induced HERV-W family elements reactivated in testicular cancer upon methylation control” by Gimenez et al., Nucleic Acids Research, April 2010, vol. 38(7): 2229-2246. For example, the module 32 implements the “BWA” alignment software described in the document “Fast and accurate long-read alignment with Burrows-Wheeler transform”, by Li H. et al, Bioinformatices, vol. 26(5): 589-595, and that can be downloaded at the address http://bio-bwa.sourceforge.net/; [0108] a block 34 for modeling each hybrid stored in the memory block 24 and each hybrid formed from a candidate probe and its target transcript using “k-hybrids” and pairs of mismatches, in a manner described herein below. This modeling produces a set {x.sub.m,p,y.sub.1} of variables x.sub.m,p and y.sub.l for each hybrid, coefficients which are stored in the memory block 24; [0109] a computation block 36 which computes, for each set {x.sub.m,p, y.sub.l} stored in the memory block 26, an affinity φ of the corresponding hybrid as a function of the coefficients {circumflex over (β)}.sub.m,p and {circumflex over (δ)}.sub.l stored in the memory block 16, in a manner explained herein below. The computed affinities φ are then stored in the memory block 28. Note that, for each candidate probe, several affinities φ are computed, an affinity φ.sub.1 for the target transcript of the probe and a plurality of affinities φ.sub.2 for the non-specific transcripts; and [0110] a selection block 38 which selects at least one probe for which the computed affinities φ, stored in the memory block 28, bear out the selection rules parameterized by the thresholds S.sub.1, S.sub.2 stored in the memory block 18, in a manner described herein below.
B) Estimating the Affinity of a Probe with a Transcript
[0111] The selection of probes for a DNA chip implemented by the unit 10 being partly defined by the modeling of the affinity φ according to the invention, the latter is first of all detailed in relation to
[0112] For the estimation of the affinity φ of the probe 40 with the portion of transcript 46, the set of portions k−H.sub.1, k−H.sub.2, k−H.sub.3, . . . k-H.sub.25−k+1 of the hybrid of length k=5 bases is identified, these portions of length k being designated by the expression “k-hybrids”. For a hybrid of length L.sub.bp, a total of L.sub.bp−k+1“k-hybrids” is therefore identified. The model of the affinity φ according to the invention computes the affinity φ as a function of the contribution of each identified k-hybrid, the contribution of a k-hybrid also depending on the position thereof in the hybrid.
[0113] The position of a k-hybrid can be the precise position in the hybrid, for example determined by the position of the pair of matched bases of the k-hybrid leftmost in the hybrid. This so-called “any position” model therefore leads to considering L.sub.bp−k+1 different positions. However, the number of positions influences the number of parameters of the model, and therefore influences the computing resources necessary to the implementation thereof, as well as the quantity of learning data needed.
[0114] Advantageously, the number of positions of a k-hybrid in the hybrid is reduced by dividing the hybrid into a limited number M of areas. For example, by referring to
[0115] The contribution to the affinity φ of a k-hybrid in an area of the hybrid is moreover computed beforehand, in a way that will be explained in more detail herein below, and stored in the coefficients {circumflex over (β)}.sub.m,p of the memory block 16. More particularly, having an alphabet of 5 elements (A, C, T, G, gap) for a length k, there are P different configurations k−H.sup.1, k−H.sup.2, k−H.sup.3, . . . k−H.sup.p, k−H.sup.P for a k-hybrid. For each of these configurations k-H.sup.p a contribution {circumflex over (β)}.sub.3′,p for the first area “3”, a contribution {circumflex over (β)}.sub.Middle,p for the second area “middle” and a contribution “{circumflex over (β)}.sub.5′,p” for the third area “5” are computed beforehand.
[0116] A first variant of the estimation of the affinity φ according to the invention then consists of: [0117] a. for each k-hybrid configuration k−H.sup.p, in counting: [0118] the number of times x.sub.3′,p in which this configuration appears in the set {k−H.sub.1, k−H.sub.2, . . . , k−H.sub.7}.sub.3′ of the first area “3”; [0119] the number of times x.sub.Middle,p in which this configuration appears in the set {k−H.sub.8, k−H.sub.9, . . . , k−H.sub.14}.sub.Middle of the second area “Middle”; [0120] the number of times x.sub.5′,p this configuration appears in the set {k−H.sub.15, k−H.sub.16, . . . , k−H.sub.21}.sub.5′ of the third area “5′”; [0121] b. then in computing the affinity φ according to the relation:
[0122] As can be seen, by explicitly taking into account the structure of a hybrid, the possible mismatches are therefore explicitly taken into account since they are involved in the P different configurations k−H.sup.1, k−H.sup.2, k−H.sup.3, . . . k−H.sup.p, . . . , k−H.sup.P.
[0123] For any number of M areas of the hybrid, including an any-position model, the above equation is easily generalized to the equation:
[0124] Moreover, there is a synergy effect between the mismatches present in a hybrid. This synergy effect, also called “interaction”, is naturally taken into account in the coefficients {circumflex over (β)}.sub.m,p when the mismatches belong to a same k-hybrid. However, when the mismatches are not included together in a single k-hybrid, and are therefore separated by more than k bases, the affinity model according to the relation (2) does not make it possible to take account thereof. For example, by referring to
[0125] Advantageously, the model of the affinity described previously is complemented by a term taking into account the synergy effect between the mismatches. More particularly, for the given lengths L.sub.bp and k, there are L configurations C.sub.1, C.sub.2, . . . , C.sub.1, . . . C.sub.L of two mismatches separated by more than k bases, and, for each of these pairs C.sub.l, a contribution {circumflex over (δ)}.sub.l to the affinity φ is computed beforehand, this contribution being stored in the memory block 16.
[0126] A second variant of the estimation of the affinity φ therefore consists also in identifying, in the hybrid, the mismatches separated by more than k bases and: [0127] c. for each configuration C.sub.l of mismatches, determining whether this configuration is present in the pairs identified. If such is the case, a variable y.sub.l is then set equal to 1, and to 0 otherwise, [0128] d. then in computing the affinity φ according to the relation:
[0129] It will thus be noted that the defects and their precise positions in the hybrid are also taken into account for the computation of the affinity.
C) Method for Selecting DNA Chip Probes
[0130] The method for selecting probes for the DNA chip is now described in more detail in relation to
[0131] Referring to
[0132] Referring to
[0133] The module 36 then next computes the affinities of each hybrid as a function of the coefficients x.sub.m,p and y.sub.l stored in the memory block 24, of the contributions {circumflex over (β)}.sub.p of k-hybrids and of the contributions {circumflex over (δ)}.sub.l of pairs of mismatches stored in the memory block 16, this computation being performed on the basis of the relation (3). The affinities thus computed are then stored in the memory block 28. For each candidate probe SC.sub.s generated from the target transcript there are therefore computed: [0134] a first affinity φ.sub.1 of the candidate probe with its target transcript, forming with the latter a perfect hybrid; [0135] second affinities φ.sub.2 of the candidate probe with the non-specific transcripts, forming with the latter hybrids that are imperfect or not.
[0136] Finally, the selection block 38 selects, as a function of the computed affinities φ.sub.1 and φ.sub.2 and of the selection parameters S.sub.1 and S.sub.2 stored in the memory block 18, at least the candidate probe or probes for which: [0137] a. the first affinity φ.sub.1 is above a first threshold S.sub.1>0; [0138] b. the second affinities φ.sub.2 are below a second threshold S.sub.2>0, strictly lower than the first threshold S.sub.1.
[0139] In a variant, a single threshold S.sub.1 can be used. The first affinity Cis that which is above or equal to the threshold S.sub.1 and the second affinities φ.sub.2 are those which are strictly below the threshold S.sub.1.
[0140] The probe or probes thus selected are those which are specific and affine with respect to the target transcript. These probes are then used for the fabrication of the DNA chip whose aim is to measure the level of expression of the target transcript.
[0141] Additional selection rules can also be implemented. Notably, in a variant, the selection block 38 also selects the probe or probes for which: [0142] a. the first affinity φ.sub.1 is above the first threshold S.sub.1; [0143] b. at most N second affinities φ.sub.2 are above the second threshold S.sub.2, N preferably being equal to 1 or 2.
[0144] The additional probes selected do not have the specific character of the first probe, and can therefore be hybridized stably with a non-specific transcript. By contrast, there are DNA chips for which the construction and the analysis of the measurements makes it possible to distinguish between a hybridization with a target transcript and a hybridization with a non-specific transcript, or cross-reaction. Similarly, a second rank probe can be retained for the fabrication of the DNA chip when it is known that the target transcript and the non-specific transcript with which it is hybridized have a low or zero probability of being present together in the biological sample that is the subject of the measurement by the DNA chip. By also using these probes in the chip, the sensitivity of the DNA chip is therefore enhanced while retaining a specific character for this chip.
[0145] According to the invention, to check the specificity of a probe, a specificity score Spec equal to the difference between the first affinity φ.sub.1 and the greater of the two affinities φ.sub.2 is computed for each probe, that is to say a score according to the relation:
Spec=φ.sub.1−max(φ.sub.2)
D) Learning the Contributions {circumflex over (β)}.SUB.m,p .and {circumflex over (δ)}.SUB.l
[0146]
[0147] This learning begins with the construction, in 70, of experimental learning data on the basis of which to identify the values of the coefficients {circumflex over (β)}.sub.m,p and {circumflex over (δ)}.sub.l. More particularly, only “the intensity” of a DNA chip probe or of an analogous device, being an easily accessible experimental data. The experimental data therefore consists of a set {I.sub.n} of probe intensities, forming, with transcripts, hybrids which comprise the k-hybrids and the pairs of mismatches corresponding to the parameters β.sub.m,p and {circumflex over (δ)}.sub.l sought.
[0148] However, without a particular measurement, the starting biological sample, the object of the measurement by a DNA chip, comprises several transcripts. Each stable hybrid between a probe and a transcript thus contributes to the intensity of the probe, without it being possible to easily separate each contribution. The first step 72 of construction of the experimental data advantageously consists in selecting the probes for which it is known that they are specific and affine with the only target transcripts from which they have been designed. Notably, the step 72 consists in selecting a first set {SA.sub.PM} of learning probes derived from conventional cellular genes (or “Protein coding genes”). These probes in effect exhibit little or no cross-reaction. This means most particularly that the intensity of such a probe corresponds substantially to the intensity of the probe with its target transcript, with which it forms a perfect hybrid.
[0149] In a next step 74, a second set {SA.sub.MM} of learning probes is designed from the first set {SA.sub.PM} by modifying one or two bases of the probes thereof. Because of the very great specificity of a probe of the first set with its target transcript, the inventors have noted that degenerating such a probe, by changing one or two of its bases, leads also to a probe which is very specific with the target transcript. Thus, the intensity of a degenerated probe also corresponds substantially to the intensity of the hybrid that it forms with the target transcript, hybrid which therefore exhibits one or two mismatches. Moreover, as described below, a filtering is implemented to eliminate any cross-reactions which could occur following the degeneration of the probes of the first set {SA.sub.PM}. The first set {SA.sub.PM} and the second set {SA.sub.MM} are therefore selected for them both to comprise the P possible configurations of k-hybrids and the L configurations of pairs of mismatches. Preferably, for the robustness of the identification of the coefficients β.sub.m,p and {circumflex over (δ)}.sub.l, these sets are chosen to include each of these configurations several times, and preferably at least 20 times.
[0150] Once the learning probes {SA.sub.PM} and {SA.sub.MM} are selected, DNA chips are constructed, in 76, from the latter, then the chips are used, in 78, to measure the level of expression of the target transcripts from which the probes {SA.sub.PM} were designed. A set {I}′ of probe intensities is therefore obtained. Optionally, a filtering is implemented, in 80, to eliminate the intensities originating from the cross-reactions. Such a filtering is for example described in the document “Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection”, by Li et al, Proceedings of the National Academy of Science, vol. 98(1):31-36, November 2006. A set {I} of intensities I is then retained. Each intensity/retained therefore has as its single cause a single hybrid, namely that formed from a known probe and a known transcript.
[0151] The method for identifying the coefficients β.sub.m,p and {circumflex over (δ)}.sub.l then continues with the computation thereof as a function of the intensities {I} in a step 82.
[0152] More particularly, by using the standard notations in the DNA chip field, because of the nature of the probes, and possibly of the filtering of the cross-reactions applied, the intensity I.sub.ij of a probe “j” can be modeled according to the relation:
I.sub.ij=θ.sub.i×φ.sub.j (4)
[0153] in which θ.sub.i is the quantity of RNA obtained by amplification of the transcript i targeted by the probe “j” and φ.sub.j is the affinity between the jth probe and its target transcript.
[0154] By combining the relations (3) and (4), the intensity I.sub.ij of a probe is therefore rewritten formally:
[0155] in which x.sub.m,p and Y.sub.l therefore correspond here to the modeling of the hybrid as k-hybrids and pairs of mismatches of the hybrid associated with the intensity I.sub.ij, and {circumflex over (β)}.sub.m,p and {circumflex over (δ)}.sub.l of the coefficients to be identified.
[0156] By adopting a matrix expression, it is shown that the relation (5) is rewritten:
I.sub.ij=θ.sub.i.Math.(X.Math.{circumflex over (B)}+Y.Math.{circumflex over (Δ)}).sub.j (6)
expression in which:
X.sup.T=(X.sub.1 . . . X.sub.m . . . X.sub.M).sup.Tε.sup.P,M (7)
∀mε[1,M],X.sub.m.sup.T=(x.sub.m,1. . . x.sub.m,p. . . x.sub.m,P).sup.Tε.sup.M (8)
{circumflex over (B)}=(B.sub.1. . . B.sub.m . . . B.sub.M).sup.Tε.sup.P,M (9)
∀mε[1,M],B.sub.m=(β.sub.m,1 . . . β.sub.m,p . . . β.sub.m,P).sup.T (10)
Y.sup.T=(y.sub.1. . . y.sub.l. . . y.sub.L).sup.Tε.sup.L (11)
{circumflex over (Δ)}=(δ.sub.1 . . . δ.sub.l . . . δ.sub.L).sup.Tε.sup.L (12)
[0157] in which T is the symbol of the transpose, the notation “Vε.sup.a” designates a real column vector of
.sup.a, and therefore a column vector comprising a real components.
[0158] Note that the right hand term of the relation (6) is nonlinear since it is equal to a product. By contrast, note that the term X.Math.{circumflex over (B)}+Y.Math.{circumflex over (Δ)} is linear in the terms {circumflex over (B)} and {circumflex over (Δ)} and that the matrices X and Y are known since the hybrid corresponding to the intensity I.sub.ij is known.
[0159] In a variant of the invention, the quantities of RNA are monitored and known a priori, such that the relation (6) becomes linear. The term {circumflex over (Θ)} of the optimization problem described below is therefore also set and known such that the problem is convex and can therefore be solved more simply. However, monitoring the quantity of RNA is a complex and costly technique. According to a variant described below, a conventional DNA chip measurement technique is implemented, technique that does not make it possible to know a priori the quantities of RNA. These quantities are therefore also identified.
[0160] For the record, in the conventional DNA chips, a transcript is targeted by several probes, each forming a perfect hybrid with the transcript. Furthermore, cross-reactions can also take place. This explains why the transcripts and the probes are not usually referenced with the same indices, as is described in the relations (4)-(6). However, because of the nature of the learning probes and of the filtering of the cross-reactions, the intensity amounts to, or is assumed as such, the hybrid formed by a probe and its target transcript such that the notation can be reduced without risk of confusion to a single index “n”, a notation which will herein below be employed in order to lighten the relations.
[0161] As is conventional in the field of identification, the computation step 82 comprises, in 84, the separation into two sets of the set of intensities {I}, namely into a first learning set {I.sub.n} and into a second validation set {I.sub.q}. The way in which the experimental data are subdivided, the size of each of these sets and the validation methods are known per se and will not therefore be detailed. For example, the set {I.sub.n} comprises ⅔ of the set {I} and the set {I.sub.q} the other ⅓ or the validation is implemented according to the “10-fold cross-validation” technique. It will be assumed that the learning set {I.sub.n} comprises N intensities, indexed by convention by the integer nε[1,N]. According to the same convention, the set {SA.sub.n} of the learning probes and the set of the quantities of RNA {θ.sub.n} associated with the learning set {I.sub.n} is likewise indexed by the integer n.
[0162] The computation step 82 also comprises a step 86 of modeling of each of the hybrids associated with the intensities I retained, the modeling being identical to that described in relation to .sup.P,M and a row matrix Y of
.sup.L as described in relation to (7)-(12). In particular, for each intensity I.sub.n of the learning set {I.sub.n}, a matrix X.sub.n and a matrix Y.sub.n are obtained.
[0163] In a subsequent step 88, an identification algorithm is implemented to minimize a distance D between the vector of the learning intensities I=(I.sub.1 . . . I.sub.n . . . I.sub.N).sup.Tε.sup.N and the intensities predicted by the model M=(M.sub.1 . . . M.sub.n . . . M.sub.N).sup.Tε
.sup.N, namely the solving of the optimization problem:
[0164] The problem of optimization of the relations (14)-(15) is conventional. Any distance D, also called “cost function”, is appropriate, for example the Euclidean norm. Similarly, any type of estimator is appropriate, for example an estimator by nonlinear regression. As can be noted, the problem of the relations (14)-(15) is not convex and therefore comprises several solutions. In a variant, the algorithm seeks several thereof, the one finally retained being for example that exhibiting the lowest estimation error upon the validation with the validation set {I.sub.q} or that minimizing a criterion of AIC (“Akaike Information Criterion”) or BIC (“Baysian Information Criterion”) type.
[0165] In a preferred variant, the search space is restricted by adding the constraint:
in which I is the number of different RNAs deposited on the chip, with a for example equal to I.
[0166] The inventors have noted that the problem of optimization of the relations (14), (15) and (16) has a single solution and, in light of the tests carried out, it is probable that this solution is the global optimum, or at the very least a local optimum close to the global optimum.
[0167] According to a preferred variant, an iterative solving of the problem of the relations (14), (15) and (16) is implemented: [0168] by setting, on the iteration i, the vectors B, A to their values calculated on the preceding iteration i−1 and by solving the optimization problem according to the relations:
[0170] Each of these problems is convex and therefore easily solved. The first iteration is for example performed by setting the affinity of each probe to 1, that is to say ∀nε[1,N], X.sub.n.Math.B(1)+Y.sub.n.Math.Δ(1)=1 and therefore by computing a first initial value {circumflex over (Θ)}(1) of the vector {circumflex over (Θ)}. In a variant, the first iteration is performed by setting
and by computing first values B(1) and Δ(1) for the vectors {circumflex over (B)} and {circumflex over (Δ)}. The iterative solving of the problem is then stopped when the distance D no longer changes, or changes insignificantly, as is known per se.
[0171] Advantageously, the problem of optimization of the relations (20)-(21) is solved by implementing a LASSO shrinkage optimization (“Lasso shrinkage method”) which consists in adding the constraint according to the relation:
∥B∥.sub.1+∥Δ∥.sub.1≦λ (22)
in which ∥•∥.sub.1 is the norm L.sub.1 and λ is a parameter determined by the LASSO optimization by cross-validation, in a manner known per se. This way makes it possible to reduce the variance of the estimator.
[0172] At the end of the step 88, there are therefore obtained a vector {circumflex over (B)} and a vector {circumflex over (Δ)}, that is to say values {circumflex over (β)}.sub.m,p and {circumflex over (δ)}.sub.l quantifying the contribution of the k-hybrids and of the pairs of mismatches to the affinity φ.
[0173] The method then ends, in 90, with the validation of the computed coefficients in order to judge the quality thereof. In particular, the preceding computation step 88 is implemented on the set {I.sub.q} of the validation intensities, which makes it possible to identify the corresponding quantities of RNA {θ.sub.q}. Each intensity I.sub.q of the set {I.sub.q} is then estimated by using the contributions to the affinity computed on the learning intensities {I.sub.n}. The intensity I.sub.q is thus estimated according to the relation:
in which Î.sub.q is the estimation of the intensity I.sub.q, and X.sub.q and Y.sub.q is the model of the hybrid associated with the intensity I.sub.q. A step of validation by affinity comparison can also be implemented, as described below in relation to
[0174] Standard statistical analyses are then implemented on the estimation error I.sub.q−Î.sub.q in a manner that is known per se.
E) Preferred Parameterizations of the Affinity and of the Selection of the Probes
[0175] Logically, the affinity model according to the invention gains in accuracy as the length k and/or the number M of areas increase. However, the increasing of these parameters poses a certain number of problems, including the need for increasingly significant computer resources because of the increase in the number of parameters of the model and the need to design a set of learning probes that have several copies of long k-hybrids, design which is lengthy and costly.
[0176] The inventors carried out tests on the influence of the parameters k and M on the accuracy of the affinity model. Referring to
F) Results
F.1) Hardware and Construction of the Data
[0179] The four examples presented below are based on two DNA chips developed by the applicant. The probes have a length equal to 25 nitrogenous bases.
[0180] The first chip, called chip “V2”, comprises a first “HERV” compartment developed to measure the HERV transcriptome. This compartment contains 6 multicopy retroviral families corresponding to a little less than 6000 HERV transcripts and is described in the document by Perot et al. “Microarray-based sketches of the HERV transcriptome landscape”, PLoS One, 2012; 7(6): e40194, June 2012.
[0181] In a second “genes” compartment, in the same format as the preceding one, 513 probe sets are introduced that originate from the DNA chip from the company Affymetrix marketed under the reference “HG_U133_Plus2”. The chip HG_U133_Plus2 targets conventional cellular genes and is described in the technical documentation “Design and Performance of the GeneChip® Human Genome U133_Plus 2.0 and Human Genome U133A 2.0 Array” accessible on the website of the company Affymetrix.
[0182] A third “learning set” compartment is, for its part, designed in order to learn the influence of the mismatches causing cross-reactions between HERV transcripts of a same family. The learning set stems from 20 probe sets of the HG_U133_Plus2 chip, intended by definition to form perfect hybrids with the transcripts that they target. For each probe of these 20 probe sets, 185 degenerated probes, the sequence of which varies by one or two mismatches with the probe, and does so at different positions, have been designed. The learning set therefore contains a set of 37 200 probes.
[0183] The chip V2 is therefore a tool for learning affinity prediction models (second compartment) and a tool for validating models learned on a known DNA chip (first compartment).
[0184] The second DNA chip, called “V3”, is a DNA chip designed according to the methodology presented above, namely on the basis of the affinity model of the relation (3) and the probe selection method described in relation to
[0185] The second chip contains approximately 400 000 HERV/MalR elements, organized into several tens of families. The chip V3 is made up of several compartments (probe set) that differ from one another either by the particular elements of the human genome that they target, or by the method of designing the probes that they contain.
[0186] The chip V3 notably comprises three compartments “HERV-MalR”, “U133_HTA” and “OPTI” which correspond to two types of elements of the human genome and two distinct probe design methods: [0187] the compartments U133_HTA and OPTI target the same 1560 genes, whereas the compartment HERV-MalR targets approximately 400 000 different HERV and MaLR elements of the genes targeted by the compartments U133_HTA and OPTI; [0188] the probes of the compartments HERV-MalR and OPTI are designed according to the methodology presented above, whereas the probes of the compartment U133_HTA originate from two Affymetrix DNA chips, namely the “HG_U133_Plus2” chip (herein below “U133”) and the chip marketed under the reference “HTA” respectively, and are therefore designed according to the methodology specific to the company Affymetrix. The compartment U133_HTA is therefore in reality two distinct probe sets originating from two Affymetrix chips targeting the same 1560 genes.
[0189] More particularly, for the design of the compartments HERV-MalR and OPTI, the length k of the k-hybrids is chosen to be equal to 5 and the number of areas M is chosen to be equal to 3. Only the probes for which the first affinity φ.sub.1 is above or equal to the threshold S.sub.1 and for which the second affinities φ.sub.2 are strictly below the threshold S.sub.1 are retained. The threshold S.sub.1 is chosen to be equal to 4.4.
[0190] The compartment HERV_MalR of the chip V3, the largest, therefore constitutes an embodiment of the present invention. The other two compartments of the chip V3 (OPTI and U133_HTA) for their part allow for a comparison of the invention with probe sets designed according to the prior art methods. Each of these compartments therefore contains probes forming perfect hybrids with their target transcripts.
F.2.) Accuracy and Choice of the Affinity Prediction Model
[0191] The validation of an affinity prediction model according to the invention relies on the protocol illustrated in
[0192] The production of the measured intensities comprises a conventional step of production of a solution 100 from targeted transcripts known through a DNA chip 102 for which the probes are known, the deposition of the solution on the chip, washing and measurement of the intensities {I.sub.n} of the probes of the chip. Usually, the solution deposited on the DNA chip is homogenous such that the quantity of RNA of a transcript is identical for each of the wells of the chip. A filtering 104 of the intensities produced is then implemented to eliminate the intensities resulting from, or assumed as such, the cross-reactions or else correct the intensities as a function of the cross-reactions, in order to obtain probe intensities {I.sub.n} each corresponding to the hybrid formed by the probe with its target transcript, and therefore each modelable according to the relation I.sub.n=θ.sub.n×φ.sub.n, as is described above.
[0193] The “validation by affinities” branch, for its part, consists in: [0194] predicting (in 106) the affinity φ.sub.n of each of the probes associated with the intensities {I.sub.n} by using a model according to the invention φ.sub.n=X.sub.n.Math.{circumflex over (B)}+Y.sub.n.Math.{circumflex over (Δ)}; [0195] estimating (in 108) an affinity value {circumflex over (φ)}.sub.n for each of the probes as a function of the intensities {I.sub.n}. This computation is the one described in the article by Li and Wong “Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection”, Proceedings of the National Academy of Sciences, 98(1): 31-36, 2001. This computation consists in particular in minimizing a cost function dependent on the differences (I.sub.n−θ.sub.n×φ.sub.n) subject to the constraint Σθ.sub.n.sup.2=N, the solution of this optimization problem being the affinity values {circumflex over (φ)}.sub.n and predictions {circumflex over (θ)}.sub.n of the quantities of RNA {circumflex over (θ)}.sub.n; and [0196] in comparing (in 110) the values φ.sub.n and {circumflex over (φ)}.sub.n.
[0197] The “validation by intensities” branch, for its part, consists in: [0198] predicting (in 112) the affinity φ.sub.n of each of the probes associated with the intensities {I.sub.n} by using a model according to the invention φ.sub.n=X.sub.n.Math.{circumflex over (B)}+Y.sub.n.Math.{circumflex over (Δ)}; [0199] in dividing (in 114) the set of the intensities of the probes {I.sub.n} into two subsets {I.sub.n}.sub.1 and {I.sub.n}.sub.2 within each probe set, and, correspondingly, dividing the set of the predicted affinities {φ.sub.n} into two subsets {φ.sub.n}.sub.1 and {φ.sub.n}.sub.2, in ⅔ and ⅓ proportions; [0200] in predicting (in 116) the quantities of RNA φ.sub.n as a function of the sets {I.sub.n}.sub.1 and {φ.sub.n}.sub.1. In particular, this prediction consists of a linear regression between the set {I.sub.n}.sub.1 and the set {φ.sub.n×φ.sub.n} since the values of φ.sub.n are already computed. A predicted value φ.sub.n is thus obtained for each quantity of RNA φ.sub.n poured into the wells of the DNA chip; [0201] in predicting (in 118) the intensities of the subset {I.sub.n}.sub.2 according to the relation Î.sub.n={circumflex over (θ)}.sub.n×φ.sub.n; [0202] in comparing the intensities of the subset {I.sub.n}.sub.2 with their corresponding predictions {Î.sub.n}.sub.2.
[0203] Thus, the performance levels of the model are evaluated (i) at the affinities level, by correlating the affinities predicted by the model with the affinities estimated by the model of Li & Wong (2001) and (ii) at the intensity level, by correlating those predicted by the model with the observed intensities, these comparisons being performed probe-by-probe. In the first case, the correlations are computed within each probe set of the DNA chip because of the constraint Σθ.sub.n.sup.2=N imposed by the Li & Wong model. In other words, instead of correlating the affinities predicted by the model with those of Li & Wong globally over the set of the probes, the computation of the correlations is made probe-by-probe for each probe set.
[0204] The aim of the present example is to illustrate the accuracy of our affinity model according to the relation (3), that is to say its ability to finely predict the affinity of the probes. In this example, a validation by affinity is implemented.
[0205] Nine affinity prediction models according to the relation φ=Σ.sub.m=1.sup.MΣ.sub.p=1.sup.px.sub.m,p.Math.{circumflex over (β)}.sub.m,p+Σ.sub.l=1.sup.L.Math.{circumflex over (δ)}.sub.l are tested. The two variables evaluated are the size of the k-hybrids (k varying from 3 to 5) and the inclusion of the spatial information according to three different scenarios: a probe is divided into 1, 3 and 25−k divisions (the last case is called “any position”). Each model is therefore associated with its own structure of the matrices X, Y, {circumflex over (B)} and {circumflex over (Δ)} and with its own values of the matrices 11 and {circumflex over (Δ)}. The learning of the models is performed in the way described in relation to the steps 82 to 88 of
[0206] For the validation of the nine models according to the intensities, the probes used are those of the probe set “CD59” of the “genes” compartment of the chip V2, hybridized with six cellular rows (RWPE1 and five rows which derive therefrom). These cellular rows are homogeneous populations of cells originating from human samples (prostate epithelial cells) which have been transformed to augment their longevity. The protocols for hybridization (amplification, fragmentation, marking, hybridization on the chip) and for biocomputing processing of the measurements derived therefrom, are described in the document by Perot et al. “Microarray-based sketches of the HERV transcriptome landscape”. In particular, the raw intensities measured on the chips follow three biocomputing preprocessing steps usually followed in this type of analysis and detailed in the document “Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics” (Irizarry et al. 4(2): 249-64. April 2003). These three steps are the correction of the background noise, the inter-chip normalization, the summarizing which provides, for each probe set, an estimation of the quantity of hybridized RNA from the intensities of the probes which make up this probe set. This last step is performed by considering that the intensity of each probe is the sum of a target-probe affinity effect specific to the probe and of a RNA quantity effect common to all the probes of a subset. Each of these effects is estimated robustly using the so-called “median polish” method (see Irizarry et al. 2003).
[0207] Sixteen tests were carried out, corresponding to a measurement on the probe sets CD59 of 16 chips V2 in order to demonstrate the accuracy of the models even faced with a strong variability of the measurements, notably because of the quantity of RNA deposited on the chips V2 which is not accurately controlled. The result of these tests is illustrated in
F.3.) Validity of the Affinity Prediction Model on Another Platform
[0208] The aim of the present example is to illustrate the performance levels of the affinity prediction model according to the invention on the 513 probe sets of the “genes” compartment of the chip V2 and to demonstrate the validity of the model for another DNA chip format, namely on the 3120 probe sets of the “U133_HTA” compartment of the chip V3. In effect, while the wells of the chip V2, of dimensions equal to those of the chip HERV-V2, measure 11 μm of side, those of the chip V3 measure only 5 μm.
[0209] To this end, a validation by affinities and a validation by intensity are implemented on an affinity prediction model characterized by a length of the k-hybrids equal to 5 (k=5) and by a division of the probes into 3 areas (M=3). As described previously, the inventors noted the good performance of this model, and even with a length k and a number M of areas that are reduced. The matrices {circumflex over (B)} and {circumflex over (Δ)} of this model are learned on the learning set of the first DNA chip.
[0210] The biological samples used in this example are four different cellular rows of the applicant hybridized simultaneously in triplicate on 12 chips V2 and 12 chips V3 (4 rows×3 replicas=12 chips). The hybridization and biocomputing processing protocols used in this example are those described in the article by Perot et al. “Microarray-based sketches of the HERV transcriptome landscape.”.
[0211]
[0212]
F.4.) Validity of the DNA Chip Design Method and Measurement Accuracy
[0213] A DNA chip can be seen as a measurement instrument whose aim is to maximize the biological variability and minimize the technical variability introduced by the tool. The technical variability, or error, is commonly decomposed as the resultant of a systematic error (or “bias”) and a random error.
[0214] The present example studies the technical variability of a DNA chip obtained according to the design method according to the invention. The objective of the results presented in this example is to demonstrate that the probes designed with the probe selection methodology according to the invention, described in relation to
[0215] The technical variability is studied using two criteria put forward by the “MicroArray Quality Check” (or “MAQC”) consortium to judge the quality of a DNA chip: the repeatability (i.e. the variation of a measurement when it is repeated by an operator in the same conditions. This variation reflects the random error and the monotonic titration (a quantity close to the sensitivity of a DNA chip that makes it possible to measure the consistency between the intensities measured on a chip with hybridized RNA concentrations). These criteria are assessed hereinbelow.
[0216] The samples used for this assessment are those used by the MAQC consortium, as described in the document “The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements”, 24(9):1151-61. Nature Biotechnology, September 2006.
[0217] These samples originate from two samples of brain RNA (A) and of reference RNA known as “universal human reference RNA” (B) corresponding to a mixture of 10 cellular rows. These two samples are mixed in portions 3:1 (C=0.75×A+0.25×B) and 1:3 (D=0.25×A+0.75×B) to generate two additional samples C and D. Each of these samples is hybridized in triplicate on the chip V3. The hybridization and biocomputing processing protocols used in this example are described in the article by Perot et al. (“Microarray-based sketches of the HERV transcriptome landscape.”)
[0218] F.4.1.) Study of the Repeatability
[0219] In order to know the relevance of the comparison between a DNA chip designed according to the methodology according to the invention with an Affymetrix chip, a study is first of all conducted to ensure that no confusing effect skews this comparison.
[0220] The results of this study are represented in
[0221]
[0222]
[0223] In reading these figures, it can be seen that the distributions of the intensities of the MAQC samples and those of the number of probes per probe set show that there is a great uniformity between the three compartments of the chip V3, making it possible to stratify the results by intensity and by probe set size. The measurement usually used to measure the repeatability is the coefficient of variation between the replicas, this computation is performed at the probe level (
[0224] F.4.2) Monotonic Titration
[0225]
[0226] Thus, if, for a probe set i, we have the relation A_i>B_i then A_i>C_i>D_i>B_i. When the probe set percentage observing this hierarchy is represented as a function of their ratio AB and B/A, the expected form of a graph representing the monotonic titration, as represented in
[0227] In the same way as in the repeatability study, the three compartments are compared at the probe and probe set levels, by correcting the effect linked to the size of the probe sets in the second case. At the probe level, the compartment OPTI gives better performance levels than the other two compartments (
[0228] F.4.3) Differently Expressed Genes
[0229] Finally,
[0230] The aim of this example is to show that the 100 genes having the strongest expression differential between the two samples A and B of the MAQC are comparable in the compartments OPTI and U133_HTA (U133 and HTA) of the chip V3. The differently expressed genes are identified using the SAM method described in the document “Significance analysis of microarrays applied to the ionizing radiation response.”, Tusher V G, Tibshirani R, Chu G. Proceedings of the National Academy of Sciences of the USA. April 2001 24; 98(9):5116-21, then for each of the three compartments of the chip V3, the 100 genes with the lowest p-value are retained. The intersections between these three compartments are represented in the Venn diagram of
F.5) Specificity of the Measurements
[0231] The aim of the present example is to demonstrate that the hybridization model according to the invention serves not only to compute the target-probe affinity, but that it can also be used to measure the specificity of the probes. The objective of the compartment HERV-MalR of the chip V3 is to specifically characterize the level of expression of the HERVs, organized in some forty multicopy families in the human genome. The repeated nature of these elements renders the individual measurement thereof difficult.
[0232] To check the specificity of the probes, a specificity score Spec=φ.sub.1−max(φ.sub.2) is computed. In other words, for a given probe, this score measures the affinity difference between the specific hybrid and the stablest non-specific hybrid, i.e. the one which exhibits the greatest risk of cross-reaction. To test the validity of this specificity score, two types of experiment can be implemented: [0233] for a given probe, create “spike-ins” RNA (i.e. RNAs artificially synthesized in a laboratory) complementing mismatches and check that the decrease in intensities is linked to the increase in the specificity score, the latter being computed as the affinity difference between the specific target, absent from the reaction mixture, and the hybridized non-specific target. This type of approach offers the advantage of accurately knowing which RNAs are present in the reaction mixture; [0234] hybridizing a same biological sample on the chips V2 and V3 and correlating the intensities of the HERV/MarR loci common to both chips. More specifically, the specificity score as described above is computed for all of the probes, then, in calculating the correlation, only the probe sets for which the probes exceed a given specificity threshold are taken into account. If the specificity score is valid, the correlation between the intensities of V2 and V3 should increase with the specificity level. This approach which is more global than the preceding one is the one chosen in this example.
[0235] The biological samples used in this example originate from the same four cellular rows as those presented in the example F. The hybridization and biocomputing processing protocols used in this example are described in the article by Perot et al. (“Microarray- based sketches of the HERV transcriptome landscape”) and comprise the usual steps of amplification, of fragmentation, of marking, of hybridization on the chip, followed by steps of background correction, of normalization and of summarization.
[0236]
[0237] In
G) Extension of the Teaching of the Embodiment Detailed
[0238] k-hybrids have been described whose length is strictly equal to k. Obviously, the invention also covers a subdivision of the hybrids into k-hybrids whose length is less than or equal to k, that is to say into hybrid portions of length strictly equal to k, into hybrid portions strictly equal to k−1, etc. The mathematical framework described above content to be applied, the design matrices X and Y and the contribution vectors {circumflex over (B)} and {circumflex over (Δ)} and being simply increased in size to take account of the additional configurations of k-hybrids.
[0239] A subdivision of the hybrids into areas of equal length has been described. The invention applies equally to areas of different length, which makes it possible to more accurately take account of the influence of each area.
[0240] A DNA chip probe selection method has been described based on a particular inventive modeling of affinity. The selection method according to the invention can however be based on other types of affinity modeling, the final threshold-based selection rules remaining identical.
[0241] Similarly, particular mathematical equations have been described. As is known per se, there can be, for each equation, several possible equivalent mathematical expressions, these different expressions lying also within the scope of the invention.