Characterization And Reproduction Of An Expert Judgement For A Binary Classification

20170307508 · 2017-10-26

Assignee

Inventors

Cpc classification

International classification

Abstract

A method for analyzing sample cells reacting with at least one specific marker, includes providing a reference sample and an active sample and providing a set (E.sup.+) of cells declared positive from among the active sample cells. The method further includes determining a vector coefficient (θ) from the active sample and from the set (E.sup.+) and determining at least one set of positive cells in the reference sample as a function of the vector coefficient (θ). A rate of false positives (α) is calculated in the reference sample from the number of positive cells of the reference sample.

Claims

1. A method for analysing cells of a sample reacting with at least one specific marker, the method comprising: providing a reference sample and an active sample; providing a set (E.sup.+) of cells declared positive by an expert from among the cells of the active sample; determining a vector coefficient (θ) from the active sample and from the set (E.sup.+); determining at least one set of positive cells in the reference sample as a function of the vector coefficient (θ); and calculating a rate of false positives in the reference sample (α) from the number of positive cells of the reference sample.

2. The method according to claim 1, wherein determining the vector coefficient (θ) comprises a minimization of a quantity of false positives and a minimization of a quantity of false negatives in the active sample.

3. The method according to claim 1 wherein determining the vector coefficient (θ) comprises: defining, for each of the markers j, an s.sub.j-quantile y.sub.j.sup.2, quantile of a cumulative distribution function P.sub.j.sup.test associated with a smoothed probability distribution function p.sub.j.sup.test determined by smoothing a marginal distribution of the j.sup.th marker in the active sample; defining the set (S.sup.+) of cells declared positive with respect to the s.sub.j-quantile y.sub.j.sup.s for each marker j in the active sample; defining and determining a cardinal of the symmetrical difference between S.sup.+ and E.sup.+; determining each of the largest values of the vector coefficient (θ) of each of the markers by minimization of the cardinal with respect to the value s.sub.j of each marker j in the interval [0,1], for all the markers.

4. The method according to claim 1 further comprising: providing the rate of false positives (α) in the reference sample; determining a vector coefficient (θ) based on the reference sample and the rate of false positives (α); and determining at least one set (S.sup.+) of positive cells in the active sample as a function of the vector coefficient (θ).

5. A method for analysing cells of a sample reacting with at least one specific marker, comprising: providing a reference sample and an active sample; providing a rate of false positives (α) in the reference sample; determining a vector coefficient (θ) based on the reference sample and the rate of false positives (α); and determining at least one set (S.sup.+) of positive cells in the active sample as a function of the vector coefficient (θ).

6. The method according to claim 5, wherein determining the vector coefficient (θ) comprises a maximization of a quantity of positive cells in the active sample respecting the given rate of false positives (α).

7. The method according to claim 5, wherein determining a vector coefficient (θ) comprises: defining, for each of the markers j, an s.sub.j-quantile y.sub.j.sup.s, quantile of a cumulative distribution function P.sub.j.sup.ref associated with a smoothed probability distribution function p.sub.j.sup.ref determined by smoothing a marginal distribution of the j.sup.th marker in the reference sample; defining a function F(s) representing a rate of negative cells in the reference sample, increasing from [0,1] to [0,1], by F ( s ) = card ( VN s ) n , where VN.sup.s is defined by VN.sup.s={i=1, . . . , n such that y.sub.ij.sup.ref<y.sub.j.sup.s for each j=1, . . . , d}, the set of the cells of the reference sample a measured value of which is under the value of the vector coefficient (θ) of the marker corresponding to all the markers; determining the smallest value of s.sub.j such that F(s)>1−α; and determining the values of the vector coefficient (θ).

8. The method according to any claim 1, wherein the vector coefficient (θ) is a set of threshold values of the expression of each of the markers above each of which a cell is declared positive.

9. The method according to claim 1 further comprising an analysis in which at least one marker to which at least one cell reacts positively is identified.

10. The method according to claim 1 further comprising verification by evaluation of a confusion matrix.

11. The method according to any claim 5, wherein the vector coefficient (θ) is a set of threshold values of the expression of each of the markers above each of which a cell is declared positive.

12. The method according to claim 5 further comprising an analysis in which at least one marker to which at least one cell reacts positively is identified.

13. The method according to claim 5 further comprising verification by evaluation of a confusion matrix.

Description

BRIEF DESCRIPTION OF THE DRAWING

[0063] The invention, according to an example of implementation, will be better understood and its advantages will be more apparent on reading the detailed description which follows, given by way of example and in no way limitative, with reference to the attached drawings in which:

[0064] FIG. 1 diagrammatically shows a mechanism for the production of molecules (Mo) by a cell (Ce) excited by an antigen (αy), each molecule being detectable using an antibody (αC) coupled to a fluorescent probe (fP),

[0065] FIG. 2 shows a representation in two dimensions, representing a first marker (Mj) and a second marker (Mj′), of a distribution of the cells of a reference sample and of a sample to be analysed,

[0066] FIG. 3 shows an example of smoothed probability density obtained for a marker (j) as a function of the measurements carried out in a reference sample and in a sample to be analysed, and

[0067] FIG. 4 shows an example of cumulative distribution functions obtained for a marker (j) as a function of the measurements carried out in a reference sample and in a sample to be analysed.

DETAILED DESCRIPTION

[0068] The present description refers by way of example to Intracellular Cytokine Staining (ICS) assays. Of course, the analysis method described within the scope of the present application is applicable to any type of analysis of cells, or even to any problem of multidimensional classification.

[0069] An ICS assay is usually carried out on blood samples incubated with antigens (ay) derived from viruses, bacteria or cancerous cells. As shown in FIG. 1, after this incubation, the cells (Ce) capable of recognizing the antigens (αy) start to produce different molecules (Mo) (usually cytokines) which are detected using antibodies (αC). Each antibody (αC) is specific to a given molecule (Mo) and is coupled to a given fluorescent probe (fP). Thus, the analysis of the fluorescence associated with a cell makes it possible to identify which molecules have been produced by this cell.

[0070] In ICS a cell (Ce) is declared positive if it has produced in a “detectable” quantity, i.e. in a quantity greater than a predetermined threshold, at least one molecule (Mo) of interest. The methods currently used for identifying the cells that are “positive”, thus reacting with at least one of the markers, rely on the visual judgement of an expert, or user.

[0071] The data set of a sample to be analysed can in fact be represented in the form of a scatter diagram, in a multidimensional space, of dimension given by the number of markers. Each point corresponds to a cell and is composed of expressions of all the markers for this cell.

[0072] As shown in FIG. 2, the user, i.e. generally the expert, visualizes two-dimensional sections of one of the markers (Mj) with respect to another (Mj′) in this multidimensional space and refers to a sample called “reference” (i.e. a sample of known negatives), before incubation, in which all the cells are negative.

[0073] The expert then manually draws the selection intervals around which he judges there to be positive cells, i.e. which are distinguished visually from the scatter diagram along one or other of the two axes, and therefore one or other of the two markers represented. This is for example represented by the dotted outline in FIG. 2.

[0074] A drawback of this procedure is that it is subjective and makes the results from different users or laboratories difficult to compare. Moreover, it is very difficult to reproduce.

[0075] In order to at least partially resolve the aforementioned drawbacks, the method, according to an example of implementation of the present invention, analyses two samples, the first being the reference sample of the known negative cells and the second being the sample to be analysed of the unknown cells. It identifies the positive cells in the sample to be analysed. In other words, the input data of the method are constituted by two samples: [0076] The reference sample, which is for example represented by a matrix which contains the measurements (of fluorescence) of a sample of n negative cells (in which no marker is expressed, as the cells have not been affected), “n” thus being the size of the information exploited or the number of points. For each cell, a number d of markers (identified for example as Mj, with j=1 . . . d) are measured, “d” therefore being the dimension of the negative cells.

[0077] The reference sample is for example denoted X.sup.ref, a matrix of size n×d, where X.sup.ref=[x.sup.ref.sub.ij] (with i=1, . . . , n and j=1, . . . , d), x.sup.ref.sub.ij corresponding to the measurement (of fluorescence) of the j.sup.th marker for the i.sup.th cell. [0078] The sample to be analysed, which is for example represented by a matrix which contains the measurements (of fluorescence) of a sample of m cells, which contain positive and negative cells (among which certain markers are expressed; the cells having been affected, certain have reacted). For each cell, the same d markers (fluorescent) are measured.

[0079] The sample to be analysed is for example denoted X.sup.test, a matrix of size m×d, where X.sup.test=[x.sup.test.sub.kj] (with k=1, . . . , m and j=1, . . . , d), x.sup.test.sub.kj corresponding to the measurement (of fluorescence) of the j.sup.th marker for the k.sup.th cell.

[0080] The main output data of the method are the set of cells of the sample to be analysed which are declared as being positive. A cell of the sample to be analysed is declared positive if the normalized expression of one of the markers, i.e. of at least one of the markers, is greater than the corresponding threshold value estimated in the third step, detailed later on.

[0081] First Step: Preparation of the Data

[0082] During a first step, which is optional, the expressions of the markers (measured fluorescence values) for the reference sample and for the sample to be analysed are for example firstly normalized then expanded. In other words, the step of preparation comprises for example a step of normalization and a step of expansion of the data. This makes it possible to render the measurements independent of the scale and of the calibration of the measurement tool. Such conditioning of the problem makes it possible moreover to simplify the method while making it possible for the classification to be carried out correctly.

[0083] It should be noted for example that the previously defined matrices X.sup.ref and X.sup.test once normalized by: Y.sup.ref=[y.sup.ref.sub.ij] and Y.sup.test=[y.sup.test.sub.kj] where y.sup.ref.sub.ij and y.sup.test.sub.kj are the normalized values of the expressions of the markers (measurements of fluorescence) x.sup.ref.sub.ij and x.sup.test.sub.kj. In order to do this, the measurements are referred to the values in the unit interval [0,1] then they are expressed in logarithmic scale.

[0084] For example, for each marker j in {1, . . . , d}, the step of preparation of the data of the method comprises for example of the following steps: [0085] a step of determining a minimum x.sub.{j,min} and a maximum x.sub.{j,max} of the measured expressions of the marker considered in the reference sample and in the sample to be analysed; [0086] a step of normalization and expansion of the data of the reference sample and the sample to be analysed, which is carried out as follows:


y.sup.ref.sub.ij=f.sub.j(x.sup.ref.sub.ij); i=1, . . . , n; j=1, . . . , d


y.sup.test.sub.kj=f.sub.j(x.sup.test.sub.kj); k=1, . . . , m; j=1, . . . , d

where f.sub.j(x) is for example the following expansion function:


f.sub.j(x)=log.sub.10((x−x.sub.{j,min})/(x.sub.{j,max}−x.sub.{j,min})+ε)

in which (x−x.sub.{j,min})/(x.sub.{j,min}) corresponds, strictly speaking, to the normalization and where ε is the expansion parameter; with j in {1 . . . , d}, and ε comprised between 10.sup.−3 and 10.sup.−6 for example, this number being able to be adapted. It is for example 10.sup.−6.

[0087] Second Step: Smoothing of the Distribution of the Values Obtained for a Sample

[0088] This step aims to smooth the probability densities of the markers of the sample considered, for example the reference sample for the example detailed here, normalized, so that they become continuous and independent of the effects of discretization. In other words, this makes it possible to have a continuous probability density function based on the discrete values that are the results of measurements. It is for example possible to use the Parzen-Rosenblatt method, also called “kernel estimator”.

[0089] The unidimensional probability densities (i.e. for one marker at a time) are for example obtained using the kernel estimation method with a Gaussian kernel and the Silverman rule for the width of the kernel, called smoothing parameter. For example, this is applied to the normalized data of the reference sample determined in step 1, i.e. y.sup.ref.sub.ij.

[0090] For each marker j in {1, . . . , d} the smoothing step of the method comprises for example the following steps: [0091] a step of selecting a kernel K, for example Gaussian; [0092] a step of determining the smoothing parameter h, which corresponds to the width of the smoothing kernel, by using for example the Silverman rule:

[00003] h j = ( 3 4 .Math. n ) 1 5 .Math. min ( σ j , irq j )

where σ.sub.j and irq.sub.j are respectively the empirical standard deviation and the interquartile of the set {y.sup.ref.sub.ij, i=1, . . . , n}. [0093] a step of defining the density probability function of the marginal distribution function of the j.sup.th marker of the reference sample by:

[00004] p j ref ( x ) = 1 nh j .Math. .Math. i = 1 n .Math. K ( x - y ij ref h j )

where K is a kernel, for example the Gaussian kernel defined by

[00005] K ( x ) = 1 2 .Math. π .Math. exp ( - x 2 2 ) .

[0094] At this stage, the results of normalized measurements for the sample to be analysed and a probability density of the result for each marker for the reference sample are thus known.

[0095] These probability densities are for example shown in FIG. 3 for a marker j.

[0096] Then, the method comprises a step of defining an estimation of the multivariate densities, which correspond to the product of the univariate kernels, for example as follows:

[00006] p ref ( x ) = 1 n h 1 .Math. h d .Math. .Math. i = 1 n .Math. .Math. j = 1 d .Math. K j ( x j - y ij ref h j )

[0097] It is moreover possible to simplify this expression by considering that K.sub.j=K, or even h.sub.j=h for all the dimensions.

[0098] According to the version of the method, defined hereafter, which is implemented, the smoothing step is carried out on at least the sample to be analysed instead of the reference sample.

[0099] Third Step: Estimation of the Thresholds

[0100] The following step, here the third step, aims to determine the values of the thresholds for the expressions of the markers above which a cell is declared positive.

[0101] In order to determine the threshold associated with each marker, two cases are envisaged here.

[0102] In a first case, called version 1, an auxiliary input comprises a sub-set E.sup.+ of cells of the sample to be analysed that the user judges positive. The method then produces an auxiliary output which is the rate a of false positive corresponding to the judgement of the user.

[0103] In a second case, called version 2, the auxiliary input is the acceptable rate a of false positives, which corresponds to the proportion of cells which are detected as positive by the method when it is applied to a sample of negative cells, for example the reference sample.

[0104] By default, if no auxiliary input is provided, the method carries out version 2 with the imposed value α=0, which corresponds to minimizing the values of the thresholds, subject to the algorithm declaring all the cells of the reference sample negative. This is the version of the method called “without bias”.

[0105] In other words, the method comprises a step of providing an additional parameter which is either the set E.sup.+, or the rate of false positives α, knowing that if no additional parameter is specified, the step of providing an additional parameter consists of considering α=0.

[0106] In other words, in both cases, the principles of the calculations are the same. In the first case, these are applied in the sample to be analysed for predicting in the reference sample, while in the second case, it is the reverse.

[0107] Third Step—Version 1

[0108] In version 1, the user firstly carries out sorting from among the cells of a sample to be analysed. The cells judged positive by the user form the set called E.sup.+, comprising between 0 and m cells of the sample to be analysed.

[0109] In this version, the thresholds are estimated so as to better to reproduce the judgement of the user on the sample to be analysed.

[0110] In other words, the third step according to the version 1 comprises for example the following steps: [0111] For a value s.sub.j (thus corresponding to a probability), a step of defining an s.sub.j-quantile y.sub.j.sup.s, quantile of the cumulative distribution function P.sub.j.sup.test associated with the smoothed probability distribution function p.sub.j.sup.test determined in step 2 for each marker j:

[00007] s j = - y j s .Math. p j test ( x ) .Math. dx = P j test ( y j s )

[0112] This is for example shown in FIG. 4 for a marker j.

[0113] The s.sub.j-quantile y.sub.j.sup.s therefore corresponds here to a threshold value of normalized expression for a considered marker j: above, a cell is considered positive for this marker, below it is considered negative for this marker. [0114] a step of defining the set of cells declared positive with respect to the s.sub.j-quantile y.sub.j.sup.s, for each marker in the sample to be analysed: by taking the combination of these d sets, the set S.sup.+={k=1, . . . , m such that y.sub.kj.sup.test≧y.sub.j.sup.s for a j=1, . . . , d} is thus obtained. This means that the set S.sup.+ comprises the set of the cells of the sample to be analysed considered as positives, i.e. the analysed cells for which the expression of a marker (measured fluorescence value) is greater than y.sub.j.sup.s for at least one given marker. In other words, the set S.sup.+ comprises all the cells which have a normalized expression of a marker greater than the threshold for at least one marker.

[0115] Therefore at this stage there are two defined sets: E.sup.+ the set of cells judged positive by the user, and S.sup.+ the set of cells defined positive by the method. If E.sup.+ is known, S.sup.+ remains to be determined as it depends on the values of the thresholds of each marker, which are to be determined. This determination of S.sup.+ is carried out according to the following steps: [0116] A step of defining and determining a cardinal of the symmetrical difference between S.sup.+ and E.sup.+. This means determining the sum of the number of cells which belong to E.sup.+ but not to S.sup.+ and the number of cells which belong to S.sup.+ but not to E.sup.+, i.e. which do not simultaneously belong to both sets S.sup.+ and E.sup.+. [0117] Then, the method comprises a step of minimizing this cardinal with respect to the value s.sub.j of each marker j in the interval [0,1], for all the markers. This means determining the largest threshold value of each of the markers from among the values minimizing the cardinal. In other words, this step consists of determining a threshold value y.sub.j.sup.s for each of the markers such that a maximum number of cells belong both to E.sup.+ and S.sup.+. For example, in a “perfect” case E.sup.+ and S.sup.+ would be superimposed, identical.

[0118] The value s.sub.j and the s.sub.j-quantile y.sub.j.sup.s for each marker j is thus known.

[0119] A simplification comprises for example of considering that all the values s.sub.j are identical, and have for example a value s, and then it is a question of determining the y.sub.j.sup.ss corresponding to each of the markers. [0120] Another step comprises for example then defining the function F (increasing from [0,1] to [0,1]) by

[00008] F ( s ) = card ( VN s ) n ,

where VN.sup.s is defined by VN.sup.s={i=1, . . . , n such that y.sub.ij.sup.ref<y.sub.j.sup.s for each j=1, . . . , d}, i.e. the set of the cells of the reference sample for which the normalized expression of the marker is under the threshold of the marker corresponding to all the markers (i.e. all the cells of the reference set in an ideal case). Determining the cardinal of this set makes it possible to count these cells which are declared negative. Dividing this cardinal by n then gives the rate of negative cells in the reference sample, n being the total number of cells of the reference sample. [0121] Finally, the method comprises a step of calculating a according to the formula α=1−F(s), the rate of false positives.

[0122] In an alternative to defining and determining the function F, it is also possible to determine the confusion matrix, as detailed previously, in order to determine the rate of false positives.

[0123] In this version, the rate a is therefore determined based on the set S.sup.+ and the method returns to the output, in response, the determined set S.sup.+ as well as the rate α.

[0124] In other words, in this version, the set S.sup.+ is constructed from arbitrary, coherent values of s.sub.j, then from an optimization procedure so as to find the thresholds y.sub.j.sup.s for each marker which will make it possible to classify the points.

[0125] Third Step—Version 2

[0126] In version 2, the rate α of false positives that the user judges acceptable is imposed as input value (also called here additional parameter). The rate α corresponds to the rate of cells detected as positives by the algorithm when it is applied to a sample of negative cells, for example the reference sample. As mentioned previously, by default, the algorithm carries out version 2 with α=0, which means that the algorithm minimizes the thresholds to ensure that all the cells of the reference sample are declared negative.

[0127] The third step for version 2 comprises for example the following steps: [0128] a step of defining y.sub.j.sup.2, the s.sub.j-quantile of the cumulative distribution function P.sub.j.sup.ref associated with the smoothed probability distribution function p.sub.j.sup.ref introduced in step 2, for each marker j. [0129] a step of defining the function F (increasing from [0,1] to [0,1]) by

[00009] F ( s ) = card ( VN s ) n ,

where VN.sup.s is defined by VN.sup.s={i=1, . . . , n such that y.sub.ij.sup.ref<y.sub.j.sup.s for each j=1, . . . , d}. [0130] A step of determining, by dichotomy for example, the smallest value of s.sub.j such that F(s)>1−α.

[0131] Knowing the values s.sub.j, it is therefore then possible to determine the associated thresholds for each of the markers.

[0132] Thus, in this version 2, having fixed an α that is tolerable or equal to 0, the smallest threshold value corresponding to each of the markers is sought.

[0133] By applying the determined threshold values to the sample to be analysed, the method can therefore then determine the set S+of positive cells, as detailed in a fourth step described hereafter.

[0134] Thus, in this version, the set S.sup.+ is determined from the rate α.

[0135] Whatever the version (1 or 2), at the end of step 3 described previously, it is known how many cells, and which, are considered positives in the sample to be analysed, and what the rate of false positives (α) is in the reference sample and hence, the values s.sub.j and the s.sub.j—quantiles y.sub.j.sup.s to be considered for each of the markers.

[0136] Fourth Step: Classification of the Sample to be Analysed

[0137] Then, a fourth step aims to classify the cells of the sample to be analysed into a set of positive cells on the one hand, and negative cells on the other hand.

[0138] A cell of the sample to be analysed is declared positive if the normalized expression of one of the markers, i.e. of at least one of the d markers, is greater than the value of the corresponding threshold estimated in the third step.

[0139] The fourth step comprises for example a step of defining and determining a set of cells declared negative in the sample to be analysed by


S.sup.−={k=1, . . . , m such that y.sup.test.sub.kj<y.sup.s.sub.j for each j=1, . . . , d}.

[0140] The set S.sup.− of the cells declared negative is thus defined, i.e. those of which all the normalized expressions of the markers are under the corresponding thresholds of the markers. The set S.sup.+ of cells declared positive is thus the complementary of S.sup.−.

[0141] Thus, the step mentioned previously is for example particularly useful with reference to version 2 of the third step, whereas in version 1, it is for example possible to determine the set S.sup.− directly by taking the complementary of the set S.sup.+ which has been determined based on the set E.sup.+ in order to calculate α.

[0142] Fifth Step: Analysis of the Positive Cells

[0143] For each cell detected as positive in the sample to be analysed, the method can indicate at least one marker the expression of which is greater than the corresponding threshold.

[0144] To do this, a first step aims to define a set X.sup.+ such that X.sup.+={(k,j), k in S.sup.+ and j=1, . . . , d such that y.sup.test.sub.kj≧y.sup.s.sub.j}. Thus, X.sup.+ represents the set of the (cell, marker) pairs, where cell is a cell declared positive in the sample to be analysed and marker is a marker the normalized value of which is greater than the corresponding threshold for the cell. As a result, for the set of the cells having been defined as positive, by considering one marker in particular, certain cells have a marker the normalized expression of which is greater than the corresponding threshold, whereas others can have an expression lower than the corresponding threshold, the latter having then been declared positive due to the expression above the threshold of another marker.

[0145] Thus, from among the cells declared positive, it is for example possible to count how many times a marker is expressed. In order to do this, a step comprises determining, for each marker j, the value of Z.sub.j=card (k in S.sup.+ such that (k,j) is in X.sup.+), which is also equal to Z.sub.j=card (k in S.sup.+ such that y.sub.kj.sup.test≧y.sub.j.sup.s). In other words, the method comprises for example a step of counting the occurrences of a marker.

[0146] Knowing the occurrence of each marker for example, it is thus possible to grade them, for example by order of importance, the more important (frequent) being then given by the expression

[00010] argmax j ( Z j ) .

The method then comprises for example a step of grading the markers according to their occurrence, i.e. according to the number of times that it is expressed by a cell.

[0147] Thus, for example, a postprocessor can then provide a statistical analysis of the output set X.sup.+, for example a grading of the markers.