DEVICE FOR AND METHOD OF DETERMINING CLUSTERS

Abstract

A device (100) for and method of determining clusters of sequences of instances of a first type of data for compacting a data set comprising sequences of instances of the first type of data is provided. Also a method of compacting a data set, a method of transmitting compacted data and a computer program product are provided. In a sequence clustering unit (110) of the device, sequences of a first set of data are clustered on basis of conditional probabilities. Each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence. In the clustering a significant part of the mutual information between the first set of data and the second set of data is maintained.

Claims

1-14. (canceled)

15. A data reduction arrangement for compacting sequences of instances of a first type of data, the data reduction arrangement comprising: a device for determining clusters of sequences of instances of a first type of data for compacting a data set comprising sequences of instances of the first type of data, the instances of the first type of data comprising information for predicting instances of a second type of data, the instances of the second type of data comprising data based on a characteristic of a physical entity, the device comprising: a first data set unit for obtaining a first set of data comprising sequences of instances of the first type of data; a second data set unit for obtaining a second set of data comprising instances of the second type of data, each instance of the second set of data corresponds to a sequence in the first set of data; a sequence clustering unit for assigning the sequences of the first set of data to clusters, the assigning is based on conditional probabilities of data of the second type given a sequence of the first set of data, wherein each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence, wherein the assigning of the sequences of the first set of data to clusters comprises applying the Context Tree Weighting method to the first set of data and the second set of data to obtain a context tree, in the Context Tree Weighting method every unique sequence of the first set of data is represented by a path in the context tree from a root node to a specific leaf node and counts stored in at least the leaf nodes of the context tree are based on the corresponding instances of the second set of data, wherein the estimated conditional probability of a respective leaf node is calculated on the basis of the counts of the respective leaf nodes; and a compaction unit configured for compacting the first data set by replacing the sequences of instances of the first type of data of the data set by an identification data of the cluster to which the sequence is assigned, wherein the identification data of a specific cluster uniquely identifies the specific cluster and is stored by a fewer number of bits than each sequence of the data set.

16. A computer-implemented method for determining clusters of sequences of instances of data of a first type of data for compacting a data set comprising sequences of instances of the first type of data, the instances of the first type of data comprising information for predicting instances of a second type of data, the instances of the second type of data comprising data being based on a characteristic of a physical entity, the method comprising: obtaining a first set of data comprising sequences of instances of the first type of data; obtaining a second set of data comprising instances of the second type of data, each instance of the second set of data corresponds to a sequence in the first set of data; assigning the sequences of the first set of data to clusters, the assigning is based on conditional probabilities of data of the second type given a sequence of the first set of data, wherein each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence, wherein the assigning of the sequences of the first set of data to clusters comprises applying the Context Tree Weighting method to the first set of data and the second set of data to obtain a context tree, in the Context Tree Weighting method every unique sequence of the first set of data is represented by a path in the context tree from a root node to a specific leaf node and counts stored in at least the leaf nodes of the context tree are based on the corresponding instances of the second set of data, wherein the estimated conditional probability of a respective leaf node is calculated on the basis of the counts of the respective leaf nodes; and compacting the first data set by replacing the sequences of instances of the first type of data of the data set by an identification data of the cluster to which the sequence is assigned, wherein the identification data of a specific cluster uniquely identifies the specific cluster and is stored by a fewer number of bits than each sequence of the data set.

17. The method according to claim 16, wherein the assigning of the sequences of the first set of data to clusters comprises forming the cluster on basis of estimated conditional probabilities of the leaf nodes of the context tree, wherein, if a specific leaf node is related to a specific cluster, then all sequences of the first set of data equal to the unique sequence ending in the specific leaf node are assigned to the specific cluster and wherein the estimated conditional probability of a respective leaf node is a Krichevsky and Trofimov estimator that is calculated on basis of the counts of the respective leaf nodes.

18. The method according to claim 16, wherein the assigning of the sequences of the first set of data to clusters uses a k-means algorithm to form the clusters and assigns sequences of the first set to clusters, and wherein the k-means algorithm uses the estimated conditional probabilities of the leaf nodes of the context tree to form the clusters.

19. The method according to claim 18, wherein, in the assigning of the sequence of the first set of data to clusters, sequences of the first set of data ending in leaf nodes having a total count that is smaller than a minimum number of observations are assigned to two additional clusters, sequences ending in leaf nodes having an estimated conditional probability smaller than 0.5 and having a total count that is smaller than the minimum number of observations are assigned to a first one of the two additional clusters, sequences ending in leaf nodes having an estimated conditional probability larger than 0.5 and having a total count that is smaller than the minimum number of observations are assigned to a second one of the two additional clusters, sequences ending in leaf nodes having an estimated conditional probability that is equal to 0.5 and having a total count smaller than the minimum number of observations are assigned to either the first one of the additional clusters or the second one of the additional clusters.

20. The method according to claim 16, wherein the clusters of the sequences of the first set of data are further optimized by an iterative optimization method to minimize an optimization function that is based on a conditional entropy of the second set of data given the data of the clusters.

21. The method according to claim 20, wherein the iterative optimization method comprises simulated annealing.

22. The method according to claim 16, wherein the sequences of instances of the first type of data of the first set of data comprise time series of sensor data, each time series comprises results of measurements of one specific sensor at consecutive moments in time and the specific sensors are of an equal type.

23. The method according to claim 16, wherein the instances of the second set of data are binary data instances.

24. The method according to claim 23, wherein the assigning based on conditional probabilities is based on the conditional probabilities that the data of the second set of data is one given an unique sequence of the first set of data.

25. A method of transmitting compacted data comprising at least one sequence of instances of the first type of data, the at least one sequence is a sequence to be transmitted, the method comprising: obtaining the at least one sequence; determining, via a computer-implemented method, clusters of sequences of instances of data of a first type of data for compacting a data set comprising sequences of instances of the first type of data, the instances of the first type of data comprising information for predicting instances of a second type of data, the instances of the second type of data comprising data being based on a characteristic of a physical entity, the computer-implemented method comprising: obtaining a first set of data comprising sequences of instances of the first type of data; obtaining a second set of data comprising instances of the second type of data, each instance of the second set of data corresponds to a sequence in the first set of data; assigning the sequences of the first set of data to clusters, the assigning is based on conditional probabilities of data of the second type given a sequence of the first set of data, wherein each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence, wherein the assigning of the sequences of the first set of data to clusters comprises applying the Context Tree Weighting method to the first set of data and the second set of data to obtain a context tree, in the Context Tree Weighting method every unique sequence of the first set of data is represented by a path in the context tree from a root node to a specific leaf node and counts stored in at least the leaf nodes of the context tree are based on the corresponding instances of the second set of data, wherein the estimated conditional probability of a respective leaf node is calculated on the basis of the counts of the respective leaf nodes; compacting the first data set by replacing the sequences of instances of the first type of data of the data set by an identification data of the cluster to which the sequence is assigned, wherein the identification data of a specific cluster uniquely identifies the specific cluster and is stored by a fewer number of bits than each sequence of the data set; selecting one of the clusters as the cluster that best matches with the at least one sequence; and transmitting an identification data of the selected cluster instead of the at least one sequence, the identification data of a specific cluster uniquely identifies the specific cluster and can be stored by a fewer number of bits than the sequence.

26. A non-transitory computer-readable medium having one or more executable instructions stored thereon, which when executed by a processor, cause the processor to perform a computer-implemented method for determining clusters of sequences of instances of data of a first type of data for compacting a data set comprising sequences of instances of the first type of data, the instances of the first type of data comprising information for predicting instances of a second type of data, the instances of the second type of data comprising data being based on a characteristic of a physical entity, the method comprising: obtaining a first set of data comprising sequences of instances of the first type of data; obtaining a second set of data comprising instances of the second type of data, each instance of the second set of data corresponds to a sequence in the first set of data; assigning the sequences of the first set of data to clusters, the assigning is based on conditional probabilities of data of the second type given a sequence of the first set of data, wherein each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence, wherein the assigning of the sequences of the first set of data to clusters comprises applying the Context Tree Weighting method to the first set of data and the second set of data to obtain a context tree, in the Context Tree Weighting method every unique sequence of the first set of data is represented by a path in the context tree from a root node to a specific leaf node and counts stored in at least the leaf nodes of the context tree are based on the corresponding instances of the second set of data, wherein the estimated conditional probability of a respective leaf node is calculated on the basis of the counts of the respective leaf nodes; and compacting the first data set by replacing the sequences of instances of the first type of data of the data set by an identification data of the cluster to which the sequence is assigned, wherein the identification data of a specific cluster uniquely identifies the specific cluster and is stored by a fewer number of bits than each sequence of the data set.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0030] These and other aspects of the invention will be apparent from and elucidated further with reference to the embodiments described by way of example in the following description and with reference to the accompanying drawings, in which

[0031] FIG. 1 schematically shows an embodiment of a device for determining clusters of sequences of instances of data of a first type of data,

[0032] FIG. 2 schematically shows an embodiment of a method of determining clusters of sequences of instances of data of a first type of data,

[0033] FIG. 3 schematically shows machines of the same type that may provide data that is the basis for the first set of data and the second set of data,

[0034] FIG. 4 schematically shows an embodiment of a computer program product, and

[0035] FIG. 5 schematically shows a Context Tree Weighing (CTW) tree for an example first and second data set.

[0036] The figures are purely diagrammatic and not drawn to scale. In the Figures, elements which correspond to elements already described may have the same reference numerals.

DETAILED DESCRIPTION OF EMBODIMENTS

[0037] FIG. 3 schematically shows two machines 300, 350 of the same type that may generate instances of a first type of data and instances of a second type of data. FIG. 3 is an example of an area where the device and method for determining clusters of data can be applied. Both machines may provide quite a lot of data, e.g. in their log-files, and it may be necessary to compact that data. This figure is being used in the discussion of FIG. 1. It is to be noted that both machines 300, 350 are of the same type, which means that they are of the same design, perform the same function, etc. Both machines 300, 350 comprise sensors 302, 352 that generate instances of a first type of data. This means that sensors 302, 352 are similar to each other and have the same function in both machines 300, 350. For example, the sensors 302, 352 measure a temperature of a motor of the machine. Both machines 300, 350 have a physical entity 304, 354 of which a characteristic is determined and/or has to be predicted. The characteristic is an instance of a second type of data. For example, the physical entity 304, 354 are a motor of the machine and the characteristic is the motor 304 or 354 is operating or malfunctions. In this context, there may be a relation between data that is obtained from the sensors 302, 352 and the characteristics of the physical entity (e.g. the motor 304, 354, respectively). If the temperature of the motor 304, 354 is rising or is arriving at a relatively high level, the probability that the respective motor will be defect within a certain period of time raises. Therefore, the instances of the first type of data can be used to predict the characteristic of the physical entity.

[0038] The instances of the first type of data may also be obtained from the sensor at consecutive moments in time such that a sequence of data is obtained that contains time evolutionary aspects. In particular the time evolutionary aspects of the instances of the first type of data may be interesting to predict instances of the second type of data. For example, when a temperature of a motor is continuously raising in a particular interval of time, this may be a sign that the motor is going to overheat and may become defect within an interval of time. In the following of this document there is a focus on sequences of instances of the first type of data. It may be that these sequences represent a time series of instances of the first type of data, which means that the instances are obtained/determined at consecutive moments in time. It must be noted that the embodiments are not necessarily limited to such time series of instances of the first type of data.

[0039] It is to be noted that the instances of data of the first type are not necessarily obtained from the sensors 302, 352. The instances of the first type of data may also be data that represents characteristics of a physical entity that are not sensed by a sensor but may be known as a state of the physical entity. For example, a controller may control the motors 304, 354 on and off and the controlled on and off state may also be the instances of the data of the first type. Also if one want to predict when the motors 304, 354 are going to break down, the controlled on and off state may be useful data that can be used in the prediction of a possible break down of the motors 304, 354.

[0040] The embodiments of this invention are discussed also with references to FIG. 3. However, embodiments are not limited to sensors of machines and to embodiments in which the characteristic of a physical entity is the temperature of a motor of a machine. The sensors are not necessarily included in the machines, and they may measure, for example, environmental data. A different application of the embodiments may be in the field of predicting energy use of a lighting system building. The characteristics of the physical entity may be the energy use of lamps in different offices of a particular size in an office building and the instances of the first type of data may, for example, be the output of occupancy sensors provided in the different offices of the particular size.

[0041] FIG. 1 schematically shows an embodiment of a device 100 for determining clusters of sequences of instances of data of a first type of data. The clusters may be used for compacting a data set comprising sequences of instances of the first type of data. The sequences of the data set may be the sequences that are clustered, but may also be other sequences that are similar to the sequences that are clustered, and based on similarities between the clustered sequences and then the other sequences may be compacted as well. The clustering can be used for compaction because instead of storing/transmitting the complete sequences of instances of the first type of data, an instance of data may be stored/transmitted that uniquely identifies the cluster to which the respective sequence of instances of the first type of data belongs.

[0042] The device 100 comprises a first data set unit 102 for obtaining a first set of data that comprises sequences of instances of the first type of data. Optionally, instances of the first type of data comprising data are based on sensor data, for example, based on the measurements of the sensors 302, 352 of FIG. 3. In the detailed description of the embodiments the first set of data is indicated by X and the sequences are indicated by x.sub.i. The number of sequences x.sub.i of instances of the first type of data in the first set of data X is indicated with n. The j'th instance in the sequences x.sub.i of instances of the first type of data is indicated with x.sub.ij. The sequences x.sub.i have a length that is equal to m. In an embodiment, each sequence x.sub.i may represent a time series, which means that the instances x.sub.ij of a particular instance x.sub.i are obtained/determined from a specific sensor at consecutive moments of time during a particular interval of time. If the sequences x.sub.i are time series, it is assumed in the context of this document that x.sub.i1 is the oldest instance of the first type of data and that x.sub.im is the most recent instance of the first type of data. In embodiments, the data present in the first set of data is discrete data that is obtained by quantizing continuous data provided by the sensors.

[0043] In an embodiment, the first set of data is represented by a matrix X:

[00002] $X = [\begin{matrix} x_{1} \\ .Math. \\ x_{n} \end{matrix}] = [\begin{matrix} x_{11} & .Math. & x_{1 .Math. m} \\ .Math. & .Math. \\ x_{n .Math. .Math. 1} & .Math. & x_{nm} \end{matrix}]$

It is to be noted that the rows in the matrix X are the sequences x.sub.i.

[0044] In an example that is discussed further in this application, the first data set may have the subsequent data:

[00003] $X_{example} = [\begin{matrix} 1 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 1 \end{matrix}]$

Please note that in this example, the instances of data of the first type have a binary value, that the sequences have a length of m=3 and that there are n=7 sequences.

[0045] Embodiments are not limited to the use of the above matrix X. The use of a matrix with the sequences as rows is just an implementation detail and a skilled person knows that matrices can be transposed or that an ordering between the columns or between the rows may be changed as well. In embodiments one may also work with matrices to which a scaling factor has been applied.

[0046] The device 100 also comprises a second data set unit 104 for obtaining a second set of data comprising instances of the second type of data. Optionally, instances of the second type of data comprising data may be based on a characteristic of a physical entity. The characteristic is, for example, in the context of FIG. 3, whether the motor 304, 354 is operating correctly or not. Each instance of the second set of data corresponds to a sequence in the first set of data. The second set of data is indicated with Y. The instances of the second set of data Y are indicated with y.sub.i:

[00004] $Y = [\begin{matrix} y_{1} \\ .Math. \\ y_{n} \end{matrix}]$

[0047] Every instance y.sub.i corresponds to, belongs to, a sequence x of the first set of data X. Thus, there are n instances y.sub.i in the second set of data Y. Corresponding to or belonging to means in this context means that the data of the sequence x.sub.i is in the physical world related to the characteristics y.sub.i of the physical entity, thus, the instances of the sequence x.sub.i comprise information that have resulted in the instance y.sub.i. In the context of FIG. 3 this means that the instances y.sub.i of the physical characteristic belong to a sequence of sensor data x.sub.i that is obtained from the same machine as from which the physical characteristic has been obtained and that the sequence of sensor data precedes the moment in time that the instance y.sub.i is obtained/determined. Optionally, the instances of the second set of data may be binary data instances.

[0048] In an example that is discussed further in this application, the first data set may have the subsequent data:

[00005] $Y_{example} = [\begin{matrix} 0 \\ 1 \\ 0 \\ 0 \\ 1 \\ 1 \\ 0 \end{matrix}]$

[0049] The device 100 also comprises a sequence clustering unit 110 that is configured to assign the sequences x.sub.i of the first set of data X to cluster. The whole set of clusters is indicated with C and one specific cluster is indicated with c.sub.p. The assigning of a specific sequence x.sub.i to a specific cluster c.sub.p is based on a conditional probability that belongs to the specific sequence. Each sequence x.sub.i has a conditional probability that an instance of the second set of data is equal to a specific value (for example, 0, or 1) given the sequence, for example P(y=0|x.sub.i). Sequences x.sub.i having a value for the conditional probability P(y=0|x.sub.i) close to each other may end up in the same cluster. If the first set of data X and the second set of data Y are known, the conditional probabilities can be calculated by the well-known ways of calculating such conditional probabilities as known from probability theory. The most basic methods to calculate the conditional probabilities are based on counting the number of occurrences of certain instances of the first set of data X and counting the number of occurrences of certain instances of the second set of data Y. The number of clusters is indicated with k and may be chosen by a user. The number k may, for example, be chosen on basis of the variability of the conditional probabilities or may be based on the total number n of sequences x.sub.i in the first set of data X. For example, the number of cluster is set to n. Choosing n clusters results in maintaining quite a lot of mutual information between the first set of data and the second set of data. The clustering unit 110 is coupled to the first data set unit 102 and the second data set unit 104 for receiving the first set of data X and the second set of data Y.

[0050] In the above discussed example of X.sub.example and Y.sub.example, the conditional probabilities P(y=0|x.sub.i) have to be determined. This can be calculated by:

[00006] $P (y = 0 | x_{i}) = \frac{P (y = 0 .Math. x_{i})}{P (x_{i})} = 1 - P (y = 1 | x_{i}) = 1 - \frac{P (y = 1 .Math. x_{i})}{P (x_{i})}$

As an example, based on counting certain combinations in the data sets X.sub.example and Y.sub.example:

[00007] $P (y = 0 | x_{1}) = P (y = 0 | 110) = \frac{P (y = 0 .Math. 110)}{P (110)} = \frac{\frac{1}{7}}{\frac{1}{7}} = 1$

[0058] Because there are 7 sequences in the first data set X, it seems logical to build three clusters (=2.6). Based on conditional probabilities that are close to each other or equal to each other, one may conclude that the subsequent three clusters must be formed, namely: [0059] C.sub.1={x.sub.1, x.sub.4, x.sub.7} [0060] C.sub.2={x.sub.2, x.sub.5} [0061] C.sub.3={x.sub.3,x.sub.6}

[0062] In the above clustering quite a lot of mutual information between the first set of data X and the second set of data Y is maintained. It follows from information theory that, if most of the mutual information between the instances of the first type of data and the instances of the second type of must be maintained during the clustering, the conditional entropy of instances of the second type of data given the cluster H(Y|C) must be kept as low as possible, e.g. may be minimized. The inventors have derived the following:

[00008] $\begin{matrix} H (Y | C) = .Math. {.Math.}_{c = 1}^{k} .Math. p (c) .Math. H (Y | C = c) \\ = .Math. {.Math.}_{c = 1}^{k} .Math. [\underset{x c}{.Math.} .Math. p (x)] .Math. H (Y | x c) \\ = .Math. - {.Math.}_{c = 1}^{k} .Math. [\underset{x c}{.Math.} .Math. p (x)] .Math. \underset{y}{.Math.} .Math. p (y | x c) .Math. \log_{2} .Math. p (y | x c) \end{matrix}$

[0063] Based on this derivation the inventors have concluded that if the conditional entropy of instances of the second type of data given the clusters H(Y|C) must be minimized, one has to create clusters/groups such that each cluster/group has a low values for .sub.yp(y|xc)log.sub.2 p(y|xc) and have similar values for .sub.xcp(x). If it is assumed that the values of the instances of the second set of data Y are binary values, a low value of .sub.yp(y|xc)log.sub.2 p(y|xc) is obtained for a specific group if there is a large difference between P(y=0|xc) and P(y=1|xc) for the specific group. The inventors came to the inside that one has to group sequences x.sub.i into one cluster c.sub.p if the sequences x.sub.i have a similar value for P(y=0|x.sub.i) because if the values of P(y=0|x.sub.i) for one specific cluster c.sub.p would vary quite a lot, the value for p(y=0|xc.sub.k) would converge to 0.5 as the expected mean of randomly drawn numbers from the interval [0,1] and then the value of P(y=1|xc.sub.p) will also converge to 0.5. Thus, by basing the clustering on P(y=0|x.sub.i), or on basis of P(y=1|x.sub.i), for the sequences x.sub.i, one obtains a clustering that results in a relatively low value for the conditional entropy of the second set of data given the data of the cluster H(Y|C) and, thus, a relatively large amount of mutual information between the first set of data and the second set of data I(X, Y) is maintained in the clustered data.

[0064] The device 100 also comprises an output unit 112 that provides the clusters C to, for example, a data reduction arrangement 150 for compacting the sequences x.sub.i of instances of the first type of data of, for example, the first set of data X. The output unit 112 is coupled to the sequence clustering unit 110 for receiving the clusters C. It is to be noted that the provided clusters C may also comprise additional information that may help to map the sequences x.sub.i to the clusters c.sub.p. For example, a map may be provided that maps each sequence x.sub.i to a specific cluster c.sub.p. Alternatively, for each cluster c.sub.p a representative instance x.sub.p from the first set of data is provided such that each instance x.sub.i of the first set of data X belongs to the specific cluster that has a representative instance x.sub.p that is closest to the respective instance x.sub.i. It is to be noted that the device 100 may also provide additional information together with the clusters, for example, the probability that a sequence x.sub.i ends up in a cluster may be provided, thus, P(c.sub.p). The device 100 may also provide additional information that is useful in the context of using the clusters to predict instances y.sub.i of the second type of data, such as, for example P(y|c.sub.p).

[0065] Thus, optionally, the device 100 is part of a data reduction arrangement 150. This data reduction arrangement 150 may comprise a compaction unit 152. The compaction unit 152 may be coupled to the output unit 112 and receives from the output unit 112 the clusters C and data that relates the instances x.sub.i to the clusters C. Subsequently, the compaction unit 152 may generate identification data id.sub.k that can be used to uniquely identify the different clusters c.sub.p. The identification data id.sub.k for each cluster c.sub.p is shorter than the length of the sequences x.sub.i of the first set of data X such that less data is required to be stored (and if the first set of data X must be transmitted, less bandwidth is required to transmit the first set). Thereafter the compaction unit 152 replaces each sequence x.sub.i with the identification data of the cluster to which the respective sequence is assigned. Thereby the amount of data present in the first set of data X reduces and, thus, the first set of data X is compacted.

[0066] If the device 100 for determining clusters is embedded in a device that has to transmit sequences of instances of the first type of data, the output of the output unit 112 can be used to replace the sequence x.sub.i to be transmitted with identification data id.sub.k such that less data has to be transmitted. It may be, that before sequences are transmitted, information about the clusters C and possible characteristics of sequences of the clusters is transmitted to the receiver such that the receiver is able to interpret the received identification data id.sub.k.

[0067] FIG. 2 schematically shows an embodiment of a method 200 of determining clusters C of sequences x.sub.i of instances x.sub.ij of data of a first type of data. The clusters C are determined for compacting a data set comprising sequences of instances of the first type of data. The instances x.sub.i of the first type of data comprise information for predicting instances y.sub.i of a second type of data. Optionally, the instances x.sub.i of the first type of data comprising data being based on sensor data. Optionally, the instances y.sub.i of the second type of data comprising data being based on a characteristic of a physical entity. The method 200 comprising: i) obtaining 202 a first set of data X comprising sequences x.sub.i of instances x.sub.ij of the first type of data; ii) obtaining 204 a second set of data Y comprising instances y.sub.i of the second type of data, each instance y.sub.i of the second set of data Y corresponds to, e.g. belongs to, a sequence x.sub.i in the first set of data X; iii) assigning 206 the sequences x.sub.i of the first set of data X to clusters C, the assigning is based on conditional probabilities, each sequence x.sub.i of the first set of data X has a conditional probability P(y|x.sub.i) for an instance of the second set of data is equal to a specific value given the sequence x.sub.i; iv) providing 220 the clusters C to, for example, a data reduction arrangement for compacting the sequences of instances of the first type of data.

[0068] The above discussed method 200 has similar embodiments, effects and advantages as the device 100 that has been discussed in the context of FIG. 1. Hereinafter more embodiments of the method 200 are discussed. It is to be noted that the device 100 may also have similar embodiments, effects and advantages as the hereinafter discussed embodiments of the method 200.

[0069] In an embodiment, the assigning of the sequences x.sub.i of the first set of data X to cluster C comprises applying 208 the Context Tree Weighting method to the first set of data X and the second set of data Y to obtain a context tree. In the Context Tree Weighting method every unique sequence x.sub.i of the first set of data is represented by a path in the context tree from a root node to a specific leaf node and counts stored in nodes of the context tree are based on the corresponding instances y.sub.i of the second set of data.

[0070] The Context Tree Weighting method is well known in the art of information theory and (text-) compression and has, for example, been described by Willems et al in papers i) The Context-Tree Weighting Method: Basic Properties, Willems et al, IEEE Transactions on Information Theory, Vol 42, No 3, pp 653-664; ii) Reflections on The Context-Tree Weighting Method: Basic Properties Willems et al., Newsletter of the IEEE Information Theory Society, 1997; and iii) 1996 IT society Paper Award, Reflection on the Prize Paper: The Context-Tree Weighting Method: Basic Properties, Willems et al, IEEE Information Theory Society Newsletter, Vol. 47, No 1, March 1997, pp 19-27. The above mentioned papers i), ii) and iii) are hereby included by reference.

[0071] For example, if three sequences x.sub.1, x.sub.2 and x.sub.5 of the first set of data X are equal to each other, then they are all three represented by a path in the context tree that starts with an edge from the root node of the context tree to a subsequent node in the context tree and this edge is marked with the value of instance x.sub.1m of the sequence x.sub.1. The last edge of the path is from a node at depth m1 to a specific leaf node at depth m. This last edge is marked with the value of instance x.sub.11 of the sequence x.sub.1. In particular if the sequences x.sub.i are time series, the most recent instance of the first type of data of each unique sequence x.sub.i marks the first edge from the root node, and the oldest instance of the first type of data of each unique sequence x.sub.i marks the last edge of the path that ends in the leaf node. It is also to be noted that if the instances of the first type are stored in a different order in the sequences x.sub.i, the marking of the edges of the paths in the context tree may be performed differently in according to the different order. In the specific leaf node counts are stored that count the values of the corresponding instances y.sub.1, y.sub.2 and y.sub.5. If the instances of the second set of data Y are binary values, the counts of the nodes are for example indicated with a for the number of corresponding instances that have the value 0 and b for the number of corresponding instances that have the value 1. For example, if y.sub.1=1, y.sub.2=0 and y.sub.5=0, then a=2 and b=1. This forming of the context tree with the Context Tree Weighting (CTW) Method has been discussed, for example, on page 24, 25 and 26 of the paper of the above discussed paper 1996 IT society Paper Award, Reflection on the Prize Paper: The Context-Tree Weighting Method: Basic Properties. The discussion of the CTW method starts at the paragraph CTW algorithm and ends with the paragraph complexity. For sake of completeness, the paragraphs tree sources and context tree of the paper are also considered to be part of the description of the discussion of the CTW method.

[0072] Initially the Context-Tree Weighting Method has been described for binary data and how this may be may be used to compact text-data. Today the practitioner in the art has several implementations at his disposal for other forms of non-binary data. Thus, if the discrete data in X or Y has more than two different possible values, one of such implementations must be used to build the first context tree. For example, the article Context Tree Weighting: Multi-alphabet Sources of Tjalkens et al, Proceedings Fourteenth Symposium on Information Theory in the Benelux, Veldhoven, The Netherlands, May 17-18, 1993, pp. 128-135 describes how the Context-Tree Weighing method may be applied to non-binary data. The document Context Tree Weighting: Multi-alphabet Sources is herewith included by reference.

[0073] Optionally, in the method 200, the assigning 206 of the sequences x.sub.i of the first set of data X to cluster C comprise clustering 210 the sequences x.sub.i of the first set of data X according to an estimated conditional probability that is based on the counts of a specific leaf node of the context tree in which the respective sequence x.sub.i ends. In the Context Tree Weighting method, for each leaf node counts are registered and one can use these counts to calculate the so-termed Krichevsky and Trofimov estimator (KT estimator). If the counts relate to binary data, then the estimated conditional probability P.sub.ec is defined by: P.sub.ec(a, b)=KT.sub.estimator

[00009] $(a, b) = \frac{a + \frac{1}{2}}{a + b + 1},$

wherein a is the number of counted 0's and b is the number of counted 1's. The KT-estimator for binary data is also discussed in the above provided references of the CTW algorithm.

[0074] The value of the estimated conditional probability P.sub.ec of a specific leaf node is an estimate for the conditional probability P(y|x.sub.i) of the data of the second set of data given the specific unique sequence x.sub.i of which the path ends in that specific leaf node. More specifically, if the instances of the second set of data are binary data instances, the above given estimated conditional probability P.sub.ec of a specific leaf node is an estimate for the conditional probability P(y=1|x.sub.i), wherein x.sub.i is the unique sequence x of which the path in the context tree ends in the specific leaf node. Thus, the estimated conditional probabilities P.sub.ec of the leaf nodes of the context tree can be used to clusters the sequences x.sub.i of the first set of data X. Thus, one may conclude that clusters of leaf nodes are formed and that all sequences x.sub.i of the first set of data X that correspond to the unique sequence that ends in a specific leaf nodes are assigned to the cluster to which the specific leaf node is assigned.

[0075] The above discussed datasets X.sub.example and Y.sub.example are also the data set that are used in the above cited papers to explain how a Context Tree Weighing (CTW) tree has to be build and how the estimated probabilities of the nodes of the tree must be calculated. The CTW tree that is being build based on the Context Tree Weighing Method is presented in FIG. 5. The drawn CTW tree is obtained from Reflections on The Context-Tree Weighting Method: Basic Properties Willems et al., Newsletter of the IEEE Information Theory Society, 1997. In FIG. 5 it has further indicated at the leaf nodes of the tree which sequences of the example first data set X.sub.example end up in that specific leaf node. For example, the sequence x.sub.7 ends up in leaf node 504. The indicated path 504 through the CTW tree is determined by sequence x.sub.7. Instance x.sub.73=1 determines that the path moves in an upward direction from the root node to the node at depth 1, instance x.sub.72=1 determines that the path moves in an upward direction from the node at depth 1 to the node at depth 2, and instance x.sub.73=0 determines that the path moves in a downwards direction from the node at depth 2 up to the leaf node 504. The value of y.sub.7=0 determines that the counts of (a,b) are (1,0) in leaf node 504. By using the count values (a,b) of the leaf nodes where a sequence x.sub.i ends, one may calculate the above discussed estimated conditional probability P.sub.ec.

[0076] The estimated conditional probabilities are then for the subsequent sequences x.sub.i: [0077] x.sub.1:a=1, b=0:P.sub.ec= [0078] x.sub.2, x.sub.5: a=0, b=2: P.sub.ec= [0079] x.sub.3, x.sub.6: a=1, b=1: P.sub.ec= [0080] x.sub.4: a=1, b=0: P.sub.ec= [0081] x.sub.7: a=1, b=0: P.sub.ec=

[0082] It can be seen that the estimated conditional probabilities P.sub.ec are not exactly the same as the above provided conditional estimated probabilities, however, one can also see that there is a correlation between the estimated probabilities P.sub.ec and the conditional probabilities.

[0083] Optionally, in the method 200, the assigning of the sequences x.sub.i of the first set of data X to clusters C uses a k-means algorithm to form the clusters C. If the assigning uses the Context Tree Weighting method according to the above discussed embodiments, the k-means algorithm uses the estimated conditional probabilities P.sub.ec of the leaf nodes of the context tree to cluster the sequences x.sub.i of the first set of data X. Please note that the k-means algorithm may also use the earlier discussed conditional probabilities in the same way as it is discussed here for the estimated conditional probabilities P.sub.ec. In the k-means algorithm one has to select a number of clusters to be formed. In an embodiment, the number is set to 5, 6 or 7. As the result of the k-means algorithm, sequences that have an estimated conditional probability P.sub.ec value close a specific mean value associated with a specific cluster are put in the specific cluster. One may also select a number of clusters that is equal to n because it is known that at such number of clusters most of the mutual information between the first set of data X and the second set of data Y is maintained in the clusters. The k-means algorithm is, for example, described in the textbook An introduction to Statistical learning of G. James et al and published by Springer, 2013, New York. This book can also be obtained via the weblink: http://www-bcf.usc.edu/gareth/ISL/ISLR %20First %20Printing.pdf.

[0084] If, for the in FIG. 5 presented CTW tree, clusters must be formed based on the estimated conditional probabilities P.sub.ec, it is suggested to form 3 clusters (because 7=2.6) and sequences that end up in leaf nodes that have an estimated conditional probability that are close to each other are appointed to the same cluster. Thereby the subsequent clusters are obtained: [0085] C.sub.1={x.sub.1, x.sub.4, x.sub.7} [0086] C.sub.2={x.sub.2, x.sub.5} [0087] C.sub.3={x.sub.3,x.sub.6}

[0088] As we can see we end up at exactly the same clustering as the clustering that was obtained directly from the conditional probabilities. Therefore it is concluded that using the estimated conditional probabilities of the leaf nodes is a good estimate for obtaining the same clustering compared to the situation in which the conditional probabilities are directly used.

[0089] It has to be noted that also other suitable clustering algorithms may be used and that embodiments are not limited to the use of the k-means clustering algorithm. It may be that the length of the sequences x.sub.i of the first set of data X is relatively long compared to the number n of sequences x.sub.i in this first set of data X. In that case it may be that in some leaf nodes the counts are relatively low, for example, smaller than a defined minimum number of observations minObs which makes the estimated conditional probability of these leaf nodes relatively inaccurate.

[0090] In an embodiment, all leaf nodes of which the total count is smaller than the minimum number of observations minObs are excluded from clustering on basis of the estimated conditional probability P.sub.ec, but are assigned to two additional clusters. All sequences x.sub.i that end in a specific leaf node with a total count smaller than the minimum number of observations minObs and that specific leaf node has an estimated conditional probability P.sub.ec smaller than 0.5 are assigned to a first one of the two additional clusters. All sequences x.sub.i that end in a specific leaf node with a total count smaller than the minimum number of observations minObs and that specific leaf node has an estimated conditional probability P.sub.ec larger than 0.5 are assigned to a second one of the two additional clusters. All sequences x.sub.i that end in a specific leaf node with a total count smaller than the minimum number of observations minObs and that specific leaf node has an estimated conditional probability P.sub.ec equal to 0.5 are assigned either to the second one of the two additional clusters or to the first one of the two additional clusters.

[0091] Above the method stages of applying 208 the Context Tree Weighting method and clustering 206 the sequences x.sub.i of the first set of data X are discussed. It is to be noted that the device 100 of FIG. 1 may comprise a Context Tree Weighting method application unit 106 and a clustering unit 108 that are configured to perform the same function/task as the discussed method stages of applying 208 the Context Tree Weighting method and clustering 206 the sequences x.sub.i of the first set of data X. The units 106 and 108 may have similar embodiments as the embodiments of the method stages with similar effects and advantages.

[0092] Optionally, after assigning the sequences x.sub.i of the first set of data X to clusters C, the method may comprise an additional optimization stage 212 wherein the cluster C for the sequences x.sub.i of the first set of data X are further optimized by an iterative optimization method to minimize an optimization function that is derived from, may be based on, or may be equal to a conditional entropy H(Y|C) of the second set of data Y given the data of the clusters C. The calculation of this conditional entropy is as follows:

[00010] $H (Y | C) = \underset{y}{.Math.} .Math. \underset{c}{.Math.} .Math. P (y, c) .Math. \log .Math. \frac{P (c)}{P (y, c)},$

wherein P(c) is the probability that an instance of the first set of data X ends up in the specific cluster c. and P(y,c) is the probability that an instance of the second set of data Y has the value y and is in the specific cluster c. For example: P(cluster1) can be calculated by dividing the number of sequences of the first set of data that are assigned to cluster1 by the total number of instances n in the first set of data X; For example, if instances of the second set of data have a binary value: for calculating P(y=0, c=cluster_1) one has to count for all sequences x.sub.i of the first set of data X that end up in cluster_1, how often their corresponding instance y.sub.i of the second set of data Y is equal to 0, the count is indicated by c. Then P(y=0, c=cluster_1)=c/n wherein n is the total number of instances in the first set of data X and in the second set of data Y.

[0093] In an example, in the additional optimization stage 212 wherein the cluster C of the sequences x.sub.i of the first set of data X are further optimized, in each iteration one or more specific sequences x.sub.i of the first set of data X are moved to another cluster and it is checked whether the conditional entropy H(Y|C) of the second set of data Y given the data of the clusters C becomes smaller and if so, a better solution has been found. In an embodiment, the used iterative optimization method comprises the simulated annealing optimization method. One may, for example, use the approach discussed in the article: Convergence Theorems for a Class of Simulated Annealing Algorithms on R.sup.d of Claude J. P. Blisle, Journal of Applied Probability, Vol. 29, No. 4 (December, 1992), pp. 885-895. The article of C. J. P. Blisle is herewith incorporated by reference. In simulated annealing an imaginary temperature function and a probability function are important and this article provides useful proposals for such functions.

[0094] In the above context, it has been described that the first set of data X comprises sequences x.sub.i of instances of the first type of data. The term sequence is used to indicate that the instances of one sequence originate from one specific source, for example, one specific sensor. Each sequence has its own source and all sources are of the same type and have the same function. For example, each sequence originates from one type of sensors that has in machines of the same type the same function. For example, in the context of FIG. 3, the sources are the temperature sensors 302, 352 that measure the temperature of the motors of respective machines 300, 350. In one embodiment, the sequences are time series of measurements of one specific type of sensors. The meaning of time series is that the instances of the first type of data are obtained/derived/measured within an interval of time at consecutive moments in time.

[0095] Optionally, one of the above discussed methods of determining clusters may be part of a method of compacting a data set that comprises sequences of instances of the first type of data and that the data is based on, for example, derived from, sensor data. The method comprises obtaining the data set with the sequences of instances of the first type of data. The obtained data step fulfills the function of the first set of data X in the included method of determining clusters. The output of the included method of determining clusters is subsequently used to replace sequences of instances of the first type of data of the data set by identification data of the clusters to which the respective sequences are assigned in the included method of determining clusters. Identification data is generated for the clusters. The identification data comprises identifications for each cluster that uniquely identifies a respective cluster. The identifications are at least smaller than the sequences of the data set, which means that they can be stored by a fewer number of bits than each sequence of the data set. This method results in a smaller set of data and storage capacity is saved. In an embodiment, the average size of the identifications is minimized given the condition that each cluster must be uniquely identifiable.

[0096] Optionally, one of the above discussed methods of determining clusters may be part of a method of transmitting compacted data comprising at least one sequence of instances of the first type of data. The at least one sequence is a sequence to be transmitted. The method comprises obtaining at least one sequence. After determining the clusters, one of the clusters is selected as the cluster that best matches with the at least one sequence. It may be that the first set of data X that is being used in the method of determining clusters already comprises a sequence that is equal to the at least one sequence and then the cluster is selected to which the sequence was assigned. It may also be that the first set of data X does not comprise a sequence that is equal to the at least one sequence and then a cluster is selected that best matches the at least one sequence. For example, each cluster may be represented by a representative sequence (which is, for example, a central sequence in the space that is represented by the cluster) and the cluster is selected of which the representative sequence is at a shortest distance to the at least one sequence. For the clusters identification data is generated. The identification data comprises identifications for each cluster that uniquely identifies a respective cluster. The identifications are at least smaller than the sequences of the data set, which means that they can be stored by a fewer number of bits than each sequence of the data set. Subsequently, an identification data of the selected cluster is transmitted instead of the at least one sequence. Thereby a smaller amount of bits are transmitted compared to the situation that the whole at least one sequence had to be transmitted and transmission bandwidth and transmission power is saved.

[0097] FIG. 4 schematically shows an embodiment of a computer program product 470 which program is operative to cause a processor to perform one of the previously discussed methods, such as one of the embodiments of the method 200 of determining cluster, the method of transmitting compacted data or the method of compacting a data set. Embodiments also extends to computer program products 470, particularly computer programs 480 on or in a carrier 470, adapted for putting the invention into practice. The computer program product may comprises a computer program 480. The program may be in the form of source code, object code, a code intermediate source and object code such as partially compiled form, or in any other form suitable for use in the implementation of the one of the above discussed methods. It will also be appreciated that such a program may have many different architectural designs. For example, a program code implementing the functionality of the method or device may be subdivided into one or more subroutines. Many different ways to distribute the functionality among these subroutines will be apparent to the skilled person. The subroutines may be stored together in one executable file to form a self-contained program. Such an executable file may comprise computer executable instructions, for example processor instructions and/or interpreter instructions (e.g. Java interpreter instructions). Alternatively, one or more or all of the subroutines may be stored in at least one external library file and linked with a main program either statically or dynamically, e.g. at run-time. The main program contains at least one call to at least one of the subroutines. Also, the subroutines may comprise function calls to each other. An embodiment relating to a computer program product 470 comprises computer executable instructions 480 corresponding to each of the processing steps of at least one of the methods set forth. These instructions may be subdivided into subroutines and/or be stored in one or more files that may be linked statically or dynamically. Another embodiment relating to a computer program product 470 comprises computer executable instructions 480 corresponding to each of the means of at least one of the systems and/or products set forth. These instructions may be subdivided into subroutines and/or be stored in one or more files that may be linked statically or dynamically.

[0098] The carrier of a computer program may be any entity or device capable of carrying the program. For example, the carrier may include a storage medium, such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a floppy disc or hard disk. Further the carrier may be a transmissible carrier such as an electrical or optical signal, which may be conveyed via electrical or optical cable or by radio or other means. When the program is embodied in such a signal, the carrier may be constituted by such cable or other device or means. Alternatively, the carrier may be an integrated circuit in which the program is embedded, the integrated circuit being adapted for performing, or for use in the performance of, the relevant method.

[0099] The computer program 480 may be a computer program for a distributed processor system and may comprise computer code which causes a first processor system to perform a subset of the steps of the above discussed method and which causes a second processor system to perform another subset of the steps of the above discussed method. The subset of steps and the another subset of steps may be mutually exclusive.

[0100] In summary, this document provides a device for and method of determining clusters of sequences of instances of a first type of data for compacting a data set comprising sequences of instances of the first type of data is provided. Also a method of compacting a data set, a method of transmitting compacted data and a computer program product are provided. In a sequence clustering unit of the device, sequences of a first set of data are clustered on basis of conditional probabilities. Each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence. In the clustering a significant part of the mutual information between the first set of data and the second set of data is maintained.

[0101] It is to be noted that the invention may be implemented in hardware and/or software, using programmable components.

[0102] It will be appreciated that the above description for clarity has described embodiments of the invention with reference to different functional units and processors. However, it will be apparent that any suitable distribution of functionality between different functional units or processors may be used without deviating from the invention. For example, functionality illustrated to be performed by separate units, processors or controllers may be performed by the same processor or controllers. Hence, references to specific functional units are only to be seen as references to suitable means for providing the described functionality rather than indicative of a strict logical or physical structure or organization. The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these.

[0103] It is noted, that in this document the word comprising does not exclude the presence of other elements or steps than those listed and the word a or an preceding an element does not exclude the presence of a plurality of such elements, that any reference signs do not limit the scope of the claims, that the invention may be implemented by means of both hardware and software, and that several means or units may be represented by the same item of hardware or software, and a processor may fulfill the function of one or more units, possibly in cooperation with hardware elements. Further, the invention is not limited to the embodiments, and the invention lies in each and every novel feature or combination of features described above or recited in mutually different dependent claims.

DEVICE FOR AND METHOD OF DETERMINING CLUSTERS

Inventors

Cpc classification

Classification Explorer

H03M7/3071

ELECTRICITY

Classification Explorer

G06F16/285

PHYSICS

Classification Explorer

H03M7/4012

ELECTRICITY

Classification Explorer

G06F17/18

PHYSICS

Classification Explorer

H03M7/3059

ELECTRICITY

International classification

Classification Explorer

H03M7/30

ELECTRICITY

Classification Explorer

H03M7/40

ELECTRICITY

Classification Explorer

G06F17/18

PHYSICS

Abstract

Claims

Description