DEVICE FOR AND METHOD OF DETERMINING CLUSTERS
20190044533 ยท 2019-02-07
Inventors
Cpc classification
G06F17/18
PHYSICS
International classification
H03M7/30
ELECTRICITY
H03M7/40
ELECTRICITY
Abstract
A device (100) for and method of determining clusters of sequences of instances of a first type of data for compacting a data set comprising sequences of instances of the first type of data is provided. Also a method of compacting a data set, a method of transmitting compacted data and a computer program product are provided. In a sequence clustering unit (110) of the device, sequences of a first set of data are clustered on basis of conditional probabilities. Each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence. In the clustering a significant part of the mutual information between the first set of data and the second set of data is maintained.
Claims
1-14. (canceled)
15. A data reduction arrangement for compacting sequences of instances of a first type of data, the data reduction arrangement comprising: a device for determining clusters of sequences of instances of a first type of data for compacting a data set comprising sequences of instances of the first type of data, the instances of the first type of data comprising information for predicting instances of a second type of data, the instances of the second type of data comprising data based on a characteristic of a physical entity, the device comprising: a first data set unit for obtaining a first set of data comprising sequences of instances of the first type of data; a second data set unit for obtaining a second set of data comprising instances of the second type of data, each instance of the second set of data corresponds to a sequence in the first set of data; a sequence clustering unit for assigning the sequences of the first set of data to clusters, the assigning is based on conditional probabilities of data of the second type given a sequence of the first set of data, wherein each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence, wherein the assigning of the sequences of the first set of data to clusters comprises applying the Context Tree Weighting method to the first set of data and the second set of data to obtain a context tree, in the Context Tree Weighting method every unique sequence of the first set of data is represented by a path in the context tree from a root node to a specific leaf node and counts stored in at least the leaf nodes of the context tree are based on the corresponding instances of the second set of data, wherein the estimated conditional probability of a respective leaf node is calculated on the basis of the counts of the respective leaf nodes; and a compaction unit configured for compacting the first data set by replacing the sequences of instances of the first type of data of the data set by an identification data of the cluster to which the sequence is assigned, wherein the identification data of a specific cluster uniquely identifies the specific cluster and is stored by a fewer number of bits than each sequence of the data set.
16. A computer-implemented method for determining clusters of sequences of instances of data of a first type of data for compacting a data set comprising sequences of instances of the first type of data, the instances of the first type of data comprising information for predicting instances of a second type of data, the instances of the second type of data comprising data being based on a characteristic of a physical entity, the method comprising: obtaining a first set of data comprising sequences of instances of the first type of data; obtaining a second set of data comprising instances of the second type of data, each instance of the second set of data corresponds to a sequence in the first set of data; assigning the sequences of the first set of data to clusters, the assigning is based on conditional probabilities of data of the second type given a sequence of the first set of data, wherein each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence, wherein the assigning of the sequences of the first set of data to clusters comprises applying the Context Tree Weighting method to the first set of data and the second set of data to obtain a context tree, in the Context Tree Weighting method every unique sequence of the first set of data is represented by a path in the context tree from a root node to a specific leaf node and counts stored in at least the leaf nodes of the context tree are based on the corresponding instances of the second set of data, wherein the estimated conditional probability of a respective leaf node is calculated on the basis of the counts of the respective leaf nodes; and compacting the first data set by replacing the sequences of instances of the first type of data of the data set by an identification data of the cluster to which the sequence is assigned, wherein the identification data of a specific cluster uniquely identifies the specific cluster and is stored by a fewer number of bits than each sequence of the data set.
17. The method according to claim 16, wherein the assigning of the sequences of the first set of data to clusters comprises forming the cluster on basis of estimated conditional probabilities of the leaf nodes of the context tree, wherein, if a specific leaf node is related to a specific cluster, then all sequences of the first set of data equal to the unique sequence ending in the specific leaf node are assigned to the specific cluster and wherein the estimated conditional probability of a respective leaf node is a Krichevsky and Trofimov estimator that is calculated on basis of the counts of the respective leaf nodes.
18. The method according to claim 16, wherein the assigning of the sequences of the first set of data to clusters uses a k-means algorithm to form the clusters and assigns sequences of the first set to clusters, and wherein the k-means algorithm uses the estimated conditional probabilities of the leaf nodes of the context tree to form the clusters.
19. The method according to claim 18, wherein, in the assigning of the sequence of the first set of data to clusters, sequences of the first set of data ending in leaf nodes having a total count that is smaller than a minimum number of observations are assigned to two additional clusters, sequences ending in leaf nodes having an estimated conditional probability smaller than 0.5 and having a total count that is smaller than the minimum number of observations are assigned to a first one of the two additional clusters, sequences ending in leaf nodes having an estimated conditional probability larger than 0.5 and having a total count that is smaller than the minimum number of observations are assigned to a second one of the two additional clusters, sequences ending in leaf nodes having an estimated conditional probability that is equal to 0.5 and having a total count smaller than the minimum number of observations are assigned to either the first one of the additional clusters or the second one of the additional clusters.
20. The method according to claim 16, wherein the clusters of the sequences of the first set of data are further optimized by an iterative optimization method to minimize an optimization function that is based on a conditional entropy of the second set of data given the data of the clusters.
21. The method according to claim 20, wherein the iterative optimization method comprises simulated annealing.
22. The method according to claim 16, wherein the sequences of instances of the first type of data of the first set of data comprise time series of sensor data, each time series comprises results of measurements of one specific sensor at consecutive moments in time and the specific sensors are of an equal type.
23. The method according to claim 16, wherein the instances of the second set of data are binary data instances.
24. The method according to claim 23, wherein the assigning based on conditional probabilities is based on the conditional probabilities that the data of the second set of data is one given an unique sequence of the first set of data.
25. A method of transmitting compacted data comprising at least one sequence of instances of the first type of data, the at least one sequence is a sequence to be transmitted, the method comprising: obtaining the at least one sequence; determining, via a computer-implemented method, clusters of sequences of instances of data of a first type of data for compacting a data set comprising sequences of instances of the first type of data, the instances of the first type of data comprising information for predicting instances of a second type of data, the instances of the second type of data comprising data being based on a characteristic of a physical entity, the computer-implemented method comprising: obtaining a first set of data comprising sequences of instances of the first type of data; obtaining a second set of data comprising instances of the second type of data, each instance of the second set of data corresponds to a sequence in the first set of data; assigning the sequences of the first set of data to clusters, the assigning is based on conditional probabilities of data of the second type given a sequence of the first set of data, wherein each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence, wherein the assigning of the sequences of the first set of data to clusters comprises applying the Context Tree Weighting method to the first set of data and the second set of data to obtain a context tree, in the Context Tree Weighting method every unique sequence of the first set of data is represented by a path in the context tree from a root node to a specific leaf node and counts stored in at least the leaf nodes of the context tree are based on the corresponding instances of the second set of data, wherein the estimated conditional probability of a respective leaf node is calculated on the basis of the counts of the respective leaf nodes; compacting the first data set by replacing the sequences of instances of the first type of data of the data set by an identification data of the cluster to which the sequence is assigned, wherein the identification data of a specific cluster uniquely identifies the specific cluster and is stored by a fewer number of bits than each sequence of the data set; selecting one of the clusters as the cluster that best matches with the at least one sequence; and transmitting an identification data of the selected cluster instead of the at least one sequence, the identification data of a specific cluster uniquely identifies the specific cluster and can be stored by a fewer number of bits than the sequence.
26. A non-transitory computer-readable medium having one or more executable instructions stored thereon, which when executed by a processor, cause the processor to perform a computer-implemented method for determining clusters of sequences of instances of data of a first type of data for compacting a data set comprising sequences of instances of the first type of data, the instances of the first type of data comprising information for predicting instances of a second type of data, the instances of the second type of data comprising data being based on a characteristic of a physical entity, the method comprising: obtaining a first set of data comprising sequences of instances of the first type of data; obtaining a second set of data comprising instances of the second type of data, each instance of the second set of data corresponds to a sequence in the first set of data; assigning the sequences of the first set of data to clusters, the assigning is based on conditional probabilities of data of the second type given a sequence of the first set of data, wherein each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence, wherein the assigning of the sequences of the first set of data to clusters comprises applying the Context Tree Weighting method to the first set of data and the second set of data to obtain a context tree, in the Context Tree Weighting method every unique sequence of the first set of data is represented by a path in the context tree from a root node to a specific leaf node and counts stored in at least the leaf nodes of the context tree are based on the corresponding instances of the second set of data, wherein the estimated conditional probability of a respective leaf node is calculated on the basis of the counts of the respective leaf nodes; and compacting the first data set by replacing the sequences of instances of the first type of data of the data set by an identification data of the cluster to which the sequence is assigned, wherein the identification data of a specific cluster uniquely identifies the specific cluster and is stored by a fewer number of bits than each sequence of the data set.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0030] These and other aspects of the invention will be apparent from and elucidated further with reference to the embodiments described by way of example in the following description and with reference to the accompanying drawings, in which
[0031]
[0032]
[0033]
[0034]
[0035]
[0036] The figures are purely diagrammatic and not drawn to scale. In the Figures, elements which correspond to elements already described may have the same reference numerals.
DETAILED DESCRIPTION OF EMBODIMENTS
[0037]
[0038] The instances of the first type of data may also be obtained from the sensor at consecutive moments in time such that a sequence of data is obtained that contains time evolutionary aspects. In particular the time evolutionary aspects of the instances of the first type of data may be interesting to predict instances of the second type of data. For example, when a temperature of a motor is continuously raising in a particular interval of time, this may be a sign that the motor is going to overheat and may become defect within an interval of time. In the following of this document there is a focus on sequences of instances of the first type of data. It may be that these sequences represent a time series of instances of the first type of data, which means that the instances are obtained/determined at consecutive moments in time. It must be noted that the embodiments are not necessarily limited to such time series of instances of the first type of data.
[0039] It is to be noted that the instances of data of the first type are not necessarily obtained from the sensors 302, 352. The instances of the first type of data may also be data that represents characteristics of a physical entity that are not sensed by a sensor but may be known as a state of the physical entity. For example, a controller may control the motors 304, 354 on and off and the controlled on and off state may also be the instances of the data of the first type. Also if one want to predict when the motors 304, 354 are going to break down, the controlled on and off state may be useful data that can be used in the prediction of a possible break down of the motors 304, 354.
[0040] The embodiments of this invention are discussed also with references to
[0041]
[0042] The device 100 comprises a first data set unit 102 for obtaining a first set of data that comprises sequences of instances of the first type of data. Optionally, instances of the first type of data comprising data are based on sensor data, for example, based on the measurements of the sensors 302, 352 of
[0043] In an embodiment, the first set of data is represented by a matrix X:
It is to be noted that the rows in the matrix X are the sequences x.sub.i.
[0044] In an example that is discussed further in this application, the first data set may have the subsequent data:
Please note that in this example, the instances of data of the first type have a binary value, that the sequences have a length of m=3 and that there are n=7 sequences.
[0045] Embodiments are not limited to the use of the above matrix X. The use of a matrix with the sequences as rows is just an implementation detail and a skilled person knows that matrices can be transposed or that an ordering between the columns or between the rows may be changed as well. In embodiments one may also work with matrices to which a scaling factor has been applied.
[0046] The device 100 also comprises a second data set unit 104 for obtaining a second set of data comprising instances of the second type of data. Optionally, instances of the second type of data comprising data may be based on a characteristic of a physical entity. The characteristic is, for example, in the context of
[0047] Every instance y.sub.i corresponds to, belongs to, a sequence x of the first set of data X. Thus, there are n instances y.sub.i in the second set of data Y. Corresponding to or belonging to means in this context means that the data of the sequence x.sub.i is in the physical world related to the characteristics y.sub.i of the physical entity, thus, the instances of the sequence x.sub.i comprise information that have resulted in the instance y.sub.i. In the context of
[0048] In an example that is discussed further in this application, the first data set may have the subsequent data:
[0049] The device 100 also comprises a sequence clustering unit 110 that is configured to assign the sequences x.sub.i of the first set of data X to cluster. The whole set of clusters is indicated with C and one specific cluster is indicated with c.sub.p. The assigning of a specific sequence x.sub.i to a specific cluster c.sub.p is based on a conditional probability that belongs to the specific sequence. Each sequence x.sub.i has a conditional probability that an instance of the second set of data is equal to a specific value (for example, 0, or 1) given the sequence, for example P(y=0|x.sub.i). Sequences x.sub.i having a value for the conditional probability P(y=0|x.sub.i) close to each other may end up in the same cluster. If the first set of data X and the second set of data Y are known, the conditional probabilities can be calculated by the well-known ways of calculating such conditional probabilities as known from probability theory. The most basic methods to calculate the conditional probabilities are based on counting the number of occurrences of certain instances of the first set of data X and counting the number of occurrences of certain instances of the second set of data Y. The number of clusters is indicated with k and may be chosen by a user. The number k may, for example, be chosen on basis of the variability of the conditional probabilities or may be based on the total number n of sequences x.sub.i in the first set of data X. For example, the number of cluster is set to n. Choosing n clusters results in maintaining quite a lot of mutual information between the first set of data and the second set of data. The clustering unit 110 is coupled to the first data set unit 102 and the second data set unit 104 for receiving the first set of data X and the second set of data Y.
[0050] In the above discussed example of X.sub.example and Y.sub.example, the conditional probabilities P(y=0|x.sub.i) have to be determined. This can be calculated by:
As an example, based on counting certain combinations in the data sets X.sub.example and Y.sub.example:
[0051] The other conditional probabilities can be calculated in the same way and their values are: [0052] P(y=0|x.sub.2)=0 [0053] p(y=0|x.sub.3)= [0054] P(y=0|x.sub.4)=1 [0055] P(y=0|x.sub.5)=P(y=0|x.sub.2)=0 [0056] P(y=0|x.sub.6)=P(y=0|x.sub.3)= [0057] P(y=0|x.sub.7)=1
[0058] Because there are 7 sequences in the first data set X, it seems logical to build three clusters (=2.6). Based on conditional probabilities that are close to each other or equal to each other, one may conclude that the subsequent three clusters must be formed, namely: [0059] C.sub.1={x.sub.1, x.sub.4, x.sub.7} [0060] C.sub.2={x.sub.2, x.sub.5} [0061] C.sub.3={x.sub.3,x.sub.6}
[0062] In the above clustering quite a lot of mutual information between the first set of data X and the second set of data Y is maintained. It follows from information theory that, if most of the mutual information between the instances of the first type of data and the instances of the second type of must be maintained during the clustering, the conditional entropy of instances of the second type of data given the cluster H(Y|C) must be kept as low as possible, e.g. may be minimized. The inventors have derived the following:
[0063] Based on this derivation the inventors have concluded that if the conditional entropy of instances of the second type of data given the clusters H(Y|C) must be minimized, one has to create clusters/groups such that each cluster/group has a low values for .sub.yp(y|xc)log.sub.2 p(y|xc) and have similar values for .sub.xcp(x). If it is assumed that the values of the instances of the second set of data Y are binary values, a low value of .sub.yp(y|xc)log.sub.2 p(y|xc) is obtained for a specific group if there is a large difference between P(y=0|xc) and P(y=1|xc) for the specific group. The inventors came to the inside that one has to group sequences x.sub.i into one cluster c.sub.p if the sequences x.sub.i have a similar value for P(y=0|x.sub.i) because if the values of P(y=0|x.sub.i) for one specific cluster c.sub.p would vary quite a lot, the value for p(y=0|xc.sub.k) would converge to 0.5 as the expected mean of randomly drawn numbers from the interval [0,1] and then the value of P(y=1|xc.sub.p) will also converge to 0.5. Thus, by basing the clustering on P(y=0|x.sub.i), or on basis of P(y=1|x.sub.i), for the sequences x.sub.i, one obtains a clustering that results in a relatively low value for the conditional entropy of the second set of data given the data of the cluster H(Y|C) and, thus, a relatively large amount of mutual information between the first set of data and the second set of data I(X, Y) is maintained in the clustered data.
[0064] The device 100 also comprises an output unit 112 that provides the clusters C to, for example, a data reduction arrangement 150 for compacting the sequences x.sub.i of instances of the first type of data of, for example, the first set of data X. The output unit 112 is coupled to the sequence clustering unit 110 for receiving the clusters C. It is to be noted that the provided clusters C may also comprise additional information that may help to map the sequences x.sub.i to the clusters c.sub.p. For example, a map may be provided that maps each sequence x.sub.i to a specific cluster c.sub.p. Alternatively, for each cluster c.sub.p a representative instance x.sub.p from the first set of data is provided such that each instance x.sub.i of the first set of data X belongs to the specific cluster that has a representative instance x.sub.p that is closest to the respective instance x.sub.i. It is to be noted that the device 100 may also provide additional information together with the clusters, for example, the probability that a sequence x.sub.i ends up in a cluster may be provided, thus, P(c.sub.p). The device 100 may also provide additional information that is useful in the context of using the clusters to predict instances y.sub.i of the second type of data, such as, for example P(y|c.sub.p).
[0065] Thus, optionally, the device 100 is part of a data reduction arrangement 150. This data reduction arrangement 150 may comprise a compaction unit 152. The compaction unit 152 may be coupled to the output unit 112 and receives from the output unit 112 the clusters C and data that relates the instances x.sub.i to the clusters C. Subsequently, the compaction unit 152 may generate identification data id.sub.k that can be used to uniquely identify the different clusters c.sub.p. The identification data id.sub.k for each cluster c.sub.p is shorter than the length of the sequences x.sub.i of the first set of data X such that less data is required to be stored (and if the first set of data X must be transmitted, less bandwidth is required to transmit the first set). Thereafter the compaction unit 152 replaces each sequence x.sub.i with the identification data of the cluster to which the respective sequence is assigned. Thereby the amount of data present in the first set of data X reduces and, thus, the first set of data X is compacted.
[0066] If the device 100 for determining clusters is embedded in a device that has to transmit sequences of instances of the first type of data, the output of the output unit 112 can be used to replace the sequence x.sub.i to be transmitted with identification data id.sub.k such that less data has to be transmitted. It may be, that before sequences are transmitted, information about the clusters C and possible characteristics of sequences of the clusters is transmitted to the receiver such that the receiver is able to interpret the received identification data id.sub.k.
[0067]
[0068] The above discussed method 200 has similar embodiments, effects and advantages as the device 100 that has been discussed in the context of
[0069] In an embodiment, the assigning of the sequences x.sub.i of the first set of data X to cluster C comprises applying 208 the Context Tree Weighting method to the first set of data X and the second set of data Y to obtain a context tree. In the Context Tree Weighting method every unique sequence x.sub.i of the first set of data is represented by a path in the context tree from a root node to a specific leaf node and counts stored in nodes of the context tree are based on the corresponding instances y.sub.i of the second set of data.
[0070] The Context Tree Weighting method is well known in the art of information theory and (text-) compression and has, for example, been described by Willems et al in papers i) The Context-Tree Weighting Method: Basic Properties, Willems et al, IEEE Transactions on Information Theory, Vol 42, No 3, pp 653-664; ii) Reflections on The Context-Tree Weighting Method: Basic Properties Willems et al., Newsletter of the IEEE Information Theory Society, 1997; and iii) 1996 IT society Paper Award, Reflection on the Prize Paper: The Context-Tree Weighting Method: Basic Properties, Willems et al, IEEE Information Theory Society Newsletter, Vol. 47, No 1, March 1997, pp 19-27. The above mentioned papers i), ii) and iii) are hereby included by reference.
[0071] For example, if three sequences x.sub.1, x.sub.2 and x.sub.5 of the first set of data X are equal to each other, then they are all three represented by a path in the context tree that starts with an edge from the root node of the context tree to a subsequent node in the context tree and this edge is marked with the value of instance x.sub.1m of the sequence x.sub.1. The last edge of the path is from a node at depth m1 to a specific leaf node at depth m. This last edge is marked with the value of instance x.sub.11 of the sequence x.sub.1. In particular if the sequences x.sub.i are time series, the most recent instance of the first type of data of each unique sequence x.sub.i marks the first edge from the root node, and the oldest instance of the first type of data of each unique sequence x.sub.i marks the last edge of the path that ends in the leaf node. It is also to be noted that if the instances of the first type are stored in a different order in the sequences x.sub.i, the marking of the edges of the paths in the context tree may be performed differently in according to the different order. In the specific leaf node counts are stored that count the values of the corresponding instances y.sub.1, y.sub.2 and y.sub.5. If the instances of the second set of data Y are binary values, the counts of the nodes are for example indicated with a for the number of corresponding instances that have the value 0 and b for the number of corresponding instances that have the value 1. For example, if y.sub.1=1, y.sub.2=0 and y.sub.5=0, then a=2 and b=1. This forming of the context tree with the Context Tree Weighting (CTW) Method has been discussed, for example, on page 24, 25 and 26 of the paper of the above discussed paper 1996 IT society Paper Award, Reflection on the Prize Paper: The Context-Tree Weighting Method: Basic Properties. The discussion of the CTW method starts at the paragraph CTW algorithm and ends with the paragraph complexity. For sake of completeness, the paragraphs tree sources and context tree of the paper are also considered to be part of the description of the discussion of the CTW method.
[0072] Initially the Context-Tree Weighting Method has been described for binary data and how this may be may be used to compact text-data. Today the practitioner in the art has several implementations at his disposal for other forms of non-binary data. Thus, if the discrete data in X or Y has more than two different possible values, one of such implementations must be used to build the first context tree. For example, the article Context Tree Weighting: Multi-alphabet Sources of Tjalkens et al, Proceedings Fourteenth Symposium on Information Theory in the Benelux, Veldhoven, The Netherlands, May 17-18, 1993, pp. 128-135 describes how the Context-Tree Weighing method may be applied to non-binary data. The document Context Tree Weighting: Multi-alphabet Sources is herewith included by reference.
[0073] Optionally, in the method 200, the assigning 206 of the sequences x.sub.i of the first set of data X to cluster C comprise clustering 210 the sequences x.sub.i of the first set of data X according to an estimated conditional probability that is based on the counts of a specific leaf node of the context tree in which the respective sequence x.sub.i ends. In the Context Tree Weighting method, for each leaf node counts are registered and one can use these counts to calculate the so-termed Krichevsky and Trofimov estimator (KT estimator). If the counts relate to binary data, then the estimated conditional probability P.sub.ec is defined by: P.sub.ec(a, b)=KT.sub.estimator
wherein a is the number of counted 0's and b is the number of counted 1's. The KT-estimator for binary data is also discussed in the above provided references of the CTW algorithm.
[0074] The value of the estimated conditional probability P.sub.ec of a specific leaf node is an estimate for the conditional probability P(y|x.sub.i) of the data of the second set of data given the specific unique sequence x.sub.i of which the path ends in that specific leaf node. More specifically, if the instances of the second set of data are binary data instances, the above given estimated conditional probability P.sub.ec of a specific leaf node is an estimate for the conditional probability P(y=1|x.sub.i), wherein x.sub.i is the unique sequence x of which the path in the context tree ends in the specific leaf node. Thus, the estimated conditional probabilities P.sub.ec of the leaf nodes of the context tree can be used to clusters the sequences x.sub.i of the first set of data X. Thus, one may conclude that clusters of leaf nodes are formed and that all sequences x.sub.i of the first set of data X that correspond to the unique sequence that ends in a specific leaf nodes are assigned to the cluster to which the specific leaf node is assigned.
[0075] The above discussed datasets X.sub.example and Y.sub.example are also the data set that are used in the above cited papers to explain how a Context Tree Weighing (CTW) tree has to be build and how the estimated probabilities of the nodes of the tree must be calculated. The CTW tree that is being build based on the Context Tree Weighing Method is presented in
[0076] The estimated conditional probabilities are then for the subsequent sequences x.sub.i: [0077] x.sub.1:a=1, b=0:P.sub.ec= [0078] x.sub.2, x.sub.5: a=0, b=2: P.sub.ec= [0079] x.sub.3, x.sub.6: a=1, b=1: P.sub.ec= [0080] x.sub.4: a=1, b=0: P.sub.ec= [0081] x.sub.7: a=1, b=0: P.sub.ec=
[0082] It can be seen that the estimated conditional probabilities P.sub.ec are not exactly the same as the above provided conditional estimated probabilities, however, one can also see that there is a correlation between the estimated probabilities P.sub.ec and the conditional probabilities.
[0083] Optionally, in the method 200, the assigning of the sequences x.sub.i of the first set of data X to clusters C uses a k-means algorithm to form the clusters C. If the assigning uses the Context Tree Weighting method according to the above discussed embodiments, the k-means algorithm uses the estimated conditional probabilities P.sub.ec of the leaf nodes of the context tree to cluster the sequences x.sub.i of the first set of data X. Please note that the k-means algorithm may also use the earlier discussed conditional probabilities in the same way as it is discussed here for the estimated conditional probabilities P.sub.ec. In the k-means algorithm one has to select a number of clusters to be formed. In an embodiment, the number is set to 5, 6 or 7. As the result of the k-means algorithm, sequences that have an estimated conditional probability P.sub.ec value close a specific mean value associated with a specific cluster are put in the specific cluster. One may also select a number of clusters that is equal to n because it is known that at such number of clusters most of the mutual information between the first set of data X and the second set of data Y is maintained in the clusters. The k-means algorithm is, for example, described in the textbook An introduction to Statistical learning of G. James et al and published by Springer, 2013, New York. This book can also be obtained via the weblink: http://www-bcf.usc.edu/gareth/ISL/ISLR %20First %20Printing.pdf.
[0084] If, for the in
[0088] As we can see we end up at exactly the same clustering as the clustering that was obtained directly from the conditional probabilities. Therefore it is concluded that using the estimated conditional probabilities of the leaf nodes is a good estimate for obtaining the same clustering compared to the situation in which the conditional probabilities are directly used.
[0089] It has to be noted that also other suitable clustering algorithms may be used and that embodiments are not limited to the use of the k-means clustering algorithm. It may be that the length of the sequences x.sub.i of the first set of data X is relatively long compared to the number n of sequences x.sub.i in this first set of data X. In that case it may be that in some leaf nodes the counts are relatively low, for example, smaller than a defined minimum number of observations minObs which makes the estimated conditional probability of these leaf nodes relatively inaccurate.
[0090] In an embodiment, all leaf nodes of which the total count is smaller than the minimum number of observations minObs are excluded from clustering on basis of the estimated conditional probability P.sub.ec, but are assigned to two additional clusters. All sequences x.sub.i that end in a specific leaf node with a total count smaller than the minimum number of observations minObs and that specific leaf node has an estimated conditional probability P.sub.ec smaller than 0.5 are assigned to a first one of the two additional clusters. All sequences x.sub.i that end in a specific leaf node with a total count smaller than the minimum number of observations minObs and that specific leaf node has an estimated conditional probability P.sub.ec larger than 0.5 are assigned to a second one of the two additional clusters. All sequences x.sub.i that end in a specific leaf node with a total count smaller than the minimum number of observations minObs and that specific leaf node has an estimated conditional probability P.sub.ec equal to 0.5 are assigned either to the second one of the two additional clusters or to the first one of the two additional clusters.
[0091] Above the method stages of applying 208 the Context Tree Weighting method and clustering 206 the sequences x.sub.i of the first set of data X are discussed. It is to be noted that the device 100 of
[0092] Optionally, after assigning the sequences x.sub.i of the first set of data X to clusters C, the method may comprise an additional optimization stage 212 wherein the cluster C for the sequences x.sub.i of the first set of data X are further optimized by an iterative optimization method to minimize an optimization function that is derived from, may be based on, or may be equal to a conditional entropy H(Y|C) of the second set of data Y given the data of the clusters C. The calculation of this conditional entropy is as follows:
wherein P(c) is the probability that an instance of the first set of data X ends up in the specific cluster c. and P(y,c) is the probability that an instance of the second set of data Y has the value y and is in the specific cluster c. For example: P(cluster1) can be calculated by dividing the number of sequences of the first set of data that are assigned to cluster1 by the total number of instances n in the first set of data X; For example, if instances of the second set of data have a binary value: for calculating P(y=0, c=cluster_1) one has to count for all sequences x.sub.i of the first set of data X that end up in cluster_1, how often their corresponding instance y.sub.i of the second set of data Y is equal to 0, the count is indicated by c. Then P(y=0, c=cluster_1)=c/n wherein n is the total number of instances in the first set of data X and in the second set of data Y.
[0093] In an example, in the additional optimization stage 212 wherein the cluster C of the sequences x.sub.i of the first set of data X are further optimized, in each iteration one or more specific sequences x.sub.i of the first set of data X are moved to another cluster and it is checked whether the conditional entropy H(Y|C) of the second set of data Y given the data of the clusters C becomes smaller and if so, a better solution has been found. In an embodiment, the used iterative optimization method comprises the simulated annealing optimization method. One may, for example, use the approach discussed in the article: Convergence Theorems for a Class of Simulated Annealing Algorithms on R.sup.d of Claude J. P. Blisle, Journal of Applied Probability, Vol. 29, No. 4 (December, 1992), pp. 885-895. The article of C. J. P. Blisle is herewith incorporated by reference. In simulated annealing an imaginary temperature function and a probability function are important and this article provides useful proposals for such functions.
[0094] In the above context, it has been described that the first set of data X comprises sequences x.sub.i of instances of the first type of data. The term sequence is used to indicate that the instances of one sequence originate from one specific source, for example, one specific sensor. Each sequence has its own source and all sources are of the same type and have the same function. For example, each sequence originates from one type of sensors that has in machines of the same type the same function. For example, in the context of
[0095] Optionally, one of the above discussed methods of determining clusters may be part of a method of compacting a data set that comprises sequences of instances of the first type of data and that the data is based on, for example, derived from, sensor data. The method comprises obtaining the data set with the sequences of instances of the first type of data. The obtained data step fulfills the function of the first set of data X in the included method of determining clusters. The output of the included method of determining clusters is subsequently used to replace sequences of instances of the first type of data of the data set by identification data of the clusters to which the respective sequences are assigned in the included method of determining clusters. Identification data is generated for the clusters. The identification data comprises identifications for each cluster that uniquely identifies a respective cluster. The identifications are at least smaller than the sequences of the data set, which means that they can be stored by a fewer number of bits than each sequence of the data set. This method results in a smaller set of data and storage capacity is saved. In an embodiment, the average size of the identifications is minimized given the condition that each cluster must be uniquely identifiable.
[0096] Optionally, one of the above discussed methods of determining clusters may be part of a method of transmitting compacted data comprising at least one sequence of instances of the first type of data. The at least one sequence is a sequence to be transmitted. The method comprises obtaining at least one sequence. After determining the clusters, one of the clusters is selected as the cluster that best matches with the at least one sequence. It may be that the first set of data X that is being used in the method of determining clusters already comprises a sequence that is equal to the at least one sequence and then the cluster is selected to which the sequence was assigned. It may also be that the first set of data X does not comprise a sequence that is equal to the at least one sequence and then a cluster is selected that best matches the at least one sequence. For example, each cluster may be represented by a representative sequence (which is, for example, a central sequence in the space that is represented by the cluster) and the cluster is selected of which the representative sequence is at a shortest distance to the at least one sequence. For the clusters identification data is generated. The identification data comprises identifications for each cluster that uniquely identifies a respective cluster. The identifications are at least smaller than the sequences of the data set, which means that they can be stored by a fewer number of bits than each sequence of the data set. Subsequently, an identification data of the selected cluster is transmitted instead of the at least one sequence. Thereby a smaller amount of bits are transmitted compared to the situation that the whole at least one sequence had to be transmitted and transmission bandwidth and transmission power is saved.
[0097]
[0098] The carrier of a computer program may be any entity or device capable of carrying the program. For example, the carrier may include a storage medium, such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a floppy disc or hard disk. Further the carrier may be a transmissible carrier such as an electrical or optical signal, which may be conveyed via electrical or optical cable or by radio or other means. When the program is embodied in such a signal, the carrier may be constituted by such cable or other device or means. Alternatively, the carrier may be an integrated circuit in which the program is embedded, the integrated circuit being adapted for performing, or for use in the performance of, the relevant method.
[0099] The computer program 480 may be a computer program for a distributed processor system and may comprise computer code which causes a first processor system to perform a subset of the steps of the above discussed method and which causes a second processor system to perform another subset of the steps of the above discussed method. The subset of steps and the another subset of steps may be mutually exclusive.
[0100] In summary, this document provides a device for and method of determining clusters of sequences of instances of a first type of data for compacting a data set comprising sequences of instances of the first type of data is provided. Also a method of compacting a data set, a method of transmitting compacted data and a computer program product are provided. In a sequence clustering unit of the device, sequences of a first set of data are clustered on basis of conditional probabilities. Each unique sequence of the first set of data is associated with one or more conditional probabilities that an instance of the second set of data has a specific value given the unique sequence. In the clustering a significant part of the mutual information between the first set of data and the second set of data is maintained.
[0101] It is to be noted that the invention may be implemented in hardware and/or software, using programmable components.
[0102] It will be appreciated that the above description for clarity has described embodiments of the invention with reference to different functional units and processors. However, it will be apparent that any suitable distribution of functionality between different functional units or processors may be used without deviating from the invention. For example, functionality illustrated to be performed by separate units, processors or controllers may be performed by the same processor or controllers. Hence, references to specific functional units are only to be seen as references to suitable means for providing the described functionality rather than indicative of a strict logical or physical structure or organization. The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these.
[0103] It is noted, that in this document the word comprising does not exclude the presence of other elements or steps than those listed and the word a or an preceding an element does not exclude the presence of a plurality of such elements, that any reference signs do not limit the scope of the claims, that the invention may be implemented by means of both hardware and software, and that several means or units may be represented by the same item of hardware or software, and a processor may fulfill the function of one or more units, possibly in cooperation with hardware elements. Further, the invention is not limited to the embodiments, and the invention lies in each and every novel feature or combination of features described above or recited in mutually different dependent claims.