METHOD FOR SEGMENTING AND INDEXING FEATURES FROM MULTIDIMENSIONAL DATA

20180143979 · 2018-05-24

    Inventors

    Cpc classification

    International classification

    Abstract

    The invention relates to a method for segmenting and indexing features from multidimensional data, said method comprising the steps of inputting (201) a sequence of tuples in a transformation process, first mapping (204) the tuple sequence to a sequence of hash sums with a rolling hash function, grouping (205) the sequence of tuple hash sums into a sequence of overlapping and contiguous sequences of n tuples hash sums, second mapping (206) of the resulting sequence of n tuples hash sums to a sequence of n-gram hash sums, and segmenting (207) the sequence of n-gram hash sums into chunks of tuples using a segmentation method selected in the group comprising at least one Content-Defined Chunking and Winnowing.

    Claims

    1. A computer-implemented method for segmenting and indexing features from multidimensional data, said method comprising the steps of: inputting (201) a sequence of tuples in a transformation process, mapping (204) the tuple sequence to a sequence of hash sums with a rolling hash function, grouping (205) the sequence of tuple hash sums into a sequence of overlapping and contiguous sequences of n tuples hash sums, mapping (206) of the resulting sequence of n tuples hash sums to a sequence of n-gram hash sums, segmenting (207) the sequence of n-gram hash sums into chunks of tuples using a segmentation method selected in the group comprising at least one Content-Defined Chunking and Winnowing.

    2. The computer-implemented method according to claim 1, wherein said hash function produces 32 bit or 64 bit integers.

    3. The computer-implemented method according to claim 1, wherein said rolling hash function slides over hashed tuples in order to produce fingerprints.

    4. The computer-implemented method according to claim 1, wherein said segmenting method is configured to divide sequence of tuples into overlapping or non-overlapping chunks of fixed or variable sizes.

    5. The computer-implemented method according to claim 1, wherein said chunks are larger or equal to a minimal chunk size (min) and smaller or equal to a maximal chunk size (max).

    6. The computer-implemented method according to claim 1, wherein said predetermined condition is that said result of said first hash function (h) is a multiple of a predetermined divisor (d) (h mod d=0).

    7. The computer-implemented method according to claim 1, comprising a preliminary step of normalization of said sequence of tuples before the first mapping step (2014).

    8. The computer-implemented method according to claim 1, wherein said step of normalizing comprises replacing each tuple of said sequence of tuples by a normalized tuple representative of a subspace comprising said tuple.

    9. The computer-implemented method according to claim 1, comprising a preliminary step of extracting meaningful data only from a sequence of multidimensional data in order to build said sequence of tuples.

    10. The computer-implemented method according to claim 1, wherein said sequence of tuples is a GPS trajectories and said tuples are GPS locations.

    11. The computer-implemented method according to claim 10, wherein longitude and latitude coordinates of each GPS location are replaced by longitude and latitude coordinates of a center location of a user defined square in which said GPS location lies.

    12. The computer-implemented method according to claim 11, wherein the longitude and latitude coordinates of each GPS location are replaced by coordinates of a road network thanks to map matching.

    Description

    [0030] The invention will be better understood by reading the following description illustrated by the figures below, wherein

    [0031] FIG. 1 represents the transformation process according to a preferred embodiment of the present invention.

    [0032] FIG. 2 represents an exemplary architecture for a client-server system which indexes sequences of multidimensional data.

    [0033] FIG. 3 represents an inverted index according to an embodiment of the present invention.

    [0034] The method of the invention can be used with any kind of sequential data but preferably applies to sequences of tuples, i.e. to sequences of multidimensional pieces of information, in order to extract features from these sequences.

    [0035] In the embodiments, the tuples preferably belong to an infinite alphabet, i.e. to very large or infinite multidimensional space. This is for example the case when the tuples comprise elements that may be characterized by any real number, such as for example the coordinates produced by a GPS tracker, and/or an element that belongs to an infinite dimension, such as time, for example.

    [0036] For the seek of clarity, the method of the invention is illustrated in the description above in relation with GPS trajectories. It can however be applied to any sequence of multidimensional data, such as for example other spatial and/or temporal data.

    [0037] FIG. 1 depicts an example architecture for a client-server system which indexes sequences of multidimensional data. Emitters 101 include any kind of devices, applications or systems which are able to produce sequences of multidimensional data. Emitters transmit data, which consists in sequences of tuples 111, to the server 103. These sequences of tuples can be stored in a database 113 before going through the transformation process 114 and being added in the inverted index 115. Searchers 102 include any kind of devices, applications or systems which can send a query to the server 103. Queries 112 also consists in sequences of tuples. However, instead of being stored, these sequence go directly through the transformation process 114 and the resulting segments are used to search for similar sequences of tuples in the inverted index 115.

    [0038] In the context of a GPS tracker, emitters would correspond to sensors embedded in phones, cars or dedicated devices. At regular intervals, the GPS coordinates recorded by the trackers are sent to a server which stores them in a database. The sequence of saved coordinates corresponds to the trajectory followed by the tracker. The saved trajectories then go through the transformation process. The resulting segments correspond to characteristic sub-trajectories which belongs to the life-long trajectory of the tracker. These segments, as well as the identifiers of the tracker, are stored in the inverted index.

    [0039] FIG. 2 depicts the transformation process in more details. The input of the transformation process is a sequence of tuples 201. According to embodiments, and depending on the nature and/or on the source of the data, the sequences of multidimensional data to be processed according to the method of the invention may require some pre-processing such as for example data extraction and/or splitting into series of tuples. Data extraction for example includes extracting meaningful data only from the sequences of multidimensional data and discarding and/or deleting unnecessary data. Extracted data is then for example split into tuples, i.e. in series of meaningful multidimensional pieces of information, thereby resulting in sequences of tuples. In the context of a GPS tracker the sequence of tuples may correspond to pairs of longitude/latitude coordinates extracted from a binary trace file.

    [0040] The sequence of tuples then goes through a normalization phase 202. Normalization aims at removing superficial differences, so that a match can occur on similar tuples, even though they are not identical. Tuple normalization for example comprises assigning a normalized value to each tuple, for example by rounding up or down the value of one or more elements of the tuple. Normalization can be viewed as a subdivision of the tuples' multidimensional space into subspaces, for example subspaces of equal sizes, and defining a normalized tuple to each sub-space. The normalized tuple for example corresponds to the coordinates of the center of the subspace. Each tuple of the sequences of tuples is normalized in that it is replaced by the normalized tuple representative of the subspace to which the tuple belongs. In the context of a GPS tracker, two trackers following the same route may record completely different sequences of coordinates. A normalization step relying solely on the process described above may not be sufficient, Consequently, in addition to space discretization, may include a step that maps the recorded GPS coordinates to the coordinates of an existing road network using map matching techniques.

    [0041] The resulting sequence of normalized tuples then goes through the specialized fingerprinting phase 203 that produces hash sums. In contrasts with bytes and characters, tuples may belong to an infinite alphabet and it is not possible to pre-compute random irreducible polynomials for a vocabulary of an unknown size, i.e. for all possible chunks of tuples, the hash sums produced by hashing the tuples must be random enough to avoid collisions, i.e. such that a same hash sum h may not correspond to two or more different series of tuples.

    [0042] In order to better understand the underlying mechanisms of phase 203, we decompose it into three actions 204, 205, 206. The sequence of tuples is first mapped 204 to a sequence of hash sums. Tuple hash sums consists in integers and the mapping can be achieved with any kind of hash function which spread integer sums evenly enough. Furthermore, the hash sums must be random enough to avoid as many collisions as possible, i.e. such that a same hash sum may not correspond to two or more different tuples. In most cases, a fast non-cryptographic hash function such as the Murmur Hash which produces 32 bit integers will give satisfactory results. However, the use of another strong hash function is possible within the frame of the invention. Furthermore, in data intensive cases, a fast custom hash functions, which assume some knowledge about the handled tuples, may be devised. The sequence of tuple hash sums is then grouped 205 into a sequence of overlapping and contiguous sequences of n tuples hash sums also called n-grams in this context. The resulting sequence of n-grams is then mapped 206 to a sequence of n-gram hash sums, which also consist in integers. In that case, a specialized hash function which hashes sets of integers into evenly distributed hash sums is required.

    [0043] In consequence, and since computing hash sums for all the overlapping n-grams in a sequence can be computationally expensive, Algorithm 1 below depicts an efficient specialized fingerprinting method which slides over a sequence of tuples with a window of size n and produces a sequence n-gram hash sums.

    TABLE-US-00001 Algorithm 1 Token-based rolling hash function initialize(s): a 31 b atext missing or illegible when filed hash a murmur hash function window an array of size s filled with 0 position 0 h 0 slide(token): in hash digest(token) out window(position) window(position) in position (position + 1) mod s h a * h + in b * out return h text missing or illegible when filed indicates data missing or illegible when filed

    [0044] Except for the first hash sum, the computation of the hash sums at each position does not require to go through all the tuples contained in the window. The hash sum can be efficiently computed by performing simple operations on the previous hash sum. These operations involve the previous hash sum the tuple which goes out of the window and the tuple which comes in the window. Traditionally, rolling hash functions consume tuples which belong to an alphabet of a finite size, such as bytes or encoded characters. Predefined sets of random irreducible polynomials each of which corresponds to a tuple of the alphabet are used to produce evenly distributed hash sums. The method we propose rely on the hash sums of the tuples instead of predefined random irreducible polynomials and can therefore be used with tuples which belongs to an infinite alphabet. As described in Algorithm 1, the tuple hash sums may be computed inside the sliding window algorithm, but other variants are possible, since hashing may occur at an earlier stage of the transformation process.

    [0045] The sequences of n-gram hash sums which result from the fingerprinting phase 203 then go through a segmenting phase 207. Such segmenting phase can be performed unrestrictedly using techniques such as content-defined chunking or winnowing. Regardless of the segmentation technique used, configuration parameters are generally set to produce segments, also called chunks, larger or equal to a minimal size (min) and smaller or equal to a maximal size (max). In addition, content-defined chunking would typically use a h mod d=0 condition, where h is a hash sum coming from the fingerprinting phase and d is a predefined constant divisor, in order to find segment boundaries. Winnowing would typically rely on an additional grouping phase to find segment boundaries. The resulting sequences of segments are then hashed 208 with a strong hash function such as SHA in order to create the terms that will be stored in the inverted index.

    [0046] FIG. 3 depicts an inverted index 304 in its simplest form. In this example, six sequences of tuples S1, S2, S3, S4, S5 and S6 have been indexed. Three segment hash sums H1, H2, H3 have been extracted by the transformation process from these six sequences of tuples. The inverted index highlight which segment hash sums belong to which sequence of tuples. When a query is performed 301, the sequence of tuple which forms the query Sq first goes through the transformation process 302. Here, two segment hash sums H1, H3 are extracted from the query. When processing queries 303 the system first search for the segments H1, H3 in the terms of the inverted index. Then, the related postings lists which contain sequence identifiers are retrieved and duplicates are removed 305 in order to form the result set 306. The sequences of the result set S1, S2, S3, and S4 are guaranteed to share some common segments with the sequence of the query Sq. The more segments are shared, the more similar the sequences are assumed to be. In addition, simple distance measures such as MinHash or Jaccard Similarity can be used to assess how similar two sequences are.

    [0047] In the context of a GPS tracker, several trackers are supposed to emit and transmit coordinates to the server. The resulting inverted index contains sub-trajectories, each of which points to a list of tracker identifiers. On the basis of this inverted index, it becomes possible to formulate queries which would correspond to specific trajectories. It is then possible to find all the sensors who emitted a sequences of coordinates that share some common sub-trajectories with the trajectory specified in the query.