METHOD FOR SEGMENTING AND INDEXING FEATURES FROM MULTIDIMENSIONAL DATA
20180143979 · 2018-05-24
Inventors
Cpc classification
F41B5/123
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
F41B5/126
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
F41B5/12
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
International classification
Abstract
The invention relates to a method for segmenting and indexing features from multidimensional data, said method comprising the steps of inputting (201) a sequence of tuples in a transformation process, first mapping (204) the tuple sequence to a sequence of hash sums with a rolling hash function, grouping (205) the sequence of tuple hash sums into a sequence of overlapping and contiguous sequences of n tuples hash sums, second mapping (206) of the resulting sequence of n tuples hash sums to a sequence of n-gram hash sums, and segmenting (207) the sequence of n-gram hash sums into chunks of tuples using a segmentation method selected in the group comprising at least one Content-Defined Chunking and Winnowing.
Claims
1. A computer-implemented method for segmenting and indexing features from multidimensional data, said method comprising the steps of: inputting (201) a sequence of tuples in a transformation process, mapping (204) the tuple sequence to a sequence of hash sums with a rolling hash function, grouping (205) the sequence of tuple hash sums into a sequence of overlapping and contiguous sequences of n tuples hash sums, mapping (206) of the resulting sequence of n tuples hash sums to a sequence of n-gram hash sums, segmenting (207) the sequence of n-gram hash sums into chunks of tuples using a segmentation method selected in the group comprising at least one Content-Defined Chunking and Winnowing.
2. The computer-implemented method according to claim 1, wherein said hash function produces 32 bit or 64 bit integers.
3. The computer-implemented method according to claim 1, wherein said rolling hash function slides over hashed tuples in order to produce fingerprints.
4. The computer-implemented method according to claim 1, wherein said segmenting method is configured to divide sequence of tuples into overlapping or non-overlapping chunks of fixed or variable sizes.
5. The computer-implemented method according to claim 1, wherein said chunks are larger or equal to a minimal chunk size (min) and smaller or equal to a maximal chunk size (max).
6. The computer-implemented method according to claim 1, wherein said predetermined condition is that said result of said first hash function (h) is a multiple of a predetermined divisor (d) (h mod d=0).
7. The computer-implemented method according to claim 1, comprising a preliminary step of normalization of said sequence of tuples before the first mapping step (2014).
8. The computer-implemented method according to claim 1, wherein said step of normalizing comprises replacing each tuple of said sequence of tuples by a normalized tuple representative of a subspace comprising said tuple.
9. The computer-implemented method according to claim 1, comprising a preliminary step of extracting meaningful data only from a sequence of multidimensional data in order to build said sequence of tuples.
10. The computer-implemented method according to claim 1, wherein said sequence of tuples is a GPS trajectories and said tuples are GPS locations.
11. The computer-implemented method according to claim 10, wherein longitude and latitude coordinates of each GPS location are replaced by longitude and latitude coordinates of a center location of a user defined square in which said GPS location lies.
12. The computer-implemented method according to claim 11, wherein the longitude and latitude coordinates of each GPS location are replaced by coordinates of a road network thanks to map matching.
Description
[0030] The invention will be better understood by reading the following description illustrated by the figures below, wherein
[0031]
[0032]
[0033]
[0034] The method of the invention can be used with any kind of sequential data but preferably applies to sequences of tuples, i.e. to sequences of multidimensional pieces of information, in order to extract features from these sequences.
[0035] In the embodiments, the tuples preferably belong to an infinite alphabet, i.e. to very large or infinite multidimensional space. This is for example the case when the tuples comprise elements that may be characterized by any real number, such as for example the coordinates produced by a GPS tracker, and/or an element that belongs to an infinite dimension, such as time, for example.
[0036] For the seek of clarity, the method of the invention is illustrated in the description above in relation with GPS trajectories. It can however be applied to any sequence of multidimensional data, such as for example other spatial and/or temporal data.
[0037]
[0038] In the context of a GPS tracker, emitters would correspond to sensors embedded in phones, cars or dedicated devices. At regular intervals, the GPS coordinates recorded by the trackers are sent to a server which stores them in a database. The sequence of saved coordinates corresponds to the trajectory followed by the tracker. The saved trajectories then go through the transformation process. The resulting segments correspond to characteristic sub-trajectories which belongs to the life-long trajectory of the tracker. These segments, as well as the identifiers of the tracker, are stored in the inverted index.
[0039]
[0040] The sequence of tuples then goes through a normalization phase 202. Normalization aims at removing superficial differences, so that a match can occur on similar tuples, even though they are not identical. Tuple normalization for example comprises assigning a normalized value to each tuple, for example by rounding up or down the value of one or more elements of the tuple. Normalization can be viewed as a subdivision of the tuples' multidimensional space into subspaces, for example subspaces of equal sizes, and defining a normalized tuple to each sub-space. The normalized tuple for example corresponds to the coordinates of the center of the subspace. Each tuple of the sequences of tuples is normalized in that it is replaced by the normalized tuple representative of the subspace to which the tuple belongs. In the context of a GPS tracker, two trackers following the same route may record completely different sequences of coordinates. A normalization step relying solely on the process described above may not be sufficient, Consequently, in addition to space discretization, may include a step that maps the recorded GPS coordinates to the coordinates of an existing road network using map matching techniques.
[0041] The resulting sequence of normalized tuples then goes through the specialized fingerprinting phase 203 that produces hash sums. In contrasts with bytes and characters, tuples may belong to an infinite alphabet and it is not possible to pre-compute random irreducible polynomials for a vocabulary of an unknown size, i.e. for all possible chunks of tuples, the hash sums produced by hashing the tuples must be random enough to avoid collisions, i.e. such that a same hash sum h may not correspond to two or more different series of tuples.
[0042] In order to better understand the underlying mechanisms of phase 203, we decompose it into three actions 204, 205, 206. The sequence of tuples is first mapped 204 to a sequence of hash sums. Tuple hash sums consists in integers and the mapping can be achieved with any kind of hash function which spread integer sums evenly enough. Furthermore, the hash sums must be random enough to avoid as many collisions as possible, i.e. such that a same hash sum may not correspond to two or more different tuples. In most cases, a fast non-cryptographic hash function such as the Murmur Hash which produces 32 bit integers will give satisfactory results. However, the use of another strong hash function is possible within the frame of the invention. Furthermore, in data intensive cases, a fast custom hash functions, which assume some knowledge about the handled tuples, may be devised. The sequence of tuple hash sums is then grouped 205 into a sequence of overlapping and contiguous sequences of n tuples hash sums also called n-grams in this context. The resulting sequence of n-grams is then mapped 206 to a sequence of n-gram hash sums, which also consist in integers. In that case, a specialized hash function which hashes sets of integers into evenly distributed hash sums is required.
[0043] In consequence, and since computing hash sums for all the overlapping n-grams in a sequence can be computationally expensive, Algorithm 1 below depicts an efficient specialized fingerprinting method which slides over a sequence of tuples with a window of size n and produces a sequence n-gram hash sums.
TABLE-US-00001 Algorithm 1 Token-based rolling hash function initialize(s): a 31 b a hash a murmur hash function window an array of size s filled with 0 position 0 h 0 slide(token): in hash digest(token) out window(position) window(position) in position (position + 1) mod s h a * h + in b * out return h
indicates data missing or illegible when filed
[0044] Except for the first hash sum, the computation of the hash sums at each position does not require to go through all the tuples contained in the window. The hash sum can be efficiently computed by performing simple operations on the previous hash sum. These operations involve the previous hash sum the tuple which goes out of the window and the tuple which comes in the window. Traditionally, rolling hash functions consume tuples which belong to an alphabet of a finite size, such as bytes or encoded characters. Predefined sets of random irreducible polynomials each of which corresponds to a tuple of the alphabet are used to produce evenly distributed hash sums. The method we propose rely on the hash sums of the tuples instead of predefined random irreducible polynomials and can therefore be used with tuples which belongs to an infinite alphabet. As described in Algorithm 1, the tuple hash sums may be computed inside the sliding window algorithm, but other variants are possible, since hashing may occur at an earlier stage of the transformation process.
[0045] The sequences of n-gram hash sums which result from the fingerprinting phase 203 then go through a segmenting phase 207. Such segmenting phase can be performed unrestrictedly using techniques such as content-defined chunking or winnowing. Regardless of the segmentation technique used, configuration parameters are generally set to produce segments, also called chunks, larger or equal to a minimal size (min) and smaller or equal to a maximal size (max). In addition, content-defined chunking would typically use a h mod d=0 condition, where h is a hash sum coming from the fingerprinting phase and d is a predefined constant divisor, in order to find segment boundaries. Winnowing would typically rely on an additional grouping phase to find segment boundaries. The resulting sequences of segments are then hashed 208 with a strong hash function such as SHA in order to create the terms that will be stored in the inverted index.
[0046]
[0047] In the context of a GPS tracker, several trackers are supposed to emit and transmit coordinates to the server. The resulting inverted index contains sub-trajectories, each of which points to a list of tracker identifiers. On the basis of this inverted index, it becomes possible to formulate queries which would correspond to specific trajectories. It is then possible to find all the sensors who emitted a sequences of coordinates that share some common sub-trajectories with the trajectory specified in the query.