Systems and Methods for Extracting and Matching Descriptors from Data Structures Describing an Image Sequence
20170316269 · 2017-11-02
Assignee
Inventors
Cpc classification
G06V20/46
PHYSICS
G06V10/462
PHYSICS
G06V20/49
PHYSICS
International classification
Abstract
A compact image sequence descriptor (101), used for describing an image sequence, comprises a segment global descriptor (113) for at least one segment within the sequence, which includes global descriptor information for respective images, relating to interest points within the video content of the images. The segment global descriptor (113) includes a base descriptor (121), which is a global descriptor associated with a representative frame (120) of the image sequence, and a number of relative descriptors (125). The relative descriptors contain information of a respective global descriptor relative to the base descriptor allowing to reconstruct an exact or approximated global descriptor associated with a respective image of the image sequence. The image sequence descriptor (101) may further include a segment local descriptor (114) for a segment, comprising a set of encoded local feature descriptors.
Claims
1. A data structure for describing an image sequence containing a plurality of images in a predetermined order, said data structure including an image sequence descriptor comprising: a base descriptor, said base descriptor representing a global descriptor associated with a specific image of the image sequence, referred to as representative frame; and a number of relative descriptors relating to global descriptors associated with images in the image sequence, each of said number of relative descriptors containing information of a respective global descriptor relative to the base descriptor allowing for reconstruction of a global descriptor associated with a respective image of the image sequence from the relative descriptor and the base descriptor, wherein each of said global descriptors is associated with a respective image of the image sequence and contains aggregated information relating to interest points within video content of the respective image.
2. The data structure of claim 1, wherein the relative descriptors contain an encoded difference between the respective global descriptor and the base global descriptor, wherein the difference is determined using a predefined difference function.
3. The data structure of claim 1, wherein the image sequence descriptor further comprises, for each of a number of segments within the image sequence, a segment local descriptor, said segment local descriptor comprising a set of encoded local feature descriptors.
4. The data structure of claim 1, wherein the image sequence descriptor further comprises data selected from the group consisting of data indicating relative temporal positions of the images with which the global descriptors are associated; data indicating relative temporal positions of images with which local descriptors are associated; data indicating spatial positions of features in images to which local descriptors refer; and data representing relevance information of global descriptors and/or local descriptors.
5. The data structure of claim 2, wherein the global descriptors are descriptors coded according to a method selected from the group consisting of Fisher Vectors, SCFV, CDVS, VLAD, VLAT, and features obtained from layers of trained Deep Convolutional Neural Networks.
6. The data structure of claim 3, wherein the local feature descriptors are local descriptors coded according to a method selected from the group consisting of CDVS, SIFT, SURF, ORB, and features obtained from layers of trained Deep Convolutional Neural Networks.
7. A method for describing an image sequence, said image sequence containing a plurality of images in a predetermined order, the method comprising: detecting interest points in each image; extracting local features from each image, said local features relating to the interest points detected; and aggregating said local features in each image to form a global descriptor of each image, wherein the following steps are performed for at least one segment of the image sequence: selecting a representative frame, choosing the global descriptor associated with the representative frame as a base descriptor for the segment; determining relative descriptors from global descriptors associated with images in the segment, each of said relative descriptors containing information of a respective global descriptor relative to the base descriptor; and generating an image sequence descriptor by encoding the base descriptor and relative descriptors.
8. The method of claim 7, further comprising the following step performed for at least one segment of the image sequence: generating a segment local descriptor and encoding it into the image sequence descriptor, said segment local descriptor comprising a set of encoded local feature descriptors.
9. The method of claim 7, further comprising segmenting the image sequence by dividing the image sequence into a number of mutually disjoint segments based on the global descriptors of the images, each segment comprising a number of consecutive images from the image sequence.
10. The method of claim 7, wherein in the step of selecting a representative frame, the representative frame is chosen as a medoid frame among the images of the respective segment based on a predefined distance function on global descriptors of images.
11. The method of claim 7, wherein in the step of determining relative descriptors, the relative descriptors are determined by encoding the difference between the respective global descriptor and the base global descriptor, wherein the difference is determined using a predefined difference function.
12. The method of claim 7, wherein during determining relative descriptors, descriptors that correspond to a difference smaller than a predetermined threshold value (θ.sub.g) are omitted, and the remaining relative descriptors are encoded using an entropy coding method.
13. The method of claim 12, wherein a maximum size is predefined and the threshold value is controlled so as to adjust the size of the resulting image sequence descriptor to fall below the maximum size.
14. The method of claim 7, further comprising the step of applying filtering, aggregation and compression of local features to obtain a set of local feature descriptors, wherein during the step of applying filtering, aggregation and compression of local features, the set of local feature descriptors is filtered to exclude all local feature descriptors that are more similar to any of the local descriptors already encoded, with regard to a predetermined similarity function and a predetermined threshold (θ.sub.l) of similarity, and for each of the remaining local feature descriptors, the difference to the most similar of the local feature descriptors already encoded is determined and the difference thus obtained is encoded using an entropy coding method.
15. The method of claim 14, wherein a maximum size is predefined and said threshold is controlled so as to adjust the size of the resulting image sequence descriptor to fall below the maximum size.
16. The method of claim 7, further comprising sampling a subset of images from the image sequence, wherein said subset of images is used as input in place of the images of the image sequence, and wherein the images in the input are processed in temporal order.
17. The method of claim 7, further comprising sampling a subset of images from the image sequence, wherein said subset of images is used as input in place of the images of the image sequence, and wherein the images in the input are processed in the order of a value yielded by a function of a counter of the images in the input.
18. The method of claim 7, wherein the resulting image sequence descriptor is serialized and transferred to a bitstream, file, or database.
19. A method for matching two image sequence descriptors, the method comprising: determining a scalar distance value between the two image sequence descriptors by performing a distance calculation between base descriptors of the image sequence descriptors and distance calculations between global descriptors of either image sequence descriptors.
20. The method of claim 19, wherein the distance calculation is performed from coarse to fine temporal resolution for efficiency of the calculation, wherein said global descriptors are reconstructed until a number of global descriptors is reached which is precalculated from the length of the image sequences underlying the image sequence descriptors.
21. A method for retrieving from a set of image sequences using a reference image sequence, the method comprising; obtaining an image sequence descriptor relating to the reference image sequence, matching said image sequence descriptor with image sequence descriptors relating to image sequences from said set using a matching function, and evaluating results thus obtained to obtain at least one of a retrieval measure, an image sequence descriptor that represents a best match within said set, and data identifying an image sequence from the set representing a best match.
22. The method of claim 21, further comprising matching two image sequence descriptors by determining a scalar distance value between the two image sequence descriptors by performing a distance calculation between base descriptors of the image sequence descriptors and distance calculations between global descriptors of either image sequence descriptors.
23. The method of claim 22, wherein the distance calculation is performed from coarse to fine temporal resolution for efficiency of the calculation, wherein said global descriptors are reconstructed until a number of global descriptors is reached which is precalculated from the length of the image sequences underlying the image sequence descriptors.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0057] These and other features and advantages of the present invention will be made evident by the following description of some exemplary and non-limitative embodiments thereof, to be read in conjunction with the attached drawings, wherein:
[0058]
[0059]
[0060]
[0061]
[0062]
[0063]
[0064]
[0065]
[0066]
DETAILED DESCRIPTION OF THE INVENTION
[0067] In the following, a descriptor according to a preferred embodiment of the invention is discussed. First, the format of the descriptor is discussed, and then methods for extracting and matching descriptors are described. Herein, a “video file” or “video sequence” is understood as data (usually stored on a data carrier such as a hard disk, DVD or digital magnetic tape) including a sequence of images; in the following, no difference is made between a video sequence and the corresponding image sequence, unless expressly noted otherwise. Further, “to extract” or “extracting” a descriptor (or other information) from initial data (such as a video sequence) is understood as referring to the actions for determining/calculating the descriptor (or other information) from the initial data without affecting the latter, and the descriptor thus extracted may contain data elements copied from the initial data and/or data elements generated based on the initial data. Further, “to match” or “matching” descriptors is understood as referring to the action of comparing the descriptors so as to derive a measure (such as score value) describing the similarity of the descriptors or underlying initial data. With regard to an image sequence, the terms “image” and “frame” are used herein interchangeably. A “segment” of a sequence of images (or video sequence) is understood, except where denoted explicitly otherwise, as the entire sequence or a part thereof, with a segment representing a set of frames which spans the interval between the first and last image of the segment, without any other segment or segment portion occurring within this interval; often, a segment is additionally required to represent a temporally continuous sequence of frames between the specified first and last image of the segment. It will be evident to the person skilled in the art to freely combine several or all of the embodiments discussed here and/or several or all of the appended claims as deemed suitable for a specific application of the invention. Throughout this disclosure, terms like “advantageous”, “exemplary” or “preferred” indicate elements or dimensions which are particularly suitable (but not essential) to the invention or an embodiment thereof, and may be modified wherever deemed suitable by the skilled person, except where expressly required.
Descriptor Notations
[0068] In the description that follows, the following general abbreviations and notations are used. A video sequence (or more generally, an image sequence) is given as ={I.sub.1, . . . , I.sub.N}, the sequence of images in the video. In case the video is segmented,
={S.sub.1, . . . , S.sub.K} is the set of segments of the video, with S.sub.k={I.sub.1.sup.k, . . . , I.sub.M.sub.
=S.sub.1.
[0069] In the following, the index m of an image I.sub.m.sup.k (within the respective segment S.sub.k) may be used as a shorthand for the image itself. In an image I.sub.m.sup.k, a set of interest points P.sup.m={p.sub.1.sup.m, . . . , p.sub.n.sup.m} is detected, for instance using a known detection method such as DoG (Difference of Gaussian), ALP or Hessian Affine. Further, D.sup.m={d.sub.1.sup.m, . . . , d.sub.n.sup.m} denotes a corresponding set of descriptors of the surrounding region of the interest point, called “local descriptors”; such local descriptors are extracted using a known method for feature detection such as SIFT, SURF or ORB (see E. Rublee, V. Rabaud, K. Konolige and G. Bradski, “ORB: An efficient alternative to SIFT or SURF”, 2011 International Conference on Computer Vision, Barcelona, 2011, pp. 2564-2571). A “global descriptor” of a frame of index m is denoted as G.sup.m; and the global descriptor is obtained from aggregating the local descriptors in D.sup.m using a known aggregating method. Suitable methods for aggregating descriptors include Fisher Vectors (FV), SCFV, VLAD or VLAT. Derivation of global and/or local descriptors may be also achieved by using layers of trained Deep Convolutional Neural Networks. Furthermore, G.sub.0.sup.m denotes an encoded version of G.sup.m, such as after dimension reduction. For instance, G.sub.0.sup.m may be formed such that it only contains the values for the non-zero components of the descriptor and starts with an index indicating the components being present. If the method chosen for descriptor aggregation already yields a binary descriptor, then it may be sufficient to have G.sub.0.sup.m=G.sup.m. The notation
Image Sequence Descriptor
[0070] The invention offers a method for extracting a single descriptor from a temporal segment, i.e., a set of consecutive and related frames (e.g., a shot) of an image sequence. This type of descriptor is created from an aggregation of sets of local descriptors from each of the images in the segment, and contains an aggregation of global descriptors and, optionally, a set of the extracted local descriptors, together with their time and location.
[0071] .sub.S.sub.
.sub.S.sub.
.sub.S.sub.
.sub.S.sub.
[0072] The segment global descriptor 113, illustrated in .sub.S.sub.
[0073] Furthermore, referring to .sub.S.sub.
[0074] As illustrated in
[0075] Summarising, a compact image sequence descriptor according to the invention, which can be used for describing an image sequence, comprises at least a segment global descriptor 113 for a segment within the sequence, which includes global descriptor information for respective images, relating to interest points within the video content of the images. The segment global descriptor 113 includes a base descriptor 121, which is a global descriptor associated with a representative frame 120 of the image sequence, and a number of relative descriptors 125. The relative descriptors contain information of a respective global descriptor relative to the base descriptor allowing to reconstruct an exact or approximated global descriptor associated with a respective image of the image sequence. The image sequence descriptor 101 may further include a segment local descriptor 114 for the segment, comprising a set of encoded local feature descriptors. The data structure will comprises multiple image sequence descriptors 101 in the case that the image sequence is segmented into multiple segments.
Segment Descriptor Extraction
[0076]
[0077] A first stage is temporal segmentation of the video in visually homogeneous segments, in steps 201-206. For every frame I.sup.m of the input sequence starting from an initial frame î (step 201), interest points P.sup.m are detected (step 202), local descriptors D.sup.m are extracted (step 203) and aggregated to a global descriptor G.sup.m (step 204).
[0078] In step 205, (optional) temporal segmentation is performed. Using the similarity of the extracted global descriptor based on matching global descriptors of current and previous images, the segmentation is made by, e.g., defining a segment as starting from frame î according to
S.sub.k={I.sub.i|δ.sub.g(G.sub.i,G.sub.i-1)≦θ.sub.gI.sub.i-1εS.sub.k,i=î . . . ∞},
[0079] where δ.sub.g is an appropriate distance function for global descriptors (e.g., a L1 or L2 norm defined on the vector representation, or a Hamming distance), and θ.sub.g is a threshold chosen for the desired temporal segmentation properties. Thus, the segment will include all frames starting from frame î until the “dissimilarity” (as measured by δ.sub.g) between to subsequent frames exceeds the threshold (step 206). The next segment will then start with this frame where the threshold is exceeded, and so on. The choice of θ.sub.g depends on the type of global descriptors employed; for example, for SCFV with 512 elements values of θ.sub.g in the range of 480-500 were found to yield good results. Smaller values will yield more homogenous segments (in terms of visual variations) with shorter duration, but more compact descriptors for these segments.
[0080] Once segments are identified, the descriptor for a segment is encoded by aggregating global descriptors (steps 207-209) and (optionally) coding local descriptors (steps 210-214) of the segments, in subsequent stages described in the following two subsections.
Segment Global Descriptor
[0081] From the set of global descriptors G.sup.m, mεS.sub.k (as defined in step 204), in step 207 the pairwise distances δ.sub.g(G.sup.m, G.sup.n) are determined for all index pairs m,n, and the medoid frame 120 is selected as a representative frame, for instance according to
[0082] This frame is the one which is “overall most similar” to all frames of the segment. The corresponding descriptor 121 is denoted G.sub.0.sup.{tilde over (m)}. For the other sampled frames i≠{tilde over (m)}εS.sub.k, in step 209 a relative descriptor is determined, for instance by differential coding and arithmetic encoding of global descriptors: The relative quantities
[0083] Before step 209, it is possible to insert filtering 208 of the descriptors, based on the descriptor size S.sub.max mentioned above, which is accepted in this step 208 as a parameter 230 describing the bit budget for global descriptors. Depending on the choice of S.sub.max, all or only a subset of the descriptors is included in the descriptor for the segment. In case descriptors need to be removed, they are removed by ascending values of δ.sub.g(G.sup.i, G.sup.{tilde over (m)}), i.e. descriptors more similar to the medoid descriptor are removed first, until their encoded size is sufficiently small to meet (or fall below) the target size S.sub.max. The remaining number of difference descriptors is denoted K.sub.g. In the minimum case that K.sub.g=0, the resulting global descriptor consists only of the medoid descriptor. For segments with average visual variability (i.e., neither static nor very dynamic), there are typically 3-7 remaining descriptors. The encoded descriptors may be written in the resulting segment global descriptor in any order that is preferred; suitably, they are output in the following order which will facilitate matching of image sequence descriptors:
Segment Local Descriptor
[0084] The image sequence descriptor according to the invention may also include local feature descriptors, coded in a segment local descriptor. The construction of the segment local descriptor of the embodiment illustrated in
[0085] Starting with step 210, local descriptors are determined. For each of the frames feature selection is performed as defined in the encoding process for
[0086] Each local descriptor has been extracted around an interest point (x,y). Some local descriptor extraction methods also provide a selection priority π, expressing the confidence in the interest point (higher values corresponding to higher confidence). Each of these selected descriptors d.sub.i.sup.m={x,y,π,f} is thus a tuple of interest point location, selection priority (optional, set to 0 for all points if not extracted) and the local feature descriptor f. Pairwise distances of the local descriptors are calculated in step 211, and in step 212 filtering and approximating of local descriptors is made. Starting from the medoid frame 120 {tilde over (m)}, the sufficiently dissimilar local descriptors are collected (step 212) according to:
L={d.sub.i.sup.m|d.sub.1(d.sub.i.sup.m,d.sub.j.sup.n)≧θ.sub.l;∀i,j;
m={tilde over (m)},{tilde over (m)}−1,{tilde over (m)}+1, . . . ;n={tilde over (m)}−1,{tilde over (m)}+1, . . . },
[0087] where θ.sub.l is a threshold, which is entered as a parameter 231 describing the bit budget for local descriptors. The value of this parameter θ.sub.l is chosen depending on the intended descriptor size (e.g., up to 5 for CDVS). The notation {{tilde over (m)}−1,{tilde over (m)}+1, . . . } is used to denote an order of the index which will comprise alternatingly decreasing and increasing indices, starting from the medoid frame {tilde over (m)}. The selection is based on pairwise distances d.sub.1(•) determined by an appropriate distance function for the type of local descriptor, such as the L1 or L2 norm, in step 211. Processing local descriptors starting from the medoid frame has the advantage that it will have more similar descriptors processed first in most cases.
[0088] For descriptors omitted due to high similarity, a reference to the most similar descriptor is kept. This results in a set F.sub.L of local descriptors, referred to as the feature descriptor. For each lεF.sub.L, the set l.sub.i.sup.T of frames m.sub.i in which this (or a very similar) descriptor appears, as well as the interest point location are described as:
l.sub.i=(f.sub.i,l.sub.i.sup.T)
l.sub.i.sup.T={(t(m.sub.i),x.sub.m.sub.
[0089] The frames are identified by time points t(m.sub.i) relative to the start time of the segment.
[0090] In step 213, differential coding and arithmetic encoding of local descriptors is made. For the set of descriptors in F.sub.L, the most similar descriptor in F.sub.L is determined, and the feature descriptor is determined as the difference of the encoded descriptors, i.e.
[0091] Adaptive binary arithmetic encoding is applied to the difference descriptors
[0092] Thus, the differential part of the segment local descriptor is obtained, as
[0093] with j being the index of the descriptor used as basis for difference calculation.
[0094] The encoding of interest points locations is preferably performed using function locenc( ). The known function locenc( ) encodes the (approximate) locations of the interest points of the encoded descriptors; it may, for example, be implemented using the histogram based location encoding methods described in ISO/IEC 15938-13 or in WO 2013/102574 A1.
[0095] The local part of the segment descriptor is composed of the set of the time map 140 (
(T,f.sub.{tilde over (m)},
[0096] where
[0097] The global and local segment descriptors thus obtained are combined into a segment descriptor (step 215) and, if required, serialised into a linear sequence of data. During the process shown in
[0098] The segment descriptors are combined into an image sequence descriptor 101, which describes the segmented image sequence, which is serialised and transferred to output. Alternatively, if preferred, it is possible to output the segment descriptors as separate image sequence descriptors. This extraction process of the invention, of which an embodiment is illustrated in
Segment Descriptor Matching
[0099]
Global Medoid Descriptor
[0100] In step 303, the global medoid descriptors are matched. This is done, for instance, by determining the similarity σ.sub.g of the medoid descriptors G.sub.0.sup.A and G.sub.0.sup.B of the two frames, using a distance function as mentioned above. In step 304, using a threshold θ.sub.m, a check for very similar data structures may be clone: If the similarity σ.sub.g<θ.sub.m is below the threshold (304), σ=0, and matching terminates. The value of the threshold depends on the type of local descriptor; for example for SCFV suitable values are between 3 and 7.
Global Descriptor Matching
[0101] Otherwise, the matching process continues with step 306, iterative decoding and matching of global descriptors. The similarity σ.sub.g is compared against a second threshold θ.sub.γ, with θ.sub.γ>θ.sub.m (e.g., a suitable θ.sub.γ can be 5-10 for SCFV), and determine the match count
[0102] and score σ.sub.0=σ.sub.gc.sup.G. The process proceeds to incrementally decode global descriptors G.sub.1.sup.A . . . G.sub.K.sup.A and G.sub.1.sup.B . . . G.sub.K.sup.B, and match them against all global descriptors decoded so far, yielding similarities δ.sub.1 . . . δ.sub.KK′/2; the match count c.sup.G is increased by one for every δ.sub.k>θ.sub.γ, and σ.sub.k is calculated as
[0103] A minimum number of min(2+└max(|A|,|B|)s.sub.min┘, |A|,|B|) descriptors are matched (loop of steps 306, 307), with s.sub.min being a predefined constant ≦1, typically in the range 0.05-0.20. The constant factor two ensures, in correspondence with the order of relative descriptors as mentioned above, that at least the most dissimilar global descriptors to the medoid global descriptor are matched (if they were encoded in the descriptor). In step 308, it is checked whether the similarity score decreases: As additional global descriptors are more similar to the medoid descriptor, decoding and matching further global descriptors from either of the segment descriptors will stop after having matched the minimum number of frames when it is found that σ.sub.k would decrease. If this is the case for both segment descriptors, global matching terminates (branch to step 310). Global matching also terminates (through step 309) if all descriptors of all frames present in the segment descriptor have been matched.
[0104] If only global matching is to be performed (step 310), matching terminates; otherwise, the process continues with local matching (steps 311-316).
[0105] The score σ.sup.G of the global descriptor matching is calculated as follows. If the number of matching frames exceeds n.sub.min=┌m.sub.minmin(|A|,|B|)┐, with a scaling parameter m.sub.min (0<m.sub.min≦1, preferably m.sub.min is chosen in the range 0.05-0.2), then σ.sup.G is calculated as median of the n.sub.min highest pairwise similarities (preferably, this value is additionally normalised by the maximum similarity for the respective similarity function used); otherwise σ.sup.G is set to 0.
Local Descriptor Matching
[0106] For matching of the local descriptors (steps 311-316), the process proceeds to decode the temporal index, the local descriptors and (if encoded) their locations, and perform matching of the local descriptors of the frames corresponding to the two medoid local descriptors (step 311), yielding a set of similarities σ.sub.0.sup.L={σ.sub.0,0.sup.L, . . . , σ.sub.P.sub.
[0107] Step 312 is iterative decoding and matching of local descriptors for frames in the segment. Each of the similarities σ.sub.p,q.sup.L of the medoid descriptors is compared against a threshold θ.sub.λ (which is a predetermined parameter chosen, e.g., around 2.0 for CDVS), and count the matching descriptor pairs. A local match count is initialised, c.sup.L=0. If a minimum number of matching descriptor pairs (typically 4-8 are required) are found (and confirmed by spatial verification, if performed), then the local match count c.sup.L is increased by 1 for each such pair of frames.
[0108] The matching of the local descriptors is suitably done in the same sequence as for global descriptors (and with the same number of minimum frames to be matched, this is checked in step 313), and for the corresponding frames, calculating new distances or reusing the already calculated ones. In the same way as for global descriptors, the average similarity is updated from the matching frames, and matching terminates when it is found that the matching score decreases (step 314) or all descriptors of all frames present in the segment descriptor have been matched (step 315). Like for the local descriptors of the medoid frame, the local match count is increased if a minimum number of matching descriptor pairs is found.
[0109] If the local match count c.sup.L exceeds n.sub.min (as determined above for global descriptor matching), the local matching score σ.sup.L is calculated as median of the n.sub.min highest pairwise similarities.
[0110] In step 316, the global matching score σ.sup.G and the local matching score σ.sup.L are combined into a total matching score σ, which is returned in step 305. The total matching score σ may be determined according to any suitable method, preferably as a weighted sum (e.g., assigning equal weight to both) of the scores σ.sup.G and σ.sup.L, or as the maximum value, max(σ.sup.G, σ.sup.L).
Retrieval
[0111] The matching method for descriptors can be used in retrieval of image sequences. For instance, a typical retrieval task is finding, in a set of videos or a video database, the video segment which is the most similar to a given reference image sequence. For the reference, an image sequence descriptor is obtained, e.g. by reading/loading the descriptor from an input such as a storage device; alternatively, the descriptor is extracted directly from the image sequence. This image sequence descriptor is compared (matched) with image sequence descriptors relating to the image sequences of the set (again, these descriptors may be obtained by reading/loading them from suitable input or storage, such as a database, or calculated from the image sequences). This will give a set of matching results (each representing the similarity between the reference image sequence one video segment), of which typically the highest value can be used to identify the most similar video segment.
System for Processing Descriptors
[0112]
[0113]
[0114] A retrieval of descriptors is done as follows: Descriptors 101 are extracted by a subsystem as described with reference to system 400 and
[0115] For an illustration of the above methods and data structures,
REFERENCE SIGNS LIST
[0116] 101 descriptor bitstream [0117] 102 first descriptor bitstream to be matched [0118] 103 second descriptor bitstream to be matched [0119] 110 header structure [0120] 111, 112 segment start time and segment end time [0121] 113 segment global descriptor [0122] 114 segment local descriptor [0123] 120 medoid frame number (as reference frame number) [0124] 121 medoid global descriptor (as reference global descriptor) [0125] 122 number of frames described [0126] 123 relative temporal positions of the frames described w.r.t. the start of the segment [0127] 124 size of coded global descriptor block [0128] 125 coded global descriptor block [0129] 130 number of local descriptors in segment [0130] 131 size of coded local descriptor block [0131] 132 coded local descriptor block [0132] 133 size of coded keypoint location block [0133] 134 coded keypoint location block [0134] 140 descriptor time map [0135] 141 local feature descriptors [0136] 142 local feature relevances (optional) [0137] 200 descriptor extraction module [0138] 201 input next frame [0139] 202 interest point detection [0140] 203 local descriptor extraction [0141] 204 local descriptor aggregation [0142] 205 matching global descriptors of current and previous image [0143] 206 continue current segment [0144] 207 determine frame of global reference descriptor [0145] 208 filter global descriptors [0146] 209 differential coding and arithmetic encoding of global descriptors [0147] 210 determine set of local descriptors [0148] 211 determine pairwise distances of local descriptors [0149] 212 filter and approximate local descriptors [0150] 213 differential coding and arithmetic encoding of local descriptors [0151] 214 location and time encoding [0152] 215 serialisation [0153] 216 clear segment store [0154] 217 frame store for current segment [0155] 220 input image sequence [0156] 230, 231 bit budgets for global descriptors/local descriptors [0157] 300 descriptor matching module [0158] 301 read descriptor A [0159] 302 read descriptor B [0160] 303 match global (medoid) reference descriptors [0161] 304 similarity exceeds threshold [0162] 305 return score [0163] 306 iterative decoding and matching of global descriptors [0164] 307 minimal number of global descriptors matched [0165] 308 similarity score decreases [0166] 309 all global descriptors matched [0167] 310 perform local matching [0168] 311 match local descriptors of (medoid) reference frames [0169] 312 iterative decoding and matching of local descriptors for frames in the segment [0170] 313 minimal number of global descriptors matched [0171] 314 similarity score decreases [0172] 315 all global descriptors matched [0173] 316 combine global and local scores [0174] 320 matching score [0175] 400 system for extraction of descriptors [0176] 401 image sequence input module [0177] 402 storage output module [0178] 403 storage input module [0179] 404 memory [0180] 405 reporting module [0181] 500 system for matching and/or retrieval of descriptors [0182] 601, 602 first and second image sequences [0183] 603 frame from second image sequence with local descriptors [0184] 604 aggregated global descriptor for the frame [0185] 610, 620 extracting segment descriptors for image sequences 601, 602 [0186] 611, 621 image sequence descriptors [0187] 613, 623 segment global descriptors [0188] 614, 624 segment local descriptors [0189] 630 matching the image sequence descriptors