Ultrafast, robust and efficient depth estimation for structured-light based 3D camera system
11551367 · 2023-01-10
Assignee
Inventors
Cpc classification
G06V10/751
PHYSICS
G06T7/521
PHYSICS
H04N2013/0081
ELECTRICITY
H04N13/254
ELECTRICITY
G01B11/2513
PHYSICS
H04N13/271
ELECTRICITY
International classification
G06T7/521
PHYSICS
H04N13/254
ELECTRICITY
H04N13/271
ELECTRICITY
G06V10/75
PHYSICS
G01B11/25
PHYSICS
Abstract
A system and a method are disclosed for a structured-light system to estimate depth in an image. An image is received in which the image is of a scene onto which a reference light pattern has been projected. The projection of the reference light pattern includes a predetermined number of particular sub-patterns. A patch of the received image and a sub-pattern of the reference light pattern are matched based on either a hardcode template matching technique or a probability that the patch corresponds to the sub-pattern. If a lookup table is used, the table may be a probability matrix, may contain precomputed correlations scores or may contain precomputed class IDs. An estimate of depth of the patch is determined based on a disparity between the patch and the sub-pattern.
Claims
1. A method for a structured-light system to estimate depth in an image, the method comprising: receiving the image, the image being of a scene onto which a reference light pattern has been projected, the image including a projection of the reference light pattern, and the reference light pattern comprising a predetermined number of particular sub-patterns; binarizing at least one patch of the image, the patch comprising a predetermined number of pixels; matching the at least one patch to a sub-pattern of the reference light pattern by minimizing an error function E.sub.k for the patch based on a first number of ones in the binarized patch and a count of white patches in the reference light pattern and a square of a count of black patches in the reference light pattern; and determining an estimate of depth for at least one patch of the image based on a disparity between the patch and the sub-pattern.
2. The method of claim 1, wherein the second number of ones in each respective binarized sub-pattern is determined by incrementing the second number of ones for a first binarized sub-pattern by 2 to obtain the second number of ones for a subsequent binarized sub-pattern.
3. The method of claim 1, wherein the error function E.sub.k is further based on a white pixel subset of the reference light pattern and a black pixel subset of the reference light pattern.
4. A method for a structured-light system to estimate depth in an image, the method comprising: receiving the image, the image being of a scene onto which a reference light pattern has been projected, the image including a projection of the reference light pattern, and the reference light pattern comprising a predetermined number of particular sub-patterns; binarizing at least one patch of the image, the patch comprising a predetermined number of pixels; matching the at least one patch to a sub-pattern of the reference light pattern by minimizing an error function E.sub.k for the patch based on a first number of ones in the binarized patch and a count of black patches divided by a count of white patches in a binary reference patch; and determining an estimate of depth for at least one patch of the image based on a disparity between the patch and the sub-pattern.
5. The method of claim 4, wherein the second number of ones in each respective binarized sub-pattern is determined by incrementing the second number of ones for a first binarized sub-pattern by 2 to obtain the second number of ones for a subsequent binarized sub-pattern.
6. The method of claim 4, wherein the error function E.sub.k is further based on a white pixel subset of the reference light pattern and a black pixel subset of the reference light pattern.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) In the following section, the aspects of the subject matter disclosed herein will be described with reference to exemplary embodiments illustrated in the figures, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
DETAILED DESCRIPTION
(15) In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the disclosure. It will be understood, however, by those skilled in the art that the disclosed aspects may be practiced without these specific details. In other instances, well-known methods, procedures, components and circuits have not been described in detail not to obscure the subject matter disclosed herein.
(16) Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment disclosed herein. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” or “according to one embodiment” (or other phrases having similar import) in various places throughout this specification may not be necessarily all referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. In this regard, as used herein, the word “exemplary” means “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not to be construed as necessarily preferred or advantageous over other embodiments. Also, depending on the context of discussion herein, a singular term may include the corresponding plural forms and a plural term may include the corresponding singular form. It is further noted that various figures (including component diagrams) shown and discussed herein are for illustrative purpose only, and are not drawn to scale. Similarly, various waveforms and timing diagrams are shown for illustrative purpose only. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, if considered appropriate, reference numerals have been repeated among the figures to indicate corresponding and/or analogous elements.
(17) The terminology used herein is for the purpose of describing particular exemplary embodiments only and is not intended to be limiting of the claimed subject matter. As used herein, the singular forms “a,” “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms “first,” “second,” etc., as used herein, are used as labels for nouns that they precede, and do not imply any type of ordering (e.g., spatial, temporal, logical, etc.) unless explicitly defined as such. Furthermore, the same reference numerals may be used across two or more figures to refer to parts, components, blocks, circuits, units, or modules having the same or similar functionality. Such usage is, however, for simplicity of illustration and ease of discussion only; it does not imply that the construction or architectural details of such components or units are the same across all embodiments or such commonly-referenced parts/modules are the only way to implement the teachings of particular embodiments disclosed herein.
(18) Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this subject matter belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
(19) Embodiments disclosed herein provide rapid depth estimations for a structured-light system. In one embodiment, depth estimations are provided based on hardcode template matching of image patches to reference patches. In another embodiment, image patches are matched to reference patches by correlation based on, for example, Bayes' rule. Still another embodiment matches image patches to reference patches using a lookup table to provide extremely fast depth estimation. All of the embodiments disclosed herein provide a dramatically reduced computational burden and reduced memory/hardware resource demands in comparison to other approaches, while also reducing noise, blur and distortion that may accompany the other approaches.
(20) Embodiments disclosed herein that use a lookup table provide a constant-time depth estimation. Moreover, the lookup table may be learned based on a training dataset that enhances depth prediction. The lookup table may be more robust than other approaches, while also achieving high accuracy.
(21)
(22) The processing device 103 may be a microprocessor or a personal computer programed via software instructions, a dedicated integrated circuit or a combination of both. In one embodiment, the processing provided by processing device 103 may be implemented completely via software, via software accelerated by a graphics processing unit (GPU), a multicore system or by a dedicated hardware, which is able to implement the processing operations. Both hardware and software configurations may provide different stages of parallelism. One implementation of the structured-light system 100 may be part of a handheld device, such as, but not limited to, a smartphone, a cellphone or a digital camera.
(23) In one embodiment, the projector 101 and the camera 102 may be matched in the visible region or in the infrared light spectrum, which may not visible to human eyes. The projected reference light pattern may be within the spectrum range of both the projector 101 and the camera 102. Additionally, the resolutions of the projector 101 and the camera 102 may be different. For example, the projector 101 may project the reference light pattern 104 in a video graphics array (VGA) resolution (e.g., 640×480 pixels), and the camera 102 may have a resolution that is higher (e.g., 1280×720 pixels). In such a configuration, the image 106 may be down-sampled and/or only the area illuminated by the projector 101 may be analyzed in order to generate the depth map 107.
(24)
(25)
(26) In one embodiment disclosed herein, the processing device 103 may generate the estimated depth information for the depth map 107 by using a hardcode template matching technique to match image patches to patches of the reference light pattern 104, in which the complexity of the matching technique is O(P) and P is the size of the patch being matched. In another embodiment disclosed herein, the processing device 103 may generate the estimated depth information by matching image patches to patches of the reference light pattern 104 based on a probability that an image patch matches a patch of the reference light pattern 104, in which the complexity of the matching technique is O(P). In still another embodiment disclosed herein, the processing device 103 may generate the estimated depth information by referring to a lookup table (LUT) that may contain probability information that an image patch matches a patch of the reference light pattern 104, in which the complexity of the matching technique may be represented by O(1).
(27) 1. Hardcode Template Matching.
(28) Matching an image patch to a patch of the reference light pattern 104 may be performed by direct calculation using a hardcode template matching technique according to the subject matter disclosed herein. For computational purposes, the reference light pattern 104 may be represented by patterns of 1s and 0s, which greatly simplifies the computations for the patch comparisons.
(29) One of three different computational techniques may be used for matching an image patch to a patch of the reference light pattern. A first computational technique may be based on a Sum of Absolute Difference (SAD) approach in which a matching score is determined based on the sum of the pixel-wise absolute difference between an image patch and a reference patch. A second computational technique may be based on a Sum of Squared Difference (SSD) approach. A third computational technique may be based on a Normalized Cross-Correlation (NCC) approach.
(30) To illustrate the advantages of the different direct-calculation approaches provided by the embodiments disclosed herein,
(31)
(32) A typical SAD matching calculation that may be used to generate a matching score for the input patches P and Q may be to minimize an error function E.sub.k, such as
E.sub.k=Σ.sub.i,j=0.sup.3|P(i,j)−Q.sub.k(i,j)|, (1)
in which (i, j) is a pixel location within a patch, k is a patch identification ID:[1,192] corresponding to a patch of the reference light pattern. For this example, consider that the patch identification k relates to the reference light pattern 104, which has 192 unique patterns; hence, the patch identification ID:[1,192].
(33) For the SAD approach of Eq. (1), the total computational burden to determine the error function E.sub.k for a single image input patch P with respect to a single image patch Q.sub.k involves 4×4×2×192=6144 addition operations.
(34) In contrast to the approach of Eq. (1),
(35) Using binary patterns, minimizing an error function may be reformulated into only summation operations of the pixels that are 1's in the reference patterns. According to one embodiment disclosed herein, a simplified SAD matching calculation that may be used to generate a matching score for the image input patch P with respect to a reference light pattern patch may be to minimize an error function E.sub.k as
(36)
in which (i, j) is a pixel location within the input patch P, k is a patch identification ID:[1,192] corresponding to a patch of the reference light pattern 104, B.sub.k is the set of pixels having a value of 1 in the reference patch Q.sub.k, ∥B.sub.k∥ is the count of 1's in the reference patch Q.sub.k, and P.sub.sum is the sum of all pixel values in patch P. As ∥B.sub.k∥ is known for each binary reference patch, and P.sub.sum may be pre-computed (and the average of 1's in a reference pixel pattern is 8), the number of additions required to do a single pattern-to-pattern comparison is reduced from 32 to approximately 8.
(37) Thus, for the SAD approach according to Eq. (4), the total computational burden to determine the error function E.sub.k for a single image input patch P with respect to an image reference patch Q.sub.k involves 8×192 addition operations for an average ∥B.sub.k∥ of 8. To further reduce the number of computation operations, P.sub.sum may be precomputed.
(38) Referring again to
E.sub.k=Σ.sub.i,j=0.sup.3|P(i,j)−Q.sub.k(i,j)|.sup.2, (5)
in which (i, j) is a pixel location within a patch, k is a patch identification ID:[1,192] corresponding to a patch of the reference light pattern 104.
(39) For the typicalSSD approach of Eq. (5), the total computation to determine the error function E.sub.k for a single image input patch P with respect to an image reference patch Q.sub.k involves 4×4×2×192=6144 addition operations.
(40) Referring to
(41)
in which (i, j) is a pixel location within the input patch P, k is a patch identification ID:[1,192] corresponding to a patch of the reference light pattern 104, B.sub.k is a set of pixels having a value of 1 in the binary reference patch Q.sub.k, ∥B.sub.k∥ is the count of 1's in the binary reference patch Q.sub.k, and P.sub.sum is the sum of all pixel values in patch P.
(42) For the simplified SSD approach according to Eq. (8), the total computational burden to determine the error function E.sub.k for a single image input patch P with respect to an image reference patch Q.sub.k involves approximately 8×192 addition operations for an average ∥B.sub.k∥ of 8. To further reduce the number of computation operations, both ∥B.sub.k∥ and P.sup.2.sub.sum may be precomputed.
(43) Referring again to
(44)
in which (i, j) is a pixel location within a patch, k is a patch identification ID:[1,192] corresponding to a patch of the reference light pattern 104.
(45) For the typicalNCC approach of Eq. (9), the total computational burden to determine the error function E.sub.k for a single image input patch P with respect to an image reference patch Q.sub.k involves 4×4×192 multiplication operations plus 4×4×192 addition operations, which equals 6144 operations.
(46) Referring to
(47)
in which (i,j) is a pixel location within the input patch P, k is a patch identification ID:[1,192] corresponding to a patch of the reference light pattern 104, and ∥B.sub.k∥ is the sum of white patches in binary reference patch Q.
(48) It should be noted that the simplified NCC technique disclosed herein generally uses one division operation for normalization. As ∥B.sub.k∥ may take five different integer values (specifically, 6-10), the division operation may be delayed until comparing matching scores. Accordingly, the 192 matching scores may be divided into five groups based on their ∥B.sub.k∥ values, and the highest matching score may be found among group. It is only when the highest scores among each of the five groups are compared that the division needs to be performed, which only needs to be done five times. Thus, for the NCC approach according to Eq. (11), the total computational burden to determine the error function E.sub.k for a single image input patch P with respect to an image reference patch Q.sub.k involves 5 multiplication operations plus 2×192 addition operations, which equals a total of 389 operations. Similar to the SAD and the SSD approaches disclosed herein, P.sup.2.sub.sum may be precomputed.
(49)
(50) The number of operations for each of the three simplified direct computation matching techniques disclosed herein may be further reduced by incrementally computing the term Σ.sub.i,j∈B.sub.
(51) In particular, the reference patch 401 includes six is (i.e., six white pixels). The reference patch 402 includes eight is (e.g., eight white pixel). The difference between in the number of is between the reference patch 401 and the reference patch 402 is two, so the value for the number of is in the reference patch 402 is two more than the value for the number of is in the reference patch 401. When the reference patch 403 is considered, no additional addition operations are added because both the reference patch 402 and the reference patch 403 include eight is. On average, the incremental number of addition operations is 2. Thus, using this incremental approach, the total number of addition operations that are needed to match all unique patterns is reduced to 2×192, which for the simplified SAD technique disclosed herein results in being 16 times faster than the SAD technique of Eq. (5).
(52) The disparity between an image input patch and a matching reference patch determined based on any of Eqs. (4), (8) or (11) may be used by the processing device 103 to generate depth information for a depth map 107.
(53) 2. Pattern Correlation based on Probability.
(54) To generate estimated depth information based on a probability that an image input patch matches a reference light pattern patch, such as the reference light pattern 104, a pattern correlation based on Bayes' rule may be used. That is, Bayes' rule may be used to determine the probability that an image input patch belongs to a particular class c of reference light pattern patches. Equation (12) below provides a simplified way to estimate the probability P of a 4×4 tile T (or patch) belongs to a class c.
log(P(c|T))=log(πP(t|c))=Σ log(P(t|c)) (12)
in which t is a pixel of value 1.
(55) Rather than performing multiplications, as indicated by the middle term of Eq. (12), the probability that an image input patch belongs to a particular class c of reference light pattern patches may be determined by only using addition operations, as indicated by the rightmost term of Eq. (12). Thus, the probability P(c|T) may be represented by a sum of probabilities instead of a multiplication of probabilities. For 192 unique patterns of size 4×4 pixels, t may take a value of [0,15] and c may take a value of [1,192]. A 16×192 matrix M may be formed in which each entry represents the log (P(t|c)). When an image input patch is to be classified, it may be correlated with each column of the matrix to obtain the probability log (P(t|c)) for each class. The class having the highest probability will correspond to the final matched class. The entries of the matrix M may be learned from a dataset formed from structured-light images in which the depth value of each reference pixel is known. Alternatively, the matrix M may be formed by a linear optimization technique or by a neural network. The performance of the Pattern Correlation approach is based on how well the matrix M may be learned.
(56)
(57) The disparity between an image input patch and a matching reference patch determined by using the approach depicted in
(58) 3. Pattern Classification by Lookup Table.
(59) The estimated depth information generated by the processing device 103 may also be generated by using a lookup table (LUT) to classify an image input patch as belonging to a particular class c. That is, an LUT may be generated that contains probability information that an image patch belongs to particular class c of patches of a reference light pattern.
(60) In one embodiment, an LUT may have 2.sup.16 keys to account for all possible 4×4 binarized input patterns. One technique for generating a value corresponding to each key is based on the probability that an image input patch belongs to a class c, as described in connection with
(61)
(62) In an embodiment in which an image input patch is large, an LUT corresponding to the LUT 604 in
(63)
(64) The overall size of the LUT may be further reduced by replacing the LUT 604 (or the LUT 704) with an LUT that contains precomputed class identifications. 0, 0, . . . , 0, 1, 0
. The predicted class ID for this example key is indicated at 806. For the example depicted in
(65)
(66)
(67) At 1005, the disparity between each image patch and the matching reference light pattern patch may be determined. At 1006, depth information for each image patch may be determined. At 1007, the process ends.
(68) Table 1 sets forth a few quantitative comparisons between a typical stereo-matching approach and the matching approaches disclosed herein. The computational complexity of a typical stereo-matching approach may be represented by O(P*S), in which P is the patch size and S is the search size. The speed of a typical stereo-matching approach is taken as a base line 1×, and the amount of memory needed is 2 MB.
(69) TABLE-US-00001 TABLE 1 Quantitative Comparisons Approaches Typical Speed Memory Stereo-Matching O(P * S) 1× 2 MB Hardcoding O(P) 16× 0 Correlation O(P) 4× 3 kB LUT O(P) 32× 12 MB LUT + Voting O(1) >1000× 64 KB
(70) The computational complexity of the matching approaches disclosed herein is much simpler and are much faster than a typical matching approach. The amount of memory the matching approaches disclosed herein may use may be significantly smaller than the amount of memory a typical matching approach uses, depending on which approach is used.
(71) As will be recognized by those skilled in the art, the innovative concepts described herein can be modified and varied over a wide range of applications. Accordingly, the scope of claimed subject matter should not be limited to any of the specific exemplary teachings discussed above, but is instead defined by the following claims.