Multi-sensor array including an IR camera as part of an automated kitchen assistant system for recognizing and preparing food and related methods
11618155 · 2023-04-04
Assignee
Inventors
- Ryan W. Sinnet (Pasadena, CA, US)
- Robert Anderson (Pasadena, CA, US)
- Zachary Zweig Vinegar (Los Angeles, CA, US)
- William Werst (Pasadena, CA, US)
- David Zito (Pasadena, CA, US)
- Sean Olson (Pacific Palisades, CA, US)
Cpc classification
G06F18/2414
PHYSICS
G06Q10/06311
PHYSICS
G06V20/52
PHYSICS
G05B19/42
PHYSICS
A23V2002/00
HUMAN NECESSITIES
A47J36/32
HUMAN NECESSITIES
A23L5/15
HUMAN NECESSITIES
G06Q20/202
PHYSICS
A23L5/10
HUMAN NECESSITIES
International classification
A23L5/10
HUMAN NECESSITIES
A47J36/32
HUMAN NECESSITIES
B25J9/00
PERFORMING OPERATIONS; TRANSPORTING
G05B19/42
PHYSICS
G06Q10/0631
PHYSICS
Abstract
An automated kitchen assistant system inspects a food preparation area in the kitchen environment using a novel sensor combination. The combination of sensors includes an Infrared (IR) camera that generates IR image data and at least one secondary sensor that generates secondary image data. The IR image data and secondary image data are processed to obtain combined image data. A trained convolutional neural network is employed to automatically compute an output based on the combined image data. The output includes information about the identity and the location of the food item. The output may further be utilized to command a robotic arm, kitchen worker, or otherwise assist in food preparation. Related methods are also described.
Claims
1. An automated food preparation system for preparing a food item in a working area of a kitchen, the system comprising: a first camera for generating first image data from a first view of the working area; a second camera for generating second image data from a second view of the working area; a display; a computer comprising: a kitchen scene understanding module operable to: (a) transform each of the first image data and the second image data into a single frame of reference; (b) compute identity and location information of the at least one food item based on the transformed first image data and the transformed second image data; (c) continuously update the location information of the at least one food item based on: (i) the computed location, (ii) prior information about the at least one food item, and optionally (iii) human input about the at least one food item; a food preparation supervisory module operable to: (a) continuously evaluate the updated location information of the at least one food item in view of recipe data for the at least one food item, (b) generate a command to prepare the at least one food item based on the evaluating step, and (c) send the command to the display for a human worker or robotic arm to execute.
2. The system of claim 1, wherein the food preparation supervisory module generates the command based on at least one input selected from the group consisting of: recipe data, an inventory of kitchen implements, information on food items, information on food preparation items, and orders from a restaurant's point of sale (POS) system.
3. The system of claim 1, wherein the display is an interactive tablet.
4. The system of claim 1, wherein the kitchen scene understanding module comprises a CNN trained to recognize and locate each food item from only one frame of reference.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTION OF THE INVENTION
(10) Before the present invention is described in detail, it is to be understood that this invention is not limited to particular variations set forth herein as various changes or modifications may be made to the invention described and equivalents may be substituted without departing from the spirit and scope of the invention. As will be apparent to those of skill in the art upon reading this disclosure, each of the individual embodiments described and illustrated herein has discrete components and features which may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present invention. In addition, many modifications may be made to adapt a particular situation, material, composition of matter, process, process act(s) or step(s) to the objective(s), spirit or scope of the present invention. All such modifications are intended to be within the scope of the claims made herein.
(11) Methods recited herein may be carried out in any order of the recited events which is logically possible, as well as the recited order of events. Furthermore, where a range of values is provided, it is understood that every intervening value, between the upper and lower limit of that range and any other stated or intervening value in that stated range is encompassed within the invention. Also, it is contemplated that any optional feature of the inventive variations described may be set forth and claimed independently, or in combination with any one or more of the features described herein.
(12) All existing subject matter mentioned herein (e.g., publications, patents, patent applications and hardware) is incorporated by reference herein in its entirety except insofar as the subject matter may conflict with that of the present invention (in which case what is present herein shall prevail).
(13) Reference to a singular item, includes the possibility that there are plural of the same items present. More specifically, as used herein and in the appended claims, the singular forms “a,” “an,” “said” and “the” include plural referents unless the context clearly dictates otherwise. It is further noted that the claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely,” “only” and the like in connection with the recitation of claim elements, or use of a “negative” limitation. Last, it is to be appreciated that unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
(14) Apparatus Overview
(15)
(16) System 100 is shown having a plurality of sensors 110, a robotic arm 120, and an enclosure 130 for housing a processor and other hardware which are operable, as described further herein, to receive data from the sensors 110, process the data, and to recognize and locate the food 140, 142. Although food 140,142 are shown as a bun and burger respectively, it is to be understood that the types of food contemplated herein may vary widely. Examples of food items include, without limitation, meat, burgers, vegetables, potatoes, fries, pizza, seasonings, sauces, frostings, fruits, starches, water, oils and other ingredients or combinations thereof.
(17) Additionally, in embodiments, the system 100 is operable to automatically control the robotic arm 120 to carry out one or more steps in preparing the food.
(18) Additionally, the motion and configuration of the robotic arm may vary widely. Examples of robotic arms, motion, training, and systems are shown and described in Provisional Patent Application No. 62/467,743, filed Mar. 6, 2017, entitled “Robotic System for Preparing Food Items in a Commercial Kitchen”; US Patent Publication No. 2017/0252922 to Levine et al.; and U.S. Pat. No. 9,785,911 to Galluzzo et al., each of which is incorporated by reference in its entirety.
(19)
(20) The number and types of sensors 110 may vary widely. In embodiments, the plurality of sensors includes a visible spectrum camera (e.g., a black and white, or RGB camera), a depth sensor, and an infrared (IR) camera.
(21) The infrared or IR camera generates IR image data by measuring the intensity of infrared waves and providing data representing such measurements over the observed area. In embodiments, the focal length of the camera lens and orientation of the optics has been set such that area imaged includes the work area. Preferably, the IR camera is adapted to measure the intensity of IR waves (typically in the range of 7.2 to 13 microns, but other wavelengths in the IR may be used) over an area and generates IR image data. An exemplary IR sensor is the CompactPro high resolution thermal imaging camera manufactured by Seek Thermal Corporation (Santa Barbara, Calif.), which can provide an image of size 320×240 with each value a 16-bit unsigned integer representing measured IR intensity.
(22) In embodiments, the visible spectrum camera is an RGB camera to generate image data. The RGB image comprises a 960 by 540 grid with intensity data for red, green, and blue portions of the spectrum for each pixel in the form of 8-bit unsigned integers. In embodiments, the focal length of the camera lens and orientation of the optics have been set such that area imaged includes the work surface. An exemplary visible spectrum sensor is the Kinect One sensor manufactured Microsoft Corporation (Redmond, Wash.). In embodiments, a black and white visible spectrum camera is used.
(23) A depth sensor incorporates a time of flight (TOF) camera to generate data on the distance of each point in the field of view from the camera. The TOF camera is a range imaging camera system that resolves distance based on the known speed of light, measuring the time-of-flight of a light signal between the camera and the subject for each point of the image. In embodiments, the image comprises a 960 by 540 grid with a value of the distance from the sensor for each point in the form of a 16-bit unsigned integer. An exemplary depth sensor is the Kinect One sensor manufactured Microsoft Corporation (Redmond, Wash.). In embodiments, other types of depth sensors are employed, such as devices using texturing (typically performed with an IR or near IR projector and two sensors) and stereo reconstruction, lidar, and stereoscopic cameras.
(24) Without intending to be bound to theory, we have discovered the IR camera sensors providing IR image data have the potential to mitigate or overcome the above-mentioned shortcomings associated with conventional automated cooking equipment. Due to the temperature differences typically present when an uncooked food is placed on a hot grill or other high temperature cooking surface or when a kitchen worker or kitchen worker's appendage is imaged against a predominantly room temperature background, IR camera sensors are able to provide high contrast and high signal-to-noise image data that is an important starting point for determining identity and location of kitchen objects, including food items, food preparation items and human workers. In contrast, the signal-to-noise ratio is significantly lower using only traditional RGB images than if using IR images. This occurs because some kitchen backgrounds, work surfaces, and cooking surfaces can be similar to food items in color, but temperatures are generally significantly different. Based on the foregoing, embodiments of the invention include IR-camera sensors in combination with other types of sensors as described herein.
(25)
(26) Step 202 states to provide a sensor assembly. The sensor assembly may include a plurality of sensors, at least one of which is an IR camera as described herein. In embodiments, and as shown in
(27) Step 204 states to inspect the food preparation work area to obtain sensor image data. As described further herein, in embodiments, the sensors generate data in the form of image data of an area.
(28) Step 206 states to process the image data from the sensors. As described further herein, the image data is input to a customized software program, engine, or module. In embodiments, the image data is input to a Kitchen Scene Understanding Engine, which may include a trained convolutional neural network or another means for processing and object recognition.
(29) Step 208 states to compute identity and location information of the food item or food preparation item. In embodiments, a probability of the identity and area within which the food item or food preparation item is located is computed by a Kitchen Scene Understanding Engine.
(30) It is to be understood that in addition to identifying and locating food, step 208 is equally applicable to identify and locate kitchen implements, and other objects detected by the sensors such as, without limitation, the kitchen worker or a part of the kitchen worker, such as his hand. Herein, the kitchen worker or a portion of the kitchen worker, robot or a portion of the robot, kitchen implements including appliances, dishware, and tools used in the preparation of food are collectively referred to as “food preparation items”). Additionally, by “kitchen object” it is meant either a food item or food preparation item.
(31) Optionally, and as discussed further herein, the identity and location information may be used to control a robotic arm or instruct a kitchen worker, or otherwise carry out a desired food preparation step, such as for example, turning on an appliance.
(32) Optionally, the control of the robotic arm is done autonomously or automatically, namely, without human instruction to carry out particular movements.
(33)
(34) The computer 212 is shown connected to sensors 220, restaurant's point of sale (POS) system 222, human input device 224, display 250, controller 230 for the robotic arm 232, and data log 240.
(35) In embodiments, one or more of the components are remote and connected to the other components of the robotic kitchen assistant system via the Internet or other type of network.
(36)
(37) In embodiments, the Kitchen Scene Understanding Engine 310 serves to track all relevant objects in the work area, including but not limited to food items, kitchen implements, and human workers or parts thereof. Data on these objects including but not limited to their identity and location are provided to the Food Preparation Supervisory System 320, which generates the instructions for preparing the food item. These instructions are provided to either or both the Robotic Food Preparation System 350 and to the human worker by display 340. In some embodiments, the Food Preparation Supervisory System 320 detects/notices the presence of new food preparation items and automatically begins the food preparation process. In some embodiments, the Food Preparation Supervisory Systems 320 is operable to signal the Robotic Food Preparation System 350 to control the robot arm or instruct a human worker to retrieve raw ingredients from nearby cold or dry storage based on an order received from the restaurant's POS system.
(38) In embodiments, once the appropriate food preparation item is recognized by the Kitchen Scene Understanding Engine 310, the Food Preparation Supervisory System 320 begins the food preparation process for that item. For example, in embodiments, the processor is operable to use recipe data to select actions and send appropriate signals to the system's controller to generate motion by the robot arm that manipulates the food on the work surface and/or signals the human worker to perform a task by displaying information on the display.
(39) The Food Preparation Supervisory System 320 shown in
(40)
(41) The combined image data serves as the input layer 450 to a trained convolutional neural network (CNN) 460.
(42) As shown with reference to step 460, a CNN processes the image input data to produce the CNN output layer 470. In embodiments, the CNN has been trained to identify food items and food preparation items, kitchen items, and other objects as may be necessary for the preparation of food items. Such items include but are not limited to human workers, kitchen implements, and food.
(43) For each set of combined image data provided as an input layer to the CNN, the CNN outputs a CNN output layer 470 containing location in the image data and associated confidence levels for objects the CNN has been trained to recognize. In embodiments, the location data contained in the output layer 470 is in the form of a “bounding box” in the image data defined by two corners of a rectangle.
(44) As described further herein, one embodiment of the CNN 460 is a combination of a region proposal network and CNN. An example of region proposal network and CNN is described in Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Faster”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39 Issue 6, June 2017, which is hereby incorporated by reference in its entirety. Examples of other types of convolution neural networks are described in Patent Publication Nos. US 20170169315 entitled “Deeply learned convolutional neural networks (cnns) for object localization and classification”; 20170206431 entitled “Object detection and classification in images”, and U.S. Pat. No. 9,542,621 entitled “Spatial pyramid pooling networks for image processing”, each of which is herein incorporated by reference in its entirety.
(45) Optionally, the accuracy of the object's location within the image may be further computed. In some embodiments, for example, image data from at least one sensor are further processed using known transformations and machine vision techniques to more accurately determine an object's location. In some embodiments, for example, IR image data measured within the area defined by the bounding box taken from the CNN output layer is further processed to more accurately determine an object's location. Techniques to do so include various computer vision and segmentation algorithms known in the art such Ohta, Yu-Ichi, Takeo Kanade, and Toshiyuki Sakai. “Color information for region segmentation.” Computer graphics and image processing 13.3 (1980): 222-241; and Beucher, Serge, and Fernand Meyer. “The morphological approach to segmentation: the watershed transformation.” Optical Engineering—New York—Marcel Dekker Incorporated—34 (1992): 433-433.
(46) In some embodiments, determining location information includes determining information on orientation including angular position, angle, or attitude.
(47) It is to be appreciated that the direct incorporation of the IR image data into the image data that, along with the RGB and depth data, makes up the input layer 450 to the CNN 460 improves the performance of the system. Although determining exactly why the inclusion of a given sensor improves the capabilities of a CNN is challenging because of the nature of CNNs, we conjecture, and without intending to be bound to theory, that the IR data offer higher signal-to-noise ratios for certain objects of a given temperature in a kitchen environment where such objects are often placed on work surfaces or imaged against backgrounds with significantly different temperatures. In cases where the CNN is used to recognize foods by the extent to which they are cooked, the IR data provides helpful information to the CNN on the thermal state of the food item and work surface, which can be a cooking surface.
(48) With reference again to
(49) In some embodiments, the location data given in the CNN output layer 470 is further processed by operating exclusively on the IR image data to more accurately identify the location of objects identified by the CNN in a three dimensional coordinate frame which may be the world coordinate frame using standard computer vision algorithms as referenced herein.
(50) The resulting vector shown in
(51) Kitchen Bayesian Belief Engine 492, described further herein, receives the object output vector 490 and assembles/aggregates the real-time continuous stream of these vectors into a set of beliefs which represents the state of all recognized food and kitchen implements in the kitchen area.
(52) With reference to
(53)
(54) Step 520 states to evaluate vector 510 to assess whether the recognized objects represent new objects as yet unidentified or are existing objects that have been previously recognized.
(55) The resulting information is then processed by a belief update law 530 which evaluates the observations in the context of the system's prior beliefs 540 as well as any human input 550 that may have been supplied.
(56) The output of the belief update rules or law is a final set of beliefs 560 on the state of the system. The state includes identity and location of all known objects in the observation area. In a sense, the output of the engine 500 is an atlas or aggregated set of information on the types of food, kitchen implements, and workers within the work space. An example of a final set of beliefs is represented as a list of objects that are believed to exist with associated classification confidences and location estimates.
(57) As stated above, in embodiments, the data from multiple sensors is pre-processed prior to being fed to the CNN.
(58) Step 610 states to create multi-sensor point cloud. Image data from RGB and depth sensors are combined into a point cloud as is known in the art. In embodiments, the resulting point cloud is a size of m by n with X, Y, Z, and RGB at each point (herein we refer to the combined RGB and depth image point cloud as “the RGBD point cloud”). In embodiments, the size of the RGBD point cloud is 960 by 540.
(59) Step 620 states to transform the multi-sensor point cloud to the IR sensor coordinates. The process of transforming an image from one frame to another is commonly referred to as registration (see, e.g., Lucas, Bruce D., and Takeo Kanade. “An iterative image registration technique with an application to stereo vision.” (1981): 674-679). Particularly, in embodiments, the RGBD point cloud is transformed into the frame of the IR camera using extrinsic transformations and re-projection. In embodiments, because the field of view of the RGB and depth sensors is larger than the field of view of the IR sensor, a portion of the RGB and depth data is cropped during registration and the resulting RGBD point cloud becomes 720 by 540.
(60) Step 630 states to register the multi-sensor point cloud to the IR sensor data and coordinates. The transformed RGBD point cloud is registered into the IR frame by projecting the RGBD data into the IR image frame. In embodiments, the resulting combined sensor image input data is 720 by 540 RGBD, and IR data for each point. In embodiments, values are converted to 8-bit unsigned integers. In other embodiments, the registration process is reversed and the IR image is projected into the RGBD frame.
(61) In embodiments with multiple sensors, including IR camera, the registration of the data from the various sensors simplifies the training of the CNN. Registering the IR data and the RGB and depth data in the same frame of reference converts the input (namely, the image input data 450 of
(62) Following step 630, the registered multi-sensor image data is fed into the CNN.
(63) With reference to
(64) In embodiments, the output layer of the CNN is the prediction vector which gives the objects recognized by the CNN, along with a confidence level (e.g., from zero to one), and their location in the two dimensional image data. In embodiments, the location is characterized using a bounding box and denoting two corner points of the bounding box in the image plane.
(65) The length of the output vector is equal to the number of objects that the CNN has been trained to identify. In embodiments, the length of the output vector ranges from 1 to 500, preferably from 50 to 200, and most preferably from 75 to 125.
(66) Training the CNN
(67)
(68) First, sensors, including an IR sensor, are set up and trained onto the work area 810.
(69) Second, with reference to step 820, the correct extrinsic and intrinsic calibration data are calculated and applied.
(70) Third, with reference to step 830, relevant objects are placed in the work area and image input data is generated which comprises an image of multiple channels representing the intensity of light at various wavelengths (e.g., red, green, blue, IR) and depth.
(71) Fourth, with reference to step 840, the image data or a portion of the image data is presented to a human user who identifies relevant objects in the image and creates bounding boxes for the images. The data from the human user is then recorded into the form of the output layer that the CNN should create when presented with the input image data.
(72) Fifth, with reference to step 850, the input images and output layer are presented and the parameters of the CNN are adjusted. Exemplary techniques to tune the weights of the CNN include without limitation backpropagation and gradient descent. The process is repeated multiple times for each image that the CNN is being trained to identify. With each iteration, the weighting factors of the CNN are modified.
(73) In embodiments, the output vector comprises multiple instances of known food items that are differentiated by the degree that they are cooked (namely, “degree of doneness”). In embodiments, the measure of cooking is the internal temperature of the object, such as a steak cooked to medium rare corresponding to an internal temperature of 130 to 135 degrees Fahrenheit. In embodiments, the CNN is trained to detect not just individual objects and their location, but the internal temperature of the objects. Measurements of the internal temperature of the food item can be taken with temperature sensors and used in the output vector for the training of the CNN. In some embodiments, these temperature measurements are taken dynamically by a thermocouple that is inserted into the food item.
(74) In embodiments, an alternate or additional thermal model is used to track the estimated internal temperature of various food items to determine when they are cooked to the appropriate level. In these cases, data can be provided by the Kitchen Scene Understanding Engine on how long the various items have been cooked and their current surface temperature and or temperature history as measured by the IR camera.
(75) Calibration
(76) Preferably, each sensor is calibrated with a calibration target capable of obtaining known high signal-to-noise ratio observations in a known coordinate frame which may be translated into a 3D or world coordinate frame. In embodiments, and with reference to
(77) The calibration target or tool 900 is shown having a spatula-shaped body 910 that is attached to the end of the robotic arm 920. The calibration target may be comprised of a metal sheet 922 featuring a pattern of circles 924. The circles and planar surface, or backplane, have been engineered to provide high signal-to-noise ratio signals in both the RGB and IR spectrum. In addition, the surface of the calibration target is smooth, increasing the strength of the signal for the depth sensor.
(78) In embodiments, the calibration target is comprised of a 4 by 5 pattern of equally-spaced black dots 924 on a surface with a white background. However, the size, number, spacing, and pattern may vary and include other patterns and shapes including symbols of symmetrical and asymmetrical nature.
(79) The high contrast between the black dots and white background when measured in the visible spectrum provides a high-quality signal for the RGB camera. Additionally, the black dots are comprised of a high thermal emissivity material and the background is comprised of an insulating or low thermal emissivity material, resulting in a high contrast reading when imaged with an IR camera.
(80) In embodiments, the tool 900 is manufactured by creating the disc-shaped holes 924, and subsequently filling the holes with a material having a color and emissivity different than that of the background 922.
(81) With reference to
(82) To prevent non-uniformities from being generated by the resistive heating element 950, the calibration target can be warmed for a period using the heating element and then the power to the heating element is shut off. The calibration process can be performed while the temperature of the calibration target cools thereby minimizing potential non-uniformities in the IR image data caused by non-uniformities in the heating supplied by the resistive heating element and/or the fact that the resistive heating element may not uniformly cover the back surface of the backplane.
(83) A method for performing calibration is described herein. Initially the calibration target 910 is mounted on a fixture that enables it to be attached as the end effector 916 of the robot arm 920.
(84) Next, the calibration target is heated by applying power to the embedded resistive heating element. After that, the power to the heating element is turned off. The robotic arm then moves the calibration target around the workspace, capturing image data at multiple locations as measured in the coordinate frame of the robot and the various sensor images. At locations in the workspace where the calibration target is seen by all three sensors, calibration data is generated comprising image data from the sensors as measured in their respective imaging coordinate system and the measured XYZ position of the calibration target as measured by the robot arm. The location of the calibration target in the image data is determined as is known in the art using, for example, computer vision algorithms. The location along with the depth measured by the depth sensor at that point is then correlated to the measured XYZ position of the end effector. In this way, the three-dimensional position of the calibration target is registered to the two-dimensional information of the RGB and IR cameras and the measured depth information from the depth sensor.
(85) The calibration method may vary. In some embodiments, for example, the tool attachment could be automated such as through the use of an automatic end effector changing system such as the QC 11 pneumatic tool changing system 916 illustrated in
(86) The calibration tool 900 serves to provide known and overlapping, high signal-to-noise ratio observations suitable for RGB, depth, and IR sensors. The known and often overlapping nature of these images enables one to compute the position of each sensor data relative to the other sensors' data.
(87) Other modifications and variations can be made to the disclosed embodiments without departing from the subject invention.