ROBOTIC DRAWING
20220032468 · 2022-02-03
Inventors
Cpc classification
B25J11/00
PERFORMING OPERATIONS; TRANSPORTING
B25J15/0019
PERFORMING OPERATIONS; TRANSPORTING
B25J13/08
PERFORMING OPERATIONS; TRANSPORTING
International classification
B25J11/00
PERFORMING OPERATIONS; TRANSPORTING
B25J13/08
PERFORMING OPERATIONS; TRANSPORTING
B25J15/00
PERFORMING OPERATIONS; TRANSPORTING
Abstract
A method includes providing a robot, providing an image of drawn handwritten characters to the robot, enabling the robot to capture a bitmapped image of the image of drawn handwritten characters, enabling the robot to infer a plan to replicate the image with a writing utensil, and enabling the robot to reproduce the image.
Claims
1-3. (canceled)
4. A method comprising: providing a robot providing an image of drawn handwritten characters to the robot enabling the robot to capture a bitmapped image of the image of drawn handwritten characters; enabling the robot to infer a plan to replicate the image with a writing utensil; and enabling the robot to reproduce the image; wherein enabling the robot to reproduce the image comprises enabling the robot to draw each target stroke in one continuous drawing motion to write from a dataset of demonstrations; wherein enabling the robot to reproduce the image comprises providing the robot commands to execute predicted by a model in real time; and wherein the robot commands comprise: commands to make the robot follow each stroke from its start to end; and commands to predict a starting location of a next stroke at an end of a current stroke.
5. The method of claim 4 wherein the commands to make the robot follow each stroke from its start to end are derived from a local model.
6. The method of claim 5 wherein the local model predicts where to move itself next in its 5×5 pixel environment.
7. The method of claim 6 wherein the commands to predict the starting location of the next stroke at the end of the current stroke are derived from a global model.
8. The method of claim 7 wherein the global model predicts the next starting point of the new stroke in a full-scale image plane.
9-13. (canceled)
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] These and other features, aspects, and advantages of the present invention will become better understood with reference to the following description, appended claims, and accompanying drawings where:
[0011]
[0012]
[0013]
[0014]
[0015]
DETAILED DESCRIPTION
[0016] The subject innovation is now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It may be evident, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing the present invention.
[0017] In
[0018] Manipulators such as robot manipulator 10 are composed of an assembly of links and joints. Links are defined as the rigid sections that make up the mechanism and joints are defined as the connection between two links. A device attached to the manipulator which interacts with its environment to perform tasks is called an end-effector (i.e., link 6 in FIG. 1). Other components, such as a camera, processor and memory, may also be included in the robot manipulator 10.
[0019] Shown an image of handwritten characters, robots should draw each target stroke in one consecutive drawing motion. Existing methods for robots that write with a utensil are unable to look at a bit-mapped image and directly produce a drawing policy. Instead, they require external information about the stroke order for character, such as human gestures or predefined paths for each letter. This extra information makes it challenging for novice users to teach the robot how to draw new characters, because the stroke order information must be provided. A more recent reinforcement learning based approach successfully learns to draw the target image, yet their model still struggles to draw each target stroke in one continuous drawing motion, and frequently draws the same parts over and over to replicate the target image.
[0020] Methods of the present invention, in contrast, take as input an image to draw, then generate commands for robots to replicate the image with a writing utensil. The method divides the drawing problem into two scales: 1) the local scale, consisting of a 5×5 pixels window, and 2) the global scale, consisting of the whole image. The method trains two separate networks for the different scales. Unlike other approaches, the present method does not require any predefined handcrafted rules, and learns drawing from stroke-order demonstrations provided only during its training phase.
[0021] The methods of the present invention can look at a bitmap image of a character that a robot has not previously seen and accurately reproduce the character. In almost all instances our method also predicts the correct stroke order and direction for the character. In addition, methods of the present invention enable two different robots to draw characters on paper and on a white board in at least ten different languages as well as stroke-based drawings, including English, Chinese, French, Greek, Hindi, Japanese, Korean, Tamil, Urd and Yiddish.
[0022]
[0023] In the present invention, given the target image of a handwritten character, X.sup.target, the goal is to generate a sequence of actions, A={a.sub.1, a.sub.2, . . . , a.sub.L}, for a robot to reproduce X.sup.target. Here we define X.sup.target as a 100×100 binary image, and a command at timestep t as a.sub.t=(Δx, Δy, touch) where Δx and Δy are shifts in x, y coordinates that range between −100 and +100. The variable touch is a boolean value which controls the touch/untouch status of a writing utensil with respect to a canvas.
[0024] One aim is to train a parametrized function approximator ƒ.sub.θsuch that A=ƒ.sub.θ(X.sup.target). While it is possible to directly estimate θ, dividing the problem into two sub-problems and separately training two specialized distinct models achieves better performance. The first sub-problem is to make the agent follow each stroke from its start to end. A Local Model is designed with parametrized weights θ.sub.L for this task. The second sub-problem is to predict the starting location of the next stroke at the end of current stroke. A Global Model with weights θ.sub.G is designed. The local model predicts where to move itself next in its 5×5 pixel environment. Once it reaches to an end, the global model predicts the next starting point of the new stroke. This process is repeated iteratively until the entire target image is visited, and obtain the full action sequence A={a.sub.1.sup.G, a.sub.1.sup.L, a.sub.1.sup.L, . . . , a.sub.n.sup.L, a.sub.m.sup.G, a.sub.n+1.sup.L, . . . }.
[0025]
[0026] Given a starting point, a goal of the local model is to follow a stroke until it reaches an end. A local state at time stamp t, s.sub.t.sup.L, is a set of three images.
[0027] 1) X.sub.t.sup.Lenv: already visited by the local model ((b) in
[0028] 2) X.sub.t.sup.Lcon: target region continuously connected with a current location of the local model ((c) in
[0029] 3) X.sub.t.sup.Ldif: difference image between target image X.sup.target and X.sub.t.sup.Lenv, which is the unvisited region of the target image ((d) in
[0030] A unique characteristic of the local model design is that an extraction procedure is applied to the encoded tensor of shape (100, 100, 64) to extract the (5, 5, 64) tensor centered at the current location of the agent. The reasons why local information is extracted are:
[0031] 1) Generalization of knowledge: Every image of hand-written characters is different, and in order to gain general knowledge of drawing, it is crucial to work in smaller scale where an agent will encounter similar situations more frequently.
[0032] 2) Computational Expensiveness: Feeding large images directly into RNN (recurrent neural networks) to predict a sequence of actions is computationally expensive. By extracting a small region, the size of input tensors to RNN cells is drastically reduced which achieves less computational expense and faster training.
[0033] 3) Information selection: While the agent draws a stroke, the most important region to focus on is mostly the one around the current position. In a broad view, the local network can be seen as a structural attention mechanism where we force our model to attend to the 5×5 local region around the current location.
[0034] In order to preserve continuity in drawing actions, a Long Short-Term Memory (LSTM) is used in the local network. As a simple example, when the agent reaches the intersection of two lines, it has choices of going either North, South, East, or West. If we know that the agent came from the North, we can make a reasonable guess that it should go South in order not to disrupt the continuity of the drawing motion. All past actions matter to predict the next action, and we use LSTM to capture this context.
[0035] Now, we formally define how our local network predicts the next action a.sub.t.sup.L. Given a local state at timestep t as s.sub.t.sup.L and current location as (x.sub.t, y.sub.t), our local model first encodes the input tensor, s.sub.t.sup.L using residual networks:
e.sub.t.sup.L=ƒ.sub.θLResidual(s.sub.t.sup.L) (1)
[0036] The residual networks include four residual blocks, each of which contains two sub-blocks of 1) batch normalization layer, 2) rectified linear unit, and 3) two-dimensional convolutional layer. Convolution layers in these four blocks have channels of [[16, 16], [16, 16], [16, 32], [32, 64]], stride of 1, width of 3 and height of 3. After the residual networks layer, we have an encoded tensor e.sub.t.sup.L of shape (100, 100, 64), and we then apply the extraction procedure to e.sub.t.sup.L centered at (x.sub.t, y.sub.t) and receive a new tensor, e.sub.t.sup.L* with shape (5, 5, 64).
[0037] To feed e.sub.t.sup.L* into the LSTM, we reshape it into a vector v.sub.t.sup.L of length 5×5×64=1600:
v.sub.t.sup.L=reshape(e.sub.t.sup.L*) (2)
[0038] We feed v.sub.t.sup.L to the LSTM and receive context vector c.sub.t.sup.L and hidden state representation h.sub.t.sup.L as:
c.sub.t.sup.L,h.sub.t.sup.L=ƒ.sub.θLLSTM (s.sub.t.sup.L)([v.sub.t.sup.L,h.sub.t-1.sup.L]) (3)
[0039] Two components of local action a.sub.t.sup.L, the local touch action a.sub.t.sup.Ltouch and the location action a.sub.t.sup.Lloc are calculated from the context vector c.sub.t.sup.L:
a.sub.t.sup.Ltouch=σ(ƒ.sub.θLFC1(c.sub.t.sup.L))
a.sub.t.sup.Lloc=argmax ƒ.sub.θLFC2(c.sub.t) (4)
[0040] Where σ is a sigmoid function, Finally, the loss function of the local model is given as:
L.sup.Local=−1/NΣ.sub.t.sup.N log (ƒ.sub.θL(s.sub.t.sup.L))a.sub.t.sup.L*) (5)
[0041] where a.sub.t.sup.L* is the true target action provided during training.
[0042] A goal of the global model is to predict a starting point of the next stroke in a full-scale image. When a.sub.t.sup.Ltouch=0, the global model observes a current state s.sub.t.sup.G, which is a set of four images.
[0043] 1) s.sub.t.sup.Gloc: current location of the local model ((e) in
[0044] 2) s.sub.t.sup.Genv: already visited region by the local model ((f) in
[0045] 3) s.sub.t.sup.Glast: recently visited region by the local model since the last global prediction ((g) in
[0046] 4) s.sub.t.sup.Gdif: difference image between target image X.sup.target and X.sub.t.sup.Gdif.
[0047] The global network also has the residual network to encode state images, and it shares all weights with the one in the local model, except for the very first initialization layer. To adjust the channel size of input tensors, the initialization layer in our residual network maps a tensor of shape (N, N, M) to (N, N, 16). Due to the discrepancy in shapes between local and global states, the size of the layer is different. We obtain the global action a.sub.t.sup.G as:
e.sub.t.sup.G=ƒ.sub.θGResidual(s.sub.t.sup.G)
c.sub.t.sup.G=ƒ.sub.θGFC(s.sub.t.sup.G)
a.sub.t.sup.G=argmax e.sub.t.sup.G
(x,y) (6)
[0048] and the loss function for the global model is:
L.sup.Global=−1/MΣ.sub.t.sup.M log (ƒ.sub.θG(s.sub.t.sup.G))a.sub.t.sup.G* (7)
[0049] where a.sub.t.sup.G* is the target next start location which is provided during training.
[0050] To illustrate the system of the present invention works in various robotic environments, it was tested with two robots. We directly applied our trained model to the real robotic environment, which creates a need to reprocess the original target image to match the image format of our training data, such that the line width has to be 1, and the image has to be size of 100×100, and so on. If our model sees a vertically-long one-stroke drawing, for example, it is likely to divide the stroke region into squared regions, individually solve the drawing for each window, and combine the results together once all completed. To adjust the line width, we used a skeletonization technique which extracts the center line of stroke-based drawing.
First Robot (Herein Referred to as “Baxter”)
[0051] As shown in
[0052] In summary, (A) shows the target image that Baxter tried to replicate, (B) shows the image drawn by Baxter, and (C) shows Baxter in motion.
[0053] Second Robot (Herein Referred as “MOVO”)
[0054] We tested our model on a MOVO robot, using a Kinova Jaco arm and the Kinect 2 as a sensor. With its precise movement capabilities MOVO reproduces the target image very accurately. Overall, the robot demonstration produces a policy for drawing recognizable characters, including languages such as Greek, Hindi and Tamil, which were previously seen during training. Photographs of drawn and handwritten examples appear in
[0055] More specifically, the word “Hello” is show in different languages: from the top—English cursive, Urdu, Greek, Japanese, Korean, Chinese, Tamil, French, Hindi and Yiddish, and a sketch of the Mona Lisa. Strokes on the left are hand-drawn on a white board; strokes on the right are drawn by the robot on the same white board after viewing the input image on the left.
[0056] The accuracy of our model's ability to reproduce English cursive, as shown in
[0057] It would be appreciated by those skilled in the art that various changes and modifications can be made to the illustrated embodiments without departing from the spirit of the present invention. All such modifications and changes are intended to be within the scope of the present invention except as limited by the scope of the appended claims.