Systems and methods of instant-messaging bot for robotic process automation and robotic textual-content extraction from images
11539643 · 2022-12-27
Assignee
Inventors
- Ping-Yuan Tseng (Taoyuan, TW)
- Chiou-Shann Fuh (New Taipei, TW)
- Richard Li-Cheng Sheng (Davis, CA, US)
- Hui Hsiung (Pasadena, CA, US)
Cpc classification
H04L51/02
ELECTRICITY
G06V30/18057
PHYSICS
G06V30/15
PHYSICS
H04L51/04
ELECTRICITY
G06V30/414
PHYSICS
G06Q30/0281
PHYSICS
International classification
H04L51/04
ELECTRICITY
H04L51/02
ELECTRICITY
Abstract
Systems and methods of instant-messaging bot for robotic process automation (RPA) and robotic textual-content extraction from digital images include a chatbot application, a software RPA manager, and an instant-messaging (IM) platform, all built for an enterprise. The enterprise IM platform is connected to one or more public IM platforms over the Internet. The RPA manager contains multiple modules of enterprise workflows and receives instructions from the enterprise chatbot for executing individual workflows. The system allows enterprise users connected to the enterprise IM platform, and external users connected to the public IM platforms, to use instant messaging to initiate enterprise workflows that are automated with the help of the enterprise chatbot and delivered via instant messaging. Furthermore, textual-content extraction from digital images is incorporated in the RPA manager as an enterprise workflow, and provides improved convolutional neural network (CNN) methods for textual-content extraction.
Claims
1. A system of instant-messaging bot for robotic process automation (RPA) comprising: a chatbot application for an enterprise (or organization) including software for receiving, processing, analyzing and responding to human-generated messages in a human-like manner; a software RPA manager for said enterprise containing multiple workflow modules and receiving instructions from said chatbot for executing said workflows; an instant-messaging (IM) platform for said enterprise connected to said chatbot application and separately connected to one or more public IM platforms via the Internet; and said enterprise IM platform including software for managing traffic of IM messages exchanged among said chatbot, users of said enterprise connected to said enterprise IM platform, and external users connected to said public IM platforms; wherein: said enterprise IM platform automatically attaches visible identification (ID) labels to messages passing through it to identify the receiver and the sender of every message whenever necessary to avoid ambiguity, and hence allowing said chatbot and said human agents to share a single messaging channel over said public IM platform for communicating with said user; said enterprise IM platform, upon receiving any incoming message from said user without a receiver ID, directs the message to an enterprise receiver according to the following rule: (1) sending the message to said chatbot or a said human agent who is the current participant of ongoing IM communication with said user; (2) sending the message to said chatbot or a said human agent who is designated as the default receiver if the message is to initiate a new IM communication; and said enterprise IM platform supports multi-channel messaging for any said human agent to engage in an IM communication with said chatbot or another human agent in parallel with said human agent's ongoing IM communication with said user.
2. The system of instant-messaging bot for robotic process automation of claim 1, wherein a said workflow module is for meeting arrangement in said enterprise.
3. The system of instant-messaging bot for robotic process automation of claim 1, wherein a said workflow module is for leave application and approval in said enterprise.
4. The system of instant-messaging bot for robotic process automation of claim 1, wherein a said workflow module is for textual-content extraction from pixelated digital images provided by said enterprise users via said enterprise IM platform, or by said external users via said public IM platforms.
5. The system of instant-messaging bot for robotic process automation of claim 1, wherein a said workflow module is for return-merchandise-authorization (RMA) processes.
6. A system of instant-messaging bot for robotic process automation comprising: a chatbot application for an enterprise (or organization) including software for receiving, processing, analyzing and responding to human-generated messages in a human-like manner; a software RPA manager for said enterprise containing multiple workflow modules and receiving instructions from said chatbot for executing said workflows; an instant-messaging (IM) platform for said enterprise connected to said chatbot application and separately connected to one or more public IM platforms via the Internet; and said enterprise IM platform including software for managing traffic of IM messages exchanged among said chatbot, users of said enterprise connected to said enterprise IM platform, and external users connected to said public IM platforms; wherein: said workflow module is for textual-content extraction from pixelated digital images provided by said enterprise users via said enterprise IM platform, or by said external users via said public IM platforms; said textual-content extraction comprises: a text-detection step for determining locations and sizes of text lines in said images, followed by a text-recognition step for recognizing textual content of each detected text line, and said text-detection step and text-recognition steps being accomplished by convolutional neural network (CNN) methods; said text-recognition method comprises: a convolutional neural network (CNN) backbone for extracting spatial features of a said text-line image; a decoding path for extracting textual content from the text-line image; said CNN backbone consisting of sequential convolutional layers composed of convolutional, rectified-linear-unit (ReLU), and pooling operations for generating a series of multi-channel feature maps from the text-line image; each said convolutional layer containing feature maps having the same spatial resolution and the same number of channels, while its subsequent convolutional layer containing feature maps having an equal or lower spatial resolution and a larger number of channels; said decoding path consisting of one or more one-dimensional convolutional layers, a softmax operation, and a Connectionist Temporal Classification (CTC) loss function, providing textual content of the text-line image as output.
7. A method for automatic detection of text lines in pixelated digital images, comprising: a convolutional neural network (CNN) backbone for extracting spatial features of a said image; a text-slice detection layer for detecting text-slice candidates; a filtering layer for selecting most likely text-slice candidates; and a text-construction layer for associating text-slice candidates into text lines; said CNN backbone consisting of sequential convolutional layers composed of convolutional, rectified-linear-unit (ReLU), and pooling operations for generating a series of multi-channel feature maps from the image; each said convolutional layer containing feature maps having the same spatial resolution and the same number of channels, while its subsequent convolutional layer containing feature maps having an equal or lower spatial resolution and a larger number of channels; said text-slice detection layer probing one of said feature maps by applying multiple text-slice detectors onto every cell of the probed feature map, with said text-slice detectors having a fixed width equal to the cell width and multiple heights corresponding to aspect ratios in the range of 1 through 8; said filtering layer using Non-Maximum Suppression for selecting the best text-slice candidates among those detected by said text-slice detection; and said text-construction layer grouping adjacent text-slice candidates from said filtering layer having a horizontal distance less than a preset value and a vertical overlap exceeding a preset value into the same text line.
8. A method for automatic detection of text lines in pixelated digital images, comprising: a convolutional neural network (CNN) consisting of a convolutional path followed by a transposed-convolutional path for generating intermediate, pixelated text-distribution maps from a said image; and a tuning layer for providing an optimal pixelated text-distribution map as output; said convolutional path consisting of sequential convolutional layers composed of convolutional, rectified-linear-unit (ReLU), and pooling operations for generating a series of multi-channel feature maps from the image; each said convolutional layer containing feature maps having the same spatial resolution and the same number of channels, while its subsequent convolutional layer containing feature maps having an equal or lower spatial resolution and a larger number of channels; said transposed-convolutional path consisting of sequential transposed-convolutional layers composed of transposed-convolutional, concatenation, convolutional, and ReLU operations for generating a series of multi-channel feature maps; each said transposed-convolutional layer containing feature maps having the same spatial resolution, with its first feature map generated by a transposed-convolutional operation on the last feature map of a lower resolution from its preceding layer, followed by a concatenation operation with a feature map of the same resolution from a said convolutional layer; said CNN generating from the image two intermediate, pixelated text-distribution maps, one of original text and the other of eroded text; and said tuning layer applying Watershed algorithm on the two intermediate text-distribution maps, yielding, as output, an optimal pixelated text-distribution map free of overlapping text.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
DETAILED DESCRIPTION
(16) The scope of the present invention is defined by the claims appended to the following detailed description, while the embodiments described herein serve as illustrations, not limitations, of the claims.
(17)
(18) The chatbot application 102 includes software for receiving, processing, analyzing and responding to human-generated messages in a human-like manner. It consists of three major parts: (1) a natural language processing and understanding (NLP/NLU) module for analyzing intent of an incoming message either from an enterprise user 112 or from an external user 114, (2) a dialog management (DM) module for interpreting the output (intent) from the NLP/NLU module, analyzing context of the ongoing human-chatbot communication, including preceding messages or other information related to the communication, and providing a responding instruction as output, and (3) a natural language generating (NLG) module for receiving the responding instruction from the DM module and generating a reply message either to the enterprise user 112 or to the external user 114.
(19) The enterprise IM platform 104 includes software for managing traffic of IM messages exchanged among the enterprise users 112, the external user 114, and the enterprise chatbot 102. In
(20) The RPA manager 106 includes software for configuring and executing built-in workflow modules 108-108n. Optionally, one or more RPA workflow modules 108tp supplied by third-party developers may also be connected to and controlled by the RPA manager 106 via application programming interface (API). To initiate a robotic workflow, internal or customer-facing, either an enterprise user 112 or an external user 114 can send a message to the enterprise chatbot 102 expressing such an intent. In response, the enterprise chatbot 102 instructs the RPA manager 106 to execute the intended workflow.
(21) Certain workflows are end-to-end, i.e., an input is received and then an output is generated directly. Other workflows are interactive, with the user 112 or 114, the chatbot 102, and the RPA manager engaging in back-and-forth interactions during the process. In some cases, the enterprise chatbot 102 and the RPA manager 106 are linked to an enterprise database 116 in order to access additional information related to an ongoing instant-messaging conversation or a robotic workflow.
(22) It is not uncommon for an enterprise chatbot to provide an incorrect answer, or no answer at all, to a user's inquiry, either due to the chatbot's inability to understand the nuances of human language or simply due to its lack of sufficient information. This could be taken as a poor user experience and detrimental for the enterprise. The present invention allows the enterprise users 112 (e.g., customer-care representatives) to intervene in an ongoing IM conversation between the external user 114 (e.g., customer) and the enterprise chatbot 102, and hence any friction or frustration likely encountered by the external user during the user-chatbot conversation may be remedied on the occasion, resulting in a better user experience for the enterprise.
(23) The present invention provides robotic workflows including daily office processes. An example is robotic meeting arrangement inside an enterprise. The meeting organizer, an enterprise user 112, only needs to send a meeting request to the enterprise chatbot 102, including meeting subject, participant list, and desired time and venue. The enterprise chatbot 102, with the help of the RPA manager 106, will first check behind the scenes the availabilities of each participant and the venue, and then it makes recommendations to and receives confirmation from the meeting organizer. Once confirmed, the enterprise chatbot 102 will send meeting invitations as well as meeting reminders to the participants (also enterprise users) before the meeting.
(24) Another example of office workflow is robotic leave application and approval for enterprise users. In this case, an enterprise user 112 submits a leave application via the enterprise chatbot 102 to his/her tiers of supervisor (also enterprise users) for approval. The enterprise chatbot 102, with the help of the RPA manager 106, will guide the approval process such that the enterprise's leave regulations are strictly followed, and it will send remainders to the supervisors to ensure that the approval process is timely.
(25) The present invention further provides robotic workflows involving textual-content extraction from digital images. The simplest of such workflows are end-to-end, wherein the user 112 or 114 sends a digital image to the enterprise chatbot 102, the latter forwards the image to the RPA manager 106 for textual-content extraction, and the result is sent back to the user 112 or 114 via the chatbot 102.
(26) Interactive robotic workflows involving textual-content extraction are illustrated with the following robotic return-merchandise-authorization (RMA) process. RMA is essential to after-sale services for enterprises. While the majority of RMA claims are routine and repetitive, they occupy precious time of customer-care representatives, and hence the need for robotic RMA is strong.
(27) The following sequence of robotic RMA represents an embodiment of the present invention:
(28) (1) An external user 114 (customer) sends a message over the public IM platform 110 to the enterprise chatbot 102, requesting RMA for a damaged merchandise; a photograph of the product label is attached with the claim message;
(29) (2) with its NLP capability, the enterprise chatbot 102 can understand the external user's intent; it forwards the product-label image to the RPA manager 106, which in turn executes the text-extraction module 108 and sends the result back to the enterprise chatbot 102;
(3) the enterprise chatbot 102 compares the product-label content (e.g., model and serial numbers) with sales records in the enterprise database 116 and determines whether the merchandise is under warranty;
(4) depending on its finding, the enterprise chatbot 102 either issues an RMA number or sends a rejection message to the external user 114; and
(5) if the external user 114 responds to the enterprise chatbot 102 with a message expressing a negative sentiment, the chatbot 102 escalates the issue to an enterprise user 112 (customer-care representative), and the later then engages in the ensuing conversation with the external user 114.
(30) Textual-Content Extraction Methods
(31) The present invention provides textual-content extraction methods comprising two steps: (1) text detection for an input image, providing the locations and sizes of text lines in the image, and (2) text recognition for each text line detected in step (1). Architectures of two independent text-detection methods and one text-recognition method are presented in
(32) Basic Convolutional Neural Network (CNN)
(33) CNNs for automatic detection, classification and recognition of objects in digital images have seen widespread applications in recent years. For illustrating the present invention, the architectures of two well-known CNN models, VGG-16 (K. Simonyan, A. Zisserman, “Very deep convolutional networks for large-scale image recognition”, arXiv: 1409.1556 (2014)) and ResNet-18 (K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for image recognition”, arXiv: 1512.03385 (2015)), are shown in
(34) A normal convolutional backbone comprises multiple stacks of convolutional, pooling and rectified-linear-unit (ReLU) operations arranged in a specific order. As shown in
(35) In CNN, each convolutional operation or a fully connected layer is weighted with trainable parameters. A CNN model of practical use may contain tens of millions trainable parameters (e.g., 138.4 million for VGG-16 and 11.5 million for ResNet-18). Therefore, in order to train fully a CNN model, a large number (e.g., hundreds of thousands to millions) of labeled training images may be required. This is not practical for most applications where only a limited number (e.g. hundreds to tens of thousands) of training images are available. Fortunately, a CNN model for applications in a specific domain can still be made effective if it properly adopts a pre-trained CNN model either in its entirety or in part. In such cases, the CNN model or its derivatives can be trained successfully with a smaller set of images relevant to the specific domain. Some embodiments of the present invention have adopted a pre-trained convolutional backbone of VGG-16, ResNet-18, or their respective variants. In general, the ResNet variants have fewer trainable parameters and easier to train than their VGG counterparts.
(36) Single Shot Multi-Box Detector (SSD)
(37) SSD is a CNN for detecting generic objects from digital images (W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, A. C. Berg, “SSD: Single Shot Multi-Box Detector”, arXiv: 1512.02325 (2016)). It is among the more efficient object-detection models, providing both accuracy and speed. The SSD architecture is depicted in
(38) The detection layer of SSD takes, as its input, feature-cell values of the Conv4 feature map and that of the smaller feature maps generated in the appended convolutional layers (as indicated with thick solid arrows in
(39) The detection-layer output provides the category score, coordinates and size of the bounding box for each detected object. The object bounding box does not have to coincide with the detector box that detects the object. Normally there are offsets in coordinates, height and width between the two. Furthermore, the same object could be detected by multiple detector boxes, resulting in multiple candidates for the object bounding box. The final step of SSD, Non-Maximum Suppression, is a rule-based method for selecting the best bounding box for each object detected.
(40) Connectionist Text Proposal Network (CTPN)
(41) Detection of text and that of generic objects differ in two major aspects. First, unlike a generic object that usually has a well-defined closed boundary, text is an assembly of separated elements (e.g., English letters, Chinese characters, punctuations, spaces), and hence the boundary of text is not well-defined. Second, text detection normally requires a higher accuracy than that of generic object detection. This is because a partially detected text line can result in substantial errors in subsequent text recognition. Consequently, methods for detecting generic objects, such as SSD, are not effective for detecting text.
(42) Connectionist Text Proposal Network (CTPN) was designed specifically for text detection (Z. Tian, W. Huang, T. He, P. He, Y. Qiao, “Detecting text in natural image with Connectionist Text Proposal Network”, arXiv: 1609.03605 (2016)). The CTPN architecture is depicted in
(43) Furthermore, CTPN provides a vertical anchor mechanism that simultaneously predicts the vertical location, height, and text/non-text score of each text-slice proposal. In
(44) Although CTPN is accurate in detecting text, its speed (e.g., 7 frames per second) is only a fraction of the speed of SSD (e.g., 59 frames per second). Therefore, a text-detection method providing both accuracy and speed is in need.
(45) Text-Detection Method 1 of the Present Invention: Connectionist Text Single Shot Multi-Slice Detector (CT-SSD)
(46) The present invention provides a text-detection method that is a hybrid of CTPN and SSD, and it is named Connectionist Text Single Shot Multi-Slice Detector (CT-SSD). The objective of CT-SSD is to achieve both the text-detection accuracy of CTPN and the speed of SSD. The CT-SSD architecture 200 is depicted in
(47) CT-SSD adopts the multi-box detection mechanism of SSD, but it samples only a single convolutional layer 206 (e.g., Conv4 of ResNet-18) for detecting fine text slices, as CTPN does. Notice that from
(48) The text-slice detection output 208 provides preliminary text-slice candidates with their coordinates, heights, and text/non-text scores. The preliminary text-slice candidates are further filtered with Non-Maximum Suppression 210 for selecting the most likely text-slice candidates. An example of text-slice detection using CT-SSD is shown in
(49) U-Net
(50) An alternative approach for detecting objects in pixelated images is semantic image segmentation. In this approach, every pixel of the image is classified according to categories of the objects being detected. U-Net is a CNN designed for semantic image segmentation (O. Ronneberger, P. Fischer, T. Brox, “U-Net: Convolutional networks for biomedical image segmentation”, arXiv: 1505.04597 (2015)). The present invention provides a text-detection method utilizing the semantic segmentation approach based on U-Net.
(51) The U-Net architecture is depicted in
(52) Text-Detection Method 2 of the Present Invention: Watershed U-Net Segmentation
(53) Since the boundary of text is not well-defined, text detection using segmentation alone may not be able to resolve possible ambiguities at a text boundary, particularly if the spacing between adjacent text lines is small. To solve this problem, the present invention provides a text-detection method that combines the U-Net segmentation with Watershed algorithm—the latter is known for segmenting mutually touching objects. This new method is named Watershed U-Net Segmentation.
(54) The architecture of Watershed U-Net Segmentation 300 is depicted in
(55) For practical applications, the resolutions of input images may vary substantially. To facilitate fully automated text detection, an overlapping-tiles method is carried out prior to the aforementioned text-detection step. This is illustrated in
(56) Text-Recognition Method of the Present Invention: Connectionist Temporal Classification CNN (CTC-CNN)
(57) The two text-detection methods of the present invention provide locations and sizes (w×h) of detected text lines within the input image. This is followed by a text-recognition process for recognizing the textual content of each text line, which is a sequence-to-sequence process that can be effectively addressed with a CNN using Connectionist Temporal Classification (CTC) as the loss function (F. Borisyuk, A. Gordo, V. Sivakumar, “Rosetta: Large scale system for text detection and recognition in images”, arXiv: 1910.05085 (2019)).
(58) The architecture of CTC-CNN 400 is depicted in
(59) The present invention discloses systems and methods of instant-messaging bot for robotic process automation (RPA) and robotic textual-content extraction from digital images. Such a system includes a chatbot application, a software RPA manager, and an instant-messaging (IM) platform, all built for an enterprise. The RPA manager contains multiple modules of enterprise workflows and receives instructions from the enterprise chatbot for executing individual workflows. The enterprise IM platform is further connected to one or more public IM platforms over the Internet.
(60) The system allows enterprise users connected to the enterprise IM platform, and external users connected to the public IM platforms, to use instant messaging to communicate with the enterprise chatbot and with one another either one-to-one or as a group. It further enables the enterprise users and the external users to initiate enterprise workflows, either internal or customer-facing, that are automated with the help of the enterprise chatbot and delivered via instant messaging.
(61) Furthermore, the present invention incorporates textual-content extraction from digital images in the RPA manager as an enterprise workflow, and it provides improved convolutional neural network (CNN) methods for textual-content extraction.
(62) Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.