SYSTEM AND METHOD FOR MULTI-MODAL TRANSFORMER-BASED CATAGORIZATION
20230044152 · 2023-02-09
Assignee
Inventors
- Lei Chen (Boston, MA, US)
- Houwei Chou (Boston, MA, US)
- Yandi Xia (Boston, MA, US)
- Hirokazu Miyake (Boston, MA, US)
Cpc classification
International classification
Abstract
A transformer categorization architecture is applied to image and text data sets to determine a taxonomy for items in a large database of products. Aggregating recommendations from a multi-modal categorization process achieves a more accurate product classification with potentially less training. The system is implemented to support an e-commerce portal and user facilitated access to products for online purchases.
Claims
1. A system directed to item categorization comprising: digital data storage with data input and output capable of outputting a set of item data that includes image data and text data associated individually with at least one item; a program controlled digital processor connected to and in communication with said digital data storage wherein said programmed processor is capable of item categorization of said stored data, said programmed processor including: a text-based transformer processor for identifying features of one or more items based on said text data and generating a digital output; an image-based transformer processor for identifying features of one or more items based on said stored image data and generating a digital output; and a fusion processor including a multi-layer perception head for combining text and image transformer processor outputs to generate an item classification prediction
2. The system of claim 1 wherein said fusion processor includes a multi-layer perception head for combining text and image transformer processor outputs in a cross modal attention module to form a multi-model representation wherein said multi-layer perception head outputs an item classification prediction.
3. The system of claim 1 wherein said fusion processor multi-layer perception head receives transformer engine outputs directly and generates text and image based classification predictions that are combined to generate a weight based item classification prediction.
4. The system of claim 1 wherein said fusion processor including a multi-layer perception head for combining text and image transformer processor outputs by using tokens that are concatenated for input into the multi-layer perception head.
5. The system of claim 1 wherein said text-based transformer applies a BERT model.
6. The system of claim 1 wherein said image-based transformer applies a ViT model.
7. The system of claim 1 wherein said text-based model is fine tuned for product title data.
8. The system of claim 6 wherein said image-based model is fine tuned for ViT L-16.
9. The system of claim 1 wherein one or more models are pre-trained on a pre-set dataset and implemented by a GPU for training.
10. A data processing system for implementing an e-commerce portal offering goods online for purchase, said system comprising: a search engine for receiving inquiries from users seeking information regarding products for online purchase; a storage connected to said portal for storing retrieval data regarding one or more products responsive to said user search request; a transformer processor for categorization of products based on image and text data associated with said products, wherein said transformer processor implements categorization using a text based transformer processor and an image based transformer processor with categorization clues generated by each; a taxonomy data set comprising a categorization for products determined by said transformer processor, used for configuring a response to said search request to reflect product categorization.
11. The system of claim 10 wherein said transformer processor implements a BERT text transformer model and a ViT image transformer model with resulting clues outputted to a fusion step to achieve a single recommendation for a given product regarding its categorization.
12. The system of claim 10 wherein said transformer processors are trained to facilitate model accuracy in proper categorization of selected products.
13. The system of claim 10 wherein a grouping of products within a single category predicted by the transformer processor is provided in response to a user search request.
14. The system of claim 11 wherein said transformer processor applies a cross attention fusion process to aggregate clues from each transformer processor.
15. A data processing method to classify a large diverse data set of individual items many of which are associated with image and text data corresponding to product type and class, the method comprising: inputting text data for a product into a first transformer processor to ascertain clues regarding what class the product fits; inputting image data for said product into a second transformer processor to ascertain clues regarding what class the product fits; aggregating clues from said first and second transformer processors into a final prediction regarding a class for that product, wherein said aggregating step includes a cross attention fusion process; and outputting said final prediction in association with said product into a taxonomy data set stored for digital access.
16. The method of claim 15 wherein the text transformer processor uses BERT processing and the image transformer processor uses ViT processing.
17. The method of claim 16 wherein the transformer processors are encoded for operation on GPU based processors.
18. The method of claim 15 wherein the transformer processors are trained against a data set of products having known classifications.
19. The method of claim 18 wherein the taxonomy set is used to facilitate responses to user queries made online to an ecommerce portal.
20. The method of claim 19 wherein the number of products having text and image data that are processed exceeds one million which are classified in the taxonomy set into at least four categories.
21. A computer implemented method of training a computerized classification system, comprising: a. a first computer memory for storing a pre-determined set of training data comprising text data associated with items within a known category; b. a second computer memory for storing a pre-determined set of training data comprising image data associated with items within a known category; c. processing said text data in a text based transformer model to characterize values within the model that optimize matching items to known categories; d. processing said image data in an image based transformer model to characterize values within the model that optimize matching items to known categories; and e. storing said characterized model values for use against data that has not been classified.
22. The method of claim 21 wherein said item classification system processes text and image data with transformer processors and an early fusion processor.
23. The method of claim 21 wherein said item classification system further includes a cross modal attention module and a multi-layer perception head to form a classification prediction.
Description
FIGURES
[0019]
[0020]
[0021]
[0022]
[0023]
DETAILED DESCRIPTION
[0024] Briefly in overview, the present invention facilitates item categorization in support of e-commerce operations. A vast and diverse collection of products are typically offered at an e-commerce web portal. Customers enter requests and are presented with responsive web pages including images and descriptions of products triggered by the user request. Groupings of products are pulled and presented together based on the request and the portal taxonomy used to classify products into these groupings.
[0025] The process of creating the classified dataset of products is dynamic as every day new products are added to the dataset with the classification delineated by the taxonomy. The data lake supporting the portal is populated with taxonomy data by the classification engine. The engine is formed by two processing paths. Text data is taken for each product and processed by a transformer-based algorithm and processor; image data is likewise taken for each product and processed by a transformer-based image algorithm and processor. For text data, BERT transformer processes the text and outputs clues to support classification. Image data is processed by a transformer approach such as the ViT method. In both, a transformer engine is a digital data processor programmed to implement the selected transformer model including a processor that implements a properly programmed GPU. A fusion step combines the two transformer engine outputs to form a prediction for item (product) classification. Prior to working against actual product data, the engine is trained and the algorithms modified to enhance accuracy.
[0026] Turning now to
[0027] Continuing in
[0028] Continuing with
[0029] A generalized system architecture is provided in
[0030]
[0031]
[0032] Now turning to
[0033] The system of
[0034] Continuing with
[0035] Continuing further with
[0036] A preferred implementation will provide a single fusion operation; A particularly preferred implementation will use early fusion of the bi-modal vectors from the transformer engines fed to the cross modal attention module, using visual gate control to optimally combine bi-modal signals. The fusion output passes to the MLP, for generating predictions. In training, the label prediction is compared to ground truth label and its difference is used to train the entire MIC model in a back-propagation way.
[0037] Processing prior to fusion is further depicted in
[0038] Supervised learning on a massive image data set is used to pre-train the ViT model, where larger training sets increases system performance. The pre-trained ViT model encodes the product image by converting the image into a matrix of P—patches. After processing these into tokens and combined with a special [CLS] visual token to represent the entire image, the M=P×P+1 long sequence is fed into the model. The encoded output is the sequence: v=(v0+v1+v2+ . . . ), where M=P×P. For this arrangement, ViT L-16 is preferred.
[0039] Testing of the above system provides enhanced categorization. A product catalog including over one million products was processed using four root level genre categories: Predictions of leaf-level product categories from image and text data for the products in the catalog were made and scored. Model performance varied and are summarized in Tables 1 & 2 of Section 5 of the Provisional Application Ser. No. 63/229,624 titled: “Multimodal Item Classification Based on Transformers” filed on Aug. 5, 2021. (The contents previously incorporated by reference.)
[0040] Variations of this arrangement are can be applied as dictated by the application. For e-commerce, search results will be processed by the taxonomy data set to locate and present a grouping of products within a category corresponding to the search request. Other taxonomy driven results are facilitated by the inventive modeling which is adjusted to meet the goals of the application.
[0041] This written description uses examples to disclose certain implementations of the disclosed technology, including the best mode, and also to enable any person skilled in the art to practice certain implementations of the disclosed technology, including making and using any devices or systems and performing any incorporated methods. The patentable scope of certain implementations of the disclosed technology is defined in the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements with insubstantial differences from the literal language of the claims.