METHODS AND SYSTEMS FOR INTELLIGENT TEXT CLASSIFICATION WITH LIMITED OR NO TRAINING DATA

20230074189 · 2023-03-09

    Inventors

    Cpc classification

    International classification

    Abstract

    Methods and apparatuses are described for intelligent text classification with limited or no training data. A server computing device receives one or more of structured text or unstructured text corresponding to compliance text data from a database. The server computing device executes a trained few-shot natural language inference (NLI) classification model on one or more sentences in the received compliance text data to identify whether the one or more sentences comprise a compliance violation. The server computing device transmits the results of the model execution to a remote computing device.

    Claims

    1. A system for intelligent text classification with limited or no training data, the system comprising a server computing device comprising a memory for storing computer-executable instructions and a processor that executes the computer-executable instructions to: receive one or more of structured text or unstructured text corresponding to compliance text data from a database; execute a trained few-shot natural language inference (NLI) classification model on sentences from the received compliance text data to identify whether the received text includes a compliance violation; and transmit output from the model execution to a remote computing device.

    2. The system of claim 1, wherein the trained few-shot NLI classification model comprises a plurality of instances of a same neural network with shared parameters.

    3. The system of claim 2, wherein each neural network instance of the trained few-shot NLI classification model receives a different text sample from the received text.

    4. The system of claim 3, wherein a first neural network instance receives a positive text sample, a second neural network instance receives an anchor text sample, and a third neural network instance receives a negative text sample.

    5. The system of claim 4, wherein the anchor text sample and the positive text sample correspond to a first class and the negative text sample corresponds to a second class.

    6. The system of claim 5, wherein each neural network instance comprises an encoder layer, a perceptron layer comprising a first fully connected layer and a rectified linear activation function (ReLU) layer, and a second fully connected layer.

    7. The system of claim 6, wherein the trained few-shot NLI classification model generates a first output comprising (i) a first distance between the positive text sample processed by the first neural network instance and the anchor text sample processed by the second neural network instance and (ii) a second distance between the anchor text sample processed by the second neural network instance and the negative text sample processed by the third neural network instance.

    8. The system of claim 7, wherein the first distance comprises a Euclidian distance and the second distance comprises a Euclidian distance.

    9. The system of claim 1, wherein the server computing device applies a triplet loss function to the first distance and the second distance to retrain the few-shot natural language inference (NLI) classification model.

    10. The system of claim 1, wherein the server computing device classifies output from the trained few-shot NLI classification model using a support vector machine (SVM) with radial basis function (RBF) kernel.

    11. The system of claim 10, wherein when the SVM with RBF kernel classifies the output from the trained few-shot NLI classification model as comprising a compliance violation, the remote computing device transmits an alert message to a client computing device for remediation of the compliance violation.

    12. A computerized method of intelligent text classification with limited or no training data, the method comprising: receiving, by a server computing device, one or more of structured text or unstructured text corresponding to compliance text data from a database; executing, by the server computing device, a trained few-shot natural language inference (NLI) classification model on one or more sentences in the received compliance text data to identify whether the one or more sentences comprise a compliance violation; and transmitting, by the server computing device, output from the model execution to a remote computing device.

    13. The method of claim 12, wherein the trained few-shot NLI classification model comprises a plurality of instances of a same neural network with shared parameters.

    14. The method of claim 13, wherein each neural network instance of the trained few-shot NLI classification model receives a different text sample from the received text.

    15. The method of claim 14, wherein a first neural network instance receives a positive text sample, a second neural network instance receives an anchor text sample, and a third neural network instance receives a negative text sample.

    16. The method of claim 15, wherein the anchor text sample and the positive text sample correspond to a first class and the negative text sample corresponds to a second class.

    17. The method of claim 16, wherein each neural network instance comprises an encoder layer, a perceptron layer comprising a first fully connected layer and a rectified linear activation function (ReLU) layer, and a second fully connected layer.

    18. The method of claim 17, wherein the trained few-shot NLI classification model generates a first output comprising (i) a first distance between the positive text sample processed by the first neural network instance and the anchor text sample processed by the second neural network instance and (ii) a second distance between the anchor text sample processed by the second neural network instance and the negative text sample processed by the third neural network instance.

    19. The method of claim 18, wherein the first distance comprises a Euclidian distance and the second distance comprises a Euclidian distance.

    20. The method of claim 12, wherein the server computing device applies a triplet loss function to the first distance and the second distance to retrain the few-shot natural language inference (NLI) classification model.

    21. The method of claim 12, wherein the server computing device classifies output from the trained few-shot NLI classification model using a support vector machine (SVM) with radial basis function (RBF) kernel.

    22. The method of claim 21, wherein when the SVM with RBF kernel classifies the output from the trained few-shot NLI classification model as comprising a compliance violation, the remote computing device transmits an alert message to a client computing device for remediation of the compliance violation.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0019] The advantages of the invention described above, together with further advantages, may be better understood by referring to the following description taken in conjunction with the accompanying drawings. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention.

    [0020] FIG. 1 is a block diagram of a system for intelligent text classification with limited or no training data.

    [0021] FIG. 2 is a block diagram of a triple capsule network architecture for intelligent text classification.

    [0022] FIG. 3 is a flow diagram of a computerized method of intelligent text classification with limited or no training data.

    [0023] FIG. 4 is a diagram of an exemplary text classification generated by the system.

    DETAILED DESCRIPTION

    [0024] FIG. 1 is a block diagram of a system 100 for intelligent text classification with limited or no training data. The system 100 includes a client computing device 102, a communications network 104, a server computing device 106 that includes a text classification module 108, and a database 114 that includes text data.

    [0025] The client computing device 102 connects to the communications network 104 in order to communicate with the server computing device 106 to provide input and receive output relating to the process of intelligent text classification with limited or no training data as described herein. Exemplary client computing devices 102 include but are not limited to computing devices such as smartphones, tablets, laptops, desktops, or other similar devices. It should be appreciated that other types of devices that are capable of connecting to the components of the system 100 can be used without departing from the scope of invention.

    [0026] The communications network 104 enables the client computing device 102 to communicate with the server computing device 106. The network 104 is typically a wide area network, such as the Internet and/or a cellular network. In some embodiments, the network 104 is comprised of several discrete networks and/or sub-networks (e.g., cellular to Internet, PSTN to Internet, PSTN to cellular, etc.).

    [0027] The server computing device 106 is a device including specialized hardware and/or software modules that execute on a processor and interact with memory modules of the server computing device 106, to receive data from other components of the system 100, transmit data to other components of the system 100, and perform functions for intelligent text classification with limited or no training data as described herein. The server computing device 106 includes a text classification module 108 that executes on one or more processors of the server computing device 106. In some embodiments, the module 108 is a specialized set of computer software instructions programmed onto one or more dedicated processors in the server computing device 106 and can include specifically-designated memory locations and/or registers for executing the specialized computer software instructions.

    [0028] It should be appreciated that any number of computing devices, arranged in a variety of architectures, resources, and configurations (e.g., cluster computing, virtual computing, cloud computing) can be used without departing from the scope of the invention. The exemplary functionality of the text classification module 108 is described in detail throughout this specification.

    [0029] In some embodiments, the text classification module 108 can comprise a software program that receives text data (e.g., compliance related text data/documents in the form of structured or unstructured text) from database 114 and processes the text data as described herein to classify the text (e.g. according to compliance violation parameters) and provide the classified text to a remote user.

    [0030] The database 114 is a computing device (or in some embodiments, a set of computing devices) coupled to the server computing device 106 and is configured to receive, generate, and store specific segments of data relating to the process of intelligent text classification with limited or no training data as described herein. In some embodiments, all or a portion of the database 114 can be integrated with the server computing device 106 or be located on a separate computing device or devices. The database 114 can comprise one or more databases configured to store portions of data used by the other components of the system 100, as will be described in greater detail below.

    [0031] FIG. 2 is a block diagram of a triple capsule network architecture 200 for intelligent text classification, used by the text classification module 108 of server computing device 106 of FIG. 1. As shown in FIG. 2, the triple capsule network 200 comprises three instances (202a, 202b, 220c) of the same neural network with shared parameters. The network takes as input three examples in each sample. The three samples consist of the anchor 204a (s), positive 204b (s+) and negative 204c (s-) example. The anchor 204a and positive 204b example belong to the same class, while the negative 204c example belongs to a different class. The network 200 outputs two values, the distance 206a between the anchor and the positive example and the distance 206b between the anchor and the negative example.

    [0032] FIi. 3 is a flow diagram of a computerized method 300 of intelligent text classification with limited or no training data, using system 100 of FIG. 1. The text classification module 108 receives (step 302) a corpus of structured and/or unsmuictured text from the database 114 for classification. The text classification module 108 executes (step 304) a trained few-shot NLI classification model on sentences from the received compliance text data to identify whether the received text includes a compliance violation. During execution of the model, the network 200 in text classification Module 108 encodes each incoming sentence using a Sentence-Bert (S-Bert) Encoder (e.g., 208) (as described in Reimers and Gurevych, 2019. Sentence-bert: Sentence embeddings using siamese hert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Naiaral Language Processing, EMN LP-lJCNLP 2019, Hong Kong, China, Nov. 3-7, 2019, pages 3980---3990, Association for Computational Linguistics (incorporated herein by reference)). The Sentence-Bert Encoder captures the contextual information in a sentence in a fixed size vector representation. The contextual sentence embedding is then fed to a two-layer perceptron: fully connected layer (FC) and rectified linear activation function layer (ReLU) (layer 210) and another FC layer (layer 212). The hidden layer 210 has ReLL activation for introducing non-linearity in the perceptron.

    [0033] Exemplary algorithms used by the layers of the neural network are below:


    e.sub.s.sup.1=S-BERT(s)  (1)


    e.sub.s.sup.2=ReLu(Wθ,1e.sub.s.sup.1 )  (2)


    where Wθ,2e.sub.s2  (3)

    and the parameter matrices to be learned during training., Triplet loss (as described in Hoffer and Ailon, 2015, Deep metric learning using triplet network. In Similarity-Based Pattern Recognition-Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, Oct. 12-14, 2015, Proceedings, volume 9370 of Lecture Notes in Computer Science, pages 84-92, Springer (incorporated herein by reference)) has been used in few-shot classification methods. Although introduced for images, it has been successfully adapted in natural language processing. Triplet loss (custom-character) enables the network to distinguish between positive and negative examples of a class. It is defined in the equation below:


    custom-character=Σmax(d(a,p)−d(a,n)+α,0)

    where α is the anchor sentence, p is a sentence drawn from the same class as a and n is a sentence drawn from a class different from that of a. The function d computes the distance between two sentences and a is the margin enforced between the positive and negative examples.

    [0034] The function d is defined below:


    d(s,s.sup.+)=∥e.sub.s.sup.3 −e.sub.s+.sup.3∥.sub.2


    d(s,s_)=∥e.sub.s.sup.3−e.sub.s−.sup.3∥.sub.2

    where II. .sub.2 denotes the 12 norm. The triplet loss is leveraged to train the network model.

    [0035] The network learns sentence representations where examples of the same class are close together. The closeness of two sentences can be measured by calculating the euclidean distance between their representation. For the final classification, the system uses a Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel. An exemplary SVM with RBF kernel is described in K. Thurnhofer-Hemsi et al., “Radial basis function kernel optimization for Support Vector Machine classifiers,” arXiv:2007.08233 [cs.LG], 17 Jul. 2020, which is incorporated herein by reference. The system uses an SVM for classification as it learns by minimizing the hinge loss which is similar to the loss used for training the triplet network.

    [0036] As can be appreciated, the systems and methods described herein can be applied to structured or unstructured text in any of a variety of different subject matter areas or domains.

    [0037] Exemplary domains include but are not limited to financial services, compliance, governmental regulation, pharmaceutical, and legal. In one example use case, the systems and methods described herien can be applied for regulatory compliance in the financial domain under, e.g., the U.S. regulation FINRA 22101 (described at www.finra.org/rules-guidance/rulebooks/finra-rules/2210, which states that “no member may make any false, exaggerated, unwarranted, promissory or misleading statement or claim in any communication.”). An exemplary text classification generated by the system 100 under this regulation is provided in FIG. 4. As shown in FIG. 4, the first example 402 displays a contradiction in that the hypothesis statement (“77% of Americans anxious over financial situation”) contradicts the premise (“Stop worrying, the best returns are yet to come.”) and would not be labeled as a promissory compliance violation. The second example 404 displays entailment in that the hypothesis statement (“You'll never have to worry, this will take the worry out of your retirement.”) confirms the premise (“Stop worrying, the best is yet to come.”) and thus the hypothesis statement is labeled as a promissory violation.

    [0038] In some embodiments, when the text classification module 108 classifies the received text as either containing a compliance violation or not containing a compliance violation, server computing device 106 transmits (step 306) output from the module 108 to a remote computing device (e.g., client computing device 102). For example, server computing device 106 can transmit one or more data packets to client computing device 102 that include data (e.g., a flag, a text string, etc.) that indicates whether the received text comprises a compliance violation or not.

    [0039] Upon receiving the output, client computing device 102 can take one or more actions based upon the content of the output. In one example, client computing device 102 can transmit an alert message to one or more other computing devices that indicates the text includes a compliance violation and requests that the violation is remediated. In some embodiments, the alert message can include the specific text (e.g., one or more sentences) that were analyzed by text classification module 108 and determined to contain a violation along with a reference to a location of the text (e.g., document name, document number, version, etc.).

    [0040] The systems and methods were trained and tested in several different settings: first, in a traditional data-heavy supervised setting, where a large number of existing examples have been classified; second, in a zero-shot training situation, where an expert was to provide only rough guidelines for what is not compliant with the legal code; and third, combining this in a few-shot setting where with comparatively little training data, the system achieves performance that is equivalent with the data-heavy supervised setting and thus enables text classification systems for regulatory compliance to be constructed quickly and with little effort allowing them to cover a wide range of industries and national regulatory frameworks.

    [0041] In the experimental setup, the dataset was split into training, development, and test datasets. These datasets comprise varying numbers of promissory and non-promissory sentences. For the zero-shot learning model, the system samples 40 promissory and 190 non-promissory example sentences from the training set and trains the model on this subset.

    [0042] The classification performance of the few-shot learning model described herein was compared against existing supervised learning methods: [0043] Naive Bayes: We train a Naive Bayes classification model using tf-idf scores of the tokens in the sentence. [0044] Multi Layer Perceptron (MLP): We train a two layer perceptron with ReLu activation in the hidden layer using the tf-idf scores of the sentence tokens as input features to the model. [0045] SVM: Similar to the MLP model, we train a SVM model for the classification task. We set the regularization parameter C and gamma to 1.0 and 0.1 respectively. [0046] Sentence-Bert: This setting is similar to our proposed approach. We encode each sentence into a fixed sized vector using its Sentence-Bert embedding. The sentence embedding is then fed into a 3 layer fully connected neural network with ReLu activation in the first two layers. The model is trained by minimizing the CrossEntropy Loss of classification using Adam optimizer. [0047] Laser: In this setting, we encode each sentence using its Laser embeddings. The remaining architecture remains the same as that using in the Sentence-Bert model.

    [0048] In addition to the supervised approaches, we compare the few-shot learning approach against a zero-shot learning approach. Yin et al., 2019, Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach, In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, Nov. 3-7, 2019, pages 3912-3921, Association for Computational Linguistics (incorporated herein by reference) suggested a method for using pre-trained natural language inference models as sequence classifiers. To this end, the text classification module 108 uses the BART model (described in Lewis et al., 2020, BART: denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension, In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, A C L 2020, Online, Jul. 5-10, 2020, pages 7871-7880, Association for Computational Linguistics (incorporated herein by reference)) as the zero-shot learning model. The text classification module 108 considers the sentences tagged as ‘promissory’ as hypothesis. The probability of a sentence being the premise for these tagged sentence is calculated using the BART model. The module 108 then considers the maximum of those scores, and if the maximum score is greater than 0.7, the module 108 classifies the sentence as a promissory sentence.

    [0049] For the task, the module 108 uses the Sentence-Bert base model. It encodes an sentence into a fixed size vector of length 768. The module 108 sets d.sub.e1, d.sub.e2 and d.sub.e3 to 768, 300 and 10 respectively. For every positive sentence, the module 108 supplies three negative sentences for the anchor sentence. The value of a is set to 1:0. The batch size is set to 16 for the triplet network and is trained with Adam optimizer (as described in Kingma and Ba, 2015, “Adam: A method for stochastic optimization,” In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, Calif., USA, May 7-9, 2015, Conference Track Proceedings (incorporated herein by reference)) with a learning rate of le-5 for 10 epochs. The module 108 sets the cost parameter C and gamma of the SVM to 0.03 and 0.1 respectively.

    [0050] As shown in Table 1 below, the few-shot method with very limited sentences provides solid results for precision, recall, and accuracy in comparison to the other supervised learning models:

    TABLE-US-00001 TABLE 1 Model Precision Recall F1 Accuracy Naive Bayes 0.78 0.48 0.60 0.75 MLP 0.66 0.70 0.68 0.75 SVM 0.76 0.67 0.71 0.79 S-Bert 0.72 0.69 0.70 0.78 Laser 0.75 0.68 0.71 0.79 Zero-Shot 0.48 0.75 0.59 0.60 Few-shot(ours) 0.64 0.66 0.65 0.73

    [0051] Appendix A attached hereto provides further experimental test results that show the benefits of the few-shot text classification architecture described herein.

    [0052] The above-described techniques can be implemented in digital and/or analog electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The implementation can be as a computer program product, i.e., a computer program tangibly embodied in a machine-readable storage device, for execution by, or to control the operation of, a data processing apparatus, e.g., a programmable processor, a computer, and/or multiple computers. A computer program can be written in any form of computer or programming language, including source code, compiled code, interpreted code and/or machine code, and the computer program can be deployed in any form, including as a stand-alone program or as a subroutine, element, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one or more sites. The computer program can be deployed in a cloud computing environment (e.g., Amazon@ AWS, Microsoft@ Azure, IBM@).

    [0053] Method steps can be performed by one or more processors executing a computer program to perform functions of the invention by operating on input data and/or generating output data. Method steps can also be performed by, and an apparatus can be implemented as, special purpose logic circuitry, e.g., a FPGA (field programmable gate array), a FPAA (field-programmable analog array), a CPLD (complex programmable logic device), a PSoC (Programmable System-on-Chip), ASIP (application-specific instruction-set processor), or an ASIC (application-specific integrated circuit), or the like. Subroutines can refer to portions of the stored computer program and/or the processor, and/or the special circuitry that implement one or more functions.

    [0054] Processors suitable for the execution of a computer program include, by way of example, special purpose microprocessors specifically programmed with instructions executable to perform the methods described herein, and any one or more processors of any kind of digital or analog computer. Generally, a processor receives instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and/or data. Memory devices, such as a cache, can be used to temporarily store data. Memory devices can also be used for long-term data storage. Generally, a computer also includes, or is operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. A computer can also be operatively coupled to a communications network in order to receive instructions and/or data from the network and/or to transfer instructions and/or data to the network. Computer-readable storage mediums suitable for embodying computer program instructions and data include all forms of volatile and non-volatile memory, including by way of example semiconductor memory devices, e.g., DRAM, SRAM, EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and optical disks, e.g., CD, DVD, HD-DVD, and Blu-ray disks. The processor and the memory can be supplemented by and/or incorporated in special purpose logic circuitry.

    [0055] To provide for interaction with a user, the above described techniques can be implemented on a computing device in communication with a display device, e.g., a CRT (cathode ray tube), plasma, or LCD (liquid crystal display) monitor, a mobile device display or screen, a holographic device and/or projector, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse, a trackball, a touchpad, or a motion sensor, by which the user can provide input to the computer (e.g., interact with a user interface element).

    [0056] Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, and/or tactile input.

    [0057] The above-described techniques can be implemented in a distributed computing system that includes a back-end component. The back-end component can, for example, be a data server, a middleware component, and/or an application server. The above described techniques can be implemented in a distributed computing system that includes a front-end component. The front-end component can, for example, be a client computer having a graphical user interface, a Web browser through which a user can interact with an example implementation, and/or other graphical user interfaces for a transmitting device. The above described techniques can be implemented in a distributed computing system that includes any combination of such back-end, middleware, or front-end components.

    [0058] The components of the computing system can be interconnected by transmission medium, which can include any form or medium of digital or analog data communication (e.g., a communication network). Transmission medium can include one or more packet-based networks and/or one or more circuit-based networks in any configuration. Packet-based networks can include, for example, the Internet, a carrier internet protocol (IP) network (e.g., local area network (LAN), wide area network (WAN), campus area network (CAN), metropolitan area network (MAN), home area network (HAN)), a private IP network, an IP private branch exchange (IPBX), a wireless network (e.g., radio access network (RAN), Bluetooth, near field communications (NFC) network, Wi-Fi, WiMAX, general packet radio service (GPRS) network, HiperLAN), and/or other packet-based networks. Circuit-based networks can include, for example, the public switched telephone network (PSTN), a legacy private branch exchange (PBX), a wireless network (e.g., RAN, code-division multiple access (CDMA) network, time division multiple access (TDMA) network, global system for mobile communications (GSM) network), and/or other circuit-based networks.

    [0059] Information transfer over transmission medium can be based on one or more communication protocols. Communication protocols can include, for example, Ethernet protocol, Internet Protocol (IP), Voice over IP (VOIP), a Peer-to-Peer (P2P) protocol, Hypertext Transfer Protocol (HTTP), Session Initiation Protocol (SIP), H.323, Media Gateway Control Protocol (MGCP), Signaling System #7 (SS7), a Global System for Mobile Communications (GSM) protocol, a Push-to-Talk (PTT) protocol, a PTT over Cellular (POC) protocol, Universal Mobile Telecommunications System (UMTS), 3GPP Long Term Evolution (LTE) and/or other communication protocols.

    [0060] Devices of the computing system can include, for example, a computer, a computer with a browser device, a telephone, an IP phone, a mobile device (e.g., cellular phone, personal digital assistant (PDA) device, smart phone, tablet, laptop computer, electronic mail device), and/or other communication devices. The browser device includes, for example, a computer (e.g., desktop computer and/or laptop computer) with a World Wide Web browser (e.g., Chrome™ from Google, Inc., Microsoft@ Internet Explorer@ available from Microsoft Corporation, and/or Mozilla® Firefox available from Mozilla Corporation). Mobile computing device include, for example, a Blackberry@ from Research in Motion, an iPhone® from Apple Corporation, and/or an Android™-based device. IP phones include, for example, a Cisco® Unified IP Phone 7985G and/or a Cisco® Unified Wireless Phone 7920 available from Cisco Systems, Inc.

    [0061] Comprise, include, and/or plural forms of each are open ended and include the listed parts and can include additional parts that are not listed. And/or is open ended and includes one or more of the listed parts and combinations of the listed parts.

    [0062] One skilled in the art will realize the subject matter may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The foregoing embodiments are therefore to be considered in all respects illustrative rather than limiting of the subject matter described herein.

    APPENDIX A

    [0063] Baseline Systems:

    [0064] TF-IDF vectorization with different baseline models.PG-1T

    [0065] Naive Bayes:

    TABLE-US-00002 Classification Report Precision Recall F1-Score Support compliant 0.83 1.00 0.90 1524 noncompliant 0.99 0.57 0.72 738 accuracy 0.86 2262 macro avg 0.91 0.78 0.81 2262 weighted avg 0.88 0.86 0.84 2262

    TABLE-US-00003 Confusion Matrix Non-Compliant Compliant True Non-Compliant 1518 6 Compliant 320 418

    [0066] ROC AUC=0.956

    [0067] Phi Coef=0.676

    TABLE-US-00004 MLP(2 Layer): Precision Recall F1-Score Support compliant 0.93 0.95 0.94 1524 noncompliant 0.89 0.85 0.87 738 accuracy 0.92 2262 macro avg 0.91 0.90 0.90 2262 weighted avg 0.91 0.92 0.91 2262

    TABLE-US-00005 Confusion Matrix Non-Compliant Compliant True Non-Compliant 1444 80 Compliant 112 626

    [0068] ROC AUC=0.965

    [0069] Phi-coef=0.806

    TABLE-US-00006 SVM Precision Recall F1-Score Support compliant 0.90 0.98 0.94 1524 noncompliant 0.95 0.78 0.86 738 accuracy 0.92 2262 macro avg 0.93 0.88 0.90 2262 weighted avg 0.92 0.92 0.91 2262

    TABLE-US-00007 Confusion Matrix Non-Compliant Compliant True Non-Compliant 1495 29 Compliant 163 575

    [0070] ROC AUC=0.973

    [0071] Phi Coef=0.806

    [0072] Deep Learning Experiments Results:

    [0073] Zero-shot learning using Hugginface:

    TABLE-US-00008 Precision Recall F1-Score Support compliant 0.27 0.84 0.41 1486 noncompliant 0.40 0.05 0.08 3485 accuracy 0.28 4971 macro avg 0.34 0.44 0.25 4971 weighted avg 0.36 0.28 0.18 4971

    [0074] Classification Using SBert Embeddings:

    TABLE-US-00009 Precision Recall F1-Score Support noncompliant 0.93 0.94 0.94 1524 compliant 0.88 0.86 0.87 738 accuracy 0.91 2262 macro avg 0.91 0.90 0.90 2262 weighted avg 0.91 0.91 0.91 2262

    TABLE-US-00010 Confusion Matrix Non-Compliant Compliant True Non-Compliant 1438 86 Compliant 107 631

    [0075] ROC AUC=0.969

    [0076] Phi Coef=0.82

    [0077] Classification using LASER embeddings:

    TABLE-US-00011 Precision Recall F1-Score Support noncompliant 0.88 0.96 0.92 1524 compliant 0.89 0.73 0.80 738 accuracy 0.88 2262 macro avg 0.89 0.84 0.86 2262 weighted avg 0.88 0.88 0.88 2262

    TABLE-US-00012 Confusion Matrix Non-Compliant Compliant True Non-Compliant 1458 66 Compliant 197 541

    [0078] ROC AUC=0.945

    [0079] Phi Coef=0.73

    [0080] AUC Curve

    [0081] Classification Result on Label 2

    [0082] For label 2, we consider the two classes to be “promissory” and “rest”

    TABLE-US-00013 Naive Bayes Classification Precision Recall F1-Score Support Rest 0.74 0.92 0.82 1402 Promissory 0.78 0.48 0.60 860 accuracy 0.75 2262 macro avg 0.76 0.70 0.71 2262 weighted avg 0.76 0.75 0.74 2262

    TABLE-US-00014 Confusion Matrix Rest Promissory True Rest 1287 115 Promissory 446 414

    [0083] Phi Coef=0.458

    [0084] ROC AUC=0.838

    TABLE-US-00015 Multi-Layer Perceptron Precision Recall F1-Score Support Rest 0.81 0.78 0.79 1402 Promissory 0.66 0.70 0.68  860 accuracy 0.75 2262 macro avg 0.73 0.74 0.74 2262 weighted avg 0.75 0.75 0.75 2262

    TABLE-US-00016 Confusion Matrix Rest Promissory True Rest 1091 311 Promissory 259 601

    [0085] Phi Coef=0.472

    [0086] ROC AUC=0.824

    TABLE-US-00017 SVM Classification Precision Recall F1-Score Support Rest 0.81 0.87 0.84 1402 Promissory 0.76 0.67 0.71  860 accuracy 0.79 2262 macro avg 0.78 0.77 0.78 2262 weighted avg 0.79 0.79 0.79 2262

    TABLE-US-00018 Confusion Matrix Rest Promissory True Rest 1219 183 Promissory 284 576

    [0087] Phi Coef=0.554

    [0088] ROC AUC=0.855

    TABLE-US-00019 SBert Classification Precision Recall F1-Score Support Rest 0.81 0.85 0.83 1402 Promissory 0.72 0.69 0.70  860 accuracy 0.78 2262 macro avg 0.77 0.75 0.76 2262 weighted avg 0.77 0.78 0.77 2262

    TABLE-US-00020 Confusion Matrix Rest Promissory True Rest 1196 206 Promissory 300 560

    [0089] ROC AUC=0.851

    [0090] Phi Coef: 0.524

    TABLE-US-00021 LASER Classification Precision Recall F1-Score Support Rest 0.81 0.86 0.84 1402 Promissory 0.75 0.68 0.71  860 accuracy 0.79 2262 macro avg 0.78 0.77 0.77 2262 weighted avg 0.79 0.79 0.79 2262

    TABLE-US-00022 Confusion Matrix Rest Promissory True Rest 1207 195 Promissory 277 583

    [0091] ROC AUC=0.858

    [0092] Phi Coef: 0.550

    [0093] Zero Shot Learning

    [0094] Huggingface Pipeline

    [0095] Model: facebook/bart-large-mnli

    [0096] Method: The 40 sentences are treated as classes and the probability of a sentence lying in those classes is calculated. We take the max of those scores, and if the max is greater than 0.7, we classify it as promissory.

    TABLE-US-00023 Precision Recall F1-Score Support Rest 0.77 0.50 0.61 1402 Promissory 0.48 0.75 0.59  860 accuracy 0.60 2262 macro avg 0.63 0.63 0.60 2262 weighted avg 0.66 0.60 0.60 2262

    TABLE-US-00024 Confusion Matrix Rest Promissory True Rest 706 696 Promissory 211 649

    [0097] ROC AUC=0.669

    [0098] Phi Coef: 0.25

    [0099] Few Shot Siamese Network

    [0100] Threshold=0.02

    TABLE-US-00025 Precision Recall F1-Score Support Rest 0.71 0.90 0.79 1402 Promissory 0.71 0.39 0.5   860 accuracy 0.71 2262 macro avg 0.71 0.64 0.65 2262 weighted avg 0.71 0.71 0.68 2262

    TABLE-US-00026 Confusion Matrix Rest Promissory True Rest 1263 139 Promissory 526 334

    [0101] Phi Coef=0.34

    [0102] Model 2: Triplet Loss.

    [0103] We sample 40 examples from the promissory cases and 190 examples from the non-promissory cases. We then learn a compact representation of the sentences using SBert and triplet loss. For final classification we use SVM since triplet loss draws a margin between examples.

    TABLE-US-00027 Precision Recall F1-Score Support Rest 0.79 0.77 0.78 1402 Promissory 0.64 0.66 0.65  860 accuracy 0.73 2262 macro avg 0.71 0.72 0.71 2262 weighted avg 0.73 0.73 0.73 2262

    TABLE-US-00028 Confusion Matrix Rest Promissory True Rest 1085 317 Promissory 294 566