G06F16/325

Compact entity identifier embeddings

The disclosed embodiments provide a system for processing data. During operation, the system applies a first set of hash functions to a first entity identifier (ID) for a first entity to generate a first set of hash values. Next, the system produces a first set of intermediate vectors from the first set of hash values and a first set of lookup tables by matching each hash value in the first set of hash values to an entry in a corresponding lookup table in the first set of lookup tables. The system then performs an element-wise aggregation of the first set of intermediate vectors to produce a first embedding. Finally, the system outputs the first embedding for use by a machine learning model.

File fingerprint generation

A string of characters within a file may be obtained. A first sequence may be selected from the string of characters. A first hash may be generated based on the first sequence. A second sequence may be selected from the string of characters based on the first sequence. The second sequence may be shifted from the first sequence. A second hash may be generated based on the second sequence. A fingerprint for the file may be generated based on the first hash and the second hash.

SYSTEM FOR MACHINE-LEARNING BASED IDENTIFICATION AND FILTERING OF ELECTRONIC NETWORK COMMUNICATION

A system is provided for machine-learning based identification and filtering of electronic network communication. In particular, the system may continuously monitor and pull electronic communications data from one or more networked computing systems in an enterprise environment. Based on the electronic communications data, the system may use machine learning algorithms to generate a database of associations between one or more users and one or more topics of interest. The system may then output one or more recommendations to one or more users for transmitting communications associated with the one or more topics of interest. In this way, the system may improve the efficiency of communications received and transmitted within a network.

System and method for fingerprinting-based conversation threading

Systems, methods, and computer readable media for staging a corpus of electronic communication documents for analysis, such as, for example, via a content analysis platform. The staging may include a staging platform accessing the corpus of electronic communication document. For each electronic communication document within the corpus, the staging platform may generate a fingerprint based upon the output of a hash function executed upon a set of characteristics corresponding to each segment within the electronic communication document. The staging platform may analyze the generated fingerprints to generated a plurality of threaded conversations that do not include electronic communication documents that fail to convey any new information. The systems and methods may also include detecting and flagging any segments within an electronic communication document that may have been mutated by its author.

Computing a private set intersection

Systems and methods for computing a private set intersection are disclosed. A method includes storing, at a sender device, a first set of values. The method includes receiving, from a receiver device, a homomorphic encryption of a receiver device value. The method includes computing a homomorphically encrypted number based on a difference between the homomorphic encryption of the receiver device value and each value in the first set of values, and based on a hash function of the encryption of the receiver device value. The method includes transmitting the homomorphically encrypted number to the receiver device for determination, at the receiver device, whether the receiver device value is in the first set of values.

KEY PACKING FOR FLASH KEY VALUE STORE OPERATIONS
20230281130 · 2023-09-07 ·

A key value (KV) store, a method thereof, and a storage system are provided herein. The KV store may include a key logger; and a processor configured to receive a first command for storing a first KV in the KV store, write a first value of the first KV to a first NAND page, generate an extent map for identifying the first memory page including the first value, write the extent map to a second memory page, append an entry for storing the first KV to the key logger, and update a device hashmap of the KV store to include a first key of the first KV, upon a threshold being met within the key logger.

SPARSE EMBEDDING INDEX FOR SEARCH

A search system facilitates efficient and fast near neighbor search given item vector representations of items, regardless of item type or corpus size. To index an item, the search system expands an item vector for the item to generate an expanded item vector and selects elements of the expanded item vector. The item is index by storing an identifier of the item in posting lists of an index corresponding to the position of each selected element in the expanded item vector. When a query is received, a query vector for the item is expanded to generate an expanded query vector, and elements of the expanded query vector are selected. Candidate items are identified based on posting lists corresponding to the position of each selected element in the expand query vector. The candidate items may be ranked, and a result set is returned as a response to the query.

SYSTEM AND METHOD FOR FINGERPRINTING-BASED CONVERSATION THREADING
20230024532 · 2023-01-26 ·

Systems, methods, and computer readable media for staging a corpus of electronic communication documents for analysis, such as, for example, via a content analysis platform. The staging may include a staging platform accessing the corpus of electronic communication document. For each electronic communication document within the corpus, the staging platform may generate a fingerprint based upon the output of a hash function executed upon a set of characteristics corresponding to each segment within the electronic communication document. The staging platform may analyze the generated fingerprints to generated a plurality of threaded conversations that do not include electronic communication documents that fail to convey any new information. The systems and methods may also include detecting and flagging any segments within an electronic communication document that may have been mutated by its author.

TECHNOLOGIES FOR FILE SHARING
20220405246 · 2022-12-22 ·

This disclosure enables various computing technologies for sharing various files securely and selectively between various predefined user groups based on various predefined workflows. For each of the predefined workflows, the files are shared based on a data structure storing various document identifiers and various metadata tags, with the document identifiers mapping onto the metadata tags.

IDENTIFYING SIMILAR DOCUMENTS IN A FILE REPOSITORY USING UNIQUE DOCUMENT SIGNATURES
20230376542 · 2023-11-23 ·

Methods, systems, and non-transitory computer readable storage media are disclosed for determining clusters of similar digital documents using unique document signatures. Specifically, the disclosed system processes digital text in a digital document to tokenize character strings (e.g., words) in the digital document by combining a subset of character values and string lengths in the character strings. Additionally, the disclosed system generates a document signature for the digital document by combining subsets of tokens generated for the digital document into a token sequence indicative of the digital text in the digital document. The disclosed system determines a cluster of similar digital documents including the digital document by comparing the document signature of the digital document to document signatures corresponding to a plurality of digital documents.