G06F16/355

STRING ENTROPY IN A DATA PIPELINE
20230040648 · 2023-02-09 ·

Various embodiments comprise systems and methods to determine entropy in strings generated by a data pipeline. In some examples, data monitoring circuitry monitors a data pipeline that ingests input data, processes the input data, and responsively generates and transfers a data string that comprises character groups. The data monitoring circuitry receives the data string, identifies character groups in the data string, identifies group types for the character groups, and assigns numbers to the character groups based on the group types. The data monitoring circuitry determines a probability distribution for the numbers, calculates entropy for the data string based on probability distribution, and generates an entropy histogram based on the entropy. The data monitoring circuitry compares the entropy histogram of the data string to another entropy histogram for another data string, determines a change in entropy, and reports the change in entropy.

Cloud data attack detection based on cloud security posture and resource network path tracing

The technology disclosed relates to streamlined analysis of security posture of a cloud environment. In particular, the disclosed technology relates to accessing permissions data and access control data for pairs of compute resources and storage resources in the cloud environment, tracing network communication paths between the pairs of the compute resources and the storage resources based on the permissions data and the access control data, accessing sensitivity classification data for objects in the storage resources, qualifying a subset of the pairs of the compute resources and the storage resources as vulnerable to breach attack based on an evaluation of the permissions data, the access control data, and the sensitivity classification data against a set risk criterion, and generating a representation of propagation of the breach attack along the network communication paths, the representation identifying relationships between the subset of the pairs of the compute resources and the storage resources.

Representing documents using document keys

Embodiments are directed to representing documents using document keys. Documents that include one or more clauses may be provided. Each clause type for the one or more clauses in documents may be determined based on one or more classification models. One or more clause identifiers may be associated with the one or more clauses based on one or more clause types of each clause. A document key may be generated for each document based on an ordered collection of the one or more clauses included in each document such that each clause identifier may be positioned in the document key based on an order of its location in a corresponding clause of a document. The documents may be analyzed based on evaluations of one or more document keys corresponding to the documents. One or more reports may be generated based on one or more results of the analysis.

Identifying similar documents in a file repository using unique document signatures
11593439 · 2023-02-28 · ·

Methods, systems, and non-transitory computer readable storage media are disclosed for determining clusters of similar digital documents using unique document signatures. Specifically, the disclosed system processes digital text in a digital document to tokenize character strings (e.g., words) in the digital document by combining a subset of character values and string lengths in the character strings. Additionally, the disclosed system generates a document signature for the digital document by combining subsets of tokens generated for the digital document into a token sequence indicative of the digital text in the digital document. The disclosed system determines a cluster of similar digital documents including the digital document by comparing the document signature of the digital document to document signatures corresponding to a plurality of digital documents.

Enterprise knowledge graph

Examples described herein generally relate to a computer system for generating a knowledge graph storing a plurality of entities and to displaying a topic page for an entity in the knowledge graph. The computer system performs a mining of source documents within an enterprise intranet to determine a plurality of entity names. The computer system generates an entity record within the knowledge graph for a mined entity name based on an entity schema and the source documents. The entity record includes attributes aggregated from the source documents. The computer system receives a curation action on the entity record from a first user. The computer system updates the entity record based on the curation action. The computer system displays an entity page including at least a portion of the attributes to a second user based on permissions of the second user to view the source documents.

SEGMENTATION BASED ON CLUSTERING ENGINES APPLIED TO SUMMARIES
20180011920 · 2018-01-11 ·

Examples disclosed herein relate to segmentation based on clustering engines applied to summaries. In one implementation, a processor segments text based on a comparison of the output of multiple clustering engines applied to multiple summarizations of documents associated with the text. The processor outputs information related to the contents of the segments.

Revealing content reuse using coarse analysis

Systems and methods for managing content provenance are provided. A network system accesses a plurality of documents. The plurality of documents is then hashed to identify one or more content features within each of the documents. In one embodiment, the hash is a MinHash. The network system compares the content features of each of the plurality of documents to determine a similarity score between each of the plurality of documents. In one embodiment, the similarly score is a Jaccard score. The network system then clusters the plurality of documents into one or more clusters based on the similarity score of each of the plurality of documents. In one embodiment, the clustering is performed using DBSCAN. DBSCAN can be iteratively performed with decreasing epsilon values to derive clusters of related but relatively dissimilar documents. The clustering information associated with the clusters are stored for use during runtime.

Mapping of unlabeled data onto a target schema via semantic type detection

Automatically mapping unlabeled input data onto a target schema via semantic type detection is described. The input data includes data elements that are structured as 2D table rows and columns forming cells. Each data element is included in a cell. The target schema includes a set of fields. Schema mapping includes mapping each column to one or more fields. More particularly, the fields are clustered into field clusters, where each field cluster includes one or more of the fields. Each column is automatically mapped to one of the field clusters of the set of field clusters. The mapping between schema fields and data columns is automatically performed based on appropriate pairings of the detected semantic types, where the semantic types are encoded in vector representations of the fields, the field clusters, and the data elements.

Confidential information identification based upon communication recipient

One embodiment provides a method, including: receiving an indication of an addition of a new participant in a textual communication between at least two existing participants; identifying at least one confidential topic contained within the textual communication by (i) parsing the textual communication and (ii) identifying at least one topic contained within the textual communication; the identifying comprising (i) accessing a confidentiality graph comprising (a) nodes representing participants and (b) edges representing confidential concepts that are acceptable discussion topics between participants connected by a corresponding edge and (ii) determining that an edge corresponding to the at least one confidential topic does not connect the new participant with both of the existing participants; and alerting one of the existing participants that the at least one confidential topic is included in the textual communication to be sent to the new participant.

SYSTEMS AND METHODS FOR ANALYZING INFORMATION CONTENT
20230237115 · 2023-07-27 ·

A system may determine information in relation to a link in an interlinked set of information content items. A memory may store a set of machine-readable instructions operable, when executed by a processor, to receive a link context associated with a link in an information content item of the interlinked set. The link context may include information from the information content item providing context for the link. Instructions may also be operable to identify one or more additional links present in the link context and determine a link density as the proportion of the link context which includes the or each additional link relative to the size of the link context, and/or determine a text density as the proportion of the link context which does not include the or each additional link relative to the size of the link context and/or count the or each additional link.