SYSTEMS AND METHODS FOR IDENTIFYING DUPLICATE DOCUMENTS AND DETECTING MISREPRESENTATION
20260120501 ยท 2026-04-30
Assignee
Inventors
Cpc classification
G06V30/19093
PHYSICS
International classification
Abstract
Systems, methods, and non-transitory computer readable media configured for identifying duplicate and misrepresented documents are provided. At least one processor may retrieve, from a first source, a first document, and may retrieve, from a second source, a second document. The processor may process each document. The processor may determine a cosine similarity between a first set of numbers and second set of numbers, and whether the cosine similarity exceeds a first threshold. The processor may determine a number of words in common between the two documents, and whether the number of words in common exceeds a second threshold. The processor may determine a number of sentences in common between the two documents, and whether that number exceeds a third threshold. Responsive to a determination that the first threshold, second threshold, or third threshold are exceeded, the processor may set a flag indicating that the second document is a duplicate.
Claims
1-15. (canceled)
16. A system comprising: a memory storing instructions; and at least one processor configured to execute the stored instructions to: retrieve, from a first source, a record of accredited universities; retrieve, from a second source, a record of suspect universities; retrieve, from a third source, a resume, wherein the resume recites one or more universities; process the resume, wherein processing includes cleaning, tokenizing, and vectorizing the resume; determine whether the one or more universities recited on the resume matches one or more universities on the record of accredited universities; determine whether the one or more universities recited on the resume matches one or more universities on the record of suspect universities; and responsive to a determination that the one or more universities recited on the resume does not match one or more universities on the record of accredited universities, or that the one or more universities recited on the resume matches one or more universities on the record of suspect universities: set a flag indicating that the resume contains one or more misrepresentations.
17. The system of claim 16, wherein the at least one processor is further configured to: iterate the processing, determining, and flag setting steps for each of a plurality of resumes retrieved from the third source, until the third source no longer contains any resumes to process.
18-24. (canceled)
25. The system of claim 16, wherein the at least one processor is further configured to periodically update the first source containing the record of accredited universities.
26. The system of claim 16, wherein the at least one processor is further configured to periodically update the second source containing the record of suspect universities.
27. The system of claim 16, wherein determining whether the one or more universities recited on the resume matches one or more universities on the record of accredited universities includes extracting and analyzing text data associated with the resume using at least of: an artificial neural network (ANN) algorithm; a k-nearest neighbors (KNN) algorithm; optical character recognition; or natural language processing.
28. The system of claim 27, wherein the at least one processor is further configured to stop analyzing the resume and move on to a subsequent resume when an initial determination indicates that the one or more universities recited on the resume matches one or more universities on the record of accredited universities.
29. The system of claim 16, wherein determining whether the one or more universities recited on the resume matches one or more universities on the record of suspect universities includes extracting and analyzing text data associated with the resume using at least of: an ANN algorithm; a KNN algorithm; optical character recognition; or natural language processing.
30. The system of claim 29, wherein the at least one processor is further configured to preemptively set a flag when an initial determination indicates that the one or more universities recited on the resume matches one or more universities on the record of suspect universities
31. The system of claim 16, wherein the at least one processor is further configured to: send the set flag for display on a graphical user interface of a user device.
32. The system of claim 16, wherein the at least one processor is further configured to: store a number of set flags from a first predetermined time period in memory; and predict, based on the number of set flags from the first predetermined time period, a number of potential flags for a second predetermined time period.
33. The system of claim 16, wherein cleaning further includes at least one of: removing malicious scripts; removing metadata; or removing malware from the resume.
34. The system of claim 16, wherein tokenizing the resume further includes substituting a sensitive data element with a non-sensitive data element using at least one of: word tokenization, character tokenization, or subword tokenization.
35. The system of claim 16, wherein the sensitive data element includes personal identifying information.
36. The system of claim 16, wherein the at least one processor is further configured to vectorize the resume using at least one of: a bag-of-words model; a term frequency-inverse document frequency model; a paragraph vector model; or one-hot encoding.
37. The system of claim 16, wherein at least one of the first source, second source, or third source is a cloud-based server.
38. The system of claim 16, wherein the third source is configured to continually store newly submitted resumes.
39. A method comprising: retrieving, from a first source, a record of accredited universities; retrieving, from a second source, a record of suspect universities; retrieving, from a third source, a resume, wherein the resume recites one or more universities; processing the resume, wherein processing includes cleaning, tokenizing, and vectorizing the resume; determining whether the one or more universities recited on the resume matches one or more universities on the record of accredited universities; determining whether the one or more universities recited on the resume matches one or more universities on the record of suspect universities; and responsive to a determination that the one or more universities recited on the resume does not match one or more universities on the record of accredited universities, or that the one or more universities recited on the resume matches one or more universities on the record of suspect universities: setting a flag indicating that the resume contains one or more misrepresentations.
40. The method of claim 39, further comprising: iterating the processing, determining, and flag setting for each of a plurality of resumes retrieved from the third source, until the third source no longer contains any resumes to process.
41. A non-transitory computer readable medium having stored instructions, which when executed, cause at least one processor to perform operations comprising: retrieving, from a first source, a record of accredited universities; retrieving, from a second source, a record of suspect universities; retrieving, from a third source, a resume, wherein the resume recites one or more universities; processing the resume, wherein processing includes cleaning, tokenizing, and vectorizing the resume; determining whether the one or more universities recited on the resume matches one or more universities on the record of accredited universities; determining whether the one or more universities recited on the resume matches one or more universities on the record of suspect universities; and responsive to a determination that the one or more universities recited on the resume does not match one or more universities on the record of accredited universities, or that the one or more universities recited on the resume matches one or more universities on the record of suspect universities: setting a flag indicating that the resume contains one or more misrepresentations.
42. The non-transitory computer readable medium of claim 41, wherein the at least one processor is further configured to: iterate the processing, determining, and flag setting for each of a plurality of resumes retrieved from the third source, until the third source no longer contains any resumes to process.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate disclosed embodiments and, together with the description, serve to explain the disclosed embodiments. In the drawings:
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
DETAILED DESCRIPTION
[0023] In the following detailed description, numerous specific details are set forth to provide a thorough understanding of the disclosed example embodiments. However, it will be understood by those skilled in the art that the principles of the example embodiments may be practiced without every specific detail. Well-known methods, procedures, and components have not been described in detail so as not to obscure the principles of the example embodiments. Unless explicitly stated, the example methods and processes described herein are not constrained to a particular order or sequence or constrained to a particular system configuration. Additionally, some of the described embodiments or elements thereof can occur or be performed simultaneously, at the same point in time, or concurrently. Reference will now be made in detail to the disclosed embodiments, examples of which are illustrated in the accompanying drawings.
[0024] By way of example,
[0025] By way of example,
[0026] By way of example,
[0027] Disclosed embodiments may involve systems, methods, and non-transitory computer readable medium configured to analyzing data retrieved from a plurality of documents. The computer readable medium may be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
[0028] Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable medium within the respective computing/processing device.
[0029] Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the C programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
[0030] Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
[0031] These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via at least one processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
[0032] The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
[0033] Such communications may take place across various types of networks, such as the Internet, a wired Wide Area Network (WAN), a wired Local Area Network (LAN), a wireless WAN (e.g., WiMAX), a wireless LAN (e.g., IEEE 802.11, etc.), a mesh network, a mobile/cellular network, an enterprise or private data network, a storage area network, a virtual private network using a public network, a nearfield communications technique (e.g., Bluetooth, infrared, etc.), or various other types of network communications. In some embodiments, the communications take place across two or more of these forms of networks and protocols. It is understood that in some embodiments, one or more aspects of the disclosed systems and methods may also be used in a localized system, with one or more of the components communicating directly with each other.
[0034] In some embodiments, a system is disclosed. In some embodiments, the system comprises a memory storing instructions. By way of example,
[0035] In some embodiments, the system comprises at least one processor 402 configured to execute instructions. At least one processor 402 may include any physical device or group of devices having circuitry configured to perform one or more logic operations on an input or inputs. For example, at least one processor 402 may include one or more integrated circuits (IC), including application-specific integrated circuit (ASIC), microchips, microcontrollers, microprocessors, all or part of a central processing unit (CPU), graphics processing unit (GPU), digital signal processor (DSP), field-programmable gate array (FPGA), or other circuits suitable for executing instructions or performing logic operations. At least one processor 402 may take the form of, but is not limited to, a microprocessor, embedded processor, or the like, or may be integrated in a system on a chip (SoC). Furthermore, according to some embodiments, processor 402 may include one or more of the family of processors manufactured by Intel, AMD, Qualcomm, Apple, NVIDIA, or the like. At least one processor 402 may also be based on the ARM architecture, a mobile processor, or a graphics processing unit, etc. The disclosed embodiments are not limited to any type of processor configured in the server. Computing device 400, containing at least one processor 402 and at least one memory 404, may be connected to a network 406, such as the Internet, a local area network, a wide area network and/or a wireless network.
[0036] Computing device 400 may comprise a memory 404, a processor 402, and/or other specialized hardware that is configured to execute one or more methods of the disclosed embodiments. Memory 404 may include one or more storage devices configured to store instructions used by at least one processor 402 to perform functions related to a server. The disclosed embodiments are not limited to particular software programs or devices configured to perform dedicated tasks. For example, the memory 404 may store a single program, such as a user-level application, that performs the functions associated with the disclosed embodiments, or may comprise multiple software programs. Additionally, at least one processor 402, in some embodiments, executes one or more programs (or portions thereof) remotely located from one or more servers. Furthermore, the memory 404 may include one or more storage devices configured to store data for use by the programs. The memory 404 may include, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a hard drive, a solid state drive, an optical disk, other permanent, fixed, or volatile memory, a CD-ROM drive, a peripheral storage device (e.g., an external hard drive, a USB drive, etc.), a network drive, a cloud storage device, or any other mechanism capable of storing instructions. In some embodiments, each processor has a similar construction or the processors may be of differing constructions that are electrically connected or disconnected from each other. For example, the processors may be separate circuits or integrated in a single circuit. When more than one processor is used, the processors may be configured to operate independently or collaboratively, and may be co-located or located remotely from each other. The processors may be coupled electrically, magnetically, optically, or by any other way that permits them to interact with each other.
[0037] In some embodiments, memory 404 includes a data repository. The data repository may be a database. The data repository may be coupled to a server. The data repository may be included on a volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other type of storage device or tangible or non-transitory computer-readable medium. The data repository may also be part of the server or separate from the server. When the data repository is not part of the server, the server may exchange data with the data repository via a communication link. The data repository may include one or more memory devices that store data and instructions used to perform one or more features of the disclosed embodiments. The data repository may include any suitable data repositories, ranging from small data repositories hosted on a workstation to large data repositories distributed among data centers. The data repository may also include any combination of one or more databases controlled by memory controller devices (e.g., server(s), etc.) or software.
[0038] For example, the data repository may include document management systems, Microsoft SQL databases, SharePoint databases, Oracle databases, Sybase databases, other relational databases, or non-relational databases, such as mongo and others. In some embodiments, the server includes one or more input/output devices, communications devices, displays, and/or other interfaces (e.g., server-to-server, database to-to-database, or other network connections). The data repository may store account information, audit information, transaction information, asset identifier information, asset type information, user information, user history information, transaction history information, and other data.
[0039] In some embodiments, at least one processor 402 is configured to retrieve, from a first source, a first document. Retrieve may refer to at least one processor 402 performing a look-up and returning a document to perform additional tasks related to the document. A first source may refer to a data repository, remote physical server, cloud-based server, and/or any other storage medium. A first document may refer to a piece of written, printed, or electronic matter that includes certain information. Non-limiting examples of a first document may be a resume, a performance evaluation, and/or an email.
[0040] In some embodiments, the first source comprises a data repository of previously submitted documents. For example, previously submitted resumes may refer to resumes that one or more applicants have submitted over a previous 3-, 5-, or 10-year period that do or do not include resumes submitted within a most recent 24-hour period.
[0041] Computing device 400 may be connected to the first source 408 via network 406. First source 408 may be configured to store information, and may be a remote physical server, data repository, cloud server, and/or other storage medium. In this example, first source 408 contains a memory. In another example, first source 408 may be configured to communicate with a cloud server. First source 408 may be configured to store first documents 410. In this example, processor 402 may be configured to retrieve, from a first source, a first document. The first source may contain resumes submitted that one or more applicants have submitted over a previous 3-, 5-, or 10-year period, that do or do not include resumes submitted within a most recent 24-hour period. In this example, the first document may be an older resume, for example, a resume an applicant submitted two years previously.
[0042] In some embodiments, at least one processor 402 is configured to retrieve, from a second source, a second document. A second source may refer to a data repository, remote physical server, cloud-based server, and/or any other storage medium. In some embodiments, the second source comprises a data repository of newly submitted documents, including those that may have been submitted within the most recent 24-hour, one-day, or one-week period.
[0043] A second document may refer to a piece of written, printed, or electronic matter that includes certain information. Non-limiting examples of a first document may be a resume, a performance evaluation, and/or an email. In one example, the second source may be a data repository, and the second document may be a newly submitted resume. In some embodiments, a newly submitted resume refers to a resume that has been submitted within the past 24 hours. A newly submitted resume may also refer to a resume that has been submitted within the past seven days, or another period. The second source 412 may be a data repository storing newly submitted resumes.
[0044] Computing device 400 may be connected to the second source 412 via network 406. Second source 412 may be configured to store information, and may be a remote physical server, data repository, cloud server, and/or other storage medium. In this example, second source 412 contains a memory, similar to memory 404. Second source 412 may also communicate with a cloud server via network 406. Second source 412 may be configured to store second documents 414. In this example, processor 402 may be configured to retrieve, from second source 412, a second document 414. The second source may contain documents submitted within the past 24 hours, seven days, or one month. The second document may be a newly submitted resume or recently submitted performance evaluation. A recently submitted performance evaluation may refer to a performance evaluation that a manager submitted within the most recent evaluation cycle.
[0045] In some embodiments, at least one processor 402 is configured to process the first and second documents. Here, processing may refer to performing multiple operations on a document so that its information can be fed into a computer program.
[0046] Consistent with disclosed embodiments, processing comprises cleaning, tokenizing, and vectorizing each of the first and second documents. Cleaning may refer to removing, scrubbing, and/or extracting metadata and/or other hidden content from a document, such as personally identifiable information (PII), the document creation date, document modification date, and file size. Examples of PII may include a person's name, address, social security number, telephone number, email address, passport number, etc. In one example, all extracted metadata, including PII, may be stored in a data repository or database, such as HADOOP. In another example, non-PII metadata may be stored in one data repository, and PII metadata may be stored in another data repository. Hidden content may include hazardous code such as malicious scripts or malware that may be inadvertently associated with each of the first and second documents, which may present a privacy or security risk. Consistent with disclosed embodiments, processor 402 may be configured to use the extracted metadata to determine whether a second document potentially contains one or more misrepresentations. In a non-limiting example, processor 402 may be configured to extract text data from each of the first and second documents using natural language processing, optical character recognition, a KNN algorithm, and/or an ANN algorithm. As described herein, a KNN algorithm may refer to a k-nearest neighbors algorithm, which is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predications about the grouping of an individual data point. Here, the individual data points may be the characters in one or more documents that are processed. As described herein, an ANN algorithm refers to an Artificial Neural Network. ANNs are based on the principles of biological neural networks, and are made of up artificial neurons that work together to solve a problem. Here, an ANN may be used to analyze text information in a first and/or second document.
[0047] Tokenizing may refer to the process of substituting a sensitive data element, such as the applicant's name and address, with a non-sensitive equivalent, referred to as a token, that has no intrinsic or exploitable meaning or value. Tokenizing documents ensures that no PII is inadvertently associated with any document, and presents an advantage over encryption because tokenization does not rely on keys to modify the original data. The tokenized documents may be retained for later use without inadvertently exposing sensitive information. Processor 402 may be configured to tokenize each of the first and second documents to comply with relevant data privacy rules. Tokenization also further reduces the risk of a data breach. In one example, processor 402 may be configured to tokenize information by substituting individual characters or words, i.e., sensitive data elements, with non-sensitive equivalents, the tokens. In a non-limiting example, processor 402 may tokenize each of first and second documents using word tokenization, character tokenization, and/or subword tokenization.
[0048] Vectorizing may refer to the process of representing the unique characteristics of a document, such as document text, numerically such that a computer processor 402 may handle the unstructured text data. In one example, processor 402 may be configured to implement one or more techniques for vectorizing text, including but not limited to using a bag-of-words (BoW) model, a term frequency-inverse document frequency (TF-IDF) model, a paragraph vector model, and/or using one-hot encoding. In another example, processor 402 may be configured to retrieve compiled document data from a database such as HADOOP or ELASTICSEARCH, and convert the data to JSON in order to more easily handle the previously unstructured, complex document data. Processing may make it easier for the processor 402 to determine whether the first and second documents 410, 414 are duplicates or contain duplicate information, consistent with disclosed embodiments.
[0049] In one example, at least one processor 402 may clean first document 410. In this example, the first document may be a prior applicant's resume, submitted within the past five years. Processor 402 may remove personally identifying metadata from the resume, such as the resume's author name and address, as well as hidden data associated with the resume that may present a security risk. At least one processor 402 may, after cleaning the resume, extract the scrubbed metadata for future use. At least one processor 402 may also tokenize the resume. Tokenizing the resume may include converting a sequence of sensitive text, such as the applicant's name and/or address, into a non-sensitive equivalent, such as a string of numbers. Processor 402 may tokenize the resume using word tokenization, character tokenization, and/or subword tokenization. At least one processor 402 may be configured to vectorize the resume. Vectorizing may include converting the text of the resume into a set of numbers to be interpreted by at least one processor 402. Processor 402 may vectorize the resume using one or more techniques described herein, such as using a BoW model, a TF-IDF model, a paragraph vector, and/or one-hot encoding.
[0050] At least one processor 402 may be configured to load one or more batches of processed first documents 410, i.e., previously submitted documents, into vector repository 416. Vector repository 416 may be a database configured to accommodate a plurality of batches of processed and vectorized first documents 410, wherein each batch may comprise 100, 200, 500, or 1000 documents.
[0051] Computing device 400 may contain a graphical user interface (GUI) 418. In one example, at least one processor 402 may set 4 flags out of a batch of 100 newly submitted resumes. At least one processor 402 may be configured to provide for display number of set flags on GUI 418. Processor 402 may flag the newly submitted resumes by implementing a Python scripter to present the newly submitted resumes in a tabular format on GUI 418
[0052] By way of example,
[0053] At least one processor may process the first document at step 506, and may process the second document at step 508. In this example, the first document is a previously submitted resume and the second document is a newly submitted resume. In one example, each of the first and second documents may be loaded into a database such as HADOOP or ELASTICSEARCH for further processing.
[0054] At step 510, at least one processor may clean a first document, i.e., remove or scrub personally identifying metadata or hazardous hidden data from the first document. At step 512, the processor may tokenize the first document. At step 514, the processor may vectorize the first document, i.e., the processor may convert the remaining text of the first document into a set of numbers to be more easily interpreted by the processor. At step 516, the processor may clean a second document. At step 518, the processor may tokenize the second document. At step 520, the processor may vectorize the second document. Consistent with disclosed embodiments, the processor may be configured to simultaneously perform steps 510 through 514 and steps 510 through 520.
[0055] At step 522, at least one processor may load a batch of processed first documents (such as, for example, first documents 410) from a first source (such as, for example, first source 408) to a vector repository (such as, for example, vector repository 416). The first documents may be a plurality of resumes. A batch of processed resumes may include 100, 200, 500, or 1000 previously submitted resumes, wherein the size of the batch may be configured by the employer.
[0056] At step 524, at least one processor may determine the similarity between the first document and the second document, as described elsewhere in this disclosure. At step 526, the processor may return copies for manual inspection, as described elsewhere in this disclosure.
[0057] By way of example,
[0058] Cosine similarity may refer to a mathematical formula for measuring the similarity between two sequences of numbers, wherein the numbers are in a particular order. The general formula for cosine similarity is shown below as Formula 1.
[0059] Formula 1: General formula for cosine similarity.
[0060] The cosine similarity between two sequences of numbers may be between 0 and 1. If the two sequences of numbers are exactly the same, then the cosine similarity is 1. Conversely, if the two sequences of numbers share no similarities, the cosine similarity is 0. If the two sequences of numbers have some numbers in common, the cosine similarity is between 0 and 1.
[0061] For a cosine similarity calculation of specific words, words that are identical between the first and second documents are assigned a cosine similarity of 1, whereas words that differ between the first and second documents are assigned a cosine similarity of 0. In another example, sentences that are identical between the first and second documents are assigned a cosine similarity of 1, whereas sentences that differ between the first and second documents are assigned a cosine similarity of 0. A sentence that contains some words in common between the first and second documents is assigned a cosine similarity between 0 and 1.
[0062] For example, a sentence in the first document may read, I like apples and oranges. A sentence in the second document may read, I like strawberries and oranges. In this example, the two sentences are nearly identical, except that the first sentence contains the word apples, whereas the second sentence contains the word strawberries. Aggregating all six unique words between the two sentences, the cosine similarity would be 0.8, as shown in the table below.
TABLE-US-00001 TABLE 1 I like apples and oranges strawberries Text 1 1 1 1 1 1 0 Text 2 1 1 0 1 1 1 Cosine Similarity = 0.8
[0063] In some embodiments, and referring to
[0064] In the example in Table 1, each sentence contains five words, four of which are identical between the first sentence and the second sentence. Therefore, as explained in Table 1, the first and second sequences of numbers have a cosine similarity of 0.8.
[0065] In some embodiments, the first threshold is between 0.5 and 1. In some embodiments, the first threshold is 0.85. At least one processor (such as, for example, processor 402 described in reference to
[0066] In some embodiments, and referring to
[0067] In some embodiments, and referring to
[0068] At least one processor (such as, for example, processor 402) may flag the second document if the number of words in common between the first and second documents is exactly the second threshold. For example, a processor may set the second threshold at 100 words in common. In this example, the first and second document may contain 100 words in common. Here, the processor may flag the second document for additional review. In some embodiments, the second threshold is between 50 and 150. In some embodiments, the second threshold is 100. In one example, the processor may be configured to set and/or update the second threshold based on user input. For example, an employer may initially set the second threshold as 50, but may flag too many documents to review. Instead, the employer may adjust the second threshold to 100 words in common to reduce the time it takes to process the documents, increase the chances of capturing documents that potentially contain one or more misrepresentations, and to decrease the time necessary to manually review the flagged documents.
[0069] In some embodiments, and referring to
[0070] In some embodiments, and referring to
[0071] The lower the third threshold, the more documents that the processor may flag as a duplicate, and vice versa. The third threshold may be a range of numbers or the third threshold may be a single number. In some embodiments, the third threshold is between 5 and 10. Specifically, the third threshold may be 6. In this example, the processor may determine that the first and second documents share 4 sentences in common, and that the number of sentences in common does not exceed the third threshold. In another example, the processor may determine that the first and second documents share 7 sentences in common. Here, the processor may determine that the number of sentences in common exceeds the third threshold. The processor may also set a flag if the number of sentences in common is exactly the third threshold. In this example, the processor may determine that the first and second documents share 6 sentences in common. Here, the processor may determine that the number of sentences in common exactly meets the third threshold.
[0072] The first and second documents (such as, for example, first document 410 and second document 414) may each contain a plurality of sentences, each comprising a plurality of words. For example, a first resume may list an applicant's work experience in a plurality of bullets. A second resume may similarly list an applicant's work experience in a plurality of bullets. A processor (such as, for example, processor 402) may be configured to analyze and vectorize each bullet point to determine the cosine similarity between the two documents with respect to each applicant's work experience. The processor may analyze each document using, for example, natural language processing, optical character recognition, an ANN algorithm, and/or a KNN algorithm. In another example, each of the first and second documents is a resume, containing each applicant's name, address information, work experience, education, and/or other information. Consistent with disclosed embodiments, the processor may be configured to analyze all text in each document and determine a cosine similarity between the first and second documents. In one example, the processor may determine that a first and second document have a cosine similarity of 0.4. A cosine similarity of 0.4 may indicate that the two documents share some elements, but not enough elements to indicate that the two may be duplicates. In another embodiment, the processor may determine that the first and second documents have a cosine similarity of 0.9. A cosine similarity of 0.9 may indicate that the first and second documents are nearly identical, and the processor may flag the second document for manual inspection by an employer, manager, or both.
[0073] Consistent with disclosed embodiments, at least one processor (such as, for example, processor 402 as described in reference to
[0074] In one example, each of the first and second documents (such as, for example, first document 410 and second document 414) are resumes. Here, at least one processor (such as, for example, processor 402) may be configured to detect whether the first and second resumes are from the same applicant applying to different jobs. The processor may detect whether applicants are applying to different jobs by identifying and locating the relevant information in the resume or in the metadata associated with the resume. In one example, the processor may be configured to extract this data from the resume. The metadata associated with the resume may be the candidate's name, candidate ID, and/or the job ID. A candidate ID and a job ID may refer to identifiers that refer to a potential applicant and the job opening without using personally identifying information. The candidate ID and job ID may be generated by the employer. The processor may determine that the first and second resumes belong to the same person applying to different jobs based on the job ID associated with the resume.
[0075] In another example, at least one processor (such as, for example, processor 402 as described in reference to
[0076] In some embodiments, and referring to
[0077] In some embodiments, at least one processor (such as, for example, processor 402 as described in reference to
[0078] Table 2 contains exemplary results from a processor (such as, for example, processor 402) iterating the above processing, determining, and flag setting steps.
TABLE-US-00002 TABLE 2 Cosine Similarity No Misrepresentations Contains Misrepresentations 0.80 9 0.81 9 0.82 7 0.83 7 0.84 2 0.85 413 21 0.86 663 17 0.87 511 37 0.88 371 48 0.89 298 53 0.90 164 63 0.91 96 24 0.92 32 29 0.93 26 25 0.94 10 28 0.95 7 16 0.96 2 0.97 11 0.98 9 0.99 63 1.00 7 Grand Total 2626 453
[0079] Table 2 breaks down documents potentially contain one or more misrepresentations based on the determined cosine similarity. The above example shows that as the cosine similarity increases, the chance that a second document contains misrepresentations and/or is a duplicate increases. In this example, the processor flagged each document for further review when the cosine similarity was above 0.8. In some embodiments, the number of set flags may be displayed on a GUI, such as, for example, GUI 418 as described in reference to
[0080] In one example, at least one processor (such as, for example, processor 402) may be configured to filter out the number of set flags based on data associated with the newly submitted resumes. In this example, the processor may only flag newly submitted resumes with a certain candidate ID, job ID, and/or timestamp.
[0081] By way of example,
[0082] In some embodiments, and referring to
[0083] In some embodiments, and referring to
[0084] In some embodiments, and referring to
[0085] In some embodiments, and referring to
[0086] In some embodiments, and referring to
[0087] In some embodiments, and referring to
[0088] In some embodiments, and referring to
[0089] In some embodiments, at least one processor is configured to iterate the processing, determining, and flag setting steps for each of a plurality of second performance evaluations retrieved from the second source, until the second source no longer contains any second performance evaluations to process. For example, the processor may detect that the second source (such as, for example, second source 412) contains 10 newly submitted performance evaluations, that is, 10 performance evaluations that one or more managers submitted over the most recent review period, wherein the review period may be 3 months, 6 months, or a year. Here, the processor may be configured to perform the above processing, determining, and flag setting steps for each of the 10 recently submitted performance evaluations. In this example, the processor may flag two recently submitted performance evaluations for further inspection.
[0090] In one example, at least one processor may be configured to provide for display, on a graphical user interface, a number of set flags. Referring to
[0091] By way of example,
[0092] In some embodiments, and referring to
[0093] In some embodiments, and referring to
[0094] In some embodiments, and referring to
[0095] In some embodiments, and referring to
[0096] In some embodiments, and referring to
[0097] In one example, at least one processor may determine that an applicant recites the University of Maryland on their resume. The at least one processor may determine that the University of Maryland is on the record of accredited universities and may therefore not flag the applicant's resume for further review. In another example, an applicant may recite Shaftesbury University on their resume. The processor, after performing the look up, may determine that Shaftesbury University is not on the record of accredited universities, and may flag that applicant's resume for further processing.
[0098] The processor may set a flag as described elsewhere in this disclosure.
[0099] In some embodiments, and referring to
[0100] In some embodiments, and referring to
[0101] In one example, an applicant may recite the University of Maryland on their resume. In this example, the processor may determine, using a look up table, comparing the analyzed text data from the resume to the record of accredited universities, or any other method described in this disclosure, that the University of Maryland matches one or more universities on the record of accredited universities, and will not set a flag. In another example, the applicant may recite Suffield University on their resume. Here, the processor may determine that Suffield University matches one or more universities on the record of suspect universities and may set a flag indicating that the resume potentially contains one or more misrepresentations.
[0102] In some embodiments, at least one processor (such as, for example, processor 402 as described in reference to
[0103] For example, at least one processor may detect that the third source contains 50 resumes. Here, the processor may be configured to perform the above processing, determining, and flag setting steps for each of the 50 resumes. In this example, the processor may flag 4 resumes for further manual inspection, based on one or more universities the applicant recites on their resume.
[0104] In one example, at least one processor may be configured to provide for display, on a graphical user interface, a number of set flags. Referring to
[0105] By way of example,
[0106] In some embodiments, and referring to
[0107] In some embodiments, and referring to
[0108] In some embodiments, and referring to
[0109] In some embodiments, and referring to
[0110] In some embodiments, and referring to
[0111] In some embodiments, and referring to
[0112] In one example, an employer may set a threshold at two days. In this example, at least one processor may be configured to flag any resume that is submitted within two days of the first resume. The processor may determine that an applicant submitted a first resume to a job portal on May 1, 2023, at 1:00 PM. The processor may determine that an applicant submitted a second resume to the job portal on May 5, 2023, at 2:00 PM. In this example, the second resume does not exceed the threshold. However, the processor may still flag the second resume for further review if, for example, the IP addresses between the first resume and the second resume are the same.
[0113] In some embodiments, and referring to
[0114] In some embodiments, at least one processor is configured to iterate the processing, extracting, determining, and setting steps for each of a plurality of second resumes retrieved from the second source, until the second source no longer contains any second resumes to process. For example, at least one processor may detect that the second source contains 50 newly submitted resumes, that is, 50 resumes that applicants submitted over a most recent 24-hour period. Here, the processor may be configured to perform the above processing, extracting, determining, and flag setting steps for each of the 50 newly submitted resumes. In this example, the processor may flag 4 newly submitted resumes for further review, responsive to a determination that the IP addresses between the first and second resumes are the same or the gap between the first timestamp and the second time is below the threshold.
[0115] By way of example,
[0116] In some embodiments, and referring to
[0117] In some embodiments, and referring to
[0118] In some embodiments, and referring to
[0119] In some embodiments, and referring to
[0120] In some embodiments, and referring to
[0121] In some embodiments, and referring to
[0122] In some embodiments, and referring to
[0123] In some embodiments, and referring to
[0124] In some embodiments, and referring to
[0125] In some embodiments, and referring to
[0126] By way of example,
[0127] In some embodiments, and referring to
[0128] In some embodiments, and referring to
[0129] In some embodiments, and referring to
[0130] In some embodiments, and referring to
[0131] In some embodiments, and referring to
[0132] In some embodiments, and referring to
[0133] In some embodiments, and referring to
[0134] In some embodiments, and referring to
[0135] In some embodiments, and referring to
[0136] In some embodiments, and referring to
[0137] In some embodiments, the processor is further configured to iterate the processing, extracting, determining, and flag setting steps for each of a plurality of second resumes retrieved from the second source, until the second source no longer contains any second resumes to process. For example, at least one processor may detect that the second source contains 50 newly submitted resumes, that is, 50 resumes that applicants submitted over a most recent 24-hour period. Here, the processor may be configured to perform the above processing, determining, and flag setting steps for each of the 50 newly submitted resumes. In this example, the processor may flag 4 newly submitted resumes for further review, based on any one of first, second, third, or fourth thresholds, as well as a determination that the IP address cosine similarity is 1. Consistent with disclosed embodiments, the may be configured to provide the number of set flags for display on a GUI, such as, for example, GUI 418.
[0138] The descriptions of the various embodiments of the present invention have been presented for purposes of illustration but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
[0139] It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.
[0140] Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.