Patent classifications
G06V30/155
Automated batch de-identification of unstructured healthcare documents
Batch de-identification of unstructured health care documents includes performing optical character recognition (OCR) upon a form-based document so as to produce an initial set of terms. Amongst the initial set of terms, initial specific terms are identified which contain protected information. Each of the identified initial specific terms are then replaced with synthetically generated corresponding terms. Subsequently, additional OCR is performed upon the form-based document so as to produce a new set of terms and new specific terms are identified amongst the new set of terms which are determined to contain protected information. Finally, the new specific terms are compared to the initial specific terms and the form-based document is then added to a repository of de-identified documents only if none of the new specific terms are equivalent to corresponding ones of the initial specific terms. But otherwise, the form-based document is flagged in error.
Automating text and graphics coverage analysis of a website page
Methods, system, and non-transitory processor-readable storage medium for a website page density and readability system are provided herein. An example method includes capturing an image of a website page rendered in a web browser. The website page density and readability system determines a text density associated with text content in the image, and then removes the text content from the image. The website page density and readability system determines a graphic density associated with graphic content in the image, and determines a website page density associated with the website page using the text density and graphic density.
SCORING METHOD AND SYSTEM
A method to score a round of golf using a golf scoring system. The method includes capturing a photograph of a physical scorecard including handwritten text using a camera of a user device, wherein the physical scorecard includes handwritten characters disposed at least partially within rectilinear boxes. The method also includes accessing the photograph in an application of the user device. The method also includes identifying the rectilinear boxes by using a color contrast between the rectilinear boxes of the physical scorecard and a background color of the physical scorecard. The method also includes removing the rectilinear boxes. The method also includes after removing the rectilinear boxes, extracting at least the handwritten characters from the physical scorecard. The method also includes after extracting at least the handwritten characters, calculating a score of the round of golf using the extracted handwritten characters on the application.
Text extraction using optical character recognition
Provided herein are systems and methods for extracting text from a document. Different optical character recognition (OCR) tools are used to extract different versions of the text in the document. Metrics evaluating the quality of the extracted text are compared to identify and select higher quality extracted text. A selected portion of text is compared to a threshold to ensure minimal quality. The selected portion of text is then saved. Error correction can be applied to the selected portion of text based on errors specific to the OCR tools or the document contents.
DIGITAL STAMP LOCALIZATION AND OVERLAPPING TEXT REMOVAL METHOD AND APPARATUS
In a form recognition system, a deep learning system may be trained to perform stamp localization for stamp removal to facilitate form recognition. In embodiments, a stamp mask identifies locations of stamps or seals on forms, and a line mask identifies pixels of the stamps. Where a stamp or seal overlaps with underlying text on a form, and a color or grayscale of the stamp or seal is sufficiently similar to that of the underlying text, a combination of the stamp mask and the line mask may enable removal of the stamp or seal without degrading the underlying text in the form, and facilitate form recognition.
METHOD, APPARATUS, AND PROGRAM FOR EVALUATING DOCUMENTS, AND DOCUMENT EVALUATION SYSTEM
A document evaluation apparatus includes: a document data acquisition unit that acquires document data including at least text objects; a preliminary processing unit that performs a preliminary process on the document data acquired by the document data acquisition unit; an information volume evaluation unit that evaluates an amount of information in the document data based on the preliminarily processed document data; a character evaluation unit that evaluates the legibility of text based on the preliminarily processed document data; and a color evaluation unit that evaluates the relationships among adjacent colors in the document data based on the preliminarily processed document data.
METHOD AND SYSTEM FOR EFFICIENTLY TRANSMITTING SOME INFORMATION LOCATED IN A SCENE
A method for efficiently transmitting some information located in a scene, suitable in particular to allow identification of the place of the scene, the method including the following: detecting some information of interest within the photo of the scene, where the information of interest comprises a logo, using a first mobile terminal; converting the information of interest into a string of characters, where the string of characters includes at least one character which represents geometrically the logo, using the first mobile terminal; and inserting the string of characters into a text message, to be sent from the first terminal to a second mobile terminal.
Line removal from an image
In some implementations, a device may process an image to identify one or more first lines of the image that extend in a first dimension. The device may process the image to identify one or more second lines of the image that extend in a second dimension orthogonal to the first dimension. The device may identify portions of the one or more first lines that do not intersect with the one or more second lines. The device may process the image to obtain a version of the image in which the portions of the one or more first lines are removed.
Document optical character recognition
Vehicles and other items often have corresponding documentation, such as registration cards, that includes a significant amount of informative textual information that can be used in identifying the item. Traditional OCR may be unsuccessful when dealing with non-cooperative images. Accordingly, features such as dewarping, text alignment, and line identification and removal may aid in OCR of non-cooperative images. Dewarping involves determining curvature of a document depicted in an image and processing the image to dewarp the image of the document to make it more accurately conform to the ideal of a cooperative image. Text alignment involves determining an actual alignment of depicted text, even when the depicted text is not aligned with depicted visual cues. Line identification and removal involves identifying portions of the image that depict lines and removing those lines prior to OCR processing of the image.
INFORMATION PROCESSING SYSTEM, INFORMATION PROCESSING METHOD, AND RECORDING MEDIUM IN WHICH INFORMATION PROCESSING PROGRAM IS RECORDED
An image processing apparatus includes an extraction processing unit that extracts a target object that is at least one of a text object and a drawing object from document data, a correction processing unit (a substitution processing unit, a deletion processing unit, and a change processing unit) that executes predetermined correction processing on the target object extracted by the correction processing unit when the target object matches an object to be processed that is registered in advance, and a rendering processing unit that generates image data for character recognition by executing render processing on the document data of which the target object has been corrected by the correction processing unit.