Method and system for searching words in documents written in a source language as transcript of words in an origin language

10042843 ยท 2018-08-07

Assignee

Inventors

Cpc classification

International classification

Abstract

The invention relates to a method used by computers for searching words in documents written in a source language, which are not in the vocabulary of said source language, but are transcript of meaningful words in an origin language. The method is comprised of a preparation process and a search process. During the preparation process a database of unrecognized words in the source language is maintained, which contains, among other data, normalized phonetic conversion of the unrecognized word, as well as a corpus of all words of the documents in the search domain and indexes for efficient search. During search, a phonetic conversion and normalization is done for the search word, and the distance to similar phonetics words in the corpus is calculated. The found words in the corpus are arranged in ascending order, and the relevant documents are displayed.

Claims

1. Computer implemented method for searching words in documents written in a source language, words which are not meaningful in said source language, but are transcript of meaningful words in an origin language, the method is comprised of two processes: a) preparation process executed for each new document, the preparation process is comprised of the following steps: i) reading the document; ii) extracting unrecognized words in the source language; iii) updating search indexes in the corpus for all document words; iv) for each new unrecognized word in the source language: 1) removing prefixes and suffixes; 2) performing phonetic conversion; 3) checking frequency of unrecognized word spelling in System Hebraized Medical Lexicon (SHML); 4) defining the most frequent spelling of the unrecognized word as the central term and connect it to other allowable and close spellings of that term; 5) updating System Hebraized Medical Lexicon (SHML); b) search process which is comprised of the following steps: i) reading the search request and perform auto-complete for terms from the System Hebraized Medical Lexicon (SHML); ii) generating phonetic conversion for all words in query; iii) for each word: 1) searching for similar phonetics in the corpus and find central terms; 2) calculating the distance to the found similar words and order them in ascending order; and 3) displaying relevant documents according to the distance.

2. The method according to claim 1, where the source language is Hebrew, the origin language is either English or Latin.

3. The method according to claim 1, where the unknown words in the source language are medical terms in the origin language.

4. The method according to claim 1, where the words in the input query goes through an autocomplete procedure from the System Hebraized Medical Lexicon (SHML).

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a top level flow chart of the search preparation process.

(2) FIG. 2 presents a flow chart of the processing of a Hebrew spelled word within the preparation process.

(3) FIG. 3 shows a flow chart of the processing of Hebraized medical term within the preparation process.

(4) FIGS. 3A and 3B presents examples of Hebraized medical terms which include prefixes and suffixes respectively.

(5) FIG. 4 shows a flowchart of the search process.

(6) FIG. 4A is an example of different spelled words which have the same meaning.

DETAILED DESCRIPTION

(7) The invention will be described more fully hereinafter, with reference to the accompanying drawings, in which a preferred embodiment of the invention is shown. The invention may, however, be embodied in many different forms and should not be construed as limited to the embodiment set forth herein; rather this embodiment is provided so that the disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

(8) In the following description Hebrew is used as a source language and English is the origin language. This is done for the purpose of the explanation and it does not limit the scope of the invention.

(9) The top level flow chart of the preparation process is shown in FIG. 1. Each new medical document which is part of the search domain, written in the source language 160 goes through the preparation process. After reading the new document in step 100, it undergoes tokenizationstep 102. Essentially, the words in the document are separated and put into a single list 170. Each word in the list is further processed on its own, as described below.

(10) Before describing the processing that each word goes through, it is important to explain the corpus of the system. The corpus of the system is a database that stores information on each document and each word ever entered the system, documents that constitute the search domain. Among the information on each word the system corpus keeps a list of all documents and locations within the document where that word is located, referred to as the search indexes. It also contains a phonetic representation for each word as well as statistical information on the word.

(11) In step 104 the next word to be processed is fetched from the list 170. In step 106 the corpus is searched to find out if the word is already known. If the word is new, as checked in step 108, than the new word with its phonetic representation are added in step 110 to the corpus180. The statistical data and the search indexes for each word are added to the corpus 180, in steps 112 and 114 respectively. Following this the script of the word is being checkedstep 116. A different process is used to process a word written in the source language (Hebrew) and to a word written in the origin language (English). The process that handles English words 120 is beyond the scope of this invention and is not a part of it. The process 118 that handles words written in Hebrew script is further described in FIG. 2.

(12) FIG. 2 describes the process that each word, written in Hebrew script, undergoes. In step 200 the word is run through dictionaries and a spell checker 210, to find out whether or not it is a valid Hebrew word. If the word is a valid Hebrew word, as tested in step 202, then it is further processed in step 206 as a valid Hebrew word. This process is outside the scope of the present invention. It is worthwhile noting that Hebrew spelling rules are somewhat relaxed; every word can have several different and valid spellings (changes in specific letters). This is true for all Hebrew words, and especially to Hebraized words (English words written in Hebrew script). If the word is found to be a non-valid Hebrew word it is processed in step 204 as an unknown Hebrew word. This process is further detailed in FIG. 3.

(13) For the benefit of the explanation, in the described example, the System Transliterated Lexicon (STL) is called System Hebraized Medical Lexicon (SHML).

(14) An unknown Hebrew word may result either from a misspelled valid Hebrew word or from Hebraized medical term. It is assumed that misspelled Hebrew words do not occur often in the scanned documents, whereas the Hebraized medical terms frequently appear in the documents. Thus statistical analysis is used to determine word type, i.e. a typo or Hebraized term. In step 300 of FIG. 3 the frequency of the incoming unknown word in the Corpus is tested. If the incoming unknown word is not a typo, i.e. its frequency is above a predefined threshold as tested in step 302 it goes through step 304 where the SHML is searched to find whether or not the incoming unknown word or its normalized form is already in the SHML. If the incoming unknown word is in the SHML then the processing terminates. If the incoming unknown word is not in the SHML, as tested in step 306, then step 308 is executed, where the normalized form of the incoming unknown word is checked. If the normalized form of the incoming unknown word is in the SHML then the processing terminates. However, if the normalized form of the incoming unknown word is not found in the in the SHML, then step 310 is executed, where a new branch of the normalized input word is added to the SHML.

(15) FIG. 3A and FIG. 3B present examples of combination of prefixes and suffixes that can accompany a basic medical term custom character. These prefixes and suffixes are removed in step 110 of FIG. 1.

(16) The search process is described in FIG. 4. The user types a search query which contains Hebraized medical terms. For each such term the system produces a list of Hebraized terms which have the same meaning as the one in the query, but are differently spelled. The corpus is searched for all the terms in the said list.

(17) The search text entered by the user goes through autocomplete process, as shown in step 400 of FIG. 4. The autocomplete options are extracted from the System Hebraized Medical Lexicon (SHML). Upon the completion of the query, as detected in step 402, phonetic conversion is done for each of the words in the querystep 404. The system than searches the Corpus for similar phonetics step406 and calculates the distance, in step 408, between the phonetics of the typed search word and the phonetics found in the Corpus. The result set of words are ordered according to the distance from the search word in ascending order.

(18) This method allows the system to generate accurate search results, taking into account the fact that Hebraized words can be spelled differently, but still have the same meaning. An example of an ordered list for a search word custom character is shown in FIG. 4A.

(19) During search, a phonetic conversion and normalization is done for the search word, and the distance to similar phonetics words in the corpus is calculated. The found words in the corpus are arranged in ascending order, and the relevant documents are displayed.