Cross Lingual Search using Multi-Language Ontology for Text Based Communication
20170357642 · 2017-12-14
Assignee
Inventors
- Jeffrey Chapman (Landsowne, VA, US)
- Shon Myatt (Starkville, MS, US)
- James B. Haynie (New Orleans, LA, US)
Cpc classification
G06F40/58
PHYSICS
G06F40/129
PHYSICS
G06F16/3337
PHYSICS
G06F9/454
PHYSICS
International classification
Abstract
A method for conducting a cross lingual searching utilizing an ontology reference process to ensure thoroughness. When a query is entered, an ontology database is accessed to identify all representations for the parent entity of interest within specified languages. These representations are used to form a search set that results in more thorough collection from the data sources. Thus, the disclosed method accommodates situations where languages do not follow the same construct (e.g. English compared to Chinese) and where direct translation does not adequately represent the intent of the user's inquiry.
Claims
1. A method of cross lingual searching, comprising the steps of: storing on a non-transient computer readable storage a plurality of equivalent representations to a WORD in a plurality of languages; wherein the equivalent representations include at least one non-direct-translation equivalent representation. receiving a query having the WORD; retrieving from the storage medium the equivalent representations of the WORD and forming a search set; and conducting a search of at least one data source according to the search set.
2. The method of claim 1 further comprising the step of: storing the results of the search.
3. The method of claim 2 further comprising the step of: indexing the results of the search to the WORD.
4. The method of claim 1 wherein the non-direct-translation equivalent representation is one of a derivation, dialect and semantic equivalent term or phrase.
5. The method of claim 1 wherein at least one of the languages is a pictographic language.
6. The method of claim 1 wherein the data source is a network.
7. A method of cross-lingual searching, comprising the steps of: providing non-transient computer-readable storage; for each of a plurality of languages, storing an ontology mapping of a WORD to equivalent representations; receiving a parent entity containing the WORD; retrieving from storage the equivalent representation ontology matches for the WORD from each of the languages; combining the equivalent representation ontology matches from each of the languages to form a search set; searching at least one data source and identifying documents containing at least one of the equivalent representation ontology matches; and storing the identified documents; and indexing the identified documents to the parent entity.
8. The method of claim 7 wherein one of the equivalent representations is one of a derivation, dialect and semantic equivalent term or phrase.
9. The method of claim 7 wherein the parent entity contains a plurality of keywords and the search set includes equivalent representation ontology matches for each of the keywords.
10. A system for cross-lingual searching, comprising: an amount of non-transient computer-readable storage medium; wherein the storage medium has stored thereon an ontology mapping of a search term to equivalent representations for each of a plurality of languages; a processor configured to: receive a parent entity containing the search term; retrieve from the storage medium the equivalent representation ontology matches for the search term from each of the languages; combine the equivalent representation ontology matches from each of the languages to form a search set; search at least one data source and identify documents containing at least one of the equivalent representation ontology matches; and store the identified documents; and index the identified documents to the parent entity.
11. The system of claim 10 wherein one of the equivalent representations is one of a derivation, dialect and semantic equivalent term or phrase.
12. The system of claim 10 wherein the parent entity contains a plurality of keywords and the search set includes equivalent representation ontology matches for each of the keywords.
13. The system of claim 10 wherein the data source is a network.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
DETAILED DESCRIPTION
[0017] Disclosed is a method for conducting cross lingual searches of electronic text based media for WORDs that accounts for the semantics and contextual differences across vernaculars. Embodiments utilize a multi-language ontology to establish a search set that will contain multiple forms and word relationships to the parent entity in the respective languages prior to conducting a search process. The end result is a set of documents that have one or more entries within the search set indexed to the parent entity.
[0018] In an embodiment and with reference to
[0019] This ontology becomes the search set, which is composed of all the associated WORDs collected from the individual language ontologies. The search set is thus a list of searchable terms used to process texted-based media.
[0020] The process uses the search set to filter for ontology matches in steps 104 and 105 and then store the matching documents and index them to the parent entity in step 106. This indexing of results is depicted in
[0021] After indexing, the documents are directly correlated to the parent entity. This process is represented in
[0022] Now with reference to
Example
[0023] To improve the comprehension of the process described above, the following example provides an exemplary use case of an embodiment.
[0024] At the time of the present disclosure, the Islamic State of Iraq and Syria (ISIS) is a mainstream concern for the United States and other nations. Searching for the term ISIS across languages presents challenges due to its representations in different cultures and the inability of tradition translation methods to capture these variants. Additionally, the term is an acronym but also is recognized as a proper noun. If a user were to enter the term “ISIS” into an engine performing searches across languages the term is still represented as “ISIS.” Even when converting to the primary alphabet of other languages (ex. Cyrillic or Arabic) the response is still a single word.
[0025] For example, GOOGLE TRANSLATE and SYSTRAN form the backbone for the majority of translation tools easily available to consumers. The translation of the entity “ISIS” into Russian and Croatian yields in both cases simply “ISIS.”
[0026] Using these translated forms of the entity will produce results but only when “ISIS” appears in a document. The drawback for this is that the term can be represented quite differently and without proper correlation a large amount of data will go unobserved. Overcoming this problem is one advantage of the disclosed method.
[0027] Embodiments use an ontology to capture the representations that a WORD may have within other languages. This ensures that an exhaustive search of available sources will contain the greatest number of relevant documents.
[0028] Croatians typically use the phonetic spelling of ISIS in their own dialect but also the spelling in Cyrillic. In previous systems the translation tools would have overlooked documents containing this subtle difference. The disclosed method would identify these items as possessing the same usage as the searched entity because a comprehensive ontology mapping of equivalents is developed for use in searching. Specifically, on at least one computer readable storage medium, a plurality of language sets are stored. In each language set, a WORD from another language will be associated with (indexed) its equivalents in that language. When a processor receives a query containing a parent entity, it retrieves from each language set the indexed equivalents, and combines those equivalents into an ontology mapping. Afterwards, the processor searches another database searching for results based on the ontology mapping.
[0029]
[0030] The Russian ontology representations contain many representations for ISIS in its primary alphabet, Cyrillic. Therefore, in this instance while the translation tools would search for a single translation of the entity, the proposed method would search for five different versions of the term, 1 Latin alphabet spelling (the same as the other tools) plus the four Cyrillic versions.
[0031]
[0032] Using direct translation tools the translation into Arabic abjad of ISIS does not account for many manifestations of “ISIS” found in Arabic communications. The disclosed would, however, identify those representations and use them in searching for relevant documents.
[0033] is now associated with the parent entity for “ISIS” (index 1) even though the document does not contain the actual base word “ISIS.” Thereafter, the document is available for review of materials related the search query.
[0034] Although the disclosed subject matter has been described and illustrated with respect to embodiments thereof, it should be understood by those skilled in the art that features of the disclosed embodiments can be combined, rearranged, etc., to produce additional embodiments within the scope of the invention, and that various other changes, omissions, and additions may be made therein and thereto, without parting from the spirit and scope of the present invention.