SENTENCE CREATION SYSTEM
20170286408 · 2017-10-05
Assignee
Inventors
- Kohsuke YANAI (Tokyo, JP)
- Toshinori MIYOSHI (Tokyo, JP)
- Toshihiko YANASE (Tokyo, JP)
- Misa SATO (Tokyo, JP)
Cpc classification
G06F40/289
PHYSICS
International classification
Abstract
A sentence creation system, which outputs an opinion sentence on an agenda, includes: an input unit into which the agenda is input; an agenda analyzing unit analyzing the agenda and judging the polarity of the agenda and a keyword used for searching; a searching unit searching for articles using the keyword and a disputed point word showing a disputed point in the discussion; a disputed point determining unit for determining the disputed point used for creating the opinion sentence; a sentence extracting unit for extracting sentences in which the disputed point is described among the articles output by the searching unit; a sentence sorting unit for creating sentences by sorting the extracted sentences; an evaluating unit for evaluating the sentences; a paraphrasing unit for inserting appropriate conjunctions into the sentences; and an output unit for outputting the most highly evaluated sentence among the plural sentences as the opinion sentence.
Claims
1. A sentence creation system for outputting an opinion sentence on an agenda, comprising: an input unit into which the agenda is input; an agenda analyzing unit for analyzing the agenda and judging the polarity of the agenda and a keyword used for searching; a searching unit for searching for articles using the keyword and a disputed point word showing a disputed point in the discussion; a disputed point determining unit for determining the disputed point used for creating the opinion sentence; a sentence extracting unit for extracting sentences in which the disputed point is described among the articles output by the searching unit; a sentence sorting unit for creating sentences by sorting the extracted sentences; an evaluating unit for evaluating the sentences; a paraphrasing unit for inserting appropriate conjunctions into the sentences; and an output unit for outputting the most highly evaluated sentence among the plurality of sentences as the opinion sentence.
2. The sentence creation system according to claim 1, wherein the disputed point determining unit determines a disputed point for each article by classifying the articles output by the searching unit.
3. The sentence creation system according to claim further comprising: a storage unit in which stored are the text data of the articles searched for by the searching unit, annotation data attached to the text data, searching indexes created from the text data and the annotation data, and a disputed point ontology that associates the disputed point, suppression words having meanings for suppressing the disputed point, and promotion words having meanings for promoting the disputed point; and an interface unit used for communicating data to and from the searching unit, the disputed point determining unit, the sentence extracting unit, the sentence sorting unit, the evaluating unit, the paraphrasing unit.
4. The sentence creation system according to claim 3, wherein the agenda analyzing unit determines which word should be used as the key word, the suppression word or the promotion word, by judging the polarity of the agenda.
5. The sentence creation system according to claim 3, wherein an evaluation model is further stored in the storage unit, and the evaluating unit calculates the likelihoods of the plurality of sentences in comparison with the evaluation model respectively, and outputs a sentence having the highest likelihood as the opinion sentence.
6. A sentence creation method for outputting an opinion sentence on an agenda, comprising: a first step for inputting the agenda; a second step for analyzing the agenda and judging the polarity of the agenda and a keyword used for searching; a third step for searching for articles using the keyword and a disputed point word showing a disputed point in the discussion; a fourth step for determining the disputed point used for creating the opinion sentence; a fifth step for extracting sentences in which the disputed point is described among the articles output at the third step; a sixth step for creating sentences by sorting the extracted sentences; a seventh step for evaluating the sentences; an eighth step for inserting appropriate conjunctions into the sentences; and a ninth step for outputting the most highly evaluated sentence among the plurality of sentences as the opinion sentence.
7. The sentence creation method according to claim 6, wherein a disputed point for each article is determined at the fourth step by classifying the articles output at the third step.
8. The sentence creation method according to claim 6, wherein, at the third step, searched is a storage unit in which stored are: the text data of articles searched for; annotation data attached to the text data; searching indexes created from the text data and the annotation data; and a disputed point ontology that associates the disputed point, suppression words having meanings for suppressing the disputed point, and promotion words having meanings for promoting the disputed point.
9. The sentence creation method according to claim 8, wherein at the second step, determines which word should be used as the key word, the suppression word or the promotion word by judging the polarity of the agenda.
10. The sentence creation method according to claim 8, wherein an evaluation model is further stored in the storage unit, and the likelihoods of the plurality of sentences in comparison with the evaluation model are respectively calculated, and a sentence having the highest likelihood is output as the opinion sentence at the seventh step.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
DESCRIPTION OF EMBODIMENTS
[0023] An embodiment of the present invention will be explained with reference to the accompanying drawings.
First Embodiment
[0024] Hereinafter, a sentence creation system according to a first embodiment of the present invention will be explained. The sentence creation system according to the first embodiment of the present invention is a system including a creation system composed of a combination of nine modules and a data management system. A concrete hardware configuration example is as shown in
[0025]
[0026] In the system 100, the nine modules are sequentially executed. First, the input unit 102 receives an agenda input by a user. Furthermore, the input unit 102 can receive an input indicating which opinion is desired to be created, a positive opinion, or a negative opinion, on the agenda. To clarify the user's position for a sentence to be created in the above-mentioned way makes it possible for this system to be used in such a debate style discussion.
[0027] Next, an agenda analyzing unit 103 analyzes the agenda, and judges the polarity of the agenda and a keyword used for searching. Next, a searching unit 104 searches for articles using the keyword and a disputed point word showing a disputed point in the debate. If the agenda is, for example, “A casino should be shut down”, “casino”, which is a noun clause, is considered to be a keyword. In addition, it can be determined whether a positive disputed point word should be used for “casino” or a negative disputed point word should be used through judging the polarity. Here, a disputed point word refers to any one of all the words in a disputed point ontology shown in
[0028] If it is desired that a positive opinion is output on the above agenda, searching is executed by selecting “casino” as a keyword and a “suppression word” which suppresses a casino as a disputed point word. In this case, since the agenda is negative for “casino”, processing, in which the “suppression word” is used as a disputed point word, is executed. Plural suppression words are listed in
[0029] Next, a disputed point determining unit 105 classifies the output articles, and determines a disputed point used for creating opinions. Next, a sentence extracting unit 106 extracts sentences in which the disputed point is described among the output articles. Next, a sentence sorting unit 107 creates sentences by sorting the extracted sentences. Next, an evaluating unit 108 evaluates the created sentences. Next, a paraphrasing unit 109 inserts appropriate conjunctions, and deletes unnecessary expressions. Next, an output unit 110 outputs the most highly evaluated sentence as an essay describing opinions.
[0030] The data management system 101 includes four databases and an interface/structuralizing unit 11. An interface DB 111 provides access means to data that are managed by databases. The text data DB 112 is text data including news articles and the like, and the text annotation data DB 113 is data attached to the text data DB 112. A searching index DB 114 is an index that enables the text data DB 112 and the annotation data DB 113 to be searched. Disputed point ontology DB 115 is a database in which disputed points, which are often discussed in debates, are associated with the related words.
[0031] Next, after the data management system 101 is explained, the respective units of the system 100 will be explained.
[0032] Data stored in the text data DB 112 are text data including news articles and the like. Sentences appropriate for composing opinion sentences are extracted from these text data, and the extracted sentences are arranged to create an essay. Therefore, the text data DB 112 is a data source for sentences composing an output essay. The text data DB 112 are composed of English and Japanese news articles cloned from the Internet. For example, a doc_id is attached to each data as an identifier to manage each data.
[0033] The text annotation data DB 113 is a database that stores data attached to the text data DB 112.
[0034] How to attach text annotation data will be explained taking a text data “Experts said that casinos dramatically increase the number of crimes.” as an example. This sentence mentions a demerit brought about by casinos, which is useful for creating an essay about casinos. Since it is understandable that the word “casinos” promotes “the number of crimes” judging from the word “increase”, an annotation “promote” is attached to the word “increase”. Here, because the word “increase” are written starting at the 40.sup.th character and ending at the 47.sup.th character of the text data “Experts said that casinos dramatically increase the number of crimes.”, “begin”=40 and “end”=47 are obtained. In addition, because a promotion actor is “casinos”, another annotation “promote arg0” is attached to “casinos”. Let's assume that the id of “promote arg0” attached to “casinos” is 125123. The “id” of an annotation is automatically given by the system so that the “id” is unique to the annotation. In this case, in order to make the relationship between “increase” and “casinos” understandable, a link is provided from “promote” annotation of “increase” to “promote_arg0” of “casinos”. This is what [“arg0”: [“125123”]] in
[0035] There are eight kinds of annotations, that is to say, “positive”, “negative”, “promote”, “promote_arg0”, “promote_arg1”, “suppress”, “suppres_arg0”, and “suppress_arg1”. “Positive” is an affair having a positive value, and includes representations on a natural language such as “benefit”, “ethic”, and “health”. “Negative” is an affair having a negative value, and includes representations on a natural language such as “disease”, “crime”, and “risk”. “promote” is a representation representing promotion, and includes, for example, “increase”, “invoke”, and “improve”. “promote_arg0” is a promotion actor, “promote_arg1” is a promoted event, and these annotations are attached after being identified from surrounding syntactic information when promote annotations are attached as described above. In a similar way, “suppress” is a representation showing suppression, and includes, for example, ‘decrease’, ‘stop’, and ‘worsen’. “suppress_arg0” is a suppression actor, “suppress_arg1” is a suppressed event, and they are attached after being identified from the surrounding syntactic information when suppress annotations are attached as described above.
[0036] These annotations can be created by applying rules, which are made in advance, to the result of syntax analysis of text data as described above. Alternatively, these annotations can be created by a machine learning method referred to as a sequential labeling such as a CRF++ and the like.
[0037] The searching index DB 114 is index data used for enabling the text data DB 112 and the text annotation data DB 113 to be searched. As for index data used for keyword searching, the statistical amounts of characteristic words in each text data are calculated for similarity searching using, for example, TF-IDF, and the vector values of the statistical amounts are stored as indexes for similarity searching. Alternatively, searching indexes can be automatically created by inputting text data or text annotation data into an API of Solar for creating indexes using the software of Solar or the like.
[0038] The disputed point ontology DB 115 is a database in which disputed points, which are often discussed in debates, are associated with the related words.
[0039] The interface unit 111 is an interface that provides an access means to the text data DB 112, the text annotation DB 113, the searching index DE 114, and the disputed point ontology DB 115, and the interface unit 111 is implemented using a technology such as REST.
[0040] Next, the respective units of the system 100 will be explained.
[0041] The input unit 102 receives an agenda from a user. The agenda is input from a GUI such as a Web browser or the like. An example of the agenda is “We should ban smoking in train stations” or the like. In addition, it is also conceivable that the setting of the number of candidates for the after-mentioned output sentence and the like are input into the input unit 102.
[0042]
[0043] At step S402, the polarity of the agenda is judged with reference to a dictionary. In the dictionary, verbs taking a positive standpoint to a subject such as “accept,” “agree”, and verbs taking a negative standpoint to a subject such as “ban,” “abandon” are separately stored. In the above example, “ban” is judged to be a verb taking a negative standpoint with reference to the dictionary. By combining this judgment and the above-extracted result whether there is a negative expression or not, the polarity of the theme of the agenda is finally judged. In this example, the polarity is judged to be negative. On the other hand, in a case of an agenda “We should not ban smoking”, this is a negative expression, and “ban” is a verb taking a negative standpoint, therefore the polarity of this agenda is judged to be positive. The polarity judged here means the polarity toward a noun clause extracted at the next step S403.
[0044] Next, a noun clause that is the theme of the agenda is extracted at step S403. Only subtrees that have syntax tags of “ROOT”, “S”, “NP”, “VP”, or “SBAR” of the syntax analysis tree of the agenda are tracked starting from “ROOT”, and noun clauses that appear on the way are extracted. For example, in the case of the agenda “We should ban smoking in train stations.”, “smoking” is extracted. Next, contextual information is extracted at step S404. Among words included in the agenda, words whose POS tags are “CC”, “FW”, “JJ”, “JJR”, “JJS”, “NN”, “NNP”, “NNPS”, “NNS”, “RP”, “VB”, “VBD”, “VBG”, “VBN”, “VBP”, or “VBZ”, and that are not extracted at step S401 and step S403 are extracted as contextual information. For example, in the case of the agenda “We should ban smoking in train stations”, “train” and “stations” are extracted.
[0045] Next, synonym expansion is executed at step S405. The synonyms of the words extracted at steps S401, S403, and S404 are derived using the dictionary. As the dictionary, for example, WordNet may be used. For example, in the case of the agenda “We should ban smoking in train stations.”, “smoking” is extracted as a noun clause, “smoke”, and “fume” are derived as the synonyms of “smoking”. In a similar way, synonyms of the verb extracted at step S401 and synonyms of the words that express the contextual information and are extracted at step S404 are also derived. As mentioned above, in the agenda analyzing unit 103, a main verb, the polarity of an agenda, a noun clause that is the theme of the agenda, contextual information, and synonyms relevant to the above words are extracted from the agenda. These words are used in the latter stages.
[0046]
[0047] Next, each of 3000 articles extracted at step S503 is given a score using the next expression.
score=(the number of times noun clauses extracted from the agenda appear) +(the number of times words in the disputed ontology appear) −the antiquity of the article
[0048] Assuming that the latest year is 2014, the antiquity of an article published in 2014 is 0, the antiquity of an article published in 2013 is 1, and the antiquity of an article published in 2012 is 2. Next, at step S504, 100 articles among the above articles are output in the order of descending scores. As described above, by giving a higher score to an article having the larger number of times the words appear, an article having a high relevance to an agenda or a disputed point can be found. In addition, by giving a score to the antiquity of an article, an article in which a newer data is reflected can be found, which can increase the persuasive power of a finally output sentence.
[0049]
[0050]
[0051]
[0052] Next, at step S903, the sentences are placed in templates and arranged to create an essay. For example, in a case of a template in which “assertion,” “reason”, “example”, “assertion”, “reason”, and “example” appear in this order, that is to say, “assertion”, “reason”, and “example” appear twice, first a sentence having the highest score, which is calculated by the sentence extracting unit 106, among sentences labeled with “assertion” in each group is selected. In a similar way, a sentence having the highest score among sentences labeled with “reason,” “example”, “assertion”, “reason”, or “example” is selected to be placed in the template in this order. At step S904, the loop processing is finished.
[0053] In this way, the sentence sorting unit 107 creates essays regarding plural disputed points. Subsequently, plural essays created by the sentence sorting unit 107 are evaluated by the next evaluating unit 108, hence the disputed point in the final output sentence, that is to say, the standpoint or value concept of an essay according to this system is first determined. In such a way, by creating an essay using only sentences extracted from articles regarding the same disputed point, a sentence that argues about the disputed point in a consistent standpoint can be created.
[0054]
[0055] At step S1001, three essays are output in the order of descending evaluation values. Essays, which have been grouped by the sentence sorting unit 107 into the groups corresponding to the disputed points respectively, are input into the evaluating unit 108. At step S1001, finally three essays are output. In this embodiment, although it is assumed that the present system is configured to output three sentences in order for a user to be able to grasp the contents of the sentences in a short time, the number of sentences to be output can be changed by the user's setting which is input in the input unit. With such a configuration of this system, this system can be used in accordance with the knowledge level of the user.
[0056]
[0057] For example, if there is a sentence “Expert said that casino dramatically increase the number of crimes in Kokubunji-shi.”, this sentence is unnatural as a sentence that provides an abstract assertion of an essay because this sentence includes a proper noun, therefore the clause “in Kokubunji-shi” is deleted, and “Expert said that casino dramatically increase the number of crimes.” is output. In this way, by complementing sentences with conjunctions and by equalizing the abstractness of plural sentences whose anaphoric relations IS corrected and that are sorted, a sentence that is comprehensible as an opinion sentence regarding a debate can be output.
[0058] The output unit 110 provides a user with an essay that is a final output from the system using a means such as a display. It goes without saying that the user can be provided with the essay using a synthesized speech other than a display. In an actual debate, the pros and cons state their own opinions respectively in conversation, therefore outputting the essay using the speech gives a higher feeling of presence to the user.
[0059] From the above description, it will be clearly understandable that the sentence creation system according to this embodiment is a sentence creation system for outputting an opinion sentence on an agenda, and this sentence creation system includes: an input unit into which the agenda is input; an agenda analyzing unit for analyzing the agenda and judging the polarity of the agenda and a keyword used for searching; a searching unit for searching for articles using the keyword and a disputed point word showing a disputed point in the discussion; a disputed point determining unit for determining the disputed point used for creating the opinion sentence; a sentence extracting unit for extracting sentences in which the disputed point is described among the articles output by the searching unit; a sentence sorting unit for creating sentences by sorting the extracted sentences; an evaluating unit for evaluating the sentences; a paraphrasing unit for inserting appropriate conjunctions into the sentences; and an output unit for outputting the most highly evaluated sentence among the plural sentences as the opinion sentence.
[0060] Furthermore, the sentence creation method according to this embodiment is a sentence creation method for outputting an opinion sentence on an agenda, and this sentence creation method includes: a first step for inputting the agenda; a second step for analyzing the agenda and judging the polarity of the agenda and a keyword used for searching; a third step for searching for articles using the keyword and a disputed point word showing a disputed point in the discussion; a fourth step for determining the disputed point used for creating the opinion sentence; a fifth step for extracting sentences in which the disputed point is described among the articles output at the third step; a sixth step for creating sentences by sorting the extracted sentences; a seventh step for evaluating the sentences; an eighth step for inserting appropriate conjunctions into the sentences; and a ninth step for outputting the most highly evaluated sentence among the plural sentences as the opinion sentence.
[0061] As above, according to this embodiment of the present invention, a sentence in which opinions about a disputed point are described can be created by classifying articles, extracting sentences, and sorting sentences on the basis of the disputed point that is a pillar of the opinion sentence, which can bring about consistency to the opinion sentence. In addition, unlike in a case where information is collected about a predetermined disputed point when people express their opinions in debate, after sentences are created by searching for information on all disputed points, plural disputed points are uniformly evaluated, so that an opinion sentence with considerable persuasive power can be created regardless of the disputed points.
List of Reference Signs
[0062] 100: Creation System
[0063] 101: Data Management System
[0064] 102: Input Unit
[0065] 103: Agenda Analyzing Unit
[0066] 104: Searching Unit
[0067] 105: Disputed Point Determining Unit
[0068] 106: Sentence Extracting Unit
[0069] 107: Sentence Sorting Unit
[0070] 108: Evaluating Unit
[0071] 109: Paraphrasing Unit
[0072] 110: Output Unit
[0073] 111: Interface
[0074] 112: Text Data DB
[0075] 113: Text Annotation Data DB
[0076] 114: Searching Index DB
[0077] 115: Disputed Point Ontology DB