DIALOG TEXT SUMMARIZATION DEVICE AND METHOD
20170169822 ยท 2017-06-15
Assignee
Inventors
Cpc classification
G10L15/22
PHYSICS
G10L15/19
PHYSICS
International classification
Abstract
Provided is a summarization technology for correcting a dialog text on a word-by-word basis for readability using a dialog structure. A dialog text summarization device includes: a recognition result acquisition unit that acquires, from a database, a word recognized from a dialog form text, time-series information of the word, and identification information identifying a speaker of the word; and a text summarization unit that corrects the word based on the word, the time-series information of the word, the identification information, and a summarization model, and that outputs a correction result to the database.
Claims
1. A dialog text summarization device comprising: a recognition result acquisition unit that acquires, from a first database, a word recognized from a dialog form text, time-series information of the word, and identification information identifying a speaker of the word; and a text summarization unit that corrects the word based on the word, the time-series information of the word, the identification information, and a summarization model, and that outputs a correction result to the first database.
2. The text summarization device according to claim 1, wherein the text summarization unit deletes a word determined to be not important by a determination using the summarization model.
3. The dialog text summarization device according to claim 1, wherein the text summarization unit deletes a word determined to be a recognition error by a determination using the summarization model.
4. The dialog text summarization device according to claim 1, wherein the text summarization unit corrects the word using a recurrent neural network in the summarization model.
5. The dialog text summarization device according to claim 1, further comprising a result display unit that displays the dialog form text including the correction result in such a manner that a corrected portion and/or a corrected content can be confirmed.
6. The dialog text summarization device according to claim 1, further comprising a result display unit that displays the dialog form text reflecting the correction result and the dialog form text including the correction result side by side.
7. The dialog text summarization device according to claim 1, further comprising a recognition unit that executes a process of recognizing a word included in the dialog form text, a process of managing the time-series information for each of the recognized word, and a process of managing the identification information identifying the speaker of the word as a recognition process.
8. The dialog text summarization device according to claim 7, wherein: the recognition unit, after receiving a query designating the dialog form text from an external terminal, acquires the dialog form text designated by the query from a second database and executes the recognition process, and further stores a process result in the first database; and the recognition result acquisition unit, after a recognition result is obtained from the recognition unit, outputs the word concerning the dialog form text designated by the query, the time-series information of the word, and the identification information to the text summarization unit.
9. The dialog text summarization device according to claim 7, wherein the recognition result acquisition unit, after receiving the query designating the dialog form text from the external terminal, acquires the word concerning the dialog form text designated by the query, the time-series information of the word, and the identification information from the first database.
10. A dialog text summarization method comprising: a process of a recognition result acquisition unit acquiring, from a first database, a word recognized from a dialog form text, time-series information of the word, and identification information identifying a speaker of the word; and a process of a text summarization unit correcting the word based on the word, the time-series information of the word, the identification information, and a summarization model, and outputting a correction result to the first database.
11. The text summarization method according to claim 10, wherein the text summarization unit deletes a word determined to be not important by a determination using the summarization model.
12. The dialog text summarization method according to claim 10, wherein the text summarization unit deletes a word determined to be a recognition error by a determination using the summarization model.
13. The dialog text summarization method according to claim 10, wherein the text summarization unit corrects the word using a recurrent neural network in the summarization model.
14. The dialog text summarization method according to claim 10, wherein the text summarization unit displays the dialog form text including the correction result in such a way that a corrected portion and/or corrected content can be confirmed.
15. The dialog text summarization method according to claim 10, wherein a recognition unit executes a process of recognizing a word included in the dialog form text, a process of managing the time-series information for each of the recognized word, and a process of managing the identification information identifying the speaker of the word.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
DETAILED DESCRIPTION
[0028] In the following, embodiments of the present invention will be described with reference to the drawings. It should be noted that the mode of the present invention is not limited to the embodiments that will be described below, and that various modifications may be made within the technical scope of the invention.
(1) First Embodiment
(1-1) System Configuration
[0029]
[0030] The call recording/recognition/summarization device 300 provides a function for automatically converting voice information exchanged between the operator and the customer into text; a function for automatically creating a summary of the dialog text created by the conversion into text; and a function for providing the summary of the dialog text in accordance with a request. In many cases, the call recording/recognition/summarization device 300 may be implemented as a server. For example, of the constituent elements of the call recording/recognition/summarization device 300, the functional units other than the databases are implemented by programs executed on a computer (including, e.g., a CPU, a RAM, and a ROM).
[0031] The call recording visualization terminal device 400 is a terminal which is used when visualizing a summarized dialog text. The call recording visualization terminal device 400 may be any terminal that includes a monitor; examples are a desktop computer, a laptop computer, and a smartphone. While in
[0032] In the present embodiment, the operator telephone 200, the call recording/recognition/summarization device 300, and the call recording visualization terminal device 400 are disposed in a single call center. However, the constituent elements of the operator telephone 200, the call recording/recognition/summarization device 300, and the call recording visualization terminal device 400 may not necessarily be all present in a single call center; instead, they may be distributed at a plurality of locations or, among a plurality of business operators in other embodiments.
[0033] The call recording/recognition/summarization device 300 is provided with a call recording unit 11; a speaker identification unit 12; a call recording DB 13; a call recording acquisition unit 14; a voice recognition unit 15; a call recognition result DB 16; a call recognition result acquisition unit 17; a text summarization unit 18; a summarization model 19; a query reception unit 22; a call search unit 23; and a result transmission unit 24.
[0034] The call recording unit 11 acquires voices (calls) transmitted and received between the customer telephone 100 and the operator telephone 200, and creates a voice file for each call. The call recording unit 11 implements the corresponding function using a known recording system based on, e.g., IP phone. The call recording unit 11 manages the individual voice files by associating them with recording times, extension numbers, telephone numbers of the other party and the like. The speaker identification unit 12 identifies the speaker of the voice (whether the speaker is a sender or a recipient) by utilizing the association information. That is, the speaker identification unit 12 identifies whether the speaker is an operator or a customer. The call recording unit 11 and the speaker identification unit 12 create a sender-side voice file and a receiver-side voice file from one call, and saves the files in the call recording database (DB) 13. The call recording DB 13 is a large-capacity storage device or system with a recording medium such as a hard disk, an optical disk, or a magnetic tape. The call recording DB 13 may be configured as a direct-attached storage (DAS), a network-attached storage (NAS), or a storage area network (SAN), for example.
[0035] The call recording acquisition unit 14 reads the voice files (sender voice file and receiver voice file) from the call recording DB 13 for each call, and feeds the files to the voice recognition unit 15. The reading of the voice files is executed during a call (in real-time), or at an arbitrary timing after the end of a call. In the present embodiment, the reading of the voice files is contemplated to be executed during a call (in real-time). The voice recognition unit 15 subjects the contents of the two voice files to voice recognition for conversion into text information. For voice recognition, a known technology may be used. However, in light of a summarization process which will be executed in a later stage, a voice recognition technology capable of outputting the text information on a word-by-word basis and chronologically may be desirable. The result of voice recognition is registered in the call recognition result DB 16. The call recognition result DB 16 is also a large-capacity storage device or system, and implemented as a medium or in a form similar to the call recording DB 13. The call recording DB 13 and the call recognition result DB 16 may be managed as different store regions of the same storage device or system.
[0036] The call recognition result acquisition unit 17 acquires, from the call recognition result DB 16, the call recognition results associated with the recording ID, and sorts the results in the chronological order of appearance of words. By the sorting, a time-series of words to which a speaker ID is given with respect to one recording ID is obtained. The text summarization unit 18, when given the input of the time-series of words created by the call recognition result acquisition unit 17, summarizes the text on a word-by-word basis by applying the summarization model 19. In the case of the present embodiment, a recurrent neural network is contemplated as the summarization model 19. The summarization by the text summarization unit 18 involves a word-by-word correction process. Word-by-word correction information is fed back from the text summarization unit 18 to the call recognition result DB 16. As a result, in the call recognition result DB 16, the aforementioned time-series of words given the speaker ID with respect to one recording ID is stored while being associated with the word-by-word correction information.
[0037] The query reception unit 22 executes a process for which a query is received from the call recording visualization terminal device 400. The query may include the presence or absence of execution of summary display, for example, in addition to a recording ID. Based on the recording ID identified by the query, the call search unit 23 reads the time-series of words for each speaker from the call recognition result DB 16. The result transmission unit 24 transmits the time-series of words for each speaker that has been read to the call recording visualization terminal device 400.
[0038] The call recording visualization terminal device 400 includes a query transmission unit 21 that receives the input of a query, and a result display unit 25 that visualizes the dialog text. The call recording visualization terminal device 400 includes a monitor, and the input of query and the displaying of a dialog text are executed via an interface screen displayed on the screen of the monitor.
(1-2) Text Summarization Operation
[0039]
[0040]
[0041] Referring back to
[0042]
[0043] Referring back to
[0044]
[0045] As shown in
[0046] In the present embodiment, the summarization model 19 uses a recurrent neural network.
s(i)=(U[x(i)d(i)s(ii)]) (Expression 1)
[0047] An output y(i) of the output layer is expressed by the following expression using the output s(i) of the hidden layer, the output weight matrix V, and a softmax function softmax
y(i)=softmax(Vs(i)) (Expression 2)
[0048] The output y(i) thus computed is considered the vector representing the word after correction of the i-th word. In this case, the input weight matrix U and the output weight matrix V are determined by training in advance. Such training can be implemented using the process of back propagation through time, for example, given a number of correct solutions to input/output relationship. In this case, by creating the correct solutions to the input/output relationship using a word sequence as the voice recognition result and a word sequence as a result of human summarization thereof, an appropriate summarization model can be created. In reality, such correct solutions may include deletion of redundant words, correction of recognition error words, deletion of unwanted sentences and the like in light of context. In the summarization model based on a recurrent neural network, these can be operated in the same framework.
[0049] For the summarization model 19, it is also possible to adopt mechanisms other than the above-described recurrent neural network. For example, a rule-based mechanism may be adopted in which correction or deletion is designated when a word of concern, words appearing before and after the word of concern, and their respective speaker IDs match a predetermined condition. The summarization model 19 may not be based on a method that takes a time-series history into consideration, as in the recurrent neural network. For example, for determining whether a word is to be deleted, an identification model such as a conditional random field based on feature quantities composed of the preceding or following words and the speaker IDs may be used.
(1-3) Call Visualization Operation
[0050]
[0051] The query reception unit 22 receives and feeds the query transmitted from the query transmission unit 21 to the call search unit 23 (step S702). The call search unit 23, based on the recording ID included in the query received by the query reception unit 22, searches the call recognition result DB 16 to access the corresponding voice interval information and recognition result information (step S703). In this case, the voice interval table 401 and the call recognition result table 402 are both output to the result transmission unit 24 as search results. The result transmission unit 24 transmits the search results output from the call search unit 23 to the call recording visualization terminal device 400 (step S704). The result display unit 25 displays the received search results on the monitor (S705).
[0052]
[0053] The result display unit 25, based on the search result, initially arranges a rectangle indicating the voice interval of the customer (speaker ID: C) on the left side, and arranges a rectangle indicating the voice interval of the operator (speaker ID: O) on the right side. In each rectangle, the words uttered in the same voice interval are arranged in order. When the words are arranged in the rectangle, if the word after correction is DELETE, the result display unit 25 does not display the corresponding word. If the word after correction is other than blank, the result display unit 25 displays the word after correction instead of the corresponding word.
[0054] If there is no word in the voice interval after correction, or if a word is entirely included in the counterpart's voice interval, the word could be considered a chiming-in. Accordingly, the result display unit 25 deletes the rectangle itself. If a word is not included in the counterpart's voice interval, it may be considered the result of deletion of a recognition error. Accordingly, the result display unit 25 substitutes a display, such as . . . , meaning that there was an utterance which could not be recognized. The rectangles are displayed at different heights (rows) in the order of time. In this way, a summary is presented on a word-by-word basis, whereby an easy-to-read display can be obtained. The presence of correction may be indicated by, for example, highlighting the corresponding text, changing the size of font, changing the color of font, or adding other modifications. The display content of the result display screen 801 or its layout may be created by the result transmission unit 24 and transmitted to the result display unit 25.
[0055]
(1-4) Effects of Embodiment
[0056] As described above, in the call recording/recognition/summarization system according to the present embodiment, it is possible, after a dialog text is divided into word levels, to create a summary in which the text is corrected on a word-by-word basis by utilizing the structure of the dialog of the recording of a call (specifically, the information identifying the speaker of each word and the information about the time-series of words). Accordingly, a dialog text summary that is easy to read compared with one by conventional methods can be created.
[0057] For example, text of a chiming-in made while the counterpart is talking, or text containing recognition error can be deleted. On the other hand, utterances having a high degree of importance, such as a chiming-in or a reply in response to the counterpart's utterance, or the operator's utterance immediately before the customer's utterance I see, can be actively left. As a result, an easy-to-read summary can be created while leaving words with high degree of importance. In addition, the present embodiment makes it possible to select whether a summary is to be displayed, so that the summarized content can be confirmed as needed.
(2) Second Embodiment
[0058] The first embodiment has been described with reference to the case where voice recognition and summarization processes are implemented simultaneously with recording of a call within a single device. In the present embodiment, a call recording/recognition/summarization system will be described in which voice recognition and summarization processes for recording of a call that are required in accordance with a request from the user are executed, and the result is visualized.
[0059]
[0060]
[0061] In the present embodiment, the voice recognition operation S1101 is not executed for all of the recording IDs but only executed with respect to the recording ID included in the query received in the call visualization operation. The same applies to the summarization operation S1102 which is executed after the end of the voice recognition operation. The above configuration makes it possible to perform voice recognition only on the necessary recording designated by the user for summarization and visualization. Accordingly, computing resources can be effectively utilized.
[0062] In the present embodiment, the voice recognition operation and the summarization operation are executed as part of the call visualization operation. However, only the summarization operation may be executed as part of the call visualization operation. In this case, the voice recognition operation may be executed, as in the first embodiment, at the time of recording of a call between customer and operator, or at least before the start of the call visualization operation. Adopting such operation technique also makes it possible to effectively utilize computing resources.
(3) Other Embodiments
[0063] The present invention is not limited to the above-described embodiments and may include various modifications. For example, while the embodiments presented systems for visualizing voices of a call, the present invention is not limited to voice and may be widely applied for a search of data including dialog. For example, similar summarization can be performed for text chatting and the like, based on the text content and a message transmission time sequence. The object of the present invention is not limited to a dialog between two persons, and may include the speaker IDs of three or more persons. Accordingly, the present invention can be applied to a dialog among three or more persons, such as a in a teleconference system.
[0064] The present invention is not necessarily required to be equipped with all of the configurations described with reference to the embodiments. Part of the configuration of one embodiment may be substituted by the configuration of another embodiment, or the configuration of the other embodiment may be incorporated into the configuration of the one embodiment. Other constituent elements may be incorporated into the respective embodiments, or some constituent elements of one embodiment may be replaced with other constituent elements.
[0065] The configurations, functions, processing units, processing means and the like described above may be partly or entirely designed for integrated circuitry and implemented by hardware. For example, the various functions for recording, recognition, and summarization of a call that are implemented by a program executed on the CPU of a server may be partly or entirely implemented by hardware using electronic components, such as integrate circuits.
[0066] The information of the programs, tables, files and the like for implementing the respective functions may be stored in a storage device, such as a memory, a hard disk, or a solid state drive (SSD), or in a storage medium, such as an IC card, an SD card, or a DVD. The illustrated control lines and information lines are only those considered necessary for the purpose of description, and do not represent all of the control lines and information lines that may be required in a product. In practice, almost all of the configurations may be considered to be mutually connected.
DESCRIPTION OF SYMBOLS
[0067] 11 Call recording unit [0068] 12 Speaker identification unit [0069] 13 Call recording DB [0070] 14 Call recording acquisition unit [0071] 15 Voice recognition unit [0072] 16 Call recognition result DB [0073] 17 Call recognition result acquisition unit [0074] 18 Text summarization unit [0075] 19 Summarization model [0076] 21 Query transmission unit [0077] 22 Query reception unit [0078] 23 Call search unit [0079] 24 Result transmission unit [0080] 25 Result display unit [0081] 100 Customer telephone [0082] 200 Operator telephone [0083] 300 Call recording/recognition/summarization device [0084] 301 Call recording device [0085] 302 Call recognition device [0086] 303 Call summarization device [0087] 400 Call recording visualization terminal device