DIALOG SUPPORT APPARATUS, DIALOG SUPPORT METHOD AND PROGRAM
20230121148 · 2023-04-20
Assignee
Inventors
Cpc classification
G10L15/22
PHYSICS
International classification
Abstract
A dialog assistance apparatus facilitates a dialog by including a first estimation unit that estimates, with respect to a field related to speech content of a first speaker, a knowledge level of a second speaker having a dialog with the first speaker, an acquisition unit that acquires, from a storage unit that stores a question in association with a keyword and the knowledge level, a question corresponding to a keyword included in the speech content and corresponding to the knowledge level of the second speaker, and an output unit that outputs the acquired question to the first speaker.
Claims
1. A dialog assistance apparatus comprising a processor configured to execute a method comprising: estimating, with respect to a field related to speech content of a first speaker, a knowledge level of a second speaker having a dialog with the first speaker; acquiring, a question corresponding to a keyword included in the speech content and corresponding to the knowledge level of the second speaker; and outputting the acquired question to the first speaker.
2. The dialog assistance apparatus according to claim 1, further comprising: estimating an understanding level of the second speaker with respect to the speech content of the first speaker, wherein the outputting further comprises outputting the question to the first speaker according to the understanding level.
3. A computer implemented method for dialog assistance, the method comprising: estimating, with respect to a field related to speech content of a first speaker, a knowledge level of a second speaker having a dialog with the first speaker; acquiring a question corresponding to a keyword included in the speech content and corresponding to the knowledge level of the second speaker; and outputting the acquired question to the first speaker.
4. A computer-readable non-transitory recording medium storing computer-executable program instructions that when executed by a processor cause a computer to execute a method comprising: estimating, with respect to a field related to speech content of a first speaker, a knowledge level of a second speaker having a dialog with the first speaker; acquiring a question corresponding to a keyword included in the speech content and corresponding to the knowledge level of the second speaker; and outputting the acquired question to the first speaker.
5. The dialog assistance apparatus according to claim 1, wherein the knowledge level is associated with a label, and the label includes one of: high, medium, or low.
6. The dialogue assistance apparatus according to claim 1, wherein the knowledge level is based on a history of dialogs between the first speaker and the second speaker.
7. The dialogue assistance apparatus according to claim 1, wherein the knowledge level of the second speaker is associated with an expression of the second speaker determined in captured image data.
8. The dialogue assistance apparatus according to claim 1, the processor further configured to execute a method comprising: outputting, based on a threshold associated with a number of speakers in dialog with the second speaker, the acquired question to the first speaker.
9. The computer implemented method according to claim 3, further comprising: estimating an understanding level of the second speaker with respect to the speech content of the first speaker, wherein the outputting further comprises outputting the question to the first speaker according to the understanding level.
10. The computer implemented method according to claim 3, wherein the knowledge level is associated with a label, and the label includes one of: high, medium, or low.
11. The computer implemented method according to claim 3, wherein the knowledge level is based on a history of dialogs between the first speaker and the second speaker.
12. The computer implemented method according to claim 3, wherein the knowledge level of the second speaker is associated with an expression of the second speaker determined in captured image data.
13. The computer implemented method according to claim 3, further comprising: outputting, based on a threshold associated with a number of speakers in dialog with the second speaker, the acquired question to the first speaker.
14. The computer-readable non-transitory recording medium according to claim 4, the computer-executable program instructions when executed further causing the computer to execute a method comprising: estimating an understanding level of the second speaker with respect to the speech content of the first speaker, wherein the outputting further comprises outputting the question to the first speaker according to the understanding level.
15. The computer-readable non-transitory recording medium according to claim 4, wherein the knowledge level is associated with a label, and the label includes one of: high, medium, or low.
16. The computer-readable non-transitory recording medium according to claim 4, wherein the knowledge level is based on a history of dialogs between the first speaker and the second speaker.
17. The computer-readable non-transitory recording medium according to claim 4, wherein the knowledge level of the second speaker is associated with an expression of the second speaker determined in captured image data.
18. The computer-readable non-transitory recording medium according to claim 4, the computer-executable program instructions when executed further causing the computer to execute a method comprising: outputting, based on a threshold associated with a number of speakers in dialog with the second speaker, the acquired question to the first speaker.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
DESCRIPTION OF EMBODIMENTS
[0018] Hereinafter, embodiments of the present disclosure will be described with reference to the drawings. The present embodiment assumes a situation in which a speaker A with high literacy (high knowledge level) and a speaker B with relatively low literacy (low knowledge level) in a certain field (for example, Information and Communication Technology (ICT)) have a dialog. For example, the speaker A may be a person who is in charge at the counter of a certain store, and the speaker B may be a person who consults the speaker A over the counter. This situation setting intends to facilitate understanding of the present embodiment and does not intend that the present embodiment is effective only in the above situation.
[0019] A dialog assistance apparatus 10 is placed where the speaker A and the speaker B have a dialog, to assist the dialog. The dialog assistance apparatus 10 may be shaped like a robot. Alternatively, a device such as a personal computer (PC), a smart phone, or the like may be utilized as the dialog assistance apparatus 10.
[0020]
[0021] A program for implementing processing performed by the dialog assistance apparatus 10 is provided as a recording medium 101 such as a compact disc read-only memory (CD-ROM). When the recording medium 101 storing the program is set in the drive device 100, the program is installed on the auxiliary storage device 102 from the recording medium 101 via the drive device 100. However, the program does not necessarily have to be installed from the recording medium 101 and may be downloaded from another computer via a network. The auxiliary storage device 102 stores the installed program and also stores necessary files, data, and the like.
[0022] The memory device 103 reads and stores the program from the auxiliary storage device 102 when the program is instructed to start. The CPU 104 implements functions relevant to the dialog assistance apparatus 10 in accordance with the program stored in the memory device 103. The microphone 105 is used to input voice of a dialog (in particular, speech content of the speaker A). The display device 106 is, for example, a liquid crystal display and is used to output (display) a question by voice to the speaker A when the speaker B is unable to understand the speech content of the speaker A, as will be described later. The display device 106 may be shaped like a window which is disposed, for example, between the speaker A and the speaker B. The camera 107 is, for example, a digital camera and used to input an image of the face (hereinafter referred to as a “face image”) of the speaker B. The microphone 105, the display device 106, the camera 107 may not be built in the dialog assistance apparatus 10, and may be connected to the dialog assistance apparatus 10, for example, wirelessly or by wire.
[0023]
[0024] Hereinafter, processing executed by the dialog assistance apparatus 10 will be described.
[0025] When the speaker A starts speaking, the keyword extraction unit 11 inputs the spoken voice of the speaker A via the microphone 105 (S101). For example, at the timing of the end of the speech, the keyword extraction unit 11 applies speech recognition to the spoken voice that has been input with respect to the speech, and extracts at least one keyword from text data acquired as a result of the speech recognition (S102). For example, “tethering” may be extracted as a keyword when the spoken voice is “do you use tethering?”.
[0026] Such keyword extraction can be performed using known techniques. For example, the keyword extraction may be performed using the method cited in “Keyword Recognition and Extraction for Speech-Driven Web Retrieval Task”, Masahiko Matsushita, Hiromitsu Nishizaki, Takehito Utsuro, and Seiichi Nakagawa, Information Processing Society of Japan, Research Report, Speech language information processing (SLP), 2003 (104 (2003-SLP-048)), 21-28. Alternatively, the keywords registered in the knowledge level DB 122, which will be described later, may be extracted.
[0027] Subsequently, the keyword extraction unit 11 records the extracted keyword in the keyword storage unit 121 (S103) and waits for the next speech of the speaker A (S101). In the keyword storage unit 121, the keywords are recorded in a manner that the order of extraction of the keywords (order of speeches) can be identified.
[0028]
[0029] The understanding level estimation unit 12 inputs the face image of the speaker B who is continuously captured by the camera 107 (S201), and estimates (calculates), based on the face image, the understanding level of the speaker B with respect to the speech content of the speaker A (S202). Specifically, the expression of the speaker B is likely to change when the speech content of the speaker A is difficult to understand. The understanding level estimation unit 12 thereby estimates the understanding level based on the expression of the speaker B. Such estimation of the understanding level may be performed using the technique described in, for example, “Understanding Presumption System from Facial Images”, Jun Mimura and Masafumi Hagiwara, IEEJ Journal of Industry Applications, C, 120 (2), 2000, 273-278. In that case, the understanding level is estimated in five levels (0 to 4) ranging from no understanding at all to a complete understanding. Although the understanding level is estimated using the input of the face image in the present embodiment, other understanding level estimation methods may be used. For example, the speech content of the speaker A or the speaker B may be input to estimate the understanding level using an existing speech recognition technique or text analysis technique.
[0030] Subsequently, the understanding level estimation unit 12 estimates whether the understanding level of the speaker B is smaller than a threshold (S203). Assume that, in the present embodiment, the lower the understanding value, the lower the level of understanding. In step S203, it is determined whether the speaker B has a low understanding level.
[0031] If the understanding level of the speaker B is equal to or greater than the threshold (No in S203), it is estimated that the speaker B is able to understand the speech content of the speaker A, and there is no need to assist the speaker B, so that the process returns to step S201. If the understanding level of the speaker B is smaller than the threshold (Yes in S203), the knowledge level estimation unit 13 estimates the knowledge level of the speaker B for the field (for example, ICT) related to the speech content of the speaker A in accordance with at least one keyword stored in the keyword storage unit 121 and the knowledge level DB 122 (S204). That is, how much knowledge the speaker B has for the field is estimated.
[0032]
[0033] When a plurality of keywords are included in the target keyword group, the understanding level estimation unit 12 may acquire, for example, the knowledge level from the knowledge level DB 122 for each target keyword, and estimate the lowest value of the acquired knowledge levels to be the knowledge level of the speaker B. Alternatively, the understanding level estimation unit 12 may estimate the highest value of the knowledge levels corresponding to any target keyword, which has been recorded in the keyword storage unit 121 before the understanding level is estimated to be smaller than the threshold, to be the knowledge level of the speaker B. This is because the speaker B is more likely to have understood the keywords that have been recorded before the understanding level is estimated to be smaller than the threshold.
[0034] In addition to the above, the technique disclosed in JP 2013-167765 A may also be used. In that case, the history of dialogs between the speaker A and the speaker B is recorded, and the knowledge level estimation unit 13 may estimate the knowledge level (knowledge amount) of the speaker B with reference to the history. Alternatively, the technique disclosed in JP 2019-28604 A may be used to estimate the knowledge level of the speaker B.
[0035] Subsequently, the question acquisition unit 14 acquires, from the question DB 123, the question to be output to the speaker A in accordance with the target keyword group and the knowledge level group estimated for the speaker B (S205).
[0036]
[0037] Accordingly, in step S205, the question acquisition unit 14 acquires the “question” from the record that includes any keyword included in the target keyword group in the “keyword” and that indicates the “required knowledge level” to be equal to or smaller than the knowledge level of the speaker B. When there are a plurality of “questions”, the questions may be sorted, for example, in descending order of the “number of outputs”.
[0038] Subsequently, the question output unit 15 outputs (displays) the question acquired by the question acquisition unit 14 to the display device 106 (S206). The display device 106 is disposed to be visibly recognizable by the speaker A and the speaker B.
[0039] Then, the speaker A speaks the answer to the question. In accordance with the question and the answer, it can be expected that the speaker B is able to understand the speech content of the speaker A, which the speaker B could not understand before.
[0040] The following is a specific example of the dialog between the speaker A and the speaker B and the questions output by the dialog assistance apparatus 10.
A(1): “Do you use wireless LAN at home?”
B(1): “Yes.”
[0041] A(2): “Do you use tethering when you are out?”
B(2): “Well . . . .”
[0042] Dialog assistance apparatus 10: “By tethering, can I use the Internet on my laptop computer?”
A(3): “Yes.”
[0043] B(3): “I do not use my laptop computer outside, so I do not think I use tethering.”
In the above, A(m) (m=1 to 3) represents the speech uttered by the speaker A. B(m) (m=1 to 3) represents the speech uttered by the speaker B. In this dialog, step S202 and subsequent steps are performed according to the facial expression of the speaker B when he/she has spoken “Well . . . ”. In step S206 performed as a result, the dialog assistance apparatus 10 outputs the question “By tethering, can I use the Internet on my laptop computer?” to the speaker A on behalf of the speaker B. In response, the speaker A answers (“Yes.”). This answer allows the speaker B to respond to the speech A(2) (speech B(3)) even if the speaker B does not fully understand the meaning of “tethering”, thus facilitating the dialog between the two. In other words, the dialog between the two has become engaged and the breakdown of the dialog is avoided.
[0044]
[0045] In this case, like the specific example described above, the question output unit 15 outputs the question “By tethering, can I use the Internet on my laptop computer?” on behalf of the speaker B. Although the present embodiment has described the example in which the output form of the question is display, the question output unit 15 may also output the question by voice. In that case, the dialog assistance apparatus 10 needs to include a speaker.
[0046] Another case is assumable, as illustrated in
[0047] Even when a plurality of speakers B are present, there may be no need to limit the output of the question based on the threshold. In this case, the understanding level estimation unit 12 may estimate the understanding level of each speaker B (in parallel), and the knowledge level estimation unit 13 may estimate the knowledge level of each speaker B (in parallel). The question acquisition unit 14 may acquire, from the question DB 123, the question to be output to the speaker A based on the lowest knowledge level among a plurality of the estimated knowledge levels (in parallel). In this manner, the question may be output according to the speaker B having the lowest knowledge level.
[0048] In accordance with the present embodiment, as described above, when the speaker B cannot understand the speech content of the speaker A (the content of the dialog with the speaker A), the dialog assistance apparatus 10 outputs (gives notice of) the question to the speaker A according to the knowledge level of the speaker B on behalf of the speaker B. As the speaker A answers the question, the speaker B can respond to the speech content based on the answer without fully understanding the speech content. This makes it possible to assist in facilitating the dialog.
[0049] In the present embodiment, the knowledge level estimation unit 13 is an example of a first estimation unit. The question acquisition unit 14 is an example of an acquisition unit. The question output unit 15 is an example of an output unit. The understanding level estimation unit 12 is an example of a second estimation unit. The speaker A is an example of a first speaker. The speaker B is an example of a second speaker.
[0050] Although the embodiments of the present disclosure have been described in detail above, the present disclosure is not limited to such specific embodiments, and various modifications and change can be made within the scope of the gist of the present disclosure described in the aspects.
REFERENCE SIGNS LIST
[0051] 10 Dialog assistance apparatus [0052] 11 Keyword extraction unit [0053] 12 Understanding level estimation unit [0054] 13 Knowledge level estimation unit [0055] 14 Question acquisition unit [0056] 15 Question output unit [0057] 100 Drive device [0058] 101 Recording medium [0059] 102 Auxiliary storage device [0060] 103 Memory device [0061] 104 CPU [0062] 105 Microphone [0063] 106 Display device [0064] 107 And Camera [0065] 121 Keyword storage unit [0066] 122 Knowledge level DB [0067] 123 Question DB [0068] B Bus