SPEECH RECOGNITION APPARATUS, ACOUSTIC MODEL LEARNING APPARATUS, SPEECH RECOGNITION METHOD, AND COMPUTER-READABLE RECORDING MEDIUM
20230064137 · 2023-03-02
Assignee
Inventors
Cpc classification
G10L15/22
PHYSICS
International classification
G10L15/06
PHYSICS
G10L15/22
PHYSICS
Abstract
A speech recognition apparatus 20, includes; a data acquisition unit 21 that acquires speech data and sensor data to be recognized; a speech recognition unit 22 that converts the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
Claims
1. A speech recognition apparatus comprising: at least one memory storing instructions; and at least one processor configured to execute the instructions to: acquire speech data and sensor data to be recognized, convert the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
2. The speech recognition apparatus according to claim 1, wherein, further at least one processor configured to execute the instructions to: generate the embedded vector from the acquired sensor data and converts the acquired speech data into text data by applying the acquired speech data and the generated embedded vector to the acoustic model.
3. The speech recognition apparatus according to claim 1, further at least one processor configured to execute the instructions to: construct the acoustic model by machine learning using the embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
4. The speech recognition apparatus according to claim 3, wherein, further at least one processor configured to execute the instructions to: input the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input, generate the embedded vector using the data output from the model and constructs the acoustic model using the generated embedded vector.
5. The speech recognition apparatus according to claim 1, the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
6.-8. (canceled)
9. A speech recognition method comprising: acquiring speech data and sensor data to be recognized, converting the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
10. The speech recognition method according to claim 9, wherein generating the embedded vector from the acquired sensor data and converting the acquired speech data into text data by applying the acquired speech data and the generated embedded vector to the acoustic model.
11. The speech recognition method according to claim 9, further comprising: constructing the acoustic model by machine learning using the embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
12. The speech recognition method according to claim 11, wherein inputting the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input, generating the embedded vector using the data output from the model, and constructing the acoustic model using the generated embedded vector.
13. The speech recognition method according to claim 9, the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
14. A non-transitory computer-readable recording medium that includes a program, the program including instructions that cause a computer to carry out: acquiring speech data and sensor data to be recognized, converting the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
15. The non-transitory computer-readable recording medium according to claim 14, wherein generating the embedded vector from the acquired sensor data and converting the acquired speech data into text data by applying the acquired speech data and the generated embedded vector to the acoustic model.
16. The non-transitory computer-readable recording medium according to claim 14, the program further including instruction that cause the computer to carry out: constructing the acoustic model by machine learning using the embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
17. The non-transitory computer-readable recording medium according to claim 11, wherein inputting the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input, generating the embedded vector using the data output from the model, and constructing the acoustic model using the generated embedded vector.
18. The non-transitory computer-readable recording medium according to claim 14, the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
EXAMPLE EMBODIMENT
First Example Embodiment
[0032] Hereinafter, in the first example embodiment, an acoustic model learning apparatus, an acoustic model learning method, and a program for realizing these will be described with reference to
[0033] [Apparatus Configuration]
[0034] First, a configuration of the acoustic model learning apparatus according to the first example embodiment will be described using
[0035] The acoustic model learning apparatus 10 according to the first example embodiment shown in
[0036] In this configuration, the data acquisition unit 11 acquires speech data to be training data, teacher data to be the training data, and sensor data related to the training data. The acoustic model construction unit 12 constructs an acoustic model by machine learning using embedded vector in addition to the speech data to be training data and teacher data to be the training data. The embedded vector is generated from the sensor data related to the acquired training data by the data acquisition unit 11.
[0037] As described above, in the first example embodiment, the acoustic model learning apparatus 10 can construct the acoustic model using the embedded vector generated without using speech recognition.
[0038] Subsequently, the configuration and function of the acoustic model learning apparatus 10 according to the first example embodiment will be described more specifically.
[0039] First, in the first example embodiment, the data acquisition unit 11 acquires speech data and teacher data to be training data from an external terminal device or the like connected by a network or the like. The teacher data is text data obtained by transcribing the utterance of the speech data.
[0040] In the first example embodiment, the acoustic model construction unit 12 first generates an embedded vector using the sensor data related to the training data. Specifically, when the sensor data is input, the acoustic model construction unit 12 inputs the sensor data related to the training data into the model that outputs the data related to the sensor data, and generates the embedded vector from the data output from the model. Examples of the sensor data include image data, temperature data, location data, time data, illuminance data, and the like. In the first example embodiment, any one of these is used.
[0041] An example of the embedded vector will be described below with reference to
[0042] In the example of
[0043] For example, in
[0044] In the example of
[0045] In the example of
[0046] In the example of
[0047] Next, the acoustic model construction unit 12 applies the acquired word to each dimension (leftmost column) of the preset vector to generate the embedded vector by setting the dimension matched the word to “1”, the dimension did not match the word to “0”.
[0048] In the example of
[0049] Further, in the example of
[0050] [Apparatus Operation]
[0051] Next, the operation of the acoustic model learning apparatus 10 according to the first example embodiment will be described with reference to
[0052] As shown in
[0053] Next, the acoustic model construction unit 12 generates an embedded vector using the sensor data acquired in step A1 (step A2). Specifically, for example, when the sensor data is image data, the acoustic model construction unit 12 generates the embedded vector by the method shown in
[0054] Next, the acoustic model construction unit 12 contracts the acoustic model by adding the embedded vector generated in step A2 to the training data acquired in step A1 and executing machine learning (step A3). Specifically, the acoustic model construction unit 12 updates the parameters of the acoustic model by inputting the training data and the embedded vector into the existing acoustic model, for example.
[0055] Steps A1 to A3 are executed each time training data is acquired. Further, by repeatedly executing steps A1 to A3, the accuracy of the acoustic model is also improved.
[0056] As described above, according to the first example embodiment, it is possible to construct the acoustic model using the embedded vector generated without using speech recognition. Therefore, according to this acoustic model, it is possible to perform speech recognition using an embedded vector generated without using speech recognition.
[0057] [Modified example]
[0058] In the first example embodiment described above, the sensor data is only one of the image data, the temperature data, the location data, the time data, and the illuminance data, but the first example embodiment is not limited to this aspect. In the first example embodiment, the sensor data may be a combination of two or more among image data, temperature data, location data, time data, and illuminance data. Further, in this case, the acoustic model construction unit 12 generates the embedded vector for each of the combined sensor data and executes machine learning using the generated embedded vector for each combined sensor data.
[0059] [Program]
[0060] It is sufficient that the program according to the first example embodiment be a program that causes a computer to execute steps A1 to A3 illustrated in
[0061] Also, the program according to the first example embodiment may be executed by a computer system constructed by a plurality of computers. In this case, for example, each computer may function as one of the data acquisition unit 11 and the acoustic model construction unit 12.
Second Example Embodiment
[0062] Next, in the second example embodiment, a speech recognition apparatus, a speech recognition method, and a program for realizing these will be described with reference to
[0063] [Apparatus Configuration]
[0064] First, a configuration of the speech recognition apparatus according to the second example embodiment will be described with reference to
[0065] The speech recognition apparatus 20 according to the second example embodiment shown in
[0066] In this configuration, the data acquisition unit 21 acquires speech data and sensor data to be recognized. The speech recognition unit 22 converts the acquired speech data into text data by applying the acquired speech data and sensor data to the acoustic model.
[0067] Further, in the second example embodiment, the acoustic model is constructed by machine learning using an embedded vector generated from sensor data related to the training data in addition to the speech data to be the training data and teacher data to be the training data.
[0068] Therefore, according to the speech recognition apparatus 20 in the second example embodiment, the speech recognition can be executed by using the embedded vector generated without using the speech recognition.
[0069] Subsequently, the configuration and function of the speech recognition apparatus 20 according to the second example embodiment will be described more specifically.
[0070] First, in the second example embodiment, the data acquisition unit 21 acquires speech data and sensor data to be recognized from an external terminal device or the like connected by a network or the like. Examples of the sensor data include image data, temperature data, location data, time data, illuminance data, and the like, as in the first example embodiment.
[0071] Further, the acoustic model used in the second example embodiment is constructed by the acoustic model learning apparatus 10 according to the first example embodiment using the embedded vector. Therefore, in the second example embodiment, the speech recognition unit 22 first generates the embedded vector from the sensor data acquired by the data acquisition unit 21. Specifically, the speech recognition unit 22 generates the embedded vector by the same method as the acoustic model construction unit 12 according to the first example embodiment.
[0072] For example, when the sensor data is image data, the speech recognition unit 22 generates the embedded vector by the method shown in
[0073] Then, the speech recognition unit 22 converts the speech data into text data by applying the speech data and the generated embedded vector to the acoustic model.
[0074] [Apparatus Operation]
[0075] Next, the operation of the speech recognition apparatus 20 according to the second example embodiment will be described with reference to
[0076] As shown in
[0077] Next, the speech recognition unit 22 generates the embedded vector using the sensor data acquired in step B1 (step B2). Specifically, for example, when the sensor data is image data, the speech recognition unit 22 generates the embedded vector by the method shown in
[0078] Next, the speech recognition unit 22 converts the speech data into text data by applying the speech data acquired in step B1 and the embedded vector generated in step B2 to the acoustic model (step B3). Further, the acoustic model used in step B3 is constructed by executing steps A1 to A3 shown in
[0079] Steps B1 to B3 are executed each time the speech data to be recognized and the sensor data are acquired. Further, the speech data is accurately recognized by steps B1 to B3.
[0080] As described above, according to the second example embodiment, it is possible to execute speech recognition by using the embedded vector generated without using speech recognition.
Modified Example
[0081] As described in the modification 1 of the first example embodiment described above, also in the second example embodiment, the sensor data may be a combination of two or more among the image data, the temperature data, the location data, the time data, and the illuminance data. In the case that, the data acquisition unit 21 acquires all the combined sensor data. Further, in this case, the speech recognition unit 22 generates the embedded vector for each of the combined sensor data. Then, the speech recognition unit 22 applies the embedded vector for each generated sensor data to the acoustic mode to converts the speech data into text data.
[0082] [Program]
[0083] It is sufficient that the program according to the second example embodiment be a program that causes a computer to execute steps B1 to B3 illustrated in
[0084] Also, the program according to the second example embodiment may be executed by a computer system constructed by a plurality of computers. In this case, for example, each computer may function as one of the data acquisition unit 21 and the speech recognition unit 22.
[0085] [Modified example]
[0086] Subsequently, a modified example of the speech recognition apparatus according to the second example embodiment will be described with reference to FIG. 10.
[0087] As shown in
[0088] With such the configuration, in this modified example, the speech recognition apparatus 20 can have a function as the acoustic model learning apparatus. In this modified example, it is possible to construct the acoustic model and perform speech recognition with one apparatus.
[0089] (Physical Configuration)
[0090] Here, a computer that realizes the acoustic model learning apparatus 10 by executing the program according to the first example embodiment, and a computer that realizes the speech recognition apparatus 20 by realizing the program according to the second example embodiment will be described with reference to
[0091] As illustrated in
[0092] Note that the computer 110 may include a GPU (Graphics Processing Unit) or an FPGA (Field-Programmable Gate Array) in addition to the CPU 111 or in place of the CPU 111.
[0093] The CPU 111 carries out various types of computation by deploying the program (codes) according to the example embodiment stored in the storage device 113 to the main memory 112, and executing the codes in a predetermined order. The main memory 112 is typically a volatile storage device, such as a DRAM (Dynamic Random-Access Memory). Also, the program according to the first and second example embodiments is provided in a state where it is stored in a computer readable recording medium 120. Note that the program according to the present example embodiment may also be distributed over the Internet connected via the communication interface 117.
[0094] Furthermore, specific examples of the storage device 113 include a hard disk drive, and also a semiconductor storage device, such as a flash memory. The input interface 114 mediates data transmission between the CPU 111 and an input device 118, such as a keyboard and a mouse. The display controller 115 is connected to a display device 119, and controls displays on the display device 119.
[0095] The data reader/writer 116 mediates data transmission between the CPU 111 and the recording medium 120, and executes readout of the program from the recording medium 120, as well as writing of the result of processing in the computer 110 to the recording medium 120. The communication interface 117 mediates data transmission between the CPU 111 and another computer.
[0096] Also, specific examples of the recording medium 120 include: a general-purpose semiconductor storage device, such as CF (Compact Flash®) and SD (Secure Digital); a magnetic recording medium, such as Flexible Disk; and an optical recording medium, such as CD-ROM (Compact Disk Read Only Memory).
[0097] Note that the acoustic model learning apparatus 10 and the speech recognition apparatus 20 according to the example embodiments can also be realized by using items of hardware corresponding to respective components, rather than by using the computer with the program installed therein. Furthermore, a part of the acoustic model learning apparatus 10 and the speech recognition apparatus 20 may be realized by the program, and the remaining part of these apparatus may be realized by hardware.
[0098] A part or all of the aforementioned example embodiment can be described as, but is not limited to, the following (Supplementary note 1) to (Supplementary note 24).
[0099] (Supplementary Note 1)
[0100] A speech recognition apparatus comprising:
[0101] a data acquisition unit that acquires speech data and sensor data to be recognized,
[0102] a speech recognition unit that converts the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
[0103] (Supplementary Note 2)
[0104] The speech recognition apparatus according to Supplementary note 1, wherein the speech recognition unit generates the embedded vector from the acquired sensor data and converts the acquired speech data into text data by applying the acquired speech data and the generated embedded vector to the acoustic model.
[0105] (Supplementary Note 3)
[0106] The speech recognition apparatus according to Supplementary note 1 or 2, further comprising:
[0107] an acoustic model construction unit that constructs the acoustic model by machine learning using the embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
[0108] (Supplementary Note 4)
[0109] The speech recognition apparatus according to Supplementary note 3, wherein the acoustic model construction unit inputs the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input,
[0110] the acoustic model construction unit generates the embedded vector using the data output from the model and constructs the acoustic model using the generated embedded vector.
[0111] (Supplementary Note 5)
[0112] The speech recognition apparatus according to any one of Supplementary note s 1 to 4,
[0113] the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
[0114] (Supplementary Note 6)
[0115] An acoustic model learning apparatus comprising:
[0116] a data acquisition unit acquires speech data to be training data, teacher data to be the training data, and sensor data related to the training data,
[0117] an acoustic model construction unit that constructs an acoustic model by machine learning using embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
[0118] (Supplementary Note 7)
[0119] The acoustic model learning apparatus according to Supplementary note 6,
[0120] wherein the acoustic model construction unit inputs the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input,
[0121] the acoustic model construction unit generates the embedded vector using the data output from the model and constructs the acoustic model using the generated embedded vector.
[0122] (Supplementary Note 8)
[0123] The acoustic model learning apparatus according to Supplementary note 6 or 7, the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
[0124] (Supplementary Note 9)
[0125] A speech recognition method comprising:
[0126] a data acquisition step of acquiring speech data and sensor data to be recognized,
[0127] a speech recognition step of converting the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
[0128] (Supplementary Note 10)
[0129] The speech recognition method according to Supplementary note 9,
[0130] wherein, in the speech recognition step, generating the embedded vector from the acquired sensor data and converting the acquired speech data into text data by applying the acquired speech data and the generated embedded vector to the acoustic model.
[0131] (Supplementary Note 11)
[0132] The speech recognition method according to Supplementary note 9 or 10, further comprising:
[0133] an acoustic model construction step of constructing the acoustic model by machine learning using the embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
[0134] (Supplementary Note 12)
[0135] The speech recognition method according to Supplementary note 11,
[0136] wherein, in the acoustic model construction step,
[0137] inputting the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input,
[0138] generating the embedded vector using the data output from the model, and
[0139] constructing the acoustic model using the generated embedded vector.
[0140] (Supplementary Note 13)
[0141] The speech recognition method according to any one of Supplementary note s 9 to 12,
[0142] the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
[0143] (Supplementary Note 14)
[0144] An acoustic model construction method comprising:
[0145] a data acquisition step of acquiring speech data to be training data, teacher data to be the training data, and sensor data related to the training data,
[0146] an acoustic model construction step of constructing an acoustic model by machine learning using embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
[0147] (Supplementary Note 15)
[0148] The acoustic model construction method according to Supplementary note 14,
[0149] Wherein, in the acoustic model construction step, inputting the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input, and
[0150] generating the embedded vector using the data output from the model and constructing the acoustic model using the generated embedded vector.
[0151] (Supplementary Note 16)
[0152] The acoustic model construction method according to Supplementary note 14 or 15,
[0153] the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
[0154] (Supplementary Note 17)
[0155] A computer-readable recording medium that includes a program, the program including instructions that cause a computer to carry out:
[0156] a data acquisition step of acquiring speech data and sensor data to be recognized,
[0157] a speech recognition step of converting the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
[0158] (Supplementary Note 18)
[0159] The computer-readable recording medium according to Supplementary note 17,
[0160] wherein, in the speech recognition step, generating the embedded vector from the acquired sensor data and converting the acquired speech data into text data by applying the acquired speech data and the generated embedded vector to the acoustic model.
[0161] (Supplementary Note 19)
[0162] The computer-readable recording medium according to Supplementary note 17 or 18, the program further including instruction that cause the computer to carry out:
[0163] an acoustic model construction step of constructing the acoustic model by machine learning using the embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
[0164] (Supplementary Note 20)
[0165] The computer-readable recording medium according to Supplementary note 19,
[0166] wherein, in the acoustic model construction step,
[0167] inputting the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input,
[0168] generating the embedded vector using the data output from the model, and
[0169] constructing the acoustic model using the generated embedded vector.
[0170] (Supplementary Note 21)
[0171] The computer-readable recording medium according to any one of Supplementary note s 17 to 20,
[0172] the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
[0173] (Supplementary Note 22)
[0174] A computer-readable recording medium that includes a program, the program including instructions that cause a computer to carry out:
[0175] a data acquisition step of acquiring speech data to be training data, teacher data to be the training data, and sensor data related to the training data,
[0176] an acoustic model construction step of constructing an acoustic model by machine learning using embedded vector generated from the sensor data related to the training data in addition to the speech data to be the training data and the teacher data to be the training data.
[0177] (Supplementary Note 23)
[0178] The computer-readable recording medium according to Supplementary note 22,
[0179] Wherein, in the acoustic model construction step, inputting the sensor data related to the training data, to a model that outputs data related to the sensor data as the sensor data is input, and
[0180] generating the embedded vector using the data output from the model and constructing the acoustic model using the generated embedded vector.
[0181] (Supplementary Note 24)
[0182] The computer-readable recording medium according to Supplementary note 22 or 23,
[0183] the sensor data is any one of image data, temperature data, location data, time data, and illuminance data, or a combination of two or more of them.
[0184] The invention has been described with reference to an example embodiment above, but the invention is not limited to the above-described example embodiment. Within the scope of the invention, various changes that could be understood by a person skilled in the art could be applied to the configurations and details of the invention.
INDUSTRIAL APPLICABILITY
[0185] As described above, according to the present invention, it is possible to perform the speech recognition using the embedded vector generated without using speech recognition. The present invention is effective for various system in which speech recognition is performed.
LIST OF REFERENCE SIGNS
[0186] 10 Acoustic model learning apparatus [0187] 11 Data acquisition unit [0188] 12 Acoustic model construction unit [0189] 20 Speech recognition apparatus [0190] 21 Data acquisition unit [0191] 22 Speech recognition unit [0192] 23 Acoustic model construction unit [0193] 110 Computer [0194] 111 CPU [0195] 112 Main memory [0196] 113 Storage device [0197] 114 Input interface [0198] 115 Display controller [0199] 116 Data reader/writer [0200] 117 Communication interface [0201] 118 Input device [0202] 119 Display device [0203] 120 Recording medium [0204] 121 Bus