Patent classifications
G10L15/1815
ELECTRONIC APPARATUS AND CONTROLLING METHOD THEREOF
An electronic apparatus includes: a memory storing one or more commands; and a processor connected to the memory and configured to control the electronic apparatus, wherein the processor is configured, by executing the one or more instructions, to: identify a first intention word and a first target word from first speech, acquire second speech received after the first speech based on at least one of the identified first intention word or the identified first target word not matching a word stored in the memory, acquire a similarity between the first speech and the second speech, and acquire response information based on the first speech and the second speech based on the similarity being a threshold value or more.
DIGITAL ASSISTANT REFERENCE RESOLUTION
Systems and processes for operating a digital assistant are provided. An example process for performing a task includes, at an electronic device having one or more processors and memory, receiving a spoken input including a request, receiving an image input including a plurality of objects, selecting a reference resolution module of a plurality of reference resolution modules based on the request and the image input, determining, with the selected reference resolution module, whether the request references a first object of the plurality of objects based on at least the spoken input, and in accordance with a determination that the request references the first object of the plurality of objects, determining a response to the request including information about the first object.
MULTIMODAL SPEECH RECOGNITION METHOD AND SYSTEM, AND COMPUTER-READABLE STORAGE MEDIUM
The disclosure provides a multimodal speech recognition method and system, and a computer-readable storage medium. The method includes calculating a first logarithmic mel-frequency spectral coefficient and a second logarithmic mel-frequency spectral coefficient when a target millimeter-wave signal and a target audio signal both contain speech information corresponding to a target user; inputting the first and the second logarithmic mel-frequency spectral coefficient into a fusion network to determine a target fusion feature, where the fusion network includes at least a calibration module and a mapping module, the calibration module is configured to perform mutual feature calibration on the target audio/millimeter-wave signals, and the mapping module is configured to fuse a calibrated millimeter-wave feature and a calibrated audio feature; and inputting the target fusion feature into a semantic feature network to determine a speech recognition result corresponding to the target user. The disclosure can implement high-accuracy speech recognition.
Tracking specialized concepts, topics, and activities in conversations
Embodiments are directed to organizing conversation information. A tracker vocabulary may be provided to a universal model to predict a generalized vocabulary associated with the tracker vocabulary. A tracker model may be generated based on the portions of the universal model activated by the tracker vocabulary such that a remainder of the universal model may be excluded from the tracker model. Portions of a conversation stream may be provided to the tracker model. A match score may be generated based on the track model and the portions of the conversation stream such that the match score predicts if the portions of the conversation stream may be in the generalized vocabulary predicted for the tracker vocabulary. Tracker metrics may be collected based on the portions of the conversation and the match scores such that the tracker metrics may be included in reports or notifications.
Electronic apparatus and control method thereof
An electronic apparatus is provided. The electronic apparatus includes a microphone, a memory configured to store a plurality of keyword recognition models, and a processor, which is coupled with the microphone and the memory, configured to control the electronic apparatus, wherein the processor is configured to selectively execute at least one keyword recognition model among the plurality of keyword recognition models based on operating state information of the electronic apparatus, based on a first user voice being input through the microphone, identify whether at least one keyword corresponding to the executed keyword recognition model is included in the first user voice by using the executed keyword recognition model, and based on at least one keyword identified as being included in the first user voice, perform an operation of the electronic apparatus corresponding to the at least one keyword.
Automated clinical documentation system and method
A method, computer program product, and computing system for proactive encounter scanning is executed on a computing device and includes obtaining encounter information of a patient encounter. The encounter information is proactively processed to determine if the encounter information is indicative of one or more medical conditions and to generate one or more result set. The one or more result sets are provided to the user.
Machine learning for interpretation of subvocalizations
Provided is an in-ear device and associated computational support system that leverages machine learning to interpret sensor data descriptive of one or more in-ear phenomena during subvocalization by the user. An electronic device can receive sensor data generated by at least one sensor at least partially positioned within an ear of a user, wherein the sensor data was generated by the at least one sensor concurrently with the user subvocalizing a subvocalized utterance. The electronic device can then process the sensor data with a machine-learned subvocalization interpretation model to generate an interpretation of the subvocalized utterance as an output of the machine-learned subvocalization interpretation model.
Method for exiting a voice skill, apparatus, device and storage medium
A method for exiting a voice skill, an apparatus, a device, and a storage medium are provided by embodiments of the present disclosure, wherein a user voice instruction is received; a target exit intention corresponding to the user voice instruction is identified according to the user voice instruction and a grammar rule of a preset exit intention; and a corresponding operation is executed on a current voice skill of a device according to the target exit intention. The embodiments of the present disclosure refine and expand the user's exit intention. After the target exit intention to which the user voice instruction belongs is identified, the corresponding operation is executed according to the target exit intention so as to meet the users' different exit requirements for the voice skills, enhance the fluency and convenience of user interaction with the device and improve the user's exit experience when using the voice skills.
Content generation framework
Techniques for performing outputting additional content associated with but nonresponsive to an input command are described. A system receives input data from a device. The system determines an intent representing the input data and receives first output data responsive to the input data. The system determines, based on context data, that additional content associated with the first output data but nonresponsive to the input data should be output. The system receives second output data associated with but nonresponsive to the input data thereafter. The system then presents first content corresponding to the first output data and second content corresponding to the second output data.
Generating input alternatives
Exemplary embodiments relate to a system for recovering a conversation between a user and the system when the system is unable to properly respond to a user's input. The system may process the user input and determine an error condition exists. The system may query one or more storage systems to identify candidate text data based on their semantic similarity to the user input. The storage systems may store data related to past frequently entered inputs and/or user-generated inputs. Alternative text data is selected from the candidate text data, and presented to the user for confirmation.