Patent classifications
G10L2015/223
Speech and Computer Vision-Based Control
The present disclosure relates to a method for controlling a digital photography system. The method includes obtaining, by a device, image data and audio data. The method also includes identifying one or more objects in the image data and obtaining a transcription of the audio data. The method also includes controlling a future operation of the device based at least on the one or more objects identified in the image data, and the transcription of the audio data.
STATE MACHINE BASED CONTEXT-SENSITIVE SYSTEM FOR MANAGING MULTI-ROUND DIALOG
The present invention discloses a state machine based context-sensitive multi-round dialog management system, comprising: an input module, for receiving multi-modal input information from a user; an intention identification engine module, for identifying intention information in the multi-modal input information; an intention module, for bringing multiple intention information identified by the intention identification engine module into one-to-one correspondence with multiple intention sub-modules at back ends; a state machine module, comprising a plurality of state machines for managing a relevant context in the dialog management system and providing support for an output result; an instruction parsing engine module, comprising a plurality of instruction parsing engine sub-modules for parsing corresponding intention information and acquiring the parsed multiple intention information; and an output module, for acquiring policy information according to the results from the parsing engine module and the intention identification module, and transmitting the policy information to the state machine module.
POLICY AUTHORING FOR TASK STATE TRACKING DURING DIALOGUE
Conversational understanding systems allow users to conversationally interface with a computing device. In examples, a query may be received that includes a request for execution of a task. A data exchange task definition may be accessed. The data exchange task definition assists a conversational understanding system in managing task state tracking for information needed for task execution. Using the data exchange task definition, a per-turn policy for interacting with the user computing device is generated based on the state of a dialogue with a computing device and an evaluation of a process flow chart provided by a task owner resource. The task owner resource may be independent from the conversational understanding system. A response to the query may be generated and output based on the per-turn policy. In examples, the per-turn policy is used to generate one or more responses during a dialogue with a user via a computing device.
PERFORMING TASKS AND RETURING AUDIO AND VISUAL ANSWERS BASED ON VOICE COMMAND
An artificial intelligence voice interactive system may provide various services to a user in response to a voice command by providing an interface between the system and a legacy system to enable providing various types of existing services in response to user speech without modifying systems for the existing services. Such system includes a central server, and the central server may perform operations of registering a plurality of service servers at the central server and storing registration information of each service server, analyzing voice command data from the user device and determining at least one task and corresponding service servers based on the analysis results, generating an instruction message based on the voice command data, the determined at least one task, and the registration information of the selected service servers, and transmitting the generated instruction message to the selected service servers, and receiving task results including audio and video data from the selected service servers and outputting the task results through at least one device associated with the user device.
SOFTWARE APPLICATIONS AND INFORMATION APPARATUS FOR PRINTING OVER AIR OR FOR PRINTING OVER A NETWORK
Information apparatus and application software supporting printing over air or network are herein disclosed and enabled. The information apparatus may include one or more software components that include (1) a discovery component to discover a supported printer in a local area network (LAN) and to receive device information related to the printer (e.g., capability, language or format supported, identification) from the printer, and (2) a printing component to generate or obtain print data based on the device information received and to transmit the print data to the discovered printer. After establishing the connection to the LAN, application software (e.g., Internet browser, email, photos, documents) in the information apparatus may print digital content by using the discovery component to discover the printer in the LAN, and may use the printing component to obtain and transmit print data in a form that is acceptable to the printer for printing the digital content.
SPEAKER VERIFICATION USING CO-LOCATION INFORMATION
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying a user in a multi-user environment. One of the methods includes receiving, by a first user device, an audio signal encoding an utterance, obtaining, by the first user device, a first speaker model for a first user of the first user device, obtaining, by the first user device for a second user of a second user device that is co-located with the first user device, a second speaker model for the second user or a second score that indicates a respective likelihood that the utterance was spoken by the second user, and determining, by the first user device, that the utterance was spoken by the first user using (i) the first speaker model and the second speaker model or (ii) the first speaker model and the second score.
REDUCING THE NEED FOR MANUAL START/END-POINTING AND TRIGGER PHRASES
Systems and processes for selectively processing and responding to a spoken user input are provided. In one example, audio input containing a spoken user input can be received at a user device. The spoken user input can be identified from the audio input by identifying start and end-points of the spoken user input. It can be determined whether or not the spoken user input was intended for a virtual assistant based on contextual information. The determination can be made using a rule-based system or a probabilistic system. If it is determined that the spoken user input was intended for the virtual assistant, the spoken user input can be processed and an appropriate response can be generated. If it is instead determined that the spoken user input was not intended for the virtual assistant, the spoken user input can be ignored and/or no response can be generated.
Contextual assistant using mouse pointing or touch cues
A method for a contextual assistant to use mouse pointing or touch cues includes receiving audio data corresponding to a query spoken by a user, receiving, in a graphical user interface displayed on a screen, a user input indication indicating a spatial input applied at a first location on the screen, and processing the audio data to determine a transcription of the query. The method also includes performing query interpretation on the transcription to determine that the query is referring to an object displayed on the screen without uniquely identifying the object, and requesting information about the object. The method further includes disambiguating, using the user input indication indicating the spatial input applied at the first location on the screen, the query to uniquely identify the object that the query is referring to, obtaining the information about the object requested by the query, and providing a response to the query.
HEAD-MOUNTED DISPLAY SYSTEM AND OPERATING METHOD FOR HEAD-MOUNTED DISPLAY DEVICE
Operability of head-mounted display systems is enhanced by incorporating the following: a microphone which receives an utterance input by a person and outputs voice information; a character string generation unit which generates an uttered character string by converting the voice information into a character string; a specific utterance information storage unit which stores specific utterance information that associates at least one program to be started or stopped and/or at least one operating mode to be started or stopped, with specific utterances for starting or stopping each of the programs and/or operating modes; a specific utterance extraction unit which extracts a specific utterance included in the uttered character string with reference to the specific utterance information, and generates an extracted specific utterance signal indicating the extraction result; and a control unit which starts or stops a program or an operating mode with reference to the extracted specific utterance signal.
Natural assistant interaction
Systems and processes for operating a virtual assistant to provide natural assistant interaction are provided. In accordance with one or more examples, a method includes, at an electronic device with one or more processors and memory: receiving a first audio stream including one or more utterances; determining whether the first audio stream includes a lexical trigger; generating one or more candidate text representations of the one or more utterances; determining whether at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant. If at least one candidate text representation is to be disregarded, one or more candidate intents are generated based on candidate text representations of the one or more candidate text representations other than the to be disregarded at least one candidate text representation.