G06F16/63

Voice Query QoS based on Client-Computed Content Metadata
20220262367 · 2022-08-18 · ·

A method includes receiving an automated speech recognition (ASR) request from a user device that includes a speech input captured by the user device and content metadata associated with the speech input. The content metadata is generated by the user device. The method also includes determining a priority score for the ASR request based on the content metadata associated with the speech input and caching the ASR request in a pre-processing backlog of pending ASR requests each having a corresponding priority score. The pending ASR requests in the pre-processing backlog are ranked in order of the priority scores. The method also includes providing, from the pre-processing backlog, one or more of the pending ASR requests to a backend-side ASR module, wherein pending ASR requests associated with higher priority scores are processed before pending ASR requests associated with lower priority scores.

Voice Query QoS based on Client-Computed Content Metadata
20220262367 · 2022-08-18 · ·

A method includes receiving an automated speech recognition (ASR) request from a user device that includes a speech input captured by the user device and content metadata associated with the speech input. The content metadata is generated by the user device. The method also includes determining a priority score for the ASR request based on the content metadata associated with the speech input and caching the ASR request in a pre-processing backlog of pending ASR requests each having a corresponding priority score. The pending ASR requests in the pre-processing backlog are ranked in order of the priority scores. The method also includes providing, from the pre-processing backlog, one or more of the pending ASR requests to a backend-side ASR module, wherein pending ASR requests associated with higher priority scores are processed before pending ASR requests associated with lower priority scores.

METHODS AND APPARATUS TO IDENTIFY MEDIA

Methods, apparatus, systems and articles of manufacture are disclosed to identify media. An example method includes: in response to a query, generating an adjusted sample media fingerprint by applying an adjustment to a sample media fingerprint; comparing the adjusted sample media fingerprint to a reference media fingerprint; and in response to the adjusted sample media fingerprint matching the reference media fingerprint, transmitting information associated with the reference media fingerprint and the adjustment.

Synchronizing Playback by Media Playback Devices

Example systems, apparatus, and methods receive audio information including a plurality of frames from a source device, wherein each frame of the plurality of frames includes one or more audio samples and a time stamp indicating when to play the one or more audio samples of the respective frame. In an example, the time stamp is updated for each of the plurality of frames using a time differential value determined between clock information received from the source device and clock information associated with the device. The updated time stamp is stored for each of the plurality of frames, and the audio information is output based on the plurality of frames and associated updated time stamps. A number of samples per frame to be output is adjusted based on a comparison between the updated time stamp for the frame and a predicted time value for play back of the frame.

Synchronizing Playback by Media Playback Devices

Example systems, apparatus, and methods receive audio information including a plurality of frames from a source device, wherein each frame of the plurality of frames includes one or more audio samples and a time stamp indicating when to play the one or more audio samples of the respective frame. In an example, the time stamp is updated for each of the plurality of frames using a time differential value determined between clock information received from the source device and clock information associated with the device. The updated time stamp is stored for each of the plurality of frames, and the audio information is output based on the plurality of frames and associated updated time stamps. A number of samples per frame to be output is adjusted based on a comparison between the updated time stamp for the frame and a predicted time value for play back of the frame.

Methods and apparatus to identify media that has been pitch shifted, time shifted, and/or resampled

Methods, apparatus, systems and articles of manufacture are disclosed to identify media that has been pitch shifted, time shifted, and/or resampled. An example method includes: generating, by executing an instruction with a processor, a fingerprint from an audio signal; transmitting the fingerprint and adjusting instructions to a central facility to facilitate a query, the adjusting instructions identifying at least one of a pitch shift or a time shift; and receiving a response including an identifier for the audio signal and information corresponding to how the audio signal was adjusted; storing information indicative of the identifier and the information into a database.

Methods and apparatus to identify media

Methods, apparatus, systems and articles of manufacture are disclosed to identify media. An example method includes: in response to a query, generating an adjusted sample media fingerprint by applying an adjustment to a sample media fingerprint; comparing the adjusted sample media fingerprint to a reference media fingerprint; and in response to the adjusted sample media fingerprint matching the reference media fingerprint, transmitting information associated with the reference media fingerprint and the adjustment.

AI-assisted sound effect generation for silent video

Sound effect recommendations for visual input are generated by training machine learning models that learn coarse-grained and fine-grained audio-visual correlations from a reference visual, a positive audio signal, and a negative audio signal. A trained Sound Recommendation Network is configured to output an audio embedding and a visual embedding and use the audio embedding and visual embedding to compute a correlation distance between an image frame or video segment and one or more audio segments retrieved from a database. The correlation distances for the one or more audio segments in the database are sorted and one or more audio segments with the closest correlation distance from the sorted audio correlation distances are determined. The audio segment with the closest audio correlation distance is applied to the input image frame or video segment.

AI-assisted sound effect generation for silent video

Sound effect recommendations for visual input are generated by training machine learning models that learn coarse-grained and fine-grained audio-visual correlations from a reference visual, a positive audio signal, and a negative audio signal. A trained Sound Recommendation Network is configured to output an audio embedding and a visual embedding and use the audio embedding and visual embedding to compute a correlation distance between an image frame or video segment and one or more audio segments retrieved from a database. The correlation distances for the one or more audio segments in the database are sorted and one or more audio segments with the closest correlation distance from the sorted audio correlation distances are determined. The audio segment with the closest audio correlation distance is applied to the input image frame or video segment.

ELECTRONIC APPARATUS AND METHOD FOR CONTROLLING THEREOF

An electronic apparatus is provided. The electronic apparatus includes a communication interface, a memory storing at least one instruction, and a processor, wherein the processor is configured to acquire a user command from the user and control the communication interface to transmit the user command to a plurality of external devices, receive, from the plurality of external devices, information on a first question generated based on the user command and information on first response to the generated first question acquired from users of each of the plurality of external devices, identify whether there a conflict between each first response occurs by analyzing the received information on the first response based on identification that the conflict occurs, acquire information on a subject to be re-questioned based on the generated information on the first question and the received information on the first response, control the communication interface to transmit information on the conflict to at least one external device identified based on the information on the subject to be re-questioned, receive, from the identified at least one external device, information on a second response to a second question generated based on the information on the conflict, and acquire a final response based on the information on the first response and the information on the second response and output the final response.