Patent classifications
G06F16/63
Voice Query QoS Based On Client-Computed Content Metadata
A method includes receiving an automated speech recognition (ASR) request from a user device that includes a speech input captured by the user device and content metadata associated with the speech input. The content metadata is generated by the user device. The method also includes determining a priority score for the ASR request based on the content metadata associated with the speech input and caching the ASR request in a pre-processing backlog of pending ASR requests each having a corresponding priority score. The pending ASR requests in the pre-processing backlog are ranked in order of the priority scores. The method also includes providing, from the pre-processing backlog, one or more of the pending ASR requests to a backend-side ASR module, wherein pending ASR requests associated with higher priority scores are processed before pending ASR requests associated with lower priority scores.
Systems and methods for aligning lyrics using a neural network
An electronic device receives audio data for a media item. The electronic device generates, from the audio data, a plurality of samples, each sample having a predefined maximum length. The electronic device, using a neural network trained to predict character probabilities, generates a probability matrix of characters for a first portion of a first sample of the plurality of samples. The probability matrix includes character information, timing information, and respective probabilities of respective characters at respective times. The electronic device identifies, for the first portion of the first sample, a first sequence of characters based on the generated probability matrix.
Systems and methods for aligning lyrics using a neural network
An electronic device receives audio data for a media item. The electronic device generates, from the audio data, a plurality of samples, each sample having a predefined maximum length. The electronic device, using a neural network trained to predict character probabilities, generates a probability matrix of characters for a first portion of a first sample of the plurality of samples. The probability matrix includes character information, timing information, and respective probabilities of respective characters at respective times. The electronic device identifies, for the first portion of the first sample, a first sequence of characters based on the generated probability matrix.
Media player system
Example systems, apparatus, and methods receive audio information including a plurality of frames from a source device, wherein each frame of the plurality of frames includes one or more audio samples and a time stamp indicating when to play the one or more audio samples of the respective frame. In an example, the time stamp is updated for each of the plurality of frames using a time differential value determined between clock information received from the source device and clock information associated with the device. The updated time stamp is stored for each of the plurality of frames, and the audio information is output based on the plurality of frames and associated updated time stamps. A number of samples per frame to be output is adjusted based on a comparison between the updated time stamp for the frame and a predicted time value for play back of the frame.
Media player system
Example systems, apparatus, and methods receive audio information including a plurality of frames from a source device, wherein each frame of the plurality of frames includes one or more audio samples and a time stamp indicating when to play the one or more audio samples of the respective frame. In an example, the time stamp is updated for each of the plurality of frames using a time differential value determined between clock information received from the source device and clock information associated with the device. The updated time stamp is stored for each of the plurality of frames, and the audio information is output based on the plurality of frames and associated updated time stamps. A number of samples per frame to be output is adjusted based on a comparison between the updated time stamp for the frame and a predicted time value for play back of the frame.
RECEIVING A NATURAL LANGUAGE REQUEST AND RETRIEVING A PERSONAL VOICE MEMO
A computer-implemented method is provided. The method includes receiving commands to store memos, identifying subjects related to the memos, storing, in a database, the memos, their related subjects, and associated time information, receiving a natural language request to retrieve a memo, the request having query information, identifying a subject related to the request, responsive to the request, querying the database for memos related to the subject, identifying multiple memos in response to the database query, identifying a memo, from the multiple identified memos, that has the most recent associated time information and providing a response in dependence on the identified memo.
METHOD AND DEVICE FOR DISPLAYING MUSIC SCORE IN TARGET MUSIC VIDEO
The present application provides techniques for displaying music score segments in target music videos. The techniques comprise determining a digital music score corresponding to a piece of music comprised in a target music video; determining a segment of the digital music score corresponding to a current playing progress of the target music video based at least in part on a playing progress of the target music video; generating an image of a music score segment corresponding to the segment of the digital music score based on a predetermined condition; and presenting the image on a corresponding interface of playing the target music video.
METHOD AND DEVICE FOR DISPLAYING MUSIC SCORE IN TARGET MUSIC VIDEO
The present application provides techniques for displaying music score segments in target music videos. The techniques comprise determining a digital music score corresponding to a piece of music comprised in a target music video; determining a segment of the digital music score corresponding to a current playing progress of the target music video based at least in part on a playing progress of the target music video; generating an image of a music score segment corresponding to the segment of the digital music score based on a predetermined condition; and presenting the image on a corresponding interface of playing the target music video.
AI-ASSISTED SOUND EFFECT GENERATION FOR SILENT VIDEO
Sound effect recommendations for visual input are generated by training machine learning models that learn coarse-grained and fine-grained audio-visual correlations from a reference visual, a positive audio signal, and a negative audio signal. A trained Sound Recommendation Network is configured to output an audio embedding and a visual embedding and use the audio embedding and visual embedding to compute a correlation distance between an image frame or video segment and one or more audio segments retrieved from a database. The correlation distances for the one or more audio segments in the database are sorted and one or more audio segments with the closest correlation distance from the sorted audio correlation distances are determined. The audio segment with the closest audio correlation distance is applied to the input image frame or video segment.
Music Discovery
Examples described herein relate to music discovery. In one aspect, a method is provided that involves (a) receiving by a computing device an indication of a search tool from among a plurality of search tools, where each search tool of the plurality of search tools is associated with at least one respective media service, (b) receiving by the computing device an indication of a media characteristic, where the computing device receives the media characteristic via the indicated search tool, (c) selecting by the computing device one or more of the at least one respective media service that maintains media associated with the indicated media characteristic, and (d) sending by the computing device an indication of the selected one or more of the at least one respective media service.