Patent classifications
G10L25/54
Method for searching for contents having same voice as voice of target speaker, and apparatus for executing same
A method for searching content having same voice as a voice of a target speaker from among a plurality of contents includes extracting a feature vector corresponding to the voice of the target speaker, selecting any subset of speakers from a training dataset repeatedly by a predetermined number of times, generating linear discriminant analysis (LDA) transformation matrices using each of the selected any subsets of speakers repeatedly by a predetermined number of times, projecting the extracted speaker feature vector to the selected corresponding subsets of speakers using each of the generated LDA transformation matrices, assigning a value corresponding to nearby speaker class among corresponding subsets of speakers, to each of projection regions of the extracted speaker feature vector, generating a hash value corresponding to the extracted feature vector based on the assigned values, and searching content having a similar hash value to the generated hash value among the contents.
LOCALIZATION OF NARRATIONS IN IMAGE DATA
Methods, system, and computer storage media are provided for multi-modal localization. Input data comprising two modalities, such as image data and corresponding text or audio data, may be received. A phrase may be extracted from the text or audio data, and a neural network system may be utilized to spatially and temporally localize the phrase within the image data. The neural network system may include a plurality of cross-modal attention layers that each compare features across the first and second modalities without comparing features of the same modality. Using the cross-modal attention layers, a region or subset of pixels within one or more frames of the image data may be identified as corresponding to the phrase, and a localization indicator may be presented for display with the image data. Embodiments may also include unsupervised training of the neural network system.
Recommending results in multiple languages for search queries based on user profile
Systems and methods for a media guidance application that generates results in multiple languages for search queries. In particular, the media guidance application resolves multiple language barriers by taking automatic and manual user language settings and applying those settings to a variety of potential search results.
Recommending results in multiple languages for search queries based on user profile
Systems and methods for a media guidance application that generates results in multiple languages for search queries. In particular, the media guidance application resolves multiple language barriers by taking automatic and manual user language settings and applying those settings to a variety of potential search results.
LEARNING DEVICE, LEARNING METHOD, AND LEARNING PROGRAM
A learning device calculates a feature of each data included in a pair of datasets in which two modalities among a plurality of modalities are combined, using a model that receives data on a corresponding modality among the modalities and outputs a feature obtained by mapping the received data into an embedding space. The learning device then selects similar data similar to each target data that is data on a first modality in a first dataset of the datasets, from data on a second modality included in a second dataset of the datasets. The learning device further updates a parameter of the model such that the features of the data in the pair included in the first and the second datasets are similar to one another, and the feature of data paired with the target data is similar to the feature of data paired with the similar data.
Media content identification and playback
Systems, devices, apparatuses, components, methods, and techniques for identifying and playing media content are provided. An example media-playback device for identifying and playing media content for a user traveling in a vehicle includes an audio identification engine and a media playback engine. Audio content is recorded and identified by comparison to media content databases. The audio content is identified and immediately played on the same device. Additional media content is selected for playback based on user listening preferences.
Media content identification and playback
Systems, devices, apparatuses, components, methods, and techniques for identifying and playing media content are provided. An example media-playback device for identifying and playing media content for a user traveling in a vehicle includes an audio identification engine and a media playback engine. Audio content is recorded and identified by comparison to media content databases. The audio content is identified and immediately played on the same device. Additional media content is selected for playback based on user listening preferences.
Method and apparatus for extracting video clip
The present disclosure discloses a method and apparatus for extracting a video clip, relates to the field of artificial intelligence technology such as video processing, audio processing, and cloud computing. The method includes: acquiring a video, and extracting an audio stream in the video; determining a confidence that audio data in each preset period in the audio stream comprises a preset feature; and extracting a target video clip corresponding to a location of a target audio clip in the video; wherein the target audio clip is an audio clip within a continuous preset period, and has a confidence that the audio data includes the preset feature, which is larger than a preset confidence threshold. This method may improve the accuracy of extracting a video clip.
Method and apparatus for extracting video clip
The present disclosure discloses a method and apparatus for extracting a video clip, relates to the field of artificial intelligence technology such as video processing, audio processing, and cloud computing. The method includes: acquiring a video, and extracting an audio stream in the video; determining a confidence that audio data in each preset period in the audio stream comprises a preset feature; and extracting a target video clip corresponding to a location of a target audio clip in the video; wherein the target audio clip is an audio clip within a continuous preset period, and has a confidence that the audio data includes the preset feature, which is larger than a preset confidence threshold. This method may improve the accuracy of extracting a video clip.
METHOD OF AND SYSTEM FOR REAL TIME FEEDBACK IN AN INCREMENTAL SPEECH INPUT INTERFACE
The present disclosure provides systems and methods for selecting and presenting content items based on user input. The method includes receiving first input intended to identify a desired content item among content items associated with metadata, determining that an input portion has an importance measure exceeding a threshold, and providing feedback identifying the input portion. The method further includes receiving second input, and inferring user intent to alter or supplement the first input with the second input. The method further includes, upon inferring intent to alter the first input, determining an alternative query by modifying the first input based on the second input, and, upon inferring intent to supplement the first input, determining an alternative query by combining the first input and the second input. The method further includes selecting and presenting a subset of content items based on comparing the alternative query and metadata associated with the subset.