Patent classifications
G10L25/54
MULTI-FORMAT CONTENT REPOSITORY SEARCH
An audio file format of an audio portion of a natural language content is determined. Using a trained audio language identification model, a human language included in the audio portion is identified. Using a trained audio to text model trained on the human language, the audio portion is converted to a corresponding set of text data. The set of text data is indexed. Using the indexed set of text data responsive to a search query, a search result is generated, the search query specifying a search including a non-textual portion of the natural language content.
Audio recognition method, device and server
An audio recognition method, including: acquiring an audio file to be recognized (S100); extracting audio feature information of the audio file to be recognized, the audio feature information including audio fingerprints (S200); searching, in a fingerprint index database, audio attribute information matched with the audio feature information, the fingerprint index database including an audio fingerprint set in which invalid audio fingerprint removal has been performed on audio sample data (S300). As the audio fingerprint set in the fingerprint index database has been subjected to invalid audio fingerprint removal of audio sample data, the storage space of audio fingerprints in the fingerprint index database can be reduced, and the audio recognition efficiency can be improved. Further provided are an audio recognition device and a server.
Dynamically assigning wake words
A method and apparatus for determining a unique wake word for devices within an incident. One system includes an electronic computing device comprising a transceiver and an electronic processor communicatively coupled to the transceiver. The electronic processor is configured to receive a notification indicative of an occurrence of an incident and one or more communication devices present at the incident, determine contextual information associated with the incident and the one or more communication devices, and identify one or more wake words based on the contextual information. The electronic processor is further configured to determine a phonetic distance for each pair of wake words included in the one or more wake words, and select a unique wake word from the one or more wake words for each communication device of the one or more communication devices based on the determined phonetic distance.
Method and system for tagging and navigating through performers and other information on time-synchronized content
In one embodiment, a computer-implemented method for editing navigation of a content item is disclosed. The method may include presenting, via a user interface at a client computing device, time-synchronized text pertaining to the content item; receiving an input of a tag for the time-synchronized text of the content item, wherein the tag corresponds to a performer that performs at least a portion of the content item at a timestamp in the time-synchronized text; storing the tag associated with the portion of the content item at the timestamp in the time-synchronized text of the content item; and responsive to receiving a request to play the content item: playing the content item via a media player presented in the user interface, and concurrently presenting the time-synchronized text and the tag in the user interface, wherein the tag is presented as a graphical user element in the user interface.
User identification using voice characteristics
Embodiments of methods, systems, and storage medium associated with providing user records associated with characteristics that may be used to identify the user are disclosed herein. In one instance, the method may include obtaining features of an individual, determining identifying characteristics associated with the obtained features, and initiating a search for a record associated with the individual based in part on the identifying characteristics associated with the obtained features, and, based on a result of the search, a verification of the record associated with the individual. The method may further include receiving at least a portion of the record associated with the individual, based at least in part on a result of the verification. The verification may be based in part on a ranking associated with the record. Other embodiments may be described and/or claimed.
User identification using voice characteristics
Embodiments of methods, systems, and storage medium associated with providing user records associated with characteristics that may be used to identify the user are disclosed herein. In one instance, the method may include obtaining features of an individual, determining identifying characteristics associated with the obtained features, and initiating a search for a record associated with the individual based in part on the identifying characteristics associated with the obtained features, and, based on a result of the search, a verification of the record associated with the individual. The method may further include receiving at least a portion of the record associated with the individual, based at least in part on a result of the verification. The verification may be based in part on a ranking associated with the record. Other embodiments may be described and/or claimed.
SYSTEM AND METHOD FOR TRAINING A TRANSFORMER-IN-TRANSFORMER-BASED NEURAL NETWORK MODEL FOR AUDIO DATA
Devices, systems and methods related to causing an apparatus to generate music information of audio data using a transformer-based neural network model with a multilevel transformer for audio analysis, using a spectral and a temporal transformer, are disclosed herein. The processor generates a time-frequency representation of obtained audio data to be applied as input for a transformer-based neural network model; determines spectral embeddings and first temporal embeddings of the audio data based on the time-frequency representation of the audio data; determines each vector of a second frequency class token (FCT) by passing each vector of the first FCT in the spectral embeddings through the spectral transformer; determines second temporal embeddings by adding a linear projection of the second FCT to the first temporal embeddings; determines third temporal embeddings by passing the second temporal embeddings through the temporal transformer; and generates music information based on the third temporal embeddings.
INTERACTIVE PRONUNCIATION LEARNING SYSTEM
Systems and methods for generating audible pronunciation of a closed captioning word in a content item. For example, a system generates for output on a first device a content item comprising dialogue. The system generates for display on the first device a closed captioning word corresponding to the dialogue where the closed captioning word is selectable via a user interface of the first device. The system receives a selection of the closed captioning word via the user interface of the first device. In response to receiving the selection of the closed captioning word, the system generates for playback on the first device at least a portion of the dialogue corresponding to the selected closed captioning word.
Methods, apparatuses and computer programs relating to spatial audio
An apparatus is disclosed, configured to receive, from first and second spatial audio capture apparatuses, respective first and second composite audio signals comprising components derived from one or more sound sources in a capture space. The apparatus is further configured to identify a position of a user device corresponding to one of first and second areas respectively associated with the positions of the first and second spatial audio capture apparatuses, and to render audio representing the one or more sound sources to the user device, the rendering being performed differently dependent on, for the spatial audio capture apparatus associated with the identified first or second area, whether or not individual audio signals from each of the one or more sound sources can be successfully separated from its composite signal.
Methods, apparatuses and computer programs relating to spatial audio
An apparatus is disclosed, configured to receive, from first and second spatial audio capture apparatuses, respective first and second composite audio signals comprising components derived from one or more sound sources in a capture space. The apparatus is further configured to identify a position of a user device corresponding to one of first and second areas respectively associated with the positions of the first and second spatial audio capture apparatuses, and to render audio representing the one or more sound sources to the user device, the rendering being performed differently dependent on, for the spatial audio capture apparatus associated with the identified first or second area, whether or not individual audio signals from each of the one or more sound sources can be successfully separated from its composite signal.