G06F40/129

Language-agnostic multilingual modeling using effective script normalization

A method includes obtaining a plurality of training data sets each associated with a respective native language and includes a plurality of respective training data samples. For each respective training data sample of each training data set in the respective native language, the method includes transliterating the corresponding transcription in the respective native script into corresponding transliterated text representing the respective native language of the corresponding audio in a target script and associating the corresponding transliterated text in the target script with the corresponding audio in the respective native language to generate a respective normalized training data sample. The method also includes training, using the normalized training data samples, a multilingual end-to-end speech recognition model to predict speech recognition results in the target script for corresponding speech utterances spoken in any of the different native languages associated with the plurality of training data sets.

GENERATION OF CUSTOM COMPOSITE EMOJI IMAGES BASED ON USER-SELECTED INPUT FEED TYPES ASSOCIATED WITH INTERNET OF THINGS (IOT) DEVICE INPUT FEEDS

Composite emoji images may be generated based on user-selected input feed types associated with various Internet of Things (IoT) device input feeds. A plurality of input feed type indicators corresponding to a plurality of input feed types may be displayed for user selection. The plurality of input feed types may be associated with a plurality of IoT device input feeds. A user selection of at least some of the plurality of input feed types may be received. A composite emoji image may be generated based on a composite of a base template emoji and individual emoji image layer portions that are generated according to the at least some of the plurality of input feed types of the user selection. For each real-time IoT device input feed, a current emoji image layer portion associated with the feed may be regularly updated for display to better enable the user selection.

Method and apparatus for rendering lyrics

A method for rendering lyrics is provided, including: acquiring pronunciation of a polyphonic word to be rendered in target lyrics, and acquiring playback time information of the pronunciation in the process of rendering the target lyrics; determining a first number of furiganas contained in the pronunciation; and word-by-word simultaneously rendering, according to the first number and the playback time information of the pronunciation of the polyphonic word to be rendered, the polyphonic word to be rendered and each furigana in the pronunciation of the polyphonic word to be rendered, wherein the pronunciation of the polyphonic word to be rendered is adjacent to and parallel to the polyphonic word to be rendered.

Alternate character set domain name suggestion and registration using translation and/or transliteration
11637806 · 2023-04-25 · ·

Some embodiments provide domain name suggestions based on a user-provided ASCII phrase translated and/or transliterated into any of a number of supported non-English language character sets. To suggest non-English-language domain names, some embodiments parse, translate, and transliterate the user-provided ASCII names into domain names that include at least one non-English language character. Moreover, some embodiments determine the DNS registration status (e.g., as a second-level domain) of the Punycode (in ASCII) corresponding to these non-English domain names and provide the user with the ability to register any that are unregistered.

Alternate character set domain name suggestion and registration using translation and/or transliteration
11637806 · 2023-04-25 · ·

Some embodiments provide domain name suggestions based on a user-provided ASCII phrase translated and/or transliterated into any of a number of supported non-English language character sets. To suggest non-English-language domain names, some embodiments parse, translate, and transliterate the user-provided ASCII names into domain names that include at least one non-English language character. Moreover, some embodiments determine the DNS registration status (e.g., as a second-level domain) of the Punycode (in ASCII) corresponding to these non-English domain names and provide the user with the ability to register any that are unregistered.

TEXT MINING METHOD BASED ON ARTIFICIAL INTELLIGENCE, RELATED APPARATUS AND DEVICE
20230111582 · 2023-04-13 ·

This application discloses a text mining method based on artificial intelligence performed by a computer device. This application includes: obtaining domain candidate term features corresponding to domain candidate terms; obtaining term quality scores corresponding to the domain candidate terms according to the domain candidate term features; determining a new term from the domain candidate terms according to the term quality scores corresponding to the domain candidate terms; obtaining an associated text according to the new term; and determining a domain seed term as a domain new term in response to determining according to the associated text that the domain seed term satisfies a domain new term mining condition. By this application, new terms can be automatically selected from domain candidate terms based on a machine learning algorithm, thereby reducing manpower costs and well adapting to the rapid emergence of special new terms in the Internet era.

PARALLEL UNICODE TOKENIZATION IN A DISTRIBUTED NETWORK ENVIRONMENT

Unicode data can be protected in a distributed tokenization environment. Data to be tokenized can be accessed or received by a security server, which instantiates a number of tokenization pipelines for parallel tokenization of the data. Unicode token tables are accessed by the security server, and each tokenization pipeline uses the accessed token tables to tokenization a portion of the data. Each tokenization pipeline performs a set of encoding or tokenization operations in parallel and based at least in part on a value received from another tokenization pipeline. The outputs of the tokenization pipelines are combined, producing tokenized data, which can be provided to a remote computing system for storage or processing.

PHONETICS-BASED COMPUTER TRANSLITERATION TECHNIQUES
20170371850 · 2017-12-28 · ·

Computer-implemented techniques can include obtaining, by a computer server having one or more processors, a phonetics-based character mapping between a source script and a different target script, the phonetics-based character mapping relating characters in the source and target scripts that have similar sounds or pronunciations. The techniques can include encoding, by the computer server, each character of the phonetics-based character mapping using an encoding scheme to obtain an encoded character mapping, wherein the encoding scheme is common to both the source and target scripts. The techniques can include generating, by the computer server, a mapping function that directly maps encoded source script characters to encoded target script characters in the encoded character mapping. The techniques can also include in response to a transliteration request, utilizing, by the computer server, the mapping function to transliterate a text from the source script to the target script.

Precise Encoding and Direct Keyboard Entry of Chinese as Extension of Pinyin
20170364486 · 2017-12-21 ·

Encoding Chinese in one(linear code)-to-one(character or word) correspondence systematically has been a century old challenge. Based on the official standards for Pinyin and writing order of characters, that all Chinese users are familiar with, this invention comprises: (1) encoding all characters and words of a predetermined set or dictionary into distinct codes in electronic system like computer; (2) retrieving character or word by decoding user's keyboard input, and then entering the corresponding character or word into the system. Denoted inside [ ], the proposed Pinyin+X coding format is [Pinyin+X]=[Pinyin]+[3-Stroke]+[Extra], where [3-Stroke] consists of three consonant letters coding for the first, second, and last stroke of the writing form of character or word, and [Extra] is system-generated consonant letter(s) to ensure the uniqueness of the entire [Pinyin+X] code. Pinyin+X keyboard entry process for Chinese can therefore be designed to be direct that every keystroke counts and none is extra.

SAMPLE GENERATION METHOD, MODEL TRAINING METHOD, TRAJECTORY RECOGNITION METHOD, DEVICE, AND MEDIUM
20230195998 · 2023-06-22 ·

Disclosed are a sample generation method, a model training method, a trajectory recognition method, a device, and a medium. The method is: determining a code result of a training Chinese character according to a preset code library, where the preset code library is generated based on code characters in a five-stroke code corpus; taking the code result as a training label of the training Chinese character; and generating a training sample according to both a writing trajectory and the training label of the training Chinese character. The amount of information carried in the training sample is enriched.