G10L15/1807

Systems and methods for automating validation and quantification of interview question responses

In an illustrative embodiment, systems and methods for automating candidate video assessments include receiving a submission from a candidate for an available position including baseline response video segments and question response video segments. The system can determine, from detected nonverbal features within the baseline response video segments, nonverbal baseline scores. For each of the interview questions, candidate response attributes can be detected including a response direction, a response speed, and nonverbal features. A nonverbal reaction score is calculated from the detected nonverbal features and the baseline scores. A response score can be calculated from the response direction and response speed, and a trustworthiness score is determined based on a correspondence between the response score and the nonverbal reaction score. A next interview question can be determined in real-time from a benchmarked version of the response score. Overall scores reflecting candidate trustworthiness can be presented within a user interface screen.

Detecting non-verbal, audible communication conveying meaning

Various embodiments of the invention provide methods, systems, and computer-program products for analyzing an audio to capture semantic and non-semantic characteristics of the audio and corresponding relationships between the semantic and non-semantic characteristics. In particular embodiments, the audio is segmented into a set of utterance segments containing a party speaking on the audio and a set of noise segments containing the party not speaking on the audio. The semantic and non-semantic characteristics are then captured for each of the utterance segments. Specifically, speech analytics is performed on each segment to identify the words spoken by the party in the segment as semantic characteristics. Further, laughter, emotion, and sentence boundary detection is performed on each segment to identify occurrences of such in the segment as non-semantic characteristics. Once identified for each segment, various embodiments of the invention involve constructing a transcript based on the identified semantic and non-semantic characteristics.

SYSTEMS AND METHODS FOR CLASSIFICATION AND RATING OF CALLS BASED ON VOICE AND TEXT ANALYSIS
20230335153 · 2023-10-19 · ·

Methods and systems include sending recording data of a call to a first server and a second server, wherein the recording data includes a first voice of a first participant of the call and a second voice of a second participant of the call; receiving, from the first server, a first emotion score representing a degree of a first emotion associated with the first voice, and a second emotion score representing a degree of a second emotion associated with the first voice; receiving, from the second server, a first sentiment score, a second sentiment score, and a third sentiment score; determining a quality score and classification data for the recording data based on the first emotion score, the second emotion score, the first sentiment score, the second sentiment score, and the third sentiment score; and outputting the quality score and the classification data for visualization of the recording data.

CONTENT OUTPUT MANAGEMENT BASED ON SPEECH QUALITY

Techniques for ensuring content output to a user conforms to a quality of the user's speech, even when a speechlet or skill ignores the speech's quality, are described. When a system receives speech, the system determines an indicator of the speech's quality (e.g., whispered, shouted, fast, slow, etc.) and persists the indicator in memory. When the system receives output content from a speechlet or skill, the system checks whether the output content is in conformity with the speech quality indicator. If the content conforms to the speech quality indicator, the system may cause the content to be output to the user without further manipulation. But, if the content does not conform to the speech quality indicator, the system may manipulate the content to render it in conformity with the speech quality indicator and output the manipulated content to the user.

SYSTEMS AND METHODS FOR CORRECTING A VOICE QUERY BASED ON A SUBSEQUENT VOICE QUERY WITH A LOWER PRONUNCIATION RATE
20230029107 · 2023-01-26 ·

Systems and methods for correcting a voice query based on a subsequent voice query with a lower pronunciation rate. In some aspects, the systems and methods calculate first and second pronunciation rates of first and second voice queries. The systems and methods determine that the second pronunciation rate is lower than the first pronunciation rate and determine a first candidate pronunciation time for a first candidate word from the first voice query. The systems and methods determine a second candidate pronunciation time, adjusted to the first pronunciation rate, for the second candidate word from the second voice query. The systems and methods determine that the first candidate pronunciation time matches the second candidate pronunciation time and generate a third voice query based on the first voice query by replacing the first candidate word with the second candidate word.

SYSTEM AND METHOD FOR CONVERSATIONAL AGENT VIA ADAPTIVE CACHING OF DIALOGUE TREE
20230018473 · 2023-01-19 ·

The present teaching relates to method, system, medium, and implementations for managing a user machine dialogue. Sensor data is received at a device, including an utterance representing a speech of a user engaged in a dialogue with the device. The speech of the user is determined based on the utterance and a response to the user is searched by a local dialogue manager residing on the device against a sub-dialogue tree stored on the device. The response, if identified from the sub-dialogue tree, is rendered to the user in response to the speech. A request is sent to a server for the response, if the response is not available in the sub-dialogue tree.

Outcome-based skill qualification in cognitive interfaces for text-based and media-based interaction

One or more communication capabilities and a plurality of versions of communication terms of a cognitive interface are identified. A probability of a particular user reaction for each communication term version is determined. A desired outcome of an interaction between a user and the cognitive interface is determined. A first communication term version is selected from the plurality of communication term versions based on the determined probabilities of the communication term versions and the desired outcome. An interaction between the user and the cognitive interface is created using the selected first communication term version. The interaction is sent to a communication device associated with the user.

TRANSFORMING VOICE SIGNALS TO COMPENSATE FOR EFFECTS FROM A FACIAL COVERING

In one example embodiment, audio characteristics of audio signals are adjusted by a first machine learning model to reduce effects of a facial covering and produce adjusted audio signals. The audio signals correspond to resulting voice signals produced from the facial covering affecting original voice signals. Speech characteristics are predicted for the adjusted audio signals by a second machine learning model. Transformed audio signals corresponding to the original voice signals are produced based on the adjusted audio signals and predicted speech characteristics.

Context-aware prosody correction of edited speech

Methods are performed by one or more processing devices for correcting prosody in audio data. A method includes operations for accessing subject audio data in an audio edit region of the audio data. The subject audio data in the audio edit region potentially lacks prosodic continuity with unedited audio data in an unedited audio portion of the audio data. The operations further include predicting, based on a context of the unedited audio data, phoneme durations including a respective phoneme duration of each phoneme in the unedited audio data. The operations further include predicting, based on the context of the unedited audio data, a pitch contour comprising at least one respective pitch value of each phoneme in the unedited audio data. Additionally, the operations include correcting prosody of the subject audio data in the audio edit region by applying the phoneme durations and the pitch contour to the subject audio data.

Handsfree Communication System and Method
20230031071 · 2023-02-02 ·

A method, computer program product, and computing system for monitoring the diction of a patient within a hospital room; processing at least a portion of the diction to identify at least one communication request within the hospital room; and if at least one communication request is detected, establishing communication between the hospital room and a remote location within the hospital.