ASR-enhanced speech compression

Abstract

A process for compressing an audio speech signal utilizes ASR processing to generate a corresponding text representation and, depending on confidence in the corresponding text representation, selectively applies more, less, or no compression to the audio signal. The result is a compressed audio signal, with corresponding text, that is compact and well suited for searching, analytics, or additional ASR processing.

Claims

1. A computer-implemented process for digitally compressing an audio signal that includes human speech utterances, the process comprising at least the following computer-implemented acts: (1) receiving a digitally encoded audio signal that includes human speech utterances in a first, uncompressed format; (2) identifying portions of the digitally encoded audio signal that correspond to speech utterances and, for each speech utterance, forming a corresponding uncompressed audio utterance; (3) performing automatic speech recognition (ASR) processing to produce, for each speech utterance, at least a corresponding (i) text representation and (ii) ASR confidence that represents a likelihood that the text representation accurately captures all spoken words contained in its corresponding uncompressed audio utterance; (4) for each speech utterance, if its ASR confidence exceeds a predetermined threshold value, then forming a corresponding compressed audio utterance in a highly compressed format; (5) forming an output stream that includes, for each speech utterance, at least: (i) its corresponding text representation; (ii) its corresponding ASR confidence; and (iii) either (a) its corresponding uncompressed audio utterance or (b) its corresponding compressed audio utterance, but not both (a) and (b), wherein the output stream contains (a) if the utterance's corresponding ASR confidence is less than or equal to the predetermined threshold value and (b) if the utterance's corresponding ASR confidence exceeds the predetermined threshold value.

2. A process, as defined in claim 1, wherein for each speech utterance, the output stream further includes metadata computed from the corresponding uncompressed audio utterance.

3. A process, as defined in claim 1, wherein the metadata includes one or more of: identity of the speaker, gender, approximate age, and/or emotion.

4. A process, as defined in claim 1, wherein the ASR confidence values are derived from normalized likelihood scores.

5. A process, as defined in claim 1, wherein the ASR confidence values are computed using an N-best homogeneity analysis.

6. A process, as defined in claim 1, wherein the ASR confidence values are computed using an acoustic stability analysis.

7. A process, as defined in claim 1, wherein the ASR confidence values are computed using a word graph hypothesis density analysis.

8. A process, as defined in claim 1, wherein the ASR confidence values are derived from associated state, phoneme, or word durations.

9. A process, as defined in claim 1, wherein the ASR confidence values are derived from language model (LM) scores or LM back-off behaviors.

10. A process, as defined in claim 1, wherein the ASR confidence values are computed using a posterior probability analysis.

11. A process, as defined in claim 1, wherein the ASR confidence values are computed using a log-likelihood-ratio analysis.

12. A process, as defined in claim 1, wherein the ASR confidence values are computed using a neural net that includes word identity and aggregated words as predictors.

13. A computer-implemented process for digitally compressing an audio signal that includes human speech utterances, the process comprising at least the following computer-implemented acts: (1) receiving a digitally encoded audio signal that includes human speech utterances in a first, lightly compressed format; (2) identifying portions of the digitally encoded audio signal that correspond to speech utterances and, for each speech utterance, forming a corresponding lightly compressed audio utterance; (3) performing automatic speech recognition (ASR) processing to produce, for each speech utterance, at least a corresponding (i) text representation and (ii) ASR confidence that represents a likelihood that the text representation accurately captures all spoken words contained in its corresponding lightly compressed audio utterance; (4) for each speech utterance, if its ASR confidence exceeds a predetermined threshold value, then forming a corresponding heavily compressed audio utterance in a highly compressed format; (5) forming an output stream that includes, for each speech utterance, at least: (i) its corresponding text representation; (ii) its corresponding ASR confidence; and (iii) either (a) its corresponding lightly compressed audio utterance or (b) its corresponding heavily compressed audio utterance, but not both (a) and (b), wherein the output stream contains (a) if the utterance's corresponding ASR confidence is less than or equal to the predetermined threshold value and (b) if the utterance's corresponding ASR confidence exceeds the predetermined threshold value.

14. A computer-implemented process for digitally compressing an audio signal that includes human speech utterances, the process comprising at least the following computer-implemented acts: (1) receiving a digitally encoded audio signal that includes human speech utterances in a first, lightly compressed format; (2) identifying portions of the digitally encoded audio signal that correspond to speech utterances and, for each speech utterance, forming a corresponding lightly compressed audio utterance; (3) performing automatic speech recognition (ASR) processing to produce, for each speech utterance, at least a corresponding (i) text representation and (ii) ASR confidence that represents a likelihood that the text representation accurately captures all spoken words contained in its corresponding lightly compressed audio utterance; (4) for each speech utterance, if its ASR confidence exceeds a predetermined threshold value, then forming a corresponding heavily compressed audio utterance in a highly compressed format; (5) forming an output stream that, for each speech utterance, consists essentially of: (i) its corresponding text representation; (ii) its corresponding ASR confidence; and (iii) either (a) its corresponding lightly compressed audio utterance or (b) its corresponding heavily compressed audio utterance, but not both (a) and (b), wherein the output stream contains (a) if the utterance's corresponding ASR confidence is less than or equal to the predetermined threshold value and (b) if the utterance's corresponding ASR confidence exceeds the predetermined threshold value.

15. A process, as defined in claim 1, wherein for each speech utterance, the output stream further consists essentially of metadata computed from the corresponding lightly compressed audio utterance.

16. A process, as defined in claim 1, wherein the metadata includes one or more of: identity of the speaker, gender, approximate age, and/or emotion.

Description

BRIEF DESCRIPTION OF THE FIGURES

(1) Aspects, features, and advantages of the present invention, and its numerous exemplary embodiments, can be further appreciated with reference to the accompanying set of figures, in which:

(2) FIG. 1 is a high-level block diagram, showing a Front End, Processing Tier, and Storage Tier;

(3) FIG. 2 depicts an exemplary embodiment in which the Front End, Processing Tier, and Storage Tier are all provisioned in one or more Data Processing Cloud(s);

(4) FIG. 3 depicts an exemplary embodiment in which the Processing and Storage Tiers are provisioned in Data Processing Cloud(s), and the Frond End is provisioned On Premises;

(5) FIG. 4 depicts an exemplary embodiment in which the Front End is provisioned in a Data Processing Cloud, and the Processing and Storage Tiers are provisioned On Premises;

(6) FIG. 5 depicts an exemplary embodiment in which the Processing Tier is provisioned in a Data Processing Cloud, and the Front End and Storage Tier are provisioned On Premises;

(7) FIG. 6 depicts an exemplary embodiment in which the Front End and Storage Tier are provisioned in Data Processing Cloud(s), and the Processing Tier is provisioned On Premises;

(8) FIG. 7 depicts an exemplary embodiment in which the Storage Tier is provisioned in a Data Processing Cloud, and the Front End and Processing Tier are provisioned On Premises;

(9) FIG. 8 depicts an exemplary embodiment in which the Front End and Processing Tier are provisioned in Data Processing Cloud(s), and the Storage Tier is provisioned On Premises;

(10) FIG. 9 depicts an exemplary embodiment in which the Front End, Processing Tier, and Storage Tier are provisioned On Premises;

(11) FIG. 10 depicts an exemplary embodiment in which the Front End, Processing Tier, and Storage Tier are provisioned partly in Data Processing Cloud(s) and partly On Premises;

(12) FIG. 11 depicts an exemplary embodiment in which the Front End and Processing Tier are provisioned partly in Data Processing Cloud(s) and partly On Premises, and the Storage Tier is provisioned On Premises;

(13) FIG. 12 depicts an exemplary embodiment in which the Front End and Processing Tier are provisioned partly in Data Processing Cloud(s) and partly On Premises, and the Storage Tier is provisioned in a Data Processing Cloud;

(14) FIG. 13 depicts an exemplary embodiment in which the Front End and Storage Tier are provisioned partly in Data Processing Cloud(s) and partly On Premises, and the Processing Tier is provisioned On Premises;

(15) FIG. 14 depicts an exemplary embodiment in which the Front End and Storage Tier are provisioned partly in Data Processing Cloud(s) and partly On Premises, and the Processing Tier is provisioned in a Data Processing Cloud;

(16) FIG. 15 depicts an exemplary embodiment in which the Processing and Storage Tiers are provisioned partly in Data Processing Cloud(s) and partly On Premises, and the Front End is provisioned On Premises;

(17) FIG. 16 depicts an exemplary embodiment in which the Processing and Storage Tiers are provisioned partly in Data Processing Cloud(s) and partly On Premises, and the Front End is provisioned in a Data Processing Cloud;

(18) FIG. 17 depicts an exemplary embodiment in which the Storage Tier is provisioned partly in a Data Processing Cloud and partly On Premises, and the Front End and Processing Tier are provisioned On Premises;

(19) FIG. 18 depicts an exemplary embodiment in which the Storage Tier is provisioned partly in a Data Processing Cloud and partly On Premises, the Front End is provisioned On Premises, and the Processing Tier is provisioned in a Data Processing Cloud;

(20) FIG. 19 depicts an exemplary embodiment in which the Storage Tier is provisioned partly in a Data Processing Cloud and partly On Premises, the Processing Tier is provisioned On Premises, and the Front End is provisioned in a Data Processing Cloud;

(21) FIG. 20 depicts an exemplary embodiment in which the Storage Tier is provisioned partly in a Data Processing Cloud and partly On Premises, and the Front End and Processing Tier are provisioned in Data Processing Cloud(s);

(22) FIG. 21 depicts an exemplary embodiment in which the Processing Tier is provisioned partly in a Data Processing Cloud and partly On Premises, and the Front End and Storage Tier are provisioned On Premises;

(23) FIG. 22 depicts an exemplary embodiment in which the Processing Tier is provisioned partly in a Data Processing Cloud and partly On Premises, the Front End is provisioned On Premises, and the Storage Tier is provisioned in a Data Processing Cloud;

(24) FIG. 23 depicts an exemplary embodiment in which the Processing Tier is provisioned partly in a Data Processing Cloud and partly On Premises, the Storage Tier is provisioned On Premises, and the Front End is provisioned in a Data Processing Cloud;

(25) FIG. 24 depicts an exemplary embodiment in which the Processing Tier is provisioned partly in a Data Processing Cloud and partly On Premises, and the Front End and Storage Tier are provisioned in Data Processing Cloud(s);

(26) FIG. 25 depicts an exemplary embodiment in which the Front End is provisioned partly in a Data Processing Cloud and partly On Premises, and the Processing and Storage Tiers are provisioned On Premises;

(27) FIG. 26 depicts an exemplary embodiment in which the Front End is provisioned partly in a Data Processing Cloud and partly On Premises, the Processing Tier is provisioned On Premises, and the Storage Tier is provisioned in a Data Processing Cloud;

(28) FIG. 27 depicts an exemplary embodiment in which the Front End is provisioned partly in a Data Processing Cloud and partly On Premises, the Storage Tier is provisioned On Premises, and the Processing Tier is provisioned in a Data Processing Cloud;

(29) FIG. 28 depicts an exemplary embodiment in which the Front End is provisioned partly in a Data Processing Cloud and partly On Premises, and the Processing and Storage Tiers are provisioned in Data Processing Cloud(s);

(30) FIGS. 29-31 show three examples of possible data flows between the Front End, Processing Tier, and Storage Tier, in accordance with the various exemplary embodiments herein;

(31) FIG. 32 depicts a first ASR-enhanced speech compression flow, in accordance with some of the exemplary embodiments herein;

(32) FIG. 33 depicts a second ASR-enhanced speech compression flow, in accordance with another of the exemplary embodiments herein;

(33) FIG. 34 depicts a first exemplary assignment of the ASR-enhanced speech compression flow elements (of FIGS. 32-33) between the Front End, Processing, and Storage Tiers, in accordance with certain exemplary embodiments herein;

(34) FIG. 35 depicts a second exemplary assignment of the ASR-enhanced speech compression flow elements (of FIGS. 32-33) between the Front End, Processing, and Storage Tiers, in accordance with certain exemplary embodiments herein;

(35) FIG. 36 depicts a third exemplary assignment of the ASR-enhanced speech compression flow elements (of FIGS. 32-33) between the Front End, Processing, and Storage Tiers, in accordance with certain exemplary embodiments herein; and,

(36) FIG. 37 depicts a fourth exemplary assignment of the ASR-enhanced speech compression flow elements (of FIGS. 32-33) between the Front End, Processing, and Storage Tiers, in accordance with certain exemplary embodiments herein.

FURTHER DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

(37) Referring to FIGS. 10-28, the depicted data link(s) are each preferably provisioned as one or more secure data tunnels, using, for example, a secure shell (SSH) protocol. See https://en.wikipedia.org/wiki/Tunneling_protocol (incorporated by reference herein). Indeed, such SSH-provisioned data link(s) may be used to support any, or all, of the data communication links between functional or structural components of the various embodiments herein.

(38) Reference is now made to FIGS. 29-31, which show three examples of possible data flows between the Front End, Processing Tier, and Storage Tier, in accordance with the various exemplary embodiments herein. While the arrows depict the directionality of “data” transfers (such as audio data, text, meta-data, etc.), it will be understood that there will or may exist additional signaling and control flows that may simultaneously operate in other directions or between other components. For example, in FIG. 29, data flow from the Front End to the Storage Tier is indicated as one way; however, those skilled in the art will appreciate that there will likely be some signaling channel or pathway that, for example, permits the Storage Tier to signal its readiness to receive data to the Front End.

(39) FIG. 29 depicts an exemplary embodiment in which data is collected by or originates in the Front End, is then “sent” or “routed”—by, for example, a push protocol (see, e.g., https://en.wikipedia.org/wiki/Push_technology, incorporated by reference herein) or a pull/get protocol (see, e.g., https://en.wikipedia.org/wiki/Pull_technology, incorporated by reference herein)—to the Storage Tier. Data is then sent from the Storage Tier to the Processing Tier for processing, with the processed data sent back to the Storage Tier for storage/archiving. This embodiment also permits data that already exists in the Storage Tier, or is sent there through other network connections, to be routed to the Processing Tier for processing and sent back for storage/archiving.

(40) FIG. 30 depicts an exemplary embodiment in which data is collected by or originates in the Front End, is then sent directly to the Processing Tier for processing, and then sent to the Storage Tier for storage/archiving. Such direct data transfer from the Front End to the Processing Tier reduces latency, which is important in the case of systems that have “real time” monitoring or alerting aspects. This embodiment also permits data that already exists in the Storage Tier, or is sent there through other network connections, to be routed to the Processing Tier for processing and sent back for storage/archiving. Additionally, though not depicted in FIG. 30, “real time” systems may interact directly with the Processing Tier to receive processed data without the additional latency associated with the Storage Tier. (A preferred form of “real time” data acquisition utilizes the assignee's direct-to-transcript technology, as disclosed in co-pending U.S. patent application Ser. No. 16/371,011, “On-The-Fly Transcription/Redaction Of Voice-Over-IP Calls,” filed Mar. 31, 2019, incorporated by reference herein.)

(41) FIG. 31 depicts a “hybrid” embodiment, in which data is collected by or originates in the Front End, some or all of which may be then sent directly to the Processing Tier for processing, then sent to the Storage Tier for storage/archiving, and some or all of which may also be sent to the Storage Tier, from which it is then sent to the Processing Tier for processing, with the processed data sent back to the Storage Tier for storage/archiving. This permits use of the direct data routing approach for “real time” audio feeds, and lower cost “batch mode” processing for other data feeds, which can be processed during time(s) when power and cloud resources are cheaper, for example.

(42) Referring now to FIGS. 32-33, which depict exemplary flow diagrams for improved speech compression flows that employ ASR confidence (and/or other metadata) to determine whether to additionally compress the audio speech signal, further detail regarding the depicted elements is as follows:

(43) Receive audio: Audio may be received or obtained from any source, whether a “live” feed (such as CTI, VOIP tap, PBX) or a recorded source (such as on-prem storage, cloud storage, or a combination thereof). A preferred source utilizes the assignee's DtT technology, as described in the commonly owned, co-pending application Ser. No. 16/371,011.

(44) VAD: Voice activity detection is an optional step. Its main function is to eliminate dead space, to improve utilization efficiency of more compute-intensive resources, such as the ASR engine, or of storage resources. VAD algorithms are well known in the art. See https://en.wikipedia.org/wiki/Voice_activity_detection (incorporated by reference herein).

(45) Segregate: Segregation of the speech input into words or utterances (preferred) is performed as an initial step to ASR decoding. Though depicted as a distinct step, it may be performed as part of the VAD or ASR processes.

(46) Confidence: Confidence may be determined either by the ASR engine (preferred) or using a separate confidence classifier. The confidence classifier may operate from the same input stream as the ASR, or may utilize both the input and output of the ASR in its computation.

(47) Low ASR confidence: If ASR confidence dips below a “threshold” value, then the word, phrase, or utterance in question will be passed uncompressed (or only slightly compressed) to the output stream. In some embodiments, the “threshold” is preset; whereas in other embodiments, it may vary dynamically, based for example on the moving average of confidence values being seen by the system.

(48) FIGS. 34-37 depict how these functions/modules can be assigned amongst the Front End, Processing Tier, and Storage Tier, in accordance with the invention herein. These, however, are merely illustrative, and not meant to be in any way limiting. Furthermore, FIGS. 2-28 illustrate how the Front End, Processing Tier, and Storage Tier functions may be provisioned in cloud(s), on premises, or in embodiments that bridge the on-prem/cloud boundary with secure data links.

ASR-enhanced speech compression

Assignee

Inventors

Cpc classification

Classification Explorer

G10L19/04

PHYSICS

Classification Explorer

G10L15/08

PHYSICS

Classification Explorer

G06F16/90332

PHYSICS

Classification Explorer

G06F16/63

PHYSICS

Classification Explorer

G10L15/18

PHYSICS

Classification Explorer

G06F40/279

PHYSICS

Classification Explorer

G06F17/18

PHYSICS

Classification Explorer

G10L19/22

PHYSICS

Classification Explorer

G06F40/30

PHYSICS

Classification Explorer

H03M7/60

ELECTRICITY

International classification

Classification Explorer

G10L19/04

PHYSICS

Classification Explorer

G10L15/18

PHYSICS

Classification Explorer

H03M7/30

ELECTRICITY

Classification Explorer

G06F40/279

PHYSICS

Classification Explorer

G06F17/18

PHYSICS

Abstract

Claims

Description