SPEECH SEPARATION AND RECOGNITION METHOD FOR CALL CENTERS
20230008613 · 2023-01-12
Assignee
Inventors
- Van Hai Do (Ha Noi City, VN)
- Nhat Minh Le (Ha Noi City, VN)
- Tung Lam Nguyen (Ha Noi City, VN)
- Quang Trung Le (Vinh Tuong Distric, VN)
- Tien Thanh Nguyen (Tien Du Distric, VN)
- Dang Linh Le (Ha Noi City, VN)
- Dinh Son Dang (Ha Noi City, VN)
- Thi Ngoc Anh Nguyen (Vinh City, VN)
- Minh Khang Pham (Ha Noi City, VN)
- Ngoc Dung Nguyen (Ha Noi City, VN)
- Manh Quan Tran (Ha Noi City, VN)
- Manh Quy Nguyen (Ha Noi City, VN)
Cpc classification
International classification
Abstract
The present invention provides a method for speech separation and recognition. The present invention overcomes the disadvantages of the existing techniques by providing automatic speech recognition and separation that helps managers see what their service agents and customers are saying. From there, quickly and objectively knowing the wishes and concerns of customers as well as whether their service agents can give accurate and correct advice. In addition, the system is constantly updated based on the semi-supervised training mechanism, which means that the system can self-learn from actual data during operation, thereby helping to improve the system's accuracy.
Claims
1. A speech separation and recognition method, comprising: step 1: collect speech data of customer service telephone calls for analysis by retrieving audio files, each file corresponds to one customer service telephone call; step 2: separate and label text for speech files; at this step, the audio files retrieved in step 1 are provided to a labeling system for transcribers to listen, separate and label a transcription for a service agent's and a customer's speech; the output of this step is speech sets that have been classified and labeled separately into service agent's speech set files and customer's speech set files; step 3: create training and test sets; when the speech sets are labeled in the service agent's speech set and the customer's speech set in step 2, both≥H.sub.label_min data hours, in which H.sub.label_min≥10 hours to ensure a data set is large enough; an administrator decides to select some of the speech set files labeled in step 2 to create a training set, the remaining files are used to create a test set with the requirement that a test set size needs to be larger than H.sub.test_min. data hours, where H.sub.test_min≥2 hours to ensure that the test set is large enough and reliable; step 4: build two language models; a first language model LM.sub.a for agents and a second language model LM.sub.b for customers based on the training data sets created in step 3 to store spoken language features including frequently spoken phrases by the service agents and the customers in order to distinguish the statements of the service agents or the customers in following steps; wherein the language models can be used as n-grams or neural network-based models; step 5: collect speech data files of telephone calls that need processing for automatic speech separation and recognition; each file corresponds to one customer service telephone call; step 6: automatically cut speech files into small segments; for each speech file obtained in step 5, the speech is automatically cut into segments based on signal characteristics; step 7: extract speaker feature vectors; all speech segments obtained in step 6 are extracted by a pre-trained feature extraction network to obtain speaker feature vectors, with each speech segment will obtain a corresponding speaker feature vector; step 8: cluster speech segments; for each speech file, cluster the speech segments in step 6 into two clusters C.sub.1 and C.sub.2 based on the speaker feature vectors extracted in step 7; step 9: convert speech to text; converting all speech segments in step 6 to text using a speech recognition system, with each speech segment obtaining a corresponding text and a recognition confidence score CS with a value ranging from 0 to 1; step 10: select the speech segment satisfying the conditions as a basis for classification; for each speech file, select the speech segments in step 9 that satisfy the condition: having confidence score CS>α, where 0.5≤α≤0.95 to eliminate speech segments with too low confidence; if no satisfactory speech segment is selected, skip the current file and move to a new speech file;
2. The method according to claim 1, wherein in step 7, the pre-trained feature extraction network comprises a deep learning neural network (DNN).
3. The method according to claim 1, where the audio files are retrieved directly from storage devices comprising hard drives.
4. The method according to claim 1, where the audio files are retrieved directly from storage devices comprising magnetic tapes.
5. The method according to claim 1, where the audio files are retrieved through data network connections.
6. The method according to claim 1, where the audio files are obtained directly on a user's storage device.
7. The method according to claim 1, where the audio files are obtained using file transfer protocols such as FTP to obtain the speech signals.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0003]
[0004]
[0005]
DETAILED DESCRIPTION
[0006] This present invention aims to provide a method for speech separation and recognition of agent, customer, and semi-supervised training in telephone call centers to automate the monitoring of customer service telephone calls.
[0007] Specifically, the present invention provides a method including:
[0008] Step 1: collect speech data of customer service telephone calls for analysis. This step is done by different methods such as retrieving audio files directly from storage devices such as hard drives, magnetic tapes, etc. or through data network connections, each file corresponds to one customer service telephone call; speech files can be obtained directly on the user's storage device or using file transfer protocols such as FTP to obtain the speech signals;
[0009] Step 2: separate and label text for speech files; at this step, put the files in step 1 to the labeling system for the transcribers to listen, separate and label the transcription for the service agent and customer's speech; the output of this step is the speech sets that have been classified and labeled separately into the service agent's speech set and the customer's speech set;
[0010] Step 3: create training and test sets; accordingly, when the speech data is labeled in the service agent's speech set and the customer's speech set in step 2, both ≥H.sub.label_min data hours, in which H.sub.label_min≥10 hours to ensure the data set is large enough; The administrator decides to select some of the speech files labeled in step 2 to create the training set, the remaining files are used to create the test set with the requirement that the test set size needs to be larger than H.sub.test_min data hours, where H.sub.test_min≥2 hours to ensure that the test set is large enough and reliable;
[0011] Step 4: build two language models; LM.sub.a for agents and LM.sub.b for customers based on the training data set created in step 3 to store spoken language features such as frequently spoken phrases by the service agent and the customer from that to distinguish the statements of the service agent or the customer in the following steps; Language models can be n-grams or neural network-based models;
[0012] Step 5: collect speech data of telephone calls that need processing for automatic speech separation and recognition; This step is done by different methods such as retrieving audio files directly from storage devices such as hard drives, magnetic tapes, etc. or through data network connections, each file corresponds to one customer service telephone call; speech files can be obtained directly on the user's storage device or using file transfer protocols such as FTP to obtain the speech signals;
[0013] Step 6: automatically cut speech files into small segments; for each speech file obtained in step 5, the speech is automatically cut into segments based on the signal characteristics; we can rely on popular methods such as: based on the average energy of the signal or we can rely on speech recognition systems;
[0014] Step 7: extract speaker feature vectors; all speech segments obtained in step 6 are extracted by a pre-trained feature extraction network such as a deep learning neural network (DNN) to obtain speaker feature vectors, wherein each speech segment will obtain a corresponding speaker feature vector;
[0015] Step 8: cluster speech segments; for each speech file, cluster the speech segments in step 6 into two clusters C.sub.1 and C.sub.2 based on the speaker feature vectors extracted in step 7;
[0016] Step 9: convert speech to text; all speech segments in step 6 are converted to text using a speech recognition system, with each speech segment obtaining a corresponding text and a recognition confidence score CS ranging from 0 to 1;
[0017] Step 10: select the speech segment satisfying the conditions as a basis for classification; for each speech file, select the speech segments in step 9 that satisfy the condition: have confidence score CS≥α, where 0.5≤α≤0.95 to eliminate speech segments with too low confidence, which are often speech segments with too poor quality or too noisy environment affecting the quality of the classification system; if no satisfactory speech segment is selected, skip the current file and move to a new speech file;
Step 11: classify speech segments of service agents and customers;
with the speech segments selected in step 10 divided into two clusters in step 8, compute, where PPL.sub.a1, PPL.sub.a2, PPL.sub.b1, PPL.sub.b2 are the perplexity given by the language models LM.sub.a, LM.sub.b in step 4 computed with the text data set of speech segments selected in step 10; PPL.sub.a1, PPL.sub.b1 are computed for the segments in cluster C.sub.1; PPL.sub.a2, PPL.sub.b2 correspond to segments in cluster C.sub.2; if w≤θ, all speech segments in cluster C.sub.1 are identified as service agents, all speech segments in cluster C.sub.2 are identified as customers, and vice versa if w>θ, all speech segments in cluster C.sub.2 are identified as customers speech segments in cluster C.sub.2 are identified as service agents, all speech segments in cluster C.sub.1 are identified as customers; threshold θ has a value in the range from 0.5 to 2.0; After this step, we have completed speech separation and recognition for the service agent and the customer, if there is a need for semi-supervised training to improve the quality of the system, proceed to step 12, otherwise, stop;
[0018] Step 12: select speech segments satisfying the conditions to be included in the semi-supervised training set; select speech segments in step 9 meeting the requirement: have confidence score CS>β, in which 0.8≤β≤0.99 to select speech segments with a high recognition confidence score for the semi-supervised training dataset; each speech segment has been labeled as service agent or customer from step 11;
[0019] Step 13: choose the time to update the language models; when the training data in the semi-supervised set is greater than a threshold H.sub.semi_min data hours and when there is a decision of the administrator, where H.sub.semi_min≥10 hours for then semi-supervised training data is large enough and reliable;
[0020] Step 14: build language models based on semi-supervised data; at this step, use the data in the semi-supervised set to build two language models, LM.sub.a_semi with service agent data and LM.sub.b_semi with customer data; then combine with two language models LM.sub.a, LM.sub.b in step 4 to create two language models LM.sub.a′, LM.sub.b′, with association coefficient k, where 0.8≥k≥0.1;
[0021] Step 15: update the language models; compute
where PPL.sub.a1, PPL.sub.a2, PPL.sub.b1, PPL.sub.b2 are the perplexity given by the language models LM.sub.a, LM.sub.b in step 4 computed with the text data of the test sets in step 3; PPL.sub.a1, PPL.sub.b1 are computed for the test set consisting of speech segments of the service agent; PPL.sub.a2, PPL.sub.b2 are calculated for the test set of customer speech segments; then compute
similar as w.sub.0 by replacing the two language models in step 4 by LM.sub.a′ and LM.sub.b′ in step 14; if w.sub.0>q*w.sub.1, update LM.sub.a with LM.sub.a′, LM.sub.b with LM.sub.b′; where q≥1.0.
Detailed Description of the Invention
[0022] The invention is detailed below, specifically, a method of speech separation and recognition of service agents, customers and semi-supervised training in a customer service call centers comprising of steps:
[0023] Step 1: collect speech data of customer service telephone calls for analysis;
[0024] Step 2: separate and label text for speech files;
[0025] Step 3: create training and test sets;
[0026] Step 4: build two language models;
[0027] Step 5: collect speech data of telephone calls that need processing for automatic speech separation and recognition;
[0028] Step 6: automatically cut speech files into small segments;
[0029] Step 7: extract speaker feature vectors;
[0030] Step 8: cluster speech segments;
[0031] Step 9: convert speech to text;
[0032] Step 10: select the speech segments satisfying the conditions as a basis for classification;
[0033] Step 11: classify speech segments of service agents and customers;
[0034] Step 12: select speech segments satisfying the conditions to be included in the semi-supervised training set;
[0035] Step 13: choose the time to update the language models;
[0036] Step 14: build language models based on semi-supervised data;
[0037] Step 15: update the language models.
[0038] The details of these steps are as follows:
[0039] Step 1: collect speech data of customer service telephone calls for analysis. This step is done by different methods such as retrieving audio files directly from storage devices such as hard drives, magnetic tapes, etc. or through data network connections, each file corresponds to one customer service telephone call; speech files can be obtained directly on the user's storage device or using file transfer protocols such as FTP to obtain the speech signals.
[0040] Step 2: separate and label text for speech files; at this step, put the files in step 1 to the labeling system for the transcribers to listen, separate and label the transcription for the service agent and customer's speech; the output of this step is the speech sets that have been classified and labeled separately into the service agent's speech set and the customer's speech set.
[0041] Step 3: create training and test sets; accordingly, when the speech data is labeled in the service agent's speech set and the customer's speech set in step 2, both≥H.sub.label_min data hours, in which H.sub.label_min≥10 hours to ensure the data set is large enough; The administrator decides to select some of the speech files labeled in step 2 to create the training set, the remaining files are used to create the test set with the requirement that the test set size needs to be larger than H.sub.test_min data hours, where H.sub.test_min≥2 hours to ensure that the test set is large enough and reliable.
[0042] Step 4: build two language models; LM.sub.a for agents and LM.sub.b for customers based on the training data set created in step 3 to store spoken language features such as frequently spoken phrases by the service agent and the customer from that to distinguish the statements of the service agent or the customer in the following steps; Language models can be n-grams or neural network-based models.
[0043] Step 5: collect speech data of telephone calls that need processing for automatic speech separation and recognition; This step is done by different methods such as retrieving audio files directly from storage devices such as hard drives, magnetic tapes, etc. or through data network connections, each file corresponds to one customer service telephone call; speech files can be obtained directly on the user's storage device or using file transfer protocols such as FTP to obtain the speech signals.
[0044] Step 6: automatically cut speech files into small segments; for each speech file obtained in step 5, the speech is automatically cut into segments based on the signal characteristics; we can rely on popular methods such as: based on the average energy of the signal or we can rely on speech recognition systems.
[0045] Step 7: extract speaker feature vectors; all speech segments obtained in step 6 are extracted by a pre-trained feature extraction network such as deep learning neural network (DNN) to obtain speaker feature vectors, with each speech segment will obtain a corresponding speaker feature vector.
[0046] Step 8: cluster speech segments; for each speech file, cluster the speech segments in step 6 into two clusters C.sub.1 and C.sub.2 based on the speaker feature vectors extracted in step 7.
[0047] Step 9: convert speech to text; all speech segments in step 6 are converted to text using a speech recognition system, with each speech segment obtaining a corresponding text and a recognition confidence score CS ranging from 0 to 1.
[0048] Step 10: select the speech segment satisfying the conditions as a basis for classification; for each speech file, select the speech segments in step 9 that satisfy the condition: have confidence score CS≥α, where 0.5≤α≤0.95 to eliminate speech segments with too low confidence, which are often speech segments with too poor quality or too noisy environment affecting the quality of the classification system; if no satisfactory speech segment is selected, skip the current file and move to a new speech file.
Step 11: classify speech segments of service agents and customers;
with the speech segments selected in step 10 divided into two clusters in step 8, compute, where PPL.sub.a1, PPL.sub.a2, PPL.sub.b1, PPL.sub.b2 are the perplexity given by the language models LM.sub.a, LM.sub.b in step 4 computed with the text data set of speech segments selected in step 10; PPL.sub.a1, PPL.sub.b1 are computed for the segments in cluster C.sub.1; PPL.sub.a2, PPL.sub.b2 correspond to segments in cluster C.sub.2; if w≤θ, all speech segments in cluster C.sub.1 are identified as service agents, all speech segments in cluster C.sub.2 are identified as customers, and vice versa if w>θ, all speech segments in cluster C.sub.2 are identified as customers speech segments in cluster C.sub.2 are identified as service agents, all speech segments in cluster C.sub.1 are identified as customers; threshold θ has a value in the range from 0.5 to 2.0; After this step, we have completed speech separation and recognition for the service agent and the customer, if there is a need for semi-supervised training to improve the quality of the system, proceed to step 12, otherwise, stop.
[0049] Step 12: select speech segments satisfying the conditions to be included in the semi-supervised training set; select speech segments in step 9 meeting the requirement: of having confidence score CS≥β, in which 0.8≤β≤0.99 to select speech segments with a high recognition confidence score for the semi-supervised training dataset; each speech segment has been labeled as service agent or customer from step 11.
[0050] Step 13: choose the time to update the language models; when the training data in the semi-supervised set is greater than a threshold H.sub.semi_min data hours and when there is a decision of the administrator, where H.sub.semi_min>10 hours for then semi-supervised training data is large enough and reliable.
[0051] Step 14: build language models based on semi-supervised data; at this step, use the data in the semi-supervised set to build two language models, LM.sub.a_semi with service agent data and LM.sub.b_semi with customer data; then combine with two language models LM.sub.a, LM.sub.b in step 4 to create two language models LM.sub.a′, LM.sub.b′ with association coefficient k, where 0.8≥k≥0.1.
[0052] Step 15: update the language models; compute
where PPL.sub.a1, PPL.sub.a2, PPL.sub.b1, PPL.sub.b2 are the perplexity given by the language models LM.sub.a, LM.sub.b in step 4 computed with the text data of the test sets in step 3; PPL.sub.a1, PPL.sub.b1 are computed for the test set consisting of speech segments of the service agent; PPL.sub.a2, PPL.sub.b2 are calculated for the test set of customer speech segments; then compute
similar as w.sub.0 by replacing the two language models in step 4 by LM.sub.a′ and LM.sub.b′ in step 14; if w.sub.0>q*w.sub.1, update LM.sub.a with LM.sub.a′, LM.sub.b with LM.sub.b′; where q≥1.0.
Examples of Invention
[0053] The solution has been applied to build a method of separating, recognizing the speech of service agents, customers and semi-supervised training in Viettel's customer service call centers.
[0054] At Viettel customer service call centers, we use this method to separate and recognize the speech of service agents and customers into text. From there, it is possible to monitor and make statistics for the content of customer service telephone calls automatically and quickly. In addition, we can also know the thoughts and frustrations of the customers as well as whether the service agent's response is correct. The system is constantly updated based on the semi-supervised training mechanism, thereby helping to improve the accuracy of the system.
Effect of Invention
[0055] A special advantage related to this present invention is to develop a method of speech separation and recognition of service agents, customers and semi-supervised training in call centers. This recommendation method lets managers see what their service agents and customers say. From there, quickly and objectively know the wishes and concerns of customers as well as whether their service agents can give accurate and correct advice. In addition, the system is constantly updated based on the semi-supervised training mechanism, which means that the system can self-learn from actual data during operation, thereby helping to improve the system's accuracy.
[0056] Although the above descriptions contain many specifics, they are not intended to be a limitation of the embodiment of the invention but are intended only to illustrate some preferred execution.