Method for Speaker Diarization

20170323643 · 2017-11-09

Assignee

Inventors

Cpc classification

International classification

Abstract

Disclosed is a speaker diarization process for determining which speaker is speaking at what time during the course of a conversation. The entire process can be most easily described in five main parts: Segmentation where speech/non-speech decisions are made; frame feature extraction where useful information is obtained from the frames; segment modeling where the information from the frame feature extraction is combined with segment start and end time information to create segment specific features; speaker decisions when the segments are clustered to create speaker models; and corrections where frame level corrections are applied to the information extracted.

Claims

1. A method for speaker diarization, when implemented on a computer system, causes the computer system to perform the following steps: acquiring an audio data; segmenting the audio data into a plurality of segments; segmenting the plurality of segments into a plurality of frames; extracting information from the frames and labeling the frames with a plurality of frame features; prewhitening the frame features; calculating a plurality of segment models; determining a plurality of speaker decisions; and correcting the speaker labels of the segments.

2. The method according to claim 1, wherein after determining the speaker decisions, a plurality of speaker labels are attached to the segments.

3. The method according to claim 1, wherein after determining the speaker decisions, a plurality of speaker labels are attached to the frames.

4. The method according to claim 3, wherein during the extracting of information from the frames, mel frequency cepstral, coefficients calculations are used to calculate a plurality of mel frequency cepstral coefficients; wherein the mel frequency cepstral coefficients are used in vector quantization to determine a codebook and a plurality of codewords; wherein euclidean distances are calculated between mel frequency cepstral coefficients and codewords; wherein each frame is labeled with the index of the codeword that resulted.

5. The method according to claim 1, wherein during the extracting of information from the frames, duration of the frames, and a plurality of fundamental frequencies are extracted.

6. The method according to claim 1, wherein during the calculation of the segment models, the information extracted from the frames is combined with segment specific information and clustered together into the segment models.

7. The method according to claim 1, wherein during the calculation of the segment models, a plurality of probability mass functions are calculated and used as the basis for segment modeling and comparisons between the segments.

8. The method according to claim 1, wherein during the step of determining a plurality of speaker decisions, the segment models are compared using a distance metric.

9. The method according to claim 7, wherein during the step of determining a plurality of speaker decisions, the segment models are compared using a distance metric; wherein a priority is assigned to the segments and a segment pool is made and comparisons are made among the probability mass functions.

10. The method according to claim 2, wherein the segment labels contain both biometric and phonetic features of speech.

11. The method according to claim 3, wherein the frame labels contain both biometric and phonetic features of speech.

12. The method according to claim 5, wherein during the step of calculating a plurality of segment models, the pitch information and the fundamental frequencies are used to calculate a weighted mean and differences in fundamental frequencies.

13. The method according to claim 12, wherein during the step of determining a plurality of speaker decisions, segments having longer durations and lower fundamental frequency difference are given priority in assigning the speaker decisions.

14. the method according to claim 13, wherein during the step of determining a plurality of speaker decisions, a normalization is applied on to the segments.

15. the method according to claim 3, wherein during the step of correcting the speaker labels of the segments, an inspection window is shifted over adjacent frames looking for a speaker label that is inconsistent with neighboring frames; wherein when an inconsistent label is found, the inconsistent label is changed to match the speaker label of the highest percentage of labels within the inspection window.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0030] The present invention will be apparent to those skilled in the art by reading the following description which reference to the attached drawings. Unless indicated otherwise, the drawings provide exemplary embodiments or aspects of the disclosure and do not limit the scope of the disclosure. In the drawings provided:

[0031] FIG. 1 illustrates the basic structure of the main corresponding structural portions of speaker diarization.

[0032] FIG. 2 is a flowchart of the main steps involved in the method of speaker diarization.

[0033] FIG. 3 illustrates the frame feature extraction process.

[0034] FIG. 4 illustrates the segment modeling step.

[0035] FIG. 5 illustrates frame label corrections regarding successive 3 neighbor frames.

[0036] FIG. 6 illustrates a sample PMF calculation.

DETAILED DESCRIPTION

[0037] With reference to the drawings and in particular FIG. 1, FIG. 1 illustrates the basic structure of the main corresponding structural portions of speaker diarization. The speaker region 100 is a portion of an audio signal or an audio part, belonging to a specific speaker containing a plurality of various segments 110, 120, 130, 140. The segments themselves are any part of the audio signal within the speaker region where there is non-silence. There can be any number of segments in a speaker region and the illustration of only four segments is not meant to limit the disclosure in any way. Within each segment, there are a plurality of equal sized frames (111, 112, 113, 121, 122, 123, 131, 132, 133, 141, 142, 143) each frame is a narrow, fixed size audio part that is fixed in either duration. As with the segments 110, 120, 130, 140; the number of frames shown in each segment are three, but are not meant to limit the disclosure in any way. The number of frames or segments can be adjusted based on the implementation of the method to different data.

[0038] FIG. 2 shows a flowchart of the main steps involved in the method of speaker diarization. Audio data 201 is received and then segmentation 210 is performed, followed by frame feature extraction 220. The segmentation process divides the audio data 201 into portions of speech and non-speech including speech segments 211 as described in FIG. 1 above and into further fragmented frames (212) (111, 112, 113, 121, 122, 123, 131, 132, 133, 141, 142, 143). In other words, the segmentation is used as the decision mechanism for silence/non-silence regions of an audio signal or conversation. The determination of silence or non-silence depends on the energy changes of audio during speech transitions. A dynamically determined threshold allows for non-speech portions to be labeled as silence during the segmentation process. The energy threshold obtained is based dynamically on the audio data. First energy levels are sorted, and then the threshold is set as the minimum speech energy based on the 20.sup.th percentile of the sorted energy values. The speech sections are then split into equal length frames with overlapping between the frames such that the start of one segment overlaps with either the previous segment, the subsequent segment, or both.

[0039] Also, speech to text “STT” module (205) is performed on the audio stream and used in segmentation and segment modeling. Features extracted from STT module (205) is integrated with prosodic and spectral parameters for step-by-step composition of speaker segments. STT module outputs are utilized as features for diarization. All the information derived from STT is collected such as words, word boundaries and confidences.

[0040] Frame feature extraction 220 is then used to extract important information from the frames, which is generally labeled as frame features 221. Such information can include start and end time or the Mel Frequency Cepstral Cofficients (MFCC), logarithmic energies, fundamental frequencies, and implemented Vector Quantization (VQ). For decorrelation and uniform separation of the feature dimensions, a pre-whitening step is utilized as follows:


X.sub.w=Σ.sup.−1/2(x−μ.sub.z)

[0041] Where x is the feature vector, μ.sub.x and Σ are the mean vector and full covariance matrices of x, respectively, and x.sub.w is the whitened feature vector.

[0042] Then, using these transformed feature vectors, vector quantization is applied, and a codebook is generated as a model. This model is used as a background model of the current audio, containing information from the whole conversation. The codebook size is determined as proportional to the speech duration. For example, a size of 128 can be selected for a three-minute audio. Then, for every frame, Euclidean distances between MFCC vectors and codewords are calculated. Each frame is labeled with the index of the codeword that resulted in the closest distance. The Euclidean distance d is calculated as follows for two vectors V.sub.1 and V.sub.2:


d=√{square root over (Σ.sub.i=1.sup.N(V.sub.1(i)-V.sub.2(i)).sup.2)}

Where N is the vector size

[0043] Another feature extracted is pitch information. Extracting pitch information (physically, fundamental frequencies) is known in the field of the art. Therefore, fundamental frequencies are calculated and kept for each frame for further processing.

[0044] In segment modeling 230, the information extracted from the frames (frame features (221) and pre-whitening step (222)) is combined with information that are segment specific and clustered together into segment models 231. The following example is offered for clarity regarding segment modeling 230 but is not meant to limit the scope of the disclosure: if each segment contains simple label populations such as: segment 1 −>[3 5 65 4 89 . . . 78 121 4] and segment 2 [8 8 53 100 . . . 44 9] then these populations are transformed into characteristic representations for every segment. These representations are PMFs (Probability Mass Functions). Generating a PMF involves counting the occurrences of each label, and dividing these counts by the total label count. Label populations can be used to represent speaker characteristics. Thus PMFs can be used as a basis for modeling and comparisons for each segment. Also, energy weighted fundamental frequency variance calculation (described below 440) are done for each segment. Therefore, a person can achieve segment modeling via transforming a label population into PMF, and calculating energy weighted fundamental frequency variance as explained in the equations of 440 below.

[0045] Once the segment models 231 have been created, they are used in making a speaker decision 240 as to who is speaking, and what they are saying. The speaker decisions generally are made by comparing models via a distance metric and combining similar models and is known in the art. Specifically, a unique method in this disclosure, involves assigning priority among segments according to their lengths and fundamental frequency variances. The priority is originally given to longer duration segments since those segments will provide more information about the speaker. A lower fundamental frequency variance will give a lower probability of having multiple speakers within a segment. Then according to the selection priority order, a segment pool is made and comparisons are made among the PMFs of the segments. The closest segments are merged and for the next speaker, used segments (i.e. those segments that have already been assigned a speaker) are discarded.

[0046] Once the speaker is determined, speaker labels 241 are assigned to the segments and a correction is performed. Corrections 250 are done on a frame-by-frame basis to ensure that some frames have not been erroneously given a speaker label 241 that does not match the speaker. Details describing the corrections 250 are below. One way of accomplishing such a correction is to check one frame to see if other frames immediately adjacent to the said frame have inconsistent speaker labels 241.

[0047] FIG. 3 shows a flowchart of the frame feature extraction. As part of the splitting of the streams detailed in FIG. 1, each frame is divided into equal length frames. Once the frames come in, Mel Frequency Cepstral Coefficients (MFCC) 331 are calculated from the frames using an MFCC calculation 330 which is well known in the art. The mel frequency cepstral coefficients (MFCC) are part of feature vectors, and using the MFCC's a vector quantization (340) is completed. Both Vector Quantization and mel frequency cepstral coefficients are known in the field. Vector Quantization is used to create codebook (341). Codebook size is determined using the frame count and feature vector dimensions. Once the codebook (341) is estimated, it is used as a background model of the audio stream that contains information for the entire conversation. Then for every frame, Euclidean distances between frame MFCC and codewords are calculated for frame labeling (350). These codewords are then used for labeling corresponding frames with frame to codeword labels 351. Another feature that is extracted during frame feature extraction 300 is pitch information. Extracting pitch information (physically, fundamental frequencies) is known in the field of the art. Thus, fundamental frequencies (f.sub.0) (321) are calculated and kept for each frame by f.sub.0 estimation (320) for further processing in segment modeling. Actual frame energies (311) are also utilized by calculating the energy using logarithmic energy (310) with the following formula:


le(i)=log (1+e(i))

where e is the energy and le is log of energy.

[0048] FIG. 4 shows the segment modeling 230 process. Segment modeling 230 combines segment start-end time information with the frame features (410) extracted during the frame feature extraction. Labels (411), Energies (412) and f.sub.0 (413) as well as the segments (420) are taken and combined so that key representation of segments can be created. Within each segment, corresponding frame labels (411) are gathered and used to generate a codeword ID histogram for each segment. PMFs are created as a result of a normalizing histogram such that the sum of bin values would give 1. Since the labels (411) are gathered using mel frequency cepstral coefficients (MFCC); it logically follows and is expected that the labels (411) contain both biometric and phonetic features of speech within the labels (411).

[0049] Incoming frame features (410) also contain fundamental frequencies f.sub.0 (413) and energy information (412). Using this information an energy weighted f.sub.0 variance (441) can be calculated using an energy weighted f.sub.0 variance calculation (440) within each segment using frames that are within the boundries of each segment. The weighted mean and variances in (440) are calculated using the following formulas:

[00001] μ f 0 = .Math. i N .Math. le ( i ) .Math. f 0 ( i ) .Math. i N .Math. le ( i ) .Math. .Math. and .Math. .Math. σ f 0 2 = .Math. i N .Math. le ( i ) .Math. ( f 0 ( i ) - μ f 0 ) 2 .Math. i N .Math. le ( i )

where N is the number of total frames within current segment. le and f.sub.0 are log energy and fundamental frequency of the frame respectively. μ.sub.f.sub.0 and σ.sub.f.sub.0.sup.2 are resulting energy weighted mean and variance of the segment f.sub.0 values. FIG. 4 shows segment modeling process.

[0050] After segmentation, frame feature extraction, and segment modeling, speaker decisions (as seen as 240 in FIG. 2) are made. All segments own their speaker ids after this stage. Up to this level, a PMF and a σ.sub.f.sub.0.sup.2 value is assigned to each segment. PMF values are used for modeling the segments. Segment lengths and σ.sub.f.sub.0.sup.2 are used for the selection priority among segments. Regarding segment priority, segments are ordered for use in speaker assignment.

[0051] As a starting point it is desired to have reliable models. Segments having longer duration and lower σ.sub.f.sub.0.sup.2 are given priority in selection. Segments with longer duration will contain phonetically more balanced speech and better represent biometric features. However, longer duration brings the possibility of having more than one speaker in that segment. To compensate for the possibility of multiple speakers, σ.sub.f.sub.0.sup.2 is used as an additional data point. If speakers have distinct pitch characteristics, σ.sub.f.sub.0.sup.2 can be used for identifying segments that may represent two separate speakers. Also noise and background discussions in segments may cause unexpected f.sub.0 calculations. To suppress the effect of noisy parts, σ.sub.f.sub.0.sup.2 are calculated as energy weighted. Additionally, a threshold is determined for the voiced frame ratio in a segment that is tuned experimentally. In other words, if a segment contains mostly frames without a fundamental frequency, the frame may contain only small amounts of unwanted audio. If this audio has a stable pitch behavior it may bias the speaker selection and lead to a incorrect identification of the speaker, thus segments with non-zero f.sub.0 ratio below the threshold are discarded and assigned the lowest priority. A normalization is applied on lengths and σ.sub.f.sub.0.sup.2 of all segments, and mapped into a [0 1] range. The normalization formula is below.

[00002] l ^ i = l i - l min l max - l min , i = 1 , 2 , .Math. .Math. , S

[0052] Where S is the number of segments, l.sub.i is the length, and Î.sub.i is the normalized length of the i.sup.th segment. Likewise, the normalization is also applied on σ.sub.f.sub.0.sup.2 values.

[0053] Weights (0.5, −0.5) are given for both lengths and variances. With both parameters and their corresponding weights, a total priority score is calculated for each segment using the equation below:


s.sub.p.sup.i=0.5{circumflex over (l)}.sub.i−0.5{circumflex over (σ)}.sub.f.sub.0.sup.2.sub.i, i=1,2, . . . , S

[0054] Where s.sub.p.sup.i denotes the i.sup.th segment priority score. Segments are then sorted according to these priorities and the top portion of these are separated for the initial speaker decisions.

[0055] Therefore, both parameters and their corresponding weights, are used to calculate a total priority weight for each segment. Segments are sorted according to these priorities and a top portion of them are separated for initial speaker decisions. In other words, the information is sorted such that the highest prioritized segments have the highest probability of correctly identifying a speaker.

[0056] Speaker assignments are then given to each segment to determine who is speaking in each segment. Speaker assignments are done using segment similarities such that segments forming a speaker should be similar to other segments of that speaker and dissimilar to segments of other speakers. During the processing all the values about segments are stored in memory, and are used in make comparisons on PMFs of segments. PMF similarities are one criterion that is considered in determining similarity. The L.sub.1 distance is used as the distance metric, which is as follows, for two PMFs P.sup.j and P.sup.k:

[00003] d j , k = .Math. i = 1 M .Math. .Math. P i j - P i k .Math.

[0057] The lower distance two segments have, the more similar they are.

[0058] Speaker initializations are done for each of the speakers. For each speaker initialization, a certain number of segments are chosen that have the highest similarity. This similarity measure is calculated as the difference between intra-similarity and inter-similarity. Intra-similarity is the comparison of new speaker candidate segments with other non-assigned new speaker candidate segments and inter-similarity is the comparison between new candidate segments and segments of already determined speakers. Positive decisions are made towards higher intra, lower inter similarities, and these segments are used for a new speaker creation. When intra-similarity is high and inter-similarity is low, it indicates that the speaker in these segments is a new speaker and results in the initialization of a new speaker label. After initialization of all speakers, any remaining unassigned segments are processed. For each of them, the similarity to all speakers is calculated and the segment is then assigned a speaker label whose other segments most closely match the unassigned segment. This assignment continues until all segments are assigned a speaker label. Segment similarities are calculated using PMF similarities since the segments are represented with PMF values. The sum of absolute differences is used as the a measure of distance between segments. After every speaker decision, present PMFs are updated with new segment PMFs. This update operation is a segment length weighted sum of PMFs, resulting in a new PMF. Updates are done as the element-wise summation between two PMFs with a weight coming from segment lengths. It can be expressed as follows for i.sup.th element in update of PMFs P.sub.1 with P.sub.2, resulting in P.sub.3:

[00004] P 3 ( i ) = P 1 ( i ) .Math. l 1 + P 2 ( i ) .Math. l 2 l 1 + l 2 , i 1 , 2 , .Math. .Math. , M

where M is total element count for a PMF; l.sub.1 and l.sub.2 are the lengths of the segments.

[0059] Therefore updating the PMF's of segments is done in order to give more emphasis to longer segments. Thus the process for assigning a speaker label (i.e. speaker assignment) involves the input of segments and an output of a speaker label and the following steps: 1) finding segments that are most similar to each other and no other existing label, 2) creating a new speaker model 3) repeating steps 1 and 2 above until all speakers have been initialized 4) then for any unassigned segments comparing the segment with speaker models, and assigning the closest speaker to that segment 5) updating the model with new values for the segments 6) ensuring no other speaker labels need to be created.

[0060] No process can be perfect, and to compensate for possible errors, a correction phase is included in the method. The aim is mainly to find and split speaker segments that Each segment may contain speaker labels for multiple speakers. During speaker corrections, the algorithm returns back to doing a frame level processing. Each frame was previously assigned labels based on closest codeword IDs. A search is made of every frame using speaker PMF to identify which speakers is most likely associated with each frame, and the frames are accordingly given speaker labels for each frame. In order to find incorrectly labeled frames, the neighboring frames are checked to see if one speaker label has been assigned repeatedly to successive neighboring frames. For a frame k, the k−N and k+N neighbors are gathered and the speaker label with the highest count is assigned to the tested frame. Using this correction technique, erroneous label fluctuations can be smoothed out. In order to find out possible speaker changes inside segments, a resegmentation algorithm is applied using speaker labels on the frames. This algorithm shifts an inspection window over frames looking for a group of another speaker's labels.

[0061] The algorithm determines the percentage of frames for each speaker within the window. It picks the speaker with highest percentage to be the correct speaker label. If the picked speaker and segment speaker have a f.sub.0 difference larger than a threshold, a further comparison is made. f.sub.0 of the new candidate part is compared with the number of speaker labels assigned to the frames in a segment (i.e. if labels A B and C were assigned within a segment, the frame would be compared to all three candidate labels). After comparison, a score is given. Also for the frame count percentage, a score is given. These two scores are combined with their corresponding weights. If the resulting score is above a predetermined threshold, a decision for speaker change is made.

[0062] FIG. 5 illustrates the correction process showing a portion of a segment before correction (501) and a portion of the same segment after correction (505) using the method of correcting explained above. In FIG. 5, frame k is shown as 515 and the N value has been set to 3. Therefore the correction looks to the three most adjacent frames on either side of frame k (515) which are preceding frames (512, 513, and 514) following frames (516, 517 and 518). 512, 513, 514, 516, 517, and 518 are highlighted to help show where the k−N and k+N range falls. Looking at the k+3 and k−3 frames, the system determines that speaker label “A” outnumbers speaker label “B” 6 to 1. Based on the fact that the majority is used to determine the most likely correct speaker label, frame 515 becomes frame 555 and is relabeled as “A” in the corrected stream 505, all other frames 550 through 560 remain unchanged in this illustrated example.

[0063] FIG. 6 illustrates a sample PMF calculation. Pi shows the probability for value i. It is generated from M bins. In this case, M is equal to the codebook size (number of codewords within the codebook). Namely, such a segment PMF contains information from i'th codeword with probability Pi.