Method for Speaker Diarization
20170323643 · 2017-11-09
Assignee
Inventors
- Mustafa Levent Arslan (Istanbul, TR)
- Mustafa Erden (Istanbul, TR)
- Sedat Demirbag (Istanbul, TR)
- Gökçe Sarar (Istanbul, TR)
Cpc classification
G10L19/018
PHYSICS
International classification
G10L19/018
PHYSICS
Abstract
Disclosed is a speaker diarization process for determining which speaker is speaking at what time during the course of a conversation. The entire process can be most easily described in five main parts: Segmentation where speech/non-speech decisions are made; frame feature extraction where useful information is obtained from the frames; segment modeling where the information from the frame feature extraction is combined with segment start and end time information to create segment specific features; speaker decisions when the segments are clustered to create speaker models; and corrections where frame level corrections are applied to the information extracted.
Claims
1. A method for speaker diarization, when implemented on a computer system, causes the computer system to perform the following steps: acquiring an audio data; segmenting the audio data into a plurality of segments; segmenting the plurality of segments into a plurality of frames; extracting information from the frames and labeling the frames with a plurality of frame features; prewhitening the frame features; calculating a plurality of segment models; determining a plurality of speaker decisions; and correcting the speaker labels of the segments.
2. The method according to claim 1, wherein after determining the speaker decisions, a plurality of speaker labels are attached to the segments.
3. The method according to claim 1, wherein after determining the speaker decisions, a plurality of speaker labels are attached to the frames.
4. The method according to claim 3, wherein during the extracting of information from the frames, mel frequency cepstral, coefficients calculations are used to calculate a plurality of mel frequency cepstral coefficients; wherein the mel frequency cepstral coefficients are used in vector quantization to determine a codebook and a plurality of codewords; wherein euclidean distances are calculated between mel frequency cepstral coefficients and codewords; wherein each frame is labeled with the index of the codeword that resulted.
5. The method according to claim 1, wherein during the extracting of information from the frames, duration of the frames, and a plurality of fundamental frequencies are extracted.
6. The method according to claim 1, wherein during the calculation of the segment models, the information extracted from the frames is combined with segment specific information and clustered together into the segment models.
7. The method according to claim 1, wherein during the calculation of the segment models, a plurality of probability mass functions are calculated and used as the basis for segment modeling and comparisons between the segments.
8. The method according to claim 1, wherein during the step of determining a plurality of speaker decisions, the segment models are compared using a distance metric.
9. The method according to claim 7, wherein during the step of determining a plurality of speaker decisions, the segment models are compared using a distance metric; wherein a priority is assigned to the segments and a segment pool is made and comparisons are made among the probability mass functions.
10. The method according to claim 2, wherein the segment labels contain both biometric and phonetic features of speech.
11. The method according to claim 3, wherein the frame labels contain both biometric and phonetic features of speech.
12. The method according to claim 5, wherein during the step of calculating a plurality of segment models, the pitch information and the fundamental frequencies are used to calculate a weighted mean and differences in fundamental frequencies.
13. The method according to claim 12, wherein during the step of determining a plurality of speaker decisions, segments having longer durations and lower fundamental frequency difference are given priority in assigning the speaker decisions.
14. the method according to claim 13, wherein during the step of determining a plurality of speaker decisions, a normalization is applied on to the segments.
15. the method according to claim 3, wherein during the step of correcting the speaker labels of the segments, an inspection window is shifted over adjacent frames looking for a speaker label that is inconsistent with neighboring frames; wherein when an inconsistent label is found, the inconsistent label is changed to match the speaker label of the highest percentage of labels within the inspection window.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0030] The present invention will be apparent to those skilled in the art by reading the following description which reference to the attached drawings. Unless indicated otherwise, the drawings provide exemplary embodiments or aspects of the disclosure and do not limit the scope of the disclosure. In the drawings provided:
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
DETAILED DESCRIPTION
[0037] With reference to the drawings and in particular
[0038]
[0039] Also, speech to text “STT” module (205) is performed on the audio stream and used in segmentation and segment modeling. Features extracted from STT module (205) is integrated with prosodic and spectral parameters for step-by-step composition of speaker segments. STT module outputs are utilized as features for diarization. All the information derived from STT is collected such as words, word boundaries and confidences.
[0040] Frame feature extraction 220 is then used to extract important information from the frames, which is generally labeled as frame features 221. Such information can include start and end time or the Mel Frequency Cepstral Cofficients (MFCC), logarithmic energies, fundamental frequencies, and implemented Vector Quantization (VQ). For decorrelation and uniform separation of the feature dimensions, a pre-whitening step is utilized as follows:
X.sub.w=Σ.sup.−1/2(x−μ.sub.z)
[0041] Where x is the feature vector, μ.sub.x and Σ are the mean vector and full covariance matrices of x, respectively, and x.sub.w is the whitened feature vector.
[0042] Then, using these transformed feature vectors, vector quantization is applied, and a codebook is generated as a model. This model is used as a background model of the current audio, containing information from the whole conversation. The codebook size is determined as proportional to the speech duration. For example, a size of 128 can be selected for a three-minute audio. Then, for every frame, Euclidean distances between MFCC vectors and codewords are calculated. Each frame is labeled with the index of the codeword that resulted in the closest distance. The Euclidean distance d is calculated as follows for two vectors V.sub.1 and V.sub.2:
d=√{square root over (Σ.sub.i=1.sup.N(V.sub.1(i)-V.sub.2(i)).sup.2)}
Where N is the vector size
[0043] Another feature extracted is pitch information. Extracting pitch information (physically, fundamental frequencies) is known in the field of the art. Therefore, fundamental frequencies are calculated and kept for each frame for further processing.
[0044] In segment modeling 230, the information extracted from the frames (frame features (221) and pre-whitening step (222)) is combined with information that are segment specific and clustered together into segment models 231. The following example is offered for clarity regarding segment modeling 230 but is not meant to limit the scope of the disclosure: if each segment contains simple label populations such as: segment 1 −>[3 5 65 4 89 . . . 78 121 4] and segment 2 [8 8 53 100 . . . 44 9] then these populations are transformed into characteristic representations for every segment. These representations are PMFs (Probability Mass Functions). Generating a PMF involves counting the occurrences of each label, and dividing these counts by the total label count. Label populations can be used to represent speaker characteristics. Thus PMFs can be used as a basis for modeling and comparisons for each segment. Also, energy weighted fundamental frequency variance calculation (described below 440) are done for each segment. Therefore, a person can achieve segment modeling via transforming a label population into PMF, and calculating energy weighted fundamental frequency variance as explained in the equations of 440 below.
[0045] Once the segment models 231 have been created, they are used in making a speaker decision 240 as to who is speaking, and what they are saying. The speaker decisions generally are made by comparing models via a distance metric and combining similar models and is known in the art. Specifically, a unique method in this disclosure, involves assigning priority among segments according to their lengths and fundamental frequency variances. The priority is originally given to longer duration segments since those segments will provide more information about the speaker. A lower fundamental frequency variance will give a lower probability of having multiple speakers within a segment. Then according to the selection priority order, a segment pool is made and comparisons are made among the PMFs of the segments. The closest segments are merged and for the next speaker, used segments (i.e. those segments that have already been assigned a speaker) are discarded.
[0046] Once the speaker is determined, speaker labels 241 are assigned to the segments and a correction is performed. Corrections 250 are done on a frame-by-frame basis to ensure that some frames have not been erroneously given a speaker label 241 that does not match the speaker. Details describing the corrections 250 are below. One way of accomplishing such a correction is to check one frame to see if other frames immediately adjacent to the said frame have inconsistent speaker labels 241.
[0047]
le(i)=log (1+e(i))
where e is the energy and le is log of energy.
[0048]
[0049] Incoming frame features (410) also contain fundamental frequencies f.sub.0 (413) and energy information (412). Using this information an energy weighted f.sub.0 variance (441) can be calculated using an energy weighted f.sub.0 variance calculation (440) within each segment using frames that are within the boundries of each segment. The weighted mean and variances in (440) are calculated using the following formulas:
where N is the number of total frames within current segment. le and f.sub.0 are log energy and fundamental frequency of the frame respectively. μ.sub.f.sub.
[0050] After segmentation, frame feature extraction, and segment modeling, speaker decisions (as seen as 240 in
[0051] As a starting point it is desired to have reliable models. Segments having longer duration and lower σ.sub.f.sub.
[0052] Where S is the number of segments, l.sub.i is the length, and Î.sub.i is the normalized length of the i.sup.th segment. Likewise, the normalization is also applied on σ.sub.f.sub.
[0053] Weights (0.5, −0.5) are given for both lengths and variances. With both parameters and their corresponding weights, a total priority score is calculated for each segment using the equation below:
s.sub.p.sup.i=0.5{circumflex over (l)}.sub.i−0.5{circumflex over (σ)}.sub.f.sub.
[0054] Where s.sub.p.sup.i denotes the i.sup.th segment priority score. Segments are then sorted according to these priorities and the top portion of these are separated for the initial speaker decisions.
[0055] Therefore, both parameters and their corresponding weights, are used to calculate a total priority weight for each segment. Segments are sorted according to these priorities and a top portion of them are separated for initial speaker decisions. In other words, the information is sorted such that the highest prioritized segments have the highest probability of correctly identifying a speaker.
[0056] Speaker assignments are then given to each segment to determine who is speaking in each segment. Speaker assignments are done using segment similarities such that segments forming a speaker should be similar to other segments of that speaker and dissimilar to segments of other speakers. During the processing all the values about segments are stored in memory, and are used in make comparisons on PMFs of segments. PMF similarities are one criterion that is considered in determining similarity. The L.sub.1 distance is used as the distance metric, which is as follows, for two PMFs P.sup.j and P.sup.k:
[0057] The lower distance two segments have, the more similar they are.
[0058] Speaker initializations are done for each of the speakers. For each speaker initialization, a certain number of segments are chosen that have the highest similarity. This similarity measure is calculated as the difference between intra-similarity and inter-similarity. Intra-similarity is the comparison of new speaker candidate segments with other non-assigned new speaker candidate segments and inter-similarity is the comparison between new candidate segments and segments of already determined speakers. Positive decisions are made towards higher intra, lower inter similarities, and these segments are used for a new speaker creation. When intra-similarity is high and inter-similarity is low, it indicates that the speaker in these segments is a new speaker and results in the initialization of a new speaker label. After initialization of all speakers, any remaining unassigned segments are processed. For each of them, the similarity to all speakers is calculated and the segment is then assigned a speaker label whose other segments most closely match the unassigned segment. This assignment continues until all segments are assigned a speaker label. Segment similarities are calculated using PMF similarities since the segments are represented with PMF values. The sum of absolute differences is used as the a measure of distance between segments. After every speaker decision, present PMFs are updated with new segment PMFs. This update operation is a segment length weighted sum of PMFs, resulting in a new PMF. Updates are done as the element-wise summation between two PMFs with a weight coming from segment lengths. It can be expressed as follows for i.sup.th element in update of PMFs P.sub.1 with P.sub.2, resulting in P.sub.3:
where M is total element count for a PMF; l.sub.1 and l.sub.2 are the lengths of the segments.
[0059] Therefore updating the PMF's of segments is done in order to give more emphasis to longer segments. Thus the process for assigning a speaker label (i.e. speaker assignment) involves the input of segments and an output of a speaker label and the following steps: 1) finding segments that are most similar to each other and no other existing label, 2) creating a new speaker model 3) repeating steps 1 and 2 above until all speakers have been initialized 4) then for any unassigned segments comparing the segment with speaker models, and assigning the closest speaker to that segment 5) updating the model with new values for the segments 6) ensuring no other speaker labels need to be created.
[0060] No process can be perfect, and to compensate for possible errors, a correction phase is included in the method. The aim is mainly to find and split speaker segments that Each segment may contain speaker labels for multiple speakers. During speaker corrections, the algorithm returns back to doing a frame level processing. Each frame was previously assigned labels based on closest codeword IDs. A search is made of every frame using speaker PMF to identify which speakers is most likely associated with each frame, and the frames are accordingly given speaker labels for each frame. In order to find incorrectly labeled frames, the neighboring frames are checked to see if one speaker label has been assigned repeatedly to successive neighboring frames. For a frame k, the k−N and k+N neighbors are gathered and the speaker label with the highest count is assigned to the tested frame. Using this correction technique, erroneous label fluctuations can be smoothed out. In order to find out possible speaker changes inside segments, a resegmentation algorithm is applied using speaker labels on the frames. This algorithm shifts an inspection window over frames looking for a group of another speaker's labels.
[0061] The algorithm determines the percentage of frames for each speaker within the window. It picks the speaker with highest percentage to be the correct speaker label. If the picked speaker and segment speaker have a f.sub.0 difference larger than a threshold, a further comparison is made. f.sub.0 of the new candidate part is compared with the number of speaker labels assigned to the frames in a segment (i.e. if labels A B and C were assigned within a segment, the frame would be compared to all three candidate labels). After comparison, a score is given. Also for the frame count percentage, a score is given. These two scores are combined with their corresponding weights. If the resulting score is above a predetermined threshold, a decision for speaker change is made.
[0062]
[0063]