Method and device for transforming feature vector for user recognition
10410638 ยท 2019-09-10
Assignee
- Samsung Electronics Co., Ltd. (Suwon-Si, Gyeonggi-do, KR)
- Korea University Research And Business Foundation (Seoul, KR)
Inventors
- Hanseok Ko (Seoul, KR)
- Sung-Soo Kim (Bucheon-si, KR)
- Jinsang Rho (Anyang-si, KR)
- Suwon Shon (Seoul, KR)
- Jae-won Lee (Seoul, KR)
Cpc classification
International classification
G10L17/02
PHYSICS
Abstract
A method of converting a feature vector includes extracting a feature sequence from an audio signal including utterance of a user; extracting a feature vector from the feature sequence; acquiring a conversion matrix for reducing a dimension of the feature vector, based on a probability value acquired based on different covariance values; and converting the feature vector by using the conversion matrix.
Claims
1. A method of identifying a user from an audio signal in a device, the method comprising: receiving the audio signal including an utterance the user via a microphone; extracting a feature sequence from the audio signal including the utterance of the user; extracting a feature vector from the feature sequence; acquiring a conversion matrix for reducing a dimension of the feature vector, based on a probability value acquired based on different covariance values; converting the feature vector by using the conversion matrix; and identifying the user from the audio signal by using the converted feature vector, wherein the acquiring of the conversion matrix comprises acquiring a dimension p as a useful dimension p of the conversion matrix based on whether an energy accumulated up to the dimension p of a variance matrix for an intra-class covariance matrix of each speaker is more than an energy of a predetermined ratio of an entire energy for an entire dimension of the variance matrix, and a dimension of the feature vector is converted into the useful dimension p.
2. The method of claim 1, wherein the conversion matrix is a heteroscedastic linear discriminant analysis (HLDA).
3. The method of claim 1, wherein the feature vector is an i-vector that is acquirable by joint factor analysis.
4. The method of claim 1, further comprising: performing scoring on a feature vector resulting from the conversion and a feature vector of each state, at least once, wherein the user is identified based on a result of the scoring.
5. A non-transitory computer-readable recording medium having recorded thereon a program for executing the method of claim 1.
6. A device for identifying a user from an audio signal, the device comprising: a receiver which receives the audio signal including an utterance of the user via a microphone; and a controller which extracts a feature sequence from the audio signal, extracts a feature vector from the feature sequence, acquires a conversion matrix for reducing a dimension of the feature vector, based on a probability value acquired based on different covariance values, converts the feature vector by using the conversion matrix, and identifies the user from the audio signal, by using the converted feature vector, wherein the controller acquires a dimension p as a useful dimension p of the conversion matrix based on whether an energy accumulated up to the dimension p of a variance matrix for an intra-class covariance matrix of each speaker is more than an energy of a predetermined ratio of an entire energy for an entire dimension of the variance matrix, and a dimension of the feature vector is converted into the useful dimension p.
7. The device of claim 6, wherein the conversion matrix is a heteroscedastic linear discriminant analysis (HLDA).
8. The device of claim 6, wherein the feature vector is an i-vector that is acquirable by joint factor analysis.
9. The device of claim 6, wherein the controller performs scoring on a feature vector resulting from the conversion and a feature vector of each state, at least once, and identifies the user, based on a result of the scoring.
Description
DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
BEST MODE
(7) A method of converting a feature vector includes extracting a feature sequence from an audio signal including utterance of a user; extracting a feature vector from the feature sequence; acquiring a conversion matrix for reducing a dimension of the feature vector, based on a probability value acquired based on different covariance values; and converting the feature vector by using the conversion matrix.
(8) The conversion matrix is a heteroscedastic linear discriminant analysis (HLDA).
(9) The acquiring of the conversion matrix includes acquiring a useful dimension p of the conversion matrix, based on accumulated energy for each dimension of a variance matrix for an intra-class covariance matrix of each speaker.
(10) The feature vector is an i-vector that is acquirable by joint factor analysis.
(11) The method further includes performing scoring on a feature vector resulting from the conversion and a feature vector of each state, at least once; and identifying the user, based on a result of the scoring.
(12) A device for converting a feature vector includes a receiver which receives an audio signal including utterance of a user; and a controller which extracts a feature sequence from the audio signal, extracts a feature vector from the feature sequence, acquires a conversion matrix for reducing a dimension of the feature vector, based on a probability value acquired based on different covariance values, and converts the feature vector by using the conversion matrix.
MODE OF THE INVENTION
(13) Embodiments will now be described more fully with reference to the accompanying drawings. However, in order to clarify the spirit of the invention, descriptions of well known functions or constructions may be omitted. In the drawings, like numbers refer to like elements throughout.
(14) Terms or words used in the present specification and claims should not be interpreted as being limited to typical or dictionary meanings, but should be interpreted as having meanings and concepts, which comply with the technical spirit of the present invention, based on the principle that an inventor can appropriately define the concept of the term to describe his/her own invention in the best manner. Therefore, configurations illustrated in the embodiments and the drawings described in the present specification are only the most preferred embodiment of the present invention and do not represent all of the technical spirit of the present invention, and thus it is to be understood that various equivalents and modified examples, which may replace the configurations, are possible when filing the present application.
(15) Some elements are exaggerated, omitted, or schematically illustrated in the drawings. As such, actual sizes of respective elements are not necessarily represented in the drawings. The present invention is not limited by relative sizes and/or intervals in the accompanying drawings.
(16) The terms comprises and/or comprising or includes and/or including when used in this specification, specify the presence of stated elements, but do not preclude the presence or addition of one or more other elements. Also, the term unit in the embodiments of the present invention means a software component or hardware components such as a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC), and performs a specific function. However, the term unit is not limited to software or hardware. The term unit may be configured to be included in an addressable storage medium or to reproduce one or more processors. Thus, for example, the term unit may refer to components such as software components, object-oriented software components, class components, and task components, and may include processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, micro codes, circuits, data, a database, data structures, tables, arrays, or variables. A function provided by the components and units may be associated with the smaller number of components and units, or may be divided into additional components and units.
(17) Embodiments of the present invention are described in detail herein with reference to the accompanying drawings so that this disclosure may be easily performed by one of ordinary skill in the art to which the present invention pertain. The invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. In the drawings, parts irrelevant to the description are omitted for simplicity of explanation, and like numbers refer to like elements throughout.
(18) Embodiments will now be described more fully with reference to the accompanying drawings.
(19)
(20) The device 100 may be a terminal that can be used by a user. For example, the device 100 may be a smart TV, an ultra high definition (UHD) TV, a monitor, a personal computer (PC), a notebook computer, a mobile phone, a tablet PC, a navigation device, a smartphone, personal digital assistants (PDA), a portable multimedia player (PMP), or a digital broadcasting receiver.
(21) The device 100 may reduce the dimension of the feature vector by taking into account the fact that classes of the feature vector may have different variance values. According to an embodiment, a class denotes a group into which a plurality of data values may be classified, in technology such as linear discriminant analysis (LDA) or heteroscedastic LDA (HLDA). The device 100 may acquire a dimension-reduced feature vector by applying a conversion matrix to the extracted feature vector to reduce the dimension of the extracted feature vector. The conversion matrix may be determined based on a matrix acquired via HLDA. When a conversion matrix is determined via HLDA and the feature vector is converted by using the conversion matrix determined via HLDA, the limitation that each class has the same variance value may be rejected, in contrast with an LDA conversion matrix.
(22) The feature vector extracted by the device 100 may be stored as a target feature vector for identifying a user or may be compared with a target feature vector as a test feature vector, and thus may be used to identify a user. The device 100 may identify a user by performing scoring that uses an extracted feature vector and a pre-stored feature vector.
(23) Referring to
(24) The feature detector 110 may detect a feature value of the audio signal of the user by using a mel-frequency cepstral coefficient (MFCC) method. In the MFCC method, a spectrum-based useful feature value is detected using non-linear frequency characteristics of the ear of a human. The feature value of the audio signal that may be detected by the feature detector 110 may be detected using various methods other than the MFCC method. For example, the feature value of the audio signal may be detected as a feature parameter value for a frequency sequence extracted from the audio signal.
(25) The feature vector extractor 120 may extract a feature vector for the audio signal, based on the feature value detected by the feature detector 110.
(26) The feature vector extractor 120 may classify the audio signal by using an ergodic hidden markov model (HMM). The feature vector extractor 120 may classify the audio signal as a state corresponding to a phonetic category by using the ergodic HMM. The phonetic category may be classified according to the feature of a voice, for example, a phonetic feature such as a frequency or a magnitude. The feature vector extractor 120 may classify the audio signal by using various methods other than the ergodic HMM.
(27) According to a joint factor analysis method, speaker utterance may be represented as a super vector comprised of a sub-space of a speaker and a sub-space of a channel. However, speaker utterance in a total variability space may include a super vector comprised of a single space as expressed in Equation 1:
M=m+T[Equation 1]
(28) where a super vector M represents utterance of a speaker, m indicates a super vector that is independent from a speaker and a channel, T indicates a total variability matrix, and indicates a total variability factor, namely, an i-vector. The values of M, m, and T may be acquired according to the joint factor analysis method.
(29) The i-vector may be determined via baum-welch statistics as expressed in Equations 2-4.
(30)
(31) where N.sub.c indicates a matrix comprised of a diagonal block of a CFCF dimension, and y.sub.t indicates one from among a feature sequence y={y1, y2, . . . , yL} of which the length that can be detected by the feature detector 110 is an L frame.
(32) In Equation 2, is a parameter in a Gaussian Mixture Models-Universal Background Model (GMM-UBM), and ={w.sub.c, m.sub.c, .sub.c} (c=1, . . . , C). In Equation 2, included in is a diagonal covariance matrix of a CFCF dimension, and C is a component dimension of a GMM. GMM-UBM is a method of classifying distribution characteristics of pieces of data when classifying patterns. In GMM-UBM, a model for data distribution may be determined according to the parameter .
(33)
(34) where F indicates a dimension of a feature space. The feature space denotes an n-dimensional space in which a feature vector may be defined.
(35) In Equations 2 and 3, N and F values may be acquired based on a probability value for each parameter of baum-welch statistics.
=(I+T.sup.tE.sup.1NT).sup.1T.sup.t.sup.1{tilde over (F)}[Equation 4]
(36) As expressed in Equation 4, the i-vector may be determined based on T, N, and values.
(37) In each state, a GMM parameter exists. Each GMM parameter denotes an individual model that represents a speaker. The GMM parameter may be expressed as in Equation 5.
.sub.s={w.sup.s.sub.c,m.sup.s.sub.c,.sup.s.sub.c}[Equation 5]
(38) Equation 1 may be expressed as Equation 6, by including a parameter for each state:
M=m.sub.s+T.sub.s.sub.s[Equation 6]
(39) where m.sub.s indicates a super vector that is independent from a speaker and a channel and dependent on a phonetic category s, T.sub.s indicates a total variability matrix for the phonetic category s, and w.sub.s indicates an i-vector for the phonetic category s.
(40) The feature vector extractor 120 may determine the i-vector w.sub.s for each state according to Equation 4.
(41) The feature vector converter 130 may apply an HLDA conversion matrix A to a q-dimensional i-vector .sub.s, as shown in Equation 7, in order to reduce the dimension of the i-vector .sub.s for each state determined by the feature vector extractor 120. As the dimension of an i-vector .sub.s resulting from the conversion by the feature vector converter 130 is reduced, the number of calculations including the i-vector .sub.s may be reduced.
(42)
(43) where A indicates an MN matrix and includes A.sub.[p] including useful dimensions from a first row to a p-th row and A.sub.[N-p] including the remaining (N-p) rows. (N-p) dimensions are treated as nuisance dimensions, and thus A.sub.[N-p] in the (N-p) dimensions may be treated as nuisance information and A.sub.[p] until the p-th dimension may be used as useful values. Thus, the dimension of an i-vector {circumflex over ()} may be converted into a p dimension that is lower than N.
(44) In LDA, covariance matrixes of all classes are assumed to be identical. However, this assumption does not comply with actual data. Thus, the feature vector converter 130 may apply to the i-vector an HLDA conversion matrix in which the fact that classes have different covariance matrices is reflected, instead of an LDA conversion matrix.
(45) By converting the i-vector .sub.s by using the HLDA conversion matrix A, the feature vector converter 130 may reduce the number of calculations performed using the i-vector .sub.s, and may reject a different assumption from actual data, thereby increasing the diversity between speakers and decreasing the diversity of an identical speaker.
(46) The feature vector converter 130 may convert the i-vector w.sub.s by using a Mxq-dimensional unified HLDA (UHLDA) conversion matrix C, which is a combination of LDA and HLDA conversion matrixes, instead of using the HLDA conversion matrix A, as shown in Equation 8. The feature vector converter 130 may convert the value by applying the UHLDA conversion matrix C instead of the HLDA conversion matrix A to the value, as shown in Equation 8:
(47)
(48) where W indicates an MN LDA conversion matrix, and A.sub.q/2 and W.sub.q/2 are respectively a sub-space of q/2 rows of the HLDA conversion matrix A and a sub-space of q/2 rows of the MN LDA conversion matrix W. The UHLDA conversion matrix C is not limited to A.sub.q/2 and W.sub.q/2 of Equation 8, and may be comprised of some sub-spaces of the HLDA conversion matrix A and the MN LDA conversion matrix W.
(49) The LDA conversion matrix W and HLDA conversion matrix A may be respectively acquired by LDA and HLDA, but embodiments of the present invention are not limited thereto. The LDA conversion matrix W and HLDA conversion matrix A may be acquired according to various other methods.
(50) For example, the HLDA conversion matrix A may be determined based on a maximum likelihood (ML) estimation and expectation maximization (EM) algorithm or smooth HLDA (SHLDA), which is another HLDA method.
(51) The feature vector converter 130 may acquire an HLDA conversion matrix according to a probability value for a case where classes have identical averages and identical covariances and a probability value for a case where classes have different averages and identical covariances, by ML estimation. The feature vector converter 130 may assume that classes have different averages and different covariances until a p-th dimension, and assume that classes have identical averages and identical covariances from an (n-p)th dimension to an n-th dimension.
(52) The useful dimension p of Equation 7 may be determined using Equations 9-11 below:
(53)
(54) where .sup.(j) indicates a covariance matrix of an i-vector within a class of a speaker j.
(55)
(56) where S.sub.w indicates an intra-class covariance matrix acquired under the assumption that the covariance matrices of the i-vectors of the classes of speakers are homoscedastic as in the LDA method.
(57)
(58) where .sub.Sw indicates a variance matrix for an intra-class covariance matrix of each speaker. An eigen value may be acquired from the variance matrix .sub.Sw via eigen value decomposition. The feature vector converter 130 may obtain an accumulated energy for each dimension from eigen values of a variance matrix that are arranged in a descending order, and thus determine the number of dimensions of which accumulated energies are equal to or greater than a predetermined energy, as the useful dimension p.
(59) The useful dimension p determined via HLDA is not limited to the above-described embodiment, and may be obtained using any of various other methods.
(60) In addition, the feature vector converter 130 may remove noise data of the feature vector converted by probabilistic LDA (PLDA).
(61) A method of determining the useful dimension p will now be described in more detail with reference to
(62)
(63) The graph of
(64) Assuming that a reference value of accumulated energy for determining a useful dimension is 90% of the entire energy, the feature vector converter 130 may determine an eigen value of which accumulated energy is at least 90% of the entire energy. In the graph of
(65)
(66) Referring to
(67) In operation S303, the device 100 may detect a feature vector for the feature value detected in operation S301. The feature vector may be an i-vector obtained via joint factor analysis, and may be acquired via baum-welch statistics.
(68) In operation S305, the device 100 may acquire a conversion matrix that assumes that classes have different covariance matrices. For example, the device 100 may acquire the conversion matrix, based on a probability value that is based on different covariance values of classes, via ML estimation.
(69) A conversion matrix that may be acquired in operation S305 may be an HLDA conversion matrix. The HLDA conversion matrix may be acquired based on different covariance matrices of classes, in contrast with an LDA conversion matrix. Accordingly, the device 100 may convert an i-vector by reflecting the covariance matrix of actual data, rather than using an LDA conversion matrix that assumes that classes have identical covariance matrices.
(70) In operation S307, the device 100 may convert the feature vector by using the conversion matrix acquired in operation S305.
(71) A method of identifying a user based on a feature vector will now be described in more detail with reference to
(72)
(73) Referring to
(74) In operation 420, the device 100 may acquire a super vector m of state 1. In operation 430, the device 100 may acquire necessary parameters according to baum-welch statistics, based on the feature sequence y. In operation 440, an i-vector may be acquired based on the parameters acquired in operation 430 and a total variability matrix T.
(75) An i-vector that is acquirable in operation 470 may be acquired via operations 450-470 according to the same method as the method of acquiring the i vector in operation 440. The i-vector in operation 440 is acquired from the audio signal including currently input speaker utterance, whereas the i-vector acquirable in operation 470 may be a feature vector previously acquired for user identification.
(76) In operation 480, the device 100 may perform scoring by using the i-vector acquired from the currently input audio signal and an i-vector that is to be compared for user identification. The scoring may be performed as expressed in Equation 12:
(77)
(78) where .sub.target indicates a pre-acquired i-vector and .sub.test indicates an i-vector acquired from a currently input audio signal.
(79) Equation 12 follows a cosine distance scoring (CDS) method, and embodiments of the present invention are not limited thereto. Scoring may be performed according to any of various methods. The device 100 may identify a speaker of the currently input audio signal according to a scoring value acquired according to Equation 12. The device 100 may identify the speaker of the currently input audio signal by performing scoring with respect to the i-vector value acquired based on the currently input audio signal and performing scoring with respect to an i-vector value for each state.
(80)
(81) Referring to
(82) In operation S503, the device 100 may acquire a feature vector for at least one state. The device 100 may acquire a feature vector previously stored for user identification. The device 100 may acquire at least one feature vector for each state.
(83) In operation S505, the device 100 may perform user identification by performing scoring on the feature vector acquired in operation S501 and the at least one feature vector acquired in operation S503. The device 100 may determine a state corresponding to the feature vector of the input audio signal by comparing a scoring value acquired based on the feature vector acquired in operation S501 with a scoring value acquired based on the feature vectors of states acquired in operation S503. The device 100 may identify the user of the currently input audio signal, based on the determined state.
(84) The internal components of a device will now be described in detail with reference to
(85)
(86) Referring to
(87) The receiver 610 may receive an audio signal including utterance of a user. For example, the receiver 610 may receive an audio signal including utterance of a user, via a microphone.
(88) The controller 620 may extract a feature vector, based on the audio signal received by the receiver 610. The controller 620 may extract an i-vector by joint factor analysis and reduce the dimension of the i-vector by using an HLDA conversion matrix. The controller 620 may identify a speaker corresponding to a currently input audio signal by performing scoring on a feature vector corresponding to the currently input audio signal and a feature vector for each state.
(89) According to an embodiment, the performance of speaker recognition may be increased by reducing the dimension of an i-vector by using an HLDA conversion matrix which takes into account the fact that classes have different covariance matrices.
(90) Methods according to some embodiments may be embodied as program commands executable by various computer means and may be recorded on a computer-readable recording medium. The computer-readable recording medium may include program commands, data files, data structures, and the like separately or in combinations. The program commands to be recorded on the computer-readable recording medium may be specially designed and configured for embodiments of the present invention or may be well-known to and be usable by one of ordinary skill in the art of computer software. Examples of the computer-readable recording medium include a magnetic medium such as a hard disk, a floppy disk, or a magnetic tape, an optical medium such as a compact disk-read-only memory (CD-ROM) or a digital versatile disk (DVD), a magneto-optical medium such as a floptical disk, and a hardware device specially configured to store and execute program commands such as ROM, random-access memory (RAM), or a flash memory. Examples of the program commands are advanced language codes that can be executed by a computer by using an interpreter or the like as well as machine language codes made by a compiler.
(91) The exemplary embodiments should be considered in descriptive sense only and not for purposes of limitation. Descriptions of features or aspects within each embodiment should typically be considered as available for other similar features or aspects in other embodiments.
(92) While the inventive concept has been particularly shown and described with reference to exemplary embodiments thereof, it will be understood that various changes in form and details may be made therein without departing from the spirit and scope of the following claims.