EAR CANAL DEFORMATION BASED CONTINUOUS USER IDENTIFICATION SYSTEM USING EAR WEARABLES
20230020631 · 2023-01-19
Inventors
Cpc classification
G01P13/00
PHYSICS
H04R1/028
ELECTRICITY
H04R1/1041
ELECTRICITY
G06F21/32
PHYSICS
H04R2420/07
ELECTRICITY
International classification
G06F21/32
PHYSICS
G01P13/00
PHYSICS
H04R1/10
ELECTRICITY
Abstract
Disclosed herein is a system and methods for ear canal deformation based user authentication using in-ear wearables. This system provides continuous and passive user authentication and is transparent to users. It leverages ear canal deformation that combines the unique static geometry and dynamic motions of the ear canal when the user is speaking for authentication. It utilizes an acoustic sensing approach to capture the ear canal deformation with the built-in microphone and speaker of the in-ear wearable. Specifically, it first emits well-designed inaudible beep signals and records the reflected signals from the ear canal. It then analyzes the reflected signals and extracts fine-grained acoustic features that correspond to the ear canal deformation for user authentication.
Claims
1. A system for acoustic authentication of a user, the system comprising: a microphone configured to face into the ear canal of a user, wherein said microphone comprises a transmitter for transmitting a probe signal, and a receiver for receiving and recording signal reflections from the ear canal of the user, wherein said signal reflections are based on ear canal dynamic deformation and ear canal geometry information; an authenticator, configured to determine a current user acoustic signature from the receiver and to compare the current user acoustic signature with a predefined user acoustic signature and to authenticate the user based on the comparison of the current user acoustic signature with the predefined user acoustic signature; and housing, wherein said microphone and authenticator are housed inside said housing.
2. The system of claim 1, wherein the system is an in-ear wearables.
3. The system of claim 2, wherein said earpiece is designed to fit into the ear of the user.
4. The system of claim 2, wherein the in-ear wearables include earbuds, earpieces and headsets.
5. The system of claim 4, wherein said headset is configured to both wirelessly and wiredly connect to a device.
6. The system of claim 5, wherein the device plays both inaudible and audible transmissions which are received by the headset.
7. The system of claim 1, wherein said recording can be accomplished on-demand or continually.
8. The system of claim 1, wherein the dynamic deformations are due to jaw motion, for instance, speaking, and head motion.
9. The system of claim 1, wherein the probe signal is a non-audible stimulus.
10. The system of claim 1, wherein a high pass filter is used in the receiver to separate inaudible and audible signals.
11. The system of claim 1, wherein the system further comprises a motion sensor.
12. The system of claim 11, wherein the motion sensor detects motion of the head.
13. A method of acoustic authentication of a user using at least one headset, which headset comprises: a microphone configured to face into the ear canal of the user, wherein said microphone comprises a transmitter for transmitting a probe signal, and a receiver for receiving and recording signal reflections from the ear canal of the user, wherein said signal reflections are based on ear canal dynamic deformation and ear canal geometry information; an authenticator, configured to determine a current user acoustic signature from the receiver and to compare the current user acoustic signature with a predefined user acoustic signature and to authenticate the user based on the comparison of the current user acoustic signature with the predefined user acoustic signature; and housing, wherein said microphone and authenticator are housed inside said housing; wherein the headset authenticates the user based on ear canal dynamic deformation.
14. The method of claim 13, wherein the device plays both inaudible and audible transmissions which are received by the headset.
15. The method of claim 13, wherein said recording can be accomplished on-demand or continually.
16. The method of claim 13, wherein the dynamic deformations are due to jaw motion, for instance, speaking, and head motion.
17. The method of claim 13, wherein the probe signal is a non-audible stimulus.
18. The method of claim 13, wherein a high pass filter is used in the receiver to separate inaudible and audible signals.
19. The method of claim 13, wherein the system further comprises a motion sensor.
20. The method of claim 19, wherein the motion sensor detects motion of the head.
Description
BRIEF DESCRIPTION OF FIGURES
[0005]
[0006]
[0007]
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
DETAILED DESCRIPTION
[0032] The system disclosed herein, referred to as Eardynamic, is a continuous user authentication system that leverages the ear canal deformation sensed by the in-canal deformation reflects the ear canal dynamic motion caused by jaw joint or articulation activities, for example, when the user is speaking. Thus, the ear canal deformation not only contains the static geometry of ear canal that represents the physiological characteristic of the user but also includes the geometry changes that characterizes the behavioral properties of the user while speaking. Recent prior work shows that the static geometry of the ear canal is unique for every individual [15]. It is disclosed herein that the ear canal deformation due to articulation activities includes more dynamic information and can provide better and more secure user authentication while the user is speaking.
[0033] In particular, the system disclosed herein utilizes an acoustic sensing approach to capture the ear canal deformation with the embedded microphone and speaker of the in-ear wearable. It first emits inaudible beep signals from the in-ear wearable to probe the user's ear canal. It then records back the reflected inaudible signals together with the user's audible sounds. The reflected inaudible signals thus contain the information of the ear canal dynamic motions, i.e., the changes of the geometry of ear canal, due to the movements of the jaw joint and other articulators. More specifically, human speech relies on the motions of multiple articulators (e.g., jaw, tongue, mouth) to pronounce various phonemes. When the jaw moves, the temporomandibular joint (TMJ) also moves which causes either expansion or compression of the ear canal wall. Such a phenomenon caused by TMJ movements is known as the Ear Canal Dynamic Motion (ECDM) [12]. Such motions results in various effects over different people [32]. For example, during the speaking process, each person's ear canal will either be expanded or compressed at various degrees and speeds. It was found that the ear canal deformation is consistent for the same individual while varies depending on the individual anatomy and behavior. Thus, by utilizing the ear canal deformation extracted from the reflected acoustic signals, our system can distinguish different users.
[0034] To better leverage the ear canal deformation and improve the usability of our system, various dynamic motions were categorized into different groups based on the phoneme pronunciations. In particular, although canal dynamic motions cannot be directly measured, one can infer such motions based on the articulatory movements. However, measuring the articulatory movements requires specialized sensors attached to the articulators, which are impractical. This challenge was solved by looking at the phonemes in the user's speech. Specifically, each phoneme pronunciation corresponds to unique and consistent articulatory movements. The canal dynamic motions thus can be identified by recognizing each phoneme and the corresponding articulatory movements. Consequently, the ear canal dynamic motions for the phonemes that are invoked by similar jaw and tongue movements share high similarity and can be categorized into the same group. Such a categorization can also reduce the calculation complexity and shorten the authentication time.
[0035] To perform user authentication, the system disclosed herein extracts fine-grained acoustic features that correspond to the ear canal deformation and compares these features against the user enrolled profile. To evaluate EarDynamic, experiments were conducted with 24 participants in various noisy environments (i.e., home, office, grocery store, vehicle, and parks). The system was also evaluated under different daily activities during the user authentication process (i.e., maintain different postures, perform different gestures). The results show EarDynamic achieves high accuracy and maintains comparable performance in different noisy environments and under various daily activities. The contributions of this work are summarized as follows: [0036] 1. It is shown that the dynamic deformation information of the ear canal is unique for each individual and can be utilized for user authentication. Ear canal dynamic motions are further categorized into various groups based on the phoneme pronunciation to facilitate user authentication. [0037] 2. EarDynamic was proposed, a continuous and user transparent authentication system utilizing in-ear wearable devices. It leverages ear canal deformation that combines the unique static geometry and dynamic motion of the ear canal when the user is speaking for authentication. A prototype of EarDynamic was built with off-the-shelf accessories by embedding an in-ward facing microphone inside an earbud. [0038] 3. Extensive experiments were conducted to evaluate the performance of the proposed EarDynamic. Experimental results show that our system achieves a recall of 97.38% and an F1 score of 96.84%. Results also show that EarDynamic works well under different noisy environments with various daily activates.
[0039] System and Attack Model
[0040] The system disclosed herein utilizes dynamic deformation of the ear canal captured by the in-ear wearable for user authentication. It requires the in-ear wearable equipped with one microphone and one speaker. The authentication can be continuous, which means the microphone keeps sending inaudible probe signals for continuous authentication. Or the authentication can be triggered on-demand based on the requirements of the mobile applications.
[0041] The authentication process is also transparent to the users as it doesn't require any user cooperation. During authentication, the user is free to speak, conduct activities, or remain silent. If the user is speaking or conducting activities, the dynamic ear canal deformation is extracted for authentication. In this work, two types of attack models are considered mimic attacks and advanced attacks. For the mimic attacks, an adversary attempts to compromise the system by spoofing the ear canal deformation of the legit user. In particular, an adversary wears the victim's in-ear device and tries to issue the voice commands for user authentication. Moreover, the attacker might mimic the jaw or head motions of the legitimate user to bypass the user authentication.
[0042] The second type of attack is an advanced attack. Although the ear canal is hidden within the human skull, it is still possible such information could be leaked, for example, through the 3D ear canal scanning when the victim needs to produce a hearing-aid or cure ear disease. For this type of attack, it is assumed an adversary acquires the user's ear canal geometry information and could rebuild it precisely, such as using 3D-printing technology. Such attacks may easily bypass the system that only utilizes the static geometry of the ear canal for user authentication. However, it is extremely hard, if not possible, for an adversary to reproduce the dynamic motions of the ear canal to bypass the ear canal deformation based authentication.
[0043] Geometry of Ear Canal
[0044] The ear is the auditory system of humans, which consists of three parts, outer ear, middle ear, and inner ear, as shown in
[0045] Firstly, the interspace of the ear canal varies from one individual to another [34, 34, 36]. As shown in
[0046] Moreover, both the curvature and cross-section of the canal are also unique for different people. The curvature is usually measured along the center axis of the ear canal. The ear canal of certain individuals could be straight but are bent for others. In one study involving 185 adults suggests that for about 30% of the subject being studied, the entire eardrum could be seen from the point of view near the pinna. On the other hand, about 9% of testing subjects' eardrum is invisible from the viewpoint at the same position, which indicates these ear canals are relatively narrow and curved [25]. Furthermore, the cross-sectional area changes over the entire course of the ear canal due to its complex structure. For example, in the middle portion of the canals, the cross-sectional areas can range between 25 and 70 mm2 for different subjects [43].
[0047] Lastly, the composition of the ear canal wall is skin, cartilaginous, and bone where the proportion of cartilaginous and bone are different for each person. As shown in
[0048] Dynamic Deformation of Ear Canal
[0049] Because human speech relies on the motions of articulators including jaw, tongue, and lips, it also causes the dynamic deformation (i.e., expansion or compression) of the ear canal, known as the Ear Canal Dynamic Motion (ECDM). Such motions mainly consist of mouth motion and head motions. Specifically, mouth motion includes both jaw and tongue movement, where the geometry of the ear canal changes when people are speaking or chewing. As shown in
[0050] One important observation is that the dynamic deformation of the ear canal also has diversity. Research [32] shows that with the same motion of open mouth (e.g., drop the jaw), the volume changes vary from person to person. The ear canal volume will be compressed as high as 10 mm3 for about 25% of the testing subjects and expanded as high as 25 mm.sup.3 for 67% of the subjects. Meanwhile, the volume of the remaining 8% of the subjects' ear canal keeps the same even when the joints move. The ear canal diameters also change differently for different individuals. An experiment involving 1488 ears suggests that with a similar joints motion, about 20% of the subject's ear canal diameters decrease while the remaining subjects increase up to 2.5 mm.sup.3. Some studies [17] also show that the ear canal moves anteriorly, posteriorly, or with a combination of both for different people. In this work, it was confirmed that the dynamics of the ear canal have individual differences. Thus, the ear canal deformation caused by articulation activities could be leveraged for user authentication.
[0051] Ear Canal Deformation Categorization
[0052] To better leverage the ear canal deformation and improve the usability of our system, various ear canal dynamic motions were categorized into different groups. As the ear canal deformation is invoked by the movements of the articulators (e.g., jaw, tongue, and mouth) when the user is speaking, we thus can categorize deformation by measuring various articulator gestures. But such measurements usually require multiple specialized devices attached to the articulators, which are impractical and cumbersome. To solve this technique challenge, phoneme pronunciation is relied upon. In particular, each phoneme pronunciation corresponds to unique and consistent articulatory gestures. By grouping the phonemes with similar articulatory gestures, the corresponding ear canal deformation can be grouped. Thus, the phonemes are categorized with a similar scale of jaw and tongue movements into the same groups. Then, the corresponding ear canal deformations belong to the same group as well due to the deformation are caused by similar jaw and tongue movements. Such grouping is then refined by experimenting on different subjects and re-categorize the phonemes/deformations with high similarity. Specifically, phonemes are the smallest distinctive unit sound of a language and can be divided into two major categories: vowels and consonants. Based on the position of the jaw and tongue when pronouncing them, phonemes can be categorized into different groups.
[0053] Ear Canal Deformation Sensing
[0054] Measuring the geometry of the ear canal directly is challenging due to its structural complexity. Specifically, instead of being a straight tube, the ear canal is an “S” shaped cavity and its cross-section keeps changing along the entire canal. Moreover, due to the complex articulatory gestures, the dynamic deformation caused by these gestures is also difficult to capture directly. For example, the jaw motion can result in various deformation on the ear canal in different directions. The measuring process becomes even more difficult when combining the static geometry with the dynamic deformation of the ear canal.
[0055] In this work, an acoustic sensing approach is used to capture the ear canal deformation indirectly. It is done by emitting inaudible acoustic signals to probe the ear canal and analyzing the signal reflections that affected by various geometric characteristics of the ear canal (e.g., the interspace, the curvature, the diameter of the ear canal, and the canal wall composition). A prototype is built using off-the-shelf ear-bud equipped with an inward-facing microphone that enables both transmitting probe signals and recording the signal reflections from the ear canal. Moreover, we leverage the channel response to measure the ear canal dynamic deformation and its geometry information. Specifically, the channel response of the ear canal is the ratio of the reflected signal to the incident probe signals. The channel response depicts how the ear canal deformation (i.e., wireless channel) reflects the original probe signals. By analyzing the channel response, different users can be distinguished based on their unique ear canal deformation.
[0056] System Overview
[0057] The key idea underlying the user authentication system is to leverage the advanced acoustic sensing capabilities of the in-ear wearable device to sense the dynamic deformation of the user's ear canal. As illustrated in
[0058] This system has the ability to work under both static and dynamic scenarios of the ear canal. When there is no head movements or articulatory motions detected, the user is under a static scenario, where the ear canal geometry remains the same throughout the authentication process. Thus, the captured signal reflections represent the physiological characteristics of the user's ear canal. Then the extracted features that correspond to static geometry of the ear canal are utilized to compare against the user enrolled profiles to determine if it is the legit user.
[0059] Different from the static scenario, dynamic deformation represents the combination of both the physiological and behavioral characteristics of the ear canal. It can be extracted under the dynamic scenarios, where the user is speaking or moving the head. To better leverage the ear canal deformation, various deformation motions are categorized into different groups based on phoneme pronunciation, such that each group shares similar jaw and tongue movements. Such an approach has the benefit of improving system usability by simplifying the profile enrollment process. For example, the disclosed system only requires the user to speak a few sentences that involve multidimensional motions of the jaw and tongue to generate an individual profile for later authentication. To identify the head movement, the present system relies on the embedded motion sensor of the wearable. In particular, five head postures are considered that lead to ear canal deformation including turning right, left, down, up, and forward.
[0060] The disclosed system can perform authentication when users wearing in-ear devices with their natural habits. By utilizing embedded inward-facing microphone and motion sensor, which have been increasingly adopted on the wearable devices, EarDynamic is highly practical and compatible. Compare with the traditional biometric authentication modalities, such as fingerprint and face, our system can achieve continuous authentication and is transparent to the user without requiring any user cooperation.
[0061] System Flow
[0062] The disclosed system consists of four major components: in-ear wearable sampling, ear canal, and motion sensing, dynamic feature extraction, and user authentication, shown in
[0063] For feature extraction, the captured signals that consist of the reflected inaudible signals and the audible signals are processed. A high pass filter was first applied to separate the inaudible and audible signals. The inaudible signals contain the acoustic properties of the ear canal, whereas the audible signals include the speech content of the user. Next, the system segments the separated audible signals into a sequence of phonemes and maps them to the corresponding inaudible components for capturing the ear canal deformation. The phoneme segmentation is done by utilizing the Munich Automatic Segmentation system [39] on both the audible signal components and inaudible signal components. For each segment, appropriate features are extracted based on different scenarios, as shown in
[0064] Lastly, the system can authenticate the user based on the extracted information from previous steps. A sequence of phoneme-based classifiers that can be combined as one stronger classifier to improve classification accuracy. If a positive decision is given, then the user is considered as legit. Otherwise, the system deems the current user as an unauthorized user.
[0065] Ear Canal Sensing
[0066] Once the system is triggered, the speaker of the in-ear wearable sends out acoustic signals to probe the ear canal. The probe signals are designed as a chirp signal with frequency ranges from 16 kHz to 23 kHz. The reason for such a design is twofold: first, the frequency range from 16 kHz to 23 kHz is inaudible to most human ears, which makes the authentication process transparent to the user; second, the chosen frequency range is sensitive to the subtle motions, which can improve our system's ability to capture the ear canal deformation. During the process of ear canal sensing, the inward-facing microphone keeps recording the reflected signals bounced from the ear canal. These reflections can be analyzed to further extract the acoustic properties of the ear canal.
[0067] Signal Processing
[0068] Ear Canal Deformation Categorization. The underlying principle of ear canal deformation categorization is that similar articulatory gestures, i.e., jaw and tongue motions, have a similar impact on the geometry of the ear canal. In particular, each phoneme is produced by a sequence of coordinated movements of several articulators. In this work, the focus is on two articulators (i.e., jaw and tongue) that contribute the most to the ear canal deformation. For example, the phoneme sound of [:] and [a] both have a lower and backward position with an open jaw. Thus, these two phonemes result in a similar impact on the ear canal deformation and are categorized into the same group. Several phonemes were eliminated due to the fact that they have minimal usage of articulators, which leads to almost no impact on the ear canal deformation. For instance, when the user pronounced the phoneme [p], no ear canal deformation was detected. The categorization results of commonly used vowels and consonants are summarized in Table 1.
TABLE-US-00001 TABLE 1 Deformation Categories Based on Phoneme Deformation (Articulator) Category Phonemes Tongue Forward and Jaw Open Slightly [i ], [
], [
], [
], [
], [
], [
] Tongue Lower and Jaw Open Widely [
], [ai], [
], [
], [
], [
] Tongue Back and Raise and Jaw Open Slightly [
], [
], [
] Tongue Back and Jaw Open Moderately [
], [
], [
], [
] Tongue Raised and Fricative and Jaw Open [t
], [tr], [ts], [d
], Widely [dr], [dz] Tongue Raised and Jaw Open Slightly [f], [s], [
], [h], [
], [z], [
], [r] Tongue Fricative and Jaw Open Slightly [θ], [
], [
]
indicates data missing or illegible when filed
[0069] As each phoneme contains unique formats (e.g., frequencies), one can segment and identify each phoneme by analyzing the spectrogram of the audible signal. In particular, the automatic speech recognition protocol is leveraged to identify each word in the sample speech [35]. Then, MAUS is utilized as the primary way of phoneme segmentation and labeling [24]. It is done by transferring the samples into expected pronunciation and searching for the highest probability in a Hidden Markov Model [23]. The segmented and labeled phonemes are categorized according to Table 1 for further analysis.
[0070] Feature Extraction. The next step is to extract features from the categorized signals segments. The captured signal reflection from the ear canal contains the acoustic characteristics information of the user's ear canal. However, due to the dynamic nature of the ear canal geometry during the speaking process, the channel response of the received signals are also time-varying. Let c.sub.1(t) and c.sub.2(t) denote the channel response of the sample signal i(t.sub.1) and i(t) and i (t−t.sub.1), due to the deformation of the ear canal, the c.sub.2(t) is not equal to the c.sub.1(t). Therefore, the received signals r(t) can be represented as: r(t)=c(t,τ)*s(t)=∫∞−∞s(t−τ)c(t, τ)dτ, where the c(t, τ) is the channel response, and τ is the propagation delay. Thus for each segment of a phoneme of the user, r(t.sub.s), The channel response c(t) is a function that relates the output response at frequency ω. Thus the channel response is:
[0071] By capturing the channel response under different segments of phonemes, we could extract the acoustics characteristics that represent both the static geometry and dynamic motions of the ear canal at a specific time point. As shown in
[0072] Classifier Boosting
[0073] After obtaining the feature extracted from received reflected signals, the system proceeds to the authentication process. Such a process can be viewed as finding an optimal solution for a classification problem to distinguish between legit users and attackers. To achieve better performance, adaptive boosting is adopted, which is an ensemble learning algorithm [37]. Such an algorithm is commonly used for classification or regression to further improve the distinguishing ability. Specifically, the authentication problem can be formulated as one boosted classifier:
[0074] where each segmented phoneme will be used as a weak learner pc that takes one channel response c as input and returns a value indicating the class of the object. Each weak learner pc is one classifier that is only slightly correlated to the final classifier that has one output o(x.sub.i) for the input from the training set. By iteratively combining all the weak learners, one can obtain the boost learner which is highly correlated to the final learner, and during each iteration of training i, each weak classifier is selected and assign a coefficient c.sub.t to minimize the training error:
[0075] where f.sub.c(x)=c.sub.to(x) is the classifier that is boosted into the final classifier, while F.sub.c-1(x) is the boosted classifier from the previous iteration. A vector also contains all weights that update information for each sample. It is created to focus on the samples that have relatively bigger errors in the previous stage.
[0076] In the context of this application, the term “headset” refers to all types of headsets, headphones, and other head worn audio playback devices, such as for example circum-aural and supra-aural headphones, ear buds, in ear headphones, and other types of earphones. The headset may be of mono, stereo, or multichannel setup. A dedicated microphone for recording the user's voice may or may not be provided as part of a headset in the context of this explanation. The headset in some embodiments may comprise an audio processor. The audio processor may be of any suitable type to provide output audio from an input audio signal. For example, the audio processor may be a digital sound processor (DSP).
[0077] The term “ear” in the preceding definition is understood as to refer to any part of the ear of the user, and in particular the outer ear comprising concha, tragus, helix, antihelix, scapha, navicular fossa, pinna, etc. In one embodiment, the influence stems primarily from the pinna of the user.
[0078] In one or more embodiments, authenticating the user may include comparing a current signature to a stored or predefined signature. Further, based on the filter transfer function currently generated, a current signature may be generated. The current signature may be generated by applying a transformation and/or function to the current transfer function generated. After computing the current signature, the current signature may be compared with the stored signature. In one or more embodiments, the current signature may be compared with the stored signature by computing the sum of a mean square error (E) between the current signature and the stored signature. The computed sum of the mean square error, or E, may indicate the confidence of a match between the stored signature and the current signature. Moreover, E may be compared to a predetermined threshold. In one or more embodiments, if E is less than the predetermined threshold, then the current signature is considered to match the stored signature. However, if E is greater than the predetermined threshold, then the current signature is not a match to the stored signature. Accordingly, if E is less than the predetermined threshold, then the user is authenticated, and provided access to the contents of the headset's memory and/or a host device that is paired with the headset.
[0079] The authentication procedure may in some embodiments be provided during an initial power up process of the headset and once succeeded be valid for a given time interval or until the user powers off the headset.
[0080] Alternatively or additionally and in further embodiments, the authentication procedure may be conducted on a continuous or quasi-continuous basis, e.g., in the latter case in regular or irregular timeout intervals. In these embodiments, the authentication procedure simultaneously serves as a so-called “don/doff detection”, i.e., a detection of whether the headset is currently worn, as the authentication would be revoked once the user's ear biometrics would not be determinable anymore from the signals/signature. In some embodiments, at least one further current user acoustic signature is determined and the authentication of the user is revoked based on a comparison of the further current user acoustic signature with the predefined user acoustic signature. For example, in case a further current user acoustic signature is determined from a corresponding further filter transfer function every 5 seconds, it is determinable whether the user still wears the headset. Certainly, the interval may be adapted based on the application and the required security level.
Example
[0081] Experimental Setup:
[0082] Environments and Hardware. The authentication process could be happening in various environment under everyday use scenarios. For example, the user might command the voice assistant through the headset in the office environment. Additionally, the user could make payments at groceries using electronic payment or send a message through voice command in the vehicles. Thus, to evaluate our system's performance in real-world environments, various locations including home, office, grocery store, vehicle, and parks were chosen to conduct experiments. Moreover, to better simulate the everyday usage of EarDynamic, participants were asked to wear the system in their natural habits as to how they would wear in-ear devices on a daily basis. The participants were allowed to maintain various posture (sitting, standing, walking) or perform different gestures (e.g., waving arm and hands, moving head) during the experiments.
[0083] There are several in-ear earbuds that equipped with inward-facing microphones on the market (e.g., Apple Airpods Pro® [6], Amazon EchoBuds® [4]). However, those devices are less desirable due to firmware restriction and we can not access the raw data for feature extraction. In this work, a prototype system was built utilizing only off-the-shelf hardware to demonstrate its practicability and compatibility. A regular in-ear earbud on the market cost less than 7 dollars with a 12 mm speaker, 3.5 mm audio jack, and a microphone chip with a sensitivity of −28±3 dB. The total cost of this prototype is very low which is more affordable to a wider range of customers compared to the abovementioned earbuds. As shown in
[0084] Data Collection. 24 participants were recruited for the experiments including 12 females and 12 males with an age range from 20 to 40. The participants are informed about the goal of the experiments and asked to talk in their natural way of speaking. For the enrollment, each participant is asked to sit in a classroom environment and wear the prototype at his/her habitual position. Then, they are required to repeat five passphrases three times while the system emitting inaudible signals. Such passphrases are designed to include different dynamic deformation motions from all the categories. Then, the features are extracted from the captured signal reflections along with the audible signals. The extracted features are used to establish each user's template. After enrollment, each participant is asked to speak 10 sentences with length varies from 2 to 20 words. The sentences include some commonly used voice commands like “Hey Google”, “Alexa, play some music” as well as other short daily conversation pieces. For each sentence, each participant is asked to repeat at least 10 times for our experiment. In total 2880 sentences from different users are collected and used for overall evaluation. Among all those sentences, 1080, 700, 300, 300, 300, and 200 sentences are collected at home, in the classroom, in office, in the vehicle, at the grocery store, and park respectively.
[0085] Metrics. To better evaluate the authentication performance of the system, four different metrics were introduced: accuracy, recall, precision, and F1 score. They are defined as following:
where TP is the True Positive, TN is the True Negative, FP is the False Positive and FN is the False Negative, respectively. The receiver operating characteristic (ROC) is also leveraged, which represents the relationship between the True Accept Rate (i.e., the probability of identified valid user) and the False Accept Rate(i.e., the probability of incorrectly accept the attacker) when the threshold is varying.
[0086] Performance
[0087] The system's overall performance is evaluated against the mimic attack. To launch a mimic attack, the adversary wears the in-ear device and issue the same voice command by mimicking the victim's way of speaking. For such an attack, the adversary tries to spoof the system by performing similar articulator gestures with respect to the victim. Table 2 summarizes the average and median of accuracy, recall, precision, and F1 score overall. It was observed that EarDynamic can achieve overall accuracy of 93.04%, recall of 97.38%, the precision of 95.02%, and an F1 score of 96.84% across different environments and participants. Furthermore, the median accuracy, recall, precision, and F1 score are 93.97%, 98.78%, 95.40%, and 96.85%, respectively.
TABLE-US-00002 TABLE 2 Authentication accuracy Mean Median Standard Deviation Accuracy 0.9304 0.9397 0.0395 Recall 0.9738 0.9878 0.0381 Precision 0.9502 0.9540 0.0202 P1 Score 0.9684 0.9685 0.0055
[0088] When applying classifier boosting techniques, our system requires to utilize multiple phonemes to boost classification accuracy. Next, the system's performance is studied by using the various number of phonemes as classifiers, the results are shown in
[0089] Moreover, the system performance is studied when utilizing dynamic deformation categorization based approach used in EarDynamic and static geometry based approach to generate templates for authentication. The static geometry-based approach generates the template when the user is under the static scenario (i.e, the user is not speaking). In this case, the generated template only represents the physiological characteristics of the ear canal. Using only static geometry of the ear canal provides a very good accuracy when the user is not speaking or moving the head. However, the ear canals of the user become dynamic when he/she is issuing voice commands or talking over the phone. Such dynamics cause ear canal deformation and degrade authentication performance Thus, the authentication system only utilizes the static information of the ear canal may not be sufficient under dynamic scenarios.
[0090] As shown in
[0091] Moreover, as shown in
[0092] Performance Under Advanced Attack
[0093] Next, the performance of the system was evaluated under the advanced attack. To launch such an attack, the adversary leverages the leaked ear canal static geometry information of the victim and rebuilds the ear canal model. For this type of attacks, the attacker can only replicate the static geometry of the victim's ear canal at best and miss the dynamic deformation motion information when the user is speaking. Thus, to simulate the advanced attack, the participants were asked to wear the in-ear devices and replay their voice commands from other devices. The users keep silent during the authentication process. As shown in
[0094] Impact of Different Environments
[0095] As people may wear in-ear devices at various locations, here the system's performance is studied in different environments. Six typical environments were chosen: living room, classroom, office, vehicles, grocery store, and park.
[0096] Impact of Head Posture
[0097] Difference head postures also impact the ear deformation, thus it is studied how these head postures impact the system. The head postures considered are facing left, up, right, and down and using facing forward as the baseline. A motion sensor is used that is attached to the earbud to detect the head postures and decided to match with which template. As shown in
[0098] Impact of Wearing Positions
[0099] People usually wear earbuds in their habitual position, it is possible the relative position of the earbud with respect to the ear could be slightly different from time to time. Next, the wearing position of the in-ear devices' impact on the system performance was evaluated. Three different wearing angles with respect to the normal wear position were evaluated: 0°, 15°, and 30°. Additional 160 sentences for case of 15°, and 30° and 240 datapoints are used in total. The measured angle is facing toward the earlobe with respect to the original position and 0° is used to establish the baseline performance. The results are shown in
[0100] Impact of Time
[0101] To evaluate the robustness of the system over time, some participants were asked to use the system over various time periods after initial enrollment. The time periods considered are 1 day, 10 days, 30 days, and 120 days. 140 additional sentences are sampled for 30 days and 120 days, 300 sentences are analyzed in total. As shown in
[0102] Impact of Body Motion
[0103] As it is known that different gestures could make an impact on the ear canal, here it is studied how different body motions affect the performance of our system. The motions chosen are daily activities including sitting, standing, walking, jogging, and squat. For walking, jogging, and squat, 360 sentences were collected in addition. As shown in
[0104] Study of Left and Right Ear
[0105] Next, the system performance was studied over different ears. The participants were asked to use either left or right ear to enroll in the system and then utilize the corresponding ear for authentication. The results are shown in
[0106] A user study based on the dataset on different factors including gender, accent, and age.
[0107] Gender. The impact of different gender was studied. 24 participants were recruited including 12 females and 12 males. As shown in
[0108] Accent. Both native English speakers and non-native English speakers are recruited for the accent experiments. In particular, there are 4 Native English speakers and 20 non-native English speakers from different countries. According to the
[0109] Indeed, comparing with native English speakers, non-native speakers' pronunciations are less stable. For example, non-native speakers are prone to mispronounce phonemes. Such inconsistency might impact the authentication and lower our system performance.
[0110] Age. Can age affect system performance? The participants are divided into four age categories: 20-25, 25-30, 30-35, and 35-40.
[0111] It will be apparent to those skilled in the art that various modifications and variations can be made in the present disclosure without departing from the scope or spirit of the invention. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the methods disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
REFERENCES
[0112] [1] 2017. HUAWEI FreeBuds 3. https://consumer.huawei.com/en/audio/freebuds3/[2]2020. Wireless Bluetooth Sports Headphones. https://www.jabra.com/sports-headphones [0113] [3] Aditya Abhyankar and Stephanie Schuckers. 2009. Integrating a wavelet based perspiration liveness check with fingerprint recognition. Pattern Recognition 42, 3 (2009), 452-464. [0114] [4] Amazon. 2020. Amazon Echo Buds. https://www.amazon.com/Echo-Buds/dp/B07F6VM1S3/[5] [0115] [5] Takashi Amesaka, Hiroki Watanabe, and Masanori Sugimoto. 2019. Facial expression recognition using ear canal transfer function. In Proceedings of the 23rd International Symposium on Wearable Computers. 1-9. [0116] [6] Apple. 2020. Apple AirPods Pro. https://www.apple.com/airpods-pro/[7] Takayuki Arakawa, Takafumi Koshinaka, Shohei Yano, Hideki Irisawa, Ryoji Miyahara, and Hitoshi Imaoka. 2016. Fast and accurate personal authentication using ear acoustics. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 1-4. [0117] [8] Hind Baqeel and Saqib Saeed. 2019. Face Detection Authentication on Smartphones: End Users Usability Assessment Experiences. In 2019 International Conference on Computer and Information Sciences (ICCIS). IEEE, 1-6. [0118] [9] Abdelkareem Bedri, David Byrd, Peter Presti, Himanshu Sahni, Zehua Gue, and Thad Starner. 2015. Stick it in your ear: Building an in-ear jaw movement sensor. In Adjunct Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2015 ACM International Symposium on Wearable Computers. 1333-1338. [0119] [10] Johan Carioli, Aidin Delnavaz, Ricardo J Zednik, and Jérémie Voix. 2017. Piezoelectric earcanal bending sensor. IEEE Sensors Journal 18, 5 (2017), 2060-2067. [0120] [11] Phillip L De Leon, Michael Pucher, Junichi Yamagishi, Inma Hernaez, and Ibon Saratxaga. 2012. Evaluation of speaker verification security and detection of HMM-based synthetic speech. IEEE Transactions on Audio, Speech, and Language Processing 20, 8 (2012), 2280-2290. [0121] [12] Aidin Delnavaz and Jérémie Voix. 2013. Energy harvesting for in-ear devices using ear canal dynamic motion. IEEE Transactions on Industrial Electronics 61, 1 (2013), 583-590. [0122] [13] Jianjiang Feng, Anil K Jain, and Arun Ross. 2009. Fingerprint alteration. submitted to IEEE TIFS (2009). [0123] [14] Matteo Ferrara, Annalisa Franco, and Davide Maltoni. 2016. On the effects of image alterations on face recognition accuracy. In Face recognition across the imaging spectrum. Springer, 195-222. [0124] [15] Yang Gao, Wei Wang, Vir V Phoha, Wei Sun, and Zhanpeng Jin. 2019. EarEcho: Using Ear Canal Echo for Wearable Authentication. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 3 (2019), 1-24. [0125] [16] Valentin Goverdovsky, David Looney, Preben Kidmose, and Danilo P Mandic. 2015. In-ear EEG from viscoelastic generic earpieces: Robust and unobtrusive 24/7 monitoring. IEEE Sensors Journal 16, 1 (2015), 271-277. [0126] [17] Malcolm J Grenness, Jon Osborn, and W Lee Weller. 2002. Mapping ear canal movement using area-based surface matching. The Journal of the Acoustical Society of America 111, 2 (2002), 960-971. [0127] [18] Rosa González Hautamäki, Tomi Kinnunen, Ville Hautamäki, Timo Leino, and Anne-Maria Laukkanen. 2013. I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry . . . . In Interspeech. 930-934. [0128] [19] Mandieh Izadpanahkakhk, Seyyed Mohammad Razavi, Mehran Taghipour-Gorjikolaie, Seyyed Hamid Zahiri, and Aurelio Uncini. 2018. Deep region of interest and feature extraction models for palmprint verification using convolutional neural networks transfer learning. Applied Sciences 8, 7 (2018), 1210. [0129] [20] Amin Jalali, Rommohan Mallipeddi, and Minho Lee. 2015. Deformation invariant and contactless palmprint recognition using convolutional neural network. In Proceedings of the 3rd International Conference on Human-Agent Interaction. 209-212. [0130] [21] Artur Janicki, Federico Alegre, and Nicholas Evans. 2016. An assessment of automatic speaker verification vulnerabilities to replay spoofing attacks. Security and Communication Networks 9, 15 (2016), 3030-3044. [0131] [22] Tomi Kinnunen, Md Sahidullah, Héctor Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee. 2017. The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. (2017). [0132] [23] Andreas Kipp, M-B Wesenick, and Florian Schiel. 1996. Automatic detection and segmentation of pronunciation variants in German speech corpora. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP'96, Vol. 1. IEEE, 106-109. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 5, No. 1, Article 39. Publication date: March 2021. EarDynamic: An Ear Canal Deformation Based Continuous User Authentication . . . 39:27 [0133] [24] Thomas Kisler, Florian Schiel, and Han Sloetjes. 2012. Signal processing via web services: the use case WebMAUS. In Digital Humanities Conference 2012. [0134] [25] Rajamani Santhosh Kumar, K R Jothi Kumar, D Saravana Bhavan, and A Anandaraj. [n.d.]. Variations in the External Auditory Canal of 185 Adult Individuals: A Clinico-Morphological Study. International Journal of Scientific and Research Publications ([n. d.]), 305. [0135] [26] Nianfeng Liu, Man Zhang, Haiqing Li, Zhenan Sun, and Tieniu Tan. 2016. DeepIris: Learning pairwise filter bank for heterogeneous iris verification. Pattern Recognition Letters 82 (2016), 154-161. [0136] [27] Davide Maltoni, Dario Maio, Anil K Jain, and Salil Prabhakar. 2009. Handbook of fingerprint recognition. Springer Science & Business Media. [0137] [28] Emanuela Marasco and Arun Ross. 2014. A survey on antispoofing schemes for fingerprint recognition systems. ACM Computing Surveys (CSUR) 47, 2 (2014), 1-36. [0138] [29] Gian Luca Marcialis, Fabio Roli, and Alessandra Tidu. 2010. Analysis of fingerprint pores for vitality detection. In 2010 20th international conference on pattern recognition. IEEE, 1289-1292. [0139] [30] Amir Mohammadi, Sushil Bhattacharjee, and Sébastien Marcel. 2017. Deeply vulnerable: a study of the robustness of face recognition to presentation attacks. IET Biometrics 7, 1 (2017), 15-26. [0140] [31] Jang-Ho Park, Dae-Geun Jang, Jung Wook Park, and Se-Kyoung Youm. 2015. Wearable sensing of in-ear pressure for heart rate monitoring with a piezoelectric sensor. Sensors 15, 9 (2015), 23402-23417. [0141] [32] Chester Pirzanski and Brenda Berge. 2005. Ear canal dynamics: Facts versus perception. The Hearing Journal 58, 10 (2005), 50-52. [0142] [33] Kiran B Raja, Ramachandra Raghavendra, Vinay Krishna Vemuri, and Christoph Busch. 2015. Smartphone based visible iris recognition using deep sparse filtering. Pattern Recognition Letters 57 (2015), 33-42. [0143] [34] Daniel M Rasetshwane and Stephen T Neely. 2011. Inverse solution of ear-canal area function from reflectance. The Journal of the Acoustical Society of America 130, 6 (2011), 3873-3881. [0144] [35] Mosur K Ravishankar 1996. Efficient Algorithms for Speech Recognition. Technical Report. Carnegie-Mellon Univ Pittsburgh pa Dept of Computer Science. [0145] [36] U Rosenhall. 1996. The Human Ear Canal. Theoretical Considerations and Clinical Applications including Cerumen Management by BB Ballachandra. JOURNAL OF AUDIOLOGICAL MEDICINE 5 (1996), 176-177. [0146] [37] Robert E Schapire and Yoav Freund. 2013. Boosting: Foundations and algorithms Kybernetes (2013). [0147] [38] Ulrich Scherhag, Christian Rathgeb, Johannes Merkle, Ralph Breithaupt, and Christoph Busch. 2019. Face recognition systems under morphing attacks: A survey. IEEE Access 7 (2019), 23012-23026. [0148] [39] Florian Schiel, Christoph Draxler, and Jonathan Harrington. 2011. Phonemic segmentation and labelling using the MAUS technique. (2011). [0149] [40] Stephanie A C Schuckers. 2002. Spoofing and anti-spoofing measures. Information Security technical report 7, 4 (2002), 56-62. [0150] [41] Chao Shen, Yuanxun Li, Yufei Chen, Xiaohong Guan, and Roy A Maxion. 2017. Performance analysis of multi-motion sensor behavior for active smartphone authentication. IEEE Transactions on Information Forensics and Security 13, 1 (2017), 48-62. [0151] [42] Kaavya Sriskandaraja, Vidhyasaharan Sethu, Eliathamby Ambikairajah, and Haizhou Li. 2016. Front-end for antispoofing countermeasures in speaker verification: Scattering spectral decomposition. IEEE Journal of Selected Topics in Signal Processing 11, 4 (2016), 632-643. [0152] [43] Michael R Stinson and BW Lawton. 1989. Specification of the geometry of the human ear canal for the prediction of sound-pressure level distribution. The Journal of the Acoustical Society of America 85, 6 (1989), 2492-2503. [0153] [44] Boudewijn Venema, Johannes Schiefer, Vladimir Blazek, Nikolai Blanik, and Steffen Leonhardt. 2013. Evaluating innovative in-ear pulse oximetry for unobtrusive cardiovascular and pulmonary monitoring during sleep. IEEE journal of translational engineering in health and medicine 1 (2013), 2700208-2700208. [0154] [45] Rudolf M Verdaasdonk and Niels Liberton. 2019. The Iphone X as 3D scanner for quantitative photography of faces for diagnosis and treatment follow-up (Conference Presentation). In Optics and Biophotonics in Low-Resource Settings V, Vol. 10869. International Society for Optics and Photonics, 1086902. [0155] [46] Jérémie Voix. 2017. The ear beyond hearing: From smart earplug to in-ear brain computer interfaces. In Proc. 24th Int. Congr. Sound Vibrat. (ICS V). 1-11. [0156] [47] Chen Wang, Junping Zhang, Jian Pu, Xiaoru Yuan, and Liang Wang. 2010. Chrono-gait image: A novel temporal template for gait recognition. In European Conference on Computer Vision. Springer, 257-270. [0157] [48] Mao-Che Wang, Chia-Yu Liu, An-Suey Shiao, and Tyrone Wang. 2005. Ear problems in swimmers. Journal of the chinese medical association 68, 8 (2005), 347-352. [0158] [49] Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilçi, Md Sahidullah, and Aleksandr Sizov. 2015. ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In Sixteenth Annual Conference of the International Speech Communication Association.