Wearable electronic device for emitting a masking signal

20210104222 · 2021-04-08

    Inventors

    Cpc classification

    International classification

    Abstract

    A signal processing method and a wearable electronic device such as a headphone or an earphone comprising a microphone arranged to pick up an acoustic signal and convert the acoustic signal to a microphone signal (x); a loudspeaker arranged in an earpiece; and a processor configured to control the volume of a masking signal (m); and supply the masking signal (m) to the loudspeaker. Further, the processor is further configured to detect voice activity and generate a voice activity signal (y) which is, concurrently with the microphone signal, sequentially indicative of one or more of: voice activity and voice in-activity; and control the volume of the masking signal (m) in response to the voice activity signal (y) in accordance with supplying the masking signal (m) to the loudspeaker at a first volume at times when the voice activity signal (y) is indicative of voice activity and at a second volume at times when the voice activity signal (y) is indicative of voice in-activity.

    Claims

    1. A wearable electronic device comprising: an electro-acoustic input transducer arranged to pick up an acoustic signal and convert the acoustic signal to a microphone signal (x); a loudspeaker and a processor configured to: control the volume of a masking signal (m); and supply the masking signal (m) to the loudspeaker; wherein the processor is further configured to: based on processing at least the microphone signal (x), detect voice activity and generate a voice activity signal (y) which is, concurrently with the microphone signal, sequentially indicative of one or more of: voice activity and voice in-activity; and control the volume of the masking signal (m) in response to the voice activity signal (y) in accordance with supplying the masking signal (m) to the loudspeaker at a first volume at times when the voice activity signal (y) is indicative of voice activity and at a second volume at times when the voice activity signal (y) is indicative of voice in-activity.

    2. A wearable device according to claim 1, wherein the processor is configured with one or both of: an audio player to generate the masking signal by playing an audio track; and an audio synthesizer to generate the masking signal using one or more signal generators.

    3. A wearable device according to claim 1, wherein the processor is configured to include a machine learning component to generate the voice activity signal (y); wherein the machine learning component is configured to indicate periods of time in which the microphone signal (x) comprises: signal components representing voice activity, or signal components representing voice activity and signal components representing noise, which is different from voice activity.

    4. A wearable device according to claim 1, wherein a machine learning component is configured to detect the voice activity based on processing time-domain waveforms of the microphone signal (x).

    5. A wearable device according to claim 1, wherein the processor is configured to: concurrently with reception of the microphone signal: generate frames comprising a frequency-time representation (X) of waveforms of the microphone signal (x); wherein the frames comprise values arranged in frequency bins; comprise a machine learning component configured to detect the voice activity based on processing the frames including the frequency-time representation of waveforms of the microphone signal (x).

    6. A wearable device according to claim 4, wherein the machine learning component is configured to generate the voice activity signal (y) in accordance with a frequency-time representation comprising values arranged in frequency bins in a frame; wherein the processor controls the masking signal (m) in accordance with a time and frequency distribution of the envelope of the masking signal substantially matching the voice activity signal or the envelope of the voice activity signal, which is in accordance with the frequency-time representation.

    7. A wearable device according to claim 1, wherein the processor is configured to: gradually increase the volume of the masking signal (m) over time in response to detecting an increasing frequency or density of voice activity.

    8. A wearable device according to claim 1, wherein the processor is configured with: a mixer to generate the masking signal from one or more selected intermediate masking signals from multiple intermediate masking signals; wherein selection of the one or more selected intermediate masking signals is performed in accordance with a criterion based on one or both of: the microphone signal and the voice activity signal.

    9. A wearable device according to claim 1, wherein the processor is configured with: a gain stage, configured with a trigger for attack amplitude modulation of an intermediate masking signal and a trigger for decay amplitude modulation of the intermediate masking signal; wherein the gain stage is triggered to perform attack amplitude modulation of the intermediate masking track in response to detecting a transition from voice in-activity to voice activity and to perform decay amplitude modulation of the intermediate masking track in response to detecting a transition from voice activity to voice in-activity.

    10. A wearable device according to claim 1, wherein the processor is configured with: an active noise cancellation unit to process the microphone signal (x) and supply an active noise cancellation signal (q) to the loudspeaker; and a mixer to mix the active noise cancellation signal (q) and the masking signal (m) into a signal for the loudspeaker.

    11. A wearable device according to claim 1, wherein the processor is configured to selectively operate in a first mode or a second mode; wherein, in the first mode, the processor controls the volume of the masking signal (m) supplied to the loudspeaker; and wherein, in the second mode, the processor: forgoes supplying the masking signal (m) to the loudspeaker at the first volume irrespective of the voice activity signal (y) being indicative of voice activity.

    12. A wearable device according to claim 1, wherein the electro-acoustic input transducer is a first microphone outputting a first microphone signal (x); and wherein the wearable device comprises: a second microphone outputting a second microphone signal (x′); and a beam-former coupled to receive the first microphone signal (x) or a third microphone signal from a third microphone and the second microphone signal (x′) and to generate a beam-formed signal.

    13. A signal processing method at a wearable electronic device comprising: an electro-acoustic input transducer arranged to pick up an acoustic signal and convert the acoustic signal to a microphone signal (x); a loudspeaker; and a processor performing: controlling the volume of a masking signal (m); and supplying the masking signal (m) to the loudspeaker; detecting voice activity, based on processing at least the microphone signal (x), and generating a voice activity signal (y) which is, concurrently with the microphone signal, sequentially indicative of one or more of: voice activity and voice in-activity; and controlling the volume of the masking signal (m) in response to the voice activity signal (y) in accordance with supplying the masking signal (m) to the loudspeaker at a first volume at times when the voice activity signal (y) is indicative of voice activity and at a second volume at times when the voice activity signal (y) is indicative of voice in-activity.

    14. A signal processing module for a headphone or earphone configured to perform the method according to claim 13.

    15. A computer-readable medium comprising instructions for performing the method according to claim 13 when run by a processor at a wearable electronic device comprising: an electro-acoustic input transducer arranged to pick up an acoustic signal and convert the acoustic signal to a microphone signal (x); a loudspeaker.

    Description

    BRIEF DESCRIPTION OF THE FIGURES

    [0135] A more detailed description follows below with reference to the drawing, in which:

    [0136] FIG. 1 shows a wearable electronic device embodied as a headphone and a pair of earphones and a block diagram of the wearable device;

    [0137] FIG. 2 shows a module, for generating a masking signal, comprising an audio player;

    [0138] FIG. 3 shows a module, for generating a masking signal, comprising an audio synthesizer;

    [0139] FIG. 4 shows a spectrogram of a microphone signal and a spectrogram of a corresponding voice activity signal;

    [0140] FIG. 5 shows a gain stage, configured with a trigger for amplitude modulation of a masking signal; and

    [0141] FIG. 6 shows a block diagram of a wearable device with a headphone mode and a headset mode.

    DETAILED DESCRIPTION

    [0142] FIG. 1 shows a wearable electronic device embodied as a headphone or as a pair of earphones and a block diagram of the wearable device.

    [0143] The headphone 101 comprises a headband 104 carrying a left earpiece 102 and a right earpiece 103 which may also be designated earcups. The pair of earphones 116 comprises a left earpiece 115 and a right earpiece 117.

    [0144] The earpieces comprise at least one loudspeaker 105 e.g. a loudspeaker in each earpiece. The headphone 101 also comprises at least one microphone 106 in an earpiece. As described herein, further below, the headphone or pair of earphones may include a processor configured with a selectable headset mode in which masking is disabled or significantly reduced.

    [0145] The block diagram of the wearable device shows an electro-acoustic input transducer in the form of a microphone 106 arranged to pick up an acoustic signal and convert the acoustic signal to a microphone signal x, a loudspeaker 105, and a processor 107. The microphone signal may be a digital signal or converted into a digital signal by the processor. The loudspeaker 105 and the microphone 105 are commonly designated electro-acoustic transducer elements 114. The electro-acoustic transducer elements 114 of the wearable electronic device may comprise at least one loudspeaker in a left hand side earpiece and at least one loudspeaker in a right hand side earpiece. The electro-acoustic transducer elements 114 may also comprise one or more microphones arranged in one or both of the left hand side earpiece and the right hand side earpiece. Microphones may be arranged differently in the right hand side earpiece than in the left hand side earpiece.

    [0146] The processor 107 comprises a voice activity detector VAD, 108 outputting a voice activity signal, y, which may be a time-domain voice activity signal or a frequency-time domain voice activity signal. The voice activity signal, y, is received by a gain stage G, 110 which sets gain factor in response to the voice activity signal. The gain stage may have two or more, e.g. multiple, gain factors selectively set in response to the voice activity signal. The gain stage G, 110 may also be controlled in response to the microphone signal e.g. via a filter or a circuit enabling adaptive gain control of the masking signal in accordance with a feed-forward or feedback configuration. The masking signal, m, may be generated by masking signal generator 109. The masking signal generator 109 may also be controlled by the voice activity signal, y. The masking signal, m, may be supplied to the loudspeaker 105 via a mixer 113. The mixer 113 mixes the masking signal, m, and a noise reduction signal, q. The noise reduction signal is provided by a noise reduction unit ANC, 112. The noise reduction unit ANC, 112 may receive the microphone signal, x, from the microphone 106 and/or receive another microphone signal from another microphone arranged at a different position in the headphone or earphone than the microphone 106. The masking signal generator 109, the voice activity detector 108 and the gain stage 110 may be comprised by a signal processing module 111.

    [0147] Thus, the processor 107 is configured to detect voice activity in the microphone signal and generate a voice activity signal, y, which is sequentially indicative of at least one or more of: voice activity and voice in-activity. Further, the processor 107 is configured to control the volume of the masking signal, m, in response to the voice activity signal, y, in accordance with supplying the masking signal, m, to the loudspeaker 105 at a first volume at times when the voice activity signal, y, is indicative of voice activity and at a second volume at times when the voice activity signal, y, is indicative of voice in-activity. The first volume may be controlled in response to the energy level or envelope of the microphone signal or the energy level or envelope of the voice activity signal. The second volume may be enabled by not supplying the masking signal to the loudspeaker or by controlling the volume to be about 10 dB below the microphone signal or lower.

    [0148] There is also shown a chart 118 illustrating that the gain factor of the gain stage G, 110 is relatively high when the voice activity signal is indicative of voice activity (va) and relatively low when the voice activity signal is indicative of voice in-activity (vi-a). The gain factor may be controlled in two or more steps.

    [0149] FIG. 2 shows a module, for generating a masking signal, comprising an audio player. The module 111 comprises the voice activity detector 108 and an audio player 201 and the gain stage G, 110. The audio player 201 is configured to play an embedded audio track 202 or an external audio track 203. The audio tracks 202 or 203 may comprise encoded audio samples and the player may be configured with a decoder for generating an audio signal from the encoded audio samples. An advantage of the embedded audio track 202 is that the wearable device may be configured with the audio track one time or in response to predefined events. The embedded audio track may then be played without requiring a wired or wireless connection to remote servers or other electronic devices; this in turn, may save battery power for battery operated wearable devices. An advantage of an external audio track 203 is that the content of the track may be changed in accordance with preferences or predefined events. The voice activity detector 108 may send a signal y′ to the player 201. The signal y′ may communicate a play command upon detection of voice activity and communicate a ‘stop’ or ‘pause’ command upon detection of voice inactivity.

    [0150] FIG. 3 shows a module, for generating a masking signal, comprising an audio synthesizer. The module 111 comprises the voice activity detector 108, an audio synthesizer 301 and the gain stage G, 110. The synthesizer 301 may generate the masking signal in accordance with parameters 302. The parameters 302 may be defined by hardware or software and may in some embodiments be selected in accordance with the voice activity signal, y. The synthesizer 301 comprises one or more tone or tones generators 305, 306 coupled to respective modulators 303, 304 which may modulate the dynamics of the signals from the tone or tones generators 305, 306. The modulators 303, 304 may operate in accordance with the parameters 302. The modulators 303, 304 output intermediate masking signals, m″ and m′″, which are input to a mixer 307, which mixes the intermediate masking signals to provide the masking signal, m′, to the gain stage 110. Modulation of the dynamics of the signals from the tone or tones generators 305, 306 may change the envelope of the signals from the tone or tone generators.

    [0151] Albeit volume control is described with respect to the gain stage G, 110, it should be noted that volume control may be achieved in other ways e.g. by controlling modulation or generation of the content of the masking signal itself.

    [0152] FIG. 4 shows a spectrogram of a microphone signal and a spectrogram of a corresponding voice activity signal. Generally, a spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. The spectrograms are shown along a time axis (horizontal) and a frequency axis (vertical). The spectrograms, shown as illustrative examples, spans a frequency range of about 0 to 8000 Hz and a time period of about 0 to 10 seconds.

    [0153] The spectrogram 401 (left hand side panel) of the microphone signal comprises a first area 403 in which signal energy is distributed across a broad range of frequencies and occurs at about 2-3 seconds. This signal energy is in a range up to 0 dB and originates mainly from keypresses on a keyboard.

    [0154] A second area 404 contains signal energy, in a range below about −20 dB distributed across a broad range of frequencies and occurring at about 4-6 seconds. This signal energy originates mainly from indistinguishable noise sources, sometimes denoted background noise.

    [0155] A third area represents presence of speech in the microphone signal and comprises a first portion 407, which represents the most dominant portion of the speech at lower frequencies, whereas a second portion 405 represents less dominant portions of the speech across a broader range of frequencies at higher frequencies. The speech occurs at about 7-8 seconds.

    [0156] Output of a voice activity detector (e.g. voice activity detector 108) is shown in the spectrogram 402 (right hand side panel). It can be seen that the output of the voice activity detector is also located at times about 7-8 seconds. The level of the output of the voice activity detector corresponds to the energy level of the speech signal with a more dominant portion 408 at lower frequencies and a less dominant portion 406 across a broader range of frequencies at higher frequencies.

    [0157] Output of a voice activity detector is thus shown as a spectrogram in accordance with a corresponding frame representation. The output of the voice activity detector is used to control the volume of the masking signal and optionally to generate the content of the masking signal is accordance with a desired spectral distribution. The output of a voice activity detector may be reduced to a one-dimensional binary or multilevel signal time-domain signal without a spectral decomposition.

    [0158] FIG. 5 shows a gain stage 501, configured with a trigger for amplitude modulation of a masking signal. This embodiment is an example of how to enable adapting the masking signal to obtain a desired fade-in and/or fade-out of the masking signal, m, based on the voice activity signal, y.

    [0159] A first trigger unit 505 detects commencement of voice activity, e.g. by a threshold, and activates a fade-in modulation characteristic 503. The modulator 502 applies the fade-in modulation characteristic 503 for modulation of the intermediate masking signal m″ to generate another intermediate masking signal, m′, which is supplied to the gain stage G, 110.

    [0160] A second trigger unit 506 detects termination or abatement of a period of voice activity, e.g. by a threshold, and activates a fade-out modulation characteristic 504. The modulator 502 applies the fade-out modulation characteristic 504 for modulation of the intermediate masking signal m″ to generate another intermediate masking signal, m′, which is supplied to the gain stage G, 110.

    [0161] Thereby, artefacts in the masking signal may be reduced.

    [0162] FIG. 6 shows a block diagram of a wearable device with a headphone mode and a headset mode. The block diagram corresponds in some aspects to the block diagram described above, but further includes elements comprised by headset block 601 related to enabling a headset mode. Further, there is provided a selector 605 for selectively enabling the headset mode or the headphone mode. The selector 605 may enable that either the masking signal, m, or a headset signal, f, is supplied to the loudspeaker 105. The selector may engage or disengage other elements of the processor. The headset block 601 may comprise a beamformer 602 which receives the microphone signal, x, from the microphone 106 and another microphone signal, x′, from another microphone 106′. The beamformer may be a broadside beamformer or an endfire beamformer or an adaptive beamformer. A beamformed signal is output from the beamformer and provided to a transceiver 604 providing wired or wireless communication with an electronic communications device 606 such as a mobile telephone or a computer.

    [0163] Generally, it should be noted that the headphone or earphone may include elements for playing back music as it is known in the art. In connection therewith, playing back music for the purpose of listening to the music, may be implemented by selection of a mode, which disables the voice activity controlled masking described above.

    [0164] Generally, it should be appreciated that the person skilled in the art may perform experiments, surveys and measurements to obtain appropriate volume levels for the masking signal. Also, experiments, surveys and measurements may be needed to avoid introducing audible or disturbing artefacts from (non-linear) signal processing associated with the masking signal.