Audio analysis and processing system
11601764 · 2023-03-07
Assignee
Inventors
- Benjamin D. Benattar (Cranbury, NJ, US)
- Alexander Khusidman (Jenkintown, PA, US)
- Christopher A. Magner (Lancaster, PA, US)
- Oya Gumustop Yuksel (Lower Gwynedd, PA, US)
Cpc classification
H04R2225/43
ELECTRICITY
H04W4/80
ELECTRICITY
H04R25/407
ELECTRICITY
H04R25/554
ELECTRICITY
H04R25/43
ELECTRICITY
International classification
Abstract
An audio analysis and processing system with a processor configured with an audio array input thread connected to a plurality of audio input channels each corresponding to an audio input sensor. An audio input sensor may be positionally related to a position of other audio input sensors and a source input thread may be configured to be connected to a microphone audio input channel. An audio output thread may be configured to be connected to a speaker output channel and a beamformer thread may be responsive to the audio array input thread. A beam analysis and selection thread may be connected to an output of the beamformer thread and a mixer thread may have a first input connected to an output of the source input thread and a second input connected to an output of the beam analysis and selection thread and may have an output connected to the audio output thread. The audio input channel may be connected to the personal communication device. The microphone audio input channel may be connected to the personal communication device. The processor may include a line output thread configured to connect to an audio output channel. An audio information interface may be provided to connect signals representing audio to the processor.
Claims
1. An audio analysis and processing system comprising a processor configured with: a. an audio array input thread configured to be connected to one or more of audio input channels each corresponding to an audio input sensor; b. an audio input sensor positionally related to a position of other audio input sensors; c. a source audio input thread configured to be connected to a source audio input channel; d. an audio output thread configured to be connected to a speaker output channel; e. a beam former, direction of arrival, and orientation thread responsive to said audio array input thread; f. an audio analysis and beam selection thread connected to an output of a user microphone thread and an output of said beam former, direction of arrival, and orientation thread and connected to a dwell time counter, and wherein said audio analysis thread includes one or more of speaker recognition, voice activity detection and noise reduction algorithms activated to select a beam upon detection of a beam selection criteria detected by said audio analysis and beam selection thread and active for a period concluding upon expiration of a dwell time counter and wherein said dwell time counter is initiated upon detection of said beam selection criteria; and g. a mixer thread having a first input connected to an output of said source audio input thread and a second input connected to an output of said audio analysis and beam selection thread and having an output connected to said audio output thread and wherein said mixer thread processes audio in accordance with an output of said audio analysis and beam selection thread and is responsive to a dwell time counter.
2. The audio analysis and processing system according to claim 1 further comprising a communications interface connected to said processor.
3. The audio analysis and processing system according to claim 2 wherein said communications interface further comprises a low-power wireless personal area network interface.
4. The audio analysis and processing system according to claim 3 wherein said low power wireless personal area network is a Bluetooth Low Energy (BLE) interface.
5. The audio analysis and processing system according to claim 4 wherein said BLE interface further comprises a BLE daemon responsive to a user interface thread of said processor and an HCI driver responsive to said BLE daemon.
6. The audio analysis and processing system according to claim 2 further comprising a user control interface linked to said processor.
7. The audio analysis and processing system according to claim 6 wherein said user control interface further comprises an application program operating on a personal communication device.
8. The audio analysis and processing system according to claim 7 wherein said audio input channel is connected to said personal communication device.
9. The audio analysis and processing system according to claim 8 wherein said microphone audio input channel is connected to said personal communication device.
10. The audio analysis and processing system according to claim 2 further comprising an audio information interface connecting signals representing audio to said processor.
11. The audio analysis and processing system according to claim 1 wherein said processor further comprises a line output thread configured to connect to an audio output channel.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
(8) Before the present invention is described in further detail, it is to be understood that the invention is not limited to the particular embodiments described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.
(9) Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.
(10) Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can also be used in the practice or testing of the present invention, a limited number of the exemplary methods and materials are described herein.
(11) It must be noted that as used herein and in the appended claims, the singular forms “a”, “an”, and “the” include plural referents unless the context clearly dictates otherwise.
(12) All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited. The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates, which may need to be independently confirmed.
(13) The invention relates to a device that facilitates control over a personal audio environment. Conventional personal speakers (headphones and earphones) provide a barrier between the ambient audio environment and the audio that a user is exposed to. The isolating effect of personal speakers is disruptive and may be dangerous. Conventional personal speakers often must be removed by a user in order to hear ambient audio. The isolating effect of personal speakers is widely recognized. Some states have enacted laws prohibiting personal speakers from being worn while driving. The organizers of many sporting events, like running and bicycle races have prohibited competitors from using personal speakers in completion because the audio isolation can be dangerous.
(14) Noise-canceling headphones increase a user's audio isolation from the environment. This brute force approach to noise reduction is not ideal and comes at the expense of blocking ambient audio that may be desirable for a user to hear. A user's audio experience may be enhanced by selectively controlling the ambient audio delivered to a user.
(15) The system described herein allows a user to control an audio environment by selectively admitting portions of ambient audio. The system may include personal speakers, a user interface, and an audio processing platform. A microphone array including audio sensing microphones may be utilized to detect the acoustic energy in the environment. A beamforming unit may segment the audio environment into distinct zones. The zones may be overlapping. An audio gateway can determine the zone or zones which include desirable audio and transmit signals representing audio from one or more of those zones to a personal speaker system. The gateway can be controlled in one or more modes through a user interface. The user interface may be implemented with a touchscreen on a personal communications device running an application program.
(16) The gateway may include a mixer to blend one or more audio zones with electronic source audio signals. The electronic source audio may be a personal music player; a dedicated microphone; or broadcast audio information.
(17) The gateway may be in a fixed arc and/or fixed direction mode. In such modes, beamforming techniques may admit audio from a direction or range of directions. This may be done independent of the presence of audio originating from the direction or range of directions.
(18) Another mode of operation may rely on keyword spotting. When a keyword spotting algorithm detects a keyword, the system selects the beam in which the keyword was detected, for transmission to the personal speaker. The system may use constrained or unconstrained keyword spotting. Keyword spotting may use a sliding window and garbage model, a k-best hypotheses, iterative Viterbi decoding, dynamic time warping, or other methods for keyword spotting. In addition, keyword spotting may include phrases consisting of multiple words. See https://en.wikipedia.org/wiki/keyword_spotting.
(19) Another mode of operation may rely on speaker recognition. When an algorithm detects the presence of speech along with sufficient acoustical detail to match the audio or speech with a locally stored or available profile, the system may select the beam in which the audio exhibits characteristics sufficiently closer to the profile that was detected. The profile may relate to a speaker of interest.
(20) Voice activity detection (VAD), also known as speech activity detection or speech detection is a technique used in speech processing in which the presence or absence of human speech is detected. Various VAD algorithms may be used that provide varying features and compromises between latency, sensitivity, accuracy and computational cost. Some VAD algorithms also provide further analysis, for example whether the speech is voiced, unvoiced or sustained.
(21) The VAD algorithm may include a noise reduction stage, e.g. via spectral subtraction. Then some features or quantities may be calculated from a section of the input signal. A classification rule may be applied to classify the section as speech or non-speech—often this classification rule finds when a value exceeds a threshold.
(22) There may be some feedback in this sequence, in which the VAD decision is used to improve the noise estimate in the noise reduction stage, or to adaptively vary the threshold(s). These feedback operations improve the VAD performance in non-stationary noise (i.e. when the noise varies a lot).
(23) According to published VAD methods formulates the decision rule on a frame by frame basis using instantaneous measures of the divergence distance between speech and noise. See Ramirez J, Segura J C, Benitez C, de La Torre A, Rubio A: A new voice activity detector using subband order-statistics filters for robust speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '04), 2004 1: 1849-1852. Different measures which may be used in the VAD including spectral slope, correlation coefficients, log likelihood ratio, cepstral, weighted cepstral, and modified distance measures.
(24) Voice activity detection may be configured to allow audio information from the zone corresponding to the direction of origin of the voice activity.
(25) Another mode of operation may be a speaker recognition mode. Speaker recognition is the identification of a person from characteristics of voices (voice biometrics). It is also called voice recognition. There is a difference between speaker recognition (recognizing who is speaking) and speech recognition (recognizing what is being said). These two terms are frequently confused. Recognizing the speaker can simplify the task of allowing a user to hear a speaker in a system that has been trained on a specific person's voice.
(26) Speaker recognition uses the acoustic features of speech that have been found to differ between individuals. These acoustic patterns reflect both anatomy (e.g., size and shape of the throat and mouth) and learned behavioral patterns (e.g., voice pitch, speaking style).
(27) Each speaker recognition system may have two phases: Enrollment and verification. During enrollment, the speaker's voice may be recorded and/or modeled on one or more features of the speaker's voice which are extracted to form a voice print, template, or model. In the verification phase, a speech sample or “utterance” may be compared against a previously created voice print. The utterance may be compared against multiple voice prints in order to determine the best match having an acceptable score. Acoustics and speech analysis techniques may be used.
(28) Speaker recognition is a pattern recognition problem. Various techniques may be used to process and store voice prints including frequency estimation, hidden Markov models, Gaussian mixture models, pattern matching algorithms, neural networks, matrix representation, Vector Quantization and decision trees. The system may also use “anti-speaker” techniques, such as cohort models, and world models. Spectral features are predominantly used in representing speaker characteristics.
(29) Ambient noise levels can impede both collections of the initial and subsequent voice samples. Noise reduction algorithms may be employed to improve accuracy.
(30)
(31) The audio analysis and processing system may include an audio input output (“I/O”) subsystem 103 described further in
(32) The audio analysis and processing system may include a Bluetooth low energy (“BLE”) adapter 104 and the control interface 105. The BLE adapter 104 may be provided to set up communications with a control interface 105 which may operate on a personal communication device, such as an iOS or Android-based cellphone, tablet, or other device. The control interface may be implemented as an app. The microphone array 102 and audio I/O subsystem 103 may be connected to a USB driver 121, which in turn may be connected to audio drivers 106a, 106b, and 106c. The microphone array 102 may be provided with one audio driver 106a for use in connection with the microphone array 102. An audio driver 106b may be dedicated to the input communications from the audio I/O subsystem 103, and a third driver 106c may be dedicated for use in connection with the output functions of the audio I/O subsystem 103. A Host Control Interface (“HCI”) driver 107 may be connected to interface with the BLE adapter 104. A BLE daemon 108 may be provided for communications with the HCI driver 107. The components 105-107 may be conventional components implemented using a Linux operating system environment.
(33) The main processor may run a plurality of processes or software threads. A software thread is a process that is part of a larger process or program. An array input thread may be an audio input thread 109 which may be connected through a USB driver 121 and audio driver 106a to the microphone array 102. The audio input thread 109 may serve to unpack a data transmission from the microphone array 102. The unpacked data may be provided to a pre-analysis processing thread shown as the beamformer, direction of arrival, and orientation thread 115 in order to implement a beamformer, direction of arrival process, and an orientation thread to process the input signals in order to arrive at usable direction, orientation, and separated audio source signals. The beamformer 115 may take signals representing audio from a plurality of microphones in the microphone array 102. For example, eight (8) signals representing audio detected at eight microphones. The beamformer 115 may process the signals to generate a plurality of directional beams. The beams, for example, may originate at the array and may have overlapping zones, each with 50% intensity over a 360 degree range, or may be a non-spatialized representation of the microphone array signals.
(34) A source input thread 110 may be responsive to the control interface 105 and is provided to process audio signals from the audio I/O subsystem 103 through the USB driver 121 and audio driver 106 in order to extract audio input based on audio obtained through the audio I/O system 103. The source input thread 110 may provide audio to the mixer thread 119. The source input thread 110 may be implemented with the ALSA (Advanced Linux Sound Architecture Library) kernel and library APIs to initialize the source input hardware and capture gain of the source input audio. In part this is done using the snd_pcm_open( ) and snd_ctl_open( ) ALSA functions. Then the ALSA snd_pcm_readi( ) function may be called to request additional samples when its buffer is not full. When a complete buffer is available, it may be en-queued and a buffer available signal may be sent to the mixer thread 119.
(35) A user microphone input thread 112 is provided to process audio from a personal microphone 213 associated with personal speakers 212 (
(36) Line output thread 114 may be controlled through the BLE daemon 108 controlled by the control interface 104. The line output thread 114 may receive a signal representing audio from the analysis and beam selection thread 111 and passes audio information through to the host control interface driver 107 to the control interface 104. The line output thread algorithm may use the ALSA (Advanced Linux Sound Architecture Library) kernel and library APIs to initialize the audio output. This may be done using the snd_pcm_open( ) and snd_ctl_open( ) ALSA functions. When it receives a new buffer of audio output samples, it may use the ALSA snd_pcm_writei( ) function to send those samples to the Host Interface driver.
(37) An analysis and beam selection thread 111 may be provided for specialized processing of the input audio beams. For example, the analysis and beam selection thread 111 may be capable of receiving multiple beams from beamformer, direction of arrival, orientation thread 115 and processing one or more audio beams through a series of analysis threads. Examples of analysis threads are shown in
(38) When the analysis and beam selection thread 111 identifies a condition in the analysis threads, the audio may be provided to a mixer thread 119 which processes the audio signal for transmission back through the audio I/O subsystem 103 to a personal speaker 212 (
(39) In order to track relative position of a user, the microphone array position sensor 123 and a microphone array position sensor 124 may provide input to the beamformer, direction of arrival and orientation thread 115. The position sensors may include one or more of a magnometer, accelerometer, and a gyrometer. In a special case where the microphone array 102 is in a fixed orientation relative to a user, only one position sensor may be needed. U.S. patent application Ser. No. 15/355,766, now U.S. Pat. No. 9,980,075, the disclosure of which is expressly incorporated herein by reference, describes the apparatus and process for stabilizing audio output to compensate for changes in position of a user, a microphone array and an audio source.
(40) The main processor 101 may also include a user interface thread 120 which permits the control interface 104 to control the processing performed by the main processor 101.
(41)
(42) The microcontroller 201 may also include a USB interface 204. The USB interface 204 may be implemented as a standard USB, a single high-speed USB, or as a dual-standard USB having USB interfaces 205 and 206. In the implementation with dual USB interfaces, they may be connected to a USB hub 207 and then to a USB connector 208 and operate at 480 mbps. The audio analysis system may also include a system clock 209. The system clock may reside on the audio input output subsystem 103. The system clock 209 may be located on or be connected to a system clock 209. The system clock 209 may be also connected as the clock in the microphone array/audio position capture system.
(43)
(44) The output of the band-pass filter 304 may be connected to a beamforming filter 305. The beamforming filter may be an 8-channel second order differential beamformer. The output of beamforming filter 305 may be frequency domain outputs. The frequency domain outputs of beamforming filter 305 may be connected to domain conversion stage 306. The domain conversion stage 306 may apply a 512 point Inverse Fast Fourier Transform (“IFFT”) with 50% overlap to convert the frequency domain outputs of the beamforming filter 305 to time domain signals. The time domain output of the domain conversion stage 306 may be eight channels connected to an output register 307. The output register 307 may have eight (8) audio channels at 16 kHz. Each of the eight (8) audio channel outputs may provide a directional output having a central lobe separated by approximately 45°. The directional processing system 300 may include a cross-correlation stage 308 connected to an output of the band-pass filter 304 and may apply a cross correlation having 360°/255° directional steps. The output of the cross-correlation stage 308 may be connected to a histogram analysis stage 309 which advantageously identifies direction of arrival of the most dominant directional steps. Advantageously the four (4) most dominant steps as determined by the histogram analysis may be mapped onto 1-4 of the 8-channel directional outputs of the output register 307. The output register 307 may include a representation of which one or more of the 8 channels correspond to the most dominant steps.
(45) A position sensor 310 may provide output data to an axis translation stage. The position sensor 310 may be a 9-axis sensor which generates output data representing a gyroscope device in 3 axes; an accelerometer in 3 axes; and a magnometer [magnetometer?] in 3 axes. The sensor may be fixed to the microphone array. The axis translation stage 311 may convert the position sensor data to data representing roll, yaw, and pitch. The position sensor data may be provided in a 16 millisecond period. The output of the axis translation stage 311 may be connected to the output register 307 which may include a representation of the orientation.
(46)
(47) A system clock 209 may be connected to connector 406. The same clock used for the audio input and output may be used to facilitate synchronous data handling. The microcontroller 401 may operate to output simultaneous signals 409 to the sensors 408. It may be advantageous to equalize the trace, 409, lengths to each sensor 408. The equalized trace lengths facilitate the near-simultaneous capture from all microphones. The microphones 408 may each be connected by serial ports 402 to the microcontroller 401. The sensors 408 may be connected in pairs to the microcontroller 401 to serial ports 402. The serial ports 402 may be I.sup.2S ports. A position sensor 407 associated with sensors 408 may be connected to the serial port 403 of the microcontroller 401. The microcontroller 401 may have a strobe/enable line 410 connected to the sensors 408. The microcontroller 401 collects data from the sensors 408 over data lines 411. The data is packaged into frames 501 shown in
(48) The microcontroller 401 is configured to collect synchronous data from the sensors 408 of a sensor array. The microcontroller may package the data into frames acting as a multiplexer.
(49) The sensors 408 may be arranged in fixed relationship to the position sensor 407. The microphones 408 may have a known relative position, and may advantageously be arranged in a “circular” pattern.
(50) The microcontroller 401 may be configured as a multiplexer in order to read-in and consolidate the data into the format shown in
(51)
(52)
(53) The loop start point is designated 601. Decision 602 determines whether there is any active beam. If the response to 602 is yes, decision 603 determines if the beam position is locked. The beam position may be locked by a user command or operation or may be locked pursuant to condition analysis (not shown). If the determination at decision 603, decision 604 determines if the dwell time counter is greater than zero (0). The dwell time represents the period of time a beam is active. The period of time may be set according to a user command or be a fixed time period. The fixed time period may be set for a duration suitable for the application.
(54) If the dwell time counter is greater than zero, the step 605 decreases the dwell time counter. Step 606 represents allowing the beam output to continue. The process at 607 returns to start loop 601.
(55) If the determination 602 is that there is no active beam, determination 611 tests whether the detection condition is active. The detection condition is any condition that the analysis process is monitoring. Audio conditions may include voice activity detection, keyword detection, speaker detection, and direction of arrival detection. Other conditions may also be monitored, both audio and non-audio. For example, location services may provide input to the condition detection noise profiles, audio profiles, such as an alarm detection, proximity detection, detection of beacon signals, like iBeacon, detection of ultrasonic signals, matching audio content to a reference, or other audio or non-audio sensed conditions.
(56) If a detection condition is active. Step 602 is to select the appropriate beam or beams. The selection may choose a beam or beams correlating to the beam carrying the strongest portion of an active detection condition. The dwell time counter may be initialized at step 615. Step 615 may be performed after the detection condition active decision 611 or after the select appropriate beam step 612.
(57) The next step may be to decrement the dwell counter at 605 or to continue the beam output 606.
(58) If a beam is locked to a particular direction, the system will continuously ensure that such direction and orientation is known, such that any subsequent change of the user and/or microphone array orientation can result in an offsetting adjustment to such beam in order to preserve its originally identified direction and orientation.
(59) If decision 603 determines that the beam position is not locked, then step 608 may operate to change the beam selection. The beam selection is changed to correspond to the direction from which a sound matching the user's established selection criteria is emanating.
(60) If step 604 determines that the dwell time counter is not greater than zero, all beams are deselected at step 609. The deselection step includes changing the beam status to inactive. After 609, start loop 610 takes the process flow to start loop 601.
(61) If the detection condition active decision 611 is no, then the process goes to deselect beams at 613, which may be the same as deselect step 609, and start loop 614 passes back to start loop 601.
(62)
(63) If decision 704 determines that the beam is selected and “dwell time not over” is a negative, then the system will determine voice activity at 708. Decision 709 is a decision on whether voice activity detection is configured (or turned on). If so, decision 710 determines whether there is voice activity. This may be done for each of the eight beams. If decision 710 determines there is voice activity, then step 711 will set a timer to start dwell time for voice activity. If voice activity is not configured at decision 709 or detected at decision 710, or after starting dwell time, the process performs a keyword configuration at decision 712. This may be done for each of the eight beams. If yes, keyword processing occurs at step 713 and then a keyword detection decision is made at 714. If the keyword detection decision is yes, step 715 starts dwell time and deconfigures keyword detection. After step 715, after no keyword detected at decision 714 and after no keyword configuration at decision 712, the process proceeds to a speaker configuration decision at 716.
(64) Decision 716 determines if the speaker profile detection is activated. If activated, the system carries out speaker processing at 717. After the speaker processing, decision 718 determines the speaker has been detected. This may be done by matching a reference voice profile to a profile generated from a beam. The speaker profile advantageously may be a preconfigured speaker profile. If the speaker profile is matched at decision 718 is yes, the system may start dwell time and deconfigure speaker detection at 719. After deconfiguration of speaker detection at 719, after a decision 716 that speaker configuration is off, and after a decision 718 that speaker profile detection is off, the process is passed to direction of arrival processing 720.
(65) Decision 720 determines whether direction of arrival processing is configured. If yes, direction processing is performed at 721. After direction processing is performed at 721, a decision 722 is made to check the decision or the direction of arrival at 722.
(66) The decisions 710, 714, 718, and 722 are stored for use at decision 723 where the detected criteria is checked against the configured criteria. If the detected criteria matches the configured criteria, then the beam with the most power is selected at step 724. If the detected criteria does not match any configured criteria, then step 726 deselects all beams. After the selection at 724, the dwell time is incremented at step 725. Processing then returns to step 701 for the next 16-millisecond interval. The process may be continuously repeated on a 16-millisecond cycle.
(67) The user may select the overall volume of the system and may select the relative volume of the prerecorded content against the injected. The system may be configured to maintain the same overall output level regardless of whether there is injected audio being mixed with prerecorded content or prerecorded content alone.
(68) Alternative audio processing may include a sound level monitor so that the actual levels of injected sound are determined and the overall volume and/or relative volumes are adjusted in order to maintain a consistent output sound level and/or ratio.
(69) The mixer may also inject audio signals indicative of detection of configured audio variables.
(70) The techniques, processes and apparatus described may be utilized to control operation of any device and conserve use of resources based on conditions detected or applicable to the device.
(71) The invention is described in detail with respect to preferred embodiments, and it will now be apparent from the foregoing to those skilled in the art that changes and modifications may be made without departing from the invention in its broader aspects, and the invention, therefore, as defined in the claims, is intended to cover all such changes and modifications that fall within the true spirit of the invention.
(72) Thus, specific apparatus for and methods of an audio analysis and processing system have been disclosed. It should be apparent, however, to those skilled in the art that many more modifications besides those already described are possible without departing from the inventive concepts herein. The inventive subject matter, therefore, is not to be restricted except in the spirit of the disclosure. Moreover, in interpreting the disclosure, all terms should be interpreted in the broadest possible manner consistent with the context. In particular, the terms “comprises” and “comprising” should be interpreted as referring to elements, components, or steps in a non-exclusive manner, indicating that the referenced elements, components, or steps may be present, or utilized, or combined with other elements, components, or steps that are not expressly referenced.