Method and apparatus for using a test audio pattern to generate an audio signal transform for use in performing acoustic echo cancellation
11521636 · 2022-12-06
Inventors
Cpc classification
G10H2240/175
PHYSICS
G10H1/0058
PHYSICS
International classification
Abstract
A test audio pattern is sent to the speaker of the participant computer for outputting by the speaker. A computer receives a microphone input signal from the participant computer that includes the test audio pattern outputted by the speaker of the participant computer, and any ambient noise picked up by the speaker of the participant computer. Ambient noise suppression is performed to cancel out any ambient noise in the microphone input signal picked up by the speaker of the participant computer. The test audio pattern sent to the speaker of the participant computer is compared with the noise-suppressed microphone input signal which includes the test audio pattern outputted by the speaker of the participant computer. An audio signal transform is generated from the comparison. The generated audio signal transform is subsequently used for performing acoustic echo cancellation of streaming audio received from the microphone input signal when the participant computer receives streaming audio and the participants engage in remote audio communications with each other.
Claims
1. An automated method of audio signal processing for use with a plurality of participant computers associated with respective participants, each of the participant computers receiving streaming audio from a host server, wherein the plurality of participants engage in remote audio communications with each other via their respective participant computers and the host server while simultaneously receiving the streaming audio, each participant computer including (i) a speaker having audio settings that are preadjusted to settings that the participant intends to use for outputting the streaming audio, and (ii) a microphone, wherein prior to the participant computer receiving streaming audio and prior to the participants engaging in remote audio communications with each other, the method comprising for each participant computer: (a) sending, by a computer, a test audio pattern to the speaker of the participant computer for outputting by the speaker, (b) receiving, by the computer, a microphone input signal from the participant computer, the microphone input signal including: (i) the test audio pattern outputted by the speaker of the participant computer, and (ii) any ambient noise picked up by the speaker of the participant computer; (c) performing ambient noise suppression, by the computer, to cancel out any ambient noise in the microphone input signal picked up by the speaker of the participant computer; (d) performing audio signal processing, by the computer, for comparing the test audio pattern sent to the speaker of the participant computer with the noise-suppressed microphone input signal, the noise-suppressed microphone input signal including the test audio pattern outputted by the speaker of the participant computer; and (e) generating, by the computer, an audio signal transform from the comparison performed in step (d), the audio signal transform being a transformation function, wherein the generated audio signal transform is subsequently used for performing acoustic echo cancellation of the streaming audio received from the microphone input signal when the participant computer receives streaming audio and the participants engage in remote audio communications with each other, and wherein steps (a)-(e) are performed for each participant computer, and thus a different audio signal transform is generated for each participant computer.
2. The method of claim 1 further comprising: (f) storing in memory the audio signal transform; and (g) applying, by the computer, the transformation function of the audio signal transform to the streaming audio that is subsequently received from the microphone input signal, wherein steps (f) and (g) are also performed for each participant computer, and thus a different audio signal transform is generated, stored, and applied for each participant computer.
3. The method of claim 1 wherein the computer is the participant computer.
4. The method of claim 1 wherein the computer is the host server.
5. A computer-implemented apparatus for audio signal processing for use with a plurality of participant computers associated with respective participants, each of the participant computers receiving streaming audio from a host server, wherein the plurality of participants engage in remote audio communications with each other via their respective participant computers and the host server while simultaneously receiving the streaming audio, each participant computer including (i) a speaker having audio settings that are preadjusted to settings that the participant intends to use for outputting the streaming audio, and (ii) a microphone, wherein prior to the participant computer receiving streaming audio and prior to the participants engaging in remote audio communications with each other, the apparatus comprising a computer that includes software code executed on a processor of the computer which is configured to perform the following steps for each participant computer: (a) send a test audio pattern to the speaker of the participant computer for outputting by the speaker, (b) receive a microphone input signal from the participant computer, the microphone input signal including: (i) the test audio pattern outputted by the speaker of the participant computer, and (ii) any ambient noise picked up by the speaker of the participant computer; (c) perform ambient noise suppression to cancel out any ambient noise in the microphone input signal picked up by the speaker of the participant computer; (d) perform audio signal processing for comparing the test audio pattern sent to the speaker of the participant computer with the noise-suppressed microphone input signal, the noise-suppressed microphone input signal including the test audio pattern outputted by the speaker of the participant computer; and (e) generate an audio signal transform from the comparison performed in step (d), the audio signal transform being a transformation function, wherein the generated audio signal transform is subsequently used for performing acoustic echo cancellation of the streaming audio received from the microphone input signal when the participant computer receives streaming audio and the participants engage in remote audio communications with each other, and wherein steps (a)-(e) are performed by the software code for each participant computer, and thus a different audio signal transform is generated for each participant computer.
6. The apparatus of claim 5 wherein the software code executed on the processor of the computer is further configured to perform the following steps for each participant computer: (f) store in memory the audio signal transform, and (g) apply the transformation function of the audio signal transform to the streaming audio that is subsequently received from the microphone input signal, wherein steps (f) and (g) are also performed for each participant computer, and thus a different audio signal transform is generated, stored, and applied for each participant computer.
7. The apparatus of claim 5 wherein the computer is the participant computer.
8. The apparatus of claim 5 wherein the computer is the host server.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The foregoing summary, as well as the following detailed description of the invention, will be better understood when read in conjunction with the appended drawings. For the purpose of illustrating the invention, there are shown in the drawings embodiments which are presently preferred. It should be understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown. In the drawings:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
(15) Certain terminology is used herein for convenience only and is not to be taken as a limitation on the present invention. The words “a” and “an”, as used in the claims and in the corresponding portions of the specification, mean “at least one.”
I. TERMINOLOGY AND DEFINITIONS
(16) The following terminology and definitions are provided to promote understanding of the present invention. The terminology and definitions of the prior art are not necessarily consistent with the terminology and definitions of the present invention. Where there is conflict, the following terminology and definitions apply.
(17) streaming audio—Streaming audio is a one-way audio transmission over a data network. The streaming audio referred to herein may come from any number of different sources such as the audio portion of a videoconference session (e.g., host of a Zoom® session), audio of a live streaming event with audio and video, or an audio only source. More specifically, the streaming audio is a one-way transmission, such as from a host (source of the streaming audio) to participants (recipients of the streaming audio), wherein the participants cannot, or are not intended to, send audio back to the host. Feedback or echo may occur if a participant sends audio back to the host.
(18) The streaming audio may be the audio portion of streaming video that is intended to be viewed by the participants. Another example of streaming audio may be the audio of a recorded event, such as a movie, sports, or news event. Certain embodiments of the present invention may be used for “watch parties” wherein participants can watch videos, live or recorded, and interact with one another around them in the same moment. In these embodiments, the streaming audio is the audio of the live or recorded videos. In one embodiment, the watch party may be the watching of a Twitch session, wherein the Twitch session livestreams the game sounds and any simultaneously occurring comments made by the game player(s). In this example, the streaming audio is the audio of the livestream.
(19) Streaming audio is also interchangeably referred to herein as “host-sent audio.”
(20) A participant may also receive other sources of streaming audio that are not host-sent. For example, the participant may be listening to Spotify or Pandora on his or her participant computer as background music. As discussed below, the “audio signal copy” is not intended to capture these other sources of streaming audio, and is not intended to be used to cancel out these other sources of streaming audio.
(21) host—The host is a provider of the streaming audio. The host may have a co-host. However, the host and the co-host are collectively referred to herein as the “host.” The host need not be the person or entity which initiated the streaming event, and need not be the person or entity on whose server account the streaming event is being held.
(22) participant—The participants are recipients of the streaming audio, and also simultaneously engage in remote audio communications with each other. Each participant thus receives at least two sources of audio, namely, (i) the streaming audio (each participant receives the same streaming audio), and (ii) audio communications inputted by one or more other participants that are in audio communication with each other. Consider, for example, an embodiment with three participants, P1, P2, and P3. P1 receives the streaming audio and any audio communications inputted by P2 and P3. P2 receives the streaming audio and any audio communications inputted by P1 and P3. P3 receives the streaming audio and any audio communications inputted by P1 and P2. In the broadest embodiment, there may be only two participants. There is no technical limitation on the maximum number of participants, but there are practical limitations on the number of participants because each participant is free to communicate with other participants, and thus there may be uncontrollable chatter among the participants unless strict order is maintained on how many participants can speak or be unmuted at one time.
(23) participant computer—Each participant has a participant computer which allows the respective participant to engage in sessions initiated by the host, and to communicate with other participant computers as part of the host-initiated session. Each participant computer includes a microphone that receives any audio communications inputted by the respective participant, such as utterances of the participant. In some instances, the participant must be invited by the host, via their respective computers, to participate in the host-initiated session. In other instances, the host-initiated session is open to any participants who wish to join a host-initiated session. In some instances, the host controls which participant may communicate with each other, whereas in other instances, the participants may independently select which other participants they wish to communicate with.
(24) The participant computer is also interchangeably referred to herein as a “computing device.” The participant computer (computing device) may have many different form factors, such as a smartphone or a tablet.
(25) host computer—A host computer provides the source of the streaming audio. Similar to the co-host described above, there may be a co-host computer. However, the host computer and the co-host computer are collectively referred to herein as the “host computer.” The co-host, via the co-host computer, may be the actual source of the streaming audio. In this embodiment, the host computer designates the co-host computer to be the source of the streaming audio. Consider, for example, a musician or a disc jockey (DJ) who is the designated co-host for providing the streaming audio. The musician may play live music, whereas the DJ may play recorded music. In the case of a co-host providing the streaming audio, a host server may receive the streaming audio either directly from the co-host computer (along with instructions from the host computer to use the streaming audio of the co-host computer as the source of the streaming audio), or the host server may receive the streaming audio from the co-host computer via the host computer which acts in a relay capacity for the streaming audio. The host computer (also referred to herein as a “host computing device”) may have many different form factors, such as a smartphone or a tablet.
(26) host server—A host server functions as a coordinating entity between the host computer and the plurality of participant computers. The host server is in wired or wireless electronic communication with the host computer and the plurality of participant computers via an electronic network, such as the internet. The exact functions of the host server depend upon the system architecture. In one type of system architecture, referred to herein as a “hub and spoke” configuration, all or most of the communications occur through the host server, including the communications between the host computer and the plurality of participant computers, as well as any communications between the plurality of participant computers. Thus, in the example above, any communications between the host computer and P1, P2, and P3 occur through the host server. Likewise, any communications between P1 and P2, or P1 and P3, or P2 and P3 occur through the host server. In addition, acoustic echo cancellation of signals received from P1, P2, and P3, also occurs at the host server.
(27) In another type of system architecture, referred to herein as a “peer-to-peer” configuration, the participant computers electronically communicate directly with each other for certain functions, particularly for conveying “processed audio communication signals” therebetween. In the peer-to-peer configuration, each of the participant computers still receive the streaming audio from the host computer. In the peer-to-peer configuration, acoustic echo cancellation of signals occurs at each of the respective participant computers. In the example above, this means that P1 performs its own acoustic echo cancellation before transmitting its “processed audio communication signals” directly to P2 and P3. Likewise, P2 performs its own acoustic echo cancellation before transmitting its “processed audio communication signals” directly to P1 and P3, and P3 performs its own acoustic echo cancellation before transmitting its “processed audio communication signals” directly to P1 and P2.
(28) Another type of system architecture, referred to herein as a “hybrid” configuration, performs acoustic echo cancellation in a similar manner as the “peer-to-peer” configuration, but communications between the participant computers occur via the host server in the same manner as described above with respect to the “hub and spoke” configuration.
(29) In certain system configuration, the host computer may be part of the host server. Consider, for example, a server farm that is used to implement a “watch party” to a large number of participants. In such a configuration, the host computer may simply be part of the host server, wherein the host server is the portion of the server farm system that comprises the host. This is in contrast to a host which functions using a single server entity, such as when the host is videoconferencing from their computer via Zoom.
(30) audio signal copy—An “audio signal copy” is a digital copy of the frequency waveform of the host-sent streaming audio, and is created directly from the host-sent streaming audio. Since the streaming audio is a continuous waveform, the audio signal copy may thus also be a continuous waveform. While the streaming audio is simply meant to be played on speakers at the respective participant computers, the audio signal copy is intended to be stored in memory for subsequent use in performing acoustic echo cancellation. Timestamps embedded in the audio signal copy are used to ensure that the two waveforms can be synchronized with each other when performing the acoustic echo cancellation. In the “hub and spoke” configuration, the audio signal copy is created in the host server and is stored in memory of the host server, whereas in the “peer to peer” or hybrid configuration, the audio signal copy is created in each of the participant computers and is stored in memory of the respective participant computers.
(31) microphone input signals—The microphone input signals for each participant computer include the host-sent streaming audio outputted by the speaker of the respective participant computer, and any audio communications inputted by the respective participant, such as utterances of the respective participants. Thus, it is inherent in the present invention that the microphone is in close proximity to the participant. The microphone input signals may also include ambient noise picked up from sources in proximity to the participant computer. In addition, the microphone input signals may also include audio communications from other participants that are outputted by the speaker of a respective participant. Consider again, the example of participants P1, P2, and P3. The speaker of P1's computer outputs any audio communications from the participants P2 and P3, which are then included in the microphone input signal of P1's computer. Similar scenarios occur for P2 and P3.
(32) speaker (audio speaker or loudspeaker)—A “speaker” as referred to herein refers to an output hardware device that connects to a computer to generate sound. The signal used to produce the sound that comes from a computer speaker is created by the computer's sound card. More specifically, the “speaker” referred to herein outputs sound to the ambient environment, as opposed to an “earphone speaker” or “earphone” or speaker portion of a headphone/headset, all of which are designed to deliver the sound directly to a person's eardrums, while minimizing sound dispersion to the ambient environment. Thus, in the present invention, the microphone input signal ideally would not pick up any sound that might be delivered to an earphone. Of course, if the volume of the earphone is very high, human perceptible sound will travel into the ambient environment, and could be picked up by a microphone. However, this is not the intended use of an earphone, and such an embodiment is excluded by the present invention. Stated another way, the speaker described herein may be characterized as an ambient speaker, or an ambient audio speaker, or a non-earphone-type ambient speaker. In the preferred embodiments of the present invention, the speaker is used to output the host-sent streaming audio at each of the participant computers, so that the respective participants can hear the streaming audio, while simultaneously engaging in remote audio communications with one or more other participants.
(33) The speaker also outputs audio generated by other participants and the microphone input signal for the participant computer will also capture this audio. However, as discussed elsewhere, this audio is not intended to be canceled out by the system components that are dedicated to perform cancellation of the host-sent streaming audio. Also, this audio generated by other participants may be canceled out, filtered, suppressed, or attenuated by other audio processing components of the videoconferencing platform as noted below as part of “signal conditioning,” and may be processed differently by different platforms depending upon the capabilities and settings of the respective platforms.
(34) audio communications—“Audio communications” refers to audio inputted by participants into their respective participant computer microphones. Audio communications are thus distinct from the streaming audio which is provided from the host computer.
(35) processed audio communication signals—“Processed audio communication signals” represent the audio signals that are outputted from the microphone input signals of the respective participant computers after the signals have undergone audio signal processing including both (i) acoustic echo cancellation to remove the streaming audio that is received at the respective participant computers, outputted by the speakers of the respective participant computers, and then picked up by the microphones of the respective participant computers, and (ii) signal conditioning to perform one or more of ambient noise suppression, and echo suppression of audio communications from one or more of the participants. The signal conditioning is conventional signal processing performed by existing videoconferencing platforms and is distinct from the acoustic echo cancellation that removes the host-sent streaming audio received at the respective participant computers. The host server or the participant computers may also perform other forms of signal conditioning, other than the two types discussed above.
(36) After removal of the streaming audio, the signals retain any audio communications inputted by the respective participants. The processed audio communication signals for each participant are provided to the remaining participants who are engaged in remote communications with each other. In this manner, the participants may engage in remote audio communications without receiving acoustic echoes of the streaming audio received by other participant computers, while still being able to individually hear the streaming audio outputted from their respective speakers.
(37) In the hub and spoke configuration, the host server performs the audio signal processing, whereas in the peer-to-peer and hybrid configurations, the participant computers perform the audio signal processing.
(38) acoustic echo cancellation (AEC)—AEC, as described herein, refers to a process that removes an originally transmitted audio signal that re-appears, with some delay, in a received audio signal. The re-appearing audio signal is also referred to in the art as “acoustic echoes.” The originally transmitted audio signal is removed via a subtraction or deletion process (also, interchangeably referred to as “phase inversion” or “polarity inversion”) using an “audio signal copy” of the originally transmitted audio signal. This process is generally implemented digitally using a digital signal processor or software, although it can be implemented in analog circuits as well. This process is also referred to herein as “echo suppression.” The originally transmitted audio signal is the host-sent streaming audio described above. The audio signal copy of the originally transmitted audio signal is interchangeably referred to herein as a “copy signal.”
(39) Conventional videoconferencing technology automatically removes certain types of background noise, as well as participant-generated audio that is picked up by respective participant computer microphones, when distributing audio signals to participants in a videoconference. These noise cancellation and conventional echo suppression techniques, which are interchangeably referred to herein as “signal conditioning,” may be implemented in parallel with the AEC process described herein, which is specifically directed to performing AEC on host-sent streaming audio. The description below does not include details of this conventional videoconferencing technology, but the present invention is fully compatible with such technology to provide an enhanced videoconferencing experience.
II. DETAILED DESCRIPTION
(40) Consider first
(41) The term “non-smart” phone refers to phones which do not have CPUs (
(42) Of course, more than one person can participate in the conference using the same device, such as two people, 425 and 427, sitting in front of the same laptop computer, 415. This can be the situation for other devices shown (including even a telephone, 413, when it has a speaker instead of or in addition to an earphone). In some cases, specially outfitted physical rooms (not shown) are outfitted with large screen monitors, cameras, speakers, and computing equipment, which perform the same audio-visual display and input-transmission functions as the devices shown. For example, the Zoom company calls rooms outfitted with their equipment “Zoom rooms”. A Zoom room permits more than one person—often several people around a table, or even a larger crowd in an auditorium—to join a conference occurring in several locations at the same time.
(43) Nonetheless, even with more than one person in front of and using the same interfacing device, the interfacing device captures only one audio-video stream, so all the people using that device will be referred to collectively, in the description below, as a single Participant or a single Musician.
(44) The interfacing devices are connected to (433) a transmission system (401) and they are connected to each other virtually through a transmission system, 433 and 401. The transmission system, 401, includes, but is not limited to, the Internet and other networks, telephone systems including land line systems, cell phone systems, VOIP (voice over internet protocol) systems, satellite and other radio transmission systems, as well as other wireless transmission systems such as (but not limited to) WiFi and Bluetooth. The interfacing devices may be connected (433) to the transmission system (401) by various ways, including, but not limited to, wire, coaxial cable, Ethernet cable, fiber-optics, WiFi, Bluetooth, and radio transmissions.
(45) Many video conferencing systems also include one or more computer servers “in the cloud”, such as 403, 405, and 407, which are connected (431) to the transmission system, 401. These computer servers may perform all of the video and audio processing for the video conferencing system (a central processing system) or only some of the video and audio processing (a system with a mixture of local processing and central processing). (Some peer-to-peer video conferencing systems might not include any such computer servers.) The servers may be multi-purpose, or might have specific capabilities such as data processing, video processing. They may be database servers, web servers, video streaming servers.
(46) Consider now
(47) The transmission system shown in
(48) The video conferencing system includes a variety of local devices, with representative examples show (515, 517, and 519). In particular, consider local device, 515, a personal computer, 543, such as, but not limited to, 415 or 417 in
(49) In general, input to the local device, 515, via keyboard, 539, microphone, 537, or webcam, 541, is processed in the CPU, 523, then converted by a modem, 521, to signals transmissible through the transmission system, 511.
(50) Local device 517, a handheld device, 545, such as a smart phone (
(51) The signals are transmitted through the transmission system, 511, to other local devices, such as 517, or are transmitted to a remote computer server (
(52) In contrast, when the local device, 519, is a telephone, 547, the user of the device can only experience the audio portion of the video conference through the device. Sound from the virtual conference can be heard through the speaker, 535, and input is obtained through the microphone, 537. As mentioned, when both the speaker, 535, and microphone, 537, are in a handset, the separation of these two functions may permit full duplex connections. Consequently, the telephone user might not have need for the present invention, even if other participants in the virtual conference do. When receiving input from the microphone, 537, the audio signal is sent to the circuit board, 525, which converts it for transmission via wire or radio wave to the telephone system, 513, which transmits it to a remote computer server, 501, via a data exchange process, 509 and a phone signal converter, 507. After that, the remote data source might process the digital signal in its CPU, 503, possibly storing some of that information in memory, 502. The CPU may also send a processed signal to the phone signal converter 507, then through the telephone system, 513, to a local device, 519, which is a telephone 547. The circuit board, 525, converts the signal to an audio signal and plays it on the speaker, 535.
(53) Current breakout room systems employ a half-duplex approach, which generally allows only one person to speak. With the filters and algorithms employed, when one person is speaking other speech tends to be suppressed. In some cases, the second speaker's voice is made dominant while the first is suppressed. If no one is speaking, and two people start speaking at once, the system usually suppresses both, with a loud “squawk”. For the most part this approach permits an acceptable level of intelligible conversation, and participants understand that only one person can talk at a time.
(54) As mentioned above, when everyone in the videoconference uses headphones, many of the noise suppression issues disappear. However, videoconference platforms are most useful, when headphones are not required.
(55) Consider now
(56) In the hub-and-spoke architecture shown in
(57) Now consider
(58) To send an audio stream to all participants, the host directs a host computer (623) to send the audio stream (Host-sent Audio) to the host server (621), where the stream is copied (627), stored (625) at the host server (621), and forwarded to the participant computers (605), where it is played aloud by the speakers (611) on P1 (607) and P2 (609). If participant computers are receiving other audio, whether other participant audio streams, rock music from a YouTube video, or streaming music from a platform such as Pandora or Spotify, the sound card of each participant computer (605) mixes the Host-sent Audio received by that computer with other audio streams received by that computer and plays the audio aloud through the computer speakers (611). (Note that each individual participant may be playing different background or auxiliary audio on his or her computer.)
(59) When each or any participant speaks, the sound of the participant's voice is captured by the microphone (613) of the participant's computer. The captured sound includes Host-sent Audio that had been played aloud through the speaker (611) of that computer, as well as other background sound within the participant's environment. When a participant does not speak, the sound that is captured by the microphone still includes Host-sent Audio that is played aloud through the speaker (611) of that computer, as well as other background sound within the participant's environment. This sound input is digitized in
(60) However, in the embodiment described in
(61) All individual Participant Audio streams are then processed and distributed (603) in the identical manner as in
(62) The above discussion of
(63) To remedy this, this embodiment uses and accesses the property in today's computers which recognizes the sound device (and the kind of sound device) which is attached to an application and plays the audio produced by the application. If a particular participant, say P1, is using a headset on his or her computer (607), the embodiment has that computer send headset information back to the host server (621). (Because of the effect of headsets on employing and adjusting noise and echo suppression, some video conferencing platforms may already be doing this.) In any event, the host server (621) uses headset information to turn off the acoustic echo cancellation (629) for any individual participant audio stream associated with a participant using a headset. This ensures that the AEC does not introduce extraneous sound to the streams of headset wearers.
(64) Consider now
(65) Again, there may be many participants in a particular videoconference or breakout room, but without loss of generality,
(66) In the hub-and-spoke audio distribution architecture shown in
(67) Steps 703 and 705 are used to identify any type of signal processing and distribution that may happen in the host server (701) or the participant computers (713 and 715). This processing and distributing includes signal conditioning such as performing ambient noise suppression, and echo suppression of audio communications from one or more of the participants. The signal conditioning is conventional signal processing such as performed by existing videoconferencing platforms and is distinct from the acoustic echo cancellation of the present invention that removes the host-sent streaming audio received at the respective participant computers. Notice that
(68) Now consider
(69) Now, to send an audio stream to all participants in the system shown in
(70) If participant computers are receiving other audio, the sound card of each participant computer (731) mixes the Host-sent Audio received by that computer with other audio streams received by that computer and plays the audio aloud through the computer speakers (707).
(71) When each or any participant speaks, the sound of the participant's voice is captured by the microphone (709) of the participant's computer. The captured sound includes Host-sent Audio that had been played aloud through the speaker (707) of that computer, as well as other background sound within the participant's environment. When a participant does not speak, the sound that is captured by the microphone still includes Host-sent Audio that is played aloud through the speaker (707) of that computer, as well as other background sound within the participant's environment. This sound input is digitized in
(72) The scrubbed Participant Audio is then processed in part locally (705) in individual participant computers (731) and then sent to the host server (721). At the host server, it is processed and distributed (703) to the individual computers (731), where it is turned into soundwaves by the speakers (707) and played aloud to the participants.
(73) Note that the acoustic echo cancellation (739) incorporated in this embodiment is different and distinct from audio treatments otherwise performed in the system at 705 and 703, even though these other audio treatments might also include noise or echo suppression elements. Note also that the Participant Audio stream as it enters 705 and then 703 in
(74) Again, the discussion of
(75) Again, to remedy this, this embodiment uses and accesses the property in today's computers which recognizes the sound device which is attached to an application and playing the audio produced by the application. If a particular participant, say P1, is using a headset on his or her computer (733), the acoustic echo cancellation (AEC) component (739) will recognize this and not implement acoustic echo cancellation. This ensures that the AEC does not introduce extraneous sound to the streams of headset wearers.
(76) Consider now
(77) Again, there may be many participants in a particular videoconference or breakout room, but without loss of generality,
(78) Only two participant computers are shown, labeled P1 (813) the computer of participant P1, and P2 (815) the computer of participant P2. A host computer (803) is also shown as if the host is set up like any other participant so that the host computer (803) also serves as a participant computer. Each participant or host computer in
(79) In the peer-to-peer audio distribution architecture shown in
(80) Step 805 is used to identify any type of signal processing and distribution that may happen in the host computer (803) or the participant computers (813 and 815). This processing and distributing includes signal conditioning such as performing ambient noise suppression, and echo suppression of audio communications from one or more of the participants. The signal conditioning is conventional signal processing such as performed by existing videoconferencing platforms and is distinct from the acoustic echo cancellation of the present invention that removes the host-sent streaming audio received at the respective participant computers.
(81) Now consider
(82) Now, to send an audio stream to all participants in the system shown in
(83) If participant computers are receiving other audio, the sound card of each participant or host computer mixes the Host-sent Audio received by that computer with other audio streams received by that computer and plays the audio aloud through the computer speakers (807).
(84) When each or any participant speaks, the sound of the participant's voice is captured by the microphone (809) of the participant's computer. The captured sound includes Host-sent Audio that had been played aloud through the speaker (807) of that computer, as well as other background sound within the participant's environment. When a participant does not speak, the sound that is captured by the microphone still includes Host-sent Audio that is played aloud through the speaker (807) of that computer, as well as other background sound within the participant's environment. This sound input is then digitized. The Host-sent Audio is removed from each participant audio stream using the Audio Signal Copy via acoustic echo cancellation (829), using methods well known to those skilled in the art, including but not limited to phase inversion.
(85) The scrubbed Participant Audio is then processed locally (805) in host and individual participant computers (821 and 823) and then distributed to the other peers in the peer-to-peer network, where it is turned into soundwaves by the speakers (807) and played aloud to the participants.
(86) Again, the discussion of
(87) Consider now
(88) After the Start (101) of a silence, the question (109) is, Does Participant 1 speak? If not, the microphone (
(89) However, if at step 109, Participant 1 does speak, the microphone (
(90) Because Participant 1 is speaking, Participant 2 does not speak, 119, or equivalently the system treats Participant 2 as not speaking, muting Participant 2. Likewise, because Participant 1 is speaking, Participant 3 does not speak, 133, or equivalently the system treats Participant 3 as not speaking, muting Participant 3. The same applies to additional participants in a breakout room.
(91) Because Participant 2 does not speak, 119, his or her microphone (
(92) In contrast consider
(93) A flow chart similar to
(94) In contrast to
(95) After the Start (201) the host has the system transmit audio to all the participants of the breakout room, 203. When the electronic encoded version of this audio is received by a participant's computer, the participant's computer's CPU, 523, creates a copy of it, which it stores in memory cache, 527, as in 205 for Participant 1, 225 for Participant 2, and 235 for Participant 3. Then the audio is played on each of the participants' computers' speakers, 533: as in 207 for Participant 1, 117 for Participant 2, and 131 for Participant 3.
(96) Now, the question (109) is, Does Participant 1 speak? If not, the microphone, 537, acting as input to the videoconferencing system (that is, the networked equipment and software, including computers) detects only the Host-sent Audio, 217, being played on the computer speakers, 207 (and 529). The computer's CPU, 523, retrieves the Audio Signal Copy from memory cache, 527, and uses the Audio Signal Copy to subtract the Host-sent Audio from the audio stream, 219. (As noted above, this Audio Signal Copy was made in step 205.) If there is no more Host-sent Audio (221), the process stops, 223. Of course, the networked equipment, software and platform are ready to start again at 201.
(97) However, if at step 109, Participant 1 does speak, the microphone, 537, detects Participant 1's voice and the Host-sent Audio, 211, and processes it in the CPU, 523, by using the Audio Signal Copy (retrieved from memory cache 527) to subtract the Host-sent Audio form the audio stream, 213, so that the audio stream includes Participant 1's voice without the Host-sent Audio. Then Participant 1's voice is transmitted to the system, 115 (and 511). This audio file or waveform is then conveyed through a network (not shown) to the equipment (such as a computer) of Participant 2 and Participant 3.
(98) There the audio stream of Participant 1's voice, along with the continuation of the Host-sent Audio, is played by each of the other participant's computers' speakers 429: at 117 for Participant 2, and 131 for Participant 3.
(99) More particularly, the audio is converted to sound waves by a CPU (523) and played on Participant 2's computer (or similar equipment), 117, as well as Participant 3's computer, 131. If there were additional participants, the audio would be transmitted to their equipment, combined with the Host-sent Audio, converted to sound waves, and played in the same manner.
(100) Because Participant 1 is speaking, Participant 2 does not speak, 119, or equivalently, in a step not shown in
(101) Because Participant 2 does not speak, 119, his or her microphone, (537), detects only the sound of the Host-sent Audio, 231. The computer CPU (523) retrieves the Audio Signal Copy from memory cache (527), and uses the Audio Signal Copy to subtract the Host-sent Audio from the audio stream, 233. (As noted above, this Audio Signal Copy was made in step 225.) Likewise, because Participant 3 does not speak, 133, his or her microphone (537) detects only the sound of the Host-sent Audio, 241. The computer CPU (523) retrieves the Audio Signal Copy from memory cache (527), and uses the Audio Signal Copy to subtract the Host-sent Audio from the audio stream, 243. (As noted above, this Audio Signal Copy was made in step 235.)
(102) Once Participant 1 stops speaking, if there is no more Host-sent Audio (221) the process stops for both Participant 2 and Participant 3, 223. Of course, the networked equipment, software and platform are ready to start again at 201.
(103) Otherwise, if, when Participant 1 stops speaking, the Host-sent Audio continues (221) and the process returns to step 203.
(104) Consider now the case when one or more participants use devices such as a telephone (547) without a CPU (523) or memory cache (527). These devices have handsets, as mentioned above and known to practitioners of the art, with both the speaker, 535, and microphone, 537. Telephone handsets separate the speaker (held near the ear) and the microphone (held near the mouth) in a way that permits a lower volume of sound and nearly eliminates interference and echo. With such telephones, a person can listen and speak at the same time (full-duplex). In other words, in such a telephone device, 535, even as Host-sent Audio is played over the speaker, 535, then at a step analogous to 211, the microphone, 537, would only detect a voice. It would not detect the Host-sent Audio, so step 213 can be eliminated, and Participant 1's voice is transmitted to the system, 513, in a step analogous to 115. In this way
(105) Note: a speakerphone is a telephone with a loudspeaker and microphone, allowing it to be used without picking up the handset. This adds utility to the telephone, but reduces it from a full-duplex device to a half-duplex one.
(106) In an alternate embodiment, a telephone,
(107)
(108) In yet another alternative embodiment, the subtractive process would be performed by a combination of central servers and individual participants' computers.
(109) In a preferred embodiment, the Host-sent Audio stream is sent to all rooms (and all participants in every room), including all breakout rooms at the same volume setting. In an alternative embodiment, the Host-sent Audio stream is sent to some virtual rooms, but not all such rooms. In an alternative embodiment, the Host-sent Audio stream is sent to different rooms at different volume settings, so that participants in different rooms hear the Host-sent Audio at different volumes. In an alternative embodiment, different Host-sent Audio streams are sent to different rooms and at the same or different volumes.
(110) In an alternative embodiment, each Participant can adjust the volume of the Host-sent Audio stream. In this way the Host-sent Audio stream will be like a whole house music system that is played at different volumes in each room. However, because the Participant's computer knows (or can sense) the volume setting, the Host-sent Audio can be filtered out of the returning audio stream. Notice that every participant hears the Host-sent Audio but at a volume he or she prefers, and consistent with the participant's ability to hear the voices and words of other participants.
(111) In an alternative embodiment, the teachings of this invention can be used to enable a live performance combining together in real-time the music and song of multiple musicians who are located in physically separate locations, but linked through a teleconferencing platform and system such as Zoom. In a preferred embodiment, the musicians need to be interfacing with the videoconferencing system using computing devices with CPUs (523) and memory caches (527).
(112) To do this, each musician receives a Host-sent Audio, which includes separable tempo and tone reference, such as a metronomic beat on a particular pitch. The Host-sent Audio might also include a reference orchestral or choral recording of the work to help the musicians harmonize and blend. Each musician hears the Host-sent Audio, and adds his voice or instrument to the group. The invention then uses the Audio Signal Copy to send this combined audio stream back to the host server, after subtracting everything except (a) the musician's addition (voice or instrument) and (b) the tempo and tone reference. The server then uses software and hardware such as those of Vocalign to automatically synchronize the tempo and tone of the various musicians in real time, after which the tempo and tone reference is subtracted from the audio stream. The resulting multi-voice orchestral and/or choral live performance is then transmitted with a slight delay (due to processing) to the non-musicians participating in the videoconference.
(113) In this embodiment, the non-musicians who hear the performance are not situated in the same room (or breakout room) as any of the musicians. Each musician can perform his or her own part, but cannot hear any other musician. In this sense, each musician might be considered in his or her own breakout room.
(114) For a detailed description, consider
(115) Again, without loss of generality, more than one person might be listening and performing at any one computer (for example, in
(116) Consider now the next 8 steps in the process that occur after the Host-sent Audio is transmitted to each of the musicians (steps 305, 307, 309, 311, 313, 315, 317, and 319). The steps are essentially identical for each musician, so without loss of generality, consider those with respect to Musician 1 at his or her computing device.
(117) An Audio Signal Copy (with separate copy of tempo and tone reference) is created at Musician 1's computer, 305, by the CPU (523) and stored in memory cache (527). Then the Host-sent Audio is played on the speaker (533) of Musician 1's computer's, 307. The computer queries whether Musician 1 is producing audio (singing, playing an instrument, speaking dialogue, or otherwise performing), 309. If not, then Musician 1's microphone (537) detects only the Host-sent Audio, 317. The computer CPU (523) retrieves the Audio Signal Copy from the memory cache (527), uses the Audio Signal Copy to subtract from the audio stream that portion of the Host-sent Audio which is not the tempo and tone reference 319. That is, when leaving step 319, the audio only consists of the tempo and tone reference channel.
(118) On the other hand, If Musician 1 is performing (309), the microphone (537) detects a voice and/or music plus the Host-sent Audio played in the background, 311. The computer CPU (523) uses the Audio Signal Copy (retrieved from memory cache, 527) to subtract from the audio stream that portion of the host-sent Audio which is not the tempo and tone reference, 313. After that, Musician 1's voice and/or music is transmitted to the system, 511, and the host server (501) along with a tempo and tone reference channel, 315. As the Musician 1 alternates between performing and not performing, a continuous electronic audio stream issues out of 315 and 319 (alternately) being transmitted from Musician 1's computer through a network (such as the Internet) to the videoconferencing host sever, 353 (and 501).
(119) At approximately the same time as steps 305, 307, 309, 311, 313, 315, 317, and 319 are being executed at Musician 1's computer, the same processes are being executed at Musician 2's computer. The eight steps: 305, 307, 309, 311, 313, 315, 317, and 319 accomplished in Musician 1's computer are replicated for Musician 2 (and with “Musician 2” substituted for “Musician 1” as necessary) in Musician 2's computer; and re-labeled as steps 325, 327, 329, 341, 343, 345, 347 and 349 respectively (351). The resulting continuous electronic audio stream from Musician 2's computer is transmitted through a network, 511, (such as the Internet) to the videoconferencing host server, 353 (and 501).
(120) The process of transmitting the Host-sent Audio to different musicians, and then receiving the return audio with added musical stylings may have induced some time lags, so that the host server does not receive the audio streams at the same time. In addition, the audio reproduction and recording equipment (speakers and microphones) at each musician's computer may have induced variations in tempo and tone. At step 353, the host server CPU (503) uses processes such in Vocalign by Synchro Arts to synchronize the tempos and tones of the various audio streams received from the various musicians. While
(121) After that, the host server CPU (503) subtracts the tempo and tone reference from the audio stream, 355, and sends the stream to all other participants in the video conference through the transmission system (511 and 513), who may be in various rooms (main, breakout, or otherwise), 357. If there is no more music to be performed (359) the process stops, 361. Otherwise, the process continues as more Host-sent Audio (with separable tempo and tone reference channel) is transmitted to the musicians, 303.
(122) The acoustic echo cancellation as taught above, is most effective if the microphone input is very sensitive and omnidirectional, registering all of the sound produced by the computer loudspeakers. An example might be the microphone array used for a speaker phone and placed on a table (or in the ceiling) of a conference room, where it is designed to capture the sound produced by anyone, anywhere in the room.
(123) However, this is not always the case. What the microphone registers is frequently different from what the speakers produce (and our ears hear) for many reasons—and what sounds the speakers produce can be different than the raw electronic waveforms they are intended to turn into sound waves. Some reasons for this include: a. Directional microphones (also known as cardioid mics) are designed to pick up sound from a person speaking directly into it (or a musical instrument playing directly into it). That is, the mic is designed to pick up sound waves that are produced close to the mic (not farther away) and sent directly into it (so the mic must be held in one particular orientation to the sound source). As an example, a group of musicians playing together may use directional microphones, so that the audio engineer can separate the sound each produces and mix their sounds and volumes for a more pleasing musical blend. b. Speakers and sound systems have tone and volume settings, whereby the user can change the frequency envelope of the waveform to their own liking—departing to some extent from the original waveform. When the speakers and sound system are incorporated in the computer (e.g, a laptop), the volume and tone may be adjusted through use of the computer, so that the computer may “know” how the speaker output differs from the initial waveform. However, computer speakers are often accessory devices attached to the computer, with their own volume and tone controls separate from (and not in communication with) the computer. c. Speakers—especially ones produced for high fidelity sound systems—have different acoustic characteristics which affect the quality and richness of the sound. But this is also affecting the waveform produced. d. Room acoustics differ, depending upon size and surface treatments, which affects what the microphone picks up. Room surfaces (including clothing) will absorb more of some frequencies than others and reflect more of some frequencies than others (creating vibrant bounce and echo)—and this is different for different surfaces and different rooms. e. Placement of the speakers and microphone differ for different participants, especially if the speakers and microphone are accessory equipment. This will affect microphone input differently, especially if the mics are directional. f. Audio processing circuitry of a participant computer differ, which may cause differences in outputting waveforms through speakers, and processing the microphone input signal.
(124) The disclosure above discussed how a participant who is using headphones does not need the acoustic echo cancellation taught by the present invention, because the Host-sent Audio cannot enter the computer microphone to create noise or feedback. To similar effect, the disclosure also mentioned that when the volume of the music played through the computer speaker is sufficiently low, even though present and audible to those in the vicinity of the computer microphone, it may not create noise, feedback, or otherwise register as sound input to the microphone. This is often due in part to the directionality and near-field sensitivity of the microphone used—that is, many microphones are designed to pick up the vocalizations and sounds produced by a user speaking into the front of the microphone from near the microphone, but not pick up sounds from other sound sources (such as speakers placed behind or in the plane of the microphone) or otherwise attenuated sounds.
(125) Nonetheless, as the user turns up the volume at which the computer plays the Host-sent Audio, the user may encounter feedback noise.
(126) The effectiveness of acoustic echo cancellation methods described above can be enhanced if the computer knows (and can measure and calculate) the difference between the waveform of any Host-sent Audio signal received by a participant's computer, and the waveform detected (or recorded) by the microphone input to that participant's computer of the Host-sent Audio—after the Host-sent Audio is first processed by that computer and played on that participants's loudspeakers. Then the Audio Signal Copy used by the acoustic echo cancellation can be adjusted to reflect that difference in accordance with the measurements and calculations made of that difference.
(127) The present invention teaches a calibration method of measuring and calculating that difference. However, because of the many factors mentioned above which contribute to the difference between the waveform of the Host-sent Audio received by the computer from the videoconferencing platform, and the waveform of the Host-sent Audio as played on the participant's loudspeakers, and registered by the participant's microphone, the calibration works best if it is implemented after the participant has made the volume and tone settings for the loudspeakers.
(128) Theoretically, the computer could play every musical note (waveform frequency) through the speaker at a specific volume (waveform amplitude) and compare the microphone response, record the difference, do it again repeatedly at different volumes, and construct a matrix of the differences to be used in the frequency transformation calculations which create the Audio Signal Copy from the Host-sent Audio.
(129) More practically, this matrix can be approximated by creating a Test Audio Pattern that includes sample frequencies (say some frequencies from every octave on the piano), repeated at a half dozen different volumes, with the differences between the waveform of the Test Audio Pattern and the waveform that is registered by the microphone calculated, and are saved as a transformation matrix. The differences for other frequencies and amplitudes are estimated by linear interpolation. The transformation matrix and interpolation calculations comprise the Audio Signal Transform.
(130) (As known to those skilled in the art, a non-linear interpolation may be more accurate, but requires constructing a model of the acoustic environment of the equipment as used in the participant's room, as discussed below.)
(131) To further refine the Audio Signal Transform, the Test Audio Pattern is played, but with a shift in frequency and volume. The Audio Signal Copy is created using the Audio Signal Transform and acoustic echo cancellation is performed. For frequencies that have not been suppressed (or cancelled), the transformation matrix is adjusted. This can be repeated for further refinement if necessary.
(132) More specifically, in an alternative embodiment, the acoustic echo cancellation effectiveness is enhanced through methods of artificial intelligence that employ digital sampling, filtering, computational analysis, and computer pattern recognition. This is done by comparing a sample Host-sent Audio that is received by (or stored in) the computer and sent to the computer's speakers, with what the computer “hears” of that audio via its microphone input, then creating a model of that audio transformation to adjust the acoustic echo cancellation process. The sample Host-sent Audio is referred to as the Test Audio Pattern.
(133) More particularly, the computer plays the Test Audio Pattern (a specifically chosen sample Host-sent Audio) that is used to measure and calculate the amount and kind of transformation that occurs—from the waveform of the sample Host-sent Audio to the sounds received by the microphone input. The Test Audio Pattern uses a variety of different pitches with different tones and at different volumes to create a full range model. As known by those skilled in the art, this kind of sampling is used to create a frequency and volume model of the transformation, and is referred to herein as the Audio Signal Transform.
(134) Then, rather than using a copy of the Host-sent Audio itself for the acoustic echo cancellation, this copy is adjusted via the Audio Signal Transform before waveform subtraction and other acoustic treatments are performed by the present invention. This is more fully explained in
(135) Consider
(136) As is known to those skilled in the art, noise suppression algorithms are based on such models, often adjusted in real time. See for example the methods and models presented in U.S. Pat. No. 11,100,941 (Sargsyan et al.) assigned to Krisp Technologies, Inc. a leading producer of sound suppression applications. This patent is incorporated herein by reference.
(137) Then, with the computer audio output to the computer speakers already adjusted by the participant with respect to volume and tone (i.e., base versus treble), 909, the computer plays the Test Audio Pattern using the computer's speakers as adjusted by the participant, 911, and records the digitized sound input at the computer microphone, 913. This input, called the Raw Input Signal Baseline, will include both background noise, as well as some portions of the Test Audio Pattern.
(138) The computer then creates a Filtered Input Signal Baseline, 915, by filtering out (subtracting) the background noise previously modeled in 907.
(139) Using methods known to those skilled in the art, the computer then calculates the transformation function by which particular tones, pitches and amplitudes of sound in the Test Audio Pattern are transformed into the Filtered Input Signal Baseline. The calculated transformation function is then saved as the Audio Signal Transform, 919, and the process stops, 921.
(140) As mentioned above, the Audio Signal Transform is refined by playing the Test Audio Pattern with pitch and volume shift through the speakers, recording the microphone input, performing acoustic echo cancellation using an Audio Signal Copy based on the Audio Signal Transform of the Test Audio Pattern, and adjusting the Audio Signal Transform according, but this is not shown in
(141) As is known to those skilled in the art, more (and more varied) sampling, using methods of and analogous to those referenced in the Krisp Technologies patent, including but not limited to creating models of the sound transformation using computer pattern recognition, artificial intelligence, and neural networks, improves the calibration and calculation of the Audio Signal Transform.
(142) As mentioned above, in some embodiments, the Test Audio Pattern is sent from a cloud-based server in the videoconferencing platform, every time the user requests calibration. Alternatively, the Test Audio Pattern is downloaded to a participant computer only once, as part of the videoconferencing application, stored on the participant computer, and run locally. Likewise, depending upon whether the videoconferencing platform is hub-and-spoke, or more distributed, in some embodiments the calibration calculation is made by a system server in the cloud, whereas in others it is made locally.
(143) In an alternative embodiment, this calibration is accomplished in the background using the Host-sent Audio, and methods of sampling, and computer pattern recognition known to those skilled in the art.
(144) The Audio Signal Transform is used to enhance the acoustic echo cancellation process at a particular participant's computer by creating an Audio Signal Copy that is more suited to the particular audio settings, equipment, and acoustic environs of that computer. Creating this customized Audio Signal Copy is detailed in the flow chart of
(145) As the process starts (1001), the computer receives Host-sent Audio (1003). The computer makes a copy of the Host-sent Audio (1005).
(146) The computer does two different things with this Host-sent Audio. It applies the Audio Signal Transform to the copy of the Host-sent Audio (1007). This modified copy of the Host-sent Audio is saved as the Audio Signal Copy (1009) and this part of the process stops (1017).
(147) The computer also sends the raw Host-sent Audio towards the computer speaker or sound system (1011), where it is modified in accordance with the participant adjusted settings to volume and tone as well as the specifications and characteristics of the amplifiers, mixers, speakers and other audio equipment (1013). The modified Host-sent Audio continues on its way to the speaker (1015), and this part of the process stops (1017).
(148) In this alternative embodiment,
(149) For similar reasons,
(150) In some embodiments, the Audio Signal Transform process as described in
(151) In this way, the alternative embodiment uses computer learning and pattern recognition, often referred to as artificial intelligence, to enhance the acoustic echo cancellation of the present invention.
(152) The processing functions performed by the host server, host computer, and respective participant computers are preferably implemented in software code which is executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers, within the respective host server, host computer and participant computer. While not shown in the figures, each of the host server, host computer, and respective participant computers include such processors.
(153) The software code can be included in an article of manufacture (e.g., one or more tangible computer program products) having, for instance, non-transitory computer readable storage media. The storage media has computer readable program code stored therein that is encoded with instructions for execution by a processor for providing and facilitating the mechanisms of the present invention.
(154) The storage media can be any known media, such as computer memory, one or more floppy discs, compact discs, optical discs, magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium. The storage media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various aspects of the present invention as discussed above.
(155) It should be appreciated by those skilled in the art that various modifications and variations may be made to the present invention without departing from the scope and spirit of the invention. It is intended that the present invention include such modifications and variations as come within the scope of the present invention.