METHOD AND SYSTEM FOR VOICE CONFERENCING WITH CONTINUOUS DOUBLE-TALK
20220303386 · 2022-09-22
Inventors
Cpc classification
H04M3/002
ELECTRICITY
International classification
Abstract
A method and system for improving communications conferencing systems that experience continuous double-talk where the communication includes an intended continuous or intermittent soundtrack or other intended continuous sound. The technology as disclosed and claimed herein uses several techniques to mask the residual echo and make it less audible.
Claims
1-24. (canceled)
25. A method of audio conferencing, comprising: receiving a soundtrack signal; receiving a far-end audio signal from a far end; combining the soundtrack signal with the far-end audio signal to generate a far-end reference signal; playing back the far-end reference signal through a near-end speaker; generating a near-end audio signal with a near-end microphone; generating a near-end transmit speech signal by performing acoustic echo cancellation and residual echo suppression on the near-end audio signal, wherein a level of the residual echo suppression that is performed depends on the level of the soundtrack signal; and transmitting the near-end transmit speech signal to the far end.
26. The method of audio conferencing of claim 25, wherein the soundtrack signal is a near-end soundtrack signal, the method comprising: combining the near-end soundtrack signal with the far-end audio signal to generate the far-end reference signal; and generating the near-end transmit speech signal by performing acoustic echo cancellation and residual echo suppression on the near-end audio signal, wherein a level of the residual echo suppression that is performed depends on the level of the near-end soundtrack signal.
27. The method of audio conferencing of claim 26, further comprising: receiving the near-end transmit speech signal at the far end; receiving a far-end soundtrack signal; combining the far-end soundtrack signal with the near-end transmit speech signal thereby generating a near-end reference signal; and playing back the near-end reference signal through a far-end speaker.
28. The method of audio conferencing of claim 27, further comprising: generating a far-end audio signal with a far-end microphone; performing acoustic echo cancellation and residual echo suppression on the far-end audio signal to generate a far-end transmit speech signal, wherein a level of the residual echo suppression that is performed is responsive to the level of the far-end soundtrack signal; and transmitting the far-end transmit speech signal to the near end.
29. The method of audio conferencing of claim 25, wherein generating the near-end transmit speech signal further comprises: adding comfort noise to the near-end transmit speech signal.
30. The method of audio conferencing of claim 29, wherein a level of the comfort noise added to the near-end transmit speech signal depends on the level of the soundtrack signal.
31. The method of audio conferencing of claim 25, wherein the soundtrack signal is a far-end soundtrack signal, the method comprising: combining the far-end soundtrack signal with the far-end audio signal to generate a far-end reference signal; and generating the near-end transmit speech signal by performing acoustic echo cancellation and residual echo suppression on the near-end audio signal, wherein the level of the residual echo suppression that is performed depends on the level of the far-end soundtrack signal.
32. The method of audio conferencing of claim 31, further comprising: receiving the near-end transmit speech signal at the far end; combining the far-end soundtrack signal with the near-end transmit speech signal thereby generating a near-end reference signal; and playing back the near-end reference signal through a far-end speaker.
33. The method of audio conferencing of claim 31, wherein generating the near-end transmit speech signal further comprises: adding comfort noise to the near-end transmit speech signal.
34. The method of audio conferencing of claim 33, wherein a level of the comfort noise added to the near-end transmit speech signal depends on the level of the far-end soundtrack signal
35. A non-transitory computer-readable medium, the computer-readable medium including instructions that when executed by a computer, cause the computer to perform operations for providing audio conferencing, comprising: receiving a soundtrack signal; receiving a far-end audio signal from a far end; combining the soundtrack signal with the far-end audio signal to generate a far-end reference signal; playing back the far-end reference signal through a near-end speaker; generating a near-end audio signal with a near-end microphone; generating a near-end transmit speech signal by performing acoustic echo cancellation and residual echo suppression on the near-end audio signal, wherein a level of the residual echo suppression that is performed depends on the level of the soundtrack signal; and transmitting the near-end transmit speech signal to the far end.
36. The non-transitory computer-readable medium of claim 35, wherein the operation of generating the near-end transmit speech signal further comprises: adding comfort noise to the near-end transmit speech signal.
37. The non-transitory computer-readable medium of claim 36, wherein a level of the comfort noise added to the near-end transmit speech signal depends on the level of the soundtrack signal.
38. The non-transitory computer-readable medium of claim 35, wherein the soundtrack signal is a near-end soundtrack signal, and the level of the residual echo suppression that is performed depends on the level of the near-end soundtrack signal.
39. The non-transitory computer-readable medium of claim 35, wherein the soundtrack signal is a far-end soundtrack signal, and the level of the residual echo suppression that is performed depends on the level of the far-end soundtrack signal.
40. An audio conferencing system that provides audio conferencing based at least in part on a soundtrack signal and a far-end audio signal received from a far end, the system comprising: a module to combine the soundtrack signal with the far-end audio signal to generate a far-end reference signal; an output for playing the far-end reference signal back through a near-end speaker; an input for receiving a near-end audio signal from a near-end microphone; an acoustic echo cancellation and residual echo suppression module to generate a near-end transmit speech signal by performing acoustic echo cancellation and residual echo suppression on the near-end audio signal, wherein a level of the residual echo suppression that is performed depends on the level of the soundtrack signal.
41. The audio conferencing system of claim 40 wherein the soundtrack signal is a near-end soundtrack signal, and the level of the residual echo suppression that is performed depends on the level of the near-end soundtrack signal.
42. The audio conferencing system of claim 40 wherein the soundtrack signal is a far-end soundtrack signal, and the level of the residual echo suppression that is performed depends on the level of the far-end soundtrack signal.
43. The audio conferencing system of claim 40 further comprising: a comfort noise generating module to add comfort noise to the near-end transmit speech signal, the level of the comfort noise depending on the level of the soundtrack signal.
44. The audio conferencing system of claim 40, further comprising: a near-end sound-bar including the near-end speaker and the near-end microphone, wherein the acoustic echo cancellation and residual echo suppression module are implemented in one or more digital signal processors in the near-end sound-bar.
Description
BRIEF DESCRIPTION OF THE DRAWING
[0016] For a better understanding of the present technology as disclosed, reference may be made to the accompanying drawings in which:
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023] While the technology as disclosed is susceptible to various modifications and alternative forms, specific implementations thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description presented herein are not intended to limit the disclosure to the particular implementations as disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the scope of the present technology as disclosed and as defined by the appended claims.
DESCRIPTION
[0024] According to the implementation(s) of the present technology as disclosed, various views are illustrated in
[0025] One implementation of the present technology as disclosed comprising a conferencing system teaches a novel system and method for a conferencing system experiencing continuous or intermittent double talk. The technology as disclosed and claimed provides a solution to this problem and masks the residual echo. The technology as disclosed and claimed herein uses several techniques to mask the residual echo and make it less audible. The main approaches include mixing in the added sound from the sound track to produce the Tx voice signal, which will naturally mask the residual echo; controlling the aggressiveness of the RES based on the level of the extra sound—such that when the extra sound is low, then apply RES as in a standard voice call, and if the extra sound is loud, then apply less RES since the echo will be naturally masked by the extra sound; and adjusting the level of comfort noise based on how loud the extra sound is.
[0026] The details of the technology as disclosed and various implementations can be better understood by referring to the figures of the drawing. Referring to
[0027] Modulating the Residual Echo Suppression (RES): For one implementation of the technology as disclosed and claimed, the aggressiveness of the RES 120 is controlled based on the level of the TAS 134. If the TAS is low, then the RES will be aggressive. If the TAS is high, then the RES can be gentle. The TAS is fed back 130 to the RES as a control parameter. A masking technique can also be utilized. For one implementation, the technology performs a spectral analysis of spectral content of the TAS and the residual echo suppression and determines the aggressiveness of the RES based on how well the TAS masks the residual echo.
[0028] Modulating the Comfort Noise Generator (CNG): The purpose of the comfort noise generator 128 is to create shaped random noise which matches the background noise level in the room. Comfort noise is required because of the RES effects of the room noise received by the microphone 114. Without comfort noise, the far-end person would potentially hear the noise in the room constantly change when the RES 120 is active. The technology as disclosed and claimed herein uses the TAS 134 to determine how much comfort noise to add. The TAS 134 is fed back 132 to the RES 120 as a control parameter. When TAS 134 is high, then the room noise is masked and no comfort noise is required. When TAS is low, the system uses comfort noise processing 128. For one implementation, separate audio inputs into the AEC 118 and RES 120 are utilized for the far-end speech 102 and the TAS 134 added sound.
[0029] One application of the technology as disclosed and claimed is that of sound-bar 142 used with gaming application systems 148. For one implementation, the sound is projected from one or more speakers 144 in the sound-bar unit 142 and sound is received by a microphone 146 integrated in the sound-bar unit 142. The technology as disclosed and claimed provides a voice chat feature so that a user has the ability to talk naturally with their teammates. The technology as disclosed and claimed provides similar functionality but without having to wear a headset with speakers and a microphone.
[0030] One implementation of the technology as disclosed and claimed is a conferencing system 140 for transmission of voice and background sounds includes a conferencing application 150 operating on a server 148 or other computing device coupled on a network 148 thereby establishing a conferencing link between a near-end conferencing application generated user interface, which for one implementation is interactive with various input devices such as a mouse, keyboard, joystick or other input device that communicates with the server 148, and the user interface is displayed on a monitor 154, said near-end user interface having a near-end speaker 144 and a near-end microphone 146 are communicably coupled 156 with a near-end computing device 148 processing with a processor 152 said near-end user interface, and a far-end conferencing application 162 generated user interface having a far-end speaker 158 and a far end microphone 160 coupled with a far end computing device 164 processing with a processor 168 said far end user interface.
[0031] For one implementation of the technology as disclosed and claimed, the conferencing application 150 processing with a processor 152 on the computing device 148, generates one or more of intermittent and continuous soundtrack signals. The near-end conferencing application 150 generated user interface and said far-end conferencing application 162 generated user interface receives and projects voice sound signals with the microphones 146 and 160 and receives and projects the one or more of the intermittent and continuous soundtrack signals produced by the conferencing applications 150 and 162 processing with the processors 152 and 168 on the computing devices 148 and 164. For one implementation of the technology as disclosed and claimed, the conferencing application 150 has a near-end digital signal processor function processing on the processor 152 that combines one or more of the intermittent and continuous sound track with an AEC and RES processed near-end speech signal thereby generating and outputting a T.sub.x 126 voice signal. For one implementation of the technology as disclosed and claimed, the near-end digital signal processor function adjusts a level of a residual echo suppression 120 responsive to the level and frequency contents of the one or more intermittent and continuous soundtrack signal.
[0032] For one implementation of the conferencing system as disclosed and claimed the conferencing application has a far-end digital signal processor function being processed by the processor 168 that combines one or more of the intermittent and continuous sound track with a far-end AEC and RES processed far-end speech signal thereby generating and outputting the far-end speech signal 102. For one implementation, the conferencing application 150 has the near-end digital signal processor function processing on the processor 152 that combines one or more of the intermittent and continuous sound track 134 with comfort noise generator 128 signal processed near-end speech output to thereby generate and output the T.sub.x voice signal 126. For one implementation the near-end digital signal processor function adjusts a level of a comfort noise generator 128 responsive to the level and frequency contents of the one or more intermittent or continuous soundtrack signal 134. For one implementation of the technology as disclosed and claimed, the conferencing application is a gaming application where the gaming application generates the one or more of intermittent and continuous soundtrack signal. For one implementation, the near-end digital signal processor function is integrated with a sound-bar and where the near-end speaker and near-end microphone are part of the sound-bar 142, and where the near-end speaker 144 and the near-end microphone 146 are integrally coupled with the near-end digital signal processor function.
[0033] One implementation of the technology as disclosed and claimed is a method of conferencing for transmitting voice and background sound including operating a conferencing application 150 with a processor 152 on a server coupled or other computing device 148 on a network 148, such as a Wide Area Network (WAN), including and Internet Service Provider (ISP) and thereby establishing a conferencing link between a near-end conferencing application generated user interface, and a far-end conferencing application generated user interface, where the near-end user interface has a near-end speaker 144 and a near-end microphone 146 coupled with a near-end computing device 148 and thereby processing said near-end user interface with the processor 152 and displaying the user interface on a near end monitor 154. The method includes a far-end conferencing application generating a far end user interface having a far-end speaker and a far end microphone coupled with a far end computing device and thereby processing with a far-end processor 168 said far end user interface. One implementation of the method including generating one or more of intermittent and continuous soundtrack signals with said conferencing applications and receiving and projecting the one or more intermittent and continuous soundtrack at said near-end conferencing application generated user interface and receiving and projecting at said far-end conferencing application generated user interface, voice sound signals and the one or more of the intermittent and continuous soundtrack signals. The method includes combining one or more of the intermittent and continuous sound track with an AEC and RES processed near-end speech signal with said conferencing application having a near-end digital signal processor function, thereby generating and outputting a T.sub.x voice signal, where the near-end digital signal processor function is adjusting a level of a residual echo suppression responsive to the level and frequency contents of the one or more intermittent and continuous soundtrack signal.
[0034] One implementation of the method of conferencing as disclosed and claimed herein includes combining one or more of the intermittent and continuous sound track with an AEC and RES processed far-end speech signal thereby generating and outputting a T.sub.x voice signal with said conferencing application having a far-end digital signal processor function. One implementation of the method of conferencing includes combining one or more of the intermittent and continuous sound track with comfort noise generator signal processed near-end signal to thereby generate and output the T.sub.x voice signal with said conferencing application having the near-end digital signal processor function. For one implementation of the method of conferencing the near-end digital signal processor function is adjusting a level of a comfort noise generator responsive to the level and frequency contents of the one or more intermittent and continuous soundtrack signal.
[0035] For one implementation of the technology as disclosed and claimed a non-transitory computer-readable medium storing a conferencing application including instructions that, when executed by a computing processor, causes establishing a conferencing link through user interfaces, to operate a conferencing application on a server coupled on a network and thereby establish a conferencing link between a near-end conferencing application generated user interface, said near-end user interface having a near-end speaker and a near-end microphone coupled with a near-end computing device and thereby process said near-end user interface, and a far-end conferencing application generated user interface having a far-end speaker and a far end microphone coupled with a far end computing device and thereby processing said far end user interface, and causes the generation of one or more of intermittent and continuous soundtrack signals with said conferencing applications and causes the receipt and projection of the near-end conferencing application generated user interface and receipt and projection at said far-end conferencing application generated user interface, voice sound signals and the one or more of the intermittent and continuous soundtrack signals. For one implementation causes the combining of one or more of the intermittent and continuous sound track with an AEC and RES processed near-end speech signal with said conferencing application having a near-end digital signal processor function, thereby generating and outputting a T.sub.x voice signal, where the near-end digital signal processor function is adjusting a level of a residual echo suppression responsive to the level and frequency contents of the one or more intermittent and continuous soundtrack signal.
[0036] Referring to
[0037] This implementation is similar to the one shown earlier in
[0038] The implementation in
[0039] Referring to
[0040] The various implementations and examples shown above illustrate a method and system for conferencing system experiencing continuous or intermittent double talk. The technology as disclosed and claimed provides a solution to this problem and masks the residual echo. The technology as disclosed and claimed herein uses several techniques to mask the residual echo and make it less audible. The main approaches include mixing in the added sound from the sound track into the Tx voice signal, which will naturally mask the residual echo; controlling the aggressiveness of the RES based on the level of the extra sound—such that when the extra sound is low, then apply RES as in a standard voice call, and if the extra sound is loud, then apply less RES since the echo will be naturally masked by the extra sound; and adjusting the level of comfort noise based on how loud the extra sound is. A user of the present method and system may choose any of the above implementations, or an equivalent thereof, depending upon the desired application. In this regard, it is recognized that various forms of the subject conferencing method and system could be utilized without departing from the scope of the present technology and various implementations as disclosed.
[0041] As is evident from the foregoing description, certain aspects of the present implementation are not limited by the particular details of the examples illustrated herein, and it is therefore contemplated that other modifications and applications, or equivalents thereof, will occur to those skilled in the art. It is accordingly intended that the claims shall cover all such modifications and applications that do not depart from the and scope of the present implementation(s). Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
[0042] Certain systems, apparatus, applications or processes are described herein as including a number of modules or components. A module may be a unit of distinct functionality that may be presented in software, hardware, or combinations thereof. For example, a module can include the acoustic echo cancellation (AEC), the Residual Echo Suppression (RES) and the Comfort Noise Generator (CNG). When the functionality of a module is performed in any part through software, the module includes a computer-readable medium. The modules may be regarded as being communicatively coupled with other modules for example the AEC, RES and the CNG are communicably couples. The inventive subject matter may be represented in a variety of different implementations of which there are many possible permutations.
[0043] The methods described herein do not have to be executed in the order described, or in any particular order. Moreover, various activities described with respect to the methods identified herein can be executed in serial or parallel fashion. In the foregoing Detailed Description, it can be seen that various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter may lie in less than all features of a single disclosed embodiment. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separate embodiment.
[0044] In an example implementation, the machine operates as a standalone device or may be connected (e.g., networked) to other machines such as a far end and near-end systems connected of a WAN. In a networked deployment, the machine may operate in the capacity of a server or a client machine in server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine may be a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine or computing device. Further, while only a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
[0045] The example computer system and client computers can include a processor (e.g., a central processing unit (CPU) a graphics processing unit (GPU) or both), a main memory and a static memory , which communicate with each other via a bus. The computer system may further include a video/graphical display unit (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)). The computer system and client computing devices can also include an alphanumeric input device (e.g., a keyboard), a cursor control device (e.g., a mouse), a drive unit, a signal generation device (e.g., a speaker) and a network interface device.
[0046] The drive unit includes a computer-readable medium on which is stored one or more sets of instructions (e.g., software) embodying any one or more of the methodologies or systems described herein. The software may also reside, completely or at least partially, within the main memory and/or within the processor during execution thereof by the computer system, the main memory and the processor also constituting computer-readable media. The software may further be transmitted or received over a network via the network interface device.
[0047] The term “computer-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term “computer-readable medium” shall also be taken to include any medium that is capable of storing or encoding a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present implementation. The term “computer-readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, and optical media, and magnetic media.
[0048] The various implementations and examples shown above illustrate a conferencing system that addressed continuous Double Talk. A user of the present technology as disclosed may choose any of the above implementations, or an equivalent thereof, depending upon the desired application. In this regard, it is recognized that various forms of the subject conferencing application could be utilized without departing from the scope of the present invention.
[0049] As is evident from the foregoing description, certain aspects of the present technology as disclosed are not limited by the particular details of the examples illustrated herein, and it is therefore contemplated that other modifications and applications, or equivalents thereof, will occur to those skilled in the art. It is accordingly intended that the claims shall cover all such modifications and applications that do not depart from the scope of the present technology as disclosed and claimed.
[0050] Other aspects, objects and advantages of the present technology as disclosed can be obtained from a study of the drawings, the disclosure and the appended claims.