CONFERENCE TERMINAL AND ECHO CANCELLATION METHOD FOR CONFERENCE

20230058981 · 2023-02-23

Assignee

Inventors

Cpc classification

International classification

Abstract

A conference terminal and an echo cancellation method for a conference are provided. In the echo cancellation method, a synthetic speech signal is received. The synthetic speech signal includes a user speech signal of a speaking party corresponding to a first conference terminal of multiple conference terminals and an audio watermark signal corresponding to the first conference terminal. One or more delay times corresponding to the audio watermark signal are detected in a received audio signal. The received audio signal is recorded through a sound receiver of a second conference terminal of the conference terminals. An echo in the received audio signal is canceled according to the delay time.

Claims

1. An echo cancellation method for a conference, adapted to a plurality of conference terminals each comprising a sound receiver and a loudspeaker, the echo cancellation method comprising: receiving a synthetic speech signal, wherein the synthetic speech signal comprises a user speech signal of a speaking party corresponding to a first conference terminal of the plurality of conference terminals and an audio watermark signal corresponding to the first conference terminal; detecting at least one delay time corresponding to the audio watermark signal in a received audio signal, wherein the received audio signal is recorded through the sound receiver of a second conference terminal of the plurality of conference terminals; and canceling an echo in the received audio signal according to the at least one delay time.

2. The echo cancellation method for a conference according to claim 1, wherein detecting the at least one delay time corresponding to the audio watermark signal in the received audio signal comprises: determining at least one initial delay time according to a correlation between the received audio signal and the audio watermark signal, wherein the at least one initial delay time corresponds to a relatively high degree of the correlation.

3. The echo cancellation method for a conference according to claim 2, wherein detecting the at least one delay time corresponding to the audio watermark signal in the received audio signal comprises: generating at least one initial delay signal corresponding to the user speech signal according to the at least one initial delay time, wherein a delay time of the at least one initial delay signal relative to the user speech signal is the at least one initial delay time; and estimating an echo path according to the at least one initial delay signal, wherein the audio watermark signal is delayed for the at least one delay time after passing through the echo path, and the echo path is a channel between the sound receiver and the loudspeaker.

4. The echo cancellation method for a conference according to claim 1, wherein the synthetic speech signal further comprises a second user speech signal of the speaking party corresponding to a third conference terminal of the plurality of conference terminals, and a second audio watermark signal corresponding to the third conference terminal, and the echo cancellation method further comprises: detecting at least one delay time corresponding to the second audio watermark signal in the received audio signal.

5. The echo cancellation method for a conference according to claim 1, wherein the audio watermark signal has a frequency of higher than 16 kilohertz (kHz).

6. The echo cancellation method for a conference according to claim 1, wherein canceling the echo in the received audio signal comprises: generating at least one second synthetic speech signal which is the synthetic speech signal with the at least one delay time; and canceling the at least one second synthetic speech signal from the received audio signal.

7. The echo cancellation method for a conference according to claim 3, wherein estimating the echo path comparing: estimating an impulse response of the echo path by applying the at least one initial delay signal to an adaptive filter.

8. The echo cancellation method for a conference according to claim 1, further comprising: playing, through the loudspeaker, the synthetic speech signal received via a network.

9. A conference terminal comprising: a sound receiver, configured to perform recording and obtain a received audio signal of a speaking party corresponding thereto; a loudspeaker, configured to play a sound; a communication transceiver, configured to transmit or receive data; and a processor, coupled to the sound receiver, the loudspeaker and the communication transceiver, and configured to: receive a synthetic speech signal through the communication transceiver, wherein the synthetic speech signal comprises a user speech signal of the speaking party corresponding to a second conference terminal and an audio watermark signal corresponding to the second conference terminal; detect at least one delay time corresponding to the audio watermark signal in the received audio signal; and cancel an echo in the received audio signal according to the at least one delay time.

10. The conference terminal according to claim 9, wherein the processor is further configured to: determine at least one initial delay time according to a correlation between the received audio signal and the audio watermark signal, wherein the at least one initial delay time corresponds to a relatively high degree of the correlation.

11. The conference terminal according to claim 10, wherein the processor is further configured to: generate at least one initial delay signal corresponding to the user speech signal according to the at least one initial delay time, wherein a delay time of the at least one initial delay signal relative to the user speech signal is the at least one initial delay time; and estimate an echo path according to the at least one initial delay signal, wherein the audio watermark signal is delayed for the at least one delay time after passing through the echo path, and the echo path is a channel between the sound receiver and the loudspeaker.

12. The conference terminal according to claim 9, wherein the synthetic speech signal further comprises a second user speech signal of the speaking party corresponding to a third conference terminal, and a second audio watermark signal corresponding to the third conference terminal, and the processor is further configured to: detect at least one delay time corresponding to the second audio watermark signal in the received audio signal.

13. The conference terminal according to claim 9, wherein the audio watermark signal has a frequency of higher than 16 kHz.

14. The conference terminal according to claim 9, wherein the processor is further configured to: generate at least one second synthetic speech signal which is the synthetic speech signal with the at least one delay time; and cancel the at least one second synthetic speech signal from the received audio signal.

15. The conference terminal according to claim 11, wherein the processor is further configured to: estimate an impulse response of the echo path by applying the at least one initial delay signal to an adaptive filter.

16. The conference terminal according to claim 9, wherein the processor is further configured to: play, through the loudspeaker, the synthetic speech signal received via a network.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] FIG. 1 is a schematic diagram of a conference system according to an embodiment of the disclosure.

[0010] FIG. 2 is a flowchart of an echo cancellation method for a conference according to an embodiment of the disclosure.

[0011] FIG. 3 is a schematic diagram illustrating generation of a synthetic speech signal according to an embodiment of the disclosure.

[0012] FIG. 4 is a schematic diagram of a conference system according to an embodiment of the disclosure.

[0013] FIG. 5 is a flowchart of an echo cancellation method for a conference according to an embodiment of the disclosure.

DESCRIPTION OF THE EMBODIMENTS

[0014] FIG. 1 is a schematic diagram of a conference system 1 according to an embodiment of the disclosure. Referring to FIG. 1, the conference system 1 includes (but not limited to) multiple conference terminals 10a and 10c, multiple local signal management devices 30, and an allocation server 50.

[0015] Each of the conference terminals 10a and 10c may be a corded phone, a mobile phone, a tablet computer, a desktop computer, a notebook computer, or a smart speaker. Each of the conference terminals 10a and 10c includes (but not limited to) a sound receiver 11, a loudspeaker 13, a communication transceiver 15, a memory 17, and a processor 19.

[0016] The sound receiver 11 may be a microphone of a dynamic type, a condenser type, or an electret condenser type. The sound receiver 11 may also be a combination of other electronic component capable of receiving sound waves (for example, human voice, environmental sound, and machine operation sound) and converting them into audio signals, an analog-to-digital converter, a filter, and an audio processor. In one embodiment, the sound receiver 11 is configured to receive/record a sound from a speaking party and obtain a received audio signal. The received audio signal may include voice of the speaking party, sound emitted by the loudspeaker 13, and/or other environmental sounds.

[0017] The loudspeaker 13 may be a speaker or a megaphone. In one embodiment, the loudspeaker 13 is configured to play a sound.

[0018] The communication transceiver 15 is, for example, a transceiver supporting a wired network such as Ethernet, a fiber optic network, or a cable network (in which the transceiver may include components such as (but not limited to) a connection interface, a signal converter, and a communication protocol processing chip). Alternatively, the communication transceiver 15 may be a transceiver supporting a wireless network such as Wi-Fi, a fourth generation (4G), fifth generation (5G) or later generation mobile network (in which the transceiver may include components such as (but not limited to) an antenna, a digital-to-analog/analog-to-digital converter, and a communication protocol processing chip). In one embodiment, the communication transceiver 15 is configured to transmit or receive data.

[0019] The memory 17 may be any type of fixed or portable random access memory (RAM), read-only memory (ROM), flash memory, hard disk drive (HDD), solid-state drive (SSD) or similar component. In one embodiment, the memory 17 is configured to record a program code, a software module, a configuration arrangement, data (such as an audio signal or a delay time), or a file.

[0020] The processor 19 is coupled to the sound receiver 11, the loudspeaker 13, the communication transceiver 15 and the memory 17. The processor 19 may be a central processing unit (CPU), a graphics processing unit (GPU), or other programmable general purpose or special purpose microprocessor, a digital signal processor (DSP), a programmable controller, a field programmable gate array (FPGA), an application-specific integrated circuit (ASIC) or other similar component or a combination of the foregoing. In one embodiment, the processor 19 is configured to perform all or some of operations of the conference terminal 10a or 10c to which the processor 19 belongs, and may load and execute various software modules, files and data recorded in the memory 17.

[0021] The local signal management device 30 is connected to the conference terminal 10a or the conference terminal 10c via a network. The local signal management device 30 may be a computer system, a server, or a signal processing device. In one embodiment, the conference terminal 10a or the conference terminal 10c may serve as the local signal management device 30. In another embodiment, the local signal management device 30 may serve as an independent relay device different from the conference terminals 10a and 10c. In some embodiments, the local signal management device 30 includes (but not limited to) the communication transceiver 15, the memory 17 and the processor 19 that are identical or similar to those mentioned above, and the implementation modes and functions of these components will not be repeated.

[0022] In addition, in one embodiment, it is assumed that the conference terminals connected to the same local signal management device 30 are located in the same region (for example, specific space, area, compartment, or floor in a building). The conference terminals 10a and 10c in FIG. 1 are respectively located in different regions. However, the number of conference terminals connected to any local signal management device 30 is not limited to one.

[0023] The allocation server 50 is connected to the local signal management device 30 via a network. The allocation server 50 may be a computer system, a server, or a signal processing device. In one embodiment, the conference terminal 10a or the conference terminal 10c or the local signal management device 30 may serve as the allocation server 50. In another embodiment, the allocation server 50 may serve as an independent cloud server different from the conference terminals 10a and 10c or the local signal management device 30. In some embodiments, the allocation server 50 includes (but not limited to) the communication transceiver 15, the memory 17 and the processor 19 that are identical or similar to those mentioned above, and the implementation modes and functions of these components will not be repeated.

[0024] In the following, the method according to an embodiment of the disclosure will be described with reference to the devices, components, and modules in the conference system 1. The steps in this method may be adjusted according to actual situations and are not limited to those described herein.

[0025] It should be noted that, for the convenience of description, the same components may implement the same or similar operations, and description thereof will not be repeated. For example, since the conference terminals 10a and 10c may serve as the local signal management device 30 or the allocation server 50, and the local signal management device 30 may also serve as the allocation server 50, in some embodiments, the processor 19 of each of the conference terminals 10a and 10c, the local signal management device 30 and the allocation server 50 may implement the method identical or similar to that according to an embodiment of the disclosure.

[0026] FIG. 2 is a flowchart of an echo cancellation method for a conference according to an embodiment of the disclosure. Referring to FIG. 1 and FIG. 2, it is assumed that a voice conference is established between the conference terminals 10a and 10c. For example, when a conference is established through video software, voice communication software, or a phone call, a speaking party may start talking. The processor 19 of the conference terminal 10a may receive a synthetic speech signal C.sup.W through the communication transceiver 15 (step S210). Specifically, the synthetic speech signal C.sup.W includes a user speech signal C′ of the speaking party corresponding to the conference terminal 10c and an audio watermark signal M.sup.C corresponding to the conference terminal 10c.

[0027] For example, FIG. 3 is a schematic diagram illustrating generation of the synthetic speech signal C.sup.W according to an embodiment of the disclosure. Referring to FIG. 3, the user speech signal C′ is generated by the conference terminal 10c through recording using the sound receiver 11 of the conference terminal 10c. The user speech signal C′ may include voice of the speaking party, sound played by the loudspeaker 13, and/or other environmental sounds. The allocation server 50 may add the audio watermark signal M.sup.C to the user speech signal C′ of the speaking party corresponding to the conference terminal 10a by spread spectrum, echo hiding, phase encoding or the like in a time domain, thereby forming the synthetic speech signal C.sup.W. Alternatively, the allocation server 50 may add the audio watermark signal M.sup.C to the user speech signal C′ of the speaking party corresponding to the conference terminal 10a by carrier wave modulation, frequency band subtraction or the like in a frequency domain, thereby forming the synthetic speech signal C.sup.W. It should be noted that the embodiment of the disclosure does not limit the algorithm of watermark embedding.

[0028] In one embodiment, the audio watermark signal M.sup.C has a frequency of higher than 16 kilohertz (kHz), so as to be prevented from being heard by humans. In another embodiment, the audio watermark signal M.sup.C may have a frequency of lower than 16 kHz.

[0029] In one embodiment, the audio watermark signal M.sup.C is used to identify the conference terminal 10c. For example, the audio watermark signal M.sup.C is a sound, an image, or a code that records an identification code of the conference terminal 10c. However, in some embodiments, the content of the audio watermark signal M.sup.C is not limited. In addition, generation of an audio watermark signal M.sup.A, a synthetic speech signal A.sup.W, and other audio watermark signals and synthetic speech signals of other conference devices can be understood with reference to the foregoing description and will be omitted.

[0030] The allocation server 50 transmits the synthetic speech signal C.sup.W to the local signal management device 30. The local signal management device 30 takes the synthetic speech signal C.sup.W as an output audio signal A″ expected to be played by the conference terminal 10a, and accordingly transmits the output audio signal A″ to the conference terminal 10a, such that the conference terminal 10a receives the synthetic speech signal C.sup.W.

[0031] The processor 19 of the conference terminal 10a may play the output audio signal A″ (the synthetic speech signal C.sup.W in the present embodiment) through the loudspeaker 13. The processor 19 of the conference terminal 10a may perform recording or sound collection through the sound receiver 11 and obtain a received audio signal A.

[0032] The processor 19 of the conference terminal 10a may detect one or more delay times corresponding to the audio watermark signal M.sup.C in the received audio signal A (step S230). Specifically, it is assumed that an audio watermark signal corresponding to another conference terminal (for example, the conference terminal 10c) is known to the conference terminal 10a. It is worth noting that the processor 19 of the conference terminal 10a may, according to the output audio signal A″ played by the loudspeaker 13 of all or some of the conference terminals (for example, the conference terminal 10a in the present embodiment) in the region where the conference terminal 10a is located, cancel an echo in the received audio signal A received by the sound receiver 11 of the conference terminal 10a.

[0033] The output audio signal A″ includes the synthetic speech signal C.sup.W. In one embodiment, if it is desired to detect a delay time corresponding to the synthetic speech signal C.sup.W in the received audio signal A, the processor 19 of the conference terminal 10a may determine initial delay times .sub.τ1.sup.CA and .sub.τ2.sup.CA (assuming that two times are corresponded to; however, the disclosure is not limited thereto) according to a correlation between the received audio signal A and the audio watermark signal M.sup.C. The initial delay times .sub.τ1.sup.CA and .sub.τ2.sup.CA correspond to a relatively high degree of correlation. For example, the processor 19 may estimate an initial delay time for the audio watermark signal M.sup.C to be transmitted to the sound receiver 11 via the loudspeaker 13 according to a peak value (that is, having a highest degree of correlation) in cross-correlation between the received audio signal A and the audio watermark signal M.sup.C. Since there may be not only one peak value, the number of the initial delay times .sub.τ1.sup.CA and .sub.τ2.sup.CA may be more than one. It should be noted that there are many algorithms for estimating the delay time, and the embodiment of the disclosure is not limited thereto.

[0034] In one embodiment, according to the initial delay times .sub.τ1.sup.CA and .sub.τ2.sup.CA, the processor 19 may generate one or more initial delay signals C.sup.W(n−τ.sub.1.sup.CA)and C.sup.W(n−τ.sub.2.sup.CA) corresponding to the user speech signal C′. The delay times of the initial delay signals C.sup.W(n−τ.sub.1.sup.CA) and C.sup.W(n−τ.sub.2.sup.CA) relative to the user speech signal C′ are the initial delay times .sub.τ1.sup.CA and .sub.τ2.sup.CA. It is worth noting that in a time-variant system, the delay time of an entire delivery system varies with a change in space. Therefore, the processor 19 may define a delay time of the synthetic speech signal C.sup.W or the audio watermark signal M.sup.C as an unknown delay time Δt.sup.C. The received audio signal A includes an audio signal a(n) of a speaking party and a synthetic speech signal C.sup.W(n−Δt.sup.C) belonging to the conference terminal 10c. The purpose of echo cancellation is to find the correct delay time Δt.sup.C and accordingly cancel redundant sound (for example, the synthetic speech signal C.sup.W(n−Δt.sup.C)), so that only the audio signal a(n) of the speaking party remains in a user speech signal A′.

[0035] The processor 19 may estimate an echo path according to the initial delay signals C.sup.W(n−τ.sub.1.sup.CA) and C.sup.W(n−τ.sub.2.sup.CA). Specifically, the audio watermark signal M.sup.C is delayed by a converged delay time after passing through the echo path, and the echo path is a channel between the sound receiver 11 and the loudspeaker 13. The processor 19 may apply the initial delay signals C.sup.W(n−τ.sub.1.sup.CA) and C.sup.W(n−τ.sub.2.sup.CA) to various types of adaptive filters (for example, a least mean squares (LMS) filter, a subband adaptive filter (SAF), or a normalized least mean squares (NLMS) filter), and accordingly estimate an impulse response of the echo path and cause the filter to converge. When the filter converges to a steady state, the processor 19 estimates, using a filter coefficient in the steady state, the synthetic speech signal C.sup.W(n−Δt.sup.C) delayed by passing through the echo path, and accordingly obtains the delay time Δt.sup.C.

[0036] According to the delay time Δt.sup.C, the processor 19 of the conference terminal 10a may cancel an echo in the received audio signal A (step S250). Specifically, it is assumed that the echo in the received audio signal A is the synthetic speech signal C.sup.W(n−Δt.sup.C). Since the synthetic speech signal C.sup.Wand Δt.sup.C are both known, the processor 19 may generate the synthetic speech signal C.sup.W(n−Δt.sup.C) and cancel the synthetic speech signal C.sup.W(n−Δt.sup.C) with respect to the received audio signal A, thereby achieving echo cancellation.

[0037] It should be noted that the embodiment of the disclosure is not limited to a one-to-one conference as shown in FIG. 1. Another embodiment is described in the following.

[0038] FIG. 4 is a schematic diagram of a conference system 1′ according to an embodiment of the disclosure. Referring to FIG. 4, the conference system 1′ includes (but not limited to) multiple conference terminals 10a to 10e, multiple local signal management devices 30, and the allocation server 50.

[0039] The implementation modes and functions of the conference terminals 10b to 10e, the local signal management device 30, and the allocation server 50 can be understood with reference to the description of FIG. 1 to FIG. 3 concerning the conference terminal 10a, the local signal management device 30 and the allocation server 50, and will be omitted.

[0040] In the present embodiment, different regions are defined according to the local signal management device 30. The conference terminals 10a and 10b are in a first region, the conference terminal 10c is in a second region, and the conference terminals 10d and 10e are in a third region. The allocation server 50 may add audio watermark signals M.sup.A to M.sup.E respectively to user speech signals A′ to E′ of the speaking parties corresponding to the conference terminals 10a to 10e, thereby forming synthetic speech signals A.sup.W to E.sup.W. The allocation server 50 transmits the synthetic speech signals C.sup.W to E.sup.w from the second region and the third region to the local signal management device 30 in the first region, transmits the synthetic speech signals A.sup.W, B.sup.W, D.sup.W and E.sup.W from the first region and the third region to the local signal management device 30 in the second region, and transmits the synthetic speech signals A.sup.W to C.sup.W from the first region and the second region to the local signal management device 30 in the third region.

[0041] It is worth noting that a difference from FIG. 1 is that the output audio signal A″ of the conference terminal 10a in FIG. 4 may include the synthetic speech signals C.sup.W to E.sup.W. Therefore, the processor 19 of the conference terminal 10a further detects one or more delay times corresponding to the audio watermark signals M.sup.D and M.sup.E in addition to the audio watermark signal M.sup.C in the received audio signal A.

[0042] Specifically, FIG. 5 is a flowchart of an echo cancellation method for a conference according to an embodiment of the disclosure. Referring to FIG. 5, the processor 19 of the conference terminal 10a obtains the audio watermark signals M.sup.C to M.sup.E (step S510). The audio watermark signals M.sup.C to M.sup.E may be pre-stored, entered by the user or downloaded from the Internet. The processor 19 detects the initial delay times τ.sub.1.sup.CA, τ.sub.2.sup.CA, τ.sub.1.sup.DA, τ.sub.2.sup.DA, τ.sub.1.sup.EA, and τ.sub.2.sup.EA of the audio watermark signals M.sup.C to M.sup.E in the received audio signal A recorded by the sound receiver 11 (step S530) (assuming that each audio watermark signal corresponds to two delay times). According to the initial delay times τ.sub.1.sup.CA, τ.sub.2.sup.CA, τ.sub.1.sup.DA, τ.sub.2.sup.DA, τ.sub.1.sup.EA and τ.sub.2.sup.EA, the processor 19 determines the initial delay signals C.sup.W(n−τ.sub.1.sup.CA), C.sup.W(n−τ.sub.2.sup.CA), D.sup.W(n−τ.sub.1.sup.DA), D.sup.W(n−τ.sub.2.sup.DA), E.sup.W(n−τ.sub.1.sup.EA), and E.sup.W(n−τ.sub.2.sup.EA) of the audio watermark signals M.sup.C to M.sup.E (step S550). The processor 19 cancels, in the received audio signal A, the initial delay signals C.sup.W(n−τ.sub.1.sup.CA), C.sup.W(n−τ.sub.2.sup.CA), D.sup.W(n−τ.sub.2.sup.DA) D.sup.W(n−τ.sub.2.sup.DA), E.sup.W(n−τ.sub.1.sup.EA) and E.sup.W(n−τ.sub.2.sup.EA), so as to reduce a convergence time of echo cancellation, and further cancel the components in the received audio signal A that belong to the synthetic speech signals C.sup.W to E.sup.W (step S570).

[0043] In summary, in the conference terminal and the echo cancellation method for a conference according to an embodiment of the disclosure, a delay time of a synthetic speech signal to be canceled is estimated using a known audio watermark signal, and synthetic speech signals of other conference devices are canceled accordingly. In an embodiment of the disclosure, an initial delay time corresponding to the audio watermark signal is firstly obtained, and a convergence time of echo cancellation can be reduced. Even if conference devices have a constantly changing positional relationship therebetween, an expected convergence effect can be achieved.

[0044] It will be apparent to those skilled in the art that various modifications and variations can be made to the structure of the disclosed embodiments without departing from the scope or spirit of the disclosure. In view of the foregoing, it is intended that the disclosure cover modifications and variations of this disclosure provided they fall within the scope of the following claims and their equivalents.