Conference system and process for voice activation in the conference system

Abstract

Conference systems are used for example in discussions and usually comprise a plurality of delegate units with microphones, whereby in a discussion each discussion participant uses his own delegate unit. Usually the delegate units have a switch or the like, that allows the participant in front of the delegate unit to request, that his microphone is activated, so that the speech of the participant is input in the conference system and amplified by the conference system. A conference system comprising (1) a plurality of delegate units (2), each delegate unit (2) having a microphone (5) for receiving an audio signal from a surrounding, a central service module (3) handling a plurality of contribution channels, whereby the audio output of the contribution channels contribute to an amplified audio output of the conference system (1) is proposed, whereby each delegate unit (2) is adapted to send a request for a contribution channel commit to the central service module (3), whereby the central service module (2) is adapted to grant the request and to allocate a contribution channel to the requesting delegate unit (2), thus setting the requesting delegate unit i in an active state A, whereby the delegate unit (2) is adapted to trigger the request by voice activation, whereby the request is triggered in case at least a first trigger condition is fulfilled defining that the audio signal level of one of the delegate units (2) as a possible requesting delegate unit i is higher than an individual test value for each other delegate unit (2) in the active state A, whereby the individual test value is an estimated audio signal level of the possible requesting unit i resulting from an audio or speech signal provided to the other active delegate units (2).

Claims

1. A conference system comprising (1) a plurality of delegate units (2), each delegate unit (2) having a microphone (5) for receiving an audio signal from a surrounding, and a central service module (3) handling a plurality of contribution channels, whereby the audio output of the contribution channels contribute to an amplified audio output of the conference system (1), whereby each delegate unit (2) is configured to send a request for a contribution channel commit to the central service module (3), whereby the central service module (2) is configured to grant the request and to allocate a contribution channel to the requesting delegate unit (2), thus setting the requesting delegate unit (i) in an active state (A), and wherein the delegate unit (2) is configured to trigger the request by voice activation, whereby the request is triggered in case at least a first trigger condition is fulfilled defining that the audio signal level of one of the delegate units (2) as a possible requesting delegate unit (i) is higher than an individual test value for each other delegate unit (2) in the active state (A), whereby the individual test value is an estimated audio signal level of the possible requesting unit (i) resulting from an audio or speech signal provided to the other active delegate units (2).

2. The conference system (1) according to claim 1, characterized in that the individual test value for a test delegate (p) unit is derived by multiplying an individual acoustical coupling factor between the possible requesting delegate unit (i) and the test delegate unit (p) with the audio signal level of the test delegate unit (p) during the test period and optionally with a threshold factor.

3. The conference system (1) according to claim 1, characterized in that the individual test value for a test delegate (p) unit is derived by multiplying an individual acoustical coupling factor between the possible requesting delegate unit (i) and the test delegate unit (p) with the maximum of the audio signal level of the test delegate unit (p) during the last few test periods and optionally with a threshold factor.

4. The conference system (1) according to claim 2, characterized in that each delegate unit (2) comprises a factor table (ACF) containing individual IDs of the other delegate units and the individual coupling factors and an audio signal level table (ALT) containing individual IDs of the other active delegate units (2) and the audio signal level during the test period.

5. The conference System (1) according to claim 4, characterized in that the factor table (ACF) is managed by and/or filed in the delegate units (2) and that the audio signal level table (ALT) is provided by the central service module (3).

6. The conference system (1) according to claim 2, characterized in that the delegate unit (2) is adapted to estimate the individual acoustical coupling factors for each of the other delegate units (2) in an iterative manner, whereby in each iteration step a start value of the individual acoustical coupling factor is improved.

7. The conference system (1) according to claim 2, characterized in that the delegate unit (2), which requested a contribution channel on basis of the data of a first test period (KB) and which was dedicated to a contribution channel by allocating the contribution channel to the delegate unit (2), is configured to review the request and thus the allocation by re-testing at least the first condition on basis of the data of a second test period ((k+1)B).

8. The conference system (1) according to claim 2, characterized in that the delegate unit (i) is configured to trigger the request in case at least the first trigger condition and a second trigger condition is fulfilled, whereby the second trigger condition claims that the audio signal level of the possible requesting delegate unit (p) is higher than a reference noise level (N) during the test period.

9. The conference system (1) according to claim 2, characterized in that the central service module (3) is configured to grant only one request during a pre-selected dead-time.

10. The conference system (1) according to claim 1, characterized in that the delegate unit (2) comprises a speaker indication device (6) for indicating an speaker status of the delegate unit (2), whereby the speaker indication device (6) is activated in case a first indication condition is fulfilled, that claims that the delegate unit (2) is in an active state and under a second indication condition that claims that a voice is detected.

11. A process for voice activation in the conference system (1) according to claim 1, characterized in that the delegate unit (2) triggers the request by voice activation, in case at least a first trigger condition is fulfilled defining that the audio signal level of one of the delegate units (2) as a possible requesting delegate unit (i) is higher than an individual test value for each other delegate unit (2) in the active state (A), whereby the individual test value is an estimated audio signal level of the possible requesting unit (i) resulting from an audio or speech signal provided to the other active delegate units (2).

12. The conference system (1) according to claim 3, characterized in that each delegate unit (2) comprises a factor table (ACF) containing individual IDs of the other delegate units and the individual coupling factors and an audio signal level table (ALT) containing individual IDs of the other active delegate units (2) and the audio signal level during the test period.

13. The conference system (1) according to claim 1, characterized in that the delegate unit (2) comprises a speaker indication device (6) for indicating an speaker status of the delegate unit (2), whereby the speaker indication device (6) is activated in case a first indication condition is fulfilled, that claims that the delegate unit (2) is in an active state and under a second indication condition that claims that a voice pitch is detected.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) Further features, advantages and details of the invention will become apparent by the description of an embodiment of the invention. The figure show:

(2) FIG. 1 a block diagram of a conference system as an embodiment of the invention.

DETAILED DESCRIPTION

(3) FIG. 1 shows an schematic overview of a conference system 1 as an embodiment of the invention. The conference system 1 comprises a plurality of delegate units 2 additionally labeled with the letters i, p1, p2, p3, p(n1), pn and a central service module 3, which are connected with the delegate units 2 by a network 4. The central service module 3 maybe embodied as a computer server or another server or may form a separate module to a server.

(4) Each of the delegate units 2 comprises a microphone 5 for receiving a speech signal from a speaker or a participant of a discussion. The central service module 3 is organizing a plurality of contribution channels, whereby the contribution channels are amplifier channels, so that an audio signal, which is sent from the delegate unit 2 to one of the contribution channels will be amplified and emitted as an amplified audio signal to a surrounding.

(5) The conference system can for example be installed in a plenary hall, whereby each plenary seat is equipped with one of the delegate units 2. During a discussion in the plenary hall a participant of the discussion using one of the delegate units can speak into the microphone 5 of the delegate unit 2 so that an audio signal is received by the delegate unit 2. The audio signal is transmitted to the contribution channel, amplified and emitted in the plenary hall, so that the other participants can hear the audio signal.

(6) In order to have a well regulated discussion, some of the delegate units 2 are in an active state A and allowing the participant to speak in the discussion and some of the delegate units 2 are in a passive state P, whereby the audio signal is not amplified and emitted in the floor. In case the delegate units 2 are in the active state A one of the contribution channel is allocated from the central service module to the delegate unit 2.

(7) During operation the delegate units 2 are switched from the passive state P to the active state A by a voice activation method as explained below:

(8) Each delegate unit 2 in a passive state P requests a contribution channel commit when a first and optional a second trigger condition are fulfilled:

(9) The first trigger condition is a directional noise condition: The input audio signal level of the possible requesting delegate unit 2 is well above the estimated coupled-in audio level, i.e. is well above the estimated audio signal level resulting from receiving a speech signal from a speaker using another delegate unit 2.

(10) The second trigger condition is a diffuse noise condition: The input audio signal level of the possible requesting delegate unit 2 is well above a reference level (e.g. the floor background noise level).

(11) The data, which will also be referred to as audio metadata, required as external information for each delegate unit 2 to determine the above two conditions is (1) a table ALT of all delegate units 2 in the active states which their unique identification IDs and their audio signal level during a test period T and (2) the reference (background noise) level N. The table ALT may for example have the structure: p2 level X.sub.p2(T) p(n1) level X.sub.p(n1)(T) pn level X.sub.pn(T) Noise level N
whereby p2, p(n1), pn represent the ID and level X.sub.p# the audio signal level during the test period T. The noise level N will be explained later.

(12) In a possible, not limiting implementation the level is a value within a range [0, 1] described by a 16-bit unsigned integer. For calculating the level the audio levels are determined during a block of samples, for example during 1024 samples. For each sub-block of 32 samples the root mean square is calculated and the result is put into an exponential averaging filter. For the noise level the level is calculated using an algorithm (for example spectral noise density) on the floor audio (which is a mix of all contribution channels).

(13) The audio metadata is collected and distributed by the central service module 3. From a practical point of view it is sufficient to only periodically distribute the audio metadata, for instance every 1024 samples for saving communication band width. The audio metadata can be distributed efficiently by using broadcast or multicast distribution methods.

(14) When a contribution channel request is received, the central service module 3 grants assignment of the contribution channel if one is available. If all contribution channels are occupied it replies with a deny response. If the delegate unit 2 in the active state A no longer fulfils either of the conditions, it requests release of the contribution channel after a time-out period. The time-out period prevents that a release is requested due to a small pause in the speech. A contribution channel commit or release always has to be requested from the central service module 3, because requirements could exist which would result in a denial. E.g. the requirement that at least one delegate unit 2 should always remain active.

(15) The first trigger condition can be seen in the following equation:

(16) $X_{i} (k) >_{dir} \max_{p P (B)} {W_{p, i} (B) X_{p, \max} (B)}$
with: k Discrete time B Block with a block length of a plurality of samples, for example 1024 samples, defining the length of the time or test period; K discrete time-frame index for the B block periods. X.sub.p,max(KB) the maximum audio signal level of the delegate unit p for the last few, for example the last 3 to 10 time periods before the time period k X.sub.i(k) the audio level of the possible requesting delegate unit i during the time period k; .sub.dir threshold factor for this condition; P the collection of delegate units 2 in the active state A during the time period k; W.sub.p,i(KB) the acoustical coupling factor between the delegate unit p and the delegate unit i during the time period k.

(17) The first trigger condition therefore tests whether the audio signal level of the delegate unit i as the possible requesting delegate unit is higher than a reference test value of each the other active delegate units p multiplied with the threshold factor. The function max serves as a pre-selecting, because it extracts the highest reference test value. The reference test value is therefore the product of the maximum audio signal level of the delegate unit p during the time period k and the coupling factor between the test delegate unit p and the possible requesting delegate unit i during the time period k.

(18) The individual acoustical coupling factor W.sub.p,i(KB) describes the ratio between the audio signal level X.sub.i of the possible requesting delegate unit i and the audio signal level X.sub.i of the test delegate unit p in case a speech signal is generated from a speaker using the test delegate unit p. So the individual acoustical coupling factor can be different from each other for each test delegate unit p. The first trigger condition will be fulfilled if the speech signal is provided by a speaker being in front of the microphone 4 of the possible requesting delegate unit i and not in front of the test delegate unit p.

(19) In a possible implementation, block length B is the 1024 sample interval. k is the discrete time, dependent on the sampling frequency. At least the first, preferably both trigger conditions are actually evaluated every sample period: first the audio level X is updated using the latest sample preferably using exponential averaging, next the comparison is made. Else wise a for example 1024 sample worst-case delay would cause the system to possibly miss the first letters of the speakers sentence. The test values W.sub.p,i(B)X.sub.p,max(B) are and the noise value (N) is only updated when new audio metadata is received (, which occurs every block B).

(20) The individual acoustical coupling factors W.sub.p,i(KB) are estimated and are achieved using a standard normalized least means squared algorithm. Its target is to quickly converge filtering coefficients to minimize the error (=residual level). Again the time period KB is used for description.

(21) In a first step, a residual level R.sub.p,i is determined from the delegate unit p to the delegate unit i, whereby the delegate unit p is the only delegate unit 2 in the active state A. In the situation where only a single delegate unit p is active, all other delegate units 2 dynamically adjust their acoustical coupling factor estimation to the active delegate unit p, using the audio signal level of their microphone 4 input and the audio signal level of the single active delegate unit p, whereby the audio signal level of the single active delegate unit p is distributed to all delegate units 2 by the central service module 3.
R.sub.p,i(B)=X.sub.i(B)W.sub.p,i(B)X.sub.p,max(B)

(22) In a next step, the acoustical coupling factor is updated:

(23) $W_{p, i} ([+ 1] B) = W_{p, i} (B) + \frac{R_{p, i} (B) X_{\max, p} (B)}{\max (.Math. {{[X_{p, \max} (B)]}^{2}}, .Math. {{[R_{p, i} (B)]}^{2}}, thr)}$
with: W.sub.p,i(B) the updated acoustical coupling factor from the delegate unit p to the delegate unit i; the converging-rate time constant { } an exponential averaging function; thr a bottom threshold to prevent spikes during initialization.

(24) As a start value, all acoustical coupling factors W could be set to the value 1.0=0 dB.

(25) A possible implementation for the exponential averaging function is defined as:
X.sub.i(k)=X.sub.i(k1)+(1)|x.sub.i(k)|
where smoothing-factor beta is determined using:

(26) $= \exp [\frac{- 1}{T_{\exp} F_{s}}]$
With: T.sub.exp the exponential time constant F.sub.s the sampling frequency.

(27) Other known implementations could be used.

(28) The exponential (moving) average function is described above, but for the { }-functions (determining the average power level) it is performed on the square of the input and the update rate is for each block period KB. Lets refer to {X.sub.p,max(B).sup.2} as P.sub.xx,p(KB). Then:
P.sub.xx,p(B)=P.sub.xx,p([1]B)+(1)X.sub.p,max.sup.2(B)

(29) As the input for this function is the maximum level of the past for example 5 blocks, and this level is determined by the exponential averaging function of the audio, this seems double work, but this smoothing is preferred for the NLMS algorithm to converge quickly. The other value P.sub.rr,p,i(KB)={R.sub.p,i(B).sup.2} is advantageous to react to external disruptions of the algorithm, e.g.: In a system where only one delegate unit is active, the coupling factor to that delegate unit are being updated. If the speaker behind a non-active delegate unit would start speaking, that delegate unit would send a request. However, it will take up tens of milliseconds for the system to grant this request: in the mean time the speech of the speaker can cause the coupling factor to be updated using incorrect input. However, due to the large error/residual signal, the P.sub.rr,p,i(KB) will quickly rise, preventing fast update of the coupling factors.

(30) As a result, each delegate unit 2 keeps a table containing the acoustical coupling factors estimations to each other delegate unit 2. The acoustical coupling factor tables are indicated in FIG. 1 with ACFi, ACF1, ACF2 . . . ACFn.

(31) The second trigger condition, i.e. the diffuse noise condition can be seen in the next equation:
X.sub.i(k)>.sub.difN(B)
with: .sub.dif the threshold factor for this condition N the reference (background noise) level (from ALT).

(32) Because the conference system 1 is a distributed System, delays and latencies in communication can occur, which may be handled as follows.

(33) Before a speaker's delegate unit 2 is granted a contribution channel, others delegate units 2 also request a channel commit due to acoustical coupling. Therefore only the first commit request is granted, where-after all commit requests are denied for a certain amount of time (called dead time). This dead time should be long enough so that the distributed metadata contains information on the speaker's delegate unit 2.

(34) To prevent a flood of re-requests, a delegate unit 2 must wait for a period of time before sending a new request, after the previous request has been denied.

(35) Because the metadata is only send once every x samples, the last known information may lag behind. In that case, it cannot be prevented that an onset in the speaker's voice triggers a commit at one or more delegate units 2 (this occurs more often when the coupling estimations have converged to their final value). To solve this, a delegate unit 2 waits for a metadata update directly after their channel commit request has been granted: If it is clear from the new metadata that the commit request was triggered by acoustical coupling, the delegate unit 2 immediately requests a channel release (i.e. without time-out period).

(36) Voice Detection/Identification:

(37) In the conference system 1 a delegate unit 2 could still request and receive a contribution channel due to a disturbance (pen click, cough, etc). This is acceptable, because the channel is quickly released. The delegate unit 2 comprises an indication device 6, indicating with a light or LED the active or passive state of the delegate unit 2. However: for a discussion it would be preferred to only indicate a real speaker to the public by activating indication device on the delegate unit 2. In a possible embodiment it is proposed to separate the indication from the channel assignment.

(38) The indication device 6 on a delegate unit 2 with contribution channel assigned. i.e. in the active state A, is activated as soon as voice (pitch) is detected on its audio signal. To limit processing requirements, it will be sufficient to perform the voice (pitch) detection on the loudest contribution channel only. It is also preferred to perform the voice (pitch) detection at the central service module 3, to reduce hardware requirements on the delegate units.

(39) Possible advantages of the discussion system 1 are that the acoustic coupling between the delegate units 2 are dynamically determined. It requires limited information to be distributed, with which the delegate units 2 can determine if they have a real speaker as audio input. It can handle communication delays in information exchange. These improvements allow the conference system 1 to be flexible and scalable. The conference system 1, especially the voice activation, is very robust, because neighboring delegate units 2 don't activate or are only activated very shortly due to acoustical coupling when a speaker starts to speak. The conference system 1, especially the delegate units 2, are self-learning, whereby after a short period, it is easy for other speakers to participate to the discussion, even at the neighboring devices. The conference system 1 is scalable, because the conference system 1 is working in small and in very large setups, without requiring manual configuration. The communication overhead is low, because of the use of periodic metadata, for instance distributed using broadcast or multicast. Summarized, the conference system 1 improves robustness and flexibility by determining acoustical coupling and optionally speak-conditions at the delegate units 2.

Conference system and process for voice activation in the conference system

Assignee

Inventors

Cpc classification

Classification Explorer

H04M3/568

ELECTRICITY

Classification Explorer

H04M2201/18

ELECTRICITY

Classification Explorer

H04M2203/5072

ELECTRICITY

International classification

Classification Explorer

H04M3/56

ELECTRICITY

Abstract

Claims

Description