LOW LATENCY AUDIO PACKET LOSS CONCEALMENT

Abstract

The invention provides a method for real-time concealing errors in audio data packets. A Long Short-Term Memory (LSTM) neural network with a plurality of nodes is provided and pre-trained with audio data. A sequence of packets is received, each packet comprising a set of modified discrete cosine transform (MDCT) coefficients associated with a frame comprising time-domain samples of the audio signal. These MDCT coefficient data are applied to the LSTM neural network, and in case it is identified that a received packet is an erroneous packet, an output from the LSTM neural network is used to generate estimated MDCT co-efficients to provide a concealment packet to replace the erroneous packet. Preferably, the MDCT coefficients are normalized prior to applying to the LSTM neural network. This method can be performed in real-time. A low latency can be obtained and still with a high audio quality.

Claims

1. A method for concealing errors in packets of data representing an audio signal, the method comprising: providing (P_L_NN) a Long Short-Term Memory (LSTM) neural network with a plurality of nodes, wherein the LSTM neural network has been pre-trained with audio data, receiving (R_P) a sequence of packets each comprising a set of modified discrete cosine transform (MDCT) coefficients associated with a frame comprising time-domain samples of the audio signal, applying (A_P_L_NN) the sequence of packets to the LSTM neural network, identifying (I_E_P) in the received sequence of packets a packet to be an erroneous packet, generating (G_CFF) estimated MDCT coefficients to replace the set of MDCT coefficients of the erroneous packet in response to an output from the LSTM neural network, wherein each packet optionally represents 10-200 MDCT coefficients, wherein the LSTM neural network optionally comprises 50-500 LSTM nodes, generating (G_CP) a concealment packet based on the estimated MDCT coefficients, and replacing (R_E_P) the erroneous packet with the concealment packet.

2. The method according to claim 1, comprising performing a normalizing procedure on the MDCT coefficients of each packet to arrive at a normalized set of MDCT coefficient for each packet, and applying the normalized set of MDCT coefficients to the LSTM neural network.

3. (canceled)

4. The method according to claim 1, wherein the LSTM neural network is pre-trained with audio data in the form of normalized sets of MDCT coefficients.

5. The method according to claim 1, comprising applying a post-processing of the estimated MDCT coefficients to modify the estimated MDCT coefficients prior to generating the concealment packet.

6. (canceled)

7. (canceled)

8. The method according to claim 1, wherein the sequence of packets are overlapping packets each representing samples of audio, the sequence of packets optionally overlapping by 20-70%.

9. (canceled)

10. (canceled)

11. (canceled)

12. (canceled)

13. (canceled)

14. (canceled)

15. The method according to claim 1, wherein the LSTM neural network comprises a plurality of LSTM layers, wherein each LSTM layer comprises a plurality of LSTM nodes in parallel.

16. The method according to claim 1, wherein the LSTM neural network has one single LSTM layer of nodes, and optionally wherein outputs from nodes of the single LSTM layer are combined to provide the output of the LSTM neural network with a desired number of elements.

17. (canceled)

18. The method according to claim 1, comprising (a) training the LSTM neural network by a predetermined loss function using an audio input, (b) training the LSTM neural network using a specific audio input, or both (a) and (b).

19. (canceled)

20. (canceled)

21. The method according to claim 1, further comprising training a plurality of different LSTM neural network configurations or LSTM neural network data sets using respective different specific audio inputs and, optionally, classifying audio in response to received packets, and selecting one of the different LSTM neural network configurations or LSTM neural network data sets to be used for generating estimated MDCT coefficients accordingly.

22. (canceled)

23. The method according to claim 21, wherein said classifying of audio comprises applying an acoustic scene classification algorithm on the sequence of packets to classify received audio with respect to input type-loading of pre-trained set of data for LSTM neural network.

24. (canceled)

25. (canceled)

26. (canceled)

27. (canceled)

28. The method according to claim 1, wherein transformation of audio samples into MDCT packets is performed prior to transmitting packets over a wireless transmission channel.

29. The method according to claim 1, wherein transformation of audio samples into MDCT packets is performed after receiving packets via a wireless transmission channel.

30. A computer program product comprising, a non-transitory computer-readable medium, with instructions for performing the method of claim 1 when executed on a processor system.

31. A device, comprising: a wireless receiver configured to receive a wireless signal representing a sequence of packets representing an audio signal, and a packet loss concealment system arranged to receive the sequence of packets, and to perform the method according to claim 1 on a processor to arrive at a modified sequence of packets comprising at least one concealment packet.

32. The device according to claim 31, comprising an audio decoder arranged to receive the modified sequence of packet and to apply an MDCT based audio decoding algorithm so as to decode the modified sequence of packets into a sequence of decoded frames of audio.

33. (canceled)

34. The device according to claim 31, the device being one of: a live performance base station, a wireless microphone, a wireless headset, a wireless intercom device, a teleconference system, a wireless audio monitor, and a virtual reality device.

35. The device according to claim 31, the device being arranged to generate an output audio signal in response to the received sequence of packets.

36. The device according to claim 31, wherein the sequence of packets represents samples of a time signal at a sample frequency of at least 8 kHz.

37. A system, comprising: an audio device comprising a wireless transmitter, wherein the audio device is arranged to generate a sequence of packets representing an audio signal and to transmit a wireless signal by means of the wireless transmitter representing the sequence of packets, and a device according to claim 31, the device configured to receive the sequence of packets from the audio device.

38. The method of claim 1, wherein the method is performed for one-way or two-way streaming of audio.

39. (canceled)

Description

BRIEF DESCRIPTION OF THE FIGURES

[0053] The invention will now be described in more detail with regard to the accompanying figures of which

[0054] FIG. 1 illustrates a simple block diagram of an embodiment of the invention with audio packet loss concealment based on a pre-trained LSTM neural network operating in the MDCT domain,

[0055] FIGS. 2a and 2b illustrate two basically different implementations with respect to MDCT transformation before and after wireless communication, the concept of packet loss concealment based on an LSTM neural network,

[0056] FIG. 3 illustrates the concept of MDCT transformation of an audio input, the LSTM neural network to operate as packet concealment generator, and inverse MDCT transformation to arrive at an audio output,

[0057] FIG. 4 illustrates nodes of an LSTM neural network,

[0058] FIG. 5 illustrates one possible implementation of a soft-threshold function to be used for post-processing of prediction output of the LSTM neural network, and

[0059] FIG. 6 illustrates steps of a method embodiment.

[0060] The figures illustrate specific ways of implementing the present invention and are not to be construed as being limiting to other possible embodiments falling within the scope of the attached claim set.

DETAILED DESCRIPTION OF THE INVENTION

[0061] FIG. 1 illustrates a simple block diagram of an embodiment with a packet loss concealment unit PLC. An input audio signal A_I is transformed into packets in the MDCT domain and applied to an LSTM neural network L_NN which has been pre-trained with a training audio input A_T, also in the MDCT domain. The LSTM neural network L_NN generates a concealment C_P packet based on the audio input A_I and its pre-training properties. It has been found that such concealment packet C_P provides a good estimate in case a packet of audio data is lost or is found to be erroneous, and thus results in a minimum of audible artefacts. The lost or erroneous packet is simply replaced by the concealment packet C_P.

[0062] Preferably, the MDCT packets are normalized prior to being applied to the LSTM neural network L_NN, such as using a 2-norm normalization. This has been found to help to improve learning efficiency and thus results in improved audio quality.

[0063] Further, the immediate estimated concealment output from the LSTM neural network L_NN is preferably applied to a post processing function, which in turn outputs the concealment packet C_P. Especially, it has been found that a soft-threshold function applied to the output from the LSTM neural network can improve audio quality. See FIG. 5 for an illustration of an example of such soft-threshold function to be used in the post-processing.

[0064] The packet loss concealment is useful due to the combination of a high audio quality, a low latency, and still with a limited complexity to allow implementation on device with low processing power capability, e.g. mobile devices.

[0065] FIGS. 2a and 2b illustrate two basically different implementations with respect to MDCT transformation before and after wireless transmission. In both cases, audio data packets D are transmitted represented in an RF signal, e.g. according to a DECT protocol or other known RF protocol, from a device comprising an RF transmitter RF1 to a device comprising an RF receiver RF2.

[0066] Apart from the pre-training, the LSTM neural network L_NN may further be configured for being online trained by the MDCT transformed audio input A_I, thus allowing the network L_NN to adapt to the actual audio input A_I in order to provide the best possible concealment estimate.

[0067] In FIG. 2a, an audio input signal A_I is transformed into the MDCT domain prior to being transmitted by transmitter RF1 as a wireless RF representation MDCT data packet D, i.e. at the transmitter side. At the receiver side, the MDCT data packets are received by receiver RF2 and applied to the LSTM neural network L_NN which generates concealment packets, in case packets are lost or damaged in the RF transmission. An audio output A_O including such possible concealment packets is then generated after inverse MDCT IMDCT to arrive at a digital or analog audio output signal A_O.

[0068] In FIG. 2b, the transmitter RF1 transmits data packets D representing the audio input A_I in a non-MDCT format. After receipt by the receiver RF2 in the receiver device DV, a MDCT transform is applied to arrive at the MDCT representation which is then applied to the LSTM neural network L_NN which performed concealment as explained above to arrive at the final audio output A_O after inverse MDCT IMDCT.

[0069] It is to be understood that the concept can be used for one-way audio streaming, e.g. in a wireless stage microphone or wireless musical instrument transmitter. The concept can also be used for two-way intercom or teleconference system.

[0070] To be used for speech, the concealment performance can be in case the LSTM neural network has been trained with the sound of a specific person's voice, e.g. a singer or speaker, in case of a digital stage microphone. Alternatively, for a teleconference system, it is possible to store a set of LSTM neural network pre-train configurations for a number of specific person's voices. If the speaking person is identified during a teleconference call, the active LSTM neural network configuration matching the speaker person can then be selected for optimal concealment performance.

[0071] FIG. 3 illustrates one possibility for transforming an audio input signal A_I into overlapping packets PCK which are transformed into MDCT domain representation, e.g. each with 50-200 MDCT coefficients. After transmission over a transmission channel with packet loss or packet damage, the MDCT data are applied to a LSTM neural network L_NN which generates a concealment packet output being inverse MDCT transformed IMDCT into concealment audio packets that can be used to form a resulting output signal A_O similar to the audio input signal A_I.

[0072] FIG. 4 illustrates nodes of a possible implementation of an LSTM neural network layer. The input is x.sup.t, whereas the output is h.sup.t. Input gate I_G, forget gate F_G and output gate O_G all include a sigmoid activation function a which squashes the output to a value between 0 and 1. Further, tanh is a hyperbolic tangent activation function. The subscripts denote: input i, input gate g, forget gate f, and output gate k. Input weights associated with each subscript is denoted U, whereas recurrence weights associated with each subscript is denoted W. Bias associated with each subscript is denoted b.

[0073] The three gates I_G, F_G, O_G control an internal state which is used to learn long term dependencies. An LSTM layer includes several LSTM nodes in parallel each producing an output. Therefore, the Hadamard product is used (indicated in

[0074] FIG. 4 by a dot in a circle). The trainable parameters are the biases b and weight matrices U, W. The dimensions of these trainable parameter are: N for parameter b, N×M for parameter U, and N×N for parameter W, where N is the number of nodes in the LSTM layer, and M is the input size.

[0075] The update equations for a single pass in the LSTM layer are:

i.sup.t=tanh(U.sub.ix.sup.t+W.sub.ih.sup.t-1+b.sub.i)

g.sup.t=σ(U.sub.gx.sup.t+W.sub.gh.sup.t-1+b.sub.g)

f.sup.t=σ(U.sub.fx.sup.t+W.sub.fh.sup.t-1+b.sub.f)

s.sup.t=i.sup.t⊙g.sup.t+s.sup.t-1⊙f.sup.t

k.sup.t=σ(U.sub.kx.sup.t+W.sub.kh.sup.t-1+b.sub.k)

h.sup.t=tanh(s.sup.t)⊙k.sup.t

[0076] The network is preferably trained using the back propagation through time algorithm, and here a loss function is used as a measure to optimize the network. A loss function that scores the performance of the network is based on the output of the network and the expected output, e.g. a mean squared error (MSE) may be used as a loss function.

[0077] Test have shown that LSTM neural network is most preferably pre-trained with the same specific type of audio which is expected as audio input when concealment is to be performed. In case of music, e.g. music segmented representing the expected music genre can be selected for pre-training. In case of specific musical instruments, sound from such musical instrument can be used for pre-training, and in case of speech, e.g. the specific person's voice may be used for optimal performance, otherwise a distinction between female and male voice for pre-training. Especially, sets of pre-trained parameters b, U, W may be stored for a set of different audio classes. According to an audio classification algorithm applied to the audio input, the audio input can be classified, and the pre-trained parameter b, U, W corresponding to the best matching audio class can be applied to the LSTM neural network for optimal concealment performance.

[0078] Tests have shown that rather short segments of audio is sufficient for training the LSTM neural network. However, longer segments of audio may improve performance without increasing complexity. Still, possible training instability may occur using long segments, and to avoid this, any sign of training instability should preferably be monitored during training.

[0079] Listening tests have been performed to confirm that a good concealment performance can be obtained with an LSTM neural network having one single layer only and with 200 LSTM nodes. However, with one single layer, an acceptable performance may be obtained by 50-200 LSTM nodes, e.g. around 80-120 LSTM nodes, only. This may be found to be a good compromise between learning capacity and computational complexity.

[0080] Specifically, audible tests have been performed with an LSTM neural network with one single layer and 100 LSTM nodes, and with overlapping packets of 48 audio samples (corresponding to 1 ms at 48 kHz sample frequency), and with correspondingly 24 MDCT coefficients per packet which are transmitted over a channel with 5% loss. However, a more stable training was obtained with 96 MDCT coefficients per packet, but resulting in a 4 ms delay compared to 1 ms with 24 MDCT coefficients per packet.

[0081] For further information regarding such LSTM neural network, reference is made to: http://www.deeplearningbook.org/ and “LSTM: A search space odyssey”, IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222-2232, October 2017.

[0082] FIG. 5 illustrates one possible implementation of a post-processing function to be used for post-processing of prediction output of the LSTM neural network in order to modify the estimated concealment coefficients output by the LSTM neural network. FIG. 5 illustrates an example of a threshold function, namely a soft-threshold function with a threshold value of 1, merely for illustration of the principle, where input amplitude is the horizontal axis, and output amplitude is the vertical axis. The function t_soft as a function of input x, where T is the threshold, is generally:

[0083] t_soft=[x−T for xT, x+T for x<−T, and 0 otherwise]

[0084] Other post-processing functions can be used, e.g. a hard-threshold function, but listening tests have confirmed a good audio quality with the soft-threshold functions, especially it has been shown to suppress audible high frequency artefacts.

[0085] FIG. 6 illustrates steps of a method embodiment, i.e. a method for concealing errors in packets of data representing an audio signal. First step is providing P_L_NN a Long Short-Term Memory (LSTM) neural network with a plurality of nodes, and pre-training the LSTM neural network with audio data, preferably matching the type of audio which the network is expected to be used for. The LSTM neural network is preferably implemented on a processor system capable of running in real time, as known by the skilled person. Next, receiving R_P a sequence of packets each comprising a set of modified discrete cosine transform (MDCT) coefficients associated with a frame comprising time-domain samples of the audio signal. Next, applying A_P_L_NN the sequence of packets to the LSTM neural network. Further, identifying I_E_P in the received sequence of packets a packet to be an erroneous packet, and identification this can be obtained as known by the skilled person. Next step is generating G_CFF estimated MDCT coefficients to replace the set of MDCT coefficients of the erroneous packet in response to an output from the LSTM neural network, e.g. further applying a post-processing, prior to generating G_CP a concealment packet based on the estimated MDCT coefficients. Finally, replacing R_E_P the erroneous packet with the concealment packet, thus having completed the concealment of the lost or damaged audio content. The method may comprise the step of inverse MDCT transforming the resulting sequence of MDCT packets to arrive at an audio signal.

[0086] To sum up, the invention provides a method for real-time concealing errors in audio data packets. A Long Short-Term Memory (LSTM) neural network with a plurality of nodes is provided and pre-trained with audio data, e.g. audio representing specific voice(s) or musical instruments(s) or music genres(s). A sequence of packets is received, each packet comprising a set of modified discrete cosine transform (MDCT) coefficients associated with a frame comprising time-domain samples of the audio signal. These MDCT coefficient data are applied to the LSTM neural network, and in case it is identified that a received packet is an erroneous packet, an output from the LSTM neural network is used to generate estimated MDCT coefficients to provide a concealment packet to replace the erroneous packet. Preferably, the MDCT coefficients are normalized prior to applying to the LSTM neural network. This method can be performed in real-time even on a small wireless portable device receiving an audio stream via e.g. an RF transmission channel with loss. A low latency can be obtained, such as down to 1-3 ms, and still with a high audio quality. An even higher concealment quality can be obtained, if the LSTM neural network is trained for a specific audio content, e.g. a specific voice or musical instrument, e.g. obtained by online training of the LSTM neural network to allow adaptation to the specific audio content streamed.

[0087] Although the present invention has been described in connection with the specified embodiments, it should not be construed as being in any way limited to the presented examples. The scope of the present invention is to be interpreted in the light of the accompanying claim set. In the context of the claims, the terms “comprising” or “comprises” do not exclude other possible elements or steps. Also, the mentioning of references such as “a” or “an” etc. should not be construed as excluding a plurality. The use of reference signs in the claims with respect to elements indicated in the figures shall also not be construed as limiting the scope of the invention. Furthermore, individual features mentioned in different claims, may possibly be advantageously combined, and the mentioning of these features in different claims does not exclude that a combination of features is not possible and advantageous.

LOW LATENCY AUDIO PACKET LOSS CONCEALMENT

Inventors

Cpc classification

Classification Explorer

G10L25/30

PHYSICS

Classification Explorer

G10L19/02

PHYSICS

Classification Explorer

G10L19/0212

PHYSICS

Classification Explorer

G10L25/18

PHYSICS

Classification Explorer

G10L19/005

PHYSICS

International classification

Classification Explorer

G10L19/02

PHYSICS

Classification Explorer

G10L19/005

PHYSICS

Classification Explorer

G10L25/30

PHYSICS

Abstract

Claims

Description