Audio encoders, audio decoders, systems, methods and computer programs using an increased temporal resolution in temporal proximity of onsets or offsets of fricatives or affricates
11205434 · 2021-12-21
Assignee
Inventors
- Sascha Disch (Fuerth, DE)
- Christian Helmrich (Erlangen, DE)
- Markus Multrus (Nuremberg, DE)
- Markus Schnell (Nuremberg, DE)
- Arthur Tritthart (Erlangen, DE)
Cpc classification
G10L19/00
PHYSICS
G10L19/025
PHYSICS
International classification
G10L19/00
PHYSICS
Abstract
An audio encoder for providing an encoded audio information on the basis of an input audio information has a bandwidth extension information provider configured to provide bandwidth extension information using a variable temporal resolution and a detector configured to detect an onset of a fricative or affricate. The audio encoder is configured to adjust a temporal resolution used by the bandwidth extension information provider such that bandwidth extension information is provided with an increased temporal resolution at least for a predetermined period of time before a time at which an onset of a fricative or affricate is detected and for a predetermined period of time following the time at which the onset of the fricative or affricate is detected. Alternatively or in addition, the bandwidth extension information is provided with an increased temporal resolution in response to a detection of an offset of a fricative or affricate. Audio encoders and methods use a corresponding concept.
Claims
1. An audio encoder for providing an encoded audio information on the basis of an input audio information, the audio encoder comprising: a bandwidth extension information provider configured to provide bandwidth extension information using a variable temporal resolution; a detector configured to detect an offset of a fricative or of an affricate; wherein the audio encoder is configured to adjust a temporal resolution used by the bandwidth extension information provider such that bandwidth extension information is provided with an increased temporal resolution in response to a detection of an offset of a fricative or of an affricate, wherein the audio encoder is configured to adjust a temporal resolution used by the bandwidth extension information provider such that bandwidth extension information is provided with an increased temporal resolution at least for a predetermined period of time before a time at which an offset of a fricative or of an affricate is detected and for a predetermined period of time following the time at which the offset of the fricative or of the affricate is detected.
2. An audio decoder for providing a decoded audio information on the basis of an encoded audio information, wherein the audio decoder is configured to perform a bandwidth extension on the basis of a bandwidth extension information provided by an audio encoder, such that the bandwidth extension is performed with an increased temporal resolution at least for a predetermined period of time before a time at which an offset of a fricative or affricate is detected and for a predetermined period of time following the time at which the offset of the fricative or affricate is detected, and wherein the audio decoder is implemented using a hardware apparatus, or using a computer, or using a combination of a hardware and a computer.
3. A system, comprising: an audio encoder according to claim 1; and an audio decoder configured to receive the encoded audio information provided by the audio encoder, and to provide, on the basis thereof, a decoded audio information, wherein the audio decoder is configured to perform a bandwidth extension on the basis of the bandwidth extension information provided by the audio encoder, such that the bandwidth extension is performed with an increased temporal resolution at least for a predetermined period of time before a time at which an onset of a fricative or affricate is detected and for a predetermined period of time following the time at which the onset of the fricative or affricate is detected, or such that the bandwidth extension is performed with an increased temporal resolution at least for a predetermined period of time before a time at which an offset of a fricative or affricate is detected and for a predetermined period of time following the time at which the offset of the fricative or affricate is detected.
4. A method for providing an encoded audio information on the basis of an input audio information, the method comprising: providing bandwidth extension information using a variable temporal resolution; and detecting an offset of a fricative or of an affricate; wherein a temporal resolution used for providing the bandwidth extension information is adjusted such that bandwidth extension information is provided with an increased temporal resolution in response to a detection of an offset of a fricative or of an affricate, such that bandwidth extension information is provided with an increased temporal resolution at least for a predetermined period of time before a time at which an offset of a fricative or of an affricate is detected and for a predetermined period of time following the time at which the offset of the fricative or of the affricate is detected.
5. A non-transitory digital storage medium having stored thereon a computer program for performing a method according to claim 4 when the computer program runs on a computer.
6. A method for providing a decoded audio information on the basis of an encoded audio information, wherein the method comprises performing a bandwidth extension on the basis of a bandwidth extension information provided by an audio encoder, such that the bandwidth extension is performed with an increased temporal resolution at least for a predetermined period of time before a time at which an offset of a fricative or of an affricate is detected and for a predetermined period of time following the time at which the offset of the fricative or of the affricate is detected.
7. A non-transitory digital storage medium having stored thereon a computer program for performing a method according to claim 6 when the computer program runs on a computer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments according to the present invention will subsequently be described taking reference to the enclosed figures in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
DETAILED DESCRIPTION OF THE INVENTION
(15) 1. Audio Encoder According to
(16)
(17) The audio encoder 100 is configured to receive an input audio information 110 and provide, on the basis thereof an encoded audio information 112.
(18) The audio encoder 100 comprises a detector 120, which may, for example, receive the input audio information 110. The detector 120 is configured to detect an onset of a fricative or affricate, for example, on the basis of the input audio information 110. The detector 120 may provide a temporal resolution adjustment information 122.
(19) The audio encoder 100 also comprises a bandwidth extension information provider 130, which is configured to provide a bandwidth extension information 132 using a variable temporal resolution. For example, the bandwidth extension information provider 130 may be configured to receive the input audio information (and possibly additional preprocessed audio information). Moreover, the bandwidth extension information provider 130 may also be configured to receive the temporal resolution adjustment information 122 from the detector 120.
(20) The audio encoder 100 may further comprise a low frequency encoding 140, which may, for example, encode a low frequency portion of an audio content represented by the input audio information 110, to thereby provide an encoded representation 142 of a low frequency portion of the audio content represented by the input audio information 110. Accordingly, the encoded audio information 112 may comprise the bandwidth extension information 132 and the encoded representation 142 of the low frequency portion of the audio content. However, details regarding the low frequency encoding are not essential for the present invention.
(21) In the following, the functionality of the audio encoder 100 will be described in more detail.
(22) The low frequency encoding 140 may encode a low frequency portion of the audio content represented by the input audio information 110. For example, a portion of the audio content having frequencies below approximately 6 kHz or below approximately 7 kHz (or below any other predetermined frequency limit) may be encoded using the low frequency encoding 140. The low frequency encoding 140 may, for example, use any of the well-known audio encoding techniques, like transform-domain encoding or linear-prediction-domain encoding. In other words, the low frequency encoding 140 may, for example, use an audio encoding concept which may be based on the well-known “advanced audio coding” (AAC) or which may be based on the well-know “linear-prediction coding”. For example, the low frequency encoding 140 may comprise (or use) a modified “advanced audio coding” as described in the International Standard ISO/IEC 23003-3. Alternatively, or in addition, the low frequency encoding 140 may comprise (or use) a linear-prediction coding as described, for example, in the International Standard ISO/IEC 23003-3. However, the low frequency encoding 140 may also comprise a switching between a (modified or unmodified) “advanced audio coding” and a linear-prediction domain audio coding. However, it should be noted that, in principle, any concepts known for the encoding of an audio signal may be used in the low frequency encoding 140, to provide the encoded representation 142 of the low frequency portion of the audio content represented by the input audio information.
(23) However, the bandwidth extension information provider 130 may provide bandwidth extension information (for example, in the form of bandwidth extension parameters), which allows to reconstruct a high frequency portion of the audio content represented by the input audio information 110, which high frequency portion is not represented by the encoded representation 142 provided by the low frequency encoding 140. For example, the bandwidth extension information provider 130 may be configured to provide some or all of the spectral band replication parameters which are described in the International Standard ISO/IEC 14496-3 (or any other standards referring to ISO/IEC 14496-3).
(24) For example, the bandwidth extension information provider may be configured to provide some or all of the parameters described in a section “SBR tool” and/or “low delay SBR” of the International Standard ISO/IEC 14496-3. For example, the bandwidth extension information provider 130 may be configured to provide some or all of the parameters of the syntax element “sbr_extension_data( )”, “sbr_header( )”, “sbr_data( )”, “sbr_single_channel_element( )”, “sbr_channel_pair_element( )” or any of the other bitstream elements referenced therein, as defined, for example, in the International Standard ISO/IEC 14496-3. In other words, the bandwidth extension information provider 130 may provide spectral bandwidth replication parameters, which may, for example, coarsely describe a spectral envelope of a high frequency portion of the audio content represented by the input audio information 110. However, the bandwidth extension information provider 130 may further comprise parameters describing a noise in a high frequency portion of the audio content represented by the input audio information 110, and/or may comprise parameters describing one or more sinusoidal signals included in the high frequency portion of the audio content represented by the input audio information 110. In addition, the bandwidth extension information provider 130 may, for example, provide a number of configuration parameters, as also described in the International Standard ISO/IEC 14496-3 with respect to the spectral bandwidth replication tool. For example, the bandwidth extension information provider 130 may provide one or more parameters representing a temporal resolution which is used for the provision of sets of bandwidth extension information, for example a temporal resolution using which updated sets of parameters representing a spectral envelope of the high frequency portion of the audio content represented by the input audio information are provided. For example, the bandwidth extension provider 130 may provide a control parameter which indicates whether one or four sets of spectral envelope parameters are provided per audio frame. For example, the control parameters provided by the bandwidth extension information provider 130 may be similar to, or even equal to, the parameters provided for the case “FIXFIX” in the syntax element “sbr_grid( )”, as described in the International Standard ISO/IEC 14496-3.
(25) However, the bandwidth extension provider 130 may, alternatively, be configured to provide a control information which is similar to, or even equal to, the control information included in the bitstream element “sbr_Id_grid( )”, which is described, for example, in section 4.6.19.3.2 of the International Standard ISO/IEC 14496-3.
(26) For example, a 2-bit value may be used to encode how many sets of envelope shape parameters are provided by the bandwidth extension information provider 130 per audio frame (cf. the bitstream element “bs_num_env” as described in section 4.6.19.3.2 of ISO/IEC 14496-3).
(27) Advantageously, the signaling may be performed as indicated for the case “FIXFIX”, which is described in section 4.6.19 “low delay SBR” of ISO/IEC 14496-3.
(28) To conclude, the bandwidth extension information provider 130 provides bandwidth extension information 132, wherein the temporal resolution (for example, the period of time between updates of parameters representing a spectral envelope of a high frequency portion of the audio content represented by the input audio information 110) is adjusted in dependence on the temporal resolution adjustment information 122, which is provided by the detector 120. Thus, the temporal resolution used by the bandwidth extension information provider 130 (for example, for providing updated sets of parameters describing a spectral envelope of a high frequency portion of an audio content represented by the input audio information 110) is adapted to the input audio information 110.
(29) For example, the audio encoder 100 is configured such that the temporal resolution used by the bandwidth extension information provider 130 is increased (when compared to a normal temporal resolution) in response to a detection of an onset of a fricative or affricate by the detector 120. However, the temporal resolution used by the bandwidth extension information provider is increased such that the bandwidth extension information (for example, the spectral envelope parameters thereof) is provided with an increased temporal resolution at least for a predetermined period of time before a time at which an onset of a fricative or affricate is detected and for a predetermined period of time following the time at which the onset of a fricative or affricate is detected. Accordingly, an “entire” onset of a fricative or affricate (or at least a sufficiently large portion of an onset of a fricative or affricate) is encoded with an increased temporal resolution of the bandwidth extension information. Consequently, onsets of a fricative or affricate can be encoded (and decoded) with sufficient accuracy, such that audible artifacts are avoided and a degradation of the audio quality is also avoided.
(30) Consequently, the encoded audio information 112, which comprises the bandwidth extension information 132 and which typically also comprises the encoded representation 142 of the low frequency portion of the audio content represented by the input audio information 110, allows for a decoding of the audio content represented by the input audio information 110 with good quality while a necessitated bitrate can be kept reasonably small.
(31) Moreover, it should be noted that any of the other features and functionalities described herein can be implemented into the audio encoder 100 as well. In particular, the audio encoder 100 may additionally be configured to adjust the temporal resolution used by the bandwidth extension information provider such that bandwidth extension information is provided with an increased temporal resolution in response to a detection of an offset of a fricative or affricate (wherein the detector 110 may also be configured to detect an offset of a fricative or affricate).
(32) In the following, some additional details regarding the functionality of the audio encoder 100 will be described taking reference to
(33)
(34) An abscissa 210 describes a time (in terms of time blocks) and an ordinate 212 designates QMF subbands. Accordingly, the representation 200 according to
(35) As can be seen, magenta dashed vertical lines designate temporal borders 220a, 220b, . . . of a conventional bandwidth extension framing. Moreover, black dashed vertical lines designate detected fricative or affricate borders 230a, 230b, 230c, 230d, . . . The detected fricative or affricate borders 230a, 230b, 230c, 230d, . . . may be detected using a tilt-based detector. As can be seen, time intervals of equal length, which may be considered as bandwidth extension frames or generally as frames, are defined by the borders 220a, . . . , 220u of the (conventional) bandwidth extension framing. In other words, in the conventional concept according to document D1, bandwidth extension information may be associated with temporally regular time intervals (separated by the borders of the conventional bandwidth extension framing) of equal temporal length.
(36) As can be seen, the detected fricative or affricate borders may lie somewhere within a time interval defined by two subsequent borders of the conventional bandwidth extension framing.
(37) However, the conventional bandwidth extension frame scheme as shown in
(38)
(39) However, between frame borders 330h and 330p, a “normal” temporal resolution (rather than an “increased” temporal resolution) is used. Moreover, an increased temporal resolution is used for the provision of the bandwidth extension information for frames between frame borders 330p and 330s, in response to a detection of an onset of a fricative or affricate in a frame (or time interval) bounded by frame borders 330p and 330q.
(40) Similarly, an increased temporal resolution is used for the provision of bandwidth extension information for frames (or time intervals) between frame borders 330t and 330w in response to a detection of an offset of a fricative or affricate in a frame (or time interval) between frame borders 330t and 330u.
(41) To conclude, a uniform (basic) framing is used to provide bandwidth extension information in the audio encoder 100, wherein the bandwidth extension information is associated with temporally regular frames (time intervals) of equal temporal length.
(42) However, the bandwidth extension information provider is configured to provide a single set of bandwidth extension information for a frame (i.e., a time interval of a given temporal length) if a first (“normal”) temporal resolution is used. For example, a single set of bandwidth extension information is provided for a frame between frame borders 330a and 330b, and a single set of bandwidth extension information is provided for each of the eight frames between time borders 330h and 330p. However, the bandwidth extension information provider is also configured to provide a plurality of sets of bandwidth extension information associated with time sub-intervals for a frame (time interval) of the given temporal length if a second (increased) temporal resolution is used. For example, four sets of bandwidth extension information are provided for each of the six frames between frame border 330b and frame border 330h, for each of the three frames between frame borders 330p and 330s, and for each of the three frames between frame borders 330t and 330w. As can be seen, each of the frames for which the bandwidth extension information is provided with high temporal resolution is subdivided into four sub-frames (or time sub-intervals) (for example, time sub-intervals 340a to 340d) of equal length, wherein one set of bandwidth extension parameters is provided for each of the time sub-intervals. Moreover, it should be noted that there is typically at least one time sub-frame, for which a set of bandwidth extension parameters is provided, immediately before a time sub-frame during which an onset of a fricative or affricate is detected or before a time sub-frame during which an offset of a fricative or affricate is detected. For example, if it is assumed that a fricative or affricate is detected in a second half of the frame between frame borders 330b and 330c, there are at least two time sub-frames (which lie in a first half of the frame between frame borders 330b and 330c) immediately preceding a time sub-frame during which the fricative or affricate is detected. Accordingly, an increased temporal resolution is used for the provision of the bandwidth extension parameters even before the time at which the onset of the fricative or affricate is actually detected or before the time at which the offset of the fricative or affricate is actually detected. Accordingly, a “full” onset of a fricative or affricate or a “full” offset of a fricative or affricate can be processed with high temporal resolution (in that the bandwidth extension parameters are provided with high temporal resolution). Consequently, a good reproduction is possible at the side of an audio decoder, which receives the audio encoded audio information provided by the audio encoder 100.
(43) Taking reference now to
(44)
(45) A first ellipse 430 describes a pre-echo which would be caused by a conventional bandwidth extension framing. Mover, the conventional bandwidth extension framing has the effect that the onset shown in the ellipse 430 is perceived as a very hard onset.
(46) Moreover, a second ellipse 440 points out a post echo, which would also be caused by a conventional bandwidth extension framing. Moreover, the offset in the region indicated by the ellipse 440 would typically be perceived as a very hard offset, which would sound unnatural.
(47) An ellipse 450 shows a vowel leakage from a base band, which would also be caused by a conventional bandwidth extension framing.
(48) Accordingly, it can be seen that a number of artifacts arise from the conventional bandwidth extension framing (for example, the bandwidth extension framing shown in
(49)
(50) Moreover, the inventive usage of an increased temporal resolution also helps to avoid a vowel leakage from a base band, as shown at ellipse 450 in
(51) In the following, some details regarding the provision of the bandwidth extension information will be explained taking reference to
(52)
(53) A time axis is designated with 610. As can be seen, the time (represented by the time axis 610) is divided into time intervals 620a, 620b, 620c, 620d, 620e, 620f, which may, for example, comprise equal length. The time intervals may be considered as frames. Moreover, a time at which an onset (or offset) of a fricative or affricate is detected is designated with t.sub.f. The time t.sub.f lies within the time interval (or frame) 620e. It should be noted that the time at which the onset (or offset) of the fricative or affricate is detected may, for example, be determined by the detector 120, and that the time at which the onset (or offset) of the fricative or affricate is detected may typically lie somewhat after an actual beginning of an onset of the fricative or affricate or after an actual beginning of the offset of the fricative or affricate.
(54) As can be seen in
(55)
(56) For example, one individual set of bandwidth extension parameters may be provided for each time sub-interval of the time intervals 720d and 720e.
(57) However, it should be noted that the increased temporal resolution is also used for the time interval 720d which precedes (immediately precedes) the time interval 720e, in which the time at which the onset (or offset) of the fricative or affricate is detected lies. However, as it is desired, according to the present invention, that at least another time interval (or time sub-interval), preceding (or immediately preceding) the time interval (or time sub-interval) in which the onset (or offset) of the fricative or affricate is detected, is encoded with an increased temporal resolution, the audio encoder 100 chooses the increased temporal resolution for the provision (and encoding) of the bandwidth extension information of the time interval 720d. Thus, since the time at which the onset of the fricative or affricate is detected lies within a first time sub-interval of the time interval 720e, the audio decoder decides that also the (preceding) time interval 720d should be processed with high temporal resolution, such that the high temporal resolution is already applied in a time interval (or time sub-interval) before the time sub-interval in which the onset (or offset) of the fricative or affricate is detected.
(58) In contrast, if the onset (or offset) of the fricative or affricate was only detected in a second sub-interval of the time interval 720e, the audio encoder would (possibly) select a low temporal resolution for the provision of the bandwidth extension information for the time interval 720d (which is the situation shown in
(59) Accordingly, even a beginning of an onset of a fricative or affricate is processed with high temporal resolution, wherein the beginning of the onset of the fricative or affricate typically lies before a time at which the onset of a fricative or affricate is actually detected by the detector 120. Consequently, audio reproduction with good perceptual quality without major artifacts can be achieved.
(60) To summarize,
(61) It should be noted that
(62) For example, one set of bandwidth extension parameters may be provided for each of the frames 620a to 620d and 620f. Moreover, one set of bandwidth extension information may be provided for each of the frames 720a, 720b, 720c, 720f. However, sets of bandwidth extension parameters may be provided with an increased temporal resolution at least for a predetermined period of time before a time at which an onset of a fricative or affricate is detected and for a predetermined period of time following the time at which the onset of the fricative or affricate is detected. For example, sets of bandwidth extension parameters are provided with increased temporal resolution for the frame 620e. For example, a total of four sets of bandwidth extension parameters may be provided for the frame 620e such that the temporal resolution is increased in the sub-frame 630a preceding the sub-frame 630b in which the onset or offset of the fricative or affricate is detected. Moreover, two more sets of bandwidth extension parameters may be provided for sub-frames 630c and 630d.
(63) A similar concept is apparent from
(64) To conclude bandwidth extension parameters may be provided with an increased temporal resolution at least for a predetermined period of time before a time at which an onset of a fricative or affricate is detected and for a predetermined period of time following the time at which the onset of the fricative or affricate is detected. Moreover, the bandwidth extension parameters may also be provided with increased temporal resolution for a portion of the audio content in which an offset of a fricative or affricate is detected.
(65) 2. Audio Encoder According to
(66)
(67) The audio encoder 800 is configured to receive an input audio information 810 and to provide, on the basis thereof, an encoded audio information 812.
(68) The audio encoder 800 comprises a detector 820 configured to detect an offset of a fricative or affricate. The detector 820 provides, for example, a temporal resolution adjustment information 822. Moreover, the audio encoder 800 comprises a bandwidth extension information provider 830 which is configured to provide bandwidth extension information 832 using a variable temporal resolution. The audio encoder is configured to adjust the temporal resolution used by the bandwidth extension information provider 830 such that the bandwidth extension information 832 is provided with an increased temporal resolution (when compared to a “normal” temporal resolution) in response to a detection of an offset of a fricative or affricate. In other words, the temporal resolution which is used by the bandwidth extension information provider 830 is increased if the detector 820 detects an offset of a fricative or affricate, such that the offset of the fricative or affricate is encoded with comparatively high (higher than normal) temporal resolution of the bandwidth extension information (or bandwidth extension parameters) 832. Moreover, the audio encoder 800 comprises a low frequency encoding 840 which may provide an encoded representation 842 of a low frequency portion of an audio content represented by the input audio information 810.
(69) Moreover, it should be noted that the detector 820 may be similar to the detector 120 described above, and that the bandwidth extension information provider 130 may be similar (or even equal to) the bandwidth extension information provider 130 described above. Moreover, the low frequency encoding 840 may be similar, or even equal to, the low frequency encoding 140 described above.
(70) Moreover, the audio encoder 800 is configured to adjust the temporal resolution used by the bandwidth extension information provider 830 such that the bandwidth extension information 832 is provided with an increased temporal resolution in response to a detection of an offset of a fricative or affricate. Accordingly, an offset of a fricative or affricate is encoded with high temporal resolution (at least of the bandwidth extension information) which helps to avoid artifacts and brings along a natural hearing impression.
(71) However, it should be noted that the audio encoder 800 may, optionally, be provided with any of the other features described above with respect to the audio encoder 100, and also with respect to
(72) Moreover, it should be noted that the concepts according to
(73) 3. Audio Decoder According to
(74)
(75) However, the audio decoder 900 may be configured to perform the bandwidth extension with an increased temporal resolution at least for a predetermined period of time before a time at which an onset of a fricative or affricate is detected and for a predetermined period of time following the time at which the onset of the fricative or affricate is detected. Accordingly, a good audio quality may be achieved even for the onset of a fricative or affricate or for the offset of a fricative or affricate.
(76) It should be noted that the temporal resolution, which is used for the bandwidth extension, may be signaled using a side information which is included in the bandwidth extension information 932. For example, the signaling may be performed as described in Section 4.6.19 of International Standard ISO/IEC 14496-3. In particular, the signaling of the temporal resolution may be performed as described in Section 4.6.19.3.2 of ISO/IEC 14496-3, subpart 4. Thus, the bandwidth extension 930 may evaluate said signaling to decide which temporal resolution should be used for the bandwidth extension.
(77) However, alternatively, the audio decoder may be configured to detect an onset of a fricative or affricate or an offset of a fricative or affricate on the basis of the decoded low frequency portion of the audio content, which may be provided by the low frequency decoding 920. Accordingly, the audio decoder 900 may decide about the temporal resolution to be used for the bandwidth extension in a similar manner as the audio encoder described above. In such a case, it may not even be necessary to use any additional side information for signaling the temporal resolution to be used for the bandwidth extension which helps to reduce the bit rate.
(78) Regarding the functionality of the audio decoder 900, it should be noted that the functionality corresponds to the functionality of the audio encoder 100 according to
(79) 4. Audio Decoder According to
(80)
(81) The audio decoder 1000 is configured to receive an encoded audio information 1010 and to provide, on the basis thereof, a decoded audio information 1012. The audio decoder comprises a low frequency decoding 1020, which may be substantially equal to the low frequency decoding 920 described above. Moreover, the audio decoder 1000 comprises a bandwidth extension 1030, which may be substantially equal to the bandwidth extension 930 described above. However, the audio decoder 1000 is configured to perform the bandwidth extension on the basis of a bandwidth extension information 1032 provided by an audio encoder, such that the bandwidth extension is performed with an increased temporal resolution at least for a predetermined period of time before a time at which an offset of a fricative or affricate is detected and for a predetermined period of time following the time at which the offset of the fricative or affricate is detected. Accordingly, the audio decoder 1000 provides a decoded audio information in which offsets of fricatives or affricates are represented with good accuracy. Accordingly, artifacts are avoided.
(82) Moreover, it should be noted that the explanations provided above with respect to the audio decoder 900 also apply to the audio decoder 1000. In addition, it should be noted that the audio decoder 1000 can be supplemented by any of the features and functionalities described with respect to the audio encoder 900. Moreover, the audio encoder 1000 (as well as the audio encoder 900) can be supplemented by any of the features and functionalities described herein with respect to the audio decoder since the audio decoding corresponds to the audio encoding described above.
(83) 5. System According to
(84)
(85) However, it should be noted that the audio encoder 1120 may be equal to the audio encoder 100 described with respect to
(86) Accordingly, the audio decoder may be configured to receive the encoded audio information provided by the audio encoder, and to provide, on the basis thereof, the decoded audio information 1150, such that the bandwidth extension is performed with an increased temporal resolution at least for a predetermined period of time before a time at which an onset of a fricative or affricate is detected and for a predetermined period of time following the time at which the onset of the fricative or affricate is detected and/or such that the bandwidth extension is performed with an increased temporal resolution at least for a predetermined period of time before a time at which an offset of a fricative or affricate is detected and for a predetermined period of time following the time at which the offset of the fricative or affricate is detected. Accordingly, a good quality reproduction of fricatives or affricates can be achieved.
(87) It should be noted that the system can be supplemented by any of the features and functionalities described above with respect to the audio encoders and audio decoders.
(88) 6. Method for Providing an Encoded Audio Information on the Basis of an Input Audio Information According to
(89)
(90) The method 1200 according to
(91) 7. Method for Providing a Decoded Audio Information According to
(92)
(93) The method 1300 further comprises performing 1320 a bandwidth extension on the basis of a bandwidth extension information provided by an audio encoder, such that a bandwidth extension is performed with an increased temporal resolution at least for a predetermined period of time before a time at which an onset of a fricative or affricate is detected and for a predetermined period of time following the time at which the onset of the fricative or affricate is detected and/or such that the bandwidth extension is performed with an increased temporal resolution at least for a predetermined period of time before a time at which an offset of a fricative or affricate is detected and for a predetermined period of time following the time at which the offset of the fricative or affricate is detected.
(94) The method 1300 is based on the same considerations as the above described audio encoder and the above described audio decoder. Moreover, it should be noted that the method 1300 can be supplemented by any of the features and functionalities described herein with respect to the audio decoder. Moreover, the method 1300 can also be supplemented by any of the features and functionalities described with the respect to the audio encoder, taking into consideration that the decoding process is substantially an inverse of the encoding process.
(95) 8. Conclusions
(96) To conclude the above explanations, it should be noted that embodiments according to the invention relate to speech coding and particularly to speech coding using bandwidth extension (BWE) techniques. Embodiments according to the invention aim to enhance the perceptual quality of the decoded signal by detecting fricatives or affricates within the speech signal and adapting the temporal resolution of the bandwidth extension parameter driven post processing accordingly (for example, by adapting a temporal resolution which is used for providing sets of bandwidth extension information). Embodiments according to the invention comprise detecting onsets and offsets of fricative or affricate signal portions of a speech signal and providing for a temporally fine-grain bandwidth extension post-processing during the entire onset and offset period of these fricative or affricate signal portions (wherein the bandwidth extension processing may, for example, comprise a provision of said bandwidth extension information at the side of an audio encoder and may comprise performing a bandwidth extension at the side of the audio decoder). Hereby, the occurrence of pre- and post-echo artifacts is reduced and a sufficiently gentle on- and offset of fricative or affricate signal portions can be modeled by the fine grain bandwidth extension parameters. Hereby, unpleasant auditory sharpness of fricatives or affricates and the occurrence of annoying pre- and post-echoes within the coded signal is avoided.
(97) Embodiments according to the invention outperform conventional solutions. For example, in [1] it is proposed to align a start time instant of a bandwidth extension parameter frame with the point in time of a spectral tilt change. A spectral tilt change might denote an onset or a sudden offset of a fricative or affricate signal portion. The alignment technique proposed in [1] prevents the occurrence of pre-echoes of fricatives or affricates within bandwidth extension methods. However, only fricative or affricate onsets are detected and offsets are missed. Additionally, the above mentioned technique does not account for fine-grain modeling of the on- and offset spectral-temporal characteristics of the individual fricatives or affricates. Hence, the sound of these can be harsh and much too sharp.
(98) In the following, some embodiments and aspects according to the invention will be described.
(99) For example, an inventive bandwidth extension encoder comprises a fricatives or affricates detector and a bandwidth extension spectro-temporal resolution switcher.
(100) The fricatives or affricates detector advantageously is capable to detect both fricatives or affricates onsets and offsets. A suitable low computational complexity realization of such a detector can be, for example, based on the evaluation of a zero crossing rate (ZCR) and an energy ratio (for details, confer, for example, references [2] and [3]). The detector may be additionally connected to a speech/music discriminator in order to restrict the subsequent inventive processing to speech signals only.
(101) In some embodiments, a certain temporal look-ahead of the detector is desired or even necessitated, to be able to timely switch bandwidth extension resolution such that during the entire onset and offset signal portion length, fine grain temporal resolution is employed within the bandwidth extension parameter estimation/synthesis. The duration of the onset or offset signal portions can be either measured signal adaptively or assumed to be fixed to an empirically determined value. For example, a number of time intervals or time-sub intervals, which are processed with high temporal resolution in response to a detection of a fricative or affricate onset or fricative or affricate offset can be predetermined, or adjusted in dependence on signal characteristics. For example, a detected fricative or affricate might activate a four times higher temporal resolution during a group of several consecutive signal frames (e.g., two or three frames) that fully encompass the detected fricative or affricate onset or offset. Advantageously, but not necessarily, the group of high temporal resolution signal frames is approximately centered with respect to the detected fricative or affricate on- or offset, thereby covering the entire duration of the on- or offset. In case of a transient adaptive bandwidth extension framing, the activation of a higher temporal resolution during an entire group of signal frames triggered by the fricatives or affricates detection supersedes the transient adaptive framing.
(102) In the following, some details regarding figures will be discussed.
(103)
(104)
(105)
(106)
(107) To conclude, the spectrograms discussed here indicate that an audio quality can be substantially improved by applying the concept according to the present invention.
(108) To further conclude, embodiments according to the invention create an audio encoder or a method of audio encoding or a related computer program, as described above.
(109) Further embodiments according to the invention create an audio decoder or a method of audio decoding or a related computer program as described above.
(110) Moreover, embodiments according to the invention create an encoded audio signal or storage medium having stored the encoded audio signal as described above.
(111) 9. Implementation Alternatives
(112) Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.
(113) The inventive encoded audio signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
(114) Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
(115) Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
(116) Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
(117) Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
(118) In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
(119) A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
(120) A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
(121) A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
(122) A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
(123) A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
(124) In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
(125) The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
(126) The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
(127) While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which will be apparent to others skilled in the art and which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
REFERENCES
(128) [1] United states patent number US 20110099018, “Apparatus and Method for Calculating Bandwidth Extension Data Using a Spectral Tilt Controlled Framing”
(129) [2] D. Ruinskiy and N. Dadush and Y. Lavner, “Spectral and textural feature-based system for automatic detection of fricatives and affricates,” IEEE 26th Convention of Electrical and Electronics Engineers in Israel (IEEEI), pp. 771-775, 2010.
(130) [3] H. Fujihara and M. Goto, “Three techniques for improving automatic synchronization between music and lyrics: Fricative detection, filler model, and novel feature vectors for vocal activity detection”, IEEE International Conference on Audio, Speech and Signal Processing, Las Vegas, USA, 2008.