DETERMINING A LIGHT EFFECT BASED ON A DEGREE OF SPEECH IN MEDIA CONTENT
20220053618 · 2022-02-17
Inventors
- TOBIAS BORRA (RIJSWIJK, NL)
- Dzmitry Viktorovich Aliakseyeu (Eindhoven, NL)
- Antonie Leonardus Johannes Kamp (San Francisco, CA, US)
Cpc classification
International classification
Abstract
A method comprises obtaining (101) media content information and obtaining (103, 109) information indicating a degree of speech in the audio portion. The media content information comprises the media content and/or information determined by analyzing the media content and the degree of speech is determined based on an analysis of an audio portion of the media content. The method further comprises determining (107, 113) an extent to which the audio portion should be used to determine one or more light effects to be rendered while the media content is being rendered and determining (117) these light effects. The extent is determined based on the degree of speech and the light effects are determined based on an analysis (115) of the audio portion in dependence on the extent and based on an analysis of a video portion of the media content.
Claims
1. A system for determining one or more light effects to be rendered while media content is being rendered, said one or more light effects being determined based on an analysis of said media content, said system comprising: at least one input interface; at least one output interface; and at least one processor configured to: use said at least one input interface to obtain media content, determine one or more light effects to be rendered on one or more light sources while said media content is being rendered, said one or more light effects being determined based on: an analysis of an audio portion of said media content, and an analysis of a video portion of said media content, and use said at least one output interface to control said one or more light sources to render said one or more light effects, wherein the processor is further configured to: obtain information indicating a degree of speech in said audio portion, said degree of speech being determined based on said analysis of said audio portion; determine an extent to which said audio portion should be used to determine said one or more light effects, said extent being determined based on said determined degree of speech; and determine a brightness and/or chromaticity of said one or more light effects based on an intensity and/or a loudness of said audio portion in dependence upon the determined extent to which said audio portion should be used to determine said one or more light effects.
2. A system as claimed in claim 1, wherein said degree of speech in said audio portion is determined by determining an amount of speech in said audio portion and classifying said audio portion as predominantly speech or predominantly non-speech based on said amount of speech.
3. A system as claimed in claim 2, wherein said at least one processor is configured to determine a first extent as said extent in dependence on said audio portion being classified as predominantly speech and determine a second extent as said extent in dependence on said audio portion being classified as predominantly non-speech, said second extent indicating that a brightness and/or chromaticity of said one or more light effects should be determined based on an intensity and/or loudness of said audio portion and said first extent indicating that a brightness and/or chromaticity of said one or more light effects should not be determined based on an intensity and/or loudness of said audio portion.
4. A system as claimed in claim 2, wherein said at least one processor is configured to determine said one or more light effects using a first brightness and/or chromaticity range in dependence on said audio portion being classified as predominantly speech and using a second brightness and/or chromaticity range in dependence on said audio portion being classified as predominantly non-speech, said first brightness and/or chromaticity range having a lower average brightness and/or chromaticity than said second brightness and/or chromaticity range.
5. A system as claimed in claim 1, wherein said degree of speech in said audio portion is determined by classifying said audio portion as a class of a plurality of classes, said plurality of classes comprising at least two of: conversation, whispering, screaming, narration, singing, diegetic speech, and non-diegetic speech.
6. A system as claimed in claim 5, wherein said at least one processor is configured to determine a first extent as said extent in dependence on said audio portion being classified as conversation and determine a second extent as said extent in dependence on said audio portion being classified as singing, said second extent indicating that a brightness and/or chromaticity of said one or more light effects should be determined based on an intensity and/or loudness of said audio portion and said first extent indicating that a brightness and/or chromaticity of said one or more light effects should not be determined based on an intensity and/or loudness of said audio portion.
7. A system as claimed in claim 5, wherein said one or more light effects comprise a plurality of light effects and said at least one processor is configured to determine a speed of transitions between said plurality of light effects in dependence on said class.
8. A system as claimed in claim 5, wherein said audio portion is classified by analyzing a spectral composition of said audio portion.
9. A system as claimed in claim 1, wherein said one or more light effects comprise a plurality of light effects and said at least one processor is configured to determine whether an amount of speech in said audio portion exceeds a threshold and determine a speed of transitions between said plurality of light effects in dependence on said amount of speech exceeding said threshold.
10. A system as claimed in claim 1, wherein said at least one processor is configured to determine words spoken in said audio portion by recognizing said spoken words in said audio portion and/or obtaining said spoken words from subtitles associated with said media content.
11. A system as claimed in claim 1, wherein said at least one processor is configured to determine said degree of speech by using subtitles associated with said media content and/or by focusing on a center channel in or obtained from said audio portion.
12. A lighting system comprising the system of claim 1 and one or more light sources.
13. A method of determining one or more light effects to be rendered while media content is being rendered, said one or more light effects being determined based on an analysis of said media content, said method comprising: obtaining media content; determining one or more light effects to be rendered on one or more light sources while said media content is being rendered, said one or more light effects being determined based on an analysis of an audio portion of said media content and an analysis of a video portion of said media content; and controlling said one or more light sources to render said one or more light effects, wherein the method further comprises: obtaining information indicating a degree of speech in said audio portion, said degree of speech being determined based on an analysis of said audio portion; determining an extent to which said audio portion should be used to determine one or more light effects, said extent being determined based on said determined degree of speech; and wherein a brightness and/or chromaticity of said one or more light effects is based on an intensity and/or a loudness of said audio portion in dependence upon the determined extent to which said audio portion should be used to determine said one or more light effects.
14. A non-transitory computer readable medium comprising at least one software code portion or a computer program product storing at least one software code portion, the software code portion, when run on a computer system, being configured for enabling the method of claim 13 to be performed.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0043] These and other aspects of the invention are apparent from and will be further elucidated, by way of example, with reference to the drawings, in which:
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054] Corresponding elements in the drawings are denoted by the same reference numeral.
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0055]
[0056] A TV 27 is also connected to the wireless LAN access point 23. Media content may be rendered by the mobile device 1 or by the TV 27, for example. The wireless LAN access point 23 is connected to the Internet 24. An Internet server 25 is also connected to the Internet 24. The mobile device 1 may be a mobile phone or a tablet, for example. The mobile device 1 may run the Philips Hue Sync app, for example. The mobile device 1 comprises a processor 5, a receiver 3, a transmitter 4, a memory 7, and a display 9. In the embodiment of
[0057] In the embodiment of
[0058] The processor 5 is further configured to determine one or more light effects to be rendered on one or more light sources, e.g. one or more of light sources 13-17 or not yet identified light sources, while media content is being rendered. The one or more light effects are determined based on an analysis of the audio portion in dependence on the extent and determined at least based on an analysis of a video portion of the media content. The processor 5 is further configured to use the transmitter 4 to control one or more of light sources 13-17 to render the one or more light effects and/or use an internal interface (not shown) to output a light script specifying the one or more light effects to memory 7.
[0059] The extent may indicate whether a brightness and/or chromaticity of the one or more light effects should be determined based on an intensity and/or a loudness of the audio portion, for example. Depending on the algorithm used for light effects creation, different ways of applying the speech classification could be envisioned:
[0060] Transition speed. If colors for light effects creation are extracted from predefined analysis areas within the on-screen content (as is done in HueSync, for example), speech classification can then be used to influence the transition speed between the light effects rendering extracted colors.
[0061] Chromaticity. Colors extracted from the screen when translated to light effects may be desaturated to more pastel colors or saturated to more vibrant colors.
[0062] Brightness. Like the above, but instead of saturation, brightness may be adapted.
[0063] Extraction algorithm. Instead of modifying colors extracted from the on-screen, speech classification could control what algorithm is used to select colors, what colors are selected, and from which analysis areas.
[0064] Audio input: Often, the main way of selecting the intensity and chromaticity of the light is based on the video signal intensity and chromaticity. However, on top of that, often some additional intensity (i.e. brightness) modulation is added based on the audio intensity and/or loudness. This will make certain effects such as explosions extra dramatic by intensifying the effect or providing any effect at all (as they may be detectable on the audio but not in the video). However, with speech it is clear that such intensity variation based on the audio signal is very much unwanted. So, this audio input will then be enabled/disabled depending on whether speech is detected.
[0065] In the embodiment of the mobile device 1 shown in
[0066] The receiver 3 and the transmitter 4 may use one or more wireless communication technologies such as Wi-Fi (IEEE 802.11) to communicate with the wireless LAN access point 23, for example. In an alternative embodiment, multiple receivers and/or multiple transmitters are used instead of a single receiver and a single transmitter. In the embodiment shown in
[0067] In the embodiment of
[0068] In the embodiment of
[0069] In the embodiment of
[0070] A first embodiment of the method is shown in
[0071] Steps 103 and 109 comprises obtaining information indicating a degree of speech in the audio portion. The degree of speech is determined based on an analysis of an audio portion of the media content. Steps 107 and 113 comprise determining an extent to which the audio portion should be used to determine one or more light effects. The extent is determined based on the degree of speech determined in steps 103 and 109.
[0072] In the embodiment of
[0073] Step 143 comprises classifying the audio portion as predominantly speech or predominantly non-speech based on the amount of speech by determining whether there is speech in more than 50% of the audio portion. Next, a step 105 is performed. Step 105 comprises determining whether the audio portion has been classified as predominantly speech or as predominantly non-speech. If the audio portion has been classified as predominantly speech, step 151 is performed. If the audio portion has been classified as predominantly non-speech, step 153 is performed. Steps 151 and 153 are sub steps of step 107.
[0074] Step 151 comprises determining a first extent. The first extent indicates that a brightness and/or chromaticity of the one or more light effects should not be determined based on an intensity and/or loudness of the audio portion and that the one or more light effects should use a first brightness and/or chromaticity range. Step 109 is performed after step 151. Step 153 comprises determining a second extent. The second extent indicates that a brightness and/or chromaticity of the one or more light effects should be determined based on an intensity and/or loudness of the audio portion and that the one or more light effects should use a second brightness and/or chromaticity range. The first brightness and/or chromaticity range has a lower average brightness and/or chromaticity than the second brightness and/or chromaticity range. Step 115 is performed after step 153.
[0075] Step 109 comprises classifying the audio portion as a class of a plurality of classes. The plurality of classes comprises at least two of: conversation, whispering, screaming, narration and singing. In the embodiment of
[0076] Next, a step 111 comprises determining in which class said audio portion has been classified and steps 161 and 162 comprise determining a speed of transitions between the plurality of light effects in dependence on this class. Step 161 is performed if the audio portion is classified as conversation or whispering (group 1). Step 163 is performed if the audio portion is classified as screaming (group 3). The extent determined in step 151 is not modified if the audio portion is classified differently (group 3). In this case, step 115 is performed after step 111. A scene comprising a lot of conversation or a mother whispering to her baby is rendered using low dynamics as indicated in the extent determined in step 161, whereas the same scene with a lot of screaming or a couple having a shouting argument, even though the audio portion of this scene may have an identical intensity and/or loudness, is rendered at higher dynamics as indicated in the extent determined in step 163.
[0077] After the extent has been determined, i.e. one of steps 151 and 153 has been performed and one of steps 161 and 163 has been performed conditionally, step 115 is performed. Step 115 comprises analyzing the video portion of the media content, e.g. by performing color extraction, and analyzing the audio portion of the media content if step 153 has been performed.
[0078] Thus, the outcome of step 143 is that either 1) the audio is predominantly speech, or 2) the audio is predominantly non-speech. Based on this classification, the first level of light effect dynamics adjustment is made in steps 151 and 153. In general, scenes which focus on dialogue should result in lower intensity light effects than scenes with focus on visual aspects (otherwise the light effects may actually distract from the dialogue). Moreover, the dynamics of the audio signal for speech, should not be considered as an input for modulating the light effect intensity, whereas for non-speech this may well be more appropriate. If it is determined in step 105 that the audio portion has been classified as speech, the spectral content is further analyzed and classified in multiple categories in step 109, e.g. conversation, whispering and screaming. Based on this classification, the dynamics of the system is further adjusted in steps 161 and 163.
[0079] A step 117 comprises determining one or more light effects to be rendered on one or more light sources while the media content is being rendered. The one or more light effects are determined based on the analysis of the audio portion performed in step 115 if step 153 has been performed, but they are at least determined based on the analysis of the video portion performed in step 115. A step 119 comprises controlling the one or more light sources to render the one or more light effects. A step 121 comprises outputting a light script specifying the one or more light effects.
[0080] In this way, the method optimizes the behavior of the dynamic lighting system based on spectral analysis of audio content. Low-level spectral analysis allows for identifying speech characteristics, such as ‘regular’ conversations, whispering, screaming etc. The system will then use and apply this information to adaptively alter the dynamics of the lights, to correspond with the scene content. Thus, the system enhances media content by adjusting the lights in a meaningful manner, corresponding to the semantics of the content.
[0081] A second embodiment of the method is shown in
[0082] In the embodiment of
[0083] A third embodiment of the method is shown in
[0084] A fourth embodiment of the method is shown in
[0085] Step 403 comprises determining whether the amount of speech determined in step 141 exceeds a threshold. This threshold may be a percentage, for example. If this threshold is set to 50%, then this results in a determination whether the audio portion comprises predominantly speech or predominantly non-speech. However, the threshold may beneficially be set to a percentage lower or higher than 50%.
[0086] Step 405 is performed after step 403. Step 405 comprises sub steps 407 and 409. Step 407 is performed if it is determined in step 403 that the threshold has been exceeded. Step 409 is performed if it is determined in step 403 that the threshold has not been exceeded. Step 407 comprises determining a first extent. Step 409 comprises determining a second extent.
[0087] The first extent indicates a first speed of transitions between the plurality of light effects (i.e. a first dynamicity). The second extent indicates a second speed of transitions between the plurality of light effects. The second speed of transitions is higher than the first speed of transitions. Thus, light effects accompanying scenes containing more than a certain amount of speech are rendered using low dynamics, whereas light effects accompanying the same scene with less than this certain amount of speech, even though the audio portion of this scene may have an identical intensity and/or loudness, are rendered with higher dynamics.
[0088] A fifth embodiment of the method is shown in
[0089] In a step 427, the mood of the scene is determined from the spoken words determined in step 421. In step 429, is it determined whether the mood of the scene is emotionally charged or not. If the mood of the scene is emotionally charged, a higher speed of transitions between the plurality of light effects is selected as the extent in step 433. If the mood of the scene is not emotionally charged, a lower speed of transitions between the plurality of light effects is selected as the extent in step 435. Steps 433 and 435 are sub steps of step 431.
[0090] A sixth embodiment of the method is shown in
[0091]
[0092] While in the example of
[0093]
[0094] As shown in
[0095] The memory elements 504 may include one or more physical memory devices such as, for example, local memory 508 and one or more bulk storage devices 510. The local memory may refer to random access memory or other non-persistent memory device(s) generally used during actual execution of the program code. A bulk storage device may be implemented as a hard drive or other persistent data storage device. The processing system 500 may also include one or more cache memories (not shown) that provide temporary storage of at least some program code in order to reduce the quantity of times program code must be retrieved from the bulk storage device 510 during execution. The processing system 500 may also be able to use memory elements of another processing system, e.g. if the processing system 500 is part of a cloud-computing platform.
[0096] Input/output (I/O) devices depicted as an input device 512 and an output device 514 optionally can be coupled to the data processing system. Examples of input devices may include, but are not limited to, a keyboard, a pointing device such as a mouse, a microphone (e.g. for voice and/or speech recognition), or the like. Examples of output devices may include, but are not limited to, a monitor or a display, speakers, or the like. Input and/or output devices may be coupled to the data processing system either directly or through intervening I/O controllers.
[0097] In an embodiment, the input and the output devices may be implemented as a combined input/output device (illustrated in
[0098] A network adapter 516 may also be coupled to the data processing system to enable it to become coupled to other systems, computer systems, remote network devices, and/or remote storage devices through intervening private or public networks. The network adapter may comprise a data receiver for receiving data that is transmitted by said systems, devices and/or networks to the data processing system 500, and a data transmitter for transmitting data from the data processing system 500 to said systems, devices and/or networks. Modems, cable modems, and Ethernet cards are examples of different types of network adapter that may be used with the data processing system 300.
[0099] As pictured in
[0100] Various embodiments of the invention may be implemented as a program product for use with a computer system, where the program(s) of the program product define functions of the embodiments (including the methods described herein). In one embodiment, the program(s) can be contained on a variety of non-transitory computer-readable storage media, where, as used herein, the expression “non-transitory computer readable storage media” comprises all computer-readable media, with the sole exception being a transitory, propagating signal. In another embodiment, the program(s) can be contained on a variety of transitory computer-readable storage media. Illustrative computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive, ROM chips or any type of solid-state non-volatile semiconductor memory) on which information is permanently stored; and (ii) writable storage media (e.g., flash memory, floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access semiconductor memory) on which alterable information is stored. The computer program may be run on the processor 502 described herein.
[0101] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
[0102] The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of embodiments of the present invention has been presented for purposes of illustration, but is not intended to be exhaustive or limited to the implementations in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the present invention. The embodiments were chosen and described in order to best explain the principles and some practical applications of the present invention, and to enable others of ordinary skill in the art to understand the present invention for various embodiments with various modifications as are suited to the particular use contemplated.