Device, system and method for identifying a scene based on an ordered sequence of sounds captured in an environment
11521626 · 2022-12-06
Assignee
Inventors
Cpc classification
G10L19/008
PHYSICS
H04S3/008
ELECTRICITY
International classification
G10L19/008
PHYSICS
Abstract
An identification device, method and system for identifying a scene in an environment. The environment includes at least one sound capture device. The identification device is configured to identify the scene based on at least two sounds captured in the environment. Each of the at least two sounds are associated respectively with at least one sound class. The scene is identified by taking account of a chronological order in which the at least two sounds were captured.
Claims
1. An identification device for identifying a scene in an environment, said environment comprising at least one sound capture device, said identification device comprising: a processor; and a non-transitory computer-readable medium comprising instructions stored thereon which when executed by the processor configure the identification device to identify said scene among a group of predefined scenes, by: obtaining at least two sounds of a chronological series of sounds generated by movement and interactions between elements in the environment when the scene occurs, wherein the at least two sounds were captured in said environment by the at least one sound capture device at different time instances; associating each of said at least two sounds respectively with at least one sound class selected from among a plurality of sound classes; and identifying the scene from among the group of predefined scenes, wherein each predefined scene of the group is associated with a predetermined number of marker sounds arranged in chronological order, the identifying comprising: comparing the sound classes associated with the at least two sounds and a chronological order at which the at least two sounds were captured, with the predetermined number of marker sounds and the chronological order of the predetermined number of marker sounds of at least one of the predefined scenes.
2. The identification device for identifying the scene according to claim 1, wherein the instructions configure the processor to receive at least one piece of complementary data provided by a connected device from said environment and associate a label with the sound class of least one of the captured sounds or with said identified scene.
3. The identification device for identifying the scene according to claim 2, wherein the instructions configure the processor to, in response to at least one of the captured sounds being associated with several sound classes, determine a sound class of the several sound classes for the at least one captured sound using said at least one piece of complementary data received.
4. The identification device for identifying the scene according to claim 1, wherein the instructions configure the processor to trigger at least one action to be performed following the identification of said scene.
5. The identification device for identifying the scene according to claim 1, wherein the instructions configure the processor to transmit to an enrichment device at least one part of the following data: a piece of information indicating the scene identified, and at least two sound classes and a chronological order associated with the identified scene, at least one part of audio files corresponding to the captured sounds associated respectively with a sound class, at least one sound class associated with a label.
6. An identification system for identifying a scene in an environment, said environment comprising at least one sound capture device, wherein said identification system comprises: a classification device configured to receive sounds captured by the at least one sound capture device in said environment, and determine, for each of the sounds received, at least one sound class selected from among a plurality of sound classes; and an identification device configured to identify said scene among a group of predefined scenes, by: obtaining at least two sounds of a chronological series of sounds generated by movement and interactions between elements in the environment when the scene occurs, wherein the at least two sounds were captured by the classification device at different time instances, identifying the scene from among the group of predefined scenes, wherein each predefined scene of the group is associated with a predetermined number of marker sounds arranged in chronological order, the identifying comprising: comparing the sound classes associated with the at least two sounds and a chronological order at which the at least two sounds were captured, with the predetermined number of marker sounds and the chronological order of the predetermined number of marker sounds of at least one of the predefined scenes.
7. The identification system for identifying the scene according to claim 6, further comprising an enrichment device, wherein: the identification device is configured to transmit to the enrichment device at least one part of the following data: a piece of information indicating the scene identified, and at least two sound classes and the chronological order associated with the identified scene, at least one part of audio files corresponding to the captured sounds associated respectively with a sound class, at least one sound class associated with a label; and the enrichment device is configured to update at least one database with at least one part of the data transmitted by the identification device.
8. An identification method for identifying a scene in an environment, said environment comprising at least one sound capture device, said method being performed by an identification device and comprising: identifying a scene among a group of predefined scenes, by: obtaining at least two sounds of a chronological series of sounds generated by movement and interactions between elements in the environment when the scene occurs, wherein the at least two sounds were captured in said environment by the at least one sound capture device at different time instances associating each of said at least two sounds respectively with at least one sound class selected from among a plurality of sound classes; and, identifying the scene from among the group of predefined scenes, wherein each predefined scene of the group is associated with a predetermined number of marker sounds arranged in chronological order, the identifying comprising: comparing the sound classes associated with the at least two sounds and a chronological order at which the at least two sounds were captured, with the predetermined number of marker sounds and the chronological order of the predetermined number of marker sounds of at least one of the predefined scene.
9. The identification method according to claim 8, also comprising updating, at least one database by using at least one part of the following data: a piece of information indicating the scene identified, and at least two sound classes and the chronological order associated with the scene identified, at least one part of audio files corresponding to the sounds captured associated respectively with a sound class, at least one sound class associated with a label.
10. A non-transitory computer-readable medium comprising instructions stored thereon which when executed by a processor of an identification device configure the identification device to identify a scene among a group of predefined scenes, by: obtaining at least two sounds of a chronological series of sounds generated by movement and interactions between elements in the environment when the scene occurs, wherein the at least two sounds were captured in an environment by at least one sound capture device at different time instances, associating each of said at least two sounds is associated respectively with at least one sound class selected from among a plurality of sound classes; and identifying the scene from among the group of predefined scenes, wherein each predefined scene of the group comprises a predetermined number of marker sounds arranged in chronological order, the identifying comprising: comparing the sound classes associated with the at least two sounds and a chronological order at which the at least two sounds were captured, with the predetermined number of marker sounds and the chronological order of the predetermined number of marker sounds of at least one of the predefined scenes.
Description
4. LIST OF FIGURES
(1) Other characteristics and advantages of the invention will appear more clearly from the following description of particular embodiments, given by way of simple illustratory and non-exhaustive examples, and from the appended drawings of which:
(2)
(3)
(4)
(5)
(6)
5. DESCRIPTION OF AN EMBODIMENT OF THE INVENTION
(7) The invention proposes, through the successive identification of sounds captured in an environment, the establishment of a use case that is associated with them.
(8) By “use case”, we mean here a set comprised of a context and an event. The context is defined by elements in the environment, such as location, stakeholders involved, the present time (day/night), etc.
(9) The event is singular, occasional and transient. The event marks a transition or a breach in a situation encountered. For example, in a situation where a person is busy in a kitchen and is performing tasks to prepare a meal, an event could correspond to the moment when this person cuts his/her hand with a knife. According to this example, a use case is therefore defined by the context comprising the person present, the kitchen, and by the cutting accident event.
(10) Another example of a use case is for example a scene where an occupant is departing from their home. According to this example, the context comprises the occupant of the home, the location (home entrance), elements with which the occupant is likely to interact during this use case (closet, keys, shoes, clothes, etc.), and the event is the departure from the home.
(11) The invention identifies such use cases defined by a context and an event that occur in an environment. Such use cases are characterized by a chronological series of sounds generated by the movement and interactions between the elements/persons in the environment when the use case occurs. These may be sounds that are specific to the context or to the event of the use case. It is the successive identification of these sounds and according to the chronological order in which they are captured that the use case can be determined.
(12) Hereon, the terms “situation”, “use case” or “scene” will be used indifferently.
(13) Described hereafter is
(14) The environment illustrated in
(15) A network of sound capture means is located in the environment. Such sound capture means (C1, C2, C3) are for example microphones embedded into the various pieces of equipment situated in the environment. For example, in the case where the environment corresponds to a home, this could be microphones embedded into mobile terminals when the user who owns the terminal is at home, microphones embedded into terminals such as a computer, tablets, etc., and microphones embedded into all types of connected devices such as connected radio, connected television, personal assistant, terminals embedding microphone systems dedicated to sound recognition, etc.
(16) Described here is the method according to the invention using three microphones. However, the method according to the invention can also be implemented with a single microphone.
(17) Generally, the network of sound capture means can comprise all types of microphones embedded into computer or multimedia equipment already in place in the environment or specifically placed for sound recognition. The system according to the invention can use microphones already located in the environment for other uses. It is therefore not always necessary to specifically place microphones in the environment.
(18) In the particular embodiment described here, the environment also comprises IoT connected devices, for example a personal assistant, a connected television or a tablet, home automation equipment, etc.
(19) The system SYS to collect and analyze sounds communicates with the capture means and possibly with the IoT connected devices via a local network RES, for example a WiFi network of a home gateway (not represented).
(20) The invention is not limited to this type of communication mode. Other communication modes are also possible. For example, the system SYS to collect and analyze sounds can communicate with the capture means and/or the IoT connected devices through Bluetooth or via a wired network.
(21) According to one variant, the local network RES is connected to a larger data network INT, for example the Internet via the home gateway.
(22) According to the invention, the system SYS to collect and analyze sounds identifies, from sounds captured in the environment, a scene or a use case.
(23) In the particular embodiment described here, the system SYS to collect and analyze sounds comprises in particular: a classification module CLASS, an interpretation module INTRP, an audio file database BSND.sub.loc, a sound class database BCLSND.sub.loc, a label database BLBL.sub.loc, a use case database BSC.sub.loc.
(24) The classification module CLASS receives (step E20) audio flows originating from capture means. For this, a specific application can be installed in the equipment in the environment that includes microphones, so that this equipment transmits the audio flow from the sound it captures. Such a transmission can be carried out continuously, or at regular intervals, or when a sound of a certain amplitude is detected.
(25) Following the reception of an audio flow, the classification module CLASS analyzes the audio flow received to determine (step E21) the sound class or classes corresponding to the sound received via one or several prediction models derived from machine learning. The sounds from the sound database are matched with sound classes memorized in the sound class database BCLSND.sub.loc. The classification module determines the sound class or classes corresponding to the sound received by selecting the sound class or classes associated with a sound from the sound database that is close to the sound received. The classification module therefore provides at output at least one class CL.sub.i of sounds associated with the sound received with a probability rate P.sub.i. The sound classes selected for an analyzed sound correspond to an acceptable, predetermined probability threshold. In other terms, the only sound classes selected are those for which the probability rate that the sound received corresponds to a sound associated with the sound class is higher than a predetermined threshold.
(26) The sound classes and their associated probability are then transmitted to the interpretation module INTRP in order for it to identify the scene that is occurring. For this, the interpretation module relies on a set of use cases stored in the use case database BSC.sub.loc.
(27) A use case is defined in the form of N marker sounds, with N being a positive integer greater than or equal to 2.
(28) The use cases are predefined in an experimental manner and built using a succession of sounds characterizing each step of the scene. For example, in the case of a scene of a departure from home, the following succession of sounds was built: sound of a closet opening, sound of a coat being put on, sound of a closet closing, sound of footsteps, sound of a door opening, sound of a door closing, sound a of door being locked. Each scene construction was submitted to visually impaired persons to determine the relevance of the sound/steps chosen and to determine the marker sounds making it possible to identify the scene.
(29) The experiment made it possible to identify that a number of three marker sounds is sufficient to identify a scene and to identity, for each scene, the marker sounds that characterize it, among the sounds in the succession of sounds built during the experiment.
(30) In the particular embodiment of the invention described here, we therefore consider that N=3. Other values are possible however. The number of marker sounds can depend on the complexity of the scene to be identified. In other variants, only two marker sounds can be used, or additional marker sounds (N>3) can be added in order to define a scene or distinguish scenes that are acoustically too close. The number of marker sounds used to identify a scene can also vary in relation to the scene to be identified. For example, certain scenes could be defined by two marker sounds, other scenes by three marker sounds, etc. In this variant, the number of marker sounds is not fixed.
(31) The use case database BSC.sub.loc was then filled with the defined scenes, each scene being characterized by three marker sounds according to a chronological order.
(32) According to one particular embodiment of the invention, the scenes defined in the use case database BSC.sub.loc can come from a larger use case database BSC, for example predefined by a service provider according to the experiment described here above or any other method. The scenes memorized in the use case database BSC.sub.loc may have been pre-selected by the user, for example during an initialization phase. This variant makes it possible to adapt the possible use cases to be identified for a user in relation to their habits or their environment.
(33) In order to identify a scene in progress, the interpretation module INTRP therefore relies on a succession of sounds received and analyzed by the classification module CLASS. For each sound received by the classification module CLASS, the latter transmits to the interpretation module INTRP at least one class associated with the sound received and an associated probability.
(34) The interpretation module compares (step E22) the succession of sound classes recognized by the classification module, in the chronological order of capture of the corresponding sounds, with the marker sounds characterizing each scene from the use case database BSC.sub.loc.
(35) According to one particular embodiment of the invention, the interpretation module INTRP also takes account of the complementary data transmitted (step E23) to the interpretation module INTRP by connected devices (JOT) placed in the environment. Such complementary data can for example be information on the location of the captured sound, temporal information (time, day/night), temperature, service type information: for example, home automation information indicating that a light is switched on, a window is open, weather information provided by a server, etc., According to the particular embodiment of the invention described here, labels or qualifiers are predefined and stored in the label database BLBL.sub.loc. These labels depend on the type and value of the complementary data likely to be received. For example, labels of the type: day/night are defined for complementary data corresponding to a schedule, labels of the type: hot/cold/moderate are defined for complementary data corresponding to temperature values, labels representing location can be defined for complementary data corresponding to the location of the captured sound.
(36) In certain cases, the complementary data can also correspond directly to a label, for example, when the sound received by the classification module was transmitted by a connected device, the connected device can transmit with the audio flow, a location label corresponding to its location . . . .
(37) The complementary data make it possible to qualify (i.e. describe semantically) a sound class or an identified scene. For example, for a captured sound corresponding to flowing water, information on the location of the captured sound will make it possible to qualify the sound class using a label associated with the location (for example: shower, kitchen, etc.). According to this example, the interpretation module INTRP can then qualify the sound class associated with a sound received.
(38) According to another example, for a captured sound associated with two sound classes that are acoustically close, therefore with relatively close probability rates, information on the location of the captured sound will make it possible to determine the most likely sound class. For example, a label associated with location will help differentiate a sound class corresponding to water flowing from a faucet from a sound class corresponding to rain.
(39) At output, the interpretation module provides the identified scene and an associated probability rate. Indeed, as for the identification of a sound class corresponding to a captured sound, the identification of a scene is performed by comparing captured sounds with marker sounds characterizing a use case. The captured sounds are not identical to the marker sounds, as the marker sounds may have been generated by elements other than those of the environment. In addition, the ambient noise of the environment can also impact sound analysis.
(40) The interpretation module also provides at output for each sound class identified by the classification module, complementary data such as the identified scene, the data provided by the connected devices, the files of the captured sounds.
(41) According to one particular embodiment of the invention, when a scene has been identified, the interpretation module INTRP transmits (step E24) the identification of the scene to a system of actuators ACT connected to the system SYS via the local network RES or via the data network INT when the system of actuators is not located in the environment. The system of actuators makes it possible to act accordingly in relation to the identified scene, by performing the actions associated with the scene. For example, this may concern triggering an alarm on identification of an intrusion, or notifying an emergency service on identification of an accident, or quite simply connecting the alarm on identification of a departure from the home.
(42) According to one particular embodiment of the invention, the system SYS to collect and analyze sounds also comprises an enrichment module ENRCH. The enrichment module ENRCH updates (step E25) the sound database BSND.sub.loc, the sound class database BCLSND.sub.loc, the use case database BSC.sub.loc, and the label database BLBL.sub.loc using information provided at output by the interpretation module (INTRP).
(43) The enricher can therefore help to enrich databases using sound files of captured sounds, making it possible to improve analysis of subsequent sounds performed by the classification module and to improve the identification of a scene, by increasing the number of sounds associated with a sound class. The enricher also makes it possible to enrich databases using the labels obtained, for example by associating a captured sound memorized in the sound database BSND.sub.loc the label obtained for this sound is memorized in the label database.
(44) The enrichment module makes it possible to enrich in a dynamic manner the data necessary for learning by the system SYS to improve the performance of this system.
(45) In the example described here, the sound database BSND.sub.loc, the sound class database BCLSND.sub.loc, the use case database BSC.sub.loc and the label database BLBL.sub.loc are local. They are for example stored in the memory of the classification module or the interpretation module, or in a memory connected to these modules.
(46) In other particular embodiments of the invention, the sound database BSND.sub.loc, the sound class database BCLSND.sub.loc, the use case database BSC.sub.loc and the label database BLBL.sub.loc can be remote. The system SYS to collect and analyze sounds accesses these databases, for example via the data network INT.
(47) The sound database BSND.sub.loc, the sound class database BCLSND.sub.loc, the use case database BSC.sub.loc and the label database BLBL.sub.loc can comprise all or part of larger remote databases BSND, BCLSND, BSC and BLBL, for example existing databases or provided by a service provider.
(48) These remote databases can be used to initialize the local databases of the system SYS and be updated using information collected by the system SYS on identification of a scene. In this way, the system SYS to collect and analyze sounds makes it possible to enrich the sound database, the sound class database, the use case database and the label database for other users.
(49) According to the particular embodiment of described here above, the classification, interpretation and enrichment modules have been described as separate entities. However, all or part of these modules can be embedded into one or several devices as will be seen here below in relation to
(50)
(51) According to one particular embodiment of the invention, the device DISP has the classic architecture of a computer, and comprises in particular a memory MEM, a processing unit UT, equipped for example with a processor PROC, and piloted by the computer program PG stored in the memory MEM. The computer program PG comprises instructions to implement the steps of the method for identifying a scene such as described previously, when the program is executed by the processor PROC. At initialization, the instructions of the computer program code PG are for example loaded into a memory before being executed by the processor PROC. The processor PROC of the processing unit UT implements in particular, the steps of the method for identifying a scene according to one of the particular embodiments described in relation to
(52) The device DISP is configured for identifying a scene based on at least two sounds captured in said environment, each of said at least two sounds being associated respectively with at least one sound class, said scene being identified by taking account of the chronological order in which said at least two sounds were captured. For example, the device DISP corresponds to the interpretation module described in relation to
(53) According to one particular embodiment of the invention, the device DISP comprises a memory BDDLOC comprising a sound database, a sound class database, a use case database and a label database.
(54) The device DISP is configured for communicating with a classification module configured for analyzing sounds received and transmitting one or more sound classes associated with a sound received, and possibly with an enrichment module configured for enriching databases such as sound databases, sound class databases, use case databases and label databases.
(55) According to one particular embodiment of the invention, the device DISP is also configured for receiving at least one piece of complementary data provided by a connected device in the environment and associating a label with a sound class of a captured sound or with said identified scene.
(56)
(57)