IMPROVED SYNCHRONIZATION OF A PRE-RECORDED MUSIC ACCOMPANIMENT ON A USER'S MUSIC PLAYING

20230082086 · 2023-03-16

Assignee

Inventors

Cpc classification

International classification

Abstract

A method for synchronizing a pre-recorded music accompaniment to a music playing of a user. The music playing is captured by at least one microphone, which delivers an input acoustic signal feeding a processing unit, which includes a memory for storing data of the music accompaniment and provides an output acoustic signal based on the music accompaniment data to feed at least a loudspeaker playing the music accompaniment. The processing unit analyses the input acoustic signal to detect musical events and music tempo, compares the detected musical events to the music accompaniment data to determine at least a lag between the timings of the detected musical events and the musical events of the music accompaniment, and adapts a timing of the output acoustic signal based on the lag and a synchronization function calculated from a temporal variable, the user's music tempo, and the duration of compensation of the lag.

Claims

1-12. (canceled)

13. A method for synchronizing a pre-recorded music accompaniment to a music playing of a user, said user's music playing being captured by at least one microphone delivering an input acoustic signal feeding a processing unit, said processing unit comprising a memory for storing data of the pre-recorded music accompaniment and providing an output acoustic signal based on said pre-recorded music accompaniment data to feed at least one loudspeaker playing the music accompaniment for said user, wherein said processing unit: analyses the input acoustic signal to detect musical events in the input acoustic signal so as to determine a tempo in said user's music playing, compares the detected musical events to the pre-recorded music accompaniment data to determine at least a lag diff between a timing of the detected musical events and a timing of musical events of the played music accompaniment, said lag diff being to be compensated, adapts a timing of the output acoustic signal on the basis of: said lag diff and a synchronization function F given by: F ( x ) = { x 2 w 2 + ( $ tempo - 2 w ) x + 1 if diff > 0 - x 2 w 2 + ( $ tempo + 2 w ) x - 1 if diff < 0 , where x is a temporal variable, $tempo is the determined tempo in said user's music playing, and w is a duration of compensation of said lag diff.

14. The method according to claim 13, wherein said music accompaniment data defines a music score and wherein variable x is a temporal value corresponding to a duration of a variable number of beats of said music score.

15. The method according to claim 13, wherein w has a duration of at least one beat on a music score defined by said music accompaniment data.

16. The method according to claim 13, wherein the duration w is chosen.

17. The method according to claim 13, wherein, said accompaniment data defining a music score, a position pos of the musician playing on said score is forecast by a linear relation defined as pos(x)=$tempo*x, where x is a number of music beats counted on said music score, and if a lag diff is detected, said synchronisation function F(x) is used so as to define a number of beats x.sub.diff corresponding to said lag time diff such that:
F(x.sub.diff)−pos(x.sub.diff)=diff.

18. The method according to claim 17, wherein a prediction is determined on the basis of said synchronisation function F(x), until a next beat x.sub.diff+w by applying a transformation function A(t), given by:
A(t)custom-characterF(t−t.sub.0+x.sub.diff)+p, where p is a current position of the musician playing on the music score at current time t.sub.0.

19. The method according to claim 13, wherein, said accompaniment data defining a music score, the processing unit further estimates a future position of the musician playing on said music score at a future synchronization time t.sub.sync, and determines a tempo (e2) of the music accompaniment to apply to the output acoustic signal until said future synchronization time t.sub.sync.

20. The method of claim 19, wherein a prediction is determined on the basis of said synchronisation function F(x), until a next beat x.sub.diff+w by applying a transformation function A(t), given by:
A(t)custom-characterF(t−t.sub.0+x.sub.diff)+p, where p is a current position of the musician playing on the music score at current time t.sub.0, and wherein said tempo of the music accompaniment to apply to the output acoustic signal noted ctempo, is determined as the derivative of A(t) at current time t.sub.0 such that:
ctempo=A′(t.sub.0)=(x.sub.diff).

21. The method according to claim 13, wherein the determination of said musical events in said input acoustic signal comprises: extracting acoustic features from said input acoustic signal, using said stored data of the pre-recorded music accompaniment to determine musical events at least in the accompaniment, and assigning musical events to said input acoustic features, on the basis of the musical events determined from said stored data.

22. A device for synchronizing a pre-recorded music accompaniment to a music playing of a user, comprising a processing unit to perform the method as claimed in claim 13.

23. A computer-readable medium comprising instructions which, when executed by a processing unit, cause the computer to carry out the method according to claim 13.

Description

[0050] More details and advantages of embodiments are given in the detailed specification hereafter and appear in the annexed drawings where:

[0051] FIG. 1 shows an example of embodiment of a device to perform the aforesaid method,

[0052] FIG. 2 is an example of algorithm comprising steps of the aforesaid method according to an embodiment,

[0053] FIGS. 3a and 3b show an example of a synchronization Time-Map using the synchronization function F(x) and the corresponding musician time-map.

[0054] The present disclosure proposes to solve the problem of synchronizing a pre-recorded accompaniment to a musician in real-time. To this aim, a device DIS (as shown in the example of FIG. 1 which is described hereafter) is used.

[0055] The device DIS comprises in an embodiment, at least: [0056] An input interface INP, [0057] A processing unit PU, including a storage memory MEM and a processor PROC cooperating with memory MEM, and [0058] An output interface OUT.

[0059] The memory MEM can store, inter alia, instructions data of a computer program according to the present disclosure.

[0060] Furthermore, music accompaniment data are stored in the processing unit (for example in the memory MEM). Music accompaniment data are therefore read by the processor PROC so as to drive the output interface OUT to feed at least one loudspeaker SPK (a baffle or an earphone) with an output acoustic signal based on the pre-recorded music accompaniment data.

[0061] The device DIS further comprises a Machine Listening Module MLM which can include an independent hardware (as shown with dashed lines in FIG. 1), or alternatively can be made of a hardware shared with the processing unit PU (i.e. a same processor and possibly a same memory unit).

[0062] A user US can hear the accompaniment music played by the loudspeaker SPK and can play with a music instrument on the accompaniment music, emitting thus a sound captured by a microphone MIC connected to the input interface INP. The microphone MIC can be incorporated in the user's instrument (such as in an electric guitar) or separated (for voice or acoustic instruments recording). The captured sound data are then processed by the machine listening module MLM and more generally by the processing unit PU.

[0063] More particularly, the captured sound data are processed so as to identify a delay or an advance of the music played by the user, compared to the accompaniment music, and to adapt then the speed of playing of the accompaniment music to the user's playing. For example, the tempo of the accompaniment music can be adapted accordingly. The time difference which is detected by the module MLM, between the accompaniment music and the music played by the user, is called hereafter “lag” at current time t and noted diff.

[0064] More particularly, musician events can be detected in real-time by the machine listening module MLM which outputs then t-uplets of musical events and tempo data pertaining to real-time detection of such events from a music score. This embodiment can be similar for example to the one disclosed in Cont (2010). In the embodiment where the machine listening module MLM has a hardware separated from the processing unit PU, the module MLM is thus exchangeable and can be thus any module that provides “events” and, optionally hereafter the tempo, in real-time, on a given music score, by listening to a musician playing.

[0065] As indicated above, the machine listening module MLM operates preferably “in real-time”, ideally with a lag of less than 15 milliseconds, which corresponds to a perceptual threshold (ability to react to an event) in most of the current usual listening algorithms.

[0066] Thanks to the pre-recorded accompaniment music data on the one hand, and to a tempo recognition in the musician playing on the other hand, the processing unit PU performs a dynamic synchronization. At each real-time instance t, it (PU) takes as input its own previous predictions at a previous time t−ε, and incoming event and tempo from machine listening. The resulting output is an accompaniment time-map that contains predictions at time t.

[0067] The synchronization is dynamic and adaptive thanks to prediction outputs at time t, based on a dynamically computed lag-dependent window (hereafter noted w). A dynamic synchronization strategy is introduced and its value is guaranteed mathematically to converge at a later time t_sync. The synchronization anticipation horizon t_sync itself is dependent on the computed lag time at time t with regards to previous instance and feedback from the environment.

[0068] The results of the adaptive synchronization strategy are to be consistent (same setup leads to same synchronization prediction). The adaptive synchronization strategy should also adapt to an interactive context.

[0069] The device DIS takes as live input musician's event and tempo, and outputs predictions for a pre-recorded accompaniment, having both pre-recorded accompaniment and music score at its disposal prior to launch. The role of the device DIS is to employ musician's Time-Map (as a result of live input) and construct a corresponding Synchronization Time-Map dynamically.

[0070] Instead of relying on a constant window length (like in state of the art), the parameter w is interpreted here as a stiffness parameter. Typically, w can correspond to a fixed number of beats of the score (for example one beat, corresponding to a quarter note of a 4/4 measure). Its time current value tv can be given at the real tempo of the accompaniment (tv=w*real tempo), which however does not necessarily correspond to the current musician tempo. The prediction window length w is determined dynamically (as detailed below with reference to FIG. 3) as a function of current lag diff at time t and assures convergence until a later synchronization time t_sync.

[0071] In an embodiment, a synchronization function F is introduced, whose role is to help construct the synchronization time-map and to compensate the lag diff in an ideal setup where the tempo is supposed to be, in a short time-frame, a constant value. Given the musician's position p (on a music score) and the musician's tempo noted hereafter “$tempo” at time t, F is a quadratic function that joins Time-Map points (0, 1) to (w, w*$tempo) and checks that its derivative is equal to parameter $tempo. The lag at time t between the musician's real-time musical position on the music score and that of the accompaniment track on the same score (both in beats) is denoted as diff. Therefore, parameter diff reflects exactly the difference between the position on the music score in beats of the detected musician's event in real-time and the position on the music score (in beats) of the accompaniment music that is to be synchronized.

[0072] It is shown here that the synchronization function F can be expressed as follows:

[00002] F ( x ) = { x 2 w 2 + ( $ tempo - 2 w ) x + 1 if diff > 0 - x 2 w 2 + ( $ tempo + 2 w ) x - 1 if diff < 0

[0073] and if diff=0, F(x) simply becomes F(x)=$tempo*x where $tempo is the real tempo value provided by the module MLM, w is a prediction window corresponding finally to the time taken to compensate the lag diff until a next adjustment of the music accompaniment on the musician playing.

[0074] It is shown furthermore that, for any event detected at time t, and accompaniment lag diff beats ahead, there is a single solution x.sub.diff of the equation F(x)−$tempo*x=diff. This unique solution defines the adaptive context on which predictions are computed and re-defines the portion of accompaniment map from x.sub.diff as:


A(t)custom-characterF(t−t.sub.0+x.sub.diff)+p

[0075] A detailed explanation of the adaptation function A(t) is given hereafter.

[0076] By construction, the synchronizing accompaniment Time-Map converges in position and tempo at time t_sync=t+w−x.sub.diff to the musician Time-Map. This mathematical construction ensures continuity of tempo until a synchronization time t_sync.

[0077] FIG. 3 shows the adaptive dynamic synchronization for updating accompaniment Time-Map, at time t, where an event is detected and the initial lag of the accompaniment is diff beats ahead (FIG. 3a). The accompaniment map from t is defined as a translated portion of function F. The synchronization Time-Map, constructed by F(x) is depicted in FIG. 3(a) and its translation to the Musician Time-Map on FIG. 3(b). Position and tempo converge at time t_sync assuming musician tempo remains constant in that interval. This Time-Map is constantly re-evaluated at each interaction of the system with a human musician. The continuity of tempo until time t_sync can be noticed.

[0078] A simple explanation of FIG. 3 can be given as follows. From the previous prediction, a forecast position pos that the musician playing should have (counted in beats x) is determined by a linear relation such as pos(x)=$tempo*x. This corresponds to the oblique dashed line of FIG. 3a. However, a lag diff is detected between the position p of the musician playing and the forecast position pos. The synchronization function F(x) is calculated as defined above and x.sub.diff is calculated such that F(x.sub.diff)−pos(x.sub.diff)=diff. A prediction can be determined then, on the basis of F(x), until the next beat x.sub.diff+w. This corresponds to the dashed lined rectangle of FIG. 3a. This “rectangle” of FIG. 3a is rather imported in the musician time-map of FIG. 3b, and translated by applying the transformation function A(t), given by:


A(t)custom-characterF(t−t.sub.0+x.sub.diff)+p.

[0079] Where p is the current position of the musician playing on the score at current time t.sub.0. Then A(t) can be computed to give the right position that the musician playing should have in a future time t.sub.sync. Until this synchronization time t.sub.sync at least, the tempo of the accompaniment is adapted. It corresponds to a new slope e.sub.2 (oblique dashed line of FIG. 3b), to compare with the previous slope e.sub.1. The corrected tempo ctempo can be thus given as the derivative of A(t) at current time t.sub.0 or:


ctempo=A′(t.sub.0)=F′(x.sub.diff)

[0080] which is known analytically.

[0081] Referring now to FIG. 2, step S1 starts with receiving the input signal related to the musician playing. In step S2, acoustic features are extracted from the input signal so as to identify musical events in the musician playing which are related to events in the music score defined in the pre-recorded music accompaniment data. In step S3, a timing of a latest detected event is compared to the timing of a corresponding one in the score and the time lag diff corresponding to the timing difference is determined.

[0082] On the basis of that time lag and a chosen duration w (a duration of a chosen number of beats in the music score typically), the synchronization function F(x) can be determined in step S4. Then, in step S5, x.sub.diff can be the sole solution given by F(x.sub.diff)−$tempo*x.sub.diff=diff

[0083] The determination of x.sub.diff makes it then possible to use the transformation function A(t) which is determined in step S6, so as to shift from the synchronization map to the musician time-map as explained above while referring to FIGS. 3a and 3b. In the musician time-map, in step S7 the tempo of the output signal which is played on the basis of the pre-recorded accompaniment data can be corrected (from slope e1 to slope e2 of FIG. 3b) so as to adjust smoothly the position on the music score of the output signal to the position of the input signal at a future next synchronization time t.sub.sync as shown on FIG. 3b. After that synchronization time t.sub.sync in step S8 (arrow Y from test S8), the process can be implemented again by extracting new features from the input signal.

[0084] Qualitatively, this embodiment contributes to reach the following advantages: [0085] It resolves the consistency issue in the state of the art. It adapts to initial lags automatically and adapts its horizon based on context. The mathematical formalism is bijective with the solution. This means that identical musician Time-Map lead to the same synchronization trajectories whereas in traditional constant window this value would differ based on context and parameters. [0086] The method ensures tempo continuity at time t_sync where as state-of-the-art demonstrate discontinuities in all available methods. [0087] The adaptive strategy provides a compromise between the two extremes described above as tight and loose and within a single framework. The tight strategy corresponds to low values of stiffness parameter w whereas loose strategy corresponds to higher values of w. [0088] The strategy is computationally efficient: As long as the prediction time-map does not change, accompaniment synchronization is computed only once using the accompaniment time-map. State-of-the-art requires computations and predictions at every stage of interaction regardless of change.

[0089] Moreover, high-level musical knowledge can be integrated into the synchronization mechanism in form of Time-Maps. To this end, predictions are extended to non-linear curves on Time-Maps. This extension allows formalisms for integrating musical expressivity such as accelerendi and fermata (i.e. with an adaptive tempo) and other common expressive musical specifications of performer's timing. This addition also enables the possibilities of automatic learning of such parameters from existing data. [0090] It enables the addition of high-level musical knowledge, if existing, into the existing framework using mathematical formalism with proof of convergence, overcoming the hand-engineering methods in the usual prior art. [0091] It extends the “constant tempo” approximation in the usual prior art that leads to piece-wise linear predictions, to the more realistic non-linear tempo predictions. [0092] It enables the possibility of automatically learning prediction time-maps either from musician or pre-recorded accompaniments to leverage expressivity.

[0093] Additional latencies are usually imposed by hardware implementations and networks communications. Compensating this latency in an interactive setup can not be reduced to a simple translation of the reading head (as seen in over-the-air audio/video streaming synchronization). The value of such latency can vary from 100 milliseconds to 1 second, which is far beyond acceptable psychoacoustic limits of human ear. The synchronization strategy takes this value optionally as input, and anticipates all output predictions based on the interactive context. As a result and for relatively small values of latency (in mid-range of 300 ms corresponding to most Bluetooth and AirMedia streaming formats), it is not necessary for the user to adjust the lag prior to performance. The general approach, expressed here in “musical time” as opposed to “physical time”, allows automatic adjustment of such parameter.

[0094] More generally, this disclosure is not limited to the detailed features presented above as examples of embodiments; it encompasses further embodiments.

[0095] Typically, the wordings related to “playing the accompaniment” on a “loudspeaker” and the notion of “pre-recorded music accompaniment” are to be interpreted broadly. In fact, the method applies to any “continuous” media, including for example audio and video. Indeed, video+audio content can be synchronized as well using the same method as presented above. Typically, the aforesaid “loudspeakers” can be replaced by an Audio-Video projection and video frames can thus be interpolated as presented above simply based on the position output of prediction for synchronization.