System and method for generation of musical notation from audio signal

11749237 · 2023-09-05

    Inventors

    Cpc classification

    International classification

    Abstract

    A system for generation of a musical notation from an audio signal, the system comprising at least one processor configured to: obtain the audio signal from an audio source or a data repository; process the audio signal using first machine learning (ML) model(s) to generate a recognition result, wherein the recognition result is indicative of a pitch and a duration of a plurality of notes in the audio signal and their corresponding confidence scores; generate a preliminary musical notation using the recognition result; process the preliminary musical notation using second ML model(s) to determine whether the preliminary musical notation includes one or more errors; and when it is determined that the preliminary musical notation includes one or more errors, modify the preliminary musical notation to generate the musical notation that is error-free or has lesser errors as compared to the preliminary musical notation.

    Claims

    1. A system for generation of a musical notation from an audio signal, the system comprising at least one processor configured to: generate a first training dataset that is to be employed to train at least one first ML model, wherein the first training dataset comprises at least one of: audio signals generated by at least one musical instrument, metadata of the audio signals generated by the at least one musical instrument; train the at least one first ML model using the first training dataset and at least one ML algorithm; obtain the audio signal from an audio source; process the audio signal using the trained at least one first machine learning (ML) model to generate a recognition result, wherein the recognition result is indicative of a pitch and a duration of a plurality of notes in the audio signal and their corresponding confidence scores, wherein the pitch refers to a frequency of a note and wherein the duration refers to a length of a time that the note is played; generate a preliminary musical notation using the recognition result; process the preliminary musical notation using at least one second ML model to determine whether the preliminary musical notation includes one or more errors, wherein when processing the preliminary musical notation using the at least one second ML model, the at least one processor is configured to: identify at least one phrase in the audio signal, based on a plurality of phrases in a plurality of audio signals belonging to a second training dataset using which the at least one second ML model is trained, wherein the at least one phrase comprises a sequence of notes that occurs between two rests; determine whether a pitch and/or a duration of the sequence of notes in the at least one phrase mis-match with a pitch and/or a duration of notes in one or more of the plurality of phrases; and determine that the preliminary musical notation includes the one or more errors, when it is determined that the pitch and/or the duration of the sequence of notes in the at least one phrase mis-match with the pitch and/or the duration of notes in one or more of the plurality of phrases belonging to the second training dataset; and when it is determined that the preliminary musical notation includes one or more errors, modify the preliminary musical notation using the at least one second ML model, to generate the musical notation that is error-free or has lesser errors as compared to the preliminary musical notation.

    2. The system according to claim 1, wherein when modifying the preliminary musical notation to generate the musical notation that is error-free or has lesser errors as compared to the preliminary musical notation, the at least one processor is configured to: determine a required correction in the pitch and/or the duration of the sequence of notes in the at least one phrase, based on an extent of mis-match between the pitch and/or the duration of the sequence of notes in the at least one phrase and the pitch and/or the duration of notes in one or more of the plurality of phrases; and apply the required correction to the pitch and/or the duration of the sequence of notes in the at least one phrase.

    3. The system according to claim 1, when it is determined that the pitch and/or the duration of the sequence of notes in the at least one phrase match with the pitch and/or the duration of notes in one or more of the plurality of phrases, the at least one processor is configured to: determine whether confidence scores associated with the pitch and/or the duration of the sequence of notes in the at least one phrase lie below a confidence threshold; and when it is determined that the confidence scores associated with the pitch and/or the duration of the sequence of notes in the at least one phrase lie below the confidence threshold, update the confidence scores to be greater than the confidence threshold.

    4. The system according to claim 1, wherein the at least one processor is further configured to detect a change in at least one of: a time signature of the preliminary musical notation, a key signature of the preliminary musical notation, a tempo marking of the preliminary musical notation, a type of the audio source, wherein upon detection of the change, the at least one processor triggers the processing of the preliminary musical notation using the at least one second ML model.

    5. The system according to claim 1, wherein the at least one processor is further configured to: generate a preliminary audio waveform of the audio signal using the recognition result; and modify the preliminary audio waveform to generate an audio waveform that is error-free or has lesser errors as compared to the preliminary audio waveform.

    6. The system according to claim 1, wherein when obtaining the audio signal from the audio source, the at least one processor is configured to record the audio signal when the audio signal is played by the audio source or import a pre-recorded audio file from the data repository.

    7. The system according to claim 1, wherein prior to processing the audio signal using the at least one first ML model, the at least one processor is further configured to convert the audio signal into a plurality of spectrograms having a plurality of time windows.

    8. The system according to claim 1, wherein the at least one first ML model comprises a plurality of first ML models and the first training dataset comprises a plurality of subsets, each subset comprising at least one of: audio signals generated by one musical instrument, metadata of the audio signals generated by the one musical instrument, wherein each first ML model is trained using a corresponding subset.

    9. A method for generating a musical notation from an audio signal, the method comprising: generating a first training dataset that is employed for training at least one first ML model, wherein the first training dataset comprises at least one of: audio signals generated by at least one musical instrument, metadata of the audio signals generated by the at least one musical instrument; training the at least one first ML model using the first training dataset and at least one ML algorithm; obtaining the audio signal from an audio source; processing the audio signal using the trained at least one first machine learning (ML) model for generating a recognition result, wherein the recognition result is indicative of a pitch and a duration of a plurality of notes in the audio signal and their corresponding confidence scores, wherein the pitch refers to a frequency of a note and wherein the duration refers to a length of a time that the note is played; generating a preliminary musical notation using the recognition result; processing the preliminary musical notation using at least one second ML model to determine whether the preliminary musical notation includes one or more errors, wherein the step of processing the preliminary musical notation using the at least one second ML model comprises: identifying at least one phrase in the audio signal, based on a plurality of phrases in a plurality of audio signals belonging to a second training dataset using which the at least one second ML model is trained, wherein the at least one phrase comprises a sequence of notes that occurs between two rests; determining whether a pitch and/or a duration of the sequence of notes in the at least one phrase mis-match with a pitch and/or a duration of notes in one or more of the plurality of phrases; and determining that the preliminary musical notation includes the one or more errors, when it is determined that the pitch and/or the duration of the sequence of notes in the at least one phrase mis-match with the pitch and/or the duration of notes in one or more of the plurality of phrases belonging to the second training dataset; and upon determining that the preliminary musical notation includes one or more errors, modifying the preliminary musical notation using the at least one second ML model for generating the musical notation that is error-free or has lesser errors as compared to the preliminary musical notation.

    10. The method according to claim 9, wherein the step of modifying the preliminary musical notation for generating the musical notation that is error-free or has lesser errors as compared to the preliminary musical notation comprises: determining a required correction in the pitch and/or the duration of the sequence of notes in the at least one phrase, based on an extent of mis-match between the pitch and/or the duration of the sequence of notes in the at least one phrase and the pitch and/or the duration of notes in one or more of the plurality of phrases; and applying the required correction to the pitch and/or the duration of the sequence of notes in the at least one phrase.

    11. The method according to claim 9, wherein the method further comprises detecting a change in at least one of: a time signature of the preliminary musical notation, a key signature of the preliminary musical notation, a tempo marking of the preliminary musical notation, a type of the audio source, wherein upon detecting the change, triggering the processing of the preliminary musical notation using the at least one second ML model.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    (1) One or more embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:

    (2) FIG. 1 illustrates a network environment in which a system for generation of a musical notation from an audio signal can be implemented, in accordance with an embodiment of the present disclosure;

    (3) FIG. 2 is a block diagram representing a system for generation of a musical notation from an audio signal, in accordance with an embodiment of the present disclosure;

    (4) FIG. 3 is an exemplary detailed process flow for generation of a musical notation from an audio signal, in accordance with an embodiment of the present disclosure; and

    (5) FIG. 4 is a flowchart listing steps of a method for generation of a musical notation from an audio signal, in accordance with an embodiment of the present disclosure.

    DETAILED DESCRIPTION

    (6) Referring to FIG. 1, illustrated is a network environment in which a system 100 for generation of a musical notation from an audio signal can be implemented, in accordance with an embodiment of the present disclosure. The network environment comprises the system 100, an audio source 102 and a data repository 104. The system 100 is communicatively coupled to the audio source 102 and the data repository 104.

    (7) Referring to FIG. 2, illustrated is a block diagram representing a system 200 for generation of a musical notation from an audio signal, in accordance with an embodiment of the present disclosure. The system 200 comprises at least one processor (depicted as a processor 202), which is configured to generate the musical notation from the audio signal.

    (8) FIGS. 1 and 2 are merely examples, which should not unduly limit the scope of the claims herein. A person skilled in the art will recognize many variations, alternatives, and modifications of embodiments of the present disclosure.

    (9) Referring to FIG. 3, illustrated is an exemplary detailed process flow for generation of a musical notation from an audio signal, in accordance with an embodiment of the present disclosure. At step 302, the audio signal is obtained from an audio source or a data repository. At step 304, the audio signal is processed using at least one first machine learning (ML) model to generate a recognition result, wherein the recognition result is indicative of a pitch and a duration of a plurality of notes in the audio signal and their corresponding confidence scores. At step 306, the audio signal is converted into a plurality of spectrograms having a plurality of time windows. At step 308, a preliminary musical notation is generated using the first recognition result. At step 310, the preliminary musical notation is processed using at least one second ML model to determine whether the preliminary musical notation includes one or more errors, and when it is determined that the preliminary musical notation includes one or more errors, the preliminary musical notation is modified to generate the musical notation that is error-free or has lesser errors as compared to the preliminary musical notation. At step 312, whether the confidence scores associated with the pitch and/or the duration of the sequence of notes in the at least one phrase lie below a confidence threshold is determined, and when it is determined that the confidence scores associated with the pitch and/or the duration of the sequence of notes in the at least one phrase lie below the confidence threshold, the confidence scores are updated to be greater than the confidence threshold. At step 314, the musical notation of the audio signal is generated. At step 316, an audio waveform of the audio signal is generated.

    (10) The aforementioned steps are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.

    (11) Referring to FIG. 4, illustrated is a flowchart listing steps of a method for generation of a musical notation from an audio signal, in accordance with an embodiment of the present disclosure. At step 402, the audio signal is obtained from an audio source or a data repository. At step 404, the audio signal is processed using at least one first machine learning (ML) model for generating a recognition result, wherein the recognition result is indicative of a pitch and a duration of a plurality of notes in the audio signal and their corresponding confidence scores. At step 406, a preliminary musical notation is generated using the recognition result. At step 408, the preliminary musical notation is processed using at least one second ML model to determine whether the preliminary musical notation includes one or more errors. At step 410, upon determining that the preliminary musical notation includes one or more errors, the preliminary musical notation is modified for generating the musical notation that is error-free or has lesser errors as compared to the preliminary musical notation.

    (12) The aforementioned steps are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.

    (13) Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as “including”, “comprising”, “incorporating”, “have”, “is” used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural.