System and method for generation of musical notation from audio signal
11749237 · 2023-09-05
Inventors
Cpc classification
G10H2210/061
PHYSICS
G10H2210/081
PHYSICS
G10H2210/066
PHYSICS
G10H2210/086
PHYSICS
International classification
Abstract
A system for generation of a musical notation from an audio signal, the system comprising at least one processor configured to: obtain the audio signal from an audio source or a data repository; process the audio signal using first machine learning (ML) model(s) to generate a recognition result, wherein the recognition result is indicative of a pitch and a duration of a plurality of notes in the audio signal and their corresponding confidence scores; generate a preliminary musical notation using the recognition result; process the preliminary musical notation using second ML model(s) to determine whether the preliminary musical notation includes one or more errors; and when it is determined that the preliminary musical notation includes one or more errors, modify the preliminary musical notation to generate the musical notation that is error-free or has lesser errors as compared to the preliminary musical notation.
Claims
1. A system for generation of a musical notation from an audio signal, the system comprising at least one processor configured to: generate a first training dataset that is to be employed to train at least one first ML model, wherein the first training dataset comprises at least one of: audio signals generated by at least one musical instrument, metadata of the audio signals generated by the at least one musical instrument; train the at least one first ML model using the first training dataset and at least one ML algorithm; obtain the audio signal from an audio source; process the audio signal using the trained at least one first machine learning (ML) model to generate a recognition result, wherein the recognition result is indicative of a pitch and a duration of a plurality of notes in the audio signal and their corresponding confidence scores, wherein the pitch refers to a frequency of a note and wherein the duration refers to a length of a time that the note is played; generate a preliminary musical notation using the recognition result; process the preliminary musical notation using at least one second ML model to determine whether the preliminary musical notation includes one or more errors, wherein when processing the preliminary musical notation using the at least one second ML model, the at least one processor is configured to: identify at least one phrase in the audio signal, based on a plurality of phrases in a plurality of audio signals belonging to a second training dataset using which the at least one second ML model is trained, wherein the at least one phrase comprises a sequence of notes that occurs between two rests; determine whether a pitch and/or a duration of the sequence of notes in the at least one phrase mis-match with a pitch and/or a duration of notes in one or more of the plurality of phrases; and determine that the preliminary musical notation includes the one or more errors, when it is determined that the pitch and/or the duration of the sequence of notes in the at least one phrase mis-match with the pitch and/or the duration of notes in one or more of the plurality of phrases belonging to the second training dataset; and when it is determined that the preliminary musical notation includes one or more errors, modify the preliminary musical notation using the at least one second ML model, to generate the musical notation that is error-free or has lesser errors as compared to the preliminary musical notation.
2. The system according to claim 1, wherein when modifying the preliminary musical notation to generate the musical notation that is error-free or has lesser errors as compared to the preliminary musical notation, the at least one processor is configured to: determine a required correction in the pitch and/or the duration of the sequence of notes in the at least one phrase, based on an extent of mis-match between the pitch and/or the duration of the sequence of notes in the at least one phrase and the pitch and/or the duration of notes in one or more of the plurality of phrases; and apply the required correction to the pitch and/or the duration of the sequence of notes in the at least one phrase.
3. The system according to claim 1, when it is determined that the pitch and/or the duration of the sequence of notes in the at least one phrase match with the pitch and/or the duration of notes in one or more of the plurality of phrases, the at least one processor is configured to: determine whether confidence scores associated with the pitch and/or the duration of the sequence of notes in the at least one phrase lie below a confidence threshold; and when it is determined that the confidence scores associated with the pitch and/or the duration of the sequence of notes in the at least one phrase lie below the confidence threshold, update the confidence scores to be greater than the confidence threshold.
4. The system according to claim 1, wherein the at least one processor is further configured to detect a change in at least one of: a time signature of the preliminary musical notation, a key signature of the preliminary musical notation, a tempo marking of the preliminary musical notation, a type of the audio source, wherein upon detection of the change, the at least one processor triggers the processing of the preliminary musical notation using the at least one second ML model.
5. The system according to claim 1, wherein the at least one processor is further configured to: generate a preliminary audio waveform of the audio signal using the recognition result; and modify the preliminary audio waveform to generate an audio waveform that is error-free or has lesser errors as compared to the preliminary audio waveform.
6. The system according to claim 1, wherein when obtaining the audio signal from the audio source, the at least one processor is configured to record the audio signal when the audio signal is played by the audio source or import a pre-recorded audio file from the data repository.
7. The system according to claim 1, wherein prior to processing the audio signal using the at least one first ML model, the at least one processor is further configured to convert the audio signal into a plurality of spectrograms having a plurality of time windows.
8. The system according to claim 1, wherein the at least one first ML model comprises a plurality of first ML models and the first training dataset comprises a plurality of subsets, each subset comprising at least one of: audio signals generated by one musical instrument, metadata of the audio signals generated by the one musical instrument, wherein each first ML model is trained using a corresponding subset.
9. A method for generating a musical notation from an audio signal, the method comprising: generating a first training dataset that is employed for training at least one first ML model, wherein the first training dataset comprises at least one of: audio signals generated by at least one musical instrument, metadata of the audio signals generated by the at least one musical instrument; training the at least one first ML model using the first training dataset and at least one ML algorithm; obtaining the audio signal from an audio source; processing the audio signal using the trained at least one first machine learning (ML) model for generating a recognition result, wherein the recognition result is indicative of a pitch and a duration of a plurality of notes in the audio signal and their corresponding confidence scores, wherein the pitch refers to a frequency of a note and wherein the duration refers to a length of a time that the note is played; generating a preliminary musical notation using the recognition result; processing the preliminary musical notation using at least one second ML model to determine whether the preliminary musical notation includes one or more errors, wherein the step of processing the preliminary musical notation using the at least one second ML model comprises: identifying at least one phrase in the audio signal, based on a plurality of phrases in a plurality of audio signals belonging to a second training dataset using which the at least one second ML model is trained, wherein the at least one phrase comprises a sequence of notes that occurs between two rests; determining whether a pitch and/or a duration of the sequence of notes in the at least one phrase mis-match with a pitch and/or a duration of notes in one or more of the plurality of phrases; and determining that the preliminary musical notation includes the one or more errors, when it is determined that the pitch and/or the duration of the sequence of notes in the at least one phrase mis-match with the pitch and/or the duration of notes in one or more of the plurality of phrases belonging to the second training dataset; and upon determining that the preliminary musical notation includes one or more errors, modifying the preliminary musical notation using the at least one second ML model for generating the musical notation that is error-free or has lesser errors as compared to the preliminary musical notation.
10. The method according to claim 9, wherein the step of modifying the preliminary musical notation for generating the musical notation that is error-free or has lesser errors as compared to the preliminary musical notation comprises: determining a required correction in the pitch and/or the duration of the sequence of notes in the at least one phrase, based on an extent of mis-match between the pitch and/or the duration of the sequence of notes in the at least one phrase and the pitch and/or the duration of notes in one or more of the plurality of phrases; and applying the required correction to the pitch and/or the duration of the sequence of notes in the at least one phrase.
11. The method according to claim 9, wherein the method further comprises detecting a change in at least one of: a time signature of the preliminary musical notation, a key signature of the preliminary musical notation, a tempo marking of the preliminary musical notation, a type of the audio source, wherein upon detecting the change, triggering the processing of the preliminary musical notation using the at least one second ML model.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) One or more embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:
(2)
(3)
(4)
(5)
DETAILED DESCRIPTION
(6) Referring to
(7) Referring to
(8)
(9) Referring to
(10) The aforementioned steps are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.
(11) Referring to
(12) The aforementioned steps are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.
(13) Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as “including”, “comprising”, “incorporating”, “have”, “is” used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural.