Method for Processing Nuclear Magnetic Resonance (NMR) Spectroscopic Data
20200124689 ยท 2020-04-23
Inventors
Cpc classification
G01R33/4608
PHYSICS
G01R33/4625
PHYSICS
International classification
Abstract
The invention provides methods for processing nuclear magnetic resonance (NMR) spectroscopic data to assign resonance peaks in an NMR spectrum of a molecule to atomic nuclei in said molecule, based on graph-theoretical principles. In particular, the methods of the invention may be employed in the assignment of data obtained from methyl-TROSY spectroscopy of proteins.
Claims
1. A method for processing nuclear magnetic resonance (NMR) spectroscopic data to assign resonance peaks in an NMR spectrum of a molecule to atomic nuclei in said molecule, comprising: (i) obtaining from a source a nuclear magnetic resonance dataset D.sub.1 comprising a first set of data points ij representing resonance peaks arising from heteronuclear J-couplings between nuclei i and nuclei j, and a second set of data points jj representing through-space proximity correlations between individual nuclei j associated with the first set of data points ij; (ii) obtaining from a source a molecular structure dataset D.sub.2 comprising data points representing a three-dimensional spatial arrangement of atoms in the molecule; (iii) defining a magnetic resonance data graph G.sub.1=(V.sub.1, E.sub.1) and a structure graph G.sub.2=(V.sub.2, E.sub.2) wherein: the set of vertices V.sub.1 of G.sub.1 comprises individual vertices V.sub.1.sup.ij such that each individual vertex V.sub.1.sup.ij represents an individual data point ij selected from the first set of data points; the set of edges E.sub.1 of G.sub.1 comprises individual edges E.sub.1.sup.jj such that each individual edge E.sub.1.sup.jj represents a data point jj selected from the second set of data points and wherein each individual edge E.sub.1.sup.jj connects two individual vertices V.sub.1.sup.ij if nuclei j associated with the data points represented by those two vertices interact via a through-space proximity correlation; the set of vertices V.sub.2 of G.sub.2 comprises individual vertices V.sub.2.sup.i such that each individual vertex V.sub.2.sup.i represents an atom in the molecular structure represented by dataset D.sub.2, said atom having a nucleus i and said being bound to at least one atom having a nucleus j; the set of edges E.sub.2 of G.sub.2 comprises individual edges E.sub.2.sup.ii such that each individual edge E.sub.2.sup.ii connects two individual vertices V.sub.2.sup.i to one another if the atoms represented by said vertices are within a predetermined distance r relative to one another in the molecular structure represented by dataset D.sub.2; (iv) comparing G.sub.1 and G.sub.2 to identify at least one maximal common edge subgraph (MCES); (iv) generating all possible mappings of edges E.sub.1.sup.jj to edges E.sub.2.sup.ii and of vertices V.sub.1.sup.ij to vertices V.sub.2.sup.i which are consistent with said at least one MCES; (vi) for each possible mapping of a vertex V.sub.1.sup.ij to a vertex V.sub.2.sup.ii obtained in step (v), generating a peak assignment, wherein generating said peak assignment comprises attributing a resonance represented by said vertex V.sub.1.sup.ij to an atom having a nucleus i which is represented by said vertex V.sub.2.sup.i; and (vii) generating an output comprising all possible assignments from step (vi) of resonance peaks in the NMR spectrum to atomic nuclei in the molecule.
2. A method according to claim 1, wherein the through-space proximity correlations are nuclear Overhauser effect (NOE) interactions.
3. A method according to claim 1, wherein the magnetic resonance data graph G.sub.1 is a graph G.sub.1=(V.sub.1, E.sub.1,.sub.1) and wherein the structure graph G.sub.2 is a graph G.sub.2=(V.sub.2, E.sub.2,.sub.2), wherein at is a labelling function which labels the vertices V.sub.1 and .sub.2 is a labelling function which labels the vertices V.sub.2.
4. A method according to claim 1, wherein nuclei i and j are independently selected from among .sup.13C, .sup.1H and .sup.15N, with the proviso that nuclei i and nuclei j are different types of nucleus to one another.
5. A method according to claim 1 wherein nuclei i are .sup.13C nuclei and nuclei j are .sup.1H nuclei.
6. A method according to claim 1 wherein nuclei i are .sup.13C nuclei and nuclei j are .sup.1H nuclei and further wherein said nuclei j are each individually bound to a nucleus i.
7. A method according to claim 6 wherein nuclei i and j are nuclei in .sup.13C-labelled methyl groups.
8. A method according to claim 7 wherein said .sup.13C-labelled methyl groups are .sup.13C-labelled isoleucine , valine and/or leucine methyl groups.
9. A method according to claim 1 wherein datapoints ij represent 2D .sup.13C-.sup.1H HMQC resonance peaks.
10. A method according to claim 1 wherein datapoints jj represent .sup.1H-.sup.1H nuclear Overhauser effect interactions.
11. A method according to claim 1 wherein datapoints ij individually represent 2D .sup.13C-.sup.1H HMQC resonance peaks arising from separate .sup.13C-labelled methyl groups and datapoints jj represent .sup.1H-.sup.1H nuclear Overhauser effect interactions between .sup.1H nuclei associated with said separate .sup.13C-labelled methyl groups.
12. A method according to claim 1 wherein said NMR spectrum is a methyl-TROSY spectrum.
13. A method according to claim 1 wherein datapoints ij represent resonance peaks arising from 2D .sup.13C-.sup.1H HMQC interactions between .sup.13C and .sup.1H nuclei in individual .sup.13C-labelled methyl groups of alanine, valine, leucine and/or isoleucine residues in a biomolecule, and datapoints jj represent homonuclear NOE interactions between .sup.1H nuclei in separate such .sup.13C-labelled alanine, valine, leucine and/or isoleucine residues.
14. A method according to claim 1 wherein the molecule represented by D.sub.2 is a protein or enzyme.
15. A method according to claim 1 wherein D.sub.2 is derived from a crystal structure.
16. A method according to claim 15 wherein said crystal structure is a crystal structure available in the Protein Databank (PDB).
17. A method according to claim 1 wherein D.sub.2 is derived from homology modelling.
18. A method according to claim 1 wherein, prior to step (iii), D.sub.1 is filtered.
19. A method according to claim 1 wherein, in step (iii), r is 5-20 .
20. A method according to claim 1 wherein, following step (iii) and prior to step (iv), G.sub.1 is decomposed into non-intersecting connected subgraphs.
21. (canceled)
22. A method according to claim 1 wherein step (iv) comprises determining an optimal matching order of the vertices of G1 to G2 and/or an optimal order of traversal of vertices of G.sub.1.
23. A method according to claim 1 wherein step (iv) is repeated as often as necessary to generate all possible mappings of G1 to G2 consistent with subgraph isomorphism and/or MCES.
24. A method according to claim 1 wherein the output of step (vii) is displayed on an electronic display screen.
25. (canceled)
26. A computer readable storage medium storing computer software code which when executing on a data processing system performs a method as claimed in claim 1.
27. A method according to claim 1 comprising, between steps (iii) and (iv), performing a step (iiiA) of comparing G.sub.1 and G.sub.2 to identify subgraph isomorphism from G.sub.1 to G.sub.2 and generating all possible mappings of edges E.sub.1.sup.jj to edges E.sub.2.sup.ii and of vertices V.sub.1.sup.ij to vertices V.sub.2.sup.i which are consistent with said subgraph isomorphism.
28. A method according to claim 27 wherein, following step (iii) and prior to step (iiiA), G.sub.1 is decomposed into non-intersecting connected subgraphs.
29. A method according to claim 27 wherein step (iiiA) is performed using the VF2 algorithm.
30. A method according to claim 1 wherein nuclear magnetic resonance dataset D.sub.1 is obtained by conducting an NMR experiment on a sample comprising a molecule of interest and the dataset D.sub.1 thus generated is in the form of an NMR spectrum of the molecule of interest.
31. A method according to claim 1 wherein steps (iii)-(vii) are performed by executing code in a computer.
32. A method of processing nuclear magnetic resonance (NMR) spectroscopic data, wherein a molecule of interest is subject to NMR and a nuclear magnetic resonance dataset D.sub.1 is obtained in the form of an NMR spectrum of the molecule of interest, to assign resonance peaks in the NMR spectrum of the molecule of interest to atomic nuclei in said molecule of interest, comprising: (i) obtaining said nuclear magnetic resonance dataset D.sub.1 comprising a first set of data points ij representing resonance peaks arising from heteronuclear J-couplings between nuclei i and nuclei j, and a second set of data points jj representing through-space proximity correlations between individual nuclei j associated with the first set of data points ij; (ii) obtaining a molecular structure dataset D.sub.2 comprising data points representing a three-dimensional spatial arrangement of atoms in the molecule; (iii) defining a magnetic resonance data graph G.sub.1=(V.sub.1, E.sub.1) and a structure graph G.sub.2=(V.sub.2, E.sub.2) wherein: the set of vertices V.sub.1 of G.sub.1 comprises individual vertices V.sub.1.sup.ij such that each individual vertex V.sub.1.sup.ij represents an individual data point ij selected from the first set of data points; the set of edges E.sub.1 of G.sub.1 comprises individual edges E.sub.1.sup.jj such that each individual edge E.sub.1.sup.jj represents a data point jj selected from the second set of data points and wherein each individual edge E.sub.1.sup.jj connects two individual vertices V.sub.1.sup.ij if nuclei j associated with the data points represented by those two vertices interact via a through-space proximity correlation; the set of vertices V.sub.2 of G.sub.2 comprises individual vertices V.sub.2.sup.i such that each individual vertex V.sub.2.sup.i represents an atom in the molecular structure represented by dataset D.sub.2, said atom having a nucleus i and said being bound to at least one atom having a nucleus j; the set of edges E.sub.2 of G.sub.2 comprises individual edges E.sub.2.sup.ii such that each individual edge E.sub.2.sup.ii connects two individual vertices V.sub.2.sup.i to one another if the atoms represented by said vertices are within a predetermined distance r relative to one another in the molecular structure represented by dataset D.sub.2; (iv) optionally comparing G.sub.1 and G.sub.2 to identify subgraph isomorphism from G.sub.1 to G.sub.2 and generating all possible mappings of edges E.sub.1.sup.jj to edges E.sub.2.sup.ii and of vertices V.sub.1.sup.ij to vertices V.sub.2.sup.i which are consistent with said subgraph isomorphism; (v) comparing G.sub.1 and G.sub.2 to identify at least one maximal common edge subgraph (MCES); (vi) generating all possible mappings of edges E.sub.1.sup.jj to edges E.sub.2.sup.ii and of vertices V.sub.1.sup.ij to vertices V.sub.2.sup.i which are consistent with said at least one MCES; (vii) for each possible mapping of a vertex V.sub.1.sup.ij to a vertex V.sub.2.sup.ii obtained in steps (iv) and (vi), generating a peak assignment, wherein generating said peak assignment comprises attributing a resonance represented by said vertex V.sub.1.sup.ij to an atom having a nucleus i which is represented by said vertex V.sub.2.sup.i; (viii) generating an output comprising all possible assignments from step (vii) of resonance peaks in the NMR spectrum to atomic nuclei in the molecule.
33. A computer program product embodied on a computer readable storage medium for processing nuclear magnetic resonance (NMR) spectroscopic data of a molecule to assign resonance peaks in said NMR spectroscopic data to atomic nuclei in said molecule, said computer program product comprising: computer code for defining a magnetic resonance data graph G.sub.1=(V.sub.1, E.sub.1) from a nuclear magnetic resonance dataset D.sub.1, wherein nuclear magnetic resonance dataset D.sub.1 comprises a first set of data points ij representing resonance peaks arising from heteronuclear J-couplings between nuclei i and nuclei j in said molecule, and a second set of data points jj representing through-space proximity correlations between individual nuclei j in said molecule associated with the first set of data points ij, and wherein the set of vertices V.sub.1 of G.sub.1 comprises individual vertices V.sub.1.sup.ij such that each individual vertex V.sub.1.sup.ij represents an individual data point ij selected from the first set of data points; and wherein the set of edges E.sub.1 of G.sub.1 comprises individual edges E.sub.1.sup.jj such that each individual edge E.sub.1.sup.jj represents a data point jj selected from the second set of data points and wherein each individual edge E.sub.1.sup.jj connects two individual vertices V.sub.1.sup.ij if nuclei j associated with the data points represented by those two vertices interact via a through-space proximity correlation; (ii) computer code for defining a structure graph G.sub.2=(V.sub.2, E.sub.2) from a molecular structure dataset D.sub.2, wherein molecular structure dataset D.sub.2 comprises data points representing a three-dimensional spatial arrangement of atoms in said molecule, and wherein the set of vertices V.sub.2 of G.sub.2 comprises individual vertices V.sub.2.sup.i such that each individual vertex V.sub.2.sup.i represents an atom in the molecular structure represented by dataset D.sub.2, said atom having a nucleus i and said being bound to at least one atom having a nucleus j; and wherein the set of edges E.sub.2 of G.sub.2 comprises individual edges E.sub.2.sup.ii such that each individual edge E.sub.2.sup.ii connects two individual vertices V.sub.2.sup.i to one another if the atoms represented by said vertices are within a predetermined distance r relative to one another in the molecular structure represented by dataset D.sub.2; (iii) optionally, computer code for comparing G.sub.1 and G.sub.2 to identify subgraph isomorphism from G.sub.1 to G.sub.2 and generating all possible mappings of edges E.sub.1.sup.jj to edges E.sub.2.sup.ii and of vertices V.sub.1.sup.ij to vertices V.sub.2.sup.i which are consistent with said subgraph isomorphism; (iv) computer code for comparing G.sub.1 and G.sub.2 to identify at least one maximal common edge subgraph (MCES); (v) computer code for generating all possible mappings of edges E.sub.1.sup.ij to edges E.sub.2.sup.ii and of vertices V.sub.1.sup.ij to vertices V.sub.2.sup.i which are consistent with said at least one MCES; (vi) computer code for generating a peak assignment for each possible mapping of a vertex V.sub.1.sup.ij to a vertex V.sub.2.sup.ii obtained in steps (iv) and (vi), wherein generating said peak assignment comprises attributing a resonance represented by said vertex V.sub.1.sup.ij to an atom having a nucleus i which is represented by said vertex V.sub.2.sup.i; (vii) computer code for generating an output comprising all possible peak assignments of resonance peaks in the NMR spectrum to atomic nuclei in the molecule.
Description
[0234] The invention will now be described in further detail with reference to the following non-limiting Examples and the Figures, in which:
[0235]
[0236]
[0237]
[0238]
[0239]
[0240]
[0241]
[0242]
[0243]
[0244]
[0245]
[0246]
[0247] score provided was followed as a function of distance threshold. The recommended distance is that where the curve starts to reach a plateau. The calculation is then repeated 100 times at this optimal distance in order to obtain a final result. The final distances for each protein are specified in the legend. (b) MAP-XSII requires that the program is run at a number of distance thresholds, and the distance that provides the most confident assignments is taken as the final value. Results are shown here for the benchmark, indicating the distance reported with an asterisk. The number of incorrect assignments at each distance threshold is indicated.
[0248]
[0249]
[0250]
EXAMPLE 1
[0251] With reference to
[0252]
[0253] Sample 101 is subjected to NMR testing 102 and an NMR dataset D.sub.1 103 is thereby obtained. In
[0254] In the illustrative dataset 103 shown in
[0255] However, other NMR experiments and pulse sequences, other molecules of interest, and other combinations of NMR-active nuclei, may be employed as described elsewhere in the present disclosure. The data presented in visual form as 103a are an example of a first set of data points ij representing resonance peaks arising from heteronuclear J-couplings between nuclei i and nuclei j. The data presented in visual form as 103b are an example of a second set of data points jj representing through-space proximity correlations (here: nuclear Overhauser interactions) between nuclei j associated with the first set of data points ij.
[0256] Molecular structure dataset D.sub.2 105 is obtained, representing the three-dimensional spatial arrangement of atoms in a molecule 104 (for illustrative purposes, a structural formula for ATCase is shown for consistency with the data presented as 103). This may be obtained by homology modeling or by acquiring a predetermined crystal structure from a database such as the PDB, for example. Where homology modeling is performed this may be done in computer 106 or separately.
[0257] Code executed in computer 106 defines magnetic resonance data graph G.sub.1 on the basis of NMR dataset 103 and structure graph G.sub.2 on the basis of molecular structure dataset 105 employing a graph-matching algorithm as described elsewhere herein which compares G.sub.1 and G.sub.2 to generate all possible mappings of edges and vertices in G.sub.1 to edges and vertices in G.sub.2 which are consistent with subgraph isomorphism and/or at least one MCES (where only one MCES exists, all such mappings for such MCES are generated; where more than one MCES exists, all such mappings for each such MCES are generated). For each possible mapping, the code then generates a peak assignment such that resonances in molecular structure dataset D.sub.1 103 are attributed to nuclei of atoms in molecular structure dataset D.sub.2 105.
[0258] Once all possible mappings and assignments have been generated, an output 107 is generated which indicates all possible assignments.
[0259] The code executed in computer 106 in order to generate mappings, assignments, and output 107, will be configured to perform a graph-matching method as described herein. An algorithm such as MAGMA, which is described in further detail in subsequent Example 2, may be employed and thus the code executed in computer 106 may be written to perform such an algorithm when executed.
[0260]
[0261] In step 201, a nuclear magnetic resonance dataset D.sub.1 (such as dataset 103 shown in
[0262] Steps 203 to 207 are then performed to process the NMR data and generate peak assignments. Such steps may be performed according to any of the embodiments as described herein, and may for example be performed in a computer (e.g. computer 106 in
[0263] The steps 203 to 206 and optionally 207 may if desired be performed in a computer programmed with code which, when executed, processes the datasets accordingly. An algorithm named MAGMA has been devised by the inventors which performs such steps and therefore a computer may be employed to implement the MAGMA algorithm. This is explored in further detail in Example 2.
EXAMPLE 2: MAGMA
[0264] MAGMA is an algorithmic method of processing NMR data in accordance with the methods of the invention. The flowchart for MAGMA is shown in
[0265] In this illustrative and non-limiting Example, as input, MAGMA takes a set of side-chain atoms from a protein structure (corresponding to a molecular structure dataset D.sub.2 as described herein) to create an input structure graph (
[0266] The core of MAGMA consists of two exact and deterministic graph matching algorithms: the VF2 subgraph isomorphism algorithm (this is an illustrative implementation of optional step (iv) of the methods of the invention) and a maximal common subgraph algorithm (this is an illustrative implementation of step (v) of the methods of the invention) adapted from the backtracking algorithm described by McGregor. The implementation of the VF2 algorithm was accessed via a Python interface to the igraph high performance graph library python-igraph0.765. Using Python, we designed and implemented an adapted MCES algorithm based on that of McGregor, as described above and illustrated in
[0267] MAGMA functions as follows, described with reference to the numerals in the flowchart of
[0268] Structure Graph Generation (Flowchart Reference 801):
[0269] This corresponds to an illustrative implementation of defining a structure graph G.sub.2 in accordance with step (iii) of the methods of the invention.
[0270] In the following Examples herein, unless indicated differently, pseudo-atoms were created by averaging the coordinates of the two methyl carbon atoms for both Leu and Val groups to create a pseudo-atom. The matrix of all Euclidean distances between Ile-.sub.1, Leu-pseudo , Val-pseudo , and when applicable, alanine , threonine and methionine , was computed from protein coordinates (provided in a PDB file). If more than a single distance was measured for any of the methyl-methyl pairs, the matrix was reduced to the set of shortest distances between any two carbon atoms. Next, the distance heuristic was applied and only the distances smaller or equal to the defined distance threshold were kept. This final set of distances defined a set of non-weighted edges for the structure graph, in which the methyl carbon atoms were used to define a set of vertices. Each vertex was assigned a label according to its amino acid type. In calculations in which discrimination between Leu and Val was not assumed (
[0271] Preprocessing of NOESY Data (Flowchart Reference 802):
[0272] This embodies one illustrative example of the optional step of filtering the magnetic resonance dataset D.sub.1 prior to defining graph G.sub.1. Such filtering steps are described in further detail elsewhere in the present disclosure.
[0273] Processed 3D- and 4D-HMQC NOESY spectra were analyzed using Sparky for the r2 dimer of ATCase and the N-terminal domain of EIN, respectively. For other proteins in the benchmark, the spectral analyses were performed by others and the resultant NOE peak lists were provided. The following rules were applied to obtain confident methyl-methyl NOE assignments:
[0274] 1) NOE reciprocity, i.e. mutual cross-peaks existed between two auto-peaks
[0275] 2) Similar normalized cross-peak intensity (90%) and signal-to-noise ratio
[0276] Normalized cross-peak intensity and signal-to-noise ratio were calculated in accordance with the detailed description of the invention.
[0277] Data heights and noise threshold were determined by Sparky software for analysis of NMR spectra. In the obtained NOE lists for which data heights (or calculated peak intensities) were missing, solely the NOE reciprocity rule was applied to filter methyl-methyl NOEs. The .sup.1H-.sup.13C frequencies from methyl-TROSY HMQC spectra and the filtered NOE lists were used to define the set of vertices and the set of edges of the methyl-methyl NOE graph.
[0278] NOE Graph Generation (Flowchart Reference 802):
[0279] This corresponds to an illustrative implementation of defining a magnetic resonance data graph G.sub.1 in accordance with step (iii) of the methods of the invention.
[0280] Within MAGMA, .sup.1H-.sup.13C frequencies from methyl-TROSY HMQC spectra and the filtered methyl-methyl NOEs were converted to an adjacency matrix of an NOE graph. The vertices of the graph were labelled according to the amino acid type. If the assignments for the two methyl groups (pro-R and pro-S) within a given Leu or Val residue were known, then a single Leu or Val residue was used to join the methyl-methyl NOEs from both methyl peaks, thereby creating a single vertex in NOE graph. The residue types were added as being known a priori, a requirement that can be accomplished through inspection of the chemical shifts (e.g. discriminating Ile resonances from Leu resonances), or by preparing specifically labeled samples (e.g. to discriminate Leu from Val resonances). When the resonances of Leu and Val types cannot be distinguished, MAGMA supports giving the same label for Leu/Val .sup.1H-.sup.13C resonances, albeit with higher ambiguity from the collapse of two labels into one.
[0281] NOE Graph Analysis (Flowchart Reference 803):
[0282] This corresponds to an illustrative implementation of the optional feature of decomposing G.sub.1 into non-intersecting connected subgraphs prior to steps (iv)/(v) of the method of the invention, as described in further detail elsewhere herein.
[0283] MAGMA analysed connectivity of every NOE graph to check if the graph was disconnected. Disconnected NOE graphs were decomposed into their connected components (connected NOE subgraphs) and their cardinality (number of vertices) was determined. The connected NOE subgraphs were ordered according to their cardinality (from the highest to the lowest) and were henceforth treated separately.
[0284] Subgraph Isomorphism Testing (Flowchart References 804, 809):
[0285] This corresponds to an illustrative implementation of optional step (iv) of the methods of the invention.
[0286] Subsequently to the steps described above (structure graph generation, preprocessing of NOESY data, NOE graph generation, and NOE graph analysis), the SI test was iteratively performed on each of the connected NOE subgraphs according to the aforementioned order.
[0287] If a subgraph passed the SI test, it was then matched to a structure graph by the VF2 algorithm, which returns all mapping solutions that result in SI. Those subgraphs that did not pass the SI test are consequently collected for graph matching by the adapted MCES algorithm (see below).
[0288] Before the exact MCES search was executed, each of the subgraphs that did not result in SI (or, if no subgraphs were present, the entire NOE graph) was passed through a protocol that determined: [0289] 1. Optimal matching order (vertices of an NOE graph to vertices of a structure graph) [0290] 2. Optimal order of the traversal of vertices in an NOE (sub)graph
[0291] Determining the optimal orders for 1. and 2. is critical for obtaining a high starting scores in an exact MCES search. Moreover, optimal ordering results in significant reduction of the search space of the exact search, as the high initial MCES scores eliminate completion of any partial assignment that would lead to lower scores. Together, the optimal matching and traversing account for efficient pruning of the search tree and faster convergence of the exact MCES search.
[0292] Sorting Matching Options Prior to MCES Search (Flowchart References 804,805) and Ordering the Visits of Nodes in NOE (Sub)Graphs and Iterative Protocol for the Improvement of the Initial MCES Score (Flowchart References 805, 806, 807, 808):
[0293] These are preliminary stages of one illustrative implementation of step (v) of the methods of the invention.
[0294] Optimal matching order refers to ordering a set of vertices of structure graph associated with each vertex of an NOE sub(graph). The set comprised all possible assignment options for a particular NOE vertex, i.e. every assignment option for a particular .sup.1H-.sup.13C resonance in a methyl-TROSY spectrum. The adapted MCES algorithm of the invention supported associating the ordered list of assignment options to each vertex of an NOE (sub)graph. The order of the vertices in the set was the order in which different assignment options should be tried. In MAGMA, sorting of the sets was determined by a variation of the Hungarian method for the assignment problem (Munkres algorithm), combined with a Jaccard similarity coefficient, as described in the detailed description of the invention.
[0295] The exact MCES algorithm constructed the search tree starting from a vertex in an NOE (sub)graph. From a given starting point (root), the order of traversal of vertices of an NOE graph during the search was found to significantly impact the starting scores of the exact MCES search. In order to determine the most optimal starting point (vertex of an NOE graph) a single iteration of the MCES search was performed from every vertex of an NOE (sub)graph, and the obtained scores (size of first found MCES) were used to determine the best starting vertex. The highest scoring, first found MCES comprised the optimal starting point for MAGMA. The order of the rest of the vertices from the root was determined through breadth-first tree search.
[0296] After determining the matching order and NOE graph traversal order, MAGMA performed an incomplete, short MCES run prior to execution of the exact MCES run. The goal of this short search was to collect a number of high scoring MCES solutions that are used to update the initially determined matching orders. The highest scoring solutions (max (|MCES|)) of the short MCES search were inserted to the front of the set of the assignment options for every NOE vertex. The initially determined matching order (given by Munkres (Hungarian) algorithm and Jaccard coefficients) came after the new order established by the short MCES runs. We determined that the optimal number of high scoring MCES solutions to collect for the update of matching orders was 10.
[0297] It should be noted that the final result of the exact MCES algorithm was not affected by the described heuristics for determining best starting point for the search and the best matching orders. However, these considerations were important determinants of the speed of convergence to the final set of solutions.
[0298] Exact MCES Search (Flowchart References 806, 807, 808):
[0299] This is a further stage in the illustrative implementation of step (v) of the methods of the invention. Once the order of the matching priorities and optimal traversal of NOE graph had been determined as described above, the exact MCES algorithm was applied.
[0300] MAGMA Scoring (Flowchart References 809, 810, 806, 807):
[0301] This corresponds to an illustrative implementation of steps (vi) and (vii) of the methods of the invention.
[0302] At the end of its protocol, MAGMA joined all assignment solutions that comprised subgraph isomorphisms (in the case of VF2) and/or MCES (adapted McGregor), and thereby associated each methyl (vertex in the methyl-methyl NOE graph) to a set of atoms (vertices in the structure graph) to which it was uniquely assigned over a set of all solutions. For each peak, the set of all possible structure graph (PDB) assignments were then found. If multiple assignment options were found, the score given to each peak was defined as the reciprocal of the total number of possible assignments for that peak (
EXAMPLE 3: A SIMPLIFIED PROBLEM
[0303] To illustrate how graph matching solves the assignment problem, all solutions were obtained to a relatively simple problem with nine methyl-labelled residues as a function of the number of common edges (i.e. explained NOEs,
EXAMPLE 4: PROOF-OF-PRINCIPLE
[0304] To demonstrate the effectiveness of MAGMA, a series of proof-of-principle calculations using simulated data were performed to examine how different levels of inter-methyl NOE sparsity and amino acid labelling schemes influenced the accuracy and ambiguity of methyl assignments (
[0305] The 34 kDa human Cyclin-dependent kinase 2 (CDK2) (PDB:2C60)(
[0306] Different methyl-methyl NOE sparsity levels were simulated by removal of a subset of edges from the CDK2 structure graph. Here, 0% sparsity corresponds to the case where all distances below the specified threshold were included in the calculation (i.e. the maximum number of edges). The ratio of the number of NOEs removed to the total possible number of NOEs then defines the sparsity. Edges were removed by random sampling according to the following distribution: the probability of removing an edge was given by:
[0307] where r.sub.i is the distance of any given edge (i). Hence, the edges corresponding to longer distances are removed with a higher probability. As the precise result from MAGMA depends on exactly which NOEs have been inputted, for each amino acid labelling scheme and the NOE sparsity percentage, MAGMA was run on three different subsets of edges, generated by changing the seed of the random number generator. This enabled calculation of the mean and the standard deviation of the number of confidently assigned methyls for a given NOE sparsity (
[0308] NOE data with varying sparsity levels were simulated such that shorter-distance NOEs were retained with a greater probability than long range NOEs. The results show that the correct assignment solution was always sampled within the set of solutions generated by MAGMA (
[0309] As expected, an increase in NOE sparsity resulted in an increase in the number of methyls that have multiple assignment possibilities (
EXAMPLE 5: COMPARISON TO OTHER APPROACHES
[0310] We compared the performance of MAGMA to two other freely available automated assignment approaches, FLAMEnGO2.014 and MAP-XSII15 (
EXAMPLE 6: APPLICATION TO AMBIGUOUS ASSIGNMENTS
[0311] For a subset of the methyls that have ambiguous assignments, the ambiguity arises from structural equality of vertices in the NOE graph, where two nodes can be exchanged but the number of satisfied NOEs does not change. Nevertheless, the ability of MAGMA to reveal ambiguity in the assignment of some methyls is valuable for an NMR spectroscopist, as it highlights the protein residues/regions for which more experimental information is required, for example, by targeted SDM (site-directed mutagenesis).
[0312] Where an assignment provided by MAGMA reveals multiple possibilities, vertices can be exchanged and still explain the maximum number of inter-methyl distance restraints. This is valuable information that highlights regions within the protein where additional data should be sought. To illustrate how this information can be used, we show that in a 50% sparse ILV dataset, three single point mutations of the most ambiguous residues provides five new confident assignments (
EXAMPLE 7: TESTING OF MAGMA ON EXPERIMENTAL NOE DATA
[0313] MAGMA was tested against a benchmark of data obtained from 3D or 4D NOESY spectra of methyl-labelled proteins, where methyl-TROSY assignments had been previously determined and cross-validated with other techniques (
[0314] This benchmark was, to the best of the inventors' knowledge, the largest assembled benchmark for testing automatic methyl assignment programs. The benchmark proteins display diversity in molecular mass, shape, amino acid labelling, and data sparsity.
[0315] The distribution of NOEs as a function of distance is shown with the total distances possible within the molecule (
[0316] The number of confident assignments obtained varied with the details of the specific experimental dataset, but coarsely scaled with the number of NOEs in the dataset. On average over the whole benchmark, five methyl-methyl NOEs were needed for each confident assignment (
[0317] An interesting case is the comparison of the isolated r2ATCase dimer (34 kDa) and the r2ATCase dimer in the context of the holoenzyme. The former was produced as it could be assigned using traditional backbone experiments. Fewer NOE peaks were observed in spectra from the dimer over the holoenzyme. Moreover, crystal structures of the holoenzyme exist, whereas to make progress with the dimer, the structure has to be assumed to be equivalent. The known assignment predicts some unexpected very long range NOEs in the isolated dimer, which might suggest that the structure of the dimer is not the same. Remarkably, 16 confident assignments were obtained from the holoenzyme despite sparser NOE data (
[0318] Another interesting case was EIN. The assignment strategy based primarily on PRE data, with the NOE data being used only for cross validation used this protein as benchmark. Here, it was shown that 35% of the methyl assignments can be obtained using NOE data alone.
[0319] In general, MAGMA will provide more confident assignments when there are more NOEs, and the reference structure is the closest possible match with what is present in solution.
TABLE-US-00005 TABLE 1 Methyl-methyl NOE benchmark data Methyl Total Assignable M.sub.w PDB Labelling Methyl containing Data inter- Data resonances (kDa) code scheme groups residues type methyl Subgraphs (i) Ubiquitin 8.6 1ubq I, L, V 33 20 NOE 25 3 18 Ubiquitin 8.6 1ubq L, V 26 13 DREAM 18 1 10 MsrB 16.6 3e0o I, L, V 39 22 NOE 37 2 21 EIN 27 1eza I, L, V, A 146 100 NOE 145 6 84 ATCase R.sub.2 30 1d09 I, L, V 66 39 NOE 91 1 34 MBP 43.4 1ez9 I, L, V 123 73 NOE 144 4 70 MSG 81.4 1y8b I, L, V 274 159 NOE 230 12 141 .sub.7.sub.7 360 1yau I, L, V 96 56 NOE 154 1 53 Mean Distance Confident Assigned Time to Total vertex threshold Sparsity assignments fraction converge time degree (ii) (A) (iii) (iv) (v) (vi (s) (vii) (s) (viii) Ubiquitin 2.89 6.5 0.56 16 0.80 2 10.sup.2 2 10.sup.2 Ubiquitin 3.6 7.5 0.78 10 0.77 2 10.sup.2 2 10.sup.2 MsrB 3.52 10 0.62 17 0.77 4 10.sup.3 4 10.sup.3 EIN 3.45 10 0.37 39 0.35 4 10.sup.3 5 10.sup.3 ATCase R.sub.2 5.35 10 0.58 12 0.31 7 10.sup.4 2 10.sup.5 MBP 4.11 10 0.56 47 0.64 7 10.sup.4 2 10.sup.5 MSG 3.26 10 0.47 60 0.38 7 10.sup.4 4 10.sup.6 .sub.7.sub.7 5.81 10 0.80 52 0.93 2 10.sup.5
EXAMPLE 8: PERFORMANCE COMPARISON
[0320] The performance of MAGMA was compared to that of the two alternative automatic methyl assignment programs (FLAMEnGO2.0 and MAP-XSII) on the protein benchmark described in Example 6 (
EXAMPLE 9: USE OF MAGMA IN STRUCTURE DETERMINATION
[0321] The ATCase R2 dimer is known to exist in R and T conformations. MSG adopts a different structure when bound to certain ligands. When the different structures were provided as input for MAGMA, different sets of confident assignments are returned (see
EXAMPLE 10: CHEMICAL SHIFT FILTERING
[0322] We assessed if chemical shift scoring could be used to reduce ambiguity in assignment solutions generated by MAGMA. Both SHIFTX2 and CH3Shift were used to predict methyl .sup.1H and .sup.13C chemical shifts from a structure, with each giving similar results in the final assignment. For each resonance where MAGMA returned an ambiguous assignment, we calculated a score as S=|.sub.calc.sub.data|/ where gives the chemical shift and , the prediction accuracy of the specific chemical shift under consideration as defined by the relevant programs. A combined .sup.1H-.sup.13C chemical shift score was obtained from the individual .sup.1H and .sup.13C scores from (S.sub.H.sup.2+S.sub.C.sup.2).sup.1/2. In the case where two stereospecific methyl groups are possible, chemical shift differences that correspond to all 4 possible combinations of stereo-specific assignment were computed, and the lowest score was taken as being most likely to be correct.
[0323] The score S was calculated for each possibility of an ambiguous assignment. A new assignment was considered confident if exactly one of the options had a score that was less than or equal to 1. Otherwise, the restraint was not applied (
EXAMPLE 11: USING MAGMA FOR LIGAND STRUCTURE DETERMINATION
[0324] In addition to using protein-protein distance restraints, NOEs from ligands can be included with the calculation. By providing MAGMA with different ligand/protein complexes, it becomes possible to determine where the ligand is bound. The example shown is Hsp90. Without ligand, the protein adopts a closed conformation. With ligand, the protein adopts an open conformation, and the binding site of the ligand is determined as shown in