GENERATION OF CODES FOR CHEMICAL STRUCTURES FROM NMR SPECTROSCOPY DATA
20220189587 · 2022-06-16
Assignee
Inventors
Cpc classification
G01R33/4625
PHYSICS
G16C20/20
PHYSICS
G01N24/087
PHYSICS
International classification
G16C20/20
PHYSICS
G06F30/27
PHYSICS
Abstract
A method of generating codes for chemical structures from NMR spectroscopy data comprises receiving spectroscopic data of a chemical compound, inputting the spectroscopic data into a first artificial neural network to generate molecular descriptors, receiving a molecular descriptor from the first artificial neural network, inputting the molecular descriptor a second artificial neural network to convert structure data of the chemical reference compounds to molecular descriptors and to convert the molecular descriptors back to the structure data, and receiving structure data of the chemical compound from the second artificial neural network.
Claims
1. A computer-implemented method, the method comprising: receiving spectroscopic data of a chemical compound, inputting the spectroscopic data into a first artificial neural network, wherein the first artificial neural network has been trained, in a supervised learning method using spectroscopic data of a multitude of chemical reference compounds, to generate molecular descriptors of the chemical reference compounds on the basis of the spectroscopic data of the chemical reference compounds, receiving a molecular descriptor of the chemical compound from the first artificial neural network, inputting the molecular descriptor received into a second artificial neural network, wherein the second artificial neural network is a decoder of an autoencoder, wherein the autoencoder has been trained, in an unsupervised learning method using a multitude of chemical reference compounds, to convert structure data of the chemical reference compounds to molecular descriptors and to convert the molecular descriptors back to the structure data, receiving structure data of the chemical compound from the second artificial neural network, and outputting and/or storing the structure data and/or information derived from the structure data.
2. The method of claim 1, wherein the spectroscopic data are data from a nuclear resonance spectrum or multiple nuclear resonance spectra of the chemical compound.
3. The method of claim 1, wherein the spectroscopic data are a peak list from a .sup.13C NMR spectrum and/or a .sup.1H NMR spectrum.
4. The method of claim 1, wherein the molecular descriptor is an n-dimensional vector.
5. The method of claim 1, 4, wherein the molecular descriptor is a continuous and data-driven molecular descriptor.
6. The Method according to any of method of claim 1, wherein the structure data are a chemical structure code.
7. The method of claim 1, wherein the structure data are a SMILES, InChI, CML or WLN code.
8. The method of claim 1, to further comprising: calculating spectroscopic data from the structure data received, comparing the spectroscopic data calculated with the spectroscopic data received, identifying the deviations between the spectroscopic data calculated and the spectroscopic data received, and outputting and/or storing the deviations.
9. A system comprising a computer configured to prompt the receipt of spectroscopic data of a chemical compound, input the spectroscopic data into a first artificial neural network, wherein the first artificial neural network has been trained, in a supervised learning method using spectroscopic data of a multitude of chemical reference compounds, to generate molecular descriptors of the chemical reference compounds on the basis of the spectroscopic data of the chemical reference compounds, receive a molecular descriptor for the chemical compound from the first artificial neural network, input the molecular descriptor received into a second artificial neural network, wherein the second artificial neural network is a decoder of an autoencoder, wherein the autoencoder has been trained, in an unsupervised learning method using a multitude of chemical reference compounds, to convert structure data of the chemical reference compounds to molecular descriptors and to convert the molecular descriptors back to the structure data, receive structure data of the chemical compound from the second artificial neural network, prompt the output of and/or to store the structure data received and/or information derived from the structure data.
10. A non-transitory computer-readable storage medium comprising instructions that, when executed by a processor, cause the processor to: receive spectroscopic data of a chemical compound, input the spectroscopic data into a first artificial neural network, wherein the first artificial neural network has been trained, in a supervised learning method using spectroscopic data of a multitude of chemical reference compounds, to generate molecular descriptors of the chemical reference compounds on the basis of the spectroscopic data of the chemical reference compounds, receive a molecular descriptor of the chemical compound from the first artificial neural network, input the molecular descriptor received into a second artificial neural network, wherein the second artificial neural network is a decoder of an autoencoder, wherein the autoencoder has been trained, in an unsupervised learning method using a multitude of chemical reference compounds, to convert structure data of the chemical reference compounds to molecular descriptors and to convert the molecular descriptors back to the structure data, receive structure data of the chemical compound from the second artificial neural network, output and/or store the structure data and/or information derived from the structure data.
11. The non-transitory computer readable storage medium of claim 10, wherein the instructions prompt the processor to execute one or more of: calculating spectroscopic data from the structure data received; comparing the spectroscopic data calculated with the spectroscopic data received; identifying the deviations between the spectroscopic data calculated and the spectroscopic data received; and outputting and/or storing the deviations.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0077]
[0078]
[0079]
[0080]
[0081]
[0082]
[0083]
DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS
[0084]
[0085] In some embodiments, the computer system (10) may be configured to receive spectroscopic data of chemical compounds, to use the spectroscopic data received to generate structure data for the chemical compounds, and to output and/or to store the structure data and/or information derived therefrom.
[0086] In some embodiments, the control and computation unit (12) may serve to control the input unit (11) and the output unit (13), to coordinate the flows of data and signals between the different units, to process spectroscopic data and further data, and to create structure data based on the spectroscopic data by means of a first and second artificial neural network. In some embodiments, the first and second artificial neural network may be loaded, for example, in a memory of the computer system that may be part of the control and computation unit (12).
[0087] In some embodiments, the input unit (11) may serve to receive spectroscopic data of chemical compounds. In some embodiments, the spectroscopic data may be provided/transmitted, for example, via a network (not shown in
[0088] In some embodiments, spectroscopic data can be transmitted via a network connection or a direct connection. Spectroscopic data can be transmitted via radio communication (WLAN, Bluetooth, mobile communications and/or the like) and/or via a cable. It is conceivable that multiple input units are present.
[0089] In some embodiments, the input unit (11) may transmit the spectroscopic data and any further data to the control and computation unit (12). In some embodiments, the control and computation unit (12) may be configured to generate structure data using the data received.
[0090] In some embodiments, the output unit (13) can display the structure data and/or information derived therefrom (for example on a monitor), output them (for example via a printer) or store them in a data storage medium.
[0091] It is conceivable that multiple output units are present. It is likewise possible for there to be multiple input units and/or control and/or computation units.
[0092]
[0093] The SMILES code for 2-acetoxybenzoic acid is: CC(═O)OCl═CC═CC═ClC(═O)O.
[0094] The InChI code for 2-acetoxybenzoic acid is: 1S/C9H8O4/cl-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12).
[0095] The SMILES code and the InChI code are examples of chemical structure codes.
[0096]
[0097] One example of a peak list for a calculated .sup.13C NMR spectrum (in this case for 2-acetoxybenzoic acid) is: [0098] δ (ppm) [0099] 20.77529 [0100] 120.91842 [0101] 122.26 [0102] 124.01 [0103] 131.33934 [0104] 133.14333 [0105] 151.28 [0106] 166.94 [0107] 169.22
[0108]
[0109] In some embodiments, the process (100) comprises the steps of: [0110] (110) receiving spectroscopic data of a chemical compound, [0111] (120) inputting the spectroscopic data into a first artificial neural network, wherein the first artificial neural network has been trained, in a supervised learning method using spectroscopic data of a multitude of chemical reference compounds, to generate molecular descriptors of the chemical reference compounds on the basis of the spectroscopic data of the chemical reference compounds, [0112] (130) receiving a molecular descriptor of the chemical compound from the first artificial neural network, [0113] (140) inputting the molecular descriptor received into a second artificial neural network, wherein the second artificial neural network is a decoder of an autoencoder, wherein the autoencoder has been trained, in an unsupervised learning method using a multitude of chemical reference compounds, to convert structure data of the chemical reference compounds to molecular descriptors and to convert the molecular descriptors back to the structure data, [0114] (150) receiving structure data of the chemical compound from the second artificial neural network, [0115] (160) outputting and/or storing the structure data and/or information derived from the structure data.
[0116]
[0117]
[0118]
[0119] Table 1 below shows images of a series of chemical structures of chemical compounds in the form of their skeletal formulae (middle column). The left-hand column of Table 1 gives a peak list of a .sup.13C NMR spectrum for each of the chemical compounds. The right-hand column of Table 1 lists the SMILES codes predicted in accordance with the invention for the chemical compounds on the basis of the peak lists. For the prediction of the SMILES codes, a multilayer feedforward neural network with fully connected layers was used as first neural network, and the decoder of an autoencoder based on the architecture described by Winter et al. as second neural network (Chem. Sci., 2019, 10, 1692-1701; Chem. Sci., 2020, 11, 10378-10389).
TABLE-US-00001 TABLE 1 14.3; 0.0Q; 9|14.3; 0.0Q; 12|24.8; 0.0Q; 5|24.8; 0.0Q; 6|61.4; 0.0T; 8|61.4; 0.0T; 11|123.1; 0.0S; 1|123.1; 0.0S; 4|141.0; 0.0D; 0|162.1; 0.0S; 2|162.1; 0.0S; 3|165.7; 0.0S; 7|165.7; 0.0S; 10