GENERATION OF CODES FOR CHEMICAL STRUCTURES FROM NMR SPECTROSCOPY DATA

20220189587 · 2022-06-16

Assignee

Inventors

Cpc classification

International classification

Abstract

A method of generating codes for chemical structures from NMR spectroscopy data comprises receiving spectroscopic data of a chemical compound, inputting the spectroscopic data into a first artificial neural network to generate molecular descriptors, receiving a molecular descriptor from the first artificial neural network, inputting the molecular descriptor a second artificial neural network to convert structure data of the chemical reference compounds to molecular descriptors and to convert the molecular descriptors back to the structure data, and receiving structure data of the chemical compound from the second artificial neural network.

Claims

1. A computer-implemented method, the method comprising: receiving spectroscopic data of a chemical compound, inputting the spectroscopic data into a first artificial neural network, wherein the first artificial neural network has been trained, in a supervised learning method using spectroscopic data of a multitude of chemical reference compounds, to generate molecular descriptors of the chemical reference compounds on the basis of the spectroscopic data of the chemical reference compounds, receiving a molecular descriptor of the chemical compound from the first artificial neural network, inputting the molecular descriptor received into a second artificial neural network, wherein the second artificial neural network is a decoder of an autoencoder, wherein the autoencoder has been trained, in an unsupervised learning method using a multitude of chemical reference compounds, to convert structure data of the chemical reference compounds to molecular descriptors and to convert the molecular descriptors back to the structure data, receiving structure data of the chemical compound from the second artificial neural network, and outputting and/or storing the structure data and/or information derived from the structure data.

2. The method of claim 1, wherein the spectroscopic data are data from a nuclear resonance spectrum or multiple nuclear resonance spectra of the chemical compound.

3. The method of claim 1, wherein the spectroscopic data are a peak list from a .sup.13C NMR spectrum and/or a .sup.1H NMR spectrum.

4. The method of claim 1, wherein the molecular descriptor is an n-dimensional vector.

5. The method of claim 1, 4, wherein the molecular descriptor is a continuous and data-driven molecular descriptor.

6. The Method according to any of method of claim 1, wherein the structure data are a chemical structure code.

7. The method of claim 1, wherein the structure data are a SMILES, InChI, CML or WLN code.

8. The method of claim 1, to further comprising: calculating spectroscopic data from the structure data received, comparing the spectroscopic data calculated with the spectroscopic data received, identifying the deviations between the spectroscopic data calculated and the spectroscopic data received, and outputting and/or storing the deviations.

9. A system comprising a computer configured to prompt the receipt of spectroscopic data of a chemical compound, input the spectroscopic data into a first artificial neural network, wherein the first artificial neural network has been trained, in a supervised learning method using spectroscopic data of a multitude of chemical reference compounds, to generate molecular descriptors of the chemical reference compounds on the basis of the spectroscopic data of the chemical reference compounds, receive a molecular descriptor for the chemical compound from the first artificial neural network, input the molecular descriptor received into a second artificial neural network, wherein the second artificial neural network is a decoder of an autoencoder, wherein the autoencoder has been trained, in an unsupervised learning method using a multitude of chemical reference compounds, to convert structure data of the chemical reference compounds to molecular descriptors and to convert the molecular descriptors back to the structure data, receive structure data of the chemical compound from the second artificial neural network, prompt the output of and/or to store the structure data received and/or information derived from the structure data.

10. A non-transitory computer-readable storage medium comprising instructions that, when executed by a processor, cause the processor to: receive spectroscopic data of a chemical compound, input the spectroscopic data into a first artificial neural network, wherein the first artificial neural network has been trained, in a supervised learning method using spectroscopic data of a multitude of chemical reference compounds, to generate molecular descriptors of the chemical reference compounds on the basis of the spectroscopic data of the chemical reference compounds, receive a molecular descriptor of the chemical compound from the first artificial neural network, input the molecular descriptor received into a second artificial neural network, wherein the second artificial neural network is a decoder of an autoencoder, wherein the autoencoder has been trained, in an unsupervised learning method using a multitude of chemical reference compounds, to convert structure data of the chemical reference compounds to molecular descriptors and to convert the molecular descriptors back to the structure data, receive structure data of the chemical compound from the second artificial neural network, output and/or store the structure data and/or information derived from the structure data.

11. The non-transitory computer readable storage medium of claim 10, wherein the instructions prompt the processor to execute one or more of: calculating spectroscopic data from the structure data received; comparing the spectroscopic data calculated with the spectroscopic data received; identifying the deviations between the spectroscopic data calculated and the spectroscopic data received; and outputting and/or storing the deviations.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0077] FIG. 1 shows a computer system, according to some embodiments of the present disclosure.

[0078] FIG. 2 shows an example of an image representation of the chemical structure of a chemical compound.

[0079] FIG. 3 shows a .sup.13C NMR spectrum of the chemical compound shown in FIG. 2.

[0080] FIG. 4 shows a method for generating codes for chemical structures, according to some embodiments.

[0081] FIG. 5 shows the mode of function of the first artificial neural network, according to some embodiments.

[0082] FIG. 6 shows the mode of function of an autoencoder, according to some embodiments.

[0083] FIG. 7 shows the interplay of the first and second artificial neural networks in the generation of structure data, according to some embodiments.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

[0084] FIG. 1 shows, in schematic form a computer system according to some embodiments of the present disclosure. The computer system (10) comprises an input unit (11), a control and computation unit (12) and an output unit (13).

[0085] In some embodiments, the computer system (10) may be configured to receive spectroscopic data of chemical compounds, to use the spectroscopic data received to generate structure data for the chemical compounds, and to output and/or to store the structure data and/or information derived therefrom.

[0086] In some embodiments, the control and computation unit (12) may serve to control the input unit (11) and the output unit (13), to coordinate the flows of data and signals between the different units, to process spectroscopic data and further data, and to create structure data based on the spectroscopic data by means of a first and second artificial neural network. In some embodiments, the first and second artificial neural network may be loaded, for example, in a memory of the computer system that may be part of the control and computation unit (12).

[0087] In some embodiments, the input unit (11) may serve to receive spectroscopic data of chemical compounds. In some embodiments, the spectroscopic data may be provided/transmitted, for example, via a network (not shown in FIG. 1) by another computer system and/or read out from a database that may be part of the computer system according to the invention or may be connected thereto via a network.

[0088] In some embodiments, spectroscopic data can be transmitted via a network connection or a direct connection. Spectroscopic data can be transmitted via radio communication (WLAN, Bluetooth, mobile communications and/or the like) and/or via a cable. It is conceivable that multiple input units are present.

[0089] In some embodiments, the input unit (11) may transmit the spectroscopic data and any further data to the control and computation unit (12). In some embodiments, the control and computation unit (12) may be configured to generate structure data using the data received.

[0090] In some embodiments, the output unit (13) can display the structure data and/or information derived therefrom (for example on a monitor), output them (for example via a printer) or store them in a data storage medium.

[0091] It is conceivable that multiple output units are present. It is likewise possible for there to be multiple input units and/or control and/or computation units.

[0092] FIG. 2 shows an example of an image representation of the chemical structure of a chemical compound. The compound is acetylsalicylic acid; the IUPAC name of the compound is 2-acetoxybenzoic acid. The structure represented as an image is a skeletal formula: the carbon atoms and the hydrogen atoms bonded to carbon atoms are not shown explicitly; all that are shown explicitly are the oxygen atoms (O) and the hydrogen atom (H) bonded to an oxygen atom by their element symbols (O, H).

[0093] The SMILES code for 2-acetoxybenzoic acid is: CC(═O)OCl═CC═CC═ClC(═O)O.

[0094] The InChI code for 2-acetoxybenzoic acid is: 1S/C9H8O4/cl-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12).

[0095] The SMILES code and the InChI code are examples of chemical structure codes.

[0096] FIG. 3 shows, by way of example, a .sup.13C NMR spectrum of the chemical compound shown in FIG. 2 (2-acetoxybenzoic acid). The spectrum was recorded in deuterochloroform as solvent (0.039 g of 2-acetoxybenzoic acid in 0.5 ml of CDCl.sub.3) at 50.18 MHz. The chemical shift δ in ppm is given on the abscissa.

[0097] One example of a peak list for a calculated .sup.13C NMR spectrum (in this case for 2-acetoxybenzoic acid) is: [0098] δ (ppm) [0099] 20.77529 [0100] 120.91842 [0101] 122.26 [0102] 124.01 [0103] 131.33934 [0104] 133.14333 [0105] 151.28 [0106] 166.94 [0107] 169.22

[0108] FIG. 4 shows a method according to some embodiments of the present disclosure in the form of a flow chart.

[0109] In some embodiments, the process (100) comprises the steps of: [0110] (110) receiving spectroscopic data of a chemical compound, [0111] (120) inputting the spectroscopic data into a first artificial neural network, wherein the first artificial neural network has been trained, in a supervised learning method using spectroscopic data of a multitude of chemical reference compounds, to generate molecular descriptors of the chemical reference compounds on the basis of the spectroscopic data of the chemical reference compounds, [0112] (130) receiving a molecular descriptor of the chemical compound from the first artificial neural network, [0113] (140) inputting the molecular descriptor received into a second artificial neural network, wherein the second artificial neural network is a decoder of an autoencoder, wherein the autoencoder has been trained, in an unsupervised learning method using a multitude of chemical reference compounds, to convert structure data of the chemical reference compounds to molecular descriptors and to convert the molecular descriptors back to the structure data, [0114] (150) receiving structure data of the chemical compound from the second artificial neural network, [0115] (160) outputting and/or storing the structure data and/or information derived from the structure data.

[0116] FIG. 5 shows the mode of function of the first artificial neural network, according to some embodiments. In some embodiments, the first neural network (NN) may be trained to use spectroscopic data (SD) as input to generate a molecular descriptor (MD). In some embodiments, it may use a peak list of a .sup.13C NMR spectrum to generate a continuous and data-driven molecular descriptor (cddd).

[0117] FIG. 6 shows the mode of function of an autoencoder, according to some embodiments. In some embodiments, the autoencoder (AC) may consist of an encoder (EC) and a decoder (DC). In some embodiments, the autoencoder (AC) may be trained to use structure data, in the present case from a chemical structure code of a chemical compound, to generate a molecular descriptor (MD) (encoding), and to use the molecular descriptor (MD) to reconstruct the chemical structure code (decoding). In the present case, the encoder (EC) of the autoencoder (AC) uses the SMILES code of 2-acetoxybenzoic acid (CC(═O)OCl═CC═CC═ClC(═O)O) to generate a continuous and data-driven molecular descriptor (cddd), and the decoder (DC) of the autoencoder (AC) uses the continuous and data-driven molecular descriptor (cddd) to generate the SMILES code of 2-acetoxybenzoic acid (CC (═O)OCl═CC═CC═ClC(═O)O).

[0118] FIG. 7 shows the interplay of the first and second artificial neural networks in the generation of structure data of a chemical structure of a chemical compound from spectroscopic data of the chemical compound, according to some embodiments. The first artificial neural network (NN) shown in FIG. 7 is the network shown in FIG. 5. The second artificial neural network (DC) shown in FIG. 7 is the decoder of the autoencoder (AC) shown in FIG. 6. In some embodiments, the first artificial neural network (NN) may receive spectroscopic data of a chemical compound (in the present case of benzoic acid). In some embodiments, the first artificial neural network (NN) may use the spectroscopic data to generate a molecular descriptor (MD), in the present example a continuous and data-driven molecular descriptor (cddd). This continuous and data-driven molecular descriptor (cddd) is sent to the second artificial neural network (DC). In some embodiments, the second artificial neural network (DC) may use the molecular descriptor to generate the SMILES code of benzoic acid (Cl═CC═C(C═Cl)C(═O)O).

[0119] Table 1 below shows images of a series of chemical structures of chemical compounds in the form of their skeletal formulae (middle column). The left-hand column of Table 1 gives a peak list of a .sup.13C NMR spectrum for each of the chemical compounds. The right-hand column of Table 1 lists the SMILES codes predicted in accordance with the invention for the chemical compounds on the basis of the peak lists. For the prediction of the SMILES codes, a multilayer feedforward neural network with fully connected layers was used as first neural network, and the decoder of an autoencoder based on the architecture described by Winter et al. as second neural network (Chem. Sci., 2019, 10, 1692-1701; Chem. Sci., 2020, 11, 10378-10389).

TABLE-US-00001 TABLE 1 14.3; 0.0Q; 9|14.3; 0.0Q; 12|24.8; 0.0Q; 5|24.8; 0.0Q; 6|61.4; 0.0T; 8|61.4; 0.0T; 11|123.1; 0.0S; 1|123.1; 0.0S; 4|141.0; 0.0D; 0|162.1; 0.0S; 2|162.1; 0.0S; 3|165.7; 0.0S; 7|165.7; 0.0S; 10 [00001]embedded image CCOC(═O)c1cc(C(═O)OCC)c(C)nc1C 15.1; 0.0Q; 10|15.1; 0.0Q; 12|60.9; 0.0T; 9|60.9; 0.0T; 11|84.4; 0.0S; 6|85.1; 0.0S; 7|91.8; 0.0D; 8|121.9; 0.0S; 3|128.2; 0.0D; 1|128.2; 0.0D; 5|128.7; 0.0D; 0|131.8; 0.0D; 2|131.8; 0.0D;4 [00002]embedded image CCOC(C#Cc1ccccc1)OCC 14.6; 0.0Q; 7|14.6; 0.0Q; 9|62.2; 0.0T; 6|62.2; 0.0T; 8|101.9; 0.0S; 0|102.3; 0.0D; 5|109.8; 0.0D; 2|113.3; 0.0S; 4|121.2; 0.0D; 1|123.3; 0.0D; 3 [00003]embedded image CCOC(OCC)n1cccc1C#N 52.32; 0.0Q; 9|69.56; 0.0T; 6|69.56; 0.0T; 7|106.91; 0.0D; 3|107.87; 0.0D; 1|107.87; 0.0D; 5|127.7; 0.0D; 11|127.7; 0.0D; 15|127.7; 0.0D; 17|127.7; 0.0D; 21|127.92; 0.0D; 13|127.92; 0.0D; 19|128.46; 0.0D; 12|128.46; 0.0D; 14|128.46; 0.0D; 18|128.46; 0.0D; 20|131.61; 0.0S; [00004]embedded image COC(═O)c1cc(OCc2ccccc2)cc(OCc2ccccc2)c1 0|136.64; 0.0S; 10|136.64; 0.0S; 16|159.51; 0.0S; 2|159.51; 0.0S; 4|165.83; 0.0S; 8 26.8; 0.0Q; 9|124.4; 0.0D; 0|124.4; 0.0D; 3|131.0; 0.0S; 4|131.0; 0.0S; 5|135.5; 0.0D; 1|135.5; 0.0D; 2|165.3; 0.0S; 6|165.3; 0.0S; 7|168.8; 0.0S;8 [00005]embedded image CC(═O)N1C(═O)c2ccccc2C1═O