Method for harvesting 3D chemical structures from file formats

Abstract

A method and system for harvesting molecular structures from non-editable documents is disclosed herein. A non-editable storage document is fed by a feeder which is received by a receiver. The molecular and non-molecular data contained in the non-editable storage document is recognized. The three-dimensional coordinates of the molecular data is separated using a pattern recognition. The molecular coordinates are encoded by a pattern sequence. A bond matrix data of the encoded data is generated. Subsequently the bond matrix data for accuracy is verified by comparing with a stored standardized data into a library.

Claims

1. A method for harvesting computable molecular data from a non-editable storage document and converting into recognizable data comprises feeding a non-editable storage document by a feeder; receiving the non-editable storage document by a receiver; recognizing and separating molecular and non-molecular data contained in the non-editable storage document by an analyzer; computing three-dimensional coordinates of said molecular data using a pattern recognition by the analyzer; encoding molecular coordinates by a pattern sequence by the analyzer, wherein the pattern sequence for encoding coordinates comprises two characters, followed by a space, an addition or subtraction symbol, a number, decimal and eight digits succeeding the decimal; generating cartesian coordinates bond matrix data from said encoded data by the analyzer; ensuring reusability of the data by the analyzer; and verifying the cartesian coordinates bond matrix data for accuracy by the analyzer with a stored standardized data into a library; wherein the method enables large scale conversion of molecular information from supplementary data available in the non-editable storage document and avoids computational duplication.

2. The method according to claim 1, wherein recognizing and separating molecular and non-molecular data is executed by parsing method.

3. The method according to claim 1, wherein generation of cartesian coordinates bond matrix comprises computation of bond angles, bond lengths and dihedral angles, interatomic distances, sequence pattern of amino acids in proteins and the reusability of the data is ensured by calculation of single point energy.

4. The method according claim 1, wherein the conversion of data is output into a standard interoperability document.

5. A system harvesting for computable molecular data from a non-editable storage document and converting into recognizable data comprises: a feeder to feed a non-editable storage document, a receiver to receive a non-editable storage document, a library having standardized data stored, an analyzer wherein said analyzer recognizes and separates molecular and non-molecular data contained in said non-editable storage document, computes three-dimensional coordinates of the molecular data using a pattern recognition, encodes molecular coordinates by a pattern sequence, wherein said pattern sequence for encoding of coordinates by the analyzer comprises two characters, followed by a space, an addition or subtraction symbol, a number, decimal and eight digits succeeding the decimal, generates cartesian coordinates, bond matrix data from the encoded data, ensures reusability of the data, verifies the cartesian coordinates, bond matrix data for accuracy with a stored standardized data into a library.

6. The system according to claim 5, wherein the analyzer recognizes and separates molecular and non-molecular data by executing parsing method.

7. The system according to claim 5, wherein the analyzer generates cartesian coordinates bond matrix with computation of bond angles, bond lengths and dihedral angles, interatomic distances, sequence pattern of amino acids in proteins and further the analyzer ensures reusability of the data by calculation of single point energy.

8. The system according to claim 5, wherein the conversion of data is output into a standard interoperability document.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a schematic representation of the inventive concept described in the present invention.

(2) FIG. 2 illustrates the computational steps workflow for extracting re-computable molecular structures from PDF articles as implemented in in the present invention

(3) FIG. 3 illustrates the logic implemented in the present invention for bond matrix creation.

(4) FIG. 4 illustrates the illustrates the standardized (optimized geometries conformation) of a torsional rotational transition state reproduced using the method of the present invention by harvesting the 3D coordinate structural data from the textual patterns generated for the pdf file containing the supporting material.

(5) FIG. 5 highlighting the challenges posed by the diverse coordinate formats present in the supplementary table of journal articles.

(6) FIG. 6 depicts the bond recognition process implemented in the present invention. a, b, c represent three scenarios between two interacting atoms A1 and A2, wherein a bond was considered to be present.

(7) FIG. 7 illustrates interatomic bond distances of Dimethyl sulfide (Mol ID 29) reproduced using the method of the present invention.

(8) FIG. 8 illustrates a comparative plot of single point energies of molecules extracted from coordinate data of Example 2. The values are in agreement with the original computed data.

(9) FIG. 9 illustrates the application of present method in embedded system

DETAILED DESCRIPTION OF THE INVENTION

(10) The invention will now be described in detail in connection with certain preferred and optional embodiments, so that various aspects thereof may be more fully understood and appreciated.

(11) FIG. 1 provides a schematic representation of the method of the present invention. An electronic image document or a non-editable document such as pdf files is provided as an input. The said document also contain a brief description of molecules, computed data, plots, page numbers, document information, manuscript bibliographic details etc. as a single document. Harvesting the molecular data from such electronic document is difficult as the data have to be excluded while parsing the electronic document. The method of the present invention recognizes the molecular data from the coordinates in text format and separates it from the remaining relevant but non-molecular text.

(12) FIG. 2 illustrates the complete steps employed, which include step of the present invention. Firstly, the data in the non-editable document such as PDF file is converted into textual data using a simple PDF parser. The textual data retrieved is then analyzed using a pattern recognition method to separate the 3D coordinates from the non-molecular text for the identification of atomic co-ordinates and atom information. All the X,Y,Z coordinates are encoded by a general pattern sequence consisting of 2 characters, followed by a space, an addition or subtraction symbol, a number, decimal and eight digits succeeding the decimal.

(13) Once the coordinate file is created, the bond matrix is computed to provide the interconnectivity information for reconstructing the original molecules reported in the supplementary material of the research article. The computation of the bond matrix is illustrated in FIG. 3. Important parameters such as bond angles, bond lengths and dihedral angles are verified and checked for consistency in the recreated molecule and then saved in the original file format, for instance gjf. The coordinate data and bond matrix information is used to create molecules in standard interoperability formats such as .sdf or .mol as ready to compute molecules for the convenience of the user. This process avoids unnecessary generation of molecular data and laborious recomputation of already published work. The molecules can be subjected to further simulations such as descriptor calculation, energy profile, docking etc.

(14) A method of the present invention facilitates the generation of a bond matrix from the coordinate atom type information. The interatomic distances of all the elements in the periodic table are taken into account to annotate the bond order between two atoms. The cut off distance between two vicinal atoms involved in a covalent bond formation was calculated as the sum of atomic radii+a scaling factor of 0.35 Å, any distance higher than this was considered as a non-bonding interaction by the program. Likewise all interatomic distance of other atoms were computed to generate bond matrix of a molecule.

(15) In another aspect of the invention, the molecules are converted from doc or txt document format.

(16) The following examples, which include preferred embodiments, will serve to illustrate the practice of this invention, it being understood that the particulars shown are by way of example and for purpose of illustrative discussion of preferred embodiments of the invention.

Example 1

(17) A supporting material file relating to a reaction modeling research paper describing the mechanistic investigation of epoxide formation from sulfur ylides and aldehydes was considered. The PDF file was processed to directly extract a .txt file from which patterns were discerned to generate the coordinates data. An important constraint for generating ready to compute molecules was the non-availability of bond order information in the published coordinates data. In view of the aforesaid problem, the method of the present invention has been provided with creation of bond matrix, that is inter-atomic connectivity of a given cluster of atoms. The method accurately retained the original conformations of all the optimized molecules when the extracted atomic coordinates were supplied back to the original program, which is illustrated in FIG. 5.

(18) To validate the accuracy of the proposed method, the bond matrix for atoms of all the molecules (n=29) deposited in the research article was computed and compared with the ones generated by the original software called as Gaussian. The values were identical in both the cases. The coordinate data and the computed connectivity information, that is bond matrix, could be used to generate molecules in the SDF and MOL formats. The bond matrix output generated by the present invention for the first molecule is given below.

(19) Mol_1 1 C1 S2 1.7019797266712668 1.55 0.1519797266712668

(20) Mol_1 2 C1 H5 1.0905715244769594 1.2000000000000002 −0.1094284755230408

(21) Mol_1 3 C1 H6 1.0926613153214495 1.2000000000000002 −0.10733868467855068

(22) Mol_1 4 S2 C3 1.829628926859214 1.55 0.27962892685921403

(23) Mol_1 5 S2 C4 1.851840751792659 1.55 0.3018407517926589

(24) Mol_1 6 C3 H7 1.095048094834195 1.2000000000000002 −0.10495190516580521

(25) Mol_1 7 C3 H8 1.0929989249765986 1.2000000000000002 −0.10700107502340162

(26) Mol_1 8 C3 H9 1.0943367945929627 1.2000000000000002 −0.10566320540703744

(27) Mol_1 9 C4 H10 1.0946906229615743 1.2000000000000002 −0.10530937703842591

(28) Mol_1 10 C4 H11 1.0934381646897096 1.2000000000000002 −0.10656183531029062

(29) Mol_1 11 C4 H12 1.0947090115642606 1.2000000000000002 −0.10529098843573959

(30) To understand the atomic (electronic) movements and distances, which is of paramount importance in transition state modeling studies of organic reactions, typically the cut-off distance for the presence of a bond is computed as the sum of the covalent radii of the two atoms.

(31) For the same purpose, the interatomic distances of all the elements in periodic table are taken into account to annotate the bond order between two atoms. The creation of a bond matrix between two atoms A1 and A2 in a molecule according to the present invention, is schematically represented in FIG. 6. The cut off distance between two vicinal atoms involved in a covalent bond formation was calculated as the sum of atomic radii+a scaling factor of 0.35 Å, any distance higher than this was considered as a non-bonding interaction by the program. Likewise all interatomic distance of other atoms were computed to generate bond matrix of a molecule.

(32) To validate method of the present invention, the bond matrix for atoms of all the molecules (n=29) deposited in the supplementary information of the research article was computed and compared with the ones generated by the original software (Gaussian). The values were identical in both the cases. Bond matrix conformation of a representative molecule from this set is shown in FIG. 7. The coordinate data and the computed connectivity information could be used to generate molecules in the SDF and MOL formats.

Example 2

(33) The input is a well cited paper wherein computational studies were performed on a range of alkenes to gain insights into the mechanistic processes involved in the thiolene reactions typically classified under click chemistry. In contrast with the demonstration in Example 1, where the approach was straight forward and an open source pdf reader was employed to convert pdf to text from the supporting information submitted in a pdf file, in the present case the pdf file was first saved in a plain text format externally and then fed as an input to the method of the present invention for extracting the coordinates. The inadvertent errors in file conversion are related to compatibility issues associated with various PDF maker programs available on the web. The method of the present invention successfully generated the Cartesian coordinates, bond matrix and non-molecular data of all the reported molecules. Due to the pagination problem in the original PDF document, only few structures partially failed. That is to say, few atoms carry forward to next molecule. This pagination issue was later addressed by molecular block identifier.

(34) MOL_0

(35) C −0.04781100 1.16216400 0.00000000

(36) H −1.09556300 1.46309200 0.00000000

(37) H 0.43082600 1.55738100 0.89506100

(38) H 0.43082600 1.55738100 −0.89506100

(39) S −0.04781100 −0.66970400 0.00000000

(40) H 1.28575000 −0.83557700 0.00000000

(41) MOL_1

(42) C −1.11122700 0.00005600 −0.00880200

(43) H −1.42403800 −0.00270000 1.04234300

(44) H −1.51094200 0.90050300 −0.47689500

(45) H −1.51064400 −0.89830400 −0.48120000

(46) S 0.69456200 0.00001000 −0.00196500

(47) MOL_2

(48) C −1.28038600 0.22044600 −0.00000100

(49) H −1.30140400 1.30644800 −0.00003900

(50) H −2.23896200 −0.28606800 0.00010900

(51) C −0.13464400 −0.45374900 −0.00003700

(52) H −0.16675400 −1.54212600 0.00001400

(53) C 1.23345600 0.16237100 0.00000400

(54) H 1.80706400 −0.15250500 0.87891300

(55) H 1.80774100 −0.15382800 −0.87798100

(56) H 1.18176300 1.25366800 −0.00081300

(57) Table 1 summaries of the results of the examples representing the diversity of coordinate molecular data in supplementary material handled by the present invention.

(58) TABLE-US-00001 Format N = & Entry Case Study molecules Regular Expression pattern Delimiter 1 Epoxide formation 29 {circumflex over ( )}[A-Za-z0-9]{1, 2)\\s+− PDF from sulfur ylides and {0, 1}.{1, 2}[0-9]{1, 8}\\s+− Space aldehydes {0, 1}.{1, 2}[0-9]{1, 8}.{1,} 2 Thiolene click 115 {circumflex over ( )}[A-Za-z0-9]{1, 2}\\s+− Text chemistry {0, 1}.{1, 2}[0-9]{1, 8}\\s+− Space {0, 1}.{1, 2}[0-9]{1, 8}.{1,} 3 Design of 55 {circumflex over ( )}[A-Za-z0-9]{1, 2}\\, PDF tetra(arenediyl)bid(allyl) [0]{0, 1}[\\,]{0, 1} Comma. derivaties for cope −{0, 1}.{1, 2}[0-9]{1, 10}\\, rearrangement transition −{0, 1}.{1, 2}[0-9]{1, 10}.{1,} states

(59) A comparative plot of single point energies of molecules extracted from coordinate data related Example 2 is illustrated in FIG. 8. The values are in agreement with the original computed data. CBS=Complete Basis set. RHF=restricted Hartre Fock.

Example 3

(60) In order to handle several hundred PDF files to harvest truly computable molecular data that are buried in PDF files, the method can harvest atomic coordinate data mixed with images, for example spectral data, barcode images, experimental data, molecular description and other computed data. The molecules are processed and transformed into SDF format, which are compatible with commercial packages thus saving time and computational effort. Such step assists the readers to access the original input files even after passage of time. It is pertinent to mention here that the biological sciences and bioinformatics community follow a standard representation of molecular coordinates in the PDB file format which is a database compliant format instead of a PDF format thus securing an easy access and exchange of information. Extracting coordinates of protein molecule from a PDF file is a challenging task, assuming an average protein size of over 2,00,000 atoms. However with the aid of ChemEngine customized with additional atomic coordinate pattern recognition modules, now it is possible to harvest any molecular data from PDF format. With the advent of 3D structure repositories and several free academic sites, data storage is no longer a major issue, the ready to compute molecules can be deposited and maintained to avoid duplication of computational efforts.

Advantages of Invention

(61) Easy extraction of data from pdf files Conversion of data to reusable format Employing the method of the present invention avoids unnecessary generation of molecular data and laborious recomputation of already published work

Method for harvesting 3D chemical structures from file formats

Assignee

Inventors

Cpc classification

Classification Explorer

G16C20/40

PHYSICS

Classification Explorer

G16C20/80

PHYSICS

Classification Explorer

G16C20/70

PHYSICS

International classification

Classification Explorer

G16C20/80

PHYSICS

Classification Explorer

G16C20/40

PHYSICS

Classification Explorer

G16C20/70

PHYSICS

Abstract

Claims

Description