SYSTEM AND METHOD FOR IDENTIFYING PEPTIDE SEQUENCES
20170327891 · 2017-11-16
Inventors
- Erich J. Baker (Waco, TX, US)
- Christopher Kearney (Woodway, TX, US)
- S. M. Ashiqul Islam (Waco, TX, US)
- Tanvir Sajed (Edmonton, CA)
Cpc classification
G16Z99/00
PHYSICS
C12Q1/6883
CHEMISTRY; METALLURGY
International classification
Abstract
A system and method for searching published genomes utilizes a robust and sensitive model to identify peptides that may serve as protein toxins. The protein toxins include unique cysteine stabilized structures and may be referred to as sequential tri-disulfide peptides (STPs) or as non-sequential tri-disulfide peptides (NTPs). While the sequence variability of STPs is so great that there are severe limitations to searching using traditional sequence-based methods, the present system and method efficiently and accurately identifies STPs as well NTPs from published genome databases, or in any peptide sequence, including artificial sequences.
Claims
1. A method for identifying peptide sequences including sequential tri-disulfide peptide (STP) structures using a Support Vector Machine (SVM)-based model, comprising: obtaining a set of training peptide sequences, wherein the training peptide sequences are identified as containing STP structures or lacking STP structures; identifying a numerical order of six cysteines in each training peptide sequence which is C1-C2-C3-C4-C5-C6; extracting a set of features from the training peptide sequences, wherein the set of features comprises three Normalized Bonding Distance (NBD) values, a presence of double consecutive cysteines in a C4-C5 loop, a presence of double consecutive cysteines in a C5-C6 loop, a least loop length to total length ratio, a total number of amino acid residues in the sequence, an aggregate number of occurrences of cysteine, serine, arginine, histindine, lysine (C,S,R,H,K), an aggregate number of occurrences of hydrophobic (F,Y,L,I,A,M,C,W,V) amino acids, an aggregate number of occurrences of hydrophilic (R,K,N,D,A,P) amino acids, and an aggregate number of occurrences of neutral (G,H,S,T,Q) amino acids; compiling the features into a feature matrix used for training the SVM-based model to predict presence of STP structures; obtaining an unknown peptide sequence; identifying a numerical order of six cysteines in the unknown peptide sequence which is C1-C2-C3-C4-C5-C6; extracting the set of features from the unknown peptide sequence; and using the SVM-based model to analyze the features of the unknown peptide sequence in relation to the feature matrix and to identify whether the unknown peptide sequence includes a STP structure.
2. The method of claim 1, wherein the three Normalized Bonding Distance (NBD) values are extracted by using the following equations:
NBD.sub.1=100/(|P.sub.1−
NBD.sub.2=100/(|P.sub.2−
NBD.sub.3=100/(|P.sub.3−
3. The method of claim 1, wherein the least loop length to total length ratio is extracted by calculating min(ΔC.sub.i,i+1) divided by the total length of the sequence, and wherein if min(ΔC.sub.i,i+1) is more than 3, then the value for the feature is 0.
4. The method of claim 1, wherein the unknown peptide sequence is obtained by searching a genome.
5. The method of claim 1, wherein the unknown peptide sequence is an artificial sequence.
6. A method for identifying peptide sequences including compact stabilized tri-disulfide peptide structures using a Support Vector Machine (SVM)-based model, comprising: obtaining a set of training peptide sequences, wherein the training peptide sequences are identified as containing compact stabilized tri-disulfide peptide structures or lacking compact stabilized tri-disulfide peptide structures; identifying a numerical order of six cysteines in each training peptide sequence which is C1-C2-C3-C4-C5-C6; extracting a set of features from the training peptide sequences, wherein the set of features comprises three Normalized Bonding Distance (NBD) values, a least loop length to total length ratio, a total number of amino acid residues in the sequence, a total number of occurrences of each amino acid in the sequence, an aggregate number of occurrences of hydrophobic (F,Y,L,I,A,M,C,W,V) amino acids, an aggregate number of occurrences of hydrophilic (R,K,N,D,A,P) amino acids, and an aggregate number of occurrences of neutral (G,H,S,T,Q) amino acids; compiling the features into a feature matrix used for training the SVM-based model to predict presence of compact stabilized tri-disulfide peptide structures; obtaining an unknown peptide sequence; identifying a numerical order of six cysteines in the unknown peptide sequence which is C1-C2-C3-C4-C5-C6; extracting the set of features from the unknown peptide sequence; and using the SVM-based model to analyze the features of the unknown peptide sequence in relation to the feature matrix and to predict whether the unknown peptide sequence includes a compact stabilized tri-disulfide peptide structure.
7. The method of claim 6, wherein the three Normalized Bonding Distance (NBD) values are extracted by using the following equations:
NBD.sub.1=100/(|P.sub.1−
NBD.sub.2=100/(|P.sub.2−
NBD.sub.3=100/(|P.sub.3−
8. The method of claim 6, wherein the least loop length to total length ratio is extracted by calculating min(ΔC.sub.i,i+1) divided by the total length of the sequence, and wherein if min(ΔC.sub.i,i+1) is more than 3, then the value for the feature is 0.
9. The method of claim 1, wherein the unknown peptide sequence is obtained by searching a genome.
10. The method of claim 1, wherein the unknown peptide sequence is an artificial sequence.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[0029] The present disclosure relates to a system and method for locating and identify sequential tri-disulfide peptide (STP) toxins and nonsequential tri-disulfide peptide (NTP) toxins. The system and method may be used for searching published genomes, or for analyzing synthetic or artificial sequences.
[0030] The present system and method utilize machine learning algorithms. It is imperative that successful machine learning algorithms select proper training sets and features.
[0031] First, for the known STP sequence collection, sequences of ICKs and cyclotides (knotted STPs) were collected from an available Knottin database (knottin.cbs.cnrs.fr/) and 167 sequences with solved 3D structures were obtained from this source. An additional 36 sequences of nonknotted STPs with known 3D structures were collected from Protein Data Bank (PDB) with 90% sequence identity (www.rcsb.org, June, 2013). The total set of 204 candidate sequences (167 from the knottin database and 37 from PDB) were further reduced to remove redundant sequences, defined as sequences sharing ≧90% sequence identity using CD-HIT (Huang et al., 2010 and Li, 2006). A total of 108 sequences were retained from the knottin database set and 36 sequences were from the PDB set, leaving 144 canonical STPs. The mean, standard deviation and range of the number of residues in the positive training set were 42.20, 15.70 and 23-143, respectively, with an average number of 6 cysteines per chain.
[0032] For the control negative sequence collection, sequences classified as negative control were collected from PDB using a criterion that was species agnostic and stipulated the exclusion of STPs through positive matches to PDB small proteins. This was defined as greater than 95% matches to the following PDB Query: “Experimental method is SOLUTION NMR; SCOP is small proteins; chain type: there is a protein chain but not any DNA or RNA or hybrid; stoichiometry in biological assembly: stoichiometry is MONOMER and TAXONOMY is Eukaryota (eucaryotes); released between 2000 and 2010.” Thus, the negative training set was constructed with a collection of small proteins verified from the NMR subset deposited in PDB between 2000 and 2010. They contain a similar number of total residues as STPs, and a number have tri-disulfide bonds (NTPs) in their 3D structure. 393 sequences were classified as non-STP sequences for the purposes of this disclosure. The mean, standard deviation and range of the number of residues in the chains of the negative training set are 63.16, 25.92 and 9-160, respectively, with an average number of 6 cysteines per chain.
[0033] A next step in the process followed to develop and evaluate the present system and method, as shown in
[0034] Likewise, if the min(ΔC.sub.i,i+1) was less than or equal to three and located between C1 and C2 or C2 and C3 the motifs were disregarded as these motifs are often found within electron transport-like proteins such as ferredoxin, rubredoxin, and iron-sulfur proteins. Otherwise, the min(ΔC.sub.i,i+1) was defined to exist between cysteines C3 and C4. This default pair of cysteines was shifted to a higher pair of cysteines if there existed less than 2 additional c-terminus cysteines. For example, if after the default C3 and C4 cysteines were identified, there was only one c-terminus cysteine, then the min(ΔC.sub.i,i+1) was defined as cysteines C4 and C5. Also, if the smallest distance was between the first two or second two Cysteins in the primary sequence, then the STP motif was disregarded. So the cysteines in the STP motif were numbered according to their order in the motif, if the STP was not invalidated. There may be more than six cysteines in the peptide, but the order of cysteines was defined for those that were participating the putative STP motif. Other cysteines which were not participating in the STP motif were not given a particular order based identity (e.g., C1, C2, C3, C4, C5, C6).
[0035] After putative STP motifs were identified and the order of each cysteine was defined in the STP motif, the Normalized Bonding Distances (NBD) were calculated. A set of three proximity lengths were calculated: P.sub.1=ΔC.sub.1,4; P.sub.2=ΔC.sub.2,5; P.sub.3=ΔC.sub.3,6. Motifs of less than six cysteines, or motifs defined as invalid by the utilized criteria, were assigned P.sub.1=P.sub.2=P.sub.3=0.
[0036] A Normalized Proximity Length (NP) was then assigned for each proximity length, P, resulting in three new values: NP.sub.1, NP.sub.2, and NP.sub.3. The NP identifies the distance from the observed mean proximity lengths of known STPs to the corresponding bonded cysteines involved in STP cysteine loops in the training set. For example, the average P for all STP sequences in the training set is subtracted from the calculated P value associated with its corresponding proximity length and normalized as shown in the equation below, where xPj is the average of the proximity lengths of known STPs derived from the training set.
NPjε{1,2,3}=1000/(|Pj−
[0037] Here, if the query peptide doesn't have a true STP motif, all three P values will be zero and the NPj values will be something less than 10. On the other hand, if the three P values exactly equal to corresponding average P values for all the known STPs, then the P values will be exactly 10, which is the maximum value. So this step will generate three features providing the NP values for C1-C4, C2-C5 and C3-C6.
[0038] Another feature utilized in the feature sets involved detecting the least loop length ratio. The least loop length is defined as the min(ΔC.sub.i,i+1) divided by the total length of the peptide. This feature is used as part of feature sets 5 and 6, as shown below.
[0039] Another feature related to detecting the presence of amino acids between C4-C5 and C5-C6. Data published describing loop lengths of ICKs and cyclotides, which comprise a large subset of STPs, motivated a Boolean feature for the presence of interloop amino acids. A result of “true” was returned if there was a presence of a minimum of one amino acid in both of the last two loops (C4-C5 and C5-C6) in a putative STP motif.
[0040] Additional features were also considered in each of the tested feature sets, as shown below in Table 1. Overall, six unique sets of features were used in the machine learning protocol. The first feature set was derived from a multiple sequence alignment (MSA) using MUSCLE (Li, 2006) in MEGA 5.10. Here, each column was considered an independent feature, providing 318 unique features. Feature sets 2-6 were derived from a variety of sequence metadata, including composition and frequency of different amino acids, hydrophobicity, hydrophilicity, neutrality, bonding proximity score (defined below), total length of a chain and least loop to total length ratio (defined below), creating sets of 3, 23, 23, 28 and 28 features, respectively.
TABLE-US-00001 TABLE 1 Feature Sets Feature No. of Set Features Features 1 23 Derived from calculating the frequency of occurrence of each amino acid plus the frequency of occurrence of aggregate hydrophobic (F, Y, L, I, A, M, C, W, V), hydrophilic (R, K, N, D, A, P) and neutral (G, H, S, T, Q) amino acids 2 23 Derived from calculating the number of occurrences of each amino acid plus the aggregate number of occurrences of hydrophobic (F, Y, L, I, A, M, C, W, V), hydrophilic (R, K, N, D, A, P), and neutral (G, H, S, T, Q) amino acids 3 3 Derived from the Normalized Bonding Distance (NBD) between C1-C4, C2-C5 and C3-C6 4 7 Derived from the Normalized Bonding Distance (NBD) between C1-C4, C2-C5 and C3-C6, Presence of amino acid between C4-C5 and C5-C6, presence of double consecutive cysteines in the sequence in the 4.sup.th and 5.sup.th loop, total peptide length and the least loop length ratio. The latter was calculated by dividing the length of the shortest ΔC.sub.i, i+1 by the total length of the peptide 5 11 Derived from Feature Set 4, plus calculating the frequency of occurrences of cysteine, serine, arginine, histindine, lysine (C, S, R, H, K) plus the frequency of occurrences of hydrophobic (F, Y, L, I, A, M, C, W, V), hydrophilic (R, K, N, D, A, P), and neutral (G, H, S, T, Q) amino acids 6 11 Derived from Feature Set 4, plus calculating the composition of cysteine, serine, arginine, histindine, lysine (C, S, R, H, K) plus the aggregate number of occurrences of hydrophobic (F, Y, L, I, A, M, C, W, V), hydrophilic (R, K, N, D, A, P), and neutral (G, H, S, T, Q) amino acids
[0041] The next step in the process was the building of classification models of support vector machine based on different feature sets. A Support Vector Machine (SVM) classifier/predictor implementation was used to elucidate STP toxins. The SVM was implemented using the e1071 library in R (2.15.1).
[0042] The training data set of 144 STP and 393 non-STP chains was evaluated using randomized sampling of 100 and 300 random samples over 200 iterations to determine the optimal feature sets. All of the 6 feature sets were examined. Feature sets were assigned as described in Table 1 and sensitivity, specificity, precision and accuracy were determined after tenfold cross validation. Initial gamma and cost were set to 0.1 and 0.1, respectively, with the best output at 0.0587. A confusion matrix was created to perform the cross validation test. True Positives (TP), False Positives (FP), True Negatives (TN) and False Negatives (FN) were determined from the confusion matrix. Sensitivity [TP/(TP+FN)], specificity [TN/(TN+FP)], precision [TP/(TP+FP)], accuracy [(TP+TN)/(TP+FN+TN+FP)] and Mathews Correlation Coefficient (MCC) RTPXTN−FPXFN)/sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN) were calculated to evaluate the performance of the algorithm.
[0043] The sensitivity, specificity, precision, accuracy and MCC scores were calculated for each of the feature sets, and results are shown in
[0044]
[0045] The first portion of the PredSTP model is illustrated in
[0046] As shown in
NBD.sub.1=100/(|P.sub.1−
NBD.sub.2=100/(|P.sub.2−
NBD.sub.3=100/(|P.sub.3−
[0047] The SVM implementation will consider the three calculated NBD values for all of the peptides in the training set and will determine a criteria by which an unknown sequence will be determined to have a STP structure based on an analysis of NBD values calculated for the unknown sequence as part of the model. The NBD values will be included in a feature matrix used to make the determination about the unknown sequence along with the other features calculated using the training data set.
[0048]
[0049]
[0050]
[0051]
[0052] In preferred embodiments, a Support Vector Machine (SVM) classifier/predictor implementation is used to elucidate STP toxins. The SVM may preferably be implemented using the e1071 library in R (2.15.1). Sensitivity, specificity, precision and accuracy using the SVM may preferably be determined after ten-fold cross validation. Initial gamma and cost may be set to 0.1 and 0.1, respectively, and examples have demonstrated a best output at 0.0587. Given 144 STP and 393 non-STP chains used in a preferred embodiment, 100 and 300 random samples may be chosen, respectively, for a training set over 200 iterations.
[0053]
[0054] Again, the first portion of the PredSTP model in this embodiment is illustrated in
[0055] As shown in
NBD.sub.1=100/(|P.sub.1−
NBD.sub.2=100/(|P.sub.2−
NBD.sub.3=100/(|P.sub.3−
[0056] The SVM implementation will consider the three calculated NBD values for all of the peptides in the training set and will determine a criteria by which an unknown sequence will be determined to have a STP or compact NTP structure based on an analysis of NBD values calculated for the unknown sequence as part of the model. The NBD values will be included in a feature matrix used to make the determination about the unknown sequence along with the other features calculated using the training data set.
[0057]
[0058]
[0059]
[0060] The present system and method are particularly useful for searching genomes to locate potential STP and NTP structures within actual published sequences, but it should be noted that the system and method are equally applicable to analyzing synthetic or artificial sequences that may be created from scratch specifically to generate STP or NTP structures. Thus the PredSTP model is also directly applicable to the de novo in silico design of specialized proteins such as antimicrobial peptides (AMPs) and insecticides. Similarly, the algorithm is useful to ensure that, in the use of an STP (or NTP) as scaffold for drug design, any drugs built as variants of the scaffold would indeed stay within the structural boundaries of a canonical STP (or NTP).
EXAMPLES
[0061] To evaluate the PredSTP model, independent test sequences were collected and considered. Table 2 below shows the eight independent sets of sequences that were collected to verify the robustness of the model.
TABLE-US-00002 TABLE 2 Independent test Query parameters Number of Number of sample (PDB.sup.a) proteins chains Small protein 92 SCOP: Small Proteins 92 163 Experimental Method: X-RAY Resolution: 1.499 or less Only Eukaryote TAXONOMY: Eukaryota 45751 102748 Only Bacteria TAXONOMY: Bacteria 31664 80664 (eubacteria) Only Archaea TAXONOMY: Archaea 3127 8366 Only Virus TAXONOMY: Viruses 4629 18642 Unassigned TAXONOMY: Unassigned 479 980 Others TAXONOMY: Other sequences 457 1418 or unclassified sequences Recently deposited Experimental Method: solution 657 751 proteins solved by NMR NMR in PDB (July 2012 to Mar. 25 2014) .sup.aPDB date August 2013 unless otherwise noted. Protein chain types only.
[0062] Among these were sets classified according to Protein Data Bank (PDB, July 2013) criteria as Eukaryote, Bacteria, Archaea, Virus and Unassigned. In addition, a set of proteins whose sequences were recently solved by NMR and deposited in PDB (Jul. 4, 2012 to Mar. 25, 2014) (“NewNMR751”) and also the Structural Classification of Protein (SCOP) PDB subset were used (“Smallprotein163”). Small protein sequences were retrieved with the following parameters: (a) resolution <1.5 Å, (b) protein chain but not DNA/RNA/Hybrid, and (c) limited to small disulfide rich proteins and have similarity in size, number of disulfide bonds, cystine number and cysteine arrangements in their primary structure. The result included STPs, rubredoxins, BPTI-like, snake toxin-like, crambin-like, insulin-like, and high potential iron proteins among others.
Example 1. Predicting STP Sequences
[0063] STP sequences were predicted from certain of the test sets shown in Table 2 using feature set 6. Due to the limited throughput of the Knoter1D interface, only the predictions made using the “NewNMR751” and “Smallprotein163” sets defined above were compared against Knoter 1D predictions (knottin.cbs.cnrs.fr/Tools_1D.php) and validated with Jmol by analyzing the disulfide connectivity using the corresponding PDB files. Results from only the eukaryotic test sets were filtered to remove sequences with ≧30% chain identity and compared against Jmol analysis. Chains exhibiting canonical STP connectivity (C1-C4, C2-C5, C3-C6) were initially considered as true positives. True positives were further cross matched with their PDB annotations to make the final confirmation.
[0064] The SmallProtein163 data subset from PDB was analyzed to determine potential automated STP classification. The median residue number of the chains in the Smallprotein163 subset is 54, which is similar to the number of residues in STP chains. In addition, 94 out of the 163 chains contain at least 6 cysteines in their primary sequences. From this subset, PredSTP was able to identify 21 of the 163 potential chains as STPcontaining. These putative STP structures were verified by examining their disulfide bonding patterns in Jmol. Of the 21 identified chains by PredSTP, 14 of them were confirmed as true positives, as shown in Table 3 below.
TABLE-US-00003 TABLE 3 Total PredSTP TRUE Knoter 1D positive chains positive positive 21 14/21 1/21
[0065] An analysis of the 142 negative STP chains predicted by PredSTP demonstrated only one false negative. The sensitivity, specificity, precision and accuracy for this particular dataset were 93.33%, 99.29%, 66.66% and 95.09%, respectively, as shown in Table 8 below. Table 4 below shows the list and description of the 21 proteins in the “Smallprotein163” subset positively predicted using PredSTP.
TABLE-US-00004 TABLE 4 Domain stabilized by tri- PDB ID.sup.1 disulfide bonds Disulfide connectivity Knoter1D Function/Class *1AHO Yes (C1-C4, C2-C5, C3-C6) No Scorpion Neurotoxin 1BX7 Yes Array is not No Serine Protease Inhibitor compact or absent 1BX8 Yes Array is not No Serine Protease Inhibitor compact or absent *1DJT: A Yes (C1-C4, C2-C5, C3-C6) No Alpha-like Neurotoxin *1DJT: B Yes (C1-C4, C2-C5, C3-C6) No Alpha-like Neurotoxin *1KV0: A Yes (C1-C4, C2-C5, C3-C6) No Alpha-like Toxin *1KV0: A Yes (C1-C4, C2-C5, C3-C6) No Alpha-like Toxin *1LU0: A Yes (C1-C4, C2-C5, C3-C6) Yes Hydrolase Inhibitor *1LU0: B Yes (C1-C4, C2-C5, C3-C6) Yes Hydrolase Inhibitor *1NPI Yes (C1-C4, C2-C5, C3-C6) No Neurotoxin 1P9G Yes C1-C4, C2-C6, C3-C5 No Antifungal Protein *1PTX Yes (C1-C4, C2-C5, C3-C6) No Scorpion Toxin 1R0R Yes Array is not No Serine Protease compact or absent *1SEG Yes (C1-C4, C2-C5, C3-C6) No Scorpion Alpha Toxin 1SGP No Array is not No Serine Protease/Inhibitor compact or absent *1SN4 Yes (C1-C4, C2-C5, C3-C6) No Scorpion Neurotoxin *1T7E Yes (C1-C4, C2-C5, C3-C6) No Alpha-Like Neurotoxin *2ASC Yes (C1-C4, C2-C5, C3-C6) No Scorpion Toxin 2GKR Yes Array is not No Hydrolase Inhibitor compact or absent *2SN3 Yes (C1-C4, C2-C5, C3-C6) No Scorpion Neurotoxin 2UUY Yes C1-C3, C2-C6, C4-C5 No Tryptase Inhibitor .sup.1*= True positives
[0066] PredSTP was also tested against protein sequences with less than 90% sequence identity and recently solved (Jul. 4, 2012 to Mar. 25, 2014) by NMR. This set of 751 amino acid chains is denoted as NewNMR751 and has a median number of 82 residues with 118 chains containing more than six cysteines. The model detected 23 chains from 23 different proteins. Analyzing the disulfide connectivity of the positive hits by Jmol, 21 chains were confirmed as true positive. Based on the number of the predicted outcomes, the sensitivity, specificity, precision and accuracy for this particular dataset were 91.30%, 99.72%, 91.30% and 99.46%, respectively, as shown in Table 8 below.
[0067] The true positive chains were further classified into 9 ICKs, 5 cyclotides and 7 nonknotted STPs. PDB identifications and functions for positive predictions are shown in Table 5 below.
TABLE-US-00005 TABLE 5 Predicted Functional Disulfide connectivity True by PDB id classification in the putative motif positive Knoter1D *2LIX Potassium channel C1-C4, C2-C5, C3-C6 Yes No toxin *2LJ7 Antimicrobial C1-C4, C2-C5, C3-C6 Yes No Peptide *2LJS Cyclotide C1-C4, C2-C5, C3-C6 Yes Yes *2LL1 Spider toxin C1-C4, C2-C5, C3-C6 Yes Yes *2LN4 Antimicrobial C1-C4, C2-C5, C3-C6 Yes No Peptide *2LT8 Antimicrobial C1-C4, C2-C5, C3-C6 Yes No Peptide *2LU9 Potassium channel C1-C4, C2-C5, C3-C6 Yes No toxin *2LUR Cyclotide C1-C4, C2-C5, C3-C6 Yes Yes *2LY5 Defensin-like C1-C4, C2-C5, C3-C6 Yes No *2LZX New ICK toxin from C1-C4, C2-C5, C3-C6 Yes No sponge *2M2Q New ICK toxin from C1-C4, C2-C5, C3-C6 Yes No bitter melon *2M2R New ICK toxin from C1-C4, C2-C5, C3-C6 Yes No bitter melon *2M36 Spider toxin C1-C4, C2-C5, C3-C6 Yes Yes 2M3H Apoptotic protein **Array is not No No compact or absent *2M3J New ICK toxin from C1-C4, C2-C5, C3-C6 Yes No sponge *2M4Z Spider toxin C1-C4, C2-C5, C3-C6 Yes Yes *2M86 Cyclotide C1-C4, C2-C5, C3-C6 Yes Yes *2M9O Cyclotide C1-C4, C2-C5, C3-C6 Yes Yes 2MD7 Transcription **Array is not No No compact or absent *2MH1 Cyclotide C1-C4, C2-C5, C3-C6 No Yes *4B2U New ICK toxin C1-C4, C2-C5, C3-C6 Yes No sicarius spiders *4B2V New ICK toxin C1-C4, C2-C5, C3-C6 Yes No sicarius spiders *4BMF Hydrolase C1-C4, C2-C5, C3-C6 Yes No
[0068] This set was also analyzed by PSI BLAST (Altschul et al., 1997) and Knoter1D. For PSI BLAST, the BLAST suite (blast-2.2.29+) was installed on a local machine along with the appropriate dataset. The dataset was the NewNMR751 dataset. The selected threshold e-values for PSI BLAST were 0.01, 0.1 and 0.5. The number of iterations for PSI BLAST was 5. All other parameters were set as default.
[0069] Knoter1D detected 5 cyclotides, 3 of the 9 ICKs and none of the nonknotted STPs. PSI BLAST (e-value 0.01) detected 12 chains comprising 1 ICK, 5 cyclotides, 5 nonknotted STPs and 1 false positive. PSI BLAST (e-value 0.1) detected 21 chains comprising five ICK, five cyclotides, seven nonknotted STPs and four false positives. PSI BLAST (e-value 0.5) detected 52 chains comprising five ICK, five cyclotides, seven nonknotted STPs and 35 false positives.
[0070] Table 6 below shows a comparison of the number of hits detected by the different tested methods in the NewNMR751 test set. Sensitivity for PredSTP and PSI BLAST was calculated based on total experimentally positive STPs (22 chains) in the NewNMR751 subset from PDB, while sensitivity for Knoter1D was calculated only for Knottins (knotted STPs).
TABLE-US-00006 TABLE 6 Calculated Calculated True False sensitivity precision Positive positive positive (%) for (%) for Method hits hits hits STPs STPs PredSTP 23 21 2 91.30 91.30 PSI BLAST 13 12 1 52.17 92.30 with e-value 0.01 PSI BLAST 21 17 4 73.90 80.95 with e-value 0.1 PSI BLAST 52 17 35 73.90 32.69 with e-value 0.5 Knoter1D 8 8 0 57.14 100
[0071] As shown in
[0072] The confusion matrices generated by PredSTP using the training sets Smallprotein163 and NewNMR751 subsets from PDB are shown below in Table 7.
TABLE-US-00007 TABLE 7 True True False False Source of data positive negative positive negative Training set 18959 56537 3463 1041 over 200 iterations Smallprotein163 14 141 7 1 NewNMR751 21 726 2 2
[0073] Table 8 below shows the comparison of evaluation matrices generated by PredSTP using the training sets Smallprotein163 and NewNMR751 subsets from PDB.
TABLE-US-00008 TABLE 8 Source of data Sensitivity Specificity Precision Accuracy Training set over 94.86 94.11 84.31 94.30 200 iterations Smallprotein163 93.33 99.29 66.66 95.09 NewNMR751 91.30 99.72 91.30 99.46
[0074] As shown above in Table 8, PredSTP showed a better accuracy (95.09%) for Smallprotein163 than it did for the training set (94.30%), while the precision was comparatively low (66.66%). The only STP not detected (PDB id 2C4B) was a heterogenous fusion protein of an STP and a catalytically inactive variant of RNase barnase. On the other hand, a test of performance of PredSTP on the NewNMP751 subset showed an excellent accuracy (99.46%) with a better precision (90.30%) than it showed on the training set. These results indicate that PredSTP retained its performance when distinguishing STPs from out of sample cysteine rich small proteins.
Example 2. Broader Prediction of STP Sequences
[0075] After testing the performance of PredSTP against chains from the “SmallProtein163” and “NewNMR751” subsets, which consist of sequences of similar size to the training set, PredSTP was tested against a set based on diverse taxonomy. “Eukaryota”, “Bacteria”, “Viruses”, “Archaea” and “Unassigned” subsets of proteins (see Table 2) were analyzed from the PDB. A higher proportion of STPs in eukaryotes was anticipated with respect to the total number of cysteine chains with a maximum of 75 residues and a minimum of six cysteines. There is a known paucity of disulfide bonding in bacteria and archaea compared to eukaryotes. The threshold of 75 was chosen because it is well below the length of the longest chain (86 residues long) detected as STP by PredSTP among taxonomy subsets. Table 9 below shows the discovery of STPs across major domains using the PDF protein sequence data and PredSTP.
TABLE-US-00009 TABLE 9 Number of Total Positive proteins # of Total chains containing Percentage PDB proteins # of predicted by positive of positive subset analyzed chains PredSTP chains chains Eukaryotes 45751 102748 636 139.sup.a 0.61 Eubacteria 31664 80664 3 2 0.003 Archaea 3127 8366 0 0 0 Viruses 4629 18642 4 3 0.02 Unassigned 479 980 10 10 1.02 .sup.aFor eukaryotes, 139 chains were obtained after screening 636 chains and removing those with greater than or equal to 30% sequence identity.
[0076] The percentage of positive chains in “Eukaryote” (0.61) is more than the percentage of predicted positive chains for the other three major super kingdoms. In “Eukaryotes”, 636 chains were predicted as STP positive. This number was reduced to 139 chains when chains sharing >30% sequence similarity were removed and the first 100 chains (based on PDB id) were manually cross-matched with Jmol analysis to determine true positives. This resulted in a 82% precision rate. In “Eubacteria”, “Viruses” and “Unassigned” subsets, the precisions were 50%, 33.33% and 90%, respectively, as shown in Table 10 below.
TABLE-US-00010 TABLE 10 PredSTP True Percent of true positive (structurally) positives PDB subset hits positives (precision) Eukaryotes 139 82 (100).sup.a 82 Bacteria 2 1 50 Archaea 0 0 NA Viruses 3 1 33 Unassigned 10 9 90 Total 115.sup.a 93 80.86 .sup.aFor eukaryotes, 100 of the 139 proteins were analyzed in Jmol to find true positives.
[0077] In the “Archaea” subset, PredSTP did not predict any potential STP toxins, resulting in no precision. In total, 115 positive hits were analyzed from the “Taxonomy” subset and 93 chains were found as true positive with an overall 80.86% precision. Individual precision rates for bacteria and viruses were low; this is potentially an artifact of their small sizes. In addition, some bacteria may contain iron-sulfur like transport proteins that mimic STPs by primary structure but are functionally distinct. The number of protein chains containing a minimum of six cysteines and consisting of a maximum 75 residues were also calculated for the same taxonomy subsets from PDB, and the percentages of predicted STPs were 30.08, 6.66, 0, 14.81 and 47.61 for Eukaryotes, bacteria, archaea, virus and unassigned, respectively, as shown in Table 11 below.
TABLE-US-00011 TABLE 11 Percent of Percent of Type Type predicted predicted Total # 1 2 STPs in type STPs in type PDB subset of chains PredSTP chain chain 1 chains 2 chains Eukaryotes 102748 636 2114 32348 30.08 1.96 Bacteria 90664 3 45 9294 6.66 0.03 Archaea 8366 0 6 663 0.00 0 Virus 18642 4 27 3477 14.81 0.11 Unassigned 980 10 21 43 47.61 23.25
[0078] After testing protein chains from different organismal taxonomy subsets in PDB, it was observed that only 6.66% and 0% of chains possessing a minimum of six cysteines and maximum 75 residues were predicted as STPs in bacteria and archaea, respectively, as shown in Table 11. In contrast, 30% of the small cysteine-containing chains were predicted as STPs in eukaryotes.
Example 3. Predicting STP and NTP Sequences
[0079] Methods used for the prediction of compact NTP sequences in addition to STP sequences were similar to those set forth in Examples 1-2 above. For the training data set, the known STP sequence collection, a total of 108 sequences were retained from the knottin database set and 36 sequences were from the PDB set, leaving 144 canonical STPs. For the control negative sequence collection, sequences classified as negative control were collected from PDB, using a criterion that was species agnostic and stipulated a solved X-ray crystallography structure of less than 1.5 Å diffraction and having a sequence less than 150 residues. Sequences containing tri-disulfide bonds with an average distance less than 1 nm were omitted. The resulting set candidate sequences was reduced to remove redundant sequences sharing ≧40% sequence identity using CD-HIT. A remaining 442 sequences were classified as non-STP sequences for the purposes of this example.
[0080] Different feature sets were considered, similar to those set out in Table 1 above but eliminating those relating to Table 1's Feature Set 4, which included presence of amino acid between C4-C5 and C5-C6 and presence of double consecutive cysteines in the sequence in the 4.sup.th and 5.sup.th loop. Again, a Support Vector Machine (SVM) classifier/predictor implementation was used to elucidate STP toxins. The SVM was implemented using the e1071 library in R (2.15.1). Sensitivity, specificity, precision and accuracy were determined after ten-fold cross validation for the feature sets. Initial gamma and cost were set to 0.01 and 0.1, respectively, with the best output at 0.0087. Given 145 knottin and 445 nonknottin chains, 50 and 150 random samples were chosen, respectively for a test set over 10 iterations. Feature sets were prioritized based on accuracy.
[0081] STP sequences were predicted from the test sets described previously (Table 1) using the feature set that included Normalized Bonding Distance (NBD) between C1-C4, C2-C5 and C3-C6, Calculating the number of occurrences of each of the amino acids, the aggregate number of occurrences of hydrophobic (F,Y,L,I,A,M,C,W,V), hydrophilic (R,K,N,D,A,P), and neutral (G,H,S,T,Q) amino acids, and plus the total peptide length and the least loop length ratio. Ultimately, the model with this feature set was used for the basis of the remainder of the example because (i) it had slightly higher accuracy and (ii) it contained higher feature dimensionality, therefore offering potentially higher discrimination.
[0082] Due to the limited throughput of the Knotter1D interface, only the same “NewNMR751” and “Smallprotein163” subsets were used as in Examples 1-2 above. Results from only the eukaryotic test sets were filtered to remove sequences with ≧30% chain identity and compared against Jmol analysis. Chains with three disulfide bonds in close proximity and exhibiting canonical STP connectivity (C1-C4, C2-C5, C3-C6) were initially considered as true positives. True positives were further cross matched with their PDB annotations to make the final confirmation.
[0083] The SmallProtein163 data subset from PDB was analyzed to determine potential automated knottin classification. From this subset, the PredSTP model was able to identify 43 of the 163 potential chains, representing 31 of the 92 proteins, as STPcontaining. These putative STP structures were verified by examining their disulfide bonding patterns in Jmol. Of the 31 identified proteins, PDB classified 12 of them as STP. An analysis of the 119 negative knottin chains predicted by the algorithm, representing 62 proteins, demonstrated no false negative. Overall, 12 of 12 STPs were correctly identified from this diverse set of protein classes and families. The sensitivity, specificity, precision and accuracy for this particular dataset were 100%, 76.22%, 38.70% and 88.2%, respectively (Table 4). Among the 19 false positives, 11 contain a compact tri-disulfide (compact NTP category) fold possessing functional activity like STPs. Thus, the algorithm also successfully predicted compact NTP sequences. When using the algorithm to predict both STP and compact NTP sequences, the precision goes up to 74.19%.
[0084] The NewNMR751 subset was also tested by PredSTP. The model detected 41 chains from 41 different proteins. Analyzing the disulfide connectivity of the positive hits by Jmol, 23 chains were confirmed as true positive. Based on the number of true positives over false positives, the calculated precision was 56.09%. The true positive chains were further classified into 9 ICKs, 5 cyclotides and 9 nonknotted STPs. This same set was also analyzed by, PSI BLAST and Knoter1D and 20 and 8 were detected as positive hits, respectively. More descriptively, Knoter1D detected all 5 cyclotides, 3 of the 9 ICKs and none of the nonknotted STPs. PSI-BLAST was applied to observe STP sequences with lower similarity and detected 4 ICKs, 5 cyclotides, 6 nonknotted and 5 were false positive. Functionally, 10 of the 41 predicted chains were antimicrobial peptides and 18 of them were other toxins. The remainder demonstrated diverse functions, often associated with STP toxins. An analysis of the false positives again found 12 out of 17 false positives possessing the NTP tri-disulfide array exhibiting the typical functional properties as STPs. Using the algorithm to predict STPs and compact NTPs increases the precision to 87.50%.
REFERENCES
[0085] The following publications are hereby incorporated by reference. [0086] Matsumura M, Signor G, Matthews B W. Substantial increase of protein stability by multiple disulphide bonds. Nature. 1989; 342:291-3. [0087] Gracy J, Le-Nguyen D, Gelly J-C, Kaas Q, Heitz A, Chiche L. KNOTTIN: the knottin or inhibitor cystine knot scaffold in 2007. Nucleic Acids Res. 2008; 36 (Database issue): D314-319. [0088] Conibear A C, Rosengren K J, Daly N L, Henriques S T, Craik D J. The cyclic cystine ladder in θ-defensins is important for structure and stability, but not antibacterial activity. J Biol Chem. 2013; 288:10830-40. [0089] Gelly J-C, Gracy J, Kaas Q, Le-Nguyen D, Heitz A, Chiche L. The KNOTTIN website and database: a new information system dedicated to the knottin scaffold. Nucleic Acids Res. 2004; 32(Database issue):D156-159. [0090] Kedarisetti P, Mizianty M J, Kaas Q, Craik D J, Kurgan L. Prediction and characterization of cyclic proteins from sequences in three domains of life. Biochim Biophys Acta. 2014; 1844(1 Pt B):181-90. [0091] Mulvenna J P, Wang C, Craik D J. CyBase: a database of cyclic protein sequence and structure. Nucleic Acids Res. 2006; 34(Database issue): D192-194. [0092] Wang C K L, Kaas Q, Chiche L, Craik D J. CyBase: a database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Res. 2008; 36(Database issue):D206-210. [0093] Kaas Q, Yu R, Jin A-H, Dutertre S, Craik D J. ConoServer: updated content, knowledge, and discovery tools in the conopeptide database. Nucleic Acids Res. 2012; 40(Database issue): D325-330. [0094] Herzig V, Wood D L A, Newell F, Chaumeil P-A, Kaas Q, Binford G J, Nicholson G M, Gorse D, King G F. ArachnoServer 2.0, an updated online resource for spider toxin sequences and structures. Nucleic Acids Res. 2011; 39(Database issue):D653-657. [0095] Muggleton S, King R D, Sternberg M J. Protein secondary structure prediction using logic-based machine learning. Protein Eng. 1992; 5:647-57. [0096] Bock J R, Gough D A. Predicting protein—protein interactions from primary structure. Bioinform Oxf Engl. 2001; 17:455-60. [0097] Cai C Z, Han L Y, Ji Z L, Chen X, Chen Y Z. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 2003; 31:3692-7. [0098] Hua S, Sun Z. A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J Mol Biol. 2001; 308:397-407. [0099] Cai Y D, Liu X J, Xu X, Zhou G P. Support vector machines for predicting protein structural class. BMC Bioinformatics. 2001; 2:3. [0100] Hua S, Sun Z. Support vector machine approach for protein subcellular localization prediction. Bioinformatics. 2001; 17:721-8. [0101] Huang Y, Niu B, Gao Y, Fu L, Li W. C D-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010; 26:680-2. [0102] Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006; 22:1658-9. [0103] Altschul S F, Madden T L, Schaffer A A, Zhang J, Zhang Z, Miller W, Lipman D J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25:3389-402.