Method for verifying the primary structure of protein

11635437 · 2023-04-25

Assignee

Inventors

Cpc classification

International classification

Abstract

Disclosed herein is a method for verifying the primary structure of a protein through comparative analyses between ion clusters observed in mass spectra and a series of simulated ion clusters deduced from its putative chemical formula. The method comprises the steps of: preparing a protein sample for mass spectrometric analyses; collecting mass spectra of the protein sample; obtaining master ion cluster from a plurality of ion clusters in the mass spectra; producing a series of simulated ion clusters according to the chemical formula of the protein; finding the best fit for the master ion cluster among the series of simulated ion clusters; and verifying if said best-fit simulated ion cluster corresponds to the chemical formula of the protein.

Claims

1. A method for verifying the primary structure of a protein comprising: obtaining a mass spectrum of a full-length protein; identifying from the mass spectrum a plurality of ion clusters with a mass corresponding to the full-length protein but with different charge states; calculating a master ion cluster from the plurality of ion clusters; and comparing the master ion cluster with a series of simulated ion clusters generated based on the chemical formula of the full-length protein with or without a modification, to find a best fitted simulated ion cluster; wherein the master ion cluster is calculated by a process comprising: summing up the intensities of the most abundant peak at (m/z).sub.ma of each of the plurality of ion clusters, to obtain a starting summation; summing up the intensities of the next larger isotopic peak p(+1), with an m/z larger than the (m/z).sub.ma according to an average isotope spacing, of each most abundant peak, to obtain a first right summation; and summing up the intensities of the next smaller isotopic peak p(−1), with an m/z smaller than the (m/z).sub.ma according to the average isotope spacing, of each most abundant peak, to obtain a first left summation.

2. The method of claim 1, wherein a plurality of right summations of a respective plurality of isotopic peaks p(+l) are obtained, a plurality of left summations of a respective plurality of isotopic peaks p(−m) are obtained, and the starting summation, the plurality of left summations and the plurality of right summations are normalized by dividing by the largest summation among all the summations, wherein l and m each is a positive integer, the isotopic peak p(+l) is the next larger isotopic peak relative to the isotopic peak p(+(l−1)) according the average isotope spacing, and the isotopic peak p(−m) is the next smaller isotopic peak relative to the isotopic peak p(−(m−1)) according the average isotope spacing.

3. The method of claim 1, wherein each of the intensities is normalized by dividing by the charge state of the corresponding isotopic peak before being summed up.

4. The method of claim 1, wherein the average isotope spacing is about 1 Dalton.

5. The method of claim 4, wherein the average isotope spacing is 1.00235 Dalton.

6. The method of claim 1, wherein the mass spectrum is obtained through a high-resolution mass spectrometry.

7. The method of claim 1, wherein the master ion cluster and the series of simulated ion clusters are compared by a method selected from the group consisting of chi-square test, Pearson's chi-square test, chi-square test with Yate's correlation, Fisher's exact test, McNemar's test, and Cochran's Q test.

8. The method of claim 1, wherein each of the series of simulated ion clusters is generated by a process comprising: given a chemical formula C.sub.vH.sub.wO.sub.xN.sub.yS.sub.z of the full-length protein with or without a modification, combining putative ion clusters of C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z to obtained the simulated ion cluster of the full-length protein with or without the modification, wherein the putative ion cluster of C.sub.v is represented by the intensities I.sub.n,v=A.sub.l2.sub.C.Math.I.sub.n,V−1+A.sub.13.sub.C.Math.I.sub.n−1,v−1 A.sub.l2.sub.C and A.sub.13.sub.C being the natural abundances of .sup.12C and .sup.13C respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak; the putative ion cluster of H.sub.w is represented by the intensities I.sub.n,w=A.sub.1.sub.H.Math.I.sub.n,w−i+A.sub.2.sub.H.Math.I.sub.n−1,w−1, A.sub.1.sub.H and A.sub.2.sub.H being the natural abundances of .sup.1H and .sup.2H respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak; the putative ion cluster of S.sub.z is represented by the intensities I.sub.n,z=A.sub.16.sub.O.Math.I.sub.n,x−1+A.sub.17.sub.O.Math.I.sub.n−1,x−1+A.sub.18.sub.O, A.sub.17.sub.O and A.sub.18.sub.O being the natural abundances of .sup.16O, .sup.17O and .sup.18O respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak; the putative ion cluster of N.sub.y is represented by the intensities I.sub.n,y=A.sub.14.sub.N.Math.I.sub.n,y−1+A.sub.15.sub.N.Math.I.sub.n−1,y−1, A.sub.14.sub.N and A.sub.15.sub.N being the natural abundances of .sup.14N and .sup.15N respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak; and the putative ion cluster of S.sub.z is represented by the intensities I.sub.n,z=A.sub.32.sub.S.Math.I.sub.n,z−1+A.sub.33.sub.S.Math.I.sub.n−1,z−1+A.sub.34.sub.S.Math.I.sub.n−2,z−1+A.sub.36.sub.S.Math.I.sub.n−4,z−1, A.sub.32.sub.S, A.sub.33.sub.S, A.sub.34.sub.S and A.sub.36.sub.S being the natural abundances of .sup.32S, .sup.33S, .sup.34S and .sup.36S respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak.

9. The method of claim 8, wherein the putative ion clusters of C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z are combined one by one.

10. The method of claim 8, wherein the putative ion clusters of C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z are combined according to the positions of the peaks.

11. The method of claim 9, wherein the putative ion clusters of C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z are combined by a process comprising: (i) calculating the intensities I.sub.M,CH, each of which equals to Σ.sub.i=0.sup.MI.sub.i,v×I.sub.(M−i),w; (ii) calculating the intensities I.sub.m,cHo, each of which equals to Σ.sub.i=0.sup.MI.sub.i,CH×I.sub.(M−i),x; (iii) calculating the intensities I.sub.M,CHON, each of which equals to Σ.sub.i=0.sup.MI.sub.i,CHO×I.sub.(M−i),y; (iv) calculating the intensities I.sub.M,CHONS, each of which equals to Σ.sub.i=0.sup.MI.sub.i,CHON×I.sub.(M−i),z; wherein i is a non-negative integer and M is the number of putative isotopic peaks other than the putative monoisotopic peak; and wherein the intensities I.sub.M,CHONS represents a simulated ion cluster of the full-length protein with or without the modification.

12. A method according to claim 1, wherein the said series of simulated ion clusters correspond to the series of the chemical formulas that are produced by adding or removing several hydrogen atoms from the chemical formula of the said protein sample.

13. A method according to claim 12, wherein said series of simulated ion clusters comprising ion clusters, each of which is computationally generated by combination of multiple single-element ion clusters, each of which has the number of atoms the same as that of chemical formula of the said ion cluster member.

14. A method according to claim 13, wherein each of said ion clusters results from the sequential pairwise combinations of single-element ion clusters based on the principle that isotopologues with the same position number in the said ion cluster member are integrated together in terms of the percentages in the ion cluster and weighted molecular masses.

15. A method according to claim 14, wherein said integration of percentages of isotopologues in the ion cluster is the summation of all percentages of all isotopologues with the same position number.

16. A method according to claim 14, wherein said molecular masse are the result of the equation:
(MM.sub.1×P.sub.1+MM.sub.2×P.sub.2)/(P.sub.1+P.sub.2) where MM.sub.1 and MM.sub.2 are the molecular masses and P.sub.1 and P.sub.2 are the percentages of isotopologues in the first and second ion clusters, respectively, before integration.

17. A method according to claim 7, wherein the i is the rounded integer of (MM.sub.I−MM.sub.MN) where MM.sub.I is the molecular mass of the said isotope I and MM.sub.MN is the monoisotopic mass of the said element.

18. A method according to claim 7, wherein the second (2nd) lightest isotopes, as i=2, are .sup.13C, .sup.2H, .sup.15N, .sup.17O, .sup.33S; the third (3rd) lightest isotopes, as i=3, are .sup.14C, .sup.3H, .sup.16N, .sup.18O, .sup.34S; the fourth (4th) lightest isotope, as i=4, is .sup.35S; and the fifth (5th) lightest isotope, as i=5, is .sup.36S.

19. A method according to claim 13, wherein the production of each single-element ion cluster is accomplished based on the principle that isotopologues with same position number in the said single-element ion cluster are integrated together in terms of the percentages in the ion cluster and weighted molecular masses.

20. A method according to claim 19, wherein the position number of each single-element isotopologue is equal to the result of the following equation:
Σ.sub.i=2.sup.5[(Σ.sub.iN)×(i−1)] where .sub.iN is the number of the ith lightest isotope of the said element included in the said single-element isotopologue.

21. A method according to claim 20, wherein the i is the rounded integer of (MM.sub.I-MM.sub.MN) where MM.sub.I is the molecular mass of the said isotope I and MM.sub.MN is the monoisotopic mass of the said element.

22. A method according to claim 13, wherein each of the said single-element ion clusters is directly taken from the databases consisting the simulated ion clusters for single-element compounds containing different numbers of atoms.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) The foregoing summary, as well as the following detailed description of the invention, will be better understood when read in conjunction with the appended drawing. In the drawings:

(2) FIG. 1 provides the flowchart for verification of protein primary structure with intact protein analysis.

(3) FIG. 2 shows the protein sequences of the tested therapeutic, PNGase F-treated erythropoietin.

(4) FIG. 3 shows the results of SDS-PAGE analysis of erythropoietin products with or without PNGase F treatment.

(5) FIG. 4 shows the overall charge state distribution of PNGase F-treated Eprex and Recormon in liquid chromatographic-mass spectrometric (LC-MS) analysis.

(6) FIG. 5 shows the workflow of IntegrateMS: to mine for ion clusters and to obtain normalized MS of target protein.

(7) FIG. 6 shows the ion clusters of de-N-glycosylated Eprex with an O-linked trisaccharide mined out from MS raw data by IntegrateMS and the subsequent integrated master ion cluster.

(8) FIG. 7 shows the ion clusters of de-N-glycosylated Eprex with an O-linked tetrasaccharide mined out from MS raw data by IntegrateMS and the subsequent integrated master ion cluster.

(9) FIG. 8 illustrates that simulated isotope distribution can be computed by intensity list-based cluster deduction by MacroCluster.

(10) FIG. 9 illustrates the computation of simulated ion cluster based on gradually combining single-element ion clusters by Merger algorithm.

(11) FIG. 10 illustrates the computation of isotope distribution for intensity list construction of elements, C, H and N, using dynamic programming.

(12) FIG. 11 illustrates the computation of isotope distribution for intensity list construction of element, O, using dynamic programming.

(13) FIG. 12 illustrates the computation of isotope distribution for intensity list construction of element, S, using dynamic programming.

(14) FIG. 13 illustrates the primary structure verification of de-N-glycosylated Eprex using CompareMS program.

(15) FIG. 14 illustrates the primary structure verification of de-N-glycosylated Recormon using CompareMS program.

(16) FIG. 15 shows protein sequence of the tested therapeutic, Humulin R.

(17) FIG. 16 illustrates the primary structure verification of protein therapeutic, Humulin R.

(18) FIG. 17 shows the protein sequence of the tested therapeutic, Saizen.

(19) FIG. 18 illustrates the primary structure verification of protein therapeutic, Saizen.

(20) FIG. 19 shows the element-specific intensity list for carbon;

(21) FIG. 20 shows the element-specific intensity list for hydrogen;

(22) FIG. 21 shows the element-specific intensity list for nitrogen;

(23) FIG. 22 shows the element-specific intensity list for oxygen; and

(24) FIG. 23 shows the element-specific intensity list for sulfur.

DESCRIPTION OF THE INVENTION

(25) Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by a person skilled in the art to which this invention belongs.

(26) In one aspect, the present invention provides a method for verifying the primary structure of a protein through comparative analyses between ion clusters observed in mass spectra and a series of simulated ion clusters deduced from its putative chemical formula, the method comprising the steps of: preparing a protein sample for mass spectrometric analyses; collecting mass spectra of the protein sample; obtaining master ion cluster from a plurality of ion clusters in the mass spectra; producing a series of simulated ion clusters according to the chemical formula of the protein; finding the best fit for the master ion cluster among the series of simulated ion clusters; and verifying if the best-fit simulated ion cluster corresponds to the chemical formula of the protein sample.

(27) According to certain embodiments of the present invention, the protein sample is prepared through the process of removal of diverse types of modifications. According to certain embodiments of the present invention, the diverse types of modifications are that the type of modification has more than five variations of combinations at one particular amino acid residue. According to certain embodiments of the present invention, the diverse type of modifications is glycosylation at asparagine residues of proteins.

(28) According to certain embodiments of the present invention, the protein is a monoclonal antibody, a hormone, a growth factor, a fusion protein, a cytokine, a therapeutic enzyme, a blood factor, a recombinant vaccine, or an anti-coagulant.

(29) According to the present invention, collected mass spectra are determined by any analytical instruments of mass spectrometry including but not limited from the group consisting of matrix-assisted laser desorption ionization/time of flight (MALDI-TOF), surface enhanced laser desorption ionization/time of flight (SELDI-TOF), liquid chromatography-mass spectrometry (LC-MS), liquid chromatography tandem mass spectrometry (LC-MS-MS), and electrospray ionization mass spectrometry (ESI-MS).

(30) According to certain embodiments of the present invention, the master ion cluster is generated by location and summation of the plurality of ion clusters due to different charge states using computer algorithms.

(31) According to certain embodiments of the present invention, the series of simulated ion clusters are generated according to the series of the chemical formulas that are produced by adding or removing several hydrogen atoms from the chemical formula of the said protein sample.

(32) According to certain embodiments of the present invention, each simulated ion cluster is generated by sequential combinations of multiple single-element ion cluster simulations whose numbers of atoms are taken from the chemical formula of the simulated ion cluster.

(33) According to certain embodiments of the present invention, the simulated ion cluster with chemical formula C.sub.yH.sub.wO.sub.xN.sub.yS.sub.z is generated by sequential combinations of five single-element ion cluster simulations for C.sub.v, H.sub.w, O.sub.x,N.sub.y and S.sub.z, respectively.

(34) According to certain embodiments of the present invention, C.sub.v ion cluster simulation is represented by the percentages in the entire C.sub.v simulation P.sub.n,v=A.sub.12.sub.C.Math.P.sub.n,v−1+A.sub.13.sub.C.Math.P.sub.n−1,v−1, A.sub.12.sub.C and A.sub.13.sub.C being the natural abundances of .sup.12C and .sup.13C respectively, for the n-th putative isotopic peak in relation to the 0-th peak being the putative monoisotopic mass (.sup.12C.sub.v) peak; H.sub.w ion cluster simulation is represented by the percentages in the entire H.sub.w simulation P.sub.n,w=A.sub.1.sub.H.Math.P.sub.n,w−1+A.sub.2.sub.H.Math.P.sub.n−1,w−1, A.sub.1.sub.H and A.sub.2.sub.H being the natural abundances of .sup.1H and .sup.2H respectively, for the n-th putative isotopic peak in relation to the 0-th peak being the putative monoisotopic mass (.sup.1H.sub.w) peak; O.sub.x ion cluster simulation is represented by the percentages in the entire O.sub.x simulation P.sub.n,x=A.sub.16.sub.O.Math.P.sub.n,x−1+A.sub.17.sub.O.Math.P.sub.n−1,x−1+A.sub.18.sub.O.Math.P.sub.n−2,x−1, A.sub.16.sub.O, A.sub.17.sub.O and A.sub.18.sub.O being the natural abundances of .sup.16O, .sup.17O and .sup.18O, respectively, for the n-th putative isotopic peak in relation to the 0-th peak being the putative monoisotopic mass (.sup.16O.sub.x) peak; N.sub.y ion cluster simulation is represented by the percentages in the entire N.sub.y simulation P.sub.n,y=A.sub.14.sub.N.Math.P.sub.n,y−1+A.sub.15.sub.N.Math.P.sub.n−1,y−1, A.sub.14.sub.N and A.sub.15.sub.N being the natural abundances of .sup.14N and .sup.15N respectively, for the n-th putative isotopic peak in relation to the 0-th peak being the putative monoisotopic mass (.sup.14N.sub.y) peak; S.sub.z ion cluster simulation is represented by the percentages in the entire S.sub.z simulation P.sub.n,x=A.sub.32.sub.S.Math.P.sub.n,z−1+A.sub.33.sub.S.Math.P.sub.n−1,z−1+A.sub.34.sub.S.Math.P.sub.n−2,z−1+A.sub.36.sub.S.Math.P.sub.n−4,z−1, A.sub.32.sub.S, A.sub.33.sub.S, A.sub.34.sub.S and A.sub.36.sub.S being the natural abundances of .sup.32S, .sup.33S, .sup.34S and .sup.36S respectively, for the n-th putative isotopic peak in relation to the 0-th peak being the putative monoisotopic mass (.sup.32S.sub.z) peak.

(35) According to certain embodiments of the present invention, the single-element ion cluster simulations of C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z are combined by a process comprising: (i) calculating the percentages P.sub.M,CH of the M-th peaks in the C.sub.vH.sub.w simulation in relation to the 0-th peak being the putative monoisotopic mass (.sup.12C.sub.v.sup.1H.sub.w) peak, each of which equals to Σ.sub.i=0.sup.MP.sub.i,v×P.sub.(M−i),w; (ii) calculating the percentages P.sub.M,CHO of the M-th peaks in the C.sub.VH.sub.wO.sub.x simulation in relation to the 0-th peak being the putative monoisotopic mass (.sup.12C.sub.v.sup.1H.sub.w.sup.16O.sub.x) peak, each of which equals to Σ.sub.i=0.sup.MP.sub.i,CH×P.sub.(M−i),x; (iii) calculating the percentages P.sub.M,CHON of the M-th peaks in the C.sub.vH.sub.wO.sub.xN.sub.y simulation in relation to the 0-th peak being the putative monoisotopic mass (.sup.12C.sub.v.sup.1H.sub.w.sup.16O.sub.x.sup.14N.sub.y) peak, each of which equals to Σ.sub.i=0.sup.MP.sub.i,CHO×P.sub.(M−i),y; (iv) calculating the percentages P.sub.M,CHONS the M-th peaks in the C.sub.vH.sub.wO.sub.xN.sub.yS.sub.z simulation in relation to the 0-th peak being the putative monoisotopic mass (.sup.12C.sub.v.sup.1H.sub.w.sup.16O.sub.x.sup.14N.sub.y.sup.32S.sub.z) peak, each of which equals to Σ.sub.i=0.sup.MP.sub.i,CHON×P.sub.(M−i),z; wherein i is a non-negative integer. However, a method of the present invention is not limited to such order of combination.

(36) According to certain embodiments of the present invention, each of the single-element ion cluster simulation is directly taken from the databases consisting of the ion cluster simulations corresponding to single-element compounds containing different numbers of atoms.

(37) According to certain embodiments of the present invention, the best fit is discovered by finding the member in the simulated ion clusters with the smallest difference scores in comparison with the master ion cluster.

(38) According to certain embodiments of the present invention, the difference score of each simulated ion cluster is assigned with a method like one among, but not limited to, methods such as chi-square test, Pearson's chi-square test, chi-square test with Yate's correlation, Fisher's exact test, McNemar's test and Cochran's Q test.

(39) In another aspect, the invention provides a method for verifying the primary structure of a protein. The method comprises the following steps: obtaining a mass spectrum of a full-length protein; identifying from the mass spectrum a plurality of ion clusters with a mass corresponding to the full-length protein but with different charge states; calculating a master ion cluster from the plurality of ion clusters; and comparing the master ion cluster with a series of simulated ion clusters generated based on the chemical formula of the full-length protein with or without a modification, to find a best fitted simulated ion cluster. If the best fitted simulated ion cluster corresponds to a full-length protein with a specific type of modification or without modification, said full-length protein with a specific type of modification or without modification represents the verified primary structure of the protein.

(40) The term “primary structure” as used herein refers to the amino acid sequence of a protein and its (post-translational) protein modification(s).

(41) The method of the present invention adopts the “top-down” strategy. As used herein, the term “full-length protein” refers to an intact protein or a protein which is pre-treated to remove certain complicated modifications (but not fragmentized) before subjecting to mass spectrometric analysis. For example, N-linked glycosylations can be removed by a PNGase F treatment.

(42) Preferably, the mass spectrum is obtained through a high-resolution mass spectrometry. The high-resolution mass spectrometry includes but is not limited to a matrix-assisted laser desorption ionization/time of flight (MALDI-TOF) mass spectrometry, a surface enhanced laser desorption ionization/time of flight (SELDI-TOF) mass spectrometry, a liquid chromatography-mass spectrometry (LC-MS), a liquid chromatography tandem mass spectrometry (LC-MS-MS), or an electrospray ionization mass spectrometry (ESI-MS).

(43) The master ion cluster is derived from the observed ion clusters in the mass spectrometry, and comprises an ordered set of normalized intensities. According to certain preferred embodiments of the present invention, certain normalized intensities are calculated by a process comprising the following steps: summing up the intensities of the most abundant peak at (m/z).sub.ma of each of the plurality of ion clusters (corresponding to the full-length protein but with different charge states), to obtain a starting summation S.sub.S; summing up the intensities of the next larger isotopic peak p(+1) in the plurality of ion clusters, with an m/z larger than the (m/z).sub.ma according to an average isotope spacing, of each most abundant peak, to obtain a first right summation S.sub.p(+1); and summing up the intensities of the next smaller isotopic peak p(−1) in the plurality of ion clusters, with an m/z smaller than the (m/z).sub.ma according to the average isotope spacing, of each most abundant peak, to obtain a first left summation S.sub.p(−1). The starting, first left and first right summations may be later normalized by the largest “intensity” (summation of intensities).

(44) Other ordered normalized intensities may be calculated through a similar process. As such, a plurality of right summations of a respective plurality of isotopic peaks p(+l) and a plurality of left summations of a respective plurality of isotopic peaks p(−m) and may be obtained, wherein l and m each is a positive integer, the isotopic peak p(+l) is the next larger isotopic peak relative to the isotopic peak p(+(l−1)) according the average isotope spacing, and the isotopic peak p(−m) is the next smaller isotopic peak relative to the isotopic peak p(−(m−1)) according the average isotope spacing. For the normalization, the starting summation, the plurality of left summations and the plurality of right summations are divided by the largest summation S.sub.M among all the summations. l and m may be readily determined by a skilled person in the art based on actual needs. For example, detection of the left half of cluster ends as |m|=ΔM.sub.N+2, and detection of the right half ends as ions when relative abundances less than 5% is reached, wherein ΔM.sub.N is the nominal mass difference between monoisotopic mass and most abundant mass of a protein (Chen et al., Anal Biochem 440, 108-113 (2013)).

(45) Accordingly, the master ion cluster may comprise an order set of normalized intensities as follows: (S.sub.p(−m)/S.sub.M, S.sub.p(−(m−1))/S.sub.M, . . . , S.sub.p(−m)/S.sub.M, S.sub.S/S.sub.M, S.sub.p(+1)/S.sub.M, S.sub.p(+(l−1))/S.sub.M, S.sub.p(+l)S.sub.M).

(46) According to one preferred embodiment of the present invention, each of the observed intensities is normalized by dividing by the charge state of the corresponding isotopic peak before being summed up.

(47) According to the present invention, the average isotope spacing may be about 1 Dalton. Preferably, the average isotope spacing is 1.00235 Dalton.

(48) According to present invention, the master ion cluster and the series of simulated ion clusters are compared by a method selected from the group consisting of chi-square test, Pearson's chi-square test, chi-square test with Yate's correlation, Fisher's exact test, McNemar's test, and Cochran's Q test.

(49) According to present invention, each of the series of simulated ion clusters may be generated by a process comprising: given a chemical formula C.sub.vH.sub.wO.sub.xN.sub.yS.sub.z of the full-length protein with or without a modification, combining putative ion clusters of C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z to obtained the simulated ion cluster of the full-length protein with or without the modification, wherein the putative ion cluster of C.sub.v is represented by the intensities I.sub.n,v=A.sub.12.sub.C.Math.I.sub.n,v−1+A.sub.13.sub.C.Math.I.sub.n−1,v−1, A.sub.12.sub.C and A.sub.13.sub.C being the natural abundances of .sup.12C and .sup.13C respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak; the putative ion cluster of H.sub.w is represented by the intensities I.sub.n,w=A.sub.1.sub.H˜I.sub.n,w−1+A.sub.2.sub.H.Math.I.sub.n−1,w−1, A.sub.1.sub.H and A.sub.2.sub.H being the natural abundances of .sup.1H and .sup.2H respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak; the putative ion cluster of H.sub.w is represented by the intensities I.sub.n,x=A.sub.16.sub.O.Math.I.sub.n,x−1+A.sub.17.sub.O.Math.I.sub.n−1,x−1+A.sub.18.sub.O.Math.I.sub.n−2,x−1, A.sub.16.sub.O, A.sub.17.sub.O and A.sub.18.sub.O being the natural abundances of .sup.16O, .sup.17O and .sup.18O respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak; the putative ion cluster of N.sub.y is represented by the intensities I.sub.n,y=A.sub.14.sub.N.Math.I.sub.n,y−1+A.sub.15.sub.N.Math.I.sub.n−1,y−1, A.sub.14.sub.N and A.sub.15.sub.N being the natural abundances of .sup.14N and .sup.15N respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak; and the putative ion cluster of S.sub.z is represented by the intensities I.sub.n,z=A.sub.23.sub.S.Math.I.sub.n,z−1+A.sub.33.sub.S.Math.I.sub.n−1,z−1+A.sub.34.sub.S.Math.I.sub.n−2,z−1+A.sub.36.sub.S.Math.I.sub.n−4,z−1, A.sub.32.sub.S, A.sub.33.sub.S, A.sub.34.sub.S and A.sub.36.sub.S being the natural abundances of .sup.32S, .sup.33S, .sup.34S and .sup.36S respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak. Accordingly, each of the series of simulated ion clusters comprises an ordered set of normalized putative intensities.

(50) Preferably, the putative ion clusters of C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z are combined one by one. In one preferred embodiment of the present invention, the putative ion clusters of C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z are combined according to the positions of the peaks. For example, the putative ion clusters of C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z may be combined by a process comprising: (i) calculating the intensities I.sub.M,CH, each of which equals to Σ.sub.i=0.sup.MI.sub.i,v×I.sub.(M−i),w; (ii) calculating the intensities I.sub.M,CHO, each of which equals to Σ.sub.i=0.sup.MI.sub.i,CH×.sub.(M−i),x; (iii) calculating the intensities I.sub.M,CHON, each of which equals to Σ.sub.i=0.sup.MI.sub.i,CHO×I.sub.(M−i),y; (iv) calculating the intensities I.sub.M,CHONS, each of which equals to Σ.sub.i=0.sup.MI.sub.i,CHON×I.sub.(M−i),z; wherein i is a non-negative integer and M is the number of putative isotopic peaks other than the putative monoisotopic peak; and wherein the intensities I.sub.M,CHONS represents a simulated ion cluster of the full-length protein with or without the modification. However, a method of the present invention is not limited to such order of combination.

(51) The present invention also includes the following Embodiments:

(52) 1. A method for verifying the primary structure of a protein comprising:

(53) obtaining a mass spectrum of a full-length protein;

(54) identifying from the mass spectrum a plurality of ion clusters with a mass corresponding to the full-length protein but with different charge states;

(55) calculating a master ion cluster from the plurality of ion clusters; and

(56) comparing the master ion cluster with a series of simulated ion clusters generated based on the chemical formula of the full-length protein with or without a modification, to find a best fitted simulated ion cluster.

(57) 2. The method of Embodiment 1, wherein the master ion cluster is calculated by a process comprising: summing up the intensities of the most abundant peak at (m/z).sub.ma of each of the plurality of ion clusters, to obtain a starting summation; summing up the intensities of the next larger isotopic peak p(+1), with an m/z larger than the (m/z).sub.ma according to an average isotope spacing, of each most abundant peak, to obtain a first right summation; and summing up the intensities of the next smaller isotopic peak p(−1), with an m/z smaller than the (m/z).sub.ma according to the average isotope spacing, of each most abundant peak, to obtain a first left summation.
3. The method of claim 2, wherein a plurality of right summations of a respective plurality of isotopic peaks p(+l) are obtained, a plurality of left summations of a respective plurality of isotopic peaks p(−m) are obtained, and the starting summation, the plurality of left summations and the plurality of right summations are normalized by dividing by the largest summation among all the summations, wherein l and m each is a positive integer, the isotopic peak p(+l) is the next larger isotopic peak relative to the isotopic peak p(+(l−1)) according the average isotope spacing, and the isotopic peak p(−m) is the next smaller isotopic peak relative to the isotopic peak p(−(m−1)) according the average isotope spacing.
4. The method of Embodiment 2 or 3, wherein each of the intensities is normalized by dividing by the charge state of the corresponding isotopic peak before being summed up.
5. The method of Embodiment 2, wherein the average isotope spacing is about 1 Dalton.
6. The method of Embodiment 5, wherein the average isotope spacing is 1.00235 Dalton.
7. The method of Embodiment 1, wherein the mass spectrum is obtained through a high-resolution mass spectrometry.
8. The method of Embodiment 1, wherein the master ion cluster and the series of simulated ion clusters are compared by a method selected from the group consisting of chi-square test, Pearson's chi-square test, chi-square test with Yate's correlation, Fisher's exact test, McNemar's test, and Cochran's Q test.
9. The method of any of Embodiments 1-8, wherein each of the series of simulated ion clusters is generated by a process comprising: given a chemical formula C.sub.vH.sub.wO.sub.xN.sub.yS.sub.z of the full-length protein with or without a modification, combining putative ion clusters of C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z to obtained the simulated ion cluster of the full-length protein with or without the modification, wherein the putative ion cluster of C.sub.v is represented by the intensities I.sub.n,v=A.sub.12.sub.C.Math.I.sub.n,v−1+A.sub.13.sub.C.Math.I.sub.n−1,v−1, A.sub.12.sub.C and A.sub.13.sub.C being the natural abundances of .sup.12C and .sup.13C respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak; the putative ion cluster of H.sub.w is represented by the intensities I.sub.n,w=A.sub.1.sub.H.Math.I.sub.n,w−1+A.sub.2.sub.H.Math.I.sub.n−1,w−1, A.sub.1.sub.H and A.sub.2.sub.H being the natural abundances of .sup.1H and .sup.2H respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak; the putative ion cluster of O.sub.x is represented by the intensities I.sub.n,x=A.sub.16.sub.O.Math.I.sub.n,x−1+A.sub.17.sub.O.Math.I.sub.n−1,x−1+A.sub.18.sub.O.Math.I.sub.n−2,x−1, A.sub.16.sub.O, A.sub.17.sub.O and A.sub.18.sub.O being the natural abundances of .sup.16O, .sup.17O and .sup.18O respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak; the putative ion cluster of N.sub.y is represented by the intensities I.sub.n,y=A.sub.14.sub.N.Math.I.sub.n,y−1+A.sub.15.sub.N.Math.I.sub.n−1,y−1, A.sub.14.sub.N and A.sub.15.sub.N being the natural abundances of .sup.14N and .sup.15N respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak; and the putative ion cluster of S, is represented by the intensities I.sub.n,z=A.sub.32.sub.S.Math.I.sub.n,z−1+A.sub.33.sub.S.Math.I.sub.n−1,z−1+A.sub.34.sub.S.Math.I.sub.n−2,z−1+A.sub.36.sub.S.Math.I.sub.n−4,z−1, A.sub.32.sub.S, A.sub.33.sub.S, A.sub.34.sub.S and A.sub.36.sub.S being the natural abundances of .sup.32S, .sup.33S, .sup.34S and .sup.36S respectively, for the n-th putative isotopic peak starting from the 0-th peak being the putative monoisotopic peak.
10. The method of Embodiment 9, wherein the putative ion clusters of C.sub.y, H.sub.w, O.sub.x, N.sub.y and S.sub.z are combined one by one.
11. The method of Embodiment 9, wherein the putative ion clusters of C.sub.y, H.sub.w, O.sub.x, N.sub.y and S.sub.z are combined according to the positions of the peaks.
12. The method of Embodiment 10 or 11, wherein the putative ion clusters of C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z are combined by a process comprising: (i) calculating the intensities I.sub.M,CH, each of which equals to Σ.sub.i=0.sup.MI.sub.i,v×I.sub.(M−i),w; (ii) calculating the intensities I.sub.M,CHO, each of which equals to Σ.sub.i=0.sup.MI.sub.i,CH×I.sub.(M−i),x; (iii) calculating the intensities I.sub.M,CHON, each of which equals to Σ.sub.i=0.sup.MI.sub.i,CHO×I.sub.(M−i),y; (iv) calculating the intensities I.sub.M,CHONS, each of which equals to Σ.sub.i=0.sup.MI.sub.i,CHON×I.sub.(M−i),z; wherein i is a non-negative integer and M is the number of putative isotopic peaks other than the putative monoisotopic peak; and wherein the intensities I.sub.M,CHONS represents a simulated ion cluster of the full-length protein with or without the modification.
13. A method according to Embodiment 1, wherein the said series of simulated ion clusters correspond to the series of the chemical formulas that are produced by adding or removing several hydrogen atoms from the chemical formula of the said protein sample.
14. A method according to Embodiment 13, wherein each ion cluster member in the said series of simulated ion cluster is computationally generated by combination of multiple single-element ion clusters each of which has the number of atoms the same as that of chemical formula of the said ion cluster member.
15. A method according to Embodiment 14, wherein said ion cluster member results from the sequential pairwise combinations of single-element ion clusters based on the principle that isotopologues with the same position number in the said ion cluster member are integrated together in terms of the percentages in the ion cluster and weighted molecular masses.
16. A method according to Embodiment 15, wherein said integration of percentages of isotopologues in the ion cluster is the summation of all percentages of all isotopologues with the same position number.
17. A method according to Embodiment 15, wherein said molecular masse are the result of the equation:
(MM.sub.1×P.sub.1+MM.sub.2×P.sub.2)/(P.sub.1+P.sub.2)
where MM.sub.1 and MM.sub.2 are the molecular masses and P.sub.1 and P.sub.2 are the percentages of isotopologues in the first and second ion clusters, respectively, before integration.
18. A method according to Embodiment 15, wherein the said position number for each multi-element isotopologue is equal to the result of the following equation:
Σ.sub.i=2.sup.5[(Σ.sub.iN.sub.e(j))×(i−1)]
where .sub.iN.sub.e(j) is the number of the ith lightest isotope of jth element, e(j), included in the said multi-element isotopologue.
19. A method according to Embodiment 18, wherein the i is the rounded integer of (MM.sub.I-MM.sub.MN) where MM.sub.I is the molecular mass of the said isotope I and MM.sub.MN is the monoisotopic mass of the said element.
20. A method according to Embodiment 18, wherein the second (2nd) lightest isotopes, as i=2, are .sup.13C, .sup.2H, .sup.15N, .sup.17O, .sup.33S; the third (3rd) lightest isotopes, as i=3, are .sup.14C, .sup.3H, .sup.16N, .sup.18O, .sup.34S; the fourth (4th) lightest isotope, as i=4, is .sup.35S; and the fifth (5th) lightest isotope, as i=5, is .sup.36S
21. A method according to Embodiment 14, wherein the production of each single-element ion cluster is accomplished based on the principle that isotopologues with same position number in the said single-element ion cluster are integrated together in terms of the percentages in the ion cluster and weighted molecular masses.
22. A method according to Embodiment 21, wherein the position number of each single-element isotopologue is equal to the result of the following equation:
Σ.sub.i=2.sup.5[(Σ.sub.iN)×(i−1)]
where .sub.iN is the number of the ith lightest isotope of the said element included in the said single-element isotopologue.
23. A method according to Embodiment 22, wherein the i is the rounded integer of (MM.sub.I-MM.sub.MN) where MM.sub.I is the molecular mass of the said isotope I and MM.sub.MN is the monoisotopic mass of the said element.
24. A method according to Embodiment 14, wherein each of the said single-element ion clusters is directly taken from the databases consisting the simulated ion clusters for single-element compounds containing different numbers of atoms.

(58) The present invention is further illustrated by the following examples, which are provided for the purpose of demonstration rather than limitation.

Example 1. The Flowchart for Verification of Protein Primary Structure

(59) Protein sample with or without previous sample preparation is analyzed using mass spectrometry and MS data are processed with algorithms e.g. in-house IntegrateMS which implements ion cluster location and summation to produce the observed master ion cluster. Meanwhile, the putative primary(1°) structure of the protein sample, including amino acid sequence and modicidations, is converted to the expected chemical formula C.sub.vH.sub.wO.sub.xN.sub.yS.sub.z (.sup.0CF). To verify the chemical formula of the protein sample, a series of simulated ion clusters are constructed/calculated by programs, e.g. in-house Macro Cluster according to the chemical formula CF(m)=.sup.0CF+H.Math.m where m ranges from −3 to +3. Finally, the algorithm e.g. in-house CompareMS is used to compare the master ion cluster with each of the simulated ion clusters to give different score (.sup.mDS) series for the simulated ion clusters for CF(m). The primary structure of the protein sample is verified only if .sup.0DS is the smallest score in the entire DS series (See FIG. 1).

(60) Verification of the primary structure, including amino acid sequence and posttranslational modifications (PTMs), is important for quality evaluation of a protein therapeutic. While protein modifications are key elements of protein structure, and usually are associated with particular functions, it remains a grand challenge to evaluate such sophisticated structures present in protein therapeutics. Particularly, protein modifications causing small changes in molecular masses, such as disulfides, amidations and deamidations, cannot be analyzed properly using conventional reductionist approach. On the contrary, documentation of the molecular mass of a protein therapeutic using mass spectrometry can serve as the first step to confirm its expected chemical formula. While high-resolution mass spectrometry can be applied to discern the details of protein therapeutics, we currently have no adequate knowledge as well methodologies to properly analyze their primary structures. We have implemented informatics methods to help understand how to deduce monoisotopic masses of protein therapeutics based on the characterization of most abundant masses in ion clusters (Chen et al., Anal Biochem 440, 108-113 (2013)). In this process, we found that informatics methods that simulate ion cluster formation would be essential for development of methods that directly verify protein primary structure, especially those protein modifications with small changes of molecular masses, such as disulfide bond formation, Gln/Asn deamidation or Glu/Asp amidation.

(61) To test our hypothesis, we streamline the analytical procedure and establish informatics-based methods to deduce the likely primary structure of protein therapeutics by matching the master ion cluster with a series of simulated ion clusters generated based on the chemical formulas that are produced by adding or removing several hydrogen atoms from the chemical formula of the protein sample.

(62) The protein sample, with or without pre-treatment, is first analyzed with high-resolution mass spectrometry. The mass spectrometric data are processed using programs e.g. IntegrateMS to obtain a master ion cluster through computationally merging ion clusters identified from the protein sample but with different charge states. The putative chemical formula C.sub.vH.sub.wO.sub.xN.sub.yS.sub.z (.sup.0CF) of the tested protein sample is deduced based on its protein sequence and known protein modifications using softwares e.g. Macro Cluster; the same program also produces a series of simulated ion clusters based on the chemical formulas CF(m) that are produced by adding or removing several hydrogen atoms from the chemical formula of the putative primary structure .sup.0CF. Finally, programs e.g. CompareMS are used to give a difference score (.sup.mDS) to each of the simulated ion clusters based on its difference to the master ion cluster. The primary structure of the protein sample is validated only when the .sup.0DS has the smaller value in the DS series (see FIG. 1).

Example 2. Protein Sequences of the Tested Therapeutic, PNGase F-Treated Erythropoietin

(63) Primary structure information of erythropoietin with PNGase F treatment is provided to build up the baseline for establishment of simulated ion clusters. N.fwdarw.D indicates that three asparagine residues (N) are replaced as three aspartic acid ones (D with the underline) after enzymatic removal of N-linked glycans. The solid line shows the disulfide linkage between two cysteine residues. Trisaccharide NeuAc-Hex-HexNAc (FIG. 2A) and tetrasaccharide NeuAc-Hex-HexNAc-NeuAc (FIG. 2B) are two possible types of O-linked glycans on serine 126. The putative chemical formulas of de-N-glycosylated erythropoietin with a trisaccharide or a tetrasaccharide are deduced as C.sub.834H.sub.1338O.sub.261N.sub.228S.sub.5 or C.sub.845H.sub.1355O.sub.269N.sub.229S.sub.5, respectively.

(64) In our studies, erythropoietin with N-oligosaccharide removed is used as an example for verification of its primary structure with our intact protein analyses. The putative therapeutic, de-N-glycosylated erythropoietin, is reported to contain 165 amino acids with three asparagines replaced as three aspartic acids (N.fwdarw.D) after enzymatic removal of N-linked glycans (FIGS. 2A and 2B). In addition, its protein modifications include two disulfide linkages and one O-linked glycosylation on serine 126 where it can be modified as two possible types of O-linked glycans, either trisaccharide NeuAc-Hex-HexNAc (FIG. 2A) or tetrasaccharide NeuAc-Hex-HexNAc-NeuAc (FIG. 2B). Hence, the putative chemical formulas of de-N-glycosylated erythropoietin with a trisaccharide or a tetrasaccharide are deduced as C.sub.834H.sub.1338O.sub.261N.sub.228S.sub.5 or C.sub.845H.sub.1355O.sub.269N.sub.229S.sub.5, respectively. Based on this information of chemical formulas, we will demonstrate our proposed method can help precisely verify the primary structure.

Example 3. SDS-PAGE Analysis of Erythropoietin Products with or without PNGase F Treatment

(65) Erythropoietin sample with (FIG. 3, lanes 3 and 5) or without PNGase F (NG-F) treatment (FIG. 3, lanes 2 and 4) were electrophoresed under non-reducing condition and visualized on gel by sliver staining. The numbers on the left, expressed in kilodaltons (kDa), are the positions of molecular mass markers. Lot numbers of Eprex and Recormon are EFS5600 and H0743H01, respectively.

(66) To verify primary structure of protein therapeutics with our proposed analytical methods, human derived erythropoietin is first chosen as the tested protein. Various biologic and biosimilar erythropoietin drugs produced by recombinant DNA technology in cell culture are currently available in the market. It is still a great challenge for quality control of these protein therapeutics derived from a complex biological system. Erythropoietin is a glycoprotein with a molecular mass of about 30.4 kDa, wherein half of its molecular mass is sugar groups. The polypeptide backbone is estimated to be approximately 18 kDa. There has been reported that three sites of N-linked glycosylations on erythropoietin results in dozen of protein structures that leads to the difficulty of detection of post-translational modifications of erythropoietin product.

(67) In order to verify whether removal of N-linked glycosylations simplifies the diversity of erythropoietin structures and helps detect other modifications easier, we performed SDS-PAGE experiment to analyze erythropoietin with or without PNGase F treatment.

(68) Two brands of erythropoietin samples were respectively incubated with or without 3U of PNGase F (NG-F) in 25 mM ammonium bicarbonate buffer at 37° C. for 2 hours and followed by addition of 5 μl 4× sample buffer, which consists of Tris pH 6.8, 10% (w/v) SDS, 0.4% (w/v) bromophenol blue and 50% (v/v) glycerol and then heated for 10 minutes at 95° C. Those processed samples were then applied to 15% SDS-PAGE electrophoresed at 150V under non-reducing condition until tracing dye reached the bottom of the gel. The gel after electrophoresis was then developed with sliver staining.

(69) SDS-PAGE analysis showed that erythropoietin sample before PNGase F treatment migrated as a blurred bands ranging from 30 kDa to 40 kDa under non-reducing conditions. When erythropoietin with PNGase F treatment was subjected to non-reducing SDS-PAGE analysis, it migrated like two 20 kDa polypeptides. However, their gel mobilities are closer to the known length of polypeptide of erythropoietin. These data together suggest that wide-spreading species from 30 kDa to 40 kDa are mainly caused by the variety of N-linked glycosylations. Also, these data support that PNGase F can serve as the enzyme for complete removal of complicated N-linked glycosylations.

Example 4. Overall Charge State Distribution of PNGase F-Treated Eprex and Recormon in Liquid Chromatographic-Mass Spectrometric (LC-MS) Analysis

(70) The average mass spectra of de-N-glycosylated Eprex (FIG. 4A) and Recormon (FIG. 4B) are generated within selected range of LC retention time and enlarged over the indicated mass range. The positive numbers at the top of MS signals indicate the charge states of ion clusters. The arrows mark two major signals of +15 ion clusters corresponding to de-N-glycosylated erythropoietins with a trisaccharide (I) and a tetrasaccharide (II) respectively, which is subsequently verified with our methods.

(71) In order to characterize whether two 20 kDa polypeptides on the gel indeed resulted from O-linked glycosylations of erythropoietin after removal of N-linked glycans and to verify whether they contain two disulfide bonds, we employed liquid chromatography-mass spectrometry (LC-MS) to examine the structural details of these two polypeptides.

(72) To examine MS profiles of these two intact polypeptides, we further subjected PNGase F-treated erythropoietin to LC-MS analyses.

(73) The PNGase F-treated samples were analyzed in LTQ-Orbitrap hybrid tandem mass spectrometer (ThermoFisher, USA) in-lined with Agilent 1200 nanaoflow HPLC system. The HPLC system was equipped with Agilent mRP-C18 High-Recovery Protein Column (length: 100 mm; internal diameter: 0.5 mm; bead size: 5 μm) as the separating column. The mobile phase consisted of (A) 0.1% formic acid in water and (B) 0.1% formic acid in acetonitrile. The full and SIM mass spectra were collected over the mass range of m/z 200-2000 at a resolving power of 100,000. The collected data were analyzed using Xcalibur software (ThermoFisher, USA).

(74) LC-MS analyses showed that two major protein species (I and II) were both detected for different branded erythropoietins, such as Eprex (FIG. 4A) and Recormon (FIG. 4B). The majority of these two ions had electric charges of +11 to +16. However, reverse ratios of these two major signals were observed for Eprex and Recormon. Besides, they also resulted in different patterns of charge state distributions. Based on mass determination with the previously reported M.sub.ma-turned-M.sub.mi approach, major form I can primarily be confirmed as de-N-glycosylated erythropoietin with an O-linked trisaccharide, while major form II as the same one but with a tetrasaccharide. However, the mass shift of disulfide bonds is too small to elucidate its presence on these structures with the M.sub.ma-turned-M.sub.mi method. Hence, the new analytical method here is developed to solve this difficulty of mass determination. All these ion signals with different charge states but from the same erythropoietin species will be identified and subsequently merged into an observed master ion cluster by using our in-house programs, IntegrateMS. The derived two observed master ion clusters will be verified with our informatics method to answer whether they are indeed as reported O-linked oligosaccharide-containing erythropoietins with two disulfide bonds.

Example 5. Workflow of IntegrateMS: To Mine for Ion Clusters and to Obtain Normalized MS of Target Protein

(75) For screening out ion clusters of target protein among different charge state P.sub.2. Most abundant mass-over-charge (m/z).sub.ma as P.sub.1 at charge state P.sub.2 is inputted as start of ion cluster fishing. Each charge state, from P.sub.2+N to P.sub.2−N, has its own (m/z).sub.ma. If P.sub.1 within P.sub.2 presents in the spectrum, FullCluster Algorithm is activated to obtain full cluster at P.sub.2 charge state. If not, next P.sub.2+N or P.sub.2−N is applied. For ion cluster mining at certain charge state, FullCluster Algorithm is designed to hook for (m/z).sub.ma and then search for neighbor peaks with m/z of (m/z).sub.ma+(1.00235/x).Math.L. If multiple peaks are detected with mass error less than 15 ppm, the maximal of I.sub.L,x is selected as the ion signal with position L. Detection of the left half of cluster ends as |L|=ΔM.sub.N+2, and detection of the right half ends as ions when relative abundances less than 5% is reached, wherein ΔM.sub.N is the nominal mass difference between monoisotopic mass and most abundant mass of a protein (Chen et al., Anal Biochem 440, 108-113 (2013)). With FullCluster Algorithm searching out for individual clusters at different charge states, multiple ion clusters are obtained. Detected ion clusters are combined and normalized to access observed master ion cluster. x: charge state.

Example 6. Ion Clusters of De-N-Glycosylated Eprex with an O-Linked Trisaccharide Mined Out from MS Raw Data by IntegrateMS and the Subsequent Integrated Master Ion Cluster

(76) After LC-MS analysis of supposed analyte, PNGase F-treated Eprex, mass spectrometric raw data were processed by IntegrateMS to obtain ion clusters of de-N-glycosylated Eprex with an O-linked trisaccharide at charge states from 10 to 18 (FIG. 6, dash-lined profiles above). While all the individual clusters are gathered, with summation of the signals with the same position among different charge states, observed master ion cluster derived (FIG. 6, dash-lined profile below).

Example 7. Ion Clusters of De-N-Glycosylated Eprex with an O-Linked Tetrasaccharide Mined Out from MS Raw Data by IntegrateMS and the Subsequent Integrated Master Ion Cluster

(77) After LC-MS analysis of supposed analyte, PNGase F-treated Eprex, mass spectrometric raw data were processed by IntegrateMS to obtain ion clusters of de-N-glycosylated Eprex with an O-linked tetrasaccharide at charge states from 10 to 19 (FIG. 7, dash-lined profiles above). While all the individual clusters are gathered, with summation of the signals with the same position among different charge states, observed master ion cluster derived (FIG. 7, dash-lined profile below).

Example 8. Program IntegrateMS

(78) When protein molecules are ionized through electrospray ionization, these molecules can take different numbers of protons to become molecular ions with various positive charge states. As molecular ions with a particular charge state move closely together in the mass analyzer, they should become one ion cluster in the high-resolution mass spectrum. When different numbers of protons are taken, there should be multiple ion clusters observed in the mass spectrum even for a protein with one single chemical formula (Zhang et al., J Am Soc Mass Spectrom 9, 225-33 (1998)). In our previous version, we have chosen the ion cluster with the highest signal for later CompareMS analyses (data not shown). We have observed that most of these ion clusters did not have a smooth profile and many were even defective, lacking a few peaks in the cluster. Intriguingly, when these clusters are put together to become one integrated one, this synthetic cluster has a much smoother profile. Thus, this significant improvement upon signal integration prompted us to develop a computer program that can automatically detect those clusters that are from the same protein molecules but have different charge states.

(79) Automatic mining out ion clusters of target MS signals from different charge states is developed (FIG. 5). There are two parameters P.sub.1 and P.sub.2 that are considered for IntegrateMS program when LC-MS data of a protein sample are analyzed. P.sub.i is the m/z value of the highest signal among the cluster, or (m/z).sub.ma, and P.sub.2 is the charge state of this m/z value. First, P.sub.1 and P.sub.2 are used to check whether the clusters at different charge states are present, on the basis of the detection of (m/z).sub.ma for each ion cluster. When the signal of (m/z).sub.ma is present, FullCluster algorithm is started.

(80) FullCluster algorithm assumes the mass difference between neighboring signals in a cluster is 1.00235, which is derived using Averagine concept (Chen et al., Anal Biochem 440, 108-113 (2013); Senko et al., J Am Soc Mass Spectrom 6, 229-233 (1995)). For the cluster with the charge state of P.sub.2, we will use the mass step (1.00235/P.sub.2)×L to examine whether other peaks in the cluster are present. For the left half of the cluster, L is a negative integer that ranges from −1 to (ΔM.sub.N+2). This ΔM.sub.N value is the nominal mass difference between monoisotopic mass and most abundant mass of a protein (Chen et al., Anal Biochem 440, 108-113 (2013)). ΔM.sub.N per se is a function of protein molecular mass, specifically ΔM.sub.N=0.63×M.sub.ma (KDa)−0.62, according to our calculation (Chen et al., Anal Biochem 440, 108-113 (2013)). For the right half of the cluster, L is a positive integer, ranging from +1 to the number whose signal intensity is smaller than 0.05 of the intensity of (m/z).sub.ma, i.e. I.sub.o,P2. When these signals are found present, each of their signal intensities, I.sub.L,P2, will be recorded. Those clusters with their (m/z).sub.ma detected are subjected to FullCluster analyses as well. Thus, all of the clusters are aligned according to their L values, and the signal intensities with the same L value are added together, which produces the master observed ion cluster (FIG. 5).

(81) Sample Results Using Program IntegrateMS

(82) We applied IntegrateMS to analyze the data acquired for de-N-glycosylated erythropoietins. Erythropoietin is a 18-KDa construct and its primary structure, including one O-linked glycosylation and two disulfide bonds, has been primarily verified using our M.sub.ma-turned-M.sub.mi method (Data not shown) (Chen et al., Anal Biochem 440, 108-113 (2013)). As mentioned, there were some problems in locating the correct M.sub.ma, and thus we would like to confirm these results on IntegrateMS analyses to generate master ion clusters. The results through integration were later analyzed with subsequent programs.

(83) For integration of de-N-glycosylated Eprex with an O-linked trisaccharide, we used (m/z).sub.ma=1350.629 as the P.sub.i parameter and charge state z=14 as the parameter P.sub.2. IntegrateMS program found nine (m/z).sub.ma values from charge states 10 to 18. With these (m/z).sub.ma values, nine ion clusters were profiled (FIG. 6). Among these ion clusters, those at states +14 to +16 have rather smooth profiles. However, the profiles of others were not as perfect and, particularly, the +10, +17 and +18 clusters have many defects. Upon integration, the master ion cluster finally has the best distribution pattern. Also for integration of de-N-glycosylated Eprex with an O-linked tetrasaccharide, we used 1280.126 and 15 as the P.sub.1 and P.sub.2 for identification of all related ion clusters. As ten (m/z).sub.ma values were found by the program, they were all mined out as non-smooth clusters (FIG. 7). While the master ion cluster kept a crescendo- and decrescendo-pattern intact, the overall pattern was not as smooth.

(84) Altogether, while these deductions show the effectiveness of IntegrateMS in completion of cluster profiling, the much smoother profiles of master ion clusters highlight the necessity of collective consideration of all molecular ions even at diverse charge states.

Example 9. Simulated Isotope Distribution can be Computed by Intensity List-Based Cluster Deduction by MacroCluster

(85) Prior to ion cluster prediction, the intensity of isotope distributions for numerous numbers of each elements, C, H, O, N and S, were separately computed and recorded in element-specific intensity lists (FIGS. 19-23). For simulated ion cluster analyses, to establish simulated clusters of a protein, the imported primary structure information such as sequence and PTM are summed up as chemical formula C.sub.vH.sub.wO.sub.xN.sub.yS.sub.z. The isotope distribution of C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z are obtained by looking up element-specific intensity lists. The selected single-element ion clusters by look-up table procedure are processed with following Merger algorithm to gain the simulated ion cluster of the putative protein.

Example 10. Computation of Simulated Ion Cluster Based on Gradually Combining Single-Element Ion Clusters by Merger Algorithm

(86) To calculate ion distribution of therapeutic with chemical formula, C.sub.vH.sub.wO.sub.xN.sub.yS.sub.z, look-up table procedure is designed to obtain element-specific ion cluster based on total amount of each element within the molecule. Further, combination of each element start from merging C.sub.v and H.sub.w for C.sub.vH.sub.w with I.sub.m,CH=Σ.sub.i=0.sup.mI.sub.i,C×I.sub.(m−i),H. The concept of merging element O.sub.x, N.sub.y and S.sub.Z into intermediate C.sub.vH.sub.w, C.sub.vH.sub.wO.sub.x and C.sub.vH.sub.wO.sub.xN.sub.y are same as above with formula I.sub.m,CHO=Σ.sub.i=0.sup.mI.sub.i,CH×I.sub.(m−i),O, I.sub.m,CHON=Σ.sub.i=0.sup.mI.sub.i,CHO×I.sub.(m−i),N and I.sub.m,CHONS=Σ.sub.i=0.sup.mI.sub.i,CHON×I.sub.(m−i),S, respectively.

Example 11. Computation of Isotope Distribution for Intensity List Construction of Elements, C, H and N, Using Dynamic Programming

(87) Among amino acid-composed elements, C, H and N have two natural isotopes. We assume the monoisotopic ion is given by peak number 0. For N, the deduction of intensity for peak number 0 can be written as I.sub.0,y=A.sub.14.sub.N×I.sub.0,y−1, while intensity of peak number, n, can be defined as I.sub.n,y=A.sub.14.sub.N×I.sub.n,y−1+A.sub.15.sub.N×I.sub.n−1,y−1. A.sub.14.sub.N, A.sub.15.sub.N: Natural abundances of .sup.14N and .sup.15N respectively; I.sub.0,y: Intensity of peak number, 0, with total atom number, y. See FIG. 10.

Example 12. Computation of Isotope Distribution for Intensity List Construction of Element, O, Using Dynamic Programming

(88) The amino acid-composed element, O, has three natural isotopes, .sup.16O, .sup.17O and .sup.18O. The deduction of intensity for monoisotopic ion can be written as I.sub.0,x=A.sub.16.sub.O×I.sub.0,x−1, while intensity of peak number, n, can be defined as I.sub.n,x=A.sub.16.sub.O×I.sub.n,x−1+A.sub.17.sub.O×I.sub.n−1,x−1+A.sub.18.sub.O×I.sub.n−2,x−1. A.sub.16.sub.O, A.sub.17.sub.O, and A.sub.18.sub.O: Natural abundances of .sup.16O, .sup.17O and .sup.18O respectively; I.sub.0,x: Intensity of peak number, 0, with total atom number, x. See FIG. 11.

Example 13. Computation of Isotope Distribution for Intensity List Construction of Element, S, Using Dynamic Programming

(89) The amino acid-composed element, S, have four natural isotopes, .sup.32S, .sup.33S, .sup.34S and .sup.36S. The deduction of intensity for monoisotopic ion can be written as I.sub.0,z=A.sub.32.sub.S×I.sub.0,z−1, while intensity of peak number, n, can be defined as I.sub.n,z=A.sub.32.sub.S×I.sub.n,z−1+A.sub.33.sub.S×I.sub.n−1,z−1+A.sub.34.sub.S×I.sub.n−2,z−1+A.sub.36.sub.S×I.sub.n−4,z−1. See FIG. 12.

Example 14. Program Macro Cluster Developed Based on a Sequential Merging Approach

(90) We have previously developed a method to deduce the monoisotopic mass of a protein therapeutic through documentation of the relationship between the monoisotopic mass (M.sub.mi) and most abundance mass (M.sub.ma) determined using high-resolution mass spectrometry. We found that it was sometimes difficult to perform accurate M.sub.mi deduction when there are several signals with similar intensities in the ion cluster. The similarity in peak intensities creates ambiguity in assignment of the M.sub.ma peak, and a misassigned M.sub.ma may lead to a major error in M.sub.mi determination. Thus, we are prompted to take into consideration all the signals in the ion cluster, rather than one single M.sub.ma signal in characterization of its protein primary structure. Hence, automatic generation of a full simulated ion cluster from input primary structure should be established.

(91) In order to profile the simulated ion cluster, we first needed to develop methods that can calculate the relative abundances of different isotopologues that are made of isotopes of five elements, including carbon (C), hydrogen (H), nitrogen (N), oxygen (O) and sulfur (S). Then, these methods need to sum together the abundances of those isotopologues with molecular masses too close to be resolved by mass spectrometry. This summation process is facilitated by the fact that the mass differences between the smaller isotope (the M.sub.mi isotope) and other isotopes (non-M.sub.mi, isotopes) for any element is very close to 1 Dalton or its multiples. Specifically, .sup.13C-.sup.12C mass difference is 1.003355 Da; .sup.2H-.sup.1H difference is 1.006277 Da; .sup.17O-.sup.16O and .sup.18O-.sup.16O differences are 1.004218 and 2.004246 Da, respectively; .sup.15N-.sup.14N difference is 0.997035 Da; .sup.33S, .sup.34S-.sup.32S and .sup.36S-.sup.32S differences are 0.999387, 1.995796 and 3.99501 Da, respectively. Thus, the use to any non-M.sub.mi isotopes should cause the mass shift with ˜1 Da as the basic unit. If the peak containing only M.sub.mi isotopes is considered as the original position, or position 0, the use of any smallest non-M.sub.mi isotope, e.g. .sup.13C, .sup.2H, .sup.17O, .sup.15N and .sup.33S, moves its isotopologue out of position 0 and to position 1. Likewise, the use of any second smallest non-M.sub.mi isotope, e.g. .sup.18O and .sup.34S, should move its isotopologue from position 0 to position 2. In other words, the numbers and types of non-M.sub.mi isotopes in an isotopologue determine its position in the ion cluster. Given this principle, those isotopologues expected at the same cluster position can be identified, grouped and merged together to deduce their collective abundance in the mass spectrum.

(92) Based on this concept, we can simply define the cluster position of isotopologues in an ion cluster. For each single-element isotopologue, its position number is equal to the result of the following equation:
Σ.sub.i=2.sup.5[(Σ.sub.iN)×(i−1)],
where .sub.iN is the number of the ith lightest isotope of the element included in the single-element isotopologue (see Table 1) and the i is the rounded integer of (MM.sub.I-MM.sub.MN) where MM.sub.I is the molecular mass of the isotope I and MM.sub.MN is the monoisotopic mass of the element.

(93) Furthermore, for the molecule like protein with multiple elements, the position number of multi-element isotopologue is equal to the result of the following equation:
Σ.sub.i=2.sup.5[(Σ.sub.iN.sub.e(j)))×(i−1)],
where .sub.iN.sub.e(j) is the number of the ith lightest isotope of jth element, e(j), included in the multi-element isotopologue and the i is the rounded integer of (MM.sub.I-MM.sub.MN) where MM.sub.I is the molecular mass of the isotope I and MM.sub.MN is the monoisotopic mass of the element. The second (2nd) lightest isotopes, as i=2, are .sup.13C, .sup.2H, .sup.15N, .sup.17O, .sup.33S; the third (3rd) lightest isotopes, as i=3, are .sup.14C, .sup.3H, .sup.16N, .sup.18O, .sup.34S; the fourth (4th) lightest isotope, as i=4, is .sup.35S; and the fifth (5th) lightest isotope, as i=5, is .sup.36S (see Table 1). Hence, isotopologues of a polypeptide with multiple elements can be grouped based on their position number in an ion cluster.

(94) Macro Cluster program uses a stepwise process based on a group of ion clusters, each of which contains only one type of element but the same atom number of the analyzed molecule. For instance, if a protein has a chemical formula of C.sub.vH.sub.wO.sub.xN.sub.yS.sub.z, the ion clusters of C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z, which have been acquired using dynamic programming approach beforehand (see below), are fetched and then sequentially merged (FIG. 8). First, the C.sub.v ion cluster is merged with H.sub.w one. Based on the principle shown above, the new position of two merged peaks is equal to the sum of the two position numbers prior to their merging. Thus, the peak intensity (I.sub.m,CH) of the m-th position of the resulted C.sub.vH.sub.w cluster can be deduced according to the following:
I.sub.m,CH=Σ.sub.i=0.sup.mI.sub.i,C×I.sub.(m−i),H
where I.sub.i,C, is the intensity of the peak at the i-th position of C.sub.v cluster and I.sub.(m−i),H is the intensity of the peak at the (m−i)-th position of the H.sub.w cluster. It is notable that, for the molecular mass range of most proteins, the intensities will be high enough for consideration only for the first few dozens of positions (data not shown). These observations help conclude that it is not necessary to perform a full calculation of all likely signals. Instead, we only carry out the merging of the first one hundred peaks in any clusters involved in the merging process. Once the peak intensities of C.sub.vH.sub.w cluster is deduced, the following merging calculations continue (FIG. 9):
I.sub.m,CHO=Σ.sub.i=0I.sub.i,CH×I.sub.(m−i),0,
I.sub.m,CHON=Σ.sub.i=0.sup.mI.sub.i,CHO=×I.sub.(m−i),N, and
I.sub.m,CHONS=Σ.sub.i=0.sup.mI.sub.i,CHON×I.sub.(m−i),S,
with the same merging principles mentioned above. Notably, only the peak intensities from 0- to 99-th positions are produced.

(95) Intensity lists are preprocessed for rapid acquisition of needed information by Macro Cluster program

(96) As shown above, we need a series of ion clusters with only 1-Da mass difference used in integration analyses. It seems reasonable that the throughput in generation of these ion clusters can be drastically improved using pre-calculation concept. Hence, we decided to establish element-specific intensity lists, each of which contains the simulated ion clusters of imaginary compounds like C.sub.v, H.sub.w, O.sub.x, N.sub.y and S.sub.z. In order to generate such a list, we have tested binomial and polynomial extension methods (Yergey, Int J Mass Spectrom Ion Phys 52, 337-349 (1983); Yergey et al., Anal Chem 55, 353-356 (1983)), although a larger error may occur when the atom numbers increase beyond certain limits (data not shown). Rather, we developed a dynamic programming approach on the basis of the mentioned principle that inclusion of non-M.sub.mi isotopes causes corresponding positional shifts in ion clusters. For elements with two stable isotopes, e.g. carbon, there are two and only two types of ‘pathways’ to synthesize the isotopologues present in the n-th peak of the ion cluster C.sub.v. The first way is to add .sup.12C, the M.sub.mi isotope, to those isotopologues present at the same position (n) of the C.sub.v−1 cluster; such a ‘synthesis’ does not produce positional shift. The other way is to add .sup.13C, the only non-M.sub.mi carbon isotope, to those isotopologues present at the (n−1) position of the C.sub.v−1 cluster. Since the smallest non-M.sub.mi isotope should cause a positional shift of 1, all of the products will be found at the n-th position of the C.sub.v cluster. Thus, the intensity (I.sub.n,v) of in the n-th peak of the ion cluster C.sub.v should be equal to:
I.sub.n,v=A.sub.12.sub.C×I.sub.n,v−1A.sub.13.sub.C×I.sub.n−1,v−1,
where A.sub.12.sub.C and A.sub.13.sub.C are natural percentages of .sup.12C and .sup.13C, and I.sub.n,v−1 and I.sub.n−1,v−1 are the peak intensities of the n- and (n−1)-th peaks in the C.sub.v−1 cluster. Likewise, the intensity (I.sub.n,v) of in the n-th peak of the ion clusters H.sub.w and N.sub.y is supposed to correspond to:
I.sub.n,w=A.sub.1.sub.H×I.sub.n,w−1+A.sub.2.sub.H×I.sub.n−1,w−1, and
I.sub.n,v=A.sub.14.sub.N×I.sub.n,y−1+A.sub.15.sub.N×I.sub.n−1,y−1, respectively (FIG. 10).

(97) This means that the ion cluster of each element-specific compound with a specific atom number can always be deduced by the cluster of its derivative with one atom subtracted. This principle can be further extended to the calculation of O.sub.x and S.sub.z clusters. For the former, there exist two non-M.sub.mi isotopes, .sup.17O and .sup.18O, whose inclusion should lead to one and two steps in positional shift, respectively. Thus, the intensity (I.sub.n,x) of in the n-th peak of the ion cluster O.sub.x should be equal to:
I.sub.n,x=A.sub.16.sub.O×I.sub.n,x−1+A.sub.17.sub.O×I.sub.n−1,x−1+A.sub.18.sub.O×I.sub.n−2,x−1,
where A.sub.16.sub.O, A.sub.17.sub.O and A.sub.18.sub.O are natural percentages of .sup.16O, .sup.17O and .sup.18C, and I.sub.n,x−, I.sub.n−1,x−1 and I.sub.n−2,x−1 are the peak intensities of the n−, (n−1)- and (n−2)-th peaks in the O.sub.x−1 cluster (FIG. 11). When S.sub.z clusters are made, there are three non-M.sub.mi isotopes, namely .sup.33S, .sup.34S and .sup.36S, for consideration. The intensity (I.sub.n,z) of in the n-th peak of the ion cluster S.sub.z should be equal to:
I.sub.n,z=A.sub.32.sub.S×I.sub.n,z−1+A.sub.33.sub.S×I.sub.n−1,z−1+A.sub.34.sub.S×I.sub.n−2,z−1+A.sub.36.sub.S×I.sub.n−4,z−1,
where A.sub.32.sub.S, A.sub.33.sub.S, A.sub.34.sub.S and A.sub.36.sub.S are natural percentages of .sup.32S, .sup.33S, .sup.34S and .sup.36S, and I.sub.n,z−1, I.sub.n−1,z−1, I.sub.n−2,z−1 and I.sub.n−4,z−1 are the peak intensities of the n-, (n−1)-, (n−2)- and (n−4)-th peaks in the S.sub.z−1 cluster (FIG. 12). With these equations, we have generated the intensity lists for these five elements using computer programming (see FIGS. 19-23).

Example 15. Primary Structure Verification of De-N-Glycosylated Eprex Using CompareMS Program

(98) To verify the chemical formula of the examined therapeutic, de-N-glycosylated Eprex with an O-linked trisaccharide, the observed master ion cluster (solid-lined profile) was first obtained by MS analysis followed by informatics-based processing of IntegrateMS (FIG. 13A). Meanwhile, the sequence of this construct was accessed for MacroCluster. Simulated ion cluster of the putative therapeutic was established (dash-lined profile with m as zero) and a series of predicted ion clusters of the putative chemical formulas with added or removed several hydrogen atoms were also constructed (dash-lined profile). The numbers at the top of bars in the lower graph are the difference scores (DS) for these derivatives. The CompareMS result of de-N-glycosylated Eprex with an O-linked tetrasaccharide (FIG. 13B). Lot number of Eprex: EFS5600.

Example 16. Primary Structure Verification of De-N-Glycosylated Recormon Using CompareMS Program

(99) To verify the chemical formula of the examined therapeutic, de-N-glycosylated Recormon with an O-linked trisaccharide, the observed master ion cluster (solid-lined profile) was first obtained by MS analysis followed by informatics-based processing of IntegrateMS (FIG. 14A). Meanwhile, the sequence of this construct was accessed for MacroCluster. Simulated ion cluster of the putative therapeutic was established (dash-lined profile with m as zero) and a series of predicted ion clusters of the putative chemical formulas with added or removed several hydrogen atoms were also constructed (dash-lined profile). The numbers at the top of bars in the lower graph are the difference scores (DS) for these derivatives. The CompareMS result of de-N-glycosylated Recormon with an O-linked tetrasaccharide (FIG. 14B). Lot number of Recormon: H0743H01.

Example 17. CompareMS Program Finds the Match of Master Ion Cluster from a Series of Ion Clusters Produced by MacroCluster

(100) In order to validate the chemical formula of the protein analyte, CompareMS has been coded to employ MacroCluster to produce ion clusters for a series of compounds with exact one H atom difference.

(101) Routinely, three compounds with extra H atoms are produced, i.e. one to three H atoms are added to the chemical formula of the protein analyte. Also, three compounds with H atoms subtracted, i.e. one to three atoms are removed from the original chemical formula. The ion clusters of a total of seven compounds are then produced. In order to quantify the difference between master ion cluster and each of seven ion clusters, CompareMS tags a parameter, or difference score (DS), to each ion cluster. The difference score is defined as:

(102) X 2 = .Math. k = 0 n ( A o , k - E o , k ) 2 E o , k + .Math. k = 0 n ( A t , k - E t , k ) 2 E t , k ,
where A.sub.o,k and A.sub.t,k represent the relative abundances of k-th peaks in the observed and simulated clusters, while E.sub.o,k and E.sub.t,k represent the expected abundances of k-th peaks, respectively. A smaller DS means higher similarity between the two ion clusters. Among all the examined clusters, the one with smallest DS is marked and we examine whether its chemical formula is consistent with the listed protein primary structure. If the answer is positive, the preliminary validation of protein primary structure is completed (FIG. 1).

Example 18. The Primary Structures of De-N-Glycosylated Erythropoietins are Validated Using CompareMS Program

(103) Primary structure of erythropoietin with the removal of N-linked glycans is said to contain 165 amino acid sequence with three asparagines replaced as three aspartic acids, one O-linked glycosylation and two disulfide bonds. Among these modifications, O-linked glycosylation is expressed as the addition of either one trisaccharide or one tetrasaccharide (FIG. 2). As mentioned above, we have produced the master ion clusters for de-N-glycosylated erythropoietins with an O-linked trisaccharide and with an O-linked tetrasaccharide, respectively. We then use their chemical formulas to produce respective series of simulated ion clusters. For de-N-glycosylated erythropoietin with an O-linked trisaccharide analyses, we used C.sub.834H.sub.1338O.sub.261N.sub.228S.sub.5 to produce the seven ion clusters, and difference scores (DS) were assigned to each of these clusters. We found that, regardless of Eprex or Recormon, the structure without H added or removed has the lowest difference score, i.e. (0.08 for Eprex or 0.01 for Recormon). As the structures with one H removed and one H added have very similar difference scores (FIGS. 13 and 14), these data suggest that the majority of the de-N-glycosylated erythropoietin with two disulfide bonds and bearing an O-linked trisaccharide has its chemical formula as C.sub.834H.sub.1338O.sub.261N.sub.228S.sub.5. Indeed, our analyses successfully validate the primary structure of de-N-glycosylated erythropoietin. For de-N-glycosylated erythropoietin with an O-linked tetrasaccharide analyses, we used C.sub.845H.sub.1355O.sub.269N.sub.229S.sub.5 to produce the simulated ion clusters, and difference scores were calculated. We also found the structures without H added or removed have the lowest difference score for two erythropoietin with different brands (i.e. 0.35 for Eprex and 0.1 for Recormon). The two structures with one extra H and fewer H had similar difference scores (FIGS. 13 and 14). Hence, we conclude that the chemical of the de-N-glycosylated erythropoietin with an O-linked tetrasaccharide should be as C.sub.845H.sub.1355O.sub.269N.sub.229S.sub.5, which also validates the listed primary structure.

(104) To evaluate the content of erythropoietin products with different brands, Eprex and Recormon are analyzed in triplicate through our intact protein analyses.

(105) Triplicate experiments of different branded erythropoietin samples were performed through our informatics-based procedures for assurance of method repeatability. The abundance of de-N-glycosylated erythropoietins with an O-linked trisaccharide or an O-linked tetrasaccharide were respectively recorded and compared in different runs. The mean ratio of a trisaccharide-containing Eprex versus a tetrasaccharide-containing one in triplicate is 1.21±0.19 while the Recormon one is 0.63±0.04 (Table 2). These results got low standard deviations which first shows the reproducibility of our analytical methods. Furthermore, our platform reveals different ratios of O-linked oligosaccharide content present in two different branded erythropoietin products. This indicates that our methods can not only qualitatively verify the primary structure of proteins, but also can quantitatively demonstrate the modification ratios on intact protein structure. This utility can be further applied for quality control of protein therapeutics such as detection of lot-to-lot variations, or even similarity of various branded protein products.

(106) TABLE-US-00002 TABLE 2 The ratios of trisaccharide-modified to tetrasaccharide-modified erythropoietins from Eprex and Recormon. Protein No. Mean (#lot) No. of sugars Abundance Tri/Tetra (±S.D.) Eprex ® 1. Tri 1.55E+04 1.00 1.21 (#EFS5600) Tetra 1.56E+04 (±0.19) 2. Tri 3.05E+04 1.38 Tetra 2.21E+04 3. Tri 2.92E+04 1.26 Tetra 2.32E+04 Recormon ® 1. Tri 4.62E+04 0.66 0.63 (#H0743H01) Tetra 6.96E+04 (±0.04) 2. Tri 5.00E+04 0.63 Tetra 7.88E+04 3. Tri 5.46E+04 0.59 Tetra 9.23E+05

Example 19. Protein Sequence of the Tested Therapeutic, Humulin R

(107) Primary structure information of Humulin R is provided to build up the baseline for establishment of simulated ion clusters. The putative therapeutic, Humulin R, is supposed to contain A and B polypeptide chains and three disulfide linkages (FIG. 15, solid lines), which results in its putative chemical formula as C.sub.257H.sub.383O.sub.77N.sub.65S.sub.6. Throughout our protein analyte verification study, we verified the proposed protein primary structure.

Example 20. Primary Structure Verification of Protein Therapeutic, Humulin R, Using the Present Invention

(108) To verify the chemical formula of the examined therapeutic, Humulin R, the observed master ion cluster (FIG. 16, solid-lined profile) was first obtained by MS analysis followed by informatics-based processing using IntegrateMS. Meanwhile, the sequence of this construct was accessed for Macro Cluster. Simulated ion cluster of the putative therapeutic was established (FIG. 16, dash-lined profile with m as zero) and a series of predicted ion clusters of the putative chemical formulas with added or removed several hydrogen atoms were also constructed (FIG. 16, dash-lined profile). The numbers at the top of bars in the lower graph are the difference scores (DS) for these derivatives. Lot number of Humulin R: A930615.

Example 21. Protein Sequence of the Tested Therapeutic, Saizen

(109) Primary structure information of Saizen is provided to build up the baseline for establishment of simulated ion clusters. The putative therapeutic, Saizen, is supposed to contain 191 amino acids and two disulfide linkages (FIG. 17, solid lines), which results in its putative chemical formula as C.sub.990H.sub.1528O.sub.300N.sub.262S.sub.7. Throughout our protein analyte verification study, we verified the proposed protein primary structure.

Example 22. Primary Structure Verification of Protein Therapeutic, Saizen

(110) To verify the chemical formula of the examined therapeutic, Saizen, the observed master ion cluster (FIG. 18, solid-lined profile) was first obtained by MS analysis followed by informatics-based processing using IntegrateMS. Meanwhile, the sequence of this construct was accessed for MacroCluster. Simulated ion cluster of the putative therapeutic was established (FIG. 18, dash-lined profile with m as zero) and a series of predicted ion clusters of the putative chemical formulas with added or removed several hydrogen atoms were also constructed (FIG. 18, dash-lined profile). The numbers at the top of bars in the lower graph are the difference scores (DS) for these derivatives. Lot number of Saizen: BA020963.

Example 23. Applications of Our Methods on Quality Control of Various Protein Therapeutics

(111) Verification of protein primary structure is the important step for quality control of protein therapeutics after production from a biological system. Different brands of erythropoietins are primarily used as examples for the test drive of our achievement on verification of protein primary structure. For magnifying application of this method, we then test other protein drugs, such as humulin R and Saizen with our approaches. Humulin is similar to the insulin the body makes naturally, which indicates as an adjunct to diet and exercise to improve glycemic control in adults and children with type 1 and type 2 diabetes mellitus. For humulin R analyses, chemical formula of C.sub.257H.sub.383O.sub.77N.sub.65S.sub.6 was used to produce simulated ion clusters and the structure without H added or removed has the lowest difference score, i.e. (0.00) (FIGS. 15 and 16). This successfully validates the primary structure of humulin R with three disulfide bonds. Saizen is a prescription medicine indicated for the treatment of growth hormone deficiency (GHD) in children and adults. The Saizen structure used in treatment is identical to the growth hormone produced by the pituitary gland. For Saizen analyses, we used C.sub.990H.sub.1528O.sub.300N.sub.262S.sub.7 to produce the simulated ion clusters, and difference scores were calculated. The structure without H added or removed was found to have the lowest difference score for Saizen i.e. (0.16), which is consistent with the listed primary structure of Saizen with two disulfide bonds (FIGS. 17 and 18).

(112) In summary, we have developed a series of computer programs that can be used to evaluate whether the chemical formula determined by high-resolution mass spectrometry is consistent with its protein primary structure. Since such evaluation is rapid, effective and consistent, this method can be applied to quality control of protein therapeutics.