Peak correlation and clustering in fluidic sample separation

Abstract

A device for analyzing measurement data having a plurality of data sets, each data set being assigned to a respective one of a plurality of measurements, each data set having multiple features being indicative of different fractions of a fluidic sample, the device comprising a cluster determining unit configured for determining feature clusters by clustering features from different data sets presumably relating to the same fraction, a spread determining unit configured for determining for at least a part of the feature clusters a spread of the features within a respective feature cluster, and a display unit configured for displaying at least the part of the feature clusters together with a graphical indication of the corresponding spread.

Claims

1. A device for analyzing measurement data, the device comprising: a processor configured to receive the measurement data, the measurement data comprising a plurality of data sets corresponding to a plurality of respective measurements, wherein: the plurality of respective measurements are performed by a fluid separation apparatus in a plurality of respective measurement runs on a plurality of respective fluidic samples; each data set comprises a plurality of features indicative of different fractions of one of the plurality of respective fluidic samples; each feature represents a combination of a value of a first measurement parameter with a value of a second measurement parameter; and the first measurement parameter is selected from the group consisting of: a retention time of a chromatography measurement, a retention volume of a chromatography measurement, and a mass to charge ratio of a coupled chromatography and mass spectroscopy measurement; a cluster determining unit configured to determine feature clusters by clustering features from different data sets corresponding to the same fraction based on at least one decision criterion, and further configured to determine, from the feature clusters, a suspicious feature for which a rule for clustering has failed; a spread determining unit configured to determine for at least a part of the feature clusters a spread of the features within a respective feature cluster; and a display unit configured to display at least the part of the feature clusters together with a graphical indication of the corresponding spread, including displaying the suspicious feature, and further configured to display at least the part of the feature clusters according to a coordinate system comprising a first axis and a second axis, wherein: the first axis corresponds to the value of the first measurement parameter; and the second axis corresponds to the number of the respective measurement run.

2. The device of claim 1, wherein the cluster determining unit is configured for: ordering at least a part of the features in accordance with the value of the first measurement parameter, particularly ordering from small to large values; and determining the feature clusters by clustering features to a respective feature cluster which fulfill a clustering condition that a difference regarding the value of the first measurement parameter between adjacent features of a feature cluster in the ordered representation is below a predetermined threshold value.

3. The device of claim 2, wherein the first parameter is retention time, and the predetermined threshold value is a time interval indicative of a difference regarding a retention time of a corresponding fraction in different ones of the measurements.

4. The device of claim 2, wherein the predetermined threshold value is a time interval selected from the group consisting of: a time interval within a range from 0.001 minutes to 0.1 minutes; and a time interval within a range from 0.005 minutes to 0.08 minutes.

5. The device of claim 2, wherein the cluster determining unit is configured for excluding a feature from a feature cluster upon determining that this feature has a value of the first measurement parameter which is larger than a value of the first measurement parameter of another feature of the same data set by less than a predetermined further threshold value.

6. The device of claim 2, wherein the cluster determining unit is configured to determine the feature clusters by clustering all features to a respective feature cluster which fulfill the clustering condition among each other under consideration of a boundary condition that not more than one feature per data set may form part of the same feature cluster.

7. The device of claim 2, wherein the cluster determining unit is configured to determine whether a first and a last of the features in the ordered representation of a feature cluster have a difference regarding the value of the first measurement parameter of more than a predetermined further threshold value, and for triggering an action upon determining that the difference exceeds the predetermined further threshold value.

8. The device of claim 1, wherein the cluster determining unit is configured to determine the feature clusters using a non-recursive algorithm.

9. The device of claim 1, wherein the display unit is configured for displaying, as the graphical indication, a bar having a width corresponding to the respective spread.

10. The device of claim 1, wherein the value of the second measurement parameter for at least the part of the features is displayable encoded by a graphical property of a respective marker representing a corresponding feature in the coordinate system.

11. The device of claim 1, wherein the coordinate system is a Cartesian coordinate system.

12. The device of claim 10, wherein the graphical property is a size of the marker.

13. The device of claim 10, wherein the display unit is configured to display the graphical indication in an overlaying manner with the markers of the features of the corresponding feature cluster.

14. The device of claim 1, wherein the display unit is configured to display the graphical indication extending along the second axis.

15. The device of claim 1, wherein the second measurement parameter is indicative of a detection intensity of a peak of the first measurement parameter.

16. The device of claim 1, comprising a fraction identification unit configured to identify individual fractions assigned to features in different data sets by determining a match with preknown technical information, wherein the cluster determining unit is configured to determine feature clusters by clustering exclusively features which have not been assigned to individual fractions by the fraction identification unit.

17. The device of claim 1, wherein the display unit is configured to display a graphical user interface.

18. The device of claim 1, wherein the measurement data comprises liquid or gaseous chromatography data.

19. The device of claim 1, wherein the measurement data comprises coupled liquid or gaseous chromatography and mass spectroscopy data.

20. The device of claim 1, wherein the measurement data is provided by a measurement device comprising one selected from the group consisting of: a sensor device, a test device for testing a device under test or a substance, a device for chemical, biological and/or pharmaceutical analysis, a fluid separation system configured for separating compounds of a fluid, a capillary electrophoresis device, a liquid chromatography device, a gas chromatography device, an electronic measurement device, and a mass spectroscopy device.

21. A method of analyzing measurement data, the method comprising: receiving the measurement data, the measurement data comprising a plurality of data sets corresponding to a plurality of respective measurements, wherein: the plurality of respective measurements are performed by a fluid separation apparatus in a plurality of respective measurement runs on a plurality of respective fluidic samples; each data set comprises a plurality of features indicative of different fractions of one of the plurality of respective fluidic samples; and each feature represents a combination of a value of a first measurement parameter with a value of a second measurement parameter; and the first measurement parameter is selected from the group consisting of: a retention time of a chromatography measurement, a retention volume of a chromatography measurement, and a mass to charge ratio of a coupled chromatography and mass spectroscopy measurement; determining feature clusters by clustering features from different data sets corresponding to the same fraction based on at least one decision criterion, and further determining, from the feature clusters, a suspicious feature for which a rule for clustering has failed; determining for at least a part of the feature clusters a spread of the features within a respective feature cluster; and displaying at least the part of the feature clusters together with a graphical indication of the corresponding spread, including displaying the suspicious feature, and further displaying at least the part of the feature clusters according to a coordinate system comprising a first axis and a second axis, wherein: the first axis corresponds to the value of the first measurement parameter; and the second axis corresponds to the number of the respective measurement run.

22. The device of claim 1, wherein the plurality of respective measurement runs correspond to a plurality of respective sample injections performed by the fluid separation apparatus.

23. A non-transitory computer-readable medium, comprising instructions stored thereon, that when executed on a processor, control or perform the steps of the method of claim 21.

24. A device for processing measurement data, the device comprising a processor configured to receive the measurement data, the measurement data comprising a plurality of data sets corresponding to a plurality of respective measurements, wherein: the plurality of respective measurements are performed by a fluid separation apparatus in a plurality of respective measurement runs on a plurality of respective fluidic samples; each data set comprises a plurality of features indicative of different fractions of one of the plurality of respective fluidic samples; each feature represents a combination of a value of a first measurement parameter with a value of a second measurement parameter; and the first measurement parameter is selected from the group consisting of: a retention time of a chromatography measurement, a retention volume of a chromatography measurement, and a mass to charge ratio of a coupled chromatography and mass spectroscopy measurement; a cluster determining unit configured to determine feature clusters by clustering features from different data sets corresponding to the same fraction based on at least one decision criterion, by: ordering at least a part of the features in accordance with the value of the first measurement parameter; and determining the feature clusters by clustering features to a respective feature cluster in accordance with a clustering condition that a difference regarding the value of the first measurement parameter between adjacent features of a feature cluster in the ordered representation is below a predetermined threshold value, wherein the cluster determining unit is further configured to determine, from the feature clusters, a suspicious feature for which a rule for clustering has failed; a spread determining unit configured to determine for at least a part of the feature clusters a spread of the features within a respective feature cluster; and a display unit configured to display at least the part of the feature clusters together with a graphical indication of the corresponding spread, including displaying the suspicious feature, and further configured to display at least the part of the feature clusters according to a coordinate system comprising a first axis and a second axis, wherein: the first axis corresponds to the value of the first measurement parameter; and the second axis corresponds to the number of the respective measurement run.

25. A method of processing measurement data, the method comprising receiving the measurement data, the measurement data comprising a plurality of data sets corresponding to a plurality of respective measurements, wherein: the plurality of respective measurements are performed by a fluid separation apparatus in a plurality of respective measurement runs on a plurality of respective fluidic samples; each data set comprises a plurality of features indicative of different fractions of one of the plurality of respective fluidic samples; and each feature represents a combination of a value of a first measurement parameter with a value of a second measurement parameter; and the first measurement parameter is selected from the group consisting of: a retention time of a chromatography measurement, a retention volume of a chromatography measurement, and a mass to charge ratio of a coupled chromatography and mass spectroscopy measurement; determining feature clusters by clustering features from different data sets corresponding to the same fraction based on at least one decision criterion, by: ordering at least a part of the features in accordance with the value of the first measurement parameter; and determining the feature clusters by clustering features to a respective feature cluster in accordance with a clustering condition that a difference regarding the value of the first measurement parameter between adjacent features of a feature cluster in the ordered representation is below a predetermined threshold value; further determining, from the feature clusters, a suspicious feature for which a rule for clustering has failed; determining for at least a part of the feature clusters a spread of the features within a respective feature cluster; and displaying at least the part of the feature clusters together with a graphical indication of the corresponding spread, including displaying the suspicious feature, and further displaying at least the part of the feature clusters according to a coordinate system comprising a first axis and a second axis, wherein: the first axis corresponds to the value of the first measurement parameter; and the second axis corresponds to the number of the respective measurement run.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) Other objects and many of the attendant advantages of embodiments of the present invention will be readily appreciated and become better understood by reference to the following more detailed description of embodiments in connection with the accompanying drawings. Features that are substantially or functionally equal or similar will be referred to by the same reference signs.

(2) FIG. 1 shows a device for analyzing measurement data having a plurality of data sets according to an exemplary embodiment of the invention.

(3) FIG. 2 to FIG. 4 are schemes relating to the execution of a method of processing measurement data having a plurality of data sets and illustrating an algorithm of clustering, calculating a spread and illustrating both together according to an exemplary embodiment of the invention.

(4) FIG. 5 to FIG. 22 show different images relating to a clustering procedure, spread calculation procedure and a graphic illustration of the latter according to an exemplary embodiment of the invention.

(5) FIG. 23 shows a diagram graphically illustrating different fractions of a fluidic sample separated and being analyzed in terms of cluster formation and spread calculation and illustration.

(6) FIG. 24 shows a liquid separation system, in accordance with embodiments of the present invention, for instance used in high performance liquid chromatography (HPLC) and ultra high performance liquid chromatography (UHPLC).

(7) The illustration in the drawing is schematic.

DETAILED DESCRIPTION

(8) Referring now in greater detail to the drawings, FIG. 24 depicts a general schematic of a liquid separation system 10. A pump 20 receives a mobile phase from a solvent supply 25, typically via a degasser 27, which degasses and thus reduces the amount of dissolved gases in the mobile phase. The pump 20—as a mobile phase drive—drives the mobile phase through a separating device 30 (such as a chromatographic column) comprising a stationary phase. A sampling unit 40 can be provided between the pump 20 and the separating device 30 in order to subject or add (often referred to as sample introduction) a fluidic sample into the mobile phase. The stationary phase of the separating device 30 is adapted for separating compounds of the fluidic sample. A detector 50 is provided for detecting separated compounds of the fluidic sample. A fractionating unit 60 can be provided for outputting separated compounds of the fluidic sample.

(9) While the mobile phase can be comprised of one solvent only, it may also be mixed from plural solvents. Such mixing might be a low pressure mixing and provided upstream of the pump 20, so that the pump 20 already receives and pumps the mixed solvents as the mobile phase. Alternatively, the pump 20 might be comprised of plural individual pumping units, with plural of the pumping units each receiving and pumping a different solvent or mixture, so that the mixing of the mobile phase (as received by the separating device 30) occurs at high pressure and downstream of the pump 20 (or as part thereof). The composition (mixture) of the mobile phase may be kept constant over time, the so called isocratic mode, or varied over time, the so called gradient mode.

(10) A data processing unit 70, which can be a PC or workstation, might be coupled (as indicated by the dotted arrows) to one or more of the devices in the liquid separation system 10 in order to receive information and/or control operation. For example, the data processing unit 70 might control operation of the pump 20 (for instance setting control parameters) and receive therefrom information regarding the actual working conditions (such as output pressure, flow rate, etc. at an outlet of the pump 20). The data processing unit 70 might also control operation of the solvent supply 25 (for instance setting the solvent/s or solvent mixture to be supplied) and/or the degasser 27 (for instance setting control parameters such as vacuum level) and might receive therefrom information regarding the actual working conditions (such as solvent composition supplied over time, flow rate, vacuum level, etc.). The data processing unit 70 might further control operation of the sampling unit 40 (for instance controlling sample injection or synchronization of sample injection with operating conditions of the pump 20). The separating device 30 might also be controlled by the data processing unit 70 (for instance selecting a specific flow path or column, setting operation temperature, etc.), and send—in return—information (for instance operating conditions) to the data processing unit 70. Accordingly, the detector 50 might be controlled by the data processing unit 70 (for instance with respect to spectral or wavelength settings, setting time constants, start/stop data acquisition), and send information (for instance about the detected sample compounds) to the data processing unit 70. The data processing unit 70 might also control operation of the fractionating unit 60 (for instance in conjunction with data received from the detector 50) and provides provide data back.

(11) Reference numeral 90 schematically illustrates a switchable valve which is controllable for selectively enabling or disabling specific fluidic paths within apparatus 10. The switchable valve 90 is not limited to the position between the pump 20 and the separating device 30 and can also be implemented at other positions, depending on the application.

(12) The data processing unit 70 may also process and display measurement data measured by liquid separation system 10 to enable a user to derive technical information from the measurement. Such procedures according to exemplary embodiments will be described in detail in the following. Particularly, methods for evaluating chromatographic results using data correlation and clustering will be explained.

(13) FIG. 1 shows a device 100 (which corresponds to liquid separation system 10 of FIG. 24) for analyzing liquid chromatography measurement data captured by a liquid chromatography measurement device 102 (which corresponds to components 20, 25, 27, 30, 40, 50, 60, 90 of FIG. 24). The liquid chromatography measurement device 102 carries out a plurality of measurements on a fluidic sample to be separated into various fractions. With each measurement, a corresponding data set is captured by the liquid chromatography measurement device 102. Each data set can be indicative of a chromatogram which has a plurality of peaks which will also be called signal features or only features. Each feature indicates the presence of a corresponding fraction or species in the fluidic sample.

(14) After finishing the measurements, the measurement data can be stored in a database 104 for later evaluation.

(15) A fraction identification unit 106 of the device 100 is configured for identifying individual fractions assigned to the features in the chromatogram in different data sets by determining a match with preknown technical information. In other words, certain fractions or components of the fluidic sample which is presently analyzed are expected so that the fraction identification unit 106 can identify peaks in the measurement signals and assign them to the various expected fractions. However, it may also happen that some of the determined features in the measurement spectra cannot be identified, i.e. cannot be assigned to an expected species. This can for instance be caused by impurities in the samples.

(16) Such impurities, which may correspond to undesired or parasitic fractions of the fluidic sample, can then be analyzed by a cluster determining unit 108. The cluster determining unit 108 is configured for determining feature clusters by clustering only the features which could not be assigned to individual fractions by the fraction identification unit 106. For this purpose, the clustering determining unit 108 determines feature clusters by clustering features from different data sets which presumably relate to the same fraction. Examples for a corresponding clustering algorithm, i.e. an algorithm for determining which of the unidentified peaks or features relate to the same fraction or are at least considered to relate to the same fraction will be discussed below in more detail.

(17) The result of the cluster determination is then supplied to a spread determining unit 110. The spread determining unit 110 is configured for determining, for each of the feature clusters individually, a corresponding spread of the features within a respective feature cluster. In other words, a value can be statistically derived which is indicative of a width of the distribution of the individual features within a cluster. In other words, the spread is an indication for the reliability of the clustering (the larger the spread, the lower the reliability).

(18) After having determined a quantitative measure for the spread for each feature cluster individually, a display unit 112 may be fed with the corresponding data and may be configured for determining display data for actually displaying the feature clusters together with the graphical indication of the corresponding spread, for instance on a monitor.

(19) As can be taken by a dashed rectangle in FIG. 1 denoted with reference numeral 114 (which corresponds to component 70 of FIG. 24), units 106, 108, 110, 112 can be realized as a common processor or computer. It is however also possible that each of the units is realized as a separate processor or computer or that some of the units only are realized as a common processor or computer.

(20) An input/output unit 116 is provided for bidirectional communication with the processor 114 as well as the database 104 and the liquid chromatography measurement device 102. Via the input/output unit 116, a user may input instructions to the system, for instance may determine parameters or may define a measurement to be carried out. It is also possible that results of such a measurement or the evaluation is displayed to the user via the input/output unit 16, for instance via a monitor.

(21) FIG. 2 to FIG. 4 illustrate how the clustering, the spread determination and the graphical display can be performed for the system shown in FIG. 1.

(22) FIG. 2 shows a diagram 200 having an abscissa 202 along which a retention time is plotted according to a liquid or gaseous chromatography measurement. Along an ordinate 204, different measurements performed with the liquid or gaseous chromatography apparatus 102 are illustrated. This means in the shown example that four different measurements are indicated in the diagram of FIG. 2, each illustrated as a corresponding horizontal dotted line. A number of signal features 208 are shown for each measurement in the diagram 200. Hence, each measurement shows a plurality of such features 208. All features 208 relating to one and the same measurement together form a corresponding data set 206, as shown in FIG. 2 as well. Therefore, the four data sets 206 shown in FIG. 2 correspond to the four measurements. In the example of FIG. 2, each data set 206 has three (in this case unidentified) features 208 which are arranged at remarkably different retention times. The following procedure intends to cluster corresponding features 208 which most probably relate to the same fraction of a sample to be separated in the various measurements.

(23) The way how the clustering is performed is shown in FIG. 3 and will be illustrated in the following. Firstly, all unidentified features 208 shown in FIG. 2 are projected on and are ordered quantitatively along an axis 330 shown in FIG. 3 which relates to the abscissa 202 (retention time axis). In other words, all twelve features 208 shown as circles in FIG. 2 are projected onto the abscissa 202 (retention time axis). Hence, the twelve features 208 illustrated as “1”, “2”, . . . , “11”, “12” in FIG. 3 are ordered according to their value of the retention time from small to large values. Feature clusters 350 are then determined by clustering all features 208 which fulfill the clustering condition that a difference regarding the value of the retention time between adjacent features 208 of a feature cluster 350 in the ordered representation is below a predetermined threshold value Δ.sub.TH being indicated in FIG. 3 with reference numeral 354. Hence, a distance Δ.sub.12 between features “1” and “2” is determined and compared to Δ.sub.TH. Since Δ.sub.12 is smaller than Δ.sub.TH, features “1” and “2” are considered to relate to the same feature cluster 350. Next, features “2” and “3” are analyzed which have a mutual distance Δ.sub.23. Since Δ.sub.23 is smaller than Δ.sub.TH, also features “2” and “3” are considered to relate to the same feature cluster 350. This procedure is continued until it is estimated that the difference Δ.sub.45 between features “4” and “5” is larger than Δ.sub.TH. Therefore, it is concluded that features “4” and “5” do not relate to the same feature cluster 350. Correspondingly, features “1” to “4” are grouped to form the first feature cluster 350. This procedure is continued so that three feature clusters 350, which are denoted as C1, C2 and C3 in FIG. 3, are identified.

(24) A further consistency check of the cluster formation may be made by comparing a respective width S1, S2 or S3 between the center of the first and the center of the last feature 208 of a respective feature cluster 350 with another threshold value S.sub.TH denoted as reference numeral 356. If one of S1, S2 or S3 would be larger than S.sub.TH, then the corresponding cluster formation would not be considered as reliable and this would be indicated to a user, for instance in the form of an alarm. However, in the present case, each of the cluster formations is considered as consistent. The corresponding values S1, S2 and S3 can be denoted as spreads of corresponding clusters C1, C2 and C3.

(25) FIG. 4 shows a diagram 400 similar to diagram 200. In addition to the information shown in FIG. 2, a bar 406 being indicative for the extension of the corresponding spread S1, S2 or S3 visually shows to the user how reliable the clustering is.

(26) Coming back to FIG. 2, a further feature 210 is shown which relates to the second measurement and has a distance to a preceding feature 212 of less than Δ.sub.TH. If such a situation occurs, i.e. that the same measurement shows two features 210, 212 differing less than Δ.sub.TH from one another but relating to the same data set 206, then the later feature 210 is not considered to relate to the same feature cluster 350, because two separable features in the same measurement are indicative of two different fractions and can therefore not be considered to relate to the same fraction for technical considerations. Feature 210 can form a separate cluster with a width or spread of zero, since it is only a single feature.

(27) In the following, referring to FIG. 5 to FIG. 22, a system of forming a graphical illustration of measurement results according to exemplary embodiments of the invention will be explained.

(28) FIG. 5 shows a chromatographic signal 500 illustrating different signal features such as peaks 502 as regions of locally high intensity in a liquid chromatography experiment in dependency of a retention time plotted along abscissa 202. A baseline 504 is shown as well.

(29) FIG. 6 shows how the chromatographic signal 500 can be transformed into an equivalent bubble diagram in which the individual peaks 502 are displayed as circular structures or features 208. In other words, the area of each feature 208 corresponds to an area under a corresponding peak 502.

(30) FIG. 7 shows an illustration similar to that of FIG. 6, wherein expected retention time windows—more precisely spreads relating to expected peaks—are illustrated in the form of bars which are denoted with reference numeral 700.

(31) FIG. 8 shows a similar diagram as FIG. 7 with the exception that apart from identified peaks, compare reference numeral 208, also some unidentified peaks are shown which are illustrated by reference numeral 800. An unidentified peak 800 means that the corresponding peak is seen in the chromatographic signal 500, however no such peak would be expected theoretically. Such unidentified peaks 800 may result from impurities in a sample or the like.

(32) FIG. 9 shows that, apart from the unidentified peaks 800, it may also happen that certain expected peaks are not found in a chromatographic signal 500, as indicated by reference numeral 900. Not found means that there is no local maximum in the chromatographic signal 500 although it would be expected theoretically.

(33) In some events, compare reference numeral 1000 in FIG. 10, an alert may be triggered since an alert rule is violated. In other cases, see reference numeral 1002, a warning may be output to a user when a warning rule is violated.

(34) FIG. 11 shows a diagram 1100 in which all peaks of features 208 are shown as bubbles, wherein the size can be proportional to area, height, amount, etc. Vertical bars 700 show the expected retention time window.

(35) FIG. 12 shows a so-called sequence peak diagram 1200. In this sequence peak diagram 1200, all peaks of features 208 of different injections or measurements are shown as bubbles, wherein the size can be proportional to area, height, amount. The vertical bars 700 show the expected retention time window. Hence, peaks of features 208 from various measurements are illustrated in the sequence peak diagram 1200.

(36) FIG. 13 shows a graphical user interface 1300, in which a user can, in a user-defined manner, design the way of illustrating the various resonances (features 208) and vertical bars 700 in accordance with user preferences.

(37) In the graphical user interface 1400 shown in FIG. 14, two peaks 1402 are marked as suspicious, because certain rules have failed (relating to warning and alert status).

(38) FIG. 15 shows a diagram 1500 in which expected but not found peaks 1502 are shown as well.

(39) FIG. 16 shows a diagram 1600 which indicates that three injections or measurements show unidentified peaks 1602. As a result of clustering, bands 1604 indicate that these unidentified peaks 1602 could be assigned to two unknown compounds.

(40) FIG. 17 shows a graphical user interface 1700 in which a comparison against a reference chromatogram is performed, and a proper match is found.

(41) User interface 1800 shown in FIG. 18 shows that at an unidentified peak 1602, reference and sequence chromatograms do not match very well.

(42) In diagram 1900 in FIG. 19, the sequence chromatogram shows one expected but not found peak 1902, one peak 1904 too many, and one peak 1906 not found.

(43) FIG. 20 shows a diagram 2000, in which peaks of a reference and a sequence chromatogram do not match. However, there is some similarity. FIG. 21 shows a diagram 2100 in which the peaks are aligned (see alignment lines 2102).

(44) FIG. 22 shows a user interface 2200 in which a suspicious marker 2202 is shown.

(45) In FIG. 23, a diagram 2300 can be seen which is similar to diagram 400 and that shows that after clustering of features 208 or peaks the resulting clusters are displayed together with a measure for the spreading.

(46) Unidentified peaks are denoted with reference numeral 2304, identified peaks are denoted with reference numeral 2302, and vertical bands (reference numeral 2306) show formed clusters.

(47) The following description referring to FIG. 23 relates to peak correlation and clustering components. It allows a user to correlate (cluster) unidentified peaks 2304 based on retention times (see abscissa 202). Peaks with retention times that are very close to each other, are assigned to the same cluster. The results are visualized as a graphic control (see FIG. 23) and as table entries (not shown) for further evaluation. The user can control the clustering window size 354 (FIG. 3) which is used for clustering, correct manually a given clustering and apply various filter operations in order to explore the clusters and peaks in detail.

(48) Clustering of peaks can be used when multiple samples show unidentified peaks 2304 and the question rises whether these unidentified peaks 2304 are likely to be caused by the same compound or impurity. The described method will help the user to classify the unidentified peaks 2304 by aligning all those peaks 2304 which show up closely at the same retention time and handle them as new entity, i.e. as a yet unknown compound or impurity.

(49) This may also be useful for developing new methods where retention times of all peaks 2302, 2304 are not known in advance. The found clusters can then be turned into expected retention times for identifying these peaks 2302, 2304.

(50) Depending on the nature of the retention time values clustering will not always lead to a unique solution. Therefore, the user needs an easy way to change the clustering window size 354 (FIG. 3) used for clustering and view in real-time how these manipulations alter the clustering. This will enable the user to select the most meaningful solution.

(51) The user interface for this feature comprises a graphical control showing the positions of all peaks 2302, 2304 and clusters as retention time bands 2306, additional entries for the column table where each column (group of columns) represents data from a specific cluster, and various interactive manipulation means for evaluating the clustered peaks 2302, 2304.

(52) Since expected peaks 2302 are clustered implicitly by data analysis, i.e., the peak identification step, this additional clustering will only be applied to unidentified peaks 2304, in an embodiment.

(53) Therefore, input for clustering is the set of retention times of all unidentified peaks from all injections. Clustering is performed for each signal separately. The only parameter is the clustering window size 354 which specifies the size of the window used to cluster peaks in retention time units (min/sec). If this parameter is not specified the algorithm will determine a default cluster window size from the minimum of non-zero differences of all unidentified peaks.

(54) Output is a collection of clusters (compare reference numeral 350 in FIG. 3). Each cluster lists the retention times, signal and injections which comprise the cluster, as well as the real width of the cluster, calculated as maximum minus minimum of retention times within the cluster.

(55) This clustering feature can be switched on or activated interactively when evaluating peak or compound results. In case clustering is switched on the method will hold the user specified clustering window size 354 or the information to use a default value.

(56) When exploring the clustering interactively the software may vary the clustering window size 354 and calculate the clustering in the background. As a result the relationship of “number of clusters” versus “cluster window size” can be inspected to allow the user to find an optimal clustering window size 354 for the user data. The software will mark the largest clustering window size 354 at which for all injections not more than one peak 2302, 2304 is included in each cluster.

(57) In the case that multiple signals are available the software can optionally collect all identified peaks 2302 from all signals as input to the correlation algorithm. In the correlation result set that peak gets marked which has the largest area from the set of peaks which are from the same injection within the same cluster but from different signals.

(58) In the case multiple detectors are available the signal alignment algorithm may be applied before determining the retention times. This is especially advantageous when combining retention times from all signals as input for the correlation/clustering algorithm.

(59) In case the clustering window size 354 is smaller than the minimum of non-zero differences of all peaks, the number of created clusters is equal to the number of different retention times. In case the clustering window size 354 is larger than the total spread, i.e. maximum minus minimum, of retention times, the number of created clusters equals one. For all other values for the clustering window size 354 the number of resulting clusters is between the two above described values; actually it is a monotonically following step function. The clustering window size 354 is limited by the largest size at which for each injection not more than one peak is included in each cluster.

(60) As mentioned above, FIG. 23 shows the principal layout of the graphical control for presenting all peaks from many injections and their clusters. The X-axis (see reference numeral 202) has the same units as the analyzed signals, i.e. time given in units of min or sec. The Y-axis (see reference numeral 204) shows just the number of injections from which the peaks 2302, 2304 are taken. The position of each peak 2302, 2304 is presented by a circle. The size of the circle represents area, height or any other chosen numerical value of a peak 2302, 2304.

(61) Clusters can be visualized by retention time bands 2306 which may be colored. The presentation of FIG. 23 includes also the identified peaks 2302. The width of the retention time bands 2306 for identified peaks 2302 is just the expected retention time plus/minus the identification window size. The width of the retention time bands 2306 for the unidentified peaks 2304 is chosen in a way that retention times, i.e. center of the circles, of all peaks 2304 belonging to a cluster are within the retention time band 2306. In the case a cluster contains only one peak 2304 then only one colored line is drawn as a cluster retention time band 2306.

(62) Identified peaks 2302 and their clusters may be colored differently from unidentified peaks 2304 and the corresponding clusters. For instance, identified peaks 2302 may be colored blue, unidentified peaks 2304 grey.

(63) A selected injection or measurement is visualized by reference numeral 206; a selected peak may be emphasized by four arrows pointing to the corresponding circle (see reference numeral 2308).

(64) Next, an interactive evaluation of correlated unidentified peaks 2304 will be explained. A prerequisite is that multiple injections are already loaded and integrated; identification can be completed but is not needed. In the case no identification has been done, all peaks 2302, 2304 are handled as unidentified. This might be a useful starting point for developing a new method from scratch.

(65) Assuming the user is evaluating chromatograms and peaks, depending on the user interface layout the user would either switch on the correlation/clustering control or switch to a specific sub-view. The system will immediately calculate the clusters and display the result as a graphic and as added columns to the compound table displaying values for the found clusters. The default is to start with all unidentified peaks from a signal and the cluster window size given by the method: either a specific or the system calculated default value. Using a toolbar, the user can easily switch between different available signals.

(66) In order to determine a proper clustering, the user can display a small popup window that shows the relationship between clustering window size 354 and number of clusters. The user can adapt the clustering window size 354 if needed. There may be a slider on the toolbar which allows the user to evaluate the diagram in real time for varying the clustering window size 354.

(67) Other options are to select which attribute will be shown by the size of the circles that represent each peak 2302, 2304 in the graphic. Possible values are: area, height, peak type, or any numeric value that is an outcome of the rule calculator. The real value is proportional to the area of the circle. The sizes of the circles vary between two predefined values for the minimum and maximum circle.

(68) Further on, the user can suppress peaks 2302, 2304 or full injections (measurements) for clustering. This makes sense when outliers have been identified by the data analysis and these outliers might create values which are not representative for all samples or would distort clustering. Peaks 2302, 2304 or full injections can manually be suppressed interactively for instance by moving the cursor near to a circle. The cursor may change its shape visualizing the possible action to suppress a peak 2302, 2304 or injection or to re-activate a suppressed item.

(69) Other filter options are to show and mark unidentified peaks 2304 that are only detected in some of the injections but not at all, and/or to show and mark ranges of signal where expected peaks 2302 have not identified, i.e. are for any reason not available.

(70) A method according to an embodiment of the invention which includes an algorithm for clustering and correlating data from a series of repeated measurements will be described in detail in the following with an emphasis on the logic of such an algorithm. Integrated with a graphical presentation of the resulting clusters this method allows the user to examine specific features of the measured data in a highly efficient way. The outlined example of peak correlation of chromatographic measurements illustrates advantages of this method, especially in the area of impurity profiling or development of chromatography methods.

(71) The described method allows correlating and clustering any measured numerical feature from a series of repeated measurements. Based on a given small Cluster Window Width (also denoted as predefined threshold value), an algorithm creates clusters of values of a measured feature that are taken from the different measurements of the series. Adjacent values within a cluster are closer to each other than the given window width. However, in an embodiment the chosen Cluster Window Width shall not exceed a size such that more than one data point from a single measurement falls into the same cluster. In general the resulting cluster size may be larger than the starting Cluster Window Width.

(72) The method includes a graphical and tabular presentation of the correlation result. The graphical presentation is a scatter diagram of the measured values. An X-axis relates to the data range of the measured data values and a Y-axis numbers the measurements of the series. The format of the single data points such as color, shape and size can visualize additional features of the data point. A table may be used to list any selected feature of each cluster in a single table column.

(73) In an embodiment, such a system may be applied to chromatographic measurement data. Gas chromatography (GC) and liquid chromatography (LC) are techniques to characterize the chemical composition of gaseous and liquid, i.e. fluidic, samples. During a chromatography run fractions or components (also called compounds) of a mixture are separated, and optionally, identified and quantified. The time it takes the component molecules to travel through the system is called retention time. The result of a chromatographic analysis is a signal (chromatogram) that shows peaks at different retention times corresponding to the different components. In addition, the height or area of the peak can be used to quantify the component in the sample.

(74) One task of data analysis is to allot these peaks, based on the retention time, to components. During method development the retention time of all components of interest are determined and inserted in the method as expected retention time. When running real samples the data analysis part of the system scans the chromatograms for peaks at expected retention times and uses the peak area or height to determine the amount of the components.

(75) Applied to chromatography peak clustering can be used to examine unidentified peaks. For instance, LC or GC analysis is applied to create a series of analyses from different samples taken from a batch of a new synthesized product. In this example the repeated measurements are the recorded chromatograms; the measured feature is the retention time of any unidentified peak within the chromatograms. The described algorithm creates clusters of unidentified peaks from the different chromatograms for which the retention times are very close to each other. One interpretation is that such clusters are caused by unknown compounds which are regarded as impurities or by-products which should not exist at optimal process control. The found clusters are added as “yet unknown” compounds to the compound list.

(76) Some of the diagrams below (for instance FIG. 23) show an exemplary layout of a scatter plot for peak correlation. Not only the unidentified peaks (reference numeral 2304 in FIG. 23) may be drawn, but also the identified peaks (reference numeral 2302 in FIG. 23). Vertical bands (reference numeral 2306 in FIG. 23) show the created clusters, either given by the below described clustering algorithm for unidentified peaks or for expected peaks by peak identification. The width of the bands for identified peaks is just the expected retention time plus/minus the identification window size as specified in the method. The width of the bars or bands for unidentified peaks is chosen in a way that retention times, i.e. center of the circles, of all peaks belonging to a cluster are within the band. The size of circles is chosen to be proportional to the peak area.

(77) This visualization concept may be integrated into a general data analysis software package for chromatographic data. If a user selects any chromatogram or peak for further inspection the related peak will also be highlighted in the scatter diagram.

(78) In addition to displaying all peaks and their correlation the graphical presentation can be used to highlight a variety of peak attributes and to help navigate to suspicious signals. Peaks can be flagged based on the results from applied data evaluation rules.

(79) Next, an exemplary peak clustering algorithm will be described which may be used for the above-described way of illustrating clusters and their spread.

(80) A prerequisite for peak correlation is that multiple signals are loaded and already integrated; identification could have been completed but is not required. In case no identification has been done all peaks are handled as unidentified. This might be a useful starting point for developing a new method from scratch.

(81) The following cluster algorithm may be applied:

(82) TABLE-US-00001 STEP 1: From each loaded Signal k collect all unidentified Peaks, result: PeaksInSignal (k) STEP 2: Merge all PeaksInSinal (k) lists, result: PeakList STEP 3: Sort PeakList (smallest to largest), result: SortedPeakList STEP 4: Set ClusterInd = 1, add SortedPeakList(1) to PeakCluster (ClusterInd) STEP 5: FOR i = 2 to NumberOfPeaks in SortedPeakList Set k such SortedPeakList (i) is in PeaksInSignal (k) IF ((SortedPeakList (i) − SortedPeakList (i−1)) <= “Cluster Window Width”) AND (No Peaks of PeaksInSignal (k) in PeakCluster (ClusterInd)) Add SortedPeakList (i) to current PeakCluster (ClusterInd) ELSE Create a new cluster, increment ClusterInd by 1 Add SortedPeakList (i) to new PeakCluster (ClusterInd) END NEXT i

(83) The number of found clusters depends on the size of the Cluster Window Width. A very small width will create many clusters, in extreme as many as unidentified peaks. A helpful tool to preselect an optimal starting value is to show the graph of the number of resulting clusters versus Cluster Window Width.

(84) Embodiments of the invention are capable to assist the chemist to review many peaks from many samples at a glance. Peak clustering and the graphical presentation allows the chemist to check whether all components have been identified and whether additional compounds have been detected. From this diagram, the chemist can directly focus on checking those components that show unexpected behavior.

(85) It should be noted that the term “comprising” does not exclude other elements or features and the term “a” or “an” does not exclude a plurality. Also elements described in association with different embodiments may be combined. It should also be noted that reference signs in the claims shall not be construed as limiting the scope of the claims.

Peak correlation and clustering in fluidic sample separation

Assignee

Inventors

Cpc classification

Classification Explorer

G01N30/8651

PHYSICS

Classification Explorer

G16C20/80

PHYSICS

Classification Explorer

G06T11/206

PHYSICS

Classification Explorer

G16C20/70

PHYSICS

International classification

Classification Explorer

G06F19/00

PHYSICS

Classification Explorer

G06T11/20

PHYSICS

Abstract

Claims

Description