COMPUTER IMPLEMENTED METHOD FOR ANALYZING HOST PHAGE RESPONSE DATA
20230400460 · 2023-12-14
Inventors
- Rob Cohen (Kensington, MD, US)
- Miguel Barreto-Sanz (Lausanne, CH)
- Priyesh AGRAWAL (Philadelphia, PA, US)
Cpc classification
G01N33/557
PHYSICS
International classification
Abstract
A computer implemented method for analyzing host phage response data comprises fitting a sequence of sigmoidal functions at each time point from a start time to an end time and selecting the best fit at each point. The best fit over all the time points is then selected, and a search performed for a time point with a similar coefficient of determination, but closer in time to a minimum time point indicative of the end of the lag phase. The lag time for the best fit time point is then obtained from the fitted model. If the best fit fails a threshold test then the dataset is considered to be flat and the lag time is set to the end time.
Claims
1. A computer implemented method for analyzing host phage response data, the method comprising: receiving or accessing a host phage response dataset, wherein the dataset comprises a time series dataset for a host-phage combination in which a host bacteria is grown in the presence of a phage, and each data point in the time series dataset comprises a measurement of a parameter indicative of the growth of the host bacteria in the presence of the phage at a specific time; obtaining an estimate of a lag time comprising: for a plurality of time points in a time window from an initial time point to an end time point, fitting one or more candidate functions over a fitting time window from the initial time point to a current time point, wherein fitting estimates a set of growth curve summary parameters comprising at least a lag time and a goodness of fit parameter; selecting a best fit function for the current time point from the one or more fitted candidate functions based on the goodness of fit parameters, wherein the one or more candidate functions each comprise a different functional form and at least one of which is a sigmoidal function, and when none of the goodness of fit estimates pass a threshold goodness of fit then classifying the time series dataset as flat and setting the lag time to the end time point; selecting from the plurality of time points, a best time point based on goodness of fit values at the plurality of time points; searching for an alternative best time point closer in time to, and greater than, a minimum time point than the best time fit and updating the best time point to the alternative best time point when the goodness of fit of the alternative time point is within a threshold amount of the goodness of fit of the best time point; estimating a lag time for the best time point, wherein when the goodness of fit exceeds a threshold goodness of fit the lag time is obtained from the growth curve summary parameter otherwise classifying the time series dataset as flat and setting the lag time to the end time point; and reporting at least the lag time or a hold time calculated as the estimated lag time from which the lag time for a control is subtracted.
2. The computer implemented method as claimed in claim 1, wherein the one or more candidate functions is an ordered set of candidate functions, and fitting one or more candidate functions comprises sequentially fitting each candidate function according to an order in the ordered set and the sequential fitting is terminated and the candidate function is selected as the best fit function when the goodness of fit of the fitted candidate function exceeds a predefined threshold goodness.
3. The computer implemented method as claimed in claim 2, wherein the ordered set of candidate functions comprises a Gompertz function, a Logistic function and a Richards function.
4. The computer implemented method as claimed in claim 3, wherein when fitting of each of the Gompertz function, the Logistic function and the Richards function failed to generate a goodness of fit exceeding the predefined threshold goodness, then applying a Blackman window function to the time series data and repeating the fitting process, and when the repeated fits for each of the Gompertz function, the Logistic function and the Richards function fail to generate a goodness of fit exceeding the predefined threshold goodness, then classifying the time series dataset as flat and setting the lag time to the end time point.
5. The computer implemented method as claimed in claim 1, wherein prior to fitting one or more candidate functions over a fitting time window, determining a maximum height of the time series dataset, and when the maximum height is less than a threshold maximum height, then classifying the time series dataset as flat and setting the lag time to the end time point and terminating the fitting process.
6. The computer implemented method as claimed in claim 1, wherein searching for an alternative best time point comprises: determining when the best time point is less than the minimum time point and when the best time point is less than the minimum time point then the alternative time point is selected based on the closest alternative time point on or after the minimum time point with a goodness of fit within a threshold difference of the goodness of fit for the best time point, and when the best time point is greater than the minimum time point then the alternative time point is selected based on the alternative time point being greater than or equal to the minimum time point and which is the closest alternative time point to the minimum time point and with a goodness of fit within a threshold difference of the goodness of fit for the best time point.
7. The computer implemented method as claimed in claim 6, wherein the minimum time point is 5 hours, the goodness of fit is the coefficient of determination (R.sup.2), and the threshold difference is 0.03.
8. The computer implemented method as claimed in claim 6, wherein the goodness of fit is the co-efficient of determination (R.sup.2) and the predefined threshold goodness is 0.6.
9. The computer implemented method as claimed in claim 1, wherein the end time point is 48 hours.
10. The computer implemented method as claimed in claim 1, wherein the minimum time point is 5 hours.
11. The computer implemented method as claimed in claim 1, wherein when the time series data is classified as flat, estimating a variability measure and when the variability measure exceeds a variability threshold rejecting the time series dataset and classifying the time series dataset as abnormal.
12. The computer implemented method as claimed in claim 1, further comprising normalizing the time series dataset based on an associated control curve.
13. The computer implemented method as claimed in claim 12, wherein a host-growth curve is a host only time-series dataset and normalizing the time series dataset comprises subtracting the host only time-series dataset from the time series dataset, wherein the time series dataset and the host only time-series dataset are obtained from separate wells on a same multi-well plate.
14. The computer implemented method as claimed in claim 13, wherein the multi-well plate further comprises one or more media control wells and the computer implemented method further comprises performing quality assurance comprising at least identifying anomalous media control wells or anomalous host cell only wells and excluding time-series datasets associated with any identified anomalous media control wells or anomalous host cell wells.
15. The computer implemented method as claimed in claim 1, wherein the time series dataset is obtained from a multi-well plate comprising a plurality of host-phage combinations, a set of positive control wells, a set of media control wells, a set of host cell control wells, a first set of diluted host cell wells and a second set of diluted host cell wells, and the computer implemented method is performed for each host-phage combination on the multi-well plate, and a report is generated for each host-phage combination on the multi-well plate.
16. The computer implemented method as claimed in claim 1, wherein the growth curve summary parameters comprise at least a max height, a slope, a lag time, and an area under curve, and the goodness of fit comprises one or more of a coefficient of determination (R.sup.2), a parameter based upon an error term or a residual term, or a summary statistic of residuals.
17. The computer implemented method as claimed in claim 12, wherein the end time point is prior to a final time point and the method further comprising receiving an updated host response dataset comprising additional data points and repeating the normalization obtaining and reporting steps, wherein the reporting includes an estimate of a probability that the phage is efficacious.
18. The computer implemented method as claimed in claim 1, wherein the end time point is prior to a final time point, further comprising determining a final class confidence estimate which is an estimate that a classification of whether the phage is efficacious at the end time point matches the classification of whether the phage is efficacious at the final time point and reporting the estimate that the phage is efficacious further comprises reporting the estimate that the phage is efficacious at the end point and final class confidence estimate, wherein determining a final class confidence estimate is determined based on identifying a set of similar host phage response datasets in a set of historical host phage response datasets comprising a plurality of flat host phage response datasets and a plurality of non-flat host phage response datasets and each comprising data points from a start time to the final time, wherein the final class confidence estimate is determined based on the similar host phage response datasets in which an estimate that the classification of whether the phage is efficacious at the end time point matches the classification of whether the phage is efficacious at the final time point.
19. The computer implemented method as claimed in claim 18, wherein the final class confidence estimate is generated using a random forest based classifier trained on the set of historical host phage response datasets.
20. The computer implemented method as claimed in claim 18, wherein a phage is selected based on the final class confidence estimate exceeding a stopping threshold at an end time point prior to the final time point.
21. A non-transitory, computer program product comprising computer executable instructions for analyzing host phage response data, the instructions executable by a computer for: receiving a host phage response dataset, wherein the dataset comprises a time series dataset for a host-phage combination in which a host bacteria is grown in the presence of a phage, and each data point in the time series dataset comprises a measurement of a parameter indicative of the growth of the host bacteria in the presence of the phage at a specific time; normalizing the time series dataset based on an associated control curve; obtaining an estimate of a lag time comprising: for a plurality of time points in a time window from an initial time point to an end time point, fitting one or more candidate functions over a fitting time window from the initial time point to a current time point, wherein fitting estimates a set of growth curve summary parameters comprising at least a lag time and a goodness of fit parameter; selecting a best fit function for the current time point from the one or more fitted candidate functions based on the goodness of fit parameters, wherein the one or more candidate functions each comprise a different functional form and at least one of which is a sigmoidal function, and when none of the goodness of fit estimates pass a threshold goodness of fit then classifying the time series dataset as flat and setting the lag time to the end time point; selecting from the plurality of time points, a best time point based on goodness of fit values at the plurality of time points; searching for an alternative best time point closer in time to, and greater than, a minimum time point than the best time fit and updating the best time point to the alternative best time point when the goodness of fit of the alternative time point is within a threshold amount of the goodness of fit of the best time point; and estimating a lag time for the best time point, wherein when the goodness of fit exceeds a threshold goodness of fit the lag time is obtained from the growth curve summary parameter otherwise classifying the time series dataset as flat and setting the lag time to the end time point; and reporting at least the lag time or a hold time calculated as the estimated lag time from which the lag time for a control is subtracted.
22. A computing apparatus comprising: at least one memory, and at least one processor wherein the memory comprises instructions to configure the at least one processor for: receiving a host phage response dataset, wherein the dataset comprises a time series dataset for a host-phage combination in which a host bacteria is grown in the presence of a phage, and each data point in the time series dataset comprises a measurement of a parameter indicative of the growth of the host bacteria in the presence of the phage at a specific time; normalizing the time series dataset based on an associated control curve; obtaining an estimate of a lag time comprising: for a plurality of time points in a time window from an initial time point to an end time point, fitting one or more candidate functions over a fitting time window from the initial time point to a current time point, wherein fitting estimates a set of growth curve summary parameters comprising at least a lag time and a goodness of fit parameter; selecting a best fit function for the current time point from the one or more fitted candidate functions based on the goodness of fit parameters, wherein the one or more candidate functions each comprise a different functional form and at least one of which is a sigmoidal function, and when none of the goodness of fit estimates pass a threshold goodness of fit then classifying the time series dataset as flat and setting the lag time to the end time point; selecting from the plurality of time points, a best time point based on goodness of fit values at the plurality of time points; searching for an alternative best time point closer in time to, and greater than, a minimum time point than the best time fit and updating the best time point to the alternative best time point when the goodness of fit of the alternative time point is within a threshold amount of the goodness of fit of the best time point; and estimating a lag time for the best time point, wherein when the goodness of fit exceeds a threshold goodness of fit the lag time is obtained from the growth curve summary parameter otherwise classifying the time series dataset as flat and setting the lag time to the end time point; and reporting at least the lag time or a hold time calculated as the estimated lag time from which the lag time for a control is subtracted.
23. The computer implemented method of claim 1, wherein the host bacteria is grown as planktonic cells.
24. The computer implemented method of claim 1, wherein the host bacteria is grown as a biofilm.
25. The computer implemented method of claim 1, wherein phage-phage synergy, antibiotic-phage synergy, or antibiotic-phage-phage synergy is measured.
26. The non-transitory, computer program product of claim 21, wherein the host bacteria is grown as planktonic cells.
27. The non-transitory, computer program product of claim 21, wherein the host bacteria is grown as a biofilm.
28. The non-transitory, computer program product of claim 21, wherein phage-phage synergy, antibiotic-phage synergy, or antibiotic-phage-phage synergy is measured.
29. The computing apparatus of claim 22, wherein the host bacteria is grown as planktonic cells.
30. The computing apparatus of claim 22, wherein the host bacteria is grown as a biofilm.
31. The computing apparatus of claim 22, wherein phage-phage synergy, antibiotic-phage synergy, or antibiotic-phage-phage synergy is measured.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0056] Embodiments of the present disclosure will be discussed with reference to the accompanying drawings wherein:
[0057]
[0058]
[0059]
[0060]
[0061]
[0062]
[0063]
[0064]
[0065]
[0066]
[0067]
[0068]
[0069]
[0070]
[0071]
[0072]
[0073]
[0074]
[0075]
[0076]
[0077]
[0078]
[0079]
[0080]
[0081]
[0082] In the following description, like reference characters designate like or corresponding parts throughout the figures.
DESCRIPTION OF EMBODIMENTS
[0083] As used in the specification and claims, the singular form “a”, “an” and “the” include plural references unless the context clearly dictates otherwise. For example, the term “a cell” includes a plurality of cells, including mixtures thereof. The term “a nucleic acid molecule” includes a plurality of nucleic acid molecules. “A phage formulation” can mean at least one phage formulation, as well as a plurality of phage formulations, i.e., more than one phage formulation. As understood by one of skill in the art, the term “phage” can be used to refer to a single phage or more than one phage.
[0084] The present invention can “comprise” (open ended) or “consist essentially of” the components of the present invention as well as other ingredients or elements described herein. As used herein, “comprising” means the elements recited, or their equivalent in structure or function, plus any other element or elements which are not recited. The terms “having” and “including” are also to be construed as open-ended unless the context suggests otherwise. As used herein, “consisting essentially of” means that the invention may include ingredients in addition to those recited in the claim, but only if the additional ingredients do not materially alter the basic and novel characteristics of the claimed invention.
[0085] As used herein, a “subject” is a vertebrate, preferably a mammal, more preferably a human. Mammals include, but are not limited to, murines, simians, humans, farm animals, sport animals, and pets. In other preferred embodiments, the “subject” is a rodent (e.g., a guinea pig, a hamster, a rat, a mouse), murine (e.g., a mouse), canine (e.g., a dog), feline (e.g., a cat), equine (e.g., a horse), a primate, simian (e.g., a monkey or ape), a monkey (e.g., marmoset, baboon), or an ape (e.g., gorilla, chimpanzee, orangutan, gibbon). In other embodiments, non-human mammals, especially mammals that are conventionally used as models for demonstrating therapeutic efficacy in humans (e.g., murine, primate, porcine, canine, or rabbit animals) may be employed. Preferably, a “subject” encompasses any organism, e.g., any animal or human, that may be suffering from a bacterial infection, particularly an infection caused by a multiple drug resistant bacterium.
[0086] As understood herein, a “subject in need thereof” includes any human or animal suffering from a bacterial infection, including but not limited to a multiple drug resistant bacterial infection, a microbial infection or a polymicrobial infection. Indeed, while it is contemplated herein that the methods may be used to target a specific pathogenic species, the method can also be used against essentially all human and/or animal bacterial pathogens, including but not limited to multiple drug resistant bacterial pathogens. Thus, in a particular embodiment, by employing the methods of the present invention, one of skill in the art can design and create personalized phage formulations against many different clinically relevant bacterial pathogens, including multiple drug resistant (MDR) bacterial pathogens.
[0087] As understood herein, an “effective amount” of a pharmaceutical composition refers to an amount of the composition suitable to elicit a therapeutically beneficial response in the subject, e.g., eradicating a bacterial pathogen in the subject. Such response may include e.g., preventing, ameliorating, treating, inhibiting, and/or reducing one of more pathological conditions associated with a bacterial infection.
[0088] The term “about” or “approximately” means within an acceptable range for the particular value as determined by one of ordinary skill in the art, which will depend in part on how the value is measured or determined, e.g., the limitations of the measurement system. For example, “about” can mean a range of up to 20%, preferably up to 10%, more preferably up to 5%, and more preferably still up to 1% of a given value. Alternatively, particularly with respect to biological systems or processes, the term can mean within an order of magnitude, preferably within 5 fold, and more preferably within 2 fold, of a value. Unless otherwise stated, the term “about” means within an acceptable error range for the particular value, such as ±1-20%, preferably ±1-10% and more preferably +1-5%. In even further embodiments, “about” should be understood to mean+/−5%.
[0089] Where a range of values is provided, it is understood that each intervening value, between the upper and lower limit of that range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges, and are also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either both of those included limits are also included in the invention.
[0090] All ranges recited herein include the endpoints, including those that recite a range “between” two values. Terms such as “about,” “generally,” “substantially,” “approximately” and the like are to be construed as modifying a term or value such that it is not an absolute, but does not read on the prior art. Such terms will be defined by the circumstances and the terms that they modify as those terms are understood by those of skill in the art. This includes, at very least, the degree of expected experimental error, technique error and instrument error for a given technique used to measure a value.
[0091] Where used herein, the term “and/or” when used in a list of two or more items means that any one of the listed characteristics can be present, or any combination of two or more of the listed characteristics can be present. For example, if a composition is described as containing characteristics A, B, and/or C, the composition can contain A feature alone; B alone; C alone; A and B in combination; A and C in combination; B and C in combination; or A, B, and C in combination.
[0092] The term “phage sensitive” or “sensitivity profile” means a bacterial strain that is sensitive to infection and/or killing by phage and/or in growth inhibition. That is phage is efficacious or effective in inhibiting growth of the bacterial strain.
[0093] The term “phage insensitive” or “phage resistant” or “phage resistance” or “resistant profile” is understood to mean a bacterial strain that is insensitive, and preferably highly insensitive to infection and/or killing by phage and/or growth inhibition. That is phage is not efficacious or is ineffective in inhibiting growth of the bacterial strain.
[0094] A “therapeutic phage formulation”, “therapeutically effective phage formulation”, “phage formulation” or like terms as used herein are understood to refer to a composition comprising one or more phage which can provide a clinically beneficial treatment for a bacterial infection when administered to a subject in need thereof.
[0095] As used herein, the term “composition” encompasses “phage formulations” as disclosed herein which include, but are not limited to, pharmaceutical compositions comprising one or more purified phage. “Pharmaceutical compositions” are familiar to one of skill in the art and typically comprise active pharmaceutical ingredients formulated in combination with inactive ingredients selected from a variety of conventional pharmaceutically acceptable excipients, carriers, buffers, and/or diluents. The term “pharmaceutically acceptable” is used to refer to a non-toxic material that is compatible with a biological system such as a cell, cell culture, tissue, or organism (or at least non-toxic in amounts typically used). Examples of pharmaceutically acceptable excipients, carriers, buffers, and/or diluents are familiar to one of skill in the art and can be found, e.g., in Remington's Pharmaceutical Sciences (latest edition), Mack Publishing Company, Easton, Pa. For example, pharmaceutically acceptable excipients include, but are not limited to, wetting or emulsifying agents, pH buffering substances, binders, stabilizers, preservatives, bulking agents, adsorbents, disinfectants, detergents, sugar alcohols, gelling or viscosity enhancing additives, flavoring agents, and colors. Pharmaceutically acceptable carriers include macromolecules such as proteins, polysaccharides, polylactic acids, polyglycolic acids, polymeric amino acids, amino acid copolymers, trehalose, lipid aggregates (such as oil droplets or liposomes), and inactive virus particles. Pharmaceutically acceptable diluents include, but are not limited to, water, saline, and glycerol.
[0096] As used herein, the term “estimating” encompasses a wide variety of actions. For example, “estimating” may include calculating, computing, processing, determining, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Also, “estimating” may include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like. Also, “estimating” may include resolving, selecting, choosing, establishing and the like.
[0097] Further, as used herein, the test bacteria may be provided to the wells “free” in a liquid preparation (e.g., as a planktonic preparation) or provided as a biofilm. If a biofilm is provided, then the methods can be tested on a single phage or can also be used to test phage/phage synergy, phage/antibiotic synergy, and even phage/phage/antibiotic synergy by just modifying the starting materials. In preferred aspects, lysis of a bacterium by a phage can be measured using assays known in the art, such as but not limited to (a) growth inhibition (either grown as planktonic cells or within a biofilm); (b) optical density; (c) metabolic output; (d) photometry (e.g., fluorescence, absorption, and transmission assays); and/or (e) plaque formation. Phage activity, including synergy of the phage—whether it be phage-phage synergy or phage-antibiotic synergy—can be measured using methods known in the art or described in U.S. Applications 63/153,039, 63/208,173, 63/246,601, 63/291,955, 63/310,875, and 63/288,793, each of which are herein incorporated by reference in their entirety.
[0098] For example, those skilled in the art will appreciate that a biofilm of a bacteria may be prepared in vitro by culturing a suitable bacteria (which may be a bacterial strain obtained from a patient sample) in the presence of a solid surface, such as provided by, for example, a polymeric film or planar surface, or a particle or bead comprised of a polymeric material or other inert material such as glass. In some embodiments, the test bacteria is preferably provided to the test wells in the form of a biofilm attached to the surface of a bead such as a glass bead with a diameter of about 1 to 5 mm, and preferably, a glass bead of about 4 mm (which are suitable for, for example, conventional 96-well multiwell plates, wherein one biofilm-coated bead per test may be sufficient). Otherwise, biofilm produced through in vitro culture may, for example, be removed from a surface (if necessary) and processed into consistently-sized pieces (e.g. 2 mm×2 mm substantially planar pieces) by any method that would be apparent to those skilled in the art (e.g. vigorous agitation and/or microdissection).
[0099] Embodiments of computer implemented methods and systems for analyzing host phage response datasets will now be described.
[0100]
[0101] The data may be provided an electronic format, such as a CSV input file exported by an OmniLog™ (or similar host response, such as Biotec™) apparatus.
[0102] Once the data is received normalization and quality control/quality assurance may be performed 102. For example, as shown in
[0103] Other quality assurance checks may also be made to identify anomalous media control wells 222 or anomalous host cell only wells 223. Datasets associated with any identified anomalous media control wells or anomalous bacterial host cell only wells may be excluded. For example, an entire row may be discarded if the host growth curve in cell control well (column 10) is flat or has an unusual shape such as non-sigmoidal or with unusual lag times which may indicate contamination or other growth issue with the bacteria host. For example, E. coli typically has a lag time in the range of 3-5 hours, and thus a lag time outside of this range may indicate an issue with the bacteria. Various rule based or threshold-based quality assurance procedures may be used. Further a machine learning method may also be used to perform quality assurance. This may be trained on labelled control wells 220 and used to identify anomalous control wells.
[0104] Once any normalization and quality assurance has been performed, automated model fitting and parameter estimation may then be performed, for example to obtain an estimate of the lag time for each dataset 103, along with other model parameters that characterize the curve. A typical growth curve 522 is shown in
[0105] In the case that a phage inhibits the growth of the host bacteria, the growth curve will have a substantially flat profile (i.e., is not sigmoidal) which can be characterized as having a lag time 254 equal to the end time. This is shown in
[0106] Thus, a preliminary step 109 may comprise determining the maximum height 529 of the time series dataset, and if the maximum height is less than a threshold maximum height h.sub.TH, then classifying the time series dataset as flat and setting the lag time 524 to the end time point 521 and terminating (or bypassing) the fitting process. The threshold maximum height h.sub.TH may be determined from the analysis of historical growth curves obtained using the OmniLog™ or similar platform (e.g., Biotec™). The threshold h.sub.TH is used for identifying the flat growth curves and is determined automatically by the algorithm by looking at the corresponding cell control. h.sub.TH is defined as the 30% of the maximum density of the cell control. If an assay well has reached its maximum density which is less than 30% of its cell control, it is marked as a flat curve by the algorithm. The threshold may be varied based on the specific equipment used to generate the host phage growth curves.
[0107] The model fitting 104 is performed for a plurality of time points in a time window from an initial time point to an end time point 529. For example, this may be from the start time (0 h) to an end time (24, 36, 48 hours etc.). The end time point may be arbitrarily set to a time point sufficient for growth of the host bacteria to have stabilized and may be set prior to the end of the dataset. For example, an experiment may run for 50 hours but the end time point may be set to 48 hours (or earlier such as 24 or 36 hours). The plurality of time points may be every time point in the dataset, at periodic intervals (e.g., every 15 minutes) or uniformly across the dataset. The interval between time points at which fitting is performed need to not be constant though (for example, some points could be lost due to data corruption, or the apparatus may not sample the wells at precisely regular intervals. For example, a fully loaded OmniLog™ machine takes around 15 minutes to sample every well in every plate, and thus data for each well may be collected at approximately 15 minutes intervals.
[0108] At each time point we fit one or more candidate functions over a fitting time window 105. The fitting time window may be from the initial time point (i.e., start time of the assay) to the current time point. The fitting fits a model of the candidate function to estimate a set of growth curve summary parameters (the fitted model parameters) comprising at least a lag time and a goodness of fit parameter. As shown in
[0109] This process is further illustrated in
[0110] It can be seen in the plot at 48 hours the data after 10 hours has a peaked profile whilst the fitted line is flat. Often in host-phage growth curves the stationary phase 525 is not particularly flat due to media and dye related effects and during the stationary phase the observed data is not always a true measure of growth. This generates considerable variability from plot to plot or experiment to experiment, making automated model fitting challenging.
[0111] To cope with the inherent variability of host-phage growth curves, embodiments of the model fitting method thus fit one or more candidate functions at each time point, and we select a best fit function for the current time point from the one or more fitted candidate functions based on the goodness of fit parameters 105. Each of candidate functions comprise a different functional form, at least one of which is a sigmoidal function (i.e., has an S shaped form). The sigmoidal functions may be a Gompertz function, a Logistic function or a Richards function, the functional forms 500 of which are shown in
[0112]
[0113] In this embodiment the sequence is a Gompertz function followed by a Logistic function, followed by a Richards function. If neither of these functions generates a good fit, we then apply a Blackman window function to the time series data and repeat the fitting process.
[0114] Having obtained a curve fit for each of the time points in the dataset (or at least a plurality of time points in the dataset) along with a goodness of fit measure such as R.sup.2 the data may be stored or written to a file, such as a CSV file 604 as illustrated in
[0115] We then select a best time point based on goodness of fit values at the plurality of time points 106. A flowchart of a search for the best time point is further illustrated in
[0116] In one embodiment the minimum time point T.sub.min is set at 5 hours. Most bacterial, in the absence of phage, have lag times of 3 hours to 5 hours, and thus the minimum time represents a time around or after the end of the lag phase. Accordingly in some embodiments T.sub.min is set at a time between 3 and 6 hours, such as 3, 3.5, 4, 4.5, 5, 5.5 or 6 hours, with the specific value used based on analysis of historical growth curves for the relevant host bacteria or multiple bacteria in which some average or central value may be selected. In the following exemplary embodiments the minimum time point T.sub.min is set at 5 hours.
[0117] As shown in
[0118] As shown in
[0119] If, however the best time point is greater than the minimum time point then the alternative time point is selected based on the alternative time point being greater than or equal to the minimum time point and which is the closest alternative time point to the minimum time point with a goodness of fit within a threshold difference of the goodness of fit for the best time point. This is to avoid the model fit focusing (or being biased by) on late times in the stationary phase where the curve may be flat and well fit at the expense of well-fitting at the growth phase. As shown in
[0120] This is illustrated in
[0121] In the above embodiment, the minimum time point is 5 hours, the goodness of fit is the coefficient of determination (R.sup.2), and the threshold difference if 0.03.
[0122] We then estimate a lag time for the best time point 107. Additionally, a test is performed to determine if the goodness of fit exceeds a threshold goodness of fit. In one form, the goodness of fit is the co-efficient of determination (R.sup.2) and the predefined threshold goodness is 0.6 (e.g., R.sup.2>0.6). In that case the lag time is obtained from the growth curve summary parameter otherwise we classify the time series dataset as flat and set the lag time 523 to the end time point 529. We then report at least the lag time or a hold time calculated as the estimated lag time from which the lag time for a control is subtracted 108.
[0123] This is further illustrated in
[0124] The legend 812 defines colors for each of a plurality of time windows (<1 hour (red), 1-3 hours (orange), 3-8 hours (yellow), 8-20 hours (green) and >20 hours (blue)). Phage which inhibits bacterial growth (i.e., good phage) have large hold times (>20 hours) and ineffective phage (i.e., poor phage) have short hold times (<1 hour).
[0125] An embodiment of the method was compared to human assessed growth curves and the difference in hold time estimates compared.
[0126] The method can thus identify flat curves and robustly fit growth curves. The threshold maximum height h.sub.TH may also be used to characterize or classify flat curves, curves with unusual or abnormal growth curves that may appear non-sigmoidal and provisionally classified as flat. However abnormal growth curves typically have substantial variability compared to true flat (i.e., due to the phage inhibiting bacterial growth) and thus can be distinguished by analysis of the variability. This may be determined by performing an analysis of the errors or residuals from a linear fit or other curve fit. A machine learning classifier could also be trained to distinguish between flat curves and abnormal curves.
[0127] To examine how quickly the model can distinguish poorly performing phage, an in silico experiment was performed to see how quickly the machine learning model could reliably classify a test host-phage response dataset. In this experiment, a machine learning model was fitted to the complete dataset for each host-phage response, and then a series of test fits was then performed on the dataset where each fit was performed over a progressively increasing time window, using 15-minute intervals. That is a test fit was performed over a time window (0, t) where t was incremented by 15 minutes for each subsequent test fit and the fitted parameters were then provided to the trained machine learning model to provide a classification. As noted above, the trained machine learning model only requires a set of summary parameters, and the time window of the fitted dataset need not be identical to that used to train the model.
[0128]
[0129] These results suggest that the machine learning model is reasonably accurate at quickly predicting poor phage after 10 hours but that it takes longer for effective phage to be identified—in this case around 20 hours). This suggests that the time period should span 20 hours, although tests could be performed after 10 hours to select our clearly ineffective phage. The minimum desirable time period will however depend to some extent on the fitting function used, time window of which fits are performed (e.g., single or piecewise fits), and growth media for the wells used for the host-phage response tests.
[0130] Thus, in one embodiment, the method is performed as data is collected and the best fit model used to predict the likely efficacy of the phage, which can then be used to assess whether to stop the experiment before the nominal final or stopping time (e.g. 48 hours). Thus, the method may further comprise receiving an updated host response dataset comprising additional data points (i.e., the current end time point is updated to a later end time point) and repeating the normalization obtaining and reporting steps, wherein the reporting includes an estimate of the probability that the phage is efficacious. This may simply be a re-estimation of the likely class at the current time considering the additional data. In one embodiment the method comprises determining the final class confidence (or final classification expectancy) and reporting this as an estimate of the probability that the phage is efficacious at the current time point. The final class confidence is an estimate of the likelihood that the classification of whether the phage is efficacious at the end time point (i.e., the current class estimate) matches the classification of whether the phage is efficacious at the final time point current (i.e. the final class estimate). The final class confidence could also refer to a final classification expectancy (i.e. the expectancy or likelihood the current class is the same as the final class) or alternatively final class probability (note this is distinct from a class probability which is the confidence in the choice of the current class at the current time point).
[0131] The final class confidence may be determined based on identifying a set of similar host phage response datasets in a set of historical host phage response datasets. The final class confidence estimate is determined based on the similar host phage response datasets in which the class at the current end time point matches the final class at the final or stopping time point (e.g., 48 hours) The set of historical datasets used for similarity assessment are full datasets spanning the start time point to the final time point and may include both flat (non-sigmoidal) host phage response datasets and non-flat (sigmoidal) host phage response datasets. In some embodiments the final class and class at each time point may be known. However, these could be re-estimated from the complete dataset if not available, or it is determined it is preferable to use the current classification method such as due to a change in the classification method since classes were originally calculated.
[0132] Identifying similar datasets could be performed in several ways. In one embodiment similar historical datasets may be chosen as in those historical datasets which have the same class estimate at the current end time point. In another embodiment similar historical datasets may be chosen as in those historical datasets which have the same class estimate at the current end time point and which match for least a threshold number or percentage (e.g., 75%) of previous time points (e.g. 75%). This could be performed by checking the class at each previous time point and counting the number of matches. In another embodiment correlation measure or a pattern matching methods such as template adaption or signal warping may be used to assess similarity between the two time curves to the current end time point.
[0133] In some embodiments the final class confidence may be determined using a machine learning method such as a random forest classifier trained on the historical datasets which can be used generate an estimate of the final class confidence for the current end time point (which can then be reported).
[0134] A random forest (RF) based classifier may be trained on the growth curve parameters (Max Height, Slope, Lag, AUC) of both curve types (Flat and Not-Flat). To benchmark the model, a dataset of growth curves was split into two equal subsets, each of them consisting of an equal number of flat and not-flat growth curves. Various parameters such as sensitivity, specificity and accuracy were estimated to assess the predictive power of the ML algorithm. To further test the RF classifier, we estimated the % accuracy of the classifier for each time points from 1 to 20 hr. For each time point between 1 to 20 hours, the RF classifier was trained and tested on the estimated growth curve parameters at that particular time point of various growth curves present in the training and testing set, respectively. The RF classifier is integrated with the analysis module to perform predictions as new data are obtained and it predicts the probability score (ranges between 0 to 1) for every growth curve for each time step from 1 to 20 hr. If the predicted score is >=0.50, the growth curve is classified as Not-Flat (Bad Phage) and if the probability score is <0.50, the growth curve is classified as Flat (Good Phage).
[0135]
[0136] A range of machine learning classifiers may be used, such as a Boosted Trees Classifier, Random Forest Classifier, Decision Tree Classifier, Support Vector Machine (SVM) Classifier, Logistic Classifier, etc. In some embodiments the classifier is a probabilistic classifier. That is rather than just issuing a binary classification (e.g., efficacious or not), the classifier outputs the class probability. Probabilistic classifiers include naïve Bayes, binomial regression models, discrete choice models, decision tree and boosting based classifiers. In some embodiments machine learning training comprises separating the complete dataset into a first training dataset and a second validation dataset. The training dataset is preferably around 60-80% of the total dataset. This dataset is used by machine learning models to create a classifier model to accurately identify efficacious phage. The second set is the Validation dataset, which is typically at least 10% of the dataset and more preferably 20-40%: This dataset is used to validate the accuracy of the model created using the training dataset. Data may be randomly allocated to the training dataset and validation datasets. In some embodiments, checks may be performed on the training dataset and validation datasets to ensure a similar proportion of good/bad phage are present in each. In some embodiments where cross-validation is used the dataset may be allocated to three datasets, namely a training dataset, a validation dataset, and a holdout or test dataset. The third holdout or test dataset is typically around 10-20% of the total dataset and is not used for training the machine learning classifier or the cross validation. This holdout dataset provides an unbiased estimate of the accuracy of the machine learning classifier model.
[0137] In one embodiment, the final class confidence estimate may be used to make decisions on whether to continue an experiment for a specific phage host combination or entire plate. For example, if the final class confidence exceeds a stopping threshold (e.g., 67%, 75%, 90%, 95% or 99%) the user can have confidence the final (correct) class has been determined, and the experiment/assay for that host-phage combination can be stopped. This allows another experiment/assay using a new host-phage combination to be started earlier than would normally occur, or it may allow treatment to be initiated using the phage (or phage-host combination) with the high final class confidence. In some embodiments the plate may contain many wells (e.g., 96) and it may not be desirable to continue the experiment until a threshold number (e.g., 75%), or all, of the wells have final class confidence over the stopping criteria. Phage host combinations/wells which were terminated early and which had final class confidence estimate below the stopping threshold could be restarted. In some embodiments specialized plates may be constructed to allow swapping mid experiment/assay. For example, a 96 well plate could comprise 12 swappable 8×1 columns of wells.
[0138] In one embodiment, the fitting step could be performed repeatedly during the course of the host phage experiment. That is, as the experiment progresses, and further data becomes available, the dataset is updated with the additional data points (i.e., the additional times) and the fitting function is refitted and classified on the updated dataset. This is equivalent to progressively increasing the time window with each new fit. In another embodiment the width of the fitting time window could be fixed such that the fitting process is effectively using a sliding time window as further data becomes available. In these embodiments, a probabilistic classifier may be used to output the classification probability. Alternatively, a classification expectancy could be estimated with each new time point/fit. The classification expectancy is an estimate of the probability (or likelihood) that the classification result being correct conditional on the current state determined using the distributions of historical data which contain a point matching the current state at the current time. That is, given set of parameters at a given time in the assay, a number could be produced that is measure of the confidence of the classification outcome (i.e., is the current classification result the expected result) for a given phage. For example, new data could be obtained every 15 minutes, and the classifiers decision could be saved for each time point.
[0139] To obtain the classification expectancy at each point we extract the subset of the historical dataset that had a matching current state. In a first embodiment this could be the dataset with the same classification outcome at the current time point. Having obtained this subset we then determine the percentage of the subset where the current estimate of the classification result was the same as the final classification result (e.g., the classification after completion of the assay) and we return that percentage (or a number based on that percentage). As time progresses this is expected to stabilize on the final value. That is, for an assay performed over 24 hours, we may get a classification result at 4 hours with a probability of 50% (i.e., unstable estimate). By 12 hours the probability may be 75% (likely to be accurate), and by 20 hours it may be 99% (highly likely to be accurate). In another embodiment, the dataset could be the dataset with the same classification outcome at the current time point and with growth measure within some range of the observed growth measure at the current time. This could be achieved by partitioning the growth values into a set of intervals or bins (e.g., 0 to 0.1, 0.1 to 0.2, 0.3 to 0.4, etc.).
[0140] We then identify which interval/bin the observed growth measure falls within and select the subset of historical data that had observed growth measures in the same interval/bin at the same time with the same current classification result. Having obtained this subset, we then determine the percentage of the subset where the current classification result was the same as the final classification result (i.e., is the current classification result the expected classification result). In an alternative embodiment, the dataset could be the dataset with the growth measure within some range of the observed growth measure at the current time as described above (i.e., selection of the dataset ignores the current classification result). We then return the percentage of final classification result for this subset which had a final classification result that matches the current classification result. The classification expectancy can thus provide an early measure of the confidence or stability of the current classification result by leverages the longer time series (and outcomes) available in a historical dataset.
[0141] The above embodiments can be used to identify one or more efficacious phage for a host bacterium. Where multiple phages are tested against the same host bacteria, a therapeutic phage formulation of the most effective phage(s) can be generated for treatment or some other criteria such as inventory status. Selection of which phage, or phages, to include may be obtained using a measure of diversity of the efficacious phage. In one embodiment the measure of diversity is indicative of a different mechanism of action between the phages. This measure of diversity could be estimated by sequencing the phage and using bioinformatics methods or datasets to estimate functional effects/associations and these could be used to assign one or more mechanism of actions labels (these could be selected from a controlled ontology such as the GeneOntology database, or databases of biological networks). Phage combinations can thus be selected based on those with different mechanisms of actions, or where phage are assigned a set of multiple possible mechanisms of action, phage could be selected based on the two phage with most dissimilar sets (i.e. minimum overlap of possible mechanisms of action). Overlapping methods of actions could be defined based on sharing a biological network or pathway, or a GeneOntology (GO) term (or downstream of a GO term), or a GO-CAM model. For example, each pair of phage could be assigned a score based on the number of mechanism of actions not shared by both lists. The largest score would indicate the most diverse (non overlapping) list. In another example, the score could be a weighted score. For example, the previous score could be divided by sum of the two list sizes to weight for list size. Other weighting or scoring functions could be used, such as applying a weighting that takes into account the evidence for a mechanism of action associated with a sequence. Other methods of assessing diversity of possible mechanisms of action could also be used based on bioinformatics data mining or biological network/pathway analysis. This approach provides robustness against a bacterium adapting to the mechanism of action of a single phage, as if the second phage has a different mechanism of action then it is likely to remain effective.
[0142] Embodiments described herein thus advantageously provide automated methods for analyzing/interpreting host phage response data that are designed to be robust to the intrinsic variability observed in host phage response datasets. Embodiments of the method fits functions over a range of time points from the start time to an end time and at each time point uses a sequential fitting approach using a sequence of sigmoidal functions such as Gompertz, Logistic and Richards functions (although others could be used). The best fit time point and model parameters can thus be obtained from the fit, for example to obtain the lag time from which a hold time can be obtained. The method gains robustness by using several different fitting functions to allow the method to adapt to the substantial variability and experimental effects observed in host phage response datasets.
[0143] Further to prevent the fitting from focusing on biologically non-interesting time periods such as the early lag phase or late stable phase and an adjustment step is performed to search for fits of similar quality (e.g., based on similar R.sup.2) but closer to the minimum lag time. That is in return for potentially sacrificing some quality/goodness of fit the method drives the estimate of the best time towards more central, and biologically interesting results.
[0144] A fully loaded OmniLog™ apparatus contains multiple multi-well plates and takes about 15 minutes to scan every well in the apparatus. The fitting process is rapid and can thus be performed after each well or after each plate is read and a report of results generated, such as the color-coded matrix plot shown in
[0145] The automated nature saves considerable manual labor and time, with a pair of human experts taking around 45 minutes to analyze 48 plates. In contrast embodiments of the method can analyze 48 plates in around 10 minutes. Further the method is robust and results can be automatically exported to a data storage, web server, or laboratory information management system, along with associated experimental meta data.
[0146] The approach can be used to identify phage for including in phage formations for treating patients with bacterial infections, and in particular Multiple Drug Resistant Infections. The methods can also be used to identify phage that can be used to clean up bacterial contaminated areas, such as for cleaning up an industrial site. These phage formulations may include two or more phage with different mechanisms of action as described above.
[0147] Variations on the above methods can also be performed. In one embodiment, the historical dataset is used to improve classification when performed during the assay (i.e., at some time point before the full assay time period). In this embodiment, a fit (or multiple fits) is performed over the current time period (e.g., 0 to 6 hours). Then fit results over the same time period is obtained for each host-phage profile in an historical dataset is obtained, and a subset of the historical dataset is selected based on having fit results similar to the fit results for the current host-phage combination (over the current time period). That is, we identify the subset of the historical dataset with a similar phage-host curve to the observed phage-host curve up to this point in time (or over some time range to this point in time). Determining similar phage-host curves could be performed using correlation measures (e.g., a cross correlation or similar similarity measure). We then provide additional data from the historical dataset as further inputs to the classifier (beyond just the fit values). In one embodiment this might be percentage of this subset of the historical dataset which were ultimately efficacious. In one embodiment a random forest classified may be used
[0148]
[0149]
[0150] A computer program may be written, for example, in a general-purpose programming language (e.g., Pascal, C, C++, Java, Python, JSON, etc.) or some specialized application-specific language to provide a user interface, perform the model fitting, and export results. In one embodiment the machine learning model may be generated using a machine learning libraries/packages such as SciKit-Learn, Tensorflow, and PyTorch, may be used. These typically implement a plurality of different classifiers such as a Boosted Trees Classifier, Random Forest Classifier, Decision Tree Classifier, Support Vector Machine (SVM) Classifier, Logistic Classifier, etc. These can each be tested, and the best performing classifier selected.
[0151] The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. For a hardware implementation, processing may be implemented within one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors including CPU and GPUs, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof. Software modules, also known as computer programs, computer codes, or instructions, may contain a number a number of source code or object code segments or instructions, and may reside in any computer readable medium such as a RAM memory, flash memory, ROM memory, EPROM memory, registers, hard disk, a removable disk, a CD-ROM, a DVD-ROM, a Blu-ray disc, or any other form of computer readable medium. In some aspects the computer-readable media may comprise non-transitory computer-readable media (e.g., tangible media). In another aspect, the computer readable medium may be integral to the processor. The processor and the computer readable medium may reside in an ASIC or related device. The software codes may be stored in a memory unit and the processor may be configured to execute them. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.
[0152] A non-transitory computer-program product or storage medium comprising computer-executable instructions for carrying out any of the methods described herein can also be generated. A non-transitory computer-readable medium can be used to store (e.g., tangibly embody) one or more computer programs for performing any one of the above-described processes by means of a computer. Further provided is a computer system comprising one or more processors, memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for carrying out any of the methods described herein.
[0153] Those of skill in the art would understand that information and signals may be represented using any of a variety of technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
[0154] Those of skill in the art would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software or instructions, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
[0155] Throughout the specification and the claims that follow, unless the context requires otherwise, the words “comprise” and “include” and variations such as “comprising” and “including” will be understood to imply the inclusion of a stated integer or group of integers, but not the exclusion of any other integer or group of integers.
[0156] The reference to any prior art in this specification is not, and should not be taken as, an acknowledgement of any form of suggestion that such prior art forms part of the common general knowledge.
[0157] It will be appreciated by those skilled in the art that the disclosure is not restricted in its use to the particular application or applications described. Neither is the present disclosure restricted in its preferred embodiment with regard to the particular elements and/or features described or depicted herein. It will be appreciated that the disclosure is not limited to the embodiment or embodiments disclosed, but is capable of numerous rearrangements, modifications and substitutions without departing from the scope as set forth and defined by the following claims.