DISCERNING BRAIN CANCER TYPE

20220308058 · 2022-09-29

    Inventors

    Cpc classification

    International classification

    Abstract

    The present invention relates to methods of determining whether a subject suspected of having a brain tumour has a glioma or a lymphoma. The invention also relates to a diagnostic kit for determining whether a subject suspected of having a brain tumour has a glioma or a lymphoma and a method of facilitating the selection of treatment for a subject suspected of having a brain tumour.

    Claims

    1-27. (canceled)

    28. A method, comprising: performing a spectroscopic analysis upon a blood sample, or a component thereof, isolated from a subject to obtain a spectroscopic signature characteristic of said blood sample, or said component thereof; wherein said spectroscopic analysis is an Attenuated Total Reflection (ATR) FTIR; and determining whether said subject has a glioma or a lymphoma using said obtained spectroscopic signature characteristic of said blood sample, or said component thereof.

    29. The method of claim 28, wherein said spectroscopic signature characteristic of said blood sample, or said component thereof, is a spectrum between 400 and 4000 cm.sup.1, between 900 and 1800 cm.sup.−1, or between 1400 and 1800 cm.sup.−1.

    30. The method of claim 28, wherein said lymphoma comprises a central nervous system (CNS) lymphoma.

    31. The method of claim 28, wherein a silicon internal reflection element supports said blood sample, or said component thereof, during said spectroscopic analysis.

    32. The method of claim 28, wherein a plurality of ATR crystals support said blood sample, or said component thereof, during said spectroscopic analysis.

    33. The method of claim 28, wherein said glioma is a glioblastoma multiforme.

    34. The method of claim 28, wherein said subject is determined to have said glioma when at least a portion of said obtained spectroscopic signature characteristic of said blood sample, or said component thereof, is lower than at least a portion of a control spectroscopic signature.

    35. The method of claim 28, wherein said subject is determined to have said lymphoma when at least a portion of said obtained spectroscopic signature characteristic of said blood sample, or said component thereof, is higher than at least a portion of a control spectroscopic signature.

    36. The method of claim 34, wherein said control spectroscopic signature comprises a plurality of pre-correlated signatures stored in a database to derive a correlation with said determination of said glioma.

    37. The method of claim 35, wherein said control spectroscopic signature comprises a plurality of pre-correlated signatures stored in a database to derive a correlation with a determination of said lymphoma.

    38. The method of claim 28, wherein determining whether said subject has said glioma or said lymphoma using said obtained spectroscopic signature characteristic of said blood sample, or said component thereof, comprises correlating said obtained spectroscopic signature with a determination of said glioma or said lymphoma based on a predictive model developed by a pattern recognition algorithm stored in a database of pre-correlated analyses.

    39. The method of claim 28, wherein said determining whether said subject has said glioma or said lymphoma using said obtained spectroscopic signature characteristic of said blood sample, or said component thereof, facilitates a determination of a treatment of said subject.

    40. The method of claim 28, further comprising detecting a status of one or more biological markers in said blood sample, or said component thereof, wherein said detecting said status of said one or more biological markers comprises detecting whether or not said one or more biological markers comprise a mutation or mutations, or is a wild type marker.

    41. The method of claim 40, wherein said status of said one or more biological markers in said blood sample, or said component thereof, is conducted on a size-fractionated blood sample, or a size-fractionated component thereof.

    42. The method according to claim 41, wherein said size-fractionated blood sample, or said size-fractionated component thereof, has been obtained by a centrifugal filtration.

    43. The method of claim 40, wherein said one or more biological markers comprises IDH.

    44. The method of claim 28, wherein said patient is treated for said glioma with at least one first treatment.

    45. The method of claim 28, wherein said subject is treated for said lymphoma with at least one second treatment.

    46. The method of claim 44, wherein said at least one first treatment is a surgical procedure.

    47. The method of claim 46, wherein said surgical procedure further comprises determining a degree of resection utilizing said status of said one or more biological markers.

    48. The method of claim 45, wherein said at least one second treatment is a chemotherapy or a radiotherapy.

    49. The method of claim 28, wherein said blood sample is a blood serum or a blood plasma.

    50. A diagnostic kit, comprising: i) a spectrocopic device configured to receive a blood sample, or a component thereof, and generate a spectroscopic signature; and ii) a processing device configured to receive said spectroscopic signature from said spectroscopic device and generate an Attenuated Total Reflection FTIR (ATR-FTIR) spectroscopic analysis.

    51. The diagnositic kit of claim 50, further comprising instructions to determine whether said spectrocopic analysis diagnoses a glioma or a lymphoma.

    52. The diagnostic kit of claim 50, wherein said spectroscopic device and said processing device are an integrated spectroscopic processing device.

    53. The diagnostic kit of claim 50, further comprising a computer, wherein said computer is installed with a software program configured to operate said computer to generate said spectroscopic signature characteristic and said spectroscopic analysis.

    54. The diagnostic kit of claim 50, wherein said spectroscopic device is further configured to automatically prepare said blood sample, or said component thereof, with a pre-determined thickness and dryness.

    55. The diagnostic kit of claim 53, wherein said software program is installed on a computer-readable medium.

    56. A method, comprising: performing a spectroscopic analysis upon a blood sample, or a component thereof, isolated from a subject to obtain a spectroscopic signature characteristic of said blood sample, or said component thereof; wherein said spectroscopic analysis is Attenuated Total Reflection (ATR) FTIR; determining whether said subject has a glioma or a lymphoma using said obtained spectroscopic signature characteristic of said blood sample, or said component thereof; and treating said subject based on said determination of a glioma or a lymphoma.

    57. The method of claim 56, wherein said treating said subject for said glioma is with a surgical procedure.

    58. The method of claim 56, wherein said treating said subject for said lymphoma is with a chemotherapy or a radiotherapy.

    Description

    DETAILED DESCRIPTION

    [0091] Embodiments of the invention will now be described by way of example, and with reference to the accompanying figures, which show:

    [0092] FIG. 1 shows a pre-processing example; (a) raw data, and (b) pre-processed;

    [0093] FIG. 2 shows a Gini importance plot from RF analysis showing the mean spectra from lymphoma (black) and glioblastoma (red). Blue: Protein; Yellow: Lipid; Green: Nucleic acid and Orange: Carbohydrate;

    [0094] FIG. 3 shows PLS scores plot for Lymphoma (black) vs GBM (red);

    [0095] FIG. 4 shows loadings plot for the 2.sup.nd PLS component in the lymphoma vs GBM classification with added biological assignments;

    [0096] FIG. 5 shows bootstrapping analysis to determine sufficient number of resamples required for the lymphoma vs GBM patient dataset: (a) the sensitivity and (b) specificity; and

    [0097] FIG. 6 shows ROC curve displaying trade-off between sensitivity and specificity of the SVM+up-sampling classification of the lymphoma vs GBM patients.

    [0098] FIG. 7 shows examples of whole serum (bottom), the HMW concentrate (middle) and the LMW filtrate (top) spectra. Raw spectra offset for clarity.

    [0099] FIG. 8 shows single model receiver operator characteristic (ROC) graphs for the a) whole serum dataset displaying the PLS-DA (blue), SVM (red) and RF (green) classifiers; and b) the best performing model for each of the tested filtrate fractions: the full spectrum (4000-800 cm.sup.−1, blue), the fingerprint region (1800-1000 cm.sup.−1, red) and the extended fingerprint region (1800-800 cm.sup.−1, green).

    [0100] FIG. 9 shows a) the PLS scores plot between PLS1 and PLS2 for the IDH1-mutated (black) and IDH1-wildtype (red) <3kDa serum filtrate (4000-800 cm.sup.−1) dataset, and b) the loadings for the 2.sup.nd PLS component.

    EXAMPLE 1

    Lymphoma Versus Glioblastoma

    Introduction

    [0101] Neurologists are particularly interested in the differentiation of primary central nervous system (CNS) lymphoma from the highly aggressive stage IV tumour, glioblastoma multiforme (GBM). A serum diagnosis would be beneficial for two reasons; firstly, it can often be difficult to distinguish between them through brain scans, such as magnetic resonance imaging (MRI), and secondly, it determines whether the tumour will be surgically removed or not. If an MRI scan suggests a patient has GBM, then they will be urgently sent for a resection. On the other hand, if it is thought that the tumour is lymphoma, they do not immediately operate on the patient, and the patients are treated with chemo- and radiotherapy. The ambiguity arising from brain scans make it extremely difficult for neurologists to effectively decide on the best course of action.

    Methods

    Sample Collection and Preparation

    [0102] Serum samples were obtained from three sources; the Walton Centre NHS Trust (Liverpool, UK), Royal Preston Hospital (Preston, UK), and the commercial source Tissue Solutions Ltd (Glasgow, UK). The number of serum samples obtained from each source is shown in Table 2. Ethical approval for this study was obtained (Walton Research Bank and BTNW/WRTB 13_01/BTNW Application #1108).

    TABLE-US-00003 TABLE 2 Serum samples used for the Lymphoma vs GBM differentiation GBM Lymphoma Liverpool 46 23 Preston 25 18 Total 71 41

    [0103] In order to be included in this study, the cancer patients must have had a pathologically confirmed primary lymphoma or glioblastoma brain tumour, and must not have been undergoing chemo- or radio-therapy at the time of collection. Blood samples were collected in serum collection tubes and allowed to clot for up to one hour. The tubes were centrifuged at 2200 g for 15 minutes at room temperature, then the separated serum component was subsequently aliquoted and stored in an −80° C. freezer.

    [0104] Prior to spectral analysis, the frozen serum samples were removed from storage and thawed at room temperature (18-25° C.) for an average time of 15-20 minutes. Using a micropipette, 3 μL of serum from one individual patient was deposited onto each of the three sample wells of the optical sample slide (wells 1, 2 and 3), whilst ensuring well ‘0’ remained clean for background collection (ClinSpec Diagnostics Ltd, UK). The serum drops were spread across the well using the pipette tip, in order to create a thin serum film and cover the whole IRE for more uniform deposition. Prepared slides were stacked in 3D printed polylactic acid (PLA) slide holders, which were designed to enable batch drying. The stacked slides were then stored in a drying unit incubator (Thermo Fisher™ Heratherm™, GE) at 35° C. for 1 hour. This step provides even heat and airflow for controlled drying dynamics of the serum droplet, to obtain a smooth, flat homogenous sampling surface.

    Spectral Collection

    [0105] For this study, a Perkin Elmer Spectrum 2 FTIR spectrometer (Perkin Elmer, UK) was used for the spectral collection. A Specac Quest ATR accessory unit was fitted with a specular reflectance puck (Specac Ltd, UK), allowing the SIRE (silicon IRE) to sit on top of the aperture and replace the traditional fixed diamond IRE. The Slide Indexing Unit (ClinSpec Diagnostic Ltd, UK) enabled accurate and reproducible movement across the specular reflectance puck, indexing the optical slide between sample wells. With the first well acting as a background, the three sample wells provide the biological repeats. Each well was analysed in triplicate—resulting in nine spectra per patient. The spectra were acquired in the range 4000-450cm.sup.−1, at a resolution of 4cm.sup.−1, with 1cm.sup.−1 data spacing and 16 co-added scans.

    Spectral Pre-Processing

    [0106] Here we have used the PRFFECT toolbox within RStudio software for the spectroscopic analysis, which can be divided into two parts; spectral pre-processing and spectral classification. The pre-processing step is commonly applied in spectroscopic studies, as it reduces unwanted variance in the dataset. A combination of baseline correction, normalisation and data reduction enables the significant biological information be emphasised and improves the classification performance. The optimum pre-processing protocol was determined using a trial-and-error iterative approach. The PRFFECT toolbox offers various pre-processing methods, such as binning, smoothing, normalisation and numerical derivatives—we direct the reader towards Smith et al (2). for more information on the use of this open-source program. FIG. 1 gives an example of data pre-processing; (a) is the mean plot for a whole unprocessed dataset, and (b) shows the spectra cut to a fingerprint region (spectroscopic signature), with baseline correction and a vector normalisation applied—greatly reducing the spectral variation.

    [0107] The optimal pre-processing parameters were found to be (in order); extended multiplicative signal correction (EMSC), spectral cut to the fingerprint region (1800-1000 cm.sup.−1), a minmax normalisation and a binning factor of 8.

    Spectral Analysis

    [0108] The classification step consists of the actual disease predictions; the purpose of this approach is to identify the biosignature from a known patient cohort to develop a trained classification model, and then to use this information to predict the presence of disease in an unknown population.

    [0109] To train the classification models, patients were randomly split into training and test sets, with a 70:30 split. In order to make the predictions more robust , no single patient could appear in more than one of these portions. Models were tuned on the training set (70%) and then used to make predictions for the spectra in the test set (30%), whilst employing a 5-fold k-cross validation. The consensus vote amongst the nine spectra that were analysed for each patient was reported as the diagnostic outcome (GBM or Lymphoma).

    [0110] Model performance is reported in terms of sensitivity, specificity, kappa and balanced accuracy. Sensitivities and specificities (Eq. 1 and 2), are based on the number of correct and incorrect predictions in the external test set. The sensitivity generally refers to the ability of a test to correctly identify the patients with disease and the specificity tends to describe the ability to correctly pick out those without the disease (Lalkhen et al.). However, in this case, the sensitivity applies to GBM and the specificity refers to the ability to identify lymphoma. For this analysis, true positives result from a patient with GBM with five or more spectra out of the nine spectra collected correctly identified, whereas true negatives refer to the patients with lymphoma who has at least five out of the nine spectra correctly identified. False positives are where a lymphoma patient has five or more spectra incorrectly identified as GBM, and a false negative is from a patient with GBM who has five or more spectra incorrectly classified as lymphoma.

    [00001] Sensitivity = true positives true positives + false negatives ( 1 ) Specificity = true negatives true negatives + false positives ( 2 )

    [0111] In order to understand the reliability of the diagnostic model the Kappa value, κ, can give a quantitative measure of the magnitude of agreement between observers (Eq. 3).

    [00002] κ = p o - p e 1 - p e ( 3 )

    [0112] Where p.sub.o is the relative observed agreement and p.sub.e is the hypothetical probability of the chance agreement. Values of κ range between zero and one and equate to the level of agreement. Where κ is ≤0 it indicates no agreement, 0.01-0.20 accounts for slight, 0.21-0.40 fair, moderate agreement is 0.41-0.60, 0.61-0.80 is substantial and lastly 0.8-1.00 is almost perfect agreement (Viera et al., McHugh).

    [0113] An n-fold cross validation was performed (n=5) on the training data to determine the optimum values for the tuning parameters. Due to the slight class imbalance present when examining the difference between GBM (71 patients) vs. lymphoma (41 patients), various sampling methods were used throughout this study to ensure no bias was present within the models; up-sampling, down-sampling and synthetic minority over-sampling technique (SMOTE). The up-sampling method consists of repeatedly sampling the minority class to increase the number of samples, whereas down-sampling selects a subset of the majority class at random, removing the extra samples to make it the same size as the minority class (Simafore). SMOTE is unique in that it artificially mixes the data to, create ‘new’ samples to achieve a more balanced dataset (Chawla et al.).

    Random Forest

    [0114] RF is a robust machine learning technique that builds an ensemble of decision trees from the training data using the Classification and Regression Trees (CART) algorithm (Breiman et al.). The RF analysis can extract statistical values, based on the number of true positives, false positives, true negatives and false negatives, determining both the accuracy and reliability of the classification. Additionally, spectral importance results can be graphically viewed in the form of Gini plots. Using the Gini impurity metric, produced from the combined mean decrease in the Gini coefficient with respect to the wavenumbers, RF can rank the spectral features in order of significance—for example, which wavenumbers are the most discriminating between the two classes (Smith et al. (1)).

    Partial Least Squares-Discriminant Analysis

    [0115] Partial Least Squares—Discriminant Analysis (PLS-DA) is supervised machine learning method that combines PLS regression (PLSR) and Linear Discriminant Analysis (LDA). This technique can extract important information from complex datasets, by reducing the dimensionality to reveal hidden patterns within the data. This technique separates classes by looking for a straight line that divides the data space into two distinct regions (Ballabio et al.). The data points are projected perpendicularly to the line, which is known as the discriminator (Lee et al.). The distances from the discriminator are referred to as the discriminant scores (Brereton et al.). This information is provided in the form of new variables called PLS components, where the first PLS component (PLS1) accounts for the greatest variation in the dataset, PLS2 represents the next greater variation, and so on. PLS scores plots give an overview of the general inconsistences within large datasets, and loadings plots further explain the variance, by suggesting where the most variable regions exist e.g. which spectral regions display the highest disparity.

    Support Vector Machine

    [0116] A support vector machine (SVM) is a supervised algorithm, commonly employed for classification purposes (Cortes et al.). From known data, SVM outputs an optimal dimension for the separation of the data, known as the hyperplane. Support vectors are the co-ordinates of the individual observation and the hyperplane can be used to categorise new samples (de Boves Harrington). The optimization of SVM tuning parameters can change the classification efficiency dramatically. The cost, C, can be referred to as the penalty parameter and is responsible for the trade-off between smooth boundaries and the ability to classify the data. The gamma parameter, γ, is responsible for the level of fit. It is important to ensure the model does not overfit the data, which is achieved using a grid search to identify the optimal classification performance (Ben-Hur et al.).

    Centrifugal Filtration

    [0117] To assess whether ATR-FTIR spectroscopy could detect IDH1 mutation, centrifugal filtration was undertaken to enable analysis of the low molecular weight (LMW) fraction of the serum samples. The whole serum from the 72 brain cancer patients were filtered to remove the more abundant high molecular weight (HMW) biomolecules. Commercially available Amicon Ultra-0.5 mL centrifugal filtering devices (Millipore-Merck, Germany) with cut-off points at 3 kDa were used to fractionate the serum samples. The serum was split into two fractions; the ‘filtrate’ and the ‘concentrate’. The filtrate accounts for the biomolecular components below the 3 kDa cut-off point, and the concentrate represents the higher MW serum constituents. Serum from each patient (0.3 mL) was placed in the centrifugal filters, and the filtration tubes were centrifuged for 30 minutes at a speed of 14000 g. The filtrates passed through the membranes into the collection vials. The filters were then inverted and centrifuged for 2 minutes at 1000 g to collect the HMW concentrates. The filtrates and concentrates were stored in a −80° C. freezer until the time of analysis.

    [0118] For centrifugal filtration study, spectra were initially corrected with extended multiplicative signal correction (EMSC) using an averaged filtrate spectrum as the reference (see Kohler et al, for example). As there were two prominent bands present between 1000-800 cm.sup.−1 in the filtered serum spectrum, the dataset was cut down to 800 cm.sup.−1 to ensure all potentially important biological information was retained. Thus, three spectral cuts were tested; 4000-800 cm.sup.−1, 1800-800 cm.sup.−1 and 1800-1000 cm.sup.−1. All other parameters were the consistent from the whole serum analysis.

    Results

    [0119] An initial random forest (RF) model provides us with the biochemical differences between the lymphoma and GBM patients. The Gini plot (FIG. 2) suggests the Amide II region is of importance, closely followed by the Amide I band. Between 1150-1000cm.sup.−1 there are various significant bands, relating to vibrations within nucleic material, glycogen and carbohydrates (Table 3).

    TABLE-US-00004 TABLE 3 Top 15 wavenumbers from RF classification of lymphoma vs GBM with tentative biochemical assignments (Baker et al., Movasaghi et al.) Wavenumbers (cm.sup.−1) ΣGini Tentative Assignments Vibrational Modes 1556.5 95.9 Amide II of proteins δ(N—H), v(C—N), 1564.5 91.4 δ(C—O), v(C—C) 1676.5 57.9 Amide I of proteins v(C═O), v(C—N), 1684.5 50.1 δ(N—H) 1572.5 42.9 Amide II of proteins δ(N—H), v(C—N), 1548.5 32.6 δ(C—O), v(C—C) 1668.5 32.2 Amide I of proteins v(C═O), v(C—N), 1660.5 30.5 δ(N—H) 1020.5 19.7 DNA/Glycogen v(PO.sup.2−)/v(C—O), def(C—OH) 1100.5 19.0 Nucleic Acids v(PO.sup.2−) 1036.5 17.4 Glycogen v(C—O), v(C—C) 1692.5 15.3 Amide I of proteins v(C=O), v(C—N), δ(N—H) 1108.5 14.6 Carbohydrate v(C—O), v(C—C) 1628.5 14.5 Amide I of proteins v(C=O), v(C—N), 1620.5 13.2 δ(N—H) v = stretching; δ = bending; def = deformation

    [0120] It was found that the PLS-DA scores plot separates the lymphoma and GBM patients across the 2.sup.nd PLS competent (FIG. 3). Again, we see the highest discrimination arises from the Amide II band and the lower wavenumber region on the loadings plot (FIG. 4). For lymphoma vs GBM, the Amide I region is also highly discriminatory, substantiating the RF Gini findings outlined previously in Table 3.

    [0121] Bootstrapping analysis was done on the lymphoma vs GBM training set to search for an acceptable number of iterations. In this case, 51 resamples were also found to be sufficient, with the standard error converging at this point (FIG. 5).

    [0122] SMOTE showed to be the best sampling technique for RF and PLS-DA, but up-sampling was found to be optimal for the SVM-based model (Table 4).

    TABLE-US-00005 TABLE 4 Statistical results for the lymphoma vs GBM test sets from the three different classification models with 51 iterations RF + SMOTE PLS-DA + SMOTE SVM + UP Mean Optimum SD Mean Optimum SD Mean Optimum SD Kappa 0.63 0.94 0.13 0.76 0.94 0.09 0.72 0.94 0.11 Sensitivity 90.9 100 5.8 90.1 100 5.7 86.6 100 8.5 (%) Specificity 70.8 100 14.9 86.3 100 9.4 86.3 100 9.5 (%) Accuracy 80.8 97.6 7.2 88.2 97.6 5.0 86.4 97.6 5.4 (%)

    [0123] For this particular dataset, the sensitivities refer to the ability to detect GBM, and the specificity relates to lymphoma. As shown in Table 4 the least effective model for this dataset was found to be RF—despite having a high sensitivity, the specificity was rather low at 70.8%. SVM combined with up-sampling performed well, reporting a balanced accuracy of 86.4%. The PLS-DA+SMOTE method seemed to be the optimal model, with a sensitivity of 90.1%, a specificity of 86.3%, and the highest κ value of all three models—mean κ=0.76. Each technique reported 100% for sensitivity and specificity for at least one of the 51 iterations. The sensitivities were relatively stable, but the predictions for lymphoma were more variable, for example, one of the RF resamples reported a sensitivity of 42%, which ultimately lowered the mean value. That said, the ROC curve for the SVM-based model still indicates promising diagnostic capability, with an AUC value of 0.92 (FIG. 6).

    ATR-FITR IDH Analysis

    [0124] Brain cancer patients—with either astrocytoma, oligodendroglioma or GBM—were separated based upon their IDH1 status using ATR-FTIR serum spectroscopy. Of the 72 patients included, there were 36 with the IDH1 mutation, and 36 IDH1-wildtype. The data was classified through RF, PLS-DA and SVM with 100 resamples for each, and the findings are reported in Table 6 on a ‘by patient’ basis. For the whole serum dataset, the SVM model reported a promising sensitivity of 75.9% but had an extremely low specificity of 28%. All models seemed to be more effective at picking out the IDH1-mutated serum samples from the test sets, as the sensitivities were much higher than the specificities in each case. It is not clear why this may be, as there were an equal number of samples in each class and therefore should be no bias present in the models. That being said, the results did not appear to be reliable, and given the poor balanced accuracies (˜50%) it could be assumed the correct predictions were ultimately made by chance.

    TABLE-US-00006 TABLE 6 Classification results for the IDH1-mutated versus IDH1-wildtype whole serum dataset, after 100 resamples. The mean sensitivity, specificity and balanced accuracy are reported with their corresponding standard deviations (SD). Balanced Sample Sensitivity (%) Specificity (%) accuracy (%) fraction Model Mean SD Mean SD Mean SD Whole Serum RF 50.3 15.2 45.4 15.1 47.9 8.6 PLS-DA 69.3 13.8 35.3 14.7 52.3 7.4 SVM 75.9 17.5 28.0 14.6 51.9 7.7

    [0125] Blood serum constitutes thousands of different proteins, ranging from the more abundant HMW serum albumin (50 g/L) to the LMW proteins like troponin (1 ng/L). Due to the wealth of various biomolecules that exist in a normal serum sample, it was expected to be a significant challenge to identify the subtle alterations in blood composition, that may have been associated with the IDH1 mutation. The LMW fraction of serum is believed to contain disease-specific information, making the spectroscopic signature of this fraction useful for diagnostics. Thus, after the poor classification performance for the whole serum data, it was thought that discrete molecular differences could potentially be emphasised through the use of centrifugal filtration.

    [0126] FIG. 7 provides an example of the IR spectra for whole serum, the >3kDa ‘HMW’ fraction and the <3 kDa ‘MW’ fraction. The concentrate appears almost identical to the whole serum spectrum; notably, they have a very similar absorbance from the more abundant proteins—such as albumin and immunoglobulins—that exist within the Amide region. With these large proteins and other HMW constituents removed, the filtrate spectrum looks remarkably different, with only a few distinct peaks in the fingerprint region (red spectrum). Three spectral regions were chosen for examination: 4000-800 cm.sup.−1 and 1800-800 cm.sup.−1—to encompass the two distinct peaks around 950 cm.sup.−1 and 850 cm.sup.−1—as well as the typical biological fingerprint region (1800-1000 cm.sup.−1). The classification results are reported in Table 7.

    [0127] In each case, the filtrate models were superior to the whole serum models in successfully detecting the IDH1-wildtype patients, reporting specificity values above 60%. The improvement in diagnostic ability due to the filtration step is emphasised in FIG. 8, which displays single model ROC curves for the three whole serum classifiers (FIG. 8a) and the best models for each of the three filtrate datasets (FIG. 8b). As expected from the poor classification results, the ROC curves for the whole serum models fall on the diagonal line, meaning the predictions that are being made are no better than random guessing, and the reported AUC values of ˜0.5 suggests the test has essentially no diagnostic accuracy. However, the inclusion of centrifugal filtration enhanced the ability to successfully discriminate the two IDH1 classes. The corresponding ROC curves in FIG. 8b report AUC values >0.7, which is often deemed an ‘acceptable’ level of discrimination.

    TABLE-US-00007 TABLE 7 Classification results for the IDH1-mutated versus IDH1-wildtype serum datasets after 100 resamples. The mean sensitivity, specificity and balanced accuracy are reported with their corresponding standard deviations (SD). Best performing models for every sample fraction is highlighted in bold. Sensitivity Specificity Balanced accuracy (%) (%) (%) Sample fraction Model Mean SD Mean SD Mean SD <3 kDa Filtered RF 68.4 16.2 67.5 15.9 68.0 11.1 Serum (4000-800 PLS-DA 75.5 12.3 62.6 15.5 69.1 9.0 cm.sup.−1) SVM 68.4 16.5 64.2 16.0 66.4 10.2 <3 kDa Filtered RF 70.6 17.8 66.4 14.5 68.5 11.2 Serum (1800-800 PLS-DA 65.0 14.6 64.6 16.5 64.8 8.7 cm.sup.−1) SVM 63.2 16.3 63.8 16.9 63.5 9.6 <3 kDa Filtered RF 66.6 15.4 68.1 14.1 67.4 9.9 Serum (1800- PLS-DA 65.9 14.6 56.2 15.5 61.1 9.1 1000 cm.sup.−1) SVM 68.1 15.6 56.8 15.6 62.5 10.1

    [0128] The <3 kDa filtered serum ‘full spectra’ dataset (4000-800 cm.sup.−1) delivered the greatest balanced accuracy of 69.1% when classified by the PLS-DA model. The PLS scores plot in FIG. 9a describes the general variation within the dataset. The major variance is generally described by the first PLS component (PLS1). The PLS1 loadings suggest large differences ˜3400 cm.sup.−1 and ˜1650 cm.sup.−1, although there is no apparent class separation across PLS1 in the scores plot. Despite some overlap, it is evident that the 2.sup.nd PLS component separates the two classes better than PLS1. The PLS2 loadings also highlight significant spectral differences around ˜1650 cm.sup.−1 (FIG. 9b). Interestingly, this is the typical location of the large Amide I band in a normal serum spectrum, accounting for the bond vibrations within an abundance of protein molecules. Even with the HMW proteins filtered out of the samples—like albumin and immunoglobulins—it still appears to be a region of importance when examining molecules of very low molecular weights (<3 kDa), suggesting the smaller protein molecules still have diagnostic potential.

    [0129] In general, the balanced accuracies were enhanced to between 60-70% for all tested models. The centrifugal filtration step has produced a significant improvement on the model performance, by delivering more balanced sensitivities and specificities.

    Conclusion

    [0130] The implementation of a quick blood serum test for the early detection of brain tumours at a GP setting could have a huge impact on the quality of life and prognosis for patients.

    [0131] We present the ability of the method of the invention to differentiate between brain tumour types. Notably, the separation of lymphoma and GBM through ATR-FTIR spectroscopy would be particularly attractive for neurologists in a secondary care setting, when imaging results are not clear. This proof-of-principle study involved 112 patients, providing a sensitivity of 90.1% and a specificity of 86.3%. A κ value of 0.76 indicates the technique is reliable.

    [0132] Identification of the molecular status from blood serum prior to biopsy could further direct some patients to alternative treatment strategies. Initially, the whole serum classifiers performed poorly, delivering balanced accuracies of ˜50%. Yet with the introduction of centrifugal filtration the classification performance improved significantly, enhancing the sensitivities and specificities to around 70%. These strategies may be further optimised in prospective clinical studies, and can be extended to identify other important molecular alterations, such as ATRX loss, 1p/19q co-deletion and/or MGMT hypermethylation, with which brain cancer type can be stratified pre-operatively.

    REFERENCES

    [0133] M. J. Baker et al., ‘Using Fourier transform IR spectroscopy to analyse biological materials’, Nature Protocols, vol. 9, no. 8, pp. 1771-1791, July 2014. [0134] D. Ballabio and V. Consonni, ‘Classification tools in chemistry. Part 1: linear models. PLS-DA’, Analytical Methods, vol. 5, no. 16, p. 3790, 2013. [0135] A. Barth, Biochimica et Biophysica Acta 1767 (2007) 1073-1101 [0136] A. Ben-Hur and J. Weston, ‘A User's Guide to Support Vector Machines’, in Data Mining Techniques for the Life Sciences, vol. 609, O. Carugo and F. Eisenhaber, Eds. Totowa, N.J.: Humana Press, 2010, pp. 223-239. [0137] L. Breiman, ‘Random Forests’, Machine Learning, vol. 45, no. 1, pp. 5-32, October 2001. [0138] R. G. Brereton and G. R. Lloyd, ‘Partial least squares discriminant analysis: taking the magic away’, Journal of Chemometrics, vol. 28, no. 4, pp. 213-225, April 2014. [0139] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ‘SMOTE: Synthetic Minority Over-sampling Technique’, Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, June 2002. [0140] C. Cortes and V. Vapnik, ‘Support-Vector Networks’, Machine Learning, vol. 20, no. 3, pp. 273-297, September 1995. [0141] S. E. Glassford et al. Biochimica et Biophysica Acta 1834 (2013) 2849-2858P. de Boves Harrington, ‘Support Vector Machine Classification Trees’, Anal. Chem., vol. 87, no. 21, pp. 11065-11071, Nov. 2015. [0142] A. Kohler, C. Kirschner, A. Oust, H. Martens, Extended multiplicative signal correction as a tool for separation and characterization of physical and chemical information in Fourier transform infrared microscopy images of cryo-sections of beef loin, Applied Spectroscopy 59 (2005) 707-716 [0143] A. G. Lalkhen and A. McCluskey, ‘Clinical tests: sensitivity and specificity’, Continuing Education in Anaesthesia Critical Care & Pain, vol. 8, no. 6, pp. 221-223, December 2008. [0144] L. C. Lee, C.-Y. Liong, and A. A. Jemain, ‘Partial least squares-discriminant analysis (PLS-DA) for classification of high-dimensional (HD) data: a review of contemporary practice strategies and knowledge gaps’, Analyst, vol. 143, no. 15, pp. 3526-3539, 2018. M. L. McHugh, ‘Interrater reliability: the kappa statistic’, Biochem Med (Zagreb), vol. 22, no. 3, pp. 276-282, 2012. [0145] Z. Movasaghi, S. Rehman, and Dr. I. ur Rehman, ‘Fourier Transform Infrared (FTIR) Spectroscopy of Biological Tissues’, Applied Spectroscopy Reviews, vol. 43, no. 2, pp. 134-179, February 2008. [0146] SIMAFORE, ‘Managing unbalanced data for building machine learning models’, March 2019. [Online]. Available: http://www.simafore.com/blog/handling-unbalanced-data-machine-learning-models. [0147] (1) B. R. Smith et al., ‘Combining random forest and 2D correlation analysis to identify serum spectral signatures for neuro-oncology’, Analyst, vol. 141, no. 12, pp. 3668-3678, 2016. [0148] (2) B. R. Smith, M. J. Baker, and D. S. Palmer, ‘PRFFECT: A versatile tool for spectroscopists’, Chemometrics and Intelligent Laboratory Systems, vol. 172, pp. 33-42, January 2018. [0149] A. J. Viera and J. M. Garrett, ‘Understanding Interobserver Agreement: The Kappa Statistic’, Family Medicine, p. 4.