Identification of Host RNA Biomarkers of Infection

Abstract

The inventive technology includes novel systems, method and compositions for the identification and classification of host-derived RNA biomarkers produced in response to an infection.

Claims

1-77. (canceled)

78. A method of identifying general host-derived RNA biomarkers of infection comprising the steps of: a) establishing a first biological sample, wherein said first biological sample comprises a tissue sample infected with a first pathogen; b) quantifying one or more genes from said first biological sample that are upregulated in response to the infection compared to a non-infected control biological sample; c) establishing a second biological sample, wherein said second biological sample comprises a saliva sample collected from a subject infected with said pathogen; d) generating a RNA transcript expression dataset by quantifying the RNA transcripts present in said second biological sample that correspond to the one or more genes upregulated in response to infection by said pathogen; and e) analyzing said RNA transcript expression data set and identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to infection by said pathogen.

79. The method of claim 78, further comprising the step of repeating steps, a-d using one or more additional pathogens to generate an RNA transcript expression data set.

80. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to said pathogen selected from the group consisting of: SEQ ID NO. 1-99

81. The method of claim 78, further comprising the step of identifying host-derived RNA biomarkers of infection commonly upregulated in response to any pathogen.

82. The method of claim 81, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to any pathogen are selected from the group consisting of: SEQ ID NOs. 31-99.

83. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to a viral pathogen.

84. The method of claim 83, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to a viral pathogen are selected from the group consisting of: SEQ ID NOs. 1-5.

85. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to a bacterial pathogen.

86. The method of claim 85, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to a bacterial pathogen are selected from the group consisting of: SEQ ID NOs. 6-10.

87. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to a retroviral pathogen.

88. The method of claim 87, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to a retroviral pathogen are selected from the group consisting of: SEQ ID NOs. 11-15.

89. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to a herpesvirus pathogen.

90. The method of claim 89, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to a herpesvirus pathogen are selected from the group consisting of: SEQ ID NOs. 16-20.

91. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to a respiratory pathogen.

92. The method of claim 91, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to a respiratory pathogen are selected from the group consisting of: SEQ ID NOs. 21-25.

93. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to a eukaryotic pathogen.

94. The method of claim 93, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to a eukaryotic pathogen are selected from the group consisting of: SEQ ID NOs SEQ ID NOs. 26-30.

95. The method of claim 78, wherein the pathogen of said infected tissue sample and pathogen of said infected saliva sample are different pathogens.

96. The method of claim 78, wherein said subject comprises a human subject.

97. A method of identifying host-derived biomarkers of infection comprising the steps of: generating a RNA transcript expression dataset of host-derived biomarker sequence reads according to the method of claim 1; performing data pre-processing on said raw dataset of host biomarker sequence reads comprising one or more of the following steps: filtering out low quality biomarker sequence reads; filtering out contaminating biomarker sequence reads; mapping the filtered biomarker sequence reads to a reference genome; assigning total number of biomarker sequence reads mapped onto each annotated gene within said reference genome; normalizing the biomarker sequence reads counts based on one or more control genes; conducting differential expression analysis to determine which host biomarker genes are up-regulated in the dataset; and outputting a dataset of upregulated host-derived biomarkers sequences.

98. The method of claim 97, and further comprising the steps of: merging a plurality of datasets of upregulated host-derived biomarkers sequences for analysis and categorization comprising one or more of the following steps: directly merging said plurality of datasets of upregulated host-derived biomarkers sequences; combining the P-value of said plurality of datasets of upregulated host-derived biomarkers sequences; combining the effect size of said plurality of datasets of upregulated host-derived biomarkers sequences; combining the rank of said plurality of datasets of upregulated host-derived biomarkers sequences; conduct co-expression and network analysis of said plurality of datasets of upregulated host-derived biomarkers sequences; and outputting a dataset of ranked host-derived biomarkers sequences.

99. The method of claim 98, and further comprising the steps of: validating said dataset of ranked host-derived biomarkers sequences comprising one or more of the following steps: comparing a dataset of random gene controls against said dataset of ranked host-derived biomarkers sequences using a machine learning system comprising a classifier; conducting cross-validation on said dataset being applied to said classifier to predict infection or non-infected states of a dataset of unknown RNA sequences; and outputting a dataset of ranked and filtered host-derived biomarker sequences.

Description

BRIEF DESCRIPTION OF DRAWINGS

[0018] The novel aspects, features, and advantages of the present disclosure will be better understood from the following detailed descriptions taken in conjunction with the accompanying figures, all of which are given by way of illustration only, and are not limiting the presently disclosed embodiments, in which:

[0019] FIG. 1A-F: 69 human universal response genes are upregulated in a broad range of infections performed in tissue culture. (A) Heatmap summarizing the observed abundance of mRNA transcripts from RNA-seq data. Each row represents one of the 69 universal response genes. Each column represents the average expression across all mock (-) or infected (+) replicates combined from all studies on a given pathogen. (B) Number of commonly upregulated genes in random combinations of in vitro infection studies. From each of the 71 studies, we curated a list of significantly upregulated genes. We then compared these genes between randomly chosen groups of studies, with 100 random combinations performed at each of the numbers of studies (X-axis). Grey dots are actual values, red dots are mean/median (?) values. The number of commonly upregulated genes (see methods) becomes asymptotic at n=69 genes. (C) Principal component analysis of universal response gene expression data from the datasets analyzed in panel A. Mock (circles) vs. infected (triangles) samples are separated by the primary principal component (80.5% of data variance) on the X axis. The dotted line is arbitrary but separates infected and mock samples. (D-F) Receiver operating characteristic (ROC) curves of various logistic regression models were established using the expression levels of the 69 universal response genes. The area under curve (AUC) is summarized in each graph. (D) The performance of a model trained on 10% of the 387 samples from the 71 in vitro datasets. The model was then used to classify the other 90% of the samples as mock infected or infected. The grey lines indicate each replicate of cross validation, while the red curve summarizes the average ROC curve. (E) Cross validation analyses between different types of infections. In each case, the classifier was trained on infections of two types (top of graph) and used to predict whether human cells had been infected with the third type of pathogen, based solely on the expression level of the 69 universal response genes. (F) Cross validation analyses of logistic regression models trained on genes from relevant gene ontology terms, performed as in panel D.

[0020] FIG. 2: The kinetics of transcription from universal response genes. Heatmaps show levels of universal response mRNAs, as measured previously in transcriptome datasets from human blood samples. (A) This transcriptome dataset was generated from a 34-year-old male health care worker exposed to Ebola virus in Sierra Leone during the 2013-2015 epidemic. Blood was taken daily starting at 7-days post-symptom onset. (B) This transcriptome dataset is derived from 15 individuals that were experimentally infected with Plasmodium falciparum. Blood was taken every two days up until the day of diagnosis (“D”). Diagnosis occurred 7.5-10.5 days post-infection, defined as the time when two of these criteria were met: positive thick blood smear, parasite density >500 parasites ml, or symptoms consistent with malaria. In both studies, the transcriptome in whole blood was profiled using microarray. Only a subset of the universal response genes was included on these microarrays; hence each panel has less than 69 genes shown. The relative fold change is calculated by comparing microarray signals on the indicated day to the signal of healthy individuals from the same study (malaria N=4, Ebola N=30)

[0021] FIG. 3A-D: Abundance of mRNA in human saliva can determine whether diverse infections are present in the body. (A) Heatmap showing relative expression of each of the universal response genes (rows) in saliva, in transcripts per million (TPM) normalized to row z-score. Each column represents the saliva sample of one individual. (B) Volcano plot of all genes significantly upregulated in all eight infected patients compared to uninfected (DEseq2 Wald test, Fold change ≥2, Adjusted P-value ≤0.01), separated by their fold change in transcript abundance in saliva (infected vs. non-infected) and Benjamini-Hochberg adjusted P-values. The 69 universal-response genes are highlighted in dark red. (C) ROC curve representing the predictive power of the 69 universal response genes to distinguish healthy versus infected individuals. Logistic regression models constructed with 10% of the in vitro data from FIG. 1, and then used to predict whether individuals SS01-SS23 were infected just based on the mRNA abundance in saliva. Grey lines indicate individual cross validations (N=20), the red line and shaded area indicate the average and variance from all 20 cross validations, respectively. (D) Total RNA from saliva from three individuals was interrogated by RT-qPCR with primers recognizing each of the universal response mRNAs shown at the bottom. To calculate the fold change of each mRNA in each infected saliva sample (shown on top of each bar), the Ct value was first normalized to the control gene, CALR, and that value was then compared pair-wise to the same value from saliva of 3 non-infected enrollees, whereafter the error bar reflects the standard error of means from the pair-wise comparison (SEM). The horizontal red line shows the highest fold-change for universal response genes in saliva observed by RNA-seq in this study, which is less sensitive.

[0022] FIG. 4A-C: Universal response transcripts in saliva identified SARS-CoV-2 infected individuals in an asymptomatic, apparently-healthy cohort. (A) Performance of infection screening using host universal response genes in identifying asymptomatic SARS-CoV-2 positive individuals. We trained logistic regression models based on the universal response genes' RT-qPCR fold change data from all but one individual from the asymptomatic SARS-CoV-2 cohort. We then used the model to predict whether the one individual was infected or not. This process is repeated among all individuals, and the prediction result was then compared with the SARS-CoV-2 infection condition and viral load determined using the pathogen-specific RT-qPCR assay (Y-axis). The SARS-CoV-2 negative individuals are represented by the dots in the blue shaded region. The outcome of the infection prediction using universal response genes is summarized as positive (red) and negative (black), using a logistic regression probability cutoff of 0.7. (B) To assess the relationship between the universal response prediction accuracy and the sample viral load, we summarized the prediction truth table comparing universal response prediction outcome and the SARS-CoV-2 RT-qPCR testing result at different viral load cutoffs (only the SARS-CoV-2 positive individuals with the viral load above the cutoff are considered). The corresponding truth table is summarized in the table. (C) To determine the extent of mRNA variation from day to day in human saliva samples, 7 apparently healthy individuals (SS26-SS32) were asked to collect saliva daily for 11 days. Total RNA was isolated from each sample and used as a template for a multiplex TaqMan assay measuring the levels of 15 universal response genes. Five of the universal response genes are shown, and the remainder are shown in FIG. 7. For each of the 7 enrollees, their Ct value for each gene was converted to fold change by normalizing it to the Ct value of RPP30, and then again to the abundance of mRNA measured at Day 1. Error bars represent the SEM of 7 individuals.

[0023] FIG. 5: A characterization of the identified universal response genes via gene ontology enrichment analysis. The X-axis, enrichment ratio, is the number of observed genes divided by the number of expected genes in each gene ontology (GO) category. The adjusted P-value indicates the probability of observing the given number of genes in each category by chance. Functions related specifically to antiviral responses are the most enriched, possibly due to an over representation of viruses within the datasets analyzed in panel A, or because innate immunity to viruses is better studied and therefore the genes involved are better annotated.

[0024] FIG. 6: Universal response genes are up- and down-regulated with different kinetics upon infection. Huh7 human liver cells were infected with SARS-CoV-2 at MOI of 0.01 over a time course of 48 hours. Total RNA was harvested 0, 2, 4, 8, 12, 24, and 48 hours post infection. The fold changes of six universal response mRNAs (top of each graph; red data line) and of the SARS-CoV-2 genome (blue data line) were measured by multiplexed TaqMan RT-qPCR assay (see Method). Error bars represent the SEM of 3 biological replicates. Ct value is converted to fold change by normalizing the Ct value to the Ct value of RPP30, and then normalized again to the abundance of mRNA measured in a mock infection. Some universal response genes (CXCL8, IRF9, MX1) are upregulated in the early time points of the infection and then rapidly downregulated within the first 24 hours. This is quite interesting, since this is a low-MOI spreading infection and new cells are constantly getting infected. This would be consistent with a pulse of activity that is then quickly downregulated by a feedback loop. On the other hand, the upregulation of other universal response genes (such as the classical type-I interferon inducible genes, IFIT2, IFITM2, and IFIH1), starts later and increases steadily along with viral genome replication. This result suggests that the abundance of mRNA from any specific universal response gene will depend on the timepoint during infection, even in situations of spreading infections as would be the case in the human body.

[0025] FIG. 7: Abundance of universal response mRNA in human saliva correlates with relative viral load in saliva samples of SARS-CoV-2+ individuals. For universal response genes, we plotted the relative fold change of universal response mRNA in saliva (Y axis) against the concentration of viral genome copies in saliva (X axis). The X axis corresponds to SARS-CoV-2 viral load, determined by RT-qPCR. The Y axis shows the relative fold change of the human mRNA noted at the top of the graph, determined by the TaqMan RT-qPCR assay described in the methods. Each measurement of human mRNA was compared to the average of the same measurement from the saliva of 20 uninfected samples, to calculate the relative fold change that is shown. The horizontal dashed line indicates the fold change of 1. A pink box shows the range of viral loads where people are considered infectious (above 10.sup.6 viral copies/mL. This is because infectious virions are almost never recovered from individuals with viral loads below 10.sup.6 viral copies per mL. Individuals with lower viral loads are either at the beginning of infection, or on the long tail of recovery. Interestingly, the mRNAs of universal response genes accumulate in saliva before this point, at the transition of viral titers to above 10.sup.4 viral copies/mL. This is consistent with a model where mRNAs from universal response genes accumulate in saliva specifically during, and possibly before, periods of acute viral replication.

[0026] FIG. 8: Universal response genes can be found in blood and saliva. On the X axis, the expression levels of human mRNAs in the saliva of SARS-CoV-2+ patients (N=3, SS19-SS21, RNAseq) were compared that of uninfected control individuals (N=15, SS1-SS15). The plot shows only genes with fold change >1. On the Y axis is the similar analysis, performed in the blood in individuals from a different SARS-CoV-2 cohort, the recently published COVIDome database. Each dot is a gene, and the universal response genes are shown in red. We find that universal response transcripts (red dots), are as (or even more) detectable in saliva than in blood.

[0027] FIG. 9: mRNA structure is preserved in human saliva samples. Sashimi plot indicating mRNA structure is preserved during the saliva sample processing and collection, so that the exon regions are preferentially sequenced over the introns. Shown here are saliva samples from 5 individuals, CXCL8 gene is selected as the example.

[0028] FIG. 10: Expression of universal response genes in asymptomatic individuals infected with SARS-CoV-2. Heatmap summarizing mRNA levels from universal response genes in the saliva of SARS-CoV-2-positive individuals and 5 randomly selected uninfected samples (SS33-SS100). Rows represent the 15 universal response mRNAs, measured by RT-qPCR in a multiplex TaqMan assay. In columns, are individual enrollees, where the normalized cycle threshold value (Ct) for each mRNA in that enrollee's saliva is compared to the average normalized Ct from 20 uninfected enrollees. The viral load in each saliva sample was measured using a separate RT-qPCR assay and is reported above the heatmap. Importantly, we noticed a strong correlation between the levels of universal response mRNAs observed and the viral load in individuals (top of heatmap). Within saliva samples that carried high viral load, almost all had an elevated level of universal response mRNAs.

[0029] FIG. 11: Relationship between the universal response screening performance and the probability cutoff for the leave-one-out logistic regression model. In order to assess the performance of the infection screening using the universal response genes, we trained logistic regression models based on the RT-qPCR fold change data from all but one individual from the asymptomatic SARS-CoV-2 cohort (SS33-SS100). We then used the model to predict whether the one individual was infected or not, given a probability cutoff from 0.1 to 0.9 (x-axis). This process is repeated among all individuals, and the prediction result was then compared with the SARS-CoV-2 infection condition determined using the pathogen-specific RT-qPCR assay. The relationship between the probability cutoffs and the comparison outcomes, including specificity (red), sensitivity (blue), and accuracy (black), are summarized in the figure above.

[0030] FIG. 12: Relative fold change of the control genes and the universal response genes over time in healthy human saliva. To determine the extent of mRNA variation from day to day in human saliva samples, 7 individuals (SS26-SS32) were asked to collect saliva on daily basis over a period of 11 days. Total RNA was isolated from each sample and used as a template in the multiplex TaqMan assay described. Shown here are the 1 control gene (RACK1) and 12 universal response genes (IFIH1, IFI6, CXCL10, IFIT3, OAS2, DDX58, IFITM2, MX2, IFI27, IRF9, PARP12 and RTP4) quantified. Error bars represent the SEM of 7 individuals. In all panels, Ct value is converted to fold change by normalizing the Ct value to the Ct value of RPP30, and then normalized again to the abundance of mRNA measured on Day 1 for each individual.

[0031] FIG. 13: Optimization of TaqMan assay in cells infected with influenza A virus. A549 human lung cells were infected with Influenza A virus at multiplicity of infection (MOI) of 0.1 for 24 hours. Total RNA was harvested from the cells and 100 ng was used as template in the multiplex TaqMan assay described. To demonstrate the dynamic range and the signal consistency, the raw Ct values are shown in the top panel, and the resulting fold changes are shown in the bottom panel. The error bar indicates the SEM from 2 biological replicates. Ct value is converted to fold change by normalizing the Ct value to the Ct value of RPP30, and then normalized again to the abundance of mRNA measured in a mock infection.

[0032] FIG. 14: shows 15 host-derived RNA biomarkers that are consistently upregulated during infection by various pathogens. In one embodiment, such host-derived RNA biomarkers may be “general” biomarkers of infection. Previously published RNA sequencing and microarray data curated from public-domain databases and was analyzed using the bioinformatic pipeline illustrated in FIG. 4 below. Vertically, the top 10 host biomarkers are shown and, horizontally, 8 of the studies that carried out infection using 9 different pathogens were chosen for demonstration. In each study, (−) columns indicate mock-infected cells, while (+) indicate infected cells. All expression level of the biomarkers are relative to the mock infection control, red indicates upregulation of that specific biomarker after infection, blue indicates downregulation, see scale at bottom. Biomarkers were identified and ranked based on how consistently they were upregulated during infection by various pathogens (discussed below and FIG. 4). DENV2=dengue virus type 2; IAV=influenza A virus; HSV=herpes simplex virus; HRV=human rhinovirus; RSV=respiratory syncytial virus. All are viral pathogens except for S. aureus which is a bacterial pathogen, and, and Plasmodium falciparum, which is an exemplary eukaryote pathogen.

[0033] FIG. 15: Certain RNA biomarkers may differentiate between different types of pathogen infection, for example eukaryotic or bacterial versus viral infection. RNA sequencing and microarray datasets (described in the legend to FIG. 1) were further divided into viral versus bacterial and eukaryotic infections. Each subset of data was then analyzed using the biomarker identification pipeline discussed below (and FIG. 4). Biomarkers that are distinctive among viral/bacterial/eukaryotic infection were selected. This embodiment allows the present inventors to distinguish infection origin using host biomarkers. All biomarker expression levels are relative to the mock infection control, red indicates upregulation of that specific biomarker after infection, blue indicates downregulation.

[0034] FIG. 16: Biomarkers that identify infection by different categories of viruses or sites of replication in the human body. RNA sequencing and microarray datasets (described above in FIG. 1 legend) were further divided into different virus categories (here, HIV-1 retrovirus or HSV herpesvirus) or sites of pathogen replication in the human body (here, respiratory viruses). This allows us to further define the nature of the infection using specific host-derived biomarkers of infection. All expression level of the biomarkers is relative to the mock infection control, red indicates upregulation of that specific biomarker after infection, blue indicates downregulation.

[0035] FIG. 17: Generalized schematic of bioinformatics pipeline used to identify RNA biomarkers that are indicative of host response to specific infection. High-throughput RNA sequencing (RNA-seq) data or RNA microarray data of host response to infection may be generated, for example by performing qRT-PCR or microarray assays on one or more biological samples that may contain one or more host derived biomarkers, or alternatively curated from publicly accessible databases (NCBI SRA, NCBI GEO). Each RNA-seq or microarray dataset may be generated by different studies. The collection includes multiple cell types and human samples that are infected by different pathogens, including RNA and DNA viruses, and various bacteria species. Additional in vitro and in vivo infection studies may also be carried out to validate and/or generate more reference datasets. In one embodiment, infection-specific biomarkers are generated to differentiate host response that is specific to viral, bacterial, respiratory and/or blood etc. infection. The result summarization step utilizes multiple statistical models to combine the differential expression analysis results from individual studies. Given an unlabeled RNA-seq sample, in silico validation and filtering of biomarkers involves using discovered biomarkers as classification criteria to determine if a given sample is infected.

DETAILED DESCRIPTION OF INVENTION

[0036] In one embodiment, the invention includes systems, methods and compositions for the identification and classification of host biomarkers produced in response to an infection. In one preferred embodiment, the invention includes systems, methods and compositions for the identification and classification of early RNA biomarkers produced by the cell or subjects innate immune response in response to an infection. Notably, such specific target RNA transcripts or biomarkers produced by a patient's innate immune response may be indicative of early infection. As a result, in one embodiment of the inventive technology may include systems, methods and compositions for the detection of these target RNA transcripts which may act as biomarkers for early-infection in a subject.

[0037] In one preferred embodiment of the invention, to identify host-derived RNA biomarkers of infection, cells in culture or in a subject, such as a human subject, may be infected with various pathogens and then the RNA of the cell or tissues, and preferably mammalian tissues, and more preferably human tissue is collected and sequenced and compared to a (−) infection control. When different conditions and pathogens are compared to each other, general host RNA biomarkers can be initially derived as shown specifically in FIG. 14, red boxes indicates that a host gene is upregulated in response to the infection challenge. In a preferred embodiment of the inventive technology, the present inventor may specifically identify universally upregulated genes like EGR1, that are turned on in all or most infections tested. Such general host RNA biomarkers may be diagnostically indicative of a variety of different type and sites of infection in a subject and may further be used to generate an initial non-specific diagnosis of an early infection in a subject.

[0038] In another preferred embodiment of the invention, the RNA biomarkers produced by the host in response to an infection challenge may be compared between different classes of pathogens. In this manner, specific biomarkers, and preferably host-derived RNA biomarkers, can be identified and classified to indicate different types of infection. For instance, in one embodiment shown in FIG. 15, the present inventors identified biomarkers that differentiate bacterial versus viral infection. In another example shown in FIG. 16, the present inventive technology can be used to identify host-derived biomarkers, and preferably host-derived RNA biomarkers, that are specific to different classes of pathogens (e.g. retroviruses, or herpesviruses), or different sites of pathogen replication in the body (e.g. respiratory, or gastrointestinal viruses). As outlined in FIG. 17, through in silico validation, the present inventors can employ computer-assisted processes to confirm that each of these sets of biomarkers reliably detect and differentiate viral versus bacterial infection; retrovirus versus other infection and the like.

[0039] Alternately, in another embodiment, the target biomarkers can be empirically tested in human or other in vivo trials. For example, one embodiment of the invention includes the validation of target RNA biomarkers of infection using quantitative reverse transcription polymerase chain reaction (RT-PCR) protocols. As biomarkers identified using the methods outlined above may be further confirmed in tissue culture infection experiments. Quantitative RT-PCR (qRT-PCR) of RNA allows specific quantification of the upregulation of candidate biomarkers as a ‘fold change’ in infected cells compared to uninfected cells. Such information helps when evaluating detection sensitivity with respect to a given biomarker. While only twenty-five exemplary biomarker candidates are being identified herein, such list should not be construed as limiting on the number of biomarkers that may identified with the current invention.

[0040] As further highlighted in FIG. 17, high-throughput RNA sequencing (RNA-seq) data as well as quantitative RNA microarray data of the host response to infection may curated from publicly accessible databases (e.g., NCBI SRA, NCBI GEO) or created in house using in vitro or in vivo infection challenge experiments, or both to generate biomarker datasets for analysis and identification. Each RNA-seq or RNA microarray dataset may preferably be derived from human cells or tissues that have been infected with one or more pathogen, and then the human RNA response is probed and quantified. A mock (−infection) control or healthy tissue samples may be used in order to subtract out the RNA biomarkers that were already being produced in the cells before they were infected. Notably, as highlighted above, that while it might seem counter-intuitive to combine datasets from different labs, this can also be of benefit. When RNA-seq and RNA microarray datasets are generated by different groups, in different human cell lines or tissues, using different pathogens, and under different conditions, then any host-derived RNA biomarkers of infection upregulated in all of these datasets (see e.g., FIG. 14) has a high probability of being a robust general biomarker.

[0041] In one embodiment the invention may include systems, methods and compositions for the identification and use of one or more host-derived RNA biomarkers of infection. In one preferred embodiment, a first tissue culture experiment can be established and tested to identify target RNA transcripts that may be upregulated during an experimental infection, and that may also be secreted from target cells. RNAs that are upregulated may be used as candidate biomarkers and engineered for compatibility with biomarker detection systems, such as the lateral flow device, as well as qRT-PCR methods and systems generally described by the present inventors in US PCT Application No. PCT/US2020/049290, the specification, figures and sequence identification being incorporated herein by reference. In parallel, RNAs from healthy and infected human saliva may be characterized in a clinical trial (right) in order to identify RNA biomarkers of infection in humans. Those biomarkers, if not already identified in the tissue culture experiments, may be engineered for compatibility with the lateral flow system as generally describe above.

[0042] In another embodiment, the invention may include one or more of the host-biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 1-30. In another embodiment, the invention may include one or more virus-specific host RNA biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 1-5. In another embodiment, the invention may include one or more retrovirus-specific host RNA biomarkers comprising nucleotide sequences identified in SEQ ID NOs. 6-10. In another embodiment, the invention may include one or more herpesvirus host RNA biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 11-15. In another embodiment, the invention may include one or more respiratory virus-specific host RNA biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 16-20. In another embodiment, the invention may include one or more eukaryotic pathogen-specific host RNA biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 16-20.

[0043] In another embodiment, the invention may include one or more bacteria-specific host RNA biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 1-30. In another embodiment, the invention may include the diagnostic use of one or more of the host-biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 1-30. In one another embodiment, a of one or more of the nucleotide sequences identified in SEQ ID NOs. 1-30, and their corresponding encoded mRNA transcript and or translated polypeptide may be used as biomarkers for early-infection in a subject. In one another embodiment, a of one or more of the nucleotide sequences identified in SEQ ID NOs. 1-30, and their corresponding encoded mRNA transcript and or translated polypeptide may be used as biomarkers for identification of the site of replication, or infection in a subject. In one another embodiment, a of one or more of the nucleotide sequences identified in SEQ ID NOs. 1-30, and their corresponding encoded mRNA transcript and or translated polypeptide may be used as biomarkers for identification of pathogen class-specific infection in a subject.

[0044] In another embodiment, identification of one or more RNA biomarkers of infection may help inform treatment of a subject. For example, identification of viral or bacterial-specific host RNA biomarkers may guide a medical practitioner to administer an anti-viral or an antibiotic. It may also, in the case of a viral infection such as SARS-CoV-2, guide a medical practitioner to recommend the subject be quarantined. For example, identification of viral RNA biomarkers associated with a respiratory infection may guide a medical practitioner to administer treatments appropriate for a viral respiratory infection.

[0045] The terminology used herein is for describing embodiments and is not intended to be limiting. As used herein, the singular forms “a,” “and” and “the” include plural referents, unless the content and context clearly dictate otherwise. Thus, for example, a reference to “a biomarker” may include a combination of two or more such biomarkers. Unless defined otherwise, all scientific and technical terms are to be understood as having the same meaning as commonly used in the art to which they pertain. As used herein, “about” or “approximately” means within 10% of a stated concentration range or within 10% of a stated time frame.

[0046] The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.

[0047] Nucleic acids and/or other moieties of the invention may be isolated. As used herein, “isolated” means separate from at least some of the components with which it is usually associated whether it is derived from a naturally occurring source or made synthetically, in whole or in part. Nucleic acids and/or other moieties of the invention may be purified. As used herein, purified means separate from the majority of other compounds or entities. A compound or moiety may be partially purified or substantially purified. Purity may be denoted by weight measure and may be determined using a variety of analytical techniques such as but not limited to mass spectrometry, HPLC, etc.

[0048] As used herein, a biological marker (“biomarker” or “marker”) is a characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes, or pharmacological responses to therapeutic interventions, consistent with NIH Biomarker Definitions Working Group (1998). Markers can also include patterns or ensembles of characteristics indicative of particular biological processes. The biomarker measurement can increase or decrease to indicate a particular biological event or process. In addition, if the biomarker measurement typically changes in the absence of a particular biological process, a constant measurement can indicate occurrence of that process. In a preferred embodiment an RNA biomarker of infection, includes one or more RNA transcripts that may be indicative of infection or other normal or abnormal physiological process. It should be noted that where RNA biomarker of infection is referenced, it includes the sequence of the RNA transcript, whether of the DNA or mRNA sequence, as well as all alternatively spliced RNA transcripts or RNA biomarkers of infection that have undergone an alternative splicing event, as well as related polynucleotides.

[0049] The term “alternative splicing event”, as used herein, designates any sequence variation existing between two polynucleotide arising from the same gene or the same pre-mRNA by alternative splicing. This term also refers to polynucleotides, including splicing isoforms or fragments thereof, comprising said sequence variation. Preferably, said sequence variation is characterized by an insertion or deletion of at least one exon or part of an exon. The term “alternative splicing events” encompasses the original alternative splicing events, the skipping of exon (Dietz et al., Science 259, 680 (1993); Liu et al., Nature Genet. 16, 328-329 (1997); Nyström-Lahti et al. Genes Chromosomes Cancer 26: 372-375 (1999)), differential splicing due to the cellular environmental conditions (e.g. cell type or physical stimulus) or to a mutation leading to abnormalities of splicing (Siffert et al., Nature Genetics 18: 45-48 (1998)).

[0050] The term “related polynucleotides”, as used herein, refers to polynucleotides having identical sequences except for one or a small number of regions that either have a different sequence, or are deleted or added from one polynucleotide compared to the other. Typical related polynucleotides are splicing isoforms of a same gene, or a gene harboring a genomic deletion or addition compared to another allele of the same gene. Such related polynucleotides may be either full-length polynucleotides such as genomic DNA, mRNAs, full-length cDNAs, or fragments thereof.

[0051] As referred to herein, the terms “nucleic acid”, “nucleic acid molecules” “oligonucleotide”, “polynucleotide”, and “nucleotides” may interchangeably be used. The terms are directed to polymers of deoxyribonucleotides (DNA), ribonucleotides (RNA), and modified forms thereof in the form of a separate fragment or as a component of a larger construct, linear or branched, single stranded, double stranded, triple stranded, or hybrids thereof. The term also encompasses RNA/DNA hybrids. The polynucleotides may include sense and antisense oligonucleotide or polynucleotide sequences of DNA or RNA. The DNA molecules may be, for example, but not limited to: complementary DNA (cDNA), genomic DNA, synthesized DNA, recombinant DNA, or a hybrid thereof. The RNA molecules may be, for example, but not limited to: ssRNA or dsRNA and the like. The terms further include oligonucleotides composed of naturally occurring bases, sugars, and covalent internucleoside linkages, as well as oligonucleotides having non-naturally occurring portions, which function similarly to respective naturally occurring portions. The terms “nucleic acid segment” and “nucleotide sequence segment,” or more generally “segment,” will be understood by those in the art as a functional term that includes both genomic sequences, ribosomal RNA sequences, transfer RNA sequences, messenger RNA sequences, operon sequences, and smaller engineered nucleotide sequences that are encoded or may be adapted to encode, peptides, polypeptides, or proteins. Further, it should be noted that when any sequence is referenced herein, for example a DNA sequence, the corresponding RNA and amino acid sequence is also specifically encompassed in such a disclosure.

[0052] As referred to herein, the term “database” is directed to an organized collection of biological sequence information and/or quantitative measurement of gene expression that may be stored in a digital form. They specifically include open source, as well as non-open source databases. In some embodiments, the database may include any sequence information. In some embodiments, the database may include the genome sequence of a subject or a microorganism. In some embodiments, the database may include expressed sequence information, such as, for example, an EST (expressed sequence tag) or cDNA (complementary DNA) databases. In some embodiments, the database may include non-coding sequences (that is, untranslated sequences), such as, for example, the collection of RNA families (Rfam) which contains information about non-coding RNA genes, structured cis-regulatory elements and self-splicing RNAs. In some embodiments, the databases may include quantitative measurement of expressed gene abundance, such as, for example, the collection of RNA, DNA or cDNA microarray readout. In some embodiments, the databases may include a collection of cDNA sequences captured from biological samples undergoing specific treatment conditions. Such collection of cDNA sequences can be analyzed to determine the relative abundance of gene expressed in the given biological samples, such as, for example, the collection of RNA sequencing data. In exemplary embodiments, the databases may be selected from redundant or non-redundant NCBI SRA database (which is NIH short read sequencing archive database containing publicly available RNA-seq datasets), NCBI GEO database (which is NIH gene expression omnibus database containing publicly available microarray database), NCBI BioProject database (NIH database containing metadata of experimental setup, protocol, patient information etc. relevant to datasets available on NCBI SRA and GEO databases), GenBank databases (which are the NIH genetic sequence database, an annotated collection of all publicly available DNA and RNA sequences). In exemplary embodiments, the databases may be selected from NCBI Short Read Archive databases. Exemplary databases may be selected from, but not limited to: GenBank CDS (Coding sequences database), PDB (protein database), SwissProt database, PIR (Protein Information Resource) database, PRF (protein sequence) database, EMBL Nucleotide Sequence database, NCBI BioProject database, NCBI SRA (Short Read Archive) database, NCBI GEO (Gene Expression Omnibus) database, Broad Institute GTEx (Genotype-Tissue Expression) database, EMBL Expression Atlas, and the like, or any combination thereof.

[0053] As used herein, the term “detection” refers to the qualitative determination of the presence or absence of a microorganism in a sample. The term “detection” also includes the “identification” of a microorganism, i.e., determining the genus, species, or strain of a microorganism according to recognized taxonomy in the art and as described in the present specification. The term “detection” further includes the quantitation of a microorganism in a sample, e.g., the copy number of the microorganism in a microliter (or a milliliter or a liter) or a microgram (or a milligram or a gram or a kilogram) of a sample. The term “detection” also includes the identification of an infection in a subject or sample.

[0054] As used herein the term “pathogen” refers to an organism, including a microorganism, which causes disease in another organism (e.g., animals and plants) by directly infecting the other organism, or by producing agents that causes disease in another organism (e.g., bacteria that produce pathogenic toxins and the like). As used herein, pathogens include, but are not limited to bacteria, protozoa, fungi, nematodes, viroids and viruses, or any combination thereof, wherein each pathogen is capable, either by itself or in concert with another pathogen, of eliciting disease in vertebrates including but not limited to mammals, and including but not limited to humans. The term also specifically includes eukaryotic or protist pathogens, such as the Plasmodium sp. that are the causative agent of Malaria. As used herein, the term “pathogen” also encompasses microorganisms which may not ordinarily be pathogenic in a non-immunocompromised host.

[0055] As used herein, the step of introducing a pathogen to a subject may include both the intentional introduction of a pathogen, such as through a clinical trial, or through the natural and unintended introduction of a pathogen that may have been introduced to a subject, for example, through an horizontal or vertical pathogen exposure, as well as direct and indirect pathogen transmission, for example including, but not limited to environmental exposure to a pathogen, zoonotic exposure to a pathogen, vector-borne exposure to a pathogen. nosocomial exposure to a pathogen.

[0056] The term “infection” or “infect” as used herein is directed to the presence of a microorganism within a subject body and/or a subject cell. For example, a virus may be infecting a subject cell. A parasite (such as, for example, a nematode) may be infecting a subject cell/body. In some embodiments, the microorganism may comprise a virus, a bacteria, a fungi, a parasite, or combinations thereof. According to some embodiments the microorganism is a virus, such as, for example, dsDNA viruses (such as, for example, Adenoviruses, Herpesviruses, Poxviruses), ssDNA viruses (such as, for example, Parvoviruses), dsRNA viruses (such as, for example, Reoviruses), (+) ssRNA viruses (+) sense RNA (such as, for example, Picornaviruses, Togaviruses), (−) ssRNA viruses (−) sense RNA (such as, for example, Orthomyxoviruses, Rhabdoviruses), ssRNA-RT viruses (+) sense RNA with DNA intermediate in life-cycle (such as, for example, Retroviruses), dsDNA-RT viruses (such as, for example, Hepadnaviruses). In some embodiments, the microorganism is a bacteria, such as, for example, a gram negative bacteria, a gram positive bacteria, and the like. In some embodiments, the microorganism is a fungi, such as yeast, mold, and the like. In some embodiments, the microorganism is a parasite, such as, for example, protozoa and helminths or the like. In some embodiments, the infection by the microorganism may inflict a disease and/or a clinically detectable symptom to the subject. In some embodiments, infection by the microorganism may not cause a clinically detectable symptom. In some embodiments, the microorganism is a symbiotic microorganism. In additional embodiments, the microorganism may comprise archaea, protists; microscopic plants (green algae), plankton, and the planarian. In some embodiments, the microorganism is unicellular (single-celled). In some embodiments, the microorganism is multicellular.

[0057] As used herein, the term “asymptomatic” refers to an individual who does not exhibit physical symptoms characteristic of being infected with a given pathogen, or a given combination of pathogens.

[0058] The target biomarkers of this invention may be used for diagnostic and prognostic purposes, as well as for therapeutic, drug screening and patient stratification purposes (e.g., to group patients into a number of “subsets” for evaluation), as well as other purposes described herein.

[0059] Some embodiments of the invention comprise detecting in a sample from a patient, a level of a biomarker, wherein the presence or expression levels of the biomarker are indicative of infection or possible infection by one or more pathogens. As used herein, the term “biological sample” or “sample” includes a sample from any bodily fluid or tissue. Biological samples or samples appropriate for use according to the methods provided herein include, without limitation, blood, serum, urine, saliva, tissues, cells, and organs, or portions thereof. A “subject” is any organism of interest, generally a mammalian subject, and preferably a human subject.

[0060] As noted above, in one embodiment qRT-PCR may be utilized to identify one or more host-derived biomarkers of infection. In certain embodiment, intercalator dyes may be used to measure the accumulation of both specific and nonspecific PCR products when utilizing RT-PCR products. For example, intercalator dyes such as SYBR green and TaqMan may be used to detect and identify host-derived biomarkers of infection in a qRT-PCR assay.

[0061] Any isothermal amplification protocol can be used according to the methods provided herein. Exemplary types of isothermal amplification include, without limitation, nucleic acid sequence-based amplification (NASBA), loop-mediated isothermal amplification (LAMP), strand displacement amplification (SDA), helicase-dependent amplification (HDA), nicking enzyme amplification reaction (NEAR), signal mediated amplification of RNA technology (SMART), rolling circle amplification (RCA), isothermal multiple displacement amplification (EVIDA), single primer isothermal amplification (SPIA), recombinase polymerase amplification (RPA), and polymerase spiral reaction (PSR, available at nature.com/articles/srepl2723 on the World Wide Web). In some cases, a forward primer is used to introduce a T7 promoter site into the resulting DNA template to enable transcription of amplified RNA products via T7 RNA polymerase. In other cases, a reverse primer is used to add a trigger sequence of a toehold sequence domain.

[0062] As used herein, the term “amplified” refers to polynucleotides that are copies of a particular polynucleotide, produced in an amplification reaction. An amplified product, according to the invention, may be DNA or RNA, and it may be double-stranded or single-stranded. An amplified product is also referred to herein as an “amplicon”. As used herein, the term “amplicon” refers to an amplification product from a nucleic acid amplification reaction. The term generally refers to an anticipated, specific amplification product of known size, generated using a given set of amplification primers.

[0063] Naturally as can be appreciated, all of the steps as herein described may be accomplished in some embodiments through any appropriate machine and/or device resulting in the transformation of, for example data, data processing, data transformation, external devices, operations, and the like. It should also be noted that in some embodiments, software and/or software solution may be utilized to carry out the objectives of the invention and may be defined as software stored on a magnetic or optical disk or other appropriate physical computer readable media including wireless devices and/or smart phones. In alternative embodiments the software and/or data structures can be associated in combination with a computer or processor that operates on the data structure or utilizes the software. Further embodiments may include transmitting and/or loading and/or updating of the software on a computer perhaps remotely over the internet or through any other appropriate transmission machine or device, or even the executing of the software on a computer resulting in the data and/or other physical transformations as herein described.

[0064] Certain embodiments of the inventive technology may utilize a machine and/or device which may include a general purpose computer, a computer that can perform an algorithm, computer readable medium, software, computer readable medium continuing specific programming, a computer network, a server and receiver network, transmission elements, wireless devices and/or smart phones, internet transmission and receiving element; cloud-based storage and transmission systems, software updateable elements; computer routines and/or subroutines, computer readable memory, data storage elements, random access memory elements, and/or computer interface displays that may represent the data in a physically perceivable transformation such as visually displaying said processed data. In addition, as can be naturally appreciated, any of the steps as herein described may be accomplished in some embodiments through a variety of hardware applications including a keyboard, mouse, computer graphical interface, voice activation or input, server, receiver and any other appropriate hardware device known by those of ordinary skill in the art.

[0065] As used herein, a machine learning system or model is a trained computational model that takes a feature of interest, such as the expression of a host-derived RNA biomarker and classifies. Examples of machine learning models include neural networks, including recurrent neural networks and convolutional neural networks; random forests models, including random forests; restricted Boltzmann machines; recurrent tensor networks; and gradient boosted trees. The term “classifier” (or classification model) is sometimes used to describe all forms of classification model including deep learning models (e.g., neural networks having many layers) as well as random forests models.

[0066] As used herein, “quantify” means to identify the presence or quantity of an RNA biomarker from a sample.

[0067] As used herein, a machine learning system may include a deep learning model that may include a function approximation method aiming to develop custom dictionaries configured to achieve a given task, be it classification or dimension reduction. It may be implemented in various forms such as by a neural network (e.g., a convolutional neural network), etc. In general, though not necessarily, it includes multiple layers. Each such layer includes multiple processing nodes and the layers process in sequence, with nodes of layers closer to the model input layer processing before nodes of layers closer to the model output. In various embodiments, one-layer feeds to the next, etc. The output layer may include nodes that represent various classifications. In certain embodiments, machine learning systems may include artificial neural networks (ANNs) which are a type of computational system that can learn the relationships between an input data set and a target data set. ANN name originates from a desire to develop a simplified mathematical representation of a portion of the human neural system, intended to capture its “learning” and “generalization” abilities. ANNs are a major foundation in the field of artificial intelligence. ANNs are widely applied in research because they can model highly non-linear systems in which the relationship among the variables is unknown or very complex. ANNs are typically trained on empirically observed data sets. The data set may conventionally be divided into a training set, a test set, and a validation set.

[0068] Having now described the inventive technology, the same will be illustrated with reference to certain examples, which are included herein for illustration purposes only, and which are not intended to be limiting of the invention.

EXAMPLES

Example 1: Data Pre-Processing

[0069] The present inventors processed the raw microarray or RNA sequencing data through standardized workflow. For Microarray datasets, the pipeline 1) performs background signal correction and signal normalization, 2) annotates probes on the microarray chip with known gene names and accession numbers, 3) filters probes based on the signal intensities. For RNA sequencing datasets, the pipeline 1) Filters out RNA-seq reads of low-quality and contaminating sequences 2) Maps the filtered reads to host (human) genome 3) Determines data quality based on trimming and mapping statistics 4) Assigns total number of RNA-seq reads mapped onto each annotated gene within human genome. This gene expression profile from both microarray and RNA sequencing datasets are indicative of the relative gene expression level. The pipeline may normalize the read counts based on a set of empirically-determined control genes and further conducts differential expression analysis to determine what are the significantly up-regulated genes within each study.

Example 2: Biomarker Discovery

[0070] Based on which host RNA biomarker is commonly upregulated across different pathogen infections, and how readily they can be detected across different cell types and tissue samples, the present inventors summarized the results from the above data pre-processing steps using statistical methods, including direct merge, combine p-value, combine effect size, combine ranks and/or co-expression analysis. These statistical measures combine the data in a way that accounts for confidence and reliability of the results.

[0071] Importantly, by focusing on studies that utilized similar infection data from broader categories (e.g. Domain level: virus, bacteria, etc; Viral class: herpesvirus, retrovirus, etc; Site of replication in the body: respiratory virus), the present inventors were also able to identify specific sets of host biomarkers that help differentiate the type of infection as explained below. These discovered biomarkers can either directly move on to empirical testing, or they can be further validated and prioritized by the computer-assisted approaches described in Example 3.

Example 3: In Silico Validation and Filtering

[0072] In another embodiment, the invention may utilize a machine learning system. The summarized host biomarkers may optionally be subject to downstream validation and filtering via supervised machine-learning approaches. In one embodiment, the present inventors provided the classifier (Logistic regression, polynomial supported vector machine (SVM), Poisson linear discriminant or Convolutional Neuron Network) with either the list of biomarkers or random genes (as control) to construct statistic models around training RNA-seq or RNA microarray datasets. Then the present inventors programmed the classifier to determine if a set of unknown RNA-seq or RNA microarray samples are infected. If the list of biomarkers helps predict the infection condition of the unknown data, the prediction accuracy would be significantly higher comparing to the control. To further utilize this approach to filter out less relevant biomarkers from the list, the present inventors removed individual genes from the biomarker list and carried out the entire classification iteratively. If the removal of that biomarker decreases the prediction accuracy, it suggests the biomarker being removed plays a key role in determining the infection condition. Reciprocally, if the removal of that biomarker increases, or has no effect on the prediction accuracy, the removed biomarker could be discarded due to its lack of relevancy.

Example 4: Virus-Specific Host Biomarkers RNA Sequences

[0073] One embodiment of the invention may include one or more of the following biomarkers, identified through the methods described herein, as being specifically upregulated in response to a viral infection in a human subject. In a preferred embodiment, the invention may include the early-detection of a viral infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 1-5. In one preferred embodiment, the invention may include the early-detection of a viral infection, such as SARS-CoV-2 (COVID-19 in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 1-5, the detection being accomplished, in one preferred embodiment, by a lateral flow device described by the present inventors in PCT Application No. PCT/US2020/049290, the specification and figures being incorporated herein by reference, or other biomarker detection systems known in the art. Additional embodiments for detecting one or more of the biomarkers identified herein may include a rapid detection LAMP assay, PCR, or other detection methods described generally herein and known in the art.

Example 5: Bacteria-Specific Host Biomarkers RNA Sequences

[0074] One embodiment of the invention may include one or more of the following biomarkers, identified through the methods described herein, as being specifically upregulated in response to a viral infection in a human subject. In a preferred embodiment, the invention may include the early-detection of a bacterial infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 6-10. In one preferred embodiment, the invention may include the early-detection of a bacterial infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 6-10, the detection being accomplished by a lateral flow device described by the present inventors in PCT Application No. PCT/US2020/049290, the specification and figures being incorporated herein by reference, or other biomarker detection systems known in the art. Additional embodiments for detecting one or more of the biomarkers identified herein may include a rapid detection LAMP assay, PCR, or other detection methods described generally herein and known in the art.

Example 6: Retrovirus-Specific Host Biomarkers RNA Sequences

[0075] One embodiment of the invention may include one or more of the following biomarkers, identified through the methods described herein, as being specifically upregulated in response to a viral infection in a human subject. In a preferred embodiment, the invention may include the early-detection of a retroviral infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 11-15. In one preferred embodiment, the invention may include the early-detection of a retroviral infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 11-15, the detection being accomplished by a lateral flow device described by the present inventors in PCT Application No. PCT/US2020/049290, the specification and figures being incorporated herein by reference, or other biomarker detection systems known in the art. Additional embodiments for detecting one or more of the biomarkers identified herein may include a rapid detection LAMP assay, PCR, or other detection methods described generally herein and known in the art.

Example 7: Herpesvirus-Specific Host Biomarkers RNA Sequences

[0076] One embodiment of the invention may include one or more of the following biomarkers, identified through the methods described herein, as being specifically upregulated in response to a viral infection in a human subject. In a preferred embodiment, the invention may include the early-detection of a herpesvirus infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 16-20. In one preferred embodiment, the invention may include the early-detection of a herpesvirus infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 16-20, the detection being accomplished by a lateral flow device described by the present inventors in PCT Application No. PCT/US2020/049290, the specification and figures being incorporated herein by reference, or other biomarker detection systems known in the art. Additional embodiments for detecting one or more of the biomarkers identified herein may include a rapid detection LAMP assay, PCR, or other detection methods described generally herein and known in the art.

Example 8: Respiratory Virus-Specific Host Biomarkers RNA Sequences

[0077] One embodiment of the invention may include one or more of the following biomarkers, identified through the methods described herein, as being specifically upregulated in response to a viral infection in a human subject. In a preferred embodiment, the invention may include the early-detection of a respiratory infection, such as SARS-CoV-2 (COVID-19) in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 21-25. In one preferred embodiment, the invention may include the early-detection of a respiratory infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 21-25, the detection being accomplished by a lateral flow device described by the present inventors in PCT Application No. PCT/US2020/049290, the specification and figures being incorporated herein by reference, or other biomarker detection systems known in the art. Additional embodiments for detecting one or more of the biomarkers identified herein may include a rapid detection LAMP assay, PCR, or other detection methods described generally herein and known in the art.

Example 9: Eukaryotic and/or Protist Virus-Specific Host Biomarkers RNA Sequences

[0078] One embodiment of the invention may include one or more of the following biomarkers, identified through the methods described herein, as being specifically upregulated in response to a eukaryotic or protist pathogen infection in a human subject. In a preferred embodiment, the invention may include the early-detection of a eukaryotic or protist pathogen infection, such as Plasmodium falciparum (P. falciparum), the causative agent of Malaria in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 26-30. In one preferred embodiment, the invention may include the early-detection of a eukaryotic or protist pathogen infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 26-30, the detection being accomplished by a lateral flow device described by the present inventors in PCT Application No. PCT/US2020/049290, the specification and figures being incorporated herein by reference, or other biomarker detection systems known in the art. Additional embodiments for detecting one or more of the biomarkers identified herein may include a rapid detection LAMP assay, PCR, or other detection methods described generally herein and known in the art.

Example 10: Identification of 69 Human Universal Response Genes to Infection

[0079] In one embodiment, the present inventors identify 69 human “universal response” genes that are upregulated by a broad range of human pathogens. Even when infection resides in distal sites in the body, the mRNAs produced in this universal response are measurable in human saliva. By assessing the abundance of these mRNAs in saliva, we were able to correctly determine whether a person harbors an infection more than 85% of the time. This is true even in the absence of perceived symptoms. As such, the monitoring of these mRNAs in saliva could be a platform for detecting infection in the body, especially as a screening tool for asymptomatic individuals.

[0080] It is striking that there is a core transcriptional response that is triggered by all tested pathogens. Many studies have explored the host gene response to infection, including the 71 studies that we used in the first step of this study (listed in Table 2), or to specific cytokines like interferon. Yet there have been far fewer studies that have looked at commonalities in gene induction by cells infected with different pathogens, and typically these have compared just a few pathogen types. By integrating results from many datasets from a broad range of pathogen types, we identified an asymptotic number of universal response genes (n=69) (SEQ ID NOs. 31-99). Importantly, no new genes were added or subtracted from this list once we surpassed a certain number of datasets analyzed. Thus, we identified the connecting signature that underlies infection, across a broad range of pathogens.

[0081] Importantly, universal response mRNAs are detectable in saliva of infected individuals, regardless of the location of infection. There are two hypotheses to explain why these mRNAs are found in saliva. First, free mRNA, or mRNA encapsulated in dead cells or exosomes, might be entering the oral cavity. This might be occurring for the purpose of targeting these structures for elimination from the body via the gastrointestinal tract. In a second model, interferon and other cytokines produced by a distal infection may be entering the oral cavity and stimulating cells there to execute the transcriptional response that we are measuring. In other words, the mRNA we observe in saliva could be produced or even propagated locally in the mouth. Regardless, the invention highlights the diagnostic value of saliva beyond its current limited use in diagnosing SARS-CoV-2, oral cancers, and Sjorgen syndrome.

[0082] To determine which human genes are commonly upregulated in diverse infections, the present inventor first obtained 71 published datasets. These datasets all profiled the transcriptional response of cultured human cells to infection. Studies involving a variety of pathogens were included (29 viruses, 7 bacteria, and 3 fungi), with many of these pathogens represented by more than one dataset (Table 2). Each of the 71 datasets included matched transcript sequencing for infected and mock-infected human cells, usually in multiple replicates (n =387 replicates in all). For each dataset, raw RNA sequencing reads were retrieved from the NCBI short-read archive and analyzed as described in the Methods. We looked for genes that were upregulated in infected conditions (“+” in FIG. 1A) compared to in mock infections (“−”). Despite the many variables in these datasets (pathogens, human cell lines, labs conducting the studies), we obtained a list of 69 genes that are consistently upregulated across the array of pathogen types tested (FIG. 1A and genes are listed in Table 3). We refer to these as “universal response” genes. While each infection triggered the expression of many human genes, these 69 genes appear to represent a core transcriptional response that is universal. Universal response genes mainly belong to pathways related to cellular antiviral functions and type-I interferon responses (FIG. 5). Several lines of evidence support the idea that these 69 genes represent a core and universal transcriptional response to infection. First, the number of universal response genes reached an asymptote of 69 genes as more studies were added to the analysis (FIG. 1B). After reaching 69 genes, the addition of more datasets does not add or subtract genes from the set. Second, principal component analysis was performed on the expression data of these 69 genes in all datasets (FIG. 1C). Despite the many variables involved, the main contributor to the data variance (PC1; which explains 80.5% of the variance) cleanly separates these in vitro experiments by condition of infected (triangles) or uninfected (circles). This suggested that levels of mRNAs from this group of 69 genes can differentiate infected from uninfected human cells in all cases.

[0083] We next assessed whether the abundance of these mRNAs in blinded human tissue culture samples could predict whether the cells had been infected or not. Using the 387 samples (meaning, independent experimental replicates) from the 71 in vitro infection datasets, we carried out cross-validation using a logistic regression model. Specifically, we first established the logistic regression classifier using the expression data of the 69 genes in 10% of the samples (much less than what is typically used in 10-fold cross-validation experiments, done to emphasize the predictive power), randomly selected. Next, we evaluated the predictive power of this model to classify the remaining 90% of the 387 samples as infected or not. This cross validation was repeated 10 times, and the accuracy of classification is summarized via receiver operating characteristic (ROC) curve (FIG. 1D). Overall, the cross validation resulted in a mean area under the curve (AUC) of 0.92, which is interpreted as a 92% chance of distinguishing mock from infected conditions based on the expression levels of these 69 mRNAs. The worst outcome of the 10 repeats had an AUC of 0.81, and the best an AUC of 0.99.

[0084] We then performed additional cross validation analyses among different types of infections (FIG. 1E). We trained the logistic regression classifier using only fungal and bacterial samples and then classified the viral samples as infected or not. This was highly successful and yielded a ROC curve with an AUC=0.93. We then trained the classifier using only viral and bacterial samples and then classified the fungal samples as infected or not (AUC=1.0). Finally, we trained the classifier using a combination of viral and fungal samples, and then classified the bacterial samples as infected or not (AUC=1.0). Collectively, this indicates that the upregulation of these universal response genes in human cell lines can correctly identify infection status, independent of the cell line and pathogen types involved. The fact that training sets on two types of pathogens can classify infections caused by a third proves that these 69 genes truly represent a universal response to infection.

[0085] We next explored whether this group of 69 genes is truly unique, relative to other groups of similar genes. We again performed the same analysis as shown in FIG. 1D, but trained our classifier on genes in relevant gene ontology (GO) terms (shown at the top of graphs; FIG. 1F) instead of the 69 genes identified here, none of the examined gene sets was able to distinguish infected and non-infected conditions to the similar degree as the 69 universal response genes (FIG. 1F). We tried other GO terms (not shown), and were not able to do better than the examples shown. Thus, the 69 host genes we have identified have more ability to detect infection than any other human gene set.

Example 11: Universal Response Genes are also Upregulated in Infected Humans

[0086] We next wanted to determine if universal response genes are upregulated in infected humans. At this point, we transitioned from analyzing data from in vitro infections of human cells to the analysis of data from human biospecimens. We first took advantage of two previously published datasets from human blood, each measuring gene expression by microarray after infection. One study focused on a 34-year-old male health care worker exposed to Ebola virus in Sierra Leone during the 2013-2015 epidemic. Starting 7 days after symptom onset, blood was taken from the individual daily and genome-wide mRNA expression was evaluated by microarray. We extracted from this dataset the expression profiles of the universal response genes (FIG. 2A). A vast majority of the genes are highly upregulated at day 7. Their expression trails off as the person goes through recovery, although the speed of dissipation of these signals is highly variable (a concept explored further in FIGS. 6-7). A few genes at the top of the panel are not upregulated at day 7, with one possibility being that their induction has already dissipated by day 7. In this individual, Ebola virus mRNA was detected between days 7-11, with the peak (Ct=31) at day 9. From this, we can see that the strong upregulation of host universal response genes occurs at least 2 days earlier than the peak of viral load and is sustained much longer.

[0087] Another study focused on 15 individuals experimentally infected with the protist that causes malaria, Plasmodium falciparum. In this study, blood was taken every two days after experimental infection and mRNA transcript abundance was interrogated by microarray, until the point where individuals had detectable pathogen in the bloodstream and/or had symptoms consistent with malaria (indicated as “D” for diagnosed in FIG. 2B). Note that protist pathogens (single-celled eukaryotes) were not represented in the 71 in vitro datasets from which we identified these 69 universal response genes. Nonetheless, more than half of the universal response genes (17/29) that were included on this microarray are upregulated in blood by the time of diagnosis. Based on these two human studies, we conclude that universal response mRNAs are also upregulated in infected humans.

[0088] We next asked whether the abundance these 69 mRNAs in human saliva could classify humans as infected or not. We find that universal response transcripts can be found to equal degrees in blood and saliva (FIG. 8) so, at this point, we transitioned to analyzing human saliva samples. We first obtained saliva samples from 15 healthy individuals (and 8 individuals diagnosed with a variety of infectious diseases. Of the latter, three had been diagnosed with SARS-CoV-2 viral infection, one with Vibrio cholerae bacterial infection, one with Staphylococcus aureus bacterial infection, and one with varicella-zoster virus infection. Two additional saliva samples were included from apparently healthy individuals from whose saliva we were able to map reads corresponding to common respiratory pathogen genomes (see Methods). Total RNA was prepared from each of these 23 human saliva samples, followed by depletion of bacterial and human ribosomal RNA. RNA with high integrity can be readily isolated from saliva (FIG. 9). Libraries were sequenced with high-throughput short-read sequencing.

[0089] We next tested whether the abundance of universal response mRNAs in saliva could determine if a human was harboring an infection. We carried out cross validation and found that a classifier trained on the expression levels of universal response genes in a randomly selected 10% of the in vitro data analyzed above (39 of the 387 experimental replicates from 71 studies), could correctly classify these 23 human saliva samples as having come from someone who is infected or healthy, just from the abundances of these mRNAs in their saliva (FIG. 3C, Mean AUC=0.86). Thus, this classification was made correctly 86% of the time, even with very little training data. Remarkably, the transfer learning approach (trained on in vitro data, then used to classify human biospecimens) only resulted in the loss of 0.06 AUC (0.92 from FIG. 1D compared to 0.86). Classification of patients as infected or not was made correctly 91.2% of the time when all of the in vitro data was used as training data. This means that transcriptional changes observed in infected human cells in culture can be observed with high fidelity in saliva of infected humans.

[0090] Importantly, two of the enrollees in the previous analysis were noted to have no signs of respiratory tract involvement, and some clearly had infection linked to distal sites (gastroenteritis, osteomyelitis/discitis, meningitis), yet these mRNA signatures are reliably detectable in saliva. We next wanted to further confirm that universal response mRNAs can be found in saliva, even when infection is at distal sites in the body. In the next experiment, we included two additional patient saliva samples, one from an enrollee being treated for a Coccidioides fungal infection and another enrollee being treated for Escherichia coli bacterial sepsis stemming from a urinary source. The three enrollees in this experiment were diagnosed with very different infections (viral, fungal, and bacterial) and were specifically noted to not have respiratory involvement in their infections. We used RT-qPCR to quantify mRNA from six of the universal response genes (due to limited sample volumes) from the saliva of these enrollees. We observed from 2- to 10.sup.5-fold upregulation of all six host mRNAs within the saliva of infected individuals compared to three healthy ones (FIG. 3D). In summary, we can detect universal response mRNAs in human saliva, even when there is no apparent respiratory involvement. Again, a viral, bacterial, and fungal infection all lead to this noted over-abundance of universal response mRNAs in saliva.

Example 12: Universal Response Transcripts in Saliva Identified SARS-CoV-2 Infected Individuals in an Asymptomatic, Apparently-Healthy Cohort

[0091] We next asked if this concept would be viable in the context of disease screening, meaning testing people who have no symptoms for the purpose of determining their likelihood of having an infection. During the 2020-21 academic year, the University of Colorado Boulder carried out weekly SARS-CoV-2 screening for students and staff. The screening effort enabled us to enroll university affiliates into an associated human study. We enrolled 68 university affiliates into the study, and each donated a single saliva sample used for both the university RT-qPCR test for SARS-CoV-2, and for analysis of the universal response mRNAs in their saliva. For the latter analysis, we chose samples from individuals who had tested positive (n=48) and negative (n=20) for SARS-CoV-2. What is special about the cohort of 68 individuals is that all had indicated no perceptible symptoms at the time of saliva donation.

[0092] We examined the levels of mRNA from universal response genes in the saliva of these 68 individuals to determine if that information alone could have revealed whether or not they were infected. Instead of sequencing transcripts in saliva, we developed a multiplex TaqMan RT-qPCR assay for measuring 15 of the universal response genes, along with 3 control genes (Methods, Table 5). These 15 genes were chosen to represent a range of expression levels and kinetics amongst the 69 total universal response genes. The expression of these genes in each enrollee is described in FIGS. 7, 10. We next trained a logistic regression model using the RT-qPCR fold-change data from all but one individual. We then classified that (left-out) individual as infected or not (FIG. 4A) by using the trained model and an optimal probability cutoff (FIG. 11). We did this for each individual in the cohort. Overall, we were able to identify SARS-CoV-2-positive individuals with a sensitivity of 79%, specificity of 80%, and overall accuracy of 79%. However, for SARS-CoV-2, infectious virions are almost never recovered from individuals with viral loads below 10.sup.6 viral copies per mL. Individuals with viral loads below this value are either at the beginning of infection, or on the long tail of recovery. A more meaningful analysis for a screening tool would be to ask how often universal response mRNAs in saliva identify people that could be infectious to others. At a >10.sup.6 viral copies per mL cutoff, we were able to identify SARS-CoV-2- positive individuals with sensitivity of 94%, specificity of 80%, and overall accuracy of 87% (FIG. 4B). Importantly, none of these individuals reported symptoms at the time of saliva collection, suggesting that the mRNAs in saliva have more predictive power over infection than even self-perceived symptoms, and that screening based on symptoms would not have identified these people. apparently healthy individuals who were asked to collect saliva samples daily over a period of 11 days. We then measured the level of universal response mRNAs in their saliva over the time course by RT-qPCR using the multiplex TaqMan assay described above. The expression levels of the universal response genes remained remarkably stable over time (five genes shown in FIG. 4B, the full set in FIG. 12).

[0093] When compared to day 1, transcript abundance in saliva changed no more than 5-fold in subsequent days. Thus, universal response mRNAs are remarkably steady in the saliva of healthy individuals.

Example 12: Materials and Methods

[0094] Meta-analysis of NCBI SRA transcriptomics datasets: We carried out a meta-analysis of RNA-seq datasets publicly available at the NCBI SRA (short read archive) database. Our criteria for choosing datasets were that human cells in culture were infected with a bacterial, viral, or fungal pathogen, and then the cellular transcriptome was sequenced along with that in a mock-infected control. We obtained a total of 71 relevant in vitro infection datasets. From these datasets, raw RNA sequencing reads in FASTQ format were downloaded, trimmed using BBDuk (BBMap v38.05) and mapped using HISAT2 v2.1.0 to human genome assembly hg38. Using NCBI RefSeq genome annotation, we then counted the mapped reads assigned to genes or transcripts using FeatureCount (Subread v1.6.2).

[0095] First, we looked for genes that were upregulated in each infected dataset versus its matched mock control. For each individual dataset, the infected replicates were compared to the corresponding mock replicates via the DESeq2 Wald test (v3.1.3), from which the fold change and Benjamini-Hochberg adjusted p-values were obtained. Correction for multiple testing was performed throughout. Next, we looked for the subset of these genes that was statistically enriched in infected datasets overall. DESeq2 results from individual datasets were ranked and combined based on the magnitude and consistency of upregulation across the datasets. Specifically, the gene rank, r.sub.! is assigned to each individual dataset following the formula:

r.sub.g=Rank(−log10(Pval.sub.Adj)×fold change)

[0096] Next, to determine which genes were consistently upregulated across different studies, the rank is combined via rank sum statistics. With n studies, the rank sum for each gene, g, is calculated as:

RS.sub.g=(Σ.sub.ir.sub.g,i)

[0097] Hence, each gene is sorted based on the RS.sub.g. We then filtered the gene list based on the within-study adjusted p-value and required that the gene be significant (p.sub.adj<0.05) in 80% of the datasets. As a result, we obtained 69 universal response genes ranked by statistical significance comparing infected vs. mock groups and by the consistency across datasets.

[0098] Cross-validation using logistic regression models: To evaluate the predictive power of the universal response genes in differentiating infected/uninfected conditions in both in vitro and in vivo RNA-seq datasets, we extracted library size-normalized read counts in transcript per million format for each sequencing replicate. We next separated the datasets into training and prediction set. Specifically, 10% of randomly selected sequencing replicates used to construct the binomial logistic regression model using R package stats (v 3.6.2). The remaining 90% of sequencing replicates were used as the predict set for evaluation. In the case of in vivo saliva sequencing replicates, the entire dataset was used for prediction. R package ROCR (v1.0.11) was used to generate the ROC curves based on the prediction outcome.

[0099] For evaluating the predictive power of universal response genes as measured by the TaqMan RT qPCR assay on SARS-CoV-2 infected/uninfected saliva samples, the relative fold change was calculated by first normalizing the raw Ct values to the corresponding control gene Ct (RPP30) and then comparing to the average normalized Ct of all uninfected individuals. The relative fold change values for each individual were then used for cross validation via logistic regression. Specifically, half of infected individuals above the said viral load threshold along with half of the uninfected individuals are used as the training set, while the remaining half was used for prediction. The methods for constructing the logistic regression model and for evaluating performance via ROC are the same as above.

[0100] Human saliva sample collection, handling, and RNA preparation: Samples SS4, SS5, SS12-SS21, SS24 and SS25 were collected under protocol 17-0562 (U. Colorado Anschutz Medical School; PI Poeschla), where adult participants were consented verbally and donated up to 5 mL of whole saliva. Saliva was collected into Oragene saliva collection kits (DNA Genotek CP-100). The saliva is mixed with the stabilization solution in the collection kit and stored at room temperature for no longer than 2 weeks before being processed for RNA purification. Diagnosis of these individuals was provided in the form of clinical notes. Saliva samples from individuals SS1-SS3, SS6-SS11, SS22, and SS23 were collected under protocol 19-0696 (U. Colorado Boulder, PI Sawyer), where anonymous adults verbally consented and donated up to 2 mL of whole saliva. Saliva was collected into Oragene saliva collection kit as mentioned above. For two individuals, infection status was noticed during RNAseq procedures, and ultimately determined by in silico metagenomic detection using GOTTCHA (v1.0b) using RNAseq reads (additional RNAseq sample preparation and analysis described below). We were able to detect sequencing reads mapping to CoV-NL63 or RSV genomes from the saliva of individual SS22 and SS23, respectively, so they were presumed to be infected with these pathogens at the time of saliva collection. Saliva samples for apparently healthy individuals over a daily time course (SS26-SS32) were collected under a COVID-19-related sub-study of protocol 19-0696 (U. Colorado Boulder, PI Sawyer), where adult participants consented verbally and donated up to 2 mL of whole saliva per day. The saliva was collected into Oragene saliva collection kit as mentioned above. To purify RNA from saliva samples collected in Oragene saliva collection kits, we used 1 mL saliva 1:1 diluted in stabilization solution and followed the manufacturer recommended protocol by DNA Genotek to precipitate the nucleic acid. The RNA was further DNase-digested using Turbo DNase (Invitrogen #AM2238) and cleaned up using RNA clean-up and concentration micro-elute kit (Norgen #61000). The purified RNA was used for RT-qPCR or processed further for RNA-seq.

[0101] To prepare the total RNA for sequencing, we first spiked in ERCC RNA spike-in mix (ThermoFisher #4456740) into the saliva total RNA for downstream normalization. We depleted bacterial ribosomal RNA using pan-bacterial riboPOOL kit (siTOOLS #026). We then prepared the RNA for total RNA sequencing using KAPA RNA HyperPrep kit with RiboErase to remove human rRNA (Roche #KK8560). Finally, the saliva total RNA libraries were sequenced in 150 bp pair-end format using NovaSeq 6000 (Illumina) at the depth of 30 million reads.

[0102] Saliva samples for SARS-CoV-2-infected individuals (SS33-SS80), and matched SARS-CoV-2-negative individuals (SS81-SS100) were collected under protocol 20-0417 (U. Colorado Boulder, PI Sawyer), where adult participants 17 years of age or older (under a Waiver of Parental Consent) provided written consent. These samples were collected and tested for the SARS-CoV-2 virus during our campus COVID-19 testing initiative during the Fall 2020, Spring 2021, and Summer 2021 semesters. As part of this campus testing operation, university affiliates were asked to fill out a questionnaire to confirm that they did not present any symptoms consistent with COVID-19 at the time of sample donation, and to collect no less than 0.5 mL of saliva into a 5-mL screw-top collection tube. Saliva samples were heated at 95° C. for 30 min on site to inactivate the viral particles for safer handling, and then placed on ice or at 4° C. before being transported to the testing laboratory for RT-qPCR-based SARS-CoV-2 testing performed on the same day. Samples were then kept in −80 C until RNA preparation. The total RNA of the remaining saliva samples was then purified using TRIzol LS reagent (ThermoFisher #10296028) followed by GeneJET RNA cleanup and concentration kit (ThermoFisher #K0841). The purified total RNA was used for RT-qPCR following the steps described below. Additional saliva samples for general assay development were collected under protocol 20-0068 (U. Colorado Boulder, PI Sawyer), where anonymous adult participants were verbally consented and donated up to 2 mL of whole saliva for use as a reagent in optimization and limit of detection experiments.

[0103] Analysis of high-throughput transcriptomics data from human saliva samples: To profile human transcriptomic changes in human saliva samples, raw RNA sequencing reads in FASTQ format were obtained, trimmed using BBDuk (BBTools v38.05), and mapped using HISAT2 v2.1.0 to human genome assembly hg38 along with ERCC spike-in sequence reference. Using NCBI RefSeq genome annotation (GRCh38. p13), we then counted the mapped reads assigned to gene or transcripts using FeatureCount (Subread v1.6.2). Read counts was first normalized using the R package RUVseq (v1.28.0) to account for library size factors based on the ERCC spike-in counts. Individual samples were then separated into infected and non-infected groups and the differential expression of genes were determined via DESeq2 (v3.1.3) Wald test, from which the fold change and Benjamini-Hochberg adjusted p-values were obtained.

[0104] RT-qPCR analysis of universal response mRNAs in human saliva: For initial RT-qPCR validation on 3 clinically diagnosed and 3 uninfected samples (FIG. 4D), 2 μL of saliva total RNA was first reverse transcribed to cDNA using poly-dT primers with the SuperScript IV first-strand synthesis system (Invitrogen #18091050). The saliva cDNA was diluted 1:20, and 5 uL of the cDNA dilution was used for each qPCR reaction including 10 μL PowerUp SYBR Green master mix (AppliedBiosystems # A25741), 500 nM forward and reverse primers (table below), and nuclease free water. The qPCR assay was carried out on QuantStudio3 real-time PCR system (ThermoFisher) consisting of a UDG activation step (50° C. for 2 min, 95° C. for 2 min), 40 cycles of PCR stage (95° C. for 15 s, 60° C. for 60 s, with a 1.6° C./s ramp-up and ramp-down rate), followed by a melt curve stage (95° C. for 15 s, 60° C. for 60 s, slow ramp-up to 95° C. at 0.15 C/s). The cycle threshold (Ct) values were used to calculate relative fold change using delta delta Ct method.

TABLE-US-00001 Gene Forward Primer Reverse Primer Name Sequence (5′-3′) Sequence (5′-3′) CALR TCCCGATCCCAGTATCTATGC TCTCTGCTGCCTTTGTTACGC C CXCL8 CCAGGAAGAAACCACCGGAA CTTGGCAAAACTGCACCTTCAC EGR1 ACTACCCTAAGCTGGAGGAGA AGGAAAAGACTCTGCGGTCA ICAM1 GCAACCTCAGCCTCGCTAT GGAGTCCAGTACACGGTGAG IFIH1 ACAGCTTCACCTGGTGTTGGA ATGGCAAACTTCTTGCATGGCT IFIT2 CCCTGCCGAACAGCTGAGAA AGTTGCCGTAGGCTGCTCTC RSAD2 GTTGGTGAGGTTCTGCAAAGT TAAGGTAGGAGTCTTTCATCTT AGAGTTGCG CTGGTTAG
Multiplexed RT-qPCR analysis for the quantitative detection of 15 of the universal response mRNAs was carried out using customized and multiplexed TaqMan primer and probe mixes. Together with 3 internal controls genes (RPP30, RACK1, and CALR), the levels of all 18 genes are measured in a total of 6 multiplexed reactions (Table 5). Understanding that the contamination of genomic DNA often introduces quantification bias when measuring host gene expression, we explicitly designed primers that span exon junctions and limit the assay elongation time so that only the host mRNA is reverse transcribed and amplified. As each transcript varies in its expression magnitude, we assigned genes into multiplex groups based on similar expression magnitudes observed in the meta-analysis of in vitro datasets and inhuman saliva. This minimizes competition of amplification reagents. Specifically, to determine the host gene expression levels, 1.5 μL of customized TaqMan multiplex probes were mixed with 5 μL 4X TaqPath 1-step multiplex master mix (ThermoFisher # A28526), 5 μL of saliva total RNA, and 8.5 μL of nuclease free water. The RT-qPCR assay was carried out on QuantStudio3 Real-time PCR system (ThermoFisher) consisting of a reverse transcription stage (25° C. for 2 min, 50° C. for 15 min, 95° C. for 2 min) followed by 40 cycles of PCR stage (95° C. for 3 s, 55° C. for 30 s, with a 1.6° C./s ramp-up and ramp-down rate). The cycle threshold (Ct) values were used to calculate relative fold change using delta delta Ct method. For the choice of internal control genes, we combined the meta-analysis (FIG. 1; cell culture experiments) and the saliva RNA-seq datasets (FIG. 3; human samples) to select genes for which the expression level remained most constant and abundant across the various conditions inherent to these experiments.

[0105] We optimized this TaqMan assay on RNA harvested from A549 human lung cells mock infected or infected with influenza A virus (H3N2/Udorn/307/72) at MOI of 0.1 for 24 hours. Human lung epithelial cells (A549s) where plated at a concentration of 1×10.sup.6 cells/well in a 6-well plate. The next day, the cells were infected with influenza A virus at an MOI=0.1 in serum-free media containing 1.0% bovine serum albumin. After 1 hour incubation, the inoculum was removed and replaced with growth media containing 1 ug/mL of N-acetylated trypsin. 24 hours post-infection, total RNA was harvested using QIAGEN RNeasy Mini kit (QIAGEN #74104). Using these samples, we confirmed that the assay can measure each mRNA over a large dynamic range (Ct 15-40) with small amount of input RNA (≥100 ng) (FIG. 13). At this moderate MOI and relatively short infection timepoint, already 14 out of the 15 measured genes are upregulated. The range of mRNA upregulation in infected cells ranged from 2.6-fold (CXCL8) to 6.1×10.sup.5-fold (OAS2).

[0106] Infection of Huh7 cells with SARS-CoV-2: Human Hepatoma (Huh7) cells (gift from Charles Rice, Rockefeller University) were grown in 1XDMEM (ThermoFisher cat. no. 12500062) supplemented with 2 mM L-glutamine (Hyclone cat. no. H30034.01), non-essential amino acids (Hyclone cat. no. SH30238.01), and 10% heat inactivated FetalBovine Serum (FBS) (Atlas Biologicals cat. no. EF-0500-A). The virus strain used for the assay was SARS-CoV2, USA WA January 2020, passage 3. Virus stocks were obtained from BEI Resources and amplified in Vero E6 cells to Passage 3 (P3) with a titer of 5.5×10.sup.5PFU/mL. Cells were resuspended to 6.0×10.sup.5 cells/mL in 10% DMEM and seeded at 2 mL/well in 6-well plates. The plates were then incubated for approximately 24 hours (h) at 37° C., 5% CO2 for cells to adhere prior to infection. Cells were infected with SARS-CoV-2 at an MOI of 0.01. Samples were harvested at 0, 2, 4, 8, 12, 24, and 48 hours post infection in 200 μl TRIzol reagent for RNA extractions following the manufacture's protocol.

TABLES

[0107]

TABLE-US-00002 TABLE 1 Exemplary Host Biomarker identification SEQ ID NO. 1: indoleamine 2,3-dioxygenase 1 (IDO1) (mRNA) SEQ ID NO. 2: interferon induced protein with tetratricopeptide repeats 2 (IFIT2), (mRNA) SEQ ID NO. 3: guanylate binding protein 4 (GBP4), (mRNA) SEQ ID NO. 4: ISG15 ubiquitin like modifier (ISG15), (mRNA) SEQ ID NO. 5: radical S-adenosyl methionine domain containing 2 (RSAD2), (mRNA) SEQ ID NO. 6: methionine adenosyltransferase 1A (MAT1A), (mRNA) SEQ ID NO. 7: caspase 16, pseudogene (CASP16P), (non-coding RNA) SEQ ID NO. 8: U1 small nuclear 2 (RNU1-2), (small nuclear RNA) SEQ ID NO. 9: ArfGAP with GTPase domain, ankyrin repeat and PH domain 11 (AGAP11), (mRNA) SEQ ID NO. 10: synaptotagmin 4 (SYT4), (mRNA) SEQ ID NO. 11: glutaminyl-peptide cyclotransferase (QPCT), (mRNA) SEQ ID NO. 12: interleukin 2 (IL2), (mRNA) SEQ ID NO. 13: brain abundant membrane attached signal protein 1 (BASP1), transcript variant 1, (mRNA) SEQ ID NO. 14: family with sequence similarity 30 member A (FAM30A), (long non-coding RNA) SEQ ID NO. 15: tetraspanin 13 (TSPAN13), (mRNA) SEQ ID NO. 16: WWC2 antisense RNA 2 (WWC2-AS2), (long non-coding RNA) SEQ ID NO. 17: prothymosin alpha (PTMA), transcript variant X5, (mRNA) SEQ ID NO. 18: zinc finger protein 296 (ZNF296), (mRNA) SEQ ID NO. 19: F-box and WD repeat domain containing 4 pseudogene 1 (FBXW4P1), (non-coding RNA) SEQ ID NO. 20: SRY-box transcription factor 3 (SOX3), (mRNA) SEQ ID NO. 21: C-C motif chemokine ligand 8 (CCL8), (mRNA) SEQ ID NO. 22: cytochrome P450 family 1 subfamily B member 1 (CYP1B1), (mRNA) SEQ ID NO. 23: long intergenic non-protein coding RNA 2057 (LINC02057), (long non-coding RNA) SEQ ID NO. 24: adrenoceptor alpha 2B (ADRA2B), (mRNA) SEQ ID NO. 25: UDP-GlcNAc:betaGal beta-1,3-N-acetylglucosaminyltransferase 6 (B3GNT6), (mRNA) SEQ ID NO. 26: ankyrin repeat domain 22 (ANKRD22), (mRNA) SEQ ID NO. 27: FERM domain containing 3 (FRMD3), transcript variant 1, (mRNA) SEQ ID NO. 28: leucine aminopeptidase 3 (LAP3), (mRNA) SEQ ID NO. 29: syntaxin 11 (STX11), (mRNA) SEQ ID NO. 30: toll like receptor 7 (TLR7), (mRNA)

TABLE-US-00003 TABLE 2 Transcriptomics datasets used for the discovery of human universal response genes Hour Post Sequencing SRP Index Human cell line Pathogen Abbreviation Infection Data Type SRP044763 IMR90 Adenovirus ADV 24 mRNA SRP163661 MRC5 Adenovirus ADV 24 Total SRP202003 HepG2 Crimean-Congo hemorrhagic fever CCHFV 72 Total virus SRP078309 A549 Dengue virus 2 DENV2 36 Total SRP130978 HUH751 Dengue virus 2 DENV2 NA Total SRP132737 Huh7 Dengue virus 2 DENV2 18 Total SRP188490 HEK293 Dengue virus 2 DENV2 18 Total SRP101856 DC Ebola virus EBOV 24 Total SRP111145 ARPE19 Ebola virus EBOV 24 Total SRP131318 Rhabdomyosarcoma Enterovirus EV 6 Total SRP060253 AGS Epstein-Barr virus EBV NA Total SRP255890 B Cell Epstein-Barr virus EBV NA Total SRP272684 B Cell Lymphoma Epstein-Barr virus EBV 24 Total SRP212863 HUVEC Hantaan Orthohantavirus HTNV 72 Total SRP158789 HepG2 Hepatitis B virus HBV 72 Total SRP187206 HUH751 Hepatitis C virus HCV 148 Total SRP091538 HepG2 Hepatitis E virus HEV 120 Total SRP117344 KMB17 Herpes Simplex virus 1 HSV-1 48 Total SRP154536 HEK293 Herpes Simplex virus 1 HSV-1 4 Total SRP163661 MRC5 Herpes Simplex virus 1 HSV-1 9 Total SRP177947 THP1 Herpes Simplex virus 1 HSV-1 24 Total SRP189489 HFF Herpes Simplex virus 1 HSV-1 8 Total SRP065236 HFF Herpes Simplex virus 2 HSV-2 8 Total SRP065236 EC Human Cytomegalovirus HCMV 48 Total SRP065236 HFF Human Cytomegalovirus HCMV 48 Total SRP085236 NPC Human Cytomegalovirus HCMV 48 Total SRP163661 MRC5 Human Cytomegalovirus HCMV 48 Total SRP266618 NTT Human Cytomegalovirus HCMV 24 Total SRP065236 CD4 + T Cell Human Immunodeficiency virus 1 HIV-1 120 Total SRP155217 CD4 + T Cell Human Immunodificiency virus 1 HIV-1 72 Total SRP155822 lieum organoid Human Norovirus HuNoV 48 Total SRP223234 HFK Human Papilomavirus HPV NA Total SRP253951 A549 Human Parainfluenza virus 3 HPIV3 24 Total SRP183819 HNEpC Human Rhinovirus HRV 48 Total SRP161185 ATII Influenza A virus IAV 24 Total SRP230823 HeLa Influenza A virus IAV 24 Total SRP234025 A549 Influenza A virus IAV 48 Total SRP253951 A549 Influenza A virus IAV 9 Total SRP272285 A549 Influenza A virus IAV 6 Total SRP277269 293T Influenza A virus IAV 6 Total SRP261173 A549 Influenza A virus IAV 12 Total SRP170549 Calu3 Middle East respiratory syndrome MERS-CoV 24 Total coronavirus SRP227272 Calu3 Middle East respiratory syndrome MERS-CoV 24 mRNA coronavirus SRP096169 HFF Orf virus ORFV 8 Total SRP277439 HEK293 Porcine Rotavirus PoRV 12 Total SRP229586 A549 Respiratory Syncytial virus RSV 36 Total SRP229586 H292 Respiratory Syncytial virus RSV 36 Total SRP229586 HBEC Respiratory Syncytial virus RSV 36 Total SRP253951 A549 Respiratory Syncytial virus RSV 24 Total SRP115192 HSAEpC Rift Valley Fever virus RVFV 18 Total SRP094462 HInEpC Rotavirus ROTAV 6 Total SRP253951 A549-ACE2 Severe acute respiratory SARS-CoV-2 24 Total syndrome coronavirus 2 SRP270617 PHAE Severe acute respiratory SARS-CoV-2 48 Total syndrome coronavirus 2 SRP273473 DC Severe acute respiratory SARS-CoV-2 2 Total syndrome coronavirus 2 SRP273473 MAC Severe acute respiratory SARS-CoV-2 2 Total syndrome coronavirus 2 SRP278618 iPSC-derived Severe acute respiratory SARS-CoV-2 48 Total cardiomyocyte syndrome coronavirus 2 SRP061284 MeWo Varicella-zoster virus VZV 24 Total SRP225661 A549 West Nile virus WNV 24 Total SRP142592 hNSC Zika virus ZIKV 72 Total SRP251704 A549 Zika virus ZIKV 48 Total SRP253197 HepG2 Zika virus ZIKV 48 Total SRP296743 PBMC Asperigillus fumigatus A. fumigatus 24 Total SRP296743 PBMC Candida albicans C. albicans 24 Total SRP296743 PBMC Rhizopus oryzae R. oryzae 24 Total SRP285913 HeLa Chiamydia trachomatis C. trachomatis 44 Total SRP321546 DLD-1 Fusobacterium nucleatum F. nucleatum 24 Total SRP321940 Primary human Listeria monocylogenes L. monocytogenes 5 Total trophoblasts ERP020415 TRP-1 Mycobactenum tuberculosis M. tuberculosis 48 Total ERP115551 hBMECs Neissaria meningitidis N. meningitidis 6 mRNA SRP263458 HUVEC Staphylococcus aureus S. aureus 16 Total SRP072326 A549 Strepticiccus pneumoniae S. pneumoniae 2 Total

TABLE-US-00004 TABLE 3 The 69 universal response genes in humans RefSeq Gene Accession Symbol NM_030641 APOL6 NM_001165 BIRC3 NM_004335 BST2 NM_001565 CXCL10 NM_000584 CXCL8 NM_014314 DDX58 NM_017631 DDX60 NM_024119 DHX58 NM_138287 DTX3L NM_004417 DUSP1 NM_004419 DUSP5 NM_004420 DUSP8 NM_001964 EGR1 NM_001432 EREG NM_005252 FOS NM_002053 GBP1 NM_052941 GBP4 NM_001945 HBEGF NM_016323 HERC5 NM_006734 HIVEP2 NM_005514 HLA-B NM_000201 ICAM1 NM_005532 IFI27 NM_006417 IFI44 NM_006820 IFI44L NM_002038 IFI6 NM_022168 IFIH1 NM_001547 IFIT2 NM_001549 IFIT3 NM_012420 IFIT5 NM_003641 IFITM1 NM_006435 IFITM2 NM_002176 IFNB1 NM_172140 IFNL1 NM_016584 IL23A NM_001570 IRAK2 NM_006084 IRF9 NM_005101 ISG15 NM_002228 JUN NM_015907 LAP3 NM_002462 MX1 NM_002463 MX2 NM_020529 NFKBIA NM_012118 NOCT NM_002535 OAS2 NM_006187 OAS3 NM_003733 OASL NM_022750 PARP12 NM_017554 PARP14 NM_021127 PMAIP1 NM_152542 PPM1K NM_014330 PPP1R15A NM_000958 PTGER4 NM_006509 RELB NM_014470 RND1 NM_080657 RSAD2 NM_022147 RTP4 NM_002999 SDC4 NM_003745 SOCS1 NM_007315 STAT1 NM_003764 STX11 NM_017633 TENT5A NM_001561 TNFRSF9 NM_003141 TRIM21 NM_080745 TRIM69 NM_017414 USP18 NM_033390 ZC3H12C NM_003407 ZFP36 NM_021035 ZNFX1

TABLE-US-00005 TABLE 4 Top 30 differentially up- and down- regulated genes from comparison between infected and healthy saliva Gene Log2(Fold Adjusted P- Symbols Change) value CHRNA5 6.05 9.35E−76 IL2RA 6.07 1.08E−71 STS 6.02 7.91E−69 BAG5 5.80 9.31E−64 HBD 7.01 3.53E−53 POR 6.03 4.83E−50 LCN10 6.38 4.06E−46 C10orf55 7.06 9.76E−44 TWIST1 6.35 1.08E−43 CA2 6.97 1.19E−43 NR0B1 7.13 7.96E−43 GALE 5.83 1.04E−42 TENT5A 6.15 2.69E−42 WRN 5.11 3.91E−42 NOS3 5.95 5.09E−41 HBEGF 5.00 8.94E−41 DRD4 6.13 5.62E−40 NCMAP 6.31 3.29E−39 REN 5.61 7.10E−39 FGG 4.98 2.07E−37 HADHA 5.01 8.57E−37 HBG2 7.61 2.11E−36 HOXD13 4.86 2.50E−36 KITLG 5.31 1.18E−35 CHRNB1 5.74 1.08E−32 ITGB3 4.59 2.63E−32 BST2 6.03 3.66E−32 OR56B1 7.34 4.66E−31 HBG1 8.01 5.45E−31 RND1 7.31 6.27E−31 LOC102723665 −3.38 1.86E−06 GCSAM −4.12 1.84E−05 TAAR9 −5.50 2.94E−05 CDCA7L −3.59 1.16E−04 MIR320B2 −4.81 1.47E−04 HULC −5.84 1.49E−04 ZNF235 −3.25 2.40E−04 SLC39A12 −3.05 3.28E−04 IVNS1ABP −3.87 3.58E−04 KLHDC4 −3.96 4.01E−04 SERPINB5 −3.57 4.41E−04 LOC101927143 −4.42 4.45E−04 VAV2 −3.29 4.68E−04 DSEL −4.39 5.69E−04 RPL22 −2.67 7.18E−04 LINC01085 −3.48 7.23E−04 ERVW-1 −3.94 8.02E−04 SLC25A25-AS1 −3.54 8.58E−04 THOC5 −2.59 9.56E−04 UXT-AS1 −4.49 1.21E−03 TRI-AAT1-1 −3.34 1.37E−03 AKAP4 −3.07 1.76E−03 TADA2A −2.58 2.03E−03 LRRC7 −3.49 2.71E−03 LEMD1-AS1 −3.55 3.02E−03 GNG14 −3.82 3.37E−03 ZNF461 −3.55 3.77E−03 LINC01781 −2.66 4.07E−03 SAMD13 −3.46 4.65E−03 SLAMF8 −1.81 5.00E−03

TABLE-US-00006 TABLE 5 Multiplex TaqMan RT-qPCR assay for monitoring host immune gene signature expression. Gene Group Target Primer Name Primer sequence (5′->3′) Probe Sequence (5′->3′) Probe Dye 1 CALR CALR_F GAGTATTCTCCCGATCCCAGTATCT ATGAGGCATACGCTGA ABY (Controls) ATGCC GGAGTTTGG CALR_R ATTTGTTTCTCTGCTGCCTTTGTTA CGCCC RACK1 RACK1_F TCCCACTTTGTTAGTGATGTGGTTA CAGTTTGCCCTCTCAG VIC TCTCC GCTCCT RACK1_R CAAATCGCCTCGTGGTGGTGCCCG TTGTGAG RPP30 RPP30_F AGATTTGGACCTGCGAGCG TTCTGACCTGAAGGCT FAM RPP30_R GAGCGGCTGTCTCCACAAGT CTGCGCG 2 DDX58 DDX58_F CCGGAAGACCCTGGACCCTA TTAGGGAGGAAGAGG ABY DDX58_R AGGGCATCCAAAAAGCCACG TGCAG IFIT2 IFIT2_F CCCTGCCGAACAGCTGAGAA CTGCAACCATGAGTGA VIC IFIT2_R AGTTGCCGTAGGCTGCTCTC GAAC IFITM2 IFITM2_F ATAGCATTCGCGTACTCCGT TGCCTCCACCGCCAAG FAM IFITM2_R TGATGCCTCCTGATCTATCGC TGC 3 Mx1 Mx1_F TAGAGAGCTGCCAGGCTTTG TACACACCGTGACGGA ABY Mx1_R ATCTGTGAAAGCAAGCCGGA TATG IFI6 IFI6_F TCGCTGCTGTGCCCATCTATC CTGCTGCTCTTCACTT VIC IFI6_R TTCTTACCTGCCTCCACCCCAC GC IFIT3 IFIT3_F ACAGCAGAGACACAGAGGGCA TCATGAGTGAGGTCAC FAM IFIT3_R AGCTGTGGAAGGATTTTCTCCAGG CAAG 4 IFI27 IFI27_F GCCACGGAATTAACCCGAGC CATCAGCAGTGACCAG ABY IFI27_R GCCACAACTCCTCCAATCACA TGTG IFIH1 IFIH1_F ACAGCTTCACCTGGTGTTGGA CGAAGCAAGCCAAAG VIC IFIH1_R ATGGCAAACTTCTTGCATGGCT CTGAAG PARP12 PARP12_F ACCATGCAAACCTGCAATACC TCCAGGCCCGAAGAG FAM PARP12_R GCAGCGTGCGGTTAAAGAG CATC 5 IRF9 IRF9_F GCTCTTCAGAACCGCCTACTTC CTCCAGCCATACTCCA ABY IRF9_R CTCCAGCAAGTATCGGGCAA CAGAATC CXCL10 CXCL10_F TGCAAGCCAATTTTGTCCACG AGCAGTTAGCAAGGAA VIC CXCL10_R GCCTCTGTGTGGTCCATCCT AGGTC Mx2 Mx2_F CATGATTGTGAAGTGCCGGG CTGAGCTTGGCAGAG FAM Mx2_R CAACGGGAGCGATTTTTGGA GCAAC 6 OAS2 OAS2_F CGTTGGTGTTGGCATCTTCTG CCAGTCCCATCCTTGA ABY OAS2_R TGCATTGTCGGCACTTTCC AGCAG CXCL8 CXCL8_F CCAGGAAGAAACCACCGGAA TGGCCGTGGCTCTCTT VIC CXCL8_R CTTGGCAAAACTGCACCTTCAC G RTP4 RTP4_F TGGACGCTGAAGTTGGATGGC CTCTCTGTTGGTATTG FAM RTP4_R CAACTTCGCTGGCAGGAGGAA CTTC

Identification of Host RNA Biomarkers of Infection

Inventors

Cpc classification

Classification Explorer

G16B25/10

PHYSICS

Classification Explorer

C12Q1/6888

CHEMISTRY; METALLURGY

Classification Explorer

G16H10/40

PHYSICS

Classification Explorer

C12Q1/702

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2600/158

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/689

CHEMISTRY; METALLURGY

Classification Explorer

G16H50/20

PHYSICS

Classification Explorer

C12Q1/705

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6893

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6883

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2600/112

CHEMISTRY; METALLURGY

International classification

Classification Explorer

C12Q1/6888

CHEMISTRY; METALLURGY

Classification Explorer

G16B25/10

PHYSICS

Classification Explorer

G16H10/40

PHYSICS

Classification Explorer

G16H50/20

PHYSICS

Abstract

Claims

Description