Identification of Host RNA Biomarkers of Infection
20220298584 · 2022-09-22
Inventors
- Sara L. SAWYER (Boulder, CO, US)
- Robin DOWELL (Boulder, CO, US)
- Qing YANG (Longmont, CO, US)
- Nicholas R. MEYERSON (Broomfield, CO, US)
Cpc classification
G16B25/10
PHYSICS
C12Q1/6888
CHEMISTRY; METALLURGY
G16H10/40
PHYSICS
G16H50/20
PHYSICS
C12Q1/705
CHEMISTRY; METALLURGY
C12Q1/6883
CHEMISTRY; METALLURGY
International classification
C12Q1/6888
CHEMISTRY; METALLURGY
G16B25/10
PHYSICS
G16H10/40
PHYSICS
Abstract
The inventive technology includes novel systems, method and compositions for the identification and classification of host-derived RNA biomarkers produced in response to an infection.
Claims
1-77. (canceled)
78. A method of identifying general host-derived RNA biomarkers of infection comprising the steps of: a) establishing a first biological sample, wherein said first biological sample comprises a tissue sample infected with a first pathogen; b) quantifying one or more genes from said first biological sample that are upregulated in response to the infection compared to a non-infected control biological sample; c) establishing a second biological sample, wherein said second biological sample comprises a saliva sample collected from a subject infected with said pathogen; d) generating a RNA transcript expression dataset by quantifying the RNA transcripts present in said second biological sample that correspond to the one or more genes upregulated in response to infection by said pathogen; and e) analyzing said RNA transcript expression data set and identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to infection by said pathogen.
79. The method of claim 78, further comprising the step of repeating steps, a-d using one or more additional pathogens to generate an RNA transcript expression data set.
80. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to said pathogen selected from the group consisting of: SEQ ID NO. 1-99
81. The method of claim 78, further comprising the step of identifying host-derived RNA biomarkers of infection commonly upregulated in response to any pathogen.
82. The method of claim 81, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to any pathogen are selected from the group consisting of: SEQ ID NOs. 31-99.
83. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to a viral pathogen.
84. The method of claim 83, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to a viral pathogen are selected from the group consisting of: SEQ ID NOs. 1-5.
85. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to a bacterial pathogen.
86. The method of claim 85, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to a bacterial pathogen are selected from the group consisting of: SEQ ID NOs. 6-10.
87. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to a retroviral pathogen.
88. The method of claim 87, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to a retroviral pathogen are selected from the group consisting of: SEQ ID NOs. 11-15.
89. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to a herpesvirus pathogen.
90. The method of claim 89, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to a herpesvirus pathogen are selected from the group consisting of: SEQ ID NOs. 16-20.
91. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to a respiratory pathogen.
92. The method of claim 91, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to a respiratory pathogen are selected from the group consisting of: SEQ ID NOs. 21-25.
93. The method of claim 78, further comprising the step of identifying general host-derived RNA biomarkers of infection that are commonly upregulated in response to a eukaryotic pathogen.
94. The method of claim 93, wherein said host-derived RNA biomarkers of infection commonly upregulated in response to a eukaryotic pathogen are selected from the group consisting of: SEQ ID NOs SEQ ID NOs. 26-30.
95. The method of claim 78, wherein the pathogen of said infected tissue sample and pathogen of said infected saliva sample are different pathogens.
96. The method of claim 78, wherein said subject comprises a human subject.
97. A method of identifying host-derived biomarkers of infection comprising the steps of: generating a RNA transcript expression dataset of host-derived biomarker sequence reads according to the method of claim 1; performing data pre-processing on said raw dataset of host biomarker sequence reads comprising one or more of the following steps: filtering out low quality biomarker sequence reads; filtering out contaminating biomarker sequence reads; mapping the filtered biomarker sequence reads to a reference genome; assigning total number of biomarker sequence reads mapped onto each annotated gene within said reference genome; normalizing the biomarker sequence reads counts based on one or more control genes; conducting differential expression analysis to determine which host biomarker genes are up-regulated in the dataset; and outputting a dataset of upregulated host-derived biomarkers sequences.
98. The method of claim 97, and further comprising the steps of: merging a plurality of datasets of upregulated host-derived biomarkers sequences for analysis and categorization comprising one or more of the following steps: directly merging said plurality of datasets of upregulated host-derived biomarkers sequences; combining the P-value of said plurality of datasets of upregulated host-derived biomarkers sequences; combining the effect size of said plurality of datasets of upregulated host-derived biomarkers sequences; combining the rank of said plurality of datasets of upregulated host-derived biomarkers sequences; conduct co-expression and network analysis of said plurality of datasets of upregulated host-derived biomarkers sequences; and outputting a dataset of ranked host-derived biomarkers sequences.
99. The method of claim 98, and further comprising the steps of: validating said dataset of ranked host-derived biomarkers sequences comprising one or more of the following steps: comparing a dataset of random gene controls against said dataset of ranked host-derived biomarkers sequences using a machine learning system comprising a classifier; conducting cross-validation on said dataset being applied to said classifier to predict infection or non-infected states of a dataset of unknown RNA sequences; and outputting a dataset of ranked and filtered host-derived biomarker sequences.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0018] The novel aspects, features, and advantages of the present disclosure will be better understood from the following detailed descriptions taken in conjunction with the accompanying figures, all of which are given by way of illustration only, and are not limiting the presently disclosed embodiments, in which:
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
DETAILED DESCRIPTION OF INVENTION
[0036] In one embodiment, the invention includes systems, methods and compositions for the identification and classification of host biomarkers produced in response to an infection. In one preferred embodiment, the invention includes systems, methods and compositions for the identification and classification of early RNA biomarkers produced by the cell or subjects innate immune response in response to an infection. Notably, such specific target RNA transcripts or biomarkers produced by a patient's innate immune response may be indicative of early infection. As a result, in one embodiment of the inventive technology may include systems, methods and compositions for the detection of these target RNA transcripts which may act as biomarkers for early-infection in a subject.
[0037] In one preferred embodiment of the invention, to identify host-derived RNA biomarkers of infection, cells in culture or in a subject, such as a human subject, may be infected with various pathogens and then the RNA of the cell or tissues, and preferably mammalian tissues, and more preferably human tissue is collected and sequenced and compared to a (−) infection control. When different conditions and pathogens are compared to each other, general host RNA biomarkers can be initially derived as shown specifically in
[0038] In another preferred embodiment of the invention, the RNA biomarkers produced by the host in response to an infection challenge may be compared between different classes of pathogens. In this manner, specific biomarkers, and preferably host-derived RNA biomarkers, can be identified and classified to indicate different types of infection. For instance, in one embodiment shown in
[0039] Alternately, in another embodiment, the target biomarkers can be empirically tested in human or other in vivo trials. For example, one embodiment of the invention includes the validation of target RNA biomarkers of infection using quantitative reverse transcription polymerase chain reaction (RT-PCR) protocols. As biomarkers identified using the methods outlined above may be further confirmed in tissue culture infection experiments. Quantitative RT-PCR (qRT-PCR) of RNA allows specific quantification of the upregulation of candidate biomarkers as a ‘fold change’ in infected cells compared to uninfected cells. Such information helps when evaluating detection sensitivity with respect to a given biomarker. While only twenty-five exemplary biomarker candidates are being identified herein, such list should not be construed as limiting on the number of biomarkers that may identified with the current invention.
[0040] As further highlighted in
[0041] In one embodiment the invention may include systems, methods and compositions for the identification and use of one or more host-derived RNA biomarkers of infection. In one preferred embodiment, a first tissue culture experiment can be established and tested to identify target RNA transcripts that may be upregulated during an experimental infection, and that may also be secreted from target cells. RNAs that are upregulated may be used as candidate biomarkers and engineered for compatibility with biomarker detection systems, such as the lateral flow device, as well as qRT-PCR methods and systems generally described by the present inventors in US PCT Application No. PCT/US2020/049290, the specification, figures and sequence identification being incorporated herein by reference. In parallel, RNAs from healthy and infected human saliva may be characterized in a clinical trial (right) in order to identify RNA biomarkers of infection in humans. Those biomarkers, if not already identified in the tissue culture experiments, may be engineered for compatibility with the lateral flow system as generally describe above.
[0042] In another embodiment, the invention may include one or more of the host-biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 1-30. In another embodiment, the invention may include one or more virus-specific host RNA biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 1-5. In another embodiment, the invention may include one or more retrovirus-specific host RNA biomarkers comprising nucleotide sequences identified in SEQ ID NOs. 6-10. In another embodiment, the invention may include one or more herpesvirus host RNA biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 11-15. In another embodiment, the invention may include one or more respiratory virus-specific host RNA biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 16-20. In another embodiment, the invention may include one or more eukaryotic pathogen-specific host RNA biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 16-20.
[0043] In another embodiment, the invention may include one or more bacteria-specific host RNA biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 1-30. In another embodiment, the invention may include the diagnostic use of one or more of the host-biomarkers comprising nucleotide sequences identified in: SEQ ID NOs. 1-30. In one another embodiment, a of one or more of the nucleotide sequences identified in SEQ ID NOs. 1-30, and their corresponding encoded mRNA transcript and or translated polypeptide may be used as biomarkers for early-infection in a subject. In one another embodiment, a of one or more of the nucleotide sequences identified in SEQ ID NOs. 1-30, and their corresponding encoded mRNA transcript and or translated polypeptide may be used as biomarkers for identification of the site of replication, or infection in a subject. In one another embodiment, a of one or more of the nucleotide sequences identified in SEQ ID NOs. 1-30, and their corresponding encoded mRNA transcript and or translated polypeptide may be used as biomarkers for identification of pathogen class-specific infection in a subject.
[0044] In another embodiment, identification of one or more RNA biomarkers of infection may help inform treatment of a subject. For example, identification of viral or bacterial-specific host RNA biomarkers may guide a medical practitioner to administer an anti-viral or an antibiotic. It may also, in the case of a viral infection such as SARS-CoV-2, guide a medical practitioner to recommend the subject be quarantined. For example, identification of viral RNA biomarkers associated with a respiratory infection may guide a medical practitioner to administer treatments appropriate for a viral respiratory infection.
[0045] The terminology used herein is for describing embodiments and is not intended to be limiting. As used herein, the singular forms “a,” “and” and “the” include plural referents, unless the content and context clearly dictate otherwise. Thus, for example, a reference to “a biomarker” may include a combination of two or more such biomarkers. Unless defined otherwise, all scientific and technical terms are to be understood as having the same meaning as commonly used in the art to which they pertain. As used herein, “about” or “approximately” means within 10% of a stated concentration range or within 10% of a stated time frame.
[0046] The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
[0047] Nucleic acids and/or other moieties of the invention may be isolated. As used herein, “isolated” means separate from at least some of the components with which it is usually associated whether it is derived from a naturally occurring source or made synthetically, in whole or in part. Nucleic acids and/or other moieties of the invention may be purified. As used herein, purified means separate from the majority of other compounds or entities. A compound or moiety may be partially purified or substantially purified. Purity may be denoted by weight measure and may be determined using a variety of analytical techniques such as but not limited to mass spectrometry, HPLC, etc.
[0048] As used herein, a biological marker (“biomarker” or “marker”) is a characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes, or pharmacological responses to therapeutic interventions, consistent with NIH Biomarker Definitions Working Group (1998). Markers can also include patterns or ensembles of characteristics indicative of particular biological processes. The biomarker measurement can increase or decrease to indicate a particular biological event or process. In addition, if the biomarker measurement typically changes in the absence of a particular biological process, a constant measurement can indicate occurrence of that process. In a preferred embodiment an RNA biomarker of infection, includes one or more RNA transcripts that may be indicative of infection or other normal or abnormal physiological process. It should be noted that where RNA biomarker of infection is referenced, it includes the sequence of the RNA transcript, whether of the DNA or mRNA sequence, as well as all alternatively spliced RNA transcripts or RNA biomarkers of infection that have undergone an alternative splicing event, as well as related polynucleotides.
[0049] The term “alternative splicing event”, as used herein, designates any sequence variation existing between two polynucleotide arising from the same gene or the same pre-mRNA by alternative splicing. This term also refers to polynucleotides, including splicing isoforms or fragments thereof, comprising said sequence variation. Preferably, said sequence variation is characterized by an insertion or deletion of at least one exon or part of an exon. The term “alternative splicing events” encompasses the original alternative splicing events, the skipping of exon (Dietz et al., Science 259, 680 (1993); Liu et al., Nature Genet. 16, 328-329 (1997); Nyström-Lahti et al. Genes Chromosomes Cancer 26: 372-375 (1999)), differential splicing due to the cellular environmental conditions (e.g. cell type or physical stimulus) or to a mutation leading to abnormalities of splicing (Siffert et al., Nature Genetics 18: 45-48 (1998)).
[0050] The term “related polynucleotides”, as used herein, refers to polynucleotides having identical sequences except for one or a small number of regions that either have a different sequence, or are deleted or added from one polynucleotide compared to the other. Typical related polynucleotides are splicing isoforms of a same gene, or a gene harboring a genomic deletion or addition compared to another allele of the same gene. Such related polynucleotides may be either full-length polynucleotides such as genomic DNA, mRNAs, full-length cDNAs, or fragments thereof.
[0051] As referred to herein, the terms “nucleic acid”, “nucleic acid molecules” “oligonucleotide”, “polynucleotide”, and “nucleotides” may interchangeably be used. The terms are directed to polymers of deoxyribonucleotides (DNA), ribonucleotides (RNA), and modified forms thereof in the form of a separate fragment or as a component of a larger construct, linear or branched, single stranded, double stranded, triple stranded, or hybrids thereof. The term also encompasses RNA/DNA hybrids. The polynucleotides may include sense and antisense oligonucleotide or polynucleotide sequences of DNA or RNA. The DNA molecules may be, for example, but not limited to: complementary DNA (cDNA), genomic DNA, synthesized DNA, recombinant DNA, or a hybrid thereof. The RNA molecules may be, for example, but not limited to: ssRNA or dsRNA and the like. The terms further include oligonucleotides composed of naturally occurring bases, sugars, and covalent internucleoside linkages, as well as oligonucleotides having non-naturally occurring portions, which function similarly to respective naturally occurring portions. The terms “nucleic acid segment” and “nucleotide sequence segment,” or more generally “segment,” will be understood by those in the art as a functional term that includes both genomic sequences, ribosomal RNA sequences, transfer RNA sequences, messenger RNA sequences, operon sequences, and smaller engineered nucleotide sequences that are encoded or may be adapted to encode, peptides, polypeptides, or proteins. Further, it should be noted that when any sequence is referenced herein, for example a DNA sequence, the corresponding RNA and amino acid sequence is also specifically encompassed in such a disclosure.
[0052] As referred to herein, the term “database” is directed to an organized collection of biological sequence information and/or quantitative measurement of gene expression that may be stored in a digital form. They specifically include open source, as well as non-open source databases. In some embodiments, the database may include any sequence information. In some embodiments, the database may include the genome sequence of a subject or a microorganism. In some embodiments, the database may include expressed sequence information, such as, for example, an EST (expressed sequence tag) or cDNA (complementary DNA) databases. In some embodiments, the database may include non-coding sequences (that is, untranslated sequences), such as, for example, the collection of RNA families (Rfam) which contains information about non-coding RNA genes, structured cis-regulatory elements and self-splicing RNAs. In some embodiments, the databases may include quantitative measurement of expressed gene abundance, such as, for example, the collection of RNA, DNA or cDNA microarray readout. In some embodiments, the databases may include a collection of cDNA sequences captured from biological samples undergoing specific treatment conditions. Such collection of cDNA sequences can be analyzed to determine the relative abundance of gene expressed in the given biological samples, such as, for example, the collection of RNA sequencing data. In exemplary embodiments, the databases may be selected from redundant or non-redundant NCBI SRA database (which is NIH short read sequencing archive database containing publicly available RNA-seq datasets), NCBI GEO database (which is NIH gene expression omnibus database containing publicly available microarray database), NCBI BioProject database (NIH database containing metadata of experimental setup, protocol, patient information etc. relevant to datasets available on NCBI SRA and GEO databases), GenBank databases (which are the NIH genetic sequence database, an annotated collection of all publicly available DNA and RNA sequences). In exemplary embodiments, the databases may be selected from NCBI Short Read Archive databases. Exemplary databases may be selected from, but not limited to: GenBank CDS (Coding sequences database), PDB (protein database), SwissProt database, PIR (Protein Information Resource) database, PRF (protein sequence) database, EMBL Nucleotide Sequence database, NCBI BioProject database, NCBI SRA (Short Read Archive) database, NCBI GEO (Gene Expression Omnibus) database, Broad Institute GTEx (Genotype-Tissue Expression) database, EMBL Expression Atlas, and the like, or any combination thereof.
[0053] As used herein, the term “detection” refers to the qualitative determination of the presence or absence of a microorganism in a sample. The term “detection” also includes the “identification” of a microorganism, i.e., determining the genus, species, or strain of a microorganism according to recognized taxonomy in the art and as described in the present specification. The term “detection” further includes the quantitation of a microorganism in a sample, e.g., the copy number of the microorganism in a microliter (or a milliliter or a liter) or a microgram (or a milligram or a gram or a kilogram) of a sample. The term “detection” also includes the identification of an infection in a subject or sample.
[0054] As used herein the term “pathogen” refers to an organism, including a microorganism, which causes disease in another organism (e.g., animals and plants) by directly infecting the other organism, or by producing agents that causes disease in another organism (e.g., bacteria that produce pathogenic toxins and the like). As used herein, pathogens include, but are not limited to bacteria, protozoa, fungi, nematodes, viroids and viruses, or any combination thereof, wherein each pathogen is capable, either by itself or in concert with another pathogen, of eliciting disease in vertebrates including but not limited to mammals, and including but not limited to humans. The term also specifically includes eukaryotic or protist pathogens, such as the Plasmodium sp. that are the causative agent of Malaria. As used herein, the term “pathogen” also encompasses microorganisms which may not ordinarily be pathogenic in a non-immunocompromised host.
[0055] As used herein, the step of introducing a pathogen to a subject may include both the intentional introduction of a pathogen, such as through a clinical trial, or through the natural and unintended introduction of a pathogen that may have been introduced to a subject, for example, through an horizontal or vertical pathogen exposure, as well as direct and indirect pathogen transmission, for example including, but not limited to environmental exposure to a pathogen, zoonotic exposure to a pathogen, vector-borne exposure to a pathogen. nosocomial exposure to a pathogen.
[0056] The term “infection” or “infect” as used herein is directed to the presence of a microorganism within a subject body and/or a subject cell. For example, a virus may be infecting a subject cell. A parasite (such as, for example, a nematode) may be infecting a subject cell/body. In some embodiments, the microorganism may comprise a virus, a bacteria, a fungi, a parasite, or combinations thereof. According to some embodiments the microorganism is a virus, such as, for example, dsDNA viruses (such as, for example, Adenoviruses, Herpesviruses, Poxviruses), ssDNA viruses (such as, for example, Parvoviruses), dsRNA viruses (such as, for example, Reoviruses), (+) ssRNA viruses (+) sense RNA (such as, for example, Picornaviruses, Togaviruses), (−) ssRNA viruses (−) sense RNA (such as, for example, Orthomyxoviruses, Rhabdoviruses), ssRNA-RT viruses (+) sense RNA with DNA intermediate in life-cycle (such as, for example, Retroviruses), dsDNA-RT viruses (such as, for example, Hepadnaviruses). In some embodiments, the microorganism is a bacteria, such as, for example, a gram negative bacteria, a gram positive bacteria, and the like. In some embodiments, the microorganism is a fungi, such as yeast, mold, and the like. In some embodiments, the microorganism is a parasite, such as, for example, protozoa and helminths or the like. In some embodiments, the infection by the microorganism may inflict a disease and/or a clinically detectable symptom to the subject. In some embodiments, infection by the microorganism may not cause a clinically detectable symptom. In some embodiments, the microorganism is a symbiotic microorganism. In additional embodiments, the microorganism may comprise archaea, protists; microscopic plants (green algae), plankton, and the planarian. In some embodiments, the microorganism is unicellular (single-celled). In some embodiments, the microorganism is multicellular.
[0057] As used herein, the term “asymptomatic” refers to an individual who does not exhibit physical symptoms characteristic of being infected with a given pathogen, or a given combination of pathogens.
[0058] The target biomarkers of this invention may be used for diagnostic and prognostic purposes, as well as for therapeutic, drug screening and patient stratification purposes (e.g., to group patients into a number of “subsets” for evaluation), as well as other purposes described herein.
[0059] Some embodiments of the invention comprise detecting in a sample from a patient, a level of a biomarker, wherein the presence or expression levels of the biomarker are indicative of infection or possible infection by one or more pathogens. As used herein, the term “biological sample” or “sample” includes a sample from any bodily fluid or tissue. Biological samples or samples appropriate for use according to the methods provided herein include, without limitation, blood, serum, urine, saliva, tissues, cells, and organs, or portions thereof. A “subject” is any organism of interest, generally a mammalian subject, and preferably a human subject.
[0060] As noted above, in one embodiment qRT-PCR may be utilized to identify one or more host-derived biomarkers of infection. In certain embodiment, intercalator dyes may be used to measure the accumulation of both specific and nonspecific PCR products when utilizing RT-PCR products. For example, intercalator dyes such as SYBR green and TaqMan may be used to detect and identify host-derived biomarkers of infection in a qRT-PCR assay.
[0061] Any isothermal amplification protocol can be used according to the methods provided herein. Exemplary types of isothermal amplification include, without limitation, nucleic acid sequence-based amplification (NASBA), loop-mediated isothermal amplification (LAMP), strand displacement amplification (SDA), helicase-dependent amplification (HDA), nicking enzyme amplification reaction (NEAR), signal mediated amplification of RNA technology (SMART), rolling circle amplification (RCA), isothermal multiple displacement amplification (EVIDA), single primer isothermal amplification (SPIA), recombinase polymerase amplification (RPA), and polymerase spiral reaction (PSR, available at nature.com/articles/srepl2723 on the World Wide Web). In some cases, a forward primer is used to introduce a T7 promoter site into the resulting DNA template to enable transcription of amplified RNA products via T7 RNA polymerase. In other cases, a reverse primer is used to add a trigger sequence of a toehold sequence domain.
[0062] As used herein, the term “amplified” refers to polynucleotides that are copies of a particular polynucleotide, produced in an amplification reaction. An amplified product, according to the invention, may be DNA or RNA, and it may be double-stranded or single-stranded. An amplified product is also referred to herein as an “amplicon”. As used herein, the term “amplicon” refers to an amplification product from a nucleic acid amplification reaction. The term generally refers to an anticipated, specific amplification product of known size, generated using a given set of amplification primers.
[0063] Naturally as can be appreciated, all of the steps as herein described may be accomplished in some embodiments through any appropriate machine and/or device resulting in the transformation of, for example data, data processing, data transformation, external devices, operations, and the like. It should also be noted that in some embodiments, software and/or software solution may be utilized to carry out the objectives of the invention and may be defined as software stored on a magnetic or optical disk or other appropriate physical computer readable media including wireless devices and/or smart phones. In alternative embodiments the software and/or data structures can be associated in combination with a computer or processor that operates on the data structure or utilizes the software. Further embodiments may include transmitting and/or loading and/or updating of the software on a computer perhaps remotely over the internet or through any other appropriate transmission machine or device, or even the executing of the software on a computer resulting in the data and/or other physical transformations as herein described.
[0064] Certain embodiments of the inventive technology may utilize a machine and/or device which may include a general purpose computer, a computer that can perform an algorithm, computer readable medium, software, computer readable medium continuing specific programming, a computer network, a server and receiver network, transmission elements, wireless devices and/or smart phones, internet transmission and receiving element; cloud-based storage and transmission systems, software updateable elements; computer routines and/or subroutines, computer readable memory, data storage elements, random access memory elements, and/or computer interface displays that may represent the data in a physically perceivable transformation such as visually displaying said processed data. In addition, as can be naturally appreciated, any of the steps as herein described may be accomplished in some embodiments through a variety of hardware applications including a keyboard, mouse, computer graphical interface, voice activation or input, server, receiver and any other appropriate hardware device known by those of ordinary skill in the art.
[0065] As used herein, a machine learning system or model is a trained computational model that takes a feature of interest, such as the expression of a host-derived RNA biomarker and classifies. Examples of machine learning models include neural networks, including recurrent neural networks and convolutional neural networks; random forests models, including random forests; restricted Boltzmann machines; recurrent tensor networks; and gradient boosted trees. The term “classifier” (or classification model) is sometimes used to describe all forms of classification model including deep learning models (e.g., neural networks having many layers) as well as random forests models.
[0066] As used herein, “quantify” means to identify the presence or quantity of an RNA biomarker from a sample.
[0067] As used herein, a machine learning system may include a deep learning model that may include a function approximation method aiming to develop custom dictionaries configured to achieve a given task, be it classification or dimension reduction. It may be implemented in various forms such as by a neural network (e.g., a convolutional neural network), etc. In general, though not necessarily, it includes multiple layers. Each such layer includes multiple processing nodes and the layers process in sequence, with nodes of layers closer to the model input layer processing before nodes of layers closer to the model output. In various embodiments, one-layer feeds to the next, etc. The output layer may include nodes that represent various classifications. In certain embodiments, machine learning systems may include artificial neural networks (ANNs) which are a type of computational system that can learn the relationships between an input data set and a target data set. ANN name originates from a desire to develop a simplified mathematical representation of a portion of the human neural system, intended to capture its “learning” and “generalization” abilities. ANNs are a major foundation in the field of artificial intelligence. ANNs are widely applied in research because they can model highly non-linear systems in which the relationship among the variables is unknown or very complex. ANNs are typically trained on empirically observed data sets. The data set may conventionally be divided into a training set, a test set, and a validation set.
[0068] Having now described the inventive technology, the same will be illustrated with reference to certain examples, which are included herein for illustration purposes only, and which are not intended to be limiting of the invention.
EXAMPLES
Example 1: Data Pre-Processing
[0069] The present inventors processed the raw microarray or RNA sequencing data through standardized workflow. For Microarray datasets, the pipeline 1) performs background signal correction and signal normalization, 2) annotates probes on the microarray chip with known gene names and accession numbers, 3) filters probes based on the signal intensities. For RNA sequencing datasets, the pipeline 1) Filters out RNA-seq reads of low-quality and contaminating sequences 2) Maps the filtered reads to host (human) genome 3) Determines data quality based on trimming and mapping statistics 4) Assigns total number of RNA-seq reads mapped onto each annotated gene within human genome. This gene expression profile from both microarray and RNA sequencing datasets are indicative of the relative gene expression level. The pipeline may normalize the read counts based on a set of empirically-determined control genes and further conducts differential expression analysis to determine what are the significantly up-regulated genes within each study.
Example 2: Biomarker Discovery
[0070] Based on which host RNA biomarker is commonly upregulated across different pathogen infections, and how readily they can be detected across different cell types and tissue samples, the present inventors summarized the results from the above data pre-processing steps using statistical methods, including direct merge, combine p-value, combine effect size, combine ranks and/or co-expression analysis. These statistical measures combine the data in a way that accounts for confidence and reliability of the results.
[0071] Importantly, by focusing on studies that utilized similar infection data from broader categories (e.g. Domain level: virus, bacteria, etc; Viral class: herpesvirus, retrovirus, etc; Site of replication in the body: respiratory virus), the present inventors were also able to identify specific sets of host biomarkers that help differentiate the type of infection as explained below. These discovered biomarkers can either directly move on to empirical testing, or they can be further validated and prioritized by the computer-assisted approaches described in Example 3.
Example 3: In Silico Validation and Filtering
[0072] In another embodiment, the invention may utilize a machine learning system. The summarized host biomarkers may optionally be subject to downstream validation and filtering via supervised machine-learning approaches. In one embodiment, the present inventors provided the classifier (Logistic regression, polynomial supported vector machine (SVM), Poisson linear discriminant or Convolutional Neuron Network) with either the list of biomarkers or random genes (as control) to construct statistic models around training RNA-seq or RNA microarray datasets. Then the present inventors programmed the classifier to determine if a set of unknown RNA-seq or RNA microarray samples are infected. If the list of biomarkers helps predict the infection condition of the unknown data, the prediction accuracy would be significantly higher comparing to the control. To further utilize this approach to filter out less relevant biomarkers from the list, the present inventors removed individual genes from the biomarker list and carried out the entire classification iteratively. If the removal of that biomarker decreases the prediction accuracy, it suggests the biomarker being removed plays a key role in determining the infection condition. Reciprocally, if the removal of that biomarker increases, or has no effect on the prediction accuracy, the removed biomarker could be discarded due to its lack of relevancy.
Example 4: Virus-Specific Host Biomarkers RNA Sequences
[0073] One embodiment of the invention may include one or more of the following biomarkers, identified through the methods described herein, as being specifically upregulated in response to a viral infection in a human subject. In a preferred embodiment, the invention may include the early-detection of a viral infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 1-5. In one preferred embodiment, the invention may include the early-detection of a viral infection, such as SARS-CoV-2 (COVID-19 in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 1-5, the detection being accomplished, in one preferred embodiment, by a lateral flow device described by the present inventors in PCT Application No. PCT/US2020/049290, the specification and figures being incorporated herein by reference, or other biomarker detection systems known in the art. Additional embodiments for detecting one or more of the biomarkers identified herein may include a rapid detection LAMP assay, PCR, or other detection methods described generally herein and known in the art.
Example 5: Bacteria-Specific Host Biomarkers RNA Sequences
[0074] One embodiment of the invention may include one or more of the following biomarkers, identified through the methods described herein, as being specifically upregulated in response to a viral infection in a human subject. In a preferred embodiment, the invention may include the early-detection of a bacterial infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 6-10. In one preferred embodiment, the invention may include the early-detection of a bacterial infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 6-10, the detection being accomplished by a lateral flow device described by the present inventors in PCT Application No. PCT/US2020/049290, the specification and figures being incorporated herein by reference, or other biomarker detection systems known in the art. Additional embodiments for detecting one or more of the biomarkers identified herein may include a rapid detection LAMP assay, PCR, or other detection methods described generally herein and known in the art.
Example 6: Retrovirus-Specific Host Biomarkers RNA Sequences
[0075] One embodiment of the invention may include one or more of the following biomarkers, identified through the methods described herein, as being specifically upregulated in response to a viral infection in a human subject. In a preferred embodiment, the invention may include the early-detection of a retroviral infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 11-15. In one preferred embodiment, the invention may include the early-detection of a retroviral infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 11-15, the detection being accomplished by a lateral flow device described by the present inventors in PCT Application No. PCT/US2020/049290, the specification and figures being incorporated herein by reference, or other biomarker detection systems known in the art. Additional embodiments for detecting one or more of the biomarkers identified herein may include a rapid detection LAMP assay, PCR, or other detection methods described generally herein and known in the art.
Example 7: Herpesvirus-Specific Host Biomarkers RNA Sequences
[0076] One embodiment of the invention may include one or more of the following biomarkers, identified through the methods described herein, as being specifically upregulated in response to a viral infection in a human subject. In a preferred embodiment, the invention may include the early-detection of a herpesvirus infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 16-20. In one preferred embodiment, the invention may include the early-detection of a herpesvirus infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 16-20, the detection being accomplished by a lateral flow device described by the present inventors in PCT Application No. PCT/US2020/049290, the specification and figures being incorporated herein by reference, or other biomarker detection systems known in the art. Additional embodiments for detecting one or more of the biomarkers identified herein may include a rapid detection LAMP assay, PCR, or other detection methods described generally herein and known in the art.
Example 8: Respiratory Virus-Specific Host Biomarkers RNA Sequences
[0077] One embodiment of the invention may include one or more of the following biomarkers, identified through the methods described herein, as being specifically upregulated in response to a viral infection in a human subject. In a preferred embodiment, the invention may include the early-detection of a respiratory infection, such as SARS-CoV-2 (COVID-19) in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 21-25. In one preferred embodiment, the invention may include the early-detection of a respiratory infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 21-25, the detection being accomplished by a lateral flow device described by the present inventors in PCT Application No. PCT/US2020/049290, the specification and figures being incorporated herein by reference, or other biomarker detection systems known in the art. Additional embodiments for detecting one or more of the biomarkers identified herein may include a rapid detection LAMP assay, PCR, or other detection methods described generally herein and known in the art.
Example 9: Eukaryotic and/or Protist Virus-Specific Host Biomarkers RNA Sequences
[0078] One embodiment of the invention may include one or more of the following biomarkers, identified through the methods described herein, as being specifically upregulated in response to a eukaryotic or protist pathogen infection in a human subject. In a preferred embodiment, the invention may include the early-detection of a eukaryotic or protist pathogen infection, such as Plasmodium falciparum (P. falciparum), the causative agent of Malaria in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 26-30. In one preferred embodiment, the invention may include the early-detection of a eukaryotic or protist pathogen infection in a host through the detection of one or more of the biomarkers according to SEQ ID NOs. 26-30, the detection being accomplished by a lateral flow device described by the present inventors in PCT Application No. PCT/US2020/049290, the specification and figures being incorporated herein by reference, or other biomarker detection systems known in the art. Additional embodiments for detecting one or more of the biomarkers identified herein may include a rapid detection LAMP assay, PCR, or other detection methods described generally herein and known in the art.
Example 10: Identification of 69 Human Universal Response Genes to Infection
[0079] In one embodiment, the present inventors identify 69 human “universal response” genes that are upregulated by a broad range of human pathogens. Even when infection resides in distal sites in the body, the mRNAs produced in this universal response are measurable in human saliva. By assessing the abundance of these mRNAs in saliva, we were able to correctly determine whether a person harbors an infection more than 85% of the time. This is true even in the absence of perceived symptoms. As such, the monitoring of these mRNAs in saliva could be a platform for detecting infection in the body, especially as a screening tool for asymptomatic individuals.
[0080] It is striking that there is a core transcriptional response that is triggered by all tested pathogens. Many studies have explored the host gene response to infection, including the 71 studies that we used in the first step of this study (listed in Table 2), or to specific cytokines like interferon. Yet there have been far fewer studies that have looked at commonalities in gene induction by cells infected with different pathogens, and typically these have compared just a few pathogen types. By integrating results from many datasets from a broad range of pathogen types, we identified an asymptotic number of universal response genes (n=69) (SEQ ID NOs. 31-99). Importantly, no new genes were added or subtracted from this list once we surpassed a certain number of datasets analyzed. Thus, we identified the connecting signature that underlies infection, across a broad range of pathogens.
[0081] Importantly, universal response mRNAs are detectable in saliva of infected individuals, regardless of the location of infection. There are two hypotheses to explain why these mRNAs are found in saliva. First, free mRNA, or mRNA encapsulated in dead cells or exosomes, might be entering the oral cavity. This might be occurring for the purpose of targeting these structures for elimination from the body via the gastrointestinal tract. In a second model, interferon and other cytokines produced by a distal infection may be entering the oral cavity and stimulating cells there to execute the transcriptional response that we are measuring. In other words, the mRNA we observe in saliva could be produced or even propagated locally in the mouth. Regardless, the invention highlights the diagnostic value of saliva beyond its current limited use in diagnosing SARS-CoV-2, oral cancers, and Sjorgen syndrome.
[0082] To determine which human genes are commonly upregulated in diverse infections, the present inventor first obtained 71 published datasets. These datasets all profiled the transcriptional response of cultured human cells to infection. Studies involving a variety of pathogens were included (29 viruses, 7 bacteria, and 3 fungi), with many of these pathogens represented by more than one dataset (Table 2). Each of the 71 datasets included matched transcript sequencing for infected and mock-infected human cells, usually in multiple replicates (n =387 replicates in all). For each dataset, raw RNA sequencing reads were retrieved from the NCBI short-read archive and analyzed as described in the Methods. We looked for genes that were upregulated in infected conditions (“+” in
[0083] We next assessed whether the abundance of these mRNAs in blinded human tissue culture samples could predict whether the cells had been infected or not. Using the 387 samples (meaning, independent experimental replicates) from the 71 in vitro infection datasets, we carried out cross-validation using a logistic regression model. Specifically, we first established the logistic regression classifier using the expression data of the 69 genes in 10% of the samples (much less than what is typically used in 10-fold cross-validation experiments, done to emphasize the predictive power), randomly selected. Next, we evaluated the predictive power of this model to classify the remaining 90% of the 387 samples as infected or not. This cross validation was repeated 10 times, and the accuracy of classification is summarized via receiver operating characteristic (ROC) curve (
[0084] We then performed additional cross validation analyses among different types of infections (
[0085] We next explored whether this group of 69 genes is truly unique, relative to other groups of similar genes. We again performed the same analysis as shown in
Example 11: Universal Response Genes are also Upregulated in Infected Humans
[0086] We next wanted to determine if universal response genes are upregulated in infected humans. At this point, we transitioned from analyzing data from in vitro infections of human cells to the analysis of data from human biospecimens. We first took advantage of two previously published datasets from human blood, each measuring gene expression by microarray after infection. One study focused on a 34-year-old male health care worker exposed to Ebola virus in Sierra Leone during the 2013-2015 epidemic. Starting 7 days after symptom onset, blood was taken from the individual daily and genome-wide mRNA expression was evaluated by microarray. We extracted from this dataset the expression profiles of the universal response genes (
[0087] Another study focused on 15 individuals experimentally infected with the protist that causes malaria, Plasmodium falciparum. In this study, blood was taken every two days after experimental infection and mRNA transcript abundance was interrogated by microarray, until the point where individuals had detectable pathogen in the bloodstream and/or had symptoms consistent with malaria (indicated as “D” for diagnosed in
[0088] We next asked whether the abundance these 69 mRNAs in human saliva could classify humans as infected or not. We find that universal response transcripts can be found to equal degrees in blood and saliva (
[0089] We next tested whether the abundance of universal response mRNAs in saliva could determine if a human was harboring an infection. We carried out cross validation and found that a classifier trained on the expression levels of universal response genes in a randomly selected 10% of the in vitro data analyzed above (39 of the 387 experimental replicates from 71 studies), could correctly classify these 23 human saliva samples as having come from someone who is infected or healthy, just from the abundances of these mRNAs in their saliva (
[0090] Importantly, two of the enrollees in the previous analysis were noted to have no signs of respiratory tract involvement, and some clearly had infection linked to distal sites (gastroenteritis, osteomyelitis/discitis, meningitis), yet these mRNA signatures are reliably detectable in saliva. We next wanted to further confirm that universal response mRNAs can be found in saliva, even when infection is at distal sites in the body. In the next experiment, we included two additional patient saliva samples, one from an enrollee being treated for a Coccidioides fungal infection and another enrollee being treated for Escherichia coli bacterial sepsis stemming from a urinary source. The three enrollees in this experiment were diagnosed with very different infections (viral, fungal, and bacterial) and were specifically noted to not have respiratory involvement in their infections. We used RT-qPCR to quantify mRNA from six of the universal response genes (due to limited sample volumes) from the saliva of these enrollees. We observed from 2- to 10.sup.5-fold upregulation of all six host mRNAs within the saliva of infected individuals compared to three healthy ones (
Example 12: Universal Response Transcripts in Saliva Identified SARS-CoV-2 Infected Individuals in an Asymptomatic, Apparently-Healthy Cohort
[0091] We next asked if this concept would be viable in the context of disease screening, meaning testing people who have no symptoms for the purpose of determining their likelihood of having an infection. During the 2020-21 academic year, the University of Colorado Boulder carried out weekly SARS-CoV-2 screening for students and staff. The screening effort enabled us to enroll university affiliates into an associated human study. We enrolled 68 university affiliates into the study, and each donated a single saliva sample used for both the university RT-qPCR test for SARS-CoV-2, and for analysis of the universal response mRNAs in their saliva. For the latter analysis, we chose samples from individuals who had tested positive (n=48) and negative (n=20) for SARS-CoV-2. What is special about the cohort of 68 individuals is that all had indicated no perceptible symptoms at the time of saliva donation.
[0092] We examined the levels of mRNA from universal response genes in the saliva of these 68 individuals to determine if that information alone could have revealed whether or not they were infected. Instead of sequencing transcripts in saliva, we developed a multiplex TaqMan RT-qPCR assay for measuring 15 of the universal response genes, along with 3 control genes (Methods, Table 5). These 15 genes were chosen to represent a range of expression levels and kinetics amongst the 69 total universal response genes. The expression of these genes in each enrollee is described in
[0093] When compared to day 1, transcript abundance in saliva changed no more than 5-fold in subsequent days. Thus, universal response mRNAs are remarkably steady in the saliva of healthy individuals.
Example 12: Materials and Methods
[0094] Meta-analysis of NCBI SRA transcriptomics datasets: We carried out a meta-analysis of RNA-seq datasets publicly available at the NCBI SRA (short read archive) database. Our criteria for choosing datasets were that human cells in culture were infected with a bacterial, viral, or fungal pathogen, and then the cellular transcriptome was sequenced along with that in a mock-infected control. We obtained a total of 71 relevant in vitro infection datasets. From these datasets, raw RNA sequencing reads in FASTQ format were downloaded, trimmed using BBDuk (BBMap v38.05) and mapped using HISAT2 v2.1.0 to human genome assembly hg38. Using NCBI RefSeq genome annotation, we then counted the mapped reads assigned to genes or transcripts using FeatureCount (Subread v1.6.2).
[0095] First, we looked for genes that were upregulated in each infected dataset versus its matched mock control. For each individual dataset, the infected replicates were compared to the corresponding mock replicates via the DESeq2 Wald test (v3.1.3), from which the fold change and Benjamini-Hochberg adjusted p-values were obtained. Correction for multiple testing was performed throughout. Next, we looked for the subset of these genes that was statistically enriched in infected datasets overall. DESeq2 results from individual datasets were ranked and combined based on the magnitude and consistency of upregulation across the datasets. Specifically, the gene rank, r.sub.! is assigned to each individual dataset following the formula:
r.sub.g=Rank(−log10(Pval.sub.Adj)×fold change)
[0096] Next, to determine which genes were consistently upregulated across different studies, the rank is combined via rank sum statistics. With n studies, the rank sum for each gene, g, is calculated as:
RS.sub.g=(Σ.sub.ir.sub.g,i)
[0097] Hence, each gene is sorted based on the RS.sub.g. We then filtered the gene list based on the within-study adjusted p-value and required that the gene be significant (p.sub.adj<0.05) in 80% of the datasets. As a result, we obtained 69 universal response genes ranked by statistical significance comparing infected vs. mock groups and by the consistency across datasets.
[0098] Cross-validation using logistic regression models: To evaluate the predictive power of the universal response genes in differentiating infected/uninfected conditions in both in vitro and in vivo RNA-seq datasets, we extracted library size-normalized read counts in transcript per million format for each sequencing replicate. We next separated the datasets into training and prediction set. Specifically, 10% of randomly selected sequencing replicates used to construct the binomial logistic regression model using R package stats (v 3.6.2). The remaining 90% of sequencing replicates were used as the predict set for evaluation. In the case of in vivo saliva sequencing replicates, the entire dataset was used for prediction. R package ROCR (v1.0.11) was used to generate the ROC curves based on the prediction outcome.
[0099] For evaluating the predictive power of universal response genes as measured by the TaqMan RT qPCR assay on SARS-CoV-2 infected/uninfected saliva samples, the relative fold change was calculated by first normalizing the raw Ct values to the corresponding control gene Ct (RPP30) and then comparing to the average normalized Ct of all uninfected individuals. The relative fold change values for each individual were then used for cross validation via logistic regression. Specifically, half of infected individuals above the said viral load threshold along with half of the uninfected individuals are used as the training set, while the remaining half was used for prediction. The methods for constructing the logistic regression model and for evaluating performance via ROC are the same as above.
[0100] Human saliva sample collection, handling, and RNA preparation: Samples SS4, SS5, SS12-SS21, SS24 and SS25 were collected under protocol 17-0562 (U. Colorado Anschutz Medical School; PI Poeschla), where adult participants were consented verbally and donated up to 5 mL of whole saliva. Saliva was collected into Oragene saliva collection kits (DNA Genotek CP-100). The saliva is mixed with the stabilization solution in the collection kit and stored at room temperature for no longer than 2 weeks before being processed for RNA purification. Diagnosis of these individuals was provided in the form of clinical notes. Saliva samples from individuals SS1-SS3, SS6-SS11, SS22, and SS23 were collected under protocol 19-0696 (U. Colorado Boulder, PI Sawyer), where anonymous adults verbally consented and donated up to 2 mL of whole saliva. Saliva was collected into Oragene saliva collection kit as mentioned above. For two individuals, infection status was noticed during RNAseq procedures, and ultimately determined by in silico metagenomic detection using GOTTCHA (v1.0b) using RNAseq reads (additional RNAseq sample preparation and analysis described below). We were able to detect sequencing reads mapping to CoV-NL63 or RSV genomes from the saliva of individual SS22 and SS23, respectively, so they were presumed to be infected with these pathogens at the time of saliva collection. Saliva samples for apparently healthy individuals over a daily time course (SS26-SS32) were collected under a COVID-19-related sub-study of protocol 19-0696 (U. Colorado Boulder, PI Sawyer), where adult participants consented verbally and donated up to 2 mL of whole saliva per day. The saliva was collected into Oragene saliva collection kit as mentioned above. To purify RNA from saliva samples collected in Oragene saliva collection kits, we used 1 mL saliva 1:1 diluted in stabilization solution and followed the manufacturer recommended protocol by DNA Genotek to precipitate the nucleic acid. The RNA was further DNase-digested using Turbo DNase (Invitrogen #AM2238) and cleaned up using RNA clean-up and concentration micro-elute kit (Norgen #61000). The purified RNA was used for RT-qPCR or processed further for RNA-seq.
[0101] To prepare the total RNA for sequencing, we first spiked in ERCC RNA spike-in mix (ThermoFisher #4456740) into the saliva total RNA for downstream normalization. We depleted bacterial ribosomal RNA using pan-bacterial riboPOOL kit (siTOOLS #026). We then prepared the RNA for total RNA sequencing using KAPA RNA HyperPrep kit with RiboErase to remove human rRNA (Roche #KK8560). Finally, the saliva total RNA libraries were sequenced in 150 bp pair-end format using NovaSeq 6000 (Illumina) at the depth of 30 million reads.
[0102] Saliva samples for SARS-CoV-2-infected individuals (SS33-SS80), and matched SARS-CoV-2-negative individuals (SS81-SS100) were collected under protocol 20-0417 (U. Colorado Boulder, PI Sawyer), where adult participants 17 years of age or older (under a Waiver of Parental Consent) provided written consent. These samples were collected and tested for the SARS-CoV-2 virus during our campus COVID-19 testing initiative during the Fall 2020, Spring 2021, and Summer 2021 semesters. As part of this campus testing operation, university affiliates were asked to fill out a questionnaire to confirm that they did not present any symptoms consistent with COVID-19 at the time of sample donation, and to collect no less than 0.5 mL of saliva into a 5-mL screw-top collection tube. Saliva samples were heated at 95° C. for 30 min on site to inactivate the viral particles for safer handling, and then placed on ice or at 4° C. before being transported to the testing laboratory for RT-qPCR-based SARS-CoV-2 testing performed on the same day. Samples were then kept in −80 C until RNA preparation. The total RNA of the remaining saliva samples was then purified using TRIzol LS reagent (ThermoFisher #10296028) followed by GeneJET RNA cleanup and concentration kit (ThermoFisher #K0841). The purified total RNA was used for RT-qPCR following the steps described below. Additional saliva samples for general assay development were collected under protocol 20-0068 (U. Colorado Boulder, PI Sawyer), where anonymous adult participants were verbally consented and donated up to 2 mL of whole saliva for use as a reagent in optimization and limit of detection experiments.
[0103] Analysis of high-throughput transcriptomics data from human saliva samples: To profile human transcriptomic changes in human saliva samples, raw RNA sequencing reads in FASTQ format were obtained, trimmed using BBDuk (BBTools v38.05), and mapped using HISAT2 v2.1.0 to human genome assembly hg38 along with ERCC spike-in sequence reference. Using NCBI RefSeq genome annotation (GRCh38. p13), we then counted the mapped reads assigned to gene or transcripts using FeatureCount (Subread v1.6.2). Read counts was first normalized using the R package RUVseq (v1.28.0) to account for library size factors based on the ERCC spike-in counts. Individual samples were then separated into infected and non-infected groups and the differential expression of genes were determined via DESeq2 (v3.1.3) Wald test, from which the fold change and Benjamini-Hochberg adjusted p-values were obtained.
[0104] RT-qPCR analysis of universal response mRNAs in human saliva: For initial RT-qPCR validation on 3 clinically diagnosed and 3 uninfected samples (
TABLE-US-00001 Gene Forward Primer Reverse Primer Name Sequence (5′-3′) Sequence (5′-3′) CALR TCCCGATCCCAGTATCTATGC TCTCTGCTGCCTTTGTTACGC C CXCL8 CCAGGAAGAAACCACCGGAA CTTGGCAAAACTGCACCTTCAC EGR1 ACTACCCTAAGCTGGAGGAGA AGGAAAAGACTCTGCGGTCA ICAM1 GCAACCTCAGCCTCGCTAT GGAGTCCAGTACACGGTGAG IFIH1 ACAGCTTCACCTGGTGTTGGA ATGGCAAACTTCTTGCATGGCT IFIT2 CCCTGCCGAACAGCTGAGAA AGTTGCCGTAGGCTGCTCTC RSAD2 GTTGGTGAGGTTCTGCAAAGT TAAGGTAGGAGTCTTTCATCTT AGAGTTGCG CTGGTTAG
Multiplexed RT-qPCR analysis for the quantitative detection of 15 of the universal response mRNAs was carried out using customized and multiplexed TaqMan primer and probe mixes. Together with 3 internal controls genes (RPP30, RACK1, and CALR), the levels of all 18 genes are measured in a total of 6 multiplexed reactions (Table 5). Understanding that the contamination of genomic DNA often introduces quantification bias when measuring host gene expression, we explicitly designed primers that span exon junctions and limit the assay elongation time so that only the host mRNA is reverse transcribed and amplified. As each transcript varies in its expression magnitude, we assigned genes into multiplex groups based on similar expression magnitudes observed in the meta-analysis of in vitro datasets and inhuman saliva. This minimizes competition of amplification reagents. Specifically, to determine the host gene expression levels, 1.5 μL of customized TaqMan multiplex probes were mixed with 5 μL 4X TaqPath 1-step multiplex master mix (ThermoFisher # A28526), 5 μL of saliva total RNA, and 8.5 μL of nuclease free water. The RT-qPCR assay was carried out on QuantStudio3 Real-time PCR system (ThermoFisher) consisting of a reverse transcription stage (25° C. for 2 min, 50° C. for 15 min, 95° C. for 2 min) followed by 40 cycles of PCR stage (95° C. for 3 s, 55° C. for 30 s, with a 1.6° C./s ramp-up and ramp-down rate). The cycle threshold (Ct) values were used to calculate relative fold change using delta delta Ct method. For the choice of internal control genes, we combined the meta-analysis (
[0105] We optimized this TaqMan assay on RNA harvested from A549 human lung cells mock infected or infected with influenza A virus (H3N2/Udorn/307/72) at MOI of 0.1 for 24 hours. Human lung epithelial cells (A549s) where plated at a concentration of 1×10.sup.6 cells/well in a 6-well plate. The next day, the cells were infected with influenza A virus at an MOI=0.1 in serum-free media containing 1.0% bovine serum albumin. After 1 hour incubation, the inoculum was removed and replaced with growth media containing 1 ug/mL of N-acetylated trypsin. 24 hours post-infection, total RNA was harvested using QIAGEN RNeasy Mini kit (QIAGEN #74104). Using these samples, we confirmed that the assay can measure each mRNA over a large dynamic range (Ct 15-40) with small amount of input RNA (≥100 ng) (
[0106] Infection of Huh7 cells with SARS-CoV-2: Human Hepatoma (Huh7) cells (gift from Charles Rice, Rockefeller University) were grown in 1XDMEM (ThermoFisher cat. no. 12500062) supplemented with 2 mM L-glutamine (Hyclone cat. no. H30034.01), non-essential amino acids (Hyclone cat. no. SH30238.01), and 10% heat inactivated FetalBovine Serum (FBS) (Atlas Biologicals cat. no. EF-0500-A). The virus strain used for the assay was SARS-CoV2, USA WA January 2020, passage 3. Virus stocks were obtained from BEI Resources and amplified in Vero E6 cells to Passage 3 (P3) with a titer of 5.5×10.sup.5PFU/mL. Cells were resuspended to 6.0×10.sup.5 cells/mL in 10% DMEM and seeded at 2 mL/well in 6-well plates. The plates were then incubated for approximately 24 hours (h) at 37° C., 5% CO2 for cells to adhere prior to infection. Cells were infected with SARS-CoV-2 at an MOI of 0.01. Samples were harvested at 0, 2, 4, 8, 12, 24, and 48 hours post infection in 200 μl TRIzol reagent for RNA extractions following the manufacture's protocol.
TABLES
[0107]
TABLE-US-00002 TABLE 1 Exemplary Host Biomarker identification SEQ ID NO. 1: indoleamine 2,3-dioxygenase 1 (IDO1) (mRNA) SEQ ID NO. 2: interferon induced protein with tetratricopeptide repeats 2 (IFIT2), (mRNA) SEQ ID NO. 3: guanylate binding protein 4 (GBP4), (mRNA) SEQ ID NO. 4: ISG15 ubiquitin like modifier (ISG15), (mRNA) SEQ ID NO. 5: radical S-adenosyl methionine domain containing 2 (RSAD2), (mRNA) SEQ ID NO. 6: methionine adenosyltransferase 1A (MAT1A), (mRNA) SEQ ID NO. 7: caspase 16, pseudogene (CASP16P), (non-coding RNA) SEQ ID NO. 8: U1 small nuclear 2 (RNU1-2), (small nuclear RNA) SEQ ID NO. 9: ArfGAP with GTPase domain, ankyrin repeat and PH domain 11 (AGAP11), (mRNA) SEQ ID NO. 10: synaptotagmin 4 (SYT4), (mRNA) SEQ ID NO. 11: glutaminyl-peptide cyclotransferase (QPCT), (mRNA) SEQ ID NO. 12: interleukin 2 (IL2), (mRNA) SEQ ID NO. 13: brain abundant membrane attached signal protein 1 (BASP1), transcript variant 1, (mRNA) SEQ ID NO. 14: family with sequence similarity 30 member A (FAM30A), (long non-coding RNA) SEQ ID NO. 15: tetraspanin 13 (TSPAN13), (mRNA) SEQ ID NO. 16: WWC2 antisense RNA 2 (WWC2-AS2), (long non-coding RNA) SEQ ID NO. 17: prothymosin alpha (PTMA), transcript variant X5, (mRNA) SEQ ID NO. 18: zinc finger protein 296 (ZNF296), (mRNA) SEQ ID NO. 19: F-box and WD repeat domain containing 4 pseudogene 1 (FBXW4P1), (non-coding RNA) SEQ ID NO. 20: SRY-box transcription factor 3 (SOX3), (mRNA) SEQ ID NO. 21: C-C motif chemokine ligand 8 (CCL8), (mRNA) SEQ ID NO. 22: cytochrome P450 family 1 subfamily B member 1 (CYP1B1), (mRNA) SEQ ID NO. 23: long intergenic non-protein coding RNA 2057 (LINC02057), (long non-coding RNA) SEQ ID NO. 24: adrenoceptor alpha 2B (ADRA2B), (mRNA) SEQ ID NO. 25: UDP-GlcNAc:betaGal beta-1,3-N-acetylglucosaminyltransferase 6 (B3GNT6), (mRNA) SEQ ID NO. 26: ankyrin repeat domain 22 (ANKRD22), (mRNA) SEQ ID NO. 27: FERM domain containing 3 (FRMD3), transcript variant 1, (mRNA) SEQ ID NO. 28: leucine aminopeptidase 3 (LAP3), (mRNA) SEQ ID NO. 29: syntaxin 11 (STX11), (mRNA) SEQ ID NO. 30: toll like receptor 7 (TLR7), (mRNA)
TABLE-US-00003 TABLE 2 Transcriptomics datasets used for the discovery of human universal response genes Hour Post Sequencing SRP Index Human cell line Pathogen Abbreviation Infection Data Type SRP044763 IMR90 Adenovirus ADV 24 mRNA SRP163661 MRC5 Adenovirus ADV 24 Total SRP202003 HepG2 Crimean-Congo hemorrhagic fever CCHFV 72 Total virus SRP078309 A549 Dengue virus 2 DENV2 36 Total SRP130978 HUH751 Dengue virus 2 DENV2 NA Total SRP132737 Huh7 Dengue virus 2 DENV2 18 Total SRP188490 HEK293 Dengue virus 2 DENV2 18 Total SRP101856 DC Ebola virus EBOV 24 Total SRP111145 ARPE19 Ebola virus EBOV 24 Total SRP131318 Rhabdomyosarcoma Enterovirus EV 6 Total SRP060253 AGS Epstein-Barr virus EBV NA Total SRP255890 B Cell Epstein-Barr virus EBV NA Total SRP272684 B Cell Lymphoma Epstein-Barr virus EBV 24 Total SRP212863 HUVEC Hantaan Orthohantavirus HTNV 72 Total SRP158789 HepG2 Hepatitis B virus HBV 72 Total SRP187206 HUH751 Hepatitis C virus HCV 148 Total SRP091538 HepG2 Hepatitis E virus HEV 120 Total SRP117344 KMB17 Herpes Simplex virus 1 HSV-1 48 Total SRP154536 HEK293 Herpes Simplex virus 1 HSV-1 4 Total SRP163661 MRC5 Herpes Simplex virus 1 HSV-1 9 Total SRP177947 THP1 Herpes Simplex virus 1 HSV-1 24 Total SRP189489 HFF Herpes Simplex virus 1 HSV-1 8 Total SRP065236 HFF Herpes Simplex virus 2 HSV-2 8 Total SRP065236 EC Human Cytomegalovirus HCMV 48 Total SRP065236 HFF Human Cytomegalovirus HCMV 48 Total SRP085236 NPC Human Cytomegalovirus HCMV 48 Total SRP163661 MRC5 Human Cytomegalovirus HCMV 48 Total SRP266618 NTT Human Cytomegalovirus HCMV 24 Total SRP065236 CD4 + T Cell Human Immunodeficiency virus 1 HIV-1 120 Total SRP155217 CD4 + T Cell Human Immunodificiency virus 1 HIV-1 72 Total SRP155822 lieum organoid Human Norovirus HuNoV 48 Total SRP223234 HFK Human Papilomavirus HPV NA Total SRP253951 A549 Human Parainfluenza virus 3 HPIV3 24 Total SRP183819 HNEpC Human Rhinovirus HRV 48 Total SRP161185 ATII Influenza A virus IAV 24 Total SRP230823 HeLa Influenza A virus IAV 24 Total SRP234025 A549 Influenza A virus IAV 48 Total SRP253951 A549 Influenza A virus IAV 9 Total SRP272285 A549 Influenza A virus IAV 6 Total SRP277269 293T Influenza A virus IAV 6 Total SRP261173 A549 Influenza A virus IAV 12 Total SRP170549 Calu3 Middle East respiratory syndrome MERS-CoV 24 Total coronavirus SRP227272 Calu3 Middle East respiratory syndrome MERS-CoV 24 mRNA coronavirus SRP096169 HFF Orf virus ORFV 8 Total SRP277439 HEK293 Porcine Rotavirus PoRV 12 Total SRP229586 A549 Respiratory Syncytial virus RSV 36 Total SRP229586 H292 Respiratory Syncytial virus RSV 36 Total SRP229586 HBEC Respiratory Syncytial virus RSV 36 Total SRP253951 A549 Respiratory Syncytial virus RSV 24 Total SRP115192 HSAEpC Rift Valley Fever virus RVFV 18 Total SRP094462 HInEpC Rotavirus ROTAV 6 Total SRP253951 A549-ACE2 Severe acute respiratory SARS-CoV-2 24 Total syndrome coronavirus 2 SRP270617 PHAE Severe acute respiratory SARS-CoV-2 48 Total syndrome coronavirus 2 SRP273473 DC Severe acute respiratory SARS-CoV-2 2 Total syndrome coronavirus 2 SRP273473 MAC Severe acute respiratory SARS-CoV-2 2 Total syndrome coronavirus 2 SRP278618 iPSC-derived Severe acute respiratory SARS-CoV-2 48 Total cardiomyocyte syndrome coronavirus 2 SRP061284 MeWo Varicella-zoster virus VZV 24 Total SRP225661 A549 West Nile virus WNV 24 Total SRP142592 hNSC Zika virus ZIKV 72 Total SRP251704 A549 Zika virus ZIKV 48 Total SRP253197 HepG2 Zika virus ZIKV 48 Total SRP296743 PBMC Asperigillus fumigatus A. fumigatus 24 Total SRP296743 PBMC Candida albicans C. albicans 24 Total SRP296743 PBMC Rhizopus oryzae R. oryzae 24 Total SRP285913 HeLa Chiamydia trachomatis C. trachomatis 44 Total SRP321546 DLD-1 Fusobacterium nucleatum F. nucleatum 24 Total SRP321940 Primary human Listeria monocylogenes L. monocytogenes 5 Total trophoblasts ERP020415 TRP-1 Mycobactenum tuberculosis M. tuberculosis 48 Total ERP115551 hBMECs Neissaria meningitidis N. meningitidis 6 mRNA SRP263458 HUVEC Staphylococcus aureus S. aureus 16 Total SRP072326 A549 Strepticiccus pneumoniae S. pneumoniae 2 Total
TABLE-US-00004 TABLE 3 The 69 universal response genes in humans RefSeq Gene Accession Symbol NM_030641 APOL6 NM_001165 BIRC3 NM_004335 BST2 NM_001565 CXCL10 NM_000584 CXCL8 NM_014314 DDX58 NM_017631 DDX60 NM_024119 DHX58 NM_138287 DTX3L NM_004417 DUSP1 NM_004419 DUSP5 NM_004420 DUSP8 NM_001964 EGR1 NM_001432 EREG NM_005252 FOS NM_002053 GBP1 NM_052941 GBP4 NM_001945 HBEGF NM_016323 HERC5 NM_006734 HIVEP2 NM_005514 HLA-B NM_000201 ICAM1 NM_005532 IFI27 NM_006417 IFI44 NM_006820 IFI44L NM_002038 IFI6 NM_022168 IFIH1 NM_001547 IFIT2 NM_001549 IFIT3 NM_012420 IFIT5 NM_003641 IFITM1 NM_006435 IFITM2 NM_002176 IFNB1 NM_172140 IFNL1 NM_016584 IL23A NM_001570 IRAK2 NM_006084 IRF9 NM_005101 ISG15 NM_002228 JUN NM_015907 LAP3 NM_002462 MX1 NM_002463 MX2 NM_020529 NFKBIA NM_012118 NOCT NM_002535 OAS2 NM_006187 OAS3 NM_003733 OASL NM_022750 PARP12 NM_017554 PARP14 NM_021127 PMAIP1 NM_152542 PPM1K NM_014330 PPP1R15A NM_000958 PTGER4 NM_006509 RELB NM_014470 RND1 NM_080657 RSAD2 NM_022147 RTP4 NM_002999 SDC4 NM_003745 SOCS1 NM_007315 STAT1 NM_003764 STX11 NM_017633 TENT5A NM_001561 TNFRSF9 NM_003141 TRIM21 NM_080745 TRIM69 NM_017414 USP18 NM_033390 ZC3H12C NM_003407 ZFP36 NM_021035 ZNFX1
TABLE-US-00005 TABLE 4 Top 30 differentially up- and down- regulated genes from comparison between infected and healthy saliva Gene Log2(Fold Adjusted P- Symbols Change) value CHRNA5 6.05 9.35E−76 IL2RA 6.07 1.08E−71 STS 6.02 7.91E−69 BAG5 5.80 9.31E−64 HBD 7.01 3.53E−53 POR 6.03 4.83E−50 LCN10 6.38 4.06E−46 C10orf55 7.06 9.76E−44 TWIST1 6.35 1.08E−43 CA2 6.97 1.19E−43 NR0B1 7.13 7.96E−43 GALE 5.83 1.04E−42 TENT5A 6.15 2.69E−42 WRN 5.11 3.91E−42 NOS3 5.95 5.09E−41 HBEGF 5.00 8.94E−41 DRD4 6.13 5.62E−40 NCMAP 6.31 3.29E−39 REN 5.61 7.10E−39 FGG 4.98 2.07E−37 HADHA 5.01 8.57E−37 HBG2 7.61 2.11E−36 HOXD13 4.86 2.50E−36 KITLG 5.31 1.18E−35 CHRNB1 5.74 1.08E−32 ITGB3 4.59 2.63E−32 BST2 6.03 3.66E−32 OR56B1 7.34 4.66E−31 HBG1 8.01 5.45E−31 RND1 7.31 6.27E−31 LOC102723665 −3.38 1.86E−06 GCSAM −4.12 1.84E−05 TAAR9 −5.50 2.94E−05 CDCA7L −3.59 1.16E−04 MIR320B2 −4.81 1.47E−04 HULC −5.84 1.49E−04 ZNF235 −3.25 2.40E−04 SLC39A12 −3.05 3.28E−04 IVNS1ABP −3.87 3.58E−04 KLHDC4 −3.96 4.01E−04 SERPINB5 −3.57 4.41E−04 LOC101927143 −4.42 4.45E−04 VAV2 −3.29 4.68E−04 DSEL −4.39 5.69E−04 RPL22 −2.67 7.18E−04 LINC01085 −3.48 7.23E−04 ERVW-1 −3.94 8.02E−04 SLC25A25-AS1 −3.54 8.58E−04 THOC5 −2.59 9.56E−04 UXT-AS1 −4.49 1.21E−03 TRI-AAT1-1 −3.34 1.37E−03 AKAP4 −3.07 1.76E−03 TADA2A −2.58 2.03E−03 LRRC7 −3.49 2.71E−03 LEMD1-AS1 −3.55 3.02E−03 GNG14 −3.82 3.37E−03 ZNF461 −3.55 3.77E−03 LINC01781 −2.66 4.07E−03 SAMD13 −3.46 4.65E−03 SLAMF8 −1.81 5.00E−03
TABLE-US-00006 TABLE 5 Multiplex TaqMan RT-qPCR assay for monitoring host immune gene signature expression. Gene Group Target Primer Name Primer sequence (5′->3′) Probe Sequence (5′->3′) Probe Dye 1 CALR CALR_F GAGTATTCTCCCGATCCCAGTATCT ATGAGGCATACGCTGA ABY (Controls) ATGCC GGAGTTTGG CALR_R ATTTGTTTCTCTGCTGCCTTTGTTA CGCCC RACK1 RACK1_F TCCCACTTTGTTAGTGATGTGGTTA CAGTTTGCCCTCTCAG VIC TCTCC GCTCCT RACK1_R CAAATCGCCTCGTGGTGGTGCCCG TTGTGAG RPP30 RPP30_F AGATTTGGACCTGCGAGCG TTCTGACCTGAAGGCT FAM RPP30_R GAGCGGCTGTCTCCACAAGT CTGCGCG 2 DDX58 DDX58_F CCGGAAGACCCTGGACCCTA TTAGGGAGGAAGAGG ABY DDX58_R AGGGCATCCAAAAAGCCACG TGCAG IFIT2 IFIT2_F CCCTGCCGAACAGCTGAGAA CTGCAACCATGAGTGA VIC IFIT2_R AGTTGCCGTAGGCTGCTCTC GAAC IFITM2 IFITM2_F ATAGCATTCGCGTACTCCGT TGCCTCCACCGCCAAG FAM IFITM2_R TGATGCCTCCTGATCTATCGC TGC 3 Mx1 Mx1_F TAGAGAGCTGCCAGGCTTTG TACACACCGTGACGGA ABY Mx1_R ATCTGTGAAAGCAAGCCGGA TATG IFI6 IFI6_F TCGCTGCTGTGCCCATCTATC CTGCTGCTCTTCACTT VIC IFI6_R TTCTTACCTGCCTCCACCCCAC GC IFIT3 IFIT3_F ACAGCAGAGACACAGAGGGCA TCATGAGTGAGGTCAC FAM IFIT3_R AGCTGTGGAAGGATTTTCTCCAGG CAAG 4 IFI27 IFI27_F GCCACGGAATTAACCCGAGC CATCAGCAGTGACCAG ABY IFI27_R GCCACAACTCCTCCAATCACA TGTG IFIH1 IFIH1_F ACAGCTTCACCTGGTGTTGGA CGAAGCAAGCCAAAG VIC IFIH1_R ATGGCAAACTTCTTGCATGGCT CTGAAG PARP12 PARP12_F ACCATGCAAACCTGCAATACC TCCAGGCCCGAAGAG FAM PARP12_R GCAGCGTGCGGTTAAAGAG CATC 5 IRF9 IRF9_F GCTCTTCAGAACCGCCTACTTC CTCCAGCCATACTCCA ABY IRF9_R CTCCAGCAAGTATCGGGCAA CAGAATC CXCL10 CXCL10_F TGCAAGCCAATTTTGTCCACG AGCAGTTAGCAAGGAA VIC CXCL10_R GCCTCTGTGTGGTCCATCCT AGGTC Mx2 Mx2_F CATGATTGTGAAGTGCCGGG CTGAGCTTGGCAGAG FAM Mx2_R CAACGGGAGCGATTTTTGGA GCAAC 6 OAS2 OAS2_F CGTTGGTGTTGGCATCTTCTG CCAGTCCCATCCTTGA ABY OAS2_R TGCATTGTCGGCACTTTCC AGCAG CXCL8 CXCL8_F CCAGGAAGAAACCACCGGAA TGGCCGTGGCTCTCTT VIC CXCL8_R CTTGGCAAAACTGCACCTTCAC G RTP4 RTP4_F TGGACGCTGAAGTTGGATGGC CTCTCTGTTGGTATTG FAM RTP4_R CAACTTCGCTGGCAGGAGGAA CTTC