METHODS OF DETERMINING AND PREDICTING MUTATED mRNA SPLICE ISOFORMS
20180051326 ยท 2018-02-22
Inventors
Cpc classification
C12Q2537/165
CHEMISTRY; METALLURGY
G16B20/20
PHYSICS
C12Q2537/165
CHEMISTRY; METALLURGY
G16B25/00
PHYSICS
C12Q2539/105
CHEMISTRY; METALLURGY
G16B20/00
PHYSICS
C12Q2539/105
CHEMISTRY; METALLURGY
International classification
Abstract
Mutations that affect mRNA splicing often produce multiple mRNA isoforms containing different exon structures. Definition of an exon and its inclusion in mature mRNA relies on joint recognition of both acceptor and donor splice sites. The instant methodology predicts cryptic and exon skipping isoforms in mRNA produced by splicing mutations from the combined information contents and the distribution of the splice sites and other regulatory binding sites defining these exons. In its simplest form, the total information content of an exon, R.sub.i,total, is the sum of the information contents of its corresponding acceptor and donor splice sites, adjusted for the self-information of the exon length. Differences between R.sub.i,total values of mutant versus normal exons that are concordant with gene expression data demonstrate alterations in the structures and relative abundance of the mRNA transcripts resulting from these mutations.
Claims
1. A method for assessing changes in expression level of a gene having an mRNA splice-altering mutation, said mutation being located within a sequence window circumscribing an exon and one or more intronic sequences of said gene, said one or more intronic sequences being adjacent to said exon, performed by a computer processor executing instructions in tangible memory, said method comprising the steps of: (a) computing and identifying changes in individual information contents of a potential donor and acceptor splice site pair, and one or more splicing regulatory sequences, such as splicing enhancer and/or silencer sequence elements, which together define either a constitutive or a mutated exon, at each nucleotide position by computing a product of their respective information theory-based position weight matrices and a corresponding-binary matrix of a respective splice site sequence; (b) computing the total information content, R.sub.i,total, of a potential exon as the sum of the corresponding individual information contents of the acceptor and donor pair, corrected by adding the gap surprisal of an exon whose length is the distance between the donor and acceptor pair; (c) comparing the R.sub.i,total values of all potential mRNA splice isoforms of the wild-type gene and the same values after the wild-type gene sequence is mutated to determine whether the mutation alters the abundance of the mRNA isoforms containing the exon, said comparison resulting in potential exons with different R.sub.i,total values in the wild-type and mutated gene, wherein the splice isoform with the largest R.sub.i,total value is predicted to be the most abundant splice isoform, and the splice isoform with the smallest R.sub.i,total value is predicted to be the least abundant isoform, and the relative abundance of any pair of isoforms corresponds to 2 to the power of the differences between the R.sub.i,total values; and (d) graphically displaying each of the isoforms that are unchanged, newly formed, altered in abundance, or eliminated by the mutation.
2. The method of claim 1, wherein the comparison step (c) determines the relative abundance of a pair of splice isoforms by computing 2 to the power of the difference between the R.sub.i,total values of each isoform.
3. The method of claim 2, wherein the mutation occurs at a cryptic splice site and the R.sub.i,total value of the isoform containing this splice site is increased, resulting in increased abundance of the isoform.
4. The method of claim 3, wherein the mutation is a leaky or partial splicing mutation, said mutation causing a mutant isoform to exceed the abundance of the normal mRNA splice isoform by at least 1 bit or 2 fold.
5. The method of claim 3, wherein a paucimorphic or effectively null allele for a splicing mutation occurs in which a mutant isoform exceeds the abundance of the normal mRNA splice isoform by at least 5 bits or 32 fold.
6. The method of claim 2, wherein the mutation occurs at a natural splice site.
7. The method of claim 6, wherein the mutation is a leaky or partial splicing mutation, said mutation causing the R.sub.i,total of the mutant isoform to be less than the R.sub.i,total value of the normal mRNA splice isoform by at least 1 bit or 2 fold.
8. The method of claim 6, wherein paucimorphic or effectively null allele for a splicing mutation occurs in which the R.sub.i,total of the mutant isoform is less than the R.sub.i,total value of the normal mRNA splice isoform by at least 5 bits or 32 fold.
9. The method of claim 1, wherein the method is specific for first exons, using a first exon-specific gap surprisal function derived from the exon lengths of a majority of human genes encoding spliced mRNAs.
10. The method of claim 1, wherein the method is specific for last exons, using a last exon-specific gap surprisal function derived from the exon lengths of a majority of human genes encoding spliced mRNAs.
11. The method of claim 1, further comprising a step (e) of correcting the R.sub.i,total from step (c) by adding gap surprisal terms for one or more splicing enhancer and/or one or more silencer sequence elements recognized by an RNA binding protein or a small nuclear ribonucleoprotein, wherein a strength of at least one of said splicing enhancer and/or one or more said silencer sequence elements is altered due to the mutation of said gene.
12. The method of claim 11, wherein a secondary gap surprisal is added to take into account distances between at least one natural splice site and each altered splicing enhancer and/or a-silencer sequence elements, and wherein said secondary gap surprisal is a gap surprisal term computed from a distance between a closest donor or acceptor splice site and one or more splicing regulatory protein binding sites that occur either within said exon or in an adjacent intron of said exon.
13. The method of claim 12, wherein at least one weak binding site that overlaps with a stronger binding site is not taken into account when applying said secondary gap surprisal.
14. The method of claim 1, wherein the total information content (R.sub.i,total) includes a contribution for an RNA binding protein that recognizes its cognate binding site by addition of the RI value of the binding site and a gap surprisal term for said RNA binding protein, said gap surprisal being computed from the distance between said RNA binding protein binding site and the nearest known splice site, said gap surprisal term being determined by scanning the genome for transcribed binding sites of said binding protein with an information-theory derived position weight matrix (abbreviated as PWM), said PWM being derived from a set of RNA sequences bound by said binding protein, said gap surprisal distribution determined from the frequency of each interval length between the known nearest splice site and the binding site for said RNA binding protein, separately for exons and introns, wherein said RNA sequences used to derive the PWM are obtained from CLIP-seq or PAR-CLiP libraries derived by binding of said RNA binding protein to these sequences.
15. The method of claim 1, wherein said step (d) is performed by extracting mRNAs from said at least one cell and by determining the sequence of one or more mRNA molecules derived from said gene.
16. The method of claim 1, wherein said step (d) is performed by extracting proteins from said at least one cell expressing said gene and by determining the sequence of one or more protein molecules derived from said gene.
17. The method of claim 1, further comprising the step of identifying new and unknown splice isoforms and determining their abundance relative to previously known splice isoforms.
18. The method of claim 1, wherein the information contents of all of the splicing regulatory sequences in an exon and adjacent intronic sequences are zero bits.
19. The method of claim 1, wherein the gap surprisal term (g(x)) for internal exons is given by the formula
g(X)=7.036E-23(X8)6.128E-19(X{circumflex over (0)}b 7)+2.212E-15(X6)4.273E-12(X5)+4.749E-09(X4)3.028E-06(X3)+0.001026(X2)0.1414(X1)+6.5383 where x=Length of exon
20. The method of claim 1, wherein the gap surprisal term (g(x)) for last exons is given by the formula
g(X)=5.44E-24(X8)+4.01E-20 (X7)1.12E-16(X6)+1.33E-13(X5)2.23E-11(X4)1.05E-07(X3)+0.000104(X2)0.03574(X1)+4.1378 where x=Length of exon.
21. The method of claim 1, wherein the gap surprisal term (g(x)) for first exons is given by the formula
g(X)=3.45E-23(X8)2.94E-19(X7)+1.04E-15(X6)1.95E-12(X5)+2.13E-09(X4)1.37E-06(X3)+0.000490554(X2)0.079260304(X1)+4.5219 where x=Length of exon.
22. The method of claim 1, wherein the process further comprises the step of testing the predictions of information theory based on exon definition by testing for the presence and abundance of the predicted isoforms by extracting mRNAs or proteins from at least one cell expressing said gene, performing gene expression assays that detect the predicted isoforms, and to determine the most abundant mRNA splice isoforms of said gene, thus allowing the concerted assessment of multiple changes in isoform expression levels within said gene.
23. The method of claim 22, wherein validation of the predicted reduction in residual normal mRNA levels is then observed only when mutation is present, but not when it is absent.
24. The method of claim 22, wherein predicted mutant cryptic isoforms are subsequently validated using the appropriate RT-PCR or RNA sequencing testing procedure.
25. The method of claim 22, wherein the cryptic isoforms present only in individuals carrying the predicted mutation are subsequently validated using appropriate RT-PCR or RNA-sequencing testing procedure, thereby excluding natural alternative mRNA splicing as the source of the isoforms.
26. The method of claim 22, wherein a predicted cryptic exon or pseudoexon is validated by RT-PCR, high throughput RNA sequencing, or a hybridization microarray containing hybridization probes containing sequences complimentary to the novel predicted exon.
27. The method of claim 22, wherein the mutation is predicted to cause intron inclusion in the incompletely processed transcript, and the gene expression assay detects the predicted intronic sequences.
28. The method of claim 22, wherein the mutation is predicted to result in overlapping natural and cryptic splice sites of the same polarity that produce exon skipping, and the predicted result is validated by a specific gene expression analysis of this outcome using either RT-PCR, expression microarray, or high throughput RNA Sequencing.
29. The method of claim 22, wherein the mutation is predicted to activate splicing of a cryptic intron within a natural exon, and the predicted result is validated by a specific gene expression analysis of this outcome using either RT-PCR, expression microarray, or high throughput RNA sequencing.
30. The method of claim 22, wherein exon skipping does not occur when the predicted regulatory splice site mutation is absent, only when it is present.
31. A method for determining changes in expression level of a gene having an mRNA splice-altering mutation, said mutation being located within a sequence window circumscribing an exon and one or more intronic sequences of said gene, said one or more intronic sequences being adjacent to said exon, performed by a computer processor executing instructions in tangible memory, said method comprising the steps of: (a) computing and identifying changes in individual information contents of a potential donor and acceptor splice site pair and one or more splicing regulatory sequences, such as splicing enhancer and/or silencer sequence elements, which together define either a constitutive or a mutated exon, at each nucleotide position by computing a product of their respective information theory-based position weight matrices and a corresponding binary matrix of a respective splice site sequence; (b) computing the total information content, R.sub.i,total, of a potential exon as the sum of the corresponding individual information contents of the acceptor and donor pair, corrected by adding the gap surprisal of an exon whose length is the distance between the donor and acceptor pair; (c) comparing the R.sub.i,total values of all potential mRNA splice isoforms of the wild-type gene and the same values after the wild-type gene sequence is mutated to determine whether the mutation alters the abundance of the mRNA isoforms containing the exon, said comparison resulting in potential exons with different R.sub.i,total values in the wild-type and mutated gene, wherein the splice isoform with the largest R.sub.i,total value is predicted to be the most abundant splice isoform, and the splice isoform with the smallest R.sub.i,total value is predicted to be the least abundant isoform, and the relative abundance of any pair of isoforms corresponds to 2 to the power of the differences between the R.sub.i,total values, thereby determining a prediction of information theory based on exon definition; and (d) graphically displaying each of the isoforms that are unchanged, newly formed, altered in abundance, or eliminated by the mutation.
32. The method of claim 31, further comprising a step (e) of correcting the R.sub.i,total from. step (b) by adding a gap surprisal term of one or more splicing enhancer and/or one or more silencer sequence elements recognized by an RNA binding protein or a small nuclear ribonucleoprotein, wherein strength of at least one of said splicing enhancer and/or one or more said silencer sequence elements is altered due to the mutation of said gene.
33. The method of claim 31, wherein a secondary gap surprisal is added to take into account distances between at least one natural splice site and each altered splicing enhancer and/or silencer sequence elements, and wherein said secondary gap surprisal is a gap surprisal term computed from a distance between a closest donor or acceptor splice site and one or more splicing regulatory protein binding sites that occur either within said exon or in an adjacent intron of said exon.
34. A method for determining changes in expression level of a gene having an mRNA splice-altering mutation, said mutation being located within a sequence window circumscribing an exon and one or more intronic sequences of said gene, said one or more intronic sequences being adjacent to said exon, performed by a computer processor executing instructions in tangible memory, said method comprising the steps of: (a) generating a genomic polynucleotide sequence of the gene; (b) computing and identifying changes in individual information contents of a potential donor and acceptor splice site pair and one or more splicing regulatory sequences, such as splicing enhancer and/or silencer sequence elements, which together define either a constitutive or a mutated exon, at each nucleotide position by computing a product of their respective information theory-based position weight matrices and a corresponding-binary matrix of a respective splice site sequence; (c) comparing the R.sub.i,total values of all potential mRNA splice isoforms of the wild-type gene and the same values after the wild-type gene sequence is mutated to determine whether the mutation alters the abundance of the mRNA isoforms containing the exon, said comparison resulting in potential exons with different R.sub.i,total values in the wild-type and mutated gene, wherein the splice isoform with the largest R.sub.i,total value is predicted to be the most abundant splice isoform, and the splice isoform with the smallest R.sub.i,total value is predicted to be the least abundant isoform, and the relative abundance of any pair of isoforms corresponds to 2 to the power of the differences between the R.sub.i,total values; and (d) graphically displaying each of the isoforms that are unchanged, newly formed, altered in abundance, or eliminated by the mutation.
35. The method of claim 34, wherein the comparison step (c) determines the relative abundance of a pair of splice isoforms by computing 2 to the power of the difference between the R.sub.i,total values of each isoform.
36. The method of claim 35, wherein the mutation occurs at a cryptic splice site and the R.sub.i,total value of the isoform containing this splice site is increased, resulting in increased abundance of the isoform.
37. The method of claim 36, wherein the mutation is a leaky or partial splicing mutation, said mutation causing a mutant isoform to exceed the abundance of the normal mRNA splice isoform by at least 1 bit or 2 fold.
38. The method of claim 36, wherein a paucimorphic or effectively null allele for a splicing mutation occurs in which a mutant isoform exceeds the abundance of the normal mRNA splice isoform by at least 5 bits or 32 fold.
39. The method of claim 35, wherein the mutation occurs at a natural splice site.
40. The method of claim 39, wherein the mutation is a leaky or partial splicing mutation, said mutation causing the R.sub.i,total of the mutant isoform to be less than the R.sub.i,total value of the normal m RNA splice isoform by at least 1 bit or 2 fold.
41. The method of claim 39, wherein paucimorphic or effectively null allele for a splicing mutation occurs in which the R.sub.i,total of the mutant isoform is less than the R.sub.i,total value of the normal mRNA splice isoform by at least 5 bits or 32 fold.
42. The method of claim 34, further comprising a step (e) of correcting the R.sub.i,total from step (b) by adding a gap surprisal term of one or more splicing enhancer and/or one or more silencer sequence elements recognized by an RNA binding protein or a small nuclear ribonucleoprotein, wherein strength of at least one of said splicing enhancer and/or one or more said silencer sequence elements is altered due to the mutation of said gene.
43. The method of claim 42, wherein a secondary gap surprisal is added to take into account distances between at least one natural splice site and each altered splicing enhancer and/or silencer sequence elements, and wherein said secondary gap surprisal is a gap surprisal term computed from a distance between a closest donor or acceptor splice site and one or more splicing regulatory protein binding sites that occur either within said exon or in an adjacent intron of said exon.
44. A method of predicting the molecular phenotype of a splicing mutation, which produces a probable set of splicing isoforms expressed in mutation carriers based on accurately predicting and quantifying binding site affinity due to sequence mutations in the transcribed DNA template, wherein non-expressed or very low expression exons are eliminated by correcting for suboptimal exon lengths, low affinity binding sites and incorrectly ordered mRNA splice sites, comprising the steps of: (a) computing and identifying changes in individual information contents of a potential donor and acceptor splice site pairs, and one or more splicing regulatory sequences, such as splicing enhancer and/or silencer sequence elements, which together define either a constitutive or a mutated exon, at each nucleotide position by computing a product of their respective information theory-based position weight matrices and a corresponding-binary matrix of a respective splice site sequence; (b) defining potential exons by selecting every pair combination of acceptor and donor splice sites and one or more splicing regulatory sequences in the sequence window, and determining a gap surprisal value based on distance in nucleotides between sites comprising a pair combination, wherein the gap surprisal value is calculated for each potential exon length or distance between splice regulatory sequence and splice site, based on frequency of said length in the genome as the inverse log.sub.2 of said frequency according to the formula; (c) computing the total information content, R.sub.i,total, of a potential exon as the sum of the corresponding individual information contents of the acceptor and donor pair, corrected by adding the gap surprisal of an exon whose length is the distance between the donor and acceptor pair; (d) comparing the R.sub.i,total values of all potential mRNA splice isoforms of the wild-type gene and the same values after the wild-type gene sequence is mutated to determine whether the mutation alters the abundance of the mRNA isoforms containing the exon, said comparison resulting in potential exons with different R.sub.i,total values in the wild-type and mutated gene, wherein the splice isoform with the largest R.sub.i,total value is predicted to be the most abundant splice isoform, and the splice isoform with the smallest R.sub.i,total value is predicted to be the least abundant isoform, and the relative abundance of any pair of isoforms corresponds to 2 to the power of the differences between the R.sub.i,total values, wherein the gap surprisal term (g(x)) for internal exons is given by the formula
g(X)=7.036E-23(X8)6.128E-19(X7)+2.212E-15(X6)4.273E-12(X5)+4.749E-09(X4)3.028E-06(X3)+0.001026(X2)0.1414(X1)+6.5383; wherein the gap surprisal term (g(x)) for last exons is given by the formula
g(X)=5.44E-24(X8)+4.01E-20 (X7)1.12E-16(X6)+1.33E-13(X5)2.23E-11(X4)1.05E-07(X3)+0.000104(X2)0.03574(X1)+4.1378; and wherein the gap surprisal term (g(x)) for first exons is given by the formula
g(X)=3.45E-23(X8)2.94E-19(X7)+1.04E-15(X6)1.95E-12(X5)+2.13E-09(X4)1.37E-06(X3)+0.000490554(X2)0.079260304(X1)+4.5219 where x=Length of exon; and (e) graphically displaying each of the isoforms that are unchanged, newly formed, altered in abundance, or eliminated by the mutation.
45. The method of claim 44, wherein the process further comprises the step of testing the predictions of information theory based on exon definition by testing for the presence and abundance of the predicted isoforms by extracting mRNAs or proteins from at least one cell expressing said gene, performing gene expression assays that detect the predicted isoforms, and to determine the most abundant mRNA splice isoforms of said gene, thus allowing the concerted assessment of multiple changes in isoform expression levels of within said gene.
46. A computational method of assessing expression level and structure of mRNAs that combines the total strengths and distributions of splicing recognition sequences in a gene having a splicing mutation which provides results comparable to experimentally determined mRNA transcript analyses comprising: a processor; and a memory medium coupled to the processor, wherein the memory medium stores: individual information contents of a potential donor and acceptor splice site pair, and one or more splicing regulatory sequences, such as splicing enhancer and/or silencer sequence elements which together define either a constitutive or a mutated exon, at each nucleotide position by computing a product of their respective information theory-based position weight matrices and a corresponding-binary matrix of a respective splice site sequence, gap surprisal value based on distance in nucleotides between sites comprising a pair combination and one or more splicing regulatory sequences, wherein the gap surprisal value is calculated for each potential exon length based on frequency of said length in the genome as the inverse log.sub.2 of said frequency, total information content, R.sub.i,total, of a potential exon as the sum of the corresponding individual information contents of the acceptor and donor pair, corrected by adding the gap surprisal of an exon whose length is the distance between the donor and acceptor pair, and R.sub.i,total values of all potential mRNA splice isoforms of the wild-type gene and the same values after the wild-type gene sequence is mutated, and program instructions, executable by the processor to: receive process information wherein the process information includes; computing and identifying changes in individual information contents of a potential donor and acceptor splice site pairs, defining potential exons by selecting every pair combination of acceptor and donor splice sites in the sequence window, and determining a gap surprisal value based on distance in nucleotides between sites comprising a pair combination, computing the total information content, R.sub.i,total, of a potential exon, comparing the R.sub.i,total values of all potential mRNA splice isoforms of the wild-type gene and the same values after the wild-type gene sequence is mutated, graphically displaying each of the isoforms that are unchanged, newly formed, altered in abundance, or eliminated by the mutation, and to execute the method of claim 1 using the process information as input, thereby determining whether the mutation alters the abundance of the mRNA isoforms containing the exon.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
DETAILED DESCRIPTION
Exon Information Content
[0048] The information content of a spliced exon may be derived from the cumulative contributions of the nucleic acid binding sites recognized by the spliceosomal machinery and the distribution distances separating binding sites within the same exon. Given a set S of n different binding sites in an exon, each of which are recognized by m different proteins, then S={x.sub.n, where 1nm}. The total information content, I.sub.s, of all sites in S is
The information content of each site, R.sub.i(x.sub.n) (measured in bits) is derived from a weight matrix (R.sub.iw) representing the sequence conservation of each nucleotide in that sequence. The derivation has been presented previously (Schneider, 1997; Rogan et al., 1998).
[0049] The information contents of each set of binding sites are modified to account for the probability that these sites occur within the same exon. This requires a gap surprisal term that depends on the transcriptome-wide distribution of the lengths separating them. The gap surprisal is applied to a set of sites within the same exon. Each combination of different binding proteins (x.sub.1, x.sub.2 . . . ) is described by a distinct distribution. The number of different, unordered pairs of binding sites, given n different sites, correspond to (.sub.2.sup.n), different gap surprisal terms. The gap surprisal for two binding sites (x.sub.p and x.sub.q), separated by L nucleotides g(L.sub.pq), is
g(L.sub.pq)=log.sub.2(P(L.sub.pq)) bits (2)
where L.sub.pq is the distance between x.sub.p and x.sub.q sites. We calculate P(L.sub.pq) from experimentally validated inter site distances from human genes. Equation (4) signifies that the greater the distance between two sites, the larger the gap surprisal (greater penalty) will be, resulting in a biological reduction of larger than consensus exon length occurrence.
[0050] Denoting G(L.sub.s), the total gap surprisal of (.sub.2.sup.n)different pairs of sites in set S,
The total information content (R.sub.i,total) is defined by combining Equations (1) and (3),
[0051] To calculate the R.sub.i,total of an internal exon, we consider the simplest case with a constitutive set of donor and acceptor splice sites (n=2). We define x.sub.1 as the acceptor and x.sub.2 to be the donor site. x.sub.n has been extended to incorporate other types of binding sites, including splicing regulatory factors, SF2/ASF (SRSF1) and SC35 (SRSF2), that modify exon recognition. These factors act to enhance splicing when the recognition sites are located within exons (ESE) and repress splicing (ISS) if occurring in the intron adjacent to constitutive splice sites (Lim et al., 2011). The sign of this term in R.sub.i,total is positive if the binding site is exonic and negative if it is intronic. The pairwise distribution of functional binding sites in the transcriptome is required to determine g(L.sub.pq). For the first and last exons of a gene, R.sub.i,total is the sum of the R.sub.i value of the single splice site in that exon adjusted for g(L), where L is exon length, and is based on length distributions for the corresponding terminal exons. The sign of the g(L.sub.pq) term is negative for exonic locations (ESE) and reversed for intronic sites (ISS). We calculate and compare R.sub.i,total values for the strengths of the constitutive splice sites in an exon prior to and after a mutation (detailed below). Isoforms with either different donor or acceptor sites may be predicted for each mutation. Because the lengths of these isoforms may vary considerably from each another, analysis of compound mutations at different gene locations has been disabled in molecular phenotypic analysis. The exon definition transformation requires at least one natural site from an exon to be contained in the predicted isoforms; thus, cryptic or pseudo-exons activated by intronic mutations are not reported. Nevertheless, the point mutation analysis capability of the ASSA server may detect these sites.
[0052] Gap Surprisal is the penalty given as per length of the exon. To correctly define the gap surprisal for a combination of splice sites, a table was constructed which relates the gap surprisal to the length of the exon. The whole genome was scanned and the frequencies of different lengths of exons occurring in the genome and their respective probability of occurrence were calculated.
[0053] According to Tribus (1961), the amount of self-information contained in a probabilistic event depends only on the probability of that event: the smaller its probability, the larger the self-information associated with receiving the information that the event indeed occurred. The self-information or surprisal I(.sub.n) associated with outcome .sub.n with probability P(.sub.n) is:
I(.sub.n)=log(1/P(.sub.n))=log(P(.sub.n))
[0054] Here, the base of the logarithm is not specified: if using base 2, the unit of I(.sub.n) is in bits. The above definition is used to deduce gap surprisal function. The self-information or gap surprisal, g(L.sub.n), of observing a pair donor and acceptor site separated by L nucleotides is log2(P(L.sub.n)) bits. The self-information or gap surprisal, g(L.sub.n), of observing a pair donor and acceptor site separated by L nucleotides is log.sub.2(P(L.sub.n)) bits. The gap surprisal is defined as follows
Gap Surprisal=Log.sub.2(1/probability of occurrence the exon length).
[0055] This function signifies that the greater the distance between the donor and acceptor sites, the larger the gap surprisal (greater penalty) will be, resulting in a biological reduction of larger than consensus exon length occurrence. The gap Surprisal values for different exon lengths were calculated using the above formula.
[0056] The most frequent length was assigned a gap surprisal of zero, based on the fact that splice sites separated by this distance have a highest likelihood of forming an exon. This length was 96 nucleotides (1901 occurrences among total 172250 occurrences). The frequency for this particular length 96 was: 1901/172250=0.011036. The gap surprisal for the most common, i.e., preferred, constitutive exon length is 6.59 bits. To normalize all other gap surprisal terms for all other exon lengths to this value and eliminate the gap surprisal penalty for exons of 96 nucleotides, all of the penalties for all exon lengths were corrected by subtracting 6.59 bits from their respective gap surprisal values.
[0057] Total information content of either the acceptor or donor or both was found to be less than zero bits (most of these represent initial and terminal exons, as expected, since these do not contain both donor and acceptor splice sites). To successfully recognize the initial and terminal exons, a separate exon definition distribution was defined for these.
Gap Surprisals of First and Last Exons
[0058] As the exon definition hypothesis cannot be applied for first exon since no acceptor site is defined; and for last exon since no donor site is defined, different gap surprisals were defined for selection of these exons. Separate gap surprisal tables were constructed for these exons by scanning refseq and identifying the frequencies of different lengths of first and last exons. It was observed that most frequent length of the first exon was 46 nucleotides and that of last exon was 24 nucleotides. Hence the minimum gap surprisal (0 bits) was assigned to length of 158 for the first exon and a length of 232 for the last exon.
Populating the Annotation Database
[0059] The ASSEDA server is based on human genome reference sequence hg19 (GRCh37), GenBank and RefSeq cDNA accessions (downloaded from genome.ucsc.edu, July 2011), and SNP (dbSNP 135) tables. Genome-wide information weight matrices for automatically curated acceptor (n=108,079) and donor (n=111,772) splice sites (acceptor_genome and donor_genome, respectively; described in (Rogan et al., 2003)), were used in the R.sub.i,total calculation. The reference sequence was scanned with these matrices to determine the R.sub.i's of known natural splice sites and used to populate a MySQL database table (ALL_RI, modified from the all_mRNA.txt and the refSeqAli.txt from the UCSC genome browser).
[0060] The frequencies of different exon lengths occurring in the RefSeq database were determined for the gap surprisal calculation. Gap surprisals were normalized, based on highest frequency distance separating splice sites of opposite polarity, which was assigned G(L.sub.s)=0 bits. Separate distributions were compiled, respectively, for first, internal, and last exons, and stored in separate database tables. The start and end positions of first and last exons were relaxed to include any coordinate within a 200 nt window once in order to avoid duplication of exons in the gap surprisal calculation (this accounts for variation in the methods used to generate the cDNAs that are mapped onto the genomic sequence).
Incorporating Models of Splicing Regulatory Sequences into R.sub.i,total
[0061] The impact of mutations in ISS or ESE's at SF2/ASF or SC35 binding sites on constitutive splicing can be predicted by selecting the option to incorporate this term into the R.sub.i,total computation (on the Advanced Options page). Information weight matrices, R.sub.i(b,l), for SF2/ASF, SC35, SRp40 (SRSF5), and SRp55 (SRSF6) were derived from previously published data (Liu et al., 1998; Liu et al., 2000; Smith et al., 2006), and supplemented by experimentally-validated binding sites curated from subsequent publications (sequence logos and weight matrices are available in
Description of Server
[0062] The ASSEDA server retains ASSA's capability to analyze changes in individual information content, but also predicts molecular phenotypes based on changes in R.sub.i,total. ASSEDA and ASSA use the same interface to input sequence variants: HUGO-approved gene symbols, HGVS mutation nomenclature, and dbSNP identifiers, sequence window range around the mutation coordinate, and selected weight matrices as input (
[0063] The window range is a primary determinant of the number of potential isoforms reported, since larger windows capture additional potential cryptic splice sites. The feasibility of exon formation is assessed by their R.sub.i,total values, and by using rule-based filters to ensure that only likely isoforms are reported. These eliminate cryptic exons with misordered splice sites, overlapping donor and acceptor sites, internal exons less than 30 nt in length (Dominski and Kole, 1991), predicted splice isoforms with <1% of exon inclusion relative to the mutated, natural exon strength (R.sub.i,total between two isoforms<6.65 bits). The server highlights isoforms with negligible expression when their R.sub.i,total values are at least 1 bit below that of the R.sub.i,total of the mutated exon. Tabular results can be sorted by column and is paginated, which is particularly helpful for mutations in which numerous cryptic exons are predicted. All rows with potentially expressed isoforms are uncolored, but the wild type exon is indicated in red. Splice isoforms that either cannot be expressed or minor forms (<5% of the major expressed form) that would not be detectable experimentally are, by default, filtered out. Without filtering, rows containing non-functional or minimally expressed predicted isoforms are highlighted in distinct colors: (1) Exons with misordered splice sites (light blue), (2) Potential cryptic exons with lower R.sub.i,total values than normal or mutated exon (1% predicted expression; pink). (3) Isoforms with both incorrect splice site order and have low R.sub.i,total values (green). The minimum reportable R.sub.i,total value may also be selected using horizontal sliding scale bar which filters out potential exons below this threshold.
[0064] The server draws a set of box glyphs (
[0065] The server also generates separate custom tracks of each isoform and uploads them to the UCSC genome browser, where they are displayed in the context of the exon containing the mutation as an embedded window within ASSEDA. Each isoform is spectrally color coded based on R.sub.i,total content.
Relative Abundance of Predicted Splice Isoforms
[0066] The server also displays pairwise differences in relative abundance for all predicted isoforms. The relative abundance or fold change in binding affinity of a single binding site is 2.sup.Ri, where R.sub.i is the difference between the respective individual information contents of wild type and mutant type of the site (Schneider, 1997). We extend the idea of relative abundance of single binding site to multiple binding sites by comparing their R.sub.i,total values. Suppose n and m are two alternative splice isoforms sharing at least one common splice site and their respective total information contents are R.sub.i,total(n) and R.sub.i,total(m). If R.sub.i,total(n)>R.sub.i,total(m), then the relative abundance of n over m will be 2.sup.Ri,total(nm), where R.sub.i,total(nm)=R.sub.i,total(n)R.sub.i,total(m). Relative transcript abundance is displayed as a multidimensional graph (with scatterplot3d, an R package for visualization of three dimensional multivariate data). The graph shows predicted pairwise differences in exon abundance (Z axis) of the X axis isoform relative to the one on the Y axis, both before (left graph) and after mutation (right graph). The isoform designations correspond to those shown in the other molecular phenotype tabs.
[0067] In order that the manner in which the recited and non-recited advantages and objects of the invention are obtained, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the drawings. It is to be understood that these drawings depict only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings.
[0068] A brief description of the drawings are provided below to provide additional specificity and detail of the drawings.
[0069]
[0070]
[0071]
[0072]
[0073]
[0074]
[0075]
[0076]
[0077]
[0078]
[0079]
[0080]
[0081]
[0082] The following examples are provided for purposes of illustration of embodiments of the present disclosure only and are not intended to be limiting. The reagents, chemicals, instruments and other materials are presented as exemplary components or reagents, and various modifications may be made in view of the foregoing discussion within the scope of this disclosure. Unless otherwise specified in this disclosure, components, reagents, protocol, and other methods used in the disclosure, as described in the Examples, are for the purpose of illustration only.
EXAMPLE 1
Exon Definition by Information Analysis of Functional Exons
[0083] Gap surprisal values of all exon lengths were determined from their respective frequencies in the exome of all RefSeq genes. The gap surprisal penalty was then normalized so that the most common internal exon length (96 nt; n=172,250) was zero bits, by subtracting a constant value of 6.59 bits (its loge frequency). Less frequent exon lengths were scaled to this value by subtracting this constant from their respective gap surprisal values. First and terminal exons are, respectively, missing either a donor or an acceptor splice site, and exhibit a broader range of exon lengths. Separate gap surprisal distributions were computed for these exons. The most frequent first and last exons were, respectively, 158 (n=23,471) and 232 (n=21,261) nt in length, corresponding to gap surprisals of 7.8 and 9.4 bits, respectively. R.sub.i,total values were >0 bits for 98.9% of internal exons, 95.3% of first exons, and 93.1% of last exons (
EXAMPLE 2
Interpretation of Splicing Mutations by Exon Definition Analysis
[0084] To assess whether the proposed model of exon definition produced results consistent with observed mutant spliced products, we evaluated a series of reported splicing mutations for which end-point (
[0085] Initially, 20 potential isoforms are found for this mutation, of which those with the highest R.sub.i,total values and the affected natural exon are indicated (
[0086] The structures and lengths of each potential isoform (natural, cryptic, skipped) are also displayed in a separate tab (
EXAMPLE 3
Impact of ESE/ISS Elements
[0087] Elements recognized by splicing regulatory proteins, SF2/ASF, SC35, SRp40, SRp55, and hnRNP-H (HNRNPH1), can now be analyzed with ASSEDA, however these matrices are based on many fewer sites (usually <50), and the R.sub.i values may not be as accurate as constitutive splice sites, especially at the low end of the distribution. The server computes R.sub.i values of any of these individual sites and can incorporate mutations at either SF2/ASF or SC35 sites into the R.sub.i,total computation. Since a mutation can affect multiple predicted sites, the site with the highest R.sub.i value altered by the mutation is analyzed, unless a second cryptic site is strengthened resulting in final R.sub.i is exceeding that of the original binding site.
[0088] A second gap surprisal function, based on the distances between known natural constitutive sites and the closest predicted splicing regulatory site of the same type, was also applied in the R.sub.i,total calculation. Exonic (ESE) and intron (ISS) have independent gap surprisal distributions (
[0089] To assess the effect of including SC35 and SF2/ASF sites in the exon definition model, we evaluated 12 reported mutations/variants in either SF2/ASF or SC35 sites that were reported to affect splicing at adjacent splice sites (
EXAMPLE 4
Analysis of Normally Spliced Large (>1000 nt) Exons
[0090] The exon definition models imply that rare exons (regardless of length) will have large gap surprisal penalties. This is supported by the fact that, for exons beyond a few hundred nucleotides, the penalty function is increases with length until it asymptotes at exon lengths present once in the genome. The significant gap surprisal penalties for long exons raise the question as to how well the model performs at the extreme lengths to correctly distinguish natural from decoy exons. The model fails if the contributions of the gap surprisal term exceed the R.sub.i values of both natural splice sites. In fact, this is generally not the case.
[0091] To assess the ability of the server to predict naturally occurring large exons, 8 large internal exons in genes BRCA1-ex11, BRCA2-ex11, TTN-ex253, JARID2-ex7, KLHL31-ex2, C6orf142-ex4 (MLIP), VCAN-ex8 and C17orf53-ex3 were evaluated using ASSEDA (
EXAMPLE 5
Generation of Information Theory-based Models of mRNA Splicing Regulatory Proteins
[0092] Successful implementation of the information theory-based exon definition model is dependent on the quality of the data used to create the information weight matrices that locate and define the strengths of binding sites. Splice junctions are precisely defined and experimentally validated.
[0093] CLIP-seq libraries for hnRNP A1 (Huelga et al., 2012), and other splicing regulatory binding sites were used to derived information-theory based position weight matrices (PWM). CLIP-seq libraries were generated by methods that chemically link an RNA binding protein to its cognate binding sites throughout the transcriptome, followed by antibody pull down of the protein crosslinked to these binding sites, then followed by conversion of RNA to cDNA in vitro, and preparation of libraries of many binding sites, and finally by high throughput DNA sequencing of the libraries. PoWeMaGen software, which uses Bipad (Bi and Rogan, 2004) to generate a minimum entropy alignments, generates a series of potential binding site models over a range of input parameters. To mitigate against phasing the alignment on natural splice sites instead of adjacent hnRNP A1 binding sites, models were built from shorter sequences, ranging in lengths from 18-25 nt. The optimal model was determined by maximizing incremental information by varying binding site length (6-10 nt), number of Monte Carlo cycles (250-5000), and allowing either zero or only one site per sequence (OOPS). The model with the highest average information used a maximum fragment length of 18 nt, 1000 Monte Carlo cycles, OOPS, and a single block binding site length of 6 nt.
[0094] CLIP-seq data were used to compute PWMs for the following RNA binding proteins that participate in the mRNA splicing reaction and/or in exon definition:
TIA1
Ri(b,l) Length of PWM12 nt
[0095] Monte Carlo cycles1000
ZOOPS (Zero Or One site Per Sequence)On
Source:
[0096] Wang Z, Kayikci M, Briese M, Zarnack K, Luscombe N M, Rot G, Zupan B, Curk T, Ule J. iCLIP predicts the dual splicing effects of TIA-RNA interactions. PLoS Biol. 2010 Oct. 26; 8(10):e1000530
PTB
Ribl Length6 nt, 10 nt
[0097] Monte Carlo cycles250, 1000
ZOOPSOn, On
Source:
[0098] Xue Y, Ouyang K, Huang J, Zhou Y, Ouyang H, Li H, Wang G, Wu Q, Wei C, Bi Y, Jiang L, Cai Z, Sun H, Zhang K, Zhang Y, Chen J, Fu X D. Direct conversion of fibroblasts to neurons by reprogramming PTB-regulated microRNA circuits. Cell. 2013 Jan. 17; 152(1-2):82-96.
HuR
Ribl Length7 int
[0099] Monte Carlo cycles250
ZOOPSOff (ON ribl is also available, but is very similar)
Source:
Kishore S, Jaskiewicz L, Burger L, Hausser J, Khorshid M, Zavolan M.
[0100] A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins. Nat Methods. 2011 May 15; 8(7):559-64.
[0101] Each model or PWM was validated with a set of independently published binding sites and if available, mutations in those binding sites. As an example, validation of hnRNP A1 binding sites and mutations are presented, however the same approach was used for the other PWMs. A coding sequence mutation in the ETFDH gene c.158A>G creates a 5.9 bit hnRNP A1 site and increases exon skipping. See Olsen et al.(2014). BRCA2 mutation c.8165C>G similarly increases skipping and is predicted to create a 6.2 bit site (Liede et al., 2002). In contrast, the variant c.1161A>G in ACADM decreases exon skipping of exon 11 by reducing the strength of an hnRNP A1 site (6.1 to 1.4 bits). The model also predicted the existence of two strong hnRNP A1 binding site in a region of ATM shown to bind to the splicing regulator (Pastor and Pagani, 2011).
[0102] The effects of mutations at hnRNP A1 sites on exon definition were determined from the total information content (R.sub.i,total), by incorporating changes in the strengths of these sites, corrected for the gap surprisal, which represents the distance between the hnRNP A1 site and the natural splice site. Gap surprisal values were determined by scanning the genome for hnRNP A1 sites with the PWM, and then determining the frequency of each interval length between known natural sites and the nearest hnRNP A1 site, separately for exons and introns. Differences between the natural and mutated exon R.sub.i,total values correspond to changes in the abundance of the respective isoforms, and can predict exon skipping. The calculation is carried out by the Automated Splice Site and Exon Definition Analysis Server (ASSEDA; http://splice.uwo.ca); See Mucaki et al. Prediction of Mutant mRNA Splice Isoforms by Information Theory-Based Exon Definition. Hum Mutat. 34:557-65 (2013), which is hereby incorporated by reference into this disclosure. Exon definition analysis in ASSEDA was validated for a set of mutations that affect hnRNP A1 binding site strength. BRCA2 variant c.8165C>G decreases the R.sub.i,total from 13.5 to 3.2 bits and results in exon skipping. ACADM variant c.1161A>G, which reduces exon skipping, increases the R.sub.i,total from 18.5 to 20.1 bits.
[0103] Table 1 summarizes the validation results for models derived CLIP Seq data by evaluating published, peer reviewed binding sites in individual genes.
TABLE-US-00001 TABLE 1 Summary of validation results RNA Binding binding sites protein Validated 9G8 1 of 4 TIA1 7 of 7 PTB 4 of 4 HuR 6 of 6 hnRNPA1 3 of 3 hnRNPC 3 of 4* hnRNP 0 of 1 A2/B1 hnRNP F 1 of 2 hnRNP U 1 of 1
[0104] Valation of the model is measured by the success rate of binding site models to predict published binding sites in the sequence interval described in the literature publication (successfully detected sites vs total number of binding sites tested). The exact location for the binding site was not always known from the publication, and in those cases, we sought to detect the strongest sites with the highest Ri values within that region, as described below. The results of optimal model construction include sequences logos and Ri(b,l) matrices, and links to the papers reporting the binding sites, among others.
[0105] Based on these validation results, the PTB and hnRNP A1 models have been qualified for mutation analysis. The information contents generated from these PWMs are completely concordant with the published results for all known binding sites, and their motifs (as depicted by the corresponding sequence logos) have a distinct, complex pattern.
[0106] The TIA1, HuR and hnRNP C model validation was also quite successful, but these PWMs consist of low complexity, T-rich motifs (based on DNA sequence, in RNA, which the protein binds to, these are Uridine) that have lower specificity than the PTB and hnRNP A1 binding sites. For TIA1 and HuR, this pyrmidine-rich region is where binding is expected. There have been concerns that these models will positively identify a binding site in nearly any poly-T rich region. As an example, one can refer to the HuR model, in which almost all information is derived from poly-T.
[0107] Summary of data on RNA binding protein motifs that are involved in mRNA splicing obtained by entropy minimization of Clip-Seq data is provided in the following text.
[0108] TIA1/TIAL1
[0109] TIA-1 promotes U1 snRNP binding to the 5 splice site of intron 6 of FAS. Exonic TIA-1 binding to Uridine-rich sequences mediate repression by PTB at the acceptor (3) site, promoting exon skipping (Jos Mara Izquierdo, Nuria Majs, Sophie Bonnal, Concepcin Martnez, Robert Castelo, Roderic Guig, Daniel Bilbao, Juan Valcrcel, Regulation of Fas Alternative Splicing by Antagonistic Effects of TIA-1 and PTB on Exon Definition, Molecular Cell, Volume 19, Issue 4, 19 Aug. 2005, Pages 475-484). This model does correctly recognize exon 3 terminus at position 573, 3.2 bit site at 576, 4.9 bit site at 596, and a 3-4 bit cluster from 600-602.
[0110] The RNA-binding protein TIA-1 preferentially enhances the use of 5 splice sites linked to IAS1 (for example, the alternative K-SAM exon in FGFR2 gene)which are then activated by overexpression of TIA1. See Del Gatto-Konczak F, Bourgeois C F, Le Guiner C, Kister L, Gesnel M C, Stvenin J, Breathnach R. The RNA-binding protein TIA-1 is a novel mammalian splicing regulator acting through intron sequences adjacent to a 5 splice site. Mol Cell Biol. 2000; 20(17):6287-99.
[0111] Approximately 20 nucleotides beyond the end of the K-SAM exon, information analysis predicts large cluster of strong binding sites (chromosome 10:123278160-123278310), associated with a long polyT/poly A track. This result is consistent with the well described property of TIA-1 binding to polyAU-rich domains of RNA.
TABLE-US-00002 Chr. Coord. Ri value 123278167 5.669410 123278168 10.217979 123278169 2.813830 123278170 5.144820 123278171 4.534150 123278172 8.654270 123278173 1.410610 123278177 4.872140 123278178 1.938000 123278179 5.716410
[0112] In the SMN2 gene, exon 7 inclusion is regulated by TIA-1 interacting with the U1 SNRNP. See N. Singh and R. Singh, Alternative splicing in spinal muscular atrophy underscores the role of an intron definition model, RNA Biol. 2011 July-August; 8(4): 600-606. There are two validated TIA-1 sites within the interval (chr5:69,372,420-69,372,490).
TABLE-US-00003 Chr. Coord. Ri value 69372436 6.438010 69372437 1.917100 69372438 3.805560 69372439 4.751070 69372441 2.209620 69372456 2.445030 69372463 3.158220 69372466 2.991800 69372469 1.997720 69372472 4.344520 69372473 3.055380 69372474 4.637970 69372475 9.499431 69372477 2.657180 69372480 1.036970 69372482 6.704550 69372483 1.218490 69372490 2.263090
[0113] In all 3 instances of valid binding sites in SMN2, a site was found (bolded). The sites exceed 5 bits. Interestingly, the 9.5 bit site is in a region, where a binding site is expected based on experimental data, but has not been localized (described as ELEMENT 2 in the publication).
[0114] In summary, the TIA-1 model detected strong sites, but weak false positives were also present, as a result of the promiscuity of A/T rich regions being flagged. In order to eliminate false positive binding sites, the TIA1 model is preferably used in combination with a second motif for a distinct RNA binding protein, which is known to interact with, for example, PTB. The combined motif could be computed as a R.sub.i,total value, based on the strengths of each sites, and the gap surprisal distribution which relates both sites.
[0115] Although it is quite accurate, the hnRNP C model confirmed 3 of 4 published binding sites all from papers that demonstrated binding within a 20-70 nt long region, none of which described the precise location of the binding sites. The one that failed was the only one that involved a mutation which supposedly abolished an hnRNP C site, which was not detected with either of the hnRNP C models developed.
[0116] Models for both hnRNP F and hnRNP U result in high bit values for natural splice sites (both donors and acceptors). The CAG pattern in the sequence logo is quite obvious. The possibility cannot be eliminated that the entropy minimization is biasing toward more conserved natural sites, which contaminate these sequences due to their proximity to the hnRNP sites. Furthermore, hnRNP F binding sites are known to have a GGG motif, which is absent from any model built from the hnRNP F data.
[0117] Hu proteins inhibit splicing by binding to intronic recognition sequences adjacent to exon 23a of NF1 (HuB, HuC, and HuD) and adjacent TIA1 sites promote recognition of the donor splice site by U1 SNRNP. See Zhu, et al. Mol Cell Biol. 2008 February; 28(4): 1240-1251. Within chr17:29,579,900-29,580,100, TIA-1 sites are present at:
TABLE-US-00004 Chr. Coord. Ri value (bits) 29580015 3.791960 29580029 7.952610
[0118] A series of Hu protein binding sites has been predicted at a weak donor site in the PLOD2 gene (chromosome 3:145,795,600-145,795,750). See Yeowell, Heather N, Walker, Linda C, Mauger, David M, Seth, Puneet, Garcia-Blanco, Mariano A. TIA Nuclear Proteins Regulate the Alternate Splicing of Lysyl Hydroxylase 2, Journal of Investigative Dermatology (2009) 129, 1402-1411.
TABLE-US-00005 Chr. Coord. Ri value (in bits) 145795604 6.539410 145795605 2.437480 145795607 5.573260 145795609 4.282010 145795610 3.696390 145795611 6.333310 145795612 0.722530 145795613 8.514270 145795614 6.387630 145795615 6.179630 145795616 7.204071 145795617 8.928380 145795618 0.453510 145795619 7.776460 145795620 4.122941 145795621 4.207820 145795622 9.756490 145795624 5.764780 145795625 3.915710 145795626 6.074350 145795627 0.233480 145795628 6.985560 145795629 2.751471 145795630 7.838311 145795631 8.452850 145795632 10.973180 145795633 7.993841 145795634 6.453230 145795635 7.710070 145795636 1.090840 145795638 3.965630 145795640 9.942340 145795641 8.432720 145795642 4.729580 145795643 2.373280 145795644 3.849880 145795645 5.682571
[0119] PTB. Two different models were computed for PTB, which differ only by the length of the binding sites. The 6SB model is preferred based on published studies on PTB. However the 6SB model may truncate the site, which is one of the reasons why the 10SB model was also derived.
[0120] As described previously by Izquierdo et al. (2005), PTB represses inclusion of the exon 6 in FAS, which was described for TIA1 (although the PTB site is in exon 6). The interval containing the PTB binding sites span the interval chromosome 10:90,770,450-90,770,649. With the 6SB model, several potential binding sites were detected in this interval (the strongest sites are bolded).
TABLE-US-00006 Chr. Coord. Ri value (bits) 90770505 1.103880 90770512 3.856850 90770517 1.824200 90770535 4.674070 90770543 4.955421 90770556 3.293820 90770564 3.055950 90770578 0.367950 90770582 3.384770 90770589 1.924930
[0121] The two strongest predicted binding sites contain the URE6 element described in the publication, and contain PTB consensus sequence, UCUU. Using the 10SB model, the corresponding sites are 2.94 and 1.13 bits, respectively, with the 3.3 bit site at 90770556 strengthening it from 3.3 to 4.5 bits.
[0122] PTB binding to the CHRNA gene has also been reported in the region, chromosome 2: 175622750-17562290 (Rahman M A, Masuda A, Ohe K, Ito M, Hutchinson D O, Mayeda A, Engel A G, Ohno K. HnRNP L and hnRNP LL antagonistically modulate PTB-mediated splicing suppression of CHRNA1 pre-mRNA. Sci Rep. 2013 Oct. 14; 3:2931.). The 7.3 bit site at position 175622764 is described in the publication (Bian Y, Masuda A, Matsuura T, Ito M, Okushin K, Engel A G, Ohno K. Tannic acid facilitates expression of the polypyrimidine tract binding protein and alleviates deleterious inclusion of CHRNA1 exon P3A due to an hnRNP H-disrupting mutation in congenital myasthenic syndrome. Hum Mol Genet. 2009 Apr. 1; 18(7):1229-37). However, the present disclosure provides a 5.8 bit site close to the branch point.
[0123] PTB also binds to both ends of exon 9 of the gene, CAPZB (http://rnajournal.cshlp.org/content/19/5/627.long. Downstream of the exon near position 19669210, there is a 3.7 bit site situated between two ACUAA elements (with the 10 nt long ribl, 2.2 bits with the 6SB model), which are recognized by the RNA binding protein, Quaken. No other predicted sites exist in this region. Upstream of the exon around position 19669400, the published study is less precise about the location of the PTB site. The model of the instant disclosure predicted several potential sites in this region, including a 6.7 bit site 40 nt downstream of the exon and a 4.4 bit site 10 nt downstream.
[0124] HuR/ELAVL1
[0125] HuR (or ELAVL1) regulates inclusion of an exon in the FAS gene, though there is evidence to suggest it is interacting with URE6. HuR is predicted to bind at several locations across exon 6 and upstream in intron 5 (Izquierdo J M. Hu antigen R (HuR) functions as an alternative pre-mRNA splicing regulator of Fas apoptosis-promoting receptor on exon definition. J Biol Chem. 2008 Jul. 4; 283(27):19077-84). The region upstream of the exon (chr10:90,770,450-90,770,649) has a cluster of strong HuR binding sites:
TABLE-US-00007 Chr. Coord Ri value (in bits) 90770471 6.351841 90770472 8.330290 90770475 7.383730 90770477 5.040200
[0126] Within the exon, there is only a single cluster of strong binding sites, which coincides with the location of the URE6 element, as indicated in the article:
TABLE-US-00008 Chr. Coord Ri value (in bits) 90770535 3.071350 90770538 4.882600 90770541 4.882600 90770542 2.393560 90770543 9.590730
[0127] HuR exhibits documented binding to the ATM gene. However, binding did not impact the mRNA splicing profile of this gene (http://www.ncbi.nlm.nih.gov/pubmed/21858080). There are 9 consecutive thymine residues, which results in a set of strong binding sites, corresponding to the interval described in the paper (80 nucleotides in length).
TABLE-US-00009 Chr. Coord Ri value (in bits) 108141430 3.633660 108141431 7.772871 108141432 12.418920 108141433 12.418920 108141434 12.418920 108141435 2.882740
[0128] In Hu et al. Mol Cell Biol. 2008 February; 28(4): 1240-1251 (cited previously for TIA-1), the authors indicate that multiple Hu proteins bind to exon 23a of NF1. Our HuR model predicts a number candidate binding sites in this region.
TABLE-US-00010 Chr. Coord. Ri (in bits) 29579831 2.263210 29579832 4.191080 29579833 3.633660 29579834 7.772871 29579835 2.882740 29579836 0.863631 29579837 7.102510
[0129] In the publication, the TIA1 site is described as adjacent to a Hu binding site downstream of the exon. 9.3 and 5.5 bit HuR binding sites were found (at pos. 29580034-35) immediately upstream and one 7.0 bit HuR site at pos. 29580047 downstream of the TIA1 site.
[0130] hnRNP A1
[0131] The following study shows that hnRNAP A1 regulates splicing of the ATM gene (Pastor T, Pagani F. Interaction of hnRNPA1/A2 and DAZAP1 with an Alu-derived intronic splicing enhancer regulates ATM aberrant splicing. PLoS One. 2011; 6(8):e23349) and binds within a 35 nucleotide interval circumscribing position 108141450.
TABLE-US-00011 Chr. Coord Ri value (in bits) 108141439 5.652870 108141457 1.664050 108141469 4.653870
[0132] A sequence variant creates an hnRNP A1 site within ETFDH (also HNRNP A2/B1 and H). See Olsen et al. (2014).
[0133] This exonic variant at 159601742 was analyzed by information analysis to assess the predicted change in hnRNP A1 site strength. This exon itself is non-constitutive, and it is predicted that this variant increases the hnRNP A1 splicing suppressor strength, thereby increasing exon skipping (hnRNP A1 site at pos. 159601740, with R.sub.i,initial=11.16->R.sub.i,final=5.94 bits).
[0134] In addition, a weak hnRNP H binding site is created (0.62 bits at pos.15961742), and another pre-existing site is strengthened (3.79->4.03 bits at pos. 15960173). An preexisting 6.9 bit site 17 nt downstream of the 4.0 bit site was also observed.
[0135] Analysis of this mutation with the hnRNP A2/B1 exon silencer model below did not detect any overlapping or novel binding sites.
[0136] In cases where a weak regulatory site overlaps a stronger site, proteins capable of binding to the weak site are likely to be displaced by the protein with the higher affinity site (stronger site). This scenario dramatically simplifies the analysis of these complex events, because when multiple binding sites are altered by a mutation, the exon definition calculation can effectively ignore the weak binding sites. Changes to total information content from effects on multiple binding sites can be reduced to fewer terms when the overlapping binding sites from different proteins have significant differences in overall binding affinity, namely, information content.
[0137] hnRNP A2B1
[0138] A different variant in another gene was found to alter strengths in splicing regulatory sequences, bound by SFSR1 and hnRNP A1, in an alternative exon of the ACADM gene (Bruun G H, Doktor T K, Andresen B S. A synonymous polymorphic variation in ACADM exon 11 affects splicing efficiency and may affect fatty acid oxidation. Mol. Genet Metab. 2013 September-October; 110(1-2):122-8). c.1161A>G improves exon 11 inclusion in ACADM. The A form has been experimentally shown to increase hnRNP A1 binding, whereas the G allele binds SFSR1 (SF2/ASF) with higher affinity. Our predictions follow the experimental results precisely(hnRNP A1 at coordinate 76227021 is reduced in strength 6.12->1.37 bits, and SFSR1 (SF2/ASF) is increased 3.08->2.77 bits.
[0139] The gap surprisal distributions for ELAVL1-PTB-TIA1-hnRNPH are shown in
EXAMPLE 6
Failing Binding Site Models as a Result of Data Insufficiency or Bias in the Source Data
[0140] (A) Data insufficiency. Other sources of data were tested to construct information theory based models. In particular, models were derived from the SpliceAID-F database (Guiletti et al. SpliceAid-F: a database of human splicing factors and their RNA-binding sites Nucl. Acids Res. 41(D1):D125-D13). In contrast with the CLIP-Seq datasets, this database has been manually curated from published sites of 71 different RNA binding proteins. In order to ensure that the individual information contents of binding sites were distinguishable, models were developed for proteins in which >20 binding sites had been ascertained. However, PoWeMaGen disqualified a substantial number of motifs derived from this data source (because these sites had negative Ri values, and according to theory, should not be capable of binding protein), resulting in models built from 10-15 sites, which led to large confidence intervals in R.sub.i values. The elimination of some of the sites during analysis may lead to models that are based on too few sites and have questionable accuracy. After disqualifying these models, on PWM based on hnRNP D and hnRNP I remained. The hnRNP D model is a low complexity binding site that lacks specificity in long polyT-rich regions, resulting in a series of consecutive positive R.sub.i values for predicted adjacent binding sites. Interestingly, the same literature publications would frequently describe HuR binding as well at these sites, as another polyT binding protein. The hnRNP I model derived by entropy minimization-based alignment had low sensitivity, failing to detect known binding sites in about 50% of cases, and those sites it did correctly predict were usually quite weak, i.e. <3 bits.
[0141] (B) Sequence bias in the dataset. A CLIP-seq based SRSF1 model (i.e. ASF/SF2) failed to predict of the effect of a G to C substitution in a known SRSF1 binding site (Guo et al. 2013, reference follows). Although it had accurately predicted the presence of 4 sites described in 3 other publications, the particular G to C mutation which was shown to significantly decrease SRSF1 binding in a laboratory pulldown experiment, was predicted to have the opposite effect, namely, to strengthen the site. The previous SFSR1 model on ASSEDA (Mucaki et al. 2013) correctly predicted that the mutation abolished the site, but the site in the unmutated reference gene sequence was predicted to be weak (1.2 bits). This suggests that the underlying data used to create the Clip-Seq based information model are biased towards certain motifs, and do not comprehensively cover the genome-wide distribution of SRSF1 binding sites. This paper also contained a mutation which abolished an hnRNP A1 site, which was predicted correctly by the CLIP-Seq based hnRNP A1 model (5.1->11.2 bits). See Guo R, Li Y, Ning J, Sun D, Lin L, Liu X. HnRNP A1/A2 and SF2/ASF regulate alternative splicing of interferon regulatory factor-3 and affect immunomodulatory functions in human non-small cell lung cancer cells. PLoS One. 2013 Apr. 29; 8(4):e62729.
EXAMPLE 7
Application of R.SUB.i,total .to Splicing Regulation-Experimental Validation of BRCA1 and BRCA2 Gene Mutations Predicted by Exon Definition Analysis
[0142] Numerous unclassified variants (UVs) have been identified in splicing regions of disease-associated genes and their characterization as pathogenic mutations or benign polymorphisms is crucial for the understanding of their role in disease development. The number of these alterations has increased considerably as a consequence of next generation sequencing analyses and confounds distinction of disease variants.
[0143] The aim of the present study was to assess the splice isoforms predicted by ASSEDA, through qPCR-based analyses. Where mRNA was available, we compared cryptic isoforms computed by exon definition analysis and their predicted abundance to results from semi quantitative RT-PCR and and quantitative RT-PCR studies. Twenty-four UVs in BRCA genes were previously characterized by conventional end-point Reverse Transcriptase-PCR (RT-PCR) [1]. Nineteen splicing mutations and 5 non-spliceogenic base changes were observed. All variants were re-evaluated using ASSEDA (http://ossify.sg.csd.uwo.ca), and the predicted isoforms were annotated (Table 2). The value of the Window Range (i.e., the region before and after the base where the mutation takes place and where the information content of sites is calculated) was set to 450 nt.
[0144] The qPCR assays were performed using the KAPA SYBR FAST Universal qPCR kit (KAPA BIOSYSTEMS) and examined on an Eco Real-Time PCR System (Illumina). The level of expression of each isoform was measured relative to the level of expression of the same isoform in a reference sample. In addition, the level of expression of each isoform considered in the assay was normalized to the expression of CCDC137, as a reference gene. For each assay, uniform length amplicons were generated from reverse transcripts using isoform-specific splice junction primers. For the BRCA1 c. 4987-1G>A the normal transcript, the exon17 isoform and the transcript derived from the partial retention of intron 16 (187 bp at the 3-end) were analyzed. For the BRCA1 c.5278-2delA the normal transcript, the exon21 isoform and the transcripts derived from the partial skipping of exon 21 (8 bp at the 5-end) and the partial retention of intron 20 (51 bp at the 3-end) were verified. In both analyses, a fragment spanning BRCA1 exon 8-9 junction was generated to serve as an internal reference.
[0145] ASSEDA detected all splicing mutations (n=19) and 9 of 11 cryptic isoforms observed in UV carriers (Table 1). Non-spliceogenic variants (n=5) did not exhibit significant changes in exon information. Cryptic isoforms of lower abundance not seen in previous analyses were also predicted (between 0 and 4 transcripts per mutation). Verification of these predictions by qPCR is currently ongoing. At present, the BRCA1 c. 4987-1G>A and c.5278-2delA mutations were analyzed. The full-length and the exon17 isoforms for the BRCA1 c. 4987-1G>A mutation and the full-length, the exon21 and the exon21q isoforms for the 5278-2delA were confirmed. However, additional low abundance isoforms predicted by ASSEDA were not observed in qPCR experiments, as expected.
[0146] Based on these results, it is conclude that information theory-based exon definition comprehensively detects the experimentally-verified repertoire of mutant isoforms by end point RT-PCR in carriers of the investigated UVs. Preliminary results show that qPCR analyses can determine which of the many potential intronic cryptic splice sites that are predicted by ASSEDA are potentially relevant and which ones can be dismissed as being irrelevant to pathogenicity.
[0147] The loss of exon identity due to the combined activation of binding sites associated with silencing of exon recognition and loss of binding sites recognized by exon enhancers has been shown. See Sterne-Weiler T, Howard J, Mort M, Cooper D N, Sanford J R, Loss of exon identity is a common mechanism of human inherited disease. Genome Res. 2011 October; 21(10):1563-71. However, although Sterne-Weiler et al. implicated specific hexamer sequences as contributing to exon skipping, and the splicing factors PTB and SRp20 in regulation of exon skipping, the context of these sequences with respect to their distance to the adjacent constitutive splice sites was not addressed or considered.
[0148] U.S. Pat. No. 8,361,979 B2 describes a method for inducing exon skipping by targeting oligonucleotide sequences to Serine-Arginine rich proteins that promote exon inclusion. However, the method of the '979 patent does not recognize the role that hnRNP A1 plays in proofreading of exon boundaries, nor does it consider that the proximity between this splicing regulatory sequence and the adjacent constitutive splice site is important for exon definition (i.e. Targeting neighboring and distant binding sites is likely to have different effects), and does not transform that distance into units of bits, i.e. Gap surprisal, so as to compute R.sub.i,total, the method described in the instant invention for predicting exons that are recognized and processed in unspliced heteronuclear RNAs.
EXAMPLE 8
Exon Definition Analysis Reveals a Previously Unrecognized, but Common Mechanism of Exon Skipping based on hnRNP A1 Cryptic Site Generation
[0149] Recursive stop-gain mutation c.5791C>T (rs144567652) in FANCM abolishes exon definition, inducing exon skipping and is a risk factor for familial breast cancer. The c.5791C>T mutation originates a stop codon at residue 1931 generating the loss of 118 amino-acids from the FANCM C-terminus that destroys the functional domain that mediates the interaction with FAAP24 (Ciccia et al. 2007) and DNA translocation (Rosado et al. 2009). However, functional analyses in lymphoblastoid cell lines obtained from two mutation carriers resulted a very low level of the mutated mRNA, suggesting that the c.5791C>T has a loss of function effect. This result was unexpected because this mutation occurs in the penultimate exon of the gene, where nonsense mediated decay, the predominant cellular mechanism of mRNA surveillance of premature stop codons, is not expected to cause significant mRNA degradation due to its close proximity to the 3 untranslated region of the mRNA (Shoemaker E and Green R, Nature Struct. & Mol. Biol. 19: 594-601, 2012).
[0150] Information theory-based mutation analysis was used to assess the impact of the variant on splicing regulatory binding sites that regulate definition of the exon.The mutation is predicted to create an overlapping 4.6 bit hnRNP A1 binding site (c.5790_5795; Mucaki et al. 2013), which completely suppresses normal exon recognition (R.sub.i,total: 3.4 (C)->2.6 (U) bits, inactivating exon recognition and results in complete exon skipping. The novel hnRNP A1 binding site sequence is frequently present in sites crosslinked to hnRNP A1 protein (Huelga et al. 2012). The frequencies of the normal and mutated FANCM hnRNPA1 sites from the sequences that were used to build the model for the present disclosure shows 140431 binding sites total in the model. The wild type site (CCGAAU) was not present, which is consistent with its negative Ri value. However, the mutant site CUGAAU was present 716 times in set of binding sites crosslinked to the protein. These are experimental data from crosslinking experiments using an antibody against hnRNP A1 to pull down these sequences. The reason why exon skipping occurs is related to one of the key functions of hnRNP A1. HnRNP A1 proofreads U2AF binding at the 3 splice site. It also directly interacts with the 5 splice site. See N. R. Zearfoss, E S. Johnson and S P. Ryder, hnRNP A1 and secondary structure coordinate alternative splicing of Mag, RNA (2013) 19: 948-957. For this protein binding site (Tavenez et al. 2012), exonic hnRNP A1 sites distant from known splice sites are very rare in the transcriptome (
[0151] The opal codon in FANCM contained the core sequence of the novel hnRNP A1 site (positions 1-3 of
[0152] Even assuming that triplet periodicity of exon lengths is random, one-third of all exon skipping events would not alter the reading frame. Nonsense mutations are generally acknowledged as pathogenic, are frequently lethal, and certainly reduce fecundity. It is well known in the art that non-sense codons induce exon skipping, as an alternative to nonsense mediated decay (T. Casci, Molecular evolution: Dealing with nonsense, Nature Reviews Genetics 12, 805). However, the specific mechanisms by which this phenomenon occurs have only been the subject of speculation, with limited specific evidence or mechanism as proven explanations for the phenomenon. Natural selection has evolved this mechanism to skip this abundant nonsense codon, TGA. For those exon skipping events that preserve the reading frame, the skipping event may result in less severe phenotypes, depending on how the structure of the protein is deformed by the loss of a stretch of amino acids. The periodic behavior of the gap surprisal function for exon lengths that are multiples of three nucleotides, suggests selection favoring exons of length that preserve the open reading frame.
[0153] Individual splicing mutations identified by exon definition may be validated by RT-PCR or qRT-PCR.
[0154] Changes may be made in the above methods without departing from the scope hereof. It should be noted that the matter contained in the above description or shown in the accompanying drawings should be interpreted as illustrative and not in a limiting sense. The following claims are intended to cover generic and specific features described herein, as well as statements of the scope of the present methodology, which, as a matter of language, might be said to fall therebetween.
[0155] It should be understood that suitable equivalents may be used in place of or in addition to the various instruments, components or compositions, the function and use of such substitute or additional components being held to be familiar to those skilled in the art and are therefore regarded as falling within the scope of the present disclosure. Therefore, the present examples are to be considered as illustrative and not restrictive, and the present disclosure is not to be limited to the details given herein but may be modified within the scope of the appended claims.
REFERENCES
[0156] The following references are either cited in this disclosure or are of relevance to the present disclosure. All documents listed below, along with other papers, patents and publication of patent applications cited throughout this disclosures, are hereby incorporated by reference as if the full contents are reproduced herein.
Barash, Y., Calarco, J. A., Gao, W., Pan, Q., Wang, X., Shai, O., Blencowe, B. J., Frey, B. J. 2010. Deciphering the splicing code. Nature 465(7294): 53-9, 2010.
Berget S M. 1995. Exon recognition in vertebrate splicing. J Biol Chem. 270:2411-2414.
Bolisetty M T, Beemon K L. 2012. Splicing of internal large exons is defined by novel cis-acting sequence elements. Nucleic Acids Res. 40(18):9244-54.
Cartegni L., Krainer A. R. 2002. Disruption of an SF2/ASF-dependent exonic splicing enhancer in SMN2 causes spinal muscular atrophy in the absence of SMN1. Nat. Genet. 30:377-384.
Churbanov A, Igor B. Rogozin, Jitender S. Deogun and Hesham Ali, Method of predicting Splice Sites based on signal interactions, Biology Direct 1(2006), no. 10.
Churbanov A, Igor Vorechovsky and Chindo Hicks A method of predicting changes in human gene splicing induced by genetic variants in context of cis-acting elements, BMC Bioinformatics 2010, 11:22
Claes K, Vandesompele J, Poppe B, Dahan K, Coene I, De Paepe A, Messiaen L. 2002. Pathological splice mutations outside the invariant AG/GT splice sites of BRCA1 exon 5 increase alternative transcript levels in the 5 end of the BRCA1 gene. Oncogene. 21:4171-4175.
Claes K, Poppe B, Machackova E, Coene I, Foretova L, De Paepe A, and Messiaen L. 2003. Differentiating pathogenic mutations from polymorphic alterations in the splice sites of BRCA1 and BRCA2. Genes Chromosomes Cancer. 37:314-320.
Clark F, Thanaraj T A. 2002. Categorization and characterization of transcript-confirmed constitutively and alternatively spliced introns and exons from human. Hum Mol Genet. 11: 451-464.
Clavero S, Prez B, Rincn A, Ugarte M, Desviat L R. 2004. Qualitative and quantitative analysis of the effect of splicing mutations in propionic acidemia underlying non-severe phenotypes. Hum Genet. 115(3):239-47.
Cook K B, Kazan H, Zuberi K, Morris Q, and Hughes T R. 2011. RBPDB: a database of RNA-binding specificities. Nucleic Acids Res. 39:D301-8.
Cover T M, Thomas J A. 2006. Elements of information theory. Wiley-Interscience, Hoboken, N.J.: p. 748.
Dalgleish R, Flicek P, Cunningham F, Astashyn A, Tully R E, Proctor G, Chen Y, McLaren W M, Larsson P, Vaughan B W, Beroud C, Dobson G et al. 2010. Locus Reference Genomic sequences: an improved basis for describing human DNA variants. Genome Med. 2:24.
De Conti L, Baralle M, Buratti E. 2012. Exon and intron definition in pre-mRNA splicing. Wiley Interdiscip Rev RNA. doi: 10.1002/wrna.1140.
Divina P, Kvitkovicova A, Buratti E, Vorechovsky I. 2009. Ab initio prediction of mutation-induced cryptic splice-site activation and exon skipping. Eur J Hum Genet. 17:759-765.
Dominski Z, Kole R. 1991. Selection of splice sites in pre-mRNAs with short internal exons. Mol Cell Biol. 11(12):6075-83.
Dominski Z, Kole R. 1992. Cooperation of pre-mRNA sequence elements in splice site selection. Mol Cell Biol. 12:2108-2114.
Goina E, Skoko N, Pagani F. 2008. Binding of DAZAP1 and hnRNPA1/A2 to an exonic splicing silencer in a natural BRCA1 exon 18 mutant. Mol Cell Biol. 28(11):3850-60.
Graveley B R, Maniatis T. 1998. Arginine/serine-rich domains of SR proteins can function as activators of pre-mRNA splicing. Mol Cell. 1:765-771.
Goren A, Kim E, Amit M, Vaknin K, Kfir N, Ram O, Ast G. 2010. Overlapping splicing regulatory motifs-combinatorial effects on splicing. Nucleic Acids Res. 38:3318-3327.
Hwang D Y, Cohen J B. 1997. U1 small nuclear RNA-promoted exon selection requires a minimal distance between the position of U1 binding and the 3 splice site across the exon. Mol Cell Biol. 17:7099-7107.
Ibrahim E C, Schaal T D, Hertel K J, Reed R, Maniatis T. 2005. Serine/arginine-rich protein-dependent suppression of exon skipping by exonic splicing enhancers. Proc Natl Acad Sci USA. 102:5002-5007.
Jaynes E. Information Theory and Statistical Mechanics. Phys. Rev. 106, 620-630 (1957).
Lim K H, Ferraris L, Filloux M E, Raphael B J, Fairbrother W G. 2011. Using positional distribution to identify splicing elements and predict pre-mRNA processing defects in human genes. Proc Natl Acad Sci USA. 108(27):11093-8.
Liu H X, Zhang M, Krainer A R. 1998. Identification of functional exonic splicing enhancer motifs recognized by individual SR proteins. Genes Dev. 12:1998-2012.
Liu H X, Chew S L, Cartegni L, Zhang M Q, Krainer A R. 2000. Exonic splicing enhancer motif recognized by human SC35 under splicing conditions. Mol. Cell. Biol. 20:1063-1071.
Macias-Vidal J, Rodes M, Hernandez-Perez J M, Vilaseca M A, Coll M J. 2009. Analysis of the CTNS gene in 32 cystinosis patients from Spain. Clin Genet. 76:486-489.
Mucaki E J, Ainsworth P, Rogan P K. 2011. Comprehensive prediction of mRNA splicing effects of BRCA1 and BRCA2 variants. Hum Mutat. 32:735-42.
Mucaki E J, Shirley B C, Rogan P K. 2013. Prediction of Mutant mRNA Splice Isoforms by Information Theory-Based Exon Definition. Hum Mutat. 34:557-65.
Nalla V K, Rogan P K. 2005. Automated splicing mutation analysis by information theory. Hum Mutat. 25:334-342.
Olsen et al., The ETFDH c.158A>G Variation Disrupts the Balanced Interplay of ESE- and ESS-Binding Proteins thereby Causing Missplicing and Multiple Acyl-CoA Dehydrogenation Deficiency. Human Mutation, Volume 35, Issue 1, pages 86-95 (2014).
Robberson B L, Cote G J, and Berget S M. 1990. Exon definition may facilitate splice site selection in RNAs with multiple exons. Mol Cell Biol. 10:84-94.
Rogan P K, Faux B M, Schneider T D. 1998. Information analysis of human splice site mutations. Hum Mutat. 12:153-171.
Rogan P K, Svojanovsky S R, Leeder J S. 2003. Information theory-based analysis of CYP219, CYP2D6 and CYP3A5 splicing mutations. Pharmacogenetics. 13:207-18.
Rogan P K. 2009. Ab Initio Exon Definition Using an Information Theory-based Approach. Biochemistry Publications. Paper 10. http://ir.lib.uwo.ca/biochempub/10.
Rutter J L, Goldstein A M, Davila M R, Tucker M A, Struewing J P. 2003. CDKN2A point mutations D153spl(c.457G>T) and IVS2+1G>T result in aberrant splice products affecting both p16INK4a and p14ARF. Oncogene. 22:4444-8.
Sanz D J, Acedo A, Infante M, Duran M, Perez-Cabornero L, Esteban-Cardenosa E, Lastra E, Pagani F, Miner C, Velasco E A. 2010. A high proportion of DNA variants of BRCA1 and BRCA2 is associated with aberrant splicing in breast/ovarian cancer patients. Clin Cancer Res. 16:1957-67.
Schneider T D, Stormo G D, Yarus M A, Gold L. 1984. Delila system tools. Nucleic Acids Res. 12:129-140.
Schneider T D. 1997. Information content of individual genetic sequences. J Theor Biol. 189:427-441.
Shultzaberger R K, Bucheimer R E, Rudd K E, Schneider T D. 2001. Anatomy of Escherichia coli ribosome binding sites. J Mol Biol. 313:215-228.
Smith P J, Zhang C, Wang J, Chew S L, Zhang M Q, Krainer A R. 2006. An increased specificity score matrix for the prediction of SF2/ASF-specific exonic splicing enhancers. Hum Mol Genet. 15(16):2490-508.
Spurdle A B, Healey S, Devereau A, Hogervorst F B, Monteiro A N, Nathanson K L, et al. ENIGMA-evidence-based network for the interpretation of germline mutant alleles: an international initiative to evaluate risk and clinical significance associated with sequence variation in BRCA1 and BRCA2 genes. Hum Mutat. 2012; 33(1):2-7.
[0157] Stamm S, Riethoven J J, Le Texier V, Gopalakrishnan C, Kumanduri V, Tang Y, Barbosa-Morais N L, Thanaraj T A. 2006. ASD: a bioinformatics resource on alternative splicing. Nucl Acids Res. 34(suppl 1):D46-55.
Thomassen M, Ana Blanco, Marco Montagna, Thomas V. O. Hansen, Inge S. Pedersen, Sara Gutierrez-Enriquez, Mirela Menendez, Laura Fachal, Marta Santamarina, Ane Y. Steffensen, Lars Jonson, Simona Agata, Phillip Whitey, Silvia Tognazzo, Eva Tornero, Uffe B. Jensen, Judith Balmana, Torben A. Kruse, David E. Goldgar, Conxi Lazaro, Orland Diez, Amanda B. Spurdle, Ana Vega, Characterization of BRCA1 and BRCA2 splicing variants: a collaborative report by ENIGMA consortium members Breast Cancer Res Treat. 2012 April; 132(3):1009-23
Tompson S W, Ruiz-Perez V L, Blair H J, Barton S, Navarro V, Robson J L, Wright M J, Goodship J A. 2007. Sequencing EVC and EVC2 identifies mutations in two-thirds of Ellis-van Creveld syndrome patients. Hum Genet. 120:663-670.
Tribus M. 1961. Thermostatics and thermodynamics: an introduction to energy, information and states of matter, with engineering applications. Van Nostrand, Princeton, N.J.: p. 649.
REFERENCES FOR MUTATIONS IN FIG. 8 ARE LISTED BELOW
[0158] .sup.1Santisteban I, Arredondo-Vega F X, Kelly S, Mary A, Fischer A, Hummell D S, Lawton A, Sorensen R U, Stiehm E R, Uribe L. 1993. Novel splicing, missense, and deletion mutations in seven adenosine deaminase-deficient patients with late/delayed onset of combined immunodeficiency disease. Contribution of genotype to phenotype. J Clin Invest 92:2291-2302.
.sup.2Sanz D J, Acedo A, Infante M, Duran M, Perez-Cabornero L, Esteban-Cardenosa E, Lastra E, Pagani F, Miner C, Velasco E A. 2010. A high proportion of DNA variants of BRCA1 and BRCA2 is associated with aberrant splicing in breast/ovarian cancer patients. Clin Cancer Res 16:1957-67.
.sup.3Chen X, Truong T T, Weaver J, Bove B A, Cattie K, Armstrong B A, Daly M B, Godwin A K. 2006. Intronic alterations in BRCA1 and BRCA2: effect on mRNA splicing fidelity and expression. Hum Mutat 27:427-435.
.sup.4Claes K, Vandesompele J, Poppe B, Dahan K, Coene I, De Paepe A, Messiaen L. 2002. Pathological splice mutations outside the invariant AG/GT splice sites of BRCA1 exon 5 increase alternative transcript levels in the 5 end of the BRCA1 gene. Oncogene 21:4171-4175.
.sup.5Claes K, Poppe B, Machackova E, Coene I, Foretova L, De Paepe A, and Messiaen L. 2003. Differentiating pathogenic mutations from polymorphic alterations in the splice sites of BRCA1 and BRCA2. Genes Chromosomes Cancer 37:314-320.
.sup.6Caux-Moncoutier V, Pages-Berhouet S, Michaux D, Asselain B, Castera L, De Pauw A, Buecher B, Gauthier-Villars M, Stoppa-Lyonnet D, Houdayer C. 2009. Impact of BRCA1 and BRCA2 variants on splicing: clues from an allelic imbalance study. Eur J Hum Genet 17:1471-1480.
.sup.7Gutierrez-Enriquez S, Coderch V, Masas M, Balmana J, Diez 0.2009. The variants BRCA1 IVS6-1G>A and BRCA2 IVS15+1G>A lead to aberrant splicing of the transcripts. Breast Cancer Res Treat 117:461-465.
.sup.8Campos B, Diez O, Domenech M, Baena M, Balmana J, Sanz J, Ramirez A, Alonso C, Baiget M. 2003. RNA analysis of eight BRCA1 and BRCA2 unclassified variants identified in breast/ovarian cancer families from Spain. Hum Mutat 22:337.
.sup.9Rutter J L, Goldstein A M, Davila M R, Tucker M A, Struewing J P. 2003. CDKN2A point mutations D153spl(c.457G>T) and IVS2+1G>T result in aberrant splice products affecting both p16INK4a and p14ARF. Oncogene 22:4444-8.
.sup.10Harland M, Mistry S, Bishop D T, Bishop January 2001. A deep intronic mutation in CDKN2A is associated with disease in a subset of melanoma pedigrees. Hum Mol Genet 23:2679-2686.
.sup.11Macias-Vidal J, Rodes M, Hernandez-Perez J M, Vilaseca M A, Coll M J. 2009. Analysis of the CTNS gene in 32 cystinosis patients from Spain. Clin Genet 76:486-489.
.sup.12Tompson S W, Ruiz-Perez V L, Blair H J, Barton S, Navarro V, Robson J L, Wright M J, Goodship J A. 2007. Sequencing EVC and EVC2 identifies mutations in two-thirds of Ellis-van Creveld syndrome patients. Hum Genet 120:663-670.
.sup.13Arranz J A, Pinol F, Kozak L, Perez-Cerda C, Cormand B, Ugarte M, Riudor E. 2002. Splicing mutations, mainly IVS6-1(G>T), account for 70% of fumarylacetoacetate hydrolase (FAH) gene alterations, including 7 novel mutations, in a survey of 29 tyrosinemia type I patients. Hum Mutat 20:180-188.
.sup.14Schloesser M, Hofferbert S, Bartz U, Lutze G, Lammle B, Engel W. 1995. The novel acceptor splice site mutation 11396(G->A) in the factor XII gene causes a truncated transcript in cross-reacting material negative patients. Hum Mol Genet 4:1235-1237.
.sup.15Lapoumeroulie C, Acuto S, Rouabhi F, Labie D, Krishnamoorthy R, Bank A. 1987. Expression of a beta thalassemia gene with abnormal splicing. Nucleic Acids Res 15:8195-8204.
.sup.16Treisman R, Orkin S H, Maniatis T. 1983. Specific transcription and RNA splicing defects in five cloned beta-thalassaemia genes. Nature 302: 591-596.
.sup.17Vidaud M, Gattoni R, Stevenin J, Vidaud D, Amselem S, Chibani J, Rosa J, Goossens M. 1989. A 5 splice-region G - - - C mutation in exon 1 of the human beta-globin gene inhibits pre-mRNA splicing: a mechanism for beta+-thalassemia. Proc Natl Acad Sci USA 86:1041-1045.
.sup.18Atweh G F, Anagnou N P, Shearin J, Forget B G, Kaufman R E. 1985. Beta-thalassemia resulting from a single nucleotide substitution in an acceptor splice site. Nucleic Acids Res 13:777-790.
.sup.19Bunge S, Steglich C, Zuther C, Beck M, Morris C P, Schwinger E, Schinzel A, Hopwood J J, Gal A. 1993. Iduronate-2-sulfatase gene mutations in 16 patients with mucopolysaccharidosis type II (Hunter syndrome). Hum Mol Genet 2:1871-1875.
.sup.20Erdmann J, Raible J, Maki-Abadi J, Hummel M, Hammann J, Wollnik B, Frantz E, Fleck E, Hetzer R, Regitz-Zagrosek V. 2001. Spectrum of clinical phenotypes and gene variants in cardiac myosin-binding protein C mutation carriers with hypertrophic cardiomyopathy. J Am Coll Cardiol 38:322-330.
.sup.21Dworniczak B, Aulehla-Scholz C, Kalaydjieva L, Bartholome K, Grudda K, Horst J. 1991. Aberrant splicing of phenylalanine hydroxylase mRNA: the major cause for phenylketonuria in parts of southern Europe. Genomics 11:242-246.
.sup.22Maciolek N L, Alward W L, Murray J C, Semina E V, McNally M T. 2006. Analysis of RNA splicing defects in PITX2 mutants supports a gene dosage model of Axenfeld-Rieger syndrome. BMC Med Genet 7:59.
.sup.23Vega Al, Prez-Cerd C, Desviat L R, Matthijs G, Ugarte M, Prez B. 2009. Functional analysis of three splicing mutations identified in the PMM2 gene: toward a new therapy for congenital disorder of glycosylation type Ia. Hum Mutat 30:795-803.
REFERENCES FOR MUTATIONS IN FIG. 9 ARE LISTED BELOW
[0159] .sup.1Miyajima H, Miyaso H, Okumura M, Kurisu J, Imaizumi K. 2002. Identification of a cis-acting element for the regulation of SMN exon 7 splicing. J Biol Chem. 277(26):23271-7.
.sup.2Heintz C, Dobrowolski S F, Andersen H S, Demirkol M, Blau N, Andresen B S. 2012. Splicing of phenylalanine hydroxylase (PAH) exon 11 is vulnerable: molecular pathology of mutations in PAH exon 11. Mol Genet Metab. 106(4):403-11.
.sup.3Sun C, Southard C, Di Rienzo A. 2009. Characterization of a novel splicing variant in the RAPTOR gene. Mutat Res. 9; 662(1-2):88-92.
.sup.4Fukao T, Horikawa R, Naiki Y, Tanaka T, Takayanagi M, Yamaguchi S, Kondo N. 2010. A novel mutation (c.951C>T) in an exonic splicing enhancer results in exon 10 skipping in the human mitochondrial acetoacetyl-CoA thiolase gene. Mol Genet Metab. 100(4):339-44.
.sup.5Gonalves V, Theisen P, Antunes O, Medeira A, Ramos J S, Jordan P, Isidro G. 2009. A missense mutation in the APC tumor suppressor gene disrupts an ASF/SF2 splicing enhancer motif and causes pathogenic skipping of exon 14. Mutat Res. 662(1-2):33-6.
.sup.6Burgess R, MacLaren R E, Davidson A E, Urquhart J E, Holder G E, Robson A G, Moore A T, Keefe R O, Black G C, Manson F D. 2009. ADVIRC is caused by distinct mutations in BEST1 that alter pre-mRNA splicing. J Med Genet. 46(9):620-5.
.sup.7Jensen C J, Stankovich J, Butzkueven H, Oldfield B J, Rubio J P. 2010. Common variation in the MOG gene influences transcript splicing in humans. J Neuroimmunol. 229(1-2):225-31.
.sup.8Tran V K, Takeshima Y, Zhang Z, Yagi M, Nishiyama A, Habara Y, Matsuo M. 2006. Splicing analysis disclosed a determinant single nucleotide for exon skipping caused by a novel intraexonic four-nucleotide deletion in the dystrophin gene. J Med Genet.43(12):924-30.
.sup.9Gabut M, Min M, Marsac C, Brivet M, Tazi J, Soret J. 2005. The SR protein SC35 is responsible for aberrant splicing of the E1alpha pyruvate dehydrogenase mRNA in a case of mental retardation with lactic acidosis. Mol Cell Biol. 25(8):3286-94.
.sup.10Colapietro P, Gervasini C, Natacci F, Rossi L, Riva P, Larizza L. 2003. NF1 exon 7 skipping and sequence alterations in exonic splice enhancers (ESEs) in a neurofibromatosis 1 patient. Hum Genet. 113(6):551-4.
.sup.11Raponi M, Kralovicova J, Copson E, Divina P, Eccles D, Johnson P, Baralle D, Vorechovsky I. 2011. Prediction of single-nucleotide substitutions that result in exon skipping: identification of a splicing silencer in BRCA1 exon 6. Hum Mutat. 32(4):436-44.