COMPUTER IMPLEMENTED METHOD TO OPTIMIZE PHYSICAL-CHEMICAL PROPERTIES OF BIOLOGICAL SEQUENCES

Abstract

A computer based biological sequence analysis method provides, after a training phase adopting data from screening experiments, either an evaluation of an input sequence expressing the performance with reference to the chemical-physical feature object of the screening experiment, or at least an optimized output sequence. The method provides the use of a set or library of sequences derived from DMS experiments and SELEX for the generation of a second set of high efficiency biological sequences, whereby high efficiency means, for example, high catalysis capacity, high fitness, high ability to bind to a specific target, high fluorescence activity and, in general, a high performance with reference to the chemical-physical properties of a molecule which are defined at the start and can be selected through experiments.

Claims

1. A computer-based method for processing results of a biological sequence screening experiment, comprising the steps of: a) receiving a set of sample biological sequences selected by at least one screening experiment comprising a selection step, wherein in the selection step, molecules encoded by the sample biological sequences are selected based on a chemical-physical property of interest; b) defining allowed molecular states related to the chemical-physical property of interest and at least one statistical energy function, wherein the at least one statistical enemy function is expressed in terms of a multivariate linear function of specific statistical energy parameters and associated to the allowed molecular states; c) providing an expression of a likelihood of obtaining, from the at least one screening experiment, the set of sample biological sequences observed in different rounds of the at least one screening experiment, wherein the expression comprises a selection factor expressing a probability that a sequence is selected during the at least one screening experiment as a function of the specific statistical energy parameters; d) calculating energy parameters of the multivariate linear function by maximizing the expression of the likelihood and considering the set of sample biological sequences; and e) given at least one input sequence, calculating a score of the at least one input sequence based on the at least one statistical energy function identified by the energy parameters calculated in step d to represent an evaluation of the at least one input sequence with respect to the chemical-physical property of interest; and/or generating at least one sequence that maximizes at least locally a score function based on the at least one statistical energy function identified by the energy parameters calculated in step d.

2. The computer-based method according to claim 1, wherein in step d, the set of sample biological sequences are used to define unsupervised training avoiding experimental measurements of physical or chemical parameters.

3. The computer-based method according to claim 1, wherein the expression of the likelihood further comprises an amplification factor and/or a sampling factor, wherein the amplification factor is expressed by a probability that a sequence is amplified during the at least one screening experiment, and the sampling factor is expressed by a probability that a sequence is sampled during the at least one screening experiment.

4. The computer-based method according to claim 1, wherein the selection factor is $(\underline{n} (t) .Math. \underline{N} (t)) = \frac{(\underline{n} (t), C) {.Math.}_{s} [(\begin{matrix} N_{s} (t) \\ n_{k, s} (t), .Math. \end{matrix}) {.Math.}_{k} e^{- n_{s, k} (t) E_{k} (s)}]}{{.Math.}_{{\underline{n}}^{'} (t)} ({\underline{n}}^{'} (t), C) {.Math.}_{s} (\begin{matrix} N_{s} (t) \\ n_{k, s} (t), .Math. \end{matrix}) [{.Math.}_{k} e^{- n_{s, k} (t) E_{k} (s)}]},$ wherein: R.sub.s(t) is a number of reads of sequence s in round t, N.sub.s(t) is a number of biological vectors that transfer the sequence s at the round t, $N_{tot} (t) = \underset{s}{.Math.} N_{s} (t)$ is a total number of the biological vectors transferring the sequence s in the round t, $R_{tot} (t) = \underset{s}{.Math.} R_{s} (t)$ is a total number of reads in the round i, $n_{tot} (t) = \underset{s}{.Math.} \underset{k \in sel}{.Math.} n_{s, k} (t)$ is a total number of biological vectors with the sequence s in state k in the round t, wherein, n.sub.s,k(t) is a number of the biological vectors with the sequence s in the state k in the round t, d is a number of different sequences, E.sub.k(s) is a statistic energy of the sequence s in the state k, k∈sel is a set of discrete molecular states that describe the selection step to a following round of the at least one screening experiment, wherein training data are from the set of discrete molecular states, and the set of discrete molecular states comprises non-specific bond, specific bond, folded state, and unfolded state, C is a target number, custom-character (n(t), C) is a function defined as follows: (n(t), C)=1 if $\underset{s, k}{.Math.} n_{s, k} (t) ⩽ C, or (\underline{n} (t), C) = 0$ in any other case.

5. The computer-based method according to claim 1, wherein the at least one statistical energy function comprises terms expressing independent position biases and epistatic effects.

6. The computer-based method according to claim 5, wherein the at least one statistical energy function is expressed by: $E_{k} (s) = - \underset{i}{.Math.} θ_{ki} (s_{i}) - \underset{i < j}{.Math.} θ_{k (ij)} (s_{i}, s_{j}) - \underset{i < j < l}{.Math.} θ_{k (ijl)} (s_{i}, s_{j}, s_{l}) + .Math. .$

7. The computer-based method according to claim 5, wherein the at least one statistical energy function is expressed by: $E_{k} (s) = - U_{k} (L) - {.Math.}_{i = 1}^{L} θ_{k} (s_{i}) - {.Math.}_{j = 1}^{J} {.Math.}_{i = 1}^{L - j} θ_{k (j)} (s_{i}, s_{i + j})$

8. The computer-based method according to claim 4, wherein the expression of the likelihood comprises an amplification factor expressed by: $= (\underline{N} (t + 1) .Math. \underline{n} (t)) = \frac{N_{tot} (t + 1)!}{{.Math.}_{s} N_{s} (t + 1)!} {.Math.}_{s = 1}^{d} {(\frac{{.Math.}_{k \in sel} n_{s, k} (t)}{n_{tot} (t)})}^{N_{s} (t + 1)} .$

9. The computer-based method according to claim 8, wherein the expression of the likelihood comprises a reads sampling factor expressed by: $(\underline{R} (t) .Math. \underline{n} (t)) = \frac{R_{tot} (t)!}{{.Math.}_{s} R_{s} (t)!} {.Math.}_{s = 1}^{d} {(\frac{N_{s} (t)}{N_{tot} (t)})}^{R_{s} (t)} .$

10. The computer-based method according to claim 9, wherein a number of the allowed molecular states is two, and the allowed molecular states are selected or not-selected in the at least one screening experiment.

11. The computer-based method according to claim 10, wherein the selection factor is expressed by: $(\underline{n} (t) .Math. \underline{N} (t)) = \underset{s}{.Math.} [(\begin{matrix} N_{s} (t) \\ n_{s} (t) \end{matrix}) {p_{a}^{n_{s} (t)} (1 - p_{s})}^{N_{s} (t) - n_{s} (t)}] .$

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0049] FIG. 1 shows a general flowchart related to an example of input/output of this method.

[0050] FIG. 2 shows a flow chart related to an example of a pipeline in which this method can be found: starting with the pre-processing of raw data from the sequencing of the experiment to three examples of model output, not to be considered exhaustive of its uses.

[0051] FIG. 3 is a schematic representation of the main definitions used in the description of the model.

[0052] Left panel: N.sub.s(t) is the number of vectors (e.g., phages) exposing sequence s. The number of reads obtained from sequencing is proportional to N.sub.s(t).

[0053] Right panel: n.sub.s(t) is the number of vectors with sequence s that have been selected (for example that have bonded with the target).

[0054] FIG. 4 shows the assessment of the selectivity of mutants. Scatter plot of the statistical binding energy calculated by the model and the selectivity calculated on the data (selectivity formula indicated in the following paragraphs). The four panels in FIG. 4 are related to the experiments described in example 1.

[0055] The circles (crosses) are the sequences on the test set (training set) with an initial count beyond a certain threshold (procedure for checking data quality). In each panel there is the Spearman correlation coefficient for each case. The Spearman coefficient measures the degree of ordinal relationship between two variables. The strong correlation, value of the coefficient between 0.80 and 0.98, shows the model's ability to predict the binding affinity in each of the four experimental datasets.

[0056] FIG. 5 shows the assessment of the selectivity of mutants through the binding energy dispersion graph calculated by the model and the selectivity calculated on the data.

[0057] FIG. 5 shows the evaluation of the selectivity of the sequences of an experiment (example 2) by learning the model on another experiment related to the same protein (example 3).

[0058] The points therefore correspond to sequences belonging to experiment 2 with a total count beyond a certain threshold (procedure for checking data quality). The selectivity calculated from the counts in experiment 2 is shown on the abscissas and the statistical energy of the model initialized by the data of experiment 3 is indicated on the ordinates.

[0059] The strong correlation, Pearson's coefficient, shows the model's ability to predict binding affinity in an experiment independent of that of learning. The Pearson coefficient between two statistical variables expresses the extent of any linearity relationship between them.

[0060] FIG. 6 shows the assessment of the selectivity of mutants through the scatter plot of bond energy calculated by the model and selectivity calculated on the data.

[0061] FIG. 6 shows the evaluation of the selectivity of the sequences with high selectivity (in this case high binding affinity) by learning the model on low selectivity sequences. The data relates to experiment 1. It should be noted that in this test the sequences with high affinity and therefore a priori more informative are hidden from the model during the learning or initialization phase.

[0062] The black dots correspond to the low selectivity training sequences while the gray crosses correspond to the high selectivity test sequences.

[0063] The model score correctly sorts the sequences in relation to the experimental selectivity.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0064] The definitions listed below are used in the description of the invention.

[0065] By Deep Mutational Scanning (DMS) a methodology is meant that is based on next generation sequencing techniques to measure the activity of a number of unique variants in the order of 105 (or more) in a single experiment a protein, a DNA sequence or a reference RNA.

[0066] By Selex (Systematic evolution of ligands by exponential enrichment) a combinatorial biochemical technique is meant that is suitable for the production of DNA or RNA oligonucleotides (both single and double chain) capable of specifically binding a given target called aptamer.

[0067] By Directed Evolution or DE, a technique is meant in which a library of variants is built from one or more sequences and is subjected to a selection process for a property of interest, the best variant or a selection of variants is used for the next round in which the procedure is repeated.

[0068] Machine-learning-assisted Directed Evolution means a Direct Evolution process in which an in silico model is trained starting from the sequence selection data (sequencing of a sample of the selected sequences) at a given or more rounds and is used for propose variants to the next round, as described in “Machine learning-assisted directed protein evolution with combinatorial libraries”, Z. Wu et al. arXiv preprint.

[0069] By Deep Sequencing is meant a repeated sequencing technique (hundreds or thousands of times) of a given region of DNA. This new generation sequencing approach allows the detection of rare clonal types, or cells of microorganisms whose genetic contribution can be of the order of 1% of the genetic material analyzed.

[0070] By Ultra Deep Sequencing techniques is meant a particular Deep Sequencing technique restricted to a limited region of the genome that allows the detection of variants whose percentage genetic contribution can be in the order of 10.sup.−7/10.sup.−8.

[0071] By genetic mutation is meant any stable and heritable modification in the nucleotide sequence of a genome or more generally of genetic material (both DNA and RNA) due to external agents or to chance, but not to genetic recombination.

[0072] By somatic mutation is meant a non-heritable genetic mutation.

[0073] By folding or protein folding, is meant the molecular folding process through which proteins reach their three-dimensional structure. By unfolded state of a protein is meant instead the denatured state of linear polypeptide chain.

[0074] By phenotype is meant the set of all the characteristics shown by a living organism, therefore its morphology, its development, its biochemical and physiological properties including behavior. By extension we refer to the phenotype of one or more mutations in a coding region of the genome meaning functional or structural variations

[0075] By genotype is meant the set of all the genes that make up the DNA (genetic makeup/genetic identity/genetic constitution) of an organism or population.

[0076] Epistasis generally means a non-additive phenotypic effect between individual mutations.

[0077] By residue is meant an amino acid of a protein or polypeptide

[0078] By molecular state is meant the state of the biological molecule (e.g., bound, non-bound, folded, unfolded), related to the activity or physical-chemical characteristic of the same;

[0079] By coding sequence is meant that portion of the DNA or RNA of a gene that codes for proteins.

[0080] By sequence alignment is meant a bioinformatics procedure in which two or more primary sequences of amino acids, DNA or RNA are arranged. in a matrix of common length by inserting appropriate symbols (not describing symbols related to amino acids or nitrogen bases) of insertion and deletion.

[0081] By positional bias is meant the frequency of observing a given amino acid in a specific position in a given sequence library or more generally in a multiple sequence alignment.

[0082] By phage display is meant a laboratory technique for the study of protein-protein, protein-peptide and protein-DNA interactions that uses bacteriophages (viruses that infect bacteria). With this technique, the gene coding for the protein of interest is inserted into the gene of a phage coating protein causing the exposure of the protein on the outside of the phage, keeping the gene of that protein inside, establishing a connection between genotype and phenotype.

[0083] By ribosome display is meant a biochemical technique to create proteins that bind a specific ligand, in particular, the technique consists in creating a hybrid between the protein of interest the progenitor RNA-messenger using as a complex to bind a particular ligand immobilized in different selection steps.

[0084] The present invention is directed to a computer implemented method for the analysis and use of sequence libraries deriving from screening experiments, e.g., Deep Mutational Scanning (DMS), Directed Evolution (DE) or SELEX, whose purpose is the selection or evaluation of amino acid or nucleotide sequences, which result in proteins and/or peptides or aptamers selected for a given chemical-physical characteristic during the screening experiment.

[0085] As described in the prior art, the DMS approach is aimed at selecting, given a desired characteristic, the best mutants starting from an initial pool of mutants and can be carried out using different types of experimental tests known in themselves.

[0086] The experiments can, for example, be based on cells (bacteria, yeasts and cultured mammalian cells) with a protein typically expressed by a plasmid or virus, or on the use of systems that develop in-vitro, such as the phage display or ribosome display.

[0087] In general, a library of mutated variants of the gene of interest is synthesized with the DMS, cloned in the appropriate expression vector and introduced, for example, in cells where the protein encoded by the gene has a function that can be selected. The selection can be applied for the protein function or another molecular property of interest, altering the frequency of each variant based on its functional capacity.

[0088] The selection in the screening experiment can be carried out using different strategies based on enzymatic catalysis or on binding to a molecular target, on cell growth favored by the presence of a more or less effective variant or on the separation of cells that express a specific variant. For example, selection can enrich cells with more active protein variants and exhaust those with inactive or highly inefficient variants. The selection can also be made by implementing the physical separation of the variants, such as in display experiments or by using cell separation techniques known in the art. The selection can finally be made before or after specific treatments or periods of time. In any case, the basis of the approach remains a selection process for an established characteristic.

[0089] At the end of one or more selection rounds, both the library present in the initial input population and that of the post-selection population are recovered and the frequency of each variant in the two libraries is determined by high-performance DNA sequencing techniques, in particular Deep Sequencing and Ultra Deep Sequencing.

[0090] As described in the prior art, SELEX's approach is aimed at selecting aptamers capable of binding with high specificity to a selected molecular target (proteins, other nucleic acids, entire cells, for example cancer). Also in this case, a library of oligonucleotide sequences is produced. Each sequence contains two constant regions at the ends to allow PCR amplification, and a central region generated with a random sequence of nucleotides.

[0091] These sequences are then subjected to an in vitro selection procedure in order to separate and amplify mainly functional aptamers rather than the other sequences. The selection can also be based in this case on different techniques known to those skilled in the art, for example on the binding affinity for a specific molecular target or on a catalytic activity. In any case, the basis of the approach remains a selection process for an established characteristic.

[0092] One or more rounds of amplification, selection and separation of the selected molecules can be carried out.

[0093] At the end of one or more selection rounds, the selected sequences are recovered and analyzed by means of high-performance DNA sequencing techniques, in particular Deep Sequencing.

[0094] Starting from the SELEX technique, various variants have been developed with different selection and amplification strategies known to those skilled in the art, such as described in “Zhuo Z, et al.—Recent Advances in SELEX Technology and Aptamer Applications in Biomedicine. Int J Mol Sci. 2017 Oct. 14; 18 (10)”.

[0095] The method according to the invention is therefore based on the use of a model capable of evaluating sequences that present one or more mutations with respect to the sequences of the training set, taking into consideration not only the possibility that these mutations may occur in different positions along the sequence but also, according to an embodiment example, their relative epistasis.

[0096] The selection takes place through a score that is based on a statistical energy function of the sequence in the method description section.

[0097] Once the model has been trained on a set of input sequences (a DMS experiment for example), the model can be used to evaluate any sequence that is aligned with the mutant library used in the experimental screening.

[0098] The model in a preferred embodiment described below considers two states, selected or unselected with which a probability is associated for each sequence.

[0099] However, it can be generalized to several states with which a probability is associated for each sequence. For example, in a preferred embodiment three states are considered: bound, unbound and folded, unfolded.

[0100] in the detailed description of the model, reference is made to the general version which considers a generic number of states and in particular the two-state case is described.

[0101] Notation

[0102] Underscored symbols as x refer to vectors whose elements denote sequences {x.sub.s}. Bold symbols x denote a set of distributed quantities over all the sequences and rounds, {x.sub.s(t)}, for each sequence s and iteration t.

[0103] Symbol Definitions

[0104] R.sub.s(t) is the number of reads of sequence s in round t.

[0105] N.sub.s(t) is the number of vectors transporting sequence s in round t.

[00001] $N_{tot} (t) = \underset{𝓈}{.Math.} N_{𝓈} (t)$

is the total number of vectors transporting sequence s in round t.

[00002] $R_{tot} (t) = \underset{𝓈}{.Math.} R_{𝓈} (t)$

is the total number of reads in round t.

[00003] $n_{tot} (t) = \underset{𝓈}{.Math.} \underset{k \in 𝓈 e l}{.Math.} n_{𝓈, k} (t)$

is the total number of vectors with sequence s in a state k in round t.

[0106] n.sub.s,k(t) is the number of vectors with sequence s in a state k in round t.

[0107] d is the number of distinct sequences.

[0108] E.sub.k(s) is the statistic energy of sequence s in state k.

[0109] k∈sel is the set of molecular discrete states specifying the selection to a sequent round of the screening experiment from which training data come from (e.g., non-specific bond, bond, folded, unfolded).

[0110] In general, statistic energy is a linear multivariate function. In a first example, the energy of each state is defined as a linear function of energetic parameters θ.sub.k(i.sub.1.sub., . . . , i.sub.p.sub.)(s.sub.i.sub.1, . . . , s.sub.i.sub.p) expressing both independent position bias and epistatic effects (i.e., non-additivity of mutational effects) such as the droplet, triplet interactions:

[00004] $E_{k} (𝓈) = - \underset{i}{.Math.} θ_{ki} (𝓈_{i}) - \underset{i < j}{.Math.} θ_{k (ij)} (𝓈_{i}, 𝓈_{j}) - \underset{i < j < i}{.Math.} θ_{k (ijl)} (𝓈_{i}, 𝓈_{j,} 𝓈_{l}) + .Math.$

[0111] Each θ.sub.k(i.sub.1.sub., . . . , i.sub.p.sub.)(s.sub.i.sub.1, . . . , s.sub.i.sub.p) is the statistic energy contribution depending on p amino acids s.sub.i.sub.1, . . . , s.sub.i.sub.p having a position i.sub.1, . . . , i.sub.p. All together, they build the free parameters that are calculated as scalars during the training procedure.

[0112] The expression above is not to be considered as exhaustive of the possible formulations of statistical energy. For example, statistical energy can be defined alternatively through the following multivariate linear function:

[00005] $E_{k} (𝓈) = - U_{k} (L) - {.Math.}_{i = 1}^{L} θ_{k} (𝓈_{i}) - {.Math.}_{j = 1}^{J} {.Math.}_{i = 1}^{L - j} θ_{k (j)} (𝓈_{i}, 𝓈_{i + j})$

[0113] This expression can be used, for example, without aligning the sequences in a multiple alignment (see definitions above).

[0114] In addition to the parameters θ dependent on the sequence, there is a term U.sub.k that expresses an energy-statistical contribution dependent on the length L of the sequence.

[0115] It is also possible that the statistical energy, for example when the sequences are not particularly long, is defined only by the independent position bias, i.e., only the first term of the expression according to the first example above.

[0116] In general, the parameters are calculated for a predefined selection during a screening experiment of at least one chemical-physical characteristic of interest of the molecule.

[0117] T is the number of rounds/cycles

[0118] C is the number of targets.

[0119] custom-character (n(t), C) is a function defined as follows: (n(t), C)=1 if

[00006] $\underset{𝓈, k}{.Math.} n_{s, k} (t) ⩽ C and (n (t), C) = 0$

in the other case.

[0120] Referring to a screening experiment, likelihood is defined as the joint probability of T rounds of selection and amplification as follows (eq. 1):

[00007] $(p, n, N, R) =_{reg} (p) (N (0)) {.Math.}_{t = 0}^{T - 1} (R (t) .Math. N (t)) (N (t + 1) .Math. n (t)) (n (t) .Math. N (t), p)$

[0121] Where custom-character .sub.reg(p) is the regularization term and term (N(0)) refers to the distribution of vectors present in round zero. Other three factors have the following definitions.

[0122] Read factor custom-character (R(t)|N(t)) is the probability to extract a set of reads R={R.sub.s(t)} from a distribution of vectors N={N.sub.s(t)} and is defined by the flowing equation:

[00008] $(R (t) .Math. N (t)) = \frac{R_{tot} (t)!}{{.Math.}_{𝓈} R_{𝓈} (t)!} {.Math.}_{𝓈 = 1}^{d} {(\frac{N_{𝓈} (t)}{N_{tot} (t)})}^{R_{𝓈} (t)}$

[0123] where R.sub.tot(t) is the total number of reads in round t.

[0124] The second term is an amplification factor, i.e., the probability to have N(t+1) vectors amplified in round t+1 starting from n(t) vectors selected in round t, is defined as:

[00009] $(N (t + 1) .Math. n (t)) = \frac{N_{tot} (t + 1)!}{{.Math.}_{𝓈} N_{𝓈} (t + 1)!} {.Math.}_{𝓈 = 1}^{d} {(\frac{Σ_{k \in sel} n_{𝓈, k} (t)}{n_{tot} (t)})}^{N_{𝓈} (t + 1)}$

[0125] The third term refers to selection and represents the probability to select n(t) vectors from N(t) present vectors and is defined as:

[00010] $(n (t) .Math. N (t)) = \frac{(n (t), C) {.Math.}_{𝓈} [(\begin{matrix} N_{𝓈} (t) \\ n_{k, 𝓈} (t), .Math. \end{matrix}) {.Math.}_{k} e^{- n_{𝓈, k (t) E_{k} (𝓈)}}]}{{.Math.}_{n^{'} (t)} 𝒮 (n^{'} (t), C) [{.Math.}_{k} e^{- n_{𝓈, k (t) E_{k} (𝓈)}}]}$

[0126] The learning or training procedure consists in finding, via known optimization algorithms, the scalar values of energy parameters θ.sub.k(i.sub.1.sub., . . . , i.sub.p.sub.)(s.sub.i.sub.1, . . . , s.sub.i.sub.p) that maximize the a posteriori joint probability custom-character and assigning to reads R={R.sub.s(t)} training experimental data from a screening experiment.

[0127] An Example of a Preferred Embodiment: a Two-State System with a Rare and Deterministic Link and the Description of the Epistatic Effect with Two-Site Interactions.

[0128] In a preferred embodiment it is possible to consider only two states, i.e., selected and not selected (for example tied or not tied to the target).

[0129] Furthermore, it is assumed that the case of bound molecule is a rare phenomenon, therefore with a probability much less than 1.

[0130] A further assumption is an infinite number of targets C.fwdarw.∞, this approximation is realistic if the number of targets is much higher the number of vectors presented in the screening experiments. This condition is verified in the majority of experiments.

[0131] In such a case, state index k is eliminated because there exist k−1 statistical energies exist and, in such a case, the states are 2.

[0132] Let n.sub.s(t) be the number of vectors bonded with sequence s in round t.

[0133] According to such assumptions, the three factors described above are simplified as follows:

[0134] Reads factor custom-character (R(t)|N(t)) is eliminated considering R(t)≈N(t)).

[0135] The amplification factor becomes:

[00011] $(N (t + 1) .Math. n (t)) = \frac{N_{tot} (t + 1)!}{{.Math.}_{𝓈} N_{𝓈} (t + 1)!} {.Math.}_{𝓈 = 1}^{d} {(\frac{n_{𝓈} (t)}{n_{tot} (t)})}^{N_{𝓈} (t + 1)}$

[0136] The selection factor becomes:

[00012] $(n (t) .Math. N (t)) = \underset{𝒮}{.Math.} [(\begin{matrix} N_{𝓈} (t) \\ n_{𝓈} (t) \end{matrix}) {p_{𝓈}^{n_{𝓈} (t)} (1 - p_{𝓈})}^{N_{𝓈} (t) - n_{𝓈} (t)}],$

[0137] where p.sub.s is the probability that sequence s is selected and is defined as p.sub.s=1/(e.sup.E.sup.s+1), where E.sub.s=E.sub.bound(s)−E.sub.unbound(s).

[0138] Considering that p.sub.s<<1 it is possible to approximate p.sub.s≈e.sup.E.sup.s.

[0139] In the current example, the energy is parametrized with interactions for one and two sites:

[00013] $E_{s} = - \underset{i}{.Math.} θ_{i} (s_{i}) - \underset{i < j}{.Math.} θ_{ij} (s_{i}, s_{j})$

[0140] This expression, together with the probability of selecting a sequence, builds the genotype-phenotype map, in that it associates the sequence (genotype) with a probability of being in a molecular state (phenotype).

[0141] From this it follows that the logarithm of the joint probability becomes:

[00014] $= \underset{s, t}{.Math.} N_{s} (t + 1) \ln \frac{N_{s} (t) e^{- E_{s}}}{{.Math.}_{r} N_{r} (t) e^{- E_{r}}}$

[0142] The adopted approximations render the maximization of custom-character a concave optimization problem for parameters θ.sub.i and θ.sub.ij. The aforementioned problem can be solved using the L-BFGS optimization algorithm once the sequences selected through the experiments have been assigned as N (t).

[0143] It can also be said that even the most general version of the problem that intends to calculate the parameters θ.sub.i and θ.sub.ij by maximizing the likelihood function in its more general shape, i.e. eq. 1, can be solved by numerical methods.

[0144] Input Data for Training.

[0145] The input data come from screening experiments of variants of biological polymer molecules, for example Deep Mutational Scan, Directed Evolution or techniques based on SELEX.

[0146] The input data are biological sequences, e.g., amino acids or nucleotides of the mutants used in the experiment together with the counts for the selection rounds.

[0147] These can be obtained from sequencing data (DNA reads in fastq format for example).

[0148] The typical bioinformatics procedure, described in FIG. 2, corresponds to the preprocessing carried out in the four datasets used as tests summarized in Table I below.

[0149] The sequencing consists of DNA filaments, starting from the set of reads for each round, for example in fastq format, and the procedure has the following steps: [0150] filtering of reads with low sequencing quality or whose forward and reverse reads do not match; [0151] translation of the nucleotide sequences in amino acid sequence, elimination of the sequences with a stop codon; [0152] counting the number of sequences in each round; [0153] filtering of sequences with a total number of occurrences in the various rounds of less than ten.

[0154] Training of the Probabilistic Model

[0155] This step, as indicated in the previous paragraphs, numerically solves the problem of maximizing the likelihood function as a function of the parameters of the at least one statistical energy function of the selection factor described above. The optimal values of the parameters thus obtained are characteristic of the set or set of training sequences entering the model and change if these training sequences are modified.

[0156] Uses of the Parameters of the Statistical Energy Function

[0157] The probabilistic method described above therefore analyzes sequencing data libraries deriving from experiments of selection and enrichment of mutant biological sequences with the aim of evaluating at least one input mutant biological sequence and selecting the best for an assigned chemical-physical characteristic. This characteristic is quantified, through the allowed molecular states, with a statistical energy function or with a combination of the statistical energies defined through the energy parameters of each of the statistical energy functions calculated during the learning phase.

[0158] Subsequently it is possible, starting from these parameters, to generate a library of biological sequences with the chemical-physical characteristic of interest.

[0159] The model is then applicable for: [0160] Evaluating the mutants and selecting the best ones based on a given characteristic. In particular, once the energy parameters as defined above have been determined, it is possible to apply these parameters to calculate the at least one statistical energy function relating to an input sequence, as a non-limiting example to a new sequence i.e., not belonging to the sequences used in the training step. Statistical energies θ, once the relative energy parameters have been calculated by maximizing the likelihood function, are in fact functions whose unknown is a biological sequence. From the energy functions each of which is associated with the related molecular states observed during the selection experiment, it is possible to obtain the score of a sequence associated with the desired chemical-physical characteristic to which the molecular states refer. [0161] Generating a library of biological sequences (e.g., of proteins with a given chemical-physical characteristic), and inferring a set of sequences characterized by an optimized function of statistical energy.

[0162] Mutant Evaluation.

[0163] In a preferred embodiment of the method, a score of a given biological sequence (amino acid or nucleotide) is calculated relative to a characteristic or activity of the related coded protein. The biological sequence to be evaluated can come from the training data (first row of the box on the right of FIG. 2) or from other experiments whose data have not been used for learning the model (second row of the box on the right of FIG. 2).

[0164] In the two-state construction (e.g., link present or absent) formulated above, the score is for example identified with the statistical energy itself:

[00015] $E_{s} = - \underset{i}{.Math.} θ_{i} (s_{i}) - \underset{i < j}{.Math.} θ_{ij} (s_{i}, s_{j})$

[0165] In an embodiment with three states (e.g., bonded, non-bonded and non-folded states) score Φ can be defined as the following combination of energies of the bonded state E.sub.L and the folded state E.sub.F:

[00016] $Φ (s) = \frac{1}{1 + e^{E_{L} (s)} (1 + e^{E_{F} (s)})}$

[0166] The calculation of the score is performed simply by selecting the parameters of the statistical energy relating to the sequence of the input mutant and applying the formula of the score, which in a two-state example case corresponds to the expression of the statistical energy function of the previous paragraph.

[0167] In particular, a high value of the previous three-state score is equivalent to a high probability that the related molecules are in the bonded and folded state in a given experiment. On the other hand, if the score is defined as identical to the statistical energy, a low value indicates the high probability of being in the bound state.

[0168] Generation One or More Optimal Sequences.

[0169] In a preferred embodiment of the method, it is possible to generate sequences that maximize the score function Φ.sub.s of the model. The latter is defined in general starting from energy functions E.sub.k(s) of each considered state k.

[0170] The generation of sequences with an optimal score takes place through a search algorithm of the sequence(s) that maximizes in an absolute or relative way an assigned score function Φ.sub.s. An efficient algorithm for this is simulated annealing, a standard optimization algorithm.

[0171] According to a preferred embodiment, the data are derived from DMS experiments using one of the protein display techniques as described in (Fowler, Douglas M., and Stanley Fields. “Deep mutational scanning: a new style of protein science.” Nature methods 11.8 (2014): 801.).

[0172] According to another preferred embodiment, the data are derived from DMS and Directed Evolution (DE) experiments that are aimed at selecting effective protein variants in binding a specific molecular target, preferably a peptide or protein, and the method has the purpose of selecting the most selective protein variants among those analyzed.

[0173] According to another preferred embodiment, the data are derived from DMS and DE-experiments which are aimed at selecting protein variants effective in binding a specific molecular target, preferably a peptide or protein, and the method has the aim of generating a library of variants more selective proteins for the molecular target.

[0174] According to another preferred embodiment, the data are derived from DMS and DE experiments which are aimed at selecting protein variants effective in carrying out a specific catalysis and the method according to the invention has the purpose of generating a library of more active protein variants in said enzymatic catalysis.

[0175] According to another preferred embodiment, the data are derived from DMS and DE experiments which are aimed at selecting protein variants effective in carrying out a specific catalysis and the method according to the invention has the purpose of selecting from the library the most active protein variants in the enzymatic catalysis.

[0176] According to another preferred embodiment, the data are derived from DMS and DE experiments which are aimed at selecting protein variants effective in achieving optimal photoluminescence and the method according to the invention has the purpose of selecting the most photoluminescent variants from the library.

[0177] According to another preferred embodiment, the data are derived from DMS and DE experiments which are aimed at selecting active protein variants at high temperature and the method according to the invention has the purpose of selecting the most heat-resistant variants from the library.

[0178] According to a preferred embodiment, the data are derived from SELEX experiments or techniques based on it, known to those skilled in the art; as a non-limiting example, see “Zhuo Z, et al.—Recent Advances in SELEX Technology and Aptamer Applications in Biomedicine Int J Mol Sci. 2017 Oct. 14; 18 (10)”

[0179] According to another preferred embodiment, the data are derived from SELEX experiments or techniques based on it which are aimed at selecting aptamers effective in binding a specific molecular target and the method has the purpose of selecting the most selective aptamers among those analyzed.

[0180] According to another preferred embodiment, the data are derived from SELEX experiments or techniques based on it which are aimed at selecting aptamers effective in binding a specific molecular target and the method aims to generate a library of more selective aptamer variants for the molecular target.

[0181] According to another preferred embodiment, the method is applied to the so-called “Machine-learning-assisted Directed Evolution” process. In this embodiment, the method is therefore trained on data derived from one or more Direct Evolution rounds and applied to generate effective protein variants to be changed and tested in subsequent Direct Evolution rounds, according to the scheme called Machine-Learning-assisted Directed Evolution.

[0182] The method of the present invention is therefore effective and reliable for the in-silico screening of libraries of protein or nucleotide sequences obtained from experiments of DMS, DE or techniques based on SELEX and for the selection of mutants with the desired characteristics. According to the method of the present invention, it is also possible to use data derived from these experiments to obtain a library of high efficiency sequences of a chemical-physical characteristic of a molecule whereby high efficiency is meant, for example, high catalysis capacity, high fitness, high ability to bind to a specific molecular target.

[0183] The method according to the invention is generally applicable to all types of DMS or DE experiments with at least one selection cycle.

[0184] The method according to the invention is generally applicable to all types of HTS-SELEX (High-Throughput Sequencing SELEX) experiments and techniques derived from it with at least one selection cycle.

EXAMPLES

[0185] The following examples are illustrative of the invention and are in no case to be considered limitative of the relative scope.

[0186] We report below the performance of the method with data deriving from four DMS experiments briefly described below and whose characteristics are summarized in the table below.

TABLE-US-00001 TABLE I # # Selection mutated sequenced # Bibliography Protein target positions roundsi mutants Boyer et al Antibody PVP, DNA 4 3 ~ PNAS (2016) heavy chain. 2.5 × 10.sup.4 Fowler et al. hYAP65 GTPPPPYTVG 25 3 ~ Nature methods (WW peptide 3 × 10.sup.5 (2010) domain) Araya et al. hYAP65 GTPPPPYTVG 34 4 ~ PNAS (2012) (WW peptide 6 × 10.sup.5 domain) Olson et al. GB1. IgG-Fc 55 2 ~ Current Biology 5.3 × 10.sup.5 (2014)

Example 1

Prediction of the Selectivity of Binding of Mutant Antibodies on Data Deriving from a DMS Experiment Carried Out by Means of a Phage Display

[0187] The model has been tested on data published by S. Boyer et al, “Hierarchy and extremes in selections from pools of randomized proteins.” PNAS (2016).

[0188] The DMS experiment reported is aimed at the analysis of a library of antibodies and the bond with a neutral synthetic polymer, polyvinylpyrrolidone (PVP). In this case, the experiment was conducted by carrying out three rounds of amplification and selection using the phage display technique. The initial library was created by saturation mutagenesis of an anti-PVP antibody on four consecutive amino acids of the region that determines complementarity 3 (CDR3).

Example 2

Prediction of the Binding Selectivity of WW Domain Mutants of the hYAP65 Protein, Data Deriving from a DMS Experiment Carried Out Using a Phage Display

[0189] The model has been tested on data published by D. M. Fowler et al, “High-resolution mapping of protein sequence-function relationships.” Nature methods (2010).

[0190] The DMS experiment reported is aimed at analyzing a library of WW domain mutants selected to bind a peptide ligand (GTPPPPYTVG). The experiment was conducted by carrying out 6 rounds of amplification and selection using the phage display technique and sequencing 3 rounds (0.3,6). The initial library was created with the “doped oligonucleotide synthesis” technique.

Example 3

Prediction of the Binding Selectivity of WW Domain Mutants of the hYAP65 Protein, Data Deriving from a DMS Experiment Carried Out by Means of Phage Display

[0191] The model has been tested on data published by C. L. Araya et al. “A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function.” PNAS (2012).

[0192] The DMS experiment reported is aimed at analyzing a library of WW domain mutants selected to bind a peptide ligand. In this case, the experiment was conducted by carrying out 4 rounds of amplification and selection using the phage display technique. The initial library was created through chemical synthesis of DNA with “doped nucleotide pools”.

Example 4

Prediction of the Binding Selectivity of Mutants of the Binding Domain with the Immunoglobulin G (IgG) of the G Protein (IgG-Binding Domain of Protein G) (GB1), Data Deriving from a DMS Experiment Performed by Means of mRNA Display

[0193] The model has been tested on data published by C. Olson et al. “A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain.” Current Biology (2014).

[0194] The DMS experiment reported is aimed at analyzing a library of mutants of the GB1 protein selected in the binding with IgG-FC. The experiment was conducted, in this case, by carrying out an amplification and selection round using the mRNA display. The initial library was created with a saturation-mutagenesis technique.

[0195] Type of Test

[0196] The dataset derived from these experiments was divided randomly into a training set for the model and a test set in which the statistical binding energy of the model was compared with an experimental measure of the ability of the mutants to bind and then be selected for the next round. This measure of a mutant's selectivity is defined as the ratio of the frequency of occurrence of the sequence in one round with the next one averaged over successive pairs of rounds. In formulas we define the selectivity of a sequence s as:

[00017] ${Sel}_{s} = {.Math.}_{t = 1}^{T - 1} f_{s} (t + 1) / f_{s} (t)$

[0197] Where f.sub.s(t) is the frequency of the sequence s under examination at round t and T the total number of rounds (FIGS. 4, 5 and 6).

[0198] The embodiment of the method is reported in the previous paragraphs relating to a two-state system with rare and deterministic bond and the description of the epistatic effect with two-site interactions.

COMPUTER IMPLEMENTED METHOD TO OPTIMIZE PHYSICAL-CHEMICAL PROPERTIES OF BIOLOGICAL SEQUENCES

Assignee

Inventors

Cpc classification

Classification Explorer

G16B25/00

PHYSICS

Classification Explorer

G16B35/20

PHYSICS

Classification Explorer

G16B20/50

PHYSICS

Classification Explorer

G06F17/11

PHYSICS

International classification

Classification Explorer

G16B35/20

PHYSICS

Classification Explorer

G06F17/11

PHYSICS

Abstract

Claims

Description