METHOD AND SYSTEM FOR ADVERSARY RESILIENT SCREENING OF SYNTHETIC GENE ORDERS

Abstract

A system and methods for adversary resilient screening of synthetic gene orders, include: applying a first alignment algorithm to generate multiple local alignments of a query sequence with a target sequence, wherein each local alignment aligns a substring of the query sequence to a substring of the target sequence to maximize an alignment score; determining unaligned sections of the query sequence to be alignment gaps; and removing k largest alignment gaps of the query sequence to generate a clean query sequence.

Claims

1. A computer-based method for screening DNA sequences to detect obfuscated sequences of concern (SOCs), comprising: receiving a query sequence for screening against a target database of sequences; applying a first alignment algorithm to generate multiple local alignments of the query sequence with a target sequence of the target database, wherein each local alignment aligns a substring of the query sequence to a substring of the target sequence to maximize an alignment score; from the aligned substrings of the query sequence, determining unaligned sections of the query sequence to be alignment gaps; and removing a number of largest alignment gaps of the query sequence to generate a clean query sequence and a respective clean alignment score that is an indicator of a homology of the target sequence and an obfuscated SOC in the query sequence.

2. The method of claim 1, wherein the number of largest alignment gaps is determined as one less than a number of local alignments of the multiple local alignments that, when combined, align with more of the target sequence than any other combination of the multiple local alignments.

3. The method of claim 1, further comprising: applying a second alignment algorithm to generate a clean alignment between the clean query sequence and the target sequence and to generate the clean alignment score, and outputting the clean alignment score indicating the homology of the target sequence and the obfuscated SOC in the query sequence.

4. The method of claim 3, wherein the first and second alignment algorithms are the same alignment algorithm.

5. The method of claim 3, wherein the first and second alignment algorithms are BLAST algorithms.

6. The method of claim 3, further comprising: reducing the clean alignment score by a gap removal penalty proportional to the number of largest alignment gaps removed, to calculate an adjusted clean alignment score indicating the homology of the target sequence and the obfuscated SOC in the query sequence.

7. The method of claim 6, wherein the adjusted clean alignment score is further adjusted by addition of a probability of biologically successful gap removal to generate a physical SOC.

8. The method of claim 6, wherein the gap removal penalty proportional to the number of largest alignment gaps removed is a per gap removal penalty (prm) multiplied by the number of largest alignment gaps removed.

9. The method of claim 8, wherein prm is a function of a number of base pairs removeable by bioengineering tools.

10. The method of claim 9, wherein the adjusted clean alignment score further includes a negative increment for each gap opening (pgo) and for each gap extension (pgx), and wherein prm=pgo+pgx.Math.x, where x is the removeable number of base pairs.

11. The method of claim 1, further comprising removing different numbers of alignment gaps from the query sequence to generate different clean query sequences; reapplying the second alignment algorithm to each of the different clean query sequences to generate multiple respective clean alignments and respective adjusted clean alignment scores; and wherein outputting the clean alignment and the adjusted clean alignment score comprises determining a maximum score from among the multiple adjusted clean alignment scores and outputting the maximum score as the adjusted clean alignment score and outputting the respective clean alignment.

12. The method of claim 11, further comprising iterating generation of clean alignments and respective adjusted clean alignment scores with respect to all of the target database sequences.

13. The method of claim 12, further comprising ordering the clean alignments and the adjusted clean alignment scores generated with respect to all of the target database sequences according to the adjusted clean alignment score.

14. The method of claim 1, wherein the database of target sequences is a database of SOCs.

15. The method of claim 1, where the number of alignment gaps removed is set to be all gaps greater than a preset threshold number of base pairs, and wherein the number of alignment gaps removed is subsequently incremented by an iterative process removing the largest alignment gap not previously removed, until the adjusted clean alignment score calculated following the gap removal does not increase.

16. The method of claim 1, wherein the first alignment score includes a positive increment for each matching character (rm) and a negative increment for each mismatching character (pmm), for each gap opening (pgo), and for each gap extension (pgx).

Description

BRIEF DESCRIPTION OF DRAWINGS

[0024] For a better understanding of various embodiments of the invention and to show how the same may be carried into effect, reference will now be made, by way of example, to the accompanying drawings. Structural details of the invention are shown to provide a fundamental understanding of the invention, the description, taken with the drawings, making apparent to those skilled in the art how the several forms of the invention may be embodied in practice. In the figures:

[0025] FIG. 1 is a flow chart of a process for adversary resilient screening of DNA sequences, in accordance with an embodiment of the invention; and

[0026] FIGS. 2-4 are graphs representing sequence alignment results of a process for adversary resilient screening of DNA sequences; in accordance with an embodiment of the invention.

DETAILED DESCRIPTION

[0027] Embodiments of the present invention provide a system and methods for adversary resilient screening of synthetic gene orders, that is, screening of DNA sequences that may hide fragments of toxic or otherwise dangerous substances.

[0028] Consider a case where a bioterrorist wants to generate a synthetic virus, a dangerous toxin, or other SOCs. Screening of DNA orders is necessary to prevent such bioterrorism, but an attacker may try to circumvent screening by ordering an obfuscated SOC. Obfuscation may include multiple gRNA scaffolds, each targeting a different cut point, though this could potentially reduce the effectiveness of SOC reconstruction (“decoding”). The size of the SOC fragments and the HDR templates may also vary. Reducing the size of the SOC fragments increases their likelihood of blending within camouflage genes, but also leads to a larger number of cuts and repairs, consequently reducing the effectiveness of reconstruction. The currently known lower bound on the size of HDR templates is 64 bp.

[0029] FIG. 1 is a flow chart of a DNA screening process 100 for identifying SOCs, and in particular for assessing the difficulty of SOC assembly from a given DNA sequence. The process thus improves the resilience of synthetic DNA production against “adversarial” attempts to produce SOCs. Hereinbelow, the process is also referred to as gene edit distance computation, or “GED.” GED screens a query sequence to find all substrings which are similar to fragments of an SOC. Then, GED quantifies the effort of assembling a SOC from these fragments. Although designed with a focus on SOCs, GED can quantify the effort required to assemble any target sequence t from a query sequence q using a standard CRISPR system. This determination includes a step of counting the number of cuts and repairs required for constructing the target sequence from the query sequence.

[0030] In a standard biological sequence alignment, a typical objective is to identify genes conserved in different genomes. To achieve this objective, an alignment algorithm should have parameters that include a match reward (rm), a mismatch penalty (pmm), a gap opening penalty (pgo), and a gap extension penalty (pgx).

[0031] However, identification of SOC fragments requires identification of short, conserved regions. Within these regions, alignments with target sequences should involve minimal gaps within the query sequence. On the other hand, it should be possible to concatenate multiple short, conserved regions regardless of the length of gaps that separate them. This is because when removing a sequence between two consecutive SOC fragments, using the CRISPR system and HDR templates, the distance between the fragments is irrelevant.

[0032] At a first step 102 of process 100, a query sequence q is received, which may be, for example, a sequence that is ordered for synthetic DNA production and which must be scanned to ensure that it is not an SOC.

[0033] At a step 104, the query sequence is scanned against a target sequence t, also called a subject sequence.

[0034] Alignment algorithms, such as BLAST, generally run against a database of potential target sequences and return a set of target sequences that are similar to the query sequence, as well as a set of alignments (also called ranges) for each target sequence. Algorithms based on BLAST typically compute a large set of small local alignments and extend these alignments when doing so increases the alignment score. The process described herein merges these small local alignments to maximize an adjusted score, Score.sup.k.

[0035] At the step 104, an alignment algorithm generates A.sub.q,t , which is a set of local alignments between q and t. Hereinbelow, q[i] denotes the i'th character in q; and t[i] denotes the i'th character in t. Every alignment in α∈A.sub.q,tmaps a substring (i.e., a range of character positions) in q to a substring in t, such that for any two successfully aligned character positions i>j, α(i)>α(j). Hereinbelow, a substring of q of an alignment α is denoted as dom(α) and a substring of t is denoted as img(α)=α(dom(α)). Similarly, α.sup.−1 denotes the inverse alignment, such that dom(α.sup.−1)=img(α) and img(α.sup.−1)=dom(α).

[0036] The score of a BLAST alignment may be computed based on the number of: [0037] matched characters, M=|{i:q[i]=t[α(i)]}|; [0038] the number of mismatched characters, MM=|{i:q[i]≠t[α(i)]}|; [0039] the number of gaps opened in both the query and the target sequences, G=|{i:α(i−1)≠⊥Λα(i)=⊥}|+|{i:α.sub.−1(i−1) ≠⊥Λα.sup.−1(i)=⊥}|; and [0040] the total extent of the gaps, GX=|{i:α(i)=⊥}|+|{i:α.sup.−1(i)=⊥}|,
where ⊥ symbolizes a character that is not aligned.

[0041] There is a reward (rm) for every matching character and penalties for mismatching characters (pmm), gap opening (pgo), and gap extension (pgx). The reward and penalties are configurable. A score of an alignment may be calculated as:

Score=rm.Math.M−pmm.Math.MM−pgo.Math.G−pgx≠GX

[0042] The fraction of q characters successfully aligned is called query coverage (QC): QC(α)=MM/|q|. The percent identity is computed from the sizes of the domain, the image, and the number of successfully mapped characters:

PI(α)=2.Math.MM/(|dom(α)|+|img(α)|)

[0043] At a step 108, unaligned sections of the query sequence (i.e., gaps in the target sequence) are identified as potential camouflage sections. A per-gap removal penalty (prm) may then be applied instead of some gap opening and extension penalties within the target sequence, for each gap removed.

[0044] At a step 110, we remove the k largest sections of potential camouflage sections. In some embodiments, the k sections removed from the query sequence may be determined as the number of disjoint local alignments (denoted as P) minus one. This output “clean” query sequence is then provided to a processing step 112 to be realigned with the same target sequence. If multiple combinations of disjoint local alignments may be applied, these combinations are also tested as described below with respect to step 116. The output clean query sequence may also be stored for subsequent analysis of adversarial attacks.

[0045] FIG. 2 indicates two local alignments of a target sequence t with disjoint sections of a query sequence, q. Note that the alignments have overlapping “images,” that is, the aligned sections of the target query overlap (the overlapping sections indicated as 202 and 204). The two sections of the query sections may be joined at a point 206 to create a clean sequence with high homology to the target sequence.

[0046] For a given target sequence gap g=[a,b] (that is, ∀i∈g, α(i)=⊥), a corresponding section g=[a,b] from the query sequence can be removed with two cuts, at a and at b respectively, followed by an HDR.

[0047] Removing the section g=[a,b] increases the score of the alignment by pgo+pgx.Math.(b−a+1), which we then reduce by prm. G.sub.q(α) and G.sub.t (α) respectively represent sets of all gaps in the query and target sequences according to the alignment α. Let G.sub.tk.Math.G.sub.t be a subset of the k longest gaps in the target sequence to be removed. We represent a probability that all gaps in G.sub.tk will be successfully removed as γ.sub.k. Let .start and .end denote the first and the last position, respectively, in the domain or the image of an alignment α∈A.sub.q,t. Let P=α.sub.1, . . . , α.sub.k be a set of alignments whose domains are disjoint, dom(α.sub.i).end<dom(α.sub.i+1).start. Their images may overlap, as depicted in FIG. 2 BLAST can find an alignment between the unified domains of the alignments (i.e., a “clean” sequence) and the target sequence. Such unified alignments are also referred to as cleaned alignments. The score of a clean alignment is indicated below as Score.sup.C.

[0048] Next, we adjust the clean alignment score to account for the probability of successful removal of the G.sub.tk sections by adversarial agents. We first calculate a virtual score, Score.sup.V, as a score that BLAST would produce if it were to generate the clean alignment (i.e., the alignment of the same domain and image produced by the clean alignment) but without the removal of gaps. Such a score may be calculated as follows:

[00001] ${Score}^{V} = {Score}^{C} - \underset{g \in G_{tk}}{.Math.} (pgo + pgx .Math. .Math. g .Math.)$

That is, Score.sup.V equals Score.sup.C, less the gap opening and gap extension penalties, pgo and pgx, for each of the gaps removed in the clean alignment.

[0049] Next, we generate an adjusted alignment score as the result of the removal of the k longest gaps in the target sequence t, given a probability of successful gap removal γ.sup.|G.sup.tk.sup.|, and considering the gap removal penalty prm:

[00002] ${Score}^{k} (α) = {Score}^{V} + γ^{.Math. G_{tk} .Math.} .Math. \underset{g \in G_{tk}}{.Math.} (pgo + pgx .Math. .Math. g .Math. - prm)$

where the alignment is indicated as α; g is a subset of gaps, G.sub.tk, in the target sequence of α; the gap opening and gap extension penalties are pgo and pgx respectively; the gap removal penalty is prm; and the gap removal probability is γ. As indicated in the equation, Score.sup.k is a function of a total gap removal penalty, which is a sum of gap removal penalties for all k, i.e., a penalty that is proportional to k, namely, k*prm. In addition, Score.sup.k is a function of the gap removal probability, that is, the probability of successful reconstruction of the intended SOC.

[0050] In embodiments of the present invention, the match reward (rm), as well as the mismatch, gap opening, and gap extension penalties (pmm, pgo, pgx) may be set to BLAST default values. In further embodiments, parameters for a BLAST alignment may be rm=2, pmm=3, pgo=5, and pgx=2.

[0051] The value of γ may be set according to assumptions made regarding the expected biological effectiveness of an adversarial “attack” given typical bioengineering tools. A value of γ=0 signifies that no attacker could ever rely on residue Cas9 protein in the cells and the SOC obfuscation attack is impossible. A value of γ=1 signifies 100% success of gene editing, which leads to the successful reconstruction of sequences regardless of the number of SOC fragments. In one test described below, we set γ=0.99.

[0052] The gap removal penalty prm should be set to a value that would justify gap removal using bioengineering tools. For example, if only the removal of gaps larger than x bp is justified, then the gap removal penalty should be set to: prm=pgo+pgx.Math.x.

[0053] The longer a gap is, the higher the alignment penalty and therefore the greater the score enhancement that can be achieved by removing the gap. To optimize Score.sup.k, the longest gaps are therefore selected for removal. The number of gaps k that should be included in G.sub.tk to achieve the maximal Score.sup.k depends on the parameters of the alignment algorithm and can be optimized using a grid search or simple hill climbing algorithm.

[0054] Referring again to FIG. 1, steps 110 through 112 are typically an iterative process, a process that is repeated until an optimal value of Score.sup.k is achieved.

[0055] At a step 114, a test is made as to whether an optimal Score.sup.k has been achieved. This is typically assessed by determining that increasing k for the most recent iteration has decreased the store, meaning that the highest score was achieved at the prior iteration. If different local alignments indicate different sections of the query sequence that may be preserved in the “clean alignment,” these different sections may be selected at a step 116, to be applied in iterations of steps 110 and 112 to achieve the optimal Score.sup.k. Additionally, If the maximum has not been achieved, then k may be incremented at step 116, after which steps 110 and 112 are repeated. Alternatively, a set of aligned substrings of the query sequence may be selected to maximize the coverage of the target sequence. Denoting P as the number of aligned substrings in such a set, k=P−1. If multiple combinations of aligned substrings cover the same extent of the target sequence are possible, the optimal combination may be selected according to the value of Score.sup.k.

[0056] Hereinbelow, a GED difference (GEDD) is a measure of the optimal number of gaps k to remove in a target sequence alignment such that the adjusted alignment score is maximized The GEDD from q to t is:

GEDD(q,t)=ARGMAX.sub.k(MAX.sub.αScore.sup.k(α)).

[0057] According to this definition Score.sup.GED(q,t)(α) is the maximal adjusted score for aligning the query with a given target, after optimal removal of potential camouflage fragments. (Note that GED quantifies the process of transforming q into t but not vice versa.) Until a determination is made at a step 118 that all target sequences have been processed, steps 104 through 114 are repeated for every sequence in the target database, a new target being extracted at the database at a step 120 before each iteration. Typical the target database is a database of known SOC sequences. A final determination of the risk of a query sequence is made based on the maximal adjusted score of the query sequence for any of the SOC sequences. The maximal adjusted score corresponds to an SOC that could most directly be generated from the query sequence.

[0058] The algorithm for GED listed below shows the steps of the process 100 as heuristic pseudocode, using the symbolic representations described above:

TABLE-US-00001 Pseudocode: Gene Edit Distance Input: q - a query sequence t - a target sequence Output: k - the number of cut and repair operations Score.sup.k - adjusted alignment score / / List local alignments A = {.sub.α} sorted by dom(α).start 1 A.sub.q,.sub.t = BLAST(query=q, subject=t) 2 for each subset P .Math. A.sub.q,t of local alignments with disjoint domains do 3 |_ .sub.αP = MERGE(P) 4 P* = argmax.sub.PScore.sup.|P|−1 (.sub.αP) 5 return k = |P*|, Score.sup.k (.sub.αP*) 6 Function MERGE(.sub.α1,..., (.sub.αk): | /* Find global alignment between unified domains | and t */ 7 | A = BLAST(query = q[S.sub.idom (.sub.αj)],subject = t) 8 | .sub.α = cleaned global alignment unifyingA | /* Reintroduce gaps to form a continuous | alignment */ 9 | .sub.αU = {(i, ⊥): i ∈ [dom (.sub.αj),end, dom (.sub.αj+ .sub.1).start], 1 ≤ |_ j < k} return .sub.α

[0059] The output of GED is the adjusted alignment score and the optimal k, that is, the number of cut and repair actions required to reconstruct the target sequence from the query sequence. The adjusted alignment score quantifies both the similarity of the body of an obfuscated SOC, and the effort required to “decode” it within a cell. GED is designed to detect the main body of an obfuscated SOC, because the gRNA scaffolds and HDR templates required for splicing may be distributed among different plasmids or even different orders.

[0060] FIG. 3 presents a graph of the results of aligning an obfuscated α-conotoxin PeIA (a short, toxic pepsin) to the PeIA sequence. BLAST returns three ranges (alignments), indicated as α.sub.1, α.sub.2, and α.sub.3. If pgx>0, these ranges will not be merged by BLAST due to the length of the gap that would be opened in the target sequence. Merging α.sub.1 and α.sub.2 would give the highest value for Score.sup.k (higher than for merging α.sub.1 and α.sub.3, because the alignment of α.sub.2 with the target sequence extends farther than the alignment of α.sub.3 with the target sequence).

[0061] FIGS. 4A and 4B present graphs of an additional alignment example. A 10 Kpb long query sequence is aligned to a 2 Kpb long SOC. An optimal alignment between the two sequences contains 2 K matching base pairs, a few mismatches, and 50 gaps whose length is exponentially distributed, as shown in FIG. 4A. Before removing any gaps (k=0) the adjusted alignment score is less than −4,500. With a gap removal penalty of prm=20, gaps are removed, replacing gap opening and gap extension penalties with prm. FIG. 4B shows the resulting graph of k vs. the adjusted alignment score Score.sup.k. As shown, with γ=0.98, k=10 results in the highest Score.sup.k value. With γ=0.99, the adjusted alignment score can reach 2,000 when k=13 in this example.

[0062] Benchmark sequence screening was performed using local copies of the complete GenBank NT and NR and UniProt databases. While GenoTHREAT requires the entire database for accurate screening, GED only requires the SOCs to compare with the query sequences. We considered the true positive rate (TPR), the false negative rate (FNR), and the false positive rate (FPR). High FNR (low FPR) indicated that the screening method could be evaded using obfuscation. Because some malicious sequences are easier to detect than others, we the hit counts for each group of sequences. In order to analyze the performance of the screening algorithms, we also inspect their confidence levels. The confidence of GED (GEDConf) is simply the maximal adjusted score it returns when screening q. The value of GEDConf is between zero and one, where GEDConf=1 means that q is definitely a SOC. GED successfully detected all obfuscated malicious sequences. Moreover, the large confidence gap between the most suspicious benign sequence and the least suspicious malicious sequence showed GED's robustness as a screening algorithm. We also note that GED is an order of magnitude faster than GenoTHREAT, because it compares the query sequence with a small database of SOC and does not need to query for every 200 bp fragment.

[0063] Experiments demonstrate that 16 out of 50 obfuscated DNA samples are not detected when screened according to the HHS guidelines. We further proposed a hardened screening algorithm termed Gene Edit Distance (GED) that successfully detects all obfuscated DNA samples. Future enhancements to DNA screening may rely on machine learning for sequence analysis and DNA function prediction. Adversarial learning techniques can be used to further increase the resilience of screening algorithms against malicious DNA sequences that are not yet on the SOC list.

[0064] It is to be understood that all or part of process 100 may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations thereof. The computing system may have one or more processors and one or more network interface modules. Processors may be configured as a multi-processing or distributed processing system. Network interface modules may control the sending and receiving of data packets over networks. Security modules control access to all data and modules. All or part of the system and process can be implemented as a computer program product, tangibly embodied in an information carrier, such as a machine-readable storage device or in a propagated signal, for execution by, or to control the operation of, data processing apparatus, such as a programmable processor, computer, or deployed to be executed on multiple computers at one site or distributed across multiple sites. Memory storage may also include multiple distributed memory units, including one or more types of storage media.

[0065] Method steps associated with the system and process can be rearranged and/or one or more such steps can be omitted to achieve the same, or similar, results to those described herein. It is to be understood that the embodiments described hereinabove are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.

METHOD AND SYSTEM FOR ADVERSARY RESILIENT SCREENING OF SYNTHETIC GENE ORDERS

Inventors

Cpc classification

Classification Explorer

G06F16/90335

PHYSICS

Classification Explorer

G16B30/10

PHYSICS

Classification Explorer

G16B50/30

PHYSICS

International classification

Classification Explorer

G16B30/10

PHYSICS

Classification Explorer

G06F16/903

PHYSICS

Classification Explorer

G16B50/30

PHYSICS

Abstract

Claims

Description