COMPOSITIONS AND METHODS FOR IDENTIFICATION OF ZINC FINGERS
20230167436 · 2023-06-01
Inventors
Cpc classification
C12N15/1086
CHEMISTRY; METALLURGY
C07K2319/81
CHEMISTRY; METALLURGY
International classification
Abstract
Provided are improved compositions and methods that are used for identifying interacting zinc fingers in a zinc finger and DNA sequence context. The compositions and methods provide a comprehensive approach that takes into account the effect of adjacent zinc fingers, in part by expanding the repertoire of F2 fingers that are varied at amino acid position 6.
Claims
1. A method of determining amino acid sequences from a plurality of zinc fingers that bind to specific DNA substrates in a DNA sequence dependent manner, wherein the binding of at least some of the zinc fingers in the plurality is determined by expression of a selectable marker and optionally a detectable marker, the method comprising: i) providing DNA substrates that are operably linked to expression of the selectable marker, wherein each DNA substrate includes a segment comprising a DNA sequence configured to detect binding of a contiguous polypeptide comprising three distinct zinc fingers that are F1, F2, and F3, respectively, and wherein the F1, the F2 and the F3 are optionally in an N-terminal to C-terminal orientation; ii) expressing the plurality of zinc fingers in a series of in vivo assays within cells that comprise the DNA substrates, wherein: iii) expressing the plurality of zinc fingers in a series of in vivo assays within cells that comprise the DNA substrates, wherein: a) in each assay the F1 comprises the same amino acid sequence; b) in each assay the F2 comprises at least one amino acid difference relative to domain 1 of other zinc fingers in other assays in the series, and wherein said amino acid difference is optionally at position 6 of the F2, and wherein F1 and F2 comprise a functional pair in each assay; and c) in each assay one or more of positions −1, 1, 2, 3, 5 and 6 of each F3 α-helix are randomized; iv) selecting zinc fingers that promote expression of the selectable marker; and v) determining the sequence of the zinc fingers that promote expression of the selectable marker to identify amino acids that promote said expression, wherein the identified amino acids that promote said expression are included in at least the F3 [domain 2].
2. The method of claim 1, wherein all of positions −1, 1, 2, 3, 5 and 6 of each F3 α-helix are randomized.
3. The method of claim 2, wherein said amino acid difference for each series of assays is at position 6 of the F2.
4. The method of claim 2, wherein each assay in the series of assays comprises at least 64 million distinct F3 α-helices.
5. The method claim 4, wherein a series of at least four assays are performed.
6. The method of claim 5, wherein the segment of the DNA substrate configured to detect the binding in each assay in the series comprises at least one variable segment, wherein the variable segment comprises three base pairs (bp) targets for use in determining the binding.
7. The method of claim 6, wherein 64 distinct 3 bp segments are included in each assay in the series.
8. The method of claim 7, wherein the series of assays is such that 64 independent selections can identify zinc fingers comprising a helices that may be able to interact with each of the 64 distinct 3 bp targets, and wherein sufficient selections are performed such that at least 16 billion unique zinc figure-DNA substrate interactions may be present and can be analyzed for expression of the selectable marker or the detectable marker or a combination thereof.
9. The method of claim 8, wherein the sufficient selections comprise 384 selections.
10. The method of claim 9, wherein the selectable marker is present and comprises HIS3.
11. The method of claim 9, wherein the detectable marker is present and comprises a fluorescent protein.
12. A zinc finger comprising an amino acid F3 [domain 2] sequence identified by claim 1.
13. A contiguous polypeptide comprising a set of zinc fingers, one of which comprises an amino acid sequence of an F3 identified by the method of claim 1, and a second of which comprises an F2 that was also present in an assay of the method of claim 1, and wherein said contiguous polypeptide can bind with specificity to a 3 bp segment of a DNA substrate that was also present in said assay.
14. A contiguous polypeptide comprising a set of zinc fingers, one of which comprises an amino acid sequence of an F3 identified by the method of claim 1, and a second of which comprises an F2 that was present in an assay of the method of claim 1, and wherein said contiguous polypeptide can bind with specificity to a 3 bp segment of a DNA substrate that was also present in said assay, and further comprising an F1 that was present in an assay of the method of claim 1, wherein said contiguous polypeptide can bind with specificity to the 3 bp segment of the DNA substrate.
Description
BRIEF DESCRIPTION OF THE FIGURES
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
DETAILED DESCRIPTION
[0027] Unless defined otherwise herein, all technical and scientific terms used in this disclosure have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains.
[0028] Every numerical range given throughout this specification includes its upper and lower values, as well as every narrower numerical range that falls within it, as if such narrower numerical ranges were all expressly written herein. The disclosure is not intended to be bound by any particular theory described herein. The disclosures of all references described in this disclosure are incorporated herein by reference.
[0029] Representative compositions and methods are provided. The disclosure includes all compositions and steps as described herein and as shown in the accompanying figures. The disclosure includes the proviso that any single or combination of reagents and steps may be excluded. Some or all the steps may be performed sequentially, although concurrent performance of steps is not necessarily excluded from the disclosure. The disclosure includes all compositions of matter formed during performance of the described method. The disclosure includes all expression vectors and combinations of expression vectors used to produce, screen and identify zinc fingers as further described herein. The disclosure relates in part to performance of a series of assays. The zinc fingers encoded by expression vectors in the assays, and the expression vectors themselves, may be considered libraries.
[0030] The disclosure includes all described methods of measuring and displaying results obtained from the described assays. The disclosure includes use of Hamming distance to obtain any described measurement. Hamming distance determination is known in the art, and generally involves measuring the edit distance between two sequences.
[0031] The term “zinc finger” is abbreviated from time to time in this disclosure as “ZF.” The term “finger” also refers to a zinc finger. The zinc fingers referred to in this disclosure generally comprise Cys.sub.2His.sub.2 zinc fingers, but other types of zinc fingers are not necessarily excluded.
[0032] ZFs used and identified in assays as further described herein are comprised by contiguous polypeptides. The contiguous polypeptide may be described in an N-terminal to C-terminal orientation due to the anti-parallel DNA binding that characterizes ZF-DNA binding. Each contiguous polypeptide generally comprises a three ZF series. Contiguous polypeptides comprising only two ZF proteins may also be used, as further described in the Examples. Each finger can be considered F1, F2, and F3, which are also referred to herein as domain 0, domain 1 and domain 2, respectively. In embodiments, two ZF's may be separated from one another in a contiguous polypeptide by an intervening linker segment. In general, a suitable linker between two zinc fingers is 8 or fewer amino acids long, and may be present between the last His of the N-terminal finger and the first Cys of the C-terminal finger, to consider them part of the same array. As a non-limiting example, the canonical zinc finger linker TGEKPFA (SEQ ID NO:3), or derivatives of this linker, are found in the majority natural zinc finger proteins between the 2nd His of an N-terminal finger and the first Cys of the following finger. In one embodiment, a linker is used to separate each ZF pair and skips a base between their targets (PMID: 30850604).
[0033] The present disclosure provides compositions and methods for use in improved analysis of each factor that impacts ZF-DNA interaction in context, and how they influence the ZF-DNA engagement, both individually and combinatorially.
[0034] In certain approaches, the present disclosure provides a method of determining amino acid sequences from a plurality of zinc fingers that bind to specific DNA substrates in a context dependent manner, wherein the binding of at least some of the zinc fingers in the plurality is determined by expression of a selectable marker and optionally a detectable marker.
[0035] In embodiments, interacting zinc finger combinations are selected by binding to DNA, such as a 3 base pair segment, which in turn drives expression of a detectable marker, or a selectable marker, or a combination thereof. In a non-limiting embodiment, the zinc fingers are selected using a series of bacterial one-hybrid assays, but alternative assays may be used, including but not necessarily limited to yeast-based assays, and assays performed in mammalian cells, including but not necessarily limited to human cells. Thus, any assay that provides a readout of zinc finger binding can be adapted for use in the described methods. In embodiments, bacterial one-hybrid assays are performed using the zinc finger and
[0036] DNA substrates as further described herein and by adapting bacterial one-hybrid assay described in Persikov, et al., (2015). A systematic survey of the Cys2His2 zinc finger DNA-binding landscape. Nucleic acids research, 43(3), 1965-1984, and Noyes M. B. Analysis of specific protein-DNA interactions by bacterial one-hybrid assay. Methods Mol. Biol. 2012; 786:79-95 from which the disclosures of bacterial one-hybrid protein selections are incorporated herein by reference.
[0037] The described assays can be adapted to use different selectable or detectable markers, and combinations thereof. Positive and negative selection, and combinations thereof can be used. The specific markers used are not particularly limited. In embodiments, the selectable marker comprises a gene that encodes a protein that produces or participates in production of a substance that is required for an organism to remain viable, which may be related to the components of a culture medium, non-limiting embodiments of such genes comprising HIS3 and URA3, or any other auxotrophic marker. In embodiments, the selectable marker may alternatively be an antibiotic resistance gene, such as a gene whose protein product provides resistance to, for example ampicillin, kanamycin, chloramphenicol, tetracycline or triclosan. In addition, URA3 can be used for negative selection in the presence of 5-fluoroorotic acid (SFOA).
[0038] The detectable marker may be any marker that produces a detectable signal, non-limiting embodiments of which include green fluorescent protein (GFP), enhanced GFP, mCherry, mTAGBFP2, mPlum, YFP, mPapaya, mStrawberry, blue fluorescent protein (BFP), Sirius, and the like. In embodiments, the detectable labels produce a signal that comprises UV light (<380 nm), visible light (380-740 nm) or far red (>740 nm). Colorimetric assays markers may also be used, such as a β-galactosidase assay.
[0039] The selection of particular combinations of interacting zinc fingers can be adjusted to require, for example, a threshold level of expression of the selectable and/or detectable markers, which may be correlated with affinity of the zinc fingers for a particular DNA sequence.
[0040] The mutation and randomization of zinc fingers can be performed using any suitable techniques, such as site directed mutagenesis. The sequence of zinc fingers can be determined by sequencing of plasmids or other expression vectors that are used to express the zinc fingers which selected by using the pertinent host cells and selection approaches described above and further herein. The disclosure includes identifying interacting zinc fingers using the describe methods, producing the zinc fingers by any suitable protein expression technique, and using the identified zinc fingers and combinations thereof for any purpose.
[0041] In embodiments, the disclosure provides for selection and identification of zinc fingers in a context dependent manner. By “context dependent” it is meant that the described assays take into account the influence of adjacent fingers on one another, and particularly the relationship of F2 and F3. Accordingly, the described assays provide for the first time identification of the amino acids in F3 that promote expression of a marker due to the F3 interaction with the DNA that takes into account influence of its interaction with F2. While the described assays include analysis of ZFs that comprise randomized α-helix amino acids in positions −1, 1, 2, 3, 5 and 6 of each F3 [domain 2], without intending to be bound by any particular theory, it is considered that position 6, and potentially position 5, of F2 have increased influence on F3 DNA binding due to their close proximity when bound to DNA, relative to the other stated positions, such influence not having been previously analyzed in the same depth as in the present disclosure. (See
[0042] The disclosure includes use of F2's that, in at least some of the assays, do not include an Arginine in position 6 of F2, which is one aspect that differentiates the present disclosure from previous approaches. Accordingly, the presently provided approach (as illustrated at least by comparison of
[0043] Further differentiation from previous approaches is illustrated at least by
[0044] In more detail, as shown in
[0045] As also described further herein and by way of the figures, this superior approach is achieved in part by using assays that include F3's [domain 2's] having positions −1, 1, 2, 3, 5 and 6 of each F3 α-helix randomized, but in the context of an F2 [domain 1] wherein at least the position 6 amino acid in F2 and its interaction with the DNA, e.g., different DNA substrates, is different in each assay, and wherein F1 [domain 0] is fixed throughout an assay series, and as stated above, in at least some F2's position 6 is not an Arginine. In embodiments, at least some assays include adenine or cytosine at this overlap position, in addition to the common arginine-guanine contact in other assays.
[0046] By using this approach, the F3 is placed in a context where domain 1 and domain 2 form a functional pair. By “functional pair” it is mean that domain 1 and domain 2 will function together to bind a particular 3 base pair DNA segment when provided with a suitable domain 3, thereby providing a selectable combination of zinc fingers. Thus, the amino sequence of domain 0 and domain 1 of each ZF is known in each assay, domain 0 and domain 1 are known to be compatible with each other, and each domain 2 [F3] α-helix is randomized to determine the sequence of which F3 interacts with the functional pair comprised by fixed domain 0 [F1] and domain 1 [F2] having at least position 6 changed in individual assays.
[0047] Thus, in various aspects the disclosure provides methods that provide for determining amino acid sequences from a plurality of zinc fingers that bind to specific DNA substrates in a DNA sequence and zinc finger dependent manner. The binding of at least some of the zinc fingers in the plurality is determined by expression of a selectable marker and optionally a detectable marker. The method generally comprises the following steps:
[0048] i) providing DNA substrates that are operably linked to expression of the selectable marker, wherein each DNA substrate includes a segment comprising a DNA sequence configured to detect binding of a contiguous polypeptide comprising three distinct zinc fingers that are (F1 [domain 0]), F2 [domain 1]) and F3 [domain 2]), respectively, and wherein the F1 [domain 0], the F2 [domain 1] and the F3 [domain 2] are optionally in an N-terminal to C-terminal orientation;
[0049] ii) The plurality of zinc fingers are expressed in a series of in vivo assays within cells that comprise the DNA substrates. In each assay the F1 [domain 0] comprises the same amino acid sequence. In each assay, the F2 [domain 1] comprises at least one amino acid difference relative to domain 1 of other zinc fingers in other assays in the series. In embodiments, the amino acid difference is at position 6 of the F2 [domain 1]. F1 [domain 0] and F2 [domain 1] comprise a functional pair in each assay comprise known sequences. In each assay one or more of positions −1, 1, 2, 3, 5 and 6 of each F3 [domain 2] α-helix are randomized.
[0050] Using this configuration, the method further comprises:
[0051] iii) selecting zinc fingers that promote expression of the selectable marker; and, iv) determining the sequence of the zinc fingers that promote expression of the one or more markers to identify amino acids that promote the expression. The identified amino acids that promote the expression are included in at least the F3 [domain 2]. In embodiments, more than one, or all of positions −1, 1, 2, 3, 5 and 6 of each F3 [domain 2] α-helix are randomized. In embodiments, amino acid differences for each series of assays is at position 6 of the F2 [domain 1].
[0052] In embodiments, each assay in the series of assays comprises at least 64 million distinct F3 [domain 2] α-helices. In embodiments, a series of at least four assays are performed. In embodiments, 4-12 assays or are performed.
[0053] In embodiments, the segment of the DNA substrate configured to detect the binding in each assay in the series comprises at least one variable segment. The variable segment comprises three base pairs (bp) targets for use in determining the binding. The variable segments can comprise up to 64 different distinct 3 bp segments in each assay.
[0054] In embodiments, the series of assays is such that 64 independent selections can identify zinc fingers comprising a helices that may be able to interact with each of the 64 distinct 3 bp targets, and sufficient selections are performed such that at least 16 billion unique zinc figure-DNA substrate interactions may be present and analyzed for expression of a marker. In an embodiment 384 selections are made.
[0055] The disclosure includes zinc figures that comprise at least a zinc finger comprising an amino acid F3 [domain 2] sequence identified by the described method. The disclosure also includes a contiguous polypeptide comprising a set of zinc fingers, one of which comprises an amino acid sequence of an F3 [domain 2] identified by a described method, and a second of which comprises an F2 [domain 1] that was also present in a described assay. The described contiguous polypeptide can bind with specificity to a 3 bp segment of a DNA substrate that was also present in the assay. The contiguous polypeptide can also comprise an F1 [domain 0] that was present in an assay. This contiguous polypeptide can also bind with specificity to the 3 bp segment of the DNA substrate.
[0056] To expand on the foregoing description of distinctions of the present approach a prior methods, two general approaches have been previously used: one focused on engineering one finger at a time and a second approach focused on the interface between adjacent ZFs of an array (
[0057] In more detail, the majority of previous ZF engineering efforts come from an era that predates next generation sequencing and therefore methods for the comparison and visualization of ZF data on this scale have not been previously established. For example, the present disclosure provides 12 comprehensive screens of 10 ZF libraries that each represent a unique interface environment in order to provide the helical diversity required to generate compatible ZF pairs across a wide range of targets. We screened these libraries across each of the 64 possible 3 bp target for functional helical strategies. In total the disclosure includes screening over 49 billion protein-DNA interactions across 768 independent selections. We used Hamming distance and position specific scoring matrix (PSSM) to quantify differences in the ZF populations that were enriched across binding site and library environments. However, the inter-dependencies of adjacent amino acids on a ZF helix to specify a given DNA target can result in a convoluted picture of multiple sequence motifs or “strategies” able to bind each target. To deconvolute this picture requires clustering but with data sets this large, the resulting clusters across 768 selections can be difficult to interpret. Thus, the disclosure provides a visualization method to represent the over 1 million functional helical sequences that were recovered on two-dimensional maps with common coordinates. These coordinates were determined by biochemically derived descriptors. By doing so, the data for each selection can be compared visually across the 2D landscapes that describe the functional ZF strategies for a given target under the adjacent finger influence provided by that library. In addition, the disclosure provides for charting where computationally derived clusters are located in this 2D space to help determine the resolution of the clustering. These comparisons permit visualization of both general and interface-specialized solutions to bind most targets. In addition, these comparisons help to reveal that targets with higher G-content tend to offer more general strategies across all libraries screened. This likely explains why the success of prior ZF screens have been biased towards G-rich targets, as these solutions appear to be compatible across most interface environments.
[0058] To test whether the described interface-focused screening approach would provide the complexity required to find compatible ZF solutions across a wide range of targets, we followed initial screens with a series of second round, two-finger selections. From these selections we were able to enrich compatible pairs across all targets tested. We assembled these ZF pairs as extended arrays to create a series of 42 zinc finger nucleases (ZFNs) and find that all constructs produced activity above background. The mean activities across all ZFNs tested was similar to the activities reported for TALEN and Cas9 screens in the same assay. Together these data demonstrate the utility of comprehensive screens and the necessity of sampling multiple interface influences in the primary ZF screens in order to provide the complexity necessary for down-stream compatibility. Additional description of these and related approaches are described in the Examples below. In parallel, we have shown that the data described here has sufficient complexity to train an AI-based model for the simple design of ZFs to be employed as nucleases, activators, and repressors.
[0059] With respect to improved ZF analysis by using combinatorial screening, in embodiments, the disclosure provides for a total number of selected helices that range from 128 thousand to over 1 million helices per library screened.
[0060] In embodiments, the disclosure provides for screening of at least 4 libraries. In embodiments, the 4 libraries each provides a different overlap environment (e.g., one library for A, one library for C, one library for G, and one library for T). In embodiments, this approach comprises use of all possible 3 bp substrates, e.g., 64 bp substrates. In embodiments, each library comprises 64 million sets of ZFs, and would therefore include analysis of approximately 16 billion ZF/DNA interactions (e.g., 4 libraries with 64 bp (all 3 bp) substrates, with 64 million amino acid combinations would yield 16.384 billion ZF/DNA interactions analyzed). In embodiments, the disclosure includes analysis of 4-12 libraries. As another non-limiting example, for 12 libraries analyzed using the described approach, the disclosure provides for analysis of approximately 49 billion ZF/DNA interactions (e.g., 12 libraries with 64 bp substrates, with 64 million amino acid combinations would yield 49.152 billion interactions analyzed. In a non-limiting embodiments, the libraries may include five domain 1 interactions that bind A at the interface, five that bind C at the interface, and 3 that bind G. Libraries that bind T at the interface may also be included. Representative and non-limiting examples are provided below. For instance, as described further below in the Examples, the disclosure demonstrates screening libraries across each of 64 possible 3 bp targets for functional helical strategies, thereby providing analysis of over 49 billion protein-DNA interactions across 768 independent selections.
[0061] The following Examples are intended to illustrate but not limit the disclosure.
Example 1
[0062] Adjacent Finger Influences have Global Impacts on Specificity
[0063] The first structures of ZF arrays bound to DNA demonstrate the influence that adjacent fingers have on one another (
[0064] We found global and target-specific differences across these libraries. The total number of selected helices ranged from 128 thousand to over 1 million helices per library screened (
Example 2
The Overlap Base and Specificity Strategies
[0065] Zinc finger strategies were enriched in over 85% of the 3 bp target selections for 9 of the library screens. In fact, for every 3 bp target, ZF strategies were enriched in at least 8 different library screens. However, it is not immediately clear if the ZFs enriched to bind a given target are the same or different across libraries or how the overlap base influences the enriched strategies. Therefore, to make a quantitative comparison we measured the relatedness of each selection using hamming distance (
Example 3
Visualizing Selected Helices in 2-Dimensional Space
[0066] To visualize common and interface-specific strategies enriched to bind any 3 bp target we first used the MUSI generated sequence clusters across all selections. However, this resulted in a cumbersome comparison of thousands of clusters across the 768 selections. Therefore, to more easily visualize and compare helical families across selections we developed a method to distribute selected helices on a common 2D space. In particular, this method is tailored to classify helices by their potential biochemical interaction with DNA bases. First, we encoded each of the six amino acid sequences into numerical descriptors based on physics features. As helical side chains interact with bases in the major groove through combinations of hydrogen bonds and hydrophobic contacts, we used an encoding system where each residue is encoded by three integers: (i) the number of hydrogen bond donors, (ii) the number of hydrogen bond acceptors and (iii) the side chain length from the Ca to the most distal atom (
[0067] We created a PCA map using these descriptors that includes over 1 million helices recovered in the 768 selections. This map allowed us to track the coordinates of any selected helix, or related helix, and their relative positions on the 2D coordinates of the map. As positions 1 and 5 of the helix have less of an influence on specificity, variability at these positions allow for up to 400 version of any helix with common “core” residues suggesting that the map reflects multiple versions of any core helix (a core helix is defined as any helix with the same residues at positions −1, 2, 3, and 6). From this map we find that similar helices are found with similar coordinates (
[0068] An object of clustering and visualization of the enriched helices on the PCA maps is to identify helical sequences where amino acids are dependent on one another and thus represent unique binding strategies. This is believed to be necessary as all of the positions on the ZF helix have the potential to influence one another and therefore, the enriched residues should not be thought of as independent but rather a network of interactions as the entire helix interacts with the target. To confirm that the PCA maps separate related helices we plotted the location of the ZFs included in the MUSI-determined clusters on each selection-specific PCA map. In general, when MUSI determines that there are 3 or less clusters, the PCA maps separate most clusters into localized regions of the map. For selections with 4 or more cluster, a subset of the clusters will tend to overlap (
Example 4
Adjacent Fingers Influence General and Specialized Strategies for DNA Targets
[0069] Globally, we compared the target selections in three ways: the ability to enrich functional helices, the quantitative comparison of hamming distance, and the qualitative comparison of PCA maps that provide a 2-dimensional representation of the specificity landscape. From these maps we can easily visualize populations that are differentially enriched across similar targets within the same library or within the same target across multiple libraries (
Example 5
Zinc Finger Incompatibility is Demonstrated Across Selections
[0070] Hamming distance and PCA plots reveal trends in general populations that are enabled across most library contexts as well as more specialized populations that are specific to subsets of libraries. These specialized populations even appear within the class of highly similar nnG selections where most, but not all, solutions appear to be general (
Example 6
Compatible ZF Pairs are Enriched Across a Wide Range of Targets
[0071] The described library screens demonstrate the profound influence that adjacent fingers have on one another despite their restriction to a relatively small subset of adjacent finger influences. However, we reasoned that the wide range of biochemical environments that these libraries represent, and the diversity of helices enriched across these selections, would provide the complexity necessary to uncover compatible helices across a wide range of targets. Therefore, to consider compatibility on a more comprehensive scale we created pools of ZFs for each 3 bp target that included helices from all successful library screens. From these pools we generated 2-finger libraries compatible with 178 6 bp targets (
[0072] To demonstrate that the compatible pairs are functional outside of the bacterial selection context, we use the pairs to create a series of 42 zinc finger nuclease (ZFNs) that target the coding sequence of eGFP. In order to maintain the independence of binding for one ZF pair versus another, we used a linker to separate each pair that skips a base between their targets. In this way we have reduced the interface compatibility issue for ZFs to a 2-fingered problem across the entire array (
Discussion of Examples
[0073] It will be recognized from the foregoing description that the present disclosure provides an alternative approach to maintain maximum diversity in the randomized ZF but systematically changing the environment presented by the adjacent finger. Among other aspects, the disclosure provides contexts that position the library adjacent to fixed interactions with A's and C's at the interface. This takes into account how these interactions would provide critically differentiated environments, compared to the commonly explored arginine-guanine environment, to generate novel binding strategies that would provide compatible ZF options for a wide-range of target sequences. Indeed, the described results demonstrate that this approach was successful as ZF populations enriched using libraries with A and C overlaps were all significantly different from libraries that presented arginine-guanine contacts at the interface (
[0074] A benefit of the presently described approach was confirmed by selecting a series compatible 2-fingered modules for 178 6 bp targets from libraries generated from these primary, interface focused selections. By choosing the most enriched pairs from each of these selections we were able to produce a set of 42 ZFNs with a range of activity similar to what would be expected from a series TALENs or a Cas9-based screen of guide RNA's. Further, we demonstrate that weak ZFNs can be improved by extending the array or reducing non-specific affinity that limits off-target binding. Since the primary selections were mostly focused on A and C at the overlap, we maintained this preference in the 6 bp targets specified for the ZFNs in our screen. We also confirmed that the G-content for the most active targets is no different than random, suggesting that the fingers are not limited by G-binding as previous efforts have been. The use of a linker to skip bases between these modules allows each module to function independently. Therefore, a complete set of 4096, independent 2-fingered modules could be generated based on this approach and applied to any target sequence. In parallel, we have shown that the data produced in this disclosure offered enough complexity to generate an AI-based model of ZF specificity that enables their simple design for application as nucleases, activators and repressors.
[0075] The following materials and methods were used to produce the data described herein and depicted in the figures.
Library Builds
[0076] Primary zinc finger libraries: All primary ZF libraries were built as previous described and detailed below. To provide templates for PCR, gBlocks were ordered from IDT that coded for the finger 0 and finger 1 domains of each library (
[0077] 2-finger libraries: Second round selections were used to select compatible pairs from pre-selected ZF pools generated in the primary ZF library selections. We pooled recovered plasmid DNA from our primary single-finger screens on a binding site basis, resulting in a pool of diverse helices (termed “round 2 pools”) with broad compatibility for each of the 64 different binding sites. To ensure these were enriched for functional helices and not background, a simple cutoff was devised to omit unsuccessful selections. Based on the data filtering metrics described, single-finger pools were omitted if less than 20% of the reads passed these filters as those selections would have added a disproportionate amount of non-functional ZFs to our template pools. A table depicting the pooling strategy is shown in
Zinc Finger Selections
[0078] Primary ZF Libraries: Libraries were built in a vector that will express the ZFs as a fusion to the omega subunit of the bacterial polymerase using a strong promoter. In the B1H system omega is simply acting as an activation domain. The binding site reporter vectors were built by placing the binding site of interest 10 bp upstream of the −35 box of the promoter that drives HIS3 and GFP expression in the previously described GHUC vector. For example, for the library 2 TAC selection, the binding site 5′ TAC-ACA-AAG 3′ was built into the GHUC vector 10 bp upstream of the promoter where the library domain will bind TAC and domains 1 and 0 of library 2 will bind ACA and AAG, respectively (
[0079] Compatible 2-finger modules selections: In order to identify compatible 2-finger modules from our round 2 libraries, we first built a matching set of vectors containing the intended DNA target and then leveraged omega-dependent activation of the HIS3 reporter in our bacteria 1-hybrid system. Round 2 libraries were co-transformed with the matching reporter vector in USO-ω cells and recovered and titered as described. Based on cell counts the next day, 1×10.sup.6 cells were added in triplicate to a 96-well deep-well plate containing a sterile bead for efficient agitation. Selections were performed in 1 mL NM +Ura/−His supplemented with 100 μg/mL carbenicillin, 50 μg/mL Kanamycin, 1 μM IPTG, and 5 mM 3AT. These were grown at 37 C in a plate shaker for 18, 24, or 40 hours and harvested upon reaching visible turbidity (typically OD>0.6). Triplicates were pooled, miniprepped, and deep sequenced on an Illumina NextSeq 500. Helices were rank-ordered by sequencing reads, and 2-finger modules within the top 5 highest counts were chosen for follow-up assembly and testing in the EGFP nuclease assay.
U20S GFP Disruption Assay
[0080] Zinc finger nuclease (ZFN) activity was assessed by measuring disruption of an integrated, constitutively-expressed eGFP reporter in a clonal U2OS cell line previously described (Reyon et al. 2012, Nat. Biotech). Cells were cultured in DMEM supplemented with 10% FBS, 2 mM GlutaMAX™ (Life Technologies), 1% penicillin/streptomycin, 1% MEM non-essential amino acids (Life Technologies), 2 mM sodium pyruvate, and 400 μg/mL G418. 1 μg of each ZFN monomer plasmid DNA and 200 ng ptdTomato-N1 plasmid DNA were transfected in duplicate into 5×10.sup.5 cells using a Lonza Nucleofector™ 2b Device (Kit V, Program X-001). In each assay 2 μg of the parental empty vector (a modified derivative of the JDS71 vector from addgene) and 200 ng ptdTomato-N1 was used as a negative control, and 2 μg of a dual spCas9-guide expressing vector (modified addgene plasmid #41815) and 200 ng ptdTomato-N1 was used as a positive control in each experiment. Cells were grown in 6-well dishes for 3 days post-transfection, harvested and kept on ice, and analyzed for expression of eGFP and tdTomato on a Sony SH800 cell sorter. In order to restrict analysis to only cells that likely received both ZFN monomer plasmids, populations were first gated on the top 15-25% tdTomato+ cells, and then analyzed for loss of eGFP expression.
Next Generation Sequencing and Prep
[0081] Primary libraries: Following selection from >5×10.sup.8 library variants, surviving colonies were pooled, miniprepped, and DNA barcoded for sequencing on an Illumina NextSeq® 500. Typically these were performed as a set of 64 3 bp binding sites for a given ‘overlap’ library as follows. 2 uL of pooled plasmid DNA was used as a template for barcoding in a 254 reaction with Taq Polymerase (NEB) with the following cycling parameters: 95 C for 5 min, 20 cycles of [95 C:20 s, 52 C:30 s, 68 C:30 s], 68 C for 10 min, and held at 4 C. 54 each reaction was visualized on a 1% agarose gel to confirm apparent equal amplification. All 64 reactions were pooled in equal volumes. These were run out on a 1% agarose gel, gel purified, and submitted to the NYU Genome Technology Center for sequencing on a NextSeq® 500.
[0082] 2-finger libraries: Following selection of ˜3×10.sup.6 2F library variants, plasmid DNA was extracted from surviving cells and barcoded for deep sequencing on an Illumina NextSeq® 500 as follows. 24 pooled plasmid DNA was used as a template for barcoding in a 25 μL reaction with GoTaq® Green 2× Mastermix (Promega) with the following cycling conditions: 95 C for 5 min, 15 cycles of [95 C:30 s, 68 C:30 s, 72 C:60 s], 72 C for 5 min, and held at 4 C. 10 μL each reaction was visualized on a 1% agarose gel to confirm equal amplification, all reactions were pooled in equal volumes. These were gel-purified from a 1% agarose gel, and submitted to the NYU Genome Technology Center for sequencing on an Illumina NextSeq® 500.
Sequence Recovery and Filtering
[0083] All paired end Illumina reads are demultiplexed and trimmed into 21-mers with in-house Unix scripts based on EMBOSS 6.6.0. Trimmed DNA sequences are translated, and amino acid sequences are considered if they have a least two read counts and are coded by at least two different DNAs. The invariant Leucine at the helix position +4 is excluded.
Clustering and Filtering Selections
[0084] For each selection, helix sequences were clustered using the MUSI software. Each sequence was assigned to the cluster associated with the PWM for which it was assigned the highest responsibility. For each cluster generated, the Shannon entropy value was calculated for each helix residue based on the PWM for that cluster. If a selection lacked a cluster with at least one position with an entropy of two or less, that selection was filtered out for downstream analysis.
Principal Component Analysis and Mapping
[0085] Over 1 million different helix sequences are retrieved from all selections. We further examined all these data with Python3.8 and the libraries Numpy 1.19, Pandas 1.2, sklearn 0.23 and logomaker 0.8. Then, each helix sequence is converted into 18 integers where each amino is coded by a combination of 3 physical features—side chain length from the Ca to the most distal atom in the side chain, number of hydrogen bond donors, the number of hydrogen bond donor. We verified that the encoding system was optimal to predict 3-nucleotide binding sites with a Random Forest algorithm and would not gain with additional features, such as the side chain hydrophobicity. We standardized all features before Principal Components Analyses reduction of the feature space. The first six Principal Components capture 0.135%, 0.108%, 0.101%, 0.093%, 0.083% and 0.064% of the total variance highlighting the roughness of the sequence space and the low information content at some helix position. Helices can be classified by a hierarchical clustering on the first two PCA coordinates. More components can be included to capture a more detailed sequence space.
Computing Similarity Between Selections by Hamming Distance
[0086] To compare the helices from two selections, A and B, pairwise normalized Hamming distances were computed between the two sets of filtered sequences based on the number of identical amino acids. The minimum normalized Hamming distance was then computed from each helix in selection A to each helix in selection B as well as from each helix in selection B to each helix in selection A. The overall distance between the two selections was computed as the mean of these distances.
Molecular Dynamic Simulations
[0087] The PDB file 1AAY was used as template, the DNA was elongated by 2 bp at each end using X3DNA to avoid the melting end effect so that the binding of zinc fingers is not affected. The DNA and protein sequences were mutated using Chimera (cgl.ucsf.edu/chimera/) for each library and test case, the protonated states were determined by WHATIF (swift.cmbi.umcn.nl/whatif/) The prepared structures were then solvated into a TIP3P water box with 15-Å buffer of water extending from the protein/DNA complex in each direction, sodium ions were added to ensure the overall charge neutrality. The FF99 Barcelona forcefield was used for protein/DNA complex and zinc amber forcefield for zinc ions. The particle mesh Ewald method was used for electrostatics calculations. The SHAKE algorithm was used to constrain the hydrogen-containing bond lengths, which allowed a 2-fs time step for MD simulation. The non-bonded cut-off was set to 12.0 Å. The systems were energy minimized using a combination of steepest descent and conjugate gradient methods. Then the systems were thermalized and equilibrated for 3 ns using a multistage protocol. The first step was a 1.5 ns gradual heating from 100K to 300 K, followed by 1.5 ns of density equilibration, both at 1-fs step length. Berendsen thermostat and barostat were used for both temperature and pressure regulation for another 6-ns equilibration at 2-fs step length with gradually reduced positional constraints at 300K. The systems were built with tleap and the simulations were conducted with GPU accelerated Amber18. For each system, three 500-ns trajectories were simulated. The hydrogen bond analysis was performed using BioPython. We considered as a hydrogen bonds any contacts below 3.5 Å between the atoms 06 and N7 in a Guanine and the atoms NH1 and NH2 in an Arginine or ND2 and OD1 for an Asparagine. Bifurcated hydrogen bonds between a guanine and an arginine are identified when two pairs 06-NH1/2 and N7-NH1/2 are found, allowing the tautomeric bifurcated hydrogen bond.
[0088] While the disclosure has been particularly shown and described with reference to specific embodiments, it should be understood by those having skill in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the present disclosure as disclosed herein.