WHOLE GENOME SGRNA LIBRARY CONSTRUCTING SYSTEM AND APPLICATION THEREOF
20230187025 · 2023-06-15
Inventors
- Fengdan Xu (Suzhou, Jiangsu, CN)
- Liang Jin (Suzhou, Jiangsu, CN)
- Pengyang Xu (Suzhou, Jiangsu, CN)
- Guangyou Duan (Suzhou, Jiangsu, CN)
- Wenyan Zhao (Suzhou, Jiangsu, CN)
- Yi Ge (Suzhou, Jiangsu, CN)
Cpc classification
C12N2310/20
CHEMISTRY; METALLURGY
C12N15/111
CHEMISTRY; METALLURGY
C12N2320/11
CHEMISTRY; METALLURGY
G16B35/00
PHYSICS
C40B40/06
CHEMISTRY; METALLURGY
International classification
G16B35/00
PHYSICS
Abstract
Provided are a system for constructing a genome-wide sgRNA library and a use thereof. The system includes an input module, an sgRNA design module and an sgRNA filtering module. By constructing three modules in the system, optimizing details and processes in the modules, and adopting multiple design criteria and screening principles, the genome-wide sgRNA library is finally constructed. The system and method are concise and efficient, and the obtained library has a high quality and good activity, and is convenient for applications in gene editing researches.
Claims
1. A system for constructing a genome-wide sgRNA library, comprising: (1) an input module, which is configured to download genomic sequences and annotation files from a database, and extract a commonly deleted segment (CDS) sequence as an input target sequence; (2) an sgRNA design module, which is configured to select candidate sgRNAs on a sense strand and an antisense strand of the target sequence according to a set parameter, perform a genome-wide sequence alignment according to a specified number of allowed mismatches, and evaluate off-target rates and grade sgRNAs according to off-target sites and a number of the off-target sites; wherein 20 nt+NGG is selected as a candidate sgRNA on the sense strand and GGN+20 nt is selected as the candidate sgRNA on the antisense strand; (3) an sgRNA filtering module, which is configured to screen evaluated and graded sgRNAs according to the following criteria: removing an sgRNA comprising 4 or more consecutive bases, ensuring that sgRNAs have no overlap, and ensuring that the sgRNAs are evenly distributed on a CDS as much as possible.
2. The system of claim 1, wherein a selection criterion of the target sequence in step (1) comprises that a CDS region is selected as the target sequence for a protein-encoding gene and an exon region is selected as the target sequence for a non-protein-encoding gene.
3. The system of claim 1, wherein the parameter in step (2) comprises a protospacer adjacent motif (PAM) sequence, a sequence length, guanine-cytosine (GC) content, a single/double-strand mode and a number of mismatches allowed in a genome alignment.
4. The system of claim 1, wherein the number of allowed mismatches in step (2) is 3 to 6, preferably 5; and preferably, off-target rate evaluation criteria in step (2) comprise: (a) filtering out an sgRNA capable of being accurately aligned to a plurality of sites in a genome; (b) an sgRNA that is only aligned to a position corresponding to the sgRNA in the genome being Best; and (c) for other sgRNAs, gradually decreasing a penalty point according to a mismatch position of 5′->3′, and comprehensively scoring the other sgRNAs in conjunction with a number of mismatches, wherein a larger penalty point corresponds to a higher risk.
5. The system of claim 1, wherein grading levels in step (2) comprise four levels: best, low-risk, moderate-risk and high-risk.
6. The system of claim 1, wherein screening criteria in step (3) further comprise any one or a combination of at least two of: selecting at most 6 sgRNAs for each target sequence, reserving only a best sgRNA and a low-risk sgRNA, ensuring that a selected sgRNA covers different transcripts of a gene as much as possible, a plurality of sgRNAs of each gene being targeted to different positions of the each gene as much as possible, and GC content of 20% to 80%, preferably, a combination of selecting at most 6 sgRNAs for each target sequence, reserving only the best sgRNA and the low-risk sgRNA, ensuring that the selected sgRNA covers the different transcripts of the gene as much as possible, the plurality of sgRNAs of each gene being targeted to the different positions of the each gene as much as possible, and the GC content of 20% to 80%.
7. A method for constructing an sgRNA library by using the system of claim 1, comprising: (1) selecting a target sequence: downloading genomic sequences and annotation files from a database, and extracting a commonly deleted segment (CDS) sequence as an input target sequence; (2) designing sgRNAs: selecting candidate sgRNAs on a sense strand and an antisense strand of the target sequence according to a set parameter, performing a genome-wide sequence alignment according to a specified number of allowed mismatches, and evaluating off-target rates and grading sgRNAs according to off-target sites and a number of the off-target sites; wherein 20 nt+NGG is selected as a candidate sgRNA on the sense strand and GGN+20 nt is selected as the candidate sgRNA on the antisense strand; (3) screening the sgRNAs: screening evaluated and graded sgRNAs according to the following criteria: removing an sgRNA comprising 4 or more consecutive bases, ensuring that sgRNAs have no overlap, and ensuring that the sgRNAs are evenly distributed on a CDS as much as possible.
8. The method of claim 7, wherein a selection criterion of the target sequence in step (1) comprises that a CDS region is selected as the target sequence for a protein-encoding gene and an exon region is selected as the target sequence for a non-protein-encoding gene; preferably, the parameter in step (2) comprises a protospacer adjacent motif (PAM) sequence, a sequence length, guanine-cytosine (GC) content, a single/double-strand mode and a number of mismatches allowed in a genome alignment; preferably, the number of allowed mismatches in step (2) is 3 to 6, preferably 5; preferably, off-target rate evaluation criteria in step (2) comprise: (a) filtering out an sgRNA capable of being accurately aligned to a plurality of sites in a genome; (b) an sgRNA that is only aligned to a position corresponding to the sgRNA in the genome being Best; and (c) for other sgRNAs, gradually decreasing a penalty point according to a mismatch position of 5′->3′, and comprehensively scoring the other sgRNAs in conjunction with a number of mismatches, wherein a larger penalty point corresponds to a higher risk; preferably, levels for the grading in step (2) comprise four levels: best, low-risk, moderate-risk and high-risk.
9. A method for constructing an sgRNA library by using the system of claim 1, specifically comprising: (1) selecting a target sequence: downloading genomic sequences and annotation files from a database, and extracting a commonly deleted segment (CDS) sequence as an input target sequence; wherein a selection criterion of the target sequence in step (1) comprises that a CDS region is selected as the target sequence for a protein-encoding gene and an exon region is selected as the target sequence for a non-protein-encoding gene; (2) designing sgRNAs: selecting candidate sgRNAs on a sense strand and an antisense strand of the target sequence according to a set parameter comprising a protospacer adjacent motif (PAM) sequence, a sequence length, guanine-cytosine (GC) content, a single/double-strand mode and a number of allowed mismatches, performing a genome-wide sequence alignment according to the number of allowed mismatches, and evaluating off-target rates and grading the sgRNAs as best, low-risk, moderate-risk and high-risk (off-target risk gradients) according to a number of mismatches and a mismatch position; wherein 20 nt+NGG is selected as a candidate sgRNA on the sense strand and GGN+20 nt is selected as the candidate sgRNA on the antisense strand; and off-target rate evaluation criteria comprise: (a) filtering out an sgRNA capable of being accurately aligned to a plurality of sites in a genome; (b) an sgRNA that is only aligned to a position corresponding to the sgRNA in the genome being Best; and (c) for other sgRNAs, gradually decreasing a penalty point according to the mismatch position of 5′->3′, and comprehensively scoring the other sgRNAs in conjunction with the number of mismatches, wherein a larger penalty point corresponds to a higher risk; (3) filtering the sgRNAs: screening evaluated and graded sgRNAs according to the following criteria: removing an sgRNA comprising 4 or more consecutive bases, ensuring that sgRNAs have no overlap, ensuring that the sgRNAs are evenly distributed on a CDS as much as possible, selecting at most 6 sgRNAs for each target sequence, reserving only a best sgRNA and a low-risk sgRNA, ensuring that a selected sgRNA covers different transcripts of a gene as much as possible, a plurality of sgRNAs of each gene being targeted to different positions of the each gene as much as possible, and GC content of 20% to 80%.
10. A genome-wide sgRNA library constructed according to the method of claim 9.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0061]
[0062]
[0063]
[0064]
DETAILED DESCRIPTION
[0065] To further elaborate on the technical means adopted and the effects achieved in the present application, the technical solutions of the present application are further described below with reference to the drawings and specific embodiments, but the present application is not limited to the scope of the embodiments.
EXAMPLE 1
[0066] A system for constructing a genome-wide sgRNA library is created. The system includes an input module, an sgRNA design module and an sgRNA filtering module.
[0067] (1) The input module is configured to download genomic sequences and annotation files from a database, and extract a CDS sequence as an input target sequence.
[0068] (2) The sgRNA design module is configured to select candidate sgRNAs on a sense strand and an antisense strand of the target sequence according to a set parameter, perform a genome-wide sequence alignment, and evaluate off-target rates and grade sgRNAs according to a specified number of allowed mismatches.
[0069] 20 nt+NGG is selected as a candidate sgRNA on the sense strand and GGN+20 nt is selected as the candidate sgRNA on the antisense strand.
[0070] (3) The sgRNA filtering module is configured to screen evaluated and graded sgRNAs according to the following criteria: removing an sgRNA including 4 or more consecutive bases, ensuring that sgRNAs have no overlap, and ensuring that the sgRNAs are evenly distributed on a CDS as much as possible, selecting at most 6 sgRNAs for each target sequence, reserving only a best sgRNA and a low-risk sgRNA, ensuring that a selected sgRNA covers different transcripts of a gene as much as possible, multiple sgRNAs of each gene being targeted to different positions of the each gene as much as possible, and GC content of 20% to 80%.
EXAMPLE 2
[0071] A pig genome-wide sgRNA library is constructed by the system in Example 1. A construction process is shown in
[0072] (1) Genome-wide sequences and annotation files of a pig are downloaded from Ensemble of release90; position information of a CDS region of each gene is acquired by analyzing the annotation files; and finally, CDS sequences of all genes are extracted from genomic sequence files according to the position information of the CDS region of the each gene and stored in a fasta file as an input target sequence of an sgRNA design module. A CDS region is selected as the target sequence for a protein-encoding gene to design sgRNAs. If a gene has multiple transcripts, all CDS sequences of the transcripts are used as the target sequence. A gene with only a single transcript uses all CDS regions as the target sequence. An exon region is used as the target sequence for a non- protein-encoding gene.
[0073] (2) Candidate sgRNAs are selected on a sense strand and an antisense strand of the target sequence according to a set parameter including a PAM sequence, a sequence length, GC content and a single/double-strand mode, where 20 nt+NGG is selected as a candidate sgRNA on the sense strand and GGN+20 nt is selected as the candidate sgRNA on the antisense strand; a genome-wide sequence alignment is performed, where a mismatch farther from the PAM sequence (NGG or GGN) more easily results in an off-target; off-target rates are evaluated according to a number of allowed mismatches specified as 5 and the sgRNAs are graded as best, low-risk, moderate-risk and high-risk (off-target risk gradients); and sgRNAs are selected, where moderate-risk and high-risk sgRNAs are removed, a Best sgRNA is preferably selected, and a low-risk sgRNA is secondly selected.
[0074] Off-target rate evaluation criteria are described below.
[0075] (a) An sgRNA capable of being accurately aligned to multiple sites in a genome is filtered out.
[0076] (b) An sgRNA that is only aligned to a position corresponding to the sgRNA in the genome is Best.
[0077] (c) For other sgRNAs, a penalty point is gradually decreased according to a mismatch position of 5′->3′, and the other sgRNAs are comprehensively scored in conjunction with a number of mismatches, where a larger penalty point corresponds to a higher risk.
[0078] (3) Evaluated and graded sgRNAs are screened according to the following criteria: removing an sgRNA including 4 or more consecutive bases, ensuring that sgRNAs have no overlap, and ensuring that the sgRNAs are evenly distributed on a CDS as much as possible, selecting at most 6 sgRNAs for each target sequence, reserving only the best sgRNA and the low-risk sgRNA, ensuring that a selected sgRNA covers different transcripts of a gene as much as possible, multiple sgRNAs of each gene being targeted to different positions of the each gene as much as possible, and GC content of 20% to 80%.
[0079] For the protein-encoding gene, an sgRNA close to a 5′ end is preferably selected, and a number of sgRNAs for each CDS is not more than 2. For the non-protein-encoding gene, 4 sgRNAs are designed on an exon sequence of the gene, and the designed sgRNAs have no overlap.
[0080] (4) General overview of the library: in the constructed pig genome-wide sgRNA library, 20438 genes in total are designed to obtain sgRNAs, of which 17410 genes are designed to obtain 6 sgRNAs and 2828 genes are designed to obtain 1-5 sgRNAs. Results of experiments on sgRNA qualities show that the low-risk sgRNA and above are all high-quality sgRNAs, and sgRNAs in the constructed library all have activity that can meet requirements of subsequent experiments.
[0081] The applicant has stated that although the detailed method of the present application is described through the embodiments described above, the present application is not limited to the detailed method described above, which means that implementation of the present application does not necessarily depend on the detailed method described above. It should be apparent to those skilled in the art that any improvements made to the present application, equivalent replacements of various raw materials of the product, the addition of adjuvant ingredients, and the selection of specific manners, etc. in the present application all fall within the protection scope and the scope of disclosure of the present application. [0082] [0083]
1: [0084]
: PAM sequence [0085]
RNA: Guide RNA [0086] Cas9
: Cas9 endonuclease [0087]
: Matching with genomic sequences [0088]
DNA: Genomic DNA [0089]
DNA
: Double-stranded DNA break repair [0090]
DNA
: Donor DNA molecule [0091]
: Targeted genome modification [0092]
: Human cell [0093]
: Zebrafish [0094]
: Bacterial cell [0095]
4: [0096] 1. Genome-wide CDA region screening [0097] 2. Design of sgRNA recognition site guiding sequences [0098] 3. Genome-wide off-target site detection [0099] 4. Score designed sgRNAs according to information on off-target sites and positions of the off-target sites [0100] 5. Result screening and statistics [0101] 6. Algorithm optimization and software development