CLASSIFICATION MODEL TRAINING METHOD, SYSTEM, ELECTRONIC DEVICE AND STRORAGE MEDIUM

Abstract

Provided are a classification model training method, system, electronic device, and storage medium. The method includes: determining sampling rates of first-class samples and second-class samples in a data set, and setting the samples with a sampling rate less than a preset value as target samples (S101); determining data distribution feature information of the target samples based on Euclidean distances between all the samples in the data set (S102); wherein the data distribution feature information is information describing the number of same-class samples in nearest neighbor samples, and the nearest neighbor samples are two samples at a Euclidean distance less than a preset distance; generating new samples corresponding to the target samples based on the data distribution feature information (S103); and training the classification model using the first-class samples, the second-class samples and the new samples (S104).

Claims

1. A classification model training method, characterized by comprising: determining sampling rates of first-class samples and second-class samples in a data set, and setting the samples with a sampling rate less than a preset value as target samples; determining data distribution feature information of the target samples based on Euclidean distances between all the samples in the data set, wherein the data distribution feature information is information describing the number of same-class samples in nearest neighbor samples, and the nearest neighbor samples are two samples at a Euclidean distance less than a preset distance; generating new samples corresponding to the target samples based on the data distribution feature information; and training the classification model using the first-class samples, the second-class samples and the new samples.

2. The classification model training method of claim 1, wherein the determining data distribution feature information of the target samples based on Euclidean distances between all the samples in the data set comprises: calculating a superiority ratio between any two nearest neighbor target samples using a first formula, and determining the superiority ratio as the data distribution feature information, wherein the nearest neighbor target samples are two target samples at a Euclidean distance less than the preset distance; wherein the first formula is Rat.sub.im=Numx.sub.i/Numx.sub.im, where Rat.sub.im is a superiority ratio between a sample x.sub.i and a sample x.sub.im, x.sub.i is any sample in the target samples, x.sub.im is an m-th nearest neighbor sample in k same-class nearest neighbor samples of the sample x.sub.i, Numx.sub.i is the number of target samples in k nearest neighbor samples of the sample x.sub.i, and Numx.sub.im is the number of target samples in k nearest neighbor samples of the sample x.sub.im.

3. The classification model training method of claim 2, wherein the generating new samples corresponding to the target samples based on the data distribution feature information comprises: in case that the superiority ratio is less than 1, generating a new sample x.sub.newim corresponding to the target samples using a second formula, wherein the second formula is x.sub.newim=x.sub.im+rand(0, 1)*Rat.sub.im*(x.sub.i−x.sub.im); in case that the superiority ratio is greater than 1, generating a new sample x.sub.newim corresponding to the target samples using a third formula, wherein the third formula is x.sub.newim=x.sub.i+rand(0, 1)/Rat.sub.im*(x.sub.im−x.sub.i); and in case that the superiority ratio is equal to 1, generating a new sample x.sub.newim corresponding to the target samples using a fourth formula, wherein the fourth formula is x.sub.newim=x.sub.i+rand(0, 1)*(x.sub.im−x.sub.i).

4. The classification model training method of claim 1, wherein the training the classification model using the first-class samples, the second-class samples and the new samples comprises: performing a sampling operation on the first-class samples, the second-class samples and the new samples to obtain a sampling result, and performing a training operation on the classification model based on the sampling result to obtain a trained file type detection model.

5. The classification model training method of claim 4, wherein the first-class samples are virus file samples, the second-class samples are non-virus file samples and the classification model is a file type detection model.

6. The classification model training method of claim 5, wherein after the performing a training operation on the classification model based on the sampling result, the method further comprises: performing a detection operation on an unknown file using the trained file type detection model to generate a detection result, to determine whether the unknown file is a virus file based the detection result.

7. The classification model training method of claim 1, wherein the determining sampling rates of first-class samples and second-class samples in a data set comprises: determining the sampling rates of the first-class samples and the second-class samples in the data set based on quantitative proportions of the samples in the data set.

8. (canceled)

9. An electronic device, characterized by comprising a memory and a processor, wherein the memory has a computer program stored thereon; and the processor, when calling the computer program stored in the memory, implements a classification model training method, comprising: determining sampling rates of first-class samples and second-class samples in a data set, and setting the samples with a sampling rate less than a preset value as target samples; determining data distribution feature information of the target samples based on Euclidean distances between all the samples in the data set, wherein the data distribution feature information is information describing the number of same-class samples in nearest neighbor samples, and the nearest neighbor samples are two samples at a Euclidean distance less than a preset distance; generating new samples corresponding to the target samples based on the data distribution feature information; and training the classification model using the first-class samples, the second-class samples and the new samples.

10. A storage medium, characterized by having computer-executable instructions stored thereon, wherein the computer-executable instructions, when loaded and executed by a processor, implements a classification model training method, comprising: determining sampling rates of first-class samples and second-class samples in a data set, and setting the samples with a sampling rate less than a preset value as target samples; determining data distribution feature information of the target samples based on Euclidean distances between all the samples in the data set, wherein the data distribution feature information is information describing the number of same-class samples in nearest neighbor samples, and the nearest neighbor samples are two samples at a Euclidean distance less than a preset distance; generating new samples corresponding to the target samples based on the data distribution feature information; and training the classification model using the first-class samples, the second-class samples and the new samples.

11. The electronic device of claim 9, wherein the determining data distribution feature information of the target samples based on Euclidean distances between all the samples in the data set comprises: calculating a superiority ratio between any two nearest neighbor target samples using a first formula, and determining the superiority ratio as the data distribution feature information, wherein the nearest neighbor target samples are two target samples at a Euclidean distance less than the preset distance; wherein the first formula is Rat.sub.im=Numx.sub.i/Numx.sub.im, where Rat.sub.im is a superiority ratio between a sample x.sub.i and a sample x.sub.im, x.sub.i is any sample in the target samples, x.sub.im is an m-th nearest neighbor sample in k same-class nearest neighbor samples of the sample x.sub.i, Numx.sub.i is the number of target samples in k nearest neighbor samples of the sample x.sub.i, and Numx.sub.im is the number of target samples in k nearest neighbor samples of the sample x.sub.im.

12. The electronic device of claim 11, wherein the generating new samples corresponding to the target samples based on the data distribution feature information comprises: in case that the superiority ratio is less than 1, generating a new sample x.sub.newim corresponding to the target samples using a second formula, wherein the second formula is x.sub.newim=x.sub.im+rand(0, 1)*Rat.sub.im*(x.sub.i−x.sub.im); in case that the superiority ratio is greater than 1, generating a new sample x.sub.newim corresponding to the target samples using a third formula, wherein the third formula is x.sub.newim=x.sub.i+rand(0, 1)/Rat.sub.im*(x.sub.im−x.sub.i); and in case that the superiority ratio is equal to 1, generating a new sample x.sub.newim corresponding to the target samples using a fourth formula, wherein the fourth formula is x.sub.newim=x.sub.i+rand(0, 1)*(x.sub.im−x.sub.i).

13. The electronic device of claim 9, wherein the training the classification model using the first-class samples, the second-class samples and the new samples comprises: performing a sampling operation on the first-class samples, the second-class samples and the new samples to obtain a sampling result, and performing a training operation on the classification model based on the sampling result to obtain a trained file type detection model.

14. The electronic device of claim 13, wherein the first-class samples are virus file samples, the second-class samples are non-virus file samples and the classification model is a file type detection model.

15. The electronic device of claim 14, wherein after the performing a training operation on the classification model based on the sampling result, the method further comprises: performing a detection operation on an unknown file using the trained file type detection model to generate a detection result, to determine whether the unknown file is a virus file based the detection result.

16. The storage medium of claim 10, wherein the determining data distribution feature information of the target samples based on Euclidean distances between all the samples in the data set comprises: calculating a superiority ratio between any two nearest neighbor target samples using a first formula, and determining the superiority ratio as the data distribution feature information, wherein the nearest neighbor target samples are two target samples at a Euclidean distance less than the preset distance; wherein the first formula is Rat.sub.im=Numx.sub.i/Numx.sub.im, where Rat.sub.im is a superiority ratio between a sample x.sub.i and a sample x.sub.im, x.sub.i is any sample in the target samples, x.sub.im is an m-th nearest neighbor sample in k same-class nearest neighbor samples of the sample x.sub.i, Numx.sub.i is the number of target samples in k nearest neighbor samples of the sample x.sub.i, and Numx.sub.im is the number of target samples in k nearest neighbor samples of the sample x.sub.im.

17. The storage medium of claim 16, wherein the generating new samples corresponding to the target samples based on the data distribution feature information comprises: in case that the superiority ratio is less than 1, generating a new sample x.sub.newim corresponding to the target samples using a second formula, wherein the second formula is x.sub.newim=x.sub.im+rand(0, 1)*Rat.sub.im*(x.sub.i−x.sub.im); in case that the superiority ratio is greater than 1, generating a new sample x.sub.newim corresponding to the target samples using a third formula, wherein the third formula is x.sub.newim=x.sub.i+rand(0, 1)/Rat.sub.im*(x.sub.im−x.sub.i); and in case that the superiority ratio is equal to 1, generating a new sample x.sub.newim corresponding to the target samples using a fourth formula, wherein the fourth formula is x.sub.newim=x.sub.i+rand(0, 1)*(x.sub.im−x.sub.i).

18. The storage medium of claim 10, wherein the training the classification model using the first-class samples, the second-class samples and the new samples comprises: performing a sampling operation on the first-class samples, the second-class samples and the new samples to obtain a sampling result, and performing a training operation on the classification model based on the sampling result to obtain a trained file type detection model.

19. The storage medium of claim 18, wherein the first-class samples are virus file samples, the second-class samples are non-virus file samples and the classification model is a file type detection model.

20. The storage medium of claim 19, wherein after the performing a training operation on the classification model based on the sampling result, the method further comprises: performing a detection operation on an unknown file using the trained file type detection model to generate a detection result, to determine whether the unknown file is a virus file based the detection result.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0034] In order to describe the technical solutions in the embodiments of the present application or the conventional art more clearly, the drawings needed to be used in descriptions about the embodiments or the conventional art will be simply introduced below. It is apparent that the drawings described below are merely some embodiments of the present application. Other drawings may further be obtained by those of ordinary skill in the art according to these drawings without creative work.

[0035] FIG. 1 is a flowchart of a classification model training method according to an embodiment of the present application.

[0036] FIG. 2 is a flowchart of a method for sampling an unbalanced data set according to an embodiment of the present application.

[0037] FIG. 3 is a schematic diagram of an inclination of a new sample according to an embodiment of the present application.

[0038] FIG. 4 is a schematic structural diagram of a classification model training system according to an embodiment of the present application.

DETAILED DESCRIPTION

[0039] In order to make the objective, technical solutions, and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly and completely below in combination with the drawings in the embodiments of the present application. Based on the embodiments of the present application, all other embodiments obtained by those skilled in the art without creative work shall fall within the protection scope of the present application.

[0040] References are made to FIG. 1 below. FIG. 1 is a flowchart of a classification model training method according to an embodiment of the present application.

[0041] The following specific steps may be included.

[0042] Step S101: determining sampling rates of first-class samples and second-class samples in a data set, and setting the samples with a sampling rate less than a preset value as target samples.

[0043] The data set mentioned in this step may include first-class samples and second-class samples. Specifically, the first-class samples may be positive samples, and the second-class samples may be negative samples. In the present embodiment, the sampling rates of the first-class samples and second-class samples in the data set may be determined based on quantitative proportions of the samples in the data set. Specifically, the sampling rate of the samples corresponding to a larger quantitative proportion is higher. It should be understood that the sampling rate is related to the number of the samples, as well as a parameter set for training the classification model.

[0044] In the present embodiment, the samples with a sampling rate less than the preset value are set as the target samples. For example, when the preset value is 1, the first-class samples are set as the target samples if the sampling rate of the first-class samples is less than 1, and the second-class samples are set as the target samples if the sampling rate of the second-class samples is less than 1. Certainly, the preset value may be set flexibly according to a practical application scenario, and no limits are made herein. This step aims to set the samples of a class accounting for a relatively small proportion in the data set as the target samples, thereby generating new samples of the same class in subsequent steps to further balance the proportions of the samples in the data set.

[0045] Step S102: determining data distribution feature information of the target samples based on Euclidean distances between the target samples.

[0046] Before this step, an operation of calculating the Euclidean distances between all the samples in the data set may be performed. Specifically, the Euclidean distances may include Euclidean distances between the first-class samples and Euclidean distances between the second-class samples, and may further include Euclidean distances between the first-class samples and the second-class samples. The Euclidean distances are Euclidean distances. A data distribution feature of the target samples may be obtained based on the Euclidean distances between all the samples. Here, the data distribution feature information is information describing the number of same-class samples in nearest neighbor samples, and the nearest neighbor samples are two samples at a Euclidean distance less than a preset distance. In the present embodiment, all samples at Euclidean distances less than a preset value from a certain sample are determined as nearest neighbor samples of the sample. Nearest neighbor samples of a sample may include samples of the same class, or samples of different classes.

[0047] Step S103: generating new samples corresponding to the target samples based on the data distribution feature information.

[0048] In the present embodiment, new samples corresponding to the target samples are generated based on the obtained data distribution feature information. Specifically, a region with a more densely target samples distribution has lower noise within the samples and less significant marginalization. Therefore, in the present embodiment, the new samples corresponding to the target samples may be generated in a densely target samples distribution, based on the data distribution feature. It should be understood that this step aims to generate the new samples based on the target samples accounting for a relatively small proportion in the data set to further balancing the numbers of the samples of each class in the data set. As a possible implementation, in the present embodiment, a corresponding number of new samples may be generated based on the difference between the sample numbers of the first-class samples and second-class samples in the data set, such that the first-class samples and the second-class samples are in a number balance state after the new samples are added to the data set. Specifically, the number balance state refers to a state that the difference between the sample numbers of the first-class samples and the second-class samples is within a preset range.

[0049] Step S104: training the classification model using the first-class samples, the second-class samples and the new samples.

[0050] Based on obtaining the new samples, in the present embodiment, the new samples may be added to the data set to further train the classification model using the samples in the data set. The classification model mentioned in the present embodiment may be a face recognition model, and furthermore, after a picture is input to the classification model, the classification model may determine whether the picture includes a face image. Alternatively, the classification model may be a virus detection model, and furthermore, after an unknown file is input to the classification model, the classification model may determine whether the unknown file is a virus file.

[0051] In the present embodiment, the first-class samples or second-class samples with sampling rate less than the preset value are set as the target samples, the target samples being samples of a class accounting for a relatively small proportion in the data set. If the classification model is trained directly using the samples in the data set, the classification model would have a greater tendency to recognize a class accounting for a relatively large proportion in the data set, which affects the recognition effect. In the present embodiment, the data distribution feature information of the target samples is determined based on the Euclidean distances between all the samples, and the new samples of the same class as the target samples are generated dynamically based on the data distribution feature information. As such, the numbers of the samples of each class in the data set are further balanced, and relatively poor model training effects caused by an unbalance between sample classes are avoided. It can be seen that, in the embodiments of present application, the numbers of samples of various classes in the data set may be balanced, and the prediction accuracy of the classification model may be improved.

[0052] As a further introduction to the embodiment corresponding to FIG. 1, the operation in S102 in the embodiment corresponding to FIG. 1 may specifically be implemented by calculating a superiority ratio between any two nearest neighbor target samples using a first formula, and determining the superiority ratio as the data distribution feature information. Here, the nearest neighbor target samples are two target samples at a Euclidean distance less than the preset distance, and the superiority ratio is information describing a superiority of a region between the pair of nearest neighbor target samples. In the present embodiment, the number of same-class samples within the preset distance from a specific sample is determined as an evaluation criterion of a superiority of a region. With a larger sample number, the region where the sample is located has a higher superiority. The region where the sample is located refers to all regional ranges within the preset distance from the sample. For example, there are 10 same-class nearest neighbor samples for sample A and 20 same-class nearest neighbor samples for sample B. In such case, it may be determined that a superiority of a region where sample B is located is higher than that of a region where sample A is located.

[0053] Specifically, the first formula is Rat.sub.im=Numx.sub.i/Numx.sub.im, where Rat.sub.im is a superiority ratio between a sample x.sub.i and a sample x.sub.im, x.sub.i is any sample in the target samples, x.sub.im is an m-th nearest neighbor sample in k same-class nearest neighbor samples of the sample x.sub.i, Numx.sub.i is the number of target samples in k nearest neighbor samples of the sample x.sub.i, and Numx.sub.im is the number of target samples in k nearest neighbor samples of the sample x.sub.im.

[0054] If the method of determining the superiority ratio as the data distribution feature information is combined with the embodiment corresponding to FIG. 1, the operation for generating new samples in S103 in FIG. 1 may include the following steps.

[0055] in case that the superiority ratio is less than 1, generating a new sample x.sub.newim corresponding to the target samples using a second formula, wherein the second formula is x.sub.newim=x.sub.im+rand(0, 1)*Rat.sub.im*(x.sub.i−x.sub.im);

[0056] in case that the superiority ratio is greater than 1, generating a new sample x.sub.newim corresponding to the target samples using a third formula, wherein the third formula is x.sub.newim=x.sub.i+rand(0, 1)/Rat.sub.im*(x.sub.im−x.sub.i); and

[0057] in case that the superiority ratio is equal to 1, generating a new sample x.sub.newim corresponding to the target samples using a fourth formula, wherein the fourth formula is x.sub.newim=x.sub.i+rand(0, 1)*(x.sub.im−x.sub.i).

[0058] In the above-mentioned implementation, the new samples may be generated in a superior region based distribution features and tendency of the first-class samples and second-class samples in the data set, thereby further improving a training effect of the classification model. The Rand function refers to a uniform random real number more than or equal to 0 and less than 1

[0059] As a further introduction to the embodiment corresponding to FIG. 1, the operation for training the classification model in S104 may include: performing a sampling operation on the first-class samples, the second-class samples and the new samples, and performing a training operation on the classification model based on the sampling result.

[0060] Further, in the embodiment corresponding to FIG. 1, the first-class samples may be virus file samples, the second-class samples may be non-virus file samples, and the classification model may be a file type detection model. Correspondingly, after the training operation is performed on the file type detection model based on the sampling result, a detection operation may further be performed on an unknown file using a trained file type detection model to generate a detection result, to determine whether the unknown file is a virus file based on the detection result.

[0061] The flow described in the above-mentioned embodiment will be described below with an embodiment in a practical application. Referring to FIG. 2, FIG. 2 is a flowchart of a method for sampling an unbalanced data set according to an embodiment of the present application. A method for sampling an unbalanced data set is described in the present embodiment. Minority-class samples are generated dynamically based on distribution features and sampling rates of unbalanced data as well as data distribution features of existing data sets. A sample generation mode is controlled to ensure that the new samples are generated in a superior region, thereby reducing sample marginalization aggravation and the probability that the new samples are noises.

[0062] In the present embodiment, a sample is classified and evaluated based on the region where the sample is located, and a proportion of same-class samples in k nearest neighbors of the sample is adopted as a classification standard. A new sample, when constructed, is inclined more to a sample corresponding to a larger proportion of same-class samples in the k nearest neighbor samples, thereby ensuring that the new sample is generated in a superior and more reasonable region. The basic idea of the present embodiment is as follows. k nearest neighbor samples of all minority-class samples are calculated. A quantitative proportion of same-class samples in the k nearest neighbor samples of each minority-class sample is statistically obtained as a standard for evaluating a superiority of the sample. k nearest neighbor samples in samples of the same class as the sample are statistically obtained. N samples are selected from the k nearest neighbor samples as auxiliary samples based on the sampling rate. Values of the sample and the auxiliary samples thereof are calculated, and each eigenvalue of a new sample is generated based on the values and a calculation rule, to obtain an additional sample by combination. The additional sample is added to the data set, to obtain a final balanced data set. Specifically, the present embodiment may include the following steps.

[0063] Step 1: Determining a Sampling Rate.

[0064] If the sampling rate N is less than or equal to 1, an original minority-class sample set is randomly sampled directly according to the sampling rate N, and a random sampling result is determined as an output result of a Tency-SMOTE algorithm. If the sampling rate N is greater than 1, the sampling rate is rounded, and the next step is performed.

[0065] Step 2: Calculating a Superiority of a Region where a Sample is Located.

[0066] The superiority of the region where the sample is located is determined based a proportion of same-class samples in k nearest neighbor samples of the sample. The superiority of the region where the sample is located is determined as follows.

[0067] For a sample x.sub.i of a minority class, x.sub.im represents a mt−h (m<=k) nearest neighbors in k same-class nearest neighbors of the sample x.sub.i. Numx.sub.i represents the number of minority-class samples in the k nearest neighbor samples of the sample x.sub.i when two classes of samples are considered at the same time. Numx.sub.im represents the number of minority-class samples in k nearest neighbors of the sample x.sub.im when two classes of samples are considered at the same time. x.sub.newim represents a new sample extended according to the sample x.sub.i and the sample x.sub.im. Rat.sub.im=Numx.sub.i/Numx.sub.im is defined as a superiority ratio between the sample xi and the sample x.sub.im. Ratx.sub.im<1 indicates that more minority-class samples are distributed around the sample x.sub.im than the sample x.sub.i, namely a region where the sample x.sub.im is located is superior to that where the sample x.sub.i is located. Therefore, a superiority relationship between a certain sample and an auxiliary sample thereof is determined by a Ratx.sub.im value.

[0068] Step 3: Using Different Generation Strategies Based on Different Superiorities of the Region where the Sample is Located.

[0069] Based on the above definition, when a new sample is generated, the new sample is inclined more to the sample x.sub.im (or the region where x.sub.im is located). Referring to FIG. 3, FIG. 3 is a schematic diagram of an inclination of a new sample according to an embodiment of the present application.

[0070] For the sample x.sub.i and the nearest neighbor sample x.sub.im thereof, a proportion (or number) of minority-class samples in nearest neighbor samples of the sample x.sub.i is greater than that of negative-class samples in nearest neighbor samples of the sample x.sub.im. Therefore, the newly generated sample x.sub.newim is inclined more to the sample x.sub.i to ensure that the new sample x.sub.newim is generated in a superior region. That is, in FIG. 3, the new sample x.sub.newim is at the left side of the straight line at a higher probability. That is, the following different new sample generation strategies are used according to a Ratx.sub.im value between a certain sample and an auxiliary sample thereof:

[00001] $x_{newim} = {\begin{matrix} x_{im} + rand (0, 1) ⋆ {Rat}_{im} ⋆ (x_{i} - x_{im}) & , & {Rat}_{im} < 1 \\ x_{i} + rand (0, 1) / {Rat}_{im} ⋆ (x_{im} - x_{i}) & , & {Rat}_{im} > 1 \\ x_{i} + rand (0, 1) ⋆ (x_{im} - x_{i}) & , & {Rat}_{im} = 1. \end{matrix}$

[0071] The above-mentioned sample generation method is analyzed below in detail.

[0072] (a) In case of Ratx.sub.im<1, the sample xi may appear as a sample of a boundary class or a sensitive class. According to the principle that a new sample is in a superior minority-class region, the sample x.sub.newim newly extended in such case is inclined more to the sample x.sub.im, namely:

x.sub.newim=x.sub.im+rand(0, 1)*Rat.sub.im*(x.sub.i−x.sub.im).

[0073] (b) The case that Ratx.sub.im is greater than 1 may occur to samples of the boundary class or the sensitive class. Similarly, the sample x.sub.newim newly generated in such case is inclined more to the sample x.sub.i:

x.sub.newim=x.sub.i+rand(0, 1)/Rat.sub.im*(x.sub.im−x.sub.i).

[0074] (c) The case that Ratx.sub.im is equal to 1 may occur to samples of the boundary class and the sensitive class, and all safety classes satisfy this condition. Similarly, the sample x.sub.newim newly generated in such case is inclined equally to the sample x.sub.i and the sample x.sub.im:

x.sub.newim=x.sub.i+rand(0, 1)*(x.sub.im−x.sub.i).

[0075] It is to be noted that the above formulas are also original SMOTE algorithm formulas.

[0076] Step 4: Generating a New Sample Based on Different Strategies.

[0077] Feature attributes of a certain sample and an auxiliary sample thereof are sequentially traversed. Eigenvalues of a new sample are sequentially generated according to a certain strategy in the formula in step 2 to finally obtain the new sample.

[0078] Step 5: Completing Over-Sampling, and Outputting a Sampling Result.

[0079] In the present embodiment, a data set to be subjected to data processing is obtained first, and dimensions and eigenvalue types of sample features thereof are statistically obtained. Minority-class sample points in the data set are traversed, and k nearest neighbor sample points of each minority-class sample are obtained. Here, the k nearest neighbor sample points of the sample are obtained based on eigenvalue balancing by a python data processing tool sklearn. N sample points are selected randomly as auxiliary samples according to sampling rates. Ratx.sub.im values between the sample point and the auxiliary sample points thereof are calculated respectively to determine an offset of a new sample. Each eigenvalue of the sample is obtained independently according to the Ratx.sub.im values. Then, each eigenvalue is combined to obtain an additional sample. Finally, all newly generated samples are added to the data set, thereby obtaining a final class-balanced data set. In the present embodiment, the problems of new sample distribution marginalization aggravation and noise increase in a conventional over-sampling method are solved, the reasonability of generating new samples in over-sampling is enhanced, and the accuracy, generalization ability, and other performance of the final model are improved.

[0080] Referring to FIG. 4, FIG. 4 is a schematic structural diagram of a classification model training system according to an embodiment of the present application.

[0081] The system may include:

[0082] a target sample setting module 100, configured for determining sampling rates of first-class samples and second-class samples in a data set, and setting the samples with a sampling rate less than a preset value as target samples;

[0083] a distribution feature determining module 200, configured for determining data distribution feature information of the target samples based on Euclidean distances between all the samples in the data set; wherein the data distribution feature information is information describing the number of same-class samples in nearest neighbor samples, and the nearest neighbor samples are two samples at a Euclidean distance less than a preset distance;

[0084] a new sample generating module 300, configured for generating new samples corresponding to the target samples based on the data distribution feature information; and

[0085] a model training module 400, configured for training the classification model using the first-class samples, the second-class samples and the new samples.

[0086] In the present embodiment, the first-class samples or second-class samples with sampling rate less than the preset value are set as the target samples, the target samples being samples of a class accounting for a relatively small proportion in the data set. If the classification model is trained directly using the samples in the data set, the classification model would have a greater tendency to recognize a class accounting for a relatively large proportion in the data set, which affects the recognition effect. In the present embodiment, the data distribution feature information of the target samples is determined based on the Euclidean distances between all the samples, and the new samples of the same class as the target samples are generated dynamically based on the data distribution feature information. As such, the numbers of the samples of each class in the data set are further balanced, and relatively poor model training effects caused by an unbalance between sample classes are avoided. It can be seen that, according to the embodiments of present application, the numbers of samples of various classes in the data set may be balanced, and the prediction accuracy of the classification model may be improved.

[0087] Further, the distribution feature determining module 200 is specifically configured for calculating a superiority ratio between any two nearest neighbor target samples using a first formula, and determining the superiority ratio as the data distribution feature information, wherein the nearest neighbor target samples are two target samples at a Euclidean distance less than the preset distance.

[0088] wherein the first formula is Rat.sub.im=Numx.sub.i/Numx.sub.im, where Rat.sub.im is a superiority ratio between a sample x.sub.i and a sample x.sub.im, x.sub.i is any sample in the target samples, x.sub.im is an m-th nearest neighbor sample in k same-class nearest neighbor samples of the sample x.sub.i, Numx.sub.i is the number of target samples in k nearest neighbor samples of the sample x.sub.i, and Numx.sub.im is the number of target samples in k nearest neighbor samples of the sample x.sub.im.

[0089] Further, the new sample generating module 300 includes:

[0090] a first generation unit, configured for, in case that the superiority ratio is less than 1, generating a new sample x.sub.newim corresponding to the target samples using a second formula, wherein the second formula is x.sub.newim=x.sub.im+rand(0, 1)*Rat.sub.im*(x.sub.i−x.sub.im);

[0091] a second generation unit, configured for, in case that the superiority ratio is greater than 1, generating a new sample x.sub.newim corresponding to the target samples using a third formula, wherein the third formula is x.sub.newim=x.sub.i+rand(0, 1)/Rat.sub.im*(x.sub.im−x.sub.i); and

[0092] a third generation unit, configured for, in case that the superiority ratio is equal to 1, generating a new sample x.sub.newim corresponding to the target samples using a fourth formula, wherein the fourth formula is x.sub.newim=x.sub.i+rand(0, 1)*(x.sub.im−x.sub.i).

[0093] Further, the model training module 400 is specifically configured for performing a sampling operation on the first-class samples, the second-class samples and the new samples, and performing a training operation on the classification model according to a sampling result.

[0094] Further, the first-class samples are virus file samples, the second-class samples are non-virus file samples and the classification model is a file type detection model.

[0095] Further, the system further includes:

[0096] a virus detection module configured for, after the performing a training operation on the classification model according to a sampling result, performing a detection operation on an unknown file using the trained file type detection model to generate a detection result, to determine whether the unknown file is a virus file based the detection result.

[0097] Further, the target sample setting module 100 includes:

[0098] a sampling rate determining unit, configured for determining the sampling rates of the first-class samples and second-class samples in the data set based on quantitative proportions of the samples in the data set.

[0099] The embodiment of the system part is in mutual correspondence with the embodiment of the method part. Therefore, the embodiment of the system part refers to the descriptions about the embodiment of the method part, and will not be elaborated temporarily herein.

[0100] The present application also provides a storage medium, having a computer program stored thereon which, when executed, may implement the steps provided in the above-mentioned embodiment. The storage medium may include various media capable of storing program codes such as a U disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

[0101] The present application also provides an electronic device, which may include a memory and a processor. The memory stores a computer program. The processor, when calling the computer program in the memory, may implement the steps provided in the above-mentioned embodiment. Certainly, the electronic device may further include various network interfaces, a power supply, and other components.

[0102] All the embodiments in this specification are described in a progressive manner. Contents mainly described in each embodiment are different from those described in other embodiments. Same or similar parts of all the embodiments refer to each other. For the system disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple, and the relevant part can be referred to the description of the method part. It should be noted that for a person of ordinary skill in the art, several improvements and modifications can be made to the present application without departing from the principle of the present application, and these improvements and modifications also fall within the scope of protection of the claims of the present application.

[0103] It is also noted that in this specification, relationship terms such as first and second are used only to distinguish one entity or operation from another entity or operation, without necessarily requiring or implying any such actual relationship or order between those entities or operations. Further, the terms “include” “comprise” or any other variation thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that includes a set of elements includes not only those elements, but also other elements not expressly listed, or also include elements that are inherent to such process, method, article, or apparatus. With no more restrictions, an element defined by statement “including a/an” does not exclude the existence of the same other elements in a process, method, object, or device including the element.

CLASSIFICATION MODEL TRAINING METHOD, SYSTEM, ELECTRONIC DEVICE AND STRORAGE MEDIUM

Inventors

Cpc classification

Classification Explorer

G06F18/214

PHYSICS

Classification Explorer

G06F18/24143

PHYSICS

Classification Explorer

G06F18/2415

PHYSICS

Classification Explorer

G06F18/28

PHYSICS

International classification

Classification Explorer

G06F18/214

PHYSICS

Abstract

Claims

Description