TEXT CLASSIFICATION SYSTEM BASED ON FEATURE SELECTION AND METHOD THEREOF

Abstract

The present disclosure discloses a text classification system based on feature selection and a method thereof in the technical field of natural language processing and short text classification, comprising: acquiring a text classification data set; dividing the text classification data set into a training text set and a test text set, and then pre-processing the training text set and the test text set; extracting feature entries from the pre-processed training text set through improved chi-square statistics to form feature subsets; using TF-IWF algorithm to give the weight to the extracted feature entries; based on the weighted feature entries, establishing a short text classification model based on a support vector machine; and classifying the pre-processed test text set by the short text classification model. The present disclosure solves the problem that the short text content is sparse to some extent, thereby improving the performance of short text classification.

Claims

1. A text classification method based on feature selection, comprising: acquiring a text classification data set; dividing the text classification data set into a training text set and a test text set, and then pre-processing the training text set and the test text set; extracting feature entries from the pre-processed training text set through improved chi-square statistics to form feature subsets; using TF-IWF algorithm to give the weight to the extracted feature entries; based on the weighted feature entries, establishing a short text classification model based on a support vector machine; and classifying the pre-processed test text set by the short text classification model.

2. The text classification method based on feature selection according to claim 1, wherein the pre-processing comprises first performing standard processing including removing stop words on the text, and then selecting Jieba word segmentation tool to segment the processed short text content to obtain the training text set and the test text set which have been segmented, and storing the training text set and the test text set in a text database.

3. The text classification method based on feature selection according to claim 2, wherein extracting feature entries from the pre-processed training text set through improved chi-square statistics to form feature subsets comprises: extracting each feature item and its related category information from the text database; calculating a word frequency adjustment parameter α(t,c.sub.i), an intra-category position parameter β and a negative correlation correction factor γ of a feature word t with respect to each category; using an improved formula to calculate the IMP_CHI value of an entry with respect to each category; according to the improved chi-square statistics, obtaining the IMP_CHI value of a feature item t with respect to the whole training set; after calculating the IMP_CHI values of the whole training set, selecting the first M words as the features represented by a document to form a final feature subset according to the descending order of the IMP_CHI values.

4. The text classification method based on feature selection according to claim 3, wherein the improved chi-square statistical formula is:
IMP_CHI(t,c.sub.i)=x.sup.2(t,c.sub.i)×α(t,c.sub.j)×β×γ where α(t,c.sub.j) is a word frequency adjustment parameter, β is an intra-category position parameter, γ is a negative correlation correction factor, and x.sup.2(t,c.sub.i) is the traditional calculation formula of chi-square statistics, which is expressed as: $x^{2} (t, c_{i}) = \frac{N \times {(A \times D - C \times B)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)}$ where N represents the total number of all texts in the training set, A is the text belonging to the category c.sub.i and containing the feature t, B is the text not belonging to the category c.sub.i and containing the feature t, C is the text belonging to the category c.sub.i and not containing the feature t, and D is the text not belonging to the category c.sub.i and not containing the feature t; for multi-category problems, the statistical calculation method of the feature item with respect to the whole training set is expressed as:
IMP_CHI.sub.max(t)=max.sub.j=1.sup.m{IMP_CHI(t,c.sub.j)} where m is the number of categories.

5. The text classification method based on feature selection according to claim 4, wherein the calculation formula of the word frequency adjustment parameter α(t,c.sub.j) is as follows: $α (t, c_{j}) = \frac{N}{n} \times \frac{tf (t, c_{i})}{{.Math.}_{i = 1}^{n} tf (t, c_{i})}$ where N represents the total number of all texts in the training set, n represents the number of documents containing the feature word t in the text set, tf(t,c.sub.i) represents the number of occurrences in the text of the category c.sub.i, Σ.sub.i=1.sup.Ntf(t,c.sub.i) represents the number of all the occurrences in the documents of all categories, the word frequency adjustment parameter α(t,c.sub.i) represents calculating the ratio of the number of word frequencies of the feature item in each category to the total number of word frequencies of the feature item in all categories, and the larger value α(t,c.sub.i) represents that the feature item appears in a certain category of the text set more frequently and the ability to discriminate the corresponding categories is stronger; the calculation formula of the intra-category position parameter β is as follows: $β_{j} = \frac{1}{m} {.Math.}_{j = 1}^{m} {[{tf}_{j} (t) - \frac{1}{m} {.Math.}_{i = 1}^{m} {tf}_{j} (t)]}^{2}$ which is normalized as: $β = 1 - \frac{β_{j}}{\sqrt{{.Math.}_{j = 1}^{m} β_{j}^{2}}}$ where m represents the total number of the categories, and tf.sub.j(t) represents the word frequency of the feature word tin the category j; the calculation formula of the negative correlation correction factor γ is as follows: $γ = N (t, c_{i}) - \frac{{.Math.}_{j = 1}^{n} N (t, c_{i})}{m}$ where N(t,c.sub.i) is the number of texts in which the feature t appears in the category c.sub.j, Σ.sub.j=1.sup.nN(t,c.sub.i) is the total number of texts in which t appears in the text set, and m is the number of categories.

6. The text classification method based on feature selection according to claim 1, wherein TF-IWF algorithm is used to give the weight to the extracted feature entries, wherein the word frequency TF refers to the frequency that a certain entry t.sub.i appears in the document d.sub.j, which is generally normalized, and the calculation process is as follows: ${TF}_{ij} = \frac{n_{i, j}}{{.Math.}_{k} n_{k, j}}$ where n.sub.i,j represents the number of occurrences of the entry t.sub.i in the document d.sub.j, and Σ.sub.kn.sub.k,j represents the total number of occurrences of all entries in the text d.sub.j; the inverse word frequency IWF refers to the reciprocal of the proportion of the total number of words to the total number of documents, and the calculation process is as follows: ${IWF}_{i} = \log \frac{{.Math.}_{m} n_{i} t_{i}}{n_{i} t_{i}}$ where Σ.sub.mn.sub.it.sub.i represents the total number of entries t.sub.i appearing in all documents in the category m, and n.sub.it.sub.i represents the number of entries t.sub.i appearing in the document d.sub.j; the TF-IWF value W.sub.i,j is obtained by multiplying the value of the word frequency TF.sub.ij by the value of the inverse word frequency IWF.sub.i, and the calculation formula is as follows:
W.sub.i,j=TF.sub.ij×IWF.sub.i

7. A text classification system based on feature selection, comprising: a data acquisition module, which is configured to acquire a text classification data set; a pre-processing module, which is configured to divide the text classification data set into a training text set and a test text set, and then pre-process the training text set and the test text set; a Chi-square statistics module, which is configured to extract feature entries from the pre-processed training text set through improved chi-square statistics to form feature subsets; a weighting module, which is configured to use TF-IWF algorithm to give the weight to the extracted feature entries; a modeling module, which is configured to, based on the weighted feature entries, establish a short text classification model based on a support vector machine; and a classification module, which is configured to classify the test text set by the short text classification model.

8. A text classification device based on feature selection, comprising a processor and a storage medium; wherein the storage medium is configured to store instructions; the processor is configured to operate according to the instructions to perform the steps of the method according to claim 1.

9. A computer-readable storage medium on which a computer program is stored, wherein the program, when executed by a processor, implements the steps of the method according to claim 1.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0052] FIG. 1 is a flow chart of a method according to Embodiment 1 of the present disclosure.

[0053] FIG. 2 is a flow chart of feature weighting according to Embodiment 1 of the present disclosure.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0054] The present disclosure will be further described with reference to the accompanying drawings. The following embodiments are only used to illustrate the technical scheme of the present disclosure more clearly, rather than limit the scope of protection of the present disclosure.

Embodiment 1

[0055] Referring to FIGS. 1-2, this embodiment discloses a text classification method based on feature selection. The present disclosure will be further described in detail through specific implementation schemes.

[0056] S1, a Chinese text classification data set THUCNews published by Tsinghua University Natural Language Processing Laboratory is downloaded from the Internet, which is divided into a training text set and a testing text set and is then pre-processed, wherein the pre-processing comprises Chinese word segmentation and stop words removal, so as to obtain the training set and the test set which have been segmented and store the training set and the test set in a text database.

[0057] The process of pre-processing the text is as follows: first performing a series of standard processing such as removing stop words on the text, and then selecting Jieba word segmentation tool to segment the processed short text content to obtain the training set and the test set which have been segmented, and storing the training set and the test set in a text database.

[0058] S2, aiming at the deficiency of the traditional chi-square statistics, a word frequency adjustment factor, an intra-category position parameter and a negative correlation correction factor are introduced.

[0059] The traditional chi-square statistical formula, that is, the CHI value of the feature item t and the category c.sub.i is as follows:

[00008] $x^{2} (t, c_{i}) = \frac{N \times {(A \times D - C \times B)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)}$

[0060] where N represents the total number of all texts in the training set, A is the text belonging to the category c.sub.i and containing the feature t, B is the text not belonging to the category c.sub.i and containing the feature t, C is the text belonging to the category c.sub.i and not containing the feature t, and D is the text not belonging to the category c.sub.i and not containing the feature t. It can be seen from the above formula that when the feature t and the category c.sub.i are independent of each other, AD-CB=0. At this time, x.sup.2(t, c.sub.i)=0. The larger the value of x.sup.2(t, c.sub.i), the more relevant the feature item t is to the category c.sub.i.

[0061] There are three disadvantages of the traditional chi-square statistics.

[0062] 1. The traditional Chi-square statistical method only consider the number of documents appearing in the document set, but do not consider the number of times that feature words appear in the text, exaggerating the function of low-frequency words, resulting in defects.

[0063] 2. The traditional Chi-square statistical method does not consider the uniform distribution of feature words within the category.

[0064] 3. The traditional Chi-square statistical method is more inclined to select the feature words negatively related to the category.

[0065] The word frequency adjustment parameter, the intra-category position parameter and the negative correlation correction factor are introduced.

[0066] 1. The calculation formula of the word frequency adjustment parameters is as follows:

[00009] $α (t, c_{j}) = \frac{N}{n} \times \frac{tf (t, c_{i})}{{.Math.}_{i = 1}^{n} tf (t, c_{i})}$

[0067] where N represents the total number of all texts in the training set, n represents the number of documents containing the feature word t in the text set, tf(t,c.sub.i) represents the number of occurrences in the text of the category c.sub.i, Σ.sub.i=1.sup.Ntf(t,c.sub.i) represents the number of all the occurrences in the documents of all categories, and the word frequency adjustment parameter α(t,c.sub.i) represents calculating the ratio of the number of word frequencies of the feature item in each category to the total number of word frequencies of the feature item in all categories. The larger value α(t,c.sub.i) represents that the feature item appears in a certain category of the text set more frequently and the ability to discriminate the corresponding categories is stronger.

[0068] 2. The calculation formula of the intra-category position parameter is as follows:

[00010] $β_{j} = \frac{1}{m} {{.Math.}_{j = 1}^{m} [{tf}_{j} (t) - \frac{1}{m} {.Math.}_{i = 1}^{m} {tf}_{j} (t)]}^{2}$

[0069] which is normalized as:

[00011] $β = 1 - \frac{β_{j}}{\sqrt{{.Math.}_{j = 1}^{m} β_{j}^{2}}}$

[0070] where m represents the total number of the categories, and tf.sub.j(t) represents the word frequency of the feature word t in the category j. With the idea of variance, the more uniform the intra-category distribution, the greater the β. By introducing the intra-category position parameter, the intra-category distribution of feature words is taken into account in the CHI feature selection, and the category discrimination of feature words in feature subsets is improved.

[0071] The calculation formula of the negative correlation correction factor is as follows:

[00012] $γ = N (t, c_{i}) - \frac{{.Math.}_{j = 1}^{n} N (t, c_{i})}{m}$

[0072] where N(t,c.sub.i) is the number of texts in which the feature t appears in the category c.sub.j, Σ.sub.j=1.sup.nN(t,c.sub.i) is the total number of texts in which t appears in the text set, and m is the number of categories. Therefore, when the number of texts in which the feature t appears in the category c.sub.j is less than the average number of texts in which t appears in each category, if the γ value is negative, the CHI value will be negative. At this time, deleting the features negatively related to category c.sub.j can avoid the influence of the negative correlation on classification.

[0073] S3, feature subsets are formed through improved chi-square statistics.

[0074] In the traditional chi-square statistics, three concepts of word frequency adjustment parameters, intra-category position parameters and negative correlation correction factors are introduced. An improved method of chi-square statistics is proposed, which is named IMP-CHI (Improved-CHIsquare). The formula is expressed as follows:

IMP_CHI(t,c.sub.j)=x.sup.2(t,c.sub.i)×α(t,c.sub.j)×β×γ

[0075] where x.sup.2(t,c.sub.i) is the traditional calculation formula of chi-square statistics, α(t,c.sub.j) is a word frequency adjustment parameter, β is an intra-category position parameter, γ is a negative correlation correction factor.

[0076] For multi-category problems, the statistical calculation method of the feature item with respect to the whole training set is expressed as:

IMP_CHI.sub.max(t)=max.sub.j=1.sup.m{IMP_CHI(t,c.sub.j)}

[0077] where m is the number of categories. The above formula uses the idea of finding the maximum value, which can avoid such a problem: t.sub.1 has a high correlation evaluation value in the category c.sub.1, and has strong category information for such category of texts, but its evaluation value in other categories is very low, and the value is finally screened out because of not high total score, so that the classification effect is adversely affected.

[0078] The specific process of the IMP_CHI method can be summarized as follows.

[0079] The text in the text corpus is pro-processed, including word segmentation, part-of-speech tagging, removing special symbols and stop words, etc. The text words (title, keywords, abstract, text and category) are acquired and are placed into the initial set.

[0080] Each feature item and its related category information are extracted from the text database in sequence.

[0081] α(t,c.sub.i), β and γ of a feature word t with respect to each category are calculated.

[0082] An improved formula is used to calculate the IMP_CHI value of an entry c with respect to each category.

[0083] According to the improved chi-square statistics, the IMP_CHI value of a feature item t with respect to the whole training set is obtained.

[0084] After calculating the IMP_CHI values of the whole training set, the first M words are selected as the features represented by a document to form a final feature subset according to the descending order of the IMP_CHI values.

[0085] S4, TF-IWF algorithm is used to give the weight to the extracted feature entries.

[0086] The TF-IWF algorithm is used to give the weight to the extracted feature entries, and the calculation process is as follows.

[0087] The word frequency TF refers to the frequency that a certain entry t.sub.i appears in the document d.sub.j, which is generally normalized, and the calculation process is as follows:

[00013] ${TF}_{ij} = \frac{n_{i, j}}{{.Math.}_{k} n_{k, j}}$

[0088] where n.sub.i,j represents the number of occurrences of the entry t.sub.i in the document d.sub.j, and Σ.sub.kn.sub.k,j represents the total number of occurrences of all entries in the text d.sub.j.

[0089] The inverse word frequency IWF.sub.i refers to the reciprocal of the proportion of the total number of words to the total number of documents. The function of IWF.sub.i is to prevent words with high frequency but little effect on documents from obtaining higher weight. The calculation process is as follows:

[00014] ${IWF}_{i} = \log \frac{{.Math.}_{m} n_{i} t_{i}}{n_{i} t_{i}}$

[0090] where Σ.sub.mn.sub.it.sub.i represents the total number of entries t.sub.i appearing in all documents in the category m, and n.sub.it.sub.i represents the number of entries t.sub.i appearing in the document d.sub.j.

[0091] The improved TF-IDF algorithm (that is, TF-IWF algorithm) is used. The TF-IWF value W.sub.i,j is obtained by multiplying the value TF.sub.ij by the value IWF.sub.i, which is represented by w.sub.i,j. The calculation formula is as follows:

W.sub.i,j=TF.sub.ij×IWF.sub.i

[0092] TF-IWF is used to filter common entries and give more weight to the entries that can better reflect the corpus. If a high-frequency entry in a text is in a low-frequency state in the text set, the TF-IWF value of the entry has a high weight.

[0093] S5, a support vector machine classifier is selected to classify the text to be tested.

[0094] A short text classification model based on a support vector machine is established. According to the trained classification model, the text data of the test set is input, the classification result is obtained, and its performance is evaluated. Experiments show that, compared with the traditional chi-square statistical method, the improved chi-square statistical method IMP_CHI proposed in the present disclosure can achieve better feature selection effect through the SVM classifier and significantly improve the performance of the classifier when being combined with TF-IWF feature extraction.

Embodiment 2

[0095] A text classification system based on feature selection can realize a text classification method based on feature selection according to Embodiment 1, comprising:

[0096] a data acquisition module, which is configured to acquire a text classification data set;

[0097] a pre-processing module, which is configured to divide the text classification data set into a training text set and a test text set, and then pre-process the training text set and the test text set;

[0098] a Chi-square statistics module, which is configured to extract feature entries from the pre-processed training text set through improved chi-square statistics to form feature subsets;

[0099] a weighting module, which is configured to use TF-IWF algorithm to give the weight to the extracted feature entries;

[0100] a modeling module, which is configured to, based on the weighted feature entries, establish a short text classification model based on a support vector machine; and

[0101] a classification module, which is configured to classify the test text set by the short text classification model.

Embodiment 3

[0102] The embodiment of the present disclosure further provides a text classification device based on feature selection, which can realize the text classification method based on feature selection according to Embodiment 1, comprising a processor and a storage medium;

[0103] wherein the storage medium is configured to store instructions;

[0104] the processor is configured to operate according to the instructions to perform the steps of the method described hereinafter:

[0105] acquiring a text classification data set;

[0106] dividing the text classification data set into a training text set and a test text set, and then pre-processing the training text set and the test text set;

[0107] extracting feature entries from the pre-processed training text set through improved chi-square statistics to form feature subsets;

[0108] using TF-IWF algorithm to give the weight to the extracted feature entries;

[0109] based on the weighted feature entries, establishing a short text classification model based on a support vector machine; and

[0110] classifying the pre-processed test text set by the short text classification model.

Embodiment 4

[0111] The embodiment of the present disclosure further provides a computer-readable storage medium which can realize the text classification method based on feature selection according to Embodiment 1, on which a computer program is stored, wherein the program, when executed by a processor, implements the steps of the method described hereinafter:

[0112] acquiring a text classification data set;

[0113] dividing the text classification data set into a training text set and a test text set, and then pre-processing the training text set and the test text set;

[0114] extracting feature entries from the pre-processed training text set through improved chi-square statistics to form feature subsets;

[0115] using TF-IWF algorithm to give the weight to the extracted feature entries;

[0116] based on the weighted feature entries, establishing a short text classification model based on a support vector machine; and

[0117] classifying the pre-processed test text set by the short text classification model.

[0118] It should be understood by those skilled in the art that the embodiments of the present disclosure can be provided as methods, systems, or computer program products. Therefore, the present disclosure can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Furthermore, the present disclosure may take the form of a computer program product implemented on one or more computer-available storage media (including but not limited to a disk storage, CD-ROM, an optical storage, etc.) in which computer-available program codes are contained.

[0119] The present disclosure is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to the embodiments of the present disclosure. It should be understood that each flow and/or block in flowcharts and/or block diagrams and combinations of flows and/or blocks in flowcharts and/or block diagrams can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing devices to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing devices produce a device for implementing the functions specified in one or more flows in flowcharts and/or one or more blocks in block diagrams.

[0120] These computer program instructions can also be stored in a computer-readable memory that can direct a computer or other programmable data processing devices to work in a specific way, so that the instructions stored in the computer-readable memory produce an article of manufacture including an instruction device that implement the functions specified in one or more flows in flowcharts and/or one or more blocks in block diagrams.

[0121] These computer program instructions can also be loaded on a computer or other programmable data processing devices, so that a series of operation steps are executed on the computer or other programmable devices to produce a computer-implemented process, so that the instructions executed on the computer or other programmable devices provide steps for implementing the functions specified in one or more flows in flowcharts and/or one or more blocks in block diagrams.

[0122] The above are only the preferred embodiments of the present disclosure. It should be pointed out that for those skilled in the art, several improvements and variations can be made without departing from the technical principle of the present disclosure, which should also be regarded as the scope of protection of the present disclosure.

TEXT CLASSIFICATION SYSTEM BASED ON FEATURE SELECTION AND METHOD THEREOF

Assignee

Inventors

Cpc classification

Classification Explorer

G06F16/35

PHYSICS

Classification Explorer

G06F16/313

PHYSICS

Classification Explorer

G06F16/353

PHYSICS

International classification

Classification Explorer

G06F16/35

PHYSICS

Classification Explorer

G06F16/31

PHYSICS

Abstract

Claims

Description