MEDICAL DISEASE FEATURE SELECTION METHOD BASED ON IMPROVED SALP SWARM ALGORITHM

20230029947 · 2023-02-02

Assignee

Inventors

Cpc classification

International classification

Abstract

The disclosure is a medical disease feature selection method based on an improved salp swarm algorithm. The improved salp swarm algorithm is used to optimize a feature selection problem, the accuracy of the mentioned method is estimated by means of a transfer function and classification by a K-nearest neighbor algorithm, and the salp swarm algorithm is improved by using a self-adapted control parameter and an elite grey wolf ruling policy, so that it helps the algorithm avoid premature convergence in the optimization process and jump out of local optimum, thereby achieving a target of the algorithm with the smallest selected feature quantity and the highest classification precision. The method has the advantages of high rate of convergence, higher classification precision and better robustness.

Claims

1. A medical disease feature selection method based on an improved salp swarm algorithm, the method comprises the following steps: Step S1, acquiring a microarray gene data set of medical diseases, marking a line number of the microarray gene data set of the medical diseases as m and a column number thereof as n, wherein the microarray gene data set of the medical diseases, which is obtained, is formed by arranging m*n gene feature data according to m lines and n columns; partitioning the microarray gene data set of the medical diseases into 10 subsets randomly by using a 10-cross validation function, wherein a line number of each of the 10 subsets is greater than or equal to 1 and column numbers are n; and selecting one subset from the 10 subsets as a validation set, the rest of the subsets being training sets; Step S2, defining a female salp population Y, wherein a size of the female salp population Y being M=20, i.e., there are M individuals in the female salp population Y, and each of the M individuals in the female salp population Yis respectively represented by a data matrix formed by arranging n dimensionality values in one line and n columns; and then performing initialized assignment to each of the dimensionality values of each individual in the female salp population Y by using a random number between 0 and 1 to obtain 0th of the female salp population Y°; Step S3, setting a global optimum fitness value is best, performing initialized assignment of best to positive infinity, setting a global optimal individual to be a bestposition, and initially setting the bestposition as a data matrix [0, 0, 0, . . . , 0] with one line and n columns; Step S4, setting a maximum time of an iteration of the female salp population T=50, setting an iterative time variable t and initially setting t as 1; Step S5, performing a t.sup.th iteration on the female salp population, an iteration process specifically including: Step S5.1, converting each dimensionality value of each individual in a (t−1).sup.th generation female salp population Y.sup.t−1 into 0 or 1 via transfer functions shown in formulae (1)-(2) to obtain a t.sup.th generation binary salp population Bt; B i , j t = { 1 , S ( Y i , j t - 1 ) r 0 , S ( Y i , j t - 1 ) < r ( 1 ) S ( Y i , j t - 1 ) = 1 1 + e - ( Y i , j t - 1 / 3 ) , ( 2 ) wherein Y.sub.i,j.sup.t−1 represents a dimensionality value in the j.sup.th column of the i.sup.th individual in the (t−1).sup.th generation female salp population, i is equal to 1, 2, 3, . . . M, j is equal to 1, 2, 3, . . . n, B.sub.i,j.sup.t represents a dimensionality value in the j.sup.th column of the i.sup.th individual in the t.sup.th generation binary salp population, r is a random number between 0 and 1 and is generated a random function before operation every time, and e is a natural constant; Step S5.2, constructing feature subsets of each individual in the (t−1).sup.th generation female salp population, the step S5.2 specifically including: judging whether the dimensionality value of each column in the i.sup.th individual in the t.sup.th generation binary salp population is 1 or not respectively, if it is 1, selecting a gene feature data in a verification set and the 9 training sets located in the column, and if it is 0, not selecting the gene feature data in the verification set and the 9 training sets located in the column, taking a residual part as the feature subsets of the verification set of the i.sup.th individual in the (t−1).sup.th generation female salp population after deleting the gene feature data of all unselected columns in the verification set, and taking the residual part as the feature subsets of the 9 training sets of the i.sup.th individual in the (t−1).sup.th generation female salp population after deleting the gene feature data of all the unselected columns in the 9 training sets, thereby obtaining the feature subsets of the verification set of each individual in the (t−1).sup.th generation female salp population and the feature subsets of the 9 training sets; Step S5.3, calculating fitness values of each individual in the (t−1).sup.th generation female salp population by adopting a formula (3) and a formula (4), sequencing all the individuals in the (t−1).sup.th generation female salp population according to the fitness values from small to large, marking a minimum fitness as bF.sup.t−1, and taking the individual with a minimum fitness value as a current optimum individual marked as bP.sup.t−1; f i t i t - 1 = ( 1 - a c c i t ) × a + L i t n × b ( 3 ) ac c i t = c c i t c c i t + u c i t × 1 00 % , ( 4 ) wherein fit.sub.i.sup.t−1 represents a fitness value of the i.sup.th individual in the (t−1).sup.th generation female salp population, a represents a classification accuracy weight set as 0.05, b represents an optimum feature selection number weight, a relation between a and b is a+b=1, L.sub.i.sup.t represents a total column number with the dimensionality value of 1 in the i.sup.th individual in the (t−1).sup.th generation female salp population, acct represents a classification accuracy of the i.sup.th individual obtained by a K-nearest algorithm, cc[and uct are obtained by performing a classifying statistical test on data in the feature subset of the verification set and data in the feature subsets of the 9 training sets in the i.sup.th individual in the (t−1).sup.th generation female salp population by adopting the K-nearest algorithm, cc.sub.i.sup.t represents a number of corrected classified data in the feature subsets of the verification set and uc.sub.i.sup.t represents a number of mistaken classified data in the feature subsets of the verification set; Step S5.4, updating the dimensionality values from a first individual to the M/2th individual in the t.sup.th generation binary salp population Bt by adopting a formula (5) respectively to obtain the first individual to the M/2.sup.th individual of the t.sup.th generation binary salp population F.sup.t; F k , j t = { b P j t - 1 + c t r 1 t r 2 t 0 b P j t - 1 - c t r 1 t r 2 t < 0 ( 5 ) c t = 2 e - ( 4 * t T ) 2 , ( 6 ) wherein k is equal to 1, 2, 3 . . . , M/2, r1.sup.t and r2.sup.t is respectively a random number generated by the random function between 0 and 1, c.sup.t is a control parameter represented by a formula (6), bP.sub.j.sup.t−1 represents a j.sup.th column dimensionality value of the current optimum individual bP.sup.t−1, F.sub.k,j represents a kth individual the j.sup.th column dimensionality value in a t.sup.th generation initial salp population Ft and e is the natural constant; Step S5.5, updating the dimensionality values from the (M/2+1).sup.th individual to the M.sup.th individual in the t.sup.th generation binary salp population Bt by adopting a formula (7) respectively by means of a self-adapted control parameter to obtain the (M/2+1).sup.th individual to the M.sup.th 5 individual of the t.sup.th generation binary salp population F.sup.t: F d t = 1 2 × cos ( p i × t 2 T ) × ( B d t + B d - 1 t ) ( 7 ) wherein d is equal to M/2+1, M2+2, M2+3, . . . , M, Bt represents the d.sup.th individual of the t.sup.th generation binary salp population B.sup.t, B.sub.d−1 .sup.t represents the (d−1).sup.th individual in the t.sup.th generation binary salp population B.sup.t, F.sub.d.sup.t represents the d.sup.th individual in the t.sup.th generation initial salp population F.sup.t, pi refers to circularity ratio and cos represents a cosine function; Step S5.6, calculating the fitness values of each individual in the t.sup.th generation initial salp population Ft by adopting a method same with those in the Steps 5.1-5.3, sequencing all the individuals in the t.sup.th generation initial salp population Ft according to the fitness values from small to large, and marking the individual with the minimum fitness value as firt, the individual with a secondary minimum fitness value as sect and the individual with a third minimum fitness value as thi.sup.t; Step S5.7, exploring and developing the t.sup.th generation initial salp population Ft by adopting formulae (8)-(16) based on an elite grey wolf ruling policy to obtain a t.sup.th generation intermediate salp population G.sup.t; d i s j f i r , t = .Math. "\[LeftBracketingBar]" 2 r 3 t × f i r j t - F i , j t .Math. "\[RightBracketingBar]" ( 8 ) G i , j f i r , t = f i r j t - A t × d i s j f i r , t ( 9 ) A t = 2 β t × r 4 t - β t ( 10 ) β t = 2 - 2 t T ( 11 ) di s j s e c , t = .Math. "\[LeftBracketingBar]" 2 r 3 t × s e c j t - F i , j t .Math. "\[RightBracketingBar]" ( 12 ) G i , j s e c , t = s e c j t - A t × d i s j s e c , t ( 13 ) di s j t h i , t = .Math. "\[LeftBracketingBar]" 2 r 3 t × t h i j t - F i , j t .Math. "\[RightBracketingBar]" ( 14 ) G i , j t h i , t = t h i j t - A t × d i s j t h i , t ( 15 ) G i , j t = G i , j f i r , t + G i , j s e c , t + G i , j t h i , t 3 , ( 16 ) wherein r3.sup.t and r4.sup.t are respectively random numbers generated by the random function between 0 and 1, A.sup.t and β.sup.t are vector coefficients, fir.sub.j.sup.t represents the j.sup.th column dimensionality value of the individual with the minimum fitness value in the t.sup.th generation initial salp population F.sup.t, sec.sub.j.sup.t represents the j.sup.th column dimensionality value of the individual with the secondary minimum fitness value in the t.sup.th generation initial salp population F.sup.t, thi.sub.j.sup.t represents the j.sup.th column dimensionality value of the individual with the third minimum fitness value in the t.sup.th generation initial salp population F.sup.t, F.sub.i,j.sup.t represents the j.sup.th column dimensionality value of the i.sup.th individual in the t.sup.th generation initial salp population F.sup.t, and G.sub.i,j.sup.t represents the j.sup.th column dimensionality value of the i.sup.th individual in the t.sup.th generation intermediate salp population G.sup.t; Step S5.8, calculating the fitness values of the t.sup.th generation intermediate salp population Gt by adopting the method same with those in the Steps S5.1-5.3, combining the M individuals in the t.sup.th generation initial salp population Ft and the M individuals in the t.sup.th generation intermediate salp population G.sup.t, sequencing totally 2M individuals according to the fitness values from small to big, selecting the M individuals with smaller fitness values, and arranging the M individuals randomly as the t.sup.th iteration to obtain the t.sup.th generation salp populationY.sup.t; and Step S5.9, comparing the minimum fitness value of the t.sup.th generation salp population Y.sup.t with the global optimum fitness value best, updating best by adopting the minimum fitness value if the minimum fitness value is smaller than the global optimum fitness value best and taking the individual corresponding to the minimum fitness value as a global optimum individual bestposition, and keeping the global optimum individual best position with the global optimum fitness value best invariable if the minimum fitness value is not smaller than the global optimum fitness value best and finishing the t.sup.th iteration; and Step S6, judging whether a current value of t is equal to T or not, if not, updating the value of t by adopting the current value of t plus 1, and then returning to the Step S5 for next an iteration; if it is equal to T, finishing the iteration process, determining columns with the dimensionality values of 1 from the first column to the nth column of the global optimum individual bestposition, which is current, and extracting the gene feature data in the microarray gene data set of the medical diseases in these columns correspondingly to form a selection data set, wherein the selection data set obtained at the time is ae dimensionality-reducing gene feature data set of the medical diseases.

Description

DESCRIPTION OF THE EMBODIMENTS

[0032] Further description of the present invention in detail will be made below in combination with embodiments.

[0033] Embodiment: a medical disease feature selection method based on an improved salp swarm algorithm, including the following steps:

[0034] Step S1, a microarray gene data set of medical diseases is acquired, a line number of the microarray gene data set of medical diseases is marked as m and a column number thereof as n, wherein the obtained microarray gene data set of medical diseases is formed by arranging m*n gene feature data according to m lines and n columns; the microarray gene data set of medical diseases is partitioned into 10 subsets randomly by using a 10-cross validation function, wherein the line number of each subset is greater than or equal to 1 and the column numbers are n; and selecting one subset from the 10 subsets as a validation set, the rest of subsets being training sets;

[0035] Step S2, a female salp population Y is defined, wherein a size of the female salp population Y being M=20, i.e., there are M individuals in the female salp population Y, and each individual in the female salp population Y is respectively represented by a data matrix formed by arranging n dimensionality values in one line and n columns; and then initialized assignment performed to each dimensionality value of each individual in the female salp population Y by using a random number between 0 and 1 to obtain the 0.sup.th female salp population Y.sup.0;

[0036] Step S3, the global optimum fitness value is set as best, initialized assignment of best is performed to positive infinity, the global optimal individual is set to be bestposition, and initially setting the bestposition as a data matrix [0, 0, 0, . . . , 0] with one line and n columns;

[0037] Step S4, the maximum time of iteration of female salp population is set at T=50, an iterative time variable t is set and t is initially set as 1;

[0038] Step S5, the t.sup.th iteration is performed on the female salp population, the iteration specifically including:

[0039] Step S5.1, each dimensionality value of each individual in the (t−1).sup.th female salp population Y.sup.t−1 is converted into 0 or 1 via transfer functions shown in formulae (1)-(2) to obtain a t.sup.th binary salp population B.sup.t;

[00006] B i , j t = { 1 , S ( Y i , j t - 1 ) r 0 , S ( Y i , j t - 1 ) < r , ( 1 ) S ( Y i , j t - 1 ) = 1 1 + e - ( Y i , j t - 1 / 3 ) , ( 2 )

[0040] wherein Y.sub.i,j.sup.t−1 represents a dimensionality value in the j.sup.th column of the i.sup.th individual in the (t−1).sup.th female salp population, i is equal to 1, 2, 3, . . . M, j is equal to 1, 2, 3, . . . n, B represents a dimensionality value in the j.sup.th column of the i.sup.th individual in the t.sup.th binary salp population, r is a random number between 0 and 1 and is generated a random function before operation every time, and e is a natural constant;

[0041] Step S5.2, a feature subset of each individual in the (t−1).sup.th generation female salp population is constructed, the step specifically including: whether the dimensionality value of each column in the i.sup.th individual in the t.sup.th generation binary salp population is 1 or not judged respectively, if it is 1, gene feature data in a verification set and 9 training sets located in the column is selected, and if it is 0, the gene feature data in a verification set and 9 training sets located in the column is not selected, the residual part is taken as the feature subset of the verification set of the i.sup.th individual in the (t−1).sup.th generation female salp population after deleting the gene feature data of all the unselected columns in the verification set, and the residual part is taken as the feature subsets of the 9 training sets of the i.sup.th individual in the (t−1).sup.th generation female salp population after deleting the gene feature data of all the unselected columns in the 9 training sets, thereby obtaining the feature subset of the verification set of each individual in the (t−1).sup.th generation female salp population and the feature subsets of the 9 training sets;

[0042] Step S5.3, a fitness value of each individual in the (t−1).sup.th generation female salp population is calculated by adopting a formula (3) and a formula (4), all the individuals in the (t−1).sup.th generation female salp population are sequenced according to the fitness values from small to large, the minimum fitness is marked as bF.sup.t−1, and the individual with the minimum fitness value is taken as a current optimum individual marked as bP.sup.t−1;

[00007] f i t i t - 1 = ( 1 - a c c i t ) × a + L i t n × b , ( 3 ) acc i t = c c i t c c i t + u c i t × 1 00 % , ( 4 )

[0043] wherein fit -i represents a fitness value of the i.sup.th individual in the (t−1).sup.th generation female salp population, a represents a classification accuracy weight set as 0.05, b represents the optimum feature selection number weight, a relation between a and b is a+b =1, L.sub.i.sup.t represents a total column number with the dimensionality value of 1 in the i.sup.th individual in the (t−1).sup.th generation female salp population, acc.sub.i.sup.t represents a classification accuracy of the i.sup.th individual obtained by the K-nearest algorithm, cc.sub.i.sup.t and uc.sub.i.sup.t are obtained by performing classifying statistical test on data in the feature subset of the verification set and data in the feature subsets of the 9 training sets in the i.sup.th individual in the (t−1).sup.th generation female salp population by adopting the K-nearest algorithm, cc[represents a number of corrected classified data in the feature subset of the verification set and uct represents a number of mistaken classified data in the feature subset of the verification set;

[0044] Step S5.4, the dimensionality values from the first individual to the M/2.sup.th individual in the t.sup.th generation binary salp population Bt are updated by adopting a formula (5) respectively to obtain the first individual to the M/2.sup.th individual of the t.sup.th generation binary salp population F.sup.t;

[00008] F k , j t = { b P j t - 1 + c t r 1 t r 2 t 0 b P j t - 1 - c t r 1 t r 2 t < 0 ( 5 ) c t = 2 e - ( 4 * t T ) 2 ( 6 )

[0045] wherein k is equal to 1, 2, 3 . . . , M/2, r1.sup.t and r2.sup.t is respectively a random number generated by a random function between 0 and 1, ct is a control parameter represented by a formula (6), bPit-1 represents a current optimum individual the j.sup.th column dimensionality value of bPt-1, F.sub.t represents the k.sup.th individual the j.sup.th column dimensionality value in the t.sup.th generation initial salp population Ft and e is a natural constant;

[0046] Step S5.5, the dimensionality values from the (M/2+1).sup.th individual to the M.sup.th individual in the t.sup.th generation binary salp population B.sup.t are updated by adopting a formula (7) respectively by means of the self-adapted control parameter to obtain the (M/2+1).sup.th individual to the M.sup.th 10 individual of the t.sup.th generation binary salp population Ft:

[00009] F d t = 1 2 × cos ( p i × t 2 T ) × ( B d t + B d - 1 t ) ( 7 )

[0047] wherein d is equal to M/2+1, M/2+2, M/2+3, . . . , M, B.sub.d.sup.t represents the d.sup.th individual of the t.sup.th generation binary salp population B.sup.t, B.sub.d−1 .sup.t represents the (d−1).sup.th individual in the t.sup.th generation binary salp population B.sup.t, Fd represents the dth individual in the t.sup.th generation initial salp population F.sup.t, pi refers to circularity ratio and cos represents a cosine function; 15 [0053] Step S5.6, the fitness value of each individual in the t.sup.th generation initial salp population Ft is calculated by adopting a method same with those in the Steps 5.1-5.3, all the individuals in the t.sup.th generation initial salp population Ft are sequenced according to the fitness values from small to large, and the individual with the minimum fitness value is marked as firt, the individual with the secondary minimum fitness value as sect and the individual with the third minimum fitness value as thi.sup.t;

[0048] Step S5.7, the t.sup.th generation initial salp population Ft is explored and developed by adopting formulae (8)-(16) based on the elite grey wolf ruling policy to obtain the t.sup.th generation intermediate salp population Gt;

[00010] d i s j f i r , t = .Math. "\[LeftBracketingBar]" 2 r 3 t × f i r j t - F i , j t .Math. "\[RightBracketingBar]" ( 8 ) G i , j f i r , t = f i r j t - A t × d i s j f i r , t ( 9 ) A t = 2 β t × r 4 t - β t ( 10 ) β t = 2 - 2 t T ( 11 ) di s j s e c , t = .Math. "\[LeftBracketingBar]" 2 r 3 t × s e c j t - F i , j t .Math. "\[RightBracketingBar]" ( 12 ) G i , j s e c , t = s e c j t - A t × d i s j s e c , t ( 13 ) di s j t h i , t = .Math. "\[LeftBracketingBar]" 2 r 3 t × t h i j t - F i , j t .Math. "\[RightBracketingBar]" ( 14 ) G i , j t h i , t = t h i j t - A t × d i s j t h i , t ( 15 ) G i , j t = G i , j f i r , t + G i , j s e c , t + G i , j t h i , t 3 ( 16 )

[0049] wherein r3.sup.t and r4.sup.t are respectively random numbers generated by the random function between 0 and 1, A.sup.t and β.sup.t are vector coefficients, fir.sub.j.sup.t represents the j.sup.th column dimensionality value of the individual with the minimum fitness value in the t.sup.th generation initial salp population F.sup.t, sec.sub.j.sup.t represents the j.sup.th column dimensionality value of the individual with the secondary minimum fitness value in the t.sup.th generation initial salp population F.sup.t, thi.sub.j.sup.t represents the j.sup.th column dimensionality value of the individual with the third minimum fitness value in the t.sup.th generation initial salp population Ft, .sub.F represents the j.sup.th column dimensionality value of the i.sup.th individual in the t.sup.th generation initial salp population Ft, and Gi.sub.t represents the j.sup.th column dimensionality value of the i.sup.th individual in the t.sup.th generation intermediate salp population Gt;

[0050] Step S5.8, the fitness value of the t.sup.th generation intermediate salp population Gt is calculated by adopting the method same with those in the Steps S5.1-5.3, the M individuals in the th generation initial salp population F.sup.t and the M individuals in the t.sup.th generation intermediate salp population G.sup.t are combined, the totally 2M individuals are sequenced according to the fitness values from small to big, the M individuals with smaller fitness values are selected, and the M individuals randomly are arranged as the t.sup.th iteration to obtain the t.sup.th generation salp population Y.sup.t; and

[0051] Step S5.9, the minimum fitness value of the t.sup.th generation salp population yt is compared with the global optimum fitness value best, best is updated by adopting the minimum fitness value if the minimum fitness value is smaller than the global optimum fitness value best, and the individual corresponding to the minimum fitness value is taken as the global optimum individual bestposition, and it is kept that the global optimum individuals bestposition with global optimum fitness value best invariable if the minimum fitness value is not smaller than the global optimum fitness value best, and finishing the t.sup.th iteration; and

[0052] Step S6, whether the current value oft is equal to T or not is judged, if not, the value oft is updated by adopting the current value of t plus 1, and then it is returned to the Step S5 for next iteration; if it is equal to T, the iteration process is finished, columns with the dimensionality value of 1 from the first column to the nth column of the current global optimum individual bestposition are determined, and the gene feature data in the microarray gene data set of the medical diseases in these columns are extracted correspondingly to form a selection data set, wherein the selection data set obtained at the time is the dimensionality-reducing gene feature data set of the medical diseases.

[0053] By taking four data sets D1-D4 in a UCI machine learning library as an example, comparative analysis is performed by adopting the method of the present invention and an existing slap swarm algorithm, where specific information of the four data sets D1-D4 is as shown in a table 1. Results of the fitness values respectively obtained by the method (AGSSA) of the present invention and the existing SSA are shown in a table 2. When the fitness value is minimum, the selected feature quantity is as shown in a table 3. When the fitness value is minimum, an error rate of the feature quantity selected based on the K-nearest algorithm is as shown in a table 4:

TABLE-US-00001 TABLE 1 No. Name No. of Instances No. of Features D1 Exactly 1000 14 D2 Lymphography 148 19 D3 Vote 300 17 D4 WineEW 178 14

TABLE-US-00002 TABLE 2 AGSSA SSA D1 0.0300 0.0376 D2 0.0357 0.0465 D3 0.0243 0.0321 D4 0.0126 0.0154

TABLE-US-00003 TABLE 3 AGSSA SSA D1 7.15 7.70 D2 6.36 7.54 D3 3.73 5.31 D4 3.28 4.01

TABLE-US-00004 TABLE 4 AGSSA SSA D1 0.0026 0.0084 D2 0.0190 0.0269 D3 0.0133 0.0163 D4 0.0000 0.0000

[0054] It may be seen from the above data that on the four data sets, the fitness value of the method of the present invention is minimum, indicating that the method has better optimal performance on feature selection problem. It may be seen from feature quantity selection that on the four data sets, the quantity selected by the improved salp swarm algorithm provided by the present invention is smaller than that of an original salp swarm algorithm, indicating that improvement on the algorithm is effectively, thereby helping the algorithm jumping out of local optimization and increasing the probability of finding the optimal solution. It may be seen from the error rate data that the feature selection classification error rate of the method of the present invention is also smaller than that of the original SSA, which reflects superiority of the algorithm provided by the present invention in optimizing the problems.