Rule discovery system, method, apparatus, and program
09767411 · 2017-09-19
Assignee
Inventors
Cpc classification
G06N3/004
PHYSICS
G06F16/90
PHYSICS
G06N3/043
PHYSICS
International classification
Abstract
A system includes a free itemset generation unit to generate a set of free itemsets, each having a frequency in the database greater than or equal to a predetermined threshold value set in advance, a valid rule candidate generation unit to generate rule candidates and store the generated rule candidates, and a rule minimality decision unit to check minimality of each of the generated rule candidates and output the generated rule candidate to an output apparatus when the generated rule candidate is determined to be minimal.
Claims
1. A rule discovery system comprising: a storage apparatus configured to store a database; a data processing apparatus; and an output apparatus, wherein the data processing apparatus includes; a free itemset generation unit to generate a free itemset including a set of items of attribute-value pairs in the database, the free itemset having a frequency in the database greater than or equal to a predetermined threshold value set in advance; a valid rule candidate generation unit to generate a rule candidate including a condition part having the generated free itemset set thereto and a consequent part having an item that does not share an attribute with the free itemset set thereto, the rule candidate having a confidence greater than or equal to a confidence threshold value set in advance, and to store the generated rule candidate in a storage unit, when a ratio of a frequency of the itemset obtained by adding one item x to a free itemset α to a frequency of the free itemset α is greater than or equal to the confidence threshold value, the valid rule candidate generation unit adding to a list of a rule candidate group including a plurality of the rule candidates a rule including a condition part having a free itemset α set thereto and a consequent part having the one item x set thereto; and a rule minimality decision unit to check minimality of the rule candidate generated by the valid rule candidate generation unit and to output the rule candidate to the output apparatus when the rule candidate is determined to be minimal.
2. The rule discovery system according to claim 1, wherein the rule minimality decision unit sorts the list of the rule candidate group generated by the valid rule candidate generation unit in an ascending order of sizes of the rule candidates, the rule minimality decision unit sequentially extracts, from a head of the list, a minimal one or more of the rule candidates and outputs the extracted minimal one or more of the rule candidates to the output apparatus, and the rule minimality decision unit removes, from the list, one or more of the remaining rule candidates in the list that are redundant with respect to the output minimal one or more of the rule candidates.
3. The rule discovery system according to claim 1, comprising an input apparatus configured to receive a threshold value of the frequency and the threshold value of the confidence as setting parameters.
4. The rule discovery system according to claim 1, wherein the rule is a rule expressed as a CFD (Conditional Functional Dependency).
5. A rule discovery system comprising: a storage apparatus configured to store a database; a data processing apparatus; and an output apparatus, wherein the data processing apparatus includes; a free itemset generation unit to generate a free itemset including a set of items of attribute-value pairs in the database, the free itemset having a frequency in the database greater than or equal to a predetermined threshold value set in advance; a valid rule candidate generation unit to generate a rule candidate including a condition part having the generated free itemset set thereto and a consequent part having an item that does not share an attribute with the free itemset set thereto, the rule candidate having a confidence greater than or equal to a confidence threshold value set in advance, and to store the generated rule candidate in a storage unit; and a rule minimality decision unit to check minimality of the rule candidate generated by the valid rule candidate generation unit and to output the rule candidate to the output apparatus when the rule candidate is determined to be minimal, wherein the valid rule candidate generation unit calculates a frequency of an itemset obtained by adding one item x to each free itemset α, and when a ratio of a frequency of the itemset obtained by adding the one item x to the free itemset α to a frequency of the free itemset α is greater than or equal to the confidence threshold value, the valid rule candidate generation unit repeatedly performs a process of adding to a list of a rule candidate group including a plurality of the rule candidates a rule including a condition part having the free itemset α set thereto and a consequent part having the one item x set thereto, as a rule candidate, until the checking for all combinations of the free itemsets α generated by the free itemset generation unit and the item x is finished.
6. A method of discovering a rule from a database by a data processing apparatus, the method comprising: (a) reading the database, generating a free itemset that includes a set of items of attribute-value pairs in the database and that has a frequency in the database greater than or equal to a predetermined threshold value set in advance, and storing the generated free itemset in a storage unit; (b) generating a rule candidate that includes a condition part having the generated free itemset set thereto and a consequent part having an item that does not share an attribute with the free itemset set thereto, and that has a confidence greater than or equal to a confidence threshold value set in advance, and storing the generated rule candidate in a storage unit, when a ratio of a frequency of the itemset obtained by adding one item x to a free itemset α to a frequency of the free itemset α is greater than or equal to the confidence threshold value, adding to a list of a rule candidate group including a plurality of the rule candidates a rule including a condition part having a free itemset α set thereto and a consequent part having the one item x set thereto; and (c) checking minimality of the rule candidate generated and outputting the rule candidate to an output apparatus when the rule candidate is determined to be minimal.
7. The rule discovery method according to claim 6, comprising: sorting the list of the rule candidate group including the generated rule candidates in an ascending order of sizes of the rule candidates; sequentially extracting a minimal one or more of the rule candidates from a head of the list and outputting the extracted minimal one or more of the rule candidates to the output apparatus; and removing, from the list, one or more of the remaining rule candidates in the list that are redundant with respect to the output minimal one or more of the rule candidates.
8. A non-transitory computer readable recording medium storing a program to cause a computer to execute the processing including: (a) reading a database to generate a free itemset that includes a set of items of attribute-value pairs in the database, and that has a frequency in the database greater than or equal to a predetermined threshold value set in advance, and storing the generated free itemset in a storage unit; (b) generating a rule candidate that includes a condition part having the generated free itemset set thereto and a consequent part having an item that does not share an attribute with the free itemset set thereto and that has a confidence greater than or equal to a confidence threshold value set in advance, and storing the generated rule candidate in a storage unit, when a ratio of a frequency of the itemset obtained by adding one item x to a free itemset α to a frequency of the free itemset α is greater than or equal to the confidence threshold value, adding to a list of a rule candidate group including a plurality of the rule candidates a rule including a condition part having a free itemset α set thereto and a consequent part having the one item x set thereto; and (c) checking minimality of each of the generated rule candidates and outputting the rule candidate to an output apparatus when the rule candidate is determined to be minimal.
9. The non-transitory computer readable recording medium storing a program according to claim 8, wherein when the rule candidate is determined to be minimal in the processing (c), the process of outputting the rule candidate to the output apparatus includes: sorting the list of the rule candidate group including the generated rule candidates in an ascending order of sizes of the rule candidates; sequentially extracting a minimal one or more of the rule candidates from a head of the list and outputting the extracted minimal one or more of the rule candidates to the output apparatus; and removing, from the list, one or more of the remaining rule candidates in the list that are redundant with respect to the output minimal one or more of the rule candidates.
10. A rule discovery apparatus comprising: a free itemset generation unit to read a database and to generate a set of free itemsets each of the free itemsets including attribute-value pairs in the database and each having a frequency in the database greater than or equal to a predetermined threshold value set in advance; a valid rule candidate generation unit to compute a frequency of each of itemsets, each of the itemsets including a condition part having the generated free itemset set thereto and a consequent part having one item x that does not share an attribute with each free itemset set thereto, the valid rule candidate generation unit adding a rule including a condition part having the free itemset set thereto and a consequent part having the one item x set thereto to a list as a rule candidate, when a ratio of frequency of the itemset obtained by adding the one item x to the free itemset to a frequency of the free itemset is greater than or equal to a predetermined confidence threshold value set in advance; and a rule minimality decision unit to sort the list of a rule candidate group including a plurality of the rules generated by the valid rule candidate generation unit in an ascending order of sizes of the rule candidates, the rule minimality decision unit sequentially extracting a minimal one or more of the rule candidates from a head of the list and outputting the extracted minimal one or more of the rule candidates to an output apparatus, the rule minimality decision unit removing, from the list, one or more of the remaining rule candidates in the list that are redundant with respect to the output minimal one or more of the rule candidates, thereby removing one or more of the rule candidates that are not minimal.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
(4)
DETAILED DESCRIPTION OF DISCLOSURE
(5) The following describes exemplary embodiments of the present invention with reference to the drawings. In discovering a rule(s) from a database, a set of approximate CFDs (constant CFDs, each having a confidence greater than or equal to a preset value), matched to contents of the database, is calculated, based on the database and a parameter(s) (such as frequency and confidence threshold value) entered by a user. More specifically, there are provided a free itemset generation means (21 of
(6) a rule candidate generation unit (22 of
(7) a rule minimality decision unit (23 of
(8) According to the exemplary embodiment, a data processing apparatus (2) includes:
(9) a free itemset generation unit (21 of
(10) a valid rule candidate generation unit (22 of
(11) a rule minimality decision unit (23 of
(12) The rule minimality decision unit (23 in
(13) The rule candidate generation unit (22 in
(14) in the case wherein a ratio of a frequency ratio of the itemset obtained by adding the one item to the free itemset to a frequency of the free itemset is greater than or equal to the confidence threshold value, the rule candidate generation unit performs a process of adding to the list a rule including a condition part having the free itemset α set thereto and a consequent part having one item x set thereto, until the checking for all combinations of the free itemsets α and the item x is finished.
(15) According to the present invention, by first computing free itemsets, portions of approximate constant CFDs corresponding to the condition parts can be efficiently enumerated. Further, according to the present invention, candidates for approximate constant CFDs are each rearranged in an ascending order of the size of the condition part. Then, the candidates are extracted from the head of the list, and the approximate constant CFD that includes a redundant term with respect to each extracted approximate constant CFD is removed. For this reason, only a minimal approximate constant CFD(s) can be efficiently obtained. As a result, a set of the approximate constant CFDs useful for grasping or correcting the contents of the database by the user can be efficiently obtained. A description will be given below in connection with an exemplary embodiment.
First Exemplary Embodiment
(16) Referring to
(17) The storage apparatus 3 includes a database storage unit 31. In the database storage unit 31, there is stored in advance a database from which a rule(s) is (are) to be extracted.
(18) The data processing apparatus 2 includes a free itemset generation unit (means) 21, a valid rule candidate generation unit (means) 22 and a rule minimality decision unit (means) 23.
(19) The free itemset generation means 21 generates a free itemset of the database stored in the database storage unit 31, using a parameter (such as a frequency threshold value k) supplied from the input apparatus 1.
(20) The free itemset generation means 21 stores the generated free itemset in a not shown storage unit in the data processing apparatus 2, a not shown storage unit in the free itemset generation means 21, or a predetermined storage region in the storage apparatus 3. An attribute-value pair appearing in the database is called as an “item”, and a set of items is referred to as an “itemset”.
(21) The valid rule candidate generation means 22 puts the free itemset generated by the free itemset generation means 21 on the condition part of a rule, and puts an item that does not share an attribute with the free itemset on the consequent part of the rule, thereby generating the rule that is valid (whose confidence is greater than or equal to a threshold value p), and adds the generated valid rule to a list L to store the generated valid rule.
(22) The valid rule candidate generation means 22 stores the list L to which the valid rule has been added in a not shown storage unit (e.g., a memory such as a DRAM (dynamic random access memory), a magnetic disk, or the like) in the data processing apparatus 2, a not shown storage unit in the valid rule candidate storage means 22, or a predetermined storage region in the storage apparatus 3.
(23) The rule minimality decision means 23 outputs the rule (valid rule) generated by the valid rule candidate generation means 22 to the output apparatus 4, when the generated rule is determined to be minimal. Herein, rule being “minimal” means that the rule loses its validity when arbitrary one or more items are removed from a condition part of the rule.
(24) Specifically, to take an example, the valid rules obtained by the valid rule candidate generation means 22 are each sorted in an ascending order of the size and is then stored in the list L, and rule candidates are sequentially extracted from the head of the list L and are output. The rule that is included in the list L and is redundant with respect to each currently extracted rule is removed, for example. The rule that is not minimal is thereby removed.
(25) Next, operation of the present exemplary embodiment will be described in detail with reference to
(26) The parameter supplied from the input apparatus 1 and contents of the database given from the database storage unit 31 are supplied to the free itemset generation means 21.
(27) The free itemset generation means 21 extracts, from attribute-value pairs (referred to as items appearing in the database, all free itemsets (referred to as “frequent free itemsets”) whose frequencies are each greater than or equal to a threshold value k (in step A1). As described above, a free itemset is a set of items whose frequency is truly increased by removing arbitrary one or more items (items).
(28) Next, the valid rule candidate generation means 22 generates a valid rule candidate from each of the free itemsets generated by the free itemset generation means 21. Specifically, when an item (item) that does not share an attribute with a free itemset α is indicated by x, the valid rule candidate generation means 22 computes the frequency of the free itemset α and the frequency of an itemset (itemset) (α+{x}) in which one item x is added to the free itemset α (in step A2).
(29) The valid rule candidate generation means 22 checks whether or not a ratio (frequency of α+{x})/(frequency of α) that is a ratio of a frequency of the itemset (α+{x}), in which the one item x is added to the free itemset α, to a frequency of the free itemset α is greater than or equal to the confidence threshold value p (in step A3).
(30) When the ratio (frequency of α+{x})/(frequency of α)≧p, the valid rule candidate generation means 22 adds a constant CFD of ψ: α.fwdarw.x
(31) to the list L as a candidate for a valid rule (approximate constant CFD) (in step A4).
(32) The valid rule candidate generation means 22 determines whether or not the checking in step A3 (checking whether or not the ratio (frequency of α+{x})/(frequency of α) is greater than or equal to the confidence threshold value p) has been finished for all combinations of the free itemsets α and the item x (in step S5). When the valid rule candidate generation means 22 determines that the checking in step A3 is not finished for all the combinations of the free itemsets α and the item x, the procedure returns to step A2. Then, the checking in step A3 is performed for all the combinations of the free itemsets α and the item x.
(33) When the valid rule candidate generation means 22 determines that the checking in step A3 has been finished for all the combinations of the free itemsets α and the item x, the rule minimality decision means 23 checks whether or not the rule of each rule candidate (approximate constant CFD) in the list L generated by the valid rule candidate generation means 22 is minimal.
(34) Specifically, the rule minimality decision means 23 sorts the elements of the list L in the ascending order of the size (number of items in the CFD condition part) of each element (in step A6).
(35) The rule minimality decision means 23 sequentially extracts each element (rule ψ: approximate constant CFD) from the head of the list L where the elements of the list L are each sorted in the ascending order of the size, and outputs the extracted rule (ψ) to the output apparatus 4 (in step A7).
(36) The rule minimality decision means 23 removes, from among the rule candidates included in the list L, the rule candidate that is redundant with respect to the rule (ψ) currently extracted from the list L and output (in step A8).
(37) The rule minimality decision means 23 determines whether or not the list L is vacant (in step A9). When the rule minimality decision means 23 determines that the list L is not vacant (branching to No in step A9), the procedure returns to the process in step A7. When the rule minimality decision means 23 determines that the list L is vacant (branching to Yes in step A9), the rule minimality decision means 23 finishes computation for rule discovery.
(38) The following describes the operation and effect of the present exemplary embodiment will be described.
(39) In the present exemplary embodiment, a rule to be discovered is limited to a constant CFD (approximate constant CFD) that substantially holds. By enumerating free itemsets initially, all candidates for the condition parts of approximate constant CFDs ar enumerated. For this reason, generation of the candidates for the constant CFD (approximate constant CFDs) that substantially hold can be reduced to as little as possible.
(40) Since minimality of each rule is checked in the ascending order of the size of the condition part of the constant CFD, an increase in time needed for determination of the minimality can be reduced. As a result, efficient discovery of the rule becomes possible.
(41) Next, the operation of the present exemplary embodiment will be described, using a specific example. A data set, made up by attributes and tuples of the following Table 1, is registered in the database storage unit 31, as shown in
(42)
(43) TABLE-US-00001 TABLE 1 Attribute 1 Attribute 2 Attribute 3 Tuple 1 1 P S Tuple 2 1 Q S Tuple 3 2 P T Tuple 4 3 P T
(44) The free itemset generation means 21 receives the above-mentioned table 1, and, as parameters, k=2 and p=0.66, where k denotes a lower limit of the frequency (threshold value) and p is a lower limit of the confidence (threshold value) both for determining whether or not a rule is valid.
(45) The free itemset generation means 21 extracts a set of all free itemsets, each having a frequency of occurrence in the database greater than or equal to k=2, which are:
(46) {empty (4), “attribute 1=1” (2), “attribute 2=P” (3), “attribute 3=S” (2), “attribute 3=T” (2)} (each numerical value in parentheses denotes the frequency of the free itemset) (refer to step A1 in
(47) For these five free itemsets, the following valid CFD rules having these five free itemsets provided in the condition parts of the CFDs are obtained.
(48) First, assuming x=“attribute 2=P” for α=empty,
(49) a frequency of empty is 4 (which is equal to the number of all the tuples), and
(50) a frequency of “attribute 2=P” is 3.
(51) The ratio (confidence) of the frequency of “attribute 2=P” and to the frequency of empty is:
(52) (frequency of “attribute 2=P”)/(frequency of empty)=¾=0.75.
(53) This value of 0.75 exceeds the confidence threshold value of 0.66.
(54) For this reason, the valid rule candidate generation means 22 adds
(55) CFD ψ: empty.fwdarw.“attribute 2=P” (frequency=3, and confidence=0.75)
(56) to the list L, as a valid rule candidate.
(57) Similarly, the valid rule candidate generation means 22 adds the following four CFDs (approximate constant CFDs) that substantially hold to the list L as valid rule candidates (in steps A2 to A4):
(58) “attribute 1=1”.fwdarw.“attribute 3=S” (frequency=2, and confidence=1),
(59) “attribute 2=P”.fwdarw.“attribute 3=T” (frequency=2, and confidence=⅔=0.667),
(60) “attribute 3=S.fwdarw.attribute 1=1” (frequency=2, and confidence=1),
(61) “attribute 3=T.fwdarw.attribute 2=P” (frequency=2, and confidence=1).
(62) The rule minimality decision means 23 sorts (rearranges) each element (rule candidate) in the list L in the ascending order of the size (in step A6 in
(63) empty.fwdarw.“attribute 2=P”
(64) appears at a head of the list L (refer to the list L of the CDF candidates in (step A2 to A6) in
(65) The rule minimality decision means 23 extracts this rule:
(66) empty.fwdarw.“attribute 2=P”
(67) from the head of the list L.
(68) Since this is a minimal rule, this rule is output to the output apparatus 4. In this case, out of the rule candidates that are present in the list L, the rule candidate:
(69) “attribute 3=T”.fwdarw.“attribute 2=P”
(70) is redundant with respect to the currently extracted rule candidate:
(71) empty.fwdarw.“attribute 2=P”.
(72) Thus, the rule minimality decision means 23 removes the rule candidate:
(73) “attribute 3=T”.fwdarw.“attribute 2=P”
(74) from the list L.
(75) Next, the rule minimality decision means 23 sequentially extracts the remaining rule candidates that are present in the list L, which are:
(76) “attribute 1=1”.fwdarw.“attribute 3=S” (frequency=2, and confidence=1),
(77) “attribute 2=P”.fwdarw.“attribute 3=T” (frequency=2, and confidence=0.667), and
(78) “attribute 3=S”.fwdarw.“attribute 1=1” (frequency=2, and confidence=1), and then sequentially output these rule candidates to the output apparatus 4.
(79) When the above-mentioned rule candidates are individually extracted from the list L, there is no redundant rule with respect to any of these rule candidates in the list L.
(80) For this reason, the rule minimality decision means 23 outputs the above-mentioned four approximate constant CFDs. As a result, the list L becomes empty, so that the rule discovery algorithm is finished (in steps A7 to A9).
Second Exemplary Embodiment
(81) Next, a second exemplary embodiment of the present invention will be described in detail, with reference to the drawings. Referring to
(82) When a parameter(s) is (are) supplied from the input apparatus 1, the data processing apparatus 6 first generates a free itemset, using a database stored in a database storage unit 31 in a storage apparatus 3.
(83) Next, the data processing apparatus 6 generates a valid rule candidate from the generated free itemset, and is added to a list L.
(84) The data processing apparatus 6 causes a minimal one or more of a set of valid rule candidates stored in the list L to be displayed on the output apparatus 4.
(85) In the above-mentioned exemplary embodiments, the list L to which a rule candidate is to be added is not limited to a data structure such as a linear list or a queue (Queue). An arbitrary storage configuration (data structure) or a storage region for arrangement or the like may be of course used if the storage configuration or the storage region can store a result obtained by sequentially sorting each element in the order of the size of the element, for example.
(86) Each disclosure of the above-listed Patent Literatures and Non-Patent Literatures is incorporated herein by reference. Modification and adjustment of each exemplary embodiment and each example are possible within the scope of the overall disclosure (including the claims) of the present invention and based on the basic technical concept of the present invention. Various combinations and selections of various disclosed elements (including each element of each claim, each element of each example, each element of each drawing, and the like) are possible within the scope of the claims of the present invention. That is, the present invention naturally includes various variations and modifications that could be made by those skilled in the art according to the overall disclosure including the claims and the technical concept. With respect to a numerical value range described herein, an arbitrary numerical value and a small range included in the numerical value range should be construed to be specifically described even unless otherwise explicitly described.