Filtering out unsound, insignificant, unproductive and redundant rules and itemsets

During search Magnum Opus can automatically filter out rules and itemsets that are likely to be of little interest.  The filter mode controls which such rules and itemsets are filtered out.   The four options are Filter-out None, Filter-out Redundant, Filter-out Unproductive, Filter-out Insignificant and Filter-out Unsound.

Each filter mode detects the respective type of spurious rule or itemset and removes it from the list of rules or itemsets that is returned to the user.

In the interactive system the current filter mode is displayed in the Filter Mode ComboBox on the Search Settings Page.

In the command-line system the filter mode is specified with the filter-mode command.

Filtering Rules

A rule is redundant if there is another rule with the same Right-Hand-Side and a subset of the Left-Hand-Side that covers exactly the same cases from the data set.  For example, the first of the two rules below is redundant because it has the same coverage as the second.  Adding Tomatoes to the LHS of the second rule does not affect it.

Lettuce & Tomatoes -> Cucumber [Coverage=0.250 (250); Support=0.239 (239); Strength=0.956; Lift=2.91; Leverage=0.1568 (156)]

Lettuce -> Cucumber [Coverage=0.250 (250); Support=0.239 (239); Strength=0.956; Lift=2.91; Leverage=0.1568 (156)]

If a rule is redundant then it will have the same support, strength, lift, and leverage as the rule with respect to which it is redundant.

Note, in Magnum Opus versions 1.0 to 3.0 redundant rules were called trivial rules.

A rule is unproductive if there is another rule with the same Right-Hand-Side and a subset of the Left-Hand-Side that has equal or higher strength.  For example, the first of the rules below is unproductive because it has lower strength than the second.  Adding Promotion1=f to the LHS of the second rule decreases its performance.

Profitability99 < 419 & Promotion1=f -> Spend99 < 2030 [Coverage=0.274 (274); Support=0.248 (248); Strength=0.905; Lift=2.72; Leverage=0.1568 (156)]

Profitability99 < 419 -> Spend99 < 2030 [Coverage=0.333 (333); Support=0.302 (302); Strength=0.907; Lift=2.72; Leverage=0.1911 (191)]

If a rule is unproductive then it will have the same or worse support, strength, lift, and leverage as the rule with respect to which it is unproductive.

A rule is insignificant if its strength is not significantly greater than that of all of its immediate generalizations and a default rule. An immediate generalization is formed by deleting a single condition from the LHS of a rule.  A default rule is formed by deleting all conditions from the LHS of a rule.  A Fisher exact test is used to test for significance.  The critical value for the significance test can be chosen by the user and defaults to 0.01.   For example, the first of the rules below is insignificant using the default critical value of 0.01 because adding NoVisits99 < 35 to the LHS of the second rule does not significantly increase its strength.

Spend99 < 2030 & NoVisits99 < 35 -> Profitability99 < 419 [Coverage=0.272 (272); Support=0.255 (255); Strength=0.938; Lift=2.82; Leverage=0.1644 (164)]

Spend99 < 2030 -> Profitability99 < 419 [Coverage=0.333 (333); Support=0.302 (302); Strength=0.907; Lift=2.72; Leverage=0.1911 (191)]

Filtering out insignificant rules will remove many rules that result from adding another value to the Left-Hand-Side of another rule without substantially increasing its strength. 

If a rule is redundant then it will also be unproductive.  Hence, Filter-out Unproductive Mode filters out all rules filtered out by Filter-out Redundant Mode. If a rule is unproductive then it will also be insignificant.  Hence, Filter-out Insignificant Mode filters out all rules filtered out by Filter-out Redundant and Filter-out Unproductive Modes.

Filtering Itemsets

An itemset is redundant if it contains two or more items for a single attribute or if it has a generalization that has identical coverage to one of its generalizations.

An example of the first type of redundant itemset is Spend99 < 2030 & Spend99 > 4978.

For an example of the second type of redundant itemset, consider the four following itemsets.  Adding Spend99<2030 to Profitability99<419 & SocioEconomicGroup=D2 does not change its coverage.  It follows that for any item that is added to Profitability99<419 & Spend99<2030 & SocioEconomicGroup=D2 the resulting coverage will be identical to that obtained by adding the item to Profitability99<419 & SocioEconomicGroup=D2.  In consequence, any specialization of Profitability99<419 & Spend99<2030 & SocioEconomicGroup=D2 is redundant.  Thus, the fourth rule below is redundant.

Profitability99<419 & SocioEconomicGroup=D2
[Coverage=0.033 (33); Leverage=0.0007 (0.7)]

Profitability99<419 & Spend99<2030 & SocioEconomicGroup=D2
[Coverage=0.033 (33); Leverage=0.0037 (3.7)]

Profitability99<419 & Grocery<873 & SocioEconomicGroup=D2
[Coverage=0.032 (32); Leverage=0.0051 (5.1)]

Profitability99<419 & Spend99<2030 & Grocery<873 & SocioEconomicGroup=D2
[Coverage=0.032 (32); Leverage=0.0063 (6.3)]

The unproductive filter removes any itemsets that are redundant or that have leverage ≤ 0.0.

The insignificant filter removes any itemsets that are redundant or that fail a Fisher exact test for the null hypothesis that leverage ≤ 0.0.

The Unsound Filter

The test for whether a rule or itemset is unsound takes account of the number of rules or itemsets considered in a search and adjusts the test for insignificance to take account thereof.  It determines the number of associations in the search space for each size of the LHS or itemset.  It then divides the critical value used for each association size by the number of possible associations for that size times the number of sizes allowed.  This adjustment strictly controls the risk of any association being accepted falsely at the unadjusted significance level.  Note that any two rules X -> Y and Y -> X (that is, two rules in which the LHS and RHS are swapped) are counted as a single association when determining the number of associations.