Association Discovery with Magnum Opus 4.3
A Tutorial Introduction
Table of
contents
A worked
example of attribute-value data.
Statistically
sound association discovery
A worked
example of holdout evaluation for rules
A worked
example of holdout evaluation for itemsets
Computation
time, snapshots and anytime results
Copyright © 2007=2009, G. I. Webb & Associates Pty Ltd.
Magnum Opus detects associations within data.
The data is imported into the system from a text file. Users typically extract data from a database into a text file for use with the system. There is considerable flexibility in the formats that may be employed.
The user selects settings that control a search for associations in the data. The user can choose the type of association to be found and between alternative measures of the relative value of an association. The user also specifies the maximum number of associations to be found and any further restrictions on the associations to be considered.
Within the restrictions specified by the user, Magnum Opus finds the associations with the highest values on the specified measure. Magnum Opus will only find fewer than the specified number of associations if the search is terminated by the user or there are fewer than the specified number that satisfy the user specified constraints.
The associations found are recorded in an output file and may optionally be exported to a comma separated value file suitable for input into a spreadsheet for further analysis.
We start with a simple invented example of analyzing the purchasing habits of a customer of a fictitious grocery store. The customer has visited the store on ten occasions, each time buying a different selection of goods. The following item-list file records the customer’s purchasing behavior. Each line represents the items bought on a single visit.
plums, lettuce,
tomatoes
celery, confectionery
apples, carrots,
tomatoes, potatoes
potatoes
confectionery
carrots
apples, oranges,
lettuce, tomatoes
peaches, oranges,
celery, potatoes, confectionery
oranges, lettuce,
carrots, tomatoes
apples, bananas, plums,
carrots, tomatoes, onions
These can be processed by Magnum Opus to find rules such as the following four.
apples -> tomatoes
[Coverage=0.300 (3); Support=0.300 (3); Strength=1.000; Lift=2.00;
Leverage=0.1500 (1.5)]
lettuce -> tomatoes
[Coverage=0.300 (3); Support=0.300 (3); Strength=1.000; Lift=2.00;
Leverage=0.1500 (1.5)]
tomatoes -> apples
[Coverage=0.500 (5); Support=0.300 (3); Strength=0.600; Lift=2.00;
Leverage=0.1500 (1.5)]
tomatoes & oranges
-> lettuce [Coverage=0.200 (2); Support=0.200 (2); Strength=1.000;
Lift=3.33; Leverage=0.1400 (1.4)]
Each rule presents a list of items to the left of the arrow that are associated with the single item to the right of the arrow. Then a number of relevant statistics are presented that describe the nature of the association. Thus, the first two of these rules indicate that whenever either apples or lettuce are purchased, tomatoes are also purchased. The third and fourth rules indicate that both apples and lettuce are more likely to be purchased if tomatoes are purchased. The final rule shows that whenever both tomatoes and oranges are purchased, lettuce is also purchased.
This is a very simplistic example. In practice it would be foolish to draw strong conclusions from such limited data. Indeed, Magnum Opus includes facilities for assessing the strength of evidence in support of a rule, and these mechanisms would reject all the above rules as having insufficient support. This example is intended to illustrate the type of analysis that Magnum Opus performs, albeit, normally on much larger volumes of more complex data.
We now provide a fully worked example of an extended variant of the above scenario. The data is now extended to include all customers of the store for a given period of time, resulting in a total of 1000 transactions. The data is contained in the example file distributed with Magnum Opus called tutorial.itl.
Note, there are two versions of Magnum Opus. The command line version runs on Linux systems. The interactive version runs under Windows. In the following and all subsequent examples we provide both a command line for executing the example on the command line system and a step-through of the process for running it on the interactive system. We present the output from the interactive system which may vary in minor respects from that of the command line system.
In the first example we run Magnum Opus with its default settings, except that we limit the number of rules produced to five only.
Command line: mocl item-list-file=tutorial.itl maximum-results=5
Interactive
system. First run Magnum Opus. From the File Menu
select Import Data. The system will
display a dialog for selecting a file to open.
If necessary, navigate to the Example Files folder within the folder
into which you installed the software.
Select the file tutorial.itl. The system will now display the following
dialog box.

The
system recognizes from the itl file
extension that the file is probably an item list file. As this is correct and we wish to use the
default settings, click the Import Now button. After importing the data the screen should
appear as follows.

As we
want to limit the number of rules to five, edit the Maximum no. edit box
accordingly.

Now
click the GO button to commence a search with the selected settings. A dialog will be displayed that allows you to
select the file into which the results will be stored. Specify a file name and navigate to the
folder in which you want it stored. Then
click on the Save button. The system
will perform the search, saving the results in the specified file and then open
the file for inspection.
Output:
Magnum Opus - The
leader in association discovery technology.
Version 4.3
Copyright (c) 1999-2009 G. I. Webb &
Associates Pty Ltd.
Data file: Tutorial.itl
1000 cases / 0 holdout cases / 16 items
Search for rules
Search by leverage
Filter out rules that are insignificant,
critical value=0.05
Maximum number of attributes on LHS = 4
Maximum number of rules = 5
Minimum leverage = -1.0
Minimum leverage count = -2147483647
Minimum coverage = 0.0
Minimum coverage count = 1
Minimum support = 0.0
Minimum support count = 0
Minimum lift = 0.0
Minimum strength = 0.0
All values allowed on LHS
All values allowed on RHS
Found 5 rules
tomatoes
-> lettuce
[Coverage=0.263 (263); Support=0.111 (111);
Strength=0.422; Lift=1.94; Leverage=0.0539 (53.9); p=2.35E-019]
lettuce
-> tomatoes
[Coverage=0.217 (217); Support=0.111
(111); Strength=0.512; Lift=1.94; Leverage=0.0539 (53.9); p=2.35E-019]
tomatoes
-> carrots
[Coverage=0.263 (263); Support=0.085
(85); Strength=0.323; Lift=1.85; Leverage=0.0390 (39.0); p=1.83E-012]
carrots
-> tomatoes
[Coverage=0.175 (175); Support=0.085
(85); Strength=0.486; Lift=1.85; Leverage=0.0390 (39.0); p=1.83E-012]
onions
-> potatoes
[Coverage=0.189 (189); Support=0.082
(82); Strength=0.434; Lift=1.53; Leverage=0.0285 (28.5); p=5.30E-007]
The output file begins with a record of the settings used to produce the rules. It then states the number of rules found, followed by each of those rules. Each rule is composed of two parts. The left-hand-side (LHS) appears before the arrow and the right-hand-side (RHS) appears after the arrow. Then a number of statistics are presented that describe the relationship between the LHS and RHS.
The first rule describes an association between tomatoes and lettuce. The following measures are presented that describe the association.
|
Coverage |
The coverage of the rule is the number of cases that contain the LHS. In this data 263 cases contain tomatoes, which is 0.263 of the 1000 cases in the data. |
|
Support |
The support of the rule is the number of cases that contain both the LHS and the RHS. In this data there are 111 cases that contain both tomatoes and lettuce which represents 0.111 of the total data. |
|
Strength |
The strength is the support divided by the coverage. This represents the proportion of the cases that contain the LHS that also contain the RHS. It can be thought of as an estimate of the probability that the RHS will occur in a case if the LHS occurs. |
|
Lift |
The lift is the strength divided by the strength that would be expected if there were no relationship between the LHS and the RHS. A value of 1.0 suggests that there is no relationship between the two. Higher values suggest stronger positive relationships. Lower values suggest stronger negative relationships (the presence of the LHS reduces the likelihood of the RHS). |
|
Leverage |
The leverage is the support minus the support that would be expected if the LHS and RHS were unrelated to one another. A positive value suggests a positive relationship and a negative value suggests a negative relationship. |
|
p |
The result of a statistical evaluation of the significance of the rule. The lower this value the less likely that this rule is a spurious outcome resulting from adding an irrelevant value into the LHS. |
Magnum Opus has several valuable features not found in most association discovery systems. One important difference is that it allows the user to specify both how many associations to find and what measure should be used to judge how interesting a association is. Any of the measures coverage, support, strength, lift or leverage can be used for this purpose.
The first example run, above, found the five rules with the highest leverage. High leverage rules have a strong positive association between the LHS and RHS and maximize the number of times more frequently the RHS occurs in the context of the LHS than would be expected if they were not associated with one another.
The two other measures that are most frequently used are strength and lift. For our next example we will rerun the previous analysis using strength as the measure by which to search.
Command line: mocl item-list-file=tutorial.itl
\
maximum-results=5 search-mode=strength
Interactive
system. Continuing from the previous
point, select Strength in the Search
by combo box. The
screen should now appear as follows.

Now
click the GO button to commence the search.
As previously, a dialog will be displayed that allows you to select the
file into which the results will be stored.
Specify a file name and navigate to the folder in which you want it
stored. Then click on the Save
button. The system will perform the
search, saving the results in the specified file and then open the file for
inspection.
Output:
Magnum Opus - The
leader in association discovery technology.
Version 4.3
Copyright (c) 1999-2009 G. I. Webb &
Associates Pty Ltd.
Data file: Tutorial.itl
1000 cases / 0 holdout cases / 16 items
Search for rules
Search by strength
Filter out rules that are insignificant,
critical value=0.05
Maximum number of attributes on LHS = 4
Maximum number of rules = 5
Minimum leverage = -1.0
Minimum leverage count = -2147483647
Minimum coverage = 0.0
Minimum coverage count = 1
Minimum support = 0.0
Minimum support count = 0
Minimum lift = 0.0
Minimum strength = 0.0
All values allowed on LHS
All values allowed on RHS
Found 5 rules
bananas
& lettuce & peaches -> apples
[Coverage=0.004 (4); Support=0.004 (4);
Strength=1.000; Lift=4.52; Leverage=0.0031 (3.1); p=0.0260]
bananas
& plums & lettuce -> potatoes
[Coverage=0.004 (4); Support=0.004 (4);
Strength=1.000; Lift=3.53; Leverage=0.0029 (2.9); p=0.0385]
lettuce
& confectionery & carrots & oranges -> beans
[Coverage=0.002 (2); Support=0.002 (2);
Strength=1.000; Lift=14.49; Leverage=0.0019 (1.9); p=0.0476]
plums
& onions & peas -> bananas
[Coverage=0.002 (2); Support=0.002 (2);
Strength=1.000; Lift=7.87; Leverage=0.0017 (1.7); p=0.0452]
lettuce
& oranges & onions -> potatoes
[Coverage=0.008 (8); Support=0.007 (7);
Strength=0.875; Lift=3.09; Leverage=0.0047 (4.7); p=0.0369]
Comparing the two sets of rules, the first thing to note is that the rules in the first set all have substantially higher leverage while the second have much higher strength, as these are the measures that each seeks to optimize. It is also notable that the coverage for the rules in the second set is much lower. When coverage is small, there is a substantial risk that values of strength and lift will be overestimated. To guard against this, Magnum Opus supports a Bayesian smoothing mechanism called the m-estimate that adjusts values of strength and lift to reduce this risk. For our next example we will rerun the previous analysis using this mechanism.
Command line: mocl item-list-file=tutorial.itl
\
maximum-results=5 search-mode=strength m=2
Interactive
system. Continuing from the previous
point, select the m-estimate check box. The screen should now appear as
follows.

Now
click the GO button to commence the search.
As previously, a dialog will be displayed that allows you to select the
file into which the results will be stored.
Specify a file name and navigate to the folder in which you want it
stored. Then click on the Save button. The system will perform the search, saving
the results in the specified file and then open the file for inspection.
Output:
Magnum Opus - The
leader in association discovery technology.
Version 4.3
Copyright (c) 1999-2009 G. I. Webb &
Associates Pty Ltd.
Data file: Tutorial.itl
1000 cases / 0 holdout cases / 16 items
Search for rules
Search by strength
Filter out rules that are insignificant,
critical value=0.05
Maximum number of attributes on LHS = 4
Maximum number of rules = 5
Minimum leverage = -1.0
Minimum leverage count = -2147483647
Minimum coverage = 0.0
Minimum coverage count = 1
Minimum support = 0.0
Minimum support count = 0
Minimum lift = 0.0
Minimum strength = 0.0
Use m-estimate, m = 2
All values allowed on LHS
All values allowed on RHS
Found 5 rules
lettuce
& carrots -> tomatoes
[Coverage=0.045 (45); Support=0.039 (39);
Strength estimate=0.841; Lift estimate=3.20; Leverage=0.0272 (27.2);
p=3.16E-008]
bananas
& plums & lettuce -> potatoes
[Coverage=0.004 (4); Support=0.004 (4);
Strength estimate=0.761; Lift estimate=2.69; Leverage=0.0029 (2.9); p=0.0385]
lettuce
& oranges & onions -> potatoes
[Coverage=0.008 (8); Support=0.007 (7);
Strength estimate=0.757; Lift estimate=2.67; Leverage=0.0047 (4.7); p=0.0369]
bananas
& lettuce & peaches -> apples
[Coverage=0.004 (4); Support=0.004 (4);
Strength estimate=0.740; Lift estimate=3.35; Leverage=0.0031 (3.1); p=0.0260]
carrots
& corn -> lettuce
[Coverage=0.006 (6); Support=0.005 (5);
Strength estimate=0.679; Lift estimate=3.13; Leverage=0.0037 (3.7); p=0.00473]
Note first of all that the values for strength and lift are called Strength Estimate and Lift Estimate when the m-estimate is used. Also note that while a number of the same rules are discovered as previously, the estimates of their strength and lift are substantially reduced. Finally, note that one of the rules discovered using the m-estimate has substantially higher coverage than those previously discovered, and that the strength estimate for this rule is quite close to the observed strength (39 / 45 = 0.867). The use of m-estimates is strongly advised when searching by strength or lift.
A search by strength with an m-estimate will tend to find strongly predictive rules. These are rules for which the RHS is very likely whenever the LHS occurs. However, some times rather than highly predictive rules, it is desirable to find rules that ‘beat the odds.’ For example, suppose there is a product that most people buy most of the time, such as might be the case if customers are required to purchase the bags if they wish to have their purchases packed. Let us assume that 90% of customers buy bags. In this case the rule
confectionery -> bags
[Coverage=0.336 (336); Support=0.302 (302); Strength=0.900; Lift=1.000;
Leverage=0.0000 (-0.4)]
will enable us to predict with reasonable accuracy that the probability of a customer purchasing a bag if they purchase confectionery is 90%. However, such a rule may not be very useful, as it does not change our default expectation of the probability the customer will purchase a bag. Lift measures how much the rule increases the probability of the RHS relative to the default. To illustrate this, we next perform a search by lift. Note that we will use an m-estimate, as in the previous example.
Command line: mocl item-list-file=tutorial.itl \
maximum-results=5 search-mode=lift m=2
Interactive
system. Continuing from the previous
point, select Lift in the Search by ComboBox. The
screen should now appear as follows.

Now
click the GO button to commence the search.
As previously, a dialog will be displayed that allows you to select the
file into which the results will be stored.
Specify a file name and navigate to the folder in which you want it
stored. Then click on the Save
button. The system will perform the
search, saving the results in the specified file and then open the file for
inspection.
Output:
Magnum Opus - The
leader in association discovery technology.
Version 4.3
Copyright (c) 1999-2009 G. I. Webb &
Associates Pty Ltd.
Data file: Tutorial.itl
1000 cases / 0 holdout cases / 16 items
Search for rules
Search by lift
Filter out rules that are insignificant,
critical value=0.05
Maximum number of attributes on LHS = 4
Maximum number of rules = 5
Minimum leverage = -1.0
Minimum leverage count = -2147483647
Minimum coverage = 0.0
Minimum coverage count = 1
Minimum support = 0.0
Minimum support count = 0
Minimum lift = 0.0
Minimum strength = 0.0
Use m-estimate, m = 2
All values allowed on LHS
All values allowed on RHS
Found 5 rules
lettuce
& confectionery & carrots & oranges -> beans
[Coverage=0.002 (2); Support=0.002 (2);
Strength estimate=0.534; Lift estimate=7.75; Leverage=0.0019 (1.9); p=0.0476]
plums
& potatoes & grapes -> beans
[Coverage=0.003 (3); Support=0.002 (2);
Strength estimate=0.428; Lift estimate=6.20; Leverage=0.0018 (1.8); p=0.0474]
apples
& peaches & onions -> peas
[Coverage=0.007 (7); Support=0.004 (4);
Strength estimate=0.463; Lift estimate=5.45; Leverage=0.0034 (3.4); p=0.0307]
bananas
& beans -> corn
[Coverage=0.010 (10); Support=0.003 (3);
Strength estimate=0.259; Lift estimate=4.80; Leverage=0.0025 (2.5); p=0.0357]
plums
& onions & peas -> bananas
[Coverage=0.002 (2); Support=0.002 (2);
Strength estimate=0.564; Lift estimate=4.44; Leverage=0.0017 (1.7); p=0.0452]
Whereas the search by strength found rules with higher strength, this search finds rules with reasonable strength for items that are not frequently purchased. For example, beans are only purchased by 6.9% of customers, but when lettuce, confectionery, carrots and oranges are all purchased, beans are always purchased. While the system discounts this evidence due to the small number of examples, it is still taken as evidence of a large increase in the frequency with which beans are purchased by such customers.
Sometimes it will be desirable to find rules for predicting one particular outcome. For example, you might only be interested in predicting the likelihood that customers will purchase beans. The system allows you to restrict the items that are allowed to appear on either the LHS or RHS of a rule. For the next example we will rerun the last analysis but with the RHS restricted to beans.
Command line: mocl item-list-file=tutorial.itl
\
maximum-results=5 search-mode=lift m=2 \
rhs-available=beans
Interactive
system. Continuing from the previous point, select
beans in the Values allowed on RHS selection box. The screen should now appear as follows.

Now
click the GO button to commence the search.
As previously, a dialog will be displayed that allows you to select the
file into which the results will be stored.
Specify a file name and navigate to the folder in which you want it
stored. Then click on the Save
button. The system will perform the search,
saving the results in the specified file and then open the file for inspection.
Output:
Magnum Opus - The
leader in association discovery technology.
Version 4.3
Copyright (c) 1999-2009 G. I. Webb &
Associates Pty Ltd.
Data file: Tutorial.itl
1000 cases / 0 holdout cases / 16 items
Search for rules
Search by lift
Filter out rules that are insignificant,
critical value=0.05
Maximum number of attributes on LHS = 4
Maximum number of rules = 5
Minimum leverage = -1.0
Minimum leverage count = -2147483647
Minimum coverage = 0.0
Minimum coverage count = 1
Minimum support = 0.0
Minimum support count = 0
Minimum lift = 0.0
Minimum strength = 0.0
Use m-estimate, m = 2
All values allowed on LHS
Values allowed on RHS:
beans
Only 2 rules satisfy the specified
constraints.
lettuce
& confectionery & carrots & oranges -> beans
[Coverage=0.002 (2); Support=0.002 (2);
Strength estimate=0.534; Lift estimate=7.75; Leverage=0.0019 (1.9); p=0.0476]
plums
& potatoes & grapes -> beans
[Coverage=0.003 (3); Support=0.002 (2);
Strength estimate=0.428; Lift estimate=6.20; Leverage=0.0018 (1.8); p=0.0474]
Only rules with beans on the RHS are returned. In this case only two such rules can be found.
Sometimes some data elements represent inputs to a process and other outputs. In such circumstances it will often be useful to limit the LHS values to the inputs and the RHS values to the outputs. The rules that are discovered will then represent ways of manipulating the inputs in order to produce specific outcomes.
Rules are a useful way to describe interactions between elements of the data when the objective is to predict the probability of specific items in specific contexts. Sometimes, however, the primary issue is simply to identify which items occur together. In this case, presenting the interactions as rules can be distracting. For example, a single interaction between elements can result in many rules.
Itemsets are simply collections of items that appear together. The system supports two measures of the importance of an itemset, coverage and leverage. The coverage is the number of transactions or cases that contain the itemset. The leverage is the difference between this and the maximum coverage that would be expected assuming that any two subsets of the items were unrelated to one another.
The next example finds itemsets for the tutorial data.
Command line: mocl item-list-file=tutorial.itl \
maximum-results=5 find-itemsets
Interactive
system. Continuing from the previous
point, select itemsets in the Search for comboBox. The screen should appear as follows.

Now
click the GO button to commence the search.
As previously, a dialog will be displayed that allows you to select the
file into which the results will be stored.
Specify a file name and navigate to the folder in which you want it
stored. Then click on the Save
button. The system will perform the
search, saving the results in the specified file and then open the file for
inspection.
Output:
Magnum Opus - The leader
in association discovery technology.
Version 4.3
Copyright (c) 1999-2009 G. I. Webb &
Associates Pty Ltd.
Data file: Tutorial.itl
1000 cases / 0 holdout cases / 16 items
Search for itemsets
Search by leverage
Filter out itemsets
that are insignificant, critical value=0.05
Maximum number of values in an itemset = 4
Maximum number of itemsets
= 5
Minimum leverage = -1.0
Minimum leverage count = -2147483647
Minimum coverage = 0.0
Minimum coverage count = 1
All values allowed
Found 5 itemsets
lettuce
& tomatoes
[Coverage=0.111 (111); Leverage=0.0539
(53.9); p=2.35E-019]
tomatoes
& carrots
[Coverage=0.085 (85); Leverage=0.0390
(39.0); p=1.83E-012]
potatoes
& onions
[Coverage=0.082 (82); Leverage=0.0285
(28.5); p=5.30E-007]
bananas
& peaches
[Coverage=0.040 (40); Leverage=0.0235
(23.5); p=2.74E-009]
lettuce
& tomatoes & carrots
[Coverage=0.039 (39); Leverage=0.0196
(19.6); p=1.43E-006]
Each itemset is presented as a list of the items in the set. The coverage and leverage statistics that are provided were described above. To illustrate how itemset leverage is calculated, consider the lettuce & tomatoes & carrots itemset. There are 111 cases that contain lettuce & tomatoes and 175 that contain carrots. Thus, if lettuce & tomatoes were not related to carrots one would expect there to be approximately 19.4 cases ([175/100] ´ [111/1000] ´ 1000) containing all three elements. There are 85 cases that contain tomatoes & carrots and 217 that contain lettuce. If these two groups were unrelated one would expect approximately 18.4 cases to contain all three items. There are 45 cases that contain lettuce & carrots and 263 that contain tomatoes. If these two groups were unrelated one would expect approximately 11.3 cases to contain all three items. Thus, the maximum coverage that can be expected given any assumption that some subsets of these items are unrelated to each other is 19.4. The leverage is the observed coverage less this amount. The p value is the probability that this coverage would be observed if the two subgroups that result in the highest expected coverage were actually unrelated to one another.
So far we have considered only data in the form of lists of items. Many data are recorded in tabular format, with columns representing attributes or fields and each row representing a distinct entity. The cells contain the values of the respective attributes or fields for the given entity. Magnum Opus supports such data, which must be listed in a data file. The columns are separated by a delimiter character such as a TAB or COMMA.
It is also necessary to specify the names and types of the attributes. This information provided in a separate file called the names file. Each line of a names file starts with the name of an attribute, the first line referring to the leftmost column, the second line to the second leftmost column, and so on.
For categorical attributes, the attribute name is followed by a colon (:) and then either the keyword categorical or a comma separated list of the values that are allowed for the attribute.
Example:
Department: bakery, dairy, beverages
This specifies that the attribute Department can assume any one of three values bakery, dairy, or beverages. Any case containing any other value will be discarded and an error message generated.
Example:
Department: categorical
This specifies that the attribute Department can assume any value that appears in the data file.
For compatibility with See-5, Magnum Opus also accepts the keyword discrete which is treated as equivalent to categorical.
Numeric attributes must be divided into sub-ranges. These can be specified in the names file. Alternatively, the names file can simply identify the number of sub-ranges and Magnum Opus will select the sub-ranges for you.
For a numeric attribute with specified sub-ranges, the attribute name is followed by a list of sub-range cut points. These indicate how the numeric values for the attribute are to be subdivided into sub-ranges. Each cut point is introduced by one of the relations < or <= which is followed by the value that terminates the sub-range. If the relation is <, the sub-range includes all values less than the specified value. If the relation is <=, the sub-range includes all values less than or equal to the specified value.
Example:
Spend < 10 <= 100
This specifies that the attribute Spend has three sub-ranges, below the first cut point, between the two cut points, and above the last cut point:
Spend < 10
10 <= Spend <= 100
Spend > 100
To allow Magnum Opus to select sub-ranges, use the keyword numeric, followed by the number of sub-ranges required.
Example:
Spend: numeric 5
For compatibility with See-5, Magnum Opus also accepts the keyword continuous which is treated as numeric 3.
The keyword ignore instructs Magnum Opus to discard any data for the given attribute. This is useful for handling attributes that may appear in the data but which should not be used, such as record identifiers.
We now provide a worked example using the example files distributed with Magnum Opus, tutorial.nam and tutorial.data. Tutorial.nam contains the following:
Profitability99: numeric 3
Profitability98: numeric 3
Spend99: numeric 3
Spend98: numeric 3
NoVisits99: numeric 3
NoVisits98: numeric 3
Dairy: numeric 3
Deli: numeric 3
Bakery: numeric 3
Grocery: numeric 3
SocioEconomicGroup: categorical
Promotion1: t, f
Promotion2: t, f
Most of these attributes are numeric. These numeric attributes have been designated numeric 3, indicating that they should be divided into three sub-ranges, each of which contains approximately the same number of cases. The profitability attributes represent respectively the profit made from a customer in 1999 and 1998. The spend attributes represent the total amount spent by a customer in each year. The NoVisits attributes represent the numbers of store visits in each year. The Dairy, Deli, Bakery, and Grocery attributes record the customer's total spend in each of four significant departments. The remaining three attributes are categorical. The SocioEconomicGroup attribute records an assessment of the customer's socio-economic group. The keyword categorical tells Magnum Opus to use whatever values it finds in the corresponding column in the data file. The final two attributes record whether the customer participated in each of two store promotions. The values that are allowed are listed. This allows error checking. If any other value appears in the column for the attribute an error message will be displayed.
The first line of the data file describes the first entity:
829, 709, 5250, 6560, 70, 82, 1074, 390, 878,
1995, C, f, f
This indicates that for the first entity the value of Profitability99 is 829 and so on through to the value of Promotion2 being ‘f’.
In the next example we run Magnum Opus on this names file and data file with its default settings, except that we limit the number of rules produced to five only.
Command line: mocl names-file=tutorial.nam \
data-file=tutorial.data maximum-results=5
Interactive
system. First run Magnum Opus. From the File
Menu select Import Data. The system will
display a dialog for selecting a file to open.
If necessary, navigate to the Example Files folder within the folder
into which you installed the software.
Select the file tutorial.nam. The system will now display the following
dialog box.

The
system recognizes from the nam file
extension that the file is a names file.
As this is correct, we click the Next > button. The system then displays the following dialog
box for selecting the data fule.

As the system has defaulted to
the correct file name and we wish to use the default settings, click the Import
Now button. After
importing the data the screen should appear as follows.
As we
want to limit the number of rules to five, edit the Maximum no. edit box
accordingly.

Now
click the GO button to commence a search with the selected settings. A dialog will be displayed that allows you to
select the file into which the results will be stored. Specify a file name and navigate to the
folder in which you want it stored. Then
click on the Save button. The system
will perform the search, saving the results in the specified file and then open
the file for inspection.
Output:
Magnum Opus - The
leader in association discovery technology.
Version 4.3
Copyright (c) 1999-2009 G. I. Webb &
Associates Pty Ltd.
Names file: Tutorial.nam
Data file: Tutorial.data
1000 cases / 0 holdout cases / 39 values
Search for rules
Search by leverage
Filter out rules that are insignificant,
critical value=0.05
Maximum number of attributes on LHS = 4
Maximum number of rules = 5
Minimum leverage = -1.0
Minimum leverage count = -2147483647
Minimum coverage = 0.0
Minimum coverage count = 1
Minimum support = 0.0
Minimum support count = 0
Minimum lift = 0.0
Minimum strength = 0.0
All values allowed on LHS
All values allowed on RHS
Found 5 rules
Spend99<2030 ->
Profitability99<419
[Coverage=0.333 (333); Support=0.302
(302); Strength=0.907; Lift=2.72; Leverage=0.1911 (191.1); p=1.66E-178]
Profitability99<419 ->
Spend99<2030
[Coverage=0.333 (333); Support=0.302 (302);
Strength=0.907; Lift=2.72; Leverage=0.1911 (191.1); p=1.66E-178]
Spend98<1782 ->
Profitability98<327
[Coverage=0.331 (331); Support=0.295
(295); Strength=0.891; Lift=2.68; Leverage=0.1848 (184.8); p=5.12E-165]
Profitability98<327 ->
Spend98<1782
[Coverage=0.333 (333); Support=0.295
(295); Strength=0.886; Lift=2.68; Leverage=0.1848 (184.8); p=5.12E-165]
NoVisits98<31 -> NoVisits99<35
[Coverage=0.325 (325); Support=0.288
(288); Strength=0.886; Lift=2.69; Leverage=0.1811 (181.1); p=1.89E-159]
As can be seen, the output is very similar to that for transaction data, except that each item consists of an attribute-value pair.
A common analytic task seeks to identify factors that distinguish different groups. This type of analysis is called contrast discovery. To perform contrast discovery it is necessary to provide each example in the data with a label identifying to which group it belongs. For attribute-value data this means providing an attribute whose values indicate group membership. For example, in the tutorial.data file, the Profitability99 attribute might be used to indicate that each example belongs to one of three groups, low profit (Profitability99<419), medium profit (419<=Profitability99<=897) or high profit (Profitability99>897). For transaction data it is necessary to add another item to each transaction. It is important to use a name for these labels that will not be used or mistaken for a standard item. For example, one might add items such as *profitable* and *unprofitable* to the transactions in the tutorial.itl data.
Once group labels have been added to the data, simply run Magnum Opus restricting the RHS values to the group labels. The next example illustrates this process using the data in the file tutorial.data, treating the Profitability99 attribute as the group variable.
Command line: mocl names-file=tutorial.nam data-file=tutorial.data
\
maximum-results=5 rhs-available=Profitability99
Interactive
system. Continuing from the point at which the last
example left off, select the three values for profitability in the Values
allowed on RHS edit box by first
left-clicking Profitability99<419

and then,
holding down the SHIFT key and left-clicking Profitability99>897.

Now
click the GO button to commence a search with the selected settings. A dialog will be displayed that allows you to
select the file into which the results will be stored. Specify a file name and navigate to the
folder in which you want it stored. Then
click on the Save button. The system
will perform the search, saving the results in the specified file and then open
the file for inspection.
Output:
Magnum Opus - The leader in
association discovery technology.
Version 4.3
Copyright (c) 1999-2009 G. I. Webb & Associates
Pty Ltd.
Names file: Tutorial.nam
Data file: Tutorial.data
1000 cases / 0 holdout cases / 39 values
Search for rules
Search by leverage
Filter out rules that are insignificant, critical
value=0.05
Maximum number of attributes on LHS = 4
Maximum number of rules = 5
Minimum leverage = -1.0
Minimum leverage count = -2147483647
Minimum coverage = 0.0
Minimum coverage count = 1
Minimum support = 0.0
Minimum support count = 0
Minimum lift = 0.0
Minimum strength = 0.0
All values allowed on LHS
Values allowed on RHS:
Profitability99<419 419<=Profitability99<=897 Profitability99>897
Found 5 rules
Spend99<2030 -> Profitability99<419
[Coverage=0.333 (333); Support=0.302 (302);
Strength=0.907; Lift=2.72; Leverage=0.1911 (191.1); p=1.66E-178]
Spend99>4278 -> Profitability99>897
[Coverage=0.333 (333); Support=0.287 (287);
Strength=0.862; Lift=2.60; Leverage=0.1768 (176.8); p=8.57E-149]
Spend99<2030 & Grocery<873 ->
Profitability99<419
[Coverage=0.278 (278); Support=0.265 (265);
Strength=0.953; Lift=2.86; Leverage=0.1724 (172.4); p=2.52E-008]
Grocery<873 -> Profitability99<419
[Coverage=0.333 (333); Support=0.277 (277);
Strength=0.832; Lift=2.50; Leverage=0.1661 (166.1); p=6.14E-129]
Spend99<2030 & NoVisits99<35 ->
Profitability99<419
[Coverage=0.272 (272); Support=0.255 (255);
Strength=0.938; Lift=2.82; Leverage=0.1644 (164.4); p=0.000257]
The LHS of each rule
that is discovered indicates a set of factors that are more frequently
associated with the RHS than with any of the other groups. For example, the first rule indicates that
customers with Profitability99<419 are more likely to have a low value for Spend99 than are customers with other levels of Spend99.
Due to the large
number of potential associations that are considered during association
discovery, it is inevitable that some associations will be ‘discovered’ that
only appear strong by chance. Magnum
Opus incorporates unique facilities for controlling the risk of finding
such associations by applying statistical tests. These tests are adjusted for the size of the
search space and the number of associations found, as appropriate. Assuming the sample data are a random sample
of the broader population about which you wish to reach conclusions, these
tests ensure that the risk of ‘discovering’ a spurious association is no
greater than the user-specified significance level. By default, significance levels are set to
0.05.
Magnum Opus supports two mechanisms for statistically sound association discovery. Within-search testing adjusts the significance level applied to statistical tests used while the search is being conducted. Use the Unsound filter to perform within-search testing. For rule discovery the unsound filter discards any rule whose strength is not significantly higher than that of any of its generalizations (rules formed by deleting elements from the LHS). For itemset discovery, the unsound filter discards itemsets that are not significantly more frequent than could be expected by assuming that any two subsets of the itemset are independent of one another.
Note that the default filter, the Insignificant filter, also applies a statistical test, but that this test is not adjusted for the size of the search space and hence is not statistically sound. The Insignificant filter is useful for discarding rules and itemsets that are very likely to be spurious, but is likely to still accept some spurious associations.
Command line: Add the
option filter=unsound to the command line.
% mocl names-file=tutorial.nam data-file=tutorial.data \
filter=unsound
Interactive
system. Select UNSOUND as the value for the Filter out
option.

The second mechanism is holdout evaluation. This requires that the data are divided into an exploratory and a holdout set. The associations are discovered from the exploratory data and tested on the holdout data. One way to do this is to have Magnum Opus randomly divide the data into these two sets when it is imported. You must then specify that holdout evaluation is to be performed and which statistical tests to employ.
The following holdout
evaluation tests are supported for rules.
|
Test |
Null Hypothesis |
Statistical technique |
|
Minimum Coverage |
Coverage ≤ Min Coverage |
Binomial sign test |
|
Minimum Support |
Support ≤ Min Support |
Binomial sign test |
|
Minimum Strength |
Strength ≤ Min Strength |
Binomial sign test |
|
Minimum Lift |
Lift ≤ Min Lift |
Binomial sign test |
|
Minimum Leverage |
Leverage ≤ Min Leverage |
Binomial sign test |
|
Positive correlation |
Support ≤ Coverage
× RHS_Coverage |
Fisher exact test |
|
Improvement over
generalizations |
Strength ≤ the
maximum Strength of any generalization of the current rule |
Fisher exact test |
|
Partial with respect
to specializations |
There exists another rule
GLHS -> RHS in the set of best rules, that has not been rejected by
holdout evaluation, that is a specialization of the current rule, and such
that the LHS and RHS of the current rule are conditionally independent given
the negation of GLHS. |
Fisher exact test |
The following holdout
evaluation tests are supported for itemsets.
|
Test |
Null Hypothesis |
Statistical technique |
|
Minimum Coverage |
Coverage ≤ Min Coverage |
Binomial sign test |
|
Minimum Leverage |
Leverage ≤ Min Leverage |
Binomial sign test |
|
Improvement over
generalizations |
Coverage ≤ the
maximum of coverage(A) × coverage(B) for any
partition of the current itemset into two subsets A
and B. |
Fisher exact test |
|
Self-sufficient |
Coverage ≤ the maximum of coverage(A) ×
coverage(B) for any partition of the current itemset
into two subsets A and B within the set of cases not covered by the
difference between the current itemset and any of
its productive supersets. |
|
The positive correlation
test is the default test for rules. It
tests whether the leverage of the rule is greater than zero. The improvement over generalizations test is
the default test for itemsets. The improvement over generalization tests are
equivalent to the tests applied by the unsound filter. The Partial with respect to specializations
and Self-sufficient tests check whether a specialization of a rule (a rule
created by adding elements to the LHS) or the supersets of an itemset, can explain the frequency with which the itemset occurs.
For
more information on statistically sound association discovery see the following
worked examples and the research paper:
Webb, G.I. (2007). Discovering Significant Patterns. Machine Learning 68(1).
Webb, G.I. (2008). Layered Critical Values:
A Powerful Direct-Adjustment Approach to Discovering Significant Patterns. Machine
Learning 71(2-3).
To illustrate holdout
evaluation for rules we run Magnum Opus on the tutorial.itl
data, selecting 50% of the data for the exploratory set and the remaining 50%
for the holdout set, using holdout evaluation, using the partialness and
improvement tests, searching by support, using no filtering and selecting only tomatoes and potatoes for the LHS and lettuce
and carrots for the RHS.
Command
line: mocl item-list-file=tutorial.itl proportion=0.5 \
out-of-sample-holdout-evaluation \
test-partialness=yes test-improvement=yes \ search-mode=support filter=none \
lhs-available=tomatoes,potatoes \
rhs-available=lettuce,carrots
Interactive system. First run Magnum Opus. From the File
Menu select Import Data. The system will
display a dialog for selecting a file to open.
If necessary, navigate to the Example Files folder within the folder
into which you installed the software.
Select the file tutorial.itl. The system will now display the following
dialog box.

The
system recognizes from the itl file
extension that the file is an item list file.
As this is correct, click the Next > button to go to the next
screen.

This screen allows you to select
the delimiter character. As the default
is correct for this file, click Next
> to go to the next screen.

This screen allows you to select
how much data of the should be loaded into the exploratory set. For this example we wish to load 50%, so
change the Percentage box to 50.

Now click Next >.

The next screen allows you to select
whether holdout evaluation is to be performed.
If it is, you have the choice of either using the data not included in
the exploratory set (the default), or of loading the data from another file. As we wish to use the default, click Import Data. This takes us to the main screen.
We want to select the holdout
tests, so select Rule
Evaluation Holdout Settings from the Preferences menu. This leads to a dialog
that allows you to select the tests and significance level to be applied during
holdout evaluation. Select Improvement
over generalizations and Partial with
respect to specializations.

Then click OK to return to the main screen.
On the main screen select Search by Support and Filter out None. Then select potatoes and tomatoes for the Values
allowed on the LHS and carrots and lettuce for the Values allowed on the RHS.

Now
click GO to commence the search.
Output:
Magnum Opus - The leader in
association discovery technology.
Version 4.3
Copyright (c) 1999-2009 G.
I. Webb & Associates Pty Ltd.
Data file: Tutorial.itl [50% sample]
500 cases / 500 holdout
cases / 16 items
Search for rules
Search by support
Maximum number of attributes
on LHS = 4
Maximum number of rules =
1000
Minimum leverage = -1.0
Minimum leverage count =
-2147483647
Minimum coverage = 0.0
Minimum coverage count = 1
Minimum support = 0.0
Minimum support count = 0
Minimum lift = 0.0
Minimum strength = 0.0
Values allowed on LHS:
potatoes
tomatoes
Values allowed on RHS:
carrots
lettuce
Only 6 rules satisfy the
specified constraints.
The following 2 rules passed
holdout evaluation
tomatoes -> lettuce
[Coverage=0.244 (122);
Support=0.106 (53); Strength=0.434; Lift=1.96; Leverage=0.0518 (25.9)]
tomatoes -> carrots
[Coverage=0.244 (122); Support=0.080
(40); Strength=0.328; Lift=1.95; Leverage=0.0390 (19.5)]
The following 4 rules failed
holdout evaluation, adjusted critical value = 0.012500
potatoes -> lettuce
[Coverage=0.272 (136);
Support=0.076 (38); Strength=0.279; Lift=1.26; Leverage=0.0156 (7.8)]
Holdout coverage = 147,
holdout support = 37, holdout strength = 0.252
Fails positive correlation,
p = 0.101
Fails significant
improvement with respect to DEFAULT, p = 0.101
Fails partial test with respect
to tomatoes & potatoes, p = 0.952
potatoes -> carrots
[Coverage=0.272 (136);
Support=0.056 (28); Strength=0.206; Lift=1.23; Leverage=0.0103 (5.2)]
Holdout coverage = 147,
holdout support = 40, holdout strength = 0.272
Fails partial test with
respect to tomatoes & potatoes, p = 0.120
tomatoes & potatoes
-> lettuce
[Coverage=0.072 (36);
Support=0.040 (20); Strength=0.556; Lift=2.50; Leverage=0.0240 (12.0)]
Holdout coverage = 41,
holdout support = 23, holdout strength = 0.561
Fails significant improvement
with respect to tomatoes, p = 0.0172
tomatoes & potatoes
-> carrots
[Coverage=0.072 (36);
Support=0.028 (14); Strength=0.389; Lift=2.31; Leverage=0.0159 (8.0)]
Holdout coverage = 41,
holdout support = 19, holdout strength = 0.463
Fails significant improvement
with respect to tomatoes, p = 0.0166
The
rules that fail holdout evaluation are listed after those that pass. The rule tomatoes & potatoes
-> lettuce
illustrates the significant improvement test.
The rule tomatoes -> lettuce has strength 0.434. The 20 examples that provide the support for
the longer rule do not provide sufficient evidence that the strength of
association is truly higher than that of the shorter rule.
The
rule potatoes -> lettuce illustrates the partialness test.
The rule tomatoes & potatoes -> lettuce covers 36 of the 136 examples
covered by the shorter rule. It also
covers 20 out of the 38 examples that have both potatoes and lettuce. Once the 36 examples covered by the longer
rule are removed, the remaining support is just 18 out of 100 examples. The resulting Strength (0.180) is lower than
the default strength for tomatoes of (0.244).
In consequence, it appears that the increased frequency of lettuce in
the context of potatoes is solely due to its increased frequency when both
potatoes and tomatoes are present.
To illustrate holdout
evaluation for itemsets we continue the previous
example. As before, we use Magnum
Opus on the tutorial.itl data, selecting 50% of
the data for the exploratory set and the remaining 50% for the holdout set and
using holdout evaluation. This time,
however, we search for itemsets using the
self-sufficient and improvement tests, searching by coverage, using no
filtering and selecting only tomatoes and potatoes for the LHS and lettuce and
carrots for the RHS.
Command line: mocl item-list-file=tutorial.itl proportion=0.5 \
out-of-sample-holdout-evaluation \
find-itemsets test-self-sufficient=yes \
test-improvement=yes search-mode=coverage \ filter=none \
items-available=tomatoes,potatoes,lettuce,carrots
Interactive system. Continuing from the previous example, select
ITEMSETS in the Search for box, select COVERAGE in the Search by box, and
select the items carrots, lettuce, tomatoes and potatoes for the Values
allowed in itemset.

Then
select the Itemset Holdout Evaluation Settings… option from the Preferences menu.
The default option, Improvement over generalizations should already be selected. Click Self-sufficent
to also select it.

Click OK to return to the main window. Now press GO to commence the search.
Output:
Magnum
Opus - The leader in association discovery technology.
Version 4.2
Copyright (c) 1999-2007 G.
I. Webb & Associates Pty Ltd.
Data file: Tutorial.itl [50% sample]
500 cases / 500 holdout
cases / 16 items
Mon Mar 15
Search for itemsets
Search by leverage
Maximum number of values in
an itemset = 4
Maximum number of itemsets = 100
Minimum leverage = -1.0
Minimum leverage count =
-2147483647
Minimum coverage = 0.0
Minimum coverage count = 1
Values allowed:
carrots
lettuce
potatoes
tomatoes
Only 16 itemsets
satisfy the specified constraints.
The following 9 itemsets passed holdout evaluation
lettuce & tomatoes
[Coverage=0.106 (53); Leverage=0.0518
(25.9)]
tomatoes & carrots
[Coverage=0.080 (40);
Leverage=0.0390 (19.5)]
lettuce & tomatoes &
carrots
[Coverage=0.036 (18);
Leverage=0.0182 (9.1)]
carrots & potatoes
[Coverage=0.056 (28);
Leverage=0.0103 (5.2)]
[Coverage=1.000 (500);
Leverage=0.0000 (0.0)]
potatoes
[Coverage=0.272 (136);
Leverage=0.0000 (0.0)]
tomatoes
[Coverage=0.244 (122);
Leverage=0.0000 (0.0)]
lettuce
[Coverage=0.222 (111);
Leverage=0.0000 (0.0)]
carrots
[Coverage=0.168 (84);
Leverage=0.0000 (0.0)]
The following 7 itemsets failed holdout evaluation, adjusted critical value
= 0.00313
lettuce & potatoes
[Coverage=0.076 (38);
Leverage=0.0156 (7.8)]
Holdout coverage = 37
Fails significant
improvement with respect to lettuce and potatoes, p = 0.101
lettuce & tomatoes &
potatoes
[Coverage=0.040 (20);
Leverage=0.0112 (5.6)]
Holdout coverage = 23
Fails significant
improvement with respect to lettuce & tomatoes and potatoes, p = 0.0498
tomatoes & carrots &
potatoes
[Coverage=0.028 (14);
Leverage=0.0062 (3.1)]
Holdout coverage = 19
Fails significant
improvement with respect to tomatoes & carrots and potatoes, p = 0.0381
lettuce & tomatoes &
carrots & potatoes
[Coverage=0.016 (8);
Leverage=0.0062 (3.1)]
Holdout coverage = 11
Fails significant improvement
with respect to lettuce & tomatoes & carrots and potatoes, p = 0.0204
tomatoes & potatoes
[Coverage=0.072 (36);
Leverage=0.0056 (2.8)]
Holdout coverage = 41
Fails significant
improvement with respect to tomatoes and potatoes, p = 0.580
lettuce & carrots
[Coverage=0.042 (21);
Leverage=0.0047 (2.4)]
Holdout coverage = 24
Fails significant
improvement with respect to lettuce and carrots, p = 0.118
Fails test for
self-sufficiency, p = 0.965
lettuce & carrots &
potatoes
[Coverage=0.016 (8); Leverage=0.0032
(1.6)]
Holdout coverage = 13
Fails significant
improvement with respect to lettuce and carrots & potatoes, p = 0.0570
The itemset, tomatoes & potatoes provides a good example of the improvement
test. Tomatoes
occurs in 0.282 of all holdout
records and
potatoes occurs
in 0.294 of all holdout records. If
these items were independent of each other then tomatoes
& potatoes would be expected
to occur in 0.083 (41.45) of all holdout records. In fact they occur in 41 holdout records, and
hence do not indicate any improvement.
The itemset lettuce
& carrots illustrates the
self-sufficiency test. This itemset appears in 24 holdout records. Its superset, lettuce & tomatoes & carrots, appears in 21 of these holdout records, accounting for all of the
improvement in the shorter itemset.
Magnum Opus provides tremendous flexibility to the
user. Many forms of analysis can be
requested, and Magnum Opus always provides exact results. However, some analyses are intrinsically
difficult, and hence require large amounts of computation to complete. Unfortunately, it is not possible to
accurately predict in advance which analyses will take extreme lengths of time
to complete and which will complete quickly.
When a computation is
taking a long time it is often helpful to view the best results discovered so
far. This allows you to both assess
whether you are actually performing the correct analysis and whether the
results already obtained satisfy the analytic requirement. A set of intermediate results created while
computation is in progress is called a snapshot. The following process is used to create a
snapshot.
Command line: While the system is running, send the
SIGUSR1 signal to the process. The exact
command required may vary depending upon the precise operating system and
command shell used. The following
provides an example under bash on Linux.
% mocl names-file=tutorial.nam
data-file=tutorial.data > tutorial.out
&
[1] 3342
% kill -SIGUSR1
3342
In this example the process has
been run in the background and has been assigned the process ID 3342.
Interactive system. When the system is in the process of a
search the screen will appear as follows:

Simply click on the blue camera
icon. A dialog will appear that allows
you to specify a file into which the snapshot will be saved.
In general the following actions will decrease compute time.
- Increase the minimum leverage.
- Increase the minimum coverage.
- Increase the minimum support.
- Decrease the maximum LHS length.
- Decrease the maximum number of rules to be found.
- Decrease the number of values allowed on the LHS and the RHS of rules.
Note, increasing the minimum lift or strength will only decrease compute time if use m-estimate is checked or the minimum coverage or support is set to a high value. Increasing minimum lift or strength when minimum coverage and support are both low can substantially increase compute time.
Search by lift and search by strength are both substantially faster when the m-estimate is used.
Magnum Opus is a powerful and flexible tool.
The default settings are sufficient for many analytic tasks. However, advanced users can use the
sophisticated controls to perform a wide variety of complex analyses. We recommend that new users start by using
the default settings and only start using the other controls as they become
familiar with the system.