1 Weka Tutorial 5 - Association © 2009 – Mark Polczynski Weka Tutorial 5 – Association Technology Forge www.technologyforge.net Version 0.1 ? [USEMAP] Welcome to the Technology Forge Weka tutorial number 5. In previous Weka tutorials, we saw examples of two basic data mining and machine learning functions: classification and clustering. In this tutorial, we will learn how to perform a third basic function – called association. We practiced classification and clustering on one of the most well-known data mining and machine learning datasets – Fisher’s iris dataset. For our examination of association, we will introduce another dataset which is commonly used in data mining texts. But before diving in to this dataset, let’s examine the basic concept behind association. Association Market Basket Analysis 2 Weka Tutorial 5 - Association http://www.broadband-finder.co.uk/blog/wp-content/uploads/2008/08/supermarket-shopper.jpg http://www.theage.com.au/ffximage/2008/01/18/supermarket_lead_narrowweb__300x406,0.jpg Got Beer? Got Diapers? What things tend to “go together” in your market basket? [USEMAP] Association is often linked to so-called “market basket” analysis. Speaking literally, market basket analysis consists of examining the items in baskets of shoppers checking out at a market to see what types of items “go together”. In other words, when people make a trip to the store, what kinds of items do they tend to buy during the same shopping trip? This concept is amusingly illustrated by an apocryphal story about a convenience store. 3 Weka Tutorial 5 - Association Example of Market Basket Analysis People who bought beer also usually bought diapers. People who bought diapers also usually bought beer. [USEMAP] Supposedly, some shop owner did a market basket analysis on people shopping at his convenience store. He found that people buying diapers often also bought beer. Or was it people buying beer that often bought diapers? Anyway, he decided to put the beer next to the diapers in his store, and the sales of each skyrocketed. It was either that, or was it that they put the diapers and beer at opposite ends of the store, and the sales of all the products in-between skyrocketed? Actually, it really doesn’t matter, because nobody has been able to substantiate this story, so it probably never happened at all anyway. Nevertheless, the story nicely illustrates the market basket concept of things “going together”, which is the basis of the data mining association function. 4 Weka Tutorial 5 - Association Cash Register Checkout Data: •Items in shopping cart •Time of day •Day of week •Season of year •Method of payment •Location of store •etc… Fact-Based Marketing Strategies: •Floor plans •Discounts •Coupons •etc… Associations? What can you use associations for? [USEMAP] Although the beer and diapers story may not be true, data mining association can be a very powerful tool in applications like market basket analysis and other retail sales applications. For example, for supermarkets, often the only source of data available on customers is cash register checkout data, which does tell what items go together in a shopping cart, but can also show what items “go together” with certain times of the day, days of the week, seasons of the year, credit vs. cash vs. check payment, geographical locations of stores, and so on. Discovering associations among these attributes can lead to fact-based marketing strategies for things like: store floor plans, special discounts, coupon offerings and so forth. Of course, you can imagine many other applications besides retail sales that benefit from association analysis. 5 Weka Tutorial 5 - Association http://www.fas.org/irp/imint/docs/rst/Sect14/tornado.jpg Then to play, or not to play, that is the question! If the weather outside is: [USEMAP] Data mining literature seems to collect apocryphal examples like the beer/diaper story. Another common story is the “to play or not to play” story relating weather conditions to whether or not some unidentified person did or did not participate in some un-named activity. As mysterious as this dataset and its origins are, we will nevertheless use it to demonstrate association, since it is commonly found in data mining texts. 6 Weka Tutorial 5 - Association TPONTPNom.xls dataset: [USEMAP] Here is the “to-play-or-not-to-play” dataset as it is often found in the literature. We have 4 attributes describing weather conditions and a 5^th attribute noting the player’s decision on whether or not to play. We have 14 instances, or samples, of decisions the player has made under varying weather conditions. If you have not done so yet, please go to the Weka tutorials web site and download the TPONTPNom.xls file, which contains this dataset. 7 Weka Tutorial 5 - Association Attribute values and counts [USEMAP] Here we see the allowable values for each attribute, and how many times each value appears in the dataset. As an aside, note that this dataset contains only nominal attributes, that is, names rather than numerical values, while our iris datasets contained numerical values for flower dimension (input attributes), and names for iris species (classes). But as we saw in Weka tutorial #2, we can always convert numerical attributes into nominal attributes using a discretization filter. 8 Weka Tutorial 5 - Association Classification: Input attributes Class Goal: Predict whether the person will play, given weather conditions. [USEMAP] Recalling our investigation of classification, for this dataset we might be tempted to assign the four attributes related to weather as input attributes, and the attribute showing if this person did or did not play as the output attribute, or class. Then we could build a decision tree that would predict whether or not the person will play in the future, based on weather conditions. But as we shall see shortly, for association we do not take class designations into consideration. 9 Weka Tutorial 5 - Association Some things that “go together”: Outlook = sunny AND Temp. = hot go together 2 times, Outlook = sunny AND Temp. = cool go together only once. For association, we look for things that “go together” [USEMAP] Recalling our beer-and-diapers anecdote, here we will be looking for things that “go together”, in other words, things that tend to occur at the same time. So, for example, we see that Outlook = sunny AND Temperature = hot go together twice in our data, while sunny and cool occur together only once. The ultimate goal of association is to find association rules of the form: If Outlook = sunny, then Temperature = hot. Right now you may be wondering if the rule If Outlook = sunny, then Temperature = hot, is different than the rule If Temperature = hot, then Outlook = sunny”. We will see the answer to this shortly. For association, attributes are often called “items”, referring to items found in a shopper’s market basket. Now as we said, for association there is no distinction between input attributes and output attributes, or classes, as there is for classification, there are only items. In this sense, association resembles clustering, which also does not distinguish between attribute types. 10 Weka Tutorial 5 - Association 1 3 4 5 2 Outlook: sunny overcast rainy Temp: hot mild cool Humidity: high normal Windy: true false Two-item sets of attributes that can go together Play: yes no 1-2 2-3 3-4 4-5 5-1 1-3 2-4 3-5 4-1 5-2 10 node-to-node links [USEMAP] One thing we can see right away about association is that there are an awful lot of combinations of things that can possibly go together. Here is a diagram that shows the pairs, or two-item sets, of the 5 different attributes in this example that can go together. As can be seen, there are 10 such pairs. Remember that we don’t have input attributes vs. classes here, so we treat the Play attribute just like the weather-related attributes. 11 Weka Tutorial 5 - Association 1 3 4 5 2 Outlook: sunny overcast rainy Temp: hot mild cool Humidity: high normal Windy: true false sunny/hot sunny/mild sunny/cool overcast/hot overcast/mild overcast/cool rainy/hot rainy/mild rainy/cool Combinations of Outlook and Temperature that can go together Play: yes no 3 x 3 = 9 [USEMAP] Now lets look at the possible combination of just the Outlook and Temperature two-item sets of attributes. We see that there are 9 possible attribute values that can go together. 12 Weka Tutorial 5 - Association 1 3 4 5 2 Play: 2 All possible two-item sets 1-2: 3x3 = 9 2-3: 3x2 = 6 3-4: 2x2 = 4 4-5: 2x2 = 4 5-1: 2x3 = 6 1-3: 3x2 = 6 2-4: 3x2 = 6 3-5: 2x2 = 4 4-1: 2x3 = 6 5-2: 2x3 = 6 57 Outlook: 3 Temp: 3 Windy: 2 Humidity: 2 node-to-node links combinations for link [USEMAP] On the previous slide, we saw the 9 combinations for just Outlook and Temperature, which are nodes 1 and 2 on this slide. Here, we see the total possible number of two-item sets that can go together for this example, which is 57. The number next to each node in the graph is the number of possible values for each attribute. Thus, there are a total of 57 possible association rules that can apply for the two-item sets. 13 Weka Tutorial 5 - Association Occurrences of each pair [USEMAP] Here are the 57 potential combinations of two-item sets. Also shown are the number of occurrences in the dataset for each potential combination. We see, for example, that the combination Humidity = normal AND Play = yes goes together six times in the dataset. We also see that Humidity = normal AND Play = no occur together once. Note that since the matrix is symmetrical, we will ignore the lower left section for the next few slides. 14 Weka Tutorial 5 - Association Temperature = cool and Humidity = normal occurs more often than Temperature = cool and Humidity = high Humidity = normal and Play = yes occurs more often than Humidity = normal and Play = no [USEMAP] Now, since Humidity = normal AND Play = yes occurs 6 times, but Humidity = normal AND Play = no occurs only once, the first combination must be more important, simply because it occurs more times. Note also that Temperature = cool AND Humidity = high never go together in the dataset, but Temperature = cool AND Humidity = normal occurs 4 times, so it seems that the second combination must be more important than the first. The logical question at this point is this: Is Temperature = cool AND Humidity = normal more or less important than Humidity = normal AND Play = yes ? 15 Weka Tutorial 5 - Association “Support” is the number of instances where a particular rule is correct. Temp. = cool AND Humid. = normal occurs 4 times in the dataset, so the support is 4. [USEMAP] We define support as the number of instances in the dataset where a particular combination of items occurs. The table shows us that, for example, the support for Temperature = hot AND Humidity = normal is 4, since this combination shows up 4 times in the dataset. Note that support is sometimes referred to as coverage. 16 Weka Tutorial 5 - Association Here are the 47 two-item sets that have support => 2. Combinations with highest coverage: Humidity = normal AND Play = yes Windy = no and Play = yes Combinations with highest support: Humidity = normal AND Play = yes Windy = no and Play = yes [USEMAP] Based on support alone, it appears that the combinations Humidity = normal AND Play = yes and Windy = no AND Play = yes are the two most important combinations in our dataset, since these are the combinations that occur the most. Is it possible to say which of these is more important? 17 Weka Tutorial 5 - Association “Confidence” of the association rule “If Humidity = normal, then Play = yes” is 6/7, or 86% Combination Humidity = normal AND Play = yes occurs 6 times. Number of occurrences of Humidity = normal is 7. [USEMAP] As we just noted, the combination of Humidity = normal AND Play = yes has a support of 6. But note that Humidity = normal occurs a total of 7 times in our database. This means that for the 7 times that Humidity = normal occurs, Play = yes occurs 6 times, and Play = no occurs only once. Therefore, the association rule If Humidity = normal, then Play = yes is true 6 times out of 7, or 86% of the time. We say that our confidence in this rule is 86%. To say this another way, when Humidity = normal, Play had 7 chances to be yes, but is actually yes only 6 times. Confidence is also termed accuracy. 18 Weka Tutorial 5 - Association Confidence of rule If Play = yes, then Humidity = normal is 6/9, or 67% Combination Humidity = normal AND Play = yes occurs 6 times. Number of occurrences of Play = yes is 9. [USEMAP] Now let’s look at the rule If Play = yes, then Humidity = normal, which has the If and Then parts of the rule on the previous slide reversed. We see that this rule has a confidence of 6/9, or 67%. Comparing this to the 87% confidence of the rule If Humidity = normal, then Play = yes, it seems logical that even though both rules have the same support value of 6, the rule with 87% confidence is more important than the rule with 67% confidence. 19 Weka Tutorial 5 - Association If Humidity = normal, then Play = yes If Play = yes, then Humidity = normal The two-item set of Humidity and Play produced two different rules. [USEMAP] Here we see calculations of confidence for all of the two-item association rules, with the rules for the two combinations Humidity = normal and Play = yes that we just analyzed highlighted for reference. It is important to note that the single two-item set of Humidity and Play is capable of generating two different rules. Recall that earlier we stated that there is a difference between the rule If Humidity = normal, then Play = yes and the rule If Play = yes, then Humidity = normal. We can now see that the difference in these two rules lies in their confidence. 20 Weka Tutorial 5 - Association Two-item set coverage Two-item set accuracy For thresholds set at: accuracy = 100% coverage = >2 Two-item set association rules: If Temperature = cool, then Humidity = normal If Outlook = overcast, than Play = yes [USEMAP] To answer the question as to which of all possible association rules best characterize a dataset, we need to examine both the support and confidence of each rule. To find the most important rules, we can specify some minimum value of each measure, and then find the rules that meet these criteria. If, for example, we require a confidence of 100% and a minimum support of 2, we see that there are 2 association rules that come out of the two-item sets: If Temperature = cool, then Humidity = normal If Outlook = overcast, than Play = yes Weka Tutorial 5 - Association 21 Dataset Association Rule Generator Association Rules Support Confidence Predictive apriori association algorithm: •Optimally combines support and confidence into predictive accuracy, •Requires only that user specify number of rules generated. Optimum thresholds? Optimum thresholds? [USEMAP] So, support and confidence are the quantities we must calculate to determine which of all possible association rules reflect the most important characteristics of our dataset. The problem is this: Where should we set the support and confidence thresholds to get only the important association rules? As it turns out, this is not an easy question to answer. Fortunately, Weka contains an association algorithm that craftily works around this problem. The predictive apriori association algorithm optimally combines support and confidence to calculate a value termed predictive accuracy. The user need only specify how many rules they would like the algorithm to generate and the algorithm takes care of optimizing support and confidence to find the best rules. Let’s see how to use the predictive apriori association algorithm in Weka. 22 Weka Tutorial 5 - Association Load TPONTPNom.arff dataset [USEMAP] Go ahead and start Weka, and then open the TPONTPNom.arff dataset. Note that since Play is the last attribute in the dataset, Weka assumes it to be the Class attribute. If you select the Play attribute for visualization, you see the 5 Play = no instances and the 9 Play = yes instances. 23 Weka Tutorial 5 - Association Show Outlook relative to Play Show Outlook relative to Play [USEMAP] If you switch to displaying, for example, the Outlook attribute, you can see how the 14 instances are distributed across the two Play attributes. Recall that Play is not analyzed as a class attribute when finding association rules, so there is nothing special about displaying the Outlook attribute relative to the Play attribute. 24 Weka Tutorial 5 - Association Choose PredictiveApriori algorithm [USEMAP] As we mentioned, for this tutorial we will use Weka’s PredictiveApriori associator, so go ahead and fire this up. Left-click on PredictiveApriori to open up the GenericObjectEditor. Let’s set number of rules generated to 100, the algorithm default. Now click Start to generate the rules. 25 Weka Tutorial 5 - Association Best 100 rules for TPONTP dataset [USEMAP] Here we see the first 25 of the top 100 association rules generated by PredictiveApriori, with the predictive accuracy of each. The top three rules are: If Outlook = overcast, then Play = yes If Temp. = cool, then Humidity = normal If Humidity = normal and Windy = false, then Play = yes Each has a predictive accuracy of 0.95583. The first two are the two-item rules that we found earlier in this tutorial. 26 Weka Tutorial 5 - Association Occurrences of Outlook = overcast Support for Outlook = overcast AND Play = yes Confidence of If Humidity = normal then Play = yes is 6/7 = 86% Confidence of If Outlook = overcast, then Play = yes is 4/4 = 100% Predictive accuracy = 95.6% [USEMAP] Here is the information provided by the PredictiveApriori output. This shows that while the confidence for rule 1 is 100%, the predictive accuracy calculated by PredictiveApriori is 95.6%. Recall that predictive accuracy is calculated using both confidence and support. 27 Weka Tutorial 5 - Association [USEMAP] Now let’s look at the third best rule: If Humidity = normal AND Windy = false, then Play = yes. This rule comes from the three item set of Outlook, Humidity, and Play, with two items in the If section, termed the precedents, and one item in the then section, termed the antecedent. Rule 8 comes from the four-item set Outlook, Temperature, Humidity, and Play, with two precedent terms and two antecedent terms. From this we can get an idea of how a vast number of association rules can be generated by the various attributes in the dataset. 28 Weka Tutorial 5 - Association Input attributes Class Association rule: If Humidity = normal AND Windy = false, then Play = yes Classification Goal: Predict whether the person will play, given weather conditions. Question: Is there a linkage between classification and association? [USEMAP] Before concluding this tutorial, let us return to an earlier view of the dataset we have been analyzing. While we have emphasized that association treats all attributes in a dataset like items in a market basket, it is clear that the Play item in this dataset represents a class, and that the other attributes can be used to build a classifier that predicts whether or not the player will play based on weather conditions. Now, an association rule like If Humidity = normal AND Windy = false, then Play = yes surely sounds like the type of prediction that a classifier would produce. Does this mean that there is some linkage between classification and association? The answer is yes, and the PredictiveApriori algorithm as implemented in Weka takes advantage of this linkage, as we shall now see. 29 Weka Tutorial 5 - Association Mine only the association rules that have the designated class attribute as the antecedent. Keep the number of rules at 100. [USEMAP] Class association rules are association rules where the then portion of the rule (antecedent) is restricted to being the class attribute only. In other words, only association rules with the class attribute as the antecedent are mined. Since Play is the natural class attribute for this dataset, this means that mining just the class association rules will result in rules that all have Play = yes or no as the antecedent. In Weka, open up the Generic Object Editor for PredictiveApriori and set car (which stands for class association rules) to True. Keep the number of rules mined at 100. 30 Weka Tutorial 5 - Association [USEMAP] Here are the thirty class association rules found by Weka when using the car option for PredictiveApriori. Also shown is a classification tree created by J48 for this same dataset. Notice the decision tree branch created by J48 that says: If Outlook = overcast, then Play = yes, which is, in fact, the first rule found by PredictiveApriori. 31 Weka Tutorial 5 - Association [USEMAP] As shown here, J48 also found PredictiveApriori rule 3. You can also find Predictive Apriori rules 4, 6, 7 in the decision tree. Interestingly, J48 did not find Predictive Apriori rule 2, but recall that the dataset has only 14 samples so some discrepancies can be expected. 32 Weka Tutorial 5 - Association What’s the difference between classification and association? •In one sense, association is like classification, except that instead of considering just one attribute as the class attribute, it considers every attribute and every combination of attributes as a class, and then performs a form of classification on all possible classes. • •Unlike classification, association does not necessarily consider all attributes when attempting to create rules, but operates on sub-set combinations of attributes. • •Association only works on nominal attributes, not numerical attributes, although numerical attributes can be discretized. [USEMAP] If both classifier and association algorithms can do classification, then what exactly is the difference between these two data mining functions? From a simplified perspective, we can mention three differences: •In one sense, association is like classification, except that instead of considering just one attribute as the class attribute, association considers every attribute and every combination of attributes as a class, and then performs a form of classification on every combination of attributes. •Unlike classification, association does not necessarily consider all attributes when attempting to create rules, but operates on sub-set combinations of attributes, for this case, two-item, three-item, and four-item sets as well as five-item sets. •Also, association only works on nominal attributes, not numerical attributes, although numerical attributes can be descretized. 33 Weka Tutorial 5 - Association Topic of the next Weka tutorial The Weka Experimenter Environment supports efficient automation of algorithm testing. [USEMAP] This concludes our introduction to association. Throughout our introductions to classification, clustering, and association, we have seen different algorithms for each function, and different ways to configure each algorithm. The best algorithm and configuration combination depends heavily on the nature of the particular dataset being analyzed. From this, we can conclude that choosing and optimizing a tool for each new dataset can be a large, tedious and error-prone process. Fortunately, the Weka Experimenter Environment, the topic of our next tutorial, is designed to improve the efficiency of testing data mining tools and configurations. 34 Weka Tutorial 5 - Association Weka Documentation: [USEMAP] As always, these Technology Forge Weka tutorials are intended to augment, not replace Weka documentation available through the Weka developers. The primary reference for this and other Technology Forge Weka tutorials is Witten and Frank’s book – Data Mining: Practical Learning Tools and Techniques. Weka Tutorial 5 - Association 35 Contact the Author: Mark Polczynski, PhD The Technology Forge mhp.techforge@gmail.com C:\Documents and Settings\polczynskim.MARQNET\My Documents\tutorialsweka\WT1\WT1images\Tech_Forge_Logo.jpg [USEMAP] We need your feedback to continuously improve our tutorials. Please direct your comments and suggestions to the author of this tutorial. Thank you for your attention.