Rare Association Rule Mining Part I. An Introduction Based on •PAKDD’04 Tutorial Data Mining for Analysis of Rare Events: A Case Study in Security, Financial and Medical Application, Aleksandar Lazarevic, Jaideep Srivastava, Vipin Kumar •Rare Association Rule Mining: An Overview, Yun Sing Koh & Nathan Rountree (in the book) Mining rare events •outliers •anomalies •exceptions •rare events (in temporal data) • •in other kind of data, in data where the structure is important, e.g. spatial, spatiotemporal, graph Motivation •Despite the enormous amount of data, particular events of interest are still quite rare •Rare events are events that occur very infrequently, i.e. their frequency ranges from 0.1% to less than 10% •However, when they do occur, their consequences can be quite dramatic and quite often in a negative sense • • (PAKDD’04 Tutoria) NeedleInAHaystack • • • • • • • Mining needle in a haystack Examples •Network intrusion detection number of intrusions on the network is typically a very small fraction of the total network traffic •Credit card fraud detection millions of regular transactions are stored, while only a very small percentage corresponds to fraud •Medical diagnostics when classifying the pixels in mammogram images, cancerous pixels represent only a very small fraction of the entire image Examples •Insurance Risk Modeling Claims are rare but very costly •Web mining Less than 3% of all people visiting Amazon.com make a purchase •Targeted Marketing Response is typically rare but can be profitable •Churn Analysis Churn is typically rare but quite costly •Hardware Fault Detection Faults are rare but very costly •Airline No-show Prediction Disease is typically rare but can be deadly Churn analysis is the calculation of the rate of attrition in the customer base of any company. It involves identifying those consumers who are most likely to discontinue using your service or product. Airlines routinely overbook flights based on the expectation that some fraction of booked passengers will not show for each flight. Accurate forecasts of the expected number of noshows for each flight can increase airline revenue Limitations of standard techniques •While most normal events are similar to each other, rare events may be quite different from one another regular credit transaction are fairly standard, while fraudulent ones vary from the standard ones in many different ways •Metrics: Accuracy is not appropriate for evaluating methods for rare event detection precision, recall, •On many applications data keeps arriving in an ongoing stream Example: network traffic data set with 99.9% of normal data and 0.1% of intrusions ROC Curve is a trade-off between detection rate and false alarm rate Confusion matrix • Predicted class • NC C •Actual NC TN FP normal class – NC •class C FN TP rare class – C • •Recall (R) = TP/(TP + FN) •Precision (P) = TP/(TP + FP) •F – measure = 2*R*P/(R+P) AUC • AUC-AreaUnderROCCurve Association rule mining • • Apriori algorithm • Rare association rule mining •why Apriori etc. are not appropriate •approaches –a variable support thershold –mining without support threshold –constraint-based mining –structure-based mining Emerging patterns* •For 2 classes C1 and C2 and support si(X) of itemset X in class Ci, growth rate is defined as s2(X)/s1(X). Given a threshold p >1, itemset X is p-emerging pattern (EP) if growth rate ≥ p •Find EPs in rare class •Find values that have the highest growth rate from the major class to the rare class •Replace different attribute values in the original rare class EPs with the highest growth rate values •New generated EPs have a strong power to discriminate rare class from the major class • •* H. Alhammady, K. Rao, The Application of Emerging Patterns for Improving the Quality of Rare-class Classification, PAKDD 2004, Springer LNAI 3056.