The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 3: An Introduction To Data Mining (I) Johannes Gehrke [email protected] http://www.cs.cornell.edu/johannes Lectures Three and Four Data preprocessing Multidimensional data analysis Data mining Association rules Classification trees Clustering Why Data Preprocessing? Quality decisions come from quality data. Problems with real life data:
Data needs to be integrated from different sources Missing values Noisy and inconsistent values Data is not at the right level of aggregation Recall: The Relational Data Model A relational database is a set of relations A relation has two components: 1. The relation instance. Basically a table, with rows (also: records, tuples) and columns (also: fields, attributes). Number of records in the relation instance: Cardinality. 2. The relation schema. Specifies name of the relation, plus name and type of each column.
Turing award (Nobel price in CS) for Codd in 1981 for his work on the relational model Example: Customer Relation Relation schema: Customers (cid: integer, name: string, byear: integer, state: string) Relation instance: cid 1 2 3 name Jones Smith Smith
byear 1960 1974 1950 state NY CA NY Data Integration Integrate data from multiple sources into a common format for data mining. Note: A good data warehouse has Data Warehouse already taken care of thisServer step. OLTP
DBMSs Other Data Sources Extract, clean, transform, aggregate, load, update Data Marts Data Integration (Contd.) Problem: Heterogeneous schema integration Different attribute names cid 1 2 3 name Jones Smith
Smith byear 1960 1974 1950 Customer-ID 1 2 3 state NY CA NY Different units: Sales in $, sales in Yen, sales in DM Data Integration (Contd.)
Problem: Heterogeneous schema integration Different scales: Sales in dollars versus sales in pennies Derived attributes: Annual salary versus monthly salary cid 1 2 3 monthlySalary 5000 2400 3000 cid 6 7 8
Salary 50,000 100,000 40,000 Data Integration (Contd.) Problem: Inconsistency due to redundancy Customer with customer-id 150 has three children in relation1 and four children in relation2 cid 1 numChildren 3 cid 1 numChildren 4
Computation of annual salary from monthly salary in relation1 does not match annualsalary attribute in relation2 cid 1 2 monthlySalary 5000 6000 cid 1 2 Salary 60,000 80,000 Missing Values Often the values of some attributes in a record are not known.
Reasons for missing values: Attribute does not apply (e.g., maiden name) Inconsistency with other recorded data Equipment malfunction Human errors Attribute introduced recently (e.g., email address) Missing Values: Approaches Ignore the record Complete the missing value: Manual completion: Tedious and likely to be infeasible Fill in a global constant, e.g., NULL, unknown Use the attribute mean or mode Construct a data mining model that predicts the missing value Noisy Data
Examples: Faulty data collection instruments Data entry problems, misspellings Data transmission problems Technology limitation Inconsistency in naming conventions Duplicate records with different values for a common field Noisy Data: Remove Outliers Noisy Data: Smoothing y Y1 Y1 X1 x
Noisy Data: Normalization Scale data to fall within a small, specified range Leave out extreme order statistics Min-max normalization Z-score normalization Normalization by decimal scaling Data Reduction Problem: Data might not be at the right scale for analysis. Example: Individual phone calls versus monthly phone call usage Complex data mining tasks might run a very long time. Example: Multi-terabyte data warehouses One disk drive: About 20MB/s Data Reduction: Attribute Selection
Select the relevant attributes for the data mining task If there are k attributes, there are 2k-1 different subsets Example: {salary,children,byear}, {salary,children}, {salary,byear}, {children,byear}, {salary}, {children}, {byear} Choice of the right subset depends on: Data mining task Underlying probability distribution Attribute Selection (Contd.) How to choose relevant attributes: Forward selection: Select greedily one attribute at a time Backward elimination: Start with all attributes, eliminate irrelevant attributes Combination of forward selection and backward elimination
Data Reduction: Parametric Models Main idea: Fit a parametric model to the data (e.g., multivariate normal distribution) Store the model parameters, discard the data (except for outliers) Parametric Models: Example Instead of storing (x,y) pairs, store only the x-value. Then recompute the y-value using y = ax + b y x Data Reduction: Sampling
Choose a representative subset of the data Simple random sampling may have very poor performance in the presence of skew Data Reduction: Sampling (Contd.) Stratified sampling: Biased sampling Example: Keep population group ratios Example: Keep minority population group count Data Reduction: Histograms Divide data into buckets and store average (sum) for each bucket Can be constructed optimally for one attribute using dynamic programming Example: Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6,
7,7,8,8,9,9,10,10,11,11,12,12 Histogram: (range, count, sum) (1-2,12,16), (3-6,8,36), (7-9,6,48), (10-12,6,66) Histograms (Contd.) Equal-width histogram Divides the domain of an attribute into k intervals of equal size Interval width = (Max Min)/k Computationally easy Problems with data skew and outliers Example: Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6, 7,7,8,8,9,9,10,10,11,11,12,12 Histogram: (range, count, sum) (1-3,14,22), (4-6,6,30), (7-9,6,48), (10-12,6,66) Histograms (Contd.) Equal-depth histogram Divides the domain of an attribute into k intervals,
each containing the same number of records Variable interval width Computationally easy Example: Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6, 7,7,8,8,9,9,10,10,11,11,12,12 Histogram: (range, count, sum) (1,8,8), (2-4,8,22), (5-8,8,52), (9-12,8,84) Data Reduction: Discretization Same concept as histograms Divide domain of a numerical attribute into intervals. Replace attribute value with label for interval. Example: Dataset (age; salary): (25;30,000),(30;80,000),(27;50,000), (60;70,000),(50;55,000),(28;25,000)
Discretized dataset (age, discretizedSalary): (25,low),(30,high),(27,medium),(60,high), (50,medium),(28,low) Discretization: Natural Hierarchies Replace low-level concepts with high-level concepts Example replacements: Product SKU by category City by state Industry Country Category State Product City
Year Quarter Month Week Day Data Reduction: Aggregation Natural hierarchies on attributes can be used to aggregate data along the hierarchy Industry Country Category State
Product City Year Quarter Month Week Day Aggregation: Example cid 1 1 1 1 1 2 2
2 2 2 3 4 4 Date Amount 3/ 1/ 2000 $150 3/ 5/ 2000 $50 3/ 29/ 2000 $200 4/ 2/ 2000 $300 4/ 6/ 2000 $200 3/ 2/ 2000 $100
3/ 5/ 2000 $250 3/ 10/ 2000 $100 3/ 11/ 2000 $50 4/ 7/ 2000 $200 4/ 2/ 2000 $300 3/ 17/ 2000 $250 4/ 25/ 2000 $100 cid Month Year Amount Visits 1 3 2000 $400 3 1
3 2000 $500 2 2 3 2000 $500 4 2 4 2000 $200 1 3 4 2000 $300 1 4 3 2000 $250 1 4 4
2000 $100 1 Data Reduction: Other Methods Principal component analysis Fourier transformation Wavelet transformation Data Preprocessing: Summary Problems during data integration: Different attribute names Different units Different scales Derived attributes Redundant data Missing values Imputation Prediction
Noisy data: Outlier removal Smoothing Normalization Data Reduction: Attribute selection Fitting parametric models Sampling Histograms Discretization Aggregation Data Analysis Data preprocessing Multidimensional data analysis Data mining Association rules Classification trees
Clustering Multidimensional Data Analysis Recall: Transactions(ckey, timekey, pkey, units, price) Customers(ckey, cid, name, byear, city, state, country) Time(tkey, day, month, quarter, year) Products(pkey, pname, price, pid, category, industry) Year Industry Country Hierarchies on dimensions: Quarter Category State Month Week Product City Day
Multidimensional Data Analysis Industry 1 Industry 2 Industry Industry 3 NY $1000 CA $2000 WI $1000 $500
$1000 $500 $3000 $3000 $3000 Country=USA Category State Product City Year
Quarter Month Week Day Corresponding Query in SQL SELECT SUM(units) FROM Transactions T, Products P, Customers C WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.country = USA GROUP BY P.industry, C.state Year Industry Country=USA We think that Industry3 in CA is
Quarter interesting. Category State Month Product City Week Day Slice and Drill-Down San Francisc o San Jose
Los Angele s $300 $300 $100 $300 $300 $800 $400 $400 $100 Category1 Category2 Category3 Industry=Industry3
Country Category State=CA Product City Year Quarter Month Week Day Corresponding Query in SQL SELECT SUM(units)
FROM Transactions T, Products P, Customers C WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND P.industry = Industry3 AND C.state = CA GROUP BY P.category, C.city We think that Category3 is interesting. Industry=Industry3 Country Category State=CA Product City Year Quarter
Month Week Day Slice and Drill-Down Product1 Product2 Product3 San Francisc o San Jose Los Angele
s $20 $20 $60 $160 $160 $480 $20 $20 $60 Industry Country Category=Category3 State=CA
Product City Year Quarter Month Week Day Corresponding Query in SQL SELECT SUM(units) FROM Transactions T, Products P, Customers C WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.state = CA AND P.category = Category3 GROUP BY P.product, C.city
Nothing new in this view of the data. Industry Country Category=Category3 State=CA Product City Year Quarter Month Week Day
Pivot To (City, Year) 1997 1998 1999 San Francisc o San Jose Los Angele s $20 $20 $60
$100 $600 $100 $20 $20 $60 Industry Country Category=Category3 State=CA Product City Year
Quarter Month Week Day Corresponding Query in SQL SELECT SUM(units) FROM Transactions T, Products P, Customers C WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.state = CA AND P.category = Category3 GROUP BY C.city, T.year Industry Country Category=Category3
State=CA Product City Year Quarter Month Week Day Multidimensional Data Analysis Set of data manipulation operators Roll-up: Go up one step in a dimension hierarchy. Example: month -> quarter Drill-down: Go down one step in a dimension hierarchy. Example: quarter -> month Slice: Select a subset of the values of a dimension.
Example: All categories -> only Category3 Dice: Select all values of a dimension. Example: Only Category3 -> all categories Pivot: Select new dimensions to visualize the data. Example: Pivot to Time(quarter) and Customer(state) Visual Intuition: Cube roll-up to category Customer Data Mart roll-up to state SH SF Product LA Product1
Product2 Product3 Product4 Product5 Product6 20 30 20 15 10 50 roll-up to week M T W Th F S S Time 50 Units of Product6 sold on Monday in LA OLAP Server Architectures
Relational OLAP (ROLAP) Relational DBMS stores data mart OLAP middleware: Aggregation and navigation logic Optimized for DBMS in the background, but slow and complex Basically only one vendor: Microstrategy Multidimensional OLAP (MOLAP) Specialized array-based storage structure Vendors: Hyperion (Essbase), Appix (iTM1), Oracle, Microsoft OLAP Server Architectures Desktop OLAP (DOLAP) Performs OLAP directly at your PC Vendors: Cognos (Powerplay), Business Objects, Brio Technology, Hummingbird Hybrids and Application OLAP More: www.olapreport.com
The OLAP Market Revenues For The OLAP Market (OLAP Report 11/1998) 3.5 Billion $ 3.0 2.5 2.0 Revenue 1.5 1.0 0.5 0.0 1994 1995
1996 1997 1998 1999 2000 Enterprise Reporting Tools Market Size Forecast for Enterprise Reporting Tools (IDC 11/1998) 1000 Million $ 800 600
Market Size 400 200 0 1997 1998 1999 2000 2001 2002 Summary: Multidimensional Analysis Spreadsheet style data analysis Roll-up, drill-down, slice, dice, and pivot your way to interesting cells in the CUBE
Mainstream technology Established enterprises already have OLAP installations When establishing your e-business, OLAP will be your first step in data analysis Data Mining Definition Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6% Definition (Cont.) Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful,
and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns. Why Use Data Mining Today? Human analysis skills are inadequate: Volume and dimensionality of the data High data growth rate Availability of: Data Storage Computational power Off-the-shelf software
An Abundance of Data Supermarket scanners, POS data Preferred customer cards Credit card transactions Direct mail response Call center records Web server logs Customer web site trails ATM machines Demographic data
Evolution of Database Technology 1960s: IMS and network DBMS 1970s: The relational data model, small relational DBMS implementations 1980s: Maturing RDBMS, application-specific DBMS (spatial data, scientific data, images, etc.) 1990s: Mature, high-performance RDBMS technology, parallel DBMS, terabyte data warehouses, object-relational DBMS, middleware and web technology 2000s: High availability, maintainability, seamless integration into business processes Computational Power Moores Law: In 1965, Intel Corp. cofounder Gordon Moore predicted that the density of transistors in an integrated circuit would double every year. Later changed to reflect 18 months progress.
Experts on ants estimate that there are 1016 to 1017 ants on earth. In the year 1997, we produced one transistor per ant. OTS-Software for Data Mining ANGOSS KnowledgeStudio IBM Intelligent Miner Metaputer PolyAnalyst SAS Enterprise Miner SGI Mineset SPSS Clementine Many others
More at http://www.kdnuggets.com/software Why Use Data Mining Today? And competitive pressure! The secret of success is to know something that nobody else knows. Aristotle Onassis Competition on service, not only on price (Banks, phone companies, hotel chains, rental car companies) Personalization Your competitors already apply data mining The Knowledge Discovery Process Steps: 1. Identify business problem 2. Data preprocessing
Data selection: Identify target datasets and relevant fields Data cleaning: Remove noise and outliers Data transformation Create common units Generate new fields 3. Data mining and model evaluation 4. Deployment and measurement of the results Preprocessing and Mining Knowledge Patterns Preprocessed Data Target Data Interpretation
Model Construction Original Data Preprocessing Data Integration and Selection What is a Data Mining Model? A data mining model is a description of a specific aspect of a dataset. It produces output values for an assigned set of input values. Examples: Linear regression model Classification model Clustering
Data Mining Models (Contd.) A data mining model can be described at two levels: Functional level: Describes model in terms of its intended usage. Examples: Classification, clustering Representational level: Specific representation of a model. Example: Log-linear model, classification tree, nearest neighbor method. Black-box models versus transparent models Data Mining Techniques Supervised learning Classification and regression Unsupervised learning Clustering
Dependency modeling Associations, summarization, causality Outlier and deviation detection Trend analysis and change detection Fields Related to Data Mining Database systems Machine learning Statistics Visualization Parallel computing Traditional Analysis Tools Predefined set of reports Capture most important known business questions Generated at fixed points in time No support for impromptu queries SQL: Write your own SQL query.
Statistics department within a company Slow reaction to users needs Limited understanding of the business questions Problems communicating the findings Example Application: Sports IBM Advanced Scout analyzes NBA game statistics Shots blocked Assists Fouls http://www.research.ibm.com/scout/home. html Advanced Scout Example pattern: An analysis of the data from a game played between the New York Knicks and the Charlotte Hornets revealed that When Glenn Rice played the shooting guard position, he shot 5/6 (83%) on jump shots."
Pattern is interesting: The average shooting percentage for the Charlotte Hornets during that game was 54%. Example Application: Sky Survey Input data: 3 TB of image data with 2 billion sky objects, took more than six years to complete Goal: Generate a catalog with all objects and their type Method: Use decision trees as data mining model Results: 94% accuracy in predicting sky object classes Increased number of faint objects classified by 300% Helped team of astronomers to discover 16 new high red-shift quasars in one order of magnitude less observation time Gold Nuggets? Investment firm mailing list: Discovered that old people do not respond to IRA mailings
Bank clustered their customers. One cluster: Older customers, no mortgage, less likely to have a credit card Bank of 1911 Customer churn example Market Basket Analysis Consider shopping cart filled with several items Market basket analysis tries to answer the following questions: Who makes purchases? What do customers buy together? In what order do customers purchase items? Market Basket Analysis Given: A database of customer transactions Each transaction is a
set of items Example: Transaction with TID 111 contains items {Pen, Ink, Milk, Juice} TID 111 111 111 111 112 112 112 113 113 114 114 114
CID 201 201 201 201 105 105 105 106 106 201 201 201 Date 5/1/99 5/1/99 5/1/99 5/1/99 6/3/99 6/3/99
6/3/99 6/5/99 6/5/99 7/1/99 7/1/99 7/1/99 Item Pen Ink Milk Juice Pen Ink Milk Pen Milk Pen Ink Juice
Qty 2 1 3 6 1 1 1 1 1 2 2 4 Market Basket Analysis (Contd.) Coocurrences 80% of all customers purchase items X, Y and Z together. Association rules 60% of all customers who purchase X
and Y also buy Z. Sequential patterns 60% of customers who first buy X also purchase Y within three weeks. Confidence and Support We prune the set of all possible association rules using two interestingness measures: Confidence of a rule: X => Y has confidence c if P(Y|X) = c Support of a rule: X => Y has support s if P(XY) = s We can also define Support of an itemset (a coocurrence) XY: XY has support s if P(XY) = s Example Examples:
{Pen} => {Milk} Support: 75% Confidence: 75% {Ink} => {Pen} Support: 100% Confidence: 100% TID 111 111 111 111 112 112 112 113 113 114 114 114
CID 201 201 201 201 105 105 105 106 106 201 201 201 Date 5/1/99 5/1/99 5/1/99 5/1/99 6/3/99 6/3/99
6/3/99 6/5/99 6/5/99 7/1/99 7/1/99 7/1/99 Item Pen Ink Milk Juice Pen Ink Milk Pen Milk Pen Ink Juice
Qty 2 1 3 6 1 1 1 1 1 2 2 4 Exercise Find all itemsets with support >= 75%? TID 111
111 111 111 112 112 112 113 113 114 114 114 CID 201 201 201 201 105 105 105 106
106 201 201 201 Date 5/1/99 5/1/99 5/1/99 5/1/99 6/3/99 6/3/99 6/3/99 6/5/99 6/5/99 7/1/99 7/1/99 7/1/99 Item Pen
Ink Milk Juice Pen Ink Milk Pen Milk Pen Ink Juice Qty 2 1 3 6 1 1 1 1
1 2 2 4 Exercise Can you find all association rules with support >= 50%? TID 111 111 111 111 112 112 112 113 113
114 114 114 CID 201 201 201 201 105 105 105 106 106 201 201 201 Date 5/1/99 5/1/99
5/1/99 5/1/99 6/3/99 6/3/99 6/3/99 6/5/99 6/5/99 7/1/99 7/1/99 7/1/99 Item Pen Ink Milk Juice Pen Ink Milk Pen Milk
Pen Ink Juice Qty 2 1 3 6 1 1 1 1 1 2 2 4 Extensions Imposing constraints
Only find rules involving the dairy department Only find rules involving expensive products Only find expensive rules Only find rules with whiskey on the right hand side Only find rules with milk on the left hand side Hierarchies on the items Calendars (every Sunday, every 1st of the month) Market Basket Analysis: Applications Sample Applications Direct marketing Fraud detection for medical insurance Floor/shelf planning Web site layout Cross-selling
Beyond Support and Confidence Example: 5000 students 3000 students play basketball 3750 students eat cereal 2000 students both play basketball and eat cereal Cereal No cereal Sum Basketball No basketball 2000 1750 1000 250 3000 2000 Sum 3750
1250 5000 Misleading Association Rules Basketball No basketball Cereal 2000 1750 No cereal 1000 250 Sum 3000 2000 Sum 3750 1250 5000 Basketball => Cereal (support: 40%,
confidence: 66.7%) is misleading because 75% of students eat cereal Basketball => No cereal (support: 20%, confidence: 33.3%) is more interesting, although with lower support and confidence Interest Interest of rule A => B: P(AB)/(P(A)*P(B)) Symmetric (uses both P(A) and P(B)) Note that confidence is not symmetric (confidence of rule A => B: P(AB)/P(A)) Interest values: Interest = 1: A and B are independent (P(AB)=P(B)*P(A)) Interest > 1: A and B are positively correlated Interest < 1: A and B are negatively correlated Interest: Example
Basketball No basketball Cereal 2000 1750 No cereal 1000 250 Sum 3000 2000 Itemset Cereal, basketball Cereal, no basketball No cereal, basketball No cereal, no basketball Sum 3750 1250 5000
Support Interest 40% 35% 20% 5% 1.125 0.857 0.750 2.000 Extensions Imposing constraints Only find rules involving the dairy department
Only find rules involving expensive products Only find expensive rules Only find rules with whiskey on the right hand side Only find rules with milk on the left hand side Hierarchies on the items Calendars (every Sunday, every 1st of the month) Applications Sample Applications Direct marketing Fraud detection for medical insurance Floor/shelf planning Web site layout Cross-selling Case study Summary Data preprocessing
Quality analysis comes from quality data Multidimensional data analysis Interactive, spreadsheet-style data analysis Data mining The knowledge discovery process Association rules Questions? (In the fourth lecture: More data mining.)