CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3 ID3 (precursor to C4.5) Canonical decision tree learning algorithm by Ross Quinlan Has evolved into C5.0 from Rulequest established product. Basic idea: recursively build decision tree top-down At each node, choose best attribute for decision from those remaining What is best? Algorithm ID3 (Training set S, Attributes R, Class C) If all records in S have same value for C, return node with that value

If R is empty, return node with most common value for C in S Create non-leaf node: Let D be the attribute with largest Gain(D,S) among attributes in R; For each d D, Let Sd be items in S where value of D is d Create edge to ID3(R-{D},C, Sd) What is Gain(D,S)? Information theory: expected reduction in entropy due to sorting S on attribute D entropy(S) = -(p1*log(p1)+...+pn*log(pn)) pi is the probability of class i in S. Gain(D,S) = | Si | entropy ( S )

entropy ( Si ) iD | S | Example: Tennis Day Outlook Temp. D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12

D13 Hot Hot Hot Mild Cool Cool Cool Mild Cold Mild Mild Mild Hot Sunny Sunny Overcast Rain

Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Humidit y High High High High Normal Normal Normal High

Normal Normal Normal High Normal Wind Play Tennis Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong

Strong Strong Weak No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes What should be first? S=[9+,5-]

E=0.940 S=[9+,5-] E=0.940 Humidity Wind High Normal Weak Strong [3+, 4-] [6+, 1-]

[6+, 2-] E=0.985 E=0.59 2 E=0.81 E=1.0 1 Gain(S,Wind) =0.940-(8/14)*0.811 (6/14)*1.0 =0.048 Gain(S,Humidity) =0.940-(7/14)*0.985 (7/14)*0.592 =0.151

[3+, 3-] (continued) S=[9+,5-] E=0.940 Outlook Sunny [2+, 3-] Over cast [4+, 0] Rain [3+, 2-] E=0.97 E=0.97 E=0.0

1 1 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 (5/14)*0.0971 =0.247 And recurse... [D1,D2,,D14] [9+,5-] Sunny Outlook Overcast Rain Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14

[4+,0-] [3+,2-] [2+,3-] Yes ? ? Gain(Ssunny , Humidity)=0.970-(3/5)0.0 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 2/5(1.0)-(1/5)0.0 = 0.570 ID3 result Outlook Sunny Humidity High No [D1,D2] Normal

Overcast Rain Yes [D3,D7,D12,D1 3] Yes [D8,D9,D11] Wind Strong Weak No Yes

[D6,D14] [D4,D5,D10 Inductive Bias in ID3 H is the power set of instances X Unbiased ? Preference for short trees, and for those with high information gain attributes near the root Bias is a preference for some hypotheses, rather than a restriction of the hypothesis space H Occams razor: prefer the shortest (simplest) hypothesis that fits the data Occams Razor Why prefer short hypotheses? Argument in favor: Fewer short hypotheses than long hypotheses A short hypothesis that fits the data is unlikely to be a coincidence

A long hypothesis that fits the data might be a coincidence Argument opposed: There are many ways to define small sets of hypotheses E.g. All trees with a prime number of nodes that use attributes beginning with Z What is so special about small sets based on size of hypothesis Overfitting Consider error of hypothesis h over Training data: errortrain(h) Entire distribution D of data: errorD(h) Hypothesis hH overfits training data if there is an alternative hypothesis hH such that errortrain(h) < errortrain(h) and errorD(h) > errorD(h) Avoid Overfitting How can we avoid overfitting?

Stop growing when data split not statistically significant Grow full tree then post-prune Minimum description length (MDL): Minimize: size(tree) + size(misclassifications(tree)) Reduced-Error Pruning Split data into training and validation set Do until further pruning is harmful: 1. Evaluate impact on validation set of pruning each possible node (plus those below it) 2. Greedily remove the one that most improves the validation set accuracy Produces smallest version of most accurate subtree Rule-Post Pruning 1. Convert tree to equivalent set of rules 2. Prune each rule independently of each other

3. Sort final rules into a desired sequence to use Method used in C4.5 Converting a Tree to Rules Outlook Sunny Overcast Humidity High No If If If If If Rain Yes Normal

Yes Wind Strong No Weak Yes (Outlook=Sunny) (Humidity=High) Then PlayTennis=No (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=Y (Outlook=Overcast) Then PlayTennis=Yes (Outlook=Rain) (Wind=Strong) Then PlayTennis=No (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes Issues Attributes with many values Maximize gain always selects these But not very interesting (classify on timestamp?)

What to do with continuous attributes? Gain ration Attributes with costs (e.g., medical tests) Unknown values