# CS 590M: Security Issues in Data Mining

CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3 ID3 (precursor to C4.5) Canonical decision tree learning algorithm by Ross Quinlan Has evolved into C5.0 from Rulequest established product. Basic idea: recursively build decision tree top-down At each node, choose best attribute for decision from those remaining What is best? Algorithm ID3 (Training set S, Attributes R, Class C) If all records in S have same value for C, return node with that value

If R is empty, return node with most common value for C in S Create non-leaf node: Let D be the attribute with largest Gain(D,S) among attributes in R; For each d D, Let Sd be items in S where value of D is d Create edge to ID3(R-{D},C, Sd) What is Gain(D,S)? Information theory: expected reduction in entropy due to sorting S on attribute D entropy(S) = -(p1*log(p1)+...+pn*log(pn)) pi is the probability of class i in S. Gain(D,S) = | Si | entropy ( S )

entropy ( Si ) iD | S | Example: Tennis Day Outlook Temp. D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12

D13 Hot Hot Hot Mild Cool Cool Cool Mild Cold Mild Mild Mild Hot Sunny Sunny Overcast Rain

Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Humidit y High High High High Normal Normal Normal High

Normal Normal Normal High Normal Wind Play Tennis Weak Strong Weak Weak Weak Strong Weak Weak Weak Strong

Strong Strong Weak No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes What should be first? S=[9+,5-]

E=0.940 S=[9+,5-] E=0.940 Humidity Wind High Normal Weak Strong [3+, 4-] [6+, 1-]

[6+, 2-] E=0.985 E=0.59 2 E=0.81 E=1.0 1 Gain(S,Wind) =0.940-(8/14)*0.811 (6/14)*1.0 =0.048 Gain(S,Humidity) =0.940-(7/14)*0.985 (7/14)*0.592 =0.151

[3+, 3-] (continued) S=[9+,5-] E=0.940 Outlook Sunny [2+, 3-] Over cast [4+, 0] Rain [3+, 2-] E=0.97 E=0.97 E=0.0

1 1 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 (5/14)*0.0971 =0.247 And recurse... [D1,D2,,D14] [9+,5-] Sunny Outlook Overcast Rain Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14

[4+,0-] [3+,2-] [2+,3-] Yes ? ? Gain(Ssunny , Humidity)=0.970-(3/5)0.0 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 2/5(1.0)-(1/5)0.0 = 0.570 ID3 result Outlook Sunny Humidity High No [D1,D2] Normal

Overcast Rain Yes [D3,D7,D12,D1 3] Yes [D8,D9,D11] Wind Strong Weak No Yes

[D6,D14] [D4,D5,D10 Inductive Bias in ID3 H is the power set of instances X Unbiased ? Preference for short trees, and for those with high information gain attributes near the root Bias is a preference for some hypotheses, rather than a restriction of the hypothesis space H Occams razor: prefer the shortest (simplest) hypothesis that fits the data Occams Razor Why prefer short hypotheses? Argument in favor: Fewer short hypotheses than long hypotheses A short hypothesis that fits the data is unlikely to be a coincidence

A long hypothesis that fits the data might be a coincidence Argument opposed: There are many ways to define small sets of hypotheses E.g. All trees with a prime number of nodes that use attributes beginning with Z What is so special about small sets based on size of hypothesis Overfitting Consider error of hypothesis h over Training data: errortrain(h) Entire distribution D of data: errorD(h) Hypothesis hH overfits training data if there is an alternative hypothesis hH such that errortrain(h) < errortrain(h) and errorD(h) > errorD(h) Avoid Overfitting How can we avoid overfitting?

Stop growing when data split not statistically significant Grow full tree then post-prune Minimum description length (MDL): Minimize: size(tree) + size(misclassifications(tree)) Reduced-Error Pruning Split data into training and validation set Do until further pruning is harmful: 1. Evaluate impact on validation set of pruning each possible node (plus those below it) 2. Greedily remove the one that most improves the validation set accuracy Produces smallest version of most accurate subtree Rule-Post Pruning 1. Convert tree to equivalent set of rules 2. Prune each rule independently of each other

3. Sort final rules into a desired sequence to use Method used in C4.5 Converting a Tree to Rules Outlook Sunny Overcast Humidity High No If If If If If Rain Yes Normal

Yes Wind Strong No Weak Yes (Outlook=Sunny) (Humidity=High) Then PlayTennis=No (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=Y (Outlook=Overcast) Then PlayTennis=Yes (Outlook=Rain) (Wind=Strong) Then PlayTennis=No (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes Issues Attributes with many values Maximize gain always selects these But not very interesting (classify on timestamp?)

What to do with continuous attributes? Gain ration Attributes with costs (e.g., medical tests) Unknown values

## Recently Viewed Presentations

• CAM stands for Crassulacean Acid Metabolism Called CAM after the plant family in which it was first found and because the CO2 is stored in the form of an acid before use in photosynthesis. Stomata open at night (when evaporation...
• Option to automatically acquire DAP and technique information eliminating the need for manual entry* ... Automatic and Manual Stitch Editors allow you to manipulate and paste individual images together (now standard with system)* ... Customer facing presentation for Image Suite...
• WE MUST CONTINUE STEADFASTLY IN FAITH AC 2:42; RE 2:10 WE MUST GROW IN GRACE & KNOWLEDGE OF JESUS CHRIST 2PE 3:18; As you begin your discipleship, what should you keep in mind? THINGS TO REMEMBER... WE ARE A NEW...
• Ơn yêu thương Cha láng lai như sông tuôn tràn. Sự cứu rỗi đến nơi tay Cha. Your mercy flows like a river wide. And healing comes from Your hands.
• Adam Gouker A2LA, EMC Program Manager ... Telecom 88 Bluetooth 10 CTIA 23 Product Safety (electrical) 68 VCCI 67 SAR 13 WiMAX 4 Energy Star - 23 (1 CB) TCB's - 8 Upcoming A2LA Training November 14-15 (SC) - Intro...
• Integrating CV and TIM: Phase 3 . Long-Term (2025+) Ensure that "back-end" data analytics consider needs of emergency responders in improving situational awareness and decision support. Reliance on CV/data likely to require more aggressive maintenance of CV infrastructure. Continued training...
• See web resources listed at the end. Acknowledgement: Myron Allen and Nicole Ballenger ... If granted, complete 6thyr of extended term in AY20-21, begin new Fixed-Term Rolling Contract in AY 21-22. Recommend change in designation. Review for 3 or 5...
• Then there is the Grasslands Beneficial Management Practices online tool, a resource that makes universally accessible 94 beneficial management practices designed for North American ranchers, conservation organizations, and government and academic institutions.