Chapter 5

Chapter 5

Chapter 5 Divide and Conquer Classification Using Decision Trees and Rules decision trees and rule learners two machine learning methods that make complex decisions from sets of simple choices present their knowledge in the form of logical structures particularly useful for business strategy and process improvement

Objectives How trees and rules "greedily" partition data into interesting segments The most common decision tree and classification rule learners, including the C5.0, 1R, and RIPPER algorithms How to use these algorithms to perform real-world classification tasks, such as identifying risky bank loans and poisonous mushrooms Understanding decision trees tree structure to model the relationships among the features and the potential outcomes

branching decisions, which channel examples into a final predicted class value E.g., Root node Decision nodes split the data across branches Leaf nodes (terminal nodes the action to be taken or expected results

benefit of decision tree algorithms flowchart-like tree structure Not necessarily exclusively for the learner's internal use in a human-readable format

provides insight into how and why the model works or doesn't work classification mechanism can be transparent some potential uses Credit scoring models in which the criteria that causes an applicant to be rejected need to be clearly documented and free from bias Marketing studies of customer behavior such as satisfaction or churn, which will be shared with management or advertising agencies Diagnosis of medical conditions based on laboratory measurements, symptoms, or the rate of disease progression

Pros and cons Pros decision trees are perhaps the single most widely used machine learning technique can be applied to model almost any type of data Cons trees may not be an ideal fit for a task where the data has a large number of nominal features with many levels or it has a large number of numeric

features. result in a very large number of decisions and an overly complex tree. tendency of decision trees to overfit data Divide and conquer recursive partitioning splits the data into subsets, which are then split repeatedly into even smaller subsets, stops when the algorithm determines the data within the subsets are sufficiently homogenous, or another stopping criterion has been met

Steps the root node represents the entire dataset choose a feature to split upon; ideally, it chooses the feature most predictive of the target class. The examples are then partitioned into groups according to the distinct values of this feature stop at a node in a case that: All (or nearly all) of the examples at the node have the same class There are no remaining features to distinguish among the examples The tree has grown to a predefined size limit

Example - Hollywood studio predict whether a potential movie would fall into one of three categories: Critical Success, Mainstream Hit, or Box Office Bust the factors leading to the success and failure of the company's 30 most recent releases build a decision tree split the feature indicating the number of celebrities build a decision tree contd

another split between movies with and without a high budget build a decision tree contd partitioned the data into three groups. top-left corner: entirely of critically acclaimed films. high number of celebrities and a relatively low budget. top-right corner: box office hits with high budgets and a large number of celebrities. final group: little star power but budgets ranging from small to large

could continue to divide and conquer the data by splitting it based on the increasingly specific ranges of budget and celebrity count not advisable to overfit a decision tree axis-parallel splits The fact that each split considers one feature at a time prevents the decision tree from forming more complex decision

boundaries a diagonal line could be created by a decision that asks, "is the number of celebrities is greater than the estimated budget?" If so, then "it will be a critical success." The C5.0 decision tree algorithm C5.0 algorithm Developed by computer scientist J. Ross Quinlan

Iterative Dichotomiser 3 (ID3) -> C4.5 -> C5.0 http://www.rulequest.com/ single-threaded version publically available industry standard to produce decision trees J48 - Java-based open source alternative to C4.5 Pros and cons Choosing the best split

first challenge - identify which feature to split upon split the data such that the resulting partitions contained examples primarily of a single class purity - degree to which a subset of examples contains only a single class Any subset composed of only a single class is called pure Entropy - concept borrowed from information theory that quantifies the randomness, or disorder

measurements of purity Sets with high entropy are very diverse entropy If there are only two possible classes, entropy values can range from 0 to 1. For n classes, entropy ranges from 0 to log2(n). the minimum value indicates that the sample is completely homogenous the maximum value indicates that the data are as diverse as possible

mathematical notion S - given segment of data c - the number of class levels pi - the proportion of values falling into class level i. E.g., partition of data with two classes: red (60 percent) and white (40 percent) > -0.60 * log2(0.60) - 0.40 * log2(0.40) [1] 0.9709506

entropy for all the possible two-class arrangements Using the curve() function, plot the entropy for all the possible values of x: > curve(-x * log2(x) - (1 - x) * log2(1 - x), col = "red", xlab = "x", ylab = "Entropy", lwd = 4) 50-50 split results in maximum entropy. As one class increasingly

dominates the other, the entropy reduces to zero determine the optimal feature to split upon information gain - the change in homogeneity that would result from a split on each possible feature information gain for a feature F - the difference between the entropy in the segment before the split (S1) and the partitions resulting from the split (S2):

function to calculate Entropy(S2) needs to consider the total entropy across all of the partitions weighing each partition's entropy by the proportion of records falling into the partition the total entropy resulting from a split is the sum of the entropy of each of the n partitions weighted by the proportion of examples falling in the partition (w ). i information gain The higher the information gain, the better a feature is at creating homogeneous groups after a split on this feature.

If the information gain is zero, there is no reduction in entropy for splitting on this feature. the maximum information gain is equal to the entropy prior to the split, which means that the split results in completely homogeneous groups splitting on numeric features Test various splits that divide the values into groups greater than or less than a numeric threshold reduces the numeric feature into a two-level categorical feature

The numeric cut point yielding the largest information gain is chosen for the split Pruning the decision tree A decision tree can continue to grow indefinitely if the tree grows overly large, many of the decisions it makes will be overly specific and the model will be overfitted to the training data pruning a decision tree - reducing its size such that it generalizes better to unseen data One solution - early stopping or pre-pruning the decision tree: stop

the tree from growing once it reaches a certain number of decisions or when the decision nodes contain only a small number of examples there is no way to know whether the tree will miss subtle, but important patterns that it would have learned had it grown to a larger size post-pruning growing a tree that is intentionally too large and pruning leaf nodes to reduce the size of the tree to a more appropriate level often a more effective approach than pre-pruning it is quite difficult to determine the optimal depth of a decision tree without growing it

first. Pruning the tree later on allows the algorithm to be certain that all the important data structures were discovered subtree raising and subtree replacement first grows a large tree that overfits the training data. Later, the nodes and branches that have little effect on the classification are removed. In some cases, entire branches are moved further up the tree or replaced by simpler decisions

Example identifying risky bank loans using C5.0 decision trees Decision trees are widely used in the banking industry due to their high accuracy and ability to formulate a statistical model in plain language we will develop a simple credit approval model using C5.0 decision trees Step 1 collecting data to identify factors that are predictive of higher risk of default. need to obtain data on a large number of past bank loans and whether the

loan went into default, as well as information on the applicant The dataset dataset donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) by Hans Hofmann of the University of Hamburg. The dataset contains information on loans obtained from a credit agency in Germany Preprocessed version: download the credit.csv file from Packt Publishing's website

Characteristics 1,000 examples on loans a set of numeric and nominal features indicating the characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default Step 2 exploring and preparing the data > credit <- read.csv("credit.csv")

Numeric features and default rate Data preparation creating random training and test datasets split our data into two portions: a training dataset 90% (900 records) a test dataset 10% (100 records) the credit dataset is not randomly ordered

random sample - sample() function common practice is to set a seed value, which causes the randomization process to follow a sequence that can be replicated later on if desired set.seed() function pseudorandom number generator The commands > set.seed(123) > train_sample <- sample(1000, 900) > str(train_sample)

int [1:900] 288 788 409 881 937 46 525 887 548 453 ... > credit_train <- credit[train_sample, ] > credit_test <- credit[-train_sample, ] Step 3 training a model on the data use the C5.0 algorithm in the C50 package > install.packages("C50") > library(C50)

?C5.0 Control command displays the help page use the default C5.0 configuration > credit_model <- C5.0(credit_train[-17], credit_train$default) see the tree's decisions > summary(credit_model)

The first three lines could be represented in plain language as: 1. If the checking account balance is unknown or greater than 200 DM, then classify as "not likely to default." 2. Otherwise, if the checking account balance is less than zero DM or between one and 200 DM. 3. And the credit history is perfect or very good, then classify as "likely to default." The numbers in parentheses indicate the number of examples meeting the criteria for that decision, and the number incorrectly

classified by the decision. - e.g., On the first line, 412/50 indicates that of the 412 examples reaching the decision, 50 were incorrectly classified as not likely to default Sometimes a tree results in decisions that make little logical sense. e.g., why would an applicant whose credit history is very good be likely to default? confusion matrix

summary(credit_model) model correctly classified all but 133 of the 900 training instances for an error rate of 14.8 percent. A total of 35 actual no values were incorrectly classified as yes (false positives) 98 yes values were misclassified as no (false negatives) Step 4 evaluating model performance > credit_pred <- predict(credit_model, credit_test)

accuracy of 73 percent and an error rate of 27 percent the model only correctly predicted 14 of the 33 actual loan defaults in the test data, or 42 percent Step 5 improving model performance Our model's error rate is likely to be too high

if the model had predicted "no default" for every test case, it would have been correct 67 percent of the time our model performed especially poorly at identifying applicants who do default on their loans Boosting the accuracy of decision trees adaptive boosting a process in which many decision trees are built and the trees vote on the best class for each example research by Rob Schapire and Yoav Freund rooted in the notion that by combining a number of weak performing learners,

you can create a team that is much stronger than any of the learners alone The trials parameter the number of separate decision trees to use in the boosted team The trials parameter sets an upper limit the algorithm will stop adding trees if it recognizes that additional trials do not seem to be improving the accuracy 10 trials - a number that has become the de facto standard, as research suggests that this reduces error rates on test data by about 25 percent > credit_boost10 <- C5.0(credit_train[-17], credit_train$default, trials = 10)

> credit_boost10 Number of boosting iterations: 10 Average tree size: 47.5 the model's performance The classifier made 34 mistakes on 900 training examples for an error rate of 3.8 percent Total error rate 27% -> 18% (> 25% reduction) still not doing well at predicting defaults,

predicting only 20/33 = 61% correctly Making mistakes more costlier than others One solution to reduce the number of false negatives may be to reject a larger number of borderline applicants the interest the bank would earn from a risky loan is far outweighed by the massive loss it would incur if the money is not paid back at all C5.0 - Assign a penalty to different types of errors cost matrix - specifies how much costlier each error is, relative to any other prediction > matrix_dimensions <- list(c("no", "yes"), c("no", "yes"))

> names(matrix_dimensions) <- c("predicted", "actual") > matrix_dimensions $predicted [1] "no" "yes" $actual [1] "no" "yes" R fills a matrix by filling columns one by one from top to bottom: Predicted no, actual no

Predicted yes, actual no Predicted no, actual yes Predicted yes, actual yes penalty values we believe that a loan default costs the bank four times as much as a missed opportunity The result

more mistakes overall: 37% 79% of the actual defaults were predicted to be defaults reduction of false negatives at the expense of increasing false positives Understanding classification rules Classification rules represent knowledge in the form of logical if-else statements that assign a class to unlabeled examples

Antecedent - comprises certain combinations of feature values Consequent - specifies the class value to assign when the rule's conditions are met Rule learners can be used for applications that generate knowledge for future action, such as: Identifying conditions that lead to a hardware failure in mechanical devices Describing the key characteristics of groups of people for customer segmentation Finding conditions that precede large drops or increases in the prices of shares on the stock market

some distinct advantages Tree - must be applied from top-to-bottom through a series of decisions rules - propositions that can be read much like a statement of fact decision trees bring a particular set of biases to the task rule learner avoids by identifying the rules directly Rule learners are generally applied to problems where the features are primarily or entirely nominal. They do well at identifying rare events, even if the rare event occurs only for a very specific interaction among feature values

Separate and conquer The process involves identifying a rule that covers a subset of examples in the training data separating this partition from the remaining data

As the rules are added, additional subsets of the data are separated Until the entire dataset has been covered and no more examples remain drilling down into the data by creating increasingly specific rules to identify class values identify whether or not an animal is a mammal set of all animals using the available features to

find homogeneous groups first rule might suggest that any land-based animals are mammals frogs are amphibians, not mammals drill down further by suggesting that mammals must walk on land and have a tail distinguishing bats from the other remaining animals

A potential feature - presence of fur the rule learning process stops a total of three rules: Animals that walk on land and have tails are mammals If the animal does not have fur, it is not a mammal Otherwise, the animal is a mammal separate and conquer algorithms are also known as covering algorithms

resulting rules are called covering rules The 1R algorithm ZeroR - a rule learner that literally learns no rules. For every unlabeled example, regardless of the values of its features, it predicts the most common class The 1R algorithm (One Rule or OneR) - selecting a single rule it tends to perform better than you might expect As demonstrated in empirical studies, the accuracy of this algorithm can

approach that of much more sophisticated algorithms for many real-world tasks strengths and weaknesses of the 1R algorithm The way 1R works For each feature 1R divides the data into groups based on similar values of the feature. Then, for each segment, the algorithm predicts the majority class.

The error rate for the rule based on each feature is calculated and the rule with the fewest errors is chosen as the one rule how 1R works for the animal data the single most important rule "one rule" based on Travels By: If the animal travels by air, it is not a mammal If the animal travels by land, it is a mammal If the animal travels by sea, it is not a mammal

The RIPPER algorithm Early rule learning algorithms were plagued by a couple of problems First, they were notorious for being slow, which made them ineffective for the increasing number of large datasets. Secondly, they were often prone to being inaccurate on noisy data. Incremental Reduced Error Pruning (IREP) proposed in 1994 by Johannes Furnkranz and Gerhard Widmer uses a combination of pre-pruning and post-pruning methods that grow very

complex rules and prune them before separating the instances from the full dataset. Although this strategy helped the performance of rule learners, decision trees often still performed better Repeated Incremental Pruning to Produce Error Reduction (RIPPER) 1995 William W. Cohen improved upon IREP to generate rules that match or exceed the

performance of decision trees strengths and weaknesses of RIPPER the RIPPER algorithm a patchwork of efficient heuristics for rule learning three-step process: 1. Grow: uses the separate and conquer technique to greedily add conditions to a rule until it perfectly classifies a subset of data or runs out of attributes for splitting the information gain criterion is used to identify the next splitting attribute

2. Prune: the information gain criterion is used to identify the next splitting attribute. When increasing a rule's specificity no longer reduces entropy, the rule is immediately pruned. Steps one and two are repeated until it reaches a stopping criterion 3. Optimize Pros and cons The RIPPER algorithm can create much more complex rules than can

the 1R algorithm, as it can consider more than one feature E.g., "if an animal flies and has fur, then it is a mammal rules can quickly become more difficult to comprehend New rule learning algorithms: IREP++, SLIPPER, TRIPPER, etc. Rules from decision trees Beginning at a leaf node and following the branches back to

the root, you will have obtained a series of decisions. These can be combined into a single rule predict movie success Following the paths from the root node down to each leaf, the rules would be: 1. If the number of celebrities is low, then the movie will be a Box Office Bust.

2. If the number of celebrities is high and the budget is high, then the movie will be a Mainstream Hit. 3. If the number of celebrities is high and the budget is low, then the movie will be a Critical Success. Con: the resulting rules are often more complex than those learned by a rule learning algorithm Pro: it is sometimes more computationally efficient to generate rules from trees

What makes trees and rules greedy? greedy learners - use data on a first-come, first-served basis make partitions one at a time, finding the most homogeneous partition first, followed by the next best, and so on, until all examples have been classified greedy algorithms are not guaranteed to generate the optimal, most accurate, or smallest number of rules greedy learner may quickly find a single rule that is accurate for one subset of data; the learner may miss the opportunity to develop a more nuanced set of rules with

better overall accuracy on the entire set of data. without using the greedy approach to rule learning, it is likely that for all but the smallest of datasets, rule learning would be computationally infeasible computational feasibility Subtle differences once divide and conquer splits on a feature,

the partitions created by the split may not be re-conquered, only further subdivided. a tree is permanently limited by its history of past decisions. decision tree: If an animal walks on land and has fur, then it is a mammal (bears, cats, dogs, elephants, pigs, rabbits, rats, rhinos) If an animal walks on land and does not have fur,

then it is not a mammal (frogs) If the animal does not walk on land and has fur, then it is a mammal (bats) If the animal does not walk on land and does not have fur, then it is not a mammal (birds, insects, sharks, fish, eels) once separate and conquer finds a rule, any examples not covered by all of the rule's conditions may be reconquered.

The rule learner: Animals that walk on land and have tails are mammals (bears, cats, dogs, elephants, pigs, rabbits, rats, rhinos) If the animal does not has fur, it is not a mammal (birds, eels, fish, frogs, insects, sharks) Otherwise, the animal is a mammal (bats)

Rule learner Note how frog was classified Comparison rule learners often find a more parsimonious set of rules than those generated from decision trees. the computational cost of rule learners may be somewhat higher than for decision trees - reuse of data Example identifying poisonous mushrooms

with rule learners there are no clear rules to identify whether a wild mushroom is poisonous or edible Step 1 collecting data Mushroom dataset by Jeff Schlimmer of Carnegie Mellon University. The raw dataset is available freely at the UCI Machine Learning Repository (http:// archive.ics.uci.edu/ml) The dataset includes information on 8,124 mushroom samples from 23 species of gilled mushrooms listed in Audubon Society Field Guide to North American Mushrooms (1981).

22 features of the mushroom samples, including characteristics such as cap shape, cap color, odor, gill size and color, stalk shape, and habitat Download the mushrooms.csv file from the Packt Publishing website Step 2 exploring and preparing the data > mushrooms <- read.csv("mushrooms.csv", stringsAsFactors = TRUE) The output of the str(mushrooms) include: $ veil_type : Factor w/ 1 level "partial": 1 1 1 1 1 1 ... since the veil type does not vary across samples, it does not provide any

useful information for prediction > mushrooms$veil_type <- NULL By assigning NULL to the veil type vector, R eliminates the feature from the mushrooms data frame distribution of the mushroom type class variable About 52 percent of the mushroom samples (N = 4,208) are edible, while 48 percent (N = 3,916) are poisonous

we do not need to hold some samples out of the training data for testing purposes. We are not trying to develop rules that cover unforeseen types of mushrooms; we are merely trying to find rules that accurately depict the complete set of known mushroom types. Therefore, we can build and test the model on the same data Step 3 training a model on the data ZeroR ignores all of the features and simply predicts the target's mode its rule would state that all the mushrooms are edible

use the 1R implementation in the RWeka package called OneR() install.packages("RWeka") library(RWeka) Syntax R formula syntax The class variable to be learned goes to the left of the tilde, and the predictor features are written on the right,

separated by + operators e.g., y ~ x1 + x2 y~. constructing the rules to predict type > mushroom_1R <- OneR(type ~ ., data = mushrooms) these rules could be summarized in a simple rule of thumb: "if the mushroom smells unappetizing, then it is likely to be

poisonous." Step 4 evaluating model performance a = edible and b = poisonous 1R classifier did not classify any edible mushrooms as poisonous it did classify 120 poisonous mushrooms as edible

Step 5 improving model performance JRip(), a Java-based implementation of the RIPPER rule learning algorithm included in the RWeka package train the JRip() rule learner

> mushroom_JRip <- JRip(type ~ ., data = mushrooms) nine rules The first three rules could be expressed as: If the odor is foul, then the mushroom type is poisonous If the gill size is narrow and the gill color is buff, then the mushroom type is poisonous If the gill size is narrow and the odor is pungent, then the mushroom type is poisonous

the ninth rule implies that any mushroom sample that was not covered by the preceding eight rules is edible Else, the mushroom is edible The numbers next to each rule indicate the number of instances covered by the rule and a count of misclassified instances how the rules are applied to the mushroom data

The End of Chapter 5

Recently Viewed Presentations

  • CSR - vpmthane

    CSR - vpmthane

    - The World Business Council for Sustainable Development (WBCSD) www.wbcsd.org "the businessman's decision and actions taken for reasons at least partially beyond the firm's direct economic or technical interest." "pursue those policies, to make those decisions, or to follow those...
  • Intermediate Accounting - Cengage

    Intermediate Accounting - Cengage

    The Need for Financial Accounting Information. Forces that create demand. Companies compete for a wide variety of resources such as financial capital, physical and natural resources, intellectual property and technology, new product and service ideas, skilled employees and executives, customers,...
  • LEON 2: LA LANGUE ET LIDENTIT LES COMMUNAUTS

    LEON 2: LA LANGUE ET LIDENTIT LES COMMUNAUTS

    En 1916, le Conseil éducatif de la Louisiane a déclaré l'anglais la seule langue de classes des écoles publiques . En 1921, la Constitution de la Louisiane a confirmé cela. En conséquence, des générations d'enfants ont été punies pour avoir...
  • GRASP - Graduate Research Advanced Skills Program RESEARCH

    GRASP - Graduate Research Advanced Skills Program RESEARCH

    WORK CLOSELY WITH YOUR SOURCES . 5. Strategies for Analysing Sources (Rosenwasser and Stephen 2009, pp. 219-221) 1. " Make your sources . s. peak - q. uote, paraphrase or summarize . in order to. analyse." 2. " Attend carefully...
  • Text Segment Source Code void test_function(int a) {

    Text Segment Source Code void test_function(int a) {

    The instruction pointer eip will jump to the next address to be executed in the test_function portion of the text segment of memory. The stack frame for main is expanded by eight bytes, so that it can accommodate the return...
  • The Consumer Welfare Gains of Guatemalas Liberal Reforms

    The Consumer Welfare Gains of Guatemalas Liberal Reforms

    The Consumer Welfare Gains of Guatemala's Liberal Reforms Thomas W. Hazlett, Giancarlo Ibarguen S. and Wayne A. Leighton Presentation to "Convergence or Competition?
  • Internet Number Resource Status Report

    Internet Number Resource Status Report

    Publications. Internet Number Status Report. Updated quarterly. Global stats on IPv4, IPv6, ASN, IPv4 Transfers. New extended stats released in 2019 Q1
  • Headland and Bays

    Headland and Bays

    HEADLAND AND BAYS"In every outthrust headland, in every curving beach, in every grain of sand there is the story of the earth." Rachel CarsonAlong the rugged North Coast the hills meet the sea in a succession of beautiful bays between...