Logistic Modeling with Applications to Marketing and Credit Risk in the Automotive Industry Bruce Lund and Keith Shields Magnify Analytic Solutions, Division of Marketing Associates MSUG conference June 4, 2015 v18b 1 Two Goals for this Talk I. Discuss Concepts and Tools to add to your Logistic Regression Modeling Practice. II. How to use these Concepts and Tools to: a) Select a Final Logistic Model b) Build Credit Risk Models (with applications to automotive finance) 2 A. Briefly: What is Logistic Regression?

An elevator pitch for a very tall building 3 Binary Outcomes Many predictive models in marketing and credit have binary outcomes (Buy, Did-Not-Buy), (Defaulted on Loan vs. Paid as Agreed). Outcome of interest (often lower frequency outcome) is event, coded as Y = 1. The non-event is coded as Y = 0. In addition to Y we have predictors Xk that give information about Y So, for each customer we want to use predictors Xk to compute the probability that the event occurs that is the probability that Y = 1. But how to do this? 4 How Logistic Model connects predictors to events 1. P(Y=1|Xk) is related to Xks by taking a weighted sum of the predictors called xbeta and substituting xbeta into the logistic function P = P(Y=1|Xk) = the logistic function where xbeta = 0 + and s are unknown parameters 2. We need to find the s which best connect the Xks to the Ys If Y=1, want P big If Y=0, want 1-P big 3. Done by maximizing the likelihood function L as a function of s Max [ L( X) = (1 - )1- ]

5 PROC LOGISTIC output Significance Test DATA WORK; INPUT Y X G$ F; DATALINES; 0 0 M 1000 0 0 F 750 1 0 M 50 1 0 F 37 0 1 M 800 0 1 F 600 1 1 M 100 1 1 F 75 ; 1 3 2 PROC LOGISTIC DATA=WORK DESC; CLASS G; MODEL Y = X G; FREQ F;

Chi-Sq 1 df. H0: X = 0 v. H1: X 0 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -3.0018 0.1103 741.2621 <.0001 X 1 0.9220 0.136 45.9666 <.0001 F 1 -0.0023 0.0655 G 0.0013 0.9716 betas 6 PROC LOGISTIC output

Model Fit Statistics -2 Log L is log (likelihood) (x by -2). Criterion Intercept Only And Covariates AIC 1850.302 1805.330 f=full, SC 1856.437 1823.735 r=restricted -2 Log L 1848.302 r 1799.330 f Testing Global Null Hypothesis: BETA=0 Likelihood Ratio = Test Chi-Square DF Pr > ChiSq [-2 Log Lr - (-2 Log Lf)] Likelihood Ratio 48.9721 2 <.0001 Score 48.6156

2 <.0001 Score Chi-Sq approx. LR Wald 45.9677 2 <.0001 computational short-cut SCORE CHI-SQUARE makes possible Best Subset Model Selection It ranks ALL possible models, Efficiently. Discuss Later SAS code on web site about Score Chi-Sq 7 B. Creating the Analytic Data Set and Sampling Decisions We are assuming we have at least semi BIG DATA 8 Training, Validation, Test, #Events, #Predictors We will assume the Analytic Dataset has > 1,500 events Split Analytic Data set into Training, Validation, Test data sets Training / Validation used together in model fitting

Test used in final evaluation (place in Al Gores Lock Box) One suggestion: Split Analytic Dataset into T,V,T by 40%, 30%, 30%. How many predictors can be in the model? Guideline: #Events / 10 > Degrees of Freedom used in model If # Events = 600, then 600/10 = 60 = max d.f. in the model If C is class predictor with 10 levels, then it uses 9 d.f. 9 Under-Sample Non-Events (Y=0) to avoid long run-times Only intercept is affected ... but due to random sampling, parameter estimates will change but under-sampling does not create bias. The Ps are wrong but can be adjusted by any of 3 methods (below). Assume the non-events are sampled at 1% 1. In PROC LOGISTIC PRIOREVENT = event/(event + 100*nonevents) [Counts of events and nonevents from sample] 2. In DATA Step after PROC LOGISTIC. Add this adj_P = P / (P + (1 - P) * 100); 3.

WEIGHT wgt; in PROC LOGISTIC. Upstream wgt = 100*(Y=0) + (Y=1); No adjustment needed for P Likelihood / Wald stats are inflated not reliable for evaluating model. 10 C. Preparing, Screening, Transforming of Predictors 11 Preparing, Screening, Transforming of Predictors Five steps to prepare Predictors before Modeling. Discuss #3. 1.Inspect predictors for missing, extremes, errors (EDA) 2.Screen (eliminate) predictors with little power 3.Transform predictors to achieve best fit 4.Remove predictors to avoid multicollinearity 5.Test for interactions among Xs and add to candidate list 12 Classify X as: Nominal, Discrete, or Continuous Nominal X values are labels even if they are numbers (Democrat, Republican, Independent 1, 2, 3 . But cant do math)

Discrete X is numeric with a few distinct values (e.g. children in HH) Continuous X is numeric with many distinct values (money, time, distance) Is X Discrete or Continuous: Sometimes a fuzzy judgement is made. Recommended Transformations: Nominal and Discrete: Optimal Binning and Weight-of-Evidence transformation Continuous: Function Selection Procedure (FSP) 13 Weight of Evidence (WOE) Transformation X: Predictor Y: Target (response) X Y=0 Y=1 X=X1 X=X2 X=X3 SUM

2 1 2 5 1 1 1 3 Cannot have any zeros in Y=0 and Y=1 columns Col % Y=0 0.400 0.200 0.400 1.000 Col % Y=1 0.333 0.333 0.333

1.000 WOE= Log(%Y=1/%Y=0) -0.1823 0.5108 -0.1823 If X = X3 then X_woe = -0.1823 If X = Xk then X_woe = log(%Y=1/%Y=0 | X=Xk) X is transformed to X_woe 14 WOE v CLASS v Dummies X X=X1 X=X2 X=X3 Y=0 2 1 2

Y=1 1 1 1 THREE MODELS SAME RESULT: PROC LOGISTIC; MODEL Y = X_woe; OUTPUT OUT = OUT1 P = P1; PROC LOGISTIC; CLASS X; MODEL Y = X; OUTPUT OUT = OUT2 P = P2; PROC LOGISTIC; MODEL Y = X_dum1 X_dum2 ; OUTPUT OUT = OUT3 P = P3; WOE CLASS Y X X_woe X_ dum1

X_ dum2 0 X1 -0.1823 1 0 0 X1 -0.1823 1 0

1 X1 -0.1823 1 0 1 X2 0.5108 0 1 1 X2

0.5108 0 1 0 X3 -0.1823 0 0 0 X3 -0.1823 0

0 1 X3 -0.1823 0 0 DUMMY P1 = P2 = P3 X_woe uses same d.f. as CLASS X 15 Information Value (IV) of X X Y=0

Y=1 X=X1 X=X2 X=X3 SUM 2 1 2 5 1 1 1 3 IV 0.02 0.1 0.2 0.3 Col %

Y=0 (A) 0.400 0.200 0.400 1.000 Interpretation un-predictive weak medium strong Col % Y=1 (B) 0.333 0.333 0.333 1.000 Difference (B)-(A) = (C)

-0.0667 0.133 -0.0667 WOE= Log(%Y=1/%Y=0) (D) -0.1823 0.5108 -0.1823 IV term (C) * (D) = IV 0.0122 0.0679 0.0122 IV = 0.0923 IV measures predictive power Drop predictors below IV = 0.1 ?

See paper / code by Lin (SGF 2015) 16 Why Bin (Collapse) X Before WOE Coding? WOE coding burns-up degrees of freedom (over-fitting) Collapsing levels of X saves d.f. (parsimony) but reduces IV But there is a Win-Win Do the collapsing: Usually IV decreases very little in the early stages of collapsing. Sometimes we want X_woe to have monotonic values vs. the ordering of X collapse until achieved. X=1, X_woe = -0.5 X=2, X_woe = 1.0 X=3, X_woe = 2.3 How to do the collapsing? next slide gives one approach. 17 Optimal Binning via %Best_Collapse Lund / Brotherton (MWSUG 2013) IV, LL

Mode: (A)ny or ad(J)acent SAS WOE code DATA EXAMPLE; COLLAPSE Freq. Target INPUT X $ Y F; (BIN) DATALINES; 0, 1 Name of A04 Data Set A16 B08 %Best_Collapse(Dataset, X, Y, F, IV, A, , , , WOE) B14 C02 X_STAT (c) k -2*Log L

IV L1 L2 C15 0.68750 4 50.6084 0.51783 A B D03 0.68382 3 50.6373 0.51441 A B D19 0.65196 2 51.2002 0.45335 A+C+D B ; L3

L4 C D C+D X_STAT = c of PROC LOGISTIC; CLASS X; MODEL Y=X; 18 Function Selection Procedure (FSP) for Continuous X FSP finds the best transformation of a continuous X for use in Logistic Regression FSP developed in mid-1990s by P. Royston, W. Sauerbrei, D. Altman, Bio-statistical applications See book Multivariable Model-building (2008) by R & S See Lund (SGF 2015) for discussion and SAS macros to run FSP See Appendix for more slides about FSP 19 D. Finding Multiple Candidate

Logistic Models for Evaluation and Comparison 20 Strategy for Selection of Multiple Models Use Schwarz-Bayes Criterion (SBC) to rank ALL (or the most promising) models this ranking is based models fitted to the TRAINING data set. I need to define SBC. Select, perhaps, 3 to 20 candidate models to be measured on the Validation sample. Use Validation sample to assess prediction for each of the 3 to 20 models. Select a final Model. Use TEST for measurement of final model performance. 21 SBC and Ranking Models SBC = -2 * LL + log(n)* K K = d.f. in model and n = sample A model with more Log Likelihood is better MORE FIT A model with Less -2 * LL is better Adding penalty (log(n)* K) makes -2 * LL an honest measure of FIT The model with smaller SBC is better (better honest FIT) All models can be Ranked by their SBC smaller is better

An alternative is AIC = -2 * LL + 2*K A CLASS C and C_woe with L levels gives L-1 df when computing K. K includes 1 for intercept 22 Concept Developers Hirotugu Akaike (1927-2009) Gideon Schwarz (19332007) SAS logistic training (with a focus on predictive modeling) uses SBC in generating multiple models for evaluation and comparison SBC v. AIC Both SBC (1978) and AIC (1973) are firmly grounded in (complex) theory. There is no view that one is always superior to the other. I think SBC is better for predictive modeling in DM /

CR because SBC prevents over-fitting in models with large n and many Xk 23 SBC and Type 1 Error Probability X is added if: SBC_before > SBC_after -2 * LLK + log(n)*K > -2 * LLK+1 + log(n)*(K + 1) D = (-2 * LLK) - (-2 * LLK+1) > log(n) If n= 10,000, then log(10000) = 9.2 . Add X if D > 9.2 D is a Chi-Sq with 1 df (if X is not CLASS) under H0: X = 0 Prob(D > 9.2 | X = 0) = 0.24% only ... 0.24% is Type 1 Error probability with SBC and n=10000 we are unlikely to add an insignificant X Using AIC: Prob(D > 2 | X = 0) = 15.7% 24 Finding Candidate Models for Evaluation and Comparison Three methods each can be successful 1. Subject Matter Expert fits models by guided trial and error But not reproducible as a Process 2. Find All (Many) Models using PROC LOGISTIC Best Subsets (i.e.

SELECTION=SCORE) and rank by (pseudo) SBC 3. Use PROC HPLOGISTIC (SAS/STAT 13.2) to find many good models and rank by SBC. Best Subsets is not available in HPLOGISTIC, so what to do? Use a new method to be explained 25 Example Data Set The Slides to follow use Data Set getStarted from PROC HPLOGISTIC documentation (n=100) data getStarted; input C$ Y X1-X10; datalines; D 0 10.2 6 1.6 F 1 12.2 6 2.6 D 1 7.7 1 2.1 38 42 38 15

61 61 2.4 1.5 1 20 10 90 0.8 0.6 0.6 8.5 8.5 7.5 3.9 0.7 5.2 97 more lines

Only Y C X1 X2 X8 are used in Examples. C is nominal. 26 Backward Plus with PROC HPLOGISTIC Variables Removed by Backward ods output SelectionDetails = seldtl_b; ods output CandidateDetails = candtl_b; Other Candidates for Removal (the PLUS) PROC HPLOGISTIC DATA= getStarted; CLASS C; MODEL Y (descending) = X1 X2 X8 C; SELECTION METHOD=BACKWARD (SELECT = SBC CHOOSE = SBC STOP = NONE) DETAILS=ALL; DATA canseldtl_b; MERGE seldtl_b candtl_b; BY step; SELECT=SBC: Finds the X PROC PRINT DATA=canseldtl_b; which gives the smallest SBC if X is removed. 27 Models from Backward Plus Number Effect_ Criterion Step InModel Removed Effect (SBC)

0 1 1 1 1 2 2 2 3 3 4 5 4 4 4 4 3 3 3 2 2 1

C C C C X1 X1 X1 X2 X2 X8 C X1 X2 X8 X1 X8 X2 X2 X8 X8 .

129.14 155.19 159.94 160.36 128.19 131.61 131.74 128.38 128.70 128.22 Step 1: C is removed and new SBC is 129.14 Step 1: Not chosen candidates X1 X2 X8 and their SBC, if removed. Best model: X2 X8 28 All Subsets vs. Backward Plus If K predictors, then # models = = 2K - 1 (all subsets)

Backward Plus provides: 1 + + K = (K+1)*K/2 Step 0 full C X1 X2 X8 Step 1 X1 X2 X8 C X2 X8 C X1 X2 C X1 X8 Step 2 X2 X8 X1 X2 X1 X8 C X1 C X2 C X8 Step 3 X8 X2 X1 C

Step 4 null K=4 =15 =10 SBC removed (in order) C, X1, X2, X8 Top row are MODELs after Removal Blue on white are other candidates Black on pink are other subsets 29 Go Crazy: Backward-Forward Plus SAS code on web site ALSO: Run HPLOGISTIC FORWARD and add Candidate Models Adds {X1}, {C}, {C X8} but {C X1} and {C X2} are still omitted Step 0 full

C X1 X2 X8 Step 1 X1 X2 X8 C X2 X8 C X1 X2 C X1 X8 Step 2 X2 X8 X1 X2 X1 X8 C X8 C X1 C X2 Step 3 X8 X2 X1 C Step 4 null

If K predictors: B-F Plus finds K models at each step > 0 Total = K2 - K + 1 Far fewer than 2K - 1 Does B-F Plus find best SBC model? usually? 30 B-F Plus Model Ranking Variables in Model from consolidated F and B Average SBC X2 X8 X8 X2 128.237 128.554 128.869

X1 X2 X8 X1 X2 X1 X1 X8 C X2 X8 C X8 C C X1 X2 X8 C X1 X8 C X1 X2 130.155 131.608 131.822 131.922 155.285 156.167 157.357 159.123 159.939 160.356 SBC is approximated. Small

Differences in SBC from FORWARD and BACKWARD. Take average if model appears in both F and B X2 X8 still best after adding Forward 31 HPLOGISTIC, CLASS Variable, and WOE Variable Find the top SBC models using CLASS variables in B and F Now substitute WOE-coded variables for CLASS variables Fit and evaluate models on VALIDATION and TEST data sets 32 PROC LOGISTIC Best Subsets Best Subsets using PROC LOGISTIC with SELECTION=SCORE The Example to follow uses Data Set getStarted from HPLOGISTIC documentation (N=100) 33

SELECTION = SCORE Syntax (aka Best Subsets) PROC LOGISTIC; MODEL Y = / SELECTION = SCORE START=s1 STOP=s2 BEST=b; look at models with count of predictors between s1 and s2 for each k in [s1, s2] find b best-models having k predictors (What is Best ?? Ill explain on next slide) Example: If 4 predictors and START=1 END=4 BEST=3 10 models 1 variable models: {X1}, {X2}, {X3}, {X4} take best 3 2 variable models: {X1 X2}, {X1 X3}, {X1 X4}, {X2 X3}, {X2 X4}, {X3 X4} take best 3 3 variable models: {X1 X2 X3}, {X1 X2 X4}, {X1 X3 X4}, {X2 X3 X4} take best 3 4 variable models: {X1 X2 X3 X4} only 1 to take 34 What is Best ? PROC LOGISTIC DATA = getStarted DESCENDING; MODEL Y = / SELECTION = SCORE START = s1 STOP = s2 BEST = b; ? BEST models have the highest Score Chi-Sq. (among the models having k variables) Using Score Chi-Sq is the short-cut making BestSubsets possible. But SBC not available.

Recall: Score Chi-Sq. gives the significance of model Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 48.9721 2 <.0001 Score 48.6156 2 <.0001 Wald 45.9677 2 <.0001 Short-cut: Dont need max. likelihood 35 CLASS not allowed for SELECTION = SCORE CLASS not allowed for Best Subsets must use WOE or Dummies

Using dummies greatly increases # of subsets impractical But Trouble: PROC LOGISTIC is unaware of upstream WOE df for WOE counted as 1 not L-1 where WOE has L levels. Keep in mind. Other Reasons to use WOE instead of Dummies in Modeling: Some dummies may not be selected in a final model (unintended binning) applies to Stepwise and Best Subsets methods Coefficients of dummies may not be ordered in final Model as was expected 36 Class Variable to WOE Here is code to convert C from getStarted to C_woe DATA getStarted; SET getStarted; IF C in ( "A" ) THEN C_woe = -0.809318612 ; IF C in ( "B" ) THEN C_woe = 0.1069721196 ; IF C in ( "C" ) THEN C_woe = 0.6177977433 ; IF C in ( "D" ) THEN C_woe = -0.403853504 ; IF C in ( "E" ) THEN C_woe = -1.145790849 ; IF C in ( "F" ) THEN C_woe = -0.809318612 ; IF C in ( "G" ) THEN C_woe = -0.703958097 ; IF C in ( "H" ) THEN C_woe = 0.1069721196 ; IF C in ( "I" ) THEN C_woe = 0.1069721196 ; IF C in ( "J" ) THEN C_woe = 1.4932664807 ;

%Best_Collapse (getStarted, C, Y, 1, IV, A, , , , WOE) 37 Degrees of Freedom for WOE-coded PROC FORMAT LIBRARY = Work; VALUE $DF C_woe" = 9 Other = 1; run; Need to do this for all WOE variables available to the Models. 38 Ranking Models from SELECTION=SCORE: SBC vs. ScoreP Since SBC is not available from SELECTION = SCORE, instead use: ScoreP = -Score Chi-Sq + log(n) * (DF + 1) P for penalized Score Chi-Square DF must include the d.f. of WOE-coded predictors (Well use the FORMAT) ScoreP might sort Models differently but top models will be found.

39 Computing ScoreP and Ranking Models ODS OUTPUT Bestsubsets = Score_Out; PROC LOGISTIC DATA = getStarted; MODEL Y = X1 X2 X8 C_woe / SELECTION = SCORE START=1 STOP=4 BEST=big; Big so ALL models (between s1, s2) are output to Score_Out. Here, Big = 6 will work Need ALL so we can compare all models AFTER corrections are made to DF for WOEs 40 Computing ScoreP and Ranking Models PROC PRINT DATA = Score_Out; Obs 1 2 3-14 15 NumberOfVariables 1

1 omitted 4 ScoreChiSq VariablesInModel 12.3744 4.2986 omitted C_woe X8 omitted 22.3223 X1 X2 X8 C_woe DATA step reads Score_Out and uses Format DF. to compute: ScoreP = -Score Chi-Sq + log(n) * (DF + 1) 41

Computing ScoreP and Ranking Models SAS code on web-site PROC SORT DATA = Score_Out; BY ScoreP; PROC PRINT DATA = Score_Out; NumberOfVariables ScoreChiSq VariablesInModel DF ScoreP X2 X8 2 4.8656 2 8.9499

1 1 4.2986 X8 3.9882 X2 obs 4-14 omitted 1 1 4.9117 5.2221 3 17.1507 11 38.1113 X1 X2 C_woe

Best Also Best Model found by HPLOGISTIC Backward + Candidates 42 Overview: PROC LOGISTIC with SELECTION = SCORE OVERVIEW: If more Xs 50, then Selection = Score is not feasible (run-time). Use Selection = Backward Fast to reduce Xs to ~ 30 or less. If need K = 50 in model, use HPLOGISTIC B-F Plus. Use judgment regarding START, STOP E.g. why use START = 1? Then create ALL models by using BEST = big Rank these models by the penalized Score Chi-Square Pick cut-off of M models (3 to 20) and measure on Validation 43 E. Evaluation of Models Now we have 3 to 20 Models to evaluate on the Validation Sample How do we do this? Discuss after Lunch! Keith Shields will present. Tell your friends to attend.

44 Contact Information Logistic Modeling with Applications to Marketing and Credit Risk in the Automotive Industry Bruce Lund and Keith Shields Magnify Analytic Solutions, Division of Marketing Associates [email protected] [email protected] MSUG conference June 4, 2015 45 Appendix Function Selection Procedure 46 Function Selection Procedure (FSP) Step 1: First translate X, if needed, to be positive. Then compute:

FP1: g(X,p) = 0 + 1 Xp . There are 8 in FP1 FP2: G(X,p1,p2) = 0 + 1 Xp1 + 2 Xp2 p1 p2 G(X,p1,p1) = 0 + 1 Xp1 + 2 Xp1 log(X) p1 = p2 There are 36 in FP2. where Xp are called fractional polynomials: with p from S = {-2, -1, -0.5, 0, 0.5, 1, 2, 3} 0 for log(x). FSP does this: Find best (Max Likelihood) transform of X from among FP1 and the best transform from among FP2 47 Function Selection Procedure (FSP) Step 2: 3-Step Test: Tells us to: Drop X, or Use X, or Use FP1, or Use FP2 Here are the 3 steps: Test Stat = { - 2Log(L)restricted } - { - 2log(L)full } ~ Chi-Sq. 1. 4 d.f. test at j level of best FP2 against null model. If test is not significant, drop X and stop, else continue. 2. 3 d.f. test at j of best FP2 against a X (linear). If test is not significant, stop (final

model is linear), else continue. 3. 2 d.f. test at j of best FP2 vs. best FP1. If test is significant, the final model is the best FP2, otherwise it is the best FP1. 48 Function Selection Procedure (FSP) G(X,-1,-1) = X-1 + 2 X-1 log(X) FP1 are monotonic functions. FP2 (see graph) can be non-monotonic. 49 A Check List when Starting the Project In the beginning important actions are taken: Define the Target variable (events and non-events) Decide to use Logistic Model (and perhaps other methods) Determine Data Sources, resolve IT issues The values of the predictor variables are determined as-of an obs-date.

Decide if obs-date varies by record or fixed obs-date all records Compile Analytic Data Set and make important decisions: Whether to Segment (e.g. purchase history) model by Segment Allocation Percents to Training, Validation, Test data sets Training / Validation used in model fitting. Test used in final evaluation Whether to Sub-sample the Target non-events applies to Segments 50 Segmenting the Analytic Data Set A model for each Segment: Upside: Opportunity to spread the logistic probabilities May have very different average event-rate across the segments obtain better lifts Allows use of variables unique to the segment e.g. current loan customers applying for new loan vs. never-seen-before customers Avoids using many interaction variables in a single (no segment) model Downside: More work (modeling), more maintenance, may cause small sample size issues. 51 Blank Slide Blank Slide

52