Short overview of Weka 1 Weka: Explorer Visualisation Attribute selections Association rules Clusters Classifications Weka: Memory issues Windows Edit the RunWeka.ini file in the directory of installation of Weka maxheap=128m -> maxheap=1280m Linux Launch Weka using the command ($WEKAHOME is the

installation directory of Weka) Java -jar -Xmx1280m $WEKAHOME/weka.jar 3 ISIDA ModelAnalyser Features: Imports output files of general data mining programs, e.g. Weka Visualizes chemical structures Computes statistics for classification models Builds consensus models by combining different individual models 4

Foreword For time reason: Not all exercises will be performed during the session They will not be entirely presented neither Numbering of the exercises refer to their numbering into the textbook. 5 Ensemble Learning Igor Baskin, Gilles Marcou and Alexandre Varnek 6 Hunting season Single hunter Courtesy of Dr D. Fourches

Hunting season Many hunters What is the probability that a wrong decision will be taken by majority voting? Probability of wrong decision ( < 0.5) < 0.5)) Each voter acts independently 45% 40% 35% 30% 25% =0.4 =0.3 =0.2 =0.1 20%

15% 10% 5% 0% 1 3 5 7 9 11 13 15

17 19 More voters less chances to take a wrong decision ! 9 The Goal of Ensemble Learning Combine base-level models which are diverse in their decisions, and complementary each other Different possibilities to generate ensemble of models on one same initial data set Compounds - Bagging and Boosting

Descriptors - Random Subspace Machine Learning Methods - Stacking 10 Principle of Ensemble Learning Perturbed sets ENSEMBLE Matrix 1 Learning algorithm

Model M1 Matrix 2 Learning algorithm Model M2 Matrix 3 Learning algorithm Model Me Training set

C1 D1 Dm Consensus Model Cn Compounds/ Descriptor Matrix 11 Ensembles Generation: Bagging

Compounds - Bagging and Boosting Descriptors - Random Subspace Machine Learning Methods - Stacking 12 Bagging

Bagging = Bootstrap Aggregation Introduced by Breiman in 1996 Based on bootstraping with replacement Usefull for unstable algorithms ( < 0.5)e.g. decision trees) Leo Breiman (1928-2005) Leo Breiman (1996). Bagging predictors. Machine Learning. 24(2):123-140. 13 Bootstrap Sample Si from training set S Training set S Dm D1

Dm D1 C1 C3 C2 C2 C3 Si C2 C4

C4 . . . . . . Cn C4 All compounds have the same probability to be selected Each compound can be selected several times or

even not selected at all (i.e. compounds are sampled randomly with replacement) Efron, B., & Tibshirani, R. J. (1993). "An introduction to the bootstrap". New York: Chapman & Hall 14 Bagging Data with perturbed sets of compounds Training set S1 C1 C3

Cn ENSEMBLE Learning algorithm Model M1 C1 C2 C4 . . . C4 C2

C8 C2 S2 C9 C7 C2 C2 Voting (classification) Learning algorithm Model M2 C1 Se

C4 C1 C3 C4 C8 Consensus Model Averaging (regression) Learning algorithm Model Me 15 Classification - Descriptors ISIDA descritpors: Sequences

Unlimited/Restricted Augmented Atoms Nomenclature: txYYlluu. x: type of the fragmentation YY: fragments content l,u: minimum and maximum number of constituent atoms Classification - Data Acetylcholine Esterase inhibitors ( < 0.5) 27 actives, 1000 inactives) 16 Classification - Files train-ache.sdf/test-ache.sdf Molecular files for training/test set train-ache-t3ABl2u3.arff/test-ache-t3ABl2u3.arff descriptor and property values for the training/test set

ache-t3ABl2u3.hdr descriptors' identifiers AllSVM.txt SVM predictions on the test set using multiple fragmentations 17 Regression - Descriptors ISIDA descritpors: Sequences Unlimited/Restricted Augmented Atoms Nomenclature: txYYlluu. x: type of the fragmentation YY: fragments content l,u: minimum and maximum number of constituent atoms

Regression - Data Log of solubility ( < 0.5) 818 in the training set, 817 in the test set) 18 Regression - Files train-logs.sdf/test-logs.sdf Molecular files for training/test set train-logs-t1ABl2u4.arff/test-logs-t1ABl2u4.arff descriptor and property values for the training/test set logs-t1ABl2u4.hdr descriptors' identifiers AllSVM.txt SVM prodictions on the test set using multiple fragmentations 19

Exercise 1 Development of one individual rules-based model (JRip method in WEKA) 20 Exercise 1 Load train-ache-t3ABl2u3.arff 21 Exercise 1 Load test-ache-t3ABl2u3.arff 22 Exercise 1

Setup one JRip model 23 Exercise 1: rules interpretation 187. (C*C),(C*C*C),(C*C-C),(C*N),(C*N*C),(C-C),(C-C-C),xC* 188. (C-N),(C-N-C),(C-N-C),(C-N-C),xC 189. (C*C),(C*C),(C*C*C),(C*C*C),(C*C*N),xC 24 Exercise 1: randomization What happens if we randomize the data and rebuild a JRip model ? 25

Exercise 1: surprizing result ! Changing the data ordering induces the rules changes 26 Exercise 2a: Bagging Reinitialize the dataset In the classifier tab, choose the meta classifier Bagging 27 Exercise 2a: Bagging Set the base classifier as JRip Build an ensemble of 1 model

28 Exercise 2a: Bagging Save the Result buffer as JRipBag1.out Re-build the bagging model using 3 and 8 iterations Save the corresponding Result buffers as JRipBag3.out and JRipBag8.out Build models using from 1 to 10 iterations 29 Bagging 0.88 0.86 0.84 Classification ROC AUC

0.82 0.8 AChE 0.78 0.76 ROC AUC of the consensus model as a function of the number of bagging iterations 0.74 0.72 0.7 0.68 0 1

2 3 4 5 6 7 8 9 10 Number of bagging iterations

30 Bagging Of Regression Models 31 Ensembles Generation: Boosting Compounds - Bagging and Boosting Descriptors -

Random Subspace Machine Learning Methods - Stacking 32 Boosting Boosting works by training a set of classifiers sequentially by combining them for prediction, where each latter classifier focuses on the mistakes of the earlier classifiers. AdaBoost classification Yoav Freund Regression boosting

Robert Shapire Jerome Friedman Yoav Freund, Robert E. Schapire: Experiments with a new boosting algorithm. In: Thirteenth International Conference on Machine Learning, San Francisco, 148-156, 1996. J.H. Friedman (1999). Stochastic Gradient Boosting. Computational Statistics and Data Analysis. 38:367-378. 33 Boosting for Classification. AdaBoost w w w w Training set C1 C2

C3 C4 . . . Cn C1 C2 C3 C4 S1 . . . Cn

w S2 w C1 w C2 w C3 w C.4 . . w Cn Se w w

w w w C1 C2 C3 C4 . . . n C e ENSEMBLE

e e e Learning algorithm Model M1 e Weighted averaging & thresholding e e e

e Learning algorithm Model M2 Learning algorithm Model Mb Consensus Model e

34 Developing Classification Model Load train-ache-t3ABl2u3.arff In classification tab, load test-ache-t3ABl2u3.arff 35 Exercise 2b: Boosting In the classifier tab, choose the meta classifier AdaBoostM1 Setup an ensemble of one JRip model 36 Exercise 2b: Boosting Save the Result buffer as JRipBoost1.out Re-build the boosting model using 3 and 8 iterations Save the corresponding Result buffers as

JRipBoost3.out and JRipBoost8.out Build models using from 1 to 10 iterations 37 Boosting for Classification. AdaBoost 0.83 0.82 Classification ROC AUC 0.81 0.8 AChE 0.79

0.78 ROC AUC as a function of the number of boosting iterations 0.77 0.76 0.75 0.74 0 1 2 3 4

5 6 7 8 9 10 Log(Number of boosting iterations) 38 Bagging vs Boosting 1 1

0.95 0.95 0.9 0.9 0.85 0.85 0.8 0.8 0.75 0.75

0.7 1 10 Base learner JRip 100 Bagging Boosting 0.7 1 10 100 1000

Base learner DecisionStump 39 Conjecture: Bagging vs Boosting Bagging leverages unstable base learners that are weak because of overfitting (JRip, MLR) Boosting leverages stable base learners that are weak because of underfitting (DecisionStump, SLR) 40 Ensembles Generation: Subspace Random Compounds

- Bagging and Boosting Descriptors - Random Subspace Machine Learning Methods - Stacking 41 Random Subspace Method Tin Kam Ho

Introduced by Ho in 1998 Modification of the training data proceeds in the attributes ( < 0.5)descriptors) space Usefull for high dimensional data Tin Kam Ho (1998). The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence. 20(8):832-844. 42 Random Subspace Method: Random Descriptor Selection Training set with initial pool of descriptors C1 D2 D3 D4

. . . D1 Dm Cn C1 D3 D2 Dm All descriptors have the same probability to be

selected Each descriptor can be selected only once Only a certain part of descriptors are selected in each run D4 Cn Training set with randomly selected descriptors 43 Random Subspace Method Data sets with randomly selected descriptors S1

D4 D2 D3 ENSEMBLE Learning algorithm Model M1 Voting (classification) Training set D1 D2 D3 D4 Dm S2 D1 D2 D3

Learning algorithm Model M2 Consensus Model Averaging (regression) Se D4 D2 D1 Learning algorithm Model Me 44

Developing Regression Models Load train-logs-t1ABl2u4.arff In classification tab, load test-logs-t1ABl2u4.arff 45 Exercise 7 Choose the meta method Random SubSpace. 46 Exercise 7 Base classifier: Multi-Linear Regression without descriptor selection Build an ensemble of 1

model then build an ensemble of 10 models. 47 Exercise 7 1 model 10 models 48 Exercise 7 49 Random Forest Random Forest = Bagging + Random Subspace

Particular implementation of bagging where base level algorithm is a random tree Leo Breiman (1928-2005) Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32. 50 Ensembles Generation: Stacking Compounds - Bagging and Boosting

Descriptors - Random Subspace Machine Learning Methods - Stacking 51 Stacking David H. Wolpert Introduced by Wolpert in 1992 Stacking combines base learners by means of a separate meta-learning method using their

predictions on held-out data obtained through crossvalidation Stacking can be applied to models obtained using different learning algorithms Wolpert, D., Stacked Generalization., Neural Networks, 5(2), pp. 241-259., 1992 Breiman, L., Stacked Regression, Machine Learning, 24, 1996 52 Stacking The same data set Different algorithms Data set S Learning algorithm L1

ENSEMBLE Model M1 Machine Learning Meta-Method (e.g. MLR) Training set C1 Cn D1 Dm Data set S

Data set S Learning algorithm L2 Model M2 Data set S Learning algorithm Le

Model Me Consensus Model 53 Exercise 9 Choose meta method Stacking Click here 54 Exercise 9 Delete the classifier ZeroR Add PLS classifier (default parameters) Add Regression Tree M5P (default

parameters) Add Multi-Linear Regression without descriptor selection 55 Exercise 9 Click here Select Multi-Linear Regression as meta-method 56 Exercise 9 57 Exercise 9

Rebuild the stacked model using: kNN (default parameters) Multi-Linear Regression without descriptor selection PLS classifier (default parameters) Regression Tree M5P 58 Exercise 9 59 Exercise 9 - Stacking Learning algorithm R (correlation coefficient)

RMSE MLR PLS M5P (regression trees) 1-NN (one nearest neighbour) 0.8910 0.9171 0.9176 1.0068 0.8518 0.8461 0.8455

1.1889 Stacking of MLR, PLS, M5P 0.9366 0.7460 Stacking of MLR, PLS, M5P, 1-NN 0.9392 0.7301 Regression models for LogS

60 Conclusion Ensemble modeling converts several weak classifiers ( < 0.5)Classification/Regression problems) into a strong one. There exist several ways to generate individual models Compounds Descriptors Machine Learning Methods 61 Thank you and Question s? Ducks and hunters, thanks to D. Fourches 62

Exercise 1 Development of one individual rules-based model for classification (Inhibition of AChE) One individual rules-based model is very unstable: the rules change as a function of ordering the compounds in the dataset 63 o M l de 1 2 el od M

el d o m 3 Ensemble modelling M od el 4 LR M NN

M V S Ensemble modelling kN N