Software Testing for Machine Learning

Software Testing for Machine Learning

AUTOMATED TEST ORACLES FOR MACHINE LEARNING SOFTWARE PD Dr. Steffen Herbold All rights reserved Contents Motivation Oracles and pseudo oracles Trivial Oracles 2

Machine Learning Software 3 Modern Applications Fraud Detection Credit Scoring Object Recognition

Autonomous Driving 4 Machine Learning and Safety ISO 26262 Standard: Road vehicles Functional safety Part 6: Product development at the software level 5 ISO 26262 Standard Automotive Safety Integrity Level D (ASIL D) ASIL D represents likely potential for severely lifethreatening or fatal injury in the event of a malfunction

and requires the highest level of assurance that the dependent safety goals are sufficient and have been achieved. Defines requirements for quality assurance 6 Testing of Machine Learning Test Data Model

Training Data Learning Algorithm Evaluation with ML Metrics Excerpt from ISO 26262 Quality assurance for ASIL-D: Equivalence class analysis Boundary value analysis Modified Condition / Decision Coverage .

8 Requires Complex Testprozess Specification Correctness Robustness Boundaries Coverage Equivalence Classes

Exception Scenarios 9 e r a w ft o S g

n i n r a e L e n i h c

a M g n s e i Testi m m u D r

o f 10 Scientific Literature? 11 Contents Motivation Oracles and pseudo oracles

Trivial Oracles 12 Software Testing! Input: Test Driver Test Case 1 Test Object

Test Case n Result 1 Expectation 1 Result n Expectation n

Expectation: Result: 13 The Test Oracle Determines the expectation 14

The Oracle Problem 15 Testing with Pseudo Oracles Simulates a real oracle Usefull to validate properties of algorithms M. D. Davis and E. J. Weyuker, Pseudo-oracles for non-testable 16 programs, in Proceedings of the ACM 81 Conference,

Pseudo Oracles for Machine Learning Approach 1: Metamorphic Testing Approach 2: Comparison of different implementations C. Murphy, G. E. Kaiser, and M. Arias, An approach to software testing of machine learning applications. in SEKE, 17 vol. 167, 2007. Approach 1:

Metamorphic Testing How does the output change, if I manipulate the input? 18 Metamorphisches Testen Original Model Training Data Morphism

Comparison Learning Algorithm Morphed Training Data Morphed Model 19 Metamorphische Relations Naive Bayes Classifier

Feature Class Label Examples for relations Permuting class labels does not affect results Permuting feature ordering does not affect results Adding constants to numeric features does not affect results Adding new features with a constant value does not affect the results X. Xie, J. W. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen, Testing and validating machine learning

20 classifiers by metamorphic testing, Journal of Systems and Software, vol. 84, no. 4, pp. 544 558, Approach 2: Testing through other implementations Does my algorithm yield the same results as the competition? 21 Comparison of

Implementations Implementation A Model A Training Data Comparison Implementation B Model B 22

Comparison of Linear Regression McCullough B. D., Mokfi Taha, Almaeenejad Mahsa. On the accuracy of linear regression routines in some data mining packages. WIREs Data Mining Knowl Discov 2018. doi: 10.1002/widm.1279 23 Contents Motivation Oracles and pseudo oracles Trivial Oracles

24 Testing of Classification Algorithms Pilot study Simple metamorophic test that can be applied to (almost) any classification algorithm Automated tests for basic functioning Smoke testing Application of tests to state-of-the-art software 25

Six Metamorphic Tests Same results, if the data does not change 1 is added to all numeric features the order of the instances changes the order of the features changes meta data changes

The results are the opposite, if the class labels are inverted 26 Smoke Testing Validate basic properties of implementation No crashes Return values exist are not Null For machine learning

Models can be trained Predictions can be made No oracle required! 27 Design of the Smoke Tests What are good training/test data for smoke tests? 28

22 Smoke Tests for Classification Data in [0,1] Features close to machine precision Random classes Alle numeric values 0

Only single value in a class 29 Prototype atoml 30 Results for Weka Decisionpassed trees crash with data from Only

two algorithm Nur zwei Algorithmen komplett Reason: all tests! stack overflow due to recursion ohne Probleme +1 change change results Permutations of change results

Keine Probleme, sich die No problem if thewenn data does not Datenchange nicht ndern 31

Results for scikit-learn and Spark 32 Developer Feedback we c.nz a . o

t ika a w . list @ t s i kal [email protected]

dev @ spa rk.a pac h e. o rg

33 Positive Feedback This is definitely helpful! This sounds like a useful tool. Thanks for sharing your analysis. We really need more work in this direction.

Do you have any interest in applying your knowledge to industry? 34 However However, I wanted to point out not all your expectations are warranted

I suspect that 35 Deviations not Always Wrong Minibatches and Bagging Instances/features are subdivided Partitions depend on random seed and order of data Asymmetric initializations Order as tie breaker

36 Users vs. Devs How should I know every detail? We could try and document all the cases where the result will not fulfill these invariances, but I think that might be too much. At some point we need the users

to understand what's going on. If you look at the random forest algorithm and you fix the random state it's obvious that feature order matters. - Scikit-Learn Core Developer Wekas random forest fulfills this property 37 Deviation vs. Significance

Was the final difference in accuracy statistically significant? Creator of Apache Spark 38 Further Steps 1. Significance tests for differences 2. Algorithm specific metamorphic tests 3. Other types of algorithms, e.g., clustering 4. Use combinatorical testing

39 Hyperparameters and Testing How do hyper parameters affect my tests? 40 Combinatorical Testing of hyper parameters

J. Chandrasekaran, H. Feng, Y. Lei, D. R. Kuhn, and R. Kacker, Applying combinatorial testing to data mining algorithms, in 2017 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), March 2017, pp. 253261. 41 Agenda Einfhrung und Motivation Das Orakelproblem und Pseudoorakel

Fallstudie fr Klassifikationsprobleme Ausblick und Zusammenfassung 42 The Future (?) Quality Assurance Machine Learning Joined Conferences Tools and Methods

(Modification of) Standards 43 Summary 44

Recently Viewed Presentations