NAVIGATING THE PERILOUS WATERS OF VALIDATION THE CASE

NAVIGATING THE PERILOUS WATERS OF VALIDATION THE CASE

NAVIGATING THE PERILOUS WATERS OF VALIDATION THE CASE OF RETINAL IMAGE ANALYSIS (in four easy parts) Emanuele Trucco, FRSA, FIAPR VAMPIRE project Computer Vision and Image Processing group Computing, School of Science and Engineering University of Dundee UK Our observation of nature must be diligent, our reflection profound, and our experiments exact Denis Diderot1 PART 1 A BRIEF WORD ON CONTEXT ( where we come from ) 2

VAMPIRE VESSEL ASSESSMENT and MEASUREMENT PLATFORM for IMAGES of the RETINA RESEARCH: DELIVER EFFECTIVE, ADVANCED IMAGE AND DATA ANALYSIS SOFTWARE TOOLS SUPPORTING CLINICAL EYERELATED RESEARCH TRAINING: SPECIALISTS OF IMAGE AND DATA ANALYSIS IN AN INTERDISCIPLINARY ENVIRONMENT TRANSLATION: MAKE A DIFFERENCE TO HEALTHCARE 3 TWO VAMPIRE GAMES BIOMARKERS: n = lots THE RETINA AS A WINDOW ON THE BODY: dementias, CVD, diabetes complications, ... THE EYE CLINIC: n = 1

Cameron et al.: Lateral thinking: interocular symmetry , Progress in Retina and Eye Research 2017 McGrory et al.: Retinal microvascular network geometry , British Journ Ophth 2016 Taylor et al.: Retinal vascular fractal dimension , PlosONE 2015 Manivannan and Trucco: Subcategory classifiers for MIL , IEEE TMI 2017 Franco-Cardenas et al.: Assessment of Ischemic Index in Retinal Vascular Disease , Seminars In Ophthalmology, 2016 Annunziata et al.: Accelerating convolutional sparse coding , IEEE TMI 2017 4 VAMPIRE 3.1 SEMI-AUTOMATIC TOOL >100 morphometry measurements of the retinal

vasculature per image Include width-related, tortuosity, bifurcations, fractal dimension By zone, vascular tree, vessel 5 VALIDATION Trucco, Ruggeri, Karnowski et al. (15 international groups): Validating retinal fundus image analysis algorithms: issues and a proposal. Investigative Ophthalmology and Visual Science, vol 54, 2013. A Lisowska, R Annunziata, GK Loh, D Karl, E Trucco: An Experimental Assessment of Five Indices of Retinal Vessel Tortuosity with the RET-TORT Public Dataset. Proc. IEEE EMBC14 Intern Conf on Engineering in Medicine and Biology, Chicago (USA), Aug 2014. JR Cameron, RD Megaw, AJ Tatham, S McGrory, TJ MacGillivray, FN Doubal, JM Wardlaw, E Trucco, S Chandran, B Dhillon: Lateral thinking - interocular symmetry and asymmetry in neurovascular patterning, in health and

disease. Progress in Retinal and Eye Research vol 59, pp 131-157, 2017. 6 PART 2 FIRST DEFINITION - AND CONSEQUENCES ( coastal navigation with charts ) 7 VALIDATION: A FIRST DEFINITION Validation = the process of showing that an algorithm performs correctly by comparing its output with a reference standard. WHY, ISNT THIS ALL WE NEED? NO. SORRY. LIFE IS MORE COMPLEX.

8 SKELETAL VALIDATION PROCEDURE (staying with first definition) GIVEN CLINICAL MOTIVATION AND SOFTWARE TOOL: 1. PROCURE CLINICALLY RELEVANT, WELL-CHARACTERIZED DATA SET 2. PROCURE ANNOTATIONS FROM WELL-CHARACTERIZED EXPERTS 3. COMPUTE AUTOMATIC MEASUREMENTS 4. COMPARE STATISTICALLY EXPERTS ANNOTATIONS WITH AUTOMATIC MEASUREMENTS 9 ESSENTIAL FIRST-DEF TOOLKIT 1. CORRELATION, AGREEMENT, ASSOCIATION Pearson, Spearman (rank), biserial (number vs dichotomous), intra-class coefficient, Cohens kappa 2. VISUALIZATION METHODS: Scattergrams Bland-Altman graphs

3. ROC CURVES, SENTIT / SPECIF, PRECISION-RECALL 4. STATISTICAL ANALYSIS: Statistical tests Significance levels (are paired measurements different by chance?) 10 LETS BE CAREFUL WITH THE FIRST DEFINITION Validation = the process of showing that an algorithm performs correctly by comparing its output with a reference standard. EXAMPLE: SUPER-HUMAN PERFORMANCE (NOTICE: VERY GOOD CNN WORK!) SEGMENTATION TASK: RETINAL VASCULATURE PUBLIC DATA SETS: DRIVE (40 IMGS), STARE (20) TWO ANNOTATORS, O1 AND O2 TAKE O1 AS GOLD STANDARD TO TRAIN/TEST COMPARE WITH O2 TO CHECK HUMAN PERFORMANCE SYSTEM ACHIEVES BETTER AGREEMENT WITH O1 THAN O2 Maninis et al: Deep Retinal Image Understanding, MICCAI

2016 11 (PARTIAL) TROUBLE LIST Name raises expectations hugely, BUT Only 2 annotators Very preliminary Unspecified annotators (eg experience) Experience? Seniority? Doctors not trained to annotate in our sense Artificial task for doctors Task-dependent annotations (in doctor's mind) Annotation protocols?

Volume of data set Very preliminary Representativeness of data sets Cohort? Disease? Race? See also MICCAI 2017 poster: How many radiologists does it take to annotate a lesion? 12 LESSON LEARNT? DIFFERENT PURPOSES, DIFFERENT VALIDATION 1. NOVEL-ALGORITHM PAPER (LIMITED DATA SETS OK, BUT SPELL OUT LIMITS) 2. CLINICALLY USEFUL METHOD (or grand claims) (EXTENSIVE VALIDATION PROTOCOL ETC) PLEASE PRESENT VALIDATION RESULTS REALISTICALLY DO NOT BE AFRAID OF TRUTH! (MORE USEFUL THAN BRAGGING ) Check out the 10 greatest cases of fraud in university research on the web!

13 FIRST DEF: GOOD-PRACTICE PROPOSALS Jannin, Grova, Maurers proposal on validation protocols P Jannin, C Grova, C R Maurer: Model for defining and reporting referencebased validation protocols in medical image processing. Int J CARS vol 1, 2006. L. Maier-Hein, A. Groch, A. Bartoli et al.: Comparative Validation of SingleShot Optical Techniques for Laparoscopic 3-D Surface Reconstruction. IEEE TMI Vol. 33, No. 10, 2014. Notice: definitions of verification and validation as per software engineering 14 FIRST DEF: THE BEST YOU CAN ACHIEVE

By definition, it makes no sense to claim higher accuracy than that of the ground truth. (given your annotations) Ground truth measurements vary with annotators. Hence best possible performance when: The difference between measurements from program and any annotator is statistically the same as that between any two annotators R. Annunziata, A. Kheirkhah, S. Aggarwal, P. Hamrah and E. Trucco, A Fully Automated Tortuosity Quantification System with Application to Corneal Nerve Fibres in Confocal Microscopy Image, Medical Image Analysis, vol 32, August 2016. R. Annunziata, A. Kheirkhah, S. Aggarwal, B.M. Cavalcanti, P. Hamrah and E. Trucco, Two-Dimensional Plane for MultiScale Quantification of Corneal Subbasal

Nerve Tortuosity, Investigative Ophthalmology & Visual Science, Vol. 57, No. 3, 2016. i.e., the program looks like an annotator BUT different experiences etc? Hold on! STAPLE coming in a few minutes 15 PART 3 BEYOND THE FIRST DEFINITION ( approaching the open ocean ) 16 A WIDER VIEW ON VALIDATION A COMPLEX AND INTERDISCIPLINARY PROCESS, OF WHICH SOFTWARE VERIFICATION IS ONLY ONE STEP START CLINICAL NEED END

CLINICS, MEDICAL RESEARCH Friedman CP, Wyatt JC. Evaluation methods in biomedical informatics (2nd edition). New York: SpringerPublishing, October 2005. XXX others BUILD SOFTWARE PROTOTYPE DOES IT HELP IN CLINICAL TRIALS / STUDIES ? DOES IT WORK WITHIN A PROCEDURE / SYSTEM / ETC? DOES IT WORK WITH TASKSPECIFIC DATA AND ANNOTATIONS?

FIRST DEFINITION VALIDATION ON OUTCOME 18 VALIDATION ON OUTCOME TEST THE EFFICACY / PERFORMANCE OF A SOFTWARE MODULE FROM THE OUTPUT OF THE PIPELINE (adv: no annotations on the specific item computed; doctor trained for task!) EXAMPLE 1: DIABETIC RETINOPATHY SCREENING Detect automatically microaneurysms Test on outcome: association btw detector output and referral by gold-standard expert EXAMPLE 2: PROGRESSION TO PLUS DISEASE IN ROP (omissis) NOT WITHOUT CHALLENGES!

Defining outcome; black box mixing many details; Trucco, Ruggeri, Karnowski et al. (15 international groups): Validating retinal fundus image analysis algorithms: issues and a proposal. Investigative Ophthalm and Visual Science, vol 54, 2013. Fleming AD, Philip S, Goatman KA, Prescott GJ, Sharp PF, Olson JA: The evidence for automated grading in diabetic retinopathy screening. Curr Diabetes Rev. 7(4), 2011. Worrall, Wilson, Brostow: Automated Retinopathy of Prematurity Case Detection with Convolutional Neural Networks. MICCAI Deep Learning Workshop 2016. Fleck BW, Williams C, Juszczak E, et al incl ; BOOST II Retinal Image Digital Analysis (RIDA) Group: An international comparison of retinopathy of prematurity grading performance within the Benefits of Oxygen Saturation Targeting II trials. Eye (Lond).

2017. 19 DONT JUST ASK FOR ANNOTATIONS ( the need and importance of precise annotation protocols) EXPERIENCE: ANNOTATING LOCATIONS OF ARTERIAL STENOSIS IN WHOLE-BODY MRA SCANS PROTOCOL AGREED WITH 3 RADIOLOGISTS TRAINING FOR ANNOTATION SOFTWARE TOOL PROVIDED STILL, PUZZLING INTER-OBSERVER VARIATIONS! Agree annotation protocol with clinical annotators Train CAs on software Organize annotation rehearsal Figure courtesy of Andrew McNeil, CVIP Dundee paper in preparation 20

SCORING ANNOTATIONS WHILE SEGMENTING: STAPLE EM ALGORITHM ESTIMATES HIDDEN GROUND TRUTH AND PERFORMANCE OF ANNOTATIONS (TEMPLATES) (majority: all votes the same) Figure from Warfield, Zhou, Wells: Simultaneous Truth and Performance Level Estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE TMI 23(7), 2004. Akhondi-Asl and Warfield: Simultaneous Truth and Performance Level Estimation through fusion of probabilistic segmentations. IEEE TMI 32(10), 2013. 21 PART 4 WHAT NEXT ? ( venturing into the uncharted ocean ) 22

LEARNING WITH WEAK ANNOTATIONS: MIL CLASSIFIER LEARNS REFER / NO-REFER IMGS NO MANUAL SEGMENTATION ( R/NR LABEL ONLY ) LEARNS IMPLICITLY DR LESIONS (MULTIPLE-INSTANCE LEARNING) ALL 6 MAIN LESIONS LEARNT WITH 600-IMGS TRAINING (MESSIDOR DATA SET): 0.881 ACCURACY ON 600 IMGS FROM MESSIDOR 0.761 ON 25,702 IMGS FROM eOPHTA

G Quellec, M Lamard, M Abramoff et al.: A multiple-instance learning framework for diabetic retinopathy screening, Medical Image Analysis vol 16, 2012. S Manivannan and E Trucco: Subcategory Classifiers for MultipleInstance Learning and Its Application to Retinal Nerve Fiber Layer Visibility Classification. IEEE Trans on Medical Imaging, vol 36 no 5, May 2017. RELATED MACHINE LEARNING / COMPUTER VISION WORK: AUTO-ANNOTATIONS, CONCEPT DISCOVERY, XFER LEARNING 23 LEARNING WITHOUT ANNOTATIONS: RCA PREDICTING THE PERFORMANCE OF A SEGM METHOD ON NEW DATA Our results indicate that, at least to some extent, it is

indeed possible to predict the performance level of a segmentation method on each individual case, in the absence of ground truth [1] Vanya V. Valindria, Ioannis Lavdas, Wenjia Bai, et al.: Reverse Classification Accuracy: Predicting Segmentation Performance in the Absence of Ground Truth. IEEE TMI 2017. FOCUS: SEGMENTATION, LARGE-SCALE STUDIES (e.g. UKBB) T. Kohlberger, V. Singh, C. Alvino, C. Bahlmann, L. Grady: Evaluating segmentation error without ground truth. Proc MICCAI 2012.

Figure and quote from [1] , top right H. Zhang, J. E. Fritts, S. A. Goldman: Image segmentation evaluation: A survey of unsupervised methods. Computer Vision and Image Understanding, vol. 110, 2008. 24 AUTOMATIC ANNOTATIONS AND GT ( ako who needs experts? ) [1] L Ballerini, L Bonaldi, E Menti, A Ruggeri, E Trucco: Automatic Generation of Synthetic Retinal Fundus Images: Vascular Network. MICCAI Workshop on Simulation and Synthesis in Medical Imaging (SASHIMI), 2016

a. GENERATING SYNTHETIC GROUND TRUTH [2] Costa, Galdran et al.: Towards adversarial retinal image synthesis. arXiv:1701.08974v1 [cs.CV], 2017. [1] [2] [3] b. ANNOTATING REAL DATA AUTOMATICALLY [3] Shotton et al.: Efficient pose estimation from single depth images. IEEE PAMI 35(12), 2013. [4] Guillaumin et al.: ImageNet autoannotation with segmentation propagation.Int Jurn Comp Vis 2014

[4] 25 AUTOMATIC ANNOTATIONS (b): DOES IT WORK? SEGMENTATION MASKS = ANNOTATIONS TESTS ON SEGMENTATION QUALITY CAN AN AUTOMATIC SEGMENTATION PROVIDE GROUND TRUTH TO VALIDATE AN AUTOMATIC SEGMENTATION? BACK TO VALIDATION ON OUTCOME! 26 CROWDSOURCING ANNOTATIONS SEVERAL STUDIES SHOW IT IS FEASIBLE (e.g. DR) D Mitry et al.: The accuracy and reliability of crowdsource annotations of digital

retinal images. Translational Vision Science and Technology 5(5) 2016. CERTAINLY TO COLLECT ANNOTATIONS e.g. REGIONS Maier-Hein et al.: Can masses of nonexperts train highly accurate image classifiers? A crowdsourcing approach to instrument segmentation in laparoscopic images. MICCAI 2014. BUT TRANSLATION INTO SCREENING NOT TRIVIAL: REALIBILITY, CONTINUITY BELIEVABILITY Mitry D, Peto T, Hayat S, et al.: Crowdsourcing as a screening tool to detect clinical features of glaucomatous optic neuropathy from digital photography. PLoS One 2015.

NATIONAL HEALTH SYSTEM DYNAMIC INERTIA 27 CROWDSOURCING: HUMANS AND BORGS HUMANS - Center for Open Sciences ManyLab: crowdsourcing human analyses - CrowdMEd Silberzahn and Uhllman: Crowdsorced research: many hands make tight work. Comment, Nature 526, 189-191, 2015. BORGS - Human Dx (IBM Watson): harness both machine learning and the crowdsourced wisdom of human physicians - Babylon Health: AI-powered chatbox triage service (aim: full diagnosis via AI). >$85m investement, 250,000 med advices 28

LETS WRAP IT UP ( back to harbour ) 29 FINAL THOUGHTS When you think you have all the answers, the world comes and changes all the questions J F Pinto, quoted in [1] First definition: - Still very necessary; please use properly! - Good practice must be spread - Comprehensive, regularly updated, multi-cohort data sets [1] A Espinosa: Si t me dices ven lo dejo todo pero dime ven. Debolsillo, 2011. Generation of large-scale synthetic ground truth very promising, especially with DL / GAN but still much to do

What validation required to make methods accepted and used clinically? Can we really test without ground truth? (Hmm! More probably with limited GT) 30 CVIP / VAMPIRE DUNDEE Dr Shazia Akbar Dr Hind Azegrouz, SNCCR, Spain Dr Roberto Annunziata, UCL Dr Lucia Ballerini, UoEdinburgh Dr Colin Buchanan, Epipole Kai Sing Chin Tianjun Huang Dr Wenqi Li, UCL Prof Stephen MacKenna Andrew McNeil Dr Siyamalan Mannivannan Dr Enrico Pellegrini, OPTOS plc Dr Adria Perez Rovira, UoRotterdam Haocheng Shen

Dr Sebastian Stein, UGlasgow Dr Roy Wang Kris Zutis, NHS Tayside VAMPIRE EDINBURGH Dr Tom MacGillivray Dr Sarah McGrory Tom Pearson Dr Devanjali Relan Dr Gavin Robertson, OPTOS plc CLINICAL/SCIENTIFIC COLLABS Dr James Cameron, UoEdinburgh, NHS Prof Ian Deary, UoEdinburgh , CCACE Dr Alex Doney, NHS Tayside Prof Bal Dhillon, Dr Fergus Doubal, UoEdinburgh, NHS Prof Paul Foster, UCL Moorfields Dr Pedram Hamrah, Harvard Med School, US

Dr Ruth Hogg, QUB, NHS Prof Graeme Houston, NHS Tayside, UoDundee Dr Jean Pierre Hubschman, UCLA Jules Stein Eye Inst Dr Ahmad Kheirkah, Harvard Med School, US Dr Gareth MacKay, QUB, NHS Dr Danny Mitry, Dr Tunde Peto, UCL Moorfields Prof Axel Pries, Charite`, D Prof Edwin van Beek, Prof Joanna Wardlaw, UoEdinburgh Dr Peter Wilson, UoAuckland, NZ MAIN INDUSTRIAL COLLABS OPTOS plc Toshiba MV Edinburgh Epipole plc NIDEK Technologies COMPUTER SCIENCE COLLABS

Matteo Barbieri, Dr Annalisa Barla, UoGenova, I Lorenza Bonaldi, Alessandro Cavinato, Alessandro Dazzi, UoPadova Prof Andrea Giachetti, UoVerona, I Prof Andrew Hunter, UoLincoln Dr Jiang (Jimmy) Liu, Dr Damon Wong, A*STAR Singapore Dr Carmen Lupascu, UoPalermo, I Elisa Menti, UoPadova, I Prof Giovanni Montana, Kings College London Monica Morellato, UoVerona, I Ilaria Pieretti, UoPadova, I Prof Mimmo Tegolo, UoPalermo, I Jeff Wigdahl, UoPadova, I Prof Alessandro Verri, UoGenova, I THANK YOU ! http://vampire.computing.dundee.ac.uk

[email protected]

Recently Viewed Presentations

  • The Book of the Dead

    The Book of the Dead

    Situational Irony. Definition: When the result of an action is contrary to the desired or expected effect. Example: The assassination attempt made on President Ronald Reagan by John Hinckley. The bullets initially missed the President. However, one of the bullets...
  • Washington Association Medical Click toStaff edit Master title

    Washington Association Medical Click toStaff edit Master title

    Studies also show manual dexterity and visuospatial ability decreases with age. AMA Council on Medical Education Report 5 . Assuring Safe and Effective Care for Patients by Senior/Late Career Physicians. April 2015. Factors associated with aging
  • Competing Behaviors Pathways and Behavior Support Plans

    Competing Behaviors Pathways and Behavior Support Plans

    List successive teaching steps for student to learn replacement behavior/s. Teaching of underlying pivotal skills that will increase the student's ability to perform general positive behaviors #7 List teaching strategies/ curriculum/materials needed to teach replacement behaviors and staff responsible.
  • EEE 302 Lecture 13 - University of Nevada, Las Vegas

    EEE 302 Lecture 13 - University of Nevada, Las Vegas

    Times New Roman Arial Symbol Default Design Adobe Photoshop Image Microsoft Visio Drawing Microsoft Equation 3.0 Microsoft Word Document Laplace Transform Definition of Laplace Transform Singularity Functions Unit Step Function, u(t) Extensions of the Unit Step Function Delta or Unit...
  • Welcome back to school! Lets start writing! Developed

    Welcome back to school! Lets start writing! Developed

    SCHOOL Acrostic Poetry. A. C. R. O. S. T. I. C. n acrostic poem. an be about anything. eally. f course, some people like to. tart each line as a sentence. hough. prefer weaving words into a . reation that...
  • The Alphabet

    The Alphabet

    Title: The Alphabet Author: kleeds Last modified by: Pitchford Created Date: 6/29/2006 12:31:44 PM Document presentation format: On-screen Show Company
  • 投影片 1 - 國立臺灣大學

    投影片 1 - 國立臺灣大學

    Color and Radiometry Digital Image Synthesis Yung-Yu Chuang 10/19/2006 with slides by Pat Hanrahan and Matt Pharr Radiometry Radiometry: study of the propagation of electromagnetic radiation in an environment Four key quantities: flux, intensity, irradiance and radiance These radiometric quantities...
  • Highway-Railroad Grade Crossing Improvements

    Highway-Railroad Grade Crossing Improvements

    The Highway Railroad Intersection"A special case of a typical highway intersection.". III-/87. Notes: The Highway-Rail Intersection. A highway-railroad grade crossing is an intersection where a roadway crosses railroad tracks at the same level (referred to by civil engineers as the...