Designing Experimentation Metrics SOMIT GUPTA, MICROSOFT ANALYSIS &
Designing Experimentation Metrics SOMIT GUPTA, MICROSOFT ANALYSIS & EXPERIMENTATION 1 Importance of right metrics In 1902, the French quarter in Hanoi was overrun with rats. A "deratisation" scheme paid citizens for each rat they captured (the proof requested for payment was rats tail). Rats Killed per day 1,000/day 4,000/day
April, 1902 Week 1 April, 1902 Week 2 20,000/day July, 1902 But they barely made a dent in the problem! Investigation revealed two phenomena: 1. Tailless rats started appearing 2. A thriving rat farming industry emerged in the city https://community.redhat.com/blog/2014/07/when-metrics-go-wrong/ http://www.freakonomics.com/media/vannrathunt.pdf
https://en.wikipedia.org/wiki/Cobra_effect 2 Experimentation Metrics Taxonomy While analyzing the results of an experiment we compute many of metrics of different type and role. 1. Data Quality metrics 2. OEC (Overall Evaluation Criteria) metric 3. Guardrail metrics 4. Local feature and diagnostic metrics A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments Principles for the design of online metrics Data-Driven Metric Development for Online Controlled Experiments Seven Rules of Thumb for Web Site Experimenters
Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 3 Data Quality metrics Are the results trustworthy? Sample Ratio Mismatch (SRM) Data loss Click reliability Cookie churn Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 4 OEC: Overall Evaluation Criteria
Was the treatment successful? A single metric or a few key metrics Two key properties: 1. Alignment with long-term company goals (Directionality) 2. Ability to impact (Sensitivity) OEC vs KPIs (Key Performance Indicators) KPIs are lagging metrics reported monthly/quarterly/yearly at the overall product level (DAU, MAU, Revenue, etc.) OEC is a leading metric measured during the experiment (e.g. 2 weeks) at user level, which is indicative of long term increase in KPIs Designing a good OEC is hard o Example: OEC for a search engine Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics http://www.exp-platform.com/Pages/hippo_long.aspx
http://bit.ly/expUnexpected http://www.exp-platform.com/Pages/PuzzlingOutcomesExplained.aspx http://www.exp-platform.com/Documents/2016CIKM_MeasuringMetrics.pdf 5 OEC for Search The two key search engine (Bing, Google) KPIs are Query Share (distinct queries) and Revenue Should OEC be Queries/User and Revenue/User? Example: A ranking bug in an experiment resulted in very poor search results Degraded (algorithmic) search results cause users to search more to complete their task, and ads appear more relevant Distinct queries went up over 10%, and revenue went up over 30% What metrics should be in the OEC for a search engine?
http://www.exp-platform.com/Pages/PuzzlingOutcomesExplained.aspx Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 6 OEC for Search Analyzing queries per month, we have where a session begins with a query and ends with 30-minutes of inactivity. (Ideally, we would look at tasks, not sessions). In a controlled experiment, the variants get (approximately) the same number of users by design, so the last term is about equal Key observation: we want users to find answers and complete tasks quickly, so queries/session should be smaller
The OEC should therefore be based on the middle term: Sessions/User Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 7 OEC for Search: Sensitivity While Sessions/User has great directionality, it rarely moves in our experiments Both because it is hard to change the pattern of user visits in a short term experiment, and because of its statistical properties More on this later in the tutorial The Search OEC we developed includes Sessions/User, but also adds other more sensitive surrogate metrics that are predictive of Sessions/User movement. Surrogates are based on the concept of search success - how successful were users in their search tasks?
http://www.exp-platform.com/Pages/PuzzlingOutcomesExplained.aspx Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 8 More OEC Examples Netflix: Subscription business KPI: Retention (i.e. the fraction of users who return month over month) OEC: Viewing Hours. Strong correlation between viewing hours and retention Coursera: Care about course completion, make money by users pay for certifications KPIs: Course completions, # Certificates sold, Revenue OEC: Test completion and Course engagement. Predictive of course completion and certificates sold
Examples are from Designing with Data: Improving the User Experience with A/B Testing 9 How to come up with the right OEC? In the beginning: Start simple: frequency of user visits can be a good indicator of user happiness Evaluate and improve based on learning experiments: o An obviously positive change: that will clearly increase user happiness like removing ads o An obviously negative change: like adding latency or decreasing relevance of search results Continue to improve directionality and sensitivity over time: Setup a metric evaluation framework: curate a diverse set of labeled experiments agreed to be positive, negative or neutral with respect to long-term value. Test changes to the OEC on this set. Advanced topics Proxy metrics: metrics that are predictive of long-term gain
Surrogates: predictive of the gain in long term outcome Modeling long term effects user learning, market equilibrium Tradeoffs: weighting multiple OEC metrics (e.g. revenue/retention) to have a single composite metric http://www.exp-platform.com/Documents/2016CIKM_MeasuringMetrics.pdf Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 10 Experimentation Metrics Taxonomy While analyzing the results of an experiment we usually compute 1000s of metrics of different type and role. 1. Data Quality metrics Are the results trustworthy? e.g. Sample Ratio Mismatch 2. OEC (Overall Evaluation Criteria) Metric
Was the treatment successful? e.g. Sessions/User 3. Guardrail metrics Did the treatment cause an unacceptable harm to key metrics? e.g. KPI metrics, Performance metrics, short-term Revenue 4. Local feature and diagnostic metrics Why the OEC and guardrail metrics moved or did not move? e.g. number of impressions and clicks on a feature/button/link Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 11 Gaurdrails: Measuring Performance 1. Performance regressions/improvements sink/rise most metrics
Slowdown experiments at Amazon, Google and Bing show significant impact of performance on key metrics 2. Tail end of performance is very important Performance distribution is long tailed. Important to measure the worst performance usually 95 th, 99th percentile. 3. Measure different parts of the performance distribution Average performance metrics is not very useful. Measure different percentiles 25, 50, 75, 90, 95, 99. 4. Perceived Performance is important e.g. Time to First Action, Time to Content Above the Fold. Preloading/lazy loading improve perceived performance. 5. Performance varies a lot Varies by day of week, type of device, browser, market, usage of the product. Use segments and triggered analysis 12 Summary
Having good metrics is critical You get what you measure Metrics have different types and roles in the analysis of an experiment Data Quality metrics, OEC metric, Guardrail metrics, Local feature/Diagnostic metrics The Challenge of designing a good OEC A leading metric measured in a ~2 week period but indicative of long term goals Start simple and continuously improve over time 13 Questions? http://exp-platform.com 14
Neuropsychological assessments - e.g. MMSE, ADAS-cog. Clinical examination. Collateral history. Brain scan. Blood tests. Distribution pathology in typical AD (Braak and Braak 1991) Focal dementiaYOU DO NOT NEED IMAGING TO DISTINGUISH THESE CONDITIONS. PATIENT 1. ALZHEIMER'S .
Chapter 17, Exercise 34Analysis of the numerical answer from binompdf. 0.26 means that a little more than 1 time out of 4 (26% of the time), you can expect a skilled archer like Diana to get 6 bull's-eyes in a...
If neither is hydrogen, the compound is a ketone. * The IUPAC system of nomenclature assigns a characteristic suffix to these classes, al to aldehydes and one to ketones * EXERCISE 23.13 EXAMPLES: * NOMENCLATURE OF CARBOXYLIC ACIDS As with...
Demonstrating that an HIV Vaccine Lowers the Risk and/or Severity of HIV infection ... Application to a proof-of-concept clinical trial of a cell mediated immunity-based HIV vaccine. Biometrics, in press. ... The consequences of adjustment for a concomitant variable that...
Sport and Health Stars Congratulations to those nominated for a range of sport and health related awards this term: Responsibility Trustworthy Performance Enthusiasm Leadership Caring S1 Litter Pick Sinead Crossan Chloe Rooney Tieghan Rooney Sohaib Asif Jack Allardice PE Star...
in the Klein model where d. E. is Euclidean metric. If not, use Schramm's transboundary extremal length and Schramm's theorem for. some curve families whose extremal lengths tending to zero on circle domains. But F. n is conformal, so the...