Designing Experimentation Metrics SOMIT GUPTA, MICROSOFT ANALYSIS &

Designing Experimentation Metrics SOMIT GUPTA, MICROSOFT ANALYSIS &

Designing Experimentation Metrics SOMIT GUPTA, MICROSOFT ANALYSIS & EXPERIMENTATION 1 Importance of right metrics In 1902, the French quarter in Hanoi was overrun with rats. A "deratisation" scheme paid citizens for each rat they captured (the proof requested for payment was rats tail). Rats Killed per day 1,000/day 4,000/day

April, 1902 Week 1 April, 1902 Week 2 20,000/day July, 1902 But they barely made a dent in the problem! Investigation revealed two phenomena: 1. Tailless rats started appearing 2. A thriving rat farming industry emerged in the city https://community.redhat.com/blog/2014/07/when-metrics-go-wrong/ http://www.freakonomics.com/media/vannrathunt.pdf

https://en.wikipedia.org/wiki/Cobra_effect 2 Experimentation Metrics Taxonomy While analyzing the results of an experiment we compute many of metrics of different type and role. 1. Data Quality metrics 2. OEC (Overall Evaluation Criteria) metric 3. Guardrail metrics 4. Local feature and diagnostic metrics A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments Principles for the design of online metrics Data-Driven Metric Development for Online Controlled Experiments Seven Rules of Thumb for Web Site Experimenters

Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 3 Data Quality metrics Are the results trustworthy? Sample Ratio Mismatch (SRM) Data loss Click reliability Cookie churn Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 4 OEC: Overall Evaluation Criteria

Was the treatment successful? A single metric or a few key metrics Two key properties: 1. Alignment with long-term company goals (Directionality) 2. Ability to impact (Sensitivity) OEC vs KPIs (Key Performance Indicators) KPIs are lagging metrics reported monthly/quarterly/yearly at the overall product level (DAU, MAU, Revenue, etc.) OEC is a leading metric measured during the experiment (e.g. 2 weeks) at user level, which is indicative of long term increase in KPIs Designing a good OEC is hard o Example: OEC for a search engine Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics http://www.exp-platform.com/Pages/hippo_long.aspx

http://bit.ly/expUnexpected http://www.exp-platform.com/Pages/PuzzlingOutcomesExplained.aspx http://www.exp-platform.com/Documents/2016CIKM_MeasuringMetrics.pdf 5 OEC for Search The two key search engine (Bing, Google) KPIs are Query Share (distinct queries) and Revenue Should OEC be Queries/User and Revenue/User? Example: A ranking bug in an experiment resulted in very poor search results Degraded (algorithmic) search results cause users to search more to complete their task, and ads appear more relevant Distinct queries went up over 10%, and revenue went up over 30% What metrics should be in the OEC for a search engine?

http://www.exp-platform.com/Pages/PuzzlingOutcomesExplained.aspx Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 6 OEC for Search Analyzing queries per month, we have where a session begins with a query and ends with 30-minutes of inactivity. (Ideally, we would look at tasks, not sessions). In a controlled experiment, the variants get (approximately) the same number of users by design, so the last term is about equal Key observation: we want users to find answers and complete tasks quickly, so queries/session should be smaller

The OEC should therefore be based on the middle term: Sessions/User Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 7 OEC for Search: Sensitivity While Sessions/User has great directionality, it rarely moves in our experiments Both because it is hard to change the pattern of user visits in a short term experiment, and because of its statistical properties More on this later in the tutorial The Search OEC we developed includes Sessions/User, but also adds other more sensitive surrogate metrics that are predictive of Sessions/User movement. Surrogates are based on the concept of search success - how successful were users in their search tasks?

http://www.exp-platform.com/Pages/PuzzlingOutcomesExplained.aspx Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 8 More OEC Examples Netflix: Subscription business KPI: Retention (i.e. the fraction of users who return month over month) OEC: Viewing Hours. Strong correlation between viewing hours and retention Coursera: Care about course completion, make money by users pay for certifications KPIs: Course completions, # Certificates sold, Revenue OEC: Test completion and Course engagement. Predictive of course completion and certificates sold

Examples are from Designing with Data: Improving the User Experience with A/B Testing 9 How to come up with the right OEC? In the beginning: Start simple: frequency of user visits can be a good indicator of user happiness Evaluate and improve based on learning experiments: o An obviously positive change: that will clearly increase user happiness like removing ads o An obviously negative change: like adding latency or decreasing relevance of search results Continue to improve directionality and sensitivity over time: Setup a metric evaluation framework: curate a diverse set of labeled experiments agreed to be positive, negative or neutral with respect to long-term value. Test changes to the OEC on this set. Advanced topics Proxy metrics: metrics that are predictive of long-term gain

Surrogates: predictive of the gain in long term outcome Modeling long term effects user learning, market equilibrium Tradeoffs: weighting multiple OEC metrics (e.g. revenue/retention) to have a single composite metric http://www.exp-platform.com/Documents/2016CIKM_MeasuringMetrics.pdf Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 10 Experimentation Metrics Taxonomy While analyzing the results of an experiment we usually compute 1000s of metrics of different type and role. 1. Data Quality metrics Are the results trustworthy? e.g. Sample Ratio Mismatch 2. OEC (Overall Evaluation Criteria) Metric

Was the treatment successful? e.g. Sessions/User 3. Guardrail metrics Did the treatment cause an unacceptable harm to key metrics? e.g. KPI metrics, Performance metrics, short-term Revenue 4. Local feature and diagnostic metrics Why the OEC and guardrail metrics moved or did not move? e.g. number of impressions and clicks on a feature/button/link Data Quality metrics OEC metrics Guard rail metrics Local feature/Diagnostic metrics 11 Gaurdrails: Measuring Performance 1. Performance regressions/improvements sink/rise most metrics

Slowdown experiments at Amazon, Google and Bing show significant impact of performance on key metrics 2. Tail end of performance is very important Performance distribution is long tailed. Important to measure the worst performance usually 95 th, 99th percentile. 3. Measure different parts of the performance distribution Average performance metrics is not very useful. Measure different percentiles 25, 50, 75, 90, 95, 99. 4. Perceived Performance is important e.g. Time to First Action, Time to Content Above the Fold. Preloading/lazy loading improve perceived performance. 5. Performance varies a lot Varies by day of week, type of device, browser, market, usage of the product. Use segments and triggered analysis 12 Summary

Having good metrics is critical You get what you measure Metrics have different types and roles in the analysis of an experiment Data Quality metrics, OEC metric, Guardrail metrics, Local feature/Diagnostic metrics The Challenge of designing a good OEC A leading metric measured in a ~2 week period but indicative of long term goals Start simple and continuously improve over time 13 Questions? http://exp-platform.com 14

Recently Viewed Presentations

  • CRS Presentation

    CRS Presentation

    Sarah Lawless. In collaboration with. Prof. Tiffany Morrison, Dr Philippa Cohen, Dr Andrew Song, ... Dr Philippa Cohen, Dr Andrew Song, Dr Danika Kleiber. References. DAFT (2019), Gender Initiatives, ... Sarah Lawless Company: James Cook University ...
  • Automatic Generation of Transcripts of Student English ...

    Automatic Generation of Transcripts of Student English ...

    Automatic Generation of Transcripts of Student English Presentations using YouTube. Simon WANG. Peggy LAI. Louisa WONG. LanguageCentre HongKongBaptistUniversity {simonwang,peggylai,louisawong}@hkbu.edu.hk
  • Vi sinh thực phẩm - hcmuaf.edu.vn

    Vi sinh thực phẩm - hcmuaf.edu.vn

    Đường hoá Cấu trúc của tinh bột Quá trình đường hóa Nguồn enzyme: Thực vật: Malt có α-amylase, β-amylase, limit dextrinase, r-enzyme Các enzyme này hoạt động cùng lúc và rất nhanh Đường hóa không hoàn toàn cần lên men...
  • WP4 - teachex.eu

    WP4 - teachex.eu

    WP/module level (WP/module leaders, event organizers etc.) WP4 level (internal QA of the project) External evaluator level (external QA of the project) Purpose. Management structure. Procedures and tools. InternalEvaluation Plan. ExternalEvaluation. Quality.
  • Chapter 5: Prepositions

    Chapter 5: Prepositions

    Generally, when the inseparable preposition is prefixed on a noun it is normally accompanied by a vocal ševā'. בְּאִישׁ- with a man. כְּאֶ֫רֶץ - like a land. לְכֹל - to every/all 5.CAttaching inseparable prepositions.
  • San Francisco - Destination Presentation

    San Francisco - Destination Presentation

    The Presidio of San Francisco is a park on the northern tip of the San Francisco Peninsula. It . has been a fortified location since 1776 when the Spanish made it the military center of the . expansion in the...
  • Precipitation Dataset for Statistical Post Processing and Downscaling

    Precipitation Dataset for Statistical Post Processing and Downscaling

    Precipitation Dataset for Statistical Post Processing and Downscaling 29 November 2007 Goals Create a historical precipitation dataset Needs to be quality controlled Needs to have a high spatial resolution Use this climatic precipitation dataset to bias-correct GEFS and NAEFS precipitation...
  • Persuasive writing

    Persuasive writing

    Forget to support your opinions with facts and example Stages of writing Prewriting (brainstorming) Rough draft (pencil) Revising/editing Final draft (in pen) These are things that make a fabulous persuasive essay A terrific title A thought provoking thesis statement An...