Using Large Scale Log Analysis to Understand Human Behavior

Using Large Scale Log Analysis to Understand Human Behavior

USING LARGE SCALE LOG ANALYSIS TO UNDERSTAND HUMAN BEHAVIOR JITP 2011 Jaime Teevan, Microsoft Reseach Mark Twain Students Students prefer prefer used used textbooks textbooks that that are are annotated. annotated. [Marshall [Marshall 1998] 1998] Cowards Cowards die die many many times times before before their their deaths. deaths. Annotated Annotated by by Nelson Nelson David Foster

Mandela Mandela Wallace I have discovered a truly which this margin marvelous proofis...too Pierre de Fermat narrow to contain. (1637) Digital Marginalia Do we lose marginalia with digital documents? Internet exposes information experiences Meta-data, annotations, relationships Large-scale information usage data Change in focus With marginalia, interest is in the individual Now we can look at experiences in the aggregate

http://hint.fm/ seer Defining Behavioral Log Data Behavioral log data are: Traces of human behavior, seen through a sensor Actual, real-world behavior Not recalled behavior or subjective impression Large-scale, real-time Behavioral log data are not: Non-behavioral sources of large-scale data Collected data (e.g., poll data, surveys, census data) Crowdsourced data (e.g., Mechanical Turk) Real-World, Large-Scale, Real-Time Private behavior is exposed

Example: Porn queries, medical queries Rare behavior is common Example: Observe 500 million queries a day Interested in behavior that occurs 0.002% of the time Still observe the behavior 10 thousand times a day! New behavior appears immediately Example: Google Flu Trends Overview How behavioral log data can be used Sources of behavioral log data Example analysis of one source: Query logs

Challenges with privacy and data sharing To understand peoples information needs To experiment with different systems What behavioral logs cannot reveal How to address limitations Practical Uses for Behavioral Data Behavioral data to improve Web search Offline log analysis Example: Re-finding common, so add history support Online log-based experiments Example: Interleave different rankings to find best algorithm Log-based functionality Example:

Boost clicked results in a search result list Behavioral data on the desktop Goal: Allocate editorial resources to create Help docs How to do so without knowing what people search for? Societal Uses of Behavioral Data Understand peoples information needs Understand what people talk about Impact public policy? (E.g., DonorsChoose.org) [Baeza Yates et al. 2007] Generalizing About Behavior Button clicks Featur e use Structured answers Information use jitp 2011 Information needs What people think

Human behavio r Personal Use of Behavioral Data Individuals now have a lot of behavioral data Introspection of personal data popular My Year in Status Status Statistics Expect to see more As compared to others For a purpose Overview Behavioral logs give practical, societal, personal insight Sources of behavioral log data

Example analysis of one source: Query logs Challenges with privacy and data sharing To understand peoples information needs To experiment with different systems What behavioral cannot reveal How to address limitations Web Service Logs Example sources Types of information Search engines Commercial websites Behavior: Queries, clicks

Content: Results, products Example analysis Query ambiguity Teevan, Dumais & Liebling. To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent. SIGIR 2008 IT & Politics Integral Theory and Practice Parenti ng Public Web Service Content Example sources Types of information

Social network sites Wiki change logs Public content Dependent on service Example analysis Twitter topic models Ramage, Dumais & Liebling. Characterizing microblogging using latent topic models. ICWSM 2010 j http:// twahpic.cloudapp.net Web Browser Logs Example sources Types of information

Proxies Toolbar Behavior: URL visit Content: Settings, pages Example analysis Diff-IE (http://bit.ly/DiffIE) Teevan, Dumais & Liebling. A Longitudinal Study of How Highlighting Web Content Change Affects .. Interactions. CHI 2010 Web Browser Logs Example sources Types of information Proxies Toolbar Behavior: URL visit Content: Settings,

pages Example analysis Webpage revisitation Adar, Teevan & Dumais. Large Scale Analysis of Web Revisitation Patterns. CHI 2008 Client-Side Logs Example sources Types of information Client application Operating system Web client interactions Other interactions rich! Example analysis

Stuff Ive Seen Dumais, Cutrell, Cadiz, Jancke, Sarin & Robbins. Stuff I've Seen: A system for personal information retrieval and re-use. SIGIR 2003 Types of Logs Rich and Varied Sources of Log Data Web services Search engines Commerce sites Proxies Toolbars or plug-ins Client applications Interactions Social network sites

Wiki change logs Web Browsers Public Web services Types of Information Logged Posts, edits Queries, clicks URL visits System interactions Context Results Ads Web pages shown Public Sources of Behavioral Logs Public Web service content

Twitter, Facebook, Digg, Wikipedia At JITP: InfoExtractor, Facebook Harvester, scraping tools Research efforts to create logs At JITP: Roxy, a research proxy Lemur Community Query Log Project http://lemurstudy.cs.umass.edu/ 1 year of data collection = 6 seconds of Google logs Publicly released private logs DonorsChoose.org http://developer.donorschoose.org/the-data Enron corpus, AOL search logs, Netflix ratings Example: AOL Search Dataset August 4, 2006: Logs released to academic community

3 months, 650 thousand users, 20 million queries Logs contain anonymized User IDs August 7, 2006: AOL pulled the files, but already mirrored August 9, 2006: New York Times identified Thelma Arnold AnonID Query QueryTime ItemRank ClickURL -------------------------------------------- ----------- 1234567 jitp 2006-04-04 18:18:18 1 http://www.jitp.net/ 1234567 submission process 2006-04-04 18:18:18 3 Ajipt Face Is Exposed for AOL Searcherhttp://www.jitp.net/m_mscript.php?p=2 No. 4417749 1234567 computational social scinece 2006-04-24 09:19:32 Queries 1234567 computational science 2006-04-24 09:20:04 2 for social businesses, services in Lilburn, GA (pop. 11k)

http://socialcomplexity.gmu.edu/phd.php Queries 1234567 seattle restaurants 2 http://seattletimes.nwsource.com/rests for Jarrett2006-04-24 Arnold 09:25:50 (and others of the Arnold clan) 1234567 perlman montreal2006-04-24 10:15:14 4 http://oldwww.acm.org/perlman/guide.html NYT contacted 14 people in Lilburn with Arnold surname 1234567 jitp 2006 notification all2006-05-20 13:13:13 When contacted, Thelma Arnold acknowledged her queries August 21, 2006: 2 AOL employees fired, CTO resigned September, 2006: Class action lawsuit filed against AOL Example: AOL Search Dataset Other well known AOL users

User 927 how to kill your wife User 711391 i love alaska http://www.minimovies.org/documentaires/view/ ilovealaska Anonymous IDs do not make logs anonymous Contain directly identifiable information Names, phone numbers, credit cards, social security numbers Contain indirectly identifiable information Example: Thelmas queries Birthdate, gender, zip code identifies 87% of Americans Example: Netflix Challenge October 2, 2006: Netflix announces contest Predict peoples ratings for a $1 million dollar prize 100 million ratings, 480k users, 17k movies Very careful with anonymity post-AOL

May 18, 2008: Data de-anonymized All customer identifying Ratings 1: [Movie 1 12, 3, 2006-04-18 Rating, Date] 5 , 2003-07-08 1234, Rating, Date] 1, 2005-11-12 2468, Rating, Date] of 17770] information has been removed; Paper published by Narayanan & Shmatikov [CustomerID, all that remains are ratings and [CustomerID, knowledge from IMDB Uses background dates. This follows our privacy Robust to [CustomerID, perturbations in data policy. . . Even if, for example, you knew all your own ratings and

December 17, 2009: Doe v. Netflix their dates you probably couldnt Movie Titles March 12, 2010: Netflix cancels second identify them reliably in the data 10120, 1982, Bladerunner 17690, 2007, The Queen competition because only a small sample was included (less than one tenth of our complete dataset) Overview Behavioral logs give practical, societal, personal insight Sources include Web services, browsers, client apps Example analysis of one source: Query logs Public sources limited due to privacy concerns To understand peoples information needs To experiment with different systems

What behavioral logs cannot reveal How to address limitations Query Time User jitp 2011 10:41 am 14203 5/15/11 9 social science 10:44 am 14203 5/15/11 9 computational social science 10:56 am 14203 5/15/11 9 jitp 2011 11:21 am 65932 5/15/11 7

crowne plaza seattle 11:59 am 31822 5/15/11 2 restaurants seattle 12:01 pm 31822 5/15/11 2 pikes market restaurants 12:17 pm 31822 5/15/11 2 stuart shulman 12:18 pm 14203 5/15/11 9 daytrips in seattle, wa 1:30 pm 55432 5/15/11 0 jitp 2011 1:30 pm 65932 5/15/11 7

jitp program 2:32 pm 43545 Query Time User jitp 2011 10:41 am 14203 Data 5/15/11 9 social science 10:44 am 5/15/11 teen sex 10:56 am 5/15/11 11:59 am 11/3/23 11:21 am 5/15/11 12:01 pm

12:17 pm 11/3/23 11:59 am 5/15/11 5/15/11 12:18 pm 12:01 pm 5/15/11 5/15/11 12:19 pm 12:17 pm 5/15/11 5/15/11 Languag e jitp 2011 Syste cheap digital camera crowne plaza seattle m cheap digital camera restaurants seattle errors cheap digital camera pikes market restaurants Spam stuart shulman daytrips in seattle, wa Porn 12:18 pm

5/15/11 cleaning pragmatics 14203 9 Significant part 14203 of data analysis 9 Ensure cleaning 65932 7 is appropriate 55432 31822 0 Keep track of 2 55432 the cleaning 31822 0 2 process 55432 31822 0 2 Keep the 14203 original data 9 around 1:30 pm 55432 5/15/11 0 Example: sex with animals

1:30 pm 65932 5/15/11 7 jitp program 2:32 pm 43545 ClimateGate Query Time User jitp 2011 10:41 am 14203 5/15/11 9 social science 10:44 am 14203 5/15/11 9 computational social science 10:56 am 14203 5/15/11 9

jitp 2011 11:21 am 65932 5/15/11 7 crowne plaza seattle 11:59 am 31822 5/15/11 2 restaurants seattle 12:01 pm 31822 5/15/11 2 pikes market restaurants 12:17 pm 31822 5/15/11 2 stuart shulman 12:18 pm 14203 5/15/11 9 daytrips in seattle, wa 1:30 pm 55432 5/15/11 0

jitp 2011 1:30 pm 65932 5/15/11 7 jitp program 2:32 pm 43545 Query Time User jitp 2011 10:41 am 14203 5/15/11 9 social science 10:44 am 14203 5/15/11 9 computational social science 10:56 am 14203 5/15/11 9

jitp 2011 Query 11:21 am typology 5/15/11 65932 7 crowne plaza seattle 11:59 am 31822 5/15/11 2 restaurants seattle 12:01 pm 31822 5/15/11 2 pikes market restaurants 12:17 pm 31822 5/15/11 2 stuart shulman 12:18 pm 14203 5/15/11 9 daytrips in seattle, wa

1:30 pm 55432 5/15/11 0 jitp 2011 1:30 pm 65932 5/15/11 7 jitp program 2:32 pm 43545 Query Time User jitp 2011 10:41 am 14203 5/15/11 9 social science 10:44 am 14203 5/15/11 9 computational social science

10:56 am 14203 5/15/11 9 jitp 2011 crowne plaza seattle restaurants seattle Query 11:21 am typology 5/15/11 65932 7 11:59 am 31822 5/15/11 2 Query behavior 12:01 pm 31822 5/15/11 2 pikes market restaurants 12:17 pm 31822 5/15/11 2 stuart shulman 12:18 pm 14203

5/15/11 9 daytrips in seattle, wa 1:30 pm 55432 5/15/11 0 jitp 2011 1:30 pm 65932 5/15/11 7 jitp program 2:32 pm 43545 Query Time User Uses of Analysis 14203 Ranking jitp 2011 10:41 am 14203 5/15/11 9

social science 10:44 am 5/15/11 computational social science jitp 2011 crowne plaza seattle stuart shulman E.g., precision 10:56 am 14203 5/15/11 9 Query 11:21 am typology 5/15/11 System design 65932 7 E.g., caching User interface 11:59 am 31822 5/15/11 2 Query behavior

12:01 pm 31822 5/15/11 2 restaurants seattle pikes market restaurants 9 Test set development 14203 9 Complementar 55432y research 0 12:17 pm 31822 5/15/11 2 Long term 12:18 trends pm 5/15/11 E.g., history daytrips in seattle, wa 1:30 pm 5/15/11 jitp 2011

1:30 pm 65932 5/15/11 7 jitp program 2:32 pm 43545 Things Observed in Query Logs Summary measures Analysis of query intent [Silverstein et al. 1999] 2.35 terms [Jansen et al. 1998] Query types and topics Navigational, Informational, Transactional [Broder 2002]

Temporal features Query frequency Query length Queries appear 3.97 times Click behavior Sessions 2.20 queries long Session length Common re-formulations Relevant results for query Queries that lead to clicks [Silverstein et al. 1999] [Lau and Horvitz, 1999] [Joachims 2002] Surprises About Query Log Data From early log analysis

Examples: Jansen et al. 2000, Broder 1998 Queries are not 7 or 8 words long Advanced operators not used or misused Nobody used relevance feedback Lots of people search for sex Navigation behavior common Prior experience was with library search Surprises About Microblog Search? Surprises About Microblog Search? Order Order ed ed by by time time Ordered Ordered by by relevanc relevanc e e

8 new tweets Surprises About Microblog Search? Time important People important Specialized syntax Queries common Order Order ed ed by by time time 8 new tweets Often Ordered Ordered by by navigational relevanc relevanc Timeeeand people less important No syntax use Queries longer

Queries Generalizing Across Systems A particular feature A web search engine Build Bing new feature s Bing experiment #123 Bing, Google, Yahoo Web search engines Build better systems Different corpora Search engines Information seeking Build new tools Browser, search, email Partitioning the Data

Corpus Language Location Device Time User System variant [Baeza Yates et al. 2007] Partition by Time Periodicities Spikes Real-time data New behavior Immediate feedback Individual

Within session Across sessions [Beitzel et al. 2004] Partition by User [Teevan et al. 2007] Temporary ID (e.g., cookie, IP address) High coverage but high churn Does not necessarily map directly to users User account Only a subset of users Partition by System Variant Also known as controlled experiments Some people see one variant, others another Example: What color for search result links?

Bing tested 40 colors Identified #0044CC Value: $80 million Everything is Significant Everything is significant, but not always meaningful Choose comparison group carefully Choose the metrics you care about first Look for converging evidence From the same time period Log a lot because it can be hard to recreate state Confirm with metrics that should be the same High variance, calculate empirically Look at the data Overview

Behavioral logs give practical, societal, personal insight Sources include Web services, browsers, client apps Partitioned query logs to view interesting slices Public sources limited due to privacy concerns By corpus, time, individual By system variant = experiment What behavioral logs cannot reveal How to address limitations What Logs Cannot Tell Us Peoples intent 7:12 Query Peoples success 7:14 Click

new Result Peoples experience tab> Peoples attention 7:15 Click 7:16 Try Peoples beliefs of what happens 7:16 Read Result Behavior can mean many things 1 7:20 Read Result 81% of search sequences ambiguous 3 [Viermetz et al. 2006] 7:27 Save links locally Example: Click Entropy

Question: How ambiguous is a query? Approach: Look at variation in clicks [Teevan et al. 2008] Measure: Click entropy Low if no variation journal of information High if lots of variation jitp IT & Politics Integral Theory and Practice Parenti ng Which Has Less Variation in Clicks? www.usajobs.gov v. federal government jobs find phone number v. msn live search ? Results change

Result entropypools = 5.7 v. singaporepools.com Result entropy singapore = 10.7 tiffany v. tiffanys quality ? Result varies Click position =v.2.6 Click nytimes connecticut newspapers position = 1.6 campbells soup recipes v. vegetable soup ? Tasks impacts # of clicks recipe Clicks/user = 1.1 Clicks/user = 2.1 soccer rules v. hockey equipment Beware of Adversaries Robots try to take advantage your service Spammers try to influence your

[Fetterly et al. 2004] interpretation Click-fraud, link farms, misleading content Never-ending arms race Queries too fast or common to be a human Queries too specialized (and repeated) to be real Look for unusual clusters of behavior Adversarial use of log data Beware of Tyranny of the Data Can provide insight into behavior Can be used to test hypotheses Example: What is search for, how needs are

expressed Example: Compare ranking variants or link color Can only reveal what can be observed Cannot tell you what you cannot observe Example: Nobody uses Twitter to re-find Supplementing Log Data Enhance log data Collect associated information Example: For browser logs, crawl visited webpages Instrumented panels Converging methods Usability studies Eye tracking Surveys Field studies

Diary studies Example: Re-Finding Intent Large-scale log analysis of re-finding [Tyler and Teevan 2010] Small-scale critical incident user study Do people know they are re-finding? Do they mean to re-find the result they do? Why are they returning to the result? Browser plug-in that logs queries and clicks Pop up survey on repeat clicks and 1/8 new clicks Insight into intent + Rich, real-world picture Re-finding often targeted towards a particular URL Not targeted when query changes or in same session Summary Behavioral logs give practical, societal, personal insight

Sources include Web services, browsers, client apps Partitioned query logs to view interesting slices Public sources limited due to privacy concerns By corpus, time, individual By system variant = experiment Behavioral logs are powerful but not complete picture Can expose small differences and tail behavior Cannot expose motivation, which is often adversarial Look at the logs and supplement with complementary data Questions? Jaime Teevan [email protected] References

Adar, E. , J. Teevan and S.T. Dumais. Large scale analysis of Web revisitation patterns. CHI 2008. Akers, D., M. Simpson, T. Wingorad and R. Jeffries. Undo and erase events as indicators of usability problems. CHI 2009. Beitzel, S.M., E.C. Jensen, A. Chowdhury, D. Grossman and O. Frieder. Hourly analysis of a very large topically categorized Web query log. SIGIR 2004. Broder, A. A taxonomy of Web search. SIGIR Forum, 36(2), 2002. Chilton, L. and J. Teevan. Addressing information needs directly in the search result page. WWW 2011. Cutrell, E., D.C. Robbins, S.T. Dumais and R. Sarin. Fast, flexible filtering with Phlat: Personal search and organization made easy. CHI 2006. Dagon, D. Botnet detection and response: The network is the infection. OARC Workshop 2005. Dasu, T. and T. Johnson. Exploratory data mining and data cleaning. 2004. Dumais, S. T., E. Cutrell, J.J. Cadiz, G. Jancke, R. Sarin and D.C. Robbins. Stuff I've Seen: A system for personal information retrieval and re-use. SIGIR 2003. Fetterly, D., M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam Web pages. Workshop on the Web and Databases 2004. Fox, S., K. Karnawat, M. Mydland, S.T. Dumais and T. White. Evaluating implicit measures to improve Web search. TOIS 23(2), 2005. Jansen, B.J., A. Spink, J. Bateman and T. Saracevic. Real life information retrieval: A study of user queries on the

Web. SIGIR Forum 32(1), 1998. Joachims, T. Optimizing search engines using clickthrough data. KDD 2002. Kellar, M., C. Wattersand, M. Shepherd. The impact of task on the usage of Web browser navigation mechanisms. GI 2006. Kohavi, R., R. Longbotham, D. Sommerfield and R.M. Henne. Controlled experiments on the Web: Survey and practical guide. Data Mining and Knowledge Discovery 18(1), 2009. Kohavi, R., R. Longbotham and T. Walker. Online experiments: Practical lessons. IEEE Computer 43 (9), 2010. Kotov, A., P. Bennett, R.W. White, S.T. Dumais and J. Teevan. Modeling and analysis of cross-session search tasks. SIGIR 2011. References

Kulkarni, A., J. Teevan, K.M. Svore and S.T. Dumais. Understanding temporal query dynamics. WSDM 2011. Lau, T. and E. Horvitz. Patterns of search: Analyzing and modeling Web query refinement. User Modeling 1999. Marshall, C.C. The future of annotation in a digital (paper) world. GSLIS Clinic 1998. Narayanan, A. and V. Shmatikov. Robust de-anonymization of large sparse datasets. IEEE Symposium on Security and Privacy 2008. Silverstein, C., Henzinger, M., Marais, H. and Moricz, M. (1999). Analysis of a very large Web search engine query log. SIGIR Forum, 33 (1), 6-12. Tang, D., A. Agarwal and D. OBrien. Overlapping experiment Infrastructure: More, better, faster experimentation. KDD 2010. Teevan, J., E. Adar, R. Jones and M. Potts. Information re-retrieval: Repeat queries in Yahoo's logs. SIGIR 2007. Teevan, J., S.T. Dumais and D.J. Liebling. To personalize or not to personalize: Modeling queries with variation in user intent. SIGIR 2008. Teevan, J., S.T. Dumais and D.J. Liebling. A longitudinal study of how highlighting Web content change affects people's Web interactions. CHI 2010. Teevan, J., D.J. Liebling and G.R. Geetha. Understanding and predicting personal navigation. WSDM 2011 Teevan, J., D. Ramage and M. R. Morris. #TwitterSearch: A comparison of microblog search and Web search. WSDM 2011. Tyler, S. K. and J. Teevan. Large scale query log analysis of re-finding. WSDM 2010. Viermetz, M., C. Stolz, V. Gedov and M. Skubacz. Relevance and impact of tabbed browsing behavior on Web usage mining. Web Intelligence 2006. Weinreich, H., H. Obendorf, E. Herder and M. Mayer. Off the beaten tracks: Exploring three aspects of Web navigation. WWW 2006. White, R.W., S.T. Dumais and J. Teevan. Characterizing the influence of domain expertise on Web search behavior. WSDM 2009. Yates, B., G. Dupret and J. Velasco. A study of mobile search queries in Japan. Query Log Analysis: Social and Technological Challenges. WWW 2007.

Recently Viewed Presentations

  • Visual Identity Training Branding Concepts and Standards

    Visual Identity Training Branding Concepts and Standards

    Verifying Greenhouse Gas Emissions: Methods to Support International Climate Agreements Stephen Pacala March 16, 2010 Accuracy of AFOLU CO2 Emissions Estimates Sector or Activity Method Current Uncertainty Uncertainty of Improved Methods AFOLU UNFCCC inventory <10-100% (developed countries) <10-50% (all countries)...
  • Value Stream Management for Lean Healthcare

    Value Stream Management for Lean Healthcare

    Value Stream Management for Lean Healthcare (2009) Authors: Don Tapping, Sue Kozlowski , Laura Archbold , and Todd Sperl ... Synonymous with process map See Text page 109 Process map - Standard symbols used to represent a process in detail....
  • E-Government

    E-Government

    Last Step You must visit a local USDA Service Center in person for identity proofing. Please have your government issued photo ID (e.g. State issued drivers license, State issued ID, US Passport or US Military ID) with you when you...
  • Sin título de diapositiva - ipc.pe

    Sin título de diapositiva - ipc.pe

    Conglomerado urbano Conjunto urbano formado por el casco urbano de más de un centro de población y su correspondiente área de influencia, que por su cercanía lo conforman y no necesariamente constituye una unidad política administrativa www.fyv-derechoambiental.com PROYECTO DE LEY...
  • Presentation Title Month XX, 2019 Presentation Title Month

    Presentation Title Month XX, 2019 Presentation Title Month

    Sample bullet slide text: All bullets should be same point size. Second and third bullet lines can be smaller only if space is needed and copy needs to remain on slide. Sample third bullet line. Bullet Copy Slide Example
  • Tuesday - 9.3.13

    Tuesday - 9.3.13

    Assertions Questions 8, 9, 10 "The only encouragements we hold out to strangers are a good climate, fertile soil, wholesome air and water, plenty of provisions, good pay for labor, kind neighbors, good laws, a free government, and a hearty...
  • CISC220 - University of Delaware

    CISC220 - University of Delaware

    C++. Compiled. Versus Interpreted. Middle Level language. Verses High-level and machine-level. Is a superset of C. Any language written in C will work in the c++ compiler
  • Virginia Opioid Addiction ECHO* Clinic

    Virginia Opioid Addiction ECHO* Clinic

    Monthly 2 hours tele-ECHO Clinics. Every tele-ECHO clinic includes 2 case presentations and a didactic presentation. Didactic presentations are developed and delivered by inter-professional experts in Sickle Cell Disease care and management