Experiences with Tracing Causality in Networked Services

Experiences with Tracing Causality in Networked Services

Experiences with Tracing Causality in Networked Services Rodrigo Fonseca, Brown Michael Freedman, Princeton George Porter, UCSD April 2010 INM/WREN San Jose, CA Which way to Bangalore? Troubleshooting Networked Systems Hard to develop, debug, deploy, troubleshoot No standard way to integrate debugging, monitoring, diagnostics Status quo: device centric

Load Balancer Web 1 Firewall ... ... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire...

28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... ... ... ... [04:03:23 2006] [notice] Dispatch s1... [04:03:23 2006] [notice] Dispatch s2... [04:04:18 2006] [notice] Dispatch s3... [04:07:03 2006] [notice] Dispatch s1... [04:10:55 2006] [notice] Dispatch s2... [04:03:24 2006] [notice] Dispatch s3... [04:04:47 2006] [crit] Server s3 down... ...

... ... ... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro 66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga ... ... Web 2

... ... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro 66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga ... ... Database ... ...

LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: LOG: ...

... statement: select oid... statement: SELECT COU... statement: SELECT g2_... statement: select oid... statement: SELECT COU... statement: SELECT g2_... statement: select oid... statement: SELECT COU... statement: SELECT g2_... statement: select oid... statement: select oid... statement: SELECT COU... statement: SELECT g2_... statement: select oid... statement: SELECT COU... statement: SELECT g2_...

statement: select oid... Status quo: device centric Determining paths: Join logs on time and ad-hoc identifiers Relies on well synchronized clocks extensive application knowledge Requires all operations logged to guarantee complete paths This talk Causality Tracking: an alternative Many previous frameworks: X-Trace, PIP, Whodunit, Magpie, Googles Dapper

Experiences integrating and using X-Trace Outline Tracing causality with X-Trace Case studies 802.1X Authentication Service CoralCDN and OASIS anycast service Challenges Conclusion Outline Tracing causality with X-Trace Case studies 802.1X Authentication Service CoralCDN and OASIS anycast service

Challenges Conclusion X-Trace X-Trace records events in a distributed execution and their causal relationship Events are grouped into tasks Well defined starting event and all that is causally related Each event generates a report, binding it to one or more preceding events Captures full happens-before relation X-Trace Output HTTP Client

HTTP Proxy TCP 1 Start IP TCP 1 End IP Router IP HTTP Server

TCP 2 Start IP TCP 2 End IP Router Task graph capturing task execution Nodes: events across layers, devices Edges: causal relations between events IP Router

IP Basic Mechanism a g [T, a] HTTP Client HTTP Proxy [T, a]

b TCP 1 End d IP TCP 2 Start i e IP Router HTTP

Server h f TCP 1 Start c n [T, g] IP IP

X-Trace Report TaskID: T EventID: g j kf Edge: from a, IP Router IP Router m TCP 2 End

l IP Each event uniquely identified within a task: [TaskId, EventId] [TaskId, EventId] propagated along execution path For each event create and log an X-Trace report Enough info to reconstruct the task graph X-Trace Library API Handles propagation within app Threads / event-based (e.g., libasync) Akin to a logging API: Main call is logEvent(message) Library takes care of event id creation, binding, reporting, etc Implementations in C++, Java, Ruby, Javascript

Outline Tracing causality with X-Trace Case studies 802.1X Authentication Service CoralCDN and OASIS anycast service Challenges Conclusion 802.1X Authentication Service Authenticator e.g. Acc. Point Client EAP L2

RADIUS Over UDP Auth Server RADIUS Identity Store e.g. LDAP LDAP Identified 5 common authentication issues from vendor logs Added a few X-Trace instrumentation points sufficient to differentiate these faults Introduced faults in a test environment

802.1X Authentication Service Instrumentation was easy: Nested invocations No in-task concurrency Extensible protocols (RADIUS, LDAP) Modular, request-oriented server software 802.1X Example Faults Misconfigured Firewall: no LDAP Miscalibrated Timeout Value Key: multiple correlated vantage points Can help tune timeout values CoralCDN and OASIS Instrumented production deployment Heavy use of sampling: 0.1% of requests to CoralCDN traced

Leveraged libasync, libarpc X-Trace instrumentation Much more complex program flow E.g. windowed parallel RPC calls, variable timeouts Found bugs, performance problems, clock skews CoralCDN CoralCDN Distributed HTTP Cache CoralCDN Response Times 189s: Linux TCP Timeout connecting

to origin Slow connection Proxy -> Client Slow connection Origin -> Proxy Timeout in RPC, due to slow Planetlab node!

189 seconds Same symptoms, very different causes B/ s K 1 /s B K 10 /s KB 0 10 /s B 1M

Outline Brief X-Trace Intro Case studies 802.1X Authentication Service CoralCDN OASIS Anycast Service Challenges Conclusion Hidden Channels Example: CoralCDN DNS Calls DNS Resolver Tasks A

resolve(foo,*) B Send C foo DNS resolve * Receive In general: deferral structures E.g., queues, thread pools, continuations

Store metadata with the structure Often encapsulated in libraries, high leverage Incidental vs. Semantic Concurrency Forks and joins tricky for nave instrumentation Non-intuitive fork Incorrect join Incidental vs. Semantic Concurrency Extra code annotation fixes the problem Manually change parent of do() events Manually add edges from done() to end Dealing with Black Boxes client

proxy server Ideal scenario: all components instrumented with X-Trace Log all events Dealing with Black Boxes client proxy server Gray-box proxy: passes X-Trace metadata on Log events on the client and server Layering does this automatically

Dealing with Black Boxes client proxy server Black box proxy: drops X-Trace metadata No X-Trace events on proxy or server Can always trace around black box, in client Outline Brief X-Trace Intro Case studies 802.1X Authentication Service CoralCDN OASIS Anycast Service

Challenges Conclusion Revisiting Troubleshooting Device-centric Logs Depends on well syncd clocks Joins on ad-hoc identifiers Needs all ops logged for complete traces No modifications to existing code Task-centric traces Does not depend on clocks (can actually fix

them) Deterministic joins on standardized ids Sample-based tracing possible Requires instrumentation X-Trace Instrumentation Instrumenting is easy in most cases A few key libraries go a long way Can be done iteratively Refining expectations (a la Pip) Partial annotation still useful Independent instrumentation feasible Huge benefits Conclusions

Simple, uniform task graphs useful in debugging, troubleshooting, diagnostics Instrumentation is feasible Causal tracing should be a first-class concept in networked systems Thank you More details on paper For more info: www.x-trace.net www.coralcdn.org

Recently Viewed Presentations

  • Hiroshima dilemma - Humble Independent School District

    Hiroshima dilemma - Humble Independent School District

    Welcome. Pick up the handout . Look over the grammar review notes. Try to complete the "Class Practice and Examples." We are going to the book room @ .
  • Northeast region:

    Northeast region:

    New Jersey. Massachusetts. Connecticut. How did you do and Section 2! Well, how did you do at identifying the states?? Are you ready for the next part…Capitals and Abbreviations!! The following slides will have a state name on them.
  • Variable Frequency AC Source - Bradley

    Variable Frequency AC Source - Bradley

    Variable Frequency AC Source (VFACS) Capable of delivering 208 Vrms and 5 A (35.5V rms for initial testing) ... Use IGBT pairs and DC rails to amplify PWM signal. ... IRF520 MOSFET* FMG2G75US60 IGBT Pair with anti-parallel diodes.
  • La Guerra Civil Espaola La Guerra Civil Espaola

    La Guerra Civil Espaola La Guerra Civil Espaola

    Se organizan las "Brigadas mixtas". Milicias Brigadas Internacionales Partidos y fuerzas políticas Zona rebelde. Zona republicana. Zona rebelde Falange Española y de las JONS: surge de la fusión en 1934 de la Falange Española fundada por José Antonio Primo de...
  • Katalyzované reakce

    Katalyzované reakce

    Katalyzované reakce Kinetika katalyzovaných reakcí, vliv vnitřní difuze na selektivitu Heterogenní katalýza Chemisorpce Reakce látky s aktivním místem na povrchu katalyzátoru Adsorpční izoterma Závislost rovnovážného množství adsorbované látky na složení směsi Langmuirova izoterma adsorpce v jediné vrstvě stupeň pokrytí povrchu...
  • More Ap Comparative Government Theory

    More Ap Comparative Government Theory

    CORPORATISM (mixed autonomy): fewer groups compete, usually one for each interest sector (ex: labor, agriculture, management) Group monopoly over its sector is officially approved by state State Corporatism. Interest groups take the lead and dominate state Societal Corporatism or Neo-Corporatism
  • Instrumentation: Validity Measuring Caring in Nursing

    Instrumentation: Validity Measuring Caring in Nursing

    Criterion-Related Validity: Predictive Validity. Test scores are related to a criterion. Criterion is behavior that test scores are used to predict. Correlation coefficient X is test score and Y is criterion score. predictive or concurrent validity estimate is established by...
  • ITU in Asia Pacific : Cooperation with APNIC

    ITU in Asia Pacific : Cooperation with APNIC

    to explore ways and means for greater collaboration and coordination between ITU and relevant organizations2 involved in the development of IP-based networks and the future Internet, through cooperation agreements, as appropriate, in order to increase the role of ITU in...