Overview: The Engagement and Performance Operations Center Dr.

Overview: The Engagement and Performance Operations Center Dr.

Overview: The Engagement and Performance Operations Center Dr. Jennifer M. Schopf PI, EPOC Indiana University International Networks National Science Foundation Award #1826994 Jason Zurawski Co-PI, EPOC ESnet / Lawrence Berkeley National Laboratory Engagement and Performance Operations Center Joint project between Indiana University and ESnet co-PI Jent (IU GlobalNOC) and Zurawski (ESnet)

Part of CC* program for domestic science support Program Officer: Kevin Thompson Award #1826994, $3.5M over 3 years Partnerships with regional, infrastructure, and science communities that span the NSF and DOE continuum of funding 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 2 Why an Engagement Operations Center? Todays science is collaborative science Collaborative science

Multiple partners Multiple data sets Many points of connection Cross agency cooperation With better access to data we ask harder questions Interactive data sources change the science we do 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 3 Understanding End-to-End Performance is Hard Lots of pieces - Host system through networks to host system

No one controls all the pieces Unknown expectations for what performance should be Soft failures are hard to find Many, many points of coordination 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 4 Partners are Needed to Scale Engagement Regional Network Partners Large scale collaborations each which supports multiple institutions in a given geographical region

Infrastructure Partners Collaborative teams that coordinate on special areas Science Community Partners Significant groups of application domain specialists Support that spans federal funding bodies Sharing knowledge and lessons between communities 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 6 Initial Regional Network Partners The Indiana State Network (I-Light)

The Ohio State R&E Network (OARnet) The Keystone Initiative for Network Based Education and Research (KINBER) The Great Plains Network (GPN) The Texas State R&E Network (LEARN) The Front Range Gigapop (FRGP) 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 7 Initial Infrastructure Partners XSEDE Uses Campus Champions to supports a single virtual system

Campus Research Computing Consortium (CaRC) Consortium of 30+ campuses to facilitate access to CI NSF Cybersecurity Center of Excellence (CCOE) Supports cybersecurity for NSF funded projects Science Gateway Community Institute (SGCI) Supports for scientists building/using data portals Internet2 Supports 200+ educational, research and community members The Quilt

Provides a central organization to share the best practices 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 8 Initial Science Community Partners Earth Science Information Partners (ESIP) 180+ member consortium of Earth science data and technologists World Climate Research - International Climate Network 1,000s of Earth System scientists using climate repositories IU Grand Challenge Precision Health Initiative

Broad set of precision health applications University of Hawaii System Astronomy Community 15+ astronomy facilities Midwest Big Data Hub (MBDH) 12 state collaboration to supports the use of data The Open Storage Network (OSN) Will support dozens of applications across a broad set of domains 2019, Engagement and Performance Operations Center (EPOC) 02/06/2020

9 Related Efforts CI Engineering Coordination Discussion community created during the early rounds of CC* Funding Target was to give the newly minted CI Engineers a place to ask questions/discuss items of importance Opened wider to include members of the R&E networking/computing communities - ~500 members. Weekly Talks on various Topics: Most Fridays @ 2pm ET starting March 29th Join list for announcements

Join the E-mail list: https://groups.google.com/a/lbl.gov/forum/#!contactowner/esnet -cybinf-engr 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 10 Expansion EPOC is a starting point, not a destination Fully anticipate bringing on additional partners that provide networking, infrastructure, or with a focus on other scientific areas No one will be turned away

Mechanisms in place to on-board new participants Department of Energy Laboratories and Facilities ESnet Science Engagement has a long history of interaction on addressing reported problems or assisting in evaluating existing and new science efforts Emerging program (via EPOC and regular operation) in embedding resources into a project/facility 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 11 EPOC Five Main Focus Areas

1. 2. 3. 4. 5. Roadside Assistance for Performance Problems Application Deep Dives Network Analysis (NetSage) Services in a box (DMZ, testpoint in a box, etc) Training 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 12

EPOC Five Main Focus Areas 1. 2. 3. 4. 5. Roadside Assistance for Performance Problems Application Deep Dives Network Analysis (NetSage) Services in a box (DMZ, testpoint in a box, etc) Training 02/06/2020 2019, Engagement and Performance Operations Center (EPOC)

13 Somethings Broken My file transfers were working fine last week but this morning nothing runs? My side? Their side? Something in between? 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 14 EPOC Roadside Assistance [email protected]

This file transfer worked last week, but it doesnt anymore? Think of this like a flat tire, crash repair EPOC is a collaboration of 3 teams already supporting this ESnet Science Engagement ([email protected]) [email protected] IRNC NOC Performance Engagement Team (PET) 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 15 Roadside Assistance Process Anyone can submit

Dont have to be NSF funded, specific university Contact [email protected] Within 24 hours, gets triaged Some initial investigation to verify the issues A Case Manager and Lead Engineer are assigned Shareable infrastructure set up 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 16 Problem: Many orgs involved results in many

tickets, which no one has all the info about Solution; Local experts on both sides pulled in early Partner with campus champions, CaRC, XSEDE, Regionals etc Solution: Folder for all engineering docs shared across orgs No single ticketing system is open to everyone Folder had docs, lists of tickets (and contacts), maps/diagrams, etc Anyone working on the case gets access 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 17 Problem: Submitter often

doesnt know status Solution: online shareable Customer Case Document Written for non-engineers Updated frequently, generally twice a week General overview and goal of the case Contact points (Case manager) Current status and next steps listed 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 18 Troubleshooting Understanding the end-to-end path of the data transfer Use public test and validation services to identify potential

issues Coordination with a wide set of engineering help staff along the data transfer path 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 19 Outcomes Ticket stays open until reporter is satisfied with result Write ups follow Engineering guidance added to http://fasterdata.es.net Sharing how to solve problems is as important as solving them for us!

02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 20 Consulting Lighter weight than a full roadside assistance Submission process same contact [email protected] Suggestions for DTNs, DMZs, firewalls and DMZs, Data projections for science fields

Expected (real) performance between two sites Advice on how to conduct a performance assessment of a network and applications Or others! Similar operations approach Results/suggestions will be added to fasterdata.es.net over time 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 21 EX: PanSTARRS Poor Performance Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) collects and shares data to enable researchers to more accurately estimate galaxy redshifts, improving their

understanding of the local cosmic expansion and dark energy Regular 100TB data transfers Institute for Astronomy at University Hawaii (UH) Space Telescope Science Institute at Johns Hopkins University Experienced only 320Mbps speeds End-to-end path believed to be 10 or 100M, so expected multi Gbps at least Involved engineers from International Networks at Indiana University ([email protected]) IRNC NOC Performance Engagement Team (PET) ESNet MidAtlantic Cross Roads (MAX) Internet2 NOC 02/06/2020 2019, Engagement and Performance Operations Center (EPOC)

22 EX: PanSTARRS: Problem Identification 1 (The Usual Suspects) perfSONAR testing identified JHU did not have a 10G connection through MAX to Internet2 Campus network upgraded to MAX/Internet2 perfSONAR testing identified default UH to CONUS route was 10G Updates default route to PIREN 100G Hawaii to LA Maximum Transmission Unit (MTU) setting on several routers was less than the recommended 9000 byte size frames (Jumbo Frames) Larger MTU settings make data transmissions more efficient, because the CPUs on switches and routers can process a larger payload for each frame, but only works if each link in the

network path -- including servers and endpoints -- is configured to enable jumbo frames at the same MTU TCP Buffer settings on end hosts were misconfigured ESnet recommended settings, available at: http://fasterdata.es.net/host-tuning/background/ 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 23 EX: PanSTARRS: Problem Identification 2 (The Less Usual Suspects) At UH, underpowered Top of Rack (TOR) switch bottleneck, misconfigured access control lists, and misconfigured firewalls

Equipment placement redesigned to remove bottlenecks from path Bespoke and aging software/systems set up Data spread across many unreliable hosts work in progress to redesign storage approach Software required manual intervention and many hard to maintain dependencies- software workflow rewrite being discussed 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 24 EX: PanSTARRS Outcome Transmission rates went from 320 Mbps to 1Gbps sustained Several additional architectural and software issues were identified,

which are now part of the projects longer-term upgrade path 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 25 EX: LHC Data Movement Issues between Pakistan and UK High Energy Physics, specifically the Large Hadron Collider, is set up to share data from Tier 1 sites, which are large, regional sites storing all or most of the data, to Tier 2 sites, smaller country-level sites, which in turn share data to local universities and researchers National Center for Physics (NCP) Tier 2 LHC site at the Quaid-i-Azam University Campus in Islamabad, Pakistan 1G connection to Pakistan national network (PERN)

Queen Mary University, London Tier 1 site for region Transfer rates NCP-QM as low as 40 Mbps NCP-Australia Tier 1 500 Mbps transfers NCP-ESnet Tier 1 280 Mbps transfers Additional intermittent performance problems over previous 2 years 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 26 EX: LHC Pakistan-UK Problem Identification (1)

A traffic shaping misconfiguration on the NCP connection to PERN limited R&E traffic to 50Mbps PERN removed traffic shaping for R&E traffic Top of rack switch bottleneck between NCPs file transfer node and edge router Moved file transfer node to the edge router, performance increased from 40Mbps to 100Mbps or better Small amounts of ongoing, intermittent packet loss within the campus network Identified by perfSONAR, cause unclear Moving data node closer to the edge of their network to alleviated the issue Work continues to identify source of loss 02/06/2020 2019, Engagement and Performance Operations Center (EPOC)

27 EX: LHC Pakistan-UK Problem Identification (2) Packet loss identified inside the PERN regional network Specific cause of the loss still unclear, work ongoing Additional bottlenecks between PERN and TEIN (Asian) networks 1Gbps between national and regional network Congestion is common, therefore so is packet loss Upgrade to 10Gbps being explored Temporary use of commercial path being explored

02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 28 EX: LHC Pakistan-UK Outcome Original Data transfer NCP to Queen Mary: 40 Mbps After engagement transfer speed: ~480 Mbps Additional areas for performance improvements identified Larger scale and longer term changes to infrastructure needed Discussions ongoing 02/06/2020 2019, Engagement and Performance Operations Center (EPOC)

29 EPOC Five Main Focus Areas 1. 2. 3. 4. 5. Roadside Assistance for Performance Problems Application Deep Dives Network Analysis (NetSage) Services in a box (DMZ, testpoint in a box, etc) Training 02/06/2020

2019, Engagement and Performance Operations Center (EPOC) 30 EPOC Deep Dives Think of this as regular maintenance, oil change, or planning to buy a new car Based on seminal work by ESnet to develop Scientific Case Studies Walk through science workflow with the actual scientists Way to understand needs and planning Often identifies issues that have nothing to do with networks, and everything to do with sociology 02/06/2020

2019, Engagement and Performance Operations Center (EPOC) 31 Anatomy of a Deep Dive Two primary components Narrative Data Estimation Built on the ESnet Requirements Review template Contains helper text to guide what is wanted Items that make sense for DOE/ESnet, may not make sense for another institution were modifying this as we go https://fasterdata.es.net/science-dmz/science-and-network-r equirements-review/ 02/06/2020

2019, Engagement and Performance Operations Center (EPOC) 32 We Walk Through Scientific Components 1. Background information Brief overview of the facility, nature of the science being performed 2. Collaborators Identify people and institutions that a science group interacts with 3. Instrumentation

Local and remote scientific instruments and facilities. 4. Process of Science Explain a day in the life of the science group Should tie together the instruments, the people, and the resources 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 33 And Also More Technical Aspects 5. Software Infrastructure

6. Network and Data Architecture 7. Cloud Services 8. Outstanding Issues and Pain Points Local and regional IT staff are critical to these parts, and help form valuable partnerships that may not exist, or could use strengthening 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 34 When This Is Done Better understanding of the science, data movement, whos using what pieces, dependencies, and time frames Identification of bottlenecks or pain points becomes more obvious

Relationships build between layers (engineering, science, administration) Clear path toward improvement and success 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 35 EPOC Five Main Focus Areas 1. 2. 3. 4. 5.

Roadside Assistance for Performance Problems Application Deep Dives Network Analysis (NetSage) Services in a box (DMZ, testpoint in a box, etc) Training 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 36 Need for Network Instrumentation Performance and measurement are 2 sides of a coin Common basic measurement data is the first step to understanding performance issues

E.g. Global perfSONAR Deployment, http://my.es.net NetSage framework SNMP, perfSONAR, Flow, Tstat Data Grafana-based dashboards to visualize performance http://portal.netsage.global 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 37 Shift of focus to network owner for USEU Circuit Normal day: http://portal.netsage.global

38 Monday Feb 5 39 NetSage for NEAAR link Feb 1-5 40 41 NetSage Focus on Use Cases (Questions) Bandwidth Dashboard: http://portal.netsage.global How used are the links? Where are congestion points?

Flow Data Dashboards What are the top sites using the IRNC Links? What are the top sources/destinations for an organization? Data Archive Monitoring (tstat) Top Source/Destination pairs (volume, rate) Retransmission stats 42 Data Archive Example 43 Measurement and EPOC

Expectation that all Regional Network Partners will have at least a partial NetSage deployment Enable better understanding of network performance issues writ large Working on methods to integrate infrastructure partners NERSC/TACC so far let us know if you want to join! Goal is to find problems BEFORE the scientist/user community do If you cant measure it you cant improve it -Peter Drucker 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 44

EPOC Five Main Focus Areas 1. 2. 3. 4. 5. Roadside Assistance for Performance Problems Application Deep Dives Network Analysis (NetSage) Services in a box (DMZ, testpoint in a box, etc) Training 02/06/2020 2019, Engagement and Performance Operations Center (EPOC)

45 How to increase adoption/deliver value to partners of smaller size? Observations by IU/ESnet after several years of community events (e.g. OIN Workshops): Lots of interest in new technologies May not know of (immediate) use cases Resources (time/$) to design/specify/build is hard to come by Easier to pay for a service that is build for you, or maintain something someone else builds Why would I need that?

Unfunded mandates have a way of being ignored 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 46 What is a Service-in-a-Box? Basic idea: Only large facilities with dedicated funding can afford the time/effort to design/install/operate/maintain a dedicated science infrastructure Ameliorate the costs of design/install at a higher level (e.g. regional network). Create infrastructure that can be delivered as a service Operation can be local or regional (offer flexibility based on the environment and resources available) Develop a business model that facilitates cost recovery and

upgrade schedules 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 47 What is a Service-in-a-Box? Goals Offer a way for traditionally smaller/less resourced facilities to use emerging technology to support scientific use cases Create new paths for regionals to interact with/learn about/support scientific use cases Reduce cost for service deployment and operation Increase adoption/improve outcomes on a larger scale E.g. Turn I cant do that into How did I live without this?

02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 48 SIAB Audience Deployer Regional network, national laboratory, resource provider, or other centrally located entity Has a staff that is available to handle purchasing, design, implementation, operation of advanced technology User Smaller (not to be pejorative) facility that is already a customer of the deployer (e.g. network service, resource consumer, etc.) Doesnt have the above resources to make an advanced service happen

Has a use case (Single? Multiple?) that could benefit Not afraid to treat this as an experiment (since the alternative is nada) 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 49 SIAB Examples Assuming the following are true Limited IT Staff & Budget Interest in services, but not critical 24/7/365 need for them Measurement and Monitoring Service Want the ability to understand performance in/out of campus to remote locations && Assistance in fixing local network design/use areas of friction

Data Transmission Service User with irregular (e.g. not daily) bulk data movement to/from well known scientific facility 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 50 SIAB Locality Assumptions (?) Regional/infrastructure providers and their customers have to touch each other somewhere Someplace on the campus, someplace centrally located where others may be present (e.g. put things where that touch point is) Hardware specification is time consuming

And most come to the same (good) answers on needs/capability (e.g. there arent many ways to skin the DTN/pS/DMZ cat) There is power in bulk purchase You buy enough meat, theyll give you anything, Cosmo Kramer 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 51 SIAB Locality perfSONAR (EX) 02/06/2020 2019, Engagement and Performance Operations Center (EPOC)

52 SIAB Locality DTN (EX) 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 53 Anticipated Offerings perfSONAR Science DMZ Deployment of regional hardware to support campus high-performance needs Data Transfer Hardware/Software Rental or co-location of capable hardware and storage

Network Capacity Testing Use of 10G/40G/100G/(400G?) hardware to prove out new circuits, or debug old ones Security Regionally deployed IDS infrastructure that doesnt impact high performance networking. 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 54 EPOC Five Main Focus Areas 1. 2.

3. 4. 5. Roadside Assistance for Performance Problems Application Deep Dives Network Analysis (NetSage) Services in a box (DMZ, testpoint in a box, etc) Training 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) 55 Training

Follow on to OIN (http://oinworkshop.com) series that reached over 750 people in the NSF/DOE funding space during the 3 year operational period Hands on perfSONAR sessions Especially for small nodes Would include file transfer tests How to do an Application Deep Dive Also known as How to talk to Scientists DMZ/DTN Set Up To request send mail to [email protected] include Training Request: in the subject line 02/06/2020

2019, Engagement and Performance Operations Center (EPOC) 56 Take Aways EPOC is an NSF-funded operations center to help scale science engagement and problem resolution Single point of contact to help with end-to-end performance issues [email protected] More about EPOC: http://epoc.global Jennifer Schopf, [email protected]

Jason Zurawski, [email protected] Dave Jent, [email protected] 02/06/2020 2019, Engagement and Performance Operations Center (EPOC) National Science Foundation Award #1826994 57 Overview: The Engagement and Performance Operations Center Dr. Jennifer M. Schopf PI, EPOC Indiana University International Networks National Science Foundation Award #1826994

Jason Zurawski Co-PI, EPOC ESnet / Lawrence Berkeley National Laboratory

Recently Viewed Presentations

  • Grant Strategy 2008 Synop Chapters

    Grant Strategy 2008 Synop Chapters

    Ch.08 Cost advantage 53 54 Ch.8 Cost advantage 1-Strategy and cost advantage 2-Sources of cost advantage 3-Analysis of cost: value chain Themes of chapter 55 Ch.8 Cost advantage (Ctd.) 1-Strategy and cost advantage First preoccupation was cost Large corporations Search...
  • Chapter 3: Christianity Section 5: Sacred Places and Sacred ...

    Chapter 3: Christianity Section 5: Sacred Places and Sacred ...

    Gothic. Romanesque. Byzantine. These are some of the most common examples of the different styles of architecture, but there are many more. Variance of Church Interiors. Altar in middle with pulpit on side. Catholic, Anglican, Orthodox, or Lutheran.
  • Pearson Access Hardware and Software Guidelines

    Pearson Access Hardware and Software Guidelines

    FCAT 2.0 Computer Based Testing ... Downloads How to Download the Software for Proctor Cache and TestNav From the Home Page of Pearson Access Click Support Under Resources click Download Click on the file to be Downloaded * * Proctor...
  • Summer Safety 2000 - Fort Sill

    Summer Safety 2000 - Fort Sill

    It has been embraced by many civilian corporations and the Army, and is now being implemented in the Navy, MC, Air Force and Coast Guard. The risk management process has traditionally been applied ...
  • Context-aware Security from the Core 1 |  2017

    Context-aware Security from the Core 1 | 2017

    You've also got an email content filter and a web content filter, but your firewall does those as well. Now DNS threats have reached critical mass to where you have got to do more. You have got to look at...
  • Matters Arising  Staffing Update Subject Detail DHT A

    Matters Arising Staffing Update Subject Detail DHT A

    SQA Exams. The Chief Invigilator reports that the SQA exams have progressed very smoothly this session. Despite significant preparatory works being undertaken, Robertson Group worked closely with the school (daily contact and shared access to the exam timetable) to ensure...
  • Présentation PowerPoint

    Présentation PowerPoint

    L'approche communicative (1980-2000) : (Exemple proposé : El . español. a tu aire) Dans les années 70, la multiplication des échanges et les impératifs de la construction de la communauté européenne vont rendre nécessaire une réflexion sur les enjeux des...
  • Quadratic Functions and Parabolas - Valencia

    Quadratic Functions and Parabolas - Valencia

    Quadratic Functions and Parabolas Linear or Not? Month Avg Temp May 64 June 67 July 71 Aug 72 Sep 71 Oct 67 Nov 62 Dec 58 Avg Temp Define the variables: T(m) = Avg monthly temp, in degrees F, in...