Spark vs R Performance Tradeoff - Computer Science at ...

Spark vs R Performance Tradeoff - Computer Science at ...

Information Technology @ Eastman Skills for Success 33.3 33.3 33.3

Business Process Skills Order to Cash, Forecasting & Budgeting, etc. Process Modeling Project Management Technical Skills Concepts, not just specific programming languages ERP Mobility Service-Oriented Architecture (SOA)

Software as a Service (SaaS) Cloud computing Infrastructure Soft Skills Adaptable (lifelong learner) Self Directed Analytical Leadership Teamwork Communication Possible Roles in IT

Data Sciences Business Solutions (ERP) SAP ERP Application Dev Program Office Infrastructure Services Six Sigma IT Security Why Eastman?

Career growth opportunities Competitive compensation Work/life balance Benefits including: 80 hours vacation plus 11 paid holidays Health care Retirement contribution

Further Questions? See the Careers section on www.eastman.com Contact Chad Drinnon at [email protected] Contact Melinda LaPrade, College Director of Career Services, at [email protected] Capstone project ideas from Eastman Business Analytics Group

Comparison of Spark vs R Business Analytics Group R at Eastman R is the primary tool for analyzing business data at Eastman Analysis is done in-memory

Typical analyst workstation has 16Gb of RAM Servers have more Single threaded* Are ways to leverage multiple cores/threads but doesnt always work Apache Spark Spark is so hot right now!

Includes machine learning, graph analytics and more Fast and scalable Replaces MapReduce with a SQL like language Can use HDFS and other filesystems Tradeoff We know R very well

Most of our datasets today fit in memory Development ecosystem and workflow is already established Scalable to handle big data Native multi-threading with machine learning should be faster even on medium size problems Spark is new and unknown to us requires training and

change management Capstone project When should one switch to Spark from R? Given increasingly larger (rows and columns) data sets, how does Spark performance compare to R in fitting different basic models (e.g. linear regression)? How does a single instance of Spark compare to a small cluster (in AWS for example)?

Deliverable Eastman would like a report detailing at what point (size of data) the performance of Spark significantly exceeds that of R. We would use this information to make decisions regarding when to begin implementing Spark for analytics at Eastman. Support Many large, public datasets are available to use with this

project. If needed, Eastman will recommend one or more specific large datasets Eastman can provide analytics expertise (e.g. linear regression, logistic regression, etc.) if needed Why choose this project? A chance to get practical experience with Spark for your

resume (it really is hot right now) A chance to get practical experience (or more experience) with distributed processing A chance to learn more about analytics (analytics + computer science = $$$ in the job market) SQL vs NoSQL Performance with text Business Analytics Group

Text Analytics at Eastman Eastman generates a lot of text in the form of call reports written by our salesforce after visiting a customer. Recently, web scraping has been added as a source of text data to help us identify consumer sentiment towards chemicals and products of interest. Text analytics is performed on this data to help pull out significant themes and trends.

SQL at Eastman The majority of Eastmans data is stored in a SQL database of some type. The data is typically modelled with a star schema Avoids replication of dimensional data Reasonably fast Web scraped data is currently stored in a star schema form

NoSQL Not Only SQL is a class of database management systems that are designed to handle specific types of data storage challenges Key Value Pairs Graph Databases Document Stores etc Why is there excitement about this emerging class of databases?

Data such as text is coming with more variety than traditional data sources and is increasingly a poor fit with relational databases and predefined data models New database designs along with ever-increasing compute power are making schema-on-read feasible

Much better fit with unstructured or variable data such as text New database designs work increasingly well in a distributed fashion and are more amenable to big data Text is one of the bigger forms of data. Deliverables Eastman would like a report detailing performance

comparisons of SQL (e.g. MySQL, SQL Server Express) vs NoSQL (e.g. MongoDB) on a database containing hundreds/thousands of reviews. We would use this data to make decisions regarding our long-term technology strategy for storing text data. Support Eastman will provide review data, if needed Publicly available text data sets are also available, which

Eastman could recommend Eastman will provide technical assistance, if needed, in web scraping new reviews Why choose this project? A chance to get practical experience with NoSQL databases, which are increasing rapidly in popularity and are a nice resume addition A chance to get practical experience with web scraping

Interest in text as a data source is booming and web scraping is important as a way to acquire text data

Recently Viewed Presentations

  • MONDAY 3 A quick thought for you following

    MONDAY 3 A quick thought for you following

    Greenwood arrested October 1587. Barrow went to visit him in November. They did not let him go. 1592 Francis Johnston chosen as Pastor. 1593 Greenwood and Barrow hung. Hung under the Anti-catholic act of 1581. ... "Jack and Tom and...
  • NSCS 730 Subconscious Motor Control Dr. Mark Kindy

    NSCS 730 Subconscious Motor Control Dr. Mark Kindy

    Times Arial Calibri Symbol Blank Presentation NSCS 730 Slide 2 Descending systems to regulate posture & tone Upper vs Lower Motor neuron lesions Slide 5 Slide 6 Reticular formation of brainstem affects tone Slide 8 Spasticity - hypertonia, hyperreflexia Vestibular...
  • NIH Public Access Policy

    NIH Public Access Policy

    Review of NIH Public Access Policy. Review of Methods A - D for getting a paper into PubMed Central (i.e. making it compliant) How to show compliance . On RPPR progress reports and also on grant applications, biosketches, etc. What...
  • The Constitution-What It Says and What It Means

    The Constitution-What It Says and What It Means

    The Preamble. What it says: We the people of the United States in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defense, promote the general Welfare, and secure the Blessings of Liberty...
  • CONSTITUTIONAL FREEDOMS - ABRIDGED CIVIL LIBERTIES: FIRST AMENDMENT

    CONSTITUTIONAL FREEDOMS - ABRIDGED CIVIL LIBERTIES: FIRST AMENDMENT

    irst case to deal with free exercise clause. Reynolds said his 1st amendment rights were violated because he was not allowed to practice polygamy and had been convicted by state court for doing so. Supreme Court said the 1st doesn't...
  • Diabetic Ketoacidosis - Yola

    Diabetic Ketoacidosis - Yola

    Observations . to be carried out: Ensure . full instructions are given to the senior nursing staff emphasizing the need for: strict . fluid balance and urine testing of every sample. daily . weight. hourly . or more frequent neuro...
  • Brendon Gallacher - Weebly

    Brendon Gallacher - Weebly

    Brendon Gallacher is a poem about belief in imaginary friends, who are fictional beings conjured from the mind often to provide companionship to a lonely individual. The poem advances through the stages of a child's belief of her friend, and...
  • PNV 1914 - Welcome to NEMCC

    PNV 1914 - Welcome to NEMCC

    MSLPN membership is $75 per year. Annual conference is held in Biloxi each year in April. MSLPN offers competitions for students and IV recertifications online for LPNs. They were instrumental in lobbying