Exnay-Javaway In the Clouday (an experience report)

Big Data, Data Mining, Tools N = ALL CORRELATION vs CAUSATION Data Sources...

Data Creation, Storage, Costs Infrastructure NoSQL Flavors https://www.youtube.com/watch?v=qI_g07C_Q5I NoSQL

Not Only SQL (sort of) Greater scalability Designed with distributed computing and commodity (not cheap) hardware.

Variety of flavors https://www.youtube.com/watch?v=qI_g07C_Q5I Topic: Algorithms Tools Speaking of the Cloud

High Level Flow Example Hadoop MapReduce HDFS Distributed file system.

Write-once/read many Fault tolerance / Redundance

Processing logic close to data http://www.ibm.com/developerworks/library/wa-introhdfs/ 2.

od e Hive CREATE TABLE docs (line STRING); CREATE TABLE word_counts AS SELECT word, count(1) as count FROM

(SELECT explode(split(line, ' ')) AS word FROM docs) w GROUP BY word ORDER BY word; Hive with Some Structure Data 123 456

789 111 222 333 444 555 F M

M M M F F M create table if not exists p_genders ( p_id string,

gender string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; SELECT * from p_genders; Pig Latin A = load 'S3://pmb4bucket/input/bleakhouse/bleakhouse.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word;

D = foreach C generate COUNT(B), group; store D into 's3://pmb4hadoop/output/bleakhouse'; Complex Event Processing Tools Data Scientist

Not just a bean counter - its about modeling General skill set: o Math (linear algebra, statistics, calculus, discrete math) o Business sense o Programming skills

o Communication o etc, etc, etc https://www.youtube.com/watch?v=ceeiUAm bfZk Our Schedule

Setting the goals for a data mining project. Setting up KNime Gathering and preparing data. Visualization Machine Learning

Nave Bayes Clustering and Classification Dimension reduction But first

