Statistical lingua franca is R - Montefiore Institute

Statistical lingua franca is R - Montefiore Institute

Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1 Outline 1. Introduction to R language 1. basics 2. loops 3. data structures 2. Visualization

1. rich plotting options 3. Bioinformatics 1. repositories 2. annotation of biological IDs 2 Definition R is a free software environment for statistical computing and graphics1 R is considered to be one of the most widely used languages amongst statisticians, data miners, bioinformaticians and others. R is free implementation of S language Other commercial statistical packages are

SPSS, SAS, MatLab 1 3 R Core Team, R: A Language and Environment for Statistical Computing, Vienna, Austria (http://www.R-project.org/) History 1993 creation by Ross Ihaka and Robert Gentleman at the University of Auckland Community extended via Packages

Facts initial interpreter ~ 1000 lines of C code GNU project 4 R language features Based on S language (commercial) Created in 1976 by Bell Labs Main moto: "to turn ideas into software, quickly and faithfully" (John Chambers) Memory allocation Fixed allocation at startup on-the-fly garbage collector

memory hungry Variables have lexical scoping functions have access to the variables which were in the effect when the function was defined 5 Lexical scoping The value of the variable is searched in the environment it is being called. If the value is not found, the search is continued in parent environment make.counter <- function(x = 0){ function(){ x <<- x + 1 cat("Inside the make.counter() the x is",x,"and not 5!\n") print(environment())

} } x <- 5 cat("Global value of x is", x , "\n"); print(environment()) counter <- make.counter() counter() print(x) 1) The value of x inside make.counter()is stored even after the function call and belongs to the function environment (i.e. scope) 2) The global value of x is not modified 3) The inner function (i.e. function()) can only access the local value of x More details are given at the functions section of the tutorial

6 R language features Flexible Plot Layouts The layout command divides up the area into a matrix. nf <- layout(matrix(c(1,1,0,2), 2, 2, byrow=TRUE), respect=TRUE) Calling layout.show()results in the layout being displayed for your reference. 7 R language features

A flexible statistical analysis toolkit Statistical models Regression, ANOVA, GLM, trees Free data analysis No subscription fees Transparent code Is a language Can write complex scripts Not limited by the GUI Active community 8

Why to learn R? R is widely used by bioinformaticians and statisticians It is multiplatform iOS, Windows, Linux Large number of libraries Main library repositories CRAN and BioConductor 9 In a nut-shell

R is a scripting language and, as such, is much more easier to learn than other compiled languages such as C R has reasonably well written documentation (vignettes) Syntax in R is simple and intuitive if one has basic statistics skills R scripts will be provided and explained in-class 10 Installation of R 1. go to http://www.r-project.org/ 2. select CRAN (Comprehensive R Archive Network) from left menu

3. link to nearby geographic site 4. select your operating system 5. choose "Base" installation 6. save R-X.X.X-win32.exe (windows) or R-X.X.Xmini.dmg (Mac OS X) 7. run the installation program accepting defaults 11 Tutorial Introduction to R language Part 1: Basics 12 Topics covered in this tutorial Operators / Variables

Main objects types Visualization plot modification functions Writing and reading data to/from files Bioinformatics applications Annotation of array probes 13 Variables/Operators Variables store one element x <- 25 Here x variable is assigned value 25

Check value assigned to the variable x >x [1] 25 Basic mathematical operators : arithmetic: + , - , * , / power: ^ Use parenthesis to obtain desired sequence of mathematical operations 14 Arithmetic operators What is the value of small z here? >x <- 25 > y <- 15 > z <- (x + y)*2

> Z <- z*z > z [1] 80 15 Logical operators These operators mostly work on vectors, matrices and other data types Type of data is not important, the same operators are used for numeric and character data types Operator < <=

> >= == != !x x|y x&y Description less than less than or equal to greater than greater than or equal to exactly equal to not equal to

Not x x OR y x AND y 16 Logical operators Can be applied to vectors in the following way. The return value is either True or False > v1 [1] 48 2 3 4 5 > v1 <= 3 [1] FALSE TRUE TRUE FALSE FALSE

17 Key functions mathematical functions sqrt, log, exp, sin, cos, tan simple functions max, min, length, sum, mean, var, sort other useful functions abs(-5) #absolute value exp(8) # exponentiation log(exp(8)) # natural logarithm sqrt(64) # square root 18

R workspace Display all workplace objects (variables, vectors, etc.) via ls(): >ls() [1] "Z" "v1" "x" "y" "z" Useful tip: to save workplace and restore from a file use: >save.image(file = " workplace.rda") >load(file = "workplace.rda") 19 How to find help info?

Any function in R has help information To invoke help use ? Sign or help(): ? function_name() ? mean help(mean, try.all.packages=T) To search in all packages installed in your R installation always use try.all.packages=T in help() To search for a key word in R documentation use help.search(): help.search("mean") 20

Basic data types Data could be of 3 basic data types: numeric character logical Numeric variable type / class: > x <- 1 > mode(x) [1] "numeric" > class(x) [1] "numeric" 21 Basic data types

Logical variable type (True/False): > y <- 3<4 > mode(y) [1] "logical" Character variable type: > z <- "Hello class" > mode(z) [1] "character" 22 Objects/Data structures The main data objects in R are: Matrices (single data type)

Data frames (supports various data types) Lists (contain set of vectors) Other more complex objects with slots S3 and S4 objects Matrices are 2D objects (rows/columns) > m <- matrix(0,2,3) > m [,1] [,2] [,3] [1,] 0 0 0 [2,] 0 0 0 23 Vectors c()

Vectors have only 1 dimension and represent enumerated sequence of data. They can also store variables > v1 <- c(1, 2, 3, 4, 5) > mean(v1) [1] 3 The elements of a vector are specified /modified with braces (e.g. [number]) > v1[1] <- 48 > v1 [1] 48 2 3 4 5 24

Lists Lists contain various vectors. Each vector in the list can be accessed by double braces [[number]]. Lists could be of different types > x <- c(1, 2, 3, 4) > y <- c(2, 3, 4) > L1 <- list(x, y) > L1 [[1]] [1] 1 2 3 4 [[2]] [1] 2 3 4 25

Matrices Contain data of the same type i.e. all cells are one of the 3 basic types Defined with matrix() Arguments: initialization value (i.e. 0) ncol: number of columns nrow: number of rows matrix(0,ncol=2,nrow=2) [,1] [,2] [1,] 0 0

[2,] 0 0 26 Data frames Data frames are similar to matrices but can contain various data types > x <- c(1,5,10) > y <- c("A", "B", "C") > z <-data.frame(x,y) x y 1 1 A 2 5 B

3 10 C To get/change column and row names use colnames() and rownames() 27 Factors Factors are special could be vectors of integers or characters can identify levels i.e. unique variables in a vector > letters = c("A","B","C","A","C","C") - the vector letters has 6 chars but 3 levels > letters = factor(letters)

[1] A B C A C C Levels: A B C > summary(letters) A B C 2 1 3 - summary()allows to see - number of variables per level - unique variables (levels) 28 Conversion between data types One can convert one type of data into another using as.xxx where xxx is a data

type 29 S4 objects Allow to store more complex data strutures Benefits Provide greater modularity Check for data type errors during assignment Cleaner code Every S4 objects belongs to a class require(methods) setClass("Box", slots=c(box_name = "character",

address = "numeric", files = "list) ,prototype=list(box_name=NULL, address=NULL, files=list()) ) 30 S4 objects Each object is an instance of a class object = new("Box") To access slots in the object use @ operator [email protected]_name [email protected]

[email protected] Assign values to slots [email protected]_name="pretty_box" 31 S4 object functions Use functions to get and set values in objects setMethod(f="getBoxName", signature="Box", definition=function(object){ return([email protected]_name) }) setMethod(f="setBoxName", signature="Box", definition=function(object,name){ [email protected]_name <-name })

32 Read Input/ Write Output To read data into R from a text file use read.table() read help(read.table) to learn more scan() is a more flexible alternative raw_data <-read.table(file="data_file.txt") To write data into R from a text file use write.table() > write.table(mydata, "data_file.txt") 33

Loops Loops allow repetition operations can check given condition execute n number of times R is not specifically designed for loops use them sparsely or replace by apply() avoid nested loops Will look at for loop while loop repeat loop 34

For loop Executed n number of times safe to use variable i will take values found in the array (i.e. 1:10) for(i in 1:10){ print(i) } 35 while loop Will test infinitely for a particular condition until it becomes TRUE i=1;

while(i<=10){ print(i) i=i+1 } 36 repeat similar to the while loop will always begin the loop Executed at least once Important to have a breaking condition otherwise, infinite loop x <-1

repeat { x <- x+1 #if(x >= 5)break; } print(x) 37 apply() A faster alternative to loops Apply function to 2D data structures (matrix, df) apply(X, MARGIN, FUN, ...) X: is matrix or data frame (df) MARGIN: 1-columns and 2-rows FUN: function to apply to rows or columns M <- matrix(1:6, nrow=3, byrow=TRUE)

apply(M, 1, sum) apply(M, 2, sum) #columns #rows 38 lapply() is applied on lists / vectors it does not work on matrices or higherdimensional arrays directly returns list() M <- matrix(1:6, nrow=3, byrow=TRUE) lapply(1:2,function(x){sum(M[,x])})

L=list(sample(1:10,10)) lapply(L,mean) 39 sapply() the "s" in "sapply" stands for "simplify simplifies output to a vector M <- matrix(1:6, nrow=3, byrow=TRUE) sapply(1:2,function(x){sum(M[,x])}) [1] 9 12 40 tapply() Apply function to subset data matrix

tapply(summary, group, Function) Summary variable: selected variable to apply function on Group variable: the filter variable used to split data Function: function to apply to resulting sets medical.data

Conditional statements allow to make choices specific parts of the code executed Add flexibility Main statements if else switch 42 If..else if (test expression) { #code

} else if (test expression) { #code } else () { #code } else()does not have any test expression accepts all cases not covered by if()or else if() test expression returns logical value (T or F) 43 If..else sign <- function(x){ if(x > 0){ print("x is a positive number");

} else if (x == 0){ print("x is zero"); } else{ print("x is negative") } } x<- 5; sign(x); x<- 0; sign(x); x<- -2; sign(x); 44 swich() switch (statement, list) the statement is evaluated and value returned based on this value, the corresponding item in the

list is returned y <- rnorm(5) x <- "sd" z <- switch(x,"mean"=mean(y),"median"=median(y),"variance"=var(y),"sd"=sd(y)) print(z) sd(y) 45 Tutorial Introduction to R language Part 2: Functions 46

Functions in R Functions encapsulate chucks of code Helps with debugging and code readability Variables passed by pass by value add.function<-function(x){ x+5 } > add.function(2) [1] 7 47 Functions in R

defined with the function() directive stored as R objects can be nested The return() directive is the last expression in the function body to be evaluated f <- function() { # Do something interesting return(1) } 48 Arguments

named arguments optionally could have default values not every function call in R makes use of all arguments i.e. function arguments could be missing then the default values are used mydata <- sample(1:10,20, replace=T) sort(mydata) sort(mydata, decreasing=T) 49 Arguments can be matched by

relative sequential position: f(arg1, arg2, arg3) rnorm(100,0,1) name: f(data=arg1, replace=arg2) rnorm(n=100, mean = 0, sd = 1) can mix positional and by name matching mydata=data.frame(y=rnorm(100),x=rnorm(100)) lm(data = mydata, y ~ x, model = FALSE, 1:100) #equivalently lm(y ~ x, mydata, 1:100, model = FALSE) 50 Lazy evaluation Arguments to functions are evaluated lazily i.e. R does not check the # of arguments supplied to

function i.e. can give less but not more args to a function args are evaluated as needed f <- function(a,b) { a^2 } f(a=10) #no error even if b is missing 51 The argument The ... argument indicates a variable # of arguments adds flexibility and abstraction Uknown number of variables to be passed to a function apriori

x=1:10; y=rnorm(10) f <- function(x, y, ...) { cat("value of x:",x,"\n"); cat("value of y:",y,"\n"); plot(x,y, ...) } f(x,y) #adding extra argument type=l f(x,y, type="l") 52 Scoping Variable value is searched through a series of environments the global environment

the namespaces of loaded libraries search() [1] ".GlobalEnv" "package:stats" [4] "package:grDevices" "package:utils" [7] "package:methods" "Autoloads" "package:graphics" "package:datasets" "package:base" The global environment is the users workspace 53

Scoping Loading a package via library()adds the namespace of that package to position 2 of the search list library(igraph) search() [1] ".GlobalEnv" [4] "package:graphics" [7] "package:datasets" [10] "package:base" "package:igraph" "package:stats" "package:grDevices" "package:utils"

"package:methods" "Autoloads" 54 Scoping The scoping rules for R different from the original S language uses lexical scoping related on how R searches for the variable value simplifies statistical calculations f <- function(x, y) { y=2x+z

} z is a free variable not defined in f() 55 Searching for the value free variable value will be searched in the environment in which a function was called In the parent environment In the top-level environment (i.e. global) In the empty environment if value not found error thrown 56

Scoping Typically, the user defined values are found in the workspace (i.e. global environment) a function is defined in the global environment can have functions defined inside other functions the environment in which a function is defined could be environment of another function child parent environments build.power <- function(n) { p <- function(x) { x^n }

print(p) } square <- build.power(2) cube <- build.power(3) 57 Lexical Scoping With lexical scoping the variable value is looked up in the environment in which the function was defined y <- 10 f <- function(x) { y <- 2 y^2

} f() [1] 4 58 Other languages Lexical scoping is supported in Perl Python 59 Dynamic scoping With dynamic scoping

The variable value is looked up in the environment from which the function was called (the calling environment) y <- 10 f <- function(x) { y^2 } f() [1] 100 60 Regression and Generalized Linear Models Functions Examples

61 Regression Regression analysis is used to explain or model the relationship between a single variable Y, called the response, output or dependent variable, and one or more predictor, input, independent or explanatory variables, X1, , Xp p=1 simple regression p>1 multivariate regression 62 Main uses of regression analysis

Prediction of future observations. Assessment of the effect of, or relationship between, explanatory variables on the response. A general description of data structure 63 Regression in R the basic syntax for doing regression in R lm(Y~model)to fit linear models glm(Y ~model)to fit generalized linear models. where model could be specified following

the general syntax rules in R consists of predictor terms provides Y~X relationship 64 Model syntax 65 Simple linear regression The regression of Y on X is given by for i = 1, , p Unknown parameters intercept

point in which the line intercepts the y-axis slope or coefficient Increase in Y per unit change in X 66 Objective To find the equation of the line that best fits the data. find andsuch that the fitted values of are as close as possible to observed

Residuals The difference between the observed value and the fitted value 67 residual sum of squares(RSS) A usual way of calculating and based on the minimization of the sum of the squared residuals, or residual sum of squares (RSS) = 68

Regression example thuesen dataset Ventricular shortening velocity 24 rows and 2 columns short.velocity (Y) blood.glucose (X) It contains ventricular shortening velocity and blood glucose for type 1 diabetic patients. 69 Regression example file="http://www.montefiore.ulg.ac.be/~kbessonov/present_data/ GBIO0009-1_TopInBioinf2015-16/lectures/L2/thuesen.txt"

data <- read.table(file, header=TRUE, stringsAsFactors=FALSE) options(na.action =na.exclude) fit.lm <- lm(short.velocity ~ blood.glucose, data=data) data.frame(data, fitted.value=fitted(fit.lm), residual=resid(fit.lm)) 1 2 3 4 5 blood.glucose 15.3 10.8

8.1 19.5 7.2 short.velocity fitted.value residual 1.76 1.433841 0.326158532 1.34 1.335010 0.004989882 1.27 1.275711 -0.005711308 1.47 1.526084 -0.056084062 1.27

1.255945 0.014054962 70 Measuring Goodness of Fit Coefficient of Determination, R2 Need to calculate TSS, RSS and SSreg The ANOVA breaks the total variability observed in the sample into two parts TSS = SSreg + RSS TSS: total sum of squares (entire sample variability) SSreg: regression sum of squares Explained by fitted model RSS: residual sum of squares (unexplained variability)

71 Calculating R 2 anova(fit.lm) Analysis of Variance Table Response: short.velocity Df Sum Sq Mean Sq F value Pr(>F) blood.glucose 1 0.20727 0.207269 4.414 0.0479 * Residuals 21 0.98610 0.046957 ---

R2 = SSreg/TSS R2 = 0.20727/(0.98610+0.20727) R2 = 0.1736846 72 Getting lm() summary summary(fit.lm) Call: lm(formula = short.velocity ~ blood.glucose, data = data) Residuals: Min 1Q Median 3Q

Max -0.40141 -0.14760 -0.02202 0.03001 0.43490 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.09781 0.11748 9.345 6.26e-09 *** blood.glucose 0.02196 0.01045 2.101 0.0479 * --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.2167 on 21 degrees of freedom (1 observation deleted due to missingness)

Multiple R-squared: 0.1737, Adjusted R-squared: 0.1343 F-statistic: 4.414 on 1 and 21 DF, p-value: 0.0479 * - are shorthand for significance levels Estimate - is the value of slope calculated by the regression Std. Error - Measure of the variability in the estimate for the coefficient t value - measures whether or not the coefficient for this variable is meaningful for the model. It is used to calculate the significance levels. R-squared - Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best. DF - The Degrees of Freedom is the difference between the number of observations (24) and the number of variables used in your model minus 1 (intercept counts as a variable). DF = 24-2-1 = 23 73 Generalized Linear Models Functions The glm() is designed to perform generalized linear models regression on

outcome data that is binary count proportion continuous overcomes limitation of classical regression models do not need to transform the response Y to have a normal distribution N(0, 2)

limitations assumes linear Y~X relationship 74 glm() glm(formula, family = "gaussian", data, ) formula: symbolic description of the model to fit e.g. y~x1+x2 or y~. or y~x1+x2+x1*x2 family: type of distribution to apply to the response variable data: data frame with the variables glm returns an object with slots

summary(obj)provides compact summary 75 logistic regression example Using data on admissions where rank of 1 represents the highest and 4 the lowest prestige. Determine the major variable/factor impacting the admission decision mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv") head(mydata) admit gre gpa rank 1 0 380 3.61 3 2 1 660 3.67

3 fit <- glm(admit ~ gre + gpa + rank, family=binomial(link="logit"), data=mydata) summary(fit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.449548 1.132846 -3.045 0.00233 ** gre 0.002294 0.001092 2.101 0.03564 * gpa 0.777014 0.327484

2.373 0.01766 * rank -0.560031 0.127137 -4.405 1.06e-05 *** AIC: 467.44 (the lower the better) 76 Exploring glm() object attributes(fit) $names [1] "coefficients" [4] "effects" [7] "qr" [10] "deviance" [13] "iter"

[16] "df.residual" [19] "converged" [22] "call" [25] "data" [28] "method" "residuals" "R" "family" "aic" "weights" "df.null" "boundary" "formula" "offset"

"contrasts" "fitted.values" "rank" "linear.predictors" "null.deviance" "prior.weights" "y" "model" "terms" "control" "xlevels" $class [1] "glm" "lm"

77 Accessing glm S3 object values Can look at beta coefficients of each variable fit$coefficients Can check how good is my fitted model fit$deviance fit$aic Use TAB key for auto-complete 78 Regression in Genetics Response yj for j=1..n where case(1) or control(0)

Could refer to two types of patients or phenotypes Genotypes gi for markers i=1..p To run glm() need to code data Additive Dominant Recesive 79 Data There are genotypes for 1018 individuals at 32 SNP markers 32 columns give the marker genotypes AA-11, AG-12 and GG-22, with NA for missing

The genotypes 890kb region flanking the CYP2D6 gene associated with the metabolism of drugs file="http://www.well.ox.ac.uk/rmott/LECTURES/ LOGISTIC_REGRESSION/ugeno.dat" data <- read.table(file, header=TRUE, stringsAsFactors=FALSE) 80 Coding Additive coding Code genotypes as AA(11) x=0, AG(12) x=1, GG(22) x=2

Recessive coding Code genotypes as AA x=0, AG x=0, GG x=1 Dominant coding additive <- function( x ) { return(as.numeric(factor(x))-1) } recessive <- function( x ) { return ( ifelse( additive(x) > 1, 1, 0 ) )

} dominant <- function( x ) { Code genotypes as return ( ifelse( additive(x) > 0, 1, AA x=0, 0 ) ) } AG x=1, GG x=1 these functions allow to convert a genotype vector in a certain way (i.e. coding) 81 Fitting the additive model Additive coding

x=apply(data[,-c(1:2)],2,additive) data_prep=as.data.frame(cbind(y=data$y,x)) fit.add <- glm(y~m1, data=data_prep, family = 'binomial' ) Call: glm(formula = y ~ m1, family = "binomial", data = data_prep) Coefficients: (Intercept) -2.8695 m1 -0.9872 Degrees of Freedom: 1017 Total (i.e. Null); Null Deviance: 343.7

Residual Deviance: 335.1 AIC: 339.1 1016 Residual 82 additive model example 83 Fitting the dominant model Dominant coding x=apply(data[,-c(1:2)],2,dominant) data_prep=as.data.frame(cbind(y=data$y,x))

fit.dom <- glm(y~m1, data=data_prep, family = 'binomial' ) Call: glm(formula = y ~ m1, family = "binomial", data = data_prep) Coefficients: (Intercept) -2.880 m1 -1.004 Degrees of Freedom: 1017 Total (i.e. Null); Null Deviance: 343.7 Residual Deviance: 336.2 AIC: 340.2

1016 Residual 84 Dominant model example 85 Fitting the recessive model Recessive coding x=apply(data[,-c(1:2)],2,recessive) data_prep=as.data.frame(cbind(y=data$y,x)) fit.recessive <- glm(y~m1, data=data_prep, family = 'binomial' )

Call: glm(formula = y ~ m1, family = "binomial", data = data_prep) Coefficients: (Intercept) -3.126 m1 -15.440 Degrees of Freedom: 1017 Total (i.e. Null); Null Deviance: 343.7 Residual Deviance: 340.1 AIC: 344.1

1016 Residual 86 Recessive model example 87 In-class exercise Part 1 of 2 88 In-class Exercises 1) Create vectors a=(5, 6, 7) and b=(10,3,1) and

obtain a total sum of their elements that are greater than 5 2) Calculate 3) Create matrix A and replace the element located in the 3rd row and 3rd column by the sum of the 1st and 2nd row 89 In-class Exercises 4) Write a function which takes a single argument which is a matrix A (see Q3). The function should return a matrix which is the same as the function argument but every odd number is doubled.

90 Plots generation in R R provides very rich set of plotting possibilities The basic command is plot() Each library has its own version of plot() function When R plots graphics it opens graphical device that could be either a window or a file 91 Plotting functions

R offers following array of plotting functions Function Description plot(x) plot of the values of x variable on the y axis bi-variable plot of x and y values (both axis scaled based on values of x and y variables) circular pie-char Plots a box plot showing variables via their quantiles Plots a histogram(bar plot)

plot(x,y) pie(y) boxplot(x) hist(x) 92 Plot modification functions Often R plots are not optimal at 1st R has an array of graphical parameters Consult here is the full list Some of the graphical parameters can be specified inside plot() or using other graphical functions such as lines()

93 Plot modification functions Function points(x,y) lines(x,y) Description add points to the plot using coordinates specified in x and y vectors adds a line using coordinates in x and y mtext(text,side=3) adds text to a given margin specified by side number

boxplot(x) arrows(x0,y0,x1,y1, angle=30, code=1) abline(h=y) rect(x1, y1, x2, y2) legend(x,y) title() this a histogram that bins values of x into categories represented as bars adds arrow to the plot specified by the x0, y0, x1, y1 coordinates. Angle provides rotational angle and code specifies at which end arrow should be drawn draws horizontal line at y coordinate draws rectangle at x1, y1, x2, y2 coordinates plots legend of the plot at the position specified by x and y vectors used to generate a given plot

adds title to the plot axis(side, vect) locator() adds axis depending on the chosen one of the 4 sides; vector specifying where tick marks are drawn used interactively to select locations with mouse 94 Plot margins Outer margin par(oma=c(3, 2, 2,1))

Fig. margin par(mar=c(5, 4, 4, 2)) Note the num. order down, left, up, right 95 Visualization of data in R cars <- c(1, 3, 6, 4, 9) trucks <- c(2, 5, 4, 5, 12) g_range <- range(0, cars, trucks) plot(cars, type="o", col="blue", ylim=g_range, axes=FALSE, ann=FALSE) lines(trucks, type="o", pch=22, lty=2, col="red")

axis(2, las=1, at=4*0:g_range[2]) axis(1, at=1:5, lab=c("Mon","Tue","Wed","Thu","Fri")) box() title(xlab="Days", col.lab=rgb(0,0.5,0)) title(ylab="Total", col.lab=rgb(0,0.5,0)) 96 Visualization of data in R c1 <- c(1,3,6,4,9); c2 <- c(2,5,4,5,12); c3 <- c(4,4,6,6,16); autos_data <- cbind(c1,c2,c3); colnames(autos_data)<-c("cars", "trucks", "suvs"); barplot(as.matrix(autos_data), main="Autos", ylab= "Total", beside=TRUE, col=rainbow(5)) legend("topleft", c("Mon","Tue","Wed","Thu","Fri"), cex=0.8,

bty="n", fill=rainbow(5)) 97 Visualization of data in R #Expand right side of the margin for the legend par(xpd=T, mar=par()$mar+c(0,0,0,4)) #Graph autos using heat colors #put 10% of the space between each bar, and make #labels smaller with horizontal y-axis labels barplot(t(autos_data), main="Autos", ylab="Total", col=heat.colors(3), space=0.1, cex.axis=0.8, las=1, names.arg=c("Mon","Tue","Wed","Thu","Fri"), cex=0.8) legend(6, 30, colnames(autos_data), cex=0.8, fill=heat.colors(3));

# Restore default margins par(mar=c(5, 4, 4, 2) + 0.1); 98 Visualization of data in R dotchart(t(autos_data), color=c("red","blue","darkgreen"), main="Dotchart for Autos", cex=0.8) 99 Visualization of data in R r <- rlnorm(1000); hist(r)

100 Visualization of data in R r <- rlnorm(1000) # Get the distribution h <- hist(r, plot=F, breaks=c(seq(0,max(r)+1, .1))) # Plot the distribution using log scale on both axes, and # use blue points plot(h$counts[h$counts > 0], log="xy", pch=20, col="blue", main="Log-normal distribution", xlab="Value", ylab="Frequency") 101 In-class exercises Part 2 of 2

102 In-class exercises 1) create two perfectly correlated vectors (with =1) and plot them. 2) Change data points to a dashed line 3) Add a horizontal and vertical red lines to your plot at coordinates: a) (1,10) and (4,4); b) (7,7) and (1,20) 4) Add text to the plot with text() to the left and right of the diagonal line 5) Add title to your plot 6) Add legend to the plot 103

Installation of new libraries There are two main R repositories CRAN BioConductor To install package/library from CRAN install.packages("seqinr") To install packages from BioConductor source("http://bioconductor.org/biocLite.R") biocLite("GenomicRanges") 104 The Comprehensive R Archive Network (CRAN)

CRAN package repository features 7154 available packages install.packages("packageName") Some popular packages ggplot2 beautiful plots Matrix sparse matrices igraph graphs and analysis thereof 105 source("http://bioconductor.org/biocLite.R"); biocLite("packageName") Repository of biology-related libraries in R

178,856 packages Some list of libraries biomaRt IRanges 106 Installation of the new libraries Download and install latest R version on your PC. Go to http://cran.r-project.org/ Install following libraries by running install.packages(c("seqinr", "ape", "GenABEL") source("http://bioconductor.org/biocLite.R") biocLite(c("limma", "muscle",

"affy","hgu133plus2.db","Biostings")) 107 Biological annotation with biomaRt Biological experiments require ID conversions e.g. microarray probe id to gene name e.g. mapping of SNP to gene symbol BioMart online service can be accessed via web-GUI programmatically via biomaRt 108

BioMart Go to http://central.biomart.org/converter/#!/ID_converter/ Select genome assembly version Paste or upload the ID list E.g. covert SOX1 to GOID 109 biomaRt Access BioMart programmatically

install the biomaRt library source("http://www.bioconductor.org/biocLite.R"); biocLite("biomaRt"); require(biomaRt); - use the listMarts() to see the different databases listMarts() - we will use the ensembl mart ensMart<-useMart("ensembl") 110 biomaRt - listDatasets() to see which data sets that are available in the database listDatasets(ensMart)

- Will use homo sapiens dataset ensembl_hs_mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl") - List retrieved attributes (db fields) listAttributes(ensembl_hs_mart)[1:100,] - Download data ensembl_df <- getBM( attributes=c("ensembl_gene_id", "ensembl_transcript_id", "hgnc_symbol","chromosome_name", "entrezgene"), mart=ensembl_hs_mart ) 111 biomaRt

- List of genes to annotate my_genes = c("ENSG00000197971", "ENSG00000153165", "ENSG00000159352", "ENSG00000146006", "ENSG00000149809", "ENSG00000204179", "ENSG00000213023", "ENSG00000115008", "ENSG00000130844","ENSG00000155363") - Annotate IDs my_genes_ann = ensembl_df[match(my_genes, ensembl_df$ensembl_gene_id),] ensembl_gene_id ensembl_transcript_id hgnc_symbol chromosome_name entrezgene 66435 ENSG00000197971 ENST00000382582 MBP 18

4155 66038 ENSG00000153165 ENST00000409886 RGPD3 2 653489 190545 ENSG00000159352 ENST00000368884 PSMD4 1 5710 112 In-class exercise

1) Map SNP rs2066844 to a gene with biomaRt library. What is its gene symbol, full gene name and gene biological function? 113 References [1] Team, R. Core. "R Language Definition." (2000). [2] Durinck, Steffen, et al. "BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis." Bioinformatics 21.16 (2005): 3439-3440. 114 115

In-class exercises (answers) s.103 x=seq(1,10) y=2*seq(1,10) lines(c(1,10),c(4,4)) lines(c(7,7),c(1,20)) text(5, 8, "This is my plot", adj = c(0,0)) plot(x,y, type="c") legend(9,15, "data" , lty=1, col=c('red'), cex=0.8) 116 In-class exercises (answers). s113 library(biomaRt) snp_db=useMart("snp", dataset="hsapiens_snp")

getBM(attributes="associated_gene", filters = "snp_filter", values = "rs2066844", mart=snp_db) 117

Recently Viewed Presentations

  • RSAT Training Tool: Trauma-Informed Correctional Care

    RSAT Training Tool: Trauma-Informed Correctional Care

    BJA's intent is to use the RSAT performance measures to understand trends and changes that grantees experience over time. With this practical understanding, BJA is better able to meet requests from Congress, Office of Justice Programs, the Department of Justice,...
  • Alliance Commissioningdifferent models for better outcomes My Time,

    Alliance Commissioningdifferent models for better outcomes My Time,

    The Stockport Mental Health People Powered Health Project: Cashable Savings and Benefits. Fewer people in expensive services for shorter periods- demonstrated. Improved productivity for clinicians in primary and secondary care- demonstrated. Sustained outcomes and social returns- demonstrated. Reduced use of...
  • PDF Questions Submitted by our customers: Gus Lluberes

    PDF Questions Submitted by our customers: Gus Lluberes

    This data is the first piece of information the search engine finds. Since PDF files are TEXT (unless scanned and not captured as text), search engines will do a full text search on the documents. Search Results Search Criteria A:...
  • Stereochemistry and Drug Action  Isomers and types of

    Stereochemistry and Drug Action Isomers and types of

    Stereochemistry and Drug Action Geometric Isomers (cis/trans) … other examples Trans isomer, i.e., E, is 1000-times more histaminic than cis, Z Vitamin A has all E double bonds, any Z would make it inactive! Stereochemistry and Drug Action Stereoisomers Enantiomers...
  • Postsecondary Planning Night Margaret Griffin, Last Names A-L,

    Postsecondary Planning Night Margaret Griffin, Last Names A-L,

    Attend Financial Aid night September 25, 2018 at 6pm in the library. Fill out the FAFSA starting as early as October 1 of your senior year (based on prior year taxes) Attend our FAFSA completion workshops Oct 9th and Nov...
  • IP Fragmentation - Columbus State University

    IP Fragmentation - Columbus State University

    The router Not the subnet Multiple Fragmentations Original packet may be fragmented multiple times along its route Defragmentation Internet layer process on destination host defragments, restoring the original packet IP Defragmentation only occurs once Fragmentation and IP Fields More Fragments...
  • Drawing Near to God James 4:8 (ESV) 8

    Drawing Near to God James 4:8 (ESV) 8

    Romans 4:6-8 (ESV) 6 Just as David also speaks of the blessing of the one to whom God counts righteousness apart from works: 7 "Blessed are those whose lawless deeds are forgiven, and whose sins are covered; 8 blessed is...
  • A Tipsy Idol, Lowing Cows, and Golden Rats

    A Tipsy Idol, Lowing Cows, and Golden Rats

    The Philistines had two main temples in honor of Dagon-one at Ashdod and one at Gaza, with smaller temples elsewhere. More Facts, Please! Lowing cows-In Bible times the terms used in regard to cattle were the same as now: a...