Data Mining in Bioinformatics Sadik A. Khuder, Ph.D., College of Medicine University of Toledo Introduction Bioinformatics is the science of managing, mining, and interpreting information from biological data. Various genome projects have contributed to an exponential growth in DNA and protein sequence databases. Rapid advances in high-throughput technologies, such as microarrays, mass spectrometry and new/next-generation sequencing, can monitor quantitatively the presence or activity of thousands of genes, RNAs, proteins,

metabolites, and compounds in a given biological state. The ongoing influx of these data, the pressing need to address complex biomedical challenge, and the gap between the two have collectively created exciting opportunities for data mining researchers. What is Data Mining? Data mining (or knowledge discovery) is the process of analyzing data and summarizing it into useful information Information are used to develop new biomarkers, or find a new line of treatment etc.. Data mining aims at the extraction of hidden predictive information from large databases Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it,

and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. The challenge Researchers in bioinformatics are estimated to produce 10,000,000,000,000,000 bytes of data each year ( 16 billion of CD-roms). How do we learn something from this data? Commonly many relevant variables in the data E.g. one chip ~30000 variables Typically many cross tabulations

Data mining can help summarize the data Also reduce the chance of obtaining spurious results Some Examples Microarray gene expression data analysis Identification of regulatory binding sites Identification of splice junction sites & translation start sites Experimental technique for inference of protein complexes Phylogenetic analysis Protein domain identification Identification of structural motifs Prediction reliability assessment of protein structures

Predicting protein functional classes such as localization sites and modifications Growth of GenBank and WGS Growth of GenBank and WGS Data in ArrayExpress NGS technology evolution Pareek at al, 2011. J Appl Genet. 2011 Nov;52(4):413-35 NGS technology evolution Pavlopoulos et al. BioData Mining 2013, 6:13

NGS technology evolution Pavlopoulos et al. BioData Mining 2013, 6:13 Metabolomic Data integration workflow for the systematic classification of unknown metabolites. PLoS Genet. Oct 2012; 8(10): Data Mining Feedback

Pre-Processing & Representation Knowledge discovery Intermediate representation Data Background Knowledge Supervised vs. Unsupervised Learning Imagine an organism or machine which experiences a series of sensory inputs: x1, x2, x3, x4, . . .

Supervised learning: The machine is also given desired outputs y1, y2, . . ., and its goal is to learn to produce the correct output given a new input. Unsupervised learning: The goal of the machine is to build representations of x that can be used for reasoning, decision making, predicting things, communicating etc. Supervised vs. Unsupervised Learning Unsupervised learning: The classes are unknown a priori and need to be discovered from the data, e.g. cluster analysis, class discovery, unsupervised pattern recognition Supervised learning- The classes are predefined and the task is to understand the

basis for the classification from a set of labeled objects. This information is then used to classify future observations, e.g. classification, discriminant analysis, class prediction, supervised, pattern recognition Eisen MB, Spellman PT, Brown PO and Botstein D. (1998) Multivariate Analysis Methods Analysis of interdependence (Unsupervised) No variables thought of as dependent Look at the relationships among variables, objects or cases Analysis of dependence (Supervised) Where one (or more) variables are dependent

variables, to be explained or predicted by others Unsupervised Learning Raw Data Dimensionali ty Reduction Representati on Distance or Similarity Matrix

Graphical Validation Clustering Internal Extermal The Dimensionality Problem Caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. It is a significant obstacle in high dimension data analysis. The sparsity increases exponentially given a fixed amount of data points. 64 data points are simulated form a uniform (0 , 1) distribution.

Points are clustered together more sparse more sparse N<

>>p. In bioinformatics, usually N<100, and p > 1000. How to deal with this N <

Useful for both data modeling and data analysis. Linear Dimensionality Reduction Methods Principal Component Analysis (PCA) Multi-Dimensional Scaling (MDS) Correspondence Analysis Principal components analysis (PCA) An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D For a matrix of m variables x n samples, create a new covariance matrix of size n x n Thus transform some large number of variables

into a smaller number of uncorrelated variables called principal components (PCs). Page 211 Principal Components The first principal component is identified as the vector (or equivalently the linear combination of variables) on which the most data variation can be projected The 2nd principal component is a vector perpendicular to the first, chosen so that it contains as much of the remaining variation as possible And so on for the 3rd principal component, the

4th, the 5th etc. PCA : objectives to reduce dimensionality to determine the linear combination of variables to choose the most useful variables (features)

to visualize multidimensional data to identify groups of objects (e.g. genes/samples) to identify outliers Page 211 Principal Components 1 2 3

4 5 1 C1 C2 C3 C4 C5 C6 Similarity Criterion: Correlatio ns Correlations 6 6

2 Project data points onto new axes (eigenvectors) Calculate eigenvectors with greatest eigenvalues Linear combinations Orthogonal Principal Components

Principal components Principal Components-example PCA plot for Alizadeh-V2. A scatter plot with the two first largest components of a PCA for AlizadehV2. Colors indicate the three classes in the data: diffuse large B-cell lymphoma in red (DLBCL), follicular lymphoma in green (FL) and chronic lymphocytic leukemia in blue(CLL). BMC Bioinformatics. 2008 Nov 27;9:497. Scree plot A scree plot displays eigenvalues versus factors. Used to estimate the number of factors that usefully capture the variance in the data. Look for where the plot levels off and exclude those factors. Typically, we hope to find as few factors as possible to explain the original data.

Eigenvalue Components Multidimensional Scaling Multidimensional scaling (MDS) is a dimension reduction technique that can be considered as an alternative to PC MDS uses distances in the calculation It attempts to arrange "objects" in a space with a particular number of dimensions so as to reproduce the observed distances. As a result, we can "explain" the distances in terms of underlying dimensions Multidimensional Scaling (MDS)

Displays (transformed) multivariate data in low-dimensional space Different from plots based on PC Primary objective is to fit the original data into low-dimensional system Distortion caused by reduction of dimensionality is minimized Distortion Similarities or dissimilarities among data Multidimensional Scaling Given a set of similarities (or distances) between every pair of N items

Find a representation of the items in few dimensions Inter-item proximities nearly match the original similarities (or distances) Non-metric and Metric MDS Non-metric MDS Uses only the rank orders of the N(N-1)/2 original similarities and not their magnitudes Metric MDS Actual magnitudes of original similarities are used Also known as principal coordinate analysis Measures of goodness-of-fit: Stress

N items, M = N(N-1) / 2 similarities. Assume no ties, and arrange s i1k1 si2 k 2 siM k M Find a q-dimensional configuration such that d i(1kq1) d i(2qk)2 d i(Mqk) M Kruskals Stress

(q) (q) d d ik ik Stress( q) k ik (q) 2 d ik k i k

1/ 2 2 dik(q ) are numbers known to satisfy the ordering. Takanes Stress

2 2 d d ik ik SStress k ik 4 d ik

k i k 2 1/ 2 Basic Algorithm Obtain and order the M pairs of similarities Try a configuration in q dimensions Determine

inter-item distances and reference numbers Minimize Kruskals or Takanes stress Move the points around to obtain an improved configuration Repeat until minimum stress is obtained Example 1 Identification of a molecular signature in human type 1 diabetes mellitus using serum and functional genomics.

In MDS analysis, each sample is plotted in a three-dimensional space where the similar samples are plotted in closer proximity compared with the dissimilar ones. RO samples are shown as red circles and HC samples are shown as black squares. J Immunol. 2008 Feb 1;180(3):1929-37 Example 2 Different cell types are multi-dimensionally scaled. The color codes indicate clear clustering. Correspondence Analysis (CA) Special case of PCA transforms a table of

numerical data into a graphic summary A simplified, more interpretable display leads to deeper understanding of the fundamental relationships/structure inherent in the data A map of basic relationships, with much of the noise eliminated and usually reduces the dimensionality of the data CA basic ideas Derived from methods of contingency table analysis

Most suited for analysis of categorical data: counts, presence-absence data possibly better to use PCA for continuous (i.e., ratio) data CA makes no assumptions about the distribution of the input variables Interpretation Correspondence analysis plots should be interpreted by looking at points relative to the origin Points that are in similar directions are positively associated Points that are on opposite sides of the origin are negatively associated Points that are far from the origin exhibit the

strongest associations Also the results reflect relative associations, not just which rows are highest or lowest overall Correspondence Analysis Distance between two species, i and j, over sites k=1,,p is x p

D ij k 1 ik / ri x jk / rj

2 ck ri species totals cksite totals Difference in proportions of each species at each site Then do Principal Coordinates Analysis

Short 28 2 27 11 26 0 0 3 2 1 0 5 0 0 4 1 2 0

0 0 1 0 1 0 0 0 0 0 3 4-plus 1

12 9 3 Short 8 2 5 1 Reg 13 -1 -3

-3 Eigenvalue 0.205 #1 #2 #3 #4 #5 #6 #7 #8 #9 #11 #12 #13

#14 #15 Reg 48 8 9 0 1 1 4 0 0 3 0 4

1 1 Dim(2) mtDNA haplotype Example Correspondence Plot 4-plus 6 117 14 15

-1 1 Eigenvalue Dim(1)0.394 4 3 CA of proteomics data Correspondence analysis on all 641 proteins identified by shotgun proteomics in the BAL fluid of control and/or cystic fibrosis

subjects. Correspondence analysis of the subset of 154 differentially enriched proteins identified by an SpI 0.75 or SpI 0.75. Yellow spheres, individual proteins; red spheres, cystic fibrosis subjects; green spheres, control subjects. Journal of Proteome Research 2008, 7, 845854 845 Prediction of Protein Functional Classes by Correspondence Analysis Homology extension is performed against a compact non-redundant database using a fast search

model to reduce running time. Correspondence analysis (CA) is incorporated as an efficient feature reduction to generate a clear visual separation of different protein classes. Functional classes are predicted by a combination of accurate compact set (CS) relation and interpretable one-nearest neighbor (1-NN) algorithm. CA can not only efficiently identify discriminative features but also provide a clear visualization of different functional classes.

PLOS ONE October 2013 | Volume 8 | Issue 10 | Figure 4. Correspondence analysis of the GramNeg_1444 data set. The figure shows Gram-negative bacteria proteins with localization labels, gapped-dipeptides (black circle) and gapped-dipeptide signatures (stars with corresponding color) projected in top two major CA dimensions whose percentage of variance is shown in parentheses. PLOS ONE October 2013 | Volume 8 | Issue 10 | Microarray Data Mining Feedback Pre-Processing & Representation Knowledge discovery

Intermediate representation Microarray Data Background Knowledge PCA for Microarray Data Thousands of variables (genes) and a very small number of (biological) replicates, so multivariate regression analysis is difficult in practice. The extremely high dimensional space of gene expression measurements obtained by microarrays impedes the detection of underlying patterns in gene expression data and the

identification of discriminatory genes. Data Format object 1 2 3 . . . n attribute 1 2 3

4.7 3.8 5.9 5.2 6.9 3.8 5.8 4.2 3.9 . . . . . . . . . 6.3 1.6 4.7 ... ...

... ... . . . ... m 1.3 2.9 4.4 . . . 2.0

Microarray Data genes object 1 2 3 . . . n attribute time points 1 2

3 ... m 4.7 3.8 5.9 ... 1.3 5.2 6.9 3.8 ... 2.9 5.8 4.2 3.9 ... 4.4 . . . . . . . . . . . . .

. . 6.3 1.6 4.7 ... 2.0 estimated expression levels Microarray Data genes object 1 2 3 . . . n

attribute tissue types 1 2 3 ... m 4.7 3.8 5.9 ... 1.3 5.2 6.9 3.8 ... 2.9 5.8 4.2 3.9 ... 4.4 . . . . . . . . .

. . . . . . 6.3 1.6 4.7 ... 2.0 estimated expression levels Microarray Data genes object 1 2 3

. . . n treatment attribute conditions 1 2 3 ... m 4.7 3.8 5.9 ... 1.3 5.2 6.9 3.8 ... 2.9 5.8 4.2 3.9 ... 4.4 . . . .

. . . . . . . . . . . 6.3 1.6 4.7 ... 2.0 estimated expression levels Example 1 gene_a

gene_b gene_c gene_d Pat_N1 90 190 90 200 Pat_N2 Pat_A1 Pat_A2 110 190 210 210 390

410 110 110 90 100 400 90 Pat_B1 290 590 120 600 Pat_B2 310

610 80 200 Covariance Matrix gene_a gene_b gene_c gene_d gene_a 8120 16120 -80 8380

gene_b 16120 32120 -80 18380 gene_c gene_d -80 8380 -80 18380 240 2020 2020 39350

Eigen Values & Eigen Vectors Eigen values [1] 6.000824e+04 1.972570e+04 9.566785e+01 3.899597e-01 Eigen vectors [,1] [,2] [,3] [,4] [1,] -0.31240722 0.33844158 -0.19322348 0.86632772 [2,] -0.64183253 0.61246043 0.05781972 -0.45782105

[3,] -0.02237963 -0.07752419 -0.97732560 -0.19576472 [4,] -0.69996564 -0.71016866 0.06446891 0.03939986 Singular Value Decomposition D [1] 6.000824e+04 1.972570e+04 9.566785e+01 3.899597e-01 U [,1] [,2] [,3] [,4] [1,] -0.31240722 -0.33844158 -0.19322348 -0.86632772 [2,] -0.64183253 -0.61246043 0.05781972 0.45782105 [3,] -0.02237963 0.07752419 -0.97732560 0.19576472 [4,] -0.69996564 0.71016866 0.06446891 -0.03939986

V [,1] [,2] [,3] [,4] [1,] -0.31240722 -0.33844158 -0.19322348 -0.86632772 [2,] -0.64183253 -0.61246043 0.05781972 0.45782105 [3,] -0.02237963 0.07752419 -0.97732560 0.19576472 [4,] -0.69996564 0.71016866 0.06446891 -0.03939986 R Code for Example 1 dataf = read.table("example1.txt") dataf1 = t(dataf) pca =princomp(dataf1,cor=F,scores=TRUE)

plot(pca)biplot(pca) PCA for Example 1 Biplot for Example 1 R Code for Example 2 data1=read.table("example2.txt") pc.c1=princomp(data1) summary(pc.c1) summary(pc.c1 <- princomp(data1, cor=TRUE)) print(summary(princomp(data1, cor=TRUE), loadings = TRUE, cutoff = 0.2), digits = 2) screeplot(pc.c1) screeplot(pc.c1, npcs=8, type="lines") biplot(princomp(data1))

biplot(pc.c1, choices = 1:2, scale = .2, pc.biplot = T) c1=cov(data1) eigen(c1) c2=svd(c1) PCA for Example 2 Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 17.0440506 9.9498858 7.8672439 7.01213985 6.71861146 Proportion of Variance 0.4734996 0.1613653 0.1008833 0.08014475 0.07357547 Cumulative Proportion 0.4734996 0.6348649 0.7357482 0.81589292 0.88946838 Comp.6

Comp.7 Comp.8 Comp.9 Standard deviation 4.66546927 4.53155163 3.91057506 3.19668232 Proportion of Variance 0.03547845 0.03347093 0.02492615 0.01665608 Cumulative Proportion 0.92494683 0.95841776 0.98334392 1.00000000 PCA for Example 2 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 x1 -0.45 -0.36 0.78 0.21 x2 -0.37 0.40 0.29

0.46 -0.42 0.44 x3 -0.42 0.21 -0.21 0.50 0.49 -0.48 X4 -0.30 0.24 -0.24 -0.53 -0.50 -0.27 -0.33 -0.23 X5 -0.31 -0.54 0.36 -0.45 -0.51 x6 -0.42 0.27

-0.49 0.40 -0.55 x7 -0.32 0.69 -0.49 0.24 x8 -0.57 -0.57 -0.22 0.46 x9 -0.24 -0.20 0.79 -0.45 -0.25

Screeplot for Example 2 Biplot for Example 2 Example 2 Importance of Components Plot of Samples Plot of Genes Biplot of Samples and Genes Three dimensional plot