Elastic Maps for Data Analysis Alexander Gorban, Leicester with Andrei Zinovyev, Paris Plan of the talk INTRODUCTION Two paradigms for data analysis: statistics and modelling Clustering and K-means Self Organizing Maps PCA and local PCA Plan of the talk 1. Principal manifolds and elastic maps The notion of of principal manifold (PM)

Constructing PMs: elastic maps Adaptation and grammars 2. Application technique Projection and regression Maps and visualization of functions 3. Implementation and examples Two basic paradigms for data analysis Data set Statistical Analysis Data Modelling Statistical Analysis

Existence of a Probability Distribution; Statistical Hypothesis about Data Generation; Verification/Falsification of Hypothesises about Hidden Properties of Data Distribution Data Modelling

Univers e of models We should find the Best Model for Data description; We know the Universe of Models; We know the Fitting Criteria; Learning Errors and Generalization Errors analysis for the Model Verification

Example: Simplest Clustering K-means algorithm K (i ) Centers y(i) {x ( j) :x ( j)

y (i ) x ( j) y ( m) 1 p U x ( j ) y (i ) N i 1x ( j )K ( i )

m} 2 1)Minimize U for given {K(i)} (find centers); Data points x(j) 2)Minimize U for given {y(i)} (find classes); 3)If {K(i)} change, then go to step 1. Centers can be lines, manifolds, with the same algorithm

1st Principal components + mean points for classes SOM - Self Organizing Maps Set of nodes is a finite metric space with distance d(N,M); 0) Map set of nodes into dataspace Nf0(N); 1) Select a datapoint X (random);

2) Find a nearest fi(N) (N=NX); 3) fi+1(N) = fi(N) +wi(d(N, NX))(X- fi(N)), where wi(d) (0

X X )( X i i j X j) p 1 q 1 Principal components: eigenvectors of the covariance matrix: e , ; ... 0 i i

1 2 The local covariance matrix (w is a positive cutting 1 p function) q q q cov y ( xi , x j ) w( y X )( X i X i )( X j X j ) p 1 q 1

The field of principal components: eigenvectors of the local covariance matrix, ei(y). Trajectories of these vector-fields present geometry of local data structure. between two basic paradigms is not crucial (Almost) Back to Statistics: Quasi-statistics: 1) delete one point from the dataset, 2) fitting, 3) analysis of the error for the deleted data;

The overfitting problem and smoothed data points (it is very close to non-parametric statistics) Principal manifolds Elastic maps framework LLE ISOMAP Multidim. scaling Visualization Non-linear Data-mining

methods PCA Kmeans Principal manifolds Supervised classification SVM Clustering SOM Factor

analysis Regression, approximation Finite set of objects in RN IRIS database X Petal heght Petal width Sepal

width Sepal height 4.9 3 1.4 0.2 Iris-setosa 4.7 3.2

1.3 0.3 Iris-setosa 4.6 3.1 1.5 0.2 Iris-setosa 7 3.2 4.7

1.4 Iris-versicolor 6.4 3.2 4.5 1.5 Iris-versicolor 6.9 3.1 4.9

1.5 Iris-versicolor 6.3 3.3 6 5.8 2.7 7.1 6.3 SPECIES

2.5 Iris-virginica X 1.9 Iris-virginica 3 5.9 2.1 Iris-virginica 2.9 5.6 1.8 Iris-virginica

i i=1.. m Mean point 1 m X Xi m i 1 K-means clustering m

X i closest Y 2 min i 1 m X i 1 i X

2 min Principal Object m i 1 2 min Principal Component Analysis M

ax im al di sp er sio n 1st Principal axis

2nd principal axis Principal manifold Statistical Self-consistency x -1(x) x = E(y|((y)=x) Principal Manifold

What do we want? Non-linear surface (1D, 2D, 3D ) Smooth and not twisted The data model is unknown Speed (time linear with Nm) Uniqueness Fast way to project datapoints Metaphor of elasticity U(E),

U(R) Data points U(Y) Graph nodes Constructing elastic nets y E (0) E (1) R (1) R (0) R (2)

Definition of elastic energy Xj (Y ) U y U E (0) E (1) R (1) R (0) R (2) U U

U s (E) j X y (i ) 2 i 1 x ( j ) K ( i ) (i )

(i ) i E (1) E (0) 2 i 1 . (Y ) 1 N

p U (E) ( R) r (i ) (i ) (i ) i R (1) R (2) 2 R (0)

i 1 U (R) i 0 , i 0 2 Elastic manifold Global minimum and softening 0, 0 103 0, 0 102 0, 0 101

0, 0 10-1 Adaptive algorithms Refining net: Growing net Idea of scaling: Adaptive net Scaling Rules For uniform d-dimensional net from the condition of constant energy density we obtain: 1 2 ... s ( s);

1 2 ... r (r ) 2 d 0 s d 4 d 0 r d s is number of edges, r is number of ribs in a given volume Grammars of Construction Substitution rules Examples: 1) For net refining:

substitutions of columns and rows 2) For growing nets: substitutions of elementary cells. Substitutions in factors Graph factorization Substitution rule Transformation of factor =

Substitutions in factors Graph transformation Transformation selection A grammar is a list of elementary graph transformations. Energetic criterion: we select and apply an elementary applicable transformation that provides the maximal energy decrease (after a fitting

step). The number of operations for this selection should be in order O(N) or less, where N is the number of Projection onto the manifold Closest node of the net Closest point of the manifold Mapping distortions 1 2 Two basic types of distortion: 1) Projecting distant points in

the close ones (bad resolution) 2) Projecting close points in the distant ones (bad topology compliance) Instability of projection Best Matching Unit (BMU) for a data point is the closest node of the graph, BMU2 is the second-close node. If BMU and BMU2 are not adjacent on the graph, then the data point is unstable. Gray polygons are the areas of instability. Numbers denote the degree of instability, how many

nodes separate BMU from Colorings: visualize any function Density visualization Example: different topologies RN R2 VIDAExpert tool and elmap C++ package

Regression and principal manifolds regression F(x) Data principal component Gen.curve Grid x x F ( x ) i

i 2 min xi Pxi 2 min Projection and regression

Data with gaps are modelled as affine manifolds, the nearest point on the manifold provides the optimal filling of gaps. Iterative error mapping For a given elastic manifold and a datapoint x(i) the error vector is (i ) (i ) (i ) xerr x P ( x ) where P(x) is the projection of data point x(i) onto the manifold. The errors form a new dataset, and we can construct another map, getting regular model of errors. So we have the first map that models the data itself, the second map

that models errors of the first model, and so on. Every point initial data space is modeled by the vector ~x xP(inx)the P2 ( x P( x)) P3 ( x P( x) P2 ( x P( x))) .... Image skeletonization or clustering around curves Image skeletonization or clustering around curves Approximation of molecular surfaces Application: economical data

Density Gross output Profit Growth temp Medical table 1700 patients with infarctus myocarde Patients map, density Lethal cases Medical table 1700 patients with infarctus

myocarde 128 indicators Age Numberof infarctus in anamnesis Stenocardia functional class Codon usage in all genes of one genome Escherichia coli

Bacillus subtilis Majority of genes Foreign genes Highly expressed genes Hydrophobic genes Golubs leukemia dataset 3051 genes, 38 samples (ALL/B-cell,ALL/Tcell,AML) ap of genes: vote for ALL

ALL sample vote for AML used by T.Golub AML sample used by W.L Golubs leukemia dataset map of samples: cell AML

ALL/B-cell Cystatin C density CA2 Carbonic anhydrase II ALL/T- Retinoblastoma binding protein P48 X-linked Helicase II Useful links

Principal components and factor analysis http://www.statsoft.com/textbook/stfacan.html http://149.170.199.144/multivar/pca.htm Principal curves and surfaces http://www.slac.stanford.edu/pubs/slacreports/slac-r-276.html http://www.iro.umontreal.ca/~kegl/research/pcurves/ Self Organizing Maps http://www.mlab.uiah.fi/~timo/som/

http://davis.wpi.edu/~matt/courses/soms/ http://www.english.ucsb.edu/grad/student-pages/jdouglass/coursework/hyperliterature/soms/ Elastic maps http://www.ihes.fr/~zinovyev/ http://www.math.le.ac.uk/~ag153/homepage/ Several names

K-means clustering: MacQueen, 1967; SOM: T. Kohonen, 1981; Principal curves: T. Hastie and W. Stuetzle, 1989; Elastic maps: A. Gorban, A. Zinovyev, A. Rossiev, 1998; Polygonal models for principal curves: B. Kgl, 1999; Local PCA for orincipal curves construction: J. J. Verbeek, N. Vlassis, and B. Krse, 2000. Two of them are Authors Thank you for your attention!

Questions?