Microarray Data Analysis Data preprocessing and visualization Supervised learning Unsupervised learning Machine learning approaches Clustering and pattern detection Gene regulatory regions predictions based co-regulated genes Linkage between gene expression data and gene sequence/function databases
Unsupervised learning Supervised methods Can only validate or reject hypotheses Can not lead to discovery of unexpected partitions Unsupervised learning No prior knowledge is used Explore structure of data on the basis of corrections and similarities DEFINITION OF THE CLUSTERING PROBLEM
Eytan Domany CLUSTER ANALYSIS YIELDS DENDROGRAM T (RESOLUTION) Eytan Domany ? BUT WHAT ABOUT THE OKAPI Eytan Domany Centroid methods Kmeans Data points at Xi , i= 1,...,N Centroids at Y , = 1,...,K Assign data point i to centroid ; Si = Cost E: N K 2
( S , )( X Y ) i i E(S1 , S2 ,...,SN ; Y1 ,...YK ) = i 1 1 Minimize E over Si , Y Eytan Domany K-means Guess K=3 Eytan Domany K-means
Start with random positions of centroids. Iteration = 0 Eytan Domany K-means Start with random positions of centroids. Assign each data point to closest centroid. Iteration = 1 Eytan Domany K-means
Start with random positions of centroids. Assign each data point to closest centroid. Move centroids to center of assigned points Iteration = 2 Eytan Domany K-means Start with random positions of centroids. Assign each data point to closest centroid. Move centroids to center of assigned
points Iterate till minimal cost Iteration = 3 Eytan Domany K-means - Summary Fast algorithm: compute distances from data points to centroids Result depends on initial centroids position Must preset K Fails for non-spherical distributions Agglomerative Hierarchical Clustering Need define
Need to define the distance between the at eachto step mergethe pairdistance of nearestbetween clusters the new cluster and the other clusters. new cluster and the other initially each point = cluster clusters. Single
Linkage: distance Single Linkage: distancebetween between Distance between joined clusters closest closestpair. pair. Complete CompleteLinkage: Linkage:distance distancebetween betweenfarthest farthest4 pair. pair. Average AverageLinkage: Linkage: average averagedistance distancebetween
between all allpairs pairs 5 3 cluster or ordistance distancebetween between cluster centers centers 1 2 1 3 2
4 5 The Thedendrogram dendrograminduces inducesaalinear linearordering ordering of ofthe thedata datapoints points Dendrogram Eytan Domany Hierarchical Clustering Summary Results depend on distance update method Greedy iterative process
NOT robust against noise No inherent measure to identify stable clusters Average Linkage the most widely used clustering method in gene expression analysis natur e 2002 breas t canc er Heat map Cluster both genes and
samples Sample should cluster together based on experimental design Often a way to catch labelling errors or heterogeneity in samples Epinephrine Treated Rat Fibroblast Cell ID Probe 1 1h
Use Average Linkage Algorithm and Manhattan distance. Gene ID 1 2 3 4 5 6 Exp1 Exp2 45 55 55 78 148 241 1303 765
774 607 439 383 Exercise Issues in Cluster Analysis A lot of clustering algorithms A lot of distance/similarity metrics Which clustering algorithm runs faster and uses less memory? How many clusters after all? Are the clusters stable? Are the clusters meaningful?
Which Clustering Method Should I Use? What is the biological question? Do I have a preconceived notion of how many clusters there should be? How strict do I want to be? Spilt or Join? Can a gene be in multiple clusters? Hard or soft boundaries between clusters The End Thank you for taking this course. Bioinformatics is a very diverse and fascinating subject. We hope you all decide to continue your pursuit of it.
We will be very glad to answer your emails or schedule appointments to talk about any bioinformatics related questions you might have. We wish you all have a wonderful summer break!
IBM System Storage DS3500 Express Storage System ***Confidential*** * In this example we have added the 8 Gbps daughter card to each controller. In this instance, the DS3500 can support up to four SAS host ports native with eight FC...
Countries involved: Peru, Dominican Republic, Guatemala, Bolivia, Costa Rica, El Salvador, Jamaica Missing Dimensions of Poverty Aims to capture subjective, psychological and material deprivations that affect well-being and understand how these might vary over population group and across time Crucial...
493.1413(b)(8)(9) & 1451(b)(8)(9)— Technical Consultant/Supervisor . Responsibilities— Evaluating the competency of all testing personnel & assuring that the staff maintain their competency to perform test procedures & report test results promptly, accurately, & proficiently.
Two individuals with the same race or ethnicity also have many differences in their DNA—so many differences that a tall Caucasian could have more DNA in common with a tall Asian than a short Caucasian! ... "We see sexist jokes,...
Death Depending on the cause distinguish natural (physiological) death from old age and wear on the body, violent death from injury or other negative effects on the body ending in death, and from diseases (inviolent) Stages of dying Agony Clinical...
Ready to download the document? Go ahead and hit continue!