Microarray Data Analysis - UA Computer Science

Microarray Data Analysis - UA Computer Science

Microarray Data Analysis Data preprocessing and visualization Supervised learning Unsupervised learning Machine learning approaches Clustering and pattern detection Gene regulatory regions predictions based co-regulated genes Linkage between gene expression data and gene sequence/function databases

Unsupervised learning Supervised methods Can only validate or reject hypotheses Can not lead to discovery of unexpected partitions Unsupervised learning No prior knowledge is used Explore structure of data on the basis of corrections and similarities DEFINITION OF THE CLUSTERING PROBLEM

Eytan Domany CLUSTER ANALYSIS YIELDS DENDROGRAM T (RESOLUTION) Eytan Domany ? BUT WHAT ABOUT THE OKAPI Eytan Domany Centroid methods Kmeans Data points at Xi , i= 1,...,N Centroids at Y , = 1,...,K Assign data point i to centroid ; Si = Cost E: N K 2

( S , )( X Y ) i i E(S1 , S2 ,...,SN ; Y1 ,...YK ) = i 1 1 Minimize E over Si , Y Eytan Domany K-means Guess K=3 Eytan Domany K-means

Start with random positions of centroids. Iteration = 0 Eytan Domany K-means Start with random positions of centroids. Assign each data point to closest centroid. Iteration = 1 Eytan Domany K-means

Start with random positions of centroids. Assign each data point to closest centroid. Move centroids to center of assigned points Iteration = 2 Eytan Domany K-means Start with random positions of centroids. Assign each data point to closest centroid. Move centroids to center of assigned

points Iterate till minimal cost Iteration = 3 Eytan Domany K-means - Summary Fast algorithm: compute distances from data points to centroids Result depends on initial centroids position Must preset K Fails for non-spherical distributions Agglomerative Hierarchical Clustering Need define

Need to define the distance between the at eachto step mergethe pairdistance of nearestbetween clusters the new cluster and the other clusters. new cluster and the other initially each point = cluster clusters. Single

Linkage: distance Single Linkage: distancebetween between Distance between joined clusters closest closestpair. pair. Complete CompleteLinkage: Linkage:distance distancebetween betweenfarthest farthest4 pair. pair. Average AverageLinkage: Linkage: average averagedistance distancebetween

between all allpairs pairs 5 3 cluster or ordistance distancebetween between cluster centers centers 1 2 1 3 2

4 5 The Thedendrogram dendrograminduces inducesaalinear linearordering ordering of ofthe thedata datapoints points Dendrogram Eytan Domany Hierarchical Clustering Summary Results depend on distance update method Greedy iterative process

NOT robust against noise No inherent measure to identify stable clusters Average Linkage the most widely used clustering method in gene expression analysis natur e 2002 breas t canc er Heat map Cluster both genes and

samples Sample should cluster together based on experimental design Often a way to catch labelling errors or heterogeneity in samples Epinephrine Treated Rat Fibroblast Cell ID Probe 1 1h

5h 10h 18h 24h D21869_s_at 25.7 55.0 170.7 305.5 807.9 2 D25233_at 705.2 578.2

629.2 641.7 795.3 3 D25543_at 2148.7 1303.0 915.5 149.2 96.3 4 L03294_g_at 241.8 421.5 577.2

866.1 2107.3 5 J03960_at 774.5 439.8 314.3 256.1 44.4 6 M81855_at 1487.6 1283.7

1372.1 1469.1 1611.7 7 L14936_at 1212.6 1848.5 2436.2 3260.5 4650.9 8 L19998_at 767.9 290.8

300.2 129.4 51.5 9 AB017912_a t 1813.7 3520.6 4404.3 6853.1 9039.4 10 M32855_at 234.1

23.1 789.4 312.7 67.8 Heap map Correlation coeff Normalized across each gene Distance Issues Euclidean distance g1 g3 g2 g4 Pearson distance

400 350 300 250 time0 time1 time2 time3 200 150 100 50 0 gene1 gene2 gene3 gene4 Exercise

Use Average Linkage Algorithm and Manhattan distance. Gene ID 1 2 3 4 5 6 Exp1 Exp2 45 55 55 78 148 241 1303 765

774 607 439 383 Exercise Issues in Cluster Analysis A lot of clustering algorithms A lot of distance/similarity metrics Which clustering algorithm runs faster and uses less memory? How many clusters after all? Are the clusters stable? Are the clusters meaningful?

Which Clustering Method Should I Use? What is the biological question? Do I have a preconceived notion of how many clusters there should be? How strict do I want to be? Spilt or Join? Can a gene be in multiple clusters? Hard or soft boundaries between clusters The End Thank you for taking this course. Bioinformatics is a very diverse and fascinating subject. We hope you all decide to continue your pursuit of it.

We will be very glad to answer your emails or schedule appointments to talk about any bioinformatics related questions you might have. We wish you all have a wonderful summer break!

Recently Viewed Presentations

  • Afs Hi̇lton Antakya Müze Otel

    Afs Hi̇lton Antakya Müze Otel

    Arsa alanına imar durumuna göre 30.837 m2 inşaat yapılabilmektedir. Koruma amaçlı alınan önlemler ve özel mimari dolayısı ile bu inşaatın yaklaşık maliyeti , konvansiyonel bir inşaatın yaklaşık 2 misline çıkmaktadır.
  • Supreme Court Cases Marbury v Madison - Crestwood Middle School

    Supreme Court Cases Marbury v Madison - Crestwood Middle School

    Gregg v Georgia (1976) 1. Troy Leon Gregg was hitchhiking to Florida. 2. Georgia is where the murder took place. 3. Gregg was convicted of killing 2 men who gave him a ride. 4. Does the death penalty constitute "...
  • www.northernpolarbears.com

    www.northernpolarbears.com

    Make your Own Quizzo: 10 pts*Best included in Quizzo & Test Bonus Q's . With a partner, create a 20 Question FrankensteinQuizzo, with an answer key included. Submit your . Quizz
  • Intonation and Discourse Marking in Oral Presentations Delivered

    Intonation and Discourse Marking in Oral Presentations Delivered

    laryngealisation (creaky voice) and /or loss of amplitude. At start of new paratone. marked pause. first tone unit raised in key. high key evident in subsequent tone units creating declination. Thompson (2003); (McAlear, 2008)
  • This Interaction Annoys Me Documenting a problem with

    This Interaction Annoys Me Documenting a problem with

    This Interaction Annoys Me Documenting a problem with an interaction Example Specify Hardware and Software Hardware and OS Hardware: HP TC1100 OS: Microsoft XP, Tablet PC Edition 2005, Version 2002, Service Pack 2 Software: Outlook Express 6, Version 6.00.2900.2180 (xpsp_sp2_rtm.040803-2158)...
  • Module 1: MAPP - What and Why?

    Module 1: MAPP - What and Why?

    The arrow model lists the MAPP steps down the center and is surrounded by the four assessments. The outcome of the four assessments drives all of the work in the process (inside the circle). MAPP is the only community-planning instrument...
  • Romeo and Juliet Cast of Characters - Denton ISD

    Romeo and Juliet Cast of Characters - Denton ISD

    Sampson, Gregory, Potpan - servants to Capulet. Romeo Montague. The House of Montague. Lord Montague - Romeo's father. Lady Montague- Romeo's mother. ... Escalus - Prince of Verona. Paris - young nobleman and kinsman to the prince. Page - servant...
  • Research Designs - University of Minnesota Duluth

    Research Designs - University of Minnesota Duluth

    Research Designs Social Sciences Survey Research Can be Qualitative or Quantitative Self-report Opinion Perceptions (attitudes & beliefs) Different than knowledge measure or social-psychological measure/inventory Mail/Internet/Personal Interview/Phone Interview Ex Post Facto Use of Secondary Data Sets Historical Qualitative Design Focus Group...