Toward a More Accurate Genome: Algorithms for the Analysis of HighThroughput Sequencing Data Dissertation Defense W. Jacob B. Biesinger Tuesday, May 27th AREM TreeHMM Applied statistics Genomix Applied computer science ChIP-seq iCLIP-seq Computational Methods

Probabilistic Models A brief history of DNA sequencing New biological insight Seeded technological revolution: highthroughput sequencing Completed in 2003 at a cost of $3 billion and 10 years of labor and planning First time weve determined the sequence of a large genome XPRIZE: $10 million for 100 genomes @ $1,000 each XPRIZE cancelled:

Outpaced by Innovation A brief history of DNA sequencing low-throughput high cost Now 250nt, $.05/MB Kercher et al., Bioessays (2010). A revolution in biology HTS has changed the way that much of biology is done today New experimental methods Targeted resequencing

Whole-genome sequencing Exome sequencing ChIP-seq RNA-seq MeDIP-seq CLIP-seq ChIRP-seq Hi-C ChIA-PET New (or rebranded) fields of study Genomics Transcriptomics Metabolomics Microbiomics Toxicogenomics Epigenomics

Interactonomics Circadiomics HTS (already) has real impact Clinical Impact o Discovery of inheritable genetic disorders o Cancer biology (identify cancer subtypes) o Evolution and spread of infectious diseases o Prenatal diagnostics o Now transitioning into clinical laboratory o Lead to personalized therapies Basic Biology o Gene expression levels o Identify regulatory network structure o Elucidate fundamental biological processes find promoter TATA binding, splicing mechanisms, the drivers of cellular state/stem cell stemness

Limitations of HTS methods You cant trust 1/100 bases We all wish the error rate were uniform All kinds of hidden biases based on the sequence composition (GC-content, strand-bias, positional bias, But we have much more data. How can we best use it all? Computational biology to the rescue! Detect and correct errors and biases See the biology beyond

the letters AREM Harness HTS read mapping uncertainty to improve analysis methods Resolve ambiguity through Machine Learning GATATAAACT ACGTGATATAAACTGCGTCGGATATAAACTACTCTAGG Most genomes are riddled with repetitive sequence o o

o o Variable lengths (six to several thousand bp) Up to 66% of the Human genome* ~30% of reads map ambiguously** Ambiguous reads often excluded completely or some subset are included at random AREM: Aligning Reads by Expectation-Maximization General framework for resolving repeats; we demonstrate how with ChIP-seq data *Koning et al. PLOS Genetics 7 (12): e1002384 **Langmead et. al. Genome Biology 10 (2009) R25

Qing Zhou, PNAS 1643816443 Identifying Peaks Look for regions with many reads piled together Treat as Nx1 dataset (N is in 10s of milions) Smooth via kernel density ChIP reads Strand bias Control reads Non-uniform control MACS: Zhang et al, 2008

A mixture model for ChIP-Seq un-ChIPed background K enriched regions Each read has some probability of belonging to each of the peak and background regions Identify best peak configuration by maximizing read likelihood A mixture model for ChIP-Seq AAAGTCTATCCCAGGCTC Which region is the most likely source of the ambiguous reads? The alignment with highest likelihood (Not so simple if were unsure where the K enriched regions are located)

Maximize Likelihood via E-M Overall problem: find best peak configuration Consider all possible peak sources and all possible alignments Expectation (With regions fixed, update alignments) Maximization (With alignments fixed, find best regions) Expectation Maximization in action Expectation Maximization r2 r1

r3 E-M is a machine learning method with many applications, especially in mixture models. Accounting for non-uniform control Define alignment likelihood as poisson survival of peak vs. unenriched background ChIP reads Control reads Test datasets We used motif presence to indicate peak quality Cohesin structural protein, known to bind repetitive regions of the genome D4Z4 sub-telomeric repeat associated with Facioscapulohumeral

Disorder * Cohesin often co-localizes with CTCF (motif in 80% peaks from mouse and human) Srebp-1 traditional transcription factor Contains a well-characterized binding motif CTCF binding motif Srebp-1 binding motif * Zeng et. al. PLoS Genetics, 5(7) 2009 AREM shows better performance in repeat regions than other peak finders Cohesin Method 1

2 Alignments Peaks New FDR Motif Repeat MACS ---

2,368,229 18,556 --- 2.80% 81.67% 56.55% SICER --- 2,368,229

17,092 --- 12.71% 82.55% 70.42% --- 1.90% 81.32% 55.30%

AREM 1 2,368,229 19,012 AREM 10 7,616,647 19,881 1,404 3.80%

81.04% 58.88% AREM 20 12,312,878 19,935 1,517 3.70% 80.88% 59.66%

AREM 40 20,527,010 19,863 1,546 3.20% 80.93% 60.34% AREM 80

34,537,311 19,820 1,538 2.90% 80.73% 60.91% 1. Allow for sequences with one alignment. 2. Allow for sequences with up to 10-80 possible alignments.

8% more peaks, similar FDR, many peaks in repeats! AREM shows better performance in repeat regions than other peak finders Srebp-1 Method 1 2 Alignments Peaks New

FDR Motif Repeat MACS --- 10,482,005 721 --- 4.85%

46.60% 53.95% SICER --- 10,482,005 622 --- 9.0% 59.00%

77.33% --- 8.0% 39.08% 53.47% AREM 1 10,482,005 1,438

AREM 10 28,347,869 1,815 262 10.5% 39.22% 56.04% AREM

20 44,493,532 1,748 227 8.0% 39.95% 55.97% AREM 40

72,453,642 1,685 248 8.2% 40.34% 56.46% AREM 80 118,744,757

1,695 272 7.3% 40.66% 56.73% 1. Allow for sequences with one alignment. 2. Allow for sequences with up to 10-80 possible alignments.

5% more peaks called at lower FDR Availability Realigns and calls peaks: Align reads Identify K regions enriched with alignments M Step: Update peak enrichment from alignment probabilities E Step: Update alignment

probabilities from enrichment Check convergence Call treatment peaks Call control peaks Calculate FDR 12 million alignments < 20 minutes < 1.6 GB RAM 120 million alignments < 30 minutes < 6 GB RAM

AREM is a python package Download from AREM can be applied in other contexts Repeat problem plagues all of HTS analysis AREM framework can be applied to other analysis methods o o o RNA-seq: re-align ambiguous reads to the most abundant transcripts SNP/variant calling: re-align ambiguous reads to the genotypes that the reads agree with Many other possibilities

AREM TreeHMM Unsupervised clustering of multiple genomes for improved biological insight Scaling up: multiple ChIP datasets from multiple cell types Determine binding site dynamics by performing the same ChIP experiment at different timepoints/cell stages Integrate multiple datasets for biological insight

Scaling up: multiple ChIP datasets from multiple species Nine ChIP-seq experiments CTCF H3k27me3 H3k36me3

H4k20me1 H3k4me1 H3k4me2 H3k4me3 H3k27ac H3k9ac Histone modifications (not transcription factors) Scaling up: multiple ChIP datasets from multiple species Nine ChIP-seq experiments

CTCF H3k27me3 H3k36me3 H4k20me1 H3k4me1 H3k4me2 H3k4me3 H3k27ac H3k9ac

Nine human cell types embryonic stem cell (H1 ES) erythrocytic leukaemia cells (K562) B-lymphoblastoid cells(GM12878) hepatocellular carcinoma cells (HepG2) umbilical vein endothelial cells (HUVEC) skeletal muscle myoblasts (HSMM)

normal lung fibroblasts (NHLF) normal epidermal keratinocytes (NHEK) mammary epithelial cells (HMEC) Ernst et al, Nature, 2011 Histone mark combinations indicate gene function Active Promoter Active transcription Active Enhancer Repressed Gene Zhou et al, Nature Rev. Gen., 2011 Binding dynamics across cell types

Active Promoter Repressed Promoter Active Promoter Neural genes repressed in muscle cells Olig1: Neural transcription factor Polm: DNA polymerase (gene needed in all cell types) ES cells: Embryonic stem cells NPCs: Neural progenitor cells MEFs: Embryonic fibroblasts (muscle)

Neurog1: Neurogenesis transcription factor Pparg: Adipogenesis transcription factor Fabp7: Neural progenitor marker Mikkelsen et al., Nature 2007 Scaling up: multiple ChIP datasets we automatically fromCan multiple species Nine ChIP-seq experiments

learn biologys histone code? Nine human cell types CTCF embryonic stem cell (H1 ES) H3k27me3 erythrocytic leukaemia cells (K562) H3k36me3

B-lymphoblastoid cells(GM12878) H4k20me1 hepatocellular carcinoma cells (HepG2) H3k4me1 umbilical vein endothelial cells (HUVEC) H3k4me2 skeletal muscle myoblasts (HSMM) H3k4me3 What does the data tell lung fibroblasts (NHLF) normal H3k27ac us about cell normal epidermal keratinocytes (NHEK) differentiation? mammary epithelial cells (HMEC) H3k9ac Ernst et al, Nature, 2011

Unsupervised Learning Family of machine learning methods to recognize patterns in datasets o Includes K-means, hierarchical clustering, selforganizing maps, and many other methods Hidden Markov Model can capture spatial transitions We build on a previous ChromHMM model Histone mark co-occurrence and spatial transitions What about multiple cell types? ES cells NPCs One genomic locus: states are organized in tree

xij: observed histone marks (position i, species j) zij: hidden chromatin state (to be inferred) A new TreeHMM for lineages Connect multiple cell types in a tree Connect adjacent regions of the genome TreeHMM Recap Data: M x N x L matrix of binary histone mark presence o M species, related to each other by a tree o N contiguous genomic regions o L different histone marks

Use a Tree Hidden Markov Model to do unsupervised learning We are given K, the number different histone states to find o K x L emission matrix o K x K transition matrix for root species o K x K x K transition matrix for other species Inference in TreeHMM Just one problem: Inferring the hidden state in this model is intractable when K or M is large (KM state space) We use variational methods to approximate the model o Choose a tractable family of surrogate models, then optimize them to look like the more complicated

model Mean field: Optimize single nodes separately Structured mean field: Optimize complete HMM chains separately Loopy belief propagation How good are the approximations? Structured mean field approximates the exact model very well K=5, very small dataset Spatial transitions between states

Vertical parent specific transition matrix Insulator (CTCF) (state 14) Active Promoter (state 3) Strong Enhancer (state 5) Strong vertical information for some states Insulator (CTCF) (state 14) Active Promoter

(state 3) Enhancer state is rarely inherited from ES cells Strong Enhancer (state 5) Validation and comparison Do our predicted states have any grounding in real biology? Validate them using a different dataset: transcription factor ChIP-seq o o Record number of recovered TF binding sites

Record fold enrichment vs. random overlap with TF binding sites Compare vs. ChromHMM (like our model, but no tree component) Predicted Promoters Recovered sites (thousands) We predict fewer sites with better accuracy Fold enrichment Taf1 is part of all cells promoter machinery

Predicted Enhancers Recovered sites (thousands) We predict more sites at better or similar accuracy Fold enrichment TreeHMM take home messages None of this would be possible without interesting data and lots of it o In several organisms, there are many new datasets

doing comprehensive surveys of biological function Extending models toward real biology is worth it o o Can lead to improved accuracy and new insights Machine Learning has tools for many more models than now employed in biology AREM TreeHMM What if we dont have a reference genome? Distributed database for genome assembly

Genomix No reference genome? For most analyses, we use the standard reference genome o o o Many organisms dont have a reference Others have a poor quality reference Some samples are too different from the reference Cancer genomes are genetically unstable, subject to genome shattering Low cost of sequencing makes it feasible to do de novo assembly using HTS reads

o o Human genome cost ~$3 billion and 10 years, finishing in 2003 Done today in a few weeks for a few thousand dollars CA ACTG AC A A A C A CC TG AC AAC TTC

Genome Assembly (1st Approximation) CA ACTG AC A A A C TG CA CC C A A A C A ACTG AC


TG AC AAC TTC Extraction, sonication hundreds of millions of fragments Assembly CA ACTG AC A AA C A CC TG AC AAC TTC

High-Throughput Sequencing Overlap Graphs ...ACTGAATCTAGGTCTGCGT TCTTTGACCA. GTCTGCGTACCCTACGTCTGACTGCCGCGGCAA CGCGGCAAACGGCTAGCTGTGTTTTTACTTCTTTGA TCTTTGATTC. Find prefix/suffix overlaps between all reads Form a graph where the reads are nodes and overlaps are edges Find a Hamiltonian path through the graph (touching all nodes once)

De Bruijn Graphs ...ACTGAATCTAGGC Repeated kmers are collapsed into a single node ...CCTCTAGGGTGC Form a graph where all Kmers of the reads are nodes and edges correspond to shared K-1mers Find an Eulerian path through the graph (touching all edges once) Pros and Cons Overlap Graph: Requires comparison between all reads

Hamiltonian path is harder than Eulerian Errors detected via consensus sequences Can handle repeats shorter than read length Mostly suited for a few, long reads De Bruijn Graph: Scales with complexity of genome, not # of reads* Many errors show as unique graph structures Error identification is critical Without additional work, handles only repeat < K

Well-suited for many short, low-quality reads * except for errors... Our goal: scalable de Bruijn graph assembly Algorithm Genome Size Small Medium Large Velvet


Contrail Genomix

If you have enough memory Interlude: a different revolution Storage is cheap and everything is recorded o Social network data, browsing/shopping history, search terms, retail information Traditionally, all this would be kept in large databases for easy query o Now, the data cant fit on one machine Demand for scalable alternatives

o o Google revealed MapReduce and GFS papers (internet scale) Hadoop (open source) soon followed In 2013, 50% of Fortune 50 Scalable algorithms need scalable frameworks Hyracks: efficient and flexible alternative to Hadoop o o o Seamless use of available memory and disk space Additional operators, index structures

Brings relational DB concepts to the cloud Pregelix: scalable graph algorithms o o o Hyracks-based open source Pregel implementation Handles all scheduling, network, message handling, etc Think like a vertex K-means clustering in the cloud Hyracks stack Hyracks

Genomix Graph algorithms Filesystem abstraction: Shared-nothing, HDFS, NFS, etc. Build graph De Bruijn graph of the small genome of E. coli after error correction Chaisson et al., Genome Research 2009 Sequencing Errors (Bubble Merge) Breadth-first search identifies common paths Extract and compare path sequences

When paths are similar, prune worst one Graph Compression We can collapse long chains of nodes into single nodes representing long chains o T All later operations will take fewer iterations H H T T H

T Use a randomized algorithm to coordinate nodes Graph Compression We can collapse long chains of nodes into single nodes representing long chains o All later operations will take fewer iterations Use a randomized algorithm to coordinate nodes Scaffolding Use the reads to guide a growing path

candidate path reads have mutual overlap with walk high-confidence region Check all candidate paths out to a certain distance D single error read connects the walk incorrectly Scaffolding Use the reads to guide a growing path One set of edges must dominate the other for the walk to proceed

Timings: Small Genomes Hadoop-based Contrail lags behind Velvet is super fast Ray is another parallel framework for de Bruijn graph assembly With Human Genome same figure with a larger genome Genomix has no memory constraints, though having it will speed up computation Ray requires ~450GB RAM

across machines Velvet is a no-show Some groups reported assembling a human genome with 2TB RAM... Assembly Accuracy Assembly is tough, but at least you can scale up! Assembly was once relegated to small bacterial genomes Dropping costs and better tech are making assembly available to much larger genomes o o

Important for understudied organisms Important for cancers and other diseases where genome structure is affected Dont need huge servers w/ beefy RAM o Small clusters will do the job (rent them on EC2!) Recap Three algorithms leveraging HTS data in different ways o AREM enables analysis in repetitive regions of the genome o TreeHMM synthesizes multiple datasets in related cell types to better annotate the genome o Genomix applies when the reference is inadequate or

unavailable and provides a scalable solution to assembly HTS requires solid computational models and algorithms to be successful Exciting time to be in biology! Costs continue to drop, quality is increasing New experimental methods are revealing comprehensive, in-depth biology at scales weve never seen before Computational methods are required to overcome errors, but also to model biological realities Acknowledgements TreeHMM AREM

Genomix Min Score and performance How many chromatin states? Apply tree-HMM to the Broad dataset (9 chromatin modification in 9 cell types): The optimal number of states is 18 by BIC. Model selection: AIC (Akaike information criterion), BIC (Bayesian information criterion) A Simple Lineage Tree

H1-ES cells erythrocytic leukaemia skeletal muscle myoblasts mammary epithelial cells normal epidermal keratinocytes

umbilical vein endothelial cells

Recently Viewed Presentations

  • The Crusades (Section 3)

    The Crusades (Section 3)

    Peasants. The First Crusade (1096-1099) Peasant army. Untrained. Lacked military equipment. Many killed by Muslim Turks. Knights. Succeeded in capturing Jerusalem. Second Crusade (1147-1149) After victory many Christians went back home. The Turks eventually took back much of the territory.
  • U.S. History 1st 9 Weeks Review - Socorro Independent School ...

    U.S. History 1st 9 Weeks Review - Socorro Independent School ...

    Alfred Thayer Mahan: Focused on Naval dominance for the success of expansionism abroad. The Spanish-American war is significant to the U.S. due to the acquisition of territory which allowed fro more resources and trade routes. Hawaii's annexation was important because...
  • My Life Development - Concordia University-Nebraska

    My Life Development - Concordia University-Nebraska

    This is an example of contemporary life-events approach. Santrock defines this as "an approach emphasizing that how a life event influences the individual's development depends not only on the event but also on mediating factors, the individual's adaptation to the...


    *Oxymoron is putting two contradictory words together. ex. Jumbo shrimp *Paradox is a statement, proposition, or situation that seems to be absurd or contradictory, but in fact is or may be true. ex. 'Everything begins where it ends.' ex. 'The...
  • Nuclear Chemistry - Baltimore Polytechnic Institute

    Nuclear Chemistry - Baltimore Polytechnic Institute

    Drill: Calculate the solubility of MgF2 in 0.10 M KF. Ksp MgF2 = 6.4 x 10-9 Nuclear Chemistry Radioactive Nuclei A nucleus that spontaneously decomposes Isotopes Elements with the same atomic number, but different mass number Isotopes All elements have...
  • History of Manipulation - APTA

    History of Manipulation - APTA

    Arial Calibri Garamond Wingdings Arial Unicode MS Abadi MT Condensed Extra Bold Times New Roman Office Theme Microsoft Excel Chart Position on Thrust Joint Manipulation Provided by Physical Therapists Manipulation Legislative Challenges PT profession response APTA Manipulation Task Force National...
  • Packet Level Algorithms - DIMACS

    Packet Level Algorithms - DIMACS

    Lots of parameters: number of hash functions, cells per bucket, fingerprint size, etc. Useful for flexible design. Fingerprint Compressed Filter (FCF) Example Experiment Summary FCF-based ACSM is the clear winner. Better performance than less space for the others in test...


    ACUTE RENAL FAILURE Carly Thompson MD, CCFP August 27, 2008 Indications for Hemodialysis Volume overload Hyperkalemia >6.5 or rising Acid-base imbalance Symptomatic uremia (pericarditis, encephalopathy, bleeding dyscrasia, nausea, vomiting, pruritis) Uremia (BUN >100) Dialyzable intoxications - methanol, ethylene glycol, theophylline,...