Agilent bake-off: oligonucleotide layers per gene

Agilent bake-off: oligonucleotide layers per gene

Investigating the source and function of most of the genome: endogenous retroelements and the proteins that bind them Tim Hughes University of Toronto Banff International Research Station "Statistical and Computational Challenges in Large Scale Molecular Biology" March 27, 2017 Challenges 1.Mapping between DNA sequence and

transcriptional outputs remains a good, but hard, computational problem (many problems) Genome: bought the book; hard to read (Eric Landers seven-word Nano lecture, 2003 Ig Nobel awards) TTTTTAGTAGCAATTTGTACTGATGGTATGGGGCCAAGAGATATATCTTAGAGGGAGGGCTGAGGGTTTGAAGTCCAACTCCTAAGCCAGTGCCAGAA GAGCCAAGGACAGGTACGGCTGTCATCACTTAGACCTCACCCTGTGGAGCCACACCCTAGGGTTGGCCAATCTACTCCCAGGAGCAGGGAGGGCAGGA GCCAGGGCTGGGCATAAAAGTCAGGGCAGAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGG TGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCA AGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCTGATAGGCACTGACTCTCTCTGCCTAT TGGTCTATTTTCCCACCCTTAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATG GGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACT

GAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGGTGAGTCTATGGGACGCTTGATGTTTTCTTTCCCCTTCTTTTCTATGGT TAAGTTCATGTCATAGGAAGGGGATAAGTAACAGGGTACAGTTTAGAATGGGAAACAGACGAATGATTGCATCAGTGTGGAAGTCTCAGGATCGTTTT AGTTTCTTTTATTTGCTGTTCATAACAATTGTTTTCTTTTGTTTAATTCTTGCTTTCTTTTTTTTTCTTCTCCGCAATTTTTACTATTATACTTAATG CCTTAACATTGTGTATAACAAAAGGAAATATCTCTGAGATACATTAAGTAACTTAAAAAAAAACTTTACACAGTCTGCCTAGTACATTACTATTTGGA ATATATGTGTGCTTATTTGCATATTCATAATCTCCCTACTTTATTTTCTTTTATTTTTAATTGATACATAATCATTATACATATTTATGGGTTAAAGT GTAATGTTTTAATATGTGTACACATATTGACCAAATCAGGGTAATTTTGCATTTGTAATTTTAAAAAATGCTTTCTTCTTTTAATATACTTTTTTGTT TATCTTATTTCTAATACTTTCCCTAATCTCTTTCTTTCAGGGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATAACAGTGAT AATTTCTGGGTTAAGGCAATAGCAATATCTCTGCATATAAATATTTCTGCATATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATAGCAGCT ACAATCCAGCTACCATTCTGCTTTTATTTTATGGTTGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTC TTATCTTCCTCCCACAGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAG AAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAA GTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGCAATGATGTATTTAAAT TATTTCTGAATATTTTACTAAAAAGGGAATGTGGGAGGTCAGTGCATTTAAAACATAAAGAAATGAAGAGCTAGTTCAAACCTTGGGAAAATACACTA TATCTTAAACTCCATGAAAGAAGGTGAGGCTGCAAACAGCTAATGCACATTGGCAACAGCCCCTGATGCATATGCCTTATTCATCCCTCAGAAAAGGA

Gene regulation occurs at many steps Promoter definition Enhancer definition Chromatin remodelling and modification DNA topology PIC formation Initiation Capping Elongation Splicing

Cleavage and polyadenylation Termination Nuclear export RNA localization Translation Degradation Cells can "classify" elements based solely on sequence we can too, right? Many successes at individual problems

Challenges 1.Mapping between DNA sequence and transcriptional outputs remains a good, but hard, computational problem (many problems) 2.Relative positioning of regulatory elements along chromosomes is important Classifiers for promoter and polyadenylation site Classifier outputs are emissions for HMM representing gene state Model learns what constitutes "normal" genes, predicts "dark matter" transcripts, heterogeneous ends, etc Tested by MPRA and by "gene synthesis"

Genomic transcript predictions Initiation+ Elongation+ Termination+ Unified Model+ TSS/Gene/CpA RNASeq+ ORFs+ Other+ Transcripts+ TranscriptsORFs- Ty element

gal1 gal10 RNASeqUnified ModelTSS/Gene/CpA InitiationElongationTermination- Growing evidence that human transcription is also sloppy Challenges 1.Mapping between DNA sequence and transcriptional outputs remains a good, but hard, computational problem (many problems) 2.Relative positioning of regulatory elements along chromosomes is important

3.Missing motifs for TFs CIS-BP: Catalog of Inferred Sequence Binding Preferences Tracks homology and predicts motifs across >300 genomes (Weirauch et al., Cell 2014 established thresholds for >40 DBD types) Weirauch, Yang et al., Cell 2014 CisBP-RNA: Ray, Kazan, Cook, Weirauch, Najafabadi et al., Nature, 2013 No motif for hundreds of human TFs Most are C2H2 zinc finger proteins A sampling of the ~710 human C2H2 proteins KRAB domain (352):

ZNF554 ZNF670 ZNF454 ZNF136 ZNF705A ZNF460 ZNF667

ZNF514 ZNF45 ZNF528 SCAN domain (52): ZSCAN22 MZF1 BTB domain (50): ZBTB12 ZBTB48

SET domain (11): PRDM5 C2H2 only: ZNF271 CTCF ZNF384 ZNF628 YY1

No motif for hundreds of human TFs Most are C2H2 zinc finger proteins Increase of ~260 in the last 4 months! Next ENCODE includes 159 C2H2 proteins Yin et al. (Taipale lab, in press) has SELEX data for several dozen C2H2 proteins A system for analysis of human C2H2 zinc finger DNA binding Recognition code trained on B1H

data RCADE (Recognition CodeAssisted Discovery of regulatory Elements) Najafabadi, Mnaimneh, Schmitges et al. Nature Biotech 2015 Najafabadi et al. Bioinformatics 2015 ChIP-seq with GFP tagged inducible ORFs in HEK293 cells RCADE motifs for 131 C2H2-ZF proteins

Challenges 1.Mapping between DNA sequence and transcriptional outputs remains a good, but hard, computational problem (many problems) 2.Relative positioning of regulatory elements along chromosomes is important 3.Missing motifs for TFs 4.Mapping the domestication of retroelements and transposons and, the genesis and adaptive roles of the proteins that bind them Human Endogenous Retroelement and Transposon catalog is incomplete

Proprietary "RepBase" is the gold standard RepBase combines automated and manuallydefined models / consensus / who-knows-what DFAM is an "open access" alternative, but only covers four species (Hs, Dr, Ce, Dm) Many of the models are truncated or noncoding Are there human TFs that evolved to silence elements that no longer exist in the human genome? Most retroelements in genomes are truncated (figure from Imbeault et al., 2017) Most LINE L1 models in DFAM are truncated A "working" LINE L1 is 6-7 kb

Length of consensus 6000 5000 4000 3000 2000 1000

0 0 10 20 30 40 50

Kimura divergence 60 70 Will ancestral genome reconstructions improve recovery of active "source" elements? Original ERE Individual copies in present day genomes

Acknowledgements Hughes lab: Hamed Najafabadi (McGill) Matt Weirauch (Cincinatti) Frank Schmitges Marjan Barazandeh Laura Campitelli Ally Yang Ernest Radovani Mihai Albu Hong Zheng Debashish Ray Sam Lambert Tharsan Kanagalingam

Jack Greenblatt Andrew Emili Guoqing Zhong Peter Young Wei Feng Dai Hua Tang Hongbo Guo Quaid Morris Philip Kim CIHR NIH CIFAR

Recently Viewed Presentations

  • Dynasty Trust Analysis (GSTT Planning Strategies) Maximizing Family

    Dynasty Trust Analysis (GSTT Planning Strategies) Maximizing Family

    Polla Bilodeau Created Date: 1/18/1997 6:12:02 AM Document presentation format: On-screen Show Other titles: Times New Roman Arial Times Arial Narrow Wingdings estate Microsoft Clip Gallery Microsoft Graph 2000 Chart Microsoft Word Document PowerPoint Presentation PowerPoint Presentation ...
  • Title here

    Title here

    CCG GOVERNING BODY MEMBERS Dr Jim Hogan Clinical Leader & AO Dr Tim Wilkinson Chair of Board, GP executive Dr Elizabeth Fellows GP executive Tom Morton Deputy Chair & Lay member Dr Dapo Alalade GP executive Jackie Powell Lay member...
  • Cyflwyno arferion astudio Sesiwn 1: Beth allen ni

    Cyflwyno arferion astudio Sesiwn 1: Beth allen ni

    Sesiynau gloywi iaith bob wythnosol am ddim. Cyngor iaith ar ddarnau unigol o waith. Sesiynau un-wrth-un gyda'r Tiwtor Sgiliau Iaith Gymraeg. Deunyddiau sgiliau iaith a sgiliau astudio ar safle'r Coleg Cymraeg Cenedlaethol: https://llyfrgell.porth.ac.uk/ Peth deunydd sgiliau astudio ar SgiliauAber. Tystysgrif...
  • C++ Programming: Program Design Including Data Structures ...

    C++ Programming: Program Design Including Data Structures ...

    Some compilers initialize global variables to default values Scope resolution operator in C++ is :: By using the scope resolution operator A global variable declared before the definition of a function (or block) can be accessed by the function (or...
  • Chapter 1

    Chapter 1

    Prime meridian International Date Line Absolute location Equator Relative location D Question 64: Basic Concepts A functional region is defined as Area based upon a perception Node that focusses on a point of origin Point or place on the map...
  • Snímek 1

    Snímek 1

    Nemá-li v G1-fázi buňka příhodné podmínky, zůstane v tzv. stacionární fázi a nedělí se dál (v případě pozastavení buněčného cyklu je pak tato fáze označována G0). Buněčný cyklus kvasinky pivní M.J. Carlile et S.C. Watkinson: The Fungi.
  • ADJECTIVE CLAUSES - Courseware

    ADJECTIVE CLAUSES - Courseware

    * * Adjective Clauses: Restrictive & Non-restrictive Clauses Restrictive Clauses Non-restrictive Clauses are necessary for identification—tell exactly which thing or person are interesting with extra information -but don't identify or tell "which one" DO NOT have commas around clause ALWAYS...
  • Canada's Government - TypePad

    Canada's Government - TypePad

    The Prime Minister chooses people from the ten provinces and three territories to sit in the Senate. Both the House of Commons and the Senate make Canadian laws and policies. Canada has both a federal and provincial court system. The...