Investigating the source and function of most of the genome: endogenous retroelements and the proteins that bind them Tim Hughes University of Toronto Banff International Research Station "Statistical and Computational Challenges in Large Scale Molecular Biology" March 27, 2017 Challenges 1.Mapping between DNA sequence and
transcriptional outputs remains a good, but hard, computational problem (many problems) Genome: bought the book; hard to read (Eric Landers seven-word Nano lecture, 2003 Ig Nobel awards) TTTTTAGTAGCAATTTGTACTGATGGTATGGGGCCAAGAGATATATCTTAGAGGGAGGGCTGAGGGTTTGAAGTCCAACTCCTAAGCCAGTGCCAGAA GAGCCAAGGACAGGTACGGCTGTCATCACTTAGACCTCACCCTGTGGAGCCACACCCTAGGGTTGGCCAATCTACTCCCAGGAGCAGGGAGGGCAGGA GCCAGGGCTGGGCATAAAAGTCAGGGCAGAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGG TGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCA AGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCTGATAGGCACTGACTCTCTCTGCCTAT TGGTCTATTTTCCCACCCTTAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATG GGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACT
GAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGGTGAGTCTATGGGACGCTTGATGTTTTCTTTCCCCTTCTTTTCTATGGT TAAGTTCATGTCATAGGAAGGGGATAAGTAACAGGGTACAGTTTAGAATGGGAAACAGACGAATGATTGCATCAGTGTGGAAGTCTCAGGATCGTTTT AGTTTCTTTTATTTGCTGTTCATAACAATTGTTTTCTTTTGTTTAATTCTTGCTTTCTTTTTTTTTCTTCTCCGCAATTTTTACTATTATACTTAATG CCTTAACATTGTGTATAACAAAAGGAAATATCTCTGAGATACATTAAGTAACTTAAAAAAAAACTTTACACAGTCTGCCTAGTACATTACTATTTGGA ATATATGTGTGCTTATTTGCATATTCATAATCTCCCTACTTTATTTTCTTTTATTTTTAATTGATACATAATCATTATACATATTTATGGGTTAAAGT GTAATGTTTTAATATGTGTACACATATTGACCAAATCAGGGTAATTTTGCATTTGTAATTTTAAAAAATGCTTTCTTCTTTTAATATACTTTTTTGTT TATCTTATTTCTAATACTTTCCCTAATCTCTTTCTTTCAGGGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATAACAGTGAT AATTTCTGGGTTAAGGCAATAGCAATATCTCTGCATATAAATATTTCTGCATATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATAGCAGCT ACAATCCAGCTACCATTCTGCTTTTATTTTATGGTTGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTC TTATCTTCCTCCCACAGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAG AAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAA GTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGCAATGATGTATTTAAAT TATTTCTGAATATTTTACTAAAAAGGGAATGTGGGAGGTCAGTGCATTTAAAACATAAAGAAATGAAGAGCTAGTTCAAACCTTGGGAAAATACACTA TATCTTAAACTCCATGAAAGAAGGTGAGGCTGCAAACAGCTAATGCACATTGGCAACAGCCCCTGATGCATATGCCTTATTCATCCCTCAGAAAAGGA
Gene regulation occurs at many steps Promoter definition Enhancer definition Chromatin remodelling and modification DNA topology PIC formation Initiation Capping Elongation Splicing
Cleavage and polyadenylation Termination Nuclear export RNA localization Translation Degradation Cells can "classify" elements based solely on sequence we can too, right? Many successes at individual problems
Challenges 1.Mapping between DNA sequence and transcriptional outputs remains a good, but hard, computational problem (many problems) 2.Relative positioning of regulatory elements along chromosomes is important Classifiers for promoter and polyadenylation site Classifier outputs are emissions for HMM representing gene state Model learns what constitutes "normal" genes, predicts "dark matter" transcripts, heterogeneous ends, etc Tested by MPRA and by "gene synthesis"
Genomic transcript predictions Initiation+ Elongation+ Termination+ Unified Model+ TSS/Gene/CpA RNASeq+ ORFs+ Other+ Transcripts+ TranscriptsORFs- Ty element
gal1 gal10 RNASeqUnified ModelTSS/Gene/CpA InitiationElongationTermination- Growing evidence that human transcription is also sloppy Challenges 1.Mapping between DNA sequence and transcriptional outputs remains a good, but hard, computational problem (many problems) 2.Relative positioning of regulatory elements along chromosomes is important
3.Missing motifs for TFs CIS-BP: Catalog of Inferred Sequence Binding Preferences Tracks homology and predicts motifs across >300 genomes (Weirauch et al., Cell 2014 established thresholds for >40 DBD types) Weirauch, Yang et al., Cell 2014 CisBP-RNA: Ray, Kazan, Cook, Weirauch, Najafabadi et al., Nature, 2013 No motif for hundreds of human TFs Most are C2H2 zinc finger proteins A sampling of the ~710 human C2H2 proteins KRAB domain (352):
ZNF554 ZNF670 ZNF454 ZNF136 ZNF705A ZNF460 ZNF667
ZNF514 ZNF45 ZNF528 SCAN domain (52): ZSCAN22 MZF1 BTB domain (50): ZBTB12 ZBTB48
SET domain (11): PRDM5 C2H2 only: ZNF271 CTCF ZNF384 ZNF628 YY1
No motif for hundreds of human TFs Most are C2H2 zinc finger proteins Increase of ~260 in the last 4 months! Next ENCODE includes 159 C2H2 proteins Yin et al. (Taipale lab, in press) has SELEX data for several dozen C2H2 proteins A system for analysis of human C2H2 zinc finger DNA binding Recognition code trained on B1H
data RCADE (Recognition CodeAssisted Discovery of regulatory Elements) Najafabadi, Mnaimneh, Schmitges et al. Nature Biotech 2015 Najafabadi et al. Bioinformatics 2015 ChIP-seq with GFP tagged inducible ORFs in HEK293 cells RCADE motifs for 131 C2H2-ZF proteins
Challenges 1.Mapping between DNA sequence and transcriptional outputs remains a good, but hard, computational problem (many problems) 2.Relative positioning of regulatory elements along chromosomes is important 3.Missing motifs for TFs 4.Mapping the domestication of retroelements and transposons and, the genesis and adaptive roles of the proteins that bind them Human Endogenous Retroelement and Transposon catalog is incomplete
Proprietary "RepBase" is the gold standard RepBase combines automated and manuallydefined models / consensus / who-knows-what DFAM is an "open access" alternative, but only covers four species (Hs, Dr, Ce, Dm) Many of the models are truncated or noncoding Are there human TFs that evolved to silence elements that no longer exist in the human genome? Most retroelements in genomes are truncated (figure from Imbeault et al., 2017) Most LINE L1 models in DFAM are truncated A "working" LINE L1 is 6-7 kb
Length of consensus 6000 5000 4000 3000 2000 1000
0 0 10 20 30 40 50
Kimura divergence 60 70 Will ancestral genome reconstructions improve recovery of active "source" elements? Original ERE Individual copies in present day genomes
Acknowledgements Hughes lab: Hamed Najafabadi (McGill) Matt Weirauch (Cincinatti) Frank Schmitges Marjan Barazandeh Laura Campitelli Ally Yang Ernest Radovani Mihai Albu Hong Zheng Debashish Ray Sam Lambert Tharsan Kanagalingam
Jack Greenblatt Andrew Emili Guoqing Zhong Peter Young Wei Feng Dai Hua Tang Hongbo Guo Quaid Morris Philip Kim CIHR NIH CIFAR