Comparative Genomics Ross Hardison, Penn State University Major collaborators: Webb Miller, Francesca Chiaromonte, Laura Elnitski, David King, et al., PSU James Taylor: Courant Institute, New York University David Haussler, Jim Kent, Univ. California at Santa Cruz PSU Nov. 28, 2006 Ivan Ovcharenko, Lawrence Livermore National Lab Major goals of comparative genomics Identify all DNA sequences in a genome
that are functional Selection to preserve function Adaptive selection Determine the biological role of each functional sequence Elucidate the evolutionary history of each type of sequence Provide bioinformatic tools so that anyone can easily incorporate insights from comparative genomics into their research Three major classes of evolution Neutral evolution Acts on DNA with no function
Genetic drift allows some random mutations to become fixed in a population Purifying (negative) selection Acts on DNA with a conserved function Signature: Rate of change is significantly slower than that of neutral DNA Sequences with a common function in the species examined are under purifying (negative) selection Darwinian (positive) selection Acts on DNA in which changes benefit an organism Signature: Rate of change is significantly faster than that of neutral DNA Ideal case for interpretation
Negative selection (purifying) Similarity Neutral DNA Positive selection (adaptive) Position along chromosome Exonic segments coding for regions of a polypeptide with common function in two species.
Exonic segments coding for regions of a polypeptide in which change is beneficial to one of the two species. Taxonomic distribution of homologs of mouse proteins aterston et al. Conservation in different parts of genes Average percent identity (black) or percent aligned (blue) for 10,000 orthologous genes Waterston et al, Mouse Genome, Nature
Levels of conservation (Human vs Mouse) in different types of proteins Black: Nuclear proteins ck: all orthologous proteins (Hum-mouse) Red: Cytoplasmic proteins 12,845 1:1 gene pairs Gray: Extracellular proteins; positive, : proteins with recognized domains diversifying selection y: proteins without recognized domains Waterston et al. Nature 2002 KA= rate of nonsynonymous substitutions
KS= rate of synonymous substitutions Rat-specific gene expansions Genes that have expanded in number in rats are enriched in Immune function/ antigen recognition immunoglobulins, T-cell receptor alpha Detoxification cytochrome P450 Reproduction alpha2u-globulin Olfaction and odorant detection Olfactory receptors Also are rapidly evolving
Segmental duplications are enriched for the same genes Rat Genome SPC 2004 Nature Adaptive remodeling of gene clusters Figure 13 Adaptive remodeling of genomes and genes. a, Orthologous regions of rat, human and mouse genomes encoding pheromone-carrier proteins of the lipocalin family (a2u-globulins in rat and major urinary proteins in mouse) shown in brown. Zfp37-like zinc finger genes are shown in blue. Filled arrows represent likely genes, whereas striped arrows represent likely pseudogenes. Gene expansions are bracketed. Arrowhead orientation represents Rat Genome
SPC 2004 Nature transcriptional direction. Flanking genes 1 and 2 are TSCOT and CTR1, DCODE.org Comparative Genomics: Align your own sequences blastZ multiZ and TBA zPicture interface for aligning sequences Automated extraction of sequence and annotation
Pre-computed alignment of genomes blastZ for pairwise alignments multiZ for multiple alignment Human, chimp, mouse, rat, chicken, dog Also multiple fly, worm, yeast genomes Organize local alignments: chains and nets Webb Miller All against all comparisons High sensitivity and specificity Computer cluster at UC Santa Cruz 1024 cpus Pentium III Job takes about half a day
Jim Kent Results available at UCSC Genome Browser http://genome.ucsc.edu Galaxy server: http://www.bx.psu.edu hwartz et al., 2003, blastZ, Genome Research anchette et al., 2004, TBA and multiZ, Genome Research David Haussler Genome-wide local alignment chains
an: 2.9 Gb assembly. Mask interspersed repeats, break into 300 segments of 1 Human Mouse blastZ: Each segment of human is given the opportunity to align with all mouse sequences. tZ in parallel for all human segments. Collect all local alignments above th local alignments into a set of chains based on position in assembly and ori Level 1 chain Level 2 chain Net Comparative genomics to find functional sequences Genome
size 2,900 Find common sequences blastZ, multiZ 2,400 Human Identify functional
sequences: ~ 145 Mbp All mammals 1000 Mbp 2,500 Mouse Rat 1,200 million base pairs (Mbp)
Also birds: 72Mb Papers in Nature from mouse and rat and chicken genome consortia, 2002, Use measures of alignment quality to discriminate functional from nonfunctional DNA Compute a conservation score adjusted for the local neutral rate Score S for a 50 bp region R is the normalized fraction of aligned bases that are identical Subtract mean for aligned ancestral repeats in the surrounding region Divide by standard deviation aterston et al., Nature
p = fraction of aligned sites in R that identical between human and mouse = average fraction of aligned sites th are identical in aligned ancestral repea the surrounding region n = number of aligned sites in R Decomposition of conservation score into neutral and likely-selected portions Neutral DNA (ARs) All DNA Likely selected DNA At least 5-6%
S is the conservation score adjusted for variation in the local substitution rate. The frequency of the S score for all 50bp windows in the human From the distribution of S scores in ancestral repeats (mostly genome shown. neutralis DNA), can compute a probability that a given alignment could result from locally adjusted neutral rate. Waterston et al., Nature DNA sequences of mammalian genomes
Human: 2.9 billion bp, finished High quality, comprehensive sequence, very few gaps Mouse, rat, dog, oppossum, chicken, frog etc. etc etc. About 40% of the human genome aligns with mouse This is conserved, but not all is under selection. About 5-6% of the human genome is under purifying selection since the rodent-primate divergence About 1.2% codes for protein The 4 to 5% of the human genome that is under selection but does not code for protein should have: Regulatory sequences Non-protein coding genes (UTRs and noncoding RNAs) Other important sequences
Conservati on score S in different types of regions Red: Ancestral repeats (mostly neutral) Blue: First class in label Green: Second class in label aterston et al., Nature
Leverage many species to improve accuracy and resolution of signals for constraint ENCODE multispecies alignment group Margulies et al., 2007 Coverage of human by alignments with other
vertebrates ranges from 1% to 91% 5.4 Millions of years 173 220 310 360 450 91 92 Human Chimp Mouse
Rat Dog Cow Opossum Platypus Chicken Frog Zebrafish Tetraodon Fugu 0 5% 20 40
60 80 Percent of human aligning with second species 100 Distinctive divergence rates for different types of functional DNA sequences 100 100 9090 8080
200 300 400 500 100 200 300 400 500 Time of divergence from common ancestor to Time of divergence from common human, Myr ago ancestor to human, Myr ago
Large divergence in cis-regulatory modules from opossum to platypus 100 90 80 70 Genome Known regulatory regions CpG islands Functional promoters Coding exons Ultraconserved (HM) 60
50 40 30 20 10 0 0 100 200 300 400
Time of divergence from common ancestor to human, Myr ago 500 cis-Regulatory modules conserved from human to fish Millions of years 173 310 450 91 About 20% of CRMs
Tend to regulate genes whose products control transcription and development cis-Regulatory modules conserved in eutherian mammals and marsupials Millions of years 173 310 450 Human-marsupial alignments capture about 60% of CRMs 91
Tend to occur close to genes involved in aminoglycan synthesis, organelle biosynthesis Human-mouse alignments capture about 87% of CRMs Tend to occur close to genes involved in apoptosis, steroid hormone receptors, etc. Within aligned noncoding DNA of eutherians, need to distinguish constrained DNA
(purifying selection) from neutral DNA. Score multi-species alignments for features associated with function Multiple alignment scores Margulies et al. (2003) Genome Research 13: 25052518 Binomial, parsimony PhastCons Siepel et al. (2005) Genome Research 15:1034-1050 Phylogenetic Hidden Markov Model Posterior probability that a site is among the most highly conserved sites GERP
Cooper et al. (2005) Genome Research 15:901-913 Genomic Evolutionary Rate Profiling Measures constraint as rejected substitutions = nucleotide substitution deficits phastCons: Likelihood of being constrained Phylogenetic Hidden Markov Model Posterior probability that a site is among the most highly conserved sites Allows for variation in rates along
lineages c is conserved (constrained) n is nonconserved (aligns but is not clearly subject to purifying selection) Siepel et al. (2005) Genome Research 15:1034-1050 Larger genomes have more of the
constrained DNA in noncoding regions Siepel et al. 2005, Genome Research Some constrained introns are editing complementary regions:GRIA2 Siepel et al. 2005, Genome Research
3UTRs can be highly constrained over large distances 3 UTRs contain RNA processing signals, miRNA tar regions subject to constraints Siepel et al. other 2005, Genome Research Ultraconserved elements = UCEs At least 200 bp with no interspecies differences
Bejerano et al. (2004) Science 304:1321-1325 481 UCEs with no changes among human, mouse and rat Also conserved between out to dog and chicken More highly conserved than vast majority of coding regions Most do not code for protein Only 111 out of 481overlap with protein-coding exons Some are developmental enhancers. Nonexonic UCEs tend to cluster in introns or in vicinity of genes encoding transcription factors regulating development 88 are more than 100 kb away from an annotated gene;
may be distal enhancers GO category analysis of UCEassociated genes Genes in which a coding exon overlaps a UCE 91 Type I genes RNA binding and modification Transcriptional regulation Genes in the vicinity of a UCE (no overlap of coding exons)
211 Type II genes Transcriptional regulation Developmental regulators Bejerano et al. (2004) Science Intronic UCE in SOX6 enhances expression in melanocytes in transgenic mice UCEs Tested UCEs Pennacchio et al.,
http://enhancer.l bl.gov/ The most stringently conserved sequences in eukaryotes are mysteries Yeast MATa2 locus Most conserved region in 4 species of yeast 100% identity over 357 bp Role is not clear Vertebrate UCEs More constrained than exons in vertebrates Noncoding UCEs are not detectable outside chordates, whereas coding regions are Were they fast-evolving prior to vertebrate/invertebrate
divergence? Are they chordate innovations? Where did they come from? Role of many is not clear; need for 100% identity over 200 bp is not obvious for any What molecular process requires strict invariance for at least 200 nucleotides? One possibility: Multiple, overlapping functions Use measures of alignment texture to discriminate functional classes of DNA Mouse Cons track (L-scores) are measures of alignment quality. Match > Mismatch > Gap
Alternatively, can analyze the patterns within alignments (texture) to try to distinguish among functional classes Regulatory regions vs bulk DNA Patterns are short strings of matches, mismatches, gaps Find frequencies for each string using training sets 93 known regulatory regions 200 ancestral repeats (neutral) Regulatory potential genome-wide Elnitski et al. (2003) Genome Research 13: 64-72. Evaluate patterns in alignments to discriminate functional classes of DNA
1. Collapse the alignment to a small alphabet, e.g. Match involving G or C = S Transition = I Gap = G Match involving A or T = W Transversion = V Alignment seq1 G T A C C T A C T A C G C A seq2 G T G T C G - - A G C C C A Collapsed alphabet S W I I S V G G V I S V S W 5/10 = 3 1/6 1/4
= 1 2/8 1/4 = 0.5 3/6 2. Is a pattern, e.g., SWIIS followed by V found more frequently in alignments of known cis-regulatory modules (set of 93) or neutral DNA (200 ancestral repeats)? 3. The regulatory potential for any alignment is a loglikelihood estimate of the extent to which its patterns are more like those in regulatory regions than in neutral DNA. Regulatory potential (RP) to distinguish functional classes
Good performance of regulatory potential (RP) for finding cisregulatory modules aylor et al. (2006) Genome Research, in press (October or November) Genes Co-expressed in Late Erythroid Maturation G1E-ER cells: proerythroblast line lacking the transcription factor GATA-1. Can rescue by expressing an estrogen-responsive form of GATA-1 Rylski et al., Mol Cell Biol. 2003 Predicted cis-Regulatory Modules (preCRMs) Around Erythroid Genes
Conservation of predicted binding sites for transcription factors Binding site for GATA-1 See poster from Yuepin Zhou, Yong Cheng, Hao Wang et al. preCRMs with conserved consensus GATA1 BS tend to be active on transfected plasmids preCRMs with conserved consensus GATA-1 BS tend to be active after integration into a chromosome Examples of validated preCRMs
Correlation of Enhancer Activity with RP Score Validation status for 99 tested fragments preCRMs with High RP and Conserved Consensus GATA-1 Tend To Be Validated Conclusions Multispecies alignments can be used to predict whether a sequence is functional (signature of purifying selection). Patterns in alignments and conservation of
some TFBSs can be used to predict some cis-regulatory elements. The predictions of cis-regulatory elements for erythroid genes are validated at a good rate. Databases and servers such as the UCSC Table Browser, Galaxy, and others provide access to these data. http://genome.ucsc.edu/ http://www.bx.psu.edu/ Many thanks Wet Lab: Yuepin Zhou, Hao Wang, Ying Zhang, Yong Cheng, David King
PSU Database crew: Belinda Giardine, Cathy Riemer, Yi Zhang, Anton Nekrutenko RP scores and other bioinformatic input: Francesca Chiaromonte, James Taylor, Alignments, chains, nets, browsers, Shan Yang, Diana Kolbe, Laura Elnitski ideas, Webb Miller, Jim Kent, David Haussler Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU
Regulatory Potential (RP) features Computation of 2-way RP score using 5-symbol, 5th order Markov model MAT, MGC, V, T, GAP MAT-MAT-MAT-MAT-MAT * * * * * MAT-MAT-MAT-MAT-MGC * * * * * Alignment Hum G
T A C C T A C T A C C C A . Mus . G T G T C G - - A G C C C A MAT-T -T -MGC-V * * *
* ln(10) . MGC MAT T T MGC V GAP GAP MAT T MGC MGC MGC MAT . Training set-200 ancestral repeats A Training score matrix formed by taking log-odds ratio
Positive set-93 is known CRMsNegative asure how much more likely an alignment MAT, is MGC, regulatory V, T, GAP as compared with netur MAT, MGC, V, T, GAP -MAT-MAT-M AT-MAT *of the * *alignments * *
og-odds ratios for each symbol over the MAT entire length are MAT-MAT-MAT-MAT-MAT * * * * * M AT-MAT-MAT-MAT-MGC * * *
* * ormalized for the length alignments MAT-MAT -MAT-M AT-MGC * * of * the * * . .
MAT-T -T -MGC-V . . * * * * 0.001 . .
MAT-T -T -MGC-V . . * * * * 0.0001 Finding and analyzing genome data
NCBI Entrez http://www.ncbi.nlm.nih.gov Ensembl/BioMart http://www.ensembl.org UCSC Table Browser http://genome.ucsc.edu Galaxy Browsers vs Data Retrieval Browsers are designed to show selected information on one locus or region at a time. UCSC Genome Browser Ensembl Run on top of databases that record vast amounts of information.
Sometimes need to retrieve one type of information for many genomics intervals or genome-wide. Access this by querying on the tables in the databases or data marts UCSC Table Browser EnsMart or BioMart Entrez at NCBI Retrieve all the protein-coding exons in humans Galaxy: Data retrieval and analysis Data can be retrieved from multiple external sources, or uploaded from users computer
Hundreds of computational tools Data editing File conversion Operations: union, intersection, complement Compute functions on data Statistics EMBOSS tools for sequence analysis PHYLIP tools for molecular evolutionary analysis PAML to compute substitutions per site Add your own tools
Galaxy via Table Browser: coding exons Retrieve human mutations Find exons with human mutations: Intersection Compute length using expression Statistics on exon lengths Plot a histogram of exon lengths
Distribution of (human mutation) exon lengths What is that really long exon? Sort by length SACS has an 11kb exon
Franklin F. Gorospe IV RN MN Why Nursing? In the midst of a true winter season, Nurses continue to venture out in the pursuit of a safer health care. As 2011 rolls out, MINIG is committed to promoting the diversity...
New Leader Orientation. Fellow Leaders. Other leaders in the pack were once just as "green" as you. Do not be afraid to ask ANY of them for answers or ideas. Our monthly leader meetings are a great time to present...
SoSC SoSA SoSB Legacy Systems Coordinating Observatories Multi-Concept Responsive Disaster Surveillance SoS Mission New Systems Interfaces utility SoS Specific Considerations SoS Attributes Estimate of Managerial Control and Influence Level of Each Component Chandra HST Revise Estimates Aircraft Satellite cost SoS...
Performance Assessment Assurances: Next Steps for Leading Deeper Learning in VirginiaDay 2: Opening May 7-8, 2018 @Fredericksburg. May 10-11, 2018 @Staunton. June 4-5, 2018 @South Boston. June 7-8, 2018 @Abingdon
Namoi CMA Context, Cont. Maps similar to those in the Water Study which identify risk, based on relevant indicators must be developed. The cumulative risk framework developed for the Namoi CMA by Ecological Australia Pty Ltd (2011) for NRM assets...
Having achieved some improvement, we now need to move onto the next step "Set In Order". We have installed some racking, and we have organized the items so that with Number 1 in the bottom left hand corner, the numbers...
What effect do you think Neptune's distance from Earth has on its temperature? Neptune's temperature is very cold because it is so far from the sun. What gases make up Uranus's atmosphere? Hydrogen, Helium, and Methane. What record does Neptune's...
Ready to download the document? Go ahead and hit continue!