Using Gene3D to Transfer Functional Properties between Homologues

Exploiting Structural and Comparative Genomics to Reveal Protein Functions How many domain families can we find in the genomes and can we predict the functions of relatives? Exploiting protein structure to predict protein functions Using correlated phylogenetic profiles based on CATH domains to reveal functional associations CATH Gene3D Domain families of known structure Protein families and domain annotations for completed genomes CATHEDRAL Oliver Redfern and Andrew Harrison Combines a rapid graph theory secondary structure filter with dynamic programming for accurate residue alignment SVM is used to combine scores and assess significance of match CATH version 3.0 1100 fold groups 2100 homologous superfamilies 86,000 Domains

Fold Recognition Performance 1 0.98 0.94 % Correct Fold % Correct Fold 0.96 0.92 CATHEDRAL CE DALI LSQMAN STRUCTAL SSAP SSAP DDP 0.9 0.88 0.86 0.84 0.82 0.8

0 5 10 15 Rank Rank 20 25 Gene3D:Domain annotations in genome sequences scan against library of HMM models >2 million protein sequences from 300 completed genomes and Uniprot ~2000 CATH ~9000 Pfam assign domains to CATH and Pfam superfamilies Benchmarking by structural data shows that 76% of remote homologues can be identified using the HMMs

Domain annotations in genome sequences Gene3D: DomainFinder: structural domains from CATH take precedent N C CATH-1 Pfam-2 Pfam-1 NewFam Pfam-1 CATH-1 NewFam Pfam-2 Percentage of all domain family sequences Domain families ranked by size (number of domain sequences) Pfam families of unknown structure NewFam of unknown stucture CATH superfamilies of known structure

Rank by family size ~90% of domain sequences in the genomes and UniProt can be assigned to ~7000 domain families <100 families account for 50% of domain sequences of known fold structural superfamily (CATH) F2 subfamily of relatives F3 F1 relatives likely to have similar functions F4 F5 Only ~3% of diverse sequences in large CATH domain families have known structures Gene3D: Domain mappings for 300 Completed Genomes 300 genomes, >2 million sequences including UniProt and RefSeq structural domain

assignments from CATH Iterative Profile Search functional domain assignments Methodology from Pfam Also: SWISS-PROT, EC, COGs, GO, KEGG, MIPS, BIND, IntAct http://www.biochem.ucl.ac.uk:8080/Gene3D Russell Marsden, Corin Yeats, Michael Maibaum, David Lee Nucleic Acids Res. 2006 Yeats et al. Nucleic Acids res. 2006. Conservation of enzyme function in homologous domains with same multidomain architecture (MDA) in Gene3D Protein 1 Pfam-1 CATH-1 NewFam Pfam-2 Protein 2 level EC STRING MATCH) Conservation of EC FUNCTION CONSERVATION number to 3 levels(3rd(%) Pfam-1 CATH-1 Pfam-2

DOMAINS IN SAMENewFam ARCHITECTURES 100 90 80 70 60 50 40 30 20 10 0 11--20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 Sequence Identity Sequence identity No OVERLAP

40% OVERLAP 80% OVERLAP 10% OVERLAP 50% OVERLAP 90% OVERLAP 20% OVERLAP 60% OVERLAP 100% OVERLAP 30% OVERLAP 70% OVERLAP Sequence identity thresholds for 95% conservation of enzyme function (to 3 EC Levels) 332 highly conserved families number of sequences 60 highly variable families 1000000 number of families 200 180 160 140 120 100 80

60 40 20 0 100000 10000 1000 100 10 1 11-20% 21-30% 31-40% 41-50% 51-60% 61-70% 71-80% 81-90% 91-100% identity thresholds Number of Sequence domain relatives Number of Superfamilies number of sequences number of families Exploiting Structural and Comparative

Genomics to Reveal Protein Functions How many domain families can we find in the genomes and can we predict the functions of relatives? Exploiting protein structure to predict protein functions Using correlated phylogenetic profiles based on CATH domains to reveal functional associations CATH Gene3D Domain families of known structure Protein families and domain annotations for completed genomes Conservation of Enzyme Function in CATH Domain Families 90 80 SSAP score Structural similarity (SSAP) score 100 Different Function Same Function 70 60

50 40 0 10 20 30 40 50 60 70 80 90 100 sequence idenity (%) Pairwise sequence identity same functions different functions Correlation of structural variability with number of different functional

groups COGs Vs SSGs Numbe r of COGs Number of COG functional groups 90 P-loop hydrolases (COG-270, SSG-67) 80 70 60 0-25 25-50 50-75 75-100 50 40 30 20 10 0 0 10 20 30

40 50 60 r of Sstructural tructua l Sclusters ub-Groups NumberNumbe of diverse within family Some families show great structural diversity Gabrielle Reeves Multiple structural alignment by CORA allows identification of consensus secondary structure and embellishments 2DSEC algorithm In 117 superfamilies relatives expanded by >2 fold or more These families represent more than half the genome sequences of known fold Structural embellishments can modify the active site Galectin binding superfamily Structural embellishments can modulate domain interactions side orientation face orientation Glucose 6-phosphate dehydrogenase

a Dihydrodipiccolinate reductase Additional secondary structure shown at (a) are involved in subunit interactions Structural embellishments can modify function by modifying active site geometry and mediating new domain and subunit interactions Biotin carboxylase D-alanine-d-alanine ligase ATP Grasp superfamily Dimer of biotin carboxylase Secondary structure insertions are distributed along the chain but aggregate in 3D 80 60 40 Frequency (%) Indel frequency < 1 % 20 0.85% 0.38%

0.23% 0.11% 0.06% 0.02% 0 1 2 3 4 5 6 7 8 9 10 11 12 Size of Indel (number of secondary structures) 85% of residue insertions comprise only 1 or 2 secondary structures

60% of domains have secondary structure embellishments co-located in 3D with 3 or more other embellishments In 80% of domains, 1 or more embellishments contact other domains or subunits ~80% of variable families are adopt regular layered architectures 2 Layer Alpha Beta Sandwich 2 Layer Beta Sandwich Alpha / Beta Barrel 3 Layer Alpha Beta Sandwich 2 Layer Alpha Beta Sandwich 2 Layer Beta Sandwich Alpha / Beta Barrel 3 Layer Alpha Beta Sandwich Function prediction to Guide Target Selection for Structural Genomics structural superfamily (CATH) close relatives with same MDA F2

F3 F1 relatives likely to have similar functions F4 F5 Only ~3% of diverse sequence families (S30 clusters) in large CATH families have known structures Conservation of Enzyme Function in Homologous Domains 100 90 70 % Frequency Conservation of EC levels (%) 80 Not Conserved 60 Less than 3 EC 50 EC3

40 EC4 30 20 10 0 50-60 60-70 70-80 SSAP Score 80-90 90-100 Structure similarity (SSAP) score FLORA structural templates for assigning structures to functional subgroups in CATH Perform CORA multiple structural alignment on functional subfamiles within CATH superfamily Use CORAXplode (HMMs) to find related sequences in UniProt and identify conserved residues (seed) Explore local structural environment of seed residues to find conserved structural motifs Dataset of 84 enzyme superfamilies in CATH of which 21 are functionally very diverse

Finding conserved residue positions (seeds) Scorecons multiple sequence alignment of relatives from functional family guided by structure alignment identify most highly conserved residue positions using Scorecons Valdar and Thornton (2001) seed positions FLORA Algorithm for Identifying Structural Homologues with Similar Functions expand to local environment of 12 assign conserved sequence seeds new structures are scanned against a library of FLORA templates and SVMs used to assess significance of matches identify structurally conserved residue cliques and generate

template Performance of FLORA vs Global Structure Comparison (SSAP) 1 0.9 0.8 Coverage 0.7 Coverage 0.6 SSAP FLORA 0.5 - 0.4 0.3 0.2 0.1 0 0 0.1 Error Error rate 0.2

Exploiting Structural and Comparative Genomics to Reveal Protein Functions How many domain families can we find in the genomes and can we predict the functions of relatives? Exploiting protein structure to predict protein functions Using correlated phylogenetic profiles based on CATH domains to reveal functional associations CATH Gene3D Domain families of known structure Protein families and domain annotations for completed genomes Eisenberg Phylogenetic Profiles for Detecting Functional Associations Superfamily Functionally Linked sp1 sp2 Organism sp3 sp4 Superfamily 1

1 0 1 0 Superfamily 2 1 0 1 0 Superfamily 3 0 1 1 0 presence or absence of superfamily in organism Gene3D Phylogenetic Occurrence Profiles CATH Domain Superfamily

sp1 sp2 Superfamily 1 12 13 35 0 12 6 0 0 Superfamily 2 Superfamily 3 11 Organism sp3 sp4 14 60 0

number of relatives from superfamily in organism Phylogenetic Occurrence Profiles Based on Domain Superfamily and Subfamilies in Gene3D Superfamily 30% sequence identity cluster 50% sequence identity cluster 40% sequence identity cluster Phylogenetic Profiles for Families and Subfamilies Juan Ranea and Corin Yeats domains clustered at different levels of sequence similarity: Superfam. 30% 40% 50% phylogenetic occurrence profile matrix

60% 100% Sp1 Sp2 Sp3 Sp4 Spn Cluster Cluster Cluster Cluster Cluster Cluster Cluster . . . Cluster 1 2 3 4 5 6 7 n 3 0 1 0 1 0 0 . . .

0 3 2 0 2 0 3 0 . . . 1 5 4 1 0 2 1 0 . . . 0 7 5 0 0 1 2 1 . . .

1 5 4 1 6 0 1 2 . . . 0 Comparison of Pairs of Phylogenetic Profiles Sp1 Sp2 Sp3 Sp4 Sp5 Spn Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7

. . . Cluster n 6 4 1 0 1 0 4 . . . 0 9 3 0 2 4 3 8 . . . 1 6 7 1 0 1 1 4

. . . 0 9 5 0 0 4 2 8 . . . 1 5 3 2 1 1 0 4 . . . 1

9 5 1 6 4 1 8 . . . 0 10 Cluster 1 5 Cluster 2 Sp1 Sp2 Sp3 Sp4 Sp5 Spn 10 5 Cluster 1 E1 Cluster 5

Sp1 Sp2 Sp3 Sp4 10 E2 5 Sp1 Sp2 Sp3 Sp4 Sp5 Spn Sp5 Spn Cluster 1 Euclidian distance: Cluster 7 E1 >> E2 Statistical Significance of Correlated Pairs (Comparison against 3 randomised models) 80 70 Real matrix 50 Random matrix I 40 Random matrix II

30 Random matrix III 20 10 Pearson correlation coefficients (0.9)-(1.0) (0.8)-(0.9) (0.7)-(0.8) (0.6)-(0.7) (0.5)-(0.6) (0.4)-(0.5) (0.3)-(0.4) (0.2)-(0.3) (0.1)-(0.2) (0.0)-(0.1) (-0.1)-(0.0) (-0.2)-(-0.1) 0 (-0.3)-(-0.2)

Frequency 60 Domain Associations Network from 13 Eukaryotes: Actin & VCP-like ATPases DNA replication and repair Chaperones and Cytoskeleton DNA Topoisomerase & Elongation factor G DNA topoisomerase & Elongation Factor G Number of domain relatives 10 9 8 7 6 5 4 3 2 1

0 1 2 3 4 5 6 7 8 Species 9 10 11 12 13 Highly correlated profiles correspond to pairs of families with significant similarity in GO functions %Frq %Sum_SS/Frq 60

50 40 30 20 biological processes 10 (>=19) (18)-(19) (17)-(18) (16)-(17) (15)-(16) (14)-(15) (13)-(14) (12)-(13) (11)-(12) (10)-(11) (9)-(10) (8)-(9) (7)-(8) (6)-(7)

(5)-(6) (4)-(5) (3)-(4) (2)-(3) (1)-(2) (0)-(1) 0 Distances of correlated profile scores Frequency of significant GO semantic similarity scores Summary On average 85% of domain sequences in genomes can be assigned to ~6000 domain families in CATH and Pfam Information on multidomain architectures (MDAs) can extend functional annotations obtained through domain based homologies Specific structural templates for functional subgroups within domain families can also help in assigning functions as more structures are solved

Analysis of Gene3D phylogenetic occurrence profiles allows detection of functional associations between families Acknowledgements CATH Lesley Greene Alison Cuff Ian Sillitoe Tony Lewis Mark Dibley Oliver Redfern Tim Dallman Gene3D Corin Yeats Sarah Addou Russell Marsden David Lee Alastair Grant Ilhem Diboun Juan Garcia Ranea http://www.biochem.ucl.ac.uk/bsm/cath_new Medical Research Council, Wellcome Trust, NIH EU funded Biosapiens, EU funded Embrace, BBSRC

Recently Viewed Presentations

  • You should write in the past tense- the

    You should write in the past tense- the

    Much of the dialogue's spoken features were used when the contestants of the game show were under immense scrutiny and pressure. Jargon such as 'pitch' added a more genuine feel to it being an intense business meeting, which also added...
  • Fish 424: Parasitology - University of Idaho

    Fish 424: Parasitology - University of Idaho

    Fish at surface gulping or piping. Suggests parasites on gills. Fish rolling/flashing. suggests protozoan or worm infestation (internal or external) Lethargy or listlessness. Suggests gill parasite. Fish at bottom. Suggests gill parasite, especially "Ich" Fin erosion/Lesions. Indicative of external parasite....
  • 110 COMMON ARTHROPODS OF NEW MEXICO FOR FFA

    110 COMMON ARTHROPODS OF NEW MEXICO FOR FFA

    110 COMMON . ARTHROPODS OF NEW MEXICO. FOR FFA STUDENTS. Pt. 7--- Diptera. through Hymenoptera. David B. Richman Carol A. Sutherland. College Professor Emeritus & Curator, Extension Entomologist, NMSU &
  • CONTACT US All International enquiries will be forwarded

    CONTACT US All International enquiries will be forwarded

    J&R Livestock Consultants Ltd. Wendy Hoydalo. 463 Dawson Road North. Winnipeg, MB Canada R2J 0S8. PHONE 204.233.7089. Toll Free 888.309.8223. CELL 204.390.0008
  • Commencement of a Museum - Confex

    Commencement of a Museum - Confex

    Nesosilicates. Crystal System: Trigonal. Distinguishing features: Willemite is the mineral glowing bright green in this piece (it is pale green in white light) Strong Fluorescence, often phosphorescent
  • Chapter 011 - Reality of Consent and Writing

    Chapter 011 - Reality of Consent and Writing

    Reality of Consent and Writing ... A mistake made by both parties concerning a material fact that is important to the subject matter of the contract In Raffles v. Wichelhaus, the court held that a mutual mistake of fact excused...
  • IMPROVING HOSPITAL BASED TECHNOLOGY ASSESSMENT: A SYSTEMATIC AND

    IMPROVING HOSPITAL BASED TECHNOLOGY ASSESSMENT: A SYSTEMATIC AND

    Step 3- Multiply each criterion's rating points with its weight and write it in the far right column (Score). Step 4- Total all criterion scores in the far right column. Step 5- Prioritize based on the overall score of individual...
  • The &quot;Roaring Twenties&quot;: A Culture in Conflict

    The "Roaring Twenties": A Culture in Conflict

    The Importance of Science General cultural acceptance of the denial of absolute values Robert Goddard launched the first liquid rocket (1926) The Kelly Act (1925) Lindbergh flies the Atlantic solo in May of 1927 Growth of the airline industry E....