Title goes here - teaching.bioinformatics.dtu.dk

Title goes here - teaching.bioinformatics.dtu.dk

Sequence alignment & Substitution matrices By Thomas Nordahl Sequence alignment 1. Sequence alignment is the most important technique used in bioinformatics 2. Infer properties from one protein to another 1. Homologous sequences often have similar biological functions 3. Most information can be deduced from a sequence if the 3D-structure is know 4. 3D-structure determination is very time consuming (X-ray, NMR) 1. Several mg of pure protein is required (> 100mg) 2. Make crystal, solve structure, 1-3 years 3. Large facilities are needed to produce X-ray 1. Rotating anode or synchrotron 5. Determining primary sequence is fast, cheap 6. Structure more conserved than sequence Growth of GenBank and WGS Structures in PDB Car parts analogy to protein folds Protein class & folds Structures in SCOP database The world seems to consist of approx1400 protein folds. Until 2014 no new folds have been observed What can we learn from sequence alignment Find similar sequence from another organism Information from the known sequence can be inherited Layers of conserved information: Structure > function > sequence where, > means more conserved Structure (3D) is the most conserved feature Proteins with different function may still share the same structure Proteins with different may still share the same function Often same function if 40-50% sequence identity Often same protein fold if above 30% sequence identity Sequence alignment M V S T A M V S T A M A T S A M 1 0 0 0 0 V 0 1 0 0 0 S 0 0 1 0 0 Antal identiske aa, % id ? T 0 0 0 1 0 Alignment score using identity matrix? A 0 0 0 0 1 Similar amino acids can be substituted, therefore other types of substitution matrices are used. Blosum matrices Blosum matrices are the most commonly used substitution matrices - Blosum50, Blosum62, blosum80 Symmetrical 20 x 20 matrix, where each element is the substitution score. Positive scores: Amino acids are likely to be aligned in a sequence alignmen -They share similar chemical

characteristics Negative scores: Less likely substitution but still occur. Zero Scores: Invariant Q) In an alignment what is the most likely amino acid that Arg will align to besides itself? Log-odds scores Log-odds scores are given by Log( Observation/Expected) The log-odd score of matching amino acid j with amino acid i in an alignment is log( Pij Qi Qj ) where Pij is the frequency of observation i aligned with j, and Q i, Qj are the frequency if amino acids i and j in the data set. The log-odd score is (in bit units) Pij Sij =2log 2 ( ) Qi Qj Where, Log2(x)=logn(x)/logn(2) S has been normalized to half bits, therefore the factor 2 Example of a scoring matrix BLOSUM80 A R N D C Q E G H I L K M F P S T W Y V A 7 -3 -3 -3 -1 -2 -2 0 -3 -3 -3 -1 -2 -4 -1 2 0 -5 -4 -1 R -3 9 -1 -3 -6 1 -1 -4 0 -5 -4 3

-3 -5 -3 -2 -2 -5 -4 -4 N -3 -1 9 2 -5 0 -1 -1 1 -6 -6 0 -4 -6 -4 1 0 -7 -4 -5 D -3 -3 2 10 -7 -1 2 -3 -2 -7 -7 -2 -6 -6 -3 -1 -2 -8 -6 -6 C -1 -6 -5 -7 13 -5 -7 -6 -7 -2 -3 -6 -3 -4 -6 -2 -2 -5 -5 -2 Q -2 1 0 -1 -5 9 3 -4 1 -5 -4 2 -1 -5 -3 -1

-1 -4 -3 -4 E -2 -1 -1 2 -7 3 8 -4 0 -6 -6 1 -4 -6 -2 -1 -2 -6 -5 -4 G 0 -4 -1 -3 -6 -4 -4 9 -4 -7 -7 -3 -5 -6 -5 -1 -3 -6 -6 -6 H -3 0 1 -2 -7 1 0 -4 12 -6 -5 -1 -4 -2 -4 -2 -3 -4 3 -5 I -3 -5 -6 -7 -2 -5 -6 -7 -6 7 2 -5 2 -1 -5 -4 -2 -5 -3 4

L -3 -4 -6 -7 -3 -4 -6 -7 -5 2 6 -4 3 0 -5 -4 -3 -4 -2 1 K -1 3 0 -2 -6 2 1 -3 -1 -5 -4 8 -3 -5 -2 -1 -1 -6 -4 -4 M -2 -3 -4 -6 -3 -1 -4 -5 -4 2 3 -3 9 0 -4 -3 -1 -3 -3 1 F -4 -5 -6 -6 -4 -5 -6 -6 -2 -1 0 -5 0 10 -6 -4 -4 0 4 -2 P -1 -3

-4 -3 -6 -3 -2 -5 -4 -5 -5 -2 -4 -6 12 -2 -3 -7 -6 -4 S 2 -2 1 -1 -2 -1 -1 -1 -2 -4 -4 -1 -3 -4 -2 7 2 -6 -3 -3 T 0 -2 0 -2 -2 -1 -2 -3 -3 -2 -3 -1 -1 -4 -3 2 8 -5 -3 0 W -5 -5 -7 -8 -5 -4 -6 -6 -4 -5 -4 -6 -3 0 -7 -6 -5 16 3 -5 Y -4 -4 -4 -6 -5 -3

-5 -6 3 -3 -2 -4 -3 4 -6 -3 -3 3 11 -3 V -1 Pij -4 S =2log ( ij 2 -5 Qi Qj -6 -2 -4 Log-Odds scores -4 Have been rounded off -6 to integers -5 4 1 -4 1 -2 -4 -3 0 -5 -3 7 ) An example Sij = 2log2(Pij/(QiQj)) Pij can be calculated as Nij/(Sumij Nij), where Nij is the number of times amino acid i is aligned to amino acid j Sum Nij is the total number of all alignments Nij Qi is the frequency observed in alignment of amino acid i MSA Multiple Sequemce Alignment How to calculate NAA seq1: seq2: seq3: seq4: 1 V A D D 2 V A V A 3 A A A A 4 D D D A NAA = 14 An example NAA NAD NAV NDA NDD NDV NVA NVD

NVV = = = = = = = = = 14 5 5 5 8 2 5 2 2 PAA PAD PAV PDA PDD PDV PVA PVD PVV = = = = = = = = = 14/48 5/48 5/48 5/48 8/48 2/48 5/48 2/48 2/48 MSA Multiple Sequemce Alignmen seq1: seq2: seq3: seq4: 1234 VVAD AAAD DVAD DAAA QA = 8/16 QD = 5/16 QV = 3/16 Example continued PAA PAD PAV PDA PDD PDV PVA PVD PVV = = = = = = = = = 0.29 0.10

0.10 0.10 0.17 0.04 0.10 0.04 0.04 QAQA QAQD QAQV QDQA QDQD QDQV QVQA QVQD QVQV = = = = = = = = = 0.25 0.16 0.09 0.16 0.10 0.06 0.09 0.06 0.03 1: 2: 3: 4: VVAD AAAD DVAD DAAA QA=0.50 QD=0.31 QV=0.19 MSA So what does this mean? PAA PAD PAV PDA PDD PDV PVA PVD PVV = = = = = = = = = 0.29 0.10 0.10 0.10 0.17 0.04 0.10 0.04 0.04 BLOSUM QAQA = 0.25 SAA = 0.44 QAQD = 0.16 SAD =-1.17 QAQV = 0.09 SAV = 0.30

QDQA = 0.16 SDA =-1.17 QDQD = 0.10 SDD = 1.54 QDQV = 0.06 SDV =-0.98 QVQA = 0.09 SVA = 0.30 QVQD = 0.06 SVD =-0.98 QisVQaVlog-likelihood = 0.03 SVV = 0.49 matrix: Sij = 2log2(Pij/(QiQj)) The Scoring matrix A A 0.44 D -1.17 V 0.30 D V -1.17 0.30 1.54 -0.98 -0.98 0.49 1: 2: 3: 4: VVAD AAAD DVAD DAAA MSA And what does the BLOSUMXX mean? High Blosum values mean high similarity between clusters Conserved substitution allowed Low Blosum values mean low similarity between clusters Less conserved substitutions allowed BLOSUM80 A R N D C Q E G H I L K M F P S T W Y V A 7 -3 -3

-3 -1 -2 -2 0 -3 -3 -3 -1 -2 -4 -1 2 0 -5 -4 -1 R -3 9 -1 -3 -6 1 -1 -4 0 -5 -4 3 -3 -5 -3 -2 -2 -5 -4 -4 N -3 -1 9 2 -5 0 -1 -1 1 -6 -6 0 -4 -6 -4 1 0 -7 -4 -5 D -3 -3 2 10 -7 -1 2 -3 -2 -7 -7 -2 -6 -6 -3 -1 -2 -8 -6 -6 C -1 -6 -5 -7 13 -5 -7

-6 -7 -2 -3 -6 -3 -4 -6 -2 -2 -5 -5 -2 Q -2 1 0 -1 -5 9 3 -4 1 -5 -4 2 -1 -5 -3 -1 -1 -4 -3 -4 E -2 -1 -1 2 -7 3 8 -4 0 -6 -6 1 -4 -6 -2 -1 -2 -6 -5 -4 G 0 -4 -1 -3 -6 -4 -4 9 -4 -7 -7 -3 -5 -6 -5 -1 -3 -6 -6 -6 H -3 0 1 -2 -7 1 0 -4 12 -6 -5

-1 -4 -2 -4 -2 -3 -4 3 -5 I -3 -5 -6 -7 -2 -5 -6 -7 -6 7 2 -5 2 -1 -5 -4 -2 -5 -3 4 L -3 -4 -6 -7 -3 -4 -6 -7 -5 2 6 -4 3 0 -5 -4 -3 -4 -2 1 K -1 3 0 -2 -6 2 1 -3 -1 -5 -4 8 -3 -5 -2 -1 -1 -6 -4 -4 M -2 -3 -4 -6 -3 -1 -4 -5 -4 2 3 -3 9 0 -4

-3 -1 -3 -3 1 F -4 -5 -6 -6 -4 -5 -6 -6 -2 -1 0 -5 0 10 -6 -4 -4 0 4 -2 P -1 -3 -4 -3 -6 -3 -2 -5 -4 -5 -5 -2 -4 -6 12 -2 -3 -7 -6 -4 S 2 -2 1 -1 -2 -1 -1 -1 -2 -4 -4 -1 -3 -4 -2 7 2 -6 -3 -3 T 0 -2 0 -2 -2 -1 -2 -3 -3 -2 -3 -1 -1 -4 -3 2 8 -5 -3

0 W -5 -5 -7 -8 -5 -4 -6 -6 -4 -5 -4 -6 -3 0 -7 -6 -5 16 3 -5 Y -4 -4 -4 -6 -5 -3 -5 -6 3 -3 -2 -4 -3 4 -6 -3 -3 3 11 -3 V -1 -4 -5 -6 -2 -4 -4 -6 -5 4 1 -4 1 -2 -4 -3 0 -5 -3 7 = 9.4 = -2.9 BLOSUM30 A R N D C Q E G H I L K M F P S T W Y V

A 4 -1 0 0 -3 1 0 0 -2 0 -1 0 1 -2 -1 1 1 -5 -4 1 R -1 8 -2 -1 -2 3 -1 -2 -1 -3 -2 1 0 -1 -1 -1 -3 0 0 -1 N 0 -2 8 1 -1 -1 -1 0 -1 0 -2 0 0 -1 -3 0 1 -7 -4 -2 D 0 -1 1 9 -3 -1 1 -1 -2 -4 -1 0 -3 -5 -1 0 -1 -4 -1 -2 C -3 -2

-1 -3 17 -2 1 -4 -5 -2 0 -3 -2 -3 -3 -2 -2 -2 -6 -2 Q 1 3 -1 -1 -2 8 2 -2 0 -2 -2 0 -1 -3 0 -1 0 -1 -1 -3 E 0 -1 -1 1 1 2 6 -2 0 -3 -1 2 -1 -4 1 0 -2 -1 -2 -3 G 0 -2 0 -1 -4 -2 -2 8 -3 -1 -2 -1 -2 -3 -1 0 -2 1 -3 -3 H -2 -1 -1 -2 -5 0

0 -3 14 -2 -1 -2 2 -3 1 -1 -2 -5 0 -3 I 0 -3 0 -4 -2 -2 -3 -1 -2 6 2 -2 1 0 -3 -1 0 -3 -1 4 L -1 -2 -2 -1 0 -2 -1 -2 -1 2 4 -2 2 2 -3 -2 0 -2 3 1 K 0 1 0 0 -3 0 2 -1 -2 -2 -2 4 2 -1 1 0 -1 -2 -1 -2 M 1 0 0 -3 -2 -1 -1 -2 2 1

2 2 6 -2 -4 -2 0 -3 -1 0 F -2 -1 -1 -5 -3 -3 -4 -3 -3 0 2 -1 -2 10 -4 -1 -2 1 3 1 P -1 -1 -3 -1 -3 0 1 -1 1 -3 -3 1 -4 -4 11 -1 0 -3 -2 -4 S 1 -1 0 0 -2 -1 0 0 -1 -1 -2 0 -2 -1 -1 4 2 -3 -2 -1 T 1 -3 1 -1 -2 0 -2 -2 -2 0 0 -1 0 -2

0 2 5 -5 -1 1 W -5 0 -7 -4 -2 -1 -1 1 -5 -3 -2 -2 -3 1 -3 -3 -5 20 5 -3 Y -4 0 -4 -1 -6 -1 -2 -3 0 -1 3 -1 -1 3 -2 -2 -1 5 9 1 V 1 -1 -2 -2 -2 -3 -3 -3 -3 4 1 -2 0 1 -4 -1 1 -3 1 5 Blosum30 = 8.3 = -1.16 Blosum80 = 9.4 = -2.9

Recently Viewed Presentations

  • Club Manager Training - Texas A&M AgriLife

    Club Manager Training - Texas A&M AgriLife

    Covers the types of record books available, what you can consider a project and anything else you can think of related to record books! "4-H Food Challenge" - Charla Bading. Invite your fellow 4-Hers to learn about "4-H Food Challenge".
  • Internet Safetyfor Everyone - GCFLearnFree.org

    Internet Safetyfor Everyone - GCFLearnFree.org

    But I already know how to use the Internet! ... Okay, okay, I believe you!So what can I do? Taking certain precautions and adopting safe habits can go a long way toward protecting you from personal harm. Always keep personal...
  • Clinically Relevant Functional Neuroanatomy: Memory and ...

    Clinically Relevant Functional Neuroanatomy: Memory and ...

    Subcortical Neuroanatomy Russell M. Bauer, Ph.D. University of Florida January 23, 2006 We will cover… Limbic System Subcortical circuits involving basal ganglia Thalamocortical circuitry relevant to cognition Limbic System Thalamus Basal Ganglia Basal Ganglia Caudate + Putamen = Striatum Putamen...
  • Dollar Tree - Weebly

    Dollar Tree - Weebly

    Dollar Tree is the largest bargain retailer in North America with the financial strength and vision to continue to grow. Dollar Tree is a great place to bargain shop. Dollar Tree has good benefits for managers and above which makes...
  • Reading

    Reading

    Prefixes (un-,re-, dis-) A prefix is a word part added to the beginning of a word. It changes the meaning of the original word. An . un. employed person has a good job.
  • An Introduction to Forensic Science - amoalf.com

    An Introduction to Forensic Science - amoalf.com

    Leone Lattes (1887 - 1954) Devised a simple procedure for determining the blood type (A,B,O,AB) of a dried bloodstain . Calvin Goddard (1891 - 1955) Used a comparison microscope to determine if a bullet was fired from a specific gun....
  • Chapter 11

    Chapter 11

    Grizzly bear. Kirkland's warbler . Knowlton cactus. Florida manatee. African elephant. Utah prairie dog. Swallowtail butterfly. Humpback chub. Golden lion tamarin. Siberian tiger. Figure 11.3. Endangered natural capital: species that are endangered or threatened with premature extinction largely because of...
  • Multivariate community analysis Similarity ANOSIM Cluster analysis Ordination

    Multivariate community analysis Similarity ANOSIM Cluster analysis Ordination

    Multivariate community analysis Similarity ANOSIM Cluster analysis Ordination Similarity Similarity: presence/absence Similarity: distance Similarity matrix Ordination Multivariate community analysis Similarity ANOSIM Cluster analysis Ordination Similarity Similarity: presence/absence Similarity: distance Similarity matrix Ordination 5 4 E 6 10 D 0 4...