CS 6293 Advanced Topics: Translational Bioinformatics Lectures 3-4: Pair-wise Sequence Alignment Outline Biological background Global sequence alignment Local sequence alignment Optional: linear-space alignment algorithm Heuristic alignment: BLAST Evolution at the DNA level C ACGGTGCAGTCACCA ACGTTGC-GTCCACCA DNA evolutionary events (sequence edits): Mutation, deletion, insertion Sequence conservation implies function next generation OK OK OK

X X Still OK? Why sequence alignment? Conserved regions are more likely to be functional Can be used for finding genes, regulatory elements, etc. Similar sequences often have similar origin and function Can be used to predict functions for new genes / proteins Sequence alignment is one of the most widely used computational tools in biology Global Sequence Alignment S T S T AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition An alignment of two strings S, T is a pair of strings S, T (with spaces) s.t. (1) |S| = |T|, and (|S| = length of S)

(2) removing all spaces in S, T leaves S, T What is a good alignment? Alignment: The best way to match the letters of one sequence with those of the other How do we define best? S: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--T: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC The score of aligning (characters or spaces) x & y is (x,y). Score of an alignment: An optimal alignment: one with max score Scoring Function Sequence edits: Mutations Insertions Deletions Scoring Function: Match: +m Mismatch: -s Gap (indel): -d AGGCCTC AGGACTC AGGGCCTC AGG-CTC

~~~AAC~~~ ~~~A-A~~~ Match = 2, mismatch = -1, gap = -1 Score = 3 x 2 2 x 1 1 x 1 = 3 More complex scoring function Substitution matrix Similarity score of matching two letters a, b should reflect the probability of a, b derived from the same ancestor It is usually defined by log likelihood ratio Active research area. Especially for proteins. Commonly used: PAM, BLOSUM An example substitution matrix A C G T A C G T 3

-2 -1 -2 3 -2 -1 3 -2 3 How to find an optimal alignment? A nave algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 i |A| align all other chars to spaces compute its value S = abcd A = cd T = wxyz B = xz retain the max -abc-d a-bc-d end w--xyz -w-xyz output the retained alignment Analysis Assume |S| = |T| = n

Cost of evaluating one alignment: n How many alignments are there: pick n chars of S,T together say k of them are in S match these k to the k unpicked chars of T Total time: E.g., for n = 20, time is > 240 >1012 operations Dynamic Programming for sequence alignment Suppose we wish to align x1xM y1yN Let F(i,j) = optimal score of aligning x1xi y1yj Scoring Function: Match: +m Mismatch: -s Gap (indel): -d Optimal substructure 1 2 i M

x: 1 y: 2 j ... N ... If x[i] is aligned to y[j] in the optimal alignment between x[1..M] and y[1..N], then The alignment between x[1..i] and y[1..j] is also optimal Easy to prove by contradiction Recursive solution Notice three possible cases: 1. ~~~~~~~ xM ~~~~~~~ yN 2. F(M,N) = F(M-1, N-1) + -s, if not xM aligns to a gap

~~~~~~~ xM ~~~~~~~ 3. m, if xM = yN xM aligns to yN max F(M,N) = F(M-1, N) - d yN aligns to a gap ~~~~~~~ ~~~~~~~ yN F(M,N) = F(M, N-1) - d Recursive solution Generalize: F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) d F(i, j-1) d (Xi,Yj) = m if Xi = Yj, and s otherwise Boundary conditions: F(0, 0) = 0. -jd: y[1..j] aligned to gaps. F(0, j) = ? -id: x[1..i] aligned to gaps. F(i, 0) = ?

What order to fill? F(0,0) F(i-1, j-1)1 F(i-1, j) 1 i F(i, j-1) 3 2 F(i, j) F(M,N ) j What order to fill? F(0,0) F(M,N ) Example x = AGTA y = ATA F(i,j)

m= 1 s =1 d =1 i = j=0 1 A 2 T 3 A 0 1 2 A G 3 T

4 A Example x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = 0 0 j=0 1 A -1 2 T -2

3 A -3 1 2 3 4 A G T A -1 -2 -3 -4 Example x = AGTA

y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 A G 3 T 4 A 0 -1 -2 -3 -4

1 1 A -1 2 T -2 3 A -3 0 -1 -2 Example x = AGTA y = ATA F(i,j) m= 1 s =1 d =1

i = j=0 0 1 2 A G 3 T 4 A 0 -1 -2 -3 -4 1 A -1

1 0 -1 -2 2 T -2 0 0 1 3 A -3 0 Example x = AGTA y = ATA F(i,j) m= 1

s =1 d =1 i = j=0 0 1 2 3 4 A G T A 0 -1 -2 -3

-4 1 A -1 1 0 -1 -2 2 T -2 0 0 1 0 3 A

-3 -1 -1 0 2 Optimal Alignment: F(4,3) = 2 Example x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2

3 4 A G T A 0 -1 -2 -3 -4 1 A -1 1 0

-1 -2 2 T -2 0 0 1 0 3 A -3 -1 -1 0 2 Optimal

Alignment: F(4,3) = 2 This only tells us the best score Trace-back x = AGTA y = ATA F(i,j) F(i,j) = max i = j=0 0 F(i-1, j-1) + (Xi,Yj) F(i-1, j) d F(i, j-1) d 1 2 3 A G

T A 0 -1 -2 -3 -4 m= 1 s =1 d =1 4 1 A -1 1 0 -1 -2

A 2 T -2 0 0 1 0 A 3 A -3 -1 -1 0 2

Trace-back x = AGTA y = ATA F(i,j) F(i,j) = max i = j=0 0 F(i-1, j-1) + (Xi,Yj) F(i-1, j) d F(i, j-1) d 1 2 3 A G T A 0

-1 -2 -3 -4 m= 1 s =1 d =1 4 1 A -1 1 0 -1 -2 T A 2

T -2 0 0 1 0 T A 3 A -3 -1 -1 0 2 Trace-back x = AGTA y = ATA F(i,j)

F(i,j) = max i = j=0 0 F(i-1, j-1) + (Xi,Yj) F(i-1, j) d F(i, j-1) d 1 2 3 A G T A 0 -1 -2

-3 -4 m= 1 s =1 d =1 4 1 A -1 1 0 -1 -2 G T A 2 T -2 0

0 1 0 - 3 A -3 -1 -1 0 2 T A Trace-back x = AGTA y = ATA F(i,j) F(i,j) = max

i = j=0 0 F(i-1, j-1) + (Xi,Yj) F(i-1, j) d F(i, j-1) d 1 2 3 A G T A 0 -1 -2 -3 -4

m= 1 s =1 d =1 4 1 A -1 1 0 -1 -2 A G T A 2 T -2 0 0

1 0 A - 3 A -3 -1 -1 0 2 T A Trace-back x = AGTA y = ATA F(i,j) F(i,j) = max i = j=0

1 A 0 F(i-1, j-1) + (Xi,Yj) F(i-1, j) d F(i, j-1) d 1 2 3 A G T A 0 -1 -2 -3 -4

-1 1 0 -1 -2 m= 1 s =1 d =1 4 2 T -2 0 0 1 0 3

A -3 -1 -1 0 2 Optimal Alignment: F(4,3) = 2 AGTA ATA Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = 0 0

j=0 1 A -1 2 T -2 3 A -3 1 2 3 4 A G T

A -1 -2 -3 -4 Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 A G

T A 0 -1 -2 -3 -4 1 0 -1 -2 1 A -1 2 T

-2 3 A -3 3 4 Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 3

4 A G T A 0 -1 -2 -3 -4 1 A -1 1 0 -1

-2 2 T -2 0 0 1 0 3 A -3 Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i =

j=0 0 1 2 3 4 A G T A 0 -1 -2 -3 -4 1

A -1 1 0 -1 -2 2 T -2 0 0 1 0 3 A -3 -1

-1 0 2 Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 3 4 A

G T A 0 -1 -2 -3 -4 1 A -1 1 0 -1 -2 2 T

-2 0 0 1 0 3 A -3 -1 -1 0 2 Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1

d =1 i = j=0 0 1 2 3 4 A G T A 0 -1 -2 -3 -4

1 A -1 1 0 -1 -2 2 T -2 0 0 1 0 3 A

-3 -1 -1 0 2 Using trace-back pointers x = AGTA y = ATA F(i,j) m= 1 s =1 d =1 i = j=0 0 1 2 3 4

A G T A 0 -1 -2 -3 -4 1 A -1 1 0 -1 -2

2 T -2 0 0 1 0 3 A -3 -1 -1 0 2 Using trace-back pointers x = AGTA y = ATA F(i,j)

m= 1 s =1 d =1 i = j=0 0 1 2 3 4 A G T A 0 -1 -2

-3 -4 1 A -1 1 0 -1 -2 2 T -2 0 0 1 0 3

A -3 -1 -1 0 2 Optimal Alignment: F(4,3) = 2 AGTA ATA The Needleman-Wunsch Algorithm 1. Initialization. a. b. c. 2. F(0, 0) F(0, j) F(i, 0)

= 0 =-jd =-id Main Iteration. Filling in scores a. For each i = 1M For each j = 1N F(i, j) Ptr(i,j) 3. = max = F(i-1,j) d F(i, j-1) d F(i-1, j-1) + (xi, yj) UP, LEFT DIAG [case 1] [case 2] [case 3] if [case 1] if [case 2]

if [case 3] Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment Complexity Time: O(NM) Space: O(NM) Linear-space algorithms do exist (with the same time complexity) Equivalent graph problem S1 = A G T A (0,0) S2 = A : a gap in the 1st sequence 1

T A 1 1 : a gap in the 2nd sequence 1 : match / mismatch 1 Value on vertical/horizontal line: -d Value on diagonal: m or -s (3,4) Number of steps: length of the alignment Path length: alignment score Optimal alignment: find the longest path from (0, 0) to (3, 4) General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible. A variant of the basic algorithm Scoring scheme: m = s = d: 1 Seq1: CAGCA-CTTGGATTCTCGG || |:||| Seq2: ---CAGCGTGG-------Seq1: CAGCACTTGGATTCTCGG |||| | | ||

Seq2: CAGC-----G-T----GG Score = -7 Score = -2 The first alignment may be biologically more realistic in some cases (e.g. if we know s2 is a subsequence of s1) A variant of the basic algorithm Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- Then, we dont want to penalize gaps in the ends The Overlap Detection variant yN y1 x1 xM Changes: 1. Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.

Termination maxi F(i, N) FOPT = max maxj F(M, j) Different types of overlaps x x y y The local alignment problem Given two strings X = x1xM, Y = y1yN Find substrings x, y whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex Y = xxxcde x y X = cxde Y = c-de Why local alignment Conserved regions may be a small part of the whole Global alignment might miss them if flanking junk outweighs similar regions

Genes are shuffled between genomes A B C B D D A C Nave algorithm for all substrings X of X and Y of Y Align X & Y via dynamic programming Retain pair with max value end ; Output the retained pair Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total. Reminder The overlap detection algorithm We do not give penalty to gaps at either end Free gap Free gap

The local alignment idea Do not penalize the unaligned regions (gaps or mismatches) The alignment can start anywhere and ends anywhere Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment By resetting alignment score to zero The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i 1, j) d Iteration: F(i, j) = max F(i, j 1) d F(i 1, j 1) + (xi, yj) The Smith-Waterman algorithm Termination: 1. If we want the best local alignment FOPT = maxi,j F(i, j) 2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back Match: 2 Mismatch: -1 Gap: -1 a b c x d

e x 0 0 0 0 0 0 0 0 x 0 x 0 x 0 c 0 d 0 e 0 Match: 2 Mismatch: -1 Gap: -1

a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 x 0 0 0 x 0 0 0

c 0 0 0 d 0 0 0 e 0 0 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0

0 0 x 0 0 0 0 x 0 0 0 0 x 0 0 0 0 c 0 0 0 2 d 0 0 0 1

e 0 0 0 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2

x 0 0 0 0 2 x 0 0 0 0 2 c 0 0 0 2 1 d 0 0 0 1 0 e 0 0 0

0 0 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 x

0 0 0 0 2 1 x 0 0 0 0 2 1 c 0 0 0 2 1 1 d 0 0 0 1 0 3 e 0

0 0 0 0 2 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2

1 0 x 0 0 0 0 2 1 0 x 0 0 0 0 2 1 0 c 0 0 0 2 1 1 0 d 0 0

0 1 0 3 2 e 0 0 0 0 0 2 5 Match: 2 Mismatch: -1 Gap: -1 a b c x d e x 0 0 0 0 0 0

0 0 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2

c 0 0 0 2 1 1 0 1 d 0 0 0 1 1 3 2 1 e 0 0 0 0 0 2 5 4 Trace back Match: 2 Mismatch: -1

Gap: -1 a b c x d e x 0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 2 x 0 0 0

0 2 1 0 2 x 0 0 0 0 2 1 0 2 c 0 0 0 2 1 1 0 1 d 0 0 0 1 1 3

2 1 e 0 0 0 0 0 2 5 4 Trace back Match: 2 Mismatch: -1 Gap: -1 cxde | || c-de x-de | || xcde a b c x d e x

0 0 0 0 0 0 0 0 x 0 0 0 0 2 1 0 2 x 0 0 0 0 2 1 0 2 x 0 0 0

0 2 1 0 2 c 0 0 0 2 1 1 0 1 d 0 0 0 1 1 3 2 1 e 0 0 0 0 0 2

5 4 No negative values in local alignment DP array Optimal local alignment will never have a gap on either end Local alignment: Smith-Waterman Global alignment: Needleman-Wunsch Analysis Time: O(MN) for finding the best alignment Time to report all alignments depends on the number of sub-opt alignments Memory: O(MN) O(M+N) possible Optional: more efficient alignment algorithms Given two sequences of length M, N Time: O(MN) Ok (?) Space: O(MN) bad 1Mb seq x 1Mb seq = 1000G memory Can we do better?

Bounded alignment Good alignment should appear near the diagonal Bounded Dynamic Programming If we know that x and y are very similar Assumption: Then, xi | yj # gaps(x, y) < k implies |ij| k Iteration: For i = 1M For j = max(1, i k)min(N, i+k)

F(i 1, j 1)+ (xi, yj) F(i, j) = max F(i, j 1) d, if j > i k F(i 1, j) d, if j < i + k k Termination: same Analysis Time: O(kM) << O(MN) Space: O(kM) with some tricks => M 2k M 2k Given two sequences of length M, N Time: O(MN) ok Space: O(MN) bad 1mb seq x 1mb seq = 1000G memory Can we do better?

Linear space algorithm If all we need is the alignment score but not the alignment, easy! We only need to keep two rows (You only need one row, with a little trick) But how do we get the alignment? Linear space algorithm When we finish, we know how we have aligned the ends of the sequences XM YN Nave idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN)) (0, 0) M/2 (M, N) Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.

(0,0) (3,0) (3,2) (3,4) (3,6) (6,6) Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k)) Hirschbergs idea Divide and conquer! Y X Forward algorithm Align x1x2xM/2 with Y M/2 F(M/2, k) represents the best alignment between x1x2xM/2 and y1y2yk Backward Algorithm Y X

M/2 Backward algorithm Align reverse(xM/2+1xM) with reverse(Y) B(M/2, k) represents the best alignment between reverse(xM/2+1xM) and reverse(ykyk+1yN ) Linear-space alignment Using 2 (4) rows of space, we can compute for k = 1N, F(M/2, k), B(M/2, k) M/2 Linear-space alignment Now, we can find k* maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k* Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2 Linear-space alignment k* M/2 M/2 Iterate this procedure to the two sub-problems!

N-k* Analysis Memory: O(N) for computation, O(N+M) to store the optimal alignment Time: MN for first iteration k M/2 + (N-k) M/2 = MN/2 for second k M/2 M/2 N-k MN MN/2 MN/4 MN + MN/2 + MN/4 + MN/8 + = MN (1 + + + 1/8 + 1/16 + ) = 2MN = O(MN) MN/8 Heuristic Local Sequence Alignment: BLAST State of biological databases Sequenced Genomes: Human Mouse

Neurospora Tetraodon Drosophila Rice sea squirts 3109 2.7109 4107 3108 1.2108 1.0109 1.6108 Yeast Rat Fugu fish 3.3108 Mosquito 2.8108 Worm Arabidopsis Current rate of sequencing (before new-generation sequencing): 4 big labs 3 109 bp /year/lab 10s small labs Private sectors With new-generation sequencing: Easily generating billions of reads daily 1.2107 2.6109 1.0108 1.2108

Some useful applications of alignments Given a newly discovered gene, - Does it occur in other species? Assume we try Smith-Waterman: Our new gene 104 The entire genomic database 1010 - 1011 May take several weeks! Some useful applications of alignments Given a newly sequenced organism, - Which subregions align with other organisms? - Potential genes - Other functional units Assume we try Smith-Waterman: Our newly sequenced mammal 3109 The entire genomic database 1010 - 1011

> 1000 years ??? BLAST Basic Local Alignment Search Tool Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 The most widely used bioinformatics tool Which is better: long mediocre match or a few nearby, short, strong matches with the same total score? Score-wise, exactly equivalent Biologically, later may be more interesting, & is common At least, if must miss some, rather miss the former BLAST is a heuristic algorithm emphasizing the later speed/sensitivity tradeoff: BLAST may miss former, but gains greatly in speed BLAST Available at NCBI (National Center for Biotechnology Information) for download and online use. http://blast.ncbi.nlm.nih.gov/ Along with many sequence databases query Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster

than Smith-Waterman DB BLAST Original Version Dictionary: All words of length k (~11 for DNA, 3 for proteins) Alignment initiated between words of alignment score T (typically T = k) query Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold scan DB query BLAST Original Version k = 4, T = 4 The matching word GGTC initiates an alignment

Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A Example: A C G A A G T A A G G T C C A G T Gapped BLAST Pairs of words can initiate alignment Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A Added features: A C G A A G T A A G G T C C A G T

Example Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: Sbjct: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| 125138 tacacccagattacaccccga 125158 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: Sbjct: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| 125104 tacacccagattacaccccga 125124 >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query:

Sbjct: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| 3891 tacacccagattacaccccga 3911 Example Query: Human atoh enhancer, 179 letters [1.5 min] Result: 57 blast hits 1. gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc... gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch... gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc... gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene... gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo... gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O... gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ... gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres... 2. 3. 4. 5. 6. 7. 8. 355 1e-95 264 4e-68

256 9e-66 78 5e-12 54 7e-05 44 0.068 42 0.27 42 0.27 gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| ||||||||||||||| Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318 Different types of BLAST blastn: search nucleic acid databases blastp: search protein databases blastx: you give a nucleic acid sequence, search protein databases tblastn: you give a protein sequence,

search nucleic acid databases tblastx: you give a nucleic sequence, search nucleic acid database, implicitly translate both into protein sequences BLAST cons and pros Advantages Fast!!!! A few minutes to search a database of 1011 bases Disadvantages Sensitivity may be low Often misses weak homologies New improvement Make it even faster Mainly for aligning very similar sequences or really long sequences E.g. whole genome vs whole genome Make it more sensitive PSI-BLAST: iteratively add more homologous sequences PatternHunter: discontinuous seeds Variants of BLAST NCBI-BLAST: most widely used version WU-BLAST: (Washington University BLAST): another popular version Optimized, added features MEGABLAST: Optimized to align very similar sequences. Linear gap penalty BLAT: Blast-Like Alignment Tool BlastZ:

Optimized for aligning two genomes PSI-BLAST: BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated Sensitive for weak homologies Slower Things weve covered so far Global alignment Needleman-Wunsch and variants Local Alignment Smith-Waterman Improvement on space and time Heuristic algorithms BLAST families Things we did not cover: Statistics for sequence alignment To handle gaps more accurately: affine gap penalty Multiple sequence alignment