# Presented by Mario Flores, Xuepo Ma, and Nguyen

Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen Outline Short-read alignment Algorithm Results Comparisons between short-read and longread alignment Long-read alignment Algorithm Results Motivation Motivation: new DNA sequencing technologies call fast and accurate read alignment programs. MAQ: Pros: accurate, feature rich and fast enough to align short reads from

single individual. Cons: MAQ does NOT support gapped alignment for single-end reads => unsuitable for alignment longer reads where indels may occur frequently. Alignment with BWT : efficiently align short sequencing reads against a large reference sequence allowing mismatches and gaps Burrows Wheeler Transfrom X: actgct W: gcc Z=1 actgct\$ ctgct\$a tgct\$ac gct\$act ct\$actg t\$actgc

\$actgct 0 1 2 3 4 5 6 6 0 4 1 3 5 2

i S[i] \$actgct actgct\$ ct\$actg ctgct\$a gct\$act t\$actgc tgct\$ac B[i] Inexact Matching - number of deference in string W Take string W=gcc for example. 1. W(0,0)=g, g is a substring of X, D(0)=0; 2. W(0,1)=gc, gc is a

substring of X, D(1)=0; 3. W(0,2)=gcc, gcc is not a substring of X, D(2)=1. Inexact Matching - Searching X: actgct W: gcc 0,6 t c a 6,6

c a g ^ 4,4 1,1 t ^ g c

t 1,1 c 3,3 1,1 1,1 1 1,1 3 1,1

2 ^ 1,1 ^ ^ 1,1 3,3 3,3 1,1

6 a a a 6,6 6,6 1,1 ^ c

t 5 3,3 3,3 6,6 1,1 4 2,3 1,1 2,3

a g Exact Matching Let the D(i)=0, then the algorithm can search for the exact matching Simulated data Accuracy BWA is more accurate than Bowtie and SOAPv2 based on criterion 1. Speed BWA is the fastest second only to SOAPv2. Memory

MAQs memory footprint is 1GB, but it increases linearly with the number of reads to be aligned. BWA only uses 2.3 GB for singleend mapping and 3GB for pairedend ( as much as Bowtie). SOAPv2 uses 5.4 GB. Differences between short-read and longread alignment Short-read alignment Align full-length read Long-read alignment Efficient for ungapped Find local matches alignment or limited Permissive about gaps alignment gaps Fast and accurate long-read alignment with Burrows-Wheeler transform Motivations

Many programs for short sequencing Not many for reads>200 bp BLAT, SSAHA2 New platforms are producing longer sequences: Roche/454 >400bp, Illumina>100 bp, Pacific > 1000 bp New algorithm: Burrows Wheeler Aligners Smith-Waterman Alignment BWA-SW Before NGS After NGS FASTA 1988 SOAP 2008 BLAST 1997

MAQ 2008 MegaBLAST 2000 Bowtie 2009 SSAHA2 2001 BWA 2009 BLAT 2002 BWA-SW 2010 Burrows Wheeler Aligners Smith-Waterman Alignment BWA-SW Overview Algorithm (1) Build FM-indices for reference and query sequences

(2) Represent reference in a prefix trie (3) Represents query in prefix in DAWG (directed acyclic word graph) transformed from the prefix trie of the query sequence Example: String GOOGOL a. 3 nodes has SA interval [4,4] b. Their parents have interval [1,2],[1,2] and [1,1] start of a string prefix trie The two numbers in A node gives the SA interval of the

node In prefix DAWG The [4,4] node has parents [1,2] and [1,1] Node [4,4] represents the strings OG, OGO, OGOL Prefix tree Prefix DAWG Burrows Wheeler Aligners Smith-Waterman Alignment BWA-SW Overview Algorithm (4) Dynamic programming with heuristics to accelerate algorithm Heuristics rules: A) Restrict the dynamic programming algorithm around good

matches only B) Report only alignments largely non-overlapping Result of these heuristics is: Savings in computing time Burrows Wheeler Aligners Smith-Waterman Alignment BWA-SW Heuristic strategies for acceleration (1) Z best : Traverse G(W) in outer loop and T(X) in inner loop, and at each node u in G(W) only keep the top Z best scoring nodes in T(X) that match u rather than keeping all the matching nodes Where G(W) prefix DAWG of query sequence W T(X) prefix trie for reference sequence X u root of G(W) (2) Take only best few alignments covering each region of the query sequence Result Implementation of BWA-SW takes a BWA index and a query FASTA and FASTQ file as

inputs. Typical sequencing reads requires less than 4GB. The peak memory is 6.4 GB in total on one query sequence with 1 million base pairs. Simulated data Speed BWA-SW is fastest, and its speed is not sensitive to the read length or error rates. Memory BWA-SW uses about 4GB (as much as BLAT). SSAHA2 uses 2.4GB for >=500 bp reads, and 5.3 GB for shorter reads. BWA-SW supports multi-threading while SSAHA2 and BLAT do not. Accuracy BWA-SW can detect chimera reads, and produces fewer false

chimeric reads given lower base errors. Conclusion Short-read alignment cannot be used for longread alignment due to: Full-length read vs local matches. Ungapped or limited gap vs larger number of gaps. BWA-short is more accurate, use less memory and competitively fast. BWA-long is the best in market in speed, accuracy and memory. Questions ?????

## Recently Viewed Presentations

• Identify negative stereotypes for clues on how to improve (even positive stereotypes, these are still expectations people come in with). Understand your business from the consumer perspective in order to enhance their experience (no matter what the expectation or stereotype...
• Area Δ = ½ bc sin A = ½ ac sin B = ½ ab sin C. Example 1. Find the area. Example 2 . Farmer Jones owns a triangular piece of land. He labels each corner of the land...
• Your assignment for the election to be held on _____, 20____ is to be a/an _____ at _____ You should be at your assigned polling place by 5:15 AM and are to remain until the Moderator has dismissed you. No...
• T L Q strategy. These slides are a summary of the lesson that we did on Thursday, November 2, in periods 2, 3, and 4.This lesson was designed to accompany Activity 2.4 in our Springboard text.
• Testosterone Replacement for Men and Women Robert A. Jones Memorial Medical Centre Adelaide South Australia Presenting from the Premises of Port Power The human ovary produces more testosterone than oestrogen- weight for weight The postmenopausal ovary continues to produce testosterone...
• Platform Thesaurus. Satellite Series Example. Instrument Thesaurus. Instrument to Platform Relation Example. Instrument to Instrument Type Relation. Link between instrument and instrument type is part of the hierarchy as well (narrower/broader). ... European Space Agency ...
• United States Regions ... Georgia • Tennessee Kentucky • Virginia Louisiana • West Virginia The Northeastern Region Connecticut • New Jersey Delaware • New York Maine • Pennsylvania Maryland • Rhode Island Massachusetts • Vermont New Hampshire * * Title:...
• 12F describe how environmental change can impact ecosystem stability KEY CONCEPT Ecological succession is a process of change in the species that make up a community. Succession occurs following a disturbance in an ecosystem.