# Data Structures for Strings Chapter 8 Introduction Numbers Data Structures for Strings Chapter 8 Introduction Numbers as key values: are data items of constant size, and can be compared in constant time. In real applications, text processing is more important than the processing of numbers We need different structures for strings than for numeric keys. Motivating Example Example: 112 < 467 , Numerical comparison in O(1). Compare Strings lexicographically does not reflect the

similarity of strings. Western > Eastern , Strings comparison in O(min(|s1|,|s2|)). where |s| denotes the length of the string s Text fragments have a length; they are not elementary objects that the computer can process in a single step. Pneumonoultramicroscopicsilicovolcanoconiosis !!! Applications Bioinformatics (DNA/RNA or protein sequence data). Search Engines. Spill checker.

8.1 Tries & Compressed Tries Tries The basic tool for string data structures, similar in role to the balanced binary search tree, is called trie Derive from retrieval. (Pronounced either try or tree) In this tree, the nodes are not binary. They contain potentially one outgoing edge for each possible character, so the degree is at most the alphabet size |A| . Tries cont. Prefix Vs. Suffix. Ex. computer. Prefix:(c, co, com).

Suffix: (r, er, ter) Each node in this tree structure corresponds to a prefix of some strings of the set. If the same prefix occurs several times, there is only one node to represent it. The root of the tree structure is the node corresponding to the empty prefix. Tries Example A = {a, c, e} Strings:

aaa aaccee ac cc cea cece eee

ee aa cc AA AA cc cc

ce ce ee ee AA eee eee aa aa ac ac

AA AA cec cec cea cea AA AA AA

AA AA AA aaa aaa aac aac AA aacc aacc

ceca ceca AA AA aacce aacce AA aaccee aaccee AA AA

Tries Example Example: standard trie for the set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop } b e s i a

l r l d u l l y

e t l o l c k p

Prefix-free What if a Whole string is a prefix of other string? car card cc aa rr Are not a prefix-free.

dd String Termination Strings are sequences of characters from some alphabet. But for use in the computer, we need an important further information: how to recognize where the string ends. There are two solutions for this: 1. We can have an explicit termination character, which is added at the end of each string, but may not occur within the string \0 (ASCII code 0) , or 2.

We can store together with each string its length. Termination Example Strings:

exam example fail false tree trie true tt ff ee

rr aa xx uu ii ee ll

ii aa ee ee ee ss ll

m m \0 \0 \0 \0 \0 \0 ee

\0 \0 \0 \0 \0 \0 pp ll ee \0 \0

String Termination The use of the special termination character \0 has a number of advantages in simplifying code. It has the disadvantage of having one reserved character in the alphabet that may not occur in strings. There are many nonprintable ASCII codes that should never occur in a text and \0 is just one of them. There are also many applications in which the strings do not represent text, but, for example, machine instructions. Find, Insert and Delete To perform a find operation in this structure: 1. Start in the node corresponding to the empty prefix.

2. Read the query string, following for each read character the outgoing pointer corresponding to that character to the next node. 3. After we read the query string, we arrived at a node corresponding to that string as prefix. 4. If the query string is contained in the set of strings stored in the trie, and that set is prefix-free, then this node belongs to that unique string. Find, Insert and Delete To perform an insert operation in this structure: 1. 2.

Perform find Any time we encounter a nil pointer we create a new node. xx Example: Insert extra ee tt aa

rr m m aa \0 \0 \0 \0 Find, Insert and Delete

To perform a delete operation in this structure: 1. 2. Perform find Delete all nodes on the path from \0 to the root of the tree unless we reach a node with more than 1 child ee xx Example: Delete extra

tt aa rr m m aa \0 \0

\0 \0 Performance Find: All the characters in the word = O(|q|) where q: query string Insert: first find then insert an array of length |A| as a node = O(|q|.|A|) Delete: first find then delete an array of length |A| as a node = O(|q|.|A|) We can get rid of the |A| dependence in the delete operation by using reference counts. All new nodes must be initialized with NULL pointers, so the |A| dependence in the insert operation does not disappear. Alphabet Size

The problem here is the dependence on the size of the alphabet which determines the size of the nodes. There are several ways to reduce or avoid the problem of the alphabet size. A simple method, is to replace the big nodes by linked lists of all the entries that are really used. List Example Lists Performance Find, Insert and delete: O(|q|.|A|) Alphabet Size

Another way to avoid the problem with the alphabet size |A| is alphabet reduction. We can represent the alphabet A as set of k -tuples from some direct product By this each string gets longer by a factor of k , but the alphabet size can be reduced to Alphabet Reduction Example For the standard ASCII codes, we can break each 8-bit character by two 4-bit characters, which reduces the node size from 256 pointers to 16 pointers Reduction Performance

Other Reduction Techniques The trie structure with balanced search trees as nodes: Find, insert and delete Time: Space: The ternary trie structure: nodes are arranged in a manner similar to a binary search tree, but with up to three children. each node contains one character as key and one pointer each for query characters that are smaller, larger, or equal Find time: Space: Patricia Tree Practical Algorithm To Retrieve information Coded in

Alphanumeric. A path compression trie. Instead of explicitly storing nodes with just one outgoing edge, we skip these nodes and keep track of the number of skipped characters. The path compressed trie contains only nodes with at least two outgoing edges. Patricia Tree It contains a number, which is the number of characters that should be skipped before the next relevant character is looked at. This reduces the required number of nodes from the total length of all strings to the number of words in our structure.

We need in each access a second pass over the string to check all those skipped characters of the found string against the query string. this technique to reduce the number of nodes is justified only if the alphabet is large. Patricia Tree: Example Patricia Tree: Insert & Delete The insertion and deletion operations create significant difficulties. We need to find where to insert a new branching node, but this requires that we know the skipped characters. One (clumsy) solution would be a pointer to one of the strings

in the subtrie reached through that node, for there we have that skipped substring already available. THANK YOU 8.2 Dictionaries Allowing Errors in Queries The Problem of exact match(1) Tries essentially mirror the lexicographic order of the strings. The trie-based structures find only exact matches; if the query string contains an error, for example, a word is misspelled or a typing or transmission error happened, the correct string will not

be found. It would be highly desirable to have a dictionary structure that keeps track of a set of strings and finds all strings that differ only in (d) characters from the query string. The Problem of exact match(2) Ex: if d = 1 and w1= Computer, then we may have : Komputer, Aomputer, Bomputer, Domputer, etc. Camputer, Cbomputer, .etc. Trivial solutions 1- Generate for each word (wi) all the words that differ in at most

(d) places from it and store all these word variants in a trie. For each w, there are (|A|d length(wi )d ) of variants The size of the underlying structure increases From to The query time stays O(length(q)). The structure size becomes a problem. Trivial solutions 2- We could use just a standard trie for the words, but generate for each query string q all the words that differ in at most ( d ) places, and perform all these queries on the trie.

For each q, there are (|A|d length(qi )d ) of variants The size of the underlying structure will not changed. The query time for each one stays (length(q)). Total time for all queries will be (length(q)) * (|A|d length(qi )d ) The query time becomes a problem. Both of the above are useless at least for d >= 2. Minor Improvements Improvement 1: Because there are too many leaves, path compression could help in reducing the exponent from d+1 to d. But, this is not obvious how to construct the structure in that time. *Example of compression is already seen in previous slides. Improvement 2:

Use a trie with list-based nodes, which would remove one |A| factor. *Example of list-based nodes is already seen in previous slides. Improvement 3: Combine the above solutions. Still, useless, but essentially the best we have for d 2. The remarkable achievement of Brodal and Gasieniec (1996) The structure consists of Double-Trie and a balanced search tree. The idea of this double-trie structure is to build one trie for all words wi and a second trie for the words written backward wi reversed b r

l e a r l l a l

e e b b The remarkable achievement of Brodal and Gasieniec (1996) For each word wi we look at all pairs of (prefix, suffix ) that are separated by a single character, so as to have different ways to write wi = cc, where c is a character. is suffix, and c is prefix. For example:

If w = bear, c = {be }, = {r}, c= any character, be }, = {be }, = {r}, c= any character, r}, c= any character, then we may have beer, bebe, bece, .etc. If w = bear, c = {be }, = {r}, c= any character, bea}, = {be }, = {r}, c= any character, }, c= any character, then we may have beaa, beab, beadetc. In general, Possible words have the following components: (Prefix) c (suffix) Example Suppose that we have the following set of words:

{abcyz, abxuz, atxyz}, and a query abxyz a b c z y t x x

c u x .. x .. y z u

z Prefix trie y z b a t a Suffix trie b

a abx ab a Prefix Stack abcyz abxuz abxyz Candidates Example Cont. The used technique:

Unless the exact match was not take place, then, 1) Go through prefix trie up to the max matched prefix. Each visited node will be pushed into a stack. 2) Go to suffix trie, and travers up to one character before the maximum prefix. 3) Then, concatenate each visited node in suffix trie with the stack entries. 4) Now, all candidate words are generated, and they will be used in find operation over the already built search-tree. Times Complexities Building the structure : a) For a given word wi, we can generate all the node pairs in time .

So, for all words (wn), we can generate all node pairs in time ) b) finding in the search-tree costs only O(Log c) Total time = O( Log )

For each query, the worst case will be: What is the best case? Theorem:Brodal and Gasieniec (1996) Double-trie structure with more general models Not only one character might be exchanged, but also one character could be inserted or deleted. This corresponds to using the edit distance instead of the Hamming distance. NO change in total complexity, but more words will be added to the tries. Example of Hamming distance of equal length is the number of positions at which the corresponding symbols are different:

Western Vs. Eastern = 2 Edit distance: minimum number of editing operations (Insertion, deletion, substitution) needed to transform one string into another. Ex: West Western = 3 add operations. 8.3 Suffix Trees Definition The suffix tree is a static structure that preprocesses a long string s and answers for a query string q, if and where it occurs in the long string. Each substring of s is prefix of a suffix of s. If we construct a trie that stores all suffixes of the long string s, then its nodes correspond to the substrings of s, and we can decide for any query q in

O(length(q)) whether it is a substring of s. This structure would use O(length(s)2) nodes. Why O(length(s)2)? nodes Suffix tree A more Compact Representation No need to store all suffixes explicitly, but can encode each by a beginning and end address in the long string S. O(length(s)) nodes representation. Suffix Links

The suffix link is an extra pointer which points from a node representing a string a0 . . .ak to the node representing the string a1 . . . ak, that is, its suffix after deleting the first character. Building the Structure Building the Structure Algorithm It takes O(n^2), but It could be O(n) if compressed path used. For more details, please see the original paper On-line construction of suffix trees..(Esko Ukkonen) 8.4 Suffix Arrays

Definition The suffix array is an alternative structure to the suffix tree that was developed by Manber and Myers (1993). It preprocesses a long string and then answers for a query string whether it occurs as substring in the preprocessed string. Possible Advantages: Its size does not depend on the size of the alphabet. It offers a quite different tool to attack the same type of string problems. straightforward implementation and it is said to be smaller than suffix trees The Underlying Idea

To consider all suffixes of the preprocessed string s in lexicographic order and perform binary search on them to find a given query string. Time Complexity (1) We need O(log length(s)) lexicographic comparisons between q and some suffix of s. Without additional information, each comparison takes O(length(q)) time for a total of O(length(q) log length(s)). Can we do better?? Using Longest Common Prefix (LCP) Using Longest Common Prefix

(LCP) if left and q share the first l characters, with l < k, then the query string cannot be between left and middle. And by the same argument, if l > k, then middle cannot be between left and q. So if we have the numbers k and l, we can decide the outcome of the comparison in that step of binary search without looking at the string q unless l = k. If l = k, we compare and update the value of ( l ) up to length (q) of times. Time Complexity (2) BUT, How can we build the structure? Building the structure 1. Construct the suffix array A 12 of the suffixes starting at positions i (mod 3) = 0.

This is done by a recursive call of the skew algorithm for a string of two thirds the length. 2. Construct the suffix array A0 of the remaining suffixes. 3. Merge the two suffix arrays into one. Full example at : http://www.mi.fu-berlin.de/wiki/pub/ABI/SS13Lecture3Materials/script.pdf Building the structure This example shows the general idea: Let us say S = CSWESTERN , 9 Characters.

0 1 2 3 4 5 6

7 8 C S W E S

T E R N Group 0 : if (i mod 3= = 0), Group 1: if (i mod 3==1), and Group 2: if (i mod 3= =2) GROUP 0 CSW (i=0) EST (i=3)

ERN (i=6) GROUP 1 SWE (i=1) STE (i=4) RN\$ (i=7) GROUP 2 WES (i=2)

TER (i=5) N\$\$ (i=8) Building the structure Take two part; Group 1 and Group 2, handle them recursively. BY using Radix Sort, we have ordered list of group 12 suffixes. GROUP 0 CSW EST

ERN GROUP 12 N\$\$ RN\$ STE SWE TER

WES Construct the suffix array of Group 12. Construct the suffix array of Group 0. Then, merge. Time Complexity of Building Suffix Arrays in Few Words THANK YOU