Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 [email protected] Important! Read This! These slides are from an early talk about SAX, some slides will make little sense out of context, but are provided here to give a quick intro to the utility of SAX. Read [1] for more details. You may use these slides for any teaching purpose, so long as they are clearly identified as being created by Jessica Lin and Eamonn Keogh. You may not use the text and images in a paper or tutorial without express prior permission from Dr. Keogh. [1] Lin, J., Keogh, E., Lonardi, S. & Chiu, B. (2003). A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. In proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. San Diego, CA. June 13. Outline of Talk Prologue: Background on Time Series Data Mining The importance of the right representation A new symbolic representation motif discovery anomaly detection visualization Appendix: classification, clustering, indexing 25.1750 25.2250 25.2500 25.2500

25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750 .. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500 What are Time Series? A time series is a collection of observations made sequentially in time. 29 28 27 26 25 24

23 0 50 100 150 200 250 300 350 400 450 500 Time Series are Ubiquitous! Ubiquitous I People measure things... Schwarzeneggers popularity rating. Their blood pressure. The annual rainfall in New Zealand. The value of their Yahoo stock. The number of web hits per second. and things change over time.

Thus time series occur in virtually every medical, scientific and businesses domain. Image data, may best be thought of as time series Video data, may best be thought of as time series Steady pointing Hand moving to shoulder level Point Hand at rest 0 10 20 30 40 50 60 70 80 90 Steady

pointing Hand moving to shoulder level Hand moving down to grasp gun Hand moving above holster Hand at rest Gun-Draw 0 10 20 30 40 50 60 70 80 90 What do we want to do with the time series data? Clustering Motif Discovery

Classification Rule 10 Discovery Query by Content s = 0.5 c = 0.3 Novelty Detection All these problems require similarity matching Clustering Motif Discovery Classification Rule 10 Discovery Query by Content s = 0.5 c = 0.3 Novelty Detection

Euclidean Distance Metric Given two time series C Q = q1qn and Q C = c1cn their Euclidean distance is defined as: n D Q, C qi ci i 1 2 D(Q,C) The Generic Data Mining Algorithm Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest Approximately solve the problem at hand in main memory Make (hopefully very few) accesses to the original data on disk to confirm the solution obtained in Step 2, or to modify the solution so it agrees with the solution we would have obtained on the original data But which which approximation approximation But

should we we use? use? should Time Series Representations Data Adaptive Sorted Coefficients Singular Symbolic Piecewise Value Polynomial Decomposition Piecewise Linear Approximation Adaptive Piecewise Constant Approximation Natural Language Non Data Adaptive Trees Strings Wavelets Random

Mappings Orthonormal Bi-Orthonormal Discrete Fourier Transform Haar Daubechies Coiflets dbn n > 1 Interpolation Regression Spectral Piecewise Aggregate Approximation Discrete Cosine Transform Symlets UUCUCUCD 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120

0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 U DFT DWT SVD APCA PAA PLA U C U

C SYM U D D The Generic Data Mining Algorithm (revisited) Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest Approximately solve the problem at hand in main memory Make (hopefully very few) accesses to the original data on disk to confirm the solution obtained in Step 2, or to modify the solution so it agrees with the solution we would have obtained on the original data This only only works works ifif This the approximation approximation the allows lower lower allows bounding What is lower bounding? Exact (Euclidean) distance D(Q,S) Lower bounding distance DLB(Q,S)

Q Q S S D(Q,S) DLB(Q,S) D(Q,S) n qi si i 1 DLB(Q,S) 2 M 2 ( sr sr )( qv sv )

i1 i i 1 i i Lower bounding means that for all Q and S, we have DLB(Q,S) Time Series Representations Data Adaptive Sorted Coefficients Singular Symbolic Piecewise Value Polynomial Decomposition Piecewise Linear Approximation Interpolation Regression Adaptive Piecewise Constant Approximation Natural Language Non Data Adaptive Trees Strings Wavelets

Random Mappings Spectral Orthonormal Bi-Orthonormal Discrete Fourier Transform Haar Daubechies Coiflets dbn n > 1 Piecewise Aggregate Approximation Discrete Cosine Transform Symlets We can live without trees, random mappings and natural language, but it would be nice if we could lower bound strings (symbolic or discrete approximations) A lower bounding symbolic approach would allow data miners to Use suffix trees, hashing, markov models etc Use text processing and bioinformatic algorithms We have created the first symbolic representation of time series, that allows Lower bounding of Euclidean distance Dimensionality Reduction Numerosity Reduction

We call our representation SAX Symbolic Aggregate ApproXimation baabccbc How do we obtain SAX? C C 0 40 60 80 100 120 c First convert the time series to PAA representation, then convert the PAA to symbols It take linear time 20 c c

b b a 0 20 b a 40 60 80 100 baabccbc 120 c c b b a 0 Time series subsequences tend to have a highly Gaussian distribution

0.999 0.997 P roba bility 0.99 0.98 0.95 0.90 0.75 0.50 0.25 0.10 0.05 0.02 0.01 0.003 0.001 -10 0 10 A normal probability plot of the (cumulative) distribution of values from subsequences of length 128. 20 a 40 Why aa Why Gaussian?

Gaussian? 60 80 Visual Comparison 3 2 1 0 -1 -2 -3 f e d c b a DFT PLA Haar APCA A raw time series of length 128 is transformed into the word ffffffeeeddcbaabceedcbaaaaacddee. We can use more symbols to represent the time series since each symbol requires fewer bits than real-numbers (float, double) n 1.5

D Q, C q i c i C 1 i 1 0.5 Euclidean Distance 0 -0.5 -1 Q -1.5 0 20 40 60 80 100 120 DR (Q , C ) 1.5

n w w 2 q c i 1 i i C 1 PAA distance lower-bounds the Euclidean Distance 0.5 0 -0.5 -1 Q -1.5 0 C = Q = 20

40 baabccbc babcacca 60 80 100 120 MINDIST (Q , C ) n w w 2 dist ( q , c ) i 1 i i

dist() can be implemented using a table lookup. 2 SAX is just as good as other representations, or working on the raw data for most problems (Slides shown at the end of this presentation) Now let us consider SAX for two hot problems, novelty detection and motif discovery We will start with novelty detection Novelty Detection Fault detection Interestingness detection Anomaly detection Surprisingness detection note that that this this problem problem should should not not be be note confused with with the the relatively

relatively simple simple confused problem of of outlier outlier detection. detection. problem Remember Hawkins Hawkins famous famous definition definition Remember of an an outlier... outlier... of ... an an outlier outlier is is an an observation observation ... that deviates deviates so so much much from from that other observations observations as as to to other arouse suspicion

suspicion that that itit was was arouse generated from from aa different different generated mechanism... mechanism... Thanks Doug, Doug, the the check check is is in in Thanks the mail. mail. the Douglas M. Hawkins We are are not not interested interested in in We finding individually individually surprising surprising finding datapoints, we we are are interested interested datapoints,

in finding finding surprising surprising patterns. patterns. in Lots of of good good folks folks have have worked worked Lots on this, this, and and closely closely related related on problems. problems. is referred referred to to as as the the detection detection ItIt is 11, of Aberrant Behavior of Aberrant Behavior , 22, Anomalies33, Novelties Novelties , Anomalies ,

Faults44, , Surprises Surprises55, , Faults 66 ,Temporal Deviants Deviants ,Temporal Change77, , and and Outliers Outliers88. . Change 1. 2. 3. 4. 5. 6. 7. 8. Brutlag, Kotsakis et. al. Daspupta et. al., Borisyuk et. al. Whitehead et. al., Decoste Yairi et. al. Shahabi, Chakrabarti Jagadish et. al. Blockeel et. al., Fawcett et. al. Hawkins. Arrr... what what be be wrong

wrong Arrr... with current current approaches? approaches? with The blue blue time time series series at at the the top top is is aa The normal healthy healthy human human normal electrocardiogram with with an an artificial artificial electrocardiogram flatline added. added. The The sequence sequence in in red red flatline at the the bottom bottom indicates indicates how

how at surprising local local subsections subsections of of the the surprising time series series are are under under the the measure measure time Our Solution Based on the following intuition, a pattern is surprising if its frequency of occurrence is greatly different from that which we expected, given previous experience This is a nice intuition, but useless unless we can more formally define it, and calculate it efficiently Notethat thatunlike unlikeall allprevious previousattempts attemptsto tosolve solvethis this Note problem,our

ournotion notionsurprisingness surprisingnessof ofaapattern patternisis problem, nottied tiedexclusively exclusivelyto toits itsshape. shape.Instead Insteadititdepends depends not onthe thedifference differencebetween betweenthe theshapes shapesexpected expected on frequencyand andits itsobserved observedfrequency. frequency. frequency Forexample exampleconsider considerthe thefamiliar familiarhead headand and

For shoulderspattern patternshown shownbelow... below... shoulders Theexistence existenceof ofthis thispattern patternin inaastock stockmarket markettime time The seriesshould shouldnot notbe beconsider considersurprising surprisingsince sincethey theyare are series knownto tooccur occur(even (evenififonly onlyby bychance). chance).However, However,ififitit known

occurredten tentimes timesthis thisyear, year,as asopposed opposedto tooccurring occurringan an occurred averageof oftwice twiceaayear yearin inprevious previousyears, years,our ourmeasure measureof of average surprisewill willflag flagthe theshape shapeas asbeing beingsurprising. surprising.Cool Cooleh? eh? surprise Thepattern patternwould wouldalso

alsobe besurprising surprisingififits itsfrequency frequencyof of The occurrenceisisless lessthan thanexpected. expected.Once Onceagain againour our occurrence definitionwould wouldflag flagsuch suchpatterns. patterns. definition We call our algorithm Tarzan! Tarzan is is not not an an Tarzan acronym. ItIt is is aa pun pun acronym. on the the fact fact that that the

the on heart of of the the algorithm algorithm heart relies comparing comparing two two relies suffix trees, trees, tree tree to to suffix tree! tree! Homer,IIhate hateto tobe beaafuddyfuddyHomer, duddy,but butcould couldyou youput puton on duddy, somepants? pants? some Tarzan(R) (R)isisaaregistered registered Tarzan

trademarkofofEdgar EdgarRice Rice trademark Burroughs,Inc. Inc. Burroughs, We begin begin by by defining defining We some terms terms Professor Professor some Frink? Frink? Definition 1: 1: AA time time series series pattern pattern Definition P, extracted extracted from from database database XX is is P, surprising relative relative to to aa database database R,

R, ifif surprising the probability probability of of its its occurrence occurrence is is the greatly different different to to that that expected expected by by greatly chance, assuming assuming that that RR and and XX are are chance, created by by the the same same underlying underlying created process. process. Definition 1: 1: AA time time series series pattern pattern

Definition P, extracted extracted from from database database XX is is P, surprising relative relative to to aa database database R, R, ifif surprising the probability probability of of occurrence occurrence is is the greatly different different to to that that expected expected by by greatly chance, assuming assuming that that RR and and XX are are chance, created by by the the same same underlying

underlying created process. process. But you you can can never never know know the the But probability of of aa pattern pattern you you probability have never never seen! seen! have And probability probability isnt isnt even even And defined for for real real valued valued time time defined series! series! We need to discretize the time series

into symbolic strings SAX!! aaabaabcbabccb Once we have done this, we can use Markov models to calculate the probability of any pattern, including ones we have never seen before principalskinner IfIf xx == principalskinner {a,c,e,i,k,l,n,p,r,s} isis {a,c,e,i,k,l,n,p,r,s} |x| isis 16 16 |x| skin skin prin prin ner ner IfIf IfIf yy yy substring of of xx isis aa substring prefix of of xx isis aa prefix suffix of

of xx isis aa suffix == == in, then then ffxx(y) (y) == 22 in, pal, then then ffxx(y) (y) == 11 pal, principalskinner principalskinner Can we do all this in linear space and time? Yes! Some very clever modifications of suffix trees (Mostly due to Stefano Lonardi) let us do this in linear space. An individual pattern can be tested in constant time! Experimental Evaluation We would like to demonstrate two features of our proposed approach Sensitivity (High True Positive Rate) The algorithm can find truly surprising patterns in a time series.

Selectivity (Low False Positive Rate) The algorithm will not find spurious surprising patterns in a time series Sensitive Sensitive and and Selective, Selective, just like like me me just Experiment 1: Shock ECG Training data Test data (subset) Tarzans level of surprise 0 200 400 600 800 1000

1200 1400 1600 0 200 400 600 800 1000 1200 1400 1600 Experiment 2: Video (Part 1) Training data 0 2000 4000 6000 8000

10000 12000 0 2000 4000 6000 8000 10000 12000 0 2000 4000 6000 8000 10000 12000 Test data (subset) Tarzans level

of surprise We zoom in on this section in the next slide Experiment 2: Video (Part 2) 400 350 300 Normal sequence 250 Actor misses holster 200 150 100 0 100 200 300 Laughing and flailing hand Normal sequence

Briefly swings gun at target, but does not aim 400 500 600 700 Experiment 3: Power Demand (Part 1) We consider a dataset that contains the power demand for a Dutch research facility for the entire year of 1997. The data is sampled over 15 minute averages, and thus contains 35,040 points. Demandfor for Demand Power? Power? Excellent! Excellent! 2500 2000 1500 1000 500 0 200

400 600 800 1000 1200 1400 1600 1800 2000 Thefirst first33weeks weeksof ofthe thepower powerdemand demanddataset. dataset.Note Notethe the The repeatingpattern patternof ofaastrong strongpeak peakfor foreach eachof

ofthe thefive five repeating weekdays,followed followedby byrelatively relativelyquite quiteweekends weekends weekdays, Experiment 3: Power Demand (Part 2) Mmm.. Mmm.. anomalous.. anomalous.. We used from Monday January 6th to Sunday March 23rd as reference data. This time period is devoid of national holidays. We tested on the remainder of the year. We will just show the 3 most surprising subsequences found by each algorithm. For each of the 3 approaches we show the entire week (beginning Monday) in which the 3 largest values of surprise fell. Both TSA-tree and IMM returned sequences that appear to be normal workweeks, however Tarzan returned 3 sequences that correspond to the weeks that contain national holidays in the Netherlands. In particular, from top to bottom, the week spanning both December 25th and 26th and the weeks containing Wednesday April 30th (Koninginnedag, Queen's Day) and May 19th (Whit

Monday). Tarzan TSA Tree IMM NASA recently recently said said TARZAN TARZAN NASA holds great great promise promise for for the the holds future*. future*. There is is now now aa journal journal There version of of TARZAN TARZAN (under (under version review), ifif you you would would like like aa review),

copy, just just ask. ask. copy, In the the meantime, meantime, let let us us In consider motif motif discovery discovery consider * Isaac, D. and Christopher Lynnes, 2003. Automated Data Quality Assessment in the Intelligent Archive, White Paper prepared for the Intelligent Data Understanding program. SAX allows Motif Discovery! ( 0 50 0 1000 150 0 Winding Dataset The angular speed of reel 2 ) 2000 Informally, motifs are reoccurring patterns

2500 Motif Discovery To find these 3 motifs would require about 6,250,000 calls to the Euclidean distance function. A 0 500 20 1500 2000 B 40 60 80 100 120 140 0 20 C

(The angular speed of reel 2) 1000 A 0 Winding Dataset B 2500 C 40 60 80 100 120 140 0 20 40 60 80

100 120 140 Why Find Motifs? Mining association rules in time series requires the discovery of motifs. These are referred to as primitive shapes and frequent patterns. Several time series classification algorithms work by constructing typical prototypes of each class. These prototypes may be considered motifs. Many time series anomaly/interestingness detection algorithms essentially consist of modeling normal behavior with a set of typical shapes (which we see as motifs), and detecting future patterns that are dissimilar to all typical shapes. In robotics, Oates et al., have introduced a method to allow an autonomous agent to generalize from a set of qualitatively different experiences gleaned from sensors. We see these experiences as motifs. In medical data mining, Caraca-Valente and Lopez-Chavarrias have introduced a method for characterizing a physiotherapy patients recovery based of the discovery of similar patterns. Once again, we see these similar patterns as motifs. Animation and video capture (Tanaka and Uehara, Zordan and Celly) T Trivial Matches Space Shuttle STS - 57 Telemetry C 0 100 200

3 00 400 ( Inertial Sensor ) 500 600 70 0 800 900 100 0 Definition 1. Match: Given a positive real number R (called range) and a time series T containing a subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M) R, then M is called a matching subsequence of C. Definition 2. Trivial Match: Given a time series T, containing a subsequence C beginning at position p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q or there does not exist a subsequence M beginning at q such that D(C, M) > R, and either q < q< p or p < q< q. Definition 3. K-Motif(n,R): Given a time series T, a subsequence length n and a range R, the most significant motif in T (hereafter called the 1-Motif(n,R)) is the subsequence C1 that has highest count of non-trivial matches (ties are broken by choosing the motif whose matches have the lower variance). The Kth most significant motif in T (hereafter called the K-Motif(n,R) ) is the subsequence CK that has the highest count of non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1 i < K. OK, we can define motifs, but how do we find them? The obvious brute force search algorithm is just too slow Our algorithm is based on a hot idea from bioinformatics,

random projection* and the fact that SAX allows use to lower bound discrete representations of time series. * J Buhler and M Tompa. Finding motifs using random projections. In RECOMB'01. 2001. A simple worked example of our motif discovery algorithm The next 4 slides T 0 500 ( m= 1000) 1000 C1 S^ 1 a 2 b 58 a 985 b C^1 a c b a c c c c

b a c c a b a c a = 3 {a,b,c} n = 16 w=4 Assume that we have a time series T of length 1,000, and a motif of length 16, which occurs twice, at time T1 and time T58. A mask {1,2} was randomly chosen, so the values in columns {1,2} were used to project matrix into buckets. a 2 b

58 a 985 b c c c c b a c c a b a c 1 2 3 4

1 Collisions are recorded by incrementing the appropriate location in the collision matrix 1 58 1 2 457 58 1 2 985 985 1 1 2

58 985 Once again, collisions are recorded by incrementing the appropriate location in the collision matrix A mask {2,4} was randomly chosen, so the values in columns {2,4} were used to project matrix into buckets. a 2 b 58 a 985 b c c c c b a c

c a b a c 1 2 3 4 1 1 58 1 2 2 58

2 985 985 1 1 2 58 985 We can calculate the expected values in the matrix, assuming there are NO patterns k i E( k , a, w, d , t ) 1- 2 i 0 w d t i w a 1 1 i a a

w i 1 Suppose E(k,a,w,d,t) = 2 2 2 1 3 58 27 3 2 1 2 2 1 985 0 1 2

1 1 2 58 3 985 A Simple Experiment Lets imbed two motifs into a random walk time series, and see if we can recover them C A D B 0 0 20 40 60 200 80 100

120 400 0 20 40 600 60 80 100 800 120 1000 1200 Planted Motifs C A B D

Real Motifs 0 20 40 60 80 100 120 0 20 40 60 80 100 120 Some Examples of Real Motifs Motor 1 (DC Current) 0 500

1000 1500 2000 Astrophysics (Photon Count) 250 0 350 0 450 0 550 0 650 0 How Fast can we find Motifs? Seconds 10k 8k Brute Force 6k TS-P

4k 2k 0 1000 2000 3000 4000 Length of Time Series 5000 Let us very quickly look at some other problems where SAX may make a contribution Visualization Understanding the why of classification and clustering Understanding the why in classification and clustering SAX Summary For most classic data mining tasks (classification, clustering and indexing), SAX is at least as good as the raw data, DFT, DWT, SVD etc. SAX allows the best anomaly detection algorithm. SAX is the engine behind the only realistic time series motif discovery algorithm. The Last Word

The sun is setting on all other symbolic representations of time series, SAX is the only way to go Conclusions SAX is posed to make major contributions to time series data mining in the next few years. A more general conclusion, if you want to solve you data mining problem, think representation, representation, representation. The slides that follow demonstrate that SAX is as good as DFT, DWT etc for the classic data mining tasks, this is important, but not very exciting, thus relegated to this appendix. Experimental Validation Clustering Hierarchical Partitional Classification Nearest Neighbor Decision Tree Indexing VA File Discrete Data only Anomaly Detection Motif Discovery Clustering

Hierarchical Clustering Compute pairwise distance, merge similar clusters bottom-up Compared with Euclidean, IMPACTS, and SDA Hierarchical Clustering Euclidean IMPACTS (alphabet=8) Hierarchical Clustering SAX SDA Clustering Hierarchical Clustering Compute pairwise distance, merge similar clusters bottom-up Compared with Euclidean, IMPACTS, and SDA Partitional Clustering K-means Optimize the objective function by minimizing the sum of squared intra-cluster errors Compared with Raw data Partitional (K-means) Clustering 265000 260000 Objective Function Raw

Rawdata data 255000 Our Symbolic SAX Approach 250000 245000 240000 235000 230000 225000 220000 1 2 3 4 5 6 7 8 9 Number of Iterations

Partitional (k-means) Clustering 10 11 Classification Nearest Neighbor Leaving-one-out cross validation Compared with Euclidean Distance, IMPACTS, SDA, and LP Datasets: Control Charts & CBF (Cylinder, Bell, Funnel) Nearest Neighbor Cylinder-Bell-Funnel Control Chart 0.6 0.5 Impacts 0.4 Error Rate SDA 0.3 Euclidean LPmax 0.2

SAX 0.1 0 5 6 7 8 Alphabet Size 9 10 5 6 7 8 Alphabet Size Nearest Neighbor 9 10 Classification

Nearest Neighbor Leaving-one-out cross validation Compared with Euclidean Distance, IMPACTS, SDA, and LP Datasets: Control Charts & CBF (Cylinder, Bell, Funnel) Decision Tree Defined for real data, but attempting to use DT on time series raw data would be a mistake Adaptive Piecewise Constant Approximation High dimensionality/Noise level would result in deep, bushy trees Geurts (01) suggests representng time series as Regression Tree, and training decision tree on it. 0 50 100 Decision (Regression) Tree Dataset SAX Regression Tree CC 3.04 1.64

2.78 2.11 CBF 0.97 1.41 1.14 1.02 Indexing Indexing scheme similar to VA (Vector Approximation) File Dataset is large and disk-resident Reduced dimensionality could still be too high for R-tree to perform well Compare with Haar Wavelet Indexing 0.6 0.5 DWT 0.4 SAX Haar 0.3 0.2 0.1 0 Ballbeam

Chaotic Memory Dataset Winding