Folie 1 - uni-luebeck.de

Non-Standard-Datenbanken Bayesian Networks, Inference, and Learning Prof. Dr. Ralf Mller Universitt zu Lbeck Institut fr Informationssysteme Beispiel Zahnarzt-Problem mit vier Variablen: Toothache (Sind besagte Schmerzen wirklich Zahnschmerzen?) Cavity (Es knnte ein Loch sein?) Catch (Stahlinstrument erzeugt Testschmerz?) Weather (Wetter: sunny,rainy,cloudy,snow) Nachfolgende Prsentationen enthalten Material aus Kapitel 14 (Sektion 1 and 2) 2 Prior probability Prior or unconditional probabilities of propositions e.g., P(Cavity = true) = 0.1 and P(Weather = sunny) = 0.72 correspond to belief prior to arrival of any (new) evidence Probability distribution gives values for all possible assignments: P(Weather) = <0.72, 0.1, 0.08, 0.1>

(normalized, i.e., sums to 1 because one condition must be the case) 3 Full joint probability distribution Joint probability distribution for a set of random variables gives the probability of every atomic event on those random variables P(Weather,Cavity) is a 4 2 matrix of values: Weather = sunny rainy cloudy snow Cavity = true 0.144 0.02 0.016 0.02 Cavity = false 0.576 0.08 0.064 0.08 Full joint probability distribution: all random variables involved P(Toothache, Catch, Cavity, Weather)

Every query about a domain can be answered by the full joint distribution 4 Discrete random variables: Notation Dom(Weather) = {sunny, rainy, cloudy, snow} and Dom(Weather) disjoint from domain of other random variables: Atomic event Weather=rainy often written as rainy Example: P(rainy), the random variable Weather is implicitly defined by the value rainy Boolean variable Cavity Atomic event Cavity=true written as cavity Atomic event Cavity=false written as cavity Examples: P(cavity) or P(cavity) 5 Conditional probability Conditional or posterior probabilities e.g., P(cavity | toothache) = 0.8 or: <0.8> i.e., given that toothache is all I know (Notation for conditional distributions:

P(Cavity | Toothache) is a 2-element vector of 2-element vectors If we know more, e.g., cavity is also given, then we have P(cavity | toothache, cavity) = 1 New evidence may be irrelevant, allowing simplification, e.g., P(cavity | toothache, sunny) = P(cavity | toothache) = 0.8 This kind of inference, sanctioned by domain knowledge, is crucial 6 Conditional probability Definition of conditional probability (in terms of uncond. prob.): P(a | b) = P(a b) / P(b) if P(b) > 0 Product rule gives an alternative formulation ( is commutative):

P(a b) = P(a | b) P(b) = P(b | a) P(a) A general version holds for whole distributions, e.g., P(Weather,Cavity) = P(Weather | Cavity) P(Cavity) View as a set of 4 2 equations, not matrix mult. (1,1) P(Weather=sunny |Cavity=true) P(Cavity=true) (1,2) P(Weather=sunny |Cavity=false) P(Cavity=false), . Chain rule is derived by successive application of product rule: P(X1, ,Xn) = P(X1,...,Xn-1) P(Xn | X1,...,Xn-1) = P(X1,...,Xn-2) P(Xn-1 | X1,...,Xn-2) P(Xn | X1,...,Xn-1) = = P(Xi | X1, ,Xi-1) 7 Inference by enumeration Start with the joint probability distribution: For any proposition , sum the probability where it is true: P() = : P()

P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2 Unconditional or marginal probability of toothache Process is called marginalization or summing out Ax3: P(A B) = P(A) + P(B) for disjoint events A, B 8 Marginalization and conditioning Let Y, Z be sequences of random variables s.th. Y Z denotes all random variables describing the world Marginalization P(Y) = z ZP(Y,z) Conditioning P(Y) = z ZP(Y|z)P(z) 9 Inference by enumeration Start with the joint probability distribution: For any proposition , sum the atomic events where it is true: P() = : P()

P(cavity toothache) = 0.108 + 0.012 + 0.072 + 0.008+ 0.016 + 0.064 = 0.28 (P(cavity toothache) = P(cavity) + P(toothache) P(cavity toothache)) 10 Inference by enumeration Start with the joint probability distribution: Can also compute conditional probabilities: P(cavity | toothache) = P(cavity toothache) Product rule P(toothache) = 0.016+0.064 0.108 + 0.012 + 0.016 + 0.064 = 0.08/0.2 = 0.4 P(cavity | toothache) = (0.108+0.012)/0.2 = 0.6 11 Normalization

Denominator P(z) (or P(toothache) in the example before) can be viewed as a normalization constant P(Cavity | toothache) = P(Cavity,toothache) = [P(Cavity,toothache,catch) + P(Cavity,toothache, catch)] = [<0.108,0.016> + <0.012,0.064>] = <0.12,0.08> = <0.6,0.4> General idea: compute distribution on query variable by fixing evidence variables (Toothache) and summing over hidden variables (Catch) 12 Inference by enumeration, contd. Typically, we are interested in the posterior joint distribution of the query variables Y given specific values e for the evidence variables E (X are all variables of the modeled world) Let the hidden variables be H = X - Y E then the required summation of joint entries is done by summing out the hidden variables: P(Y | E = e) = P(Y,E = e) = hP(Y,E= e, H = h) The terms in the summation are joint entries because Y, E and H together exhaust the set of random variables (X)

Obvious problems: 1. Worst-case time complexity O(dn) where d is the largest arity and n denotes the number of random variables 2. Space complexity O(dn) to store the joint distribution 3. How to find the numbers for O(dn) entries? 13 Independence A and B are independent iff P(A|B) = P(A) or P(B|A) = P(B) P(B) or P(A, B) = P(A) P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather) 32 entries reduced to 12; Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, none of which are independent. What to do?

14 Conditional independence P(Toothache, Cavity, Catch) has 23 1 = 7 independent entries If I have a cavity, the probability that the probe catches in it doesn't depend on whether I have a toothache: (1) P(catch | toothache, cavity) = P(catch | cavity) The same independence holds if I haven't got a cavity: (2) P(catch | toothache,cavity) = P(catch | cavity) Catch is conditionally independent of Toothache given Cavity: P(Catch | Toothache,Cavity) = P(Catch | Cavity) Equivalent statements: P(Toothache | Catch, Cavity) = P(Toothache | Cavity) P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity) 15 Conditional independence contd. Write out full joint distribution using chain rule:

P(Toothache, Catch, Cavity) = P(Toothache | Catch, Cavity) P(Catch, Cavity) = P(Toothache | Catch, Cavity) P(Catch | Cavity) P(Cavity) conditional independence = P(Toothache | Cavity) P(Catch | Cavity) P(Cavity) i.e., 2 + 2 + 1 = 5 independent numbers In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n Conditional independence is our most basic and robust form of knowledge about uncertain environments 16 Nave Bayes Model Usually, the assumption that effects are independent is wrong, but works well in practice 17 Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions Syntax:

a set of nodes, one per variable a directed, acyclic graph (link "directly influences") a conditional distribution for each node given its parents: P (Xi | Parents (Xi)) In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over Xi for each combination of parent values 18 Example Topology of network encodes conditional independence assertions: Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity 19 Example I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglary?

Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects "causal" knowledge: A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call 20 Example contd. 21 Compactness of Full Joint Encoding Due to Network Sparseness A CPT for Boolean Xi with k Boolean parents has 2k rows for the combinations of parent values

Each row requires one number p for Xi = true (the number for Xi = false is just 1-p) If each variable has no more than k parents, the complete network requires n 2k numbers i.e., grows linearly with n, vs. 2n for the full joint distribution For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25-1 = 31) 22 Semantics The full joint distribution is defined as the product of the local conditional distributions: n P (X1, ,Xn) = i = 1 P (Xi | Parents(Xi)) e.g., P(j m a b e) = P (j | a) P (m | a) P (a | b, e) P (b) P (e)

= 0.90x0.7x0.001x0.999x0.998 0.00063 23 Noisy-OR-Representation A B 0.8 A B 0.1 C Meaning C A B c f f 0.0

f t 0. f 1 0. t t t 8 0.1+0.8 0.1*0.8 Substantial reduction of the probabilities to be stored 24 what must be true to allow for a certain (desired) conclusion? (Calculation over tree sum-product evaluation tree with path length n (=number of RVs) and degree d (=maximal number domain elements) ) Evaluation Tree Can do better with Variable Elimination (VE) Basic Objects of VE

Track objects called factors (right-to-left, in tree: bottom up) Initial factors are local CPTs P(M|A) fM(A) During elimination create new factors Form of Dynamic Programming Basic Operations: Pointwise Product Pointwise product of factors f1 and f2 for example: f (A,B) * f (B,C)= f(A,B,C) 1 2 in general: f1(X1,...,Xj,Y1,,Yk) *f2(Y1,,Yk,Z1,,Zl)= f(X1,...,Xj,Y1,,Yk,Z1,,Zl) has 2j+k+l entries (if all variables are binary) Join by pointwise product Basic Operations: Summing out Summing out a variable from a product of factors Move any constant factors outside the summation Add up submatrices in pointwise product of remaining factors x f1* *fk = f1**fi*x fi+1**fk

= f1**fi* fX assuming f1, , fi do not depend on X Summing out Summing out a VE result for Burglary Example Further Optimization by Finding Irrelevant Variables {Alarm, Earthquake, Burglary} Irrelevant variables can be identified by a graph theoretical criterion on the associated moral graph by m-separation Many other inference algorithms and optimizations known Discussed in detail in course on Bayesian Knowledge Acknowledgements Slides from AIMA book provided by Cristina Conati, UBC 35 Data Mining Bayesian Networks Full Bayesian Learning MAP learning

Maximum Likelihood Learning Learning Bayesian Networks Fully observable With hidden (unobservable) variables 36 Full Bayesian Learning In the learning methods we have seen so far, the idea was always to find the best model that could explain some observations In contrast, full Bayesian learning sees learning as Bayesian updating of a probability distribution over the hypothesis space, given data H is the hypothesis variable Possible hypotheses (values of H) h1, hn P(H) = prior probability distribution over hypothesis space jth observation dj gives the outcome of random variable Dj training data d= d1,..,dk 37 Non-Standard-Datenbanken Bayesian Networks, Inference, and Learning Prof. Dr. Ralf Mller Universitt zu Lbeck Institut fr Informationssysteme

Example contd. 39 Example Suppose we have 5 types of candy bags 10% are 100% cherry candies (h100) 20% are 75% cherry + 25% lime candies (h75) 40% are 50% cherry + 50% lime candies (h50) 20% are 25% cherry + 75% lime candies (h25) 10% are 100% lime candies (h0) Then we observe candies drawn from some bag Lets call the parameter that defines the fraction of cherry candy in a bag, and h the corresponding hypothesis Which of the five kinds of bag has generated my 10 observations? P(h |d). What flavour will the next candy be? Prediction P(X|d) 40 Full Bayesian Learning Given the data so far, each hypothesis hi has a posterior probability: P(hi |d) = P(d| hi) P(hi) (Bayes theorem) where P(d| hi) is called the likelihood of the data under each hypothesis Predictions over a new entity X are a weighted

The data average over the prediction of each hypothesis: does not add P(X|d) = anything to a = i P(X, hi |d) prediction = i P(X| hi,d) P(hi |d) given a = i P(X| hi) P(hi |d) hypothesis ~ i P(X| hi) P(d| hi) P(hi) The weights are given by the data likelihood and prior of each h + No need to pick one best-guess hypothesis! - Need to iterate over all hypotheses 41 Example If we re-wrap each candy and return it to the bag, our 10 observations are independent and identically distributed, i.i.d, so P(d| h) = j P(dj| h) for j=1,..,10 For a given h , the value of P(dj| h) is P(dj = cherry| h) = ; P(dj = lime|h) = (1-) c

And given PN(dobservations, | h ) j 1 j 1 (1 of ) which c (1 )c are cherry and l = N-c lime Binomial distribution: probability of # of successes in a sequence of N independent trials with binary outcome, each of which yields success with probability . For instance, after observing 3 lime candies in a row: P([lime, lime, lime] | h 50) = 0.53 because the probability of seeing lime for each observation is 0.5 under this 42 All-limes: Posterior Probability of H P(hi |d) = P(P(d| hi) P(hi) Initially, the hp with higher priors dominate (h50 with prior = 0.4) As data comes in, the finally best hypothesis (h0) starts dominating, as the probability of seeing this data given the other hypotheses gets increasingly smaller After seeing three lime candies in a row, the probability

43 Prediction Probability i P(next candy is lime| hi) P(hi |d) The probability that the next candy is lime increases with the probability that the bag is an all-lime one 44 Overview Full Bayesian Learning MAP learning Maximum Likelihood Learning Learning Bayesian Networks Fully observable With hidden (unobservable) variables 45 MAP approximation Full Bayesian learning seems like a very safe bet, but unfortunately it does not work well in practice Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes) Very common approximation: Maximum a posterior (MAP) learning:

Instead of doing prediction by considering all possible hypotheses , as in o P(X|d) = i P(X| hi) P(hi |d) Make predictions based on h MAP that maximises P(hi |d) o I.e., maximize P(d| hi) P(hi) o P(X|d) P(X| hMAP ) 46 MAP approximation MAP is a good approximation when P(X |d) P(X| hMAP) In our example, hMAP is the all-lime bag after only 3 candies, predicting that the next candy will be lime with p =1 The Bayesian learner gave a prediction of 0.8, safer after seeing only 3 candies 47 Bias As more data arrive, MAP and Bayesian prediction become closer, as MAPs competing hypotheses become less likely Often easier to find MAP (optimization problem) than deal with a large summation problem P(H) plays an important role in both MAP and Full Bayesian Learning (defines learning bias) Used to define a tradeoff between model complexity and its ability to fit the data

More complex models can explain the data better => higher P(d| hi) danger of overfitting But they are less likely a priory because there are more of them than simpler model => lower P(hi) I.e. common learning bias is to penalize complexity 48 Overview Full Bayesian Learning MAP learning Maximum Likelihood Learning Learning Bayesian Networks Fully observable With hidden (unobservable) variables 49 Maximum Likelihood (ML)Learning Further simplification over full Bayesian and MAP learning Assume uniform priors over the space of hypotheses MAP learning (maximize P(d| hi) P(hi)) reduces to maximize P(d| hi) When is ML appropriate? Howard, R.: Decision analysis: Perspectives on inference, decision, and experimentation. Proceedings of the IEEE 58(5), 632-643, 1970

50 Maximum Likelihood (ML) Learning Further simplification over Full Bayesian and MAP learning Assume uniform prior over the space of hypotheses MAP learning (maximize P(d| hi) P(hi)) reduces to maximize P(d| hi) When is ML appropriate? Used in statistics as the standard (non-bayesian) statistical learning method by those who distrust subjective nature of hypotheses priors When the competing hypotheses are indeed equally likely (e.g. have same complexity) With very large datasets, for which P(d| hi) tends to overcome the influence of P(hi) 51 Overview Full Bayesian Learning MAP learning Maximum Likelihood Learning Learning Bayesian Networks Fully observable (complete data) With hidden (unobservable) variables 52

Learning BNets: Complete Data We will start by applying ML to the simplest type of BNets learning: known structure Data containing observations for all variables All variables are observable, no missing data The only thing that we need to learn are the networks parameters 53 Maximum-LikelihoodParameterschtzung Nehme an, die Struktur eines BNs sei bekannt Ziel: Schtze BN-Parameter Eintrge in CPTs, P(X | Parents(X)) Eine Parametrierung ist gut, falls hierdurch die beobachteten Daten wahrscheinlich generiert werden: Maximum Likelihood Estimation (MLE) Prinzip: Whle * so, dass P(D| *) maximiert wird Gleichverteilte, unabhngige Stichprobem (i.i.d. samples)

54 Anwendungsbeispiel Bonbonfabrik Ein Hersteller whlt die Farbe des Bonbonpapiers mit einer bestimmten Wahrscheinlichkeit je nach Geschmack, wobei die entsprechende Verteilung nicht bekannt sei Wenn Geschmack=cherry, whle rotes Papier mit W'keit 1 Wenn Geschmack=lime, whle rotes Papier mit W'keit 2 Das Bayessche Netzwerk enthlt drei zu lernende Parameter 1 2 55 Anwendungsbeispiel Bonbonfabrik P( W=green, F = cherry| h12) = (*) = P( W=green|F = cherry, h12) P( F = cherry| h12) = (1- 1) Wir packen N Bonbons aus c sind cherry und l sind lime

rc cherry mit rotem Papier, gc cherry mit grnem Papier rl lime mit rotem Papier, g l lime mit grnem Papier Jeder Versuch liefert eine Kombination aus Papier und Geschmack wie bei (*) P(d| h12) = j P(dj| h12) = c (1-) l ( 1) rc (1- 1) gc ( 2) rl (1- 2) gl 56 Anwendungsbeispiel Bonbonfabrik Maximierung des Logarithmus der Zielfunktion clog + l log(1- ) + rc log log(1- 2) 1

+ gc log(1- 1) + rl log Bestimmung der Ableitungen bzgl. , 1 , 2 +g 2 Ausdrcke ohne Term, nach dem abgeleitet wird, verschwinden 57 l Maximum-LikelihoodParameterschtzung Schtzung durch Bildung relativer Hufigkeiten Dieser Prozess ist auf jedes voll beobachtbare BN anwendbar Mit vollstndigen Daten und Maximum-LikelihoodParameterschtzung: Parameterlernen zerfllt in separate Lernprobleme fr jeden Parameter (CPT) durch Logarithmierung Jeder Parameter wird durch die relative Hufigkeit eines Knotenwertes bei gegebenen Werten der Elternknoten

bestimmt 58 Beliebte Anwendung: Naives BayesModell Nave Bayes-Modell: Sehr einfaches Bayessches Netzwerk zur Klassifikation C Klassenvariable C (vorherzusagen) bildet Wurzel X1 Xn Attributvariablen Xi (Beobachtungen) sind Bltter X2 Naiv, weil angenommen wird, dass die Attributwerte bedingt unabhngig sind, wenn die Klasse gegeben ist P(C , x1,x2 ,..,xn ) P(C|x1,x2 ,..,xn ) P(C) P(xn | C) P(x1,x2 ,..,xn ) i Deterministische Vorhersagen knnen durch Wahl der wahrscheinlichsten Klasse erreicht werden Skalierung auf realen Daten sehr gut:

2n + 1 Parameter bentigt 59 berblick Volles Bayessches Lernen (BMA) MAP-Lernen Maximum-Likelihood-Lernen Vollstndig beobachtbar (vollstndige Daten) Mit versteckten (unbeobachtbaren) Variablen 60 Parameterlernen mit versteckten Variablen Bisher haben wir angenommen, Daten fr jede Variable des BNs stehen zur Verfgung Was machen wir, wenn das nicht der Fall ist, d.h. es gibt versteckte (hidden) Variablen im Netzwerk? Den Ansatz mit relativen Hufigkeiten knnen wir nicht so einfach bernehmen (keine Zhler fr Variable H bekannt) 61 Vermeintliche Lsung Vermeide versteckte Variablen Knnte klappen bei kleinen Netzwerken Doch was

machen wir hier? Jede Variable habe 3 Werte (low, moderate, high) Die Zahlen an den Knoten reprsentieren, wie viele Parameter fr die CPT des betreffenden Knotens bestimmt werden mssen 78 Wahrscheinlichkeitswerte insgesamt 62 HeartDisease wegzulassen ist gar keine gute Lsung! Die Symptome sind nicht lnger bedingt unabhngig gegeben die Elternknotenwerte Sehr viel mehr Kanten, sehr viel mehr Wahrscheinlichkeiten zu spezifizieren: 708 insgesamt Wir brauchen sehr viel mehr Daten, um diese ganzen Werte vernnftig zu lernen 63 Beispiel: Bonbonfabrik wieder einmal Zwei Bonbontten (1 and 2) wurden vermischt Bonbons werden durch drei Eigenschaften beschrieben: Flavor und Wrapper wie vorher, plus Hole (ob ein Loch in der Mitte ist) Die Merkmale von Bonbon hngen mit bestimmten Wahrscheinlichkeiten von der Tte ab, aus der sie kommen Wir wollen fr jedes Bonbon vorhersagen, aus welcher Tte

es kam, je nach vorgefundenen Eigenschaften: Nave-BayesModell = P(Bag = 1) Fj = P(Flavor = cherry|Bag = j) Wj = P(Wrapper = red|Bag = j) Hj = P(Hole = yes|Bag = j) j =1,2 64 Expectation-Maximization (EM) Wenn wir versteckte Variablen beibehalten und die Netzwerkparameter aus Daten lernen wollen, haben wir eine Form des unberwachten Lernens (unsupervised learning) Die Daten geben nicht ber alle Aspekte von Datenpunkten Auskunft Expectation-Maximization Genereller Algorithmus zum Lernen von Modellparametern bei unvollstndigen Daten Hier fr diskrete Datenwerte im BN-Kontext gezeigt 65 EM: Generelle Idee Falls wir Daten fr alle Variablen im BN htten, knnten wir die Parameter mit ML- (oder MAP-) Anstzen lernen Relative Hufigkeiten bestimmen, wie vorher besprochen

Falls wir die Parameter htten, knnten wir die Posterior-Wahrscheinlichkeiten jedes Ereignisses Dempster, A.P., Laird. N.M., Rubin, D.B.: Maximum-Likelihood schtzen: from incomplete data via the EM algorithm. Journal of the Royal P(H|A,B,C) Statistical Society, 1977 66 EM: Generelle Idee Der Algorithmus startet mit "erfundenen" (zufllig generierten) Informationen, um das Lernproblem zu lsen: Erfundene Netzwerkparameterwerte (virtuell notiert in den CPTs) Dann werden die Initialwerte in zwei Schritten verfeinert: Expectation (E): Aktualisiere die Daten (virtuell) mit aus dem aktuellen Modell hergeleiteten Erwartungen Maximization (M): Gegeben die aktualisierten Daten, aktualisieren die Parameter des BNs mitteln Maximum Likelihood (ML) Ansatz Das ist der gleiche Schritt, wie im voll beobachtbaren CPT: Conditional Probability Table Fall 67

EM: Wie funktioniert das mit Naive BayesModelen Betrachtet die folgenden Daten, N Beispiele mit Booleschen Attributen X1, X2, X3, X4 die wir kategorisieren wollen mit mglichen Werten einer Klasse C = {1,2,3} Wir verwenden einen naiven Bayes-Klassifikator mit versteckter Variable C ? ? ? ? ? 68 EM: Initialisierung Der Algorithmus startet mit "erfundenen" (zufllig generierten) Informationen, um das Lernproblem zu lsen: Erfundene Netzwerkparameterwerte (virtuell notiert in den CPTs) ? ? ? ? ? Definiere zufllige Parameter

69 EM: Expectation-Schritt (Bestimmung der erwarteter Zahlen) ? ? ? ? ? Was bentigen wir, um die Netzwerkparameter mit dem Maximum-Likelihood-Ansatz zu lernen? Fr P(C) = Anzahl(Datenpunkte mit C=i) / Anzahl (alle Datenpunkte) i=1,2,3 Fr P(Xh|C) = Anzahl(Daten mit Xh = valk und C=i) / Anzahl(Daten mit C=i) fr alle Werte valk von Xh und i=1,2,3 70 EM: Expectation-Schritt (Bestimmung der erwarteter Zahlen) Bislang haben wir nur: N = Anzahl(Datenpunkte) Wir approximieren alle weiteren Zhler mit erwarteten Zhlerwerten, die aus dem Modell mit den "erfundenen" Parametern abgeleitet werden N(C i) Der erwartete Zhler ist die Summe,

ber alle N Beispieldaten, der Wahrscheinlichkeiten, dass jedes Beispiel in N Kategorie i flltN(C i) P(C i | attributes of example Attribute von Beispiel eje) j ) j1 N P(C i | x1j , x2 j , x3 j , x4 j ) j1 71 EM: Expectation-Schritt (Bestimmung der erwarteter Zahlen) Wie erhalten wir die notwendigen Werte aus dem Modell? N N(C i) P(C i | attributes Attribute of vonexample Beispieleej j) ) j1

N P(C i | x1j , x2 j , x3 j , x4 j ) j1 Leicht mit naivem Bayesschen Netzwerk P(C i | x1j , x2 j , x3 j , x4 j ) P(C i, x1j , x2 j , x3 j , x4 j ) P(x1 j , x2 j , x3 j , x4 j ) P(x1 j | C i).., P(x4 j | C i)P(C i) P(x1 j , x2 j , x3 j , x4 j ) Auch direkt aus Netzwerk bestimmbar. bungsaufgabe Erfundene Parameter fr das Netzwerk 72 EM: Expectation-Schritt (Bestimmung der erwarteter Zahlen) Durch einen hnlichen Schritt erhalten wir die erwarteten Zhler fr Beispiele mit Attribut Xh= valk und Kategorie i

Diese werden spter Schtzung (X von Exp. Counts(exa mples bentigt with X h valzur and C i) N valk , C i) k h P(X | C) h P(X h | C): Exp.Counts(examples with C i) N (C i) Zum Beispiel fr alle Werte valk von Xh und i=1,2,3 N(X1 t, C 1)

P(C i | x1 j t, x2 j , x3 j , x4 j ) e j with X1 t Wiederum erhalten wir diese W'keiten aus dem aktuellen Modell 73 EM: Generelle Idee Der Algorithmus startet mit "erfundenen" (zufllig generierten) Informationen, um das Lernproblem zu lsen: Erfundene Netzwerkparameterwerte (virtuell notiert in den CPTs) Dann werden die Initialwerte in zwei Schritten verfeinert: Expectation (E): Aktualisiere die Daten (virtuell) mit aus dem aktuellen Modell hergeleiteten Erwartungen Maximization (M): Gegeben die aktualisierten Daten, aktualisieren die Parameter des BNs mitteln Maximum Likelihood (ML) Ansatz Das ist der gleiche Schritt, wie im voll beobachtbaren CPT: Conditional Probability Table

Fall 74 Maximization-Schritt: (Verfeinerung der Parameter) Nun verfeinern wir die Netzwerkparameter mittels Maximum-Likelihood-Lernen auf den erwarteten Zhlern N(C i) P(C i) P(X j valk | C i) N N(X j val k C i) N(C i) fr alle Werte valk von Xj und i=1,2,3 75 EM-Zyklus Nun kann der E-Schritt wiederholt werden Erwartete Zhler (Augmented data) Wahrscheinlichkeiten

76 Beispiel: Bonbonfabrik wieder einmal Zwei Bonbontten (1 and 2) wurden vermischt Bonbons werden durch drei Eigenschaften beschrieben: Flavor und Wrapper wie vorher, plus Hole (ob ein Loch in der Mitte ist) Die Merkmale von Bonbon hngen mit bestimmten Wahrscheinlichkeiten von der Tte ab, aus der sie kommen Wir wollen fr jedes Bonbon vorhersagen, aus welcher Tte es kam, je nach vorgefundenen Eigenschaften: Nave-BayesModell = P(Bag = 1) Fj = P(Flavor = cherry|Bag = j) Wj = P(Wrapper = red|Bag = j) Hj = P(Hole = yes|Bag = j) j =1,2 77 Daten Nehmen wir an, die wahren Parameter seien: = 0.5; F1 = W1 = H1 = 0.8; F2 = W2 = H2 = 0.3; Annahme: Die folgenden Zhler werden "gesampelt" aus P(C, F, W, H) (N = 1000): Wir haben einen Datensatz Wir wollen nun die wahren Parameter aus diesen

Daten mittels EM rekonstruieren 78 EM: Initialisierung Weise Parametern zufllige Initialwerte zu Zu Vereinfachung der Darstellung verwenden wir hier: ( 0) 0.6; F( 01 ) W( 01) H( 01) 0.6; F( 02) W( 02) H( 02) 0.4 Nun durchlaufen wir einen EM-Zyklus zur Berechnung von (1). 79 E-Schritt Zuerst brauchen wir die erwarteten Zhler fr Bonbon von Tte 1: Addiere die Wahrscheinlichkeiten, dass jedes der N Datenpunkte aus Tte 1 kommt Seien flavorj, wrapper N j, holej die Werte der entsprechenden Attribute fr den j-ten Datenpunkt N(Bag 1) P(Bag1 | flavor ,wrapper ,hole )

j j j j1 N P(flavor j , wrapper j , hole j|Bag 1 )P(Bag 1 ) P(flavor j , wrapperj , hole j ) j 1 N j 1 P(flavor j|Bag 1 )P(wrapper j|Bag 1 )P(hole j|Bag 1 )P(Bag 1 ) P(flavor |Bag i)P(wrappe r |Bag i)P(hole |Bag i)P(Bag i) j

j j i 80 E-step N j 1 P(flavor j|Bag 1 )P(wrapper j|Bag 1 )P(hole j|Bag 1 )P(Bag 1 ) P(flavor |Bag i)P(wrapper |Bag i)P(hole |Bag i)P(Bag i) j j j i Die Summation kann in die 8 Bonbongruppen aus der Tabelle aufgebrochen werden. Zum Beispiel ergibt die Summe ber 273 cherry-Bonbons mit rotem Papier und einem Loch (erster Eintrag in der Tabelle)

F( 01 ) W( 01) H( 01) ( 0 ) 273 ( 0 ) ( 0 ) ( 0 ) ( 0 ) ( 0) ( 0) ( 0) (0) F 1 W 1 H 1 F 2 W 2 H 2 (1 ) 4 0.6 0.1296 273 4 273 227.97 4 0.6 0.4 0.1552 ( 0 ) 0.6; ( 0 ) ( 0 ) ( 0 ) 0.6; F1 W1 H1 ( 0 ) ( 0 ) ( 0 ) 0.4 F2 W2

H2 81 M-Schritt Mit der Berechnung der anderen 7 Bonbongruppe erhalten wir N(Bag 1) 612.4 Nun fhren wir den M-Schritt aus, um zu verfeinern. Hierzu nehmen wir den erwarteten Zhlerwert der Datenpunkte aus Tte 1 N(Bag 1) (1) N 0.6124 82 Noch ein Parameter Wir machen das gleiche fr den Parameter F1 E-Schritt: Bestimme erwarteten Zhlerwert von cherryBonbons aus Tte 1 N(Bag 1 Flavor cherry)

P(Bag1 | Flavorj cherry ,wrapperj ,holej ) j:Flavorj cherry Bestimmbar aus naivem Bayesschen Modell wie vorher besprochen Als bung M-Schritt: Verfeinere F1 durch Bestimmung der relativen Hufigkeiten N(Bag1 Flavor cherry) F(1)1 N(Bag1) 83 Lernergebnis Nach einem vollen Zyklus ber alle Parameter haben wir (1) 0.6124; F(11) 0.6684; W(11) 0.6483; H(11) 0.658; F(12) 0.3887; W(12) 0.3817; H(12) 0.3827; Fr jeden Satz von Parametern knnen wir die Log-Likelihood bestimmen wie auch schon frher Man kann zeigen, dass dieser Wert mit jeder EM-Iteration steigt (Konvergenz) EM mit Maximum-Likelihood bleibt aber bei lokalen Maxima hngen bleiben (ggf. mehrere Startwerte nehmen, Priors bercksichtigen oder MAP nehmen) 84

Lernergebnis Nach einem vollen Zyklus ber alle Parameter haben wir (1) 0.6124; F(11) 0.6684; W(11) 0.6483; H(11) 0.658; F(12) 0.3887; W(12) 0.3817; H(12) 0.3827; Fr jeden Satz von Parametern knnen wir die Log-Likelihood bestimmen wie auch schon frher 1000 P(d | h ( i ) ( i ) ( i ) ( i ) ( i ) ( i ) ( i ) ) P(d j | h ( i ) ( i ) ( i ) ( i ) ( i ) ( i ) ( i ) ) F2 W2 H2 F1 W1 H1 j 1 Wir knnen zeigen, das die Log-Likelihood in jeder Iteration ansteigt, so dass die initiale Likelihood schon nach drei Iterationen berflgelt wird Log-likelihood L F1 W1 H1

F2 W2 H2 -1975 -1980 -1985 -1990 -1995 -2000 -2005 -2010 -2015 -2020 -2025 0 20 40 60 80 Iteration number 100

120 85 EM: Diskussion Fr komplexere BNs ist der Algorithmus der gleiche Im allgemeinen mssen wir die CPT-Eintrge fr jede Variable Xi gegeben die Elternknotenwerte Pai berechnen ijk= P(Xi = xij|Pai = paik) ijk N ( X i xij ; Pai paik ) N ( Pa pa ) i ik Die erwarteten Zhlerwerte werden durch Summierung ber die Beispiele berechnet, nachdem all notwendigen P(Xi = xij,Pai = paik) mittels BN-Algorithmen bestimmt sind Das Verfahren kann in eine kombinatorische 86 Learning Bayesian network structures Given training set

D { x[1],..., x[ M]} Find model that best matches D model selection parameter estimation B B[1] A[1] C[1] E[1] E[ M ] B[ M ] A[ M ] C[ M ] Inducer E A C Data D

87 Some of the following slides from an AI course Graphical mod by Burgard/De Raedt/Kersting/Nebel Model selection Goal: Select the best network structure, given the data Input: Training data Scoring function Output: A network that maximizes the score 88 Structure selection: Scoring Bayesian: prior over parameters and structure get balance between model complexity and fit to data as a byproduct Can we learn Gs params from D? Does G explain D with ML? Prior w.r.t. MDL ScoreD (G) = log P(G|D) = log [P(D|G) P(G)] Marginal likelihood just comes from our

parameter estimates Prior on structure can be any measure we want; typically a function of the network complexity (MDL principle) Same key property: Decomposability Score(structure) = Si Score(substructure of Xi) 89 Heuristic search B B E A se r e Re v A ) E re(A o score(sc B 90 E

C A dd C E (C) re o c score(s De le score(s E te co A re( A) E A C B E A A C

C Thats fine. But what to do if we have non-complete data?? 91 Cannot use decomposability Local Search in Practice Perform EM for each candidate graph G3 G2 Parameter space G1 Parametric optimization (EM) Local Maximum G4 Gn

Local Search in Practice Perform EM for each candidate graph G3 G2 G1 Parameter space Parametric optimization (EM) Local Maximum Computationally expensive: Gn G4 Parameter optimization via EM non-trivial Need to perform EM for all candidate structures Spend time even on poor candidates In practice, considers only a few candidates Structural EM [Friedman et al. 98] Recall, in complete data we had

Decomposition efficient search Idea: Instead of optimizing the real score Find decomposable alternative score Such that maximizing new score improvement in real score Structural EM Idea: Use current model to help evaluate new structures Outline: Perform search in (Structure, Parameters) space At each iteration, use current model for finding either: Better scoring parameters: parametric EM step or Nir Friedman. The Bayesian structural EM algorithm. In Proceedings of the 14th conference on Uncertainty in artificial Better scoring intelligence (UAI'98), 129-138. 1998 structure: structural EM Score for structure G and parameterization

given data over observed RVs O ScoreO (G, ) = log P(O:G,) Pen(G, ,O) Handle hiddens H (non-observed) by conditionalizing (like in full bayesian learning) using current structure G* and parameterization * Q(G, :G*, *) = Eh~P(h|O:G*, *)[logP(O,h:G, )] Pen(G, Alternating model selecetion EM Choose G0 and 0 randomly Loop for n= 0,1, until convergence Find Gn+1, n+1 s.t. Q(Gn+1, n+1: Gn, n) > Q(Gn, n: Gn, n) ,O) 96 Structural EM Reiterate ComputationExpected Counts X1 X2 X3 H Y1

Y2 Y3 Training Data N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) N(X2,X1) N(H, X1, X3) N(Y1, X2) N(Y2, Y1, H) Score & Parameterize X1 X2 X3

H Y1 X1 Y2 X2 Y3 X3 H Y1 Y2 Y3 Structure Learning: incomplete data Data E B A Current model

Expectation Inference: P(S|X=0,D=1,C=0,B=1) Maximization Parameters EM-algorithm: iterate until convergence S X D C B <1 1 ? 0 1> <0 0 0 ? ? > Expected counts S 1 1 0 1

X D C 0 1 0 1 1 0 0 0 0 0 0 0 .. B 1 1 0 1 Structure Learning: incomplete data B E Data E B A A Current model

S X D C B <1 1 ? 0 1> <0 0 0 ? ? > Expected Expectation Inference: P(S|X=0,D=1,C=0,B=1) counts Maximization Parameters S 1 1 0 1 Maximization Structure SEM-algorithm:

iterate until convergence B E X D C 0 1 0 1 1 0 0 0 0 0 0 0 .. E E A A B 1 1 0 1 B B

A Variations on a theme Known structure, fully observable: only need to do parameter estimation Known structure, hidden variables: use expectation maximization (EM) to estimate parameters Unknown structure, fully observable: do heuristic search through structure space, then parameter estimation Unknown structure, hidden variables: structural EM 100 Big Datasets Destroy Simple Abstractions SELECT * FROM RAWDATA INPUT FILE time id temp 10am

1 20 time id temp 10am 2 21 10am 1 20 10am 2 21 ..

.. 10am 7 29 .. .. 10am 7 29 Raw Data Tables Sensor/RFID streams (+ metadata, floor plans, ) Relational DBMS

HeisenData -- Towards a Next-Generation Uncertain Data Management System Project Funding: FP7 Marie-Curie International Reintegration Grants (FP7-PEOPLE-2009-RG, Reference No. 249217) Project Duration: 1/3/2010 - 28/2/2014 Fellow & Scientist-in-Charge: Prof. Minos Garofalakis OUTPUT FILE Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models Daisy Zhe Wang+, Eirinaios Michelakis+, Minos Garofalakis*+, Joseph M. Hellerstein+ University of Californi Berkeley+, Yahoo! Research* 25th August 2008, VLDB 101 Minos Garofalakis, Projekt HeisenData BayesStore Data Model 1. Incomplete Relation -- Rp 2. Distribution over Possible Worlds F Sensor1(Time(T), Room(R), Sid, Temperature(Tp) p, Light(L) p) Incomplete Relation of Sensor1p t1 T R

Sid Tpp Lp X1 Drk F = Pr [X1, , X7 ] N: number of missing values |X|: size of the domain 1 1 1 1 1 2 Hot Cold 1 1 3 X2

1 2 1 X4 X3 Brt t3 1 2 2 Hot X5 t4 1 2 3 X6 X7

t2 t5 Probabilistic Distribution of Sensor1p |F| = (|X|N) Daisy Zhe Wang, Eirinaios Michelakis, Minos Garofalakis, and Joseph M. Hellerstein. "BAYESSTORE: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models", Proceedings of VLDB'2008 (PVLDB, Vol. 1), 2008. http://heisendata.softnet.tuc.gr The Skyscrapers Example For all sensor in all rooms at all timestamp, Light and Temperature readings are correlated. Light Temperature http://heisendata.softnet.tuc.gr Definitions Stripe: A family of random variables from the same probabilistic attribute. First-order Factor: A family of

local models, which share the same structure and conditional probability table(CPT). BayesStore Data Type: The input and output abstract data type of queries in BayesStore, which consists of data and model. http://heisendata.softnet.tuc.gr F as a First-order Bayesian Network Sensor1(I) Stripe (FO Variable) p T R Lp p Sid Tp Definitions 1 1 1

1 1 2 1 1 1 H X1 3 o Cold Drk t X2 X3 2 1 X

Brt 1 2 2 4 Hot 1 2 3 X6 2 1 1 2 1

2 X8 Hot Cold Drk 2 1 3 X9 X10 t6 2 2 1 X Brt t7

2 2 2 1 1 Hot 2 2 3 X13 t1 t2 t3 t4 t5 t8 All Tp values in Sensor1p with Sid=1 X5

X7 X12 X14 http://heisendata.softnet.tuc.gr F as a First-order Bayesian Network Sensor1(I) Stripe (FO Variable) p T R Lp p Sid Tp Definitions 1 1 1 1 1

2 1 1 3 Hot X1 Cold Drk X2 X3 1 2 1 X4 Brt t3 1 2 2

X5 t4 1 2 3 Hot X6 X7 2 1 1 Hot X8 2 1

2 Cold Drk 2 1 3 X9 X10 t6 2 2 1 Brt t7 2 2

2 X1 1 Hot 2 2 3 X13 X14 t1 t2 t5 t8 X12 All Tp values in Sensor1p with Sid=1 All Tp values in Sensor1p with Sid=2 All Tp values in

Sensor1p with Sid ! =2 All Tp values in Sensor1p All L values in Sensor1p http://heisendata.softnet.tuc.gr F as a First-order Bayesian Model Mapping between Stripes All Tp values All L values All Tp values All L values . All Tp values with Sid=1 All Tp values

with Sid=1 All Tp values with Sid=2 http://heisendata.softnet.tuc.gr . . All Tp values with Sid=2 F as a First-order Bayesian Model First-order Factor All Tp Definitions All L values values All Tp values with Sid=1

All Tp values with Sid=2 All Tp values with Sid ! =2 http://heisendata.softnet.tuc.gr Tp L p Cold Brt 0.1 Hot Brt 0.9 Hot

Drk 0.1 Cold Drk 0.9 Tp1 Tp2 p Cold Cold 0.1 Cold Hot 0.9 Hot

Hot 0.1 Hot Cold 0.9 Tp p Cold 0.6 Hot 0.4 Query Semantics Resulting Relational and

Inference Queries (I) (II) Represent Represent (III) (IV) Inference Queries Relational and Possible Worlds And http://heisendata.softnet.tuc.gr Resulting Possible Worlds And Distribution Query Algebra Relational Queries ML Inference

Queries Full Distribution Queries http://heisendata.softnet.tuc.gr Selection Selection over Incomplete Relation Rp Selection over Model MFOBN Sensor 1p T R Sensor 1p Tpp Sid Lp 1 1 1 Hot X1 1 1

2 Cold Drk 1 1 3 X2 X3 Tp=Cold 1 2 1 X4 Brt t3 1 2 2

Hot X5 t4 1 2 3 X6 X7 t1 t2 t5 http://heisendata.softnet.tuc.gr Selection Selection over Incomplete Relation Rp Selection over Model MFOBN Sensor 1p T R Sensor 1p Tpp

Sid Lp 1 1 1 Hot X1 1 1 2 Cold Drk 1 1 3 X2 X3 Tp=Cold

1 2 1 X4 Brt 1 1 3 X2 X3 t3 1 2 2 Hot X5 t3 1 2 1

X4 Brt t4 1 2 3 X6 X7 t4 1 2 3 X6 X7 t1 t2 t5 T R t2 1 1

t6 http://heisendata.softnet.tuc.gr p Sid Tp p L 2 Cold Drk Selection Selection over Incomplete Relation Rp Selection over Model MFOBN Sensor 1p T R Sensor 1p Tpp Sid Lp 1 1 1 Hot X1

1 1 2 Cold Drk 1 1 3 X2 X3 Tp=Cold 1 2 1 X4 Brt 1 1 3

X2 X3 t3 1 2 2 Hot X5 t3 1 2 1 X4 Brt t4 1 2 3 X6

X7 t4 1 2 3 X6 X7 t1 t2 t5 T R t2 1 1 t6 http://heisendata.softnet.tuc.gr p Sid Tp p L 2 Cold Drk Selection Selection over Incomplete Relation Rp

Selection over Model MFOBN Sensor1p t1 t2 t3 t4 T R Sid Tpp Lp 1 1 1 Hot X1 1

1 2 Cold Drk 1 1 3 X2 X3 1 2 1 X4 Brt 1

2 2 Hot X5 1 2 3 X6 X7 t5 t6 Tp=Cold| T R Sid Null

Tpp Lp t2 1 1 2 Cold Drk t3 1 1 3 X2 X3 1

2 1 X4 Brt 1 2 3 X6 X7 t4 t6 Tuple Correlation Graph (TCG) for FFOBN (Sensor1) Compute Transitive Closure over TCG

t1 t2 t3 t4 t5 t6 T R Sid Tpp Lp 1 1 1 Hot

X1 t1 1 1 2 Cold Drk t2 1 1 3 X2 X3 1 2

1 X4 Brt 1 2 2 Hot X5 1 2 3 X6 X7 http://heisendata.softnet.tuc.gr t3

t4 t5 Selection Selection over Incomplete Relation Rp Selection over Model MFOBN Probabilistic Distribution FFOBN of Sensor1p Tp L p All Tp All L Cold Brt 0.1 values values Hot Brt 0.9 All Tp values with Sid=1 All Tp with

values Sid=2 All Tp values with Sid ! =2 Hot Drk 0.1 Cold Drk 0.9 Tp1 Tp2 p Cold Cold 0.9

Cold Hot 0.1 Hot Hot 0.9 Hot Cold 0.1 Tp p Cold 0.6 Hot 0.4

Tp=Co ld http://heisendata.softnet.tuc.gr FFOB | N Tp=Col d Selection Selection over Incomplete Relation Rp Selection over Model MFOBN Sensor1(T, R, Sid, Tpp, FFOBN of Sensor1p Lp,Exist(E)p) All Tp values All L values Tp=Cold All Tp values with

Sid=1 All Tp with values Sid=2 All Tp with Sid ! values =2 Tp L p Cold Brt 0.1 Hot Brt 0.9 Hot

Drk 0.1 All Cold Tp Drk values 0.9 values Tp=Cold Tp1 Tp2 p Cold Cold 0.9 Cold

Hot 0.1 Hot Hot 0.9 Hot Tp All Exist pr(E=1)=1 iff Cold 0.1 Tp=Cold pr(E=0) iffp Tp=Hot Cold 0.6 Hot 0.4

http://heisendata.softnet.tuc.gr Project & Join Project Project over Incomplete Relation projected attributes and correlated attributes Project over Model retrieve only part of the model relevant to the projected attributes Join Join over Incomplete Relations with deterministic join condition (e.g. Sensor1.Sid = Sensor2.Sid) Join over Models by merging the local models for Existp attribute Probabilistic selection with probabilistic join condition (e.g. Sensor1.Lightp = Sensor2.Lightp) http://heisendata.softnet.tuc.gr Non-Standard-Datenbanken Bayessche Netze, Infererenz, Lernen, Relationale Erweiterung Verbundwahrscheinlichkeiten, Bayes Bayessche Netze Regel Anfragebeantwortung Lernverfahren, BME,

MAP, ML EM, Structural EM Erweiterung auf Relationale Datenmodelle Ausblick Vertiefung zum Thema Probabilistische Relationale Modellierung im Master-Programm Vorlesung Web-Mining-Agents 119

Recently Viewed Presentations

• Marbles sliding over and around each other in a box model the behavior of matter in the ... You can separate a mixture of sand and iron filings with the use of a material that shows the property of 15._____....
• Beth yw Diwali? Gŵyl Hindwaidd sydd yn parhau am 5 diwrnod ydy Divali. Y mae yn cael ei ddathlu gan Sikhs hefyd. Mae Divali yn golygu "rhês o lampau goleuedig' ac sydd yn aml yn cael ei adnabod fel gŵyl...
• Status of Thailand'sGeospatial Data Infrastructure and Systems (National Geo-informatics Infrastructure Service: NGIS Portal ) ... Are there any "legal or statutory" obligations or requirements relating to geodetic / geospatial data or infrastructure? ... Land Information System ...
• HM&E Equipment Term Box Sig Cond ICAS LAN Rack The ICAS Network Digitized signal (logsheet) data XML Wrapping Approvals Gateway Ethernet (Crossover) NCAP AXIS 2100 Pelco Spectra III 75 Ohm Coax RS-485 Pan, Tilt, & Zoom Control NCAP Ethernet (Crossover)...
• HEAT INJURY PREVENTION References MCO 3500.27A Marine Corps ORM MCO P5102.1A NAVMED P-5010 Naval Preventive Medicine Purpose Commanders at all levels are responsible the planning and execution of a command sponsored heat injury prevention program set forth in MCO 6200.1E...
• Root Review. ped This Photo by ... Port- Last modified by: Karen Belleau ...
• IT'S THE TEACHER Effect of Teacher Effectiveness on Student Achievement 3rd graders placed with 3 high performing teachers in a row averaged 96th percentile at end of 5th grade in math 3rd graders placed with 3 low performing teachers in...
• Welcome!! This PowerPoint Template has clean and neutral design that can be adapted to any content and meets various market segments. With this many slides you are able to make a complete PowerPoint Presentation that best suit your needs.