causal inference: an introduction and some results Alex Dimakis UT Austin joint work with Murat Kocaoglu, Karthik Shanmugam Sriram Vishwanath, Babak Hassibi Overview Discovering causal directions Part 1: Interventions and how to design them Chordal graphs and combinatorics Part 2: If you cannot intervene: entropic causality A theorem of identifiability A practical algorithm for Shannon entropy causal inference

Good empirical performance on standard benchmark Many open problems Disclaimer There are many frameworks of causality For time-series: Granger causality Potential Outcomes / CounterFactuals framework (Imbens & Rubin) Pearls structural equation models with independent errors Additive models, Dawids decision-oriented approach, Information Geometry, many others Overview Discovering causal directions Part 1: Interventions and how to design them Chordal graphs and combinatorics Part 2: A new model: entropic causality A theorem of identifiability A practical algorithm for Shannon entropy causal inference

Good empirical performance on standard benchmark Many open problems Smoking causes cancer Joint pdf Observational data S: Heavy smoker 0 C: Lung cancer before 60 0 1 1 0 1 1 . 1 .

S=0 S=1 C=0 30/100 10/100 C=1 20/100 40/100 Causality= mechanism S C Pr(S,C)

Causality= mechanism S=0 S=1 S=1 C=0 30/50 10/50 C=1 20/50 40/50 0.5 Pr(S) S S=0

0.5 C Pr(C/S) Pr(S,C) Universe 1 S=0 S=1 S=1 C=0 30/50 10/50 C=1 20/50

40/50 0.5 Pr(S) S S=0 0.5 C Pr(C/S) Pr(S,C) C=F(S,E C=F(S,E )) EE SS Universe 2

S C Universe 2 C=0 C=1 C=1 S=0 30/(100*0.4) = 0.75 20/(100*0.6) = 0.33 S=1 10/(100*0.4) = 0.25 40/(100*0.6) = 0.66 0.6 Pr(C)

S C=0 0.4 C Pr(S/C) Pr(S,C) S=F(C,E S=F(C,E )) EE CC How to find the causal direction? Pr(S,C) Pr(S) Pr(C/S) S

C=F(S,E) ES C How to find the causal direction? Pr(S,C) Pr(S) Pr(C/S) S C=F(S,E) ES C Pr(C) Pr(S/C) S S=F(C,E ) E S C

How to find the causal direction? It is impossible to find the true causal direction from observational data for two random variables. Pr(S,C) (Unless we make more assumptions) You need interventions, i.e. messing with the mechanism. For more than two r.v.s there is a rich theory and some directions can be learned without interventions. (Spirtes et al.) Pr(S) Pr(C/S) Pr(C) Pr(S/C) S C=F(S,E) ES C S S=F(C,E ) E S

C Overview Discovering causal directions Part 1: Interventions and how to design them Chordal graphs and combinatorics Part 2: A new model: entropic causality A theorem of identifiability A practical algorithm for Shannon entropy causal inference Good empirical performance on standard benchmark Many open problems Intervention: force people to smoke S=0 S=1 S=0

S=1 C=0 30/50 10/50 C=1 20/50 40/50 0.5 0.5 Pr(S) Pr(C/S) Flip coin and force each person to smoke or not, with prob . S

C In Universe1 (i.e. Under SC) , new joint pdf stays same as before intervention. Intervention: force people to smoke C=0 C=1 C=0 C=1 S=0 30/(100*0.4) = 0.75 20/(100*0.6) = 0.33 S=1 10/(100*0.4) = 0.25 40/(100*0.6) = 0.66

0.4 0.6 Pr(C) Pr(S/C) Flip coin and force each person to smoke or not, with prob . S C In Universe 2 (Under CS) S, C will become independent after intervention. Intervention: force people to smoke C=0 C=1 C=0 C=1

S=0 30/(100*0.4) = 0.75 20/(100*0.6) = 0.33 S=1 10/(100*0.4) = 0.25 40/(100*0.6) = 0.66 0.4 0.6 Pr(C) Pr(S/C) Flip coin and force each person to smoke or not, with prob . S C

In Universe 2 (Under CS) S, C will become independent after intervention. So check correlation on data after intervention and find true direction! More variables S4 S1 S5 S3 S2 S7 True Causal DAG S6 More variables S4 S1

S5 S3 S2 S7 True Causal DAG S6 From observational Data we can learn Conditional independencies. Obtain Skeleton (lose directions) More variables S4 S1 S3 S2 S4

S5 S7 True Causal DAG S1 S6 From observational Data we can learn Conditional independencies. Obtain Skeleton (lose directions) S5 S3 S2 Skeleton S7

S6 PC Algorithm (Spirtes et al. Meek) S4 S1 S5 S3 S2 Skeleton There are a few directions we can learn from observational Data (Immoralities, Meek Rules) S7 S6 Spirtes, Glymour, Scheines 2001, PC Algorithm C. Meek , 1995. Andersson, Madigan, Perlman, 1997

PC Algorithm (Spirtes et al. Meek) S4 S1 S5 S3 S2 Skeleton There are a few directions we can learn from observational Data (Immoralities, Meek Rules) S7 S6 Spirtes, Glymour, Scheines 2001, PC Algorithm C. Meek , 1995. Andersson, Madigan, Perlman, 1997 How interventions reveal directions

S4 S1 S5 S3 S2 Intervened Set S ={S1,S2,S4} We choose a subset of the variables S and Intervene (i.e. force random values ) S7 S6 How interventions reveal directions S4 S1 S5 S3

S2 Intervened Set S ={S1,S2,S4} We choose a subset of the variables S and Intervene (i.e. force random values ) S7 Directions of edges from S to Sc are revealed to me. S6 How interventions reveal directions S4 S1 S5 S3 S2 Intervened Set S ={S1,S2,S4}

We choose a subset of the variables S and Intervene (i.e. force random values ) S7 Directions of edges from S to Sc are revealed to me. S6 Re-apply PC Algorithm+Meek rules to learn a few more edges possibly Learning Causal DAGs S4 S1 S5 S3 S2 Skeleton Given a skeleton graph, how many interventions

are needed to learn all directions ? S7 A-priori fixed set of interventions (nonAdaptive) S6 Learning Causal DAGs S4 S1 Given a skeleton graph, how many interventions are needed to learn all directions ? S5 S3 A-priori fixed set of interventions (nonAdaptive) S6 Adaptive Randomized Adaptive

S2 Skeleton S7 Learning Causal DAGs S4 S1 S5 S3 S2 Skeleton Given a skeleton graph, how many interventions are needed to learn all directions ? S7 S6 A-priori fixed set of interventions (nonAdaptive) Theorem (Hauser & Buhlmann 2014): Log() interventions suffice ) interventions suffice

() interventions suffice = chromatic number of skeleton) Learning Causal DAGs S4 S1 S5 S3 S2 Skeleton Thm: Log() interventions suffice ) interventions suffice Proof: 1.Color the vertices. (legal coloring) S7 S6 Learning Causal DAGs S4 S1 S5

S3 S2 Skeleton Thm: Log() interventions suffice ) interventions suffice Proof: 1.Color the vertices. S7 S6 2. Form table with binary representations of colors Red: 00 Green: 0 1 Blue: 1 0 S1 0 0

S2 0 1 S3 1 0 S4 0 1 S5 1 0

S6 0 1 S7 1 0 Learning Causal DAGs S4 S1 S5 S3 S2 Skeleton Thm: Log() interventions suffice ) interventions suffice Proof: 1.Color the vertices.

S7 S6 2. Form table with binary representations of colors Red: 00 Green: 0 1 Blue: 1 0 3. Each intervention is indexed by a column of this table. S1 0 0 S2 0

1 S3 1 0 S4 0 1 S5 1 0 S6 0 1

S7 1 0 Intervention 1 Learning Causal DAGs S4 S1 S5 S3 S2 Thm: Log() interventions suffice ) interventions suffice Proof: 1.Color the vertices. S6 S7

For any edge, its two vertices have different colors. Their binary reps are different in 1 bit. So for some intervention, one is in set and other is not. So I will learn its direction. . 2. Form table with binary representations of colors Red: 00 Green: 0 1 Blue: 1 0 3. Each intervention is indexed by a column of this table. S1 0 0 S2

0 1 S3 1 0 S4 0 1 S5 1 0 S6 0

1 S7 1 0 Intervention 1 Learning Causal DAGs S4 S1 S5 S3 S2 Skeleton Given a skeleton graph, how many interventions are needed to learn all directions ? S7

S6 A-priori fixed set of interventions (nonAdaptive) Log() interventions suffice ) Adaptive (NIPS15): Adaptive cannot improve for all graphs. Randomized Adaptive (Li,Vetta, NIPS14): loglog(n) interventions with high probability suffice for complete skeleton. Major problem: Size of interventions We choose a subset of the variables S and Intervene (i.e. force random values ) S4 S1 S5 S3

S2 Intervened Set S ={S1,S2,S4} S7 We need exponentially many samples in the size of the intervention set S. S6 Question: If each intervention has size up to k, how many interventions do we need ? Eberhardt: A separating system on ) interventions suffice elements with weight k is sufficient to produce a non-adaptive causal inference algorithm A separating system on n elements with weight k is a {0,1} matrix with n distinct columns and each row having weight at most k. Reyni, Kantona, Wegener: (n,k) separating systems have size Major problem: Size of interventions Open problem: Is a separating system

S4 S1 S5 S3 S2 Intervened Set S ={S1,S2,S4} necessary or can adaptive algorithms do better ? S7 S6 (NIPS15): For complete graph skeletons, separating systems are necessary. Even for adaptive algorithms. We can use lower bounds on size of separating systems to get lower bounds on the number of interventions. Randomized adaptive: loglogn interventions Our result: n/k loglog k interventions suffice ,

each of size up to k. A good algorithm for general graphs Overview Discovering causal directions Part 1: Interventions and how to design them Chordal graphs and combinatorics Part 2: A new model: entropic causality A theorem of identifiability A practical algorithm for Shannon entropy causal inference Good empirical performance on standard benchmark Many open problems Data-driven causality How to find causal direction without interventions Impossible for two variables. Possible under assumptions. Popular assumption Y= F(X) + E, (E X)

(Additive models)(Shimizu et al., Hoyer et al., Peters et al. Chen et al., Mooij et al.) This work: Use information theory for general data-driven causality. Y= F(X,E), (E X) (related work: Janzing, Mooij, Zhang, Lemeire: not additive assumption but no noise. Y=F(X) ) Entropic Causality Given data Xi,Yi. Search over explanations assuming XY Y= F(X,E) , (E X) Simplest explanation: One that minimizes H(E). Search in the other direction, assuming YX X= F(Y,E) , (E Y) If H(E) << H(E) decide YX If H(E) <

S C S= F(C,E) , (E C) H(E) big Entropic Causality in pictures S C You You may may be be thinking thinking that is that min min H(E) H(E)S is like like

minimizing minimizing H(C/S). H(C/S). C But But itit is is fundamentally fundamentally different different (well (well prove prove its its NP-hard NP-hard to to compute) compute) C= F(S,E) , (E S) H(E) small S= F(C,E) , (E C) H(E) big

Question 1: Identifiability? If data is generated from XY , i.e. Y= f(X,E), (E X) and H(E) is small. Is it true that all possible reverse explanations X= f(Y,E) , (E Y) must have H(E) big, for all f,E ? Theorem 1: If X,E,f are generic, then identifiability holds for H0 (support of distribution of E must be large). Conjecture 1: Same result holds for H1 (Shannon entropy). Question 2: How to find simplest explanation? Minimum entropy coupling problem: Given some marginal distributions U1,U2, .. Un , find the joint distribution that has these as marginals and has minimal entropy. (NP-Hard, Kovacevic et al. 2012). Theorem 2: Finding the simplest data explanation f,E, is equivalent to solving the minimum entropy coupling problem. How to use: We propose a greedy algorithm that empirically performs reasonably well Proof idea Consider Y = f(X, E). (X,Y over n sized alphabet.)

pi,j =P(Y = i|X=j) = P(f(X,E) = i | X = j) = P( fj(E) = i ) since X e1 e2 e3 e4 e5 e6 . . . em Distribution of E f1 p1,1 p2,1 p3,1 . .

. pn,1 Distribution of Y conditioned on X = 1 E Each conditional probability is a subset sum of distribution of E Si,j: index set for pi,j Performance on Tubingen dataset Accuracy vs. Decision Percentage 100 Entropy-based Causal Inference 68% Confidence Interval 95% Condifence Interval 90

Decision rate: Fraction of pairs that algorithm makes a decision. 80 Decision made when |H(X,E)-H(Y,E)|> t (t determines the decision rate) Accuracy, % 70 60 50 Confidence intervals based on number of datapoints 40 30

20 10 0 20 30 40 50 60 70 Decision Rate, % 80 90 100 Slightly better than ANMs

Conclusions 1 Learning causal graphs with interventions is a fun graph theory problem The landscape when the sizes of interventions is bounded is quite open, especially for general graphs. Good combinatorial algorithms with provable guarantees? Conclusions 2 Introduced a new framework for data-driven causality for two variables Established Identifiability for generic distributions for H0 entropy. Conjectured it holds for Shannon entropy. Inspired by Occams razor. Natural and different from prior works. Natural for categorical variables (Additive models do not work there) Proposed practical greedy algorithm using Shannon entropy. Empirically performs very well for artificial and real causal datasets. fin Existing Theory: Additive Noise Models Assume Y = f(X)+E, XE Identifiability 1: If f nonlinear, then g, N Y such that X = g(Y)+N (almost surely) If E non-Gaussian, g, N Y such that X = g(Y)+N

Performs 63% on real data* Drawback: Additivity is a restrictive functional assumption * Cause Effect Pairs Dataset: https://webdav.tuebingen.mpg.de/cause-effect/ Existing Theory: Independence of Cause and Mechanism Function f chosen independently from distribution of X by nature Notion of independence: Assign a variable to f, check logslope integral Boils down to: X causes Y if h(Y) < h(X) [h: differential entropy] Drawback: No exogenous variable assumption (deterministic X-Y relation) Continuous variables only Open Problem Work in most general functional form Y = f(X,E) Handle ordinal as well as categorical data Our Approach Consider discrete variables X, Y, E.

Use total input (Renyi) entropy as a measure of complexity Choose the simpler model Assumption: (Renyi) entropy of exogenous variable E is small Theoretical guarantees for identifiable H0 Renyi entropy (cardinality) Causal direction (almost surely) if E has small cardinality Performance of Greedy Joint Entropy Minimization n marginal distributions each with n states are randomly generated for each n Minimum J oint Entropy by Greedy Algorithm 1.2

Average Gap H * (X1 ,X 2 ,...X n )-maxi H(X i ) Minimum Gap Maximum Gap H * (X1 ,X 2 ,...,Xn ) - maxi(H(Xi)) 1 0.8 The minimum joint entropy obtained by the greedy algorithm is at most 1 bit away from the largest marginal maxiH(Xi) 0.6 0.4 0.2 0 2

2.5 3 3.5 4 4.5 5 5.5 Log of number of states, log 2 (n) 6 6.5 7 Results

Shannon Entropy-based Identifiability Accuracy for Low Entropy E Generate distributions of X,Y by randomly selecting f, X, E. 1 Probability of Success 0.98 0.96 0.94 Probability of success is the fraction of points where H(X,E) < H(Y,N). 1 0.92 0.999 0.9

0.998 0.88 0.997 0.86 0 0.5 1 0.84 0.82 0 0.2 0.4 Larger n drives probability of success

to 1 when $H(E) < log(n), supporting the conjecture. n=4 n=8 n = 16 n = 32 n = 64 n = 128 0.6 H(E)/log(n) 0.8 1 Characterization of Conditionals Define conditional distribution Let p = [p1T, p2T, , pnT]T. Then Ex.: where M is a block partition matrix:

Each block of length n is a partitioning of columns General Position Argument Suppose Y|X = j are uniform over simplex (not realistic, toy example) Note: Let xi exp(1). Then following is a uniform random vector over the simplex: Drop n rows of p to make it (almost) i.i.d. Claim: There does not exist an e with H0 < n(n-1) Proof: Assume otherwise. Rows of M are linearly dependent. a such that aT M = 0 Then aTp = 0 Implies a random hyperplane being orthogonal to a vector, has probability 0.

Our contribution Nature chooses X, E, f. Joint distribution over X, Y implied Choose X, E randomly over simplex. Derive X|Y from induced joint Any Y for which X = g(Y, ) implies Corresponds to a non-zero polynomial being zero, has probability 0. Formal Result X, Y discrete r.v.s with cardinality n Y = f(X,E) where E X is also discrete f is generic (technical condition to avoid edge cases, true in real data) Distribution vectors of X, E uniformly randomly sampled from simplex Then with probability 1, there does not exist N Y such that there exist g that satisfies X = g(Y, N) Working with Shannon Entropy Given Y|X, finding E with minimum Shannon entropy such that there is f that satisfies Y = f(X,E) is equivalent to Given marginal distributions of n variables Xi, find the joint distribution with minimum entropy

NP hard problem. We propose a greedy algorithm (that produces a local optimum)