Games on Graphs Uri Zwick Tel Aviv University Lecture 3 Turn-Based Stochastic Games Last modified 17/11/2019 Lecture 3 Back to Turn-Based Stochastic Games (TBSGs) Value iteration Policy iteration Linear Programming Turn-based Stochastic Games (TBSGs) [Shapley (1953)] [Gillette (1957)] [Condon (1992)]

Objective functions: (Are they well defined?) Total cost finite horizon min/max Total cost infinite horizon min/max 20 10 Discounted cost min/max Limiting average cost min/max TBSG terminology

sets of states controlled by the two players. set of actions from state . cost of action . probability of moving to after action . , for every . With probability , the game ends. (The sink is not considered to be a state.) Player 0 is the minimizer. Player 1 is the maximizer. Total cost TBSGs with the stopping condition To make the total cost well defined we assume: Stopping condition: For every pair of strategies of the players, the game ends with probability 1.

Discounted cost is a special case. Limited average cost can be solved using similar, but more complicated, techniques. Optimality equations for TBSGs , , Theorem: For a stopping TBSG: 1. The optimality equations have a unique solution. 2. If , and , then and are optimal strategies for the two players. How do we solve the optimality equations? LP does not work Can still use value iteration and some form of policy iteration.

The Value Iteration Operator : , , is the optimal value vector of an -step game with terminal cost vector . (Simple proof using backward induction.) is a solution of the optimality equations iff . In other words, is a fixed-point of . Discounted cost Total cost 1

2 1 Discount factor Multiplying , the cost of the -step by , is equivalent, in expectation, to stopping the game at each step with probability . The resulting game is stopping. 1 2 Strategy Improvement for TBSGs Let be a positional strategy for min.

Let be the optimal counter-strategy of max to . (Can be found by solving an MDP.) Let be a strategy obtained by performing improving switches to , keeping unchanged. Let be the optimal counter-strategy of max to . Lemma: (,) Proof: Consider as a policy in a 1-player min game. As is optimal for max, all switches are non-deteriorating for min. Thus is obtained from via improving or non-deteriorating switches, hence , by the fact that policy iteration works for MDPs. Positional optimal strategies

The termination of the strategy iteration algorithm proves that the optimality equations have a solution. As for MDPs, we can read a pair of positional strategies that give the corresponding values. Using the values to modify the costs, we see that no general strategy can possibly do any better. Strategy iteration for two-player games Start with an arbitrary strategy . Compute an optimal counter-strategy . If there are improving switches for , keeping fixed, perform some of them. Repeat until there are no improving switches. Final strategies are optimal for the two players. Consider the move from to . From the point of view of player 0, controlling ,

all switches performed are improving or non-deteriorating! Thus . Repeated best response? Policy/Strategy iteration is a asymmetric. One of the players improves her strategy locally, by performing improving switches. The other player improves her strategy globally, by computing best response. What if both players use best response? Or if they both improve locally? In both cases, the algorithm may cycle! [ Condon (1993) ] Repeated best response may cycle

[ Condon (1993) ] 0.4 3 4 1 4 3 4 1 4

1 Final payoffs 0.9 0.5 0 Initial strategies and [ Condon (1993) ] 0.4

MAX switches both actions 3 4 1 4 3 4 1 4

1 0.9 0 0.5 0 MIN switches both actions [ Condon (1993) ]

0.4 3 4 1 4 3 4 1 4 1

0.9 1 0.5 0 MAX switches both actions [ Condon (1993) ] 0.4 3 4

1 4 3 4 1 4 1 0.9

2 0.5 0 MIN switches both actions [ Condon (1993) ] 0.4 3 4 1

4 3 4 1 4 1 0.9 3

0.5 0 [ Condon (1993) ] 0.4 And we are back to the starting position! 3 4 1 4

3 4 1 4 1 0.9 4= 0 An essentially minimal example as each player must have at least two vertices.

0.5 0 0 MAX switches both actions Can be converted to a MPGs. 5 6 8

4 Note: We did not give a general strategy iteration algorithm for MPGs yet. 0 MIN switches both actions 5 6 8

4 0 MAX switches both actions 5 6 8 4 0

MIN switches both actions 5 6 8 4 0 And we are back to the starting position! 5

6 8 4 Local improvements by both players Exercise: Construct a stopping TBSG on which there is a sequence of alternating improving switches by both players that cycles. Why doesnt the proof given for the strategy iteration algorithm in which the first player uses improving switches while the other player uses best response work in this case? Strategy Iteration for discounted TBSGs

Greedy Strategy Iteration, also known as SWITCH-ALL, or Howards algorithm: Perform the best switch from each state of player 0. Compute the best response of player 1. For discounted MDPs and TBSGs, with discount factor Howards algorithm terminates after iterations, where is the total number of actions. [ Ye (2011) ] [ Hansen-Miltersen-Z (2012) ] [Sherrer (2016) ] Matrix notation (reminder) 1 1 1 states

= 1 1 State/action incidence matrix actions = 2 1 states

, Transition probabilities If is a strategy profile, then , =

Note that , and are obtained by taking rows indexed by . Values and Modified costs Let be a strategy profile. Let be the values under . Let be the modified costs w.r.t. . Lemma 1: Let and be two strategy profiles. Then, . 1

1 ( ) ( ) =( ) ( ( ) ) Can you give an intuitive interpretation of the lemma?

Strategy Iteration vs. Value Iteration Lemma 2: The value iteration operator of a discounted TBSG is a contraction, i.e., , for every . Lemma 3: Let , where is a best response to . Let be a profile obtained by one iteration of Howards algorithm. Then . (In other words, strategy iteration is faster than value iteration.) Corollary 4: If is a sequence of strategies generates by Howards algorithm, and is the optimal solution, then . Action elimination by Strategy Iteration Lemma 4: If is a sequence of strategy profiles generated by Howards algorithm and is an optimal profile, then

. Corollary 5: If and , then at least one action in is not used in any of the profiles Proof of Lemma 4 Lemma 4: If is a sequence of strategy profiles generates by Howards algorithm and is an optimal profile, then . (Lemma 1) (As and ) (Corollary 4) (Lemma 1) (Each row of sums to )

Howards algorithm for discounted TBSGs Theorem: Howards strategy iteration solves an -action discounted TBSG or MDP with discount factor using at most iterations. Let . Then . Every iterations, a new action is eliminated, i.e., appears at but does not appear in . Thus, the total number of iterations before an optimal strategy profile is found is at most . Lower bounds for Howards algorithm for non-discounted problems There are deterministic -action -state MDPs on which Howards algorithm performs iterations. [ Hansen-Z (2010) ]

There are -action -state MDPs on which Howards algorithm performs iterations. [ Friedmann (2009) ] [ Fearnley (2010) ] Results also hold for discount factors sufficiently close to 1. Optimal positional strategies for discount factors close to 1 Theorem: Let be a TBSG. Then, there exist positional strategies and of the two players, and , such that and are optimal for every discount factor . For , let the discounted values of all states, with discount factor , when using . Lemma: For every , is a rational function, with numerator and denominator of degree at most .

Proof: + Cramers rule. Fact: Two non-identical rational functions of degree intersect at most times. Optimal positional strategies for discount factors close to 1 Theorem: Let be a TBSG. Then, there exists positional strategies and of the two players, and , such that and are optimal for every discount factor . For be positional and optimal for . Let be a pair that appears an infinite number of times in sequence . By the rationality of , there exists , such that and are optimal for every . Optimal positional strategies for

TBSG with limiting average cost Theorem: Let be a TBSG. Then, there exists positional strategies and of the two players, and , such that and are optimal for every discount factor . Theorem: The strategies and are also positional optimal strategies under the limiting average cost objective. Lemma: Let where and are positional. Then, . Note: Here we do not assume the stopping condition. END of LECTURE 3