# CS267: Introduction

CS 267 Applications of Parallel Computers Hierarchical Methods for the N-Body problem James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr05 28/03/2005 CS267 Lecture 17 Big Idea Suppose the answer at each point depends on data at all the other points Electrostatic, gravitational force Solution of elliptic PDEs Graph partitioning Seems to require at least O(n2) work, communication If the dependence on distant data can be compressed Because it gets smaller, smoother, simpler

Then by compressing data of groups of nearby points, can cut cost (work, communication) at distant points Apply idea recursively: cost drops to O(n log n) or even O(n) Examples: Barnes-Hut or Fast Multipole Method (FMM) for electrostatics/gravity/ Multigrid for elliptic PDE Multilevel graph partitioning (METIS, Chaco,) 28/03/2005 CS267 Lecture 17 Outlin e Motivation Obvious algorithm for computing gravitational or electrostatic force on N bodies takes O(N2) work How to reduce the number of particles in the force sum

We must settle for an approximate answer (say 2 decimal digits, or perhaps 16 ) Basic Data Structures: Quad Trees and Oct Trees The Barnes-Hut Algorithm (BH) An O(N log N) approximate algorithm for the N-Body problem The Fast Multipole Method (FMM) An O(N) approximate algorithm for the N-Body problem Parallelizing BH, FMM and related algorithms 28/03/2005 CS267 Lecture 17 Particle Simulation t=0 while t < t_final for i = 1 to n n = number of particles compute f(i) = force on particle i

for i = 1 to n move particle i under force f(i) for time dt using F=ma compute interesting properties of particles (energy, etc.) t = t + dt end while f(i) = external_force + nearest_neighbor_force + N-Body_force External_force is usually embarrassingly parallel and costs O(N) for all particles external current in Sharks and Fish Nearest_neighbor_force requires interacting with a few neighbors, so still O(N) van der Waals, bouncing balls N-Body_force (gravity or electrostatics) requires all-to-all interactions f(i) = f(i,k) f(i,k) = force on i from k k != i - 28/03/2005 f(i,k) = c*v/||v||3 in 3 dimensions or f(i,k) = c*v/||v||2 in 2 dimensions v = vector from particle i to particle k , c = product of masses or charges

||v|| = length of v Obvious algorithm costs O(N2), but we can do better... CS267 Lecture 17 Application s Astrophysics and Celestial Mechanics Intel Delta = 1992 supercomputer, 512 Intel i860s 17 million particles, 600 time steps, 24 hours elapsed time M. Warren and J. Salmon Gordon Bell Prize at Supercomputing 92 Sustained 5.2 Gflops = 44K Flops/particle/time step 1% accuracy Direct method (17 Flops/particle/time step) at 5.2 Gflops would have taken 18 years, 6570 times longer Plasma Simulation Molecular Dynamics Electron-Beam Lithography Device Simulation Fluid Dynamics (vortex method) Good

sequential algorithms too! 28/03/2005 CS267 Lecture 17 Reducing the number of particles in the force sum All later divide and conquer algorithms use same intuition Consider computing force on earth due to all celestial bodies Look at night sky, # terms in force sum >= number of visible stars Oops! One star is really the Andromeda galaxy, which contains billions of real stars - Seems like a lot more work than we thought Dont worry, ok to approximate all stars in Andromeda by a single point at its center of mass (CM) with same total mass D = size of box containing Andromeda , r = distance of CM to Earth Require that D/r be small enough Idea not new: Newton approximated earth and falling apple by CMs 28/03/2005

CS267 Lecture 17 What is new: Using points at CM recursively From Andromedas point of view, Milky Way is also a point mass Within Andromeda, picture repeats itself As long as D1/r1 is small enough, stars inside smaller box can be replaced by their CM to compute the force on Vulcan Boxes nest in boxes recursively 28/03/2005 CS267 Lecture 17 Outlin e Motivation Obvious algorithm for computing gravitational or electrostatic force on N bodies

takes O(N2) work How to reduce the number of particles in the force sum We must settle for an approximate answer (say 2 decimal digits, or perhaps 16 ) Basic Data Structures: Quad Trees and Oct Trees The Barnes-Hut Algorithm (BH) An O(N log N) approximate algorithm for the N-Body problem The Fast Multipole Method (FMM) An O(N) approximate algorithm for the N-Body problem Parallelizing BH, FMM and related algorithms 28/03/2005 CS267 Lecture 17 Quad Trees Data structure to subdivide the plane Nodes can contain coordinates of center of box, side length Eventually also coordinates of CM, total mass, etc.

In a complete quad tree, each nonleaf node has 4 children 28/03/2005 CS267 Lecture 17 Oct Trees Similar Data Structure to subdivide space 28/03/2005 CS267 Lecture 17 Using Quad Trees and Oct Trees All our algorithms begin by constructing a tree to hold all the particles Interesting cases have nonuniformly distributed particles In a complete tree most nodes would be empty, a waste of space

and time Adaptive Quad (Oct) Tree only subdivides space where particles are located 28/03/2005 CS267 Lecture 17 Example of an Adaptive Quad Tree Child nodes enumerated counterclockwise from SW corner, empty ones excluded 28/03/2005 CS267 Lecture 17 Adaptive Quad Tree Algorithm (Oct Tree analogous) Procedure Quad_Tree_Build

end 28/03/2005 CS267 Lecture 17 Cost of Adaptive Quad Tree Constrution Cost <= N * maximum cost of Quad_Tree_Insert = O( N * maximum dept of Quad_Tree) Uniform Distribution of particles Depth of Quad_Tree = O( log N ) Cost <= O( N * log N ) Arbitrary distribution of particles Depth of Quad_Tree = O( # bits in particle coords ) = O( b ) Cost <= O( b N ) 28/03/2005 CS267 Lecture 17

Outlin e Motivation Obvious algorithm for computing gravitational or electrostatic force on N bodies takes O(N2) work How to reduce the number of particles in the force sum We must settle for an approximate answer (say 2 decimal digits, or perhaps 16 ) Basic Data Structures: Quad Trees and Oct Trees The Barnes-Hut Algorithm (BH) An O(N log N) approximate algorithm for the N-Body problem The Fast Multipole Method (FMM) An O(N) approximate algorithm for the N-Body problem Parallelizing BH, FMM and related algorithms 28/03/2005 CS267 Lecture 17

Barnes-Hut Algorithm A Hierarchical O(n log n) force calculation algorithm, J. Barnes and P. Hut, Nature, v. 324 (1986), many later papers Good for low accuracy calculations: RMS error =k || approx f(k) - true f(k) ||2 / || true f(k) ||2 /N)1/2 ~ 1% (other measures better if some true f(k) ~ 0) High Level Algorithm (in 2D, for simplicity) 1) Build the QuadTree using QuadTreeBuild already described, cost = O( N log N) or O(b N) 2) For each node = subsquare in the QuadTree, compute the CM and total mass (TM) of all the particles it contains post order traversal of QuadTree, cost = O(N log N) or O(b N) 3) For each particle, traverse the QuadTree to compute the force on it, using the CM and TM of distant subsquares core of algorithm cost depends on accuracy desired but still O(N log N) or O(bN) 28/03/2005 CS267 Lecture 17

Step 2 of BH: compute CM and total mass of each node Compute the CM = Center of Mass and TM = Total Mass of all the particles in each node of the QuadTree ( TM, CM ) = Compute_Mass( root ) function ( TM, CM ) = Compute_Mass( n ) compute the CM and TM of node n if n contains 1 particle the TM and CM are identical to the particles mass and location store (TM, CM) at n return (TM, CM) else post order traversal: process parent after all children for all children c(j) of n j = 1,2,3,4 ( TM(j), CM(j) ) = Compute_Mass( c(j) ) endfor TM = TM(1) + TM(2) + TM(3) + TM(4) the total mass is the sum of the childrens masses CM = ( TM(1)*CM(1) + TM(2)*CM(2) + TM(3)*CM(3) + TM(4)*CM(4) ) / TM the CM is the mass-weighted sum of the childrens centers of mass store ( TM, CM ) at n

return ( TM, CM ) end if Cost = O(# nodes in QuadTree) = O( N log N ) or O(b N) 28/03/2005 CS267 Lecture 17 Step 3 of BH: compute force on each particle For each node = square, can approximate force on particles outside the node due to particles inside node by using the nodes CM and TM This will be accurate enough if the node if far away enough from the particle For each particle, use as few nodes as possible to compute force, subject to accuracy constraint Need criterion to decide if a node is far enough from a particle D = side length of node r = distance from particle to CM of node = user supplied error tolerance < 1

Use CM and TM to approximate force of node on box if D/r < 28/03/2005 CS267 Lecture 17 Computing force on a particle due to a node Suppose node n, with CM and TM, and particle k, satisfy D/r < Let (xk, yk, zk) be coordinates of k, m its mass Let (xCM, yCM, zCM) be coordinates of CM r = ( (xk - xCM)2 + (yk - yCM)2 + (zk - zCM)2 )1/2 G = gravitational constant Force on k ~ G * m * TM * ( xCM - xk , yCM - yk , zCM zk ) / r^3 28/03/2005 CS267 Lecture 17 Details of Step 3 of

BH for each particle, traverse the QuadTree to compute the force on it for k = 1 to N f(k) = TreeForce( k, root ) compute force on particle k due to all particles inside root endfor function f = TreeForce( k, n ) compute force on particle k due to all particles inside node n f=0 if n contains one particle evaluate directly f = force computed using formula on last slide else r = distance from particle k to CM of particles in n D = size of n if D/r < ok to approximate by CM and TM compute f using formula from last slide else need to look inside node for all children c of n f = f + TreeForce ( k, c ) end for end if

end if 28/03/2005 CS267 Lecture 17 Analysis of Step 3 of BH Correctness follows from recursive accumulation of force from each subtree Each particle is accounted for exactly once, whether it is in a leaf or other node Complexity analysis Cost of TreeForce( k, root ) = O(depth in QuadTree of leaf containing k) Proof by Example (for >1): For each undivided node = square, (except one containing k), D/r < 1 < There are 3 nodes at each level of the QuadTree There is O(1) work per node Cost = O(level of k)

Total cost = O(k level of k) = O(N log N) - Strongly depends on 28/03/2005 CS267 Lecture 17 k Outlin e Motivation Obvious algorithm for computing gravitational or electrostatic force on N bodies takes O(N2) work How to reduce the number of particles in the force sum We must settle for an approximate answer (say 2 decimal digits, or perhaps 16 ) Basic Data Structures: Quad Trees and Oct Trees The Barnes-Hut Algorithm (BH) An O(N log N) approximate algorithm for the N-Body problem

The Fast Multipole Method (FMM) An O(N) approximate algorithm for the N-Body problem Parallelizing BH, FMM and related algorithms 28/03/2005 CS267 Lecture 17 Fast Multiple Method (FMM) A fast algorithm for particle simulation, L. Greengard and V. Rokhlin, J. Comp. Phys. V. 73, 1987, many later papers Greengard: 1987 ACM Dissertation Award; Rohklin: 1999 NAS Differences from Barnes-Hut FMM computes the potential at every point, not just the force FMM uses more information in each box than the CM and TM, so it is both more accurate and more expensive In compensation, FMM accesses a fixed set of boxes at every level, independent of D/r

BH uses fixed information (CM and TM) in every box, but # boxes increases with accuracy. FMM uses a fixed # boxes, but the amount of information per box increase with accuracy. FMM uses two kinds of expansions Outer expansions represent potential outside node due to particles inside, analogous to (CM,TM) Inner expansions represent potential inside node due to particles outside; Computing this for every leaf node is the computational goal of FMM First review potential, then return to FMM 28/03/2005 CS267 Lecture 17 Gravitational/Electrostatic Potential FMM will compute a compact expression for potential (x,y,z) which can be evaluated and/or differentiated at any point In 3D with x,y,z coordinates

Potential=(x,y,z) = -1/r = -1/(x2 + y2 + z2)1/2 Force = -grad (x,y,z) = - (d/dx , d/dy , d/dz) = -(x,y,z)/r3 In 2D with x,y coordinates Potential=(x,y) = log r = log (x2 + y2)1/2 Force = -grad (x,y) = - (d/dx , d/dy ) = -(x,y)/r2 In 2D with z = x+iy coordinates Potential=(z) = log |z| = Real( log z ) because log z = log |z|ei = log |z| + i Drop Real( ) from calculations, for simplicity Force = -(x,y)/r2 = -z / |z|2 28/03/2005 CS267 Lecture 17 2D Multipole Expansion (Taylor expansion in 1/z)

(z) = potential due to zk, k=1,,n = k mk * log |z - zk| sum from k=1 to n = Real( k mk * log (z - zk) ) drop Real() from now on = M * log(z) + e>=1 z-e e Taylor Expansion in 1/z where M = k mk = Total Mass and e = k mk zke This is called a Multipole Expansion in z = M * log(z) + r>=e>=1 z-e e + error( r ) r = number of terms in Truncated Multipole Expansion

and error( r ) = r

1 bit of accuracy 24 terms enough for single precision, 53 terms for double precision In 3D, can use spherical harmonics or other expansions 28/03/2005 CS267 Lecture 17 Error outside larger box is O( c^(-r) ) Outer(n) and Outer Expansion (z) ~ M * log(z - zn) + r>=e>=1 (z-zn)-e e Outer(n) = (M, 1 , 2 , , r , zn ) Stores data for evaluating potential (z) outside node n due to particles inside n zn = center of node n Cost of evaluating (z) is O( r ), independent of the number of particles inside n

Cost grows linearly with desired number of bits of precision ~r Will be computed for each node in Quad_Tree Analogous to (TM,CM) in Barnes-Hut M and 1 same information as Barnes-Hut 28/03/2005 CS267 Lecture 17 Inner(n) and Inner Expansion Outer(n) used to evaluate potential outside node n due to particles inside n Inner(n) will be used to evaluate potential inside node n due to particles outside n 0<=e<=r e * (z-zn)e zn = center of node n, a D-by-D box Inner(n) = ( 0 , 1 , , r , zn ) Particles outside n must lie outside 3D-by-3D box centered at zn

28/03/2005 CS267 Lecture 17 Top Level Description of FMM (1) Build the QuadTree (2) Call Build_Outer(root), to compute outer expansions of each node n in the QuadTree Traverse QuadTree from bottom to top, combining outer expansions of children to get out outer expansion of parent (3) Call Build_ Inner(root), to compute inner expansions of each node n in the QuadTree Traverse QuadTree from top to bottom, converting outer to inner expansions and combining them (4) For each leaf node n, add contributions of nearest particles directly into Inner(n) final Inner(n) is desired output: expansion for potential at each point due to all particles

28/03/2005 CS267 Lecture 17 Step 2 of FMM: Outer_shift: converting Outer(n 1) to Outer(n ) For step2 2 of FMM (as in step 2 of BH) we want to compute Outer(n) cheaply from Outer( c ) for all children c of n How to combine outer expansions around different points? k(z) ~ Mk * log(z-zk) + r>=e>=1 (z-zk)-e ek expands around zk , k=1,2 First step: make them expansions around same point n1 is a child (subsquare) of n2 zk = center(nk) for k=1,2 Outer(n1) expansion accurate outside blue dashed square, so also accurate outside black dashed square So there is an Outer(n2) expansion with different k and center z2 which represents the same potential as Outer(n

) outside the black dashed box 28/03/2005 1 CS267 Lecture 17 Outer_shift: continued Given 1(z) = M1 * log(z-z1) + r>=e>=1 (z-z1)-e e1 Solve for M2 and e2 in 1(z) ~ 2(z) = M2 * log(z-z2) + r>=e>=1 (z-z1)-e e2 Get M2 = M1 and each e2 is a linear combination of the e1 multiply r-vector of e1 values by a fixed r-by-r matrix to get e2 ( M2 , 12 , , r2 , z2 ) = Outer_shift( Outer(n1) , z2 ) = desired Outer( n2 ) 28/03/2005 CS267 Lecture 17

Step 2 of FMM: compute Outer(n) for each node n in QuadTree Compute Outer(n) for each node of the QuadTree outer = Build_Outer( root ) function ( M, 1,,r , zn) = Build_Outer( n ) compute outer expansion of node n if n if a leaf it contains 1 (or a few) particles compute and return Outer(n) = ( M, 1,,r , zn) directly from its definition as a sum else post order traversal: process parent after all children Outer(n) = 0 for all children c(k) of n k = 1,2,3,4 Outer( c(k) ) = Build_Outer( c(k) ) Outer(n) = Outer(n) + Outer_shift( Outer(c(k)) , center(n)) just add component by component endfor return Outer(n) end if

Cost = O(# nodes in QuadTree) = O( N ) same as for Barnes-Hut 28/03/2005 CS267 Lecture 17 Top Level Description of FMM (1) Build the QuadTree (2) Call Build_Outer(root), to compute outer expansions of each node n in the QuadTree Traverse QuadTree from bottom to top, combining outer expansions of children to get out outer expansion of parent (3) Call Build_ Inner(root), to compute inner expansions of each node n in the QuadTree Traverse QuadTree from top to bottom, converting outer to inner expansions and combining them (4) For each leaf node n, add contributions of nearest particles

directly into Inner(n) final Inner(n) is desired output: expansion for potential at each point due to all particles 28/03/2005 CS267 Lecture 17 Step 3 of FMM: Compute Inner(n) for each n in QuadTree Need Inner(n1) = Inner_shift(Inner(n2)) 28/03/2005 Need Inner(n4) = Convert(Outer(n3)) CS267 Lecture 17 Step 3 of FMM: Inner(n1) = Inner_shift(Inner(n2))

Inner(nk) = ( 0k , 1k , , rk , zk ) Innerexpansion =0<=e<=r ek * (z-zk)e Solve 0<=e<=r e1 * (z-z1)e = 0<=e<=r e2 * (z-z2)e for e1 given z1, e2 , and z2 (r+1) x (r+1) matrix-vector multiply CS267 Lecture 17 28/03/2005 Step 3 of FMM: Inner(n4) = Convert(Outer(n3)) Inner(n4) = ( 0 , 1 , , r , z4 ) Outer(n3) = (M, 1 , 2 , , r , z3 )

Solve 0<=e<=r e * (z-z4)e = M*log (z-z3) + 0<=e<=r e * (z-z3)-e for e given z4 , e , and z3 (r+1) x (r+1) matrix-vector multiply 28/03/2005 CS267 Lecture 17 Step 3 of FMM: Computing Inner(n) from other expansions We will use Inner_shift and Convert to build each Inner(n) by combing expansions from other nodes Which other nodes? As few as necessary to compute the potential accurately Inner_shift(Inner(parent(n), center(n)) will account for potential from particles far enough away from parent (red nodes below) Convert(Outer(i), center(n)) will account for potential from particles in boxes at same level in Interaction Set (nodes labeled i below) 28/03/2005

CS267 Lecture 17 Step 3 of FMM: Interaction Set Interaction Set = { nodes i that are children of a neighbor of parent(n), such that i is not itself a neighbor of n} For each i in Interaction Set , Outer(i) is available, so that Convert(Outer(i),center(n)) gives contribution to Inner(n) due to particles in i Number of i in Interaction Set is at most 62 - 32 = 27 in 2D Number of i in Interaction Set is at most 63 - 33 = 189 in 3D 28/03/2005 CS267 Lecture 17 Step 3 of FMM: Compute Inner(n) for each n in QuadTree Compute Inner(n) for each node of the QuadTree outer = Build_ Inner( root ) function ( 1,,r , zn) = Build_ Inner( n ) compute inner expansion of node n

p = parent(n) p=nil if n = root Inner(n) = Inner_shift( Inner(p), center(n) ) Inner(n) = 0 if p = root for all i in Interaction_Set(n) Interaction_Set(root) is empty Inner(n) = Inner(n) + Convert( Outer(i), center(n) ) add component by component end for for all children c of n complete preorder traversal of QuadTree Build_Inner( c ) end for Cost = O(# nodes in QuadTree) = O( N ) 28/03/2005 CS267 Lecture 17 Top Level Description of FMM (1) Build the QuadTree (2) Call Build_Outer(root), to compute outer expansions

of each node n in the QuadTree Traverse QuadTree from bottom to top, combining outer expansions of children to get out outer expansion of parent (3) Call Build_ Inner(root), to compute inner expansions of each node n in the QuadTree Traverse QuadTree from top to bottom, converting outer to inner expansions and combining them (4) For each leaf node n, add contributions of nearest particles directly into Inner(n) if 1 node/leaf, then each particles accessed once, so cost = O( N ) final Inner(n) is desired output: expansion for potential at each point due to all particles 28/03/2005 CS267 Lecture 17

Outlin e Motivation Obvious algorithm for computing gravitational or electrostatic force on N bodies takes O(N2) work How to reduce the number of particles in the force sum We must settle for an approximate answer (say 2 decimal digits, or perhaps 16 ) Basic Data Structures: Quad Trees and Oct Trees The Barnes-Hut Algorithm (BH) An O(N log N) approximate algorithm for the N-Body problem The Fast Multipole Method (FMM) An O(N) approximate algorithm for the N-Body problem Parallelizing BH, FMM and related algorithms 28/03/2005 CS267 Lecture 17

Parallelizing Hierachical N-Body codes Barnes-Hut, FMM and related algorithm have similar computational structure: 1) Build the QuadTree 2) Traverse QuadTree from leaves to root and build outer expansions (just (TM,CM) for Barnes-Hut) 3) Traverse QuadTree from root to leaves and build any inner expansions 4) Traverse QuadTree to accumulate forces for each particle One parallelization scheme will work for them all Based on D. Blackston and T. Suel, Supercomputing 97 UCB PhD Thesis, David Blackston, Pbody Assign regions of space to each processor Regions may have different shapes, to get load balance - Each region will have about N/p particles Each processor will store part of Quadtree containing all particles (=leaves) in its

region, and their ancestors in Quadtree Top of tree stored by all processors, lower nodes may also be shared Each processor will also store adjoining parts of Quadtree needed to compute forces for particles it owns Subset of Quadtree needed by a processor called the Locally Essential Tree (LET) Given the LET, all force accumulations (step 4)) are done in parallel, without communication 28/03/2005 CS267 Lecture 17 Programming Model BSP BSP Model = Bulk Synchronous Programming Model All processors compute; barrier; all processors communicate; barrier; repeat Advantages easy to program (parallel code looks like serial code) easy to port (MPI, shared memory, TCP network) Possible disadvantage

Rigidly synchronous style might mean inefficiency? Not a real problem, since communication costs low FMM 80% efficient on 32 processor Cray T3E FMM 90% efficient on 4 PCs on slow network FMM 85% efficient on 16 processor SGI SMP (Power Challenge) Better efficiencies for Barnes-Hut, other algorithms 28/03/2005 CS267 Lecture 17 Load Balancing Scheme 1: Orthogonal Recursive Bisection (ORB) Warren and Salmon, Supercomputing 92 Recursively split region along axes into regions containing equal numbers of particles Works well for 2D, not 3D (available in Pbody) Partitioning for 16 procs:

28/03/2005 CS267 Lecture 17 Load Balancing Scheme 2: Costzones Called Costzones for Shared Memory PhD thesis, J.P. Singh, Stanford, 1993 Called Hashed Oct Tree for Distributed Memory Warren and Salmon, Supercomputing 93 We will use the name Costzones for both; also in Pbody Idea: partition QuadTree instead of space Estimate work for each node, call total work W Arrange nodes of QuadTree in some linear order (lots of choices) Assign contiguous blocks of nodes with work W/p to processors Works well in 3D 28/03/2005

CS267 Lecture 17 Linearly Ordering Quadtree nodes for Costzones Hashed QuadTrees (Warren and Salmon) Assign unique key to each node in QuadTree, then compute hash(key) to get integers that can be linearly ordered If (x,y) are coordinates of center of node, interleave bits to get key Put 1 at left as sentinel Nodes at root of tree have shorter keys 28/03/2005 CS267 Lecture 17 Linearly Ordering Quadtree nodes for Costzones (continued) Assign unique key to each node in QuadTree, then compute hash(key) to get a linear order

key = interleaved bits of x,y coordinates of node, prefixed by 1 Hash(key) = bottom h bits of key (eg h=4) Assign contiguous blocks of hash(key) to same processors 28/03/2005 CS267 Lecture 17 Determining Costzones in Parallel Not practical to compute QuadTree, in order to compute Costzones, to then determine how to best build QuadTree Random Sampling: All processors send small random sample of their particles to Proc 1 Proc 1 builds small Quadtree serially, determines its Costzones, and broadcasts them to all processors Other processors build part of Quadtree they are assigned by these Costzones

All processors know all Costzones; we need this later to compute LETs 28/03/2005 CS267 Lecture 17 Computing Locally Essential Trees (LETs) Warren and Salmon, 1992; Liu and Bhatt, 1994 Every processor needs a subset of the whole QuadTree, called the LET, to compute the force on all particles it owns Shared Memory Receiver Driven Protocol Each processor reads part of QuadTree it needs from shared memory on demand, keeps it in cache Drawback: cache memory appears to need to grow proportionally to P to remain scalable Distributed Memory

Sender driven protocol Each processor decides which other processors need parts of its local subset of the Quadtree, and sends these subsets 28/03/2005 CS267 Lecture 17 Locally Essential Trees in Distributed Memory How does each processor decide which other processors need parts of its local subset of the Quadtree? Barnes-Hut: Let j and k be processors, n a node on processor j Let D(n) be the side length of n Let r(n) be the shortest distance from n to any point owned by k If either (1) D(n)/r(n) < and D(parent(n))/r(parent(n)) >= , or (2) D(n)/r(n) >= then node n is part of ks LET, and so proc j should send n to k Condition (1) means (TM,CM) of n can be used on proc k, but this is

not true of any ancestor Condition (2) means that we need the ancestors of type (1) nodes too FMM 28/03/2005 CS267 Lecture 17 Simpler rules based just on relative positions in QuadTree Performance Results - 1 512 Proc Intel Delta Warren and Salmon, Supercomputing 92 8.8 M particles, uniformly distributed .1% to 1% RMS error 114 seconds = 5.8 Gflops - Decomposing domain

7 secs - Building the OctTree 7 secs - Tree Traversal - Communication during traversal - Force evaluation 54 secs

- Load imbalance 7 secs 33 secs 6 secs Rises to 160 secs as distribution becomes nonuniform 28/03/2005 CS267 Lecture 17 Performance Results - 2 Cray T3E Blackston, 1999 10-4 RMS error General 80% efficient on up to 32 processors Example: 50K particles, both uniform and nonuniform -

preliminary results; lots of tuning parameters to set Uniform Tree size MaxDepth Time(secs) Speedup Speedup vs O(n2) Nonuniform 1 proc 4 procs 1 proc 4 procs 2745 4

172.4 5729 10 14.7 5729 10 2.4 6.1 >500 2745 4 38.9 4.4 >50 28/03/2005 CS267 Lecture 17 Future work - portable, efficient

code including all useful variants Old Slides 28/03/2005 CS267 Lecture 17 Fast Multiple Method (FMM) A fast algorithm for particle simulation, L. Greengard and V. Rokhlin, J. Comp. Phys. V. 73, 1987, many later papers Greengard won 1987 ACM Dissertation Award Differences from Barnes-Hut FMM computes the potential at every point, not just the force FMM uses more information in each box than the CM and TM, so it is both more accurate and more expensive In compensation, FMM accesses a fixed set of boxes at every level, independent of D/r BH uses fixed information (CM and TM) in every box, but # boxes

increases with accuracy. FMM uses a fixed # boxes, but the amount of information per box increase with accuracy. FMM uses two kinds of expansions Outer expansions represent potential outside node due to particles inside, analogous to (CM,TM) Inner expansions represent potential inside node due to particles outside; Computing this for every leaf node is the computational goal of FMM First review potential, then return to FMM 28/03/2005 CS267 Lecture 17 Gravitational/Electrostatic Potential Force on particle at (x,y,z) due to one at origin = -(x,y,z)/r 3 Instead of force, consider potential (x,y,z) = -1/r potential satisfies 3D Poisson equation d2/dx2 + d2/dy2 + d2/dz2 = 0

Force = -grad (x,y,z) = - (d/dx , d/dy , d/dz) FMM will compute a compact express for (x,y,z), which can be evaluated and/or differentiated at any point For simplicity, present algorithm in 2D instead of 3D force = -(x,y)/r2 = -z / |z|2 where z = x + y (complex number) potential = log |z| potential satisfies 2D Poisson equation d2/dx2 + d2/dy2 = 0 equivalent to gravity between infinite parallel wires instead of point masses 28/03/2005 CS267 Lecture 17 2D Multipole Expansion (Taylor expansion in 1/z) (z) = potential due to zk, k=1,,n = k mk * log |z - zk|

= Real( k mk * log (z - zk) ) since log z = log |z|ei = log |z| + i drop Real() from now on = k mk * [ log(z) + log (1 - zk/z) ] how logarithms work = M * log(z) + k mk * log (1 - zk/z) where M = k mk = M * log(z) + k mk * e>=1 (zk/z)e Taylor expansion converges if |zk/z| < 1 = M * log(z) + e>=1 z-e k mk zke swap order of summation 28/03/2005 CS267 Lecture 17 = M * log(z) + e>=1 z-e e Error in Truncated 2D Multipole Expansion

error( r ) = r c >= maxk |zk|/ |z| error( r ) = O(cr+1) Suppose all the particles zk lie inside a D-by-D square centered at the origin Suppose z is outside a 3D-by-3D square centered at the origin 1/c = 1.5*D/(D/sqrt(2)) ~ 2.12 Each extra term in expansion increases accuracy about 1 bit 24 terms enough for single, 53 terms for double In 3D, can use spherical harmonics or other expansions 28/03/2005 CS267 Lecture 17

Outer(n) and Outer Expansion (z) ~ M * log(z) + r>=e>=1 z-e e Outer(n) = (M, 1 , 2 , , r ) Stores data for evaluating potential (z) outside node n due to particles inside n Used in place of (TM,CM) in Barnes-Hut Will be computed for each node in QuadTree Cost of evaluating (z) is O( r ), independent of the number of particles inside n 28/03/2005 CS267 Lecture 17 Outer_shift: converting Outer(n1) to Outer(n2) As in step 2 of BH, we want to compute Outer(n) cheaply from Outer( c ) for all children of n How to combine outer expansions around different points? k(z) ~ Mk * log(z-zk) + r>=e>=1 (z-zk)-e ek expands around zk , k=1,2 First step: make them expansions around same point

n1 is a subsquare of n2 zk = center(nk) for k=1,2 Outer(n1) expansion accurate outside blue dashed square, so also accurate outside black dashed square So there is an Outer(n2) expansion with different k and center z2 which represents the same potential as Outer(n ) outside the black dashed box 28/03/2005 1 CS267 Lecture 17 Outer_shift: continued Given 1(z) = M1 * log(z-z1) + r>=e>=1 (z-z1)-e e1 Solve for M2 and e2 in

1(z) ~ 2(z) = M2 * log(z-z2) + r>=e>=1 (z-z1)-e e2 Get M2 = M1 and each e2 is a linear combination of the e1 multiply r-vector of e1 values by a fixed r-by-r matrix to get e2 ( M2 , 12 , , r2 ) = Outer_shift( Outer(n1) , z2 ) = desired Outer( n2 ) 28/03/2005 CS267 Lecture 17 FMM: compute Outer(n) for each node n in QuadTree Compute the Outer(n) for each node of the QuadTree outer = Build_ Outer( root ) function ( M, 1,,r ) = Build_ Outer( n ) compute outer expansion of node n if n if a leaf it contains 1 (or a few) particles compute and return Outer(n) = ( M, 1,,r ) directly from its definition as a sum else post order traversal: process parent after all children

Outer(n) = 0 for all children c(k) of n k = 1,2,3,4 Outer( c(k) ) = Build_ Outer( c(k) ) Outer(n) = Outer(n) + Outer_shift( Outer(c(k)) , center(n)) just add component by component endfor return Outer(n) end if Cost = O(# nodes in QuadTree) = O( N log N ) or O(b N) same as for Barnes-Hut 28/03/2005 CS267 Lecture 17 Inner(n) and Inner Expansion Outer(n) used to evaluate potential outside n due to particles inside Inner(n) will be used to evaluate potential inside n

due to particles outside 0<=k<=r k * (z-zc)k zc = center of n, a D-by-D box Inner(n) = ( 0 , 1 , , r , zc ) Particles outside lie outside 3D-by-3D box centered at zc Output of FMM will include Inner(n) for each leaf node n in QuadTree Need to show how to convert Outer to Inner expansions 28/03/2005 CS267 Lecture 17 Summary of FMM (1) Build the QuadTree (2) Call Build_Outer(root), to compute outer expansions of each node n in the QuadTree (3) Traverse the QuadTree from top to bottom, computing Inner(n) for each node n in QuadTree

still need to show how to convert outer to inner expansions (4) For each leaf node n, add contributions of nearest particles directly into Inner(n) since Inner(n) only includes potential from distant nodes 28/03/2005 CS267 Lecture 17

## Recently Viewed Presentations

• Aboriginal title - right to land itself vs. a rt to do or participate in an activity or practice. Ab groups must have occupied the land at the time of European occupation. The lands are not restricted to traditional use,...
• SSC COMPONENT QUB MEDICAL STUDENTS: FRIDAY, 17 FEBRUARY 2012 One small step for man … … The Model for Improvement (keep it small and simple ...
• "Old Historicism" (not a real word) New Historicism. Most of us are raised to think about history in the traditional way . One would read an account of a Revolutionary War battle written by an American historian in 1944 and...
• "Where the Rubber Hits the Cornfield!" ... (CILs) with Cherie Clark, an outstanding Iowa Disability Rights Advocate from Cedar Rapids. Cherie had a great relationship with Senator Bob Dvorsky, who represented the area where Cherie lives, and he was also...
• FILARIOIDEA Filariidae Onchocercidae Dirofilaria spp. Seteria spp. Onchocerca spp. κ.ά. Parafilaria spp. Aelurostrongylus abstrusus Kυψελίδες γάτας Α: 4-6mm Θ: 9-10mm Agiostrongylus vasorum Κλάδοι πνευμονικής αρτηρίας σκύλου κ.ά. Α: 14-18mm Θ: 18-25mm Κλάδοι πνευμ. αρτηρίας A:14-18mm, Θ:18-25mm L1 επιβ. 8-16 ...
• Monogamous mating and eusociality. Mating system is central to the proposed evolution of eusociality (coming in ch. 12) Colonies of social insects (Hymenoptera, Isoptera) ... Humans by comparison: 20 - 20,000 Hz. Substrate vibrations (solid or water) Chordotonal Organ -...
• Michelle - A child on a journey walks along with his guide. ... Using "I'm not sure" or "I don't know" wiht a statement suchs as "what should we do" "how could we handle it" or "how can we find...
• * Gregor Mendel was born on July 22, 1822, into a peasant family in a small town in Heinzendorf, Austria (now the Czech Republic), and spent much of his youth working in his family's orchards and gardens.