Performance OpenGL Platform Independent Techniques Dave Shreiner Brad Grantham What Youll See Today An in-depth look at the OpenGL pipeline from a performance perspective Techniques for determining where OpenGL application performance bottlenecks are A bunch of simple, good habits for OpenGL applications Performance OpenGL 3
Performance Tuning Assumptions Youre trying to tune an interactive OpenGL application Theres an established metric for estimating the applications performance Consistent frames/second Number of pixels or primitives to be rendered per frame You can change the applications source code Performance OpenGL 4 Errors Skew Performance Measurements OpenGL Reports Errors
Asynchronously OpenGL doesnt tell you when something goes wrong Need to use glGetError() to determine if something went wrong Calls with erroneous parameters will silently set error state and return without completing Performance OpenGL 5 Checking a single command Simple Macro #define CHECK_OPENGL_ERROR( cmd ) \ cmd; \ { GLenum error; \ while ( (error = glGetError()) != GL_NO_ERROR) { \ printf( "[%s:%d] '%s' failed with error %s\n", \
__FILE__, __LINE__, #cmd, \ gluErrorString(error) ); \ } Some limitations on where the macro can be used cant use inside of glBegin() / glEnd() pair Performance OpenGL 6 The OpenGL Pipeline Performance OpenGL 7 Framebuffer Rasterization
Transformation Application (The Macroscopic View) Performance Bottlenecks Bottlenecks are the performance limiting part of the application Application bottleneck Application may not pass data fast enough to the OpenGL pipeline Transform-limited bottleneck OpenGL may not be able to process vertex transformations fast enough Performance OpenGL 8
Performance Bottlenecks (cont.) Fill-limited bottleneck OpenGL may not be able to rasterize primitives fast enough Performance OpenGL 9 There Will Always Be A Bottleneck Some portion of the application will always be the limiting factor to performance If the application performs to expectations, then the bottleneck isnt a problem Otherwise, need to be able to identify which part of the application is the bottleneck
Well work backwards through the OpenGL pipeline in resolving bottlenecks Performance OpenGL 10 Fill-limited Bottlenecks System cannot fill all the pixels required in the allotted time Easiest bottleneck to test Reduce number of pixels application must fill Make the viewport smaller Performance OpenGL 11 Reducing Fill-limited Bottlenecks
The Easy Fixes Make the viewport smaller This may not be an acceptable solution, but its easy Reduce the frame-rate pixels 800 M second pixels 10 . 7 M frame frames 75 second Performance OpenGL pixels 800 M second
pixels 13 . 3 M frame frames 60 second 12 A Closer Look at OpenGLs Rasterization Pipeline Point Rasterization Line Rasterization Triangle Rasterization
Color Sum Texture Mapping Engine (Sep. Specular Color) Bitmap Rasterization To Fragment Tests Pixel Rectangle Rasterization
Performance OpenGL Fog Engine 13 Reducing Fill-limited Bottlenecks (cont.) Point Rasterization Line Rasterization Rasterization Pipeline Cull back facing Triangle Rasterization
Texture Mapping Engine Color Sum Bitmap Rasterization Pixel Rectangle Rasterization polygons Does require all primitives have same facediness Use per-vertex fog, as compared to per-pixel Performance OpenGL 14
Fog Engine To Fragment Tests A Closer Look at OpenGLs Rasterization Pipeline (cont.) Fragments Pixel Ownership Test Depth Buffer Test
Performance OpenGL Scissor Test Blending Alpha Test Stencil Test Dithering Logical Operations 15 Framebuffer
(from previous stages) Reducing Fill-limited Bottlenecks (cont.) Pixel Ownership Test Fragment Pipeline Do less work per pixel Depth Buffer Test Scissor Test
Blending Alpha Test Stencil Test Dithering Logical Operations Disable dithering Depth-sort primitives to reduce depth testing Use alpha test to reject transparent fragments Performance OpenGL saves doing a pixel read-back from the framebuffer in the
blending phase 16 A Closer Look at OpenGLs Pixel Pipeline Pixel Rectangle Pixel Unpacking Type Conversion Pixel Map Texture Memory
Framebuffer Performance OpenGL Pixel Scale & Bias 17 Clamp Working with Pixel Rectangles Texture downloads and Blts OpenGL supports many formats for storing pixel data Signed and unsigned types, floating point Type conversions from storage type to framebuffer / texture memory format occur
automatically Performance OpenGL 18 Pixel Data Conversions 30 25 20 15 10 5 0 Machine 1 Machine 2 GL_BYTE
GL_UNSIGNED_BYTE GL_SHORT GL_INT GL_UNSIGNED_INT GL_FLOAT Performance OpenGL 19 Machine 3 GL_UNSIGNED_SHORT Pixel Data Conversions (cont.) 2.5 2
1.5 1 0.5 0 Machine 1 Machine 2 Machine 3 GL_ UNSIGNED_ SHORT_ 4_ 4_ 4_ 4 GL_ UNSIGNED_ SHORT_ 4_ 4_ 4_ 4_ REV GL_ UNSIGNED_ SHORT_ 5_ 5_ 5_ 1 GL_ UNSIGNED_ SHORT_ 1_ 5_ 5_ 5_ REV GL_ UNSIGNED_ INT_ 8_ 8_ 8_ 8
GL_ UNSIGNED_ INT_ 8_ 8_ 8_ 8_ REV GL_ UNSIGNED_ INT_ 10_ 10_ 10_ 2 GL_ UNSIGNED_ INT_ 2_ 10_ 10_ 10_ REV Performance OpenGL 20 Pixel Data Conversions (cont.) Observations Signed data types probably arent optimized OpenGL clamps colors to [0, 1] Match pixel format to windows pixel format for blts Usually involves using packed pixel formats No significant difference for rendering speed for textures internal format
Performance OpenGL 21 Fragment Operations and Fill Rate The more you do, the less you get The more work per pixel, the less fill you get Performance OpenGL 22 Fragment Operations and Fill Rate (contd) Machine 2 43
46 49 52 55 58 61 64 46 49 52 55
58 61 64 37 34 31 28 40 40 37 34
31 28 25 22 19 16 13 10 7 1
64 61 58 55 52 49 46 43 40 0.00% 37
0.00% 34 20.00% 31 20.00% 28 40.00% 25 40.00% 22
60.00% 19 60.00% 16 80.00% 13 80.00% 10 100.00% 7 100.00%
4 120.00% 4 Percentage of Peak Fill 120.00% 1 25 Machine 3 Percentage of Peak Fill Performance OpenGL
43 Machine 1 22 19 16 13 7 1 64 61 58
55 52 49 46 43 40 37 34 0.00% 31
0.00% 28 20.00% 25 20.00% 22 40.00% 19 40.00% 16 60.00%
13 60.00% 7 80.00% 10 80.00% 4 100.00% 1 100.00%
10 Percentage of Peak Fill 120.00% 4 Pecentage of Peak Fill 120.00% Machine 4 23 Texture-mapping Considerations Use Texture Objects Allows OpenGL to do texture memory management Loads texture into texture memory when appropriate Only convert data once
Provides queries for checking if a texture is resident Load all textures, and verify they all fit simultaneously Performance OpenGL 24 Texture-mapping Considerations (cont.) Texture Objects (cont.) Assign priorities to textures Provides hints to texture-memory manager on which textures are most important Can be shared between OpenGL contexts Allows one thread to load textures; other thread to render using them Requires OpenGL 1.1
Performance OpenGL 25 Texture-mapping Considerations (cont.) Sub-loading Textures Only update a portion of a texture Reduces bandwidth for downloading textures Usually requires modifying texture-coordinate matrix Performance OpenGL 26 Texture-mapping Considerations (cont.) Know what sizes your textures need to be What sizes of mipmaps will you need? OpenGL 1.2 introduces texture level-of-detail
Ability to have fine grained control over mipmap stack Only load a subset of mipmaps Control which mipmaps are used Performance OpenGL 27 What If Those Options Arent Viable? Use more or faster hardware Utilize the extra time in other parts of the application Transform pipeline tessellate objects for smoother appearance use better lighting Application more accurate simulation
better physics Performance OpenGL 28 Transform-limited Bottlenecks System cannot process all the vertices required in the allotted time If application doesnt speed up in fill-limited test, its most likely transform-limited Additional tests include Disable lighting Disable texture coordinate generation Performance OpenGL
29 A Closer Look at OpenGLs Transformation Pipeline Texture Coordinate Transform Color Texture Coordinate Generation Texture Coordinates Lighting Normals
Vertex Coordinates Performance OpenGL ModelView Transform Lighting 30 Projection Transform Clipping Reducing Transformlimited Bottlenecks Do less work per-vertex Tune lighting Use typed OpenGL matrices
Use explicit texture coordinates Simulate features in texturing lighting Performance OpenGL 31 Lighting Considerations Use infinite (directional) lights Less computation compared to local (point) lights Dont use GL_LIGHTMODEL_LOCAL_VIEWER Use fewer lights Not all lights may be hardware accelerated Performance OpenGL 32
Lighting Considerations (cont.) Use a texture-based lighting scheme Only helps if youre not fill-limited Performance OpenGL 33 Reducing Transformlimited Bottlenecks (cont.) Matrix Adjustments Use typed OpenGL matrix calls Typed Untyped glRotate*() glScale*() glTranslate*() glLoadIdentity()
glLoadMatrix*() glMultMatrix*() Some implementations track matrix type to reduce matrix-vector multiplication operations Performance OpenGL 34 Application-limited Bottlenecks When OpenGL does all you ask, and your application still runs too slow System may not be able to transfer data to OpenGL fast enough Test by modifying application so that no rendering is performed, but all data is still transferred to OpenGL
Performance OpenGL 35 Application-limited Bottlenecks (cont.) Rendering in OpenGL is triggered when vertices are sent to the pipe Send all data to pipe, just not necessarily in its original form Replace all glVertex*() and glColor*() calls with glNormal*() calls glNormal*() only sets the current vertexs normal values Application transfers the same amount of data to the pipe, but doesnt have to wait for rendering to complete Performance OpenGL
36 Reducing Applicationlimited Bottlenecks No amount of OpenGL transform or rasterization tuning will help the problem Revisit application design decisions Data structures Traversal methods Storage formats Use an application profiling tool (e.g. pixie & prof, gprof, or other similar tools) Performance OpenGL 37 The Novice OpenGL Programmers View of the World
Set State Performance OpenGL Render 38 What Happens When You Set OpenGL State The amount of work varies by operation Turning on or off a Set the features enable feature (glEnable() ) flag ( Set a typed set of Set values in OpenGLs data
context (glMaterialfv()) Transfer untyped data (glTexImage2D()) Transfer and convert data from host format into internal representation But all request a validation at next rendering operation Performance OpenGL 39 A (Somewhat) More Accurate Representation Validation
Set State Performance OpenGL Render 40 Validation OpenGLs synchronization process Validation occurs in the transition from state setting to rendering glMaterial( GL_FRONT, GL_DIFFUSE, blue ); glEnable( GL_LIGHT0 ); glBegin( GL_TRIANGLES ); Not all state changes trigger a validation Vertex data (e.g. color, normal, texture coordinates) Changing rendering primitive
Performance OpenGL 41 What Happens in a Validation Changing state may do more than just set values in the OpenGL context May require reconfiguring the OpenGL pipeline selecting a different rasterization routine enabling the lighting machine Internal caches may be recomputed Performance OpenGL vertex / viewpoint independent data 42
The Way it Really Is (Conceptually) Different Rendering Primitive Validation Set State Performance OpenGL Render 43 Why Be Concerned About Validations? Validations can rob performance
from an application Redundant state and primitive changes Validation is a two-step process Determine what data needs to be updated Select appropriate rendering routines based on enabled features Performance OpenGL 44 How Can Validations Be Minimized? Be Lazy Change state as little as possible Try to group primitives by type Beware of under the covers state changes GL_COLOR_MATERIAL Performance OpenGL
may force an update to the lighting cache ever call to glColor*() 45 How Can Validations Be Minimized? (cont.) Beware of glPushAttrib() / glPopAttrib() Very convenient for writing libraries Saves lots of state when called All elements of an attribute groups are copied for later Almost guaranteed to do a validation when calling glPopAttrib() Performance OpenGL 46 State Sorting
Simple technique Big payoff Arrange rendering sequence to minimize state changes Group primitives based on their state attributes Organize rendering based on the expense of the operation Performance OpenGL 47 State Sorting (cont.) Most Expensive Texture Download Modifying Lighting Parameters
Matrix Operations Vertex Data Performance OpenGL Least Expensive 48 A Comment on Encapsulation An Extremely Handy Design Mechanism, however Encapsulation may affect performance Tendency to want to complete all operations for an object before continuing to next object limits state sorting potential may cause unnecessary validations
Performance OpenGL 49 A Comment on Encapsulation (cont.) Using a visitor type pattern can reduce state changes and validations Usually a two-pass operation Traverse objects, building a list of rendering primitives by state and type Render by processing lists Popular method employed by many scene-graph packages Performance OpenGL
50 Scene Graph Lessons Can you use pre-packaged software? Save yourself some trouble Lessons learned from scene graphs Inventor OpenRM DirectModel Cosmo3D DirectSceneGraph/Fahrenheit Performance OpenGL 51 Performer OpenSG
Scene Graph Lessons Organization is the key Organize scene data for performance E.g. by transformation & bounding hierarchy All about balance (like all high-perf coding) Speed versus convenience Portability versus speed Performance OpenGL 52 Performance Goals Identify performance techniques and goals Sort to eliminate costly state changes Evaluate state lazily Dont set state if wont draw geometry Eliminate redundance Dont set red if material is already red
Performance OpenGL 53 Feature Goals Identify algorithms and requirements E.g. Sort alpha/transparent shapes back-tofront Identify multiple rendering passes Shadow texture Environment Map Will you need threading? Performance OpenGL 54 Scene Graph Lessons Put units of work in nodes/leaves Speed of traversal vs. convenience
Instancing e.g. Vertex Arrays, Material, Primitives Do work where you have the information needed e.g. don't require leaf info before traversal Performance OpenGL 55 Scene Graph Lessons Data (Nodes) versus operation (Action) VertexSet::Draw() might be too inflexible Renderer::draw(Node*)allows more flexibility Write new Renderer without changing nodes Performance OpenGL
56 Scene Graph Lessons Data (Nodes) versus operation (Action) Might take a little thought - but beneficial Gives you flexibility for future Need to keep careful eye on performance Performance OpenGL 57 Example Data Structure Directed Graph / Tree Internal and leaf Nodes Subclassing Render Traversal Use bounding volumes OpenGL state management
Performance OpenGL 58 Example Group Group VertexSet PixelState Shape VertexState Performance OpenGL 59
Internal Nodes Represent spatial organization Bounding hierarchy, spatial grid, etc Might encode some functionality Animation Level of Detail Performance OpenGL 60 Internal Nodes Opportunities from bounding volumes View frustum cull (early if using a hierarchy) Screen space size estimation e.g. for level-of-detail or subdivision Ray/picking optimizations Performance OpenGL
61 Leaf Nodes Make leaf data approximate OpenGL state Rapid application from leaf to OpenGL But encode some abstraction Inventors SoComplexity EnvironmentMap instead of assuming GL_SPHERE_MAP Allow optimizations on specific platforms Performance OpenGL 62 Example: VertexSet class VertexSet { GLenum fmt float *verts;
int vertCount; }; Performance OpenGL 63 Example: VertexState class VertexState { Material *Mtl; Lighting *Lighting; ClipPlane *Planes; ... }; Performance OpenGL 64 Example: Node class Node
Volume ... }; Performance OpenGL { Bounds; 65 Example: Shape class Shape : Node { VertexSet *Verts; VertexState *VState; PixelState *PState; int PrimLengths[]; Glenum PrimTypes[]; int *PrimIndices; ... };
Performance OpenGL 66 Example: Group class Group : Node { int ChildCount; Node Children[]; ... }; Performance OpenGL 67 OpenGL Context Encapsulation Bundles of OpenGL state Items commonly changed together texture
fragment ops lighting material vertex arrays shaders Performance OpenGL 68 Example: OpenGLRenderer This is where you evaluate lazily and skip redundant state changes class OpenGLRenderer : Traversal { virtual void Traverse(Node *root); private:
void Schedule (float mtx[16], Shape *shape); void Finish(void); ... }; Performance OpenGL 69 Example: OpenGLRenderer OpenGLRenderer::Traverse(root) Recursive traversal - call from app Check bounding volume Accumulate transformation matrix private OpenGLRenderer::Schedule(mtx, shape) Schedule a shape for rendering private OpenGLRenderer::Finish() State sort by bundle Sort transparent objs back to front Draw all the objects
Performance OpenGL 70 Case Study: Rendering A Cube More than one way to render a cube Render 100000 cubes Render two quads, and one quad-strip Render six separate quads Performance OpenGL 71 Case Study: Method 1
Once for each cube glColor3fv( color ); for ( i = 0; i < NUM_CUBE_FACES; ++i ) { glBegin( GL_QUADS ); glVertex3fv( cube[cubeFace[i][0]] ); glVertex3fv( cube[cubeFace[i][1]] ); glVertex3fv( cube[cubeFace[i][2]] ); glVertex3fv( cube[cubeFace[i][3]] ); glEnd(); } Performance OpenGL 72 Case Study: Method 2 Once for each cube glColor3fv( color ); glBegin( GL_QUADS ); for ( i = 0; i < NUM_CUBE_FACES; ++i ) { glVertex3fv( cube[cubeFace[i][0]] );
glVertex3fv( cube[cubeFace[i][1]] ); glVertex3fv( cube[cubeFace[i][2]] ); glVertex3fv( cube[cubeFace[i][3]] ); } glEnd(); Performance OpenGL 73 Case Study: Method 3 glBegin( GL_QUADS ); for ( i = 0; i < numCubes; ++i ) { for ( i = 0; i < NUM_CUBE_FACES; ++i ) { glVertex3fv( cube[cubeFace[i][0]] ); glVertex3fv( cube[cubeFace[i][1]] ); glVertex3fv( cube[cubeFace[i][2]] ); glVertex3fv( cube[cubeFace[i][3]] ); } } glEnd();
Performance OpenGL 74 Case Study: Method 4 Once for each cube glBegin( GL_QUAD_STRIP ); for ( i = 2; i < NUM_CUBE_FACES; ++i ) { glVertex3fv( cube[cubeFace[i][0]] ); glVertex3fv( cube[cubeFace[i][1]] ); } glVertex3fv( cube[cubeFace[2][0]] ); glVertex3fv( cube[cubeFace[2][1]] ); glEnd(); glColor3fv( color ); glBegin( GL_QUADS ); glVertex3fv( cube[cubeFace[0][0]] glVertex3fv( cube[cubeFace[0][1]] glVertex3fv( cube[cubeFace[0][2]]
glVertex3fv( cube[cubeFace[0][3]] glVertex3fv( glVertex3fv( glVertex3fv( glVertex3fv( glEnd(); cube[cubeFace[1][0]] cube[cubeFace[1][1]] cube[cubeFace[1][2]] cube[cubeFace[1][3]] Performance OpenGL ); ); ); ); ); ); );
); 75 Case Study: Method 5 glBegin( GL_QUADS ); for ( i = 0; i < numCubes; ++i ) { Cube& cube = cubes[i]; glColor3fv( color[i] ); for ( i = 0; i < numCubes; ++i ) { Cube& cube = cubes[i]; glColor3fv( color[i] ); glVertex3fv( glVertex3fv( glVertex3fv( glVertex3fv( cube[cubeFace[0][0]] cube[cubeFace[0][1]]
cube[cubeFace[0][2]] cube[cubeFace[0][3]] ); ); ); ); glVertex3fv( glVertex3fv( glVertex3fv( glVertex3fv( cube[cubeFace[1][0]] cube[cubeFace[1][1]] cube[cubeFace[1][2]] cube[cubeFace[1][3]] ); ); );
); glBegin( GL_QUAD_STRIP ); for ( i = 2; i < NUM_CUBE_FACES; ++i ) { glVertex3fv( cube[cubeFace[i][0]] ); glVertex3fv( cube[cubeFace[i][1]] ); } glVertex3fv( cube[cubeFace[2][0]] ); glVertex3fv( cube[cubeFace[2][1]] ); glEnd(); } } glEnd(); Performance OpenGL 76 Case Study: Results 2.0
1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Performance OpenGL Method 1 Method 2 Method 3 Method 4 Method 5 Machine 1 Machine 2 Machine 3
77 Rendering Geometry OpenGL has four ways to specify vertex-based geometry Immediate mode Display lists Vertex arrays Interleaved vertex arrays Performance OpenGL 78 Rendering Geometry (cont.) Not all ways are created equal 3.5 3 2.5
Immediate Display List 2 Array Element Draw Array 1.5 Draw Elements Interleaved Array Element 1 Interleaved Draw Array Interleaved Draw Elements 0.5 0 Machine 1
Performance OpenGL Machine 2 Machine 3 79 Rendering Geometry (cont.) Add lighting and color material to the mix 6 5 Immediate 4 Display List Array Element
3 Draw Array Draw Elements 2 Interleaved Array Element Interleaved Draw Array 1 Interleaved Draw Elements 0 Machine 1 Performance OpenGL Machine 2 Machine 3
80 Case Study: Application Description 1.02M Triangles 507K Vertices Vertex Arrays Colors Normals Coordinates Color Material Performance OpenGL 81 Case Study: Whats the Problem? Low frame rate On a machine capable of 13M
polygons/second application was getting less than 1 frame/second 13.1 M 1.02 M polygons second triangles frame frames 12 second Application wasnt fill limited Performance OpenGL 82 Case Study: The
Rendering Loop Vertex Arrays glVertexPointer( GL_VERTEX_POINTER ); glNormalPointer( GL_NORMAL_POINTER ); glColorPointer( GL_COLOR_POINTER ); glDrawElements() index based rendering Color Material glColorMaterial( GL_FRONT, GL_AMBIENT_AND_DIFFUSE ); Performance OpenGL 83 Case Study: What To Notice Color Material changes two lighting material components per glColor*() call Not that many colors used in the model
18 unique colors, to be exact (3 * 1020472 18) = 3061398 redundant color calls per frame Performance OpenGL 84 Case Study: Conclusions A little state sorting goes a long way Sort triangles based on color Rewriting the rendering loop slightly for ( i = 0; i < numColors; ++i ) { glColor3fv( color[i] ); glDrawElements( , trisForColor[i] ); } Frame rate increased to six frames/second 500% performance increase
Performance OpenGL 85 Summary Know the answer before you start Understand rendering requirements of your applications Have a performance goal Utilize applicable benchmarks Estimate what the hardwares capable of Organize rendering to minimize OpenGL validations and other work Performance OpenGL 86 Summary (cont.) Pre-process data
Convert images and textures into formats which dont require pixel conversions Pre-size textures Simultaneously fit into texture memory Mipmaps Determine whats the best format for sending data to the pipe Performance OpenGL 87 Questions & Answers Thanks for coming Updates to notes and slides will be available at http://www.plunk.org/Performance.OpenGL Feel free to email if you have questions Dave Shreiner
[email protected] Brad Grantham [email protected] Performance OpenGL 88 References OpenGL Programming Guide, 3rd Edition Woo, Mason et. al., Addison Wesley OpenGL Reference Manual, 3rd Edition OpenGL Architecture Review Board, Addison Wesley OpenGL Specification, Version 1.4 OpenGL Architecture Review Board Performance OpenGL
89 For More Information SIGGRAPH 2002 Course # 3 - Developing Efficient Graphics Software SIGGRAPH 2000 Course # 32 - Advanced Graphics Programming Techniques Using OpenGL Performance OpenGL 90 Acknowledgements A Big Thank You to Peter Shaheen for a number of the benchmark programs David Shirley for Case Study application
Performance OpenGL 91