This page tries to collate a set of best practices to help you quickly get to the bottom of any performance problems you are having using Clutter.

/!\ the content is currently a work in progress /!\

Introducing the tools

Sysprof

Sysprof is a stochastic profiler which rapidly samples what line of code the CPU is executing so when you stop profiling it can build up a graph of which lines of code account for the most samples which can give a strong indication of what code should be optimized or at least if you can see what code is to blame you may be able to change something else to stop that codepath being hit at all.

A big advantage of sysprof is that you don't have to modify your application to use it and it packaged for most major Linux distributions.

One disadvantage is that is only considers the CPU, so if your Clutter application is GPU bound then the results from sysprof have little use. Another Problem is that the results from sysprof can be somewhat overwhelming and quite hard to analyse if you aren't familiar with the the internals of clutter or the other libraries involved in running your application.

Sysprof screenshot

UProf

UProf is a toolkit for adding domain specific instrumentation to a project. Clutter and Cogl use UProf to define timers and counters throughout the Clutter code that we think may be of interest when analyzing the performance of Clutter applications. We have timers around the redrawing code, the picking code around specific stages of the journal flushing code and more. UProf can provide a textual report at the end of an application run or you can use the ncurses ui to provide a live view of where time is being spent in your application.

To use uprof you can fetch the code from here: git clone git://github.com/rib/UProf.git and then, after building and installing it, configure Clutter with --enable-profile.

Unlike sysprof, the timing is not based on frequent sampling of the CPU and instead we programatically define where timing should start and stop. The timers measure the real, wall clock time that has elapsed and that means if you have GPU or IO work happening between timers then that will be reflected in the results.

Here is an example of uprof output:

context: Clutter

counters:
Name                            Total Per Frame 

Actor real-paint counter        3185  13        
Actor pick-paint counter        632   2         
_clutter_backend_redraw counter 245   1         
_clutter_do_pick counter        79    0         
glsl vertex compile counter     20    0         
arbfp compile counter           2     0         

timers:
Name                                      Total   Per Frame  Percent                                                
                                          msecs                                                                     

Mainloop                                  2157.99 8.81       100.000% █████████████████████████████████████████████ 
  Master Clock                            1925.95 7.86        89.247% ████████████████████████████████████████▏     
    Redrawing                             1132.60 4.62        52.484% ███████████████████████▌                      
      Painting actors                     926.40  3.78        42.929% ███████████████████▎                          
        Stage clear                       828.56  3.38        38.395% █████████████████▎                            
      glXSwapBuffers                      204.71  0.84         9.486% ████▎                                         
    Event Processing                      596.40  2.43        27.637% ████████████▍                                 
    Timelines Advancement                 191.67  0.78         8.882% ███▉                                          
  Picking                                 588.22  2.40        27.258% ████████████▎                                 
    Read Pixels                           339.82  1.39        15.747% ███████                                       
    Stage clear (pick)                    219.89  0.90        10.189% ████▌                                         
    Painting actors (pick mode)           15.20   0.06         0.704% ▎                                             
  Journal Flush                           52.64   0.21         2.440% █                                             
    flush: vbo+texcoords+material+entries 37.96   0.15         1.759% ▊                                             
      flush: texcoords+material+entries   36.73   0.15         1.702% ▊                                             
        flush: material+entries           34.77   0.14         1.611% ▋                                             
          flush: modelview+entries        30.57   0.12         1.417% ▋                                             
  Journal Log                             4.29    0.02         0.199%                                               
  Material Flush                          3.94    0.02         0.183%                                               
  Layouting                               1.05    0.00         0.049%                                               
  Mainloop Idle                           0.99    0.00         0.046%                                               
  _cogl_material_equal                    0.23    0.00         0.011%                

Follows a brief explanation of some of the items in the UProf report:

  • Actor real-paint counter: This is the number of times that actors have been requested to paint themselves. If more than a couple hundred actors are painted per-frame, it's probably more than actually needed and it could be impacting performance very negatively (see "Painting actors" below)
  • shader compile counters: If shaders (glsl or arbfp) are being compiled in each frame, this could have a very big impact on performance. Consider caching them.
  • Master clock: This is the amount of work spent preparing each frame. Consider the desired FPS: the time spent in each clock tick should be less than 1000 / FPS milliseconds. If it's bigger, then reducing it should be a first goal.
  • Painting actors: If the "Master clock" counter is too high and a considerable portion of the time is spent here, consider reducing the number of actors that are painted per frame and the amount of work that is performed when painting each.
  • Journal Flush: Sending of commands to the GPU. In order to reduce the time spent here, make sure that batching is optimal (as explained below) and that there isn't superfluous drawing.

Clutter and Cogl environment variables

Clutter and Cogl come with quite a lot of options that can be tuned at runtime using the CLUTTER_DEBUG, CLUTTER_PAINT and COGL_DEBUG environment variables. A lot of these options will enable tracing of various sub-systems, but other options may disable a set of functionality so that you can eliminate them from the equation to decide if they are the cause of a given problem.

To use these options you need to build clutter with --enable-debug and --enable-cogl-debug. To get a feel for the Cogl debug options available you can export COGL_DEBUG=help before running a Clutter application and you will see the following list:

     Supported debug values:
                     handle: debug ref counting issues for Cogl objects
                    slicing: debug the creation of texture slices
                      atlas: debug texture atlas management
              blend-strings: debug blend-string parsing
                    journal: view all geometry passing through the journal
                   batching: show how geometry is being batched in the journal
                   matrices: trace all matrix manipulation
                       draw: misc tracing of some drawing operations
                      pango: trace the pango renderer
             texture-pixmap: trace the Cogl texture pixmap backend
                 rectangles: add wire outlines for all rectangular geometry
           disable-batching: disable the journal batching
               disable-vbos: disable use of OpenGL vertex buffer objects
               disable-pbos: disable use of OpenGL pixel buffer objects
  disable-software-transform use the GPU to transform rectangular geometry
           dump-atlas-image: dump atlas changes to an image file
              disable-atlas: disable texture atlasing
          disable-texturing: disable texturing primitives
              disable-arbfp: disable use of ARBfp
               disable-glsl: disable use of GLSL
           disable-blending: disable use of blending
                show-source: show generated ARBfp/GLSL
                     opengl: traces some select OpenGL calls
                  offscreen: debug offscreen support

Of particular interest when debugging performance issues are the disable-XYZ options. If you can disable a broad range of functionality and understand how it affects your program that can give some big hints about your problem.

/!\ TODO: explain in more detail what each option really does, and how you can expect them to affect a Clutter application.

Ideas about methodology

It's hard to really define the one true methodology for finding the root cause of all performance problems so instead here we'll try and collate some of the insight we have gained profiling Clutter applications and probably others may have other good ideas. Please add your ideas if they aren't currently covered.

Understand if you are CPU, GPU or IO bound

A good place to start when getting to the bottom of a performance issue to to understand if it primarily relates to a CPU, GPU or IO bottlneck. A simple way to indicate if you have a CPU bottlneck is just to run top and look at the CPU% column for your application. If top shows that your application isn't busy then you can expect you have a GPU or IO problem.

Currently I don't know of a convenient way to determine if an application is IO bound vs GPU bound, but usually that can be guessed with some understanding of the application being debugged. At some point we should add a debug option to simply NOP all rendering and which would give an easy way to eliminate the GPU from the equation but we don't have that currently.

You should only use sysprof if you are CPU bound!

Are you performing redundant clears?

For immediate mode renderers, redundant clears can be a big problem. When Clutter paints the stage for each frame the first thing it normally does is clear the stage according to the current stage color. If you have an application that is actually drawing over every pixel of the stage though this clear is redundant and actually just wasting resources. For Clutter 1.4 we've introduced a new API: clutter_stage_set_no_clear_hint. You should use this to tell clutter that you will be covering every pixel of the stage so there's no need to do the clear.

Is your geometry being batched?

Most Clutter scenes are basically comprised of lots of textured rectangles. For simple rectangles we implement some special purpose batching to try and reduce the number of OpenGL state changes and draw calls we do because OpenGL drivers don't typically cope well if you use separate draw calls for such tiny primitives. The component that handles this batching is called the "Cogl Journal".

The Cogl Journal batching depends on rectangles being drawn sequentially with similar state (It wont re-order your geometry to achieve batching). If you submit 10 red rectangles (or more specifically 10 rectangles that use the same CoglMaterial state) then they will be batched into one vertex buffer and we will make just one draw call.

Some state changes immediately act as a synchronization point and result in the Journal being flushed. Examples are; use of CoglVertexBuffer (Used for long runs of text); changes of depth testing state; fog state; switching to draw to a different framebuffer.

How to find out if your rectangles are being batched? The easiest way it to run your application with COGL_DEBUG=rectangles exported in your environment. The enables a visual debugging mode that will draw outlines around rectangles in different colors. Rectangles that are batched together will have the same color outline. Another option is to use COGL_DEBUG=batching. This debug option will print details about batching to the console. For test actors we see this:

BATCHING: journal len = 6
BATCHING:  vbo offset batch len = 6
BATCHING:   material batch len = 6
BATCHING:    modelview batch len = 6

which shows that the hands are drawn together. If wouldn't be so good it you see lots of "len = 1" lines instead.

See if you have extra journal flushes happening

In an ideal situation we would accumulate all the geometry for a full frame into the journal and only flush it once before performing a swap-buffers operation to display the contents of your back buffer. Because some state changes aren't currently trackable through the journal (For example changes to the stencil test functions) there are times when we forcibly flush the journal and that may impact the performance of your application. One fairly straightforward way to investigate this is to attach to your program with GDB and set a breakpoint on _cogl_journal_flush. Each time that is hit look at the backtrace and hopefully you can figure out from that what is causing extra flushes.

Eventually we expect that most state changes will be trackable through journal entries so that we can remove this problem but until then it's an important gotcha to look out for!

Understand that profiling applications running under a compositors is hard!

There are a number of facets that make profiling applications under a compositor hard, but one of the biggest problems to be aware of is that current compositors provide no synchronization between client and compositor rendering. This means there can be a big discrepancy between the frame rate of the compositor and of your application. In one extreme this can result with you running a client faster than 60fps and starving the compositor of resources to be able to render at 60fps, or vice versa a compositor that is rendering too fast or requiring a lot of resources to render may starve your application of resources.

There is ongoing work to synchronize Clutter clients running under Mutter, but as of February 2013 this hasn't been finished yet.

Too many actors

Clutter will avoid drawing actors unnecessarily by tracking where in the screen each actor draws and skipping (culling) those that aren't visible. In order to be able to do so, Clutter queries each actor's "paint volume", but it's not completely trivial so may be good to check that your actors are reporting correctly their paint volumes and are skipped when completely hidden. This can be done by running with CLUTTER_DEBUG=clipping, which will tell what actors are being culled or not, and why. If this isn't working as expected, the paint volumes can be visualized with CLUTTER_PAINT=paint-volumes.

Even when culling is working as expected, excessive unneeded actors can be still a performance problem because of overhead due to allocation in containers, generic book-keeping and memory consumption.

Sometimes calculating the paint-volume of an actor can be too expensive and it may be more convenient for containers to skip hidden actors in their ::paint implementation and maybe in ::allocate.

Actor creation can be expensive as well. If you are going to have several hundreds of entities that can be expressed as actors, consider packing the data inside a ClutterListModel and using a container that fetches that data lazily.

Projects/Clutter/Profiling (last edited 2013-11-22 18:46:20 by WilliamJonMcCann)