I’ve been working hard on an OpenGL renderer. The main reason is that we want to be able to move away from DirectX and Windows, for oh so many reasons. However, one of the major problems is the handling of shaders. With DirectX, we can use the quite flexible FX framework, which lets us encapsulate shaders, render states and variables into a very neat package. From a content management perspective, this is extremely flexible since render states can be implemented as a separate subsystem. This is the reason why I’ve been developing the FX framework I’ve been talking about.

Well, it works, and a result we now have a functioning OpenGL renderer. The only downside is that it’s extremely slower than the DirectX renderer. I’m currently investigating if this is driver related, shader related or stupidly implemented in Nebula. However, it’s identical in any other aspect, meaning we’ve crossed one of the biggest thresholds with getting a working version in Linux.

These are the results:

DirectX 11.0 version, 60 fps flat.

DirectX 11.0 version, ~60 FPS.

OpenGL 4.0 version, 20 fps flat.

OpenGL 4.0 version, ~20 FPS.

It also just dawned on me that the “Toggle Benchmark with F3!” is not showing on the DirectX 11 version. Gotta look into that…

Anyway, this seriously got me thinking what actually demanded time. I made some improvements in the AnyFX API and reduced the number of GL calls from 63k per frame down to 16k, but it made no difference in FPS. What you see here is approximately 2500 models which are individually drawn (no instancing). The DirectX renderer only updates its constant buffers and renders using the Effects11 framework as backend for shader handling. The OpenGL renderer does the exact same thing, only updates vital uniforms and uniform buffers and renders. As a matter of fact, I used apitrace to watch what was taking so much time. What I observed was that each draw call took about 20 microseconds, which multiplied with 2500 results in 0.05 seconds per frame which amounts to 20 fps. The methods for updating the buffer takes only a fraction of the time, however the GPU waits an ENORMOUS amount of time before even starting the rendering process, as can be seen in this picture.


The blue lines describe the CPU load, the width of the line or section determines the time it takes. We can see that each call costs some CPU time. However, we can also see that we start rendering way earlier than the actual GPU starts to get some work done. We can also very easily observe that GPU isn’t busy with anything, so there is no apparent reason (as far as I can tell) as to why it doesn’t start immediately.

Crazy. We’ll see if the performance is as low in Linux as it is in Windows. Have in mind that this is done on an ATI card. On the Nvidia-card I have access to, I got ~40 FPS, but even so, the performance waved between ~20 FPS and ~40 FPS seemingly at random. Weird. It can have something to do with SLI, but I’m not competent to say.

// Gustav


  1. Floh

    Hmm I haven’t worked with uniform buffers yet (I’m more or less restricted to GLES2), but getting dynamic buffer updates to work right (without stalling the pipeline) is more complicated to get right in OpenGL than in D3D (e.g. simply doing a glBufferData or glSubBufferData update will stall if the buffer is currently used by the GPU). You might need to use “buffer orphaning”, or manage double buffering yourself, or use glMapBuffer with whatever the GL equivalent of the D3D MapDiscard flag is. Did you try to put only static/constant shader state into uniform buffers (e.g. material parameters) into uniform buffers (so they would be immutable), and update the dynamic parameters (mvp matrix) through traditional glUniformXXX() calls? I would assume that this would be fast, since 2500 draw calls per frame at 20 fps is really slow (e.g. there really must be something wrong). In Drakensang Online’s GL renderer we also saw bad performance with dynamic vertex buffer updates for the UI, until we implemented buffer orphaning which fixed the performance problem. Also you’ll have to be aware that direct buffer content access via something like glMapBuffer will probably never be fast on platforms where the GL renderer is “far away” from the rendering code (e.g. WebGL on Chrome where the GL rendering code lives in its own process). Cheers and keep up the good work 🙂 -Floh

    • Gustav Sterbrant

      Thank you so much! I tried focusing on those bugging buffer updates, and I found that I actually set the ViewProjection matrices for each object just before every render. The thing about ViewProjection is that it doesn’t really need to be updated more than once per frame, so I thought it would make sense to put it in a uniform block so that it could be shared across multiple shaders. The error I did was to copy the DX11 transform device, which applied all transforms to the current shader in the ApplyModelTransforms(), which come to think of it makes absolutely no sense at all. Our underlying OpenGL effects library has no way of knowing if the value it’s been fed is identical to the one in the buffer already, so it would continuously update the entire block prior to each draw. Problem solved!

Leave Comment