I’ve been working hard on an OpenGL renderer. The main reason is that we want to be able to move away from DirectX and Windows, for oh so many reasons. However, one of the major problems is the handling of shaders. With DirectX, we can use the quite flexible FX framework, which lets us encapsulate shaders, render states and variables into a very neat package. From a content management perspective, this is extremely flexible since render states can be implemented as a separate subsystem. This is the reason why I’ve been developing the FX framework I’ve been talking about.

Well, it works, and a result we now have a functioning OpenGL renderer. The only downside is that it’s extremely slower than the DirectX renderer. I’m currently investigating if this is driver related, shader related or stupidly implemented in Nebula. However, it’s identical in any other aspect, meaning we’ve crossed one of the biggest thresholds with getting a working version in Linux.

These are the results:

DirectX 11.0 version, 60 fps flat.

DirectX 11.0 version, ~60 FPS.

OpenGL 4.0 version, 20 fps flat.

OpenGL 4.0 version, ~20 FPS.

It also just dawned on me that the “Toggle Benchmark with F3!” is not showing on the DirectX 11 version. Gotta look into that…

Anyway, this seriously got me thinking what actually demanded time. I made some improvements in the AnyFX API and reduced the number of GL calls from 63k per frame down to 16k, but it made no difference in FPS. What you see here is approximately 2500 models which are individually drawn (no instancing). The DirectX renderer only updates its constant buffers and renders using the Effects11 framework as backend for shader handling. The OpenGL renderer does the exact same thing, only updates vital uniforms and uniform buffers and renders. As a matter of fact, I used apitrace to watch what was taking so much time. What I observed was that each draw call took about 20 microseconds, which multiplied with 2500 results in 0.05 seconds per frame which amounts to 20 fps. The methods for updating the buffer takes only a fraction of the time, however the GPU waits an ENORMOUS amount of time before even starting the rendering process, as can be seen in this picture.


The blue lines describe the CPU load, the width of the line or section determines the time it takes. We can see that each call costs some CPU time. However, we can also see that we start rendering way earlier than the actual GPU starts to get some work done. We can also very easily observe that GPU isn’t busy with anything, so there is no apparent reason (as far as I can tell) as to why it doesn’t start immediately.

Crazy. We’ll see if the performance is as low in Linux as it is in Windows. Have in mind that this is done on an ATI card. On the Nvidia-card I have access to, I got ~40 FPS, but even so, the performance waved between ~20 FPS and ~40 FPS seemingly at random. Weird. It can have something to do with SLI, but I’m not competent to say.

// Gustav

Tone mapping

While I’ve been working on AnyFX, I’ve also looked into tone mapping. Now, Nebula encodes and decodes pseudo-hdr by just down-scaling and up-scaling, something which works rather well in most cases. However, HDR was applied screen-wide and had no adaptation of brightness or color, and this is exactly what tone mapping solves.

I was a bit stunned of how easy it was to perform tone mapping. We need to downscale the color buffer down to a 2×2 image, and then calculate the average luminance value into a 1×1 texture. We then simply copy the 1×1 texture to be used the next frame.

One way to perform the downscaling could be a sequence of post effects which performs a simple 2×2 box average. However, since what we are doing is essentially mipmapping, we could just aswell generate mips instead. The only downside to this is that we generate one more level than we want, but it’s still much more efficient than having to run a series of consecutive downscale passes.

When the average luminance has been calculated, we just use that value with a tone mapping operator to perform the effect. We perform the eye adaptation part when we perform the 2×2 -> 1×1 downscale. The HLSL code for the operator is:

static const float g_fMiddleGrey = 0.6f;
static const float g_fMaxLuminance = 16.0f;</pre>
	Calculates HDR tone mapping
ToneMap(float4 vColor, float lumAvg, float4 luminance)
	// Calculate the luminance of the current pixel
	float fLumPixel = dot(vColor.rgb, luminance);
	// Apply the modified operator (Eq. 4)
	float fLumScaled = (fLumPixel * g_fMiddleGrey) / lumAvg;
	float fLumCompressed = (fLumScaled * (1 + (fLumScaled / (g_fMaxLuminance * g_fMaxLuminance)))) / (1 + fLumScaled);
	return float4(fLumCompressed * vColor.rgb, vColor.a);

We use a constant for the middle gray area and the maximum amount of luminance. This could be parametrized, but it’s not really necessary. We calculate the average luminance using the following kernel:

	Performs a 2x2 kernel downscale
psMain(float4 Position : SV_POSITION0,
	float2 UV : TEXCOORD0,
	out float result : SV_TARGET0)
	float2 pixelSize = GetPixelSize(ColorSource);
	float fAvg = 0.0f;

	// source should be a 512x512 texture, so we sample the 8'th mip of the texture
	float sample1 = dot(ColorSource.SampleLevel(DefaultSampler, UV + float2(0.5f, 0.5f) * pixelSize, 8), Luminance);
	float sample2 = dot(ColorSource.SampleLevel(DefaultSampler, UV + float2(0.5f, -0.5f) * pixelSize, 8),  Luminance);
	float sample3 = dot(ColorSource.SampleLevel(DefaultSampler, UV + float2(-0.5f, 0.5f) * pixelSize, 8), Luminance);
	float sample4 = dot(ColorSource.SampleLevel(DefaultSampler, UV + float2(-0.5f, -0.5f) * pixelSize, 8), Luminance);
	fAvg = (sample1+sample2+sample3+sample4) * 0.25f;

	float fAdaptedLum = PreviousLum.Sample(DefaultSampler, float2(0.5f, 0.5f));
	result = clamp(fAvg + (fAdaptedLum - fAvg) * ( 1 - pow( 0.98f, 30 * TimeDiff) ), 0.3, 1.0f);

The 0.98f can be adjusted to modify the speed of which the eye adaptation occurs. We can also adjust the factor with which we multiply the TimeDiff variable. Here we use 30, but we could use any value. Modifying either value will affect the speed of the adaptation.

Also, in order to make the pipeline more streamlined, we first downsample the color buffer (which has a screen-relative size) down to 512×512 before we perform the mipmap generation. This ensures us there will be a level with 2×2, and using simple math, we can calculate the mip level must be the 8th.

512×512 Level 0
256×256 Level 1
128×128 Level 2
64×64 Level 3
32×32 Level 4
16×16 Level 5
8×8 Level 6
4×4 Level 7
2×2 Level 8

So to conclude. First we perform a downscale from variable size down to 512×512. Then we perform a mipmap generation on the render target. Calculate the average luminance and use the time difference to blend between different levels of luminance. We then use the luminance value and perform the operator described above. In order for this to be properly handled, we blend both the bloom and the final result. If we perform bloom without the tonemapping, we will get an overabundance of bloom. The difference can be seen below:

Full tonemapping
Full tonemapping
Blur tonemapping disabled
Blur tonemapping disabled
Final tonemapping disabled
Final tonemapping disabled

// Gustav