Coherently persistent

OpenGL as well as DirectX have moved from sending data to the GPU through single values to sending values in a buffer. In DirectX 10+, this is forced on the user, and it’s flexible but also somewhat confusing. The idea is to have a block of memory which can be retained between executions, which I will admit is rather clever. Taking this into consideration, using glUniform* will simply send the data which is only valid during the execution of the current program. As soon as the current program is switched, the set of uniform variables are cleared and must be assigned again. However, in OpenGL 3.1, another method was introduced in parallel to using traditional uniforms, called uniform buffers. This is identical to the method seen in DirectX, however the performance of uniform buffers are abysmal on most drivers. To be honest, most types of buffer updates which requires the use of glBufferData or glBufferSubData is somewhat slow, even if we chose to orphan the current buffer using glInvalidateBufferData and use some multi-buffering method. The main reason is that data has to be flushed to the GPU whenever we make one of these calls, which not only means we have to a lot of interaction with the driver, but also need to synchronize.

Something very new and very cool with OpenGL is the power to persistently map a buffer to CPU memory, and have the GL push the data to the GPU when it’s required. This basically allows us to let the driver decide when to synchronize the data. Pretty awesome, since this allows us to effectively queue draw calls. However, in order to avoid stomping the data in flight, i.e. write to a part of memory which is not yet pushed and used, or which is currently being transferred, we must make sure to wait for that fragment of data to be complete. This has been thoroughly discussed and shown to many extents, but just extra clarity, I will explain how it is implemented in Nebula.

For extra clarity, this is also how AnyFX handles variables and OpenGL uniform blocks and uniform buffers.

// get alignment
GLint alignment;
glGetIntegerv(GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, &alignment);

// calculate aligned size
this->alignedSize = (bufferSize + alignment - 1) - (bufferSize + alignment - 1) % alignment;

// setup
glGenBuffers(1, this->buffers);
glBindBuffer(GL_UNIFORM_BUFFER, this->buffers[0]);
GLenum flags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;
glBufferStorage(GL_UNIFORM_BUFFER, this->alignedSize * this->numBackingBuffers, NULL, flags | GL_DYNAMIC_STORAGE_BIT);
this->glBuffer = (GLchar*)glMapBufferRange(GL_UNIFORM_BUFFER, 0, this->alignedSize * this->numBackingBuffers, flags);

The magic here is of course the new fancy GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT as well as glBufferStorage and glMapBufferRange. glBufferStorage gives us the opportunity to tell the GL ‘give me an immutable buffer with given size’. Since it’s immutable, we can’t change its size, which of course is possible with glBufferData. It’s also vitally important to make sure we align the buffer size to be in multiples of the alignment size. Otherwise, glMapBufferRange will return a null pointer and invoke an invalid operation.

AnyFX makes sure that every shader which somehow includes the same uniform buffer also uses the same backend, so we can basically share this buffer among all shader programs, which is nice.

Then, whenever we set a variable, we get this:

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::SetVariable(InternalEffectVariable* var, void* value)
{    
    char* data = (this->glBuffer + *this->glBufferOffset + var->byteOffset);
    if (!this->manualLocking) this->UnlockBuffer();
    memcpy(data, value, var->byteSize);
    this->isDirty = true;
}

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::SetVariableArray(InternalEffectVariable* var, void* value, size_t size)
{
    char* data = (this->glBuffer + *this->glBufferOffset + var->byteOffset);
    if (!this->manualLocking) this->UnlockBuffer();
    memcpy(data, value, size);
    this->isDirty = true;
}

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::SetVariableIndexed(InternalEffectVariable* var, void* value, unsigned i)
{
    char* data = (this->glBuffer + *this->glBufferOffset + var->byteOffset + i * var->byteSize);
    if (!this->manualLocking) this->UnlockBuffer();
    memcpy(data, value, var->byteSize);
    this->isDirty = true;
}

The thing to note here is that data is a buffer which is coherent between GPU and CPU. Basically, we just calculate offsets into the buffer and copy the data into the buffer at that offset. glBufferOffset here is the byte offset into the buffer to which we are currently writing. The function LockBuffer and UnlockBuffer looks like this:

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::LockBuffer()
{
    if (this->syncs[*this->ringIndex] == 0)
    {
        this->syncs[*this->ringIndex] = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, NULL);

        // traverse to next buffer
        *this->ringIndex = (*this->ringIndex + 1) % this->numBackingBuffers;
        *this->glBufferOffset = *this->ringIndex * this->alignedSize;
    }    
}

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::UnlockBuffer()
{
    // wait for sync
    if (this->syncs[*this->ringIndex] != 0)
    {
        GLbitfield waitFlags = 0;
        GLuint64 waitDuration = 0;
        do
        {
            GLenum result = glClientWaitSync(this->syncs[*this->ringIndex], waitFlags, waitDuration);
            if (result == GL_ALREADY_SIGNALED || result == GL_CONDITION_SATISFIED) break;

            waitFlags = GL_SYNC_FLUSH_COMMANDS_BIT;
            waitDuration = 1000000000;
        } 
        while (true);   

        glDeleteSync(this->syncs[*this->ringIndex]);
        this->syncs[*this->ringIndex] = 0;
    }
}

Ring index is an increasing number which corresponds to the current segment of the multi-buffered backing storage we should work with. Basically, we have to make sure that the range in the buffer is blocked, so that we don’t modify the data before the GL has had time to use it. Also note that only when we lock the buffer, we decide to traverse to the next one. This allows us to fill the consecutive segment of the buffer without waiting.

When we want to perform the draw, we do this just before the draw call:

 
//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::Commit()
{
    // bind buffer at the current position
    glBindBufferRange(GL_UNIFORM_BUFFER, this->uniformBlockBinding, this->buffers[0], *this->ringIndex * this->alignedSize, this->alignedSize);
}

glBindBufferRange is really fast, so we suffer almost no overhead doing this.

So in which sequence does all of this happen? OpenGL defines that coherently mapped buffers are synced whenever a draw call is executed, which is exactly what we must wait for in order to avoid data corruption. This basically means that we have to perform the LockBuffer() function after we do a draw call. So, in essence, the shading system must have to be prepared for when a variable is set, just before a draw call is to be performed, and after a draw call is done. So basically:

// Wait for current segment to be available
for each (buffer in program)
{
    UnlockBuffer(); // glClientWaitSync() + glDeleteSync()
    SetVariable();
    SetVariable();
    SetVariable();
    SetVariable();
    SetVariable();
    glBindBufferRange();
}
glDraw*();
for each (buffer in program)
{
    LockBuffer();   // glFenceSync() + Increase buffer offset to next buffer
}

This is nice, because we only lock and unlock the buffer if something has changed. If nothing is different, we just bind the buffer range and draw as normal. This also allows us to keep variables outside uniform buffers if we want. This could be useful if we have variables which are not shared, or which are already applied once (per pass variables for example).

Now for per-frame buffers, we might want to lock and unlock manually, since per-frame variables doesn’t have to wait until the first coming call is done, but rather not until the next frame. This is where the this->manualLocking comes into play. With manual locking, we can decide if a uniform buffer should lock explicitly. In Nebula, we do this:

    const Ptr<ShaderInstance>& shdInst = this->GetSharedShader();
    AnyFX::EffectVarblock* block = shdInst->GetAnyFXEffect()->GetVarblockByName("PerFrame");
    RenderDevice::Instance()->SetPerFrameBlock(block);

Then on BeginFrame in OGL4RenderDevice:

    // unlock per-frame buffer
    this->perFrameBlock->UnlockBuffer();

Lastly on EndFrame in OGL4RenderDevice:

    // lock per-frame buffer
    this->perFrameBlock->LockBuffer();

Remember that since blocks can be shared, we can use the good old shared shader which also contains all uniform buffers shared by other shaders.

AnyFX has been extended to allow for selecting how big the buffer should be, in multiples. Example:

shared buffers=2048 varblock PerObject
{
	mat4 Model;
	mat4 InvModel;
	int ObjectId;
};

This means we have a uniform buffer (or constant buffer in DX) which is backed 2048 times ūüėÄ . This might seem excessive, but it’s not really that bad considering we have ONE of these for the entire application (using qualifier shared), and it only amounts to 270 kB. It allows us to perform some 2048 draw calls before we actually have to wait to render a new object, which is nice.

The performance increase from this is tremendous. It’s so fast in fact that if I don’t call glFinish() each frame to synchronize with the CPU, the view matrix doesn’t keep up with the current frame. However this proved to be bad for performance, since some calls may take A LOT of time for the GPU to finish, like huge geometry and instanced batches. Syncing will effectively stop putting new commands in the GL command queue (because we are effectively waiting for the GPU to finish everything), and the command queue should never stop being fed if one is aiming for performance.

All modules which stream data to the GL have been implemented to use this method, including the text renderer and the shape renderer for spontaneous geometry. Next up is to convert the particle system to utilize this feature too. After that I think we will look into bindless textures.

We are still very much alive.

Subroutines and suboptimal solutions

I managed to remove the render thread, getting rid of the client/server side object structure which was vital in order to keep the rendering in its own thread. What happened then was that we gained a significant boost in performance, probably due to the fact that the overhead required for syncing was greater than the performance we gained.

I’ve then been investigating on how to extend AnyFX to supply support for some of the stuff I left out in version 1.0, namely shader storage buffers (or RWBuffer in DirectX) and dynamically linked shader functions. The work is currently ongoing, however the syntax for the new buffer and dynamically linked functions are already inplace. Behold!

// declare subroutine 'interface'
prototype vec4 colorMethod(vec2 UV);

// declare implementation of colorMethod which produces static color
subroutine (colorMethod) vec4 staticColorMethod(vec2 UV)
{
   return vec4(1);
}

// declare implementation of colorMethod which produces textured color
subroutine (colorMethod) vec4 texturedColorMethod(vec2 UV)
{
    return texture(SomeSampler, UV);
}

colorMethod dynamicColorMethodVariable;

The dynamicColorMethodVariable then works as a special variable, meaning there is no way to change it using the AnyFX API. The syntax for defining a program previously looked something like this:

program SomeProgram
{
    vs = SomeVertexShader();
}

However, the shader binding syntax now accept arguments to the shader which is in the form of subroutine bindings. For example:

program SomeProgram
{
    vs = SomeVertexShader(dynamicColorMethodVariable = texturedColorMethod);
}

This would bind the vertex shader SomeVertexShader and bind the dynamicColorMethodVariable subroutine variable to be the one that uses a texture. This allows us to create programs which are just marginally different from other programs, and allows us to perform an ‘incremental’ change of the program state, compared to exchanging the whole program object each time we want some variation. The only problem is that Nebula doesn’t really have any concept of knowing whether an incremental shader change is possible or not.

So here comes yet another interesting concept, what if we were to sort materials (which are already sorted based on batch) by variation? Consider the following illustration:

FlatGeometryLit -
                |--- Static
                |--- VertexColor
                |--- Multilayered
                |--- LightmappedLit
                |--- Shiny
                |--- Organic
                |--- Animated
                |--- Skinned
                |--- Skin
                |--- SkinnedShiny
                |--- Foliage

This is the order in which they will be rendered, however they are all opaque geometry, so they might render in any order within this list. However, the change between lets say Static, Shiny and Animated is actually not that much, just a couple of lines of shader code. There is no linkage difference between shaders, and they can as such use the same shader, but with different subroutine sets! If we were to sort this list based on ‘change’, we would probably end up with something like this:

FlatGeometryLit -
                |--- Static
                |--- Shiny
                |--- Animated
                |--- Foliage
                |--- Organic
                |--- VertexColor
                |--- Multilayered
                |--- LightmappedLit
                |--- Skinned
                |--- Skin
                |--- SkinnedShiny

This is because most of these shaders share the same number of vertex shader inputs, or pixel shader outputs. However, if we simply implement the shaders to have equal functions, then AnyFX could figure out which programs are duplicates of others, and then simply tell us which material should actually apply its program, and which materials are sub dominant and thus only requires an incremental update. What we will end up with, is a sorted list of materials, where the first ‘unique’ material will be dominant, and the others will be incremental. The list would look like this:

FlatGeometryLit -
                |--- Static         -- dominant
                |--- Shiny          -- incremental
                |--- Animated       -- incremental
                |--- Foliage        -- incremental
                |--- Organic        -- incremental
                |--- VertexColor    -- dominant  (introduces vertex colors in vertex layout, cannot be subroutined)
                |--- Multilayered   -- incremental
                |--- LightmappedLit -- dominant  (introduces secondary UV set in vertex layout, cannot be subroutined)
                |--- Skinned        -- dominant  (introduces skin weights and joint indices in vertex layout, cannot be subroutined)
                |--- SkinnedShiny   -- incremental
                |--- Skin           -- incremental

As we can see here, every time we encounter a recessive material, we can simply perform a smaller update rather than set the entire shader program, which will probably spare us some performance if we have lots of variation in materials. This table only shows the base materials for a specific batch. However, the algorithm would sort all batches by this manner in order to make the entire pipeline reduce it’s API heavy calls. This is probably not a performance issue right now, seeing as we have a rather small set of materials per batch type, however, consider a game with lots of custom made shader derivatives. Currently, these derivatives would more or less have a copy of the shader code of some base shader, and then apply the program prior to each group of objects with that shader.

The next thing to tackle on the list is getting shader storage blocks working. The syntax for these babies are also already defined, but are only implemented by stubs. The shader storage block counterpart of AnyFX is called varbuffer. As opposed to varblock, the varbuffer allows for application control of the internal buffer storage, meaning we can retrieve its handle and read/write data from it as we please. We can also attach the buffer to some other part of the pipeline, which requires information that resides outside the scope of AnyFX. Also, varbuffers supports member arrays with indetermined size! As such, a varbuffer will have some way of allocating a buffer with a dimension set from the application side. Consider the following:

struct ObjectVariables
{
   float MatEmissiveIntensity;
   float MatSpecularIntensity;
   mat4 Model;
};

varbuffer SomeBuffer
{
   ObjectVariables vars[];
};

This creates a buffer which contains variables per object rendered. We can then from the AnyFX API tell the varbuffer to allocate a backend buffer with a size, which can then be used to for example perform bulk rendering with full per-object variation using glMultiDraw*. The only issue with this is that AnyFX usually handles variables as objects which one can retrieve and simply set, but in this case, a variable would be inside a struct of an array type, and is thus not something which is publicly available. However, we can solve the same problem using the already existing varblock syntax with just a set of arrays of variables with a fixed size. However, shader storage blocks (varbuffer) have a much bigger minimum implementation size, 16MB compared to the one defined for uniform blocks (varblock) which is 16KB, meaning we cannot have as much data per multi draw as we can with varbuffers.

This is totally worth looking into, seeing as it would enable a much faster (probably) execution rate of draw calls seeing as we can pack probably every single object with the same shader in the scene into one single glMultiDraw*, however it will probably not work with the current implementation of using a variable to set a value in, but will need some code which gathers up a bunch of objects and their variables, packs them into a buffer, and then renders everything. More on that when the subroutines are working!

// Gustav

Oh render thread, why art thou so hard to reach?

Recently, I’ve discovered some issues regarding unstable frame rates, as well as a couple of glitches related to the render thread. One of the major issues with having ALL the rendering in a fat separate thread is that all communication to and from the render thread has to go through either the client-server side ‘interface’, meaning that a ModelEntity communicates with its internal counterpart InternalModelEntity to pass information back and forth. This method works really good, since the ModelEntity and the InternalModelEntity never have to be in perfect sync.

However, we have encountered problems where we actually want something from the render thread RIGHT NOW, meaning we block the game thread to wait for render thread data. Take this example code:

Ptr<Graphics::ItemAtPosition> msg = Graphics::ItemAtPosition::Create();
msg->SetPosition(mousePos);
Graphics::GraphicsInterface::Instance()->SendWait(msg.upcast<Messaging::Message>());

// get by id
const Ptr<Game::Entity>& entity = BaseGameFeature::EntityManager::Instance()->GetEntityByUniqueId(msg->GetItem());

This requires the render thread to basically wait for the message to be handled before we can continue. Well, this is good and all, since the render thread executes in parallel, however, this cannot be done while we are running the game. This is because in order to avoid several synchronizations on the same game frame with the render thread, the application goes into a lockstep mode, meaning the game thread basically waits for the render thread to finish, then performs a rendezvous and synchronizes. This then means both threads much arrive at the same ‘position’ for a sync to take place, meaning that we cannot possibly lock either thread during the lockstep phase! So, in a pure catch 22 fashion, we cannot do the above code if we are in lockstep, and if we’re not in lockstep we will get random synchronizations with the render thread which screws up our main thread timings.

Now this is just the most recent problem, we’ve continuously had problems with the render thread, so we thought, hey, why not rethink the whole fat render thread concept?! The idea is to only make low-level OpenGL/DX calls on separate thread(s) and have all the previously internal render stuff straight on the main thread. Since Nebula is already nicely executed using jobs to handle much of the computational heavy pre-render preprocessing such as culling and skeletal animations, we shouldn’t really lose that much performance (I hope). Also, if we can utilize multiple threads to execute draw calls and such, we should be in really good form, since the CPU heavy API calls will be run in a thread which is not related to the main loop.

If we take a look at SUI, we have sort of the same problem. The way we communicate with SUI, which is on the render thread, is by passing messages to it from the main thread, which is very clunky and greatly inhibits control. This will also be solved if we implement render to not entirely reside within a separate fat thread.

As an addition to this, I am planning on looking into the new stuff posted at GDC to reduce OpenGL drawcall overhead. Explanation of this can be found here: http://blogs.nvidia.com/blog/2014/03/20/opengl-gdc2014/

Basically, they explain that the Mantle counterpart of OpenGL already exists, although it’s not widely known. They explain how we can memory map GPU buffers can be persistently mapped to CPU memory, and then have them synchronized once per frame instead of the usual glMap, glUnmap which forces a synchronization each time. They also explain how drawing using indirect drawing and using buffers which contains per-object variables can be buffered instead of using the glUniform-syntax. Basically, you create a uniform buffer, or shader storage buffer which contains variables per each object, update it and then just fetch the variables on a per-object basis. This allows you to buffer ONCE per frame, and then simply tell the GPU to render. If you want to get really fancy, you can even create a draw command buffer much like you do with uniforms, and just tell the GL to render X objects with this setup. We can then create worker threads, one for uniform updates and one for drawing, and then just call glMultiDrawElementsIndirect from the main thread to execute the batch.

So basically, instead of having the thread border somewhere in the middle between the game and render parts, we instead push the thread border to the absolute limit, meaning we have light threads on the backend which just do the CPU heavy API calls. It should be fairly easy to see if the threading offloads some main thread work and thus gives us a performance boost, or if the overhead of syncing with the threads take more time than they give back.

// Gustav

Physically based lighting

Seeing as we’re aiming for a bleeding edge engine, there is no need to skip out on anything. A little bird whispered in my ear that there are other ways of performing lighting than the standardized simple blinn-phong method commonly used, and since we’re on a pretty flexible budget when it comes to graphics performance, I thought I should give it a good looksie.

Physically based lighting basically takes more into account than regular lighting. It also provides a more ‘real’ representation of the world in terms of reflective light (albedo) and surface roughness/gloss. Couple that with the original cheat called normalmaps and you got yourself some pretty good-looking effects. Basically, all ¬†materials have been added with a new roughness map which allows a graphics artist to author the surface complexity of a model. This allows lighting to properly respond to the surface instead of just applying a uniform specular reflectiveness. The shader code (mostly taken and translated from¬†http://www.altdevblogaday.com/2011/08/23/shader-code-for-physically-based-lighting/) looks like this:

float normalizationTerm = (roughness + 2.0f) / 8.0f;
float blinnPhong = pow(NH, roughness);
float specularTerm = normalizationTerm * blinnPhong;
float cosineTerm = NL;
float base = 1.0f - HL;
float exponent = pow(base, 5.0f);
vec3 fresnelTerm = specColor.rgb + ( 1.0f - specColor.rgb ) * exponent;
float alpha = 1.0f / ( sqrt ( (PI / 4) * roughness + (PI / 2)) );
float visibilityTerm = (NL * (1.0f - alpha) + alpha ) * ( NV * ( 1.0f - alpha ) + alpha );
visibilityTerm = 1.0f / visibilityTerm;
float3 spec = saturate(specularTerm * cosineTerm * fresnelTerm * visibilityTerm) * lightColor.xyz;

As you can see, this code is way more complex than the standard formulae. What you can see here is that instead of using a constant value for specular power, we instead use the roughness. This allows us to have a per-pixel roughness authored by a graphics artist. The only downside to this is that roughness is somewhat unintuitive in terms of encoding/decoding. To decode roughness, which is a value in the range [0..1], I use this formula (taken from  Physically-based Lighting in Call Of Duty: Black Ops):

float specPower = exp2(10 * specColor.a + 1);

This allows our specular power to be in the range [1, 8192]. Since our method uses the Blinn-Phong algorithm for distribution, our specular power is much greater than the range [0..1], however easier to compute than the more advanced yet more intuitive Beckmann algorithm (which actually operates in the range [0..1]). The result can be seen in the picture below:

Screenshot from 2013-12-12 13:25:13

Note the specular light given off by the local lights which was previously non-existant.

A part of performing physically based lighting is to also use reflections and proper ‘roughing’ of the reflections. Reflections affects both specular light (since it’s actually a reflection, go figure) and the final color of the surface. To account for this in our completely deferred renderer, the environment maps on reflective objects take roughness into account, and selects a specific mip-level in the environment map based on the roughness. The awesome tool (https://code.google.com/p/cubemapgen/) can take an ordinary cube map and generate mips where each mip is a BRDF approximation (actually there are several different algorithms, but for the sake of clarity we’ll stick to Blinn-Phong BRDFs). We can also say to generate a new mip using a glossness falloff, resulting in a very good-looking mip-chain for our cube maps.

You have have come across this image http://seblagarde.files.wordpress.com/2011/07/reference_top_ref_bottom_mipchain.jpg showing a series of cube maps with different levels of reflectiveness, which is exactly what we are doing and what we want. Just to clarify, this is all precomputed using an original cube map and is not done in real-time! The more interesting part is that what is visible in your environment cube map is irrelevant. What is relevant is that the average color of the cube map fits your scene in terms of colors and lighting. In the pictures below, we have the same model ranging in roughness from 0-1.

Roughness set to 0.0

Roughness set to 0.0

Roughness set to 0.5

Roughness set to 0.5

Roughness set to 1.0

Roughness set to 1.0

As you can see, the roughness changes the surface look of the object dramatically, although still uses the exact same shader. Also note that using the skydome cube map as the reflective cube map is a bit ugly since it’s half bright half dark. That’s all!

 

// Gustav

OpenGL

I’ve been working hard on an OpenGL renderer. The main reason is that we want to be able to move away from DirectX and Windows, for oh so many reasons. However, one of the major problems is the handling of shaders. With DirectX, we can use the quite flexible FX framework, which lets us encapsulate shaders, render states and variables into a very neat package. From a content management perspective, this is extremely flexible since render states can be implemented as a separate subsystem. This is the reason why I’ve been developing the FX framework I’ve been talking about.

Well, it works, and a result we now have a functioning OpenGL renderer. The only downside is that it’s extremely slower than the DirectX renderer. I’m currently investigating if this is driver related, shader related or stupidly implemented in Nebula. However, it’s identical in any other aspect, meaning we’ve crossed one of the biggest thresholds with getting a working version in Linux.

These are the results:

DirectX 11.0 version, 60 fps flat.

DirectX 11.0 version, ~60 FPS.

OpenGL 4.0 version, 20 fps flat.

OpenGL 4.0 version, ~20 FPS.

It also just dawned on me that the “Toggle Benchmark with F3!” is not showing on the DirectX 11 version. Gotta look into that…

Anyway, this seriously got me thinking what actually demanded time. I made some improvements in the AnyFX API and reduced the number of GL calls from 63k per frame down to 16k, but it made no difference in FPS. What you see here is approximately 2500 models which are individually drawn (no instancing). The DirectX renderer only updates its constant buffers and renders using the Effects11 framework as backend for shader handling. The OpenGL renderer does the exact same thing, only updates vital uniforms and uniform buffers and renders. As a matter of fact, I used apitrace to watch what was taking so much time. What I observed was that each draw call took about 20 microseconds, which multiplied with 2500 results in 0.05 seconds per frame which amounts to 20 fps. The methods for updating the buffer takes only a fraction of the time, however the GPU waits an ENORMOUS amount of time before even starting the rendering process, as can be seen in this picture.

glprofile

The blue lines describe the CPU load, the width of the line or section determines the time it takes. We can see that each call costs some CPU time. However, we can also see that we start rendering way earlier than the actual GPU starts to get some work done. We can also very easily observe that GPU isn’t busy with anything, so there is no apparent reason (as far as I can tell) as to why it doesn’t start immediately.

Crazy. We’ll see if the performance is as low in Linux as it is in Windows. Have in mind that this is done on an ATI card. On the Nvidia-card I have access to, I got ~40 FPS, but even so, the performance waved between ~20 FPS and ~40 FPS seemingly at random. Weird. It can have something to do with SLI, but I’m not competent to say.

// Gustav

Water

Nebula has been missing something for a very long time. When I started working with Nebula something like 3 years ago, we had simple UV-animated alpha planes which were supposed to look like water. And for the time being, they looked really good.

However, today a simple uv-animated alpha plane won’t quite cut it. Instead, we need something fancier, incorporating refraction, reflection and specular highlighting. I have been doing exactly this for the past day. The result is beautiful and realistic-ish water (let’s face it, water in real life is rather boring). Picture time!

Refraction

Refraction

Reflection and specularity

Reflection and specularity

 

However, I’ve been a bit lazy with the reflections in our implementation. The GPU Gems article from Nvidia¬†shows that we should render the scene from below the water plane in order to get correct reflections. Instead, I simply just cheat and use the already lit and rendered image as a reflection. This makes the reflections completely wrong, but it still looks good…

Sometime in the future, I might write a new frame shader which cheaply renders the objects being reflected without all the fancy stuff like SSS, HBAO and multiple light shading, so as to give decent-looking geometry for reflections. Although, for the time being, this serves nicely. I also have an alternative shader which uses an environment map to render the reflections using a pre-rendered environment map, which may look good when using water in small local areas where real-time reflections are easily overlooked.

The water is fully customizable, with reflection intensity, water color, deep water color and of course reflection map. The reflection map is supposed  to be a pre-rendered cube map of the scene, so that reflections can be done without rendering everything twice. One can select whether to cheat, or to use an environment map.

I’ve also written a billboard rendering system, which basically just lets us render textures to billboards which is very useful for the level editor and other such tools. This is a crude representation of how it can look:

Lights represented as billboards

Lights represented as billboards

With actual icons, we can neatly show billboards instead of geometry to represent such things as spotlights, pointlights, the global light, and any other such entities which can’t, or shouldn’t, be represented with geometry.

Next thing on the list is terrain rendering, so keep tuned!

 

// Gustav