Subroutines and suboptimal solutions

I managed to remove the render thread, getting rid of the client/server side object structure which was vital in order to keep the rendering in its own thread. What happened then was that we gained a significant boost in performance, probably due to the fact that the overhead required for syncing was greater than the performance we gained.

I’ve then been investigating on how to extend AnyFX to supply support for some of the stuff I left out in version 1.0, namely shader storage buffers (or RWBuffer in DirectX) and dynamically linked shader functions. The work is currently ongoing, however the syntax for the new buffer and dynamically linked functions are already inplace. Behold!

// declare subroutine 'interface'
prototype vec4 colorMethod(vec2 UV);

// declare implementation of colorMethod which produces static color
subroutine (colorMethod) vec4 staticColorMethod(vec2 UV)
{
   return vec4(1);
}

// declare implementation of colorMethod which produces textured color
subroutine (colorMethod) vec4 texturedColorMethod(vec2 UV)
{
    return texture(SomeSampler, UV);
}

colorMethod dynamicColorMethodVariable;

The dynamicColorMethodVariable then works as a special variable, meaning there is no way to change it using the AnyFX API. The syntax for defining a program previously looked something like this:

program SomeProgram
{
    vs = SomeVertexShader();
}

However, the shader binding syntax now accept arguments to the shader which is in the form of subroutine bindings. For example:

program SomeProgram
{
    vs = SomeVertexShader(dynamicColorMethodVariable = texturedColorMethod);
}

This would bind the vertex shader SomeVertexShader and bind the dynamicColorMethodVariable subroutine variable to be the one that uses a texture. This allows us to create programs which are just marginally different from other programs, and allows us to perform an ‘incremental’ change of the program state, compared to exchanging the whole program object each time we want some variation. The only problem is that Nebula doesn’t really have any concept of knowing whether an incremental shader change is possible or not.

So here comes yet another interesting concept, what if we were to sort materials (which are already sorted based on batch) by variation? Consider the following illustration:

FlatGeometryLit -
                |--- Static
                |--- VertexColor
                |--- Multilayered
                |--- LightmappedLit
                |--- Shiny
                |--- Organic
                |--- Animated
                |--- Skinned
                |--- Skin
                |--- SkinnedShiny
                |--- Foliage

This is the order in which they will be rendered, however they are all opaque geometry, so they might render in any order within this list. However, the change between lets say Static, Shiny and Animated is actually not that much, just a couple of lines of shader code. There is no linkage difference between shaders, and they can as such use the same shader, but with different subroutine sets! If we were to sort this list based on ‘change’, we would probably end up with something like this:

FlatGeometryLit -
                |--- Static
                |--- Shiny
                |--- Animated
                |--- Foliage
                |--- Organic
                |--- VertexColor
                |--- Multilayered
                |--- LightmappedLit
                |--- Skinned
                |--- Skin
                |--- SkinnedShiny

This is because most of these shaders share the same number of vertex shader inputs, or pixel shader outputs. However, if we simply implement the shaders to have equal functions, then AnyFX could figure out which programs are duplicates of others, and then simply tell us which material should actually apply its program, and which materials are sub dominant and thus only requires an incremental update. What we will end up with, is a sorted list of materials, where the first ‘unique’ material will be dominant, and the others will be incremental. The list would look like this:

FlatGeometryLit -
                |--- Static         -- dominant
                |--- Shiny          -- incremental
                |--- Animated       -- incremental
                |--- Foliage        -- incremental
                |--- Organic        -- incremental
                |--- VertexColor    -- dominant  (introduces vertex colors in vertex layout, cannot be subroutined)
                |--- Multilayered   -- incremental
                |--- LightmappedLit -- dominant  (introduces secondary UV set in vertex layout, cannot be subroutined)
                |--- Skinned        -- dominant  (introduces skin weights and joint indices in vertex layout, cannot be subroutined)
                |--- SkinnedShiny   -- incremental
                |--- Skin           -- incremental

As we can see here, every time we encounter a recessive material, we can simply perform a smaller update rather than set the entire shader program, which will probably spare us some performance if we have lots of variation in materials. This table only shows the base materials for a specific batch. However, the algorithm would sort all batches by this manner in order to make the entire pipeline reduce it’s API heavy calls. This is probably not a performance issue right now, seeing as we have a rather small set of materials per batch type, however, consider a game with lots of custom made shader derivatives. Currently, these derivatives would more or less have a copy of the shader code of some base shader, and then apply the program prior to each group of objects with that shader.

The next thing to tackle on the list is getting shader storage blocks working. The syntax for these babies are also already defined, but are only implemented by stubs. The shader storage block counterpart of AnyFX is called varbuffer. As opposed to varblock, the varbuffer allows for application control of the internal buffer storage, meaning we can retrieve its handle and read/write data from it as we please. We can also attach the buffer to some other part of the pipeline, which requires information that resides outside the scope of AnyFX. Also, varbuffers supports member arrays with indetermined size! As such, a varbuffer will have some way of allocating a buffer with a dimension set from the application side. Consider the following:

struct ObjectVariables
{
   float MatEmissiveIntensity;
   float MatSpecularIntensity;
   mat4 Model;
};

varbuffer SomeBuffer
{
   ObjectVariables vars[];
};

This creates a buffer which contains variables per object rendered. We can then from the AnyFX API tell the varbuffer to allocate a backend buffer with a size, which can then be used to for example perform bulk rendering with full per-object variation using glMultiDraw*. The only issue with this is that AnyFX usually handles variables as objects which one can retrieve and simply set, but in this case, a variable would be inside a struct of an array type, and is thus not something which is publicly available. However, we can solve the same problem using the already existing varblock syntax with just a set of arrays of variables with a fixed size. However, shader storage blocks (varbuffer) have a much bigger minimum implementation size, 16MB compared to the one defined for uniform blocks (varblock) which is 16KB, meaning we cannot have as much data per multi draw as we can with varbuffers.

This is totally worth looking into, seeing as it would enable a much faster (probably) execution rate of draw calls seeing as we can pack probably every single object with the same shader in the scene into one single glMultiDraw*, however it will probably not work with the current implementation of using a variable to set a value in, but will need some code which gathers up a bunch of objects and their variables, packs them into a buffer, and then renders everything. More on that when the subroutines are working!

// Gustav

Oh render thread, why art thou so hard to reach?

Recently, I’ve discovered some issues regarding unstable frame rates, as well as a couple of glitches related to the render thread. One of the major issues with having ALL the rendering in a fat separate thread is that all communication to and from the render thread has to go through either the client-server side ‘interface’, meaning that a ModelEntity communicates with its internal counterpart InternalModelEntity to pass information back and forth. This method works really good, since the ModelEntity and the InternalModelEntity never have to be in perfect sync.

However, we have encountered problems where we actually want something from the render thread RIGHT NOW, meaning we block the game thread to wait for render thread data. Take this example code:

Ptr<Graphics::ItemAtPosition> msg = Graphics::ItemAtPosition::Create();
msg->SetPosition(mousePos);
Graphics::GraphicsInterface::Instance()->SendWait(msg.upcast<Messaging::Message>());

// get by id
const Ptr<Game::Entity>& entity = BaseGameFeature::EntityManager::Instance()->GetEntityByUniqueId(msg->GetItem());

This requires the render thread to basically wait for the message to be handled before we can continue. Well, this is good and all, since the render thread executes in parallel, however, this cannot be done while we are running the game. This is because in order to avoid several synchronizations on the same game frame with the render thread, the application goes into a lockstep mode, meaning the game thread basically waits for the render thread to finish, then performs a rendezvous and synchronizes. This then means both threads much arrive at the same ‘position’ for a sync to take place, meaning that we cannot possibly lock either thread during the lockstep phase! So, in a pure catch 22 fashion, we cannot do the above code if we are in lockstep, and if we’re not in lockstep we will get random synchronizations with the render thread which screws up our main thread timings.

Now this is just the most recent problem, we’ve continuously had problems with the render thread, so we thought, hey, why not rethink the whole fat render thread concept?! The idea is to only make low-level OpenGL/DX calls on separate thread(s) and have all the previously internal render stuff straight on the main thread. Since Nebula is already nicely executed using jobs to handle much of the computational heavy pre-render preprocessing such as culling and skeletal animations, we shouldn’t really lose that much performance (I hope). Also, if we can utilize multiple threads to execute draw calls and such, we should be in really good form, since the CPU heavy API calls will be run in a thread which is not related to the main loop.

If we take a look at SUI, we have sort of the same problem. The way we communicate with SUI, which is on the render thread, is by passing messages to it from the main thread, which is very clunky and greatly inhibits control. This will also be solved if we implement render to not entirely reside within a separate fat thread.

As an addition to this, I am planning on looking into the new stuff posted at GDC to reduce OpenGL drawcall overhead. Explanation of this can be found here: http://blogs.nvidia.com/blog/2014/03/20/opengl-gdc2014/

Basically, they explain that the Mantle counterpart of OpenGL already exists, although it’s not widely known. They explain how we can memory map GPU buffers can be persistently mapped to CPU memory, and then have them synchronized once per frame instead of the usual glMap, glUnmap which forces a synchronization each time. They also explain how drawing using indirect drawing and using buffers which contains per-object variables can be buffered instead of using the glUniform-syntax. Basically, you create a uniform buffer, or shader storage buffer which contains variables per each object, update it and then just fetch the variables on a per-object basis. This allows you to buffer ONCE per frame, and then simply tell the GPU to render. If you want to get really fancy, you can even create a draw command buffer much like you do with uniforms, and just tell the GL to render X objects with this setup. We can then create worker threads, one for uniform updates and one for drawing, and then just call glMultiDrawElementsIndirect from the main thread to execute the batch.

So basically, instead of having the thread border somewhere in the middle between the game and render parts, we instead push the thread border to the absolute limit, meaning we have light threads on the backend which just do the CPU heavy API calls. It should be fairly easy to see if the threading offloads some main thread work and thus gives us a performance boost, or if the overhead of syncing with the threads take more time than they give back.

// Gustav