Oh render thread, why art thou so hard to reach?

Recently, I’ve discovered some issues regarding unstable frame rates, as well as a couple of glitches related to the render thread. One of the major issues with having ALL the rendering in a fat separate thread is that all communication to and from the render thread has to go through either the client-server side ‘interface’, meaning that a ModelEntity communicates with its internal counterpart InternalModelEntity to pass information back and forth. This method works really good, since the ModelEntity and the InternalModelEntity never have to be in perfect sync.

However, we have encountered problems where we actually want something from the render thread RIGHT NOW, meaning we block the game thread to wait for render thread data. Take this example code:

Ptr<Graphics::ItemAtPosition> msg = Graphics::ItemAtPosition::Create();
msg->SetPosition(mousePos);
Graphics::GraphicsInterface::Instance()->SendWait(msg.upcast<Messaging::Message>());

// get by id
const Ptr<Game::Entity>& entity = BaseGameFeature::EntityManager::Instance()->GetEntityByUniqueId(msg->GetItem());

This requires the render thread to basically wait for the message to be handled before we can continue. Well, this is good and all, since the render thread executes in parallel, however, this cannot be done while we are running the game. This is because in order to avoid several synchronizations on the same game frame with the render thread, the application goes into a lockstep mode, meaning the game thread basically waits for the render thread to finish, then performs a rendezvous and synchronizes. This then means both threads much arrive at the same ‘position’ for a sync to take place, meaning that we cannot possibly lock either thread during the lockstep phase! So, in a pure catch 22 fashion, we cannot do the above code if we are in lockstep, and if we’re not in lockstep we will get random synchronizations with the render thread which screws up our main thread timings.

Now this is just the most recent problem, we’ve continuously had problems with the render thread, so we thought, hey, why not rethink the whole fat render thread concept?! The idea is to only make low-level OpenGL/DX calls on separate thread(s) and have all the previously internal render stuff straight on the main thread. Since Nebula is already nicely executed using jobs to handle much of the computational heavy pre-render preprocessing such as culling and skeletal animations, we shouldn’t really lose that much performance (I hope). Also, if we can utilize multiple threads to execute draw calls and such, we should be in really good form, since the CPU heavy API calls will be run in a thread which is not related to the main loop.

If we take a look at SUI, we have sort of the same problem. The way we communicate with SUI, which is on the render thread, is by passing messages to it from the main thread, which is very clunky and greatly inhibits control. This will also be solved if we implement render to not entirely reside within a separate fat thread.

As an addition to this, I am planning on looking into the new stuff posted at GDC to reduce OpenGL drawcall overhead. Explanation of this can be found here: http://blogs.nvidia.com/blog/2014/03/20/opengl-gdc2014/

Basically, they explain that the Mantle counterpart of OpenGL already exists, although it’s not widely known. They explain how we can memory map GPU buffers can be persistently mapped to CPU memory, and then have them synchronized once per frame instead of the usual glMap, glUnmap which forces a synchronization each time. They also explain how drawing using indirect drawing and using buffers which contains per-object variables can be buffered instead of using the glUniform-syntax. Basically, you create a uniform buffer, or shader storage buffer which contains variables per each object, update it and then just fetch the variables on a per-object basis. This allows you to buffer ONCE per frame, and then simply tell the GPU to render. If you want to get really fancy, you can even create a draw command buffer much like you do with uniforms, and just tell the GL to render X objects with this setup. We can then create worker threads, one for uniform updates and one for drawing, and then just call glMultiDrawElementsIndirect from the main thread to execute the batch.

So basically, instead of having the thread border somewhere in the middle between the game and render parts, we instead push the thread border to the absolute limit, meaning we have light threads on the backend which just do the CPU heavy API calls. It should be fairly easy to see if the threading offloads some main thread work and thus gives us a performance boost, or if the overhead of syncing with the threads take more time than they give back.

// Gustav

Leave Comment