Vulkan – Designing a back-end.

With the Khronos validation layer becoming more and more (although perhaps not entirely) complete, the Vulkan renderer implementation is also coming along nicely. At the moment, I cannot produce anything but a black window, hopefully mainly because the handling of descriptor sets are not completely done yet.

However, the design choices and the way to handle Vulkan are still noteworthy to bring up, so this post is going to be comprehensive with illustrations showing the thought process.

Command buffers

As you may or may not know, most of the operations done on the GPU is done in a command queue. This is apparent in OpenGL if you take a look at the functions glFlush and glFinish. In Vulkan however, command buffers are for you to allocate, destroy, reset, propagate, queue up and most importantly populate with commands. Noteworthy is also that Vulkan operates by submitting Command buffers to Queues, and your GPU might support more than one Queue. A Queue can be thought of as a road, although some Queues only accept busses, some bikes, and others are for pedestrians. In Vulkan, there are three different types of Queues.

  • Transfer – Allow for fast memory transfers GPU CPU as well as resources locally on the GPU.</>
  • Graphics – Allows for render commands such as Draw, BeginRenderPass, etc.
  • Compute – Allows for dispatch calls.

The intuitive way would be to try to replicate the GL4 behavior, by simply creating a single MAIN command buffer into which you put all your commands, and then execute it at the end of the frame. While this is a fine enough solution, it will cause some issues which we will get into later. But mainly, the whole idea of using command buffers is to be able to, in some cases, precompute command buffers for reuse (binding render target and render target-specific shader variables, viewports, etc) or for rapid population when drawing, and using threads to do so. There are currently 3 main buffers in Nebula, one for each queue.

Think of command buffers as sync points. Begin a command buffer, and all commands done afterwards are written to this buffer. When the buffer is ended, the buffer can then be used to run the commands on the GPU. Sounds simple. It’s not.

gameloop

It’s not, because most rendering engines today are not designed around this principle, and if they were, it would be equally hard to make them work for older implementations, and perhaps even breaking huge part of the user space code. The most obvious way to begin and end a command buffer is, just like it was with DX9, within the BeginScene and EndScene. My bet is that most engines are designed around this principle. So lets say I want to load a texture, or perhaps I want to just create a vertex buffer and update it with data. Well, if we don’t call it within our frame, then the command buffer won’t be in the Begin-state, and thus fail.

gameloop-problem

There are two ways to solve this.

Solution 1 – the lazy option

Create a command buffer when needed, create your resource, then add to the command queue that you want to update it with data. Last, you submit the command buffer and update the iamge.
This solution works fine, but it might be slow if you are loading in many textures on the fly, because some frame might get tons of vkQueueSubmit while the rest will be idle. This could be fixed if you know when to begin creating/updating resources and when you end, which leads me into the next part.

Solution 2 – delegates

The other solution is to postpone the command until BeginFrame. This can be done using a simple delegate system, which allows you to just save the command into a struct, add it to an vector, and then run it whenever you want to run it. For Nebula I implemented several ‘buckets’ of these kinds of delegates, so that we can run them on Begin/EndFrame, Begin/EndPass, etc. This solution easily allows for us to accumulate tons of resource updates and run them in a single Queue submit, instead of making lots of smaller ones. This type of delaying commands is also extremely useful for memory releasing, which I will talk about later.

Threading

One of the main things using command buffers is that they allow us to queue up commands in threads, and isn’t that wonderful? Turns out it’s not quite as simple as that, because of several reasons. The first being that we must use a uniquely created command buffer per thread, and also have a command buffer pool per thread. The reason for this is that if we share pools, one thread may allocate a buffer which another already has, causing the same buffer to appear in multiple threads and thus there is no guarantee for the order of the commands. Using multiple buffers allows us to ensure a specific sequence of commands are executed in order, however we want to avoid doing a submit on all of these command buffers when we are done. This is why there are secondary command buffers, which allows us to create and record commands which are then patched in a primary command buffer (read, a MAIN command buffer) and then everything is executed with a single submit.

In Nebula the VkRenderDevice class has a constant integer describing how many graphics, compute and transfer threads should be created, along with completion events so that we may wait for the threads to finish. These will then act as a pool, each receiving commands using a scheme implemented by the RenderDevice, which is described later.

Commands are send to threads using something very similar to the delegate system, by putting structs in a thread safe queue and have the command buffer building threads populate its command buffer. The problem isn’t really that, but the issue is how to distribute the command buffer buildup from the rendering pipeline to the threads. One way would obviously be to, for each draw command, switch threads, so given four threads we might get the following.

draw1 draw2 draw3 draw4 draw5 draw6 draw7 draw8
  • Thread 1: draw1 draw5
  • Thread 2: draw2 draw6
  • Thread 3: draw3 draw7
  • Thread 4: draw4 draw8

So if we then would collect these draws together, we would get

draw1 draw5, draw2, draw6, draw3, draw7, draw4, draw8

Clearly, our draws are out of order, which is fine if we are using depth-testing. The real problem comes with binding the rendering state (or pipelines as they are called). Consider us introducing the following linear command list.

<pre>
<html><strong>pipeline1</strong> draw1 draw2 <strong>pipeline2</strong> draw3 draw4 <strong>pipeline3</strong> draw5 draw6 <strong>pipeline4</strong> draw7 draw8</html>
</pre>

Then our threads will get

  • Thread 1: shader1 draw1 draw5
  • Thread 2: shader2 draw2 draw6
  • Thread 3: shader3 draw3 draw7
  • Thread 4: shader4 draw4 draw8

Basically, now draws 5-8 will not get the shader associated with them, resulting in an incorrect result. To be honest, there is no perfect way of handling this properly, but there is way to split draws into threads just by going by the lowest common denominator, which is shader pipeline. In Nebula right now, the thread used for draw command population will be cycled whenever we change pipeline, which results in the above command list looking like this on the threads.

  • Thread 1: shader1 draw1 draw2
  • Thread 2: shader2 draw3 draw4
  • Thread 3: shader3 draw5 draw6
  • Thread 4: shader4 draw7 draw8

However, when not swapping shaders, we obviously will get all draws on the same thread, which might not be overly efficient. I haven’t come up with a solution for this yet, but one obvious one would be to sync and rendezvous the command buffer threads every X calls, and start a new batch.

Now this is only really relevant to do if we want to rapidly prepare our scene using draws, but how does it fare with transfers and computes? Well, with computes you have the same deal, bind shader and compute. The only real difference between geometry rendering and computation is that we also bind vertex and index buffers when we need to produce geometry. Transfers however is a completely different thing. In theory, a single transfer operation could be done on one thread each, meaning that for every single new buffer update, we circulate the threads.

Pipelines

In the good old days with the older APIs, binding the rendering state or compute state was easy, you would just bind shaders, vertex layouts, render settings like blending and alpha testing, samplers and subroutines and they incrementally built up a rendering state. In the next-gen APIs, this work is left to the developer, and is referred to in Vulkan as a Pipeline. A pipeline describes EVERYTHING required by the GPU to perform a draw, so you might understand this structure is huge.

The only problem with pipelines is that they couple together so many different pieces of information – shader programs, vertex attribute layout, blending, rasterizing and scissor state options, depth state options, viewports, etc. Just look at this beast of a structure! https://www.khronos.org/registry/vulkan/specs/1.0/man/html/VkGraphicsPipelineCreateInfo.html

The reason for this setup was that many games used to create tons of shaders, bind them, but not actually use them. This caused tons of unwanted validation for a shader-vertex-render state setup which was never supposed to be used, and slowed down performance. This new method forces the developer to say they are done, and that this is a complete state to render with.

Now you might already predict that trying to figure out all possible combinations of all your setups and precompute all these pipelines is the best way to solve it, and you would be correct. However, how do you look it up afterwards? Well, the good folks at Khronos thought of this, and implemented a pipeline cache, which basically only creates a new pipeline if none exists, but if one does exist using the same members, then the vkCreateGraphicsPipelines or vkCreateComputePipelines (the latter being infinitely more easy to precompute) will only just return the same pipeline you created earlier. Very neat if you ask me, and according to Valve, this is magic and it’s incredibly fast.

In Nebula, the shaders loaded as compute shaders can predict everything they need and create the pipeline on resource load, which is very flexible indeed. For graphics, I implemented a flagging system, where flags would be set if a member of this struct is initialized, and when all flags are set and we want to render, we call a function to do so.

	this->currentPipelineBits |= InputLayoutInfoSet;
	this->currentPipelineInfo.pInputAssemblyState = inputLayout;
	this->currentPipelineBits &= ~PipelineBuilt;
	this->currentPipelineBits |= FramebufferLayoutInfoSet;
	this->currentPipelineInfo.renderPass = framebufferLayout.renderPass;
	this->currentPipelineInfo.subpass = framebufferLayout.subpass;
	this->currentPipelineInfo.pViewportState = framebufferLayout.pViewportState;
	this->currentPipelineBits &= ~PipelineBuilt;
	n_assert((this->currentPipelineBits & AllInfoSet) != 0);
	if ((this->currentPipelineBits & PipelineBuilt) == 0)
	{
		this->CreateAndBindGraphicsPipeline();
		this->currentPipelineBits |= PipelineBuilt;
	}
Skip to toolbar