We’ve been quiet here but we haven’t been idle. We’ve been working on our tools for the 2016 iteration of Nebula Trifid, adding stuff, removing other stuff and doing UI optimizations. I’ve also been working on the OpenGL renderer somewhat and I now deem it to be in a stable state.
Before I jump into the work currently being done on Vulkan, I would like to finish up the last post by explaining how I solve uniform buffers and updating them in Nebula.
Where previously shaders were just resources ment to instanciate by creating a shader instance, the shader itself is now applicable and the shader instances can be seen as derivatives of the original shader. The ‘main’ shader can be applied, and its variables updated. The shader resource holds a list of uniform buffers directly associated with a ‘varblock’ in AnyFX, and as such, updating the uniforms in the shader will require a Begin -> Update -> End procedure before rendering. This causes the uniform buffers in the shader to accumulate changes, and flush them directly when done.
Shader instances hold their own backings of said buffers, meaning they have a copy of the shaders buffer, but can provide their own per-instance unique variables. This way, we have solved the per-object animation of certain variables as alpha-blending, without disturbing the main shader buffers. This might seem like a waste of space, and it is if the shader code has tons of uniforms which are never in use. Although, AnyFX knows if a varblock is being used, and can report so to the engine, which in turn then just doesn’t bother with loading that varblock.
Apart from this automated system, a varblock in AnyFX can be marked using annotations to be “System” = true; meaning it will now be managed by the Nebula system. This means that shaders and shader instances will NOT create a buffer backing for these varblocks, and will NOT apply them automatically. This is on purpose, since some buffers are supposed to be maintained by Nebula and Nebula alone. These buffer -> varblock bindings include:
- Per object transforms and ID.
- Camera matrices and camera related stuff.
- Instancing buffers.
- Joint buffers.
These are retrieved using a new API in AnyFX which lets us determine block size and variable offset, and then bind a OGL4ShaderVariable straight on the uniform buffer. Updating said OGL4ShaderVariable will just use memcpy into a persistently mapped buffer in the uniform buffer object, and we’re done. Simple.
This explicit <-> automatic use of uniform buffers lets us optimize what gets updated and when.
Not unlike the rest of the developers out there I had to jump straight into Vulkan as soon as it was released. To start off, I have never seen such an explicit API with such a talkative and redundant syntax. That being said, I like it. It reminds me of DX11 but without all that COM. It’s a C-like API where structs are passed to functions to determine how said function will operate, and any operation liable to cause an error will return such an object, and it’s consistently so.
The first thing that ‘had to go’ was the AnyFX API. Now now, I didn’t throw it away, but I quickly realized that with such a close-to-metal API, the shading system also had to be closer to the engine. Therefore, I developed a secondary API for AnyFX, called low-level, and split the previous API into it’s own folder called high-level. High-level is meant to manage the shading automatically, where you just apply a program and set some variables, whereas the low-level API allows for nothing more than shader reflection, including all the AnyFX stuff.
Also, SOIL is no longer a viable alternative for image loading, so we have to use DevIL for that. Luckily for us, we have our own DevIL fork and have added some stuff to it, which in turn also gave me some insight into the API. It shouldn’t be too much work (if any) to make DevIL able to load a texture and just give me the compressed data straight away, so that I can load it into a Vulkan image.
However, the AnyFX compiler has also received some minor additions. Not only has the compiler language been cleaned up a bit, but it has also gotten some extensions. Similar to the layout(x, y, z) language in GLSL, AnyFX has received something called an qualifier expression. This is essentially a qualifier with an attached value, say group(5) which is a qualifier ‘group’ with the value ‘5’. The group qualifier in particular is used to extend AnyFX to allow for a Vulkan-specific language feature, explaining what resources goes into which DescriptorSet. The qualifiers used for specific languages will be ignored when compiling for language without said support, so that we can still use the same shaders. However, in the future the qualifier expression syntax will most likely be used to replace the parameter qualifiers that determine special behavior for shader function parameters, such as [color0]. Also, the AnyFX compiler doesn’t use the hardware-dependent GLSL compiler anymore, but instead use the Khronos group reference compiler (glslang).
The render device in Vulkan is also somewhat different from the OGL4 and DX11 versions, in that it uses 4 worker threads to populate 4 command buffers instead of just sending commands to the queue directly. This way we can parallel the process of issuing drawing, shader switches and descriptor binds (images, textures, samplers and buffers). Although we cannot use the TP-system with jobs, because we need to explicitly control which draw gets into which thread and in what order.
To switch threads to push commands to, we currently use a Pipeline as a ‘context switch’ which will cause the next thread to receive the next series of vertex buffers, index buffers and draws. This may be inefficient, because we might actually get just one shader, one vertex buffer, and LOADS of draws, so perhaps sorting the just draws into their own threads is more efficient, and sync the threads per pipeline. Another solution is to have one thread per pipeline, and have them spawn their draw threads, but command buffers has to be externally synchronized (meaning only one thread manipulate them at a time) so that’s somewhat complicated. This figure illustrates what I mean:
- Thread -> P -> IA -> Draw -> Draw -> Draw | Switch thread | P -> IA -> Draw
- Thread -> P -> IA -> Draw -> Draw -> Draw -> IA -> Draw -> Draw -> Draw | Switch thread |
- Thread -> P -> IA -> Draw | Switch thread |
- Thread -> P -> IA -> Draw -> Draw -> Draw -> Draw -> Draw -> Draw | Switch thread |
Here, P means PipelineStateObject, IA means input assembly (VBOs, IBO, offsets). This execution style ensures the draw calls will happen in the order they should in relation to their pipeline and vertex and index buffers, however it can also give us this:
- Thread -> P -> IA -> Draw -> Draw -> Draw -> Draw -> Draw -> Draw -> Draw -> Draw -> Draw -> Draw -> Draw -> Draw -> Draw -> Draw -> Draw -> Draw
- Thread -> Zzzz…
- Thread -> Zzzz…
- Thread -> Zzzz…
Which is really bad, because we want to utilize our threads as much as possible. Ironically, this scenario would be IDEAL in OpenGL and DX11< because we have no context switching between our draws. Now however, it just means our threads are idling. For brevity, the above illustrations does not show the bindings of descriptor sets before each draw.
Currently, I’m working on a system which can construct and update DescriptorSets so that we may update the shader uniform state in an efficient manner. In essence, a DescriptorSet implements an entire set of shader state uniforms, textures, uniform buffers, storage buffers and the new sampler objects in an entire package. Preferably, I would like to have a static DescriptorSet per each surface and then just bind it, but the amount of DescriptorSets is finitely defined when creating the DescriptorPool, so adding more objects might cause the DescriptorSet allocation to fail. The idea in Vulkan is to have a DescriptorSet per level of change, for example.
- Per frame (lights, shadow maps).
- Per view (camera, shadow casting matrices).
- Per pass (Input render targets, [subpass input attachments]).
- Per material (textures, parameters).
- Per object.
Vulkan then lets you bind them one by one, while updating inner loop sets and letting outer sets remain the same. Nebula however, can never assert this behavior will be incremental, and will bind whichever set(s) have changed, so it is up to the shader developer to make sure that the variables are grouped by frequency of update. Probably, the way Nebula will do it is one by one, through calls to vkBindDescriptorSet for each descriptor set, but I haven’t gotten there quite yet.
Another thing which Nebula should really utilize is the concept of subpasses. Subpasses allows for per-pixel shader caches, meaning we can read a previously written fragment without having to dump all the information to a framebuffer first. This gives us two major things:
- Performance when dealing with G-buffers.
- No more ping-pong textures. And also more performance.
Because we can read G-buffer data using a new type of uniform, called an InputAttachment we can now avoid dumping the G-buffer to a framebuffer. The lighting shaders can later just read the pixel values straight from the input attachment, instead of having to sample a texture, allowing us to skip the overhead of G-buffer sampling in some instances. At some point however, the G-buffers do need to dump at least their depth values to a framebuffer for later use (Godrays, AO, DoF etc) but we can probably get away with skipping the Albedo, Emissive parts. However you twist and turn it, you will end up with less memory bandwidth and thus better performance. We can also skip ping-ponging when doing bloom and blurs, because we can use a pixel-local cache to save a horizontal blur result when doing a vertical blur, and vice versa.
To enable this in Nebula, we need to add it to the design of a frame shader. A frame shader (perhaps they should have a renderer-specific implementation?) should need a SuperPass tag which can hold many ordinary Pass tags. A <SuperPass> would be the Vulkan equivalent of a RenderPass, and an ordinary <Pass> would be a SubPass. This would allow us to pass ‘framebuffers’ as input attachments within the <SuperPass> and when the <SuperPass> ends, we explain what will actually become resolved.
Nebula also needs (with the help of AnyFX) to enable sampler objects in the shader code. In GLSL there are two texture types, sampler1D -> samplerCubeArray and image1D -> imageCubeArray. Samplers would have a sampler attached to it from the application, and images would only be readable through texel fetches. However, in the Vulkan GLSL language there is a new type of texture called, well, texture, and it follows the same range from texture1D -> textureCubeArray. To sample such a texture, we do texture(sampler(AlbedoMap, PointSampler), UV); which is very similar to the DX11 way, which is AlbedoMap.Sample(PointSampler, UV); In the Vulkan GLSL language, we can create a sampler couple when sampling from a texture, which allows the exampled AlbedoMap to be texture sampled using many different samplers without having to exist more than once. However, Vulkan also supports pre-coupled sampler and images, so for the time being, we will stick to that syntax.