Vulkan – Shading ideas

So this is where the Vulkan renderer is right now.
vulkan5

What you see might be unimpressive, but when getting to this stage there isn’t too much left. As you can see, I can load textures, which are compressed (hopefully you can’t see that), render several objects with different shaders and uniform values, like positions, and textures.

This might seem to be near completion, just a couple of post effects which might need to be redone (mipmap reduction compute shader for example), but you would be wrong.

In this example, the Vulkan renderer created a single descriptor set per object, which I thought was fine and I basically assumed that is what descriptor sets were for. I believed that descriptor sets would be like using variable setups and just apply them as a package, instead of individually selecting textures and uniform buffers. However, on my GPU, which is a AMD Fury Nano, sporting a massive 4 GB of GPU memory (it doesn’t run Chrome, so it’s massive), I ran out of memory when reaching a meager 2000 objects. Out of GPU memory, never actually experienced that before.

So I decided to check how much memory I actually allocated, and while Vulkan supplies you with a nice set of callback functions to look this up, it doesn’t really do much for descriptor pools, and I have already boggled down the memory usage exhaustion to be happening when I create too many objects, so it cannot be a texture issue. Anyhow in order to have per-object unique variables, each object allocates its own uniform buffer backing for the ‘global’ uniform buffer. Buffer memory never exceeds 260~ MB. Problem is not there.

So the only conclusion I can draw is that the AMD driver allocates TONS of memory for the descriptor sets. So I did a bit of studying, and I decided to go with this solution for handling descriptor sets: Vulkan Fast Paths.

The TL;DR of the pdf is to put all textures into huge arrays, so I did:

#define MAX_2D_TEXTURES 4096
#define MAX_2D_MS_TEXTURES 64
#define MAX_CUBE_TEXTURES 128
#define MAX_3D_TEXTURES 128

group(TEXTURE_GROUP) texture2D 		Textures2D[MAX_2D_TEXTURES];
group(TEXTURE_GROUP) texture2DMS 	Textures2DMS[MAX_2D_MS_TEXTURES];
group(TEXTURE_GROUP) textureCube 	TexturesCube[MAX_CUBE_TEXTURES];
group(TEXTURE_GROUP) texture3D 		Textures3D[MAX_3D_TEXTURES];

And textures are fetched through:

group(TEXTURE_GROUP) shared varblock RenderTargetIndices
{
	// base render targets
	uint DepthBufferIdx;
	uint NormalBufferIdx;
	uint AlbedoBufferIdx;	
	uint SpecularBufferIdx;
	uint LightBufferIdx;
	
	// shadow buffers
	uint CSMShadowMapIdx;
	uint SpotLightShadowMapIdx;
};

Well, render targets are. On the ordinary shader level, textures would be fetched by an index which is unique per object. I also took the liberty to implement samplers which are like uniforms, bound in the shader and can be assembled in GLSL as defined in GL_KHR_vulkan_glsl section Combining separate samplers and textures. This allows us to assemble samplers and textures in the shader code, which is good if we have a texture array like above, where we can’t really assign a sampler per texture in the shader, because we have absolutely no clue when writing the shaders which texture goes where, so it’s much more flexible to be able to assign a sampler state when we know what kind of texture we want, let me give you an example.

The old way would be:

samplerstate GeometryTextureSampler
{
	Samplers = { SpecularMap, EmissiveMap, NormalMap, AlbedoMap, DisplacementMap, RoughnessMap, CavityMap };
	Filter = MinMagMipLinear;
	AddressU = Wrap;
	AddressV = Wrap;
};
...
vec4 diffColor = texture(AlbedoMap, UV) * MatAlbedoIntensity;
float roughness = texture(RoughnessMap, UV).r * MatRoughnessIntensity;
vec4 specColor = texture(SpecularMap, UV) * MatSpecularIntensity;
float cavity = texture(CavityMap, UV).r;

The new way is:

samplerstate GeometryTextureSampler 
{
	Filter = MinMagMipLinear;
	AddressU = Wrap;
	AddressV = Wrap;
};
...
vec4 diffColor = texture(sampler2D(AlbedoMap, GeometryTextureSampler), UV) * MatAlbedoIntensity;
float roughness = texture(sampler2D(RoughnessMap, GeometryTextureSampler), UV).r * MatRoughnessIntensity;
vec4 specColor = texture(sampler2D(SpecularMap, GeometryTextureSampler), UV) * MatSpecularIntensity;
float cavity = texture(sampler2D(CavityMap, GeometryTextureSampler), UV).r;

While the new way is only possible in GLSL through the KHR_vulkan extension, this has been the default way in DirectX since version 10. This syntax also allows for a direct mapping of texture sampling between GLSL<->HLSL if we want to use HLSL above shader model 3.0.

This method basically allows for all textures to be bound to a single descriptor set, and this descriptor set can then be applied to bind ALL textures at the same time. So when this texture library is submitted, we basically have access to all textures directly in the shader. Neat huh? It’s like bindless textures, and that is exactly what AMD mentions in the talk.

Then we come to uniform buffers. I read the Vulkan Memory Management and all of the sudden it became completely clear to me. If we want to keep the number of descriptor sets down, we can’t have a individual buffer per object because that requires either a descriptor set per object with the individual buffer bound to it, or it requires us to sync the rendering commands and update the descriptor set being used.

So the solution is to use the same uniform buffer, and expand its size per object. And if you follow the nvidia article, that is clearly not a good way to go. Instead, the uniform buffers implement a clever array allocation method, where we grow the total size by a set amount of instances, and keep a list of used and free indices (which can be calculated to offsets) into the buffer. Allocating when there are no free indices grows the buffer by the maximum of a set amount (8) or the number of instances requested. Allocating when there are free indices returns the offset calculated used said free index, and trying to allocate a range of values first attempts to fit the range in the list of free indices if there are enough free indices, or allocates a new chunk if no such range could be found.

So basically, the Vulkan uniform buffer implementation uses a pool allocator to grow its size (doesn’t shrink it though, which we actually might want to do). But because we are using GPU memory, we might want to avoid doubling the memory, however that is a problem for later. Each allocation returns the offset into the buffer, so that we can bind the descriptor with per-object offsets later, which means we retain the exact same descriptor set, but only modifies the offsets.

So to sum up:

  • Texture arrays with all textures bound at the same time, submitting the entire texture library (or libraries, for all 2D, 2DMS, Cube and 3D textures).
  • Uniform buffers are created per shader (resource-level) and each instance allocates a chunk of memory in this buffer.
  • Offsets into the same buffer is used per object so we can have the same descriptor set but jump around in it, giving us per-object variables.
  • Textures are sent as indexes, and can thus be on a per-object basis too.

The only real issue with this method is read-write textures, also known as images in GLSL. Since image variables has to be declared with a format qualifier denoting how to read from the image, we can’t really bind them as above. However images are not really on a level of update frequency as textures are, instead they are bound and switched on a per-shader basis, like with post effects, and are either statically assigned or can be predicted. For example, doing a blur horizontal + vertical pass requires the same image to be bound between both passes, however if we want to perform a format change, like in the HBAO shader, where we transfer from ao, p -> ao, we can just bind the same image to two different slots, and thus avoid descriptor updates.

Oh, I should also mention that all of this might soon be possible to do in OpenGL too, with the GL SPIRV extension, which should give us the ability in OpenGL to use samplers as separate objects. Texture arrays already exists, and so do uniform buffers.

Leave Comment