Adaptive virtual terrain texturing

Today I would like to talk about a new algorithm implemented in Nebula, and this one is for making beautiful terrain at scale! Our implementation of this algorithm is extracted from the little concrete information provided by Ka Chen in his great presentation in 2015 at GDC, where he presented a novel way of doing virtual texturing in order to increase resolution dynamically, instead of the otherwise static resolution of ordinary virtual texturing.


Before we get down to brass tacks, just a little motivation as to why on earth we would go through this work and not just do ordinary virtual texturing using the GPU sparse binding API (which Nebula also supports). Well, the answer is resolution. And if your question is why we can’t do the texture splatting (sampling the materials) at runtime instead of going about storing it in some texture cache, well the answer is that in order to add small detail like decals on the terrain, we would have to render many many decal boxes every frame and it would be expensive, while this method caches that result for sampling every frame.

In our solution, which follows the one from FarCry 4, we want a 64k maximum theoretical resolution SubTexture, which over an area of 64×64 meters yields 1024 pixels per meter. We use a fixed tile size of 256×256 pixels, which means the previous statement can also be reformulated as a resolution per tile, in which case one tile will cover 0.25 meters squared at the highest resolution, and the whole SubTexture, meaning 64 meters squared at the lowest resolution.


The algorithm can be summarized into a few abstract steps, we will then go through each step and have a look at the implementation. These steps include both CPU and GPU work, but they are presented in the order in which a programmer would approach it.


The first thing we need to do is set everything up. This step consists of the following – divide the world into 64×64 meter regions to each a SubTexture gets assigned. We will discuss the details of the SubTexture later, but for now, let’s just say it requires a world position in the range [-half world size, half world size] if oriented around 0, an indirection texture offset which can be initialized to FFFFFFF, a maximum LOD level float and an unsigned number of tiles, which can also be initialized to FFFFFFFF.

We also need an indirection texture, which in our case is 2048×2048 pixels and with a mip chain. The mip chain should meet the requirements for the maximum amount of tiles in a SubTexture, such that at the highest mip, a SubTexture represents only a single indirection pixel. So, for 256 pixel tiles, it would mean the number of mips is 8. We also have our physical texture caches which is an albedo, material and normals, at 8192×8192 standard size, or 8448×8448 padded size (explained later). The physical caches are not mipped.

We also need to render the lowres callback. This is a texture used to render pixels for which we have no SubTextures covering them, because they are too far away. This is done with a simple screen space pass and a shader which is more or less equivalent with the tile update shader.

Our solution differs form Ka Chens solution and the follow-up solution from Ghost Recon Wildlands with how we output our pages. In the first solution, a buffer was attached and written to, in Ghost Recon Wildlands they used a 3D texture (which sounds a bit wasteful) to map page coordinate XY to the pixel in a texture plane, and the mip as the texture plane selector. In our solution, we use a buffer of indirection buffer statuses, which utilize atomic operations to decide whether or not a page has been already produced or not, and if it hasn’t we output it directly to our page buffer.

		uint index = mipOffset + pageCoord.x + pageCoord.y * mipSize;
		uint status = atomicExchange(PageStatuses[index], 1u);
		if (status == 0x0)
			uvec4 entry = PackPageDataEntry(1u, subTextureIndex, lowerMip, pageCoord.x, pageCoord.y, subTextureTile.x, subTextureTile.y);

			uint entryIndex = atomicAdd(PageList.NumEntries, 1u);
			PageList.Entry[entryIndex] = entry;

Here we calculate the buffer index (which is mipped, thus the offset and size modifications to the index calculation), we use atomic exchange with the page status, and if 0, we add it to the page output! This way, we don’t need to clear the buffer, nor do we have to implement a second pass to extract the data from this pass!

We need a buffer to store the page statuses used in the Terrain Prepass, which should map to a texture. Therefore, we have to make a buffer big enough to capture all mips, and also generate a list of mip sizes and mip buffer offsets such that we can emulate this being a mipped texture on the GPU. We avoid any indirection data on the CPU side, because of reasons related to copying and rearranging subtexture regions.

uint offset = 0;

for (uint i = 0; i < IndirectionNumMips; i++)
	uint width = IndirectionTextureSize >> i;
	uint height = IndirectionTextureSize >> i;
		terrainVirtualTileState.indirection[i].Fill(IndirectionEntry{ 0xF, 0xFFFFFFFF, 0xFFFFFFFF });

	terrainVirtualTileState.indirectionMipSizes[i] = width;
	terrainVirtualTileState.indirectionMipOffsets[i] = offset;
	offset += width * height;

Because of alignment wasting memory if we use arrays of single floats or integers for constant buffers in GLSL, we opted to store these values in ivec4s using the following method:

for (SizeT j = 0; j < terrainVirtualTileState.indirectionMipOffsets.Size(); j++)
	uniforms.VirtualPageBufferMipOffsets[j / 4][j % 4] = terrainVirtualTileState.indirectionMipOffsets[j];
	uniforms.VirtualPageBufferMipSizes[j / 4][j % 4] = terrainVirtualTileState.indirectionMipSizes[j];

Later in the GPU code, you will see how we get this values back by making a similar operation.


  1. Read back data from Terrain Prepass – CPU
    1. Setup tile update jobs for tiles that should be deleted
    2. Setup tile update jobs for tiles that should be rendered
    3. Prepare indirection buffer updates
  2. Update SubTexture buffer – CPU
    1. Based on camera distance to the world space region occupied by a SubTexture, calculate the resolution it should have (explained in detail later)
    2. Calculate number of tiles from resolution, each tile is 256 pixels in size
    3. If SubTexture changed resolution, allocate a new indirection texture region for the new resolution, and deallocate the old
    4. If SubTexture increased in size, copy the whole region of indirection values to the new region, and shift the mip down such that we only write to mips 1..x and avoid 0
    5. If SubTexture decreased in size, copy whole region from mips 0..x-1 to new region
  3. Copy buffers – CPU/GPU Transfer
    1. Copy SubTextures from old regions to new regions if changed
    2. Copy indirection pixels from buffer to texture
    3. Copy from staging subtexture buffer to GPU buffer
  4. Clear entries – GPU – Compute
  5. Render Terrain Prepass – GPU/Render
    1. Updates page entry buffer
  6. Copy page entries to CPU-readable buffer to be used in 1. – GPU/Transfer
  7. Render page tiles – GPU/Render


Given the overview above, it is time to go through what a SubTexture is. Imagine that we want our 64k resolution for a 64×64 meter area. For a world of size, let’s say 8192, we would need an 8mi total texture! Even with sparse resources, the biggest possible resource is 16k on Nvidia, and 32k on AMD, so that is immediately out of the question. However, we will talk about usage of sparse resources a bit later…

So think of it like this: We pretend like we have an 8mi texture, but only on the CPU side. We already have our world split into 64×64 meter regions, so we can think of each 64×64 meter region as a smaller virtual texture within our 8mi texture. This is what we refer to as a SubTexture, it is a virtual texture within the virtual texture :D. Think of the maximum resolution, like 64k for a tile of the highest resolution, as the theoretical size of a texture, which you will see won’t be fully utilized.

Based on the camera distance to that region, we decide how many actual physical texture tiles should be represented by that region. In the highest resolution case, where we have a 64k resolution, we have 256×256 tiles. These tiles can be found in two places, first in a 256×256 block of pixels in the indirection texture, and as scattered 256×256 pixel tiles in the three physical caches. So each tile occupies 1 pixel in the indirection texture, and a 256×256 tile in the physical caches. And SubTextures closer to the camera uses more indirection pixels, and therefore more texture tiles, hence can provide a higher resolution for nearby pixels.

Illustration of the sub textures on a heightmap. To the left we see sub textures with differing tile counts, and on the right we see the indirection pixels that they occupy. One indirection pixel points to a 256×256 pixel texture tile. Blue means highest resolution (most tiles) and red means lowest resolution (fewest tiles)

Now you might think: hold on, so if we have 256×256 indirection pixels for the highest resolution SubTexture, and each tile is 256×256 pixels, that would mean a 256*256 x 256*256 texture, which is astronomical! Well, on top of all this complexity, we will also introduce mipmaps! Each SubTexture occupies a block of indirection pixels, but as we mentioned before, the indirection texture is actually mipmapped, meaning we actually have one block for each mip level, all the way down to where a SubTexture only represents a single indirection pixel. This also means that each tile is rendered to represent a certain mip, so for a 256×256 tile at mip 0, a tile covers 0.25 meters of the world, but at mip 8, it covers a whooping 64 meters. For a tile covering only 1×1 tiles, it only has a single mip.

Calculating LODs

You might have gathered by now that one of the things we have to do manually in this algorithm is to calculate the LODs. We need to calculate LODs ourselves for two things, first to determine how many tiles a SubTexture should use, and then when rendering the terrain on the GPU, decide which mip a pixel needs so we can mark the page at the proper mip as being resident. Our formula is rather straight forward:

		// control the maximum resolution as such, to get 10.24 texels/cm, we need to have 65536 pixels (theoretical) for a 64 meter region
		const uint maxResolution = SubTextureTileWorldSize * 1024;

		// distance where we should switch lods, set it to every 10 meters
		const float switchDistance = 2.0f;

		// mask out y coordinate by multiplying result with, 1, 0 ,1
		Math::vec4 min = Math::vec4(subTex.worldCoordinate[0], 0, subTex.worldCoordinate[1], 0);
		Math::vec4 max = min + Math::vec4(64.0f, 0.0f, 64.0f, 0.0f);
		Math::vec4 cameraXZ = Math::ceil(cameraTransform.position * Math::vec4(1, 0, 1, 0));
		Math::vec4 nearestPoint = Math::minimize(Math::maximize(cameraXZ, min), max);
		float distance = length(nearestPoint - cameraXZ);

		// if we are outside the virtual area, just default the resolution to 0
		uint resolution = 0;
		if (distance > 300)
			goto skipResolution;

		// at every regular distance interval, increase t
		uint t = Math::n_max(1.0f, (distance / switchDistance));

		// calculate lod logarithmically, such that it goes geometrically slower to progress to higher lods
		uint lod = Math::n_min((uint)Math::n_log2(t), (IndirectionNumMips - 1));

		// calculate the resolution by offseting the max resolution with the lod
		resolution = maxResolution >> lod;


		// calculate the amount of tiles, which is the final lodded resolution divided by the size of a tile
		// the max being maxResolution and the smallest being 1
		uint tiles = resolution / PhysicalTextureTileSize;

Two constants here to take in mind, SubTextureTileWorldSize is set to 64 and represents the world size a SubTexture takes. IndirectionNumMips represents the amount of mips in the indirection texture, which we earlier figured out was 8. What this code is doing is that it is calculating the resolution of the SubTexture, and then converts it to a number of tiles. This is then used to determine the resolution for this SubTexture during this frame.

Let’s move on to the first GPU loop! Yes, there are two loops…


The clear pass is rather intuitive, but for the sake of it, let’s explain exactly what it is doing.

[local_size_x] = 1
	PageList.NumEntries = 0u;

Just set the amount of entries in the entry output list


Remember the LOD calculation for the SubTexture? On the GPU it is slightly different. There, we do the following:

// convert world space to positive integer interval [0..WorldSize]
vec2 worldSize = vec2(WorldSizeX, WorldSizeZ);
vec2 unsignedPos = worldPos.xz + worldSize * 0.5f;
uvec2 subTextureCoord = uvec2(unsignedPos / VirtualTerrainSubTextureSize);

// calculate subtexture index
uint subTextureIndex = subTextureCoord.x + subTextureCoord.y * VirtualTerrainNumSubTextures.x;
if (subTextureIndex >= VirtualTerrainNumSubTextures.x * VirtualTerrainNumSubTextures.y)
TerrainSubTexture subTexture = SubTextures[subTextureIndex];

// if this subtexture is bound on the CPU side, use it
if (subTexture.tiles != 0xFFFFFFFF)
	// calculate LOD
	const float lodScale = 4 * subTexture.tiles;
	vec2 dy = dFdy(worldPos.xz * lodScale);
	vec2 dx = dFdx(worldPos.xz * lodScale);
	float d = max(1.0f, max(dot(dx, dx), dot(dy, dy)));
	d = clamp(sqrt(d), 1.0f, pow(2, subTexture.maxLod));
	float lod = log2(d);

Let’s walk through what is happening… First thing we do is to convert the pixel coordinate to be in the [0..WorldSize] range. Using that, we select which SubTexture (remember, 64×64 meter area) this pixel is in. We check if the SubTexture is valid by checking if it has tiles. Then, we use the partial derivative of x and y for the world position, scaled by the amount of tiles the SubTexture has (the range of these values is like we said before, [1..256]) multiplied by a constant factor of 4 which just seemed to work well. That difference is used as a distance to our pixel in order to determine the slope, which is then clamped to not exceed the maximum LOD for this SubTexture. The max LOD is determined based on the amount of tiles in the SubTexture, so a SubTexture of 256 tiles has 8 as the max LOD, while a SubTexture of 1 tile has 0 as the max LOD.

Binning the pages

The next stage of the shader is to bin the pages. Following from the code before inside the inner condition, we do the following:

// calculate pixel position relative to the world coordinate for the subtexture
vec2 relativePos = worldPos.xz - subTexture.worldCoordinate;

// the mip levels would be those rounded up, and down from the lod value we receive
uint upperMip = uint(ceil(lod));
uint lowerMip = uint(floor(lod));

// calculate tile coords
uvec2 subTextureTile;
uvec2 pageCoord;
vec2 dummy;
CalculateTileCoords(lowerMip, subTexture.tiles, relativePos, subTexture.indirectionOffset, pageCoord, subTextureTile, dummy);

// since we have a buffer, we must find the appropriate offset and size into the buffer for this mip
uint mipOffset = VirtualPageBufferMipOffsets[lowerMip / 4][lowerMip % 4];
uint mipSize = VirtualPageBufferMipSizes[lowerMip / 4][lowerMip % 4];

uint index = mipOffset + pageCoord.x + pageCoord.y * mipSize;
uint status = atomicExchange(PageStatuses[index], 1u);
if (status == 0x0)
	uvec4 entry = PackPageDataEntry(1u, subTextureIndex, lowerMip, pageCoord.x, pageCoord.y, subTextureTile.x, subTextureTile.y);

	uint entryIndex = atomicAdd(PageList.NumEntries, 1u);
	PageList.Entry[entryIndex] = entry;

// if the mips are not identical, we need to repeat this process for the upper mip
if (upperMip != lowerMip)

Here, we first calculate the relative position of this SubTexture in world space. This is used to determine a SubTexture relative distance for the tile this pixel will be binned inside. We also calculate an upper mip and lower mip, such that we capture both mips that will be used when sampling. The CalculateTileCoords function is implemented as such:

CalculateTileCoords(in uint mip, in uint maxTiles, in vec2 relativePos, in uvec2 subTextureIndirectionOffset, out uvec2 pageCoord, out uvec2 subTextureTile, out vec2 tilePosFract)
	// calculate the amount of meters a single tile is, this is adjusted by the mip and the number of tiles at max lod
	vec2 metersPerTile = VirtualTerrainSubTextureSize / float(maxTiles >> mip);

	// calculate subtexture tile index by dividing the relative position by the amount of meters there are per tile
	vec2 tilePos = relativePos / metersPerTile;
	tilePosFract = fract(tilePos);
	subTextureTile = uvec2(tilePos);

	// the actual page within that tile is the indirection offset of the whole
	// subtexture, plus the sub texture tile index
	pageCoord = (subTextureIndirectionOffset >> mip) + subTextureTile;

This function is meant to take in a position relative to the SubTexture in world space, that is, if a SubTexture is at coordinates (-256, -256) in world space, then a pixel at (-256, -256) would have relativePos (0, 0). The argument mip is the mip provided by the calculation above, maxTiles is the amount of tiles a SubTexture is providing at mip 0, and subTextureIndirectionOffset is the offset in the indirection texture where this SubTexture resides. The output is the pageCoord, which is the indirection texture space coordinate for this page. Accompanied by this variable is subTextureTile, which is the tile index for this pixel inside the SubTexture. Going back to the function calling this one, we use these values to calculate the index of the tile which we are updating, by using the mip offsets and mip size from the Setup phase, the calculation is a simple 2D to 1D index conversion to know which indirection pixel should be affected, and therefore which page should be updated. Like mentioned before, we use atomics to synchronize the writes to the PageEntry list, which has a cost but is much cheaper than having a clear and extraction pass.

Read back

Okay, so now we are done with the heavy lifting on the GPU side for this frame, but we will be generating work from a previous frame to update tiles on this frame! Sounds confusing? Well because we are buffering frames, the read back for this frame is at least N buffers old, so any tile updates we do now might actually already be out of view, but because of the nature of buffering there really is no way around this, unfortunately.

When reading back the data, we get what we produced from the shader and, if you noticed before the data on the GPU is packed, so we first need to unpack it. The packing is a way to reduce the amount of memory being passed over the PCI bus when reading back, so it’s important to keep it as small as possible. However, this can be contended and optimized later, for example removing the pageCoords and calculate them on the CPU side.

		uint status, subTextureIndex, mip, pageCoordX, pageCoordY, subTextureTileX, subTextureTileY;
		UnpackPageDataEntry(updateList[0].Entry[i], status, subTextureIndex, mip, pageCoordX, pageCoordY, subTextureTileX, subTextureTileY);

		// the update state is either 1 if the page is allocated, or 2 if it used to be allocated but has since been deallocated
		uint updateState = status;
		uint index = pageCoordX + pageCoordY * (dims.width >> mip);
		IndirectionEntry& entry = terrainVirtualTileState.indirection[mip][index];

We get the indirection entry by using the page coords and mip produced by the GPU. This is actually all the info we need, now all we have to do is to allocate pages or deallocate pages from the physical texture caches, produce tile update jobs and we are done!

One important thing to note here is that there is a code path for the read back information which does the following:

				TerrainSubTexture& subTexture = subTextures[subTextureIndex];
				float metersPerTile = SubTextureTileWorldSize / float(subTexture.tiles >> entry.mip);

This uses the SubTexture buffer which the GPU assumed was resident at the time of these page updates, and as per subTextureIndex, it becomes clear we need to use the SubTextures that were in use when the GPU was producing its pages. Therefore, the SubTextures CPU buffer is also N buffered. Actually, one must have to N buffer a bunch of things for this algorithm to work, but this problem in particular cost a lot of time to realize.

Texture space allocation

As previously mentioned, we allocate indirection pages when we update our SubTextures, and physical texture cache pixels when we get the read back from the GPU.

We implement two different ways to allocate texture memory, one for the indirection texture, which is done with a quad tree, and the physical texture caches which is done with a LRU cache.

Indirection allocation

Indirection regions are allocated as a quad-tree due to it’s nature of effectively searching for a region of a specific size, or finding a previously allocated indirection region. A quadtree equally divides a space spatially into four equally sized squares, and for each such square it produces another quadtree. However, a quadtree doesn’t necessarily allocate all of these levels up-front, but does it dynamically when needed. Due to the fact that for every search we can reduce the search size by 4, it has a log4 search time, which is really good for us since we will need to allocate regions of different sizes in a fast manner. Indirection pixels are allocated based on SubTexture size, and when a SubTexture changes size, it allocates a new area, queues a copy from the old to the new area, and then deletes the old are for next frame, such that another SubTexture can fill the space.

Physical texture page tile allocation

With physical pages it is a little bit more sensitive. Since we have no mechanism to determine whether or not a page is not used anymore, we need a smarter approach to decide whether or not we should reuse a physical texture page. Another solution would be to implement a mechanism where we clear the page entries and determine which pages are not seen, and deallocate those pages, but if we do, it means we will lose the tiles that we just recently saw! This means that if we turn around, the same pages we just saw will be updated again, even if they never changed. The same happens when we resize SubTextures – we copy the old area to the new according to how many mips we can fill such that we reduce the amount of popping when moving between SubTextures.

Instead, we implement an LRU cache (last recently used) which is implemented as a dictionary as a cache lookup and a doubly linked list. We used the doubly linked list to keep track of the most recently used item, and the last used item. When an item is encountered in the lookup we move this linked list node to the head of the list. If an item is not found in the lookup, we allocate a new node and put it as head and in the lookup. If the item is not found but the list is full, we delete the last used item from the list, which evicts that tile. Implementation wise, we allocate one node per each tile in the physical caches, and reuse these nodes to avoid memory allocation when we free and obtain a new item. Every node is associated with a tile coordinate, so this is how we determine which tile should be updated.

Instead, we implement an LRU cache (last recently used) which is implemented as a dictionary as a cache lookup and a doubly linked list. We used the doubly linked list to keep track of the most recently used item, and the last used item. When an item is encountered in the lookup we move this linked list node to the head of the list. If an item is not found in the lookup, we allocate a new node and put it as head and in the lookup. If the item is not found but the list is full, we delete the last used item from the list, which evicts that tile. Implementation wise, we allocate one node per each tile in the physical caches, and reuse these nodes to avoid memory allocation when we free and obtain a new item. Every node is associated with a tile coordinate, so this is how we determine which tile should be updated.

Copying between mips

I briefly mentioned that when a SubTexture gets resized, we perform a copy between the mips in order to fill as much as we can of the new region. Another way to conceptualize this is that we have two ‘layers’ of detail, the size of a SubTexture in tiles, and the mips in the indirection texture for that SubTexture. A SubTexture will most likely only have a few pages used in just some of the mips, meaning we might have a few pages in mip 0, more in mip 1, and the rest in mip 2. If you remember, this is because the readback will tell us for which SubTexture and in which mip a page is currently visible, and it will trigger a tile update.

But can we save protect this data when a SubTexture is scaled up or down? Just like in Ka Chens presentation, yes of course! When a SubTexture is scaled down, we simply need to copy from the current region to the new, cutting off some of those top mips. If we scale up, we make sure to copy as many mips as we have resident into the new region. As an example, let’s say a region goes from 256 tiles to 128. This means that mips 0-8 contain valid information, but the new region can really only fit mips 1-8, so we copy those. If we scale up, lets’s say we go from 128 to 256, then we are missing mip 0, but the rest of the mip chain is already there, so we simply copy from mips 0-7 of the old region (128) to mips 1-8 of the new region (256). Because the size of the region follows the same binary series as the mips, a 256 tile SubTexture is equivalent to a 128 tiled one if we offset the mip by 1!


Alright, last detail to think about is the anisotropical issue. Why would we have such an issue you might ask? Well, we have a texture cache with tiles tightly packed next to each other, which means that if we sample along the border of such a tile, we will surely get texture information from another tile, and that will produce ugly artifacts. To combat this, we simply just add some padding to the tiles, so a 256×256 tile is now 264×264 pixels, leaving an 8 pixel border around it. In the sampling shader, we simply offset by 4 pixels, and reduce the tile size by 4 pixels, leaving us with a 256×256 pixel sample space again.

The red border shows the inner measurement which is 256×256 pixels, while the total tile is 264×264.

Vulkan – Designing a new frame script system

Nebula has a neat feature called frame shaders. Frame shaders are XML scripts and describe the rendering of an entire frame. However, frame shaders in nebula are designed with the DirectX 9 mindset, and is in dire need of a rewrite.

With Vulkan, and partly in OpenGL4, there are slightly more efficient ways of binding render targets. In DirectX 9 there was a clear distinction between multiple render targets and singular ones. With OpenGL we have framebuffers which is an object containing all render targets. In DirectX 10-12 we bind render targets individually. In Vulkan and OpenGL, we can have a framebuffer, but only select a set of attachments to actually use, allowing us to avoid binding framebuffers more commonly than needed. In Vulkan, we can even pass data between renders through input attachments, so our new design has to take that into consideration.

We also want to be able to apply global variables in the frame shader, so that we for example can switch out the NormalMap, AlbedoBuffer, etc, if we for example render with VR or want to produce a reflection cube texture. This allows the frame script to apply settings per execution, which can be shared across all shaders being used when rendering the script.

So one of the design choices is to design a frame scripting system which allows us to add frame operations just like FramePassBase, however a FramePass is a bit too ‘high-level’ since it implies something like a texture and some draws. With the new system, we want to execute memory barriers, trigger events to keep track of compute shader jobs and assemble highly optimized subpass dependency chains, meaning a frame operation can be much simpler than an actual pass.

Also, we want to slightly redesign some of the concepts of CoreGraphics, where we don’t begin a pass with a render target, multiple render target or render target cube, but instead we use a pass object. The pass already knows about its subpasses and attachments, and the frame system knows when to bind it and what to do when it is bound.

Enter Frame2.

Declare RenderTexture    - can be used as shader variable and render target
	Fixed size 		- in pixels
	Relative size	- to screen (decimal 0-1)
	Dynamic size	- can be adjusted in settings
	Format			- any renderable color format
	Multisample		- true if render texture supports multisampling
Declare RenderDepthStencil	- implements a depth-stencil buffer for rendering
	Fixed size 		- in pixels
	Relative size	- to screen (decimal 0-1)
	Dynamic size	- can be adjusted in settings
	Format			- any accepted depth-stencil format
	Multisample		- true if depth-stencil supports multisampling
Declare ReadWriteTexture   - can be used as shader variable and compute shader input/output and fragment shader input/output
	Fixed size 		- in pixels
	Relative size		- to screen (decimal 0-1)
	Dynamic size		- can be adjusted in settings
	Format			- any color format (renderable or otherwise) but not depth-stencil 
	Multisample		- true if read-write image supports multisampling
Declare ReadWriteBuffer - can be used as compute shader input/output
	Size		- in bytes
	Relative size	- to screen, size is now size per screen pixel
Declare Event - can be used to signal dependent work that other work is done
	Set 		- created in an already set state

Declare Algorithm
	Class		- to create instance of algorithm class
	- List all global values used by this frame shader
Pass <Name>
	- List all attachments being used, then use index as lookup
	- Pass implicitly creates rendertarget/framebuffer
	RenderTexture <Name of declared RenderTexture>
		- Clear color
		- If name is __WINDOW__ use backbuffer
	RenderDepthStencil <Name of declared DepthStencil>
		- Clear stencil, clear depth
	Subpass <Name>
		- List subpasses depended upon
		- List of attachment indices
			- Output color
			- Output depth-stencil
			- Output input attachment (call something clever, like shader-local)
			- Resolve <boolean>
			- Passthrough (automatically assume all unmentioned attachments are pass-through)
			- Dependency
		- Viewports and scissor rects
			- Set viewport -> float4(x, y, width, height), index
			- Set scissor rect -> float4(x, y, width height), index
			- If none are given, whole texture is used
		- Drawing
			SortedBatch <Name>
				- Must be inside subpass
				- Renders objects in Z-order
			Batch <Name>
				- Must be inside subpass
				- Batch renders objects using as minimal shader switches as possible
				- Renders materials declaring a Pass with <Name>. Rename Pass to Batch in materials.			
			System <Name>
				- Name decides what to do
				- Must be inside subpass
				- Lights means light sources
				- LightProbes means light probes
				- UI means GUI renderers
				- Text means text rendering
				- Shapes means debug shapes
			FullscreenEffect <Name>
				- Must be inside subpass
				- Bind shader
				- Update variables
			SubpassAlgorithm <Name>
				- Select algorithm to run
				- Select stage to execute so we can execute different phases with different subpasses
                                - Inputs listed from 
				- Must be inside subpass
Copy <Name>
	- Must be outside of pass
	- Target texture
	- Source texture
Blit <Name>
	- Like copy, but may filter if formats are not the same, or size differs
	- Must be outside of pass
ComputeAlgorithm <Name>
	- Select algorithm to run
	- InputImage input readwrite image
	- OutputImage output readwrite image
        - Allow asynchronous computation <boolean>, if not defined is false
	- Has to be outside of a Pass
Compute <Name>
	- Bind shader
	- Update variables
        - Allow asynchronous computation <boolean>, will require an Event to be used to control execution
	- Compute X, Y, Z
		- Is number of work groups, work group size is defined in shader
Barrier <Name>
	- Implements a memory blockade between two operations
	- Signals a ReadWriteBuffer or a ReadWriteTexture to be blocked from one pipeline stage to another, also denoting the access flags allowed for both resources prior and after the barrier
	- Can be used as a pipeline bubble, or as the content of an Event
	- Can be executed within a render pass or outside

Event <Name>
	- Can be reset
	- Can be set
	- Can be waited for
	- Can be called within render pass and outside

For the sake of minimalism, the new system also implements the script as a simplified JSON file, instead of an XML, making it slightly more readable, although almost equally stupid. An example of a frame shader, now called frame script, can look like this:

version: 2,
engine: "NebulaTrifid",
		{ name: "NormalBuffer", 	format: "A8R8G8B8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "DepthBuffer", 		format: "R32F", 			relative: true,  width: 1.0, height: 1.0 },
		{ name: "AlbedoBuffer", 	format: "A8R8G8B8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "SpecularBuffer", 	format: "A8R8G8B8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "EmissiveBuffer", 	format: "A16B16G16R16F", 	relative: true,  width: 1.0, height: 1.0 },
		{ name: "LightBuffer", 		format: "A16B16G16R16F", 	relative: true,  width: 1.0, height: 1.0 },
		{ name: "ColorBuffer", 		format: "A8B8G8R8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "ScreenBuffer", 	format: "A8B8G8R8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "BloomBuffer", 		format: "A8B8G8R8", 		relative: true,  width: 0.5, height: 0.5 },
		{ name: "GodrayBuffer", 	format: "A8B8G8R8", 		relative: true,  width: 0.5, height: 0.5 },
		{ name: "ShapeBuffer", 		format: "A8B8G8R8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "AverageLumBuffer", format: "R16F", 			relative: false, width: 1.0, height: 1.0 },
		{ name: "SSSBuffer", 		format: "A16B16G16R16F", 	relative: true,  width: 1.0, height: 1.0 },
		{ name: "__WINDOW__" }
		{ name: "HBAOBuffer", 		format: "R16F", 			relative: true, width: 1.0, height: 1.0 }
		{ name: "ZBuffer", 			format: "D32S8", 			relative: true, width: 1.0, height: 1.0 }
			name: 		"Tonemapping", 
			class: 		"Algorithms::TonemapAlgorithm", 
			name:		"HBAO",
			class: 		"Algorithms::HBAOAlgorithm",
			name: 		"FinalizeState", 
			shader:		"shd:finalize", 
				{semantic: "ColorTexture", 		value: "ColorBuffer"},
				{semantic: "LuminanceTexture", 	value: "AverageLumBuffer"},
				{semantic: "BloomTexture", 		value: "BloomBuffer"}
			name: 		"GatherState",
			shader: 	"shd:gather",
				{semantic: "LightTexture", 		value: "LightBuffer"},
				{semantic: "SSSTexture", 		value: "SSSBuffer"},
				{semantic: "EmissiveTexture", 	value: "EmissiveBuffer"},
				{semantic: "SSAOTexture", 		value: "HBAOBuffer"},
				{semantic: "DepthTexture", 		value: "DepthBuffer"}
		name:			"DeferredTextures",
			{ semantic:"AlbedoBuffer", 		value:"AlbedoBuffer" },
			{ semantic:"DepthBuffer", 		value:"DepthBuffer" },
			{ semantic:"NormalBuffer", 		value:"NormalBuffer" },				
			{ semantic:"SpecularBuffer", 	value:"SpecularBuffer" },
			{ semantic:"EmissiveBuffer", 	value:"EmissiveBuffer" },
			{ semantic:"LightBuffer", 		value:"LightBuffer" }
		name: 		"HBAO-Prepare",
		algorithm:	"HBAO",
		function:	"Prepare"
		name: "DeferredPass",
			{ name: "AlbedoBuffer", 	clear: [0.1, 0.1, 0.1, 1], 		store: true	},
			{ name: "NormalBuffer", 	clear: [0.5, 0.5, 0, 0], 		store: true },
			{ name: "DepthBuffer", 		clear: [-1000, 0, 0, 0], 		store: true },
			{ name: "SpecularBuffer", 	clear: [0, 0, 0, 0], 			store: true	},
			{ name: "EmissiveBuffer", 	clear: [0, 0, 0, -1], 			store: true	},
			{ name: "LightBuffer", 		clear: [0.1, 0.1, 0.1, 0.1], 	store: true	},
			{ name: "SSSBuffer", 		clear: [0.5, 0.5, 0.5, 1], 		store: true }
		depthStencil: { name: "ZBuffer", clear: 1, clearStencil: 0, store: true },
			name: "GeometryPass",
			dependencies: [], 
			attachments: [0, 1, 2, 3, 4],
			depth: true,
			batch: "FlatGeometryLit", 
			batch: "TesselatedGeometryLit"
			name: "LightPass",
			dependencies: [0],
			inputs: [0, 1, 2, 3, 4],
			depth: true,
			attachments: [5],
			system: "Lights"
		name: 			"Downsample2x2",
		algorithm: 		"Tonemapping",
		function: 		"Downsample"
		name: 			"HBAO-Compute",
		algorithm:		"HBAO",
		function:		"HBAOAndBlur"
		name: "PostPass",
			{ name: "DepthBuffer",  		load: true },
			{ name: "AverageLumBuffer", 	clear: [0.5, 0.5, 0.5, 1] },
			{ name: "ColorBuffer", 			clear: [0.5, 0.5, 0.5, 1] },
			{ name: "ScreenBuffer", 		clear: [0.5, 0.5, 0.5, 1], store: true},
			{ name: "BloomBuffer", 			clear: [0.5, 0.0, 0.5, 1] },
			{ name: "GodrayBuffer", 		clear: [-1000, 0, 0, 0] },
			{ name: "ShapeBuffer", 			clear: [-1000, 0, 0, 0] }
		depthStencil: { name: "ZBuffer", load: true },
			name: "Gather",
			dependencies: [],
			attachments: [2],
			depth: false,
				name: 				"GatherPostEffect",
				shaderState: 		"GatherState",
				sizeFromTexture: 	"ColorBuffer"
			name: "AverageLum",
			dependencies: [0],
			attachments: [1],
			depth: false,
				name: 				"AverageLuminance",
				algorithm: 			"Tonemapping",
				function: 			"AverageLum"
			name: "Unlit",
			dependencies: [],
			attachments: [6],
			depth: true,
			batch: "Unlit",
			batch: "ParticleUnlit",
			system: "Shapes"
			name: "FinishPass",
			dependencies: [1, 2],
			inputs: [0, 5, 6],
			attachments: [3],
			depth: false,
				name: 				"ToScreen",
				shaderState: 		"FinalizeState",
				sizeFromTexture: 	"ColorBuffer"
			system: "Text"
		name: 		"CopyToNextFrame",
		algorithm: 	"Tonemapping",
		function: 	"Copy"
		name: 		"SwapWindowBuffer",
		texture: 	"__WINDOW__"
		name: 		"CopyToWindow",
		from: 		"ScreenBuffer",
		to: 		"__WINDOW__"

Some of the design choices are:

  • GlobalState
  • Assigns global variables, like the deferred textures used by this frame shader. Other frame shaders can execute and apply their values.

  • RenderTextures contains list of all declared color renderable textures
  • We want to declare all textures in a neat manner, so a single row per texture is nice.

  • DepthStencils contains list of all declared depth stencil targets
  • We might want to use more than one depth stencil sometimes.

  • ReadWriteTextures contains list of all declared textures which supports read-write operations
  • Used for image load-stores.

  • ReadWriteBuffers contains list of all declared buffers which supports read-write operations
  • Used for compute shaders to load-store data. Size is size in bytes, but if the relative flag is used, size denotes the byte size per pixel.
    Size of 1, 1 with a relative size on a 1024×768 pixel screen will allocate 1024×768 bytes, 0.5, 0.5 is 512, 384.

  • Algorithms contain all algorithms used by this frame shader
  • We want to declare algorithms beforehand, so that we can select which pass to use within it dependent on where we are.

  • Pass assigns a list of render targets which may be applied during the pass
  • Only draws are allowed within a pass, because a pass can’t guarantee order of execution of subpasses. A pass defines a list of allowed attachments, and which depth-stencil to use. A pass doesn’t really do anything, the work is done in subpasses.

  • Subpass actually binds render targets
  • Subpasses work on the concepts of OpenGL4 and Vulkan. Binding a framebuffer is done in the pass, the subpass then selects which attachments should be used and in which order.
    Subpasses have dependencies if other subpasses needs to be completed before this subpass can run. Subpasses list the attachments used by the pass by index. Subpasses may also contain the most important part, which is drawing.

  • Drawing!
  • We have four types of draw methods.

    • Batch performs a batch render by shader – surface – mesh to avoid unnecessary switches
    • OrderedBatch performs a ordered batch render, on all materials but renders in Z-order (or perhaps some other scheme) instead of per shader, so it’s potentially detrimental for performance, since it may switch shaders many times
    • System runs a render system built-in, like Lights, LightProbes, UI, Text and Shapes
    • FullscreenEffect renders what we before called a post effect, but it doesn’t really have to be ‘post’ per-se. Fullscreen effects require a shader, potentially (in the majority of cases) a list of variable updates. Also needs a texture to extract the size of the fullscreen quad
  • Algorithm execution
  • We want to execute algorithms, but since the new system has full control over what is being bound and when, the algorithm is not allowed to begin or end passes. So what do we need from algorithms? Algorithms need to update shader variables, apply meshes and render, however they might also need to render to more than one texture at a time. An algorithm needs to be able to run a certain step.

    • SubpassAlgorithm can only be executed within a pass, and is done for rendering geometry. Subpass algorithms may take values as input, however it is better to use the global state if possible
    • ComputeAlgorithm must be executed outside a pass, and may not render stuff, but only dispatch computations. Compute algorithms must be provided with at least one ReadWriteImage or ReadWriteBuffer, otherwise the compute algorithm is pointless. Compute algorithms can be asynchronous if hardware supports it. If it doesn’t, it just runs inline with graphics.
  • Compute
  • Like before, we can select a compute shader and run X, Y, Z number of work groups, using Width, Height, Depth sizes for each work group. However now we can allow computes to be asynchronously executed if the hardware supports it. If it doesn’t, it just runs inline with graphics. If a compute uses the relative flag, then size denotes the workgroup size rounded to down to fit the resolution.

  • Barrier
  • Executes a barrier between operations. A barrier describes just that, a barrier between adjacent operations. If two interdependent operations are not directly after each other, then we can use events to wait for some later operation to be done.

  • Event
  • We can declare and use events (consistently) during a frame. Events can be set for example after a compute is done, and waited for somewhere else, with work being done in between. While barriers are better to use between immediate operations where the second depend on the first, events can be done for computations at the beginning of the frame, and be waited for just before they are needed.

  • Copy and Blit
  • We also want to perform an image copy, or a slightly more expensive blit operation (which allows for conversions/decompression)

Backwards compatibility

Since Vulkan defines a very explicit API, it shouldn’t be hard to translate from Vulkan ‘down’ to for example OpenGL or DirectX. Events in OpenGL can be used with glFenceSync and memory barriers with glMemoryBarrier.

In DirectX 11 and down, no such mechanism exist, for which barriers and fences are simply semantic in the frame shader to denote where a barrier or sync would exist given a more explicit API, meaning that they will have no actual function. In OpenGL, a pass is a call to glBindFramebuffer, and a subpass is glDrawBuffers. In DirectX, a pass is just a list of renderable textures, and a subpass selects a subset of these render targets and bind them prior to drawing. In Metal, a subpass is the info used in MTLRenderPipelineDescriptor to create a render pipeline, and a render pass is just a list of textures to pick from.

Compute algorithms and plain computes are simply not available if the API doesn’t support them. Perhaps the RenderDevice should implement whether or not the underlying device can actually perform computes, but at the same time, it is impossible to load compute shaders unless the device supports them. And since Nebula loads all exported shaders by default, this might be a problem.

Final thoughts

In some cases we might actually want to modify a frame script. All ‘runnable’ elements in a frame script is of the class FrameOp, meaning we can technically add in FrameOps into the script during runtime. For example, if we want VR, then perhaps we want the last sub-pass to not present to screen, but instead to a texture, and we might want to switch which texture to present to, like for left eye, right eye, without necessarily having two of every screen buffer.

We might also want to be able to assemble a frame script in code, for example when implementing different shadow mapping methods. We could, for example, merge frame scripts together by just adding ops. The old system used a class called FramePassBase, which would bind a render target and run batches, which is very much like a render pass. However a FramePassBase binds a render target object as-is, and will assume all attachments will be used in the shader. With the new method, we can, for example, bind the CSM shadow buffer, the spot light shadow buffer atlas in one render pass, then all cube maps for point lights as 6 * number of point light shadows as layers. FrameOps allow us to have a fine grained control of how a script is executed, however it also allows us to break validation, since both a FramePass and a FrameSubpass are FrameOps, we can technically put a FrameSubpass without being inside a FramePass – if we assemble one in code. The frame script loader enforces correct formatting.

We also want to slightly rework the render plugin system so that we can determine when a plugin should render, seeing as it is important to be able to decide which texture the result will end up in. An idea is to be able to execute a subset of plugins in the script, and then have the plugins register to the plugin registry with a certain group.

The exact details of this design will probably change during development, however the basic concepts are here. One of the major concerns is that the new system should inhibit the user to do stupid things, and to also be able to fully optimize and utilize the Vulkan, and by extension, the other renderers. The script can, for example, find required dependencies by just looking at which subpass or algorithm is using a resource, and inform the programmer, just like a validation process. It can also be extended to use a graphical design interface, however it is doubtful sensitive features like this is supposed to be exposed to someone who are not too familiar with GPU programming.

In the current state, the frame script system will implement objects of different types, which have a certain behavior. However, since we know the code beforehand, it should be possible to just unravel the frame script produce just the code, meaning the engine will generate the rendering process during compilation, much like the NIDL system, and thus making the rendering pipeline both debugable, as well as more efficient, without all the indirection.

Vulkan – Beyond the pipeline cache

Don’t you go thinking I have been idle now just because I haven’t written anything down. As a matter of fact, I implemented a whole new render script system, which allows full utilization of Vulkan features such as subpasses and explicit synchronizations such as barriers and events.

The current of Nebula implements most major parts, including lighting, shadowing, shape rendering, GUI, text rendering and particles. What’s left to implement and validate is the compute-parts. However working with Vulkan is not so simple as many think. There are tons of problems, driver related and otherwise, which is why I decided to implement my own pipeline cache system.

Basically, the Vulkan pipeline cache can just return a VkPipeline object when we use the same objects to create a pipeline twice. That is cute and cool, but internally the system has to at least serialize 14 integers (12 pointers, 2 integers for the subpass index and the number of shader states). This is handled by the driver, so relying on it being intelligent or even efficient has proven to be a leap of faith. So I figured, how many different ‘objects’ do we use to create a pipeline in Nebula? Turns out, we just use 4, pass info, shader, vertex layout, vertex input.

So the idea came to mind to just incrementally build a DAG of the currently applied states, and if the selected DAG path, when calling GetOrCreatePipeline(), has a pipeline created, just return it instead of create it. The newest AMD driver, 16.9.1 fails to serialize pipelines, so calling vkCreateGraphicsPipelines always creates and links a new one, which downed my runtime performance from 140 FPS down to 12. Terrible, but it gave me the motivation to avoid calling a vkCreateX function everytime I needed something new.

Enter the Nebula Pipeline Database. Sounds so cool, but is a simple tree structure which layers different pipeline states into tiers, and constructs a tree-like structure in order to construct a dependency which in the end creates a VkPipeline. The class works by applying shading states in tiers. The tiers are: Pass, Shader, Vertex layout, Primitive Input. If one applies a pass, then all the lower states get invalidated. If a vertex layout is applied, then it will be ‘applied’ to the current pass. We construct a tree like so:

Pass 1 Pass 2 Pass 3
Shader 1 Shader 2 Shader 3 Shader 4 Shader 5 Shader 6
Vertex layout 1 Vertex layout 2 Vertex layout 3 Vertex layout 4 Vertex layout 5 Vertex layout 6 Vertex layout 7 Vertex layout 8 Vertex layout 9 Vertex layout 10 Vertex layout 11 Vertex layout 12
Primitive input 1 Primitive input 2 Primitive input 3 Primitive input 4 Primitive input 5 Primitive input 6 Primitive input 7 Primitive input 8 Primitive input 9 Primitive input 10 Primitive input 11 Primitive input 12 Primitive input 13 Primitive input 14 Primitive input 15 Primitive input 16 null null

When setting a state, we try to find an already created node for that tier. If no node is found, we create it using the currently applied state. This allows us to rather quickly find the subtree and retrieve an already created pipeline. You might think this is very cumbersome just to combine pipeline features, but it boosted the base frame rate by several percent, because this way, using only 5 identifying objects, is much faster than the driver implementation, and for obvious reasons. The driver could never assume we have the same code layout as we do in Nebula, so it has to assume every part of the pipeline is dynamic.

Also, the render device doesn’t request a new pipeline from the database object unless the change has actually changed, so we can effectively avoid tons of tree traversals, searches and VkPipelineCache requests just by assuming the state doesn’t need to change.

So what’s left to do?

Platform and vendor compatibility stuff. At the current stage, the code doesn’t consider violations against hardware limits, such as the number of uniform buffers per shader stage, or per descriptor set. This is an apparent problem on nvidia cards, where the number of concurrently bound uniform buffers is limited to 12. Also, testing and figuring out how events and barriers work or what they are actually needed for, since renderpasses implement barriers themselves, and compute shaders run on the same queue seems to be internally synchronized.

Vulkan – Persistent descriptor sets

Vulkan allows us to bind shader resources like textures, images, storage buffers, uniform buffers and texel buffers in an incremental manner. For example, we can bind all view matrices in a single descriptor set (actually, just a single uniform buffer) and have it persist between several pipeline switches. However, it’s not super clear how descriptor sets are deemed compatible between pipelines.

NOTE: When mentioning shader later, I mean AnyFX style shaders, meaning a single shader can contain several vertex/pixel/hull/domain/geometry/compute shader modules.

I could never get the descriptor sets to work perfectly, which is to bind the frame-persistent descriptors first each frame, and then not bind them again for the entire frame (or view). Currently, I bind my ‘shared’ descriptor sets after I start a render pass or bind a compute shader.

When binding a descriptor set, all descriptor sets currently bound with a set number lower than the one you are binding now has to be compatible. So if we have set 0 bound, and bind set 3, then for set 0 to stay bound, it has to be compatible with the pipeline. If we switch pipelines, then the descriptor sets compatible between pipelines will be retained, if they follow the previous rule. That is, if Pipeline A has sets 0, 1, 2, 3 and Pipeline B is bound, and sets 0 and 1 are compatible, then 2 and 3 will be unbound and will need to be bound again.

Where do we find the biggest change of shader variables? Well, clearly in each individual shader. For example, let’s pick shader billboard.fx, which has a vec4 Color, and a sampler2D AlbedoMap. In AnyFX, the Color variable would be a uniform and tucked away in a uniform buffer, and the AlbedoMap would be its own resource. In the Vulkan implementation, they would also be assigned a set number, and to avoid screwing with lower sets, thereby trying to avoid invalidating descriptor sets, this ‘default set’ would have to be high enough for other sets to not go above it. However, since we can’t really know the shader developers intention of how sets are used, the compiler be supplied a flag, /DEFAULTSET , which will determine where all default sets will go. This means that the engine and the shader developer themselves can decide where the most likely to be incompatible descriptor set should go.

I also got texture arrays and indexing to work properly, so now all textures are submitted as a huge array of descriptors, and whenever an object is rendered all that is updated is the index into the array which is supplied in a uniform buffer. This way, we can greatly keep the amount of descriptor sets down to a minimum of 1 per set number per shader resource. Allocating a new resource using a certain shader will expand the uniform buffer to accommodate for object-specific data.

First off is the naïve way:

Memory Memory Memory Memory Memory Memory Memory Memory
Buffer Buffer Buffer Buffer Buffer Buffer Buffer Buffer
Object 1 Object 2 Object 3 Object 4 Object 5 Object 6 Object 7 Object 8

Which was where I was a couple of days ago, and this forced me to use one descriptor per shader state, since each shader state has their own buffer. The slightly less bad way of doing this is:

Buffer Buffer Buffer Buffer Buffer Buffer Buffer Buffer
Object 1 Object 2 Object 3 Object 4 Object 5 Object 6 Object 7 Object 8

Which reduces memory allocations but also doesn’t help with keeping the descriptor set count low.

Object 1 Free Free Free Free Free Free Free

Allocating a new object just returns a free slot.

Object 1 Object 2 Free Free Free Free Free Free

If the memory backing is full, we expand the buffer size and allocate new memory.

Object 1 Object 2 Object 3 Object 4 Object 5 Object 6 Object 7 Object 8
Object 1 Object 2 Object 3 Object 4 Object 5 Object 6 Object 7 Object 8 Object 9 Free Free Free Free Free Free Free

As you can see, the buffer stays the same, meaning we can keep it bound in the descriptor set, and just change its memory backing. The only thing the shader state needs to do now is to submit the exact same descriptor state as all sibling states, but provide its own offset into the buffer.

However, since we need to create a new buffer in Vulkan to bind new memory, we actually have to update the descriptor set when we expand, but this will only be done when creating a shader state, which is done outside of the rendering loop anyways.

Textures are bound by the shader server each time a texture is created, it registers with the shader server, and the shader server performs a descriptor set write. The texture descriptor set must be set index 0, so that it can be shared by all shaders.

Consider this shader:

group(1) varblock MaterialVariables
group(1) sampler2D MaterialSampler;

group(2) r32f image2D ReadImage;
group(2) image2D WriteImage;

group(3) varblock KernelVariables

Resulting in this layout on the engine side.

Descriptor set 1 Descriptor set 2 Descriptor set 3
Uniform buffer Sampler Image Image Uniform buffer

Creating a ‘state’ of this shader would only perform an expansion of the uniform buffers in sets 1 and 3, but the sampler and two images will be directly bound to the descriptor set of the shader, meaning that any per-object texture switches would cause all objects to switch textures. We don’t want that, obviously, but we’re almost there. We can still create a state of this shader and not bind our own uniform buffers, by simply expanding the uniform buffers in sets 1 and 3 to accommodate for the per-object variables. To do this for textures, we need to apply the texture array method mentioned before.

group(0) sampler2D AllMyTextures[2048];
group(1) varblock MaterialVariables
   uint MaterialTextureId;

group(2) r32f image2D ReadImage;
group(2) image2D WriteImage;

group(3) varblock KernelVariables

Which results in the following layout:

Descriptor set 0 Descriptor set 1 Descriptor set 2 Descriptor set 3
Sampler array Uniform buffer Image Image Uniform buffer

Now, texture selection is just a manner of uniform values, supplying a per-object value for the uniform buffer value MaterialTextureId. While this is trivial for samplers, it also leaves us asking for more. For example, how do we perform different sampling of textures when all samplers are bound in an array? Vulkan allows for a texture to be bound with an immutable sampler in the descriptor set, so that’s one option, although we supply all our sampler information in AnyFX in the shader code by doing something like:

samplerstate MaterialSamplerState
   Samplers = { MaterialSampler };
   Filter = Anisotropic;

But we can’t anymore, because we don’t have MaterialSampler, and applying this sampler state to all textures in the entire engine might not be correct either. Luckily for us, the KHR_vulkan_glsl extension supplies us with the ability to decouple textures from samplers, and create the sampler in shader code. So I enabled AnyFX to create such a separate sampler object, although to do so one must omit the list of samplers. So the above code would be:

group(1) samplerstate MaterialSamplerState
   Filter = Anisotropic;

Which results in a separate sampler, and the descriptor sets would be:

group(0) texture2D AllMyTextures[2048];

And finally, sampling the texture is

vec4 Color = texture(sampler2D(AllMyTextures[MaterialTextureId], MaterialSamplerState));

Instead of

vec4 Color = texture(AllMyTextures[MaterialTextureId]);

Which will allow us to, in the shader code, explicitly select which sampler state to use, even if we have all our textures submitted once per frame. I could also implement a list of image-samplers combined really easily, and allow for example a graphics artist to supply the texture with sampler information, and just have that updated directly into the descriptor set, but still be able to fetch the proper sampler from the array.

For the sake of completeness, here’s the final shader layout:

Descriptor set 0 Descriptor set 1 Descriptor set 2 Descriptor set 3
Texture array Sampler state Uniform buffer Image Image Uniform buffer

So this proves we can utilize uniform buffers to select textures too, covering all our grounds in one tied up bow. Neat. Except for images, and here’s why.

Images are not switched around and messed around with like textures are, and for good reason. An image is used when a shader needs to perform a read-write to texels in the same resource, meaning that images are mostly used for random access and random writes, for post effects and the like, and are thus not as prone to changes as for example individual objects. Instead, images are mostly consistent, and can be bound during rendering engine setup. We could implement image arrays like we do texture arrays, however we must consider the HUGE amount of format combinations required to fit all cases.

Images can, like textures, be 2D, 2D multisample, 3D, Cube, just to mention the common types. We obviously have special cases like 2DArray, CubeArray and so forth, but array textures are not even used or supported in Nebula; never saw the need for them. However, images also needs a format qualifier if the image is to be supported with imageLoad, meaning we basically need a uniform array of all 4 ordinary types, with all permutations of formats. While possible, I deemed it a big no-no, and instead determined that since images are special use resources for single-fire read-writes, then a shader has to update the descriptor set each time it wants to change it, meaning it’s more efficient to, in the same shader, reuse the same variable and just not perform a new binding. All in all, this shouldn’t become a problem.

What’s left to do is to enforce certain descriptor set layouts by the shader loader, so that no shader creator accidentally use a reserved descriptor set (like 0 for textures, 1 for camera, 2 for lighting, 3 for instancing). If the shader does, it will manipulate a reserved descriptor set which will cause it to become incompatible, and we can’t have that since it will simply cause manually applied descriptors to stop being bound, resulting in unpredictable behavior. Another way of solving this issue is by changing the group-syntax in AnyFX to something more stable and easier to validate, like making it into a structure like syntax, for example:

group 0
   sampler2D Texture;
   varblock Block
     vec4 Vector;
     uint Index;

And then assert that no group is later declared with the same index. To handle stray variables declared outside of a group, the compiler simply generates the default group, and puts all strays in there.

The only issue I have with the above syntax is the annoying level of indirection before you actually get to the meat of the shader code. I think implementing an engine side check is the way to go now, but implementing groups as a structure like above could be a valid idea, since we might want to have the same behavior in all rendering APIs. Consider this for OpenGL too, in which we can guarantee that applying a group of uniforms and textures will remain persistent if all shaders share the same declaration. Although, in OpenGL, since we don’t have descriptor sets, we must simply ensure that the location values for individual groups remain consistent.

Vulkan – Shading ideas

So this is where the Vulkan renderer is right now.

What you see might be unimpressive, but when getting to this stage there isn’t too much left. As you can see, I can load textures, which are compressed (hopefully you can’t see that), render several objects with different shaders and uniform values, like positions, and textures.

This might seem to be near completion, just a couple of post effects which might need to be redone (mipmap reduction compute shader for example), but you would be wrong.

In this example, the Vulkan renderer created a single descriptor set per object, which I thought was fine and I basically assumed that is what descriptor sets were for. I believed that descriptor sets would be like using variable setups and just apply them as a package, instead of individually selecting textures and uniform buffers. However, on my GPU, which is a AMD Fury Nano, sporting a massive 4 GB of GPU memory (it doesn’t run Chrome, so it’s massive), I ran out of memory when reaching a meager 2000 objects. Out of GPU memory, never actually experienced that before.

So I decided to check how much memory I actually allocated, and while Vulkan supplies you with a nice set of callback functions to look this up, it doesn’t really do much for descriptor pools, and I have already boggled down the memory usage exhaustion to be happening when I create too many objects, so it cannot be a texture issue. Anyhow in order to have per-object unique variables, each object allocates its own uniform buffer backing for the ‘global’ uniform buffer. Buffer memory never exceeds 260~ MB. Problem is not there.

So the only conclusion I can draw is that the AMD driver allocates TONS of memory for the descriptor sets. So I did a bit of studying, and I decided to go with this solution for handling descriptor sets: Vulkan Fast Paths.

The TL;DR of the pdf is to put all textures into huge arrays, so I did:

#define MAX_2D_TEXTURES 4096
#define MAX_2D_MS_TEXTURES 64
#define MAX_3D_TEXTURES 128

group(TEXTURE_GROUP) texture2D 		Textures2D[MAX_2D_TEXTURES];
group(TEXTURE_GROUP) texture2DMS 	Textures2DMS[MAX_2D_MS_TEXTURES];
group(TEXTURE_GROUP) textureCube 	TexturesCube[MAX_CUBE_TEXTURES];
group(TEXTURE_GROUP) texture3D 		Textures3D[MAX_3D_TEXTURES];

And textures are fetched through:

group(TEXTURE_GROUP) shared varblock RenderTargetIndices
	// base render targets
	uint DepthBufferIdx;
	uint NormalBufferIdx;
	uint AlbedoBufferIdx;	
	uint SpecularBufferIdx;
	uint LightBufferIdx;
	// shadow buffers
	uint CSMShadowMapIdx;
	uint SpotLightShadowMapIdx;

Well, render targets are. On the ordinary shader level, textures would be fetched by an index which is unique per object. I also took the liberty to implement samplers which are like uniforms, bound in the shader and can be assembled in GLSL as defined in GL_KHR_vulkan_glsl section Combining separate samplers and textures. This allows us to assemble samplers and textures in the shader code, which is good if we have a texture array like above, where we can’t really assign a sampler per texture in the shader, because we have absolutely no clue when writing the shaders which texture goes where, so it’s much more flexible to be able to assign a sampler state when we know what kind of texture we want, let me give you an example.

The old way would be:

samplerstate GeometryTextureSampler
	Samplers = { SpecularMap, EmissiveMap, NormalMap, AlbedoMap, DisplacementMap, RoughnessMap, CavityMap };
	Filter = MinMagMipLinear;
	AddressU = Wrap;
	AddressV = Wrap;
vec4 diffColor = texture(AlbedoMap, UV) * MatAlbedoIntensity;
float roughness = texture(RoughnessMap, UV).r * MatRoughnessIntensity;
vec4 specColor = texture(SpecularMap, UV) * MatSpecularIntensity;
float cavity = texture(CavityMap, UV).r;

The new way is:

samplerstate GeometryTextureSampler 
	Filter = MinMagMipLinear;
	AddressU = Wrap;
	AddressV = Wrap;
vec4 diffColor = texture(sampler2D(AlbedoMap, GeometryTextureSampler), UV) * MatAlbedoIntensity;
float roughness = texture(sampler2D(RoughnessMap, GeometryTextureSampler), UV).r * MatRoughnessIntensity;
vec4 specColor = texture(sampler2D(SpecularMap, GeometryTextureSampler), UV) * MatSpecularIntensity;
float cavity = texture(sampler2D(CavityMap, GeometryTextureSampler), UV).r;

While the new way is only possible in GLSL through the KHR_vulkan extension, this has been the default way in DirectX since version 10. This syntax also allows for a direct mapping of texture sampling between GLSL<->HLSL if we want to use HLSL above shader model 3.0.

This method basically allows for all textures to be bound to a single descriptor set, and this descriptor set can then be applied to bind ALL textures at the same time. So when this texture library is submitted, we basically have access to all textures directly in the shader. Neat huh? It’s like bindless textures, and that is exactly what AMD mentions in the talk.

Then we come to uniform buffers. I read the Vulkan Memory Management and all of the sudden it became completely clear to me. If we want to keep the number of descriptor sets down, we can’t have a individual buffer per object because that requires either a descriptor set per object with the individual buffer bound to it, or it requires us to sync the rendering commands and update the descriptor set being used.

So the solution is to use the same uniform buffer, and expand its size per object. And if you follow the nvidia article, that is clearly not a good way to go. Instead, the uniform buffers implement a clever array allocation method, where we grow the total size by a set amount of instances, and keep a list of used and free indices (which can be calculated to offsets) into the buffer. Allocating when there are no free indices grows the buffer by the maximum of a set amount (8) or the number of instances requested. Allocating when there are free indices returns the offset calculated used said free index, and trying to allocate a range of values first attempts to fit the range in the list of free indices if there are enough free indices, or allocates a new chunk if no such range could be found.

So basically, the Vulkan uniform buffer implementation uses a pool allocator to grow its size (doesn’t shrink it though, which we actually might want to do). But because we are using GPU memory, we might want to avoid doubling the memory, however that is a problem for later. Each allocation returns the offset into the buffer, so that we can bind the descriptor with per-object offsets later, which means we retain the exact same descriptor set, but only modifies the offsets.

So to sum up:

  • Texture arrays with all textures bound at the same time, submitting the entire texture library (or libraries, for all 2D, 2DMS, Cube and 3D textures).
  • Uniform buffers are created per shader (resource-level) and each instance allocates a chunk of memory in this buffer.
  • Offsets into the same buffer is used per object so we can have the same descriptor set but jump around in it, giving us per-object variables.
  • Textures are sent as indexes, and can thus be on a per-object basis too.

The only real issue with this method is read-write textures, also known as images in GLSL. Since image variables has to be declared with a format qualifier denoting how to read from the image, we can’t really bind them as above. However images are not really on a level of update frequency as textures are, instead they are bound and switched on a per-shader basis, like with post effects, and are either statically assigned or can be predicted. For example, doing a blur horizontal + vertical pass requires the same image to be bound between both passes, however if we want to perform a format change, like in the HBAO shader, where we transfer from ao, p -> ao, we can just bind the same image to two different slots, and thus avoid descriptor updates.

Oh, I should also mention that all of this might soon be possible to do in OpenGL too, with the GL SPIRV extension, which should give us the ability in OpenGL to use samplers as separate objects. Texture arrays already exists, and so do uniform buffers.

Vulkan – Designing a back-end.

With the Khronos validation layer becoming more and more (although perhaps not entirely) complete, the Vulkan renderer implementation is also coming along nicely. At the moment, I cannot produce anything but a black window, hopefully mainly because the handling of descriptor sets are not completely done yet.

However, the design choices and the way to handle Vulkan are still noteworthy to bring up, so this post is going to be comprehensive with illustrations showing the thought process.

Command buffers

As you may or may not know, most of the operations done on the GPU is done in a command queue. This is apparent in OpenGL if you take a look at the functions glFlush and glFinish. In Vulkan however, command buffers are for you to allocate, destroy, reset, propagate, queue up and most importantly populate with commands. Noteworthy is also that Vulkan operates by submitting Command buffers to Queues, and your GPU might support more than one Queue. A Queue can be thought of as a road, although some Queues only accept busses, some bikes, and others are for pedestrians. In Vulkan, there are three different types of Queues.

  • Transfer – Allow for fast memory transfers GPU CPU as well as resources locally on the GPU.</>
  • Graphics – Allows for render commands such as Draw, BeginRenderPass, etc.
  • Compute – Allows for dispatch calls.

The intuitive way would be to try to replicate the GL4 behavior, by simply creating a single MAIN command buffer into which you put all your commands, and then execute it at the end of the frame. While this is a fine enough solution, it will cause some issues which we will get into later. But mainly, the whole idea of using command buffers is to be able to, in some cases, precompute command buffers for reuse (binding render target and render target-specific shader variables, viewports, etc) or for rapid population when drawing, and using threads to do so. There are currently 3 main buffers in Nebula, one for each queue.

Think of command buffers as sync points. Begin a command buffer, and all commands done afterwards are written to this buffer. When the buffer is ended, the buffer can then be used to run the commands on the GPU. Sounds simple. It’s not.


It’s not, because most rendering engines today are not designed around this principle, and if they were, it would be equally hard to make them work for older implementations, and perhaps even breaking huge part of the user space code. The most obvious way to begin and end a command buffer is, just like it was with DX9, within the BeginScene and EndScene. My bet is that most engines are designed around this principle. So lets say I want to load a texture, or perhaps I want to just create a vertex buffer and update it with data. Well, if we don’t call it within our frame, then the command buffer won’t be in the Begin-state, and thus fail.


There are two ways to solve this.

Solution 1 – the lazy option

Create a command buffer when needed, create your resource, then add to the command queue that you want to update it with data. Last, you submit the command buffer and update the iamge.
This solution works fine, but it might be slow if you are loading in many textures on the fly, because some frame might get tons of vkQueueSubmit while the rest will be idle. This could be fixed if you know when to begin creating/updating resources and when you end, which leads me into the next part.

Solution 2 – delegates

The other solution is to postpone the command until BeginFrame. This can be done using a simple delegate system, which allows you to just save the command into a struct, add it to an vector, and then run it whenever you want to run it. For Nebula I implemented several ‘buckets’ of these kinds of delegates, so that we can run them on Begin/EndFrame, Begin/EndPass, etc. This solution easily allows for us to accumulate tons of resource updates and run them in a single Queue submit, instead of making lots of smaller ones. This type of delaying commands is also extremely useful for memory releasing, which I will talk about later.


One of the main things using command buffers is that they allow us to queue up commands in threads, and isn’t that wonderful? Turns out it’s not quite as simple as that, because of several reasons. The first being that we must use a uniquely created command buffer per thread, and also have a command buffer pool per thread. The reason for this is that if we share pools, one thread may allocate a buffer which another already has, causing the same buffer to appear in multiple threads and thus there is no guarantee for the order of the commands. Using multiple buffers allows us to ensure a specific sequence of commands are executed in order, however we want to avoid doing a submit on all of these command buffers when we are done. This is why there are secondary command buffers, which allows us to create and record commands which are then patched in a primary command buffer (read, a MAIN command buffer) and then everything is executed with a single submit.

In Nebula the VkRenderDevice class has a constant integer describing how many graphics, compute and transfer threads should be created, along with completion events so that we may wait for the threads to finish. These will then act as a pool, each receiving commands using a scheme implemented by the RenderDevice, which is described later.

Commands are send to threads using something very similar to the delegate system, by putting structs in a thread safe queue and have the command buffer building threads populate its command buffer. The problem isn’t really that, but the issue is how to distribute the command buffer buildup from the rendering pipeline to the threads. One way would obviously be to, for each draw command, switch threads, so given four threads we might get the following.

draw1 draw2 draw3 draw4 draw5 draw6 draw7 draw8
  • Thread 1: draw1 draw5
  • Thread 2: draw2 draw6
  • Thread 3: draw3 draw7
  • Thread 4: draw4 draw8

So if we then would collect these draws together, we would get

draw1 draw5, draw2, draw6, draw3, draw7, draw4, draw8

Clearly, our draws are out of order, which is fine if we are using depth-testing. The real problem comes with binding the rendering state (or pipelines as they are called). Consider us introducing the following linear command list.

<html><strong>pipeline1</strong> draw1 draw2 <strong>pipeline2</strong> draw3 draw4 <strong>pipeline3</strong> draw5 draw6 <strong>pipeline4</strong> draw7 draw8</html>

Then our threads will get

  • Thread 1: shader1 draw1 draw5
  • Thread 2: shader2 draw2 draw6
  • Thread 3: shader3 draw3 draw7
  • Thread 4: shader4 draw4 draw8

Basically, now draws 5-8 will not get the shader associated with them, resulting in an incorrect result. To be honest, there is no perfect way of handling this properly, but there is way to split draws into threads just by going by the lowest common denominator, which is shader pipeline. In Nebula right now, the thread used for draw command population will be cycled whenever we change pipeline, which results in the above command list looking like this on the threads.

  • Thread 1: shader1 draw1 draw2
  • Thread 2: shader2 draw3 draw4
  • Thread 3: shader3 draw5 draw6
  • Thread 4: shader4 draw7 draw8

However, when not swapping shaders, we obviously will get all draws on the same thread, which might not be overly efficient. I haven’t come up with a solution for this yet, but one obvious one would be to sync and rendezvous the command buffer threads every X calls, and start a new batch.

Now this is only really relevant to do if we want to rapidly prepare our scene using draws, but how does it fare with transfers and computes? Well, with computes you have the same deal, bind shader and compute. The only real difference between geometry rendering and computation is that we also bind vertex and index buffers when we need to produce geometry. Transfers however is a completely different thing. In theory, a single transfer operation could be done on one thread each, meaning that for every single new buffer update, we circulate the threads.


In the good old days with the older APIs, binding the rendering state or compute state was easy, you would just bind shaders, vertex layouts, render settings like blending and alpha testing, samplers and subroutines and they incrementally built up a rendering state. In the next-gen APIs, this work is left to the developer, and is referred to in Vulkan as a Pipeline. A pipeline describes EVERYTHING required by the GPU to perform a draw, so you might understand this structure is huge.

The only problem with pipelines is that they couple together so many different pieces of information – shader programs, vertex attribute layout, blending, rasterizing and scissor state options, depth state options, viewports, etc. Just look at this beast of a structure!

The reason for this setup was that many games used to create tons of shaders, bind them, but not actually use them. This caused tons of unwanted validation for a shader-vertex-render state setup which was never supposed to be used, and slowed down performance. This new method forces the developer to say they are done, and that this is a complete state to render with.

Now you might already predict that trying to figure out all possible combinations of all your setups and precompute all these pipelines is the best way to solve it, and you would be correct. However, how do you look it up afterwards? Well, the good folks at Khronos thought of this, and implemented a pipeline cache, which basically only creates a new pipeline if none exists, but if one does exist using the same members, then the vkCreateGraphicsPipelines or vkCreateComputePipelines (the latter being infinitely more easy to precompute) will only just return the same pipeline you created earlier. Very neat if you ask me, and according to Valve, this is magic and it’s incredibly fast.

In Nebula, the shaders loaded as compute shaders can predict everything they need and create the pipeline on resource load, which is very flexible indeed. For graphics, I implemented a flagging system, where flags would be set if a member of this struct is initialized, and when all flags are set and we want to render, we call a function to do so.

	this->currentPipelineBits |= InputLayoutInfoSet;
	this->currentPipelineInfo.pInputAssemblyState = inputLayout;
	this->currentPipelineBits &= ~PipelineBuilt;
	this->currentPipelineBits |= FramebufferLayoutInfoSet;
	this->currentPipelineInfo.renderPass = framebufferLayout.renderPass;
	this->currentPipelineInfo.subpass = framebufferLayout.subpass;
	this->currentPipelineInfo.pViewportState = framebufferLayout.pViewportState;
	this->currentPipelineBits &= ~PipelineBuilt;
	n_assert((this->currentPipelineBits & AllInfoSet) != 0);
	if ((this->currentPipelineBits & PipelineBuilt) == 0)
		this->currentPipelineBits |= PipelineBuilt;