General – Id allocator

With the rewrite of the graphics system, there is an obvious need for a way to easily and consistently implement allocators. So what do we need for a DOD design?

Iteration 1 – Macros

Primarily, we need some class which is capable of having an N number of members. This is in itself non-intuitive, because an N-member template class could not possibly generate variable names for each member. The other way would be to implement a series of macros which allows us to construct the class, but here’s the issue. While creating the class itself is easy to do with a macro, something like __BeginClass __AddMember __EndClass, there also has to be an allocator function that use Ids to recycle slices into those arrays. So, we can do Begin/Add/End for the class declaration, but then we also need a Begin/Add/End pattern for the allocation function. Ugly:

	__AddStorage(VkBuffer, buffers);
	__AddStorage(VkDeviceMemory, mems);
	__AddStorage(Resources::ResourceId, layouts);
	__AddStorage(Base::GpuResourceBase::Usage, usages);
	__AddStorage(Base::GpuResourceBase::Access, access);
	__AddStorage(Base::GpuResourceBase::Syncing, syncing);
	__AddStorage(int, numVertices);
	__AddStorage(int, vertexSize);
	__AddStorage(int, mapcount);

	__AddAllocator(buffers, nullptr);
	__AddAllocator(mems, nullptr);
	__AddAllocator(layouts, Ids::InvalidId24);
	__AddAllocator(usages, Base::GpuResourceBase::UsageImmutable);
	__AddAllocator(access, Base::GpuResourceBase::AccessRead);
	__AddAllocator(syncing, Base::GpuResourceBase::SyncingCoherent);
	__AddAllocator(numVertices, 0);
	__AddAllocator(vertexSize, 0);
	__AddAllocator(mapcount, 0);

Good side is that we can declare default values for each slice. Still, the fact we have to write the same thing twice is not pretty, and the macros underlying it are not pretty either. It’s very easy to make a mistake, and even Visual Studio is really bad at helping with debugging macros. Another problem is that if we need a complex type, with commas in it, the macro will think the next thing is a new argument, so:

__AddStorage(std::map<int, float>, mapping);

Is going to assume the first argument is “std::map“, and so on. So to circumvent it we first need to typedef the map. Annoying, ugly, and ultimately a work-around. This was iteration 1.

Iteration 2 – Generic programming method

While I am opposed to boost-like (or stl style) generic programming, where simple things like strings are template types because it’s cool, this problem really has no better way of solving. The behavior is simple, one id-pool, N arrays of data, one allocation function which allocates a new slice for all N arrays, some function which, using an id from the pool, can retrieve and deallocate data from all arrays simultaneously.

	/// we need a thread-safe allocator since it will be used by both the memory and stream pool
	typedef Ids::IdAllocatorSafe<
		RuntimeInfo,						// 0 runtime info (for binding)
		LoadInfo,							// 1 loading info (mostly used during the load/unload phase)
		MappingInfo,						// 2 used when image is mapped to memory
	> VkTextureAllocator;

RuntimeInfo, LoadInfo and MappingInfo are structs which denote components of a texture:

	struct LoadInfo
		VkImage img;
		VkDeviceMemory mem;
		TextureBase::Dimensions dims;
		uint32_t mips;
		CoreGraphics::PixelFormat::Code format;
		Base::GpuResourceBase::Usage usage;
		Base::GpuResourceBase::Access access;
		Base::GpuResourceBase::Syncing syncing;
	struct RuntimeInfo
		VkImageView view;
		TextureBase::Type type;
		uint32_t bind;
	struct MappingInfo
		VkBuffer buf;
		VkDeviceMemory mem;
		VkImageCopy region;
		uint32_t mapCount;

Problem with this solution is that the variables are not named, but are just numbered, so a Get requires a template integer argument for which member. However, it’s implemented such that Get can resolve its return type for us, which is nice.

	/// during the load-phase, we can safetly get the structs
	VkTexture::RuntimeInfo& runtimeInfo = this->Get<0>(res);
	VkTexture::LoadInfo& loadInfo = this->Get<1>(res);

For textures, we are using a thread-safe method, since textures can be either files loaded in a thread, or memory-loaded directly from memory. Thus, it requires either the Enter/Leave get pattern, or GetSafe. We can also use GetUnsafe, but it’s greatly discouraged because of the obvious syncing issue. Anyway, we can see in the above code that Get takes the number of the member in the allocator, and automatically resolve the return type. For the technical part, the way this is solved is by a long line of generic programming types, unfolding the template arguments and generating an Array Append for each type.

template <typename C>
struct get_template_type;

/// get inner type of two types
template <template <typename > class C, typename T>
struct get_template_type<C<T>>
	using type = T;

/// get inner type of a constant ref outer type
template <template <typename > class C, typename T>
struct get_template_type<const C<T>&>
	using type = T;

/// helper typedef so that the above expression can be used like decltype
template <typename C>
using get_template_type_t = typename get_template_type<C>::type;

/// unpacks allocations for each member in a tuble
template<class...Ts, std::size_t...Is>
void alloc_for_each_in_tuple(const std::tuple<Ts...> & tuple, std::index_sequence<Is...>)
	using expander = int[];

/// entry point for above expansion function
void alloc_for_each_in_tuple(const std::tuple<Ts...> & tuple)
	alloc_for_each_in_tuple(tuple, std::make_index_sequence<sizeof...(Ts)>());

/// get type of contained element in Util::Array stored in std::tuple
template <int MEMBER, class ... TYPES>
using tuple_array_t = get_template_type_t<std::tuple_element_t<MEMBER, std::tuple<Util::Array<TYPES>...>>>;

The internet helped me greatly. The allocator can be created as such:

template <class ... TYPES>
class IdAllocator
	/// constructor
	IdAllocator(uint32_t maxid = 0xFFFFFFFF, uint32_t grow = 512) : pool(maxid, grow), size(0) {};
	/// destructor
	~IdAllocator() {};

	/// allocate a new resource, and generate new entries if required
	Ids::Id32 AllocResource()
		Ids::Id32 id = this->pool.Alloc();
		if (id >= this->size)
		return id;

	/// recycle id
	void DeallocResource(const Ids::Id32 id) { this->pool.Dealloc(id); }

	/// get single item from id, template expansion might hurt
	template <int MEMBER>
	inline tuple_array_t<MEMBER, TYPES...>&
	Get(const Ids::Id32 index)
		return std::get<MEMBER>(this->objects)[index];

	Ids::IdPool pool;
	uint32_t size;
	std::tuple<Util::Array<TYPES>...> objects;

The only real magic here is the fact that we use std::tuple to store the data, tuple_array_t to find out the type of a tuple member, and alloc_for_each_in_tuple to allocate a slice for each array. It’s all compile time, and all generic, but not generic enough as to be too hard to understand. Cheerio!

Now, the coolest thing by far is that it’s possible to chain these allocators, which makes it easy to adapt class hierarchies!

	/// this member allocates shaders
		AnyFX::ShaderEffect*,						//0 effect
		SetupInfo,									//1 setup immutable values
		RuntimeInfo,								//2 runtime values
		VkShaderProgram::ProgramAllocator,			//3 variations
		VkShaderState::ShaderStateAllocator			//4 the shader states, sorted by shader
	> shaderAlloc;

Here, VkShaderProgram::ProgramAllocator allocates all individual shader combinations, and VkShaderState::ShaderStateAllocator contains all the texture and uniform binds. They can obviously also have their own allocators, and so on, and so forth! And since they are now also aligned as a single array under a single item of the parent type, which in this case is the shader allocator, they also appear linearly in memory. So, when we bind a shader, and then swap its states, all of the states for that shader will be in line, which is great for cache consistency!

Graphics – New design philosophy


Data oriented design has been the new thing ever since it was rediscovered, and for good reason. The funny part is that in practice it is a regression in technology, back to the good old days of C, although the motivations may be different. So here is the main difference between OOP and DOD:

With OOP, an object is a singular instance of its data and methods. As OOP classes get more members, its size increases, and with it the stride between consecutive elements. In addition, an OOP solution has a tendency to allocate an instance of an object when it is required. OOP is somewhat intuitive to many modern programmers, because it attempts to explain the code in clear-text.

The DOD way is very different. It’s still okay to have classes and members, although care should be taken as to how those members are used. For example, if some code only requires members A and B, then it’s bad for the cache if there are members C and D between each element. So how do we still use an object-like mentality? Say we have a class A, and should use members a, b, and c. Instead of treating A as individual objects, we have a new class AHub, which is the manager of all the A instances. The AHub contains a, b and c as individual arrays. So how do we identify individual A objects? Well, they become an index into those arrays, and since those arrays are uniform in length, each index becomes a slice. Now it’s fine if for example a is another class or struct. There are many benefits to a design of this nature:

1. Objects become integers. This is nice because there is no need to include anything to handle an integer, and the implementation can easily be obfuscated in a shared library.
2. No need to keep track of pointers and their usage. When an ID is released nothing is really deleted, instead the ID is just recycled. However there are ways check if an ID is valid.
3. Ownership of objects is super-clear. The hub classes will indiscriminately be the ONLY class responsible for creating and releasing IDs.

An example of how different the code can be, here is an example:

// OOP method
Texture* tex = new Texture();

// DOD method
Id tex = Graphics::CreateTexture(width, height, format);

Now one of the things which may be slightly less comfortable is where you put all those functions?! Since all you are playing with are IDs, there are no member functions to call. So where is the function interface? Well, the hub of course! And since the hubs are interfaces to your objects, the hubs themselves are also responsible for the operations. And since the hubs are interfaces, they can just as well be singleton instances. And because you can make the hubs singletons, it also means you can have functions declared in the class namespace which belong to no class at all. So in the above example, we have the row:

Id tex = Graphics::CreateTexture(width, height, format);

, but since textures are managed in some singleton, what Graphics::CreateTexture is really doing is this:

namespace Graphics
Id CreateTexture(int width, int height, PixelFormat format)
    return TextureHub::CreateTexture(width, height, format);
} // namespace Graphics

Now the benefits is that all functions can go into the same namespace, Graphics, in this case, and the programmer does not need to keep track of whatever the the hub is called.


In Nebula, textures are treated as resources, and they go through a different system to be created, however the process is exactly the same. Textures are managed by a ResourcePool, like all resources, and the ResourcePools are also responsible for implementing the behavior of those resources. With this new system, smart pointers are not really needed that much, but one of the few cases where they are still in play is for those pools. The resources have a main hub, called the ResourceManager, and it contains a list of pools (which are also responsible for loading and saving). There are two families of pools, stream pool sand memory pools. Stream pools can act asynchronously, and fetches its data from some URI, for example a file. Memory pools are always immediate, and take their information from data already in memory.

Textures for example, can be either a file, like a .dds asset, or it can be a buffer mapped and loaded by some other system, like LibRocket. Memory pools have a specific set of functions to create a resource, and they are ReserveResource which creates a new empty resource and returns the Id, and UpdateResource which takes a pointer to some update structure which is then used to update the data.

The way a resource is created is through a call to the ResourceManager, which gets formatted like so:

Resources::ResourceId id = ResourceManager::Instance()->ReserveResource(reuse_name, tag, MemoryVertexBufferPool::RTTI);
struct VboUpdateInfo info = {...};
ResourceManager::Instance()->UpdateResource(id, &info);

reuse_name is a global resource Id which ensures that consecutive calls to ReserveResource will return the same Id. tag is a global tag, which will delete all resources under the same tag if DiscardByTag is called on that pool. The last argument is the type of pool which is supposed to reserve this resource. In order to make this easier for the programmer, we can create a function within the CoreGraphics namespace as such:

namespace CoreGraphics
CreateVertexBuffer(reuse_name, tag, numVerts, vertexComponents, dataPtr, dataPtrSize)
    Resources::ResourceId id = ResourceManager::Instance()->ReserveResource(reuse_name, tag, MemoryVertexBufferPool::RTTI);
    struct VboUpdateInfo info = {...};
    ResourceManager::Instance()->UpdateResource(id, &info);
} // namespace CoreGraphics

ReserveResource has to go through and find the MemoryVertexBufferPool first, so we can eliminate that too by just saving the pointer to the MemoryVertexBufferPool somewhere, perhaps in the same header. This is completely safe since the list of pools must initialized first, so their indices are already fixed.

Now all we have to do to get our functions is to include the CoreGraphics header, and we are all set! No need to know about nasty class names, everything is just in there, like a nice simple facade. Extending it is super easy, just declare the same namespace in some other file, and add new functions! Since we are always dealing with singletons and static hubs, none of this should be too complicated. It’s back to functions again! Now we can chose to have those functions declared in some header for each use, for example all texture-related functions could be in the texture.h header, or they could be exposed in a single include. Haven’t decided yet.

One of the big benefits is that while it’s quite complicated to expose a class to for example a scripting interface, exposing a header with simple functions is very simple. And since everything is handles, the user never has to know about the implementation, and is only exposed to the functions which they might want.


So I mentioned everything is returned as a handle, but a handle can contain much more information than just an integer. The resources is one such example, it contains the following:

1. First (leftmost) 32 bits is the unique id of the resource instance within the loader.
2. Next 24 bits is the resource id as specified by reuse_name for memory pools, or the path to the file for stream pools.
3. Last 8 bits is the id of the pool itself within the ResourceManager. This allows an Id to be immediately recognized as belonging to a certain pool, and the pool can be retrieved directly if required.

This system will keep an intrinsic track of the usage count, since the amount of times a resource is used is indicated by the unique resource instance number, and once all of those ids are returned, the resource itself is safe to discard.

All the graphics side objects are also handles. If we for example want to bind a vertex buffer to slot 0 with offset 0, we do this:

CoreGraphics::BindVertexBuffer(id, 0, 0);

Super simple, and that function will fetch the required information from id, and send it to the render device. While this all looks good here, there is still tons of work left to do in order to convert everything.

Vulkan – Designing a new frame script system

Nebula has a neat feature called frame shaders. Frame shaders are XML scripts and describe the rendering of an entire frame. However, frame shaders in nebula are designed with the DirectX 9 mindset, and is in dire need of a rewrite.

With Vulkan, and partly in OpenGL4, there are slightly more efficient ways of binding render targets. In DirectX 9 there was a clear distinction between multiple render targets and singular ones. With OpenGL we have framebuffers which is an object containing all render targets. In DirectX 10-12 we bind render targets individually. In Vulkan and OpenGL, we can have a framebuffer, but only select a set of attachments to actually use, allowing us to avoid binding framebuffers more commonly than needed. In Vulkan, we can even pass data between renders through input attachments, so our new design has to take that into consideration.

We also want to be able to apply global variables in the frame shader, so that we for example can switch out the NormalMap, AlbedoBuffer, etc, if we for example render with VR or want to produce a reflection cube texture. This allows the frame script to apply settings per execution, which can be shared across all shaders being used when rendering the script.

So one of the design choices is to design a frame scripting system which allows us to add frame operations just like FramePassBase, however a FramePass is a bit too ‘high-level’ since it implies something like a texture and some draws. With the new system, we want to execute memory barriers, trigger events to keep track of compute shader jobs and assemble highly optimized subpass dependency chains, meaning a frame operation can be much simpler than an actual pass.

Also, we want to slightly redesign some of the concepts of CoreGraphics, where we don’t begin a pass with a render target, multiple render target or render target cube, but instead we use a pass object. The pass already knows about its subpasses and attachments, and the frame system knows when to bind it and what to do when it is bound.

Enter Frame2.

Declare RenderTexture    - can be used as shader variable and render target
	Fixed size 		- in pixels
	Relative size	- to screen (decimal 0-1)
	Dynamic size	- can be adjusted in settings
	Format			- any renderable color format
	Multisample		- true if render texture supports multisampling
Declare RenderDepthStencil	- implements a depth-stencil buffer for rendering
	Fixed size 		- in pixels
	Relative size	- to screen (decimal 0-1)
	Dynamic size	- can be adjusted in settings
	Format			- any accepted depth-stencil format
	Multisample		- true if depth-stencil supports multisampling
Declare ReadWriteTexture   - can be used as shader variable and compute shader input/output and fragment shader input/output
	Fixed size 		- in pixels
	Relative size		- to screen (decimal 0-1)
	Dynamic size		- can be adjusted in settings
	Format			- any color format (renderable or otherwise) but not depth-stencil 
	Multisample		- true if read-write image supports multisampling
Declare ReadWriteBuffer - can be used as compute shader input/output
	Size		- in bytes
	Relative size	- to screen, size is now size per screen pixel
Declare Event - can be used to signal dependent work that other work is done
	Set 		- created in an already set state

Declare Algorithm
	Class		- to create instance of algorithm class
	- List all global values used by this frame shader
Pass <Name>
	- List all attachments being used, then use index as lookup
	- Pass implicitly creates rendertarget/framebuffer
	RenderTexture <Name of declared RenderTexture>
		- Clear color
		- If name is __WINDOW__ use backbuffer
	RenderDepthStencil <Name of declared DepthStencil>
		- Clear stencil, clear depth
	Subpass <Name>
		- List subpasses depended upon
		- List of attachment indices
			- Output color
			- Output depth-stencil
			- Output input attachment (call something clever, like shader-local)
			- Resolve <boolean>
			- Passthrough (automatically assume all unmentioned attachments are pass-through)
			- Dependency
		- Viewports and scissor rects
			- Set viewport -> float4(x, y, width, height), index
			- Set scissor rect -> float4(x, y, width height), index
			- If none are given, whole texture is used
		- Drawing
			SortedBatch <Name>
				- Must be inside subpass
				- Renders objects in Z-order
			Batch <Name>
				- Must be inside subpass
				- Batch renders objects using as minimal shader switches as possible
				- Renders materials declaring a Pass with <Name>. Rename Pass to Batch in materials.			
			System <Name>
				- Name decides what to do
				- Must be inside subpass
				- Lights means light sources
				- LightProbes means light probes
				- UI means GUI renderers
				- Text means text rendering
				- Shapes means debug shapes
			FullscreenEffect <Name>
				- Must be inside subpass
				- Bind shader
				- Update variables
			SubpassAlgorithm <Name>
				- Select algorithm to run
				- Select stage to execute so we can execute different phases with different subpasses
                                - Inputs listed from 
				- Must be inside subpass
Copy <Name>
	- Must be outside of pass
	- Target texture
	- Source texture
Blit <Name>
	- Like copy, but may filter if formats are not the same, or size differs
	- Must be outside of pass
ComputeAlgorithm <Name>
	- Select algorithm to run
	- InputImage input readwrite image
	- OutputImage output readwrite image
        - Allow asynchronous computation <boolean>, if not defined is false
	- Has to be outside of a Pass
Compute <Name>
	- Bind shader
	- Update variables
        - Allow asynchronous computation <boolean>, will require an Event to be used to control execution
	- Compute X, Y, Z
		- Is number of work groups, work group size is defined in shader
Barrier <Name>
	- Implements a memory blockade between two operations
	- Signals a ReadWriteBuffer or a ReadWriteTexture to be blocked from one pipeline stage to another, also denoting the access flags allowed for both resources prior and after the barrier
	- Can be used as a pipeline bubble, or as the content of an Event
	- Can be executed within a render pass or outside

Event <Name>
	- Can be reset
	- Can be set
	- Can be waited for
	- Can be called within render pass and outside

For the sake of minimalism, the new system also implements the script as a simplified JSON file, instead of an XML, making it slightly more readable, although almost equally stupid. An example of a frame shader, now called frame script, can look like this:

version: 2,
engine: "NebulaTrifid",
		{ name: "NormalBuffer", 	format: "A8R8G8B8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "DepthBuffer", 		format: "R32F", 			relative: true,  width: 1.0, height: 1.0 },
		{ name: "AlbedoBuffer", 	format: "A8R8G8B8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "SpecularBuffer", 	format: "A8R8G8B8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "EmissiveBuffer", 	format: "A16B16G16R16F", 	relative: true,  width: 1.0, height: 1.0 },
		{ name: "LightBuffer", 		format: "A16B16G16R16F", 	relative: true,  width: 1.0, height: 1.0 },
		{ name: "ColorBuffer", 		format: "A8B8G8R8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "ScreenBuffer", 	format: "A8B8G8R8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "BloomBuffer", 		format: "A8B8G8R8", 		relative: true,  width: 0.5, height: 0.5 },
		{ name: "GodrayBuffer", 	format: "A8B8G8R8", 		relative: true,  width: 0.5, height: 0.5 },
		{ name: "ShapeBuffer", 		format: "A8B8G8R8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "AverageLumBuffer", format: "R16F", 			relative: false, width: 1.0, height: 1.0 },
		{ name: "SSSBuffer", 		format: "A16B16G16R16F", 	relative: true,  width: 1.0, height: 1.0 },
		{ name: "__WINDOW__" }
		{ name: "HBAOBuffer", 		format: "R16F", 			relative: true, width: 1.0, height: 1.0 }
		{ name: "ZBuffer", 			format: "D32S8", 			relative: true, width: 1.0, height: 1.0 }
			name: 		"Tonemapping", 
			class: 		"Algorithms::TonemapAlgorithm", 
			name:		"HBAO",
			class: 		"Algorithms::HBAOAlgorithm",
			name: 		"FinalizeState", 
			shader:		"shd:finalize", 
				{semantic: "ColorTexture", 		value: "ColorBuffer"},
				{semantic: "LuminanceTexture", 	value: "AverageLumBuffer"},
				{semantic: "BloomTexture", 		value: "BloomBuffer"}
			name: 		"GatherState",
			shader: 	"shd:gather",
				{semantic: "LightTexture", 		value: "LightBuffer"},
				{semantic: "SSSTexture", 		value: "SSSBuffer"},
				{semantic: "EmissiveTexture", 	value: "EmissiveBuffer"},
				{semantic: "SSAOTexture", 		value: "HBAOBuffer"},
				{semantic: "DepthTexture", 		value: "DepthBuffer"}
		name:			"DeferredTextures",
			{ semantic:"AlbedoBuffer", 		value:"AlbedoBuffer" },
			{ semantic:"DepthBuffer", 		value:"DepthBuffer" },
			{ semantic:"NormalBuffer", 		value:"NormalBuffer" },				
			{ semantic:"SpecularBuffer", 	value:"SpecularBuffer" },
			{ semantic:"EmissiveBuffer", 	value:"EmissiveBuffer" },
			{ semantic:"LightBuffer", 		value:"LightBuffer" }
		name: 		"HBAO-Prepare",
		algorithm:	"HBAO",
		function:	"Prepare"
		name: "DeferredPass",
			{ name: "AlbedoBuffer", 	clear: [0.1, 0.1, 0.1, 1], 		store: true	},
			{ name: "NormalBuffer", 	clear: [0.5, 0.5, 0, 0], 		store: true },
			{ name: "DepthBuffer", 		clear: [-1000, 0, 0, 0], 		store: true },
			{ name: "SpecularBuffer", 	clear: [0, 0, 0, 0], 			store: true	},
			{ name: "EmissiveBuffer", 	clear: [0, 0, 0, -1], 			store: true	},
			{ name: "LightBuffer", 		clear: [0.1, 0.1, 0.1, 0.1], 	store: true	},
			{ name: "SSSBuffer", 		clear: [0.5, 0.5, 0.5, 1], 		store: true }
		depthStencil: { name: "ZBuffer", clear: 1, clearStencil: 0, store: true },
			name: "GeometryPass",
			dependencies: [], 
			attachments: [0, 1, 2, 3, 4],
			depth: true,
			batch: "FlatGeometryLit", 
			batch: "TesselatedGeometryLit"
			name: "LightPass",
			dependencies: [0],
			inputs: [0, 1, 2, 3, 4],
			depth: true,
			attachments: [5],
			system: "Lights"
		name: 			"Downsample2x2",
		algorithm: 		"Tonemapping",
		function: 		"Downsample"
		name: 			"HBAO-Compute",
		algorithm:		"HBAO",
		function:		"HBAOAndBlur"
		name: "PostPass",
			{ name: "DepthBuffer",  		load: true },
			{ name: "AverageLumBuffer", 	clear: [0.5, 0.5, 0.5, 1] },
			{ name: "ColorBuffer", 			clear: [0.5, 0.5, 0.5, 1] },
			{ name: "ScreenBuffer", 		clear: [0.5, 0.5, 0.5, 1], store: true},
			{ name: "BloomBuffer", 			clear: [0.5, 0.0, 0.5, 1] },
			{ name: "GodrayBuffer", 		clear: [-1000, 0, 0, 0] },
			{ name: "ShapeBuffer", 			clear: [-1000, 0, 0, 0] }
		depthStencil: { name: "ZBuffer", load: true },
			name: "Gather",
			dependencies: [],
			attachments: [2],
			depth: false,
				name: 				"GatherPostEffect",
				shaderState: 		"GatherState",
				sizeFromTexture: 	"ColorBuffer"
			name: "AverageLum",
			dependencies: [0],
			attachments: [1],
			depth: false,
				name: 				"AverageLuminance",
				algorithm: 			"Tonemapping",
				function: 			"AverageLum"
			name: "Unlit",
			dependencies: [],
			attachments: [6],
			depth: true,
			batch: "Unlit",
			batch: "ParticleUnlit",
			system: "Shapes"
			name: "FinishPass",
			dependencies: [1, 2],
			inputs: [0, 5, 6],
			attachments: [3],
			depth: false,
				name: 				"ToScreen",
				shaderState: 		"FinalizeState",
				sizeFromTexture: 	"ColorBuffer"
			system: "Text"
		name: 		"CopyToNextFrame",
		algorithm: 	"Tonemapping",
		function: 	"Copy"
		name: 		"SwapWindowBuffer",
		texture: 	"__WINDOW__"
		name: 		"CopyToWindow",
		from: 		"ScreenBuffer",
		to: 		"__WINDOW__"

Some of the design choices are:

  • GlobalState
  • Assigns global variables, like the deferred textures used by this frame shader. Other frame shaders can execute and apply their values.

  • RenderTextures contains list of all declared color renderable textures
  • We want to declare all textures in a neat manner, so a single row per texture is nice.

  • DepthStencils contains list of all declared depth stencil targets
  • We might want to use more than one depth stencil sometimes.

  • ReadWriteTextures contains list of all declared textures which supports read-write operations
  • Used for image load-stores.

  • ReadWriteBuffers contains list of all declared buffers which supports read-write operations
  • Used for compute shaders to load-store data. Size is size in bytes, but if the relative flag is used, size denotes the byte size per pixel.
    Size of 1, 1 with a relative size on a 1024×768 pixel screen will allocate 1024×768 bytes, 0.5, 0.5 is 512, 384.

  • Algorithms contain all algorithms used by this frame shader
  • We want to declare algorithms beforehand, so that we can select which pass to use within it dependent on where we are.

  • Pass assigns a list of render targets which may be applied during the pass
  • Only draws are allowed within a pass, because a pass can’t guarantee order of execution of subpasses. A pass defines a list of allowed attachments, and which depth-stencil to use. A pass doesn’t really do anything, the work is done in subpasses.

  • Subpass actually binds render targets
  • Subpasses work on the concepts of OpenGL4 and Vulkan. Binding a framebuffer is done in the pass, the subpass then selects which attachments should be used and in which order.
    Subpasses have dependencies if other subpasses needs to be completed before this subpass can run. Subpasses list the attachments used by the pass by index. Subpasses may also contain the most important part, which is drawing.

  • Drawing!
  • We have four types of draw methods.

    • Batch performs a batch render by shader – surface – mesh to avoid unnecessary switches
    • OrderedBatch performs a ordered batch render, on all materials but renders in Z-order (or perhaps some other scheme) instead of per shader, so it’s potentially detrimental for performance, since it may switch shaders many times
    • System runs a render system built-in, like Lights, LightProbes, UI, Text and Shapes
    • FullscreenEffect renders what we before called a post effect, but it doesn’t really have to be ‘post’ per-se. Fullscreen effects require a shader, potentially (in the majority of cases) a list of variable updates. Also needs a texture to extract the size of the fullscreen quad
  • Algorithm execution
  • We want to execute algorithms, but since the new system has full control over what is being bound and when, the algorithm is not allowed to begin or end passes. So what do we need from algorithms? Algorithms need to update shader variables, apply meshes and render, however they might also need to render to more than one texture at a time. An algorithm needs to be able to run a certain step.

    • SubpassAlgorithm can only be executed within a pass, and is done for rendering geometry. Subpass algorithms may take values as input, however it is better to use the global state if possible
    • ComputeAlgorithm must be executed outside a pass, and may not render stuff, but only dispatch computations. Compute algorithms must be provided with at least one ReadWriteImage or ReadWriteBuffer, otherwise the compute algorithm is pointless. Compute algorithms can be asynchronous if hardware supports it. If it doesn’t, it just runs inline with graphics.
  • Compute
  • Like before, we can select a compute shader and run X, Y, Z number of work groups, using Width, Height, Depth sizes for each work group. However now we can allow computes to be asynchronously executed if the hardware supports it. If it doesn’t, it just runs inline with graphics. If a compute uses the relative flag, then size denotes the workgroup size rounded to down to fit the resolution.

  • Barrier
  • Executes a barrier between operations. A barrier describes just that, a barrier between adjacent operations. If two interdependent operations are not directly after each other, then we can use events to wait for some later operation to be done.

  • Event
  • We can declare and use events (consistently) during a frame. Events can be set for example after a compute is done, and waited for somewhere else, with work being done in between. While barriers are better to use between immediate operations where the second depend on the first, events can be done for computations at the beginning of the frame, and be waited for just before they are needed.

  • Copy and Blit
  • We also want to perform an image copy, or a slightly more expensive blit operation (which allows for conversions/decompression)

Backwards compatibility

Since Vulkan defines a very explicit API, it shouldn’t be hard to translate from Vulkan ‘down’ to for example OpenGL or DirectX. Events in OpenGL can be used with glFenceSync and memory barriers with glMemoryBarrier.

In DirectX 11 and down, no such mechanism exist, for which barriers and fences are simply semantic in the frame shader to denote where a barrier or sync would exist given a more explicit API, meaning that they will have no actual function. In OpenGL, a pass is a call to glBindFramebuffer, and a subpass is glDrawBuffers. In DirectX, a pass is just a list of renderable textures, and a subpass selects a subset of these render targets and bind them prior to drawing. In Metal, a subpass is the info used in MTLRenderPipelineDescriptor to create a render pipeline, and a render pass is just a list of textures to pick from.

Compute algorithms and plain computes are simply not available if the API doesn’t support them. Perhaps the RenderDevice should implement whether or not the underlying device can actually perform computes, but at the same time, it is impossible to load compute shaders unless the device supports them. And since Nebula loads all exported shaders by default, this might be a problem.

Final thoughts

In some cases we might actually want to modify a frame script. All ‘runnable’ elements in a frame script is of the class FrameOp, meaning we can technically add in FrameOps into the script during runtime. For example, if we want VR, then perhaps we want the last sub-pass to not present to screen, but instead to a texture, and we might want to switch which texture to present to, like for left eye, right eye, without necessarily having two of every screen buffer.

We might also want to be able to assemble a frame script in code, for example when implementing different shadow mapping methods. We could, for example, merge frame scripts together by just adding ops. The old system used a class called FramePassBase, which would bind a render target and run batches, which is very much like a render pass. However a FramePassBase binds a render target object as-is, and will assume all attachments will be used in the shader. With the new method, we can, for example, bind the CSM shadow buffer, the spot light shadow buffer atlas in one render pass, then all cube maps for point lights as 6 * number of point light shadows as layers. FrameOps allow us to have a fine grained control of how a script is executed, however it also allows us to break validation, since both a FramePass and a FrameSubpass are FrameOps, we can technically put a FrameSubpass without being inside a FramePass – if we assemble one in code. The frame script loader enforces correct formatting.

We also want to slightly rework the render plugin system so that we can determine when a plugin should render, seeing as it is important to be able to decide which texture the result will end up in. An idea is to be able to execute a subset of plugins in the script, and then have the plugins register to the plugin registry with a certain group.

The exact details of this design will probably change during development, however the basic concepts are here. One of the major concerns is that the new system should inhibit the user to do stupid things, and to also be able to fully optimize and utilize the Vulkan, and by extension, the other renderers. The script can, for example, find required dependencies by just looking at which subpass or algorithm is using a resource, and inform the programmer, just like a validation process. It can also be extended to use a graphical design interface, however it is doubtful sensitive features like this is supposed to be exposed to someone who are not too familiar with GPU programming.

In the current state, the frame script system will implement objects of different types, which have a certain behavior. However, since we know the code beforehand, it should be possible to just unravel the frame script produce just the code, meaning the engine will generate the rendering process during compilation, much like the NIDL system, and thus making the rendering pipeline both debugable, as well as more efficient, without all the indirection.

Vulkan – Beyond the pipeline cache

Don’t you go thinking I have been idle now just because I haven’t written anything down. As a matter of fact, I implemented a whole new render script system, which allows full utilization of Vulkan features such as subpasses and explicit synchronizations such as barriers and events.

The current of Nebula implements most major parts, including lighting, shadowing, shape rendering, GUI, text rendering and particles. What’s left to implement and validate is the compute-parts. However working with Vulkan is not so simple as many think. There are tons of problems, driver related and otherwise, which is why I decided to implement my own pipeline cache system.

Basically, the Vulkan pipeline cache can just return a VkPipeline object when we use the same objects to create a pipeline twice. That is cute and cool, but internally the system has to at least serialize 14 integers (12 pointers, 2 integers for the subpass index and the number of shader states). This is handled by the driver, so relying on it being intelligent or even efficient has proven to be a leap of faith. So I figured, how many different ‘objects’ do we use to create a pipeline in Nebula? Turns out, we just use 4, pass info, shader, vertex layout, vertex input.

So the idea came to mind to just incrementally build a DAG of the currently applied states, and if the selected DAG path, when calling GetOrCreatePipeline(), has a pipeline created, just return it instead of create it. The newest AMD driver, 16.9.1 fails to serialize pipelines, so calling vkCreateGraphicsPipelines always creates and links a new one, which downed my runtime performance from 140 FPS down to 12. Terrible, but it gave me the motivation to avoid calling a vkCreateX function everytime I needed something new.

Enter the Nebula Pipeline Database. Sounds so cool, but is a simple tree structure which layers different pipeline states into tiers, and constructs a tree-like structure in order to construct a dependency which in the end creates a VkPipeline. The class works by applying shading states in tiers. The tiers are: Pass, Shader, Vertex layout, Primitive Input. If one applies a pass, then all the lower states get invalidated. If a vertex layout is applied, then it will be ‘applied’ to the current pass. We construct a tree like so:

Pass 1 Pass 2 Pass 3
Shader 1 Shader 2 Shader 3 Shader 4 Shader 5 Shader 6
Vertex layout 1 Vertex layout 2 Vertex layout 3 Vertex layout 4 Vertex layout 5 Vertex layout 6 Vertex layout 7 Vertex layout 8 Vertex layout 9 Vertex layout 10 Vertex layout 11 Vertex layout 12
Primitive input 1 Primitive input 2 Primitive input 3 Primitive input 4 Primitive input 5 Primitive input 6 Primitive input 7 Primitive input 8 Primitive input 9 Primitive input 10 Primitive input 11 Primitive input 12 Primitive input 13 Primitive input 14 Primitive input 15 Primitive input 16 null null

When setting a state, we try to find an already created node for that tier. If no node is found, we create it using the currently applied state. This allows us to rather quickly find the subtree and retrieve an already created pipeline. You might think this is very cumbersome just to combine pipeline features, but it boosted the base frame rate by several percent, because this way, using only 5 identifying objects, is much faster than the driver implementation, and for obvious reasons. The driver could never assume we have the same code layout as we do in Nebula, so it has to assume every part of the pipeline is dynamic.

Also, the render device doesn’t request a new pipeline from the database object unless the change has actually changed, so we can effectively avoid tons of tree traversals, searches and VkPipelineCache requests just by assuming the state doesn’t need to change.

So what’s left to do?

Platform and vendor compatibility stuff. At the current stage, the code doesn’t consider violations against hardware limits, such as the number of uniform buffers per shader stage, or per descriptor set. This is an apparent problem on nvidia cards, where the number of concurrently bound uniform buffers is limited to 12. Also, testing and figuring out how events and barriers work or what they are actually needed for, since renderpasses implement barriers themselves, and compute shaders run on the same queue seems to be internally synchronized.

Vulkan – Persistent descriptor sets

Vulkan allows us to bind shader resources like textures, images, storage buffers, uniform buffers and texel buffers in an incremental manner. For example, we can bind all view matrices in a single descriptor set (actually, just a single uniform buffer) and have it persist between several pipeline switches. However, it’s not super clear how descriptor sets are deemed compatible between pipelines.

NOTE: When mentioning shader later, I mean AnyFX style shaders, meaning a single shader can contain several vertex/pixel/hull/domain/geometry/compute shader modules.

I could never get the descriptor sets to work perfectly, which is to bind the frame-persistent descriptors first each frame, and then not bind them again for the entire frame (or view). Currently, I bind my ‘shared’ descriptor sets after I start a render pass or bind a compute shader.

When binding a descriptor set, all descriptor sets currently bound with a set number lower than the one you are binding now has to be compatible. So if we have set 0 bound, and bind set 3, then for set 0 to stay bound, it has to be compatible with the pipeline. If we switch pipelines, then the descriptor sets compatible between pipelines will be retained, if they follow the previous rule. That is, if Pipeline A has sets 0, 1, 2, 3 and Pipeline B is bound, and sets 0 and 1 are compatible, then 2 and 3 will be unbound and will need to be bound again.

Where do we find the biggest change of shader variables? Well, clearly in each individual shader. For example, let’s pick shader billboard.fx, which has a vec4 Color, and a sampler2D AlbedoMap. In AnyFX, the Color variable would be a uniform and tucked away in a uniform buffer, and the AlbedoMap would be its own resource. In the Vulkan implementation, they would also be assigned a set number, and to avoid screwing with lower sets, thereby trying to avoid invalidating descriptor sets, this ‘default set’ would have to be high enough for other sets to not go above it. However, since we can’t really know the shader developers intention of how sets are used, the compiler be supplied a flag, /DEFAULTSET , which will determine where all default sets will go. This means that the engine and the shader developer themselves can decide where the most likely to be incompatible descriptor set should go.

I also got texture arrays and indexing to work properly, so now all textures are submitted as a huge array of descriptors, and whenever an object is rendered all that is updated is the index into the array which is supplied in a uniform buffer. This way, we can greatly keep the amount of descriptor sets down to a minimum of 1 per set number per shader resource. Allocating a new resource using a certain shader will expand the uniform buffer to accommodate for object-specific data.

First off is the naïve way:

Memory Memory Memory Memory Memory Memory Memory Memory
Buffer Buffer Buffer Buffer Buffer Buffer Buffer Buffer
Object 1 Object 2 Object 3 Object 4 Object 5 Object 6 Object 7 Object 8

Which was where I was a couple of days ago, and this forced me to use one descriptor per shader state, since each shader state has their own buffer. The slightly less bad way of doing this is:

Buffer Buffer Buffer Buffer Buffer Buffer Buffer Buffer
Object 1 Object 2 Object 3 Object 4 Object 5 Object 6 Object 7 Object 8

Which reduces memory allocations but also doesn’t help with keeping the descriptor set count low.

Object 1 Free Free Free Free Free Free Free

Allocating a new object just returns a free slot.

Object 1 Object 2 Free Free Free Free Free Free

If the memory backing is full, we expand the buffer size and allocate new memory.

Object 1 Object 2 Object 3 Object 4 Object 5 Object 6 Object 7 Object 8
Object 1 Object 2 Object 3 Object 4 Object 5 Object 6 Object 7 Object 8 Object 9 Free Free Free Free Free Free Free

As you can see, the buffer stays the same, meaning we can keep it bound in the descriptor set, and just change its memory backing. The only thing the shader state needs to do now is to submit the exact same descriptor state as all sibling states, but provide its own offset into the buffer.

However, since we need to create a new buffer in Vulkan to bind new memory, we actually have to update the descriptor set when we expand, but this will only be done when creating a shader state, which is done outside of the rendering loop anyways.

Textures are bound by the shader server each time a texture is created, it registers with the shader server, and the shader server performs a descriptor set write. The texture descriptor set must be set index 0, so that it can be shared by all shaders.

Consider this shader:

group(1) varblock MaterialVariables
group(1) sampler2D MaterialSampler;

group(2) r32f image2D ReadImage;
group(2) image2D WriteImage;

group(3) varblock KernelVariables

Resulting in this layout on the engine side.

Descriptor set 1 Descriptor set 2 Descriptor set 3
Uniform buffer Sampler Image Image Uniform buffer

Creating a ‘state’ of this shader would only perform an expansion of the uniform buffers in sets 1 and 3, but the sampler and two images will be directly bound to the descriptor set of the shader, meaning that any per-object texture switches would cause all objects to switch textures. We don’t want that, obviously, but we’re almost there. We can still create a state of this shader and not bind our own uniform buffers, by simply expanding the uniform buffers in sets 1 and 3 to accommodate for the per-object variables. To do this for textures, we need to apply the texture array method mentioned before.

group(0) sampler2D AllMyTextures[2048];
group(1) varblock MaterialVariables
   uint MaterialTextureId;

group(2) r32f image2D ReadImage;
group(2) image2D WriteImage;

group(3) varblock KernelVariables

Which results in the following layout:

Descriptor set 0 Descriptor set 1 Descriptor set 2 Descriptor set 3
Sampler array Uniform buffer Image Image Uniform buffer

Now, texture selection is just a manner of uniform values, supplying a per-object value for the uniform buffer value MaterialTextureId. While this is trivial for samplers, it also leaves us asking for more. For example, how do we perform different sampling of textures when all samplers are bound in an array? Vulkan allows for a texture to be bound with an immutable sampler in the descriptor set, so that’s one option, although we supply all our sampler information in AnyFX in the shader code by doing something like:

samplerstate MaterialSamplerState
   Samplers = { MaterialSampler };
   Filter = Anisotropic;

But we can’t anymore, because we don’t have MaterialSampler, and applying this sampler state to all textures in the entire engine might not be correct either. Luckily for us, the KHR_vulkan_glsl extension supplies us with the ability to decouple textures from samplers, and create the sampler in shader code. So I enabled AnyFX to create such a separate sampler object, although to do so one must omit the list of samplers. So the above code would be:

group(1) samplerstate MaterialSamplerState
   Filter = Anisotropic;

Which results in a separate sampler, and the descriptor sets would be:

group(0) texture2D AllMyTextures[2048];

And finally, sampling the texture is

vec4 Color = texture(sampler2D(AllMyTextures[MaterialTextureId], MaterialSamplerState));

Instead of

vec4 Color = texture(AllMyTextures[MaterialTextureId]);

Which will allow us to, in the shader code, explicitly select which sampler state to use, even if we have all our textures submitted once per frame. I could also implement a list of image-samplers combined really easily, and allow for example a graphics artist to supply the texture with sampler information, and just have that updated directly into the descriptor set, but still be able to fetch the proper sampler from the array.

For the sake of completeness, here’s the final shader layout:

Descriptor set 0 Descriptor set 1 Descriptor set 2 Descriptor set 3
Texture array Sampler state Uniform buffer Image Image Uniform buffer

So this proves we can utilize uniform buffers to select textures too, covering all our grounds in one tied up bow. Neat. Except for images, and here’s why.

Images are not switched around and messed around with like textures are, and for good reason. An image is used when a shader needs to perform a read-write to texels in the same resource, meaning that images are mostly used for random access and random writes, for post effects and the like, and are thus not as prone to changes as for example individual objects. Instead, images are mostly consistent, and can be bound during rendering engine setup. We could implement image arrays like we do texture arrays, however we must consider the HUGE amount of format combinations required to fit all cases.

Images can, like textures, be 2D, 2D multisample, 3D, Cube, just to mention the common types. We obviously have special cases like 2DArray, CubeArray and so forth, but array textures are not even used or supported in Nebula; never saw the need for them. However, images also needs a format qualifier if the image is to be supported with imageLoad, meaning we basically need a uniform array of all 4 ordinary types, with all permutations of formats. While possible, I deemed it a big no-no, and instead determined that since images are special use resources for single-fire read-writes, then a shader has to update the descriptor set each time it wants to change it, meaning it’s more efficient to, in the same shader, reuse the same variable and just not perform a new binding. All in all, this shouldn’t become a problem.

What’s left to do is to enforce certain descriptor set layouts by the shader loader, so that no shader creator accidentally use a reserved descriptor set (like 0 for textures, 1 for camera, 2 for lighting, 3 for instancing). If the shader does, it will manipulate a reserved descriptor set which will cause it to become incompatible, and we can’t have that since it will simply cause manually applied descriptors to stop being bound, resulting in unpredictable behavior. Another way of solving this issue is by changing the group-syntax in AnyFX to something more stable and easier to validate, like making it into a structure like syntax, for example:

group 0
   sampler2D Texture;
   varblock Block
     vec4 Vector;
     uint Index;

And then assert that no group is later declared with the same index. To handle stray variables declared outside of a group, the compiler simply generates the default group, and puts all strays in there.

The only issue I have with the above syntax is the annoying level of indirection before you actually get to the meat of the shader code. I think implementing an engine side check is the way to go now, but implementing groups as a structure like above could be a valid idea, since we might want to have the same behavior in all rendering APIs. Consider this for OpenGL too, in which we can guarantee that applying a group of uniforms and textures will remain persistent if all shaders share the same declaration. Although, in OpenGL, since we don’t have descriptor sets, we must simply ensure that the location values for individual groups remain consistent.

Vulkan – Shading ideas

So this is where the Vulkan renderer is right now.

What you see might be unimpressive, but when getting to this stage there isn’t too much left. As you can see, I can load textures, which are compressed (hopefully you can’t see that), render several objects with different shaders and uniform values, like positions, and textures.

This might seem to be near completion, just a couple of post effects which might need to be redone (mipmap reduction compute shader for example), but you would be wrong.

In this example, the Vulkan renderer created a single descriptor set per object, which I thought was fine and I basically assumed that is what descriptor sets were for. I believed that descriptor sets would be like using variable setups and just apply them as a package, instead of individually selecting textures and uniform buffers. However, on my GPU, which is a AMD Fury Nano, sporting a massive 4 GB of GPU memory (it doesn’t run Chrome, so it’s massive), I ran out of memory when reaching a meager 2000 objects. Out of GPU memory, never actually experienced that before.

So I decided to check how much memory I actually allocated, and while Vulkan supplies you with a nice set of callback functions to look this up, it doesn’t really do much for descriptor pools, and I have already boggled down the memory usage exhaustion to be happening when I create too many objects, so it cannot be a texture issue. Anyhow in order to have per-object unique variables, each object allocates its own uniform buffer backing for the ‘global’ uniform buffer. Buffer memory never exceeds 260~ MB. Problem is not there.

So the only conclusion I can draw is that the AMD driver allocates TONS of memory for the descriptor sets. So I did a bit of studying, and I decided to go with this solution for handling descriptor sets: Vulkan Fast Paths.

The TL;DR of the pdf is to put all textures into huge arrays, so I did:

#define MAX_2D_TEXTURES 4096
#define MAX_2D_MS_TEXTURES 64
#define MAX_3D_TEXTURES 128

group(TEXTURE_GROUP) texture2D 		Textures2D[MAX_2D_TEXTURES];
group(TEXTURE_GROUP) texture2DMS 	Textures2DMS[MAX_2D_MS_TEXTURES];
group(TEXTURE_GROUP) textureCube 	TexturesCube[MAX_CUBE_TEXTURES];
group(TEXTURE_GROUP) texture3D 		Textures3D[MAX_3D_TEXTURES];

And textures are fetched through:

group(TEXTURE_GROUP) shared varblock RenderTargetIndices
	// base render targets
	uint DepthBufferIdx;
	uint NormalBufferIdx;
	uint AlbedoBufferIdx;	
	uint SpecularBufferIdx;
	uint LightBufferIdx;
	// shadow buffers
	uint CSMShadowMapIdx;
	uint SpotLightShadowMapIdx;

Well, render targets are. On the ordinary shader level, textures would be fetched by an index which is unique per object. I also took the liberty to implement samplers which are like uniforms, bound in the shader and can be assembled in GLSL as defined in GL_KHR_vulkan_glsl section Combining separate samplers and textures. This allows us to assemble samplers and textures in the shader code, which is good if we have a texture array like above, where we can’t really assign a sampler per texture in the shader, because we have absolutely no clue when writing the shaders which texture goes where, so it’s much more flexible to be able to assign a sampler state when we know what kind of texture we want, let me give you an example.

The old way would be:

samplerstate GeometryTextureSampler
	Samplers = { SpecularMap, EmissiveMap, NormalMap, AlbedoMap, DisplacementMap, RoughnessMap, CavityMap };
	Filter = MinMagMipLinear;
	AddressU = Wrap;
	AddressV = Wrap;
vec4 diffColor = texture(AlbedoMap, UV) * MatAlbedoIntensity;
float roughness = texture(RoughnessMap, UV).r * MatRoughnessIntensity;
vec4 specColor = texture(SpecularMap, UV) * MatSpecularIntensity;
float cavity = texture(CavityMap, UV).r;

The new way is:

samplerstate GeometryTextureSampler 
	Filter = MinMagMipLinear;
	AddressU = Wrap;
	AddressV = Wrap;
vec4 diffColor = texture(sampler2D(AlbedoMap, GeometryTextureSampler), UV) * MatAlbedoIntensity;
float roughness = texture(sampler2D(RoughnessMap, GeometryTextureSampler), UV).r * MatRoughnessIntensity;
vec4 specColor = texture(sampler2D(SpecularMap, GeometryTextureSampler), UV) * MatSpecularIntensity;
float cavity = texture(sampler2D(CavityMap, GeometryTextureSampler), UV).r;

While the new way is only possible in GLSL through the KHR_vulkan extension, this has been the default way in DirectX since version 10. This syntax also allows for a direct mapping of texture sampling between GLSL<->HLSL if we want to use HLSL above shader model 3.0.

This method basically allows for all textures to be bound to a single descriptor set, and this descriptor set can then be applied to bind ALL textures at the same time. So when this texture library is submitted, we basically have access to all textures directly in the shader. Neat huh? It’s like bindless textures, and that is exactly what AMD mentions in the talk.

Then we come to uniform buffers. I read the Vulkan Memory Management and all of the sudden it became completely clear to me. If we want to keep the number of descriptor sets down, we can’t have a individual buffer per object because that requires either a descriptor set per object with the individual buffer bound to it, or it requires us to sync the rendering commands and update the descriptor set being used.

So the solution is to use the same uniform buffer, and expand its size per object. And if you follow the nvidia article, that is clearly not a good way to go. Instead, the uniform buffers implement a clever array allocation method, where we grow the total size by a set amount of instances, and keep a list of used and free indices (which can be calculated to offsets) into the buffer. Allocating when there are no free indices grows the buffer by the maximum of a set amount (8) or the number of instances requested. Allocating when there are free indices returns the offset calculated used said free index, and trying to allocate a range of values first attempts to fit the range in the list of free indices if there are enough free indices, or allocates a new chunk if no such range could be found.

So basically, the Vulkan uniform buffer implementation uses a pool allocator to grow its size (doesn’t shrink it though, which we actually might want to do). But because we are using GPU memory, we might want to avoid doubling the memory, however that is a problem for later. Each allocation returns the offset into the buffer, so that we can bind the descriptor with per-object offsets later, which means we retain the exact same descriptor set, but only modifies the offsets.

So to sum up:

  • Texture arrays with all textures bound at the same time, submitting the entire texture library (or libraries, for all 2D, 2DMS, Cube and 3D textures).
  • Uniform buffers are created per shader (resource-level) and each instance allocates a chunk of memory in this buffer.
  • Offsets into the same buffer is used per object so we can have the same descriptor set but jump around in it, giving us per-object variables.
  • Textures are sent as indexes, and can thus be on a per-object basis too.

The only real issue with this method is read-write textures, also known as images in GLSL. Since image variables has to be declared with a format qualifier denoting how to read from the image, we can’t really bind them as above. However images are not really on a level of update frequency as textures are, instead they are bound and switched on a per-shader basis, like with post effects, and are either statically assigned or can be predicted. For example, doing a blur horizontal + vertical pass requires the same image to be bound between both passes, however if we want to perform a format change, like in the HBAO shader, where we transfer from ao, p -> ao, we can just bind the same image to two different slots, and thus avoid descriptor updates.

Oh, I should also mention that all of this might soon be possible to do in OpenGL too, with the GL SPIRV extension, which should give us the ability in OpenGL to use samplers as separate objects. Texture arrays already exists, and so do uniform buffers.

Vulkan – Designing a back-end.

With the Khronos validation layer becoming more and more (although perhaps not entirely) complete, the Vulkan renderer implementation is also coming along nicely. At the moment, I cannot produce anything but a black window, hopefully mainly because the handling of descriptor sets are not completely done yet.

However, the design choices and the way to handle Vulkan are still noteworthy to bring up, so this post is going to be comprehensive with illustrations showing the thought process.

Command buffers

As you may or may not know, most of the operations done on the GPU is done in a command queue. This is apparent in OpenGL if you take a look at the functions glFlush and glFinish. In Vulkan however, command buffers are for you to allocate, destroy, reset, propagate, queue up and most importantly populate with commands. Noteworthy is also that Vulkan operates by submitting Command buffers to Queues, and your GPU might support more than one Queue. A Queue can be thought of as a road, although some Queues only accept busses, some bikes, and others are for pedestrians. In Vulkan, there are three different types of Queues.

  • Transfer – Allow for fast memory transfers GPU CPU as well as resources locally on the GPU.</>
  • Graphics – Allows for render commands such as Draw, BeginRenderPass, etc.
  • Compute – Allows for dispatch calls.

The intuitive way would be to try to replicate the GL4 behavior, by simply creating a single MAIN command buffer into which you put all your commands, and then execute it at the end of the frame. While this is a fine enough solution, it will cause some issues which we will get into later. But mainly, the whole idea of using command buffers is to be able to, in some cases, precompute command buffers for reuse (binding render target and render target-specific shader variables, viewports, etc) or for rapid population when drawing, and using threads to do so. There are currently 3 main buffers in Nebula, one for each queue.

Think of command buffers as sync points. Begin a command buffer, and all commands done afterwards are written to this buffer. When the buffer is ended, the buffer can then be used to run the commands on the GPU. Sounds simple. It’s not.


It’s not, because most rendering engines today are not designed around this principle, and if they were, it would be equally hard to make them work for older implementations, and perhaps even breaking huge part of the user space code. The most obvious way to begin and end a command buffer is, just like it was with DX9, within the BeginScene and EndScene. My bet is that most engines are designed around this principle. So lets say I want to load a texture, or perhaps I want to just create a vertex buffer and update it with data. Well, if we don’t call it within our frame, then the command buffer won’t be in the Begin-state, and thus fail.


There are two ways to solve this.

Solution 1 – the lazy option

Create a command buffer when needed, create your resource, then add to the command queue that you want to update it with data. Last, you submit the command buffer and update the iamge.
This solution works fine, but it might be slow if you are loading in many textures on the fly, because some frame might get tons of vkQueueSubmit while the rest will be idle. This could be fixed if you know when to begin creating/updating resources and when you end, which leads me into the next part.

Solution 2 – delegates

The other solution is to postpone the command until BeginFrame. This can be done using a simple delegate system, which allows you to just save the command into a struct, add it to an vector, and then run it whenever you want to run it. For Nebula I implemented several ‘buckets’ of these kinds of delegates, so that we can run them on Begin/EndFrame, Begin/EndPass, etc. This solution easily allows for us to accumulate tons of resource updates and run them in a single Queue submit, instead of making lots of smaller ones. This type of delaying commands is also extremely useful for memory releasing, which I will talk about later.


One of the main things using command buffers is that they allow us to queue up commands in threads, and isn’t that wonderful? Turns out it’s not quite as simple as that, because of several reasons. The first being that we must use a uniquely created command buffer per thread, and also have a command buffer pool per thread. The reason for this is that if we share pools, one thread may allocate a buffer which another already has, causing the same buffer to appear in multiple threads and thus there is no guarantee for the order of the commands. Using multiple buffers allows us to ensure a specific sequence of commands are executed in order, however we want to avoid doing a submit on all of these command buffers when we are done. This is why there are secondary command buffers, which allows us to create and record commands which are then patched in a primary command buffer (read, a MAIN command buffer) and then everything is executed with a single submit.

In Nebula the VkRenderDevice class has a constant integer describing how many graphics, compute and transfer threads should be created, along with completion events so that we may wait for the threads to finish. These will then act as a pool, each receiving commands using a scheme implemented by the RenderDevice, which is described later.

Commands are send to threads using something very similar to the delegate system, by putting structs in a thread safe queue and have the command buffer building threads populate its command buffer. The problem isn’t really that, but the issue is how to distribute the command buffer buildup from the rendering pipeline to the threads. One way would obviously be to, for each draw command, switch threads, so given four threads we might get the following.

draw1 draw2 draw3 draw4 draw5 draw6 draw7 draw8
  • Thread 1: draw1 draw5
  • Thread 2: draw2 draw6
  • Thread 3: draw3 draw7
  • Thread 4: draw4 draw8

So if we then would collect these draws together, we would get

draw1 draw5, draw2, draw6, draw3, draw7, draw4, draw8

Clearly, our draws are out of order, which is fine if we are using depth-testing. The real problem comes with binding the rendering state (or pipelines as they are called). Consider us introducing the following linear command list.

<html><strong>pipeline1</strong> draw1 draw2 <strong>pipeline2</strong> draw3 draw4 <strong>pipeline3</strong> draw5 draw6 <strong>pipeline4</strong> draw7 draw8</html>

Then our threads will get

  • Thread 1: shader1 draw1 draw5
  • Thread 2: shader2 draw2 draw6
  • Thread 3: shader3 draw3 draw7
  • Thread 4: shader4 draw4 draw8

Basically, now draws 5-8 will not get the shader associated with them, resulting in an incorrect result. To be honest, there is no perfect way of handling this properly, but there is way to split draws into threads just by going by the lowest common denominator, which is shader pipeline. In Nebula right now, the thread used for draw command population will be cycled whenever we change pipeline, which results in the above command list looking like this on the threads.

  • Thread 1: shader1 draw1 draw2
  • Thread 2: shader2 draw3 draw4
  • Thread 3: shader3 draw5 draw6
  • Thread 4: shader4 draw7 draw8

However, when not swapping shaders, we obviously will get all draws on the same thread, which might not be overly efficient. I haven’t come up with a solution for this yet, but one obvious one would be to sync and rendezvous the command buffer threads every X calls, and start a new batch.

Now this is only really relevant to do if we want to rapidly prepare our scene using draws, but how does it fare with transfers and computes? Well, with computes you have the same deal, bind shader and compute. The only real difference between geometry rendering and computation is that we also bind vertex and index buffers when we need to produce geometry. Transfers however is a completely different thing. In theory, a single transfer operation could be done on one thread each, meaning that for every single new buffer update, we circulate the threads.


In the good old days with the older APIs, binding the rendering state or compute state was easy, you would just bind shaders, vertex layouts, render settings like blending and alpha testing, samplers and subroutines and they incrementally built up a rendering state. In the next-gen APIs, this work is left to the developer, and is referred to in Vulkan as a Pipeline. A pipeline describes EVERYTHING required by the GPU to perform a draw, so you might understand this structure is huge.

The only problem with pipelines is that they couple together so many different pieces of information – shader programs, vertex attribute layout, blending, rasterizing and scissor state options, depth state options, viewports, etc. Just look at this beast of a structure!

The reason for this setup was that many games used to create tons of shaders, bind them, but not actually use them. This caused tons of unwanted validation for a shader-vertex-render state setup which was never supposed to be used, and slowed down performance. This new method forces the developer to say they are done, and that this is a complete state to render with.

Now you might already predict that trying to figure out all possible combinations of all your setups and precompute all these pipelines is the best way to solve it, and you would be correct. However, how do you look it up afterwards? Well, the good folks at Khronos thought of this, and implemented a pipeline cache, which basically only creates a new pipeline if none exists, but if one does exist using the same members, then the vkCreateGraphicsPipelines or vkCreateComputePipelines (the latter being infinitely more easy to precompute) will only just return the same pipeline you created earlier. Very neat if you ask me, and according to Valve, this is magic and it’s incredibly fast.

In Nebula, the shaders loaded as compute shaders can predict everything they need and create the pipeline on resource load, which is very flexible indeed. For graphics, I implemented a flagging system, where flags would be set if a member of this struct is initialized, and when all flags are set and we want to render, we call a function to do so.

	this->currentPipelineBits |= InputLayoutInfoSet;
	this->currentPipelineInfo.pInputAssemblyState = inputLayout;
	this->currentPipelineBits &= ~PipelineBuilt;
	this->currentPipelineBits |= FramebufferLayoutInfoSet;
	this->currentPipelineInfo.renderPass = framebufferLayout.renderPass;
	this->currentPipelineInfo.subpass = framebufferLayout.subpass;
	this->currentPipelineInfo.pViewportState = framebufferLayout.pViewportState;
	this->currentPipelineBits &= ~PipelineBuilt;
	n_assert((this->currentPipelineBits & AllInfoSet) != 0);
	if ((this->currentPipelineBits & PipelineBuilt) == 0)
		this->currentPipelineBits |= PipelineBuilt;


Stable release

We’ve been quiet here but we haven’t been idle. We’ve been working on our tools for the 2016 iteration of Nebula Trifid, adding stuff, removing other stuff and doing UI optimizations. I’ve also been working on the OpenGL renderer somewhat and I now deem it to be in a stable state.

Before I jump into the work currently being done on Vulkan, I would like to finish up the last post by explaining how I solve uniform buffers and updating them in Nebula.

Where previously shaders were just resources ment to instanciate by creating a shader instance, the shader itself is now applicable and the shader instances can be seen as derivatives of the original shader. The ‘main’ shader can be applied, and its variables updated. The shader resource holds a list of uniform buffers directly associated with a ‘varblock’ in AnyFX, and as such, updating the uniforms in the shader will require a Begin -> Update -> End procedure before rendering. This causes the uniform buffers in the shader to accumulate changes, and flush them directly when done.

Shader instances hold their own backings of said buffers, meaning they have a copy of the shaders buffer, but can provide their own per-instance unique variables. This way, we have solved the per-object animation of certain variables as alpha-blending, without disturbing the main shader buffers. This might seem like a waste of space, and it is if the shader code has tons of uniforms which are never in use. Although, AnyFX knows if a varblock is being used, and can report so to the engine, which in turn then just doesn’t bother with loading that varblock.

Apart from this automated system, a varblock in AnyFX can be marked using annotations to be “System” = true; meaning it will now be managed by the Nebula system. This means that shaders and shader instances will NOT create a buffer backing for these varblocks, and will NOT apply them automatically. This is on purpose, since some buffers are supposed to be maintained by Nebula and Nebula alone. These buffer -> varblock bindings include:

  1. Per object transforms and ID.
  2. Camera matrices and camera related stuff.
  3. Instancing buffers.
  4. Joint buffers.
  5. Lights.

These are retrieved using a new API in AnyFX which lets us determine block size and variable offset, and then bind a OGL4ShaderVariable straight on the uniform buffer. Updating said OGL4ShaderVariable will just use memcpy into a persistently mapped buffer in the uniform buffer object, and we’re done. Simple.

This explicit <-> automatic use of uniform buffers lets us optimize what gets updated and when.


Not unlike the rest of the developers out there I had to jump straight into Vulkan as soon as it was released. To start off, I have never seen such an explicit API with such a talkative and redundant syntax. That being said, I like it. It reminds me of DX11 but without all that COM. It’s a C-like API where structs are passed to functions to determine how said function will operate, and any operation liable to cause an error will return such an object, and it’s consistently so.

The first thing that ‘had to go’ was the AnyFX API. Now now, I didn’t throw it away, but I quickly realized that with such a close-to-metal API, the shading system also had to be closer to the engine. Therefore, I developed a secondary API for AnyFX, called low-level, and split the previous API into it’s own folder called high-level. High-level is meant to manage the shading automatically, where you just apply a program and set some variables, whereas the low-level API allows for nothing more than shader reflection, including all the AnyFX stuff.

Also, SOIL is no longer a viable alternative for image loading, so we have to use DevIL for that. Luckily for us, we have our own DevIL fork and have added some stuff to it, which in turn also gave me some insight into the API. It shouldn’t be too much work (if any) to make DevIL able to load a texture and just give me the compressed data straight away, so that I can load it into a Vulkan image.

However, the AnyFX compiler has also received some minor additions. Not only has the compiler language been cleaned up a bit, but it has also gotten some extensions. Similar to the layout(x, y, z) language in GLSL, AnyFX has received something called an qualifier expression. This is essentially a qualifier with an attached value, say group(5) which is a qualifier ‘group’ with the value ‘5’. The group qualifier in particular is used to extend AnyFX to allow for a Vulkan-specific language feature, explaining what resources goes into which DescriptorSet. The qualifiers used for specific languages will be ignored when compiling for language without said support, so that we can still use the same shaders. However, in the future the qualifier expression syntax will most likely be used to replace the parameter qualifiers that determine special behavior for shader function parameters, such as [color0]. Also, the AnyFX compiler doesn’t use the hardware-dependent GLSL compiler anymore, but instead use the Khronos group reference compiler (glslang).

The render device in Vulkan is also somewhat different from the OGL4 and DX11 versions, in that it uses 4 worker threads to populate 4 command buffers instead of just sending commands to the queue directly. This way we can parallel the process of issuing drawing, shader switches and descriptor binds (images, textures, samplers and buffers). Although we cannot use the TP-system with jobs, because we need to explicitly control which draw gets into which thread and in what order.

To switch threads to push commands to, we currently use a Pipeline as a ‘context switch’ which will cause the next thread to receive the next series of vertex buffers, index buffers and draws. This may be inefficient, because we might actually get just one shader, one vertex buffer, and LOADS of draws, so perhaps sorting the just draws into their own threads is more efficient, and sync the threads per pipeline. Another solution is to have one thread per pipeline, and have them spawn their draw threads, but command buffers has to be externally synchronized (meaning only one thread manipulate them at a time) so that’s somewhat complicated. This figure illustrates what I mean:

  1. Thread -> P -> IA -> Draw -> Draw -> Draw | Switch thread | P -> IA -> Draw
  2. Thread -> P -> IA -> Draw -> Draw -> Draw -> IA -> Draw -> Draw -> Draw  | Switch thread |
  3. Thread -> P -> IA -> Draw | Switch thread |
  4. Thread -> P -> IA -> Draw  -> Draw -> Draw -> Draw -> Draw -> Draw | Switch thread |

Here, P means PipelineStateObject, IA means input assembly (VBOs, IBO, offsets). This execution style ensures the draw calls will happen in the order they should in relation to their pipeline and vertex and index buffers, however it can also give us this:

  1. Thread -> P -> IA -> Draw -> Draw -> Draw -> Draw  -> Draw  -> Draw  -> Draw  -> Draw  -> Draw  -> Draw  -> Draw  -> Draw  -> Draw  -> Draw  -> Draw  -> Draw
  2. Thread -> Zzzz…
  3. Thread -> Zzzz…
  4. Thread -> Zzzz…

Which is really bad, because we want to utilize our threads as much as possible. Ironically, this scenario would be IDEAL in OpenGL and DX11< because we have no context switching between our draws. Now however, it just means our threads are idling. For brevity, the above illustrations does not show the bindings of descriptor sets before each draw.



Currently, I’m working on a system which can construct and update DescriptorSets so that we may update the shader uniform state in an efficient manner. In essence, a DescriptorSet implements an entire set of shader state uniforms, textures, uniform buffers, storage buffers and the new sampler objects in an entire package. Preferably, I would like to have a static DescriptorSet per each surface and then just bind it, but the amount of DescriptorSets is finitely defined when creating the DescriptorPool, so adding more objects might cause the DescriptorSet allocation to fail. The idea in Vulkan is to have a DescriptorSet per level of change, for example.

  1. Per frame (lights, shadow maps).
  2. Per view (camera, shadow casting matrices).
  3. Per pass (Input render targets, [subpass input attachments]).
  4. Per material (textures, parameters).
  5. Per object.

Vulkan then lets you bind them one by one, while updating inner loop sets and letting outer sets remain the same. Nebula however, can never assert this behavior will be incremental, and will bind whichever set(s) have changed, so it is up to the shader developer to make sure that the variables are grouped by frequency of update. Probably, the way Nebula will do it is one by one, through calls to vkBindDescriptorSet for each descriptor set, but I haven’t gotten there quite yet.

Another thing which Nebula should really utilize is the concept of subpasses. Subpasses allows for per-pixel shader caches, meaning we can read a previously written fragment without having to dump all the information to a framebuffer first. This gives us two major things:

  1. Performance when dealing with G-buffers.
  2. No more ping-pong textures. And also more performance.

Because we can read G-buffer data using a new type of uniform, called an InputAttachment we can now avoid dumping the G-buffer to a framebuffer. The lighting shaders can later just read the pixel values straight from the input attachment, instead of having to sample a texture, allowing us to skip the overhead of G-buffer sampling in some instances. At some point however, the G-buffers do need to dump at least their depth values to a framebuffer for later use (Godrays, AO, DoF etc) but we can probably get away with skipping the Albedo, Emissive parts. However you twist and turn it, you will end up with less memory bandwidth and thus better performance. We can also skip ping-ponging when doing bloom and blurs, because we can use a pixel-local cache to save a horizontal blur result when doing a vertical blur, and vice versa.

To enable this in Nebula, we need to add it to the design of a frame shader. A frame shader (perhaps they should have a renderer-specific implementation?) should need a SuperPass tag which can hold many ordinary Pass tags. A <SuperPass> would be the Vulkan equivalent of a RenderPass, and an ordinary <Pass> would be a SubPass. This would allow us to pass ‘framebuffers’ as input attachments within the <SuperPass> and when the <SuperPass> ends, we explain what will actually become resolved.

Nebula also needs (with the help of AnyFX) to enable sampler objects in the shader code. In GLSL there are two texture types, sampler1D -> samplerCubeArray and image1D -> imageCubeArray. Samplers would have a sampler attached to it from the application, and images would only be readable through texel fetches. However, in the Vulkan GLSL language there is a new type of texture called, well, texture, and it follows the same range from texture1D -> textureCubeArray. To sample such a texture, we do texture(sampler(AlbedoMap, PointSampler), UV); which is very similar to the DX11 way, which is AlbedoMap.Sample(PointSampler, UV);  In the Vulkan GLSL language, we can create a sampler couple when sampling from a texture, which allows the exampled AlbedoMap to be texture sampled using many different samplers without having to exist more than once. However, Vulkan also supports pre-coupled sampler and images, so for the time being, we will stick to that syntax.

Working with surface materials

The transform of moving the render state stuff out of the model and into its own resource has been done, and it turns out that it works. It’s concept is similar to what the previous post touched upon, although with some slight modifications.

For example, my initial idea was to apply a surface material ONCE per shader, and then iterate each submesh and render them. However, it turns out that a submesh with the same surface material properties might actually need to use other shaders, for example, we might want to do instanced rendering, which picks a model matrix from an array using gl_InstanceID and occasionally non-instanced rendering, which uses the singular model matrix. One can consider that we always perform instanced rendering, i.e. that we always consider an array of per-object variables, although there is no semantic way in the shader language to denote (and contextually shouldn’t be) variables or uniform buffer blocks to have a per-draw call changing pattern. What we don’t want is to have to implement a version of each surface which uses instancing. An optional solution is to not use instancing for ordinary objects, but exclusively for particles, grass or other frequently updated (massively) repeated stuff.

Instead, it might be smarter to adapt the DX11 way which is that every single variable is a part of a buffer block, where variables declared outside of an explicit buffer block is just considered to be a part of the global block. This method will always require an entire block update, even if only one variable has changed, although using the new material system there is a flexible way around this.

Instead of working on a per-object basis, where each object will literally set their state for each draw, it’s more attractive to use shader storage blocks for variables shared, and guaranteed to be unique per object, such as model matrix, object id etc. Each surface material will provide a single uniform buffer block which is updated ONCE when the surface is created, and modified if a surface gets any constants changed (which is rare during runtime).

When rendering, we simply only need to apply the shader, bind the surface block, then per each object bind their unique blocks, and we’re done with the variable updates. Or are we?

A problem comes from the fact that we might want to animate some shader variable, for example the alpha blend value, which forces us to update an otherwise shared uniform buffer block. Unfortunately, this means we not only have to modify the block to use the new alpha value, but we also need to modify it back to its default state once we don’t want the change anymore.

So one idea is to have changeable values in a transient uniform buffer block (which uses persistent mapping for performance), have one or several material blocks (one for foliage stuff, one for water stuff, one for ordinary objects, etc.) and have a shader storage block for variables which are guaranteed to be changed on an object basis. Why shader storage block you might ask? Well, if we can buffer all of our transforms into one huge array, we can retain a global dictionary of their variables, which can then be accessed in the shader.

By doing this we can minimize the need to set the same transforms twice for the same object, for example when doing shadow mapping (which is three passes: global, spot, point) the default shading, and then perhaps some extra shaders. A shader storage block is flexible in this context, because it can hold a dynamic count of instances. This method basically approaches instancing, because with this method we could also utilize the MultiDraw calls, since we have prepared our state prior to rendering, and all sub meshes already have their unique stuff posted in the shader.

My concern with this is that we might have the case where not all sub meshes share the same surface material. This is the case if we have a composite mesh, where some part uses for example subsurface scattering, and the rest is glass, or cloth, or some other type of material. The issue is basically that we have a very small subset of cases where we win any performance, since we need this particular scenario: A mesh where all parts share the same surface material and shaders.

The issue related to clever drawing methods always fall back on the issue of addressing variables if you’re drawing more than one object. Storing all variables in per-object buffers is trivial, however doesn’t allow us to use any of the fancy APIs like MultiDraw. Storing all variables in global buffers is difficult, because we need to need to define a set of global buffers (with hard-coded names) which is explicitly updated on an index representing the object. This poses a problem if we have several parameter sets, for example PBR, Foliage, UV animation and then try to somehow use 3 global buffers, because we must also have another buffer which denotes the objects id into this buffer (unless we can use gl_DrawID).

I think the best way might be to do something like this:
Surface material blocks are retrieved by looking at the surface variables, extract their blocks and create a buffer for them which is persistently mapped. Objects which needs to modify singular values do so by writing to the persistently mapped buffer directly, which will make the value active at the next render. The issue with this is that if we perform some type of buffering method, we can only have as many buffer changes before we need a sync. Basically, if we have a triple buffering method, we need to wait every third change because the previous change might not have been drawn yet.

We have one transient buffer which is the one changing every frame (time, view matrix, wind direction, global light stuff), and finally a buffer per each submesh which contain its transforms and object ID. So to summarize:

  • Surface material contains buffers for each buffer they are ‘touching’. Mapped for rare changes (triple buffered).
  • Transient frame buffer like the one we have now, but with more of the common variables. Mapped for changes (triple buffered).
  • Per-object transform buffer. Mapped for changes (triple buffered).

AnyFX should be made so that variables which lay outside a varblock (uniform buffer block/constant buffer) is just put in a global varblock with a reserved name. So AnyFX should probably turn over the responsibility of handling uniform buffers to the engine, and not manage it intrinsically. As such, a variable which lies in a block will just assert because setting a variable inside of a block will have no purpose or result. Instead we can set the buffer which should be used in the varblocks by sending an OpenGL buffer handle to the VarBlock implementation. This will also make it simpler to integrate the API with future APIs which forces explicit control (read Vulkan, DX11+) of buffers, but still wraps the shader part in a comfortable and flexible manner.

In essence, we should minimize the buffer changes, and we’re not restricted to the current system where we poke inside a gigantic transform buffer for each time we draw. The current method suffers from the issue that we have to sync every N’th object, and we update the same object several times for each time we draw, which might be 6 times per frame (!). The new method would sync every N’th frame instead, which should scale better when the same object gets rendered multiple times.

Material system rewrite

With the game projects done, we’ve gotten some feedback related to the game creation process with Nebula. So we have a gigantic list of stuff we should implement/fix/change for the next iteration.

One of these things (one of the biggest) is making a new material system. Currently, a material is defined in an XML which explains where during a frame a material is used. Materials are then put on models, and the material explains when during the frame a model should be rendered, and with which shader. So basically, you have this relation:

  • Material
    • Pass 1
      • BatchGroup ‘FlatGeometryLit’ with shader ‘static’ and variation ‘Static’
    • Pass 2
      • BatchGroup ‘GlobalShadows’ with shader ‘shadow’ and variation ‘Static|CSM’
  • Model
    • Model node
      • Material
      • Variables
      • FrameShader
        • FrameBatch, which denotes a BatchGroup
          • For all materials belonging to BatchGroup
            • Apply material
            • For all visible model nodes using material
              • Apply mesh
              • For all visible model node instances using material
                • Apply node state
                • Draw

      So a material describes when an object should be rendered, and how, a model uses a material to understand when it should be rendered, and the frame batch tells when a certain batch group should be executed.

      While this method works, it’s a bit inflexible. The main downsides are:

      • State of a model is saved within each model node.
      • Material is saved on a resource level for each model node, so it can not be switched out.
      • Material variables has to be set on each model node, and cannot be reused.

      So the new system is different, instead of saving the shading state (material settings) within a model, they are instead saved in a separate resource, which at the time is called SurfaceMaterial. A SurfaceMaterial is created as a resource in the content browser, and it contains the values for each material variable. It uses a currently existing material as a template, since the surface material doesn’t denote when it should be rendered. This makes it possible to create new surfaces by taking a template of a material, and then save material variables (textures, intensities etc) in a separate resource. When making new assets later on, it will be possible to use the same surface on several models, which is nice because it makes it faster for graphics artists to assign textures, alpha clip thresholds and texture multipliers since they only need to assign an already created material.

      Furthermore, since the state of a model is now detached from the model resource, it also allows us to change the material during run-time. This means we now (finally) have the ability to switch the material on objects in real-time, something which is extremely useful when for example hiding opaque objects to close to the camera, procedural dissolves, fades, etc.

      In theory, it would also be possible to further improve performance by sorting each model node instance based on surface, which results in a surface only being applied once per frame, however it is difficult to filter out instance unique surfaces (for example if we have one object with a unique alpha blend factor). This might not be necessary (or even noticeable), since the shader subsystem will only actually apply a shader variable if it differs from what it currently is set, so setting the same surface multiple times result in close to no overhead.

      The original render system used a clever bucket-system, where model node instances would be put into buckets depending on shader. When I made the first iteration of the material system, I made it so that this bucket-system used materials to group (sort) objects, so that material switches would be as few as possible. However this system relied on the fact that materials are defined in the model resource. It was easy to switch this to a system where each model node instance would decide which bucket it would be put in.

      The biggest change is to convert all .attributes into surface XMLs, then remove those who are duplicates, then replace the field in each state node which corresponds to its material name to instead be the name of a generated surface. Perhaps it is just easiest to go through the projects, collect their values, make new materials and assign them manually. Then we also need to create tools to make materials with. This should be fairly straight-forward, seeing as we can change the state of a material by setting a value, and swap materials on models using the new system, so we should be able to visualize it easily enough.

      So instead of the above hierarchy, we have something like this:

      • SurfaceMaterial
        • Material
        • Variables
      • Model
        • Model node
          • Surface material name
          • Model node instances
            • Pointer to surface material
Skip to toolbar