Vulkan – Designing a new frame script system

Nebula has a neat feature called frame shaders. Frame shaders are XML scripts and describe the rendering of an entire frame. However, frame shaders in nebula are designed with the DirectX 9 mindset, and is in dire need of a rewrite.

With Vulkan, and partly in OpenGL4, there are slightly more efficient ways of binding render targets. In DirectX 9 there was a clear distinction between multiple render targets and singular ones. With OpenGL we have framebuffers which is an object containing all render targets. In DirectX 10-12 we bind render targets individually. In Vulkan and OpenGL, we can have a framebuffer, but only select a set of attachments to actually use, allowing us to avoid binding framebuffers more commonly than needed. In Vulkan, we can even pass data between renders through input attachments, so our new design has to take that into consideration.

We also want to be able to apply global variables in the frame shader, so that we for example can switch out the NormalMap, AlbedoBuffer, etc, if we for example render with VR or want to produce a reflection cube texture. This allows the frame script to apply settings per execution, which can be shared across all shaders being used when rendering the script.

So one of the design choices is to design a frame scripting system which allows us to add frame operations just like FramePassBase, however a FramePass is a bit too ‘high-level’ since it implies something like a texture and some draws. With the new system, we want to execute memory barriers, trigger events to keep track of compute shader jobs and assemble highly optimized subpass dependency chains, meaning a frame operation can be much simpler than an actual pass.

Also, we want to slightly redesign some of the concepts of CoreGraphics, where we don’t begin a pass with a render target, multiple render target or render target cube, but instead we use a pass object. The pass already knows about its subpasses and attachments, and the frame system knows when to bind it and what to do when it is bound.

Enter Frame2.

Declare RenderTexture    - can be used as shader variable and render target
	Fixed size 		- in pixels
	Relative size	- to screen (decimal 0-1)
	Dynamic size	- can be adjusted in settings
	Format			- any renderable color format
	Multisample		- true if render texture supports multisampling
	Name
	
Declare RenderDepthStencil	- implements a depth-stencil buffer for rendering
	Fixed size 		- in pixels
	Relative size	- to screen (decimal 0-1)
	Dynamic size	- can be adjusted in settings
	Format			- any accepted depth-stencil format
	Multisample		- true if depth-stencil supports multisampling
	Name
	
Declare ReadWriteTexture   - can be used as shader variable and compute shader input/output and fragment shader input/output
	Fixed size 		- in pixels
	Relative size		- to screen (decimal 0-1)
	Dynamic size		- can be adjusted in settings
	Format			- any color format (renderable or otherwise) but not depth-stencil 
	Multisample		- true if read-write image supports multisampling
	Name
	
Declare ReadWriteBuffer - can be used as compute shader input/output
	Size		- in bytes
	Relative size	- to screen, size is now size per screen pixel
	Name
	
Declare Event - can be used to signal dependent work that other work is done
	Set 		- created in an already set state
	Name 

Declare Algorithm
	Class		- to create instance of algorithm class
	
GlobalState
	- List all global values used by this frame shader
	
Pass <Name>
	- List all attachments being used, then use index as lookup
	- Pass implicitly creates rendertarget/framebuffer
	
	RenderTexture <Name of declared RenderTexture>
		- Clear color
		- If name is __WINDOW__ use backbuffer
	RenderDepthStencil <Name of declared DepthStencil>
		- Clear stencil, clear depth
	
	Subpass <Name>
		- List subpasses depended upon
		- List of attachment indices
			- Output color
			- Output depth-stencil
			- Output input attachment (call something clever, like shader-local)
			- Resolve <boolean>
			- Passthrough (automatically assume all unmentioned attachments are pass-through)
			- Dependency
			
		- Viewports and scissor rects
			- Set viewport -> float4(x, y, width, height), index
			- Set scissor rect -> float4(x, y, width height), index
			- If none are given, whole texture is used
			
		- Drawing
			SortedBatch <Name>
				- Must be inside subpass
				- Renders objects in Z-order
				
			Batch <Name>
				- Must be inside subpass
				- Batch renders objects using as minimal shader switches as possible
				- Renders materials declaring a Pass with <Name>. Rename Pass to Batch in materials.			
				
			System <Name>
				- Name decides what to do
				- Must be inside subpass
				- Lights means light sources
				- LightProbes means light probes
				- UI means GUI renderers
				- Text means text rendering
				- Shapes means debug shapes
			
			FullscreenEffect <Name>
				- Must be inside subpass
				- Bind shader
				- Update variables
							
			SubpassAlgorithm <Name>
				- Select algorithm to run
				- Select stage to execute so we can execute different phases with different subpasses
                                - Inputs listed from 
				- Must be inside subpass
				
Copy <Name>
	- Must be outside of pass
	- Target texture
	- Source texture
			
Blit <Name>
	- Like copy, but may filter if formats are not the same, or size differs
	- Must be outside of pass
	
ComputeAlgorithm <Name>
	- Select algorithm to run
	- InputImage input readwrite image
	- OutputImage output readwrite image
        - Allow asynchronous computation <boolean>, if not defined is false
	- Has to be outside of a Pass
	
Compute <Name>
	- Bind shader
	- Update variables
        - Allow asynchronous computation <boolean>, will require an Event to be used to control execution
	- Compute X, Y, Z
		- Is number of work groups, work group size is defined in shader
	
Barrier <Name>
	- Implements a memory blockade between two operations
	- Signals a ReadWriteBuffer or a ReadWriteTexture to be blocked from one pipeline stage to another, also denoting the access flags allowed for both resources prior and after the barrier
	- Can be used as a pipeline bubble, or as the content of an Event
	- Can be executed within a render pass or outside

Event <Name>
	- Can be reset
	- Can be set
	- Can be waited for
	- Can be called within render pass and outside

For the sake of minimalism, the new system also implements the script as a simplified JSON file, instead of an XML, making it slightly more readable, although almost equally stupid. An example of a frame shader, now called frame script, can look like this:

version: 2,
engine: "NebulaTrifid",
framescript: 
{
	renderTextures:
	[
		{ name: "NormalBuffer", 	format: "A8R8G8B8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "DepthBuffer", 		format: "R32F", 			relative: true,  width: 1.0, height: 1.0 },
		{ name: "AlbedoBuffer", 	format: "A8R8G8B8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "SpecularBuffer", 	format: "A8R8G8B8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "EmissiveBuffer", 	format: "A16B16G16R16F", 	relative: true,  width: 1.0, height: 1.0 },
		{ name: "LightBuffer", 		format: "A16B16G16R16F", 	relative: true,  width: 1.0, height: 1.0 },
		{ name: "ColorBuffer", 		format: "A8B8G8R8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "ScreenBuffer", 	format: "A8B8G8R8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "BloomBuffer", 		format: "A8B8G8R8", 		relative: true,  width: 0.5, height: 0.5 },
		{ name: "GodrayBuffer", 	format: "A8B8G8R8", 		relative: true,  width: 0.5, height: 0.5 },
		{ name: "ShapeBuffer", 		format: "A8B8G8R8", 		relative: true,  width: 1.0, height: 1.0 },
		{ name: "AverageLumBuffer", format: "R16F", 			relative: false, width: 1.0, height: 1.0 },
		{ name: "SSSBuffer", 		format: "A16B16G16R16F", 	relative: true,  width: 1.0, height: 1.0 },
		{ name: "__WINDOW__" }
	],
	
	readWriteTextures:
	[
		{ name: "HBAOBuffer", 		format: "R16F", 			relative: true, width: 1.0, height: 1.0 }
	],
	
	depthStencils:
	[
		{ name: "ZBuffer", 			format: "D32S8", 			relative: true, width: 1.0, height: 1.0 }
	],
	
	algorithms:
	[
		{ 
			name: 		"Tonemapping", 
			class: 		"Algorithms::TonemapAlgorithm", 
			renderTextures:
			[
				"ColorBuffer",
				"AverageLumBuffer"
			]
		},
		{
			name:		"HBAO",
			class: 		"Algorithms::HBAOAlgorithm",
			renderTextures:
			[
				"DepthBuffer"
			],
			readWriteTextures:
			[
				"HBAOBuffer"
			]
		}
	],
	
	shaderStates:
	[
		{ 
			name: 		"FinalizeState", 
			shader:		"shd:finalize", 
			variables:
			[
				{semantic: "ColorTexture", 		value: "ColorBuffer"},
				{semantic: "LuminanceTexture", 	value: "AverageLumBuffer"},
				{semantic: "BloomTexture", 		value: "BloomBuffer"}
			]
		},
		{
			name: 		"GatherState",
			shader: 	"shd:gather",
			variables:
			[
				{semantic: "LightTexture", 		value: "LightBuffer"},
				{semantic: "SSSTexture", 		value: "SSSBuffer"},
				{semantic: "EmissiveTexture", 	value: "EmissiveBuffer"},
				{semantic: "SSAOTexture", 		value: "HBAOBuffer"},
				{semantic: "DepthTexture", 		value: "DepthBuffer"}
			]
		}
	],
	
	globalState:
	{
		name:			"DeferredTextures",
		variables:
		[
			{ semantic:"AlbedoBuffer", 		value:"AlbedoBuffer" },
			{ semantic:"DepthBuffer", 		value:"DepthBuffer" },
			{ semantic:"NormalBuffer", 		value:"NormalBuffer" },				
			{ semantic:"SpecularBuffer", 	value:"SpecularBuffer" },
			{ semantic:"EmissiveBuffer", 	value:"EmissiveBuffer" },
			{ semantic:"LightBuffer", 		value:"LightBuffer" }
		]
	},
	
	computeAlgorithm:
	{
		name: 		"HBAO-Prepare",
		algorithm:	"HBAO",
		function:	"Prepare"
	},
	pass:
	{
		name: "DeferredPass",
		attachments:
		[
			{ name: "AlbedoBuffer", 	clear: [0.1, 0.1, 0.1, 1], 		store: true	},
			{ name: "NormalBuffer", 	clear: [0.5, 0.5, 0, 0], 		store: true },
			{ name: "DepthBuffer", 		clear: [-1000, 0, 0, 0], 		store: true },
			{ name: "SpecularBuffer", 	clear: [0, 0, 0, 0], 			store: true	},
			{ name: "EmissiveBuffer", 	clear: [0, 0, 0, -1], 			store: true	},
			{ name: "LightBuffer", 		clear: [0.1, 0.1, 0.1, 0.1], 	store: true	},
			{ name: "SSSBuffer", 		clear: [0.5, 0.5, 0.5, 1], 		store: true }
		],
		
		depthStencil: { name: "ZBuffer", clear: 1, clearStencil: 0, store: true },
		
		subpass:
		{
			name: "GeometryPass",
			dependencies: [], 
			attachments: [0, 1, 2, 3, 4],
			depth: true,
			batch: "FlatGeometryLit", 
			batch: "TesselatedGeometryLit"
		},
		subpass:
		{
			name: "LightPass",
			dependencies: [0],
			inputs: [0, 1, 2, 3, 4],
			depth: true,
			attachments: [5],
			system: "Lights"
		}
	},
	computeAlgorithm:
	{
		name: 			"Downsample2x2",
		algorithm: 		"Tonemapping",
		function: 		"Downsample"
	},
	computeAlgorithm:
	{
		name: 			"HBAO-Compute",
		algorithm:		"HBAO",
		function:		"HBAOAndBlur"
	},
	pass:
	{
		name: "PostPass",
		attachments:
		[
			{ name: "DepthBuffer",  		load: true },
			{ name: "AverageLumBuffer", 	clear: [0.5, 0.5, 0.5, 1] },
			{ name: "ColorBuffer", 			clear: [0.5, 0.5, 0.5, 1] },
			{ name: "ScreenBuffer", 		clear: [0.5, 0.5, 0.5, 1], store: true},
			{ name: "BloomBuffer", 			clear: [0.5, 0.0, 0.5, 1] },
			{ name: "GodrayBuffer", 		clear: [-1000, 0, 0, 0] },
			{ name: "ShapeBuffer", 			clear: [-1000, 0, 0, 0] }
		],
		
		depthStencil: { name: "ZBuffer", load: true },
		subpass:
		{
			name: "Gather",
			dependencies: [],
			attachments: [2],
			depth: false,
			fullscreenEffect:
			{
				name: 				"GatherPostEffect",
				shaderState: 		"GatherState",
				sizeFromTexture: 	"ColorBuffer"
			}
		},
		subpass:
		{
			name: "AverageLum",
			dependencies: [0],
			attachments: [1],
			depth: false,
			subpassAlgorithm:
			{
				name: 				"AverageLuminance",
				algorithm: 			"Tonemapping",
				function: 			"AverageLum"
			}
		},
		subpass:
		{
			name: "Unlit",
			dependencies: [],
			attachments: [6],
			depth: true,
			batch: "Unlit",
			batch: "ParticleUnlit",
			system: "Shapes"
		},
		subpass:
		{
			name: "FinishPass",
			dependencies: [1, 2],
			inputs: [0, 5, 6],
			attachments: [3],
			depth: false,
			fullscreenEffect: 
			{
				name: 				"ToScreen",
				shaderState: 		"FinalizeState",
				sizeFromTexture: 	"ColorBuffer"
			},
			plugins:
			{
				name:"UI",
				filter:"UI"
			},
			system: "Text"
		}
	},
	computeAlgorithm:
	{
		name: 		"CopyToNextFrame",
		algorithm: 	"Tonemapping",
		function: 	"Copy"
	},
	swapbuffers:
	{
		name: 		"SwapWindowBuffer",
		texture: 	"__WINDOW__"
	},
	blit:
	{
		name: 		"CopyToWindow",
		from: 		"ScreenBuffer",
		to: 		"__WINDOW__"
	}
}

Some of the design choices are:

  • GlobalState
  • Assigns global variables, like the deferred textures used by this frame shader. Other frame shaders can execute and apply their values.

  • RenderTextures contains list of all declared color renderable textures
  • We want to declare all textures in a neat manner, so a single row per texture is nice.

  • DepthStencils contains list of all declared depth stencil targets
  • We might want to use more than one depth stencil sometimes.

  • ReadWriteTextures contains list of all declared textures which supports read-write operations
  • Used for image load-stores.

  • ReadWriteBuffers contains list of all declared buffers which supports read-write operations
  • Used for compute shaders to load-store data. Size is size in bytes, but if the relative flag is used, size denotes the byte size per pixel.
    Size of 1, 1 with a relative size on a 1024×768 pixel screen will allocate 1024×768 bytes, 0.5, 0.5 is 512, 384.

  • Algorithms contain all algorithms used by this frame shader
  • We want to declare algorithms beforehand, so that we can select which pass to use within it dependent on where we are.

  • Pass assigns a list of render targets which may be applied during the pass
  • Only draws are allowed within a pass, because a pass can’t guarantee order of execution of subpasses. A pass defines a list of allowed attachments, and which depth-stencil to use. A pass doesn’t really do anything, the work is done in subpasses.

  • Subpass actually binds render targets
  • Subpasses work on the concepts of OpenGL4 and Vulkan. Binding a framebuffer is done in the pass, the subpass then selects which attachments should be used and in which order.
    Subpasses have dependencies if other subpasses needs to be completed before this subpass can run. Subpasses list the attachments used by the pass by index. Subpasses may also contain the most important part, which is drawing.

  • Drawing!
  • We have four types of draw methods.

    • Batch performs a batch render by shader – surface – mesh to avoid unnecessary switches
    • OrderedBatch performs a ordered batch render, on all materials but renders in Z-order (or perhaps some other scheme) instead of per shader, so it’s potentially detrimental for performance, since it may switch shaders many times
    • System runs a render system built-in, like Lights, LightProbes, UI, Text and Shapes
    • FullscreenEffect renders what we before called a post effect, but it doesn’t really have to be ‘post’ per-se. Fullscreen effects require a shader, potentially (in the majority of cases) a list of variable updates. Also needs a texture to extract the size of the fullscreen quad
  • Algorithm execution
  • We want to execute algorithms, but since the new system has full control over what is being bound and when, the algorithm is not allowed to begin or end passes. So what do we need from algorithms? Algorithms need to update shader variables, apply meshes and render, however they might also need to render to more than one texture at a time. An algorithm needs to be able to run a certain step.

    • SubpassAlgorithm can only be executed within a pass, and is done for rendering geometry. Subpass algorithms may take values as input, however it is better to use the global state if possible
    • ComputeAlgorithm must be executed outside a pass, and may not render stuff, but only dispatch computations. Compute algorithms must be provided with at least one ReadWriteImage or ReadWriteBuffer, otherwise the compute algorithm is pointless. Compute algorithms can be asynchronous if hardware supports it. If it doesn’t, it just runs inline with graphics.
  • Compute
  • Like before, we can select a compute shader and run X, Y, Z number of work groups, using Width, Height, Depth sizes for each work group. However now we can allow computes to be asynchronously executed if the hardware supports it. If it doesn’t, it just runs inline with graphics. If a compute uses the relative flag, then size denotes the workgroup size rounded to down to fit the resolution.

  • Barrier
  • Executes a barrier between operations. A barrier describes just that, a barrier between adjacent operations. If two interdependent operations are not directly after each other, then we can use events to wait for some later operation to be done.

  • Event
  • We can declare and use events (consistently) during a frame. Events can be set for example after a compute is done, and waited for somewhere else, with work being done in between. While barriers are better to use between immediate operations where the second depend on the first, events can be done for computations at the beginning of the frame, and be waited for just before they are needed.

  • Copy and Blit
  • We also want to perform an image copy, or a slightly more expensive blit operation (which allows for conversions/decompression)

Backwards compatibility

Since Vulkan defines a very explicit API, it shouldn’t be hard to translate from Vulkan ‘down’ to for example OpenGL or DirectX. Events in OpenGL can be used with glFenceSync and memory barriers with glMemoryBarrier.

In DirectX 11 and down, no such mechanism exist, for which barriers and fences are simply semantic in the frame shader to denote where a barrier or sync would exist given a more explicit API, meaning that they will have no actual function. In OpenGL, a pass is a call to glBindFramebuffer, and a subpass is glDrawBuffers. In DirectX, a pass is just a list of renderable textures, and a subpass selects a subset of these render targets and bind them prior to drawing. In Metal, a subpass is the info used in MTLRenderPipelineDescriptor to create a render pipeline, and a render pass is just a list of textures to pick from.

Compute algorithms and plain computes are simply not available if the API doesn’t support them. Perhaps the RenderDevice should implement whether or not the underlying device can actually perform computes, but at the same time, it is impossible to load compute shaders unless the device supports them. And since Nebula loads all exported shaders by default, this might be a problem.

Final thoughts

In some cases we might actually want to modify a frame script. All ‘runnable’ elements in a frame script is of the class FrameOp, meaning we can technically add in FrameOps into the script during runtime. For example, if we want VR, then perhaps we want the last sub-pass to not present to screen, but instead to a texture, and we might want to switch which texture to present to, like for left eye, right eye, without necessarily having two of every screen buffer.

We might also want to be able to assemble a frame script in code, for example when implementing different shadow mapping methods. We could, for example, merge frame scripts together by just adding ops. The old system used a class called FramePassBase, which would bind a render target and run batches, which is very much like a render pass. However a FramePassBase binds a render target object as-is, and will assume all attachments will be used in the shader. With the new method, we can, for example, bind the CSM shadow buffer, the spot light shadow buffer atlas in one render pass, then all cube maps for point lights as 6 * number of point light shadows as layers. FrameOps allow us to have a fine grained control of how a script is executed, however it also allows us to break validation, since both a FramePass and a FrameSubpass are FrameOps, we can technically put a FrameSubpass without being inside a FramePass – if we assemble one in code. The frame script loader enforces correct formatting.

We also want to slightly rework the render plugin system so that we can determine when a plugin should render, seeing as it is important to be able to decide which texture the result will end up in. An idea is to be able to execute a subset of plugins in the script, and then have the plugins register to the plugin registry with a certain group.

The exact details of this design will probably change during development, however the basic concepts are here. One of the major concerns is that the new system should inhibit the user to do stupid things, and to also be able to fully optimize and utilize the Vulkan, and by extension, the other renderers. The script can, for example, find required dependencies by just looking at which subpass or algorithm is using a resource, and inform the programmer, just like a validation process. It can also be extended to use a graphical design interface, however it is doubtful sensitive features like this is supposed to be exposed to someone who are not too familiar with GPU programming.

In the current state, the frame script system will implement objects of different types, which have a certain behavior. However, since we know the code beforehand, it should be possible to just unravel the frame script produce just the code, meaning the engine will generate the rendering process during compilation, much like the NIDL system, and thus making the rendering pipeline both debugable, as well as more efficient, without all the indirection.

Vulkan – Beyond the pipeline cache

Don’t you go thinking I have been idle now just because I haven’t written anything down. As a matter of fact, I implemented a whole new render script system, which allows full utilization of Vulkan features such as subpasses and explicit synchronizations such as barriers and events.

The current of Nebula implements most major parts, including lighting, shadowing, shape rendering, GUI, text rendering and particles. What’s left to implement and validate is the compute-parts. However working with Vulkan is not so simple as many think. There are tons of problems, driver related and otherwise, which is why I decided to implement my own pipeline cache system.

Basically, the Vulkan pipeline cache can just return a VkPipeline object when we use the same objects to create a pipeline twice. That is cute and cool, but internally the system has to at least serialize 14 integers (12 pointers, 2 integers for the subpass index and the number of shader states). This is handled by the driver, so relying on it being intelligent or even efficient has proven to be a leap of faith. So I figured, how many different ‘objects’ do we use to create a pipeline in Nebula? Turns out, we just use 4, pass info, shader, vertex layout, vertex input.

So the idea came to mind to just incrementally build a DAG of the currently applied states, and if the selected DAG path, when calling GetOrCreatePipeline(), has a pipeline created, just return it instead of create it. The newest AMD driver, 16.9.1 fails to serialize pipelines, so calling vkCreateGraphicsPipelines always creates and links a new one, which downed my runtime performance from 140 FPS down to 12. Terrible, but it gave me the motivation to avoid calling a vkCreateX function everytime I needed something new.

Enter the Nebula Pipeline Database. Sounds so cool, but is a simple tree structure which layers different pipeline states into tiers, and constructs a tree-like structure in order to construct a dependency which in the end creates a VkPipeline. The class works by applying shading states in tiers. The tiers are: Pass, Shader, Vertex layout, Primitive Input. If one applies a pass, then all the lower states get invalidated. If a vertex layout is applied, then it will be ‘applied’ to the current pass. We construct a tree like so:

Pass 1 Pass 2 Pass 3
Shader 1 Shader 2 Shader 3 Shader 4 Shader 5 Shader 6
Vertex layout 1 Vertex layout 2 Vertex layout 3 Vertex layout 4 Vertex layout 5 Vertex layout 6 Vertex layout 7 Vertex layout 8 Vertex layout 9 Vertex layout 10 Vertex layout 11 Vertex layout 12
Primitive input 1 Primitive input 2 Primitive input 3 Primitive input 4 Primitive input 5 Primitive input 6 Primitive input 7 Primitive input 8 Primitive input 9 Primitive input 10 Primitive input 11 Primitive input 12 Primitive input 13 Primitive input 14 Primitive input 15 Primitive input 16 null null

When setting a state, we try to find an already created node for that tier. If no node is found, we create it using the currently applied state. This allows us to rather quickly find the subtree and retrieve an already created pipeline. You might think this is very cumbersome just to combine pipeline features, but it boosted the base frame rate by several percent, because this way, using only 5 identifying objects, is much faster than the driver implementation, and for obvious reasons. The driver could never assume we have the same code layout as we do in Nebula, so it has to assume every part of the pipeline is dynamic.

Also, the render device doesn’t request a new pipeline from the database object unless the change has actually changed, so we can effectively avoid tons of tree traversals, searches and VkPipelineCache requests just by assuming the state doesn’t need to change.

So what’s left to do?

Platform and vendor compatibility stuff. At the current stage, the code doesn’t consider violations against hardware limits, such as the number of uniform buffers per shader stage, or per descriptor set. This is an apparent problem on nvidia cards, where the number of concurrently bound uniform buffers is limited to 12. Also, testing and figuring out how events and barriers work or what they are actually needed for, since renderpasses implement barriers themselves, and compute shaders run on the same queue seems to be internally synchronized.

Skip to toolbar