Vulkan – Beyond the pipeline cache

Don’t you go thinking I have been idle now just because I haven’t written anything down. As a matter of fact, I implemented a whole new render script system, which allows full utilization of Vulkan features such as subpasses and explicit synchronizations such as barriers and events.

The current of Nebula implements most major parts, including lighting, shadowing, shape rendering, GUI, text rendering and particles. What’s left to implement and validate is the compute-parts. However working with Vulkan is not so simple as many think. There are tons of problems, driver related and otherwise, which is why I decided to implement my own pipeline cache system.

Basically, the Vulkan pipeline cache can just return a VkPipeline object when we use the same objects to create a pipeline twice. That is cute and cool, but internally the system has to at least serialize 14 integers (12 pointers, 2 integers for the subpass index and the number of shader states). This is handled by the driver, so relying on it being intelligent or even efficient has proven to be a leap of faith. So I figured, how many different ‘objects’ do we use to create a pipeline in Nebula? Turns out, we just use 4, pass info, shader, vertex layout, vertex input.

So the idea came to mind to just incrementally build a DAG of the currently applied states, and if the selected DAG path, when calling GetOrCreatePipeline(), has a pipeline created, just return it instead of create it. The newest AMD driver, 16.9.1 fails to serialize pipelines, so calling vkCreateGraphicsPipelines always creates and links a new one, which downed my runtime performance from 140 FPS down to 12. Terrible, but it gave me the motivation to avoid calling a vkCreateX function everytime I needed something new.

Enter the Nebula Pipeline Database. Sounds so cool, but is a simple tree structure which layers different pipeline states into tiers, and constructs a tree-like structure in order to construct a dependency which in the end creates a VkPipeline. The class works by applying shading states in tiers. The tiers are: Pass, Shader, Vertex layout, Primitive Input. If one applies a pass, then all the lower states get invalidated. If a vertex layout is applied, then it will be ‘applied’ to the current pass. We construct a tree like so:

Pass 1 Pass 2 Pass 3
Shader 1 Shader 2 Shader 3 Shader 4 Shader 5 Shader 6
Vertex layout 1 Vertex layout 2 Vertex layout 3 Vertex layout 4 Vertex layout 5 Vertex layout 6 Vertex layout 7 Vertex layout 8 Vertex layout 9 Vertex layout 10 Vertex layout 11 Vertex layout 12
Primitive input 1 Primitive input 2 Primitive input 3 Primitive input 4 Primitive input 5 Primitive input 6 Primitive input 7 Primitive input 8 Primitive input 9 Primitive input 10 Primitive input 11 Primitive input 12 Primitive input 13 Primitive input 14 Primitive input 15 Primitive input 16 null null

When setting a state, we try to find an already created node for that tier. If no node is found, we create it using the currently applied state. This allows us to rather quickly find the subtree and retrieve an already created pipeline. You might think this is very cumbersome just to combine pipeline features, but it boosted the base frame rate by several percent, because this way, using only 5 identifying objects, is much faster than the driver implementation, and for obvious reasons. The driver could never assume we have the same code layout as we do in Nebula, so it has to assume every part of the pipeline is dynamic.

Also, the render device doesn’t request a new pipeline from the database object unless the change has actually changed, so we can effectively avoid tons of tree traversals, searches and VkPipelineCache requests just by assuming the state doesn’t need to change.

So what’s left to do?

Platform and vendor compatibility stuff. At the current stage, the code doesn’t consider violations against hardware limits, such as the number of uniform buffers per shader stage, or per descriptor set. This is an apparent problem on nvidia cards, where the number of concurrently bound uniform buffers is limited to 12. Also, testing and figuring out how events and barriers work or what they are actually needed for, since renderpasses implement barriers themselves, and compute shaders run on the same queue seems to be internally synchronized.

Leave Comment