Material system rewrite

With the game projects done, we’ve gotten some feedback related to the game creation process with Nebula. So we have a gigantic list of stuff we should implement/fix/change for the next iteration.

One of these things (one of the biggest) is making a new material system. Currently, a material is defined in an XML which explains where during a frame a material is used. Materials are then put on models, and the material explains when during the frame a model should be rendered, and with which shader. So basically, you have this relation:

  • Material
    • Pass 1
      • BatchGroup ‘FlatGeometryLit’ with shader ‘static’ and variation ‘Static’
    • Pass 2
      • BatchGroup ‘GlobalShadows’ with shader ‘shadow’ and variation ‘Static|CSM’
  • Model
    • Model node
      • Material
      • Variables
      • FrameShader
        • FrameBatch, which denotes a BatchGroup
          • For all materials belonging to BatchGroup
            • Apply material
            • For all visible model nodes using material
              • Apply mesh
              • For all visible model node instances using material
                • Apply node state
                • Draw

      So a material describes when an object should be rendered, and how, a model uses a material to understand when it should be rendered, and the frame batch tells when a certain batch group should be executed.

      While this method works, it’s a bit inflexible. The main downsides are:

      • State of a model is saved within each model node.
      • Material is saved on a resource level for each model node, so it can not be switched out.
      • Material variables has to be set on each model node, and cannot be reused.

      So the new system is different, instead of saving the shading state (material settings) within a model, they are instead saved in a separate resource, which at the time is called SurfaceMaterial. A SurfaceMaterial is created as a resource in the content browser, and it contains the values for each material variable. It uses a currently existing material as a template, since the surface material doesn’t denote when it should be rendered. This makes it possible to create new surfaces by taking a template of a material, and then save material variables (textures, intensities etc) in a separate resource. When making new assets later on, it will be possible to use the same surface on several models, which is nice because it makes it faster for graphics artists to assign textures, alpha clip thresholds and texture multipliers since they only need to assign an already created material.

      Furthermore, since the state of a model is now detached from the model resource, it also allows us to change the material during run-time. This means we now (finally) have the ability to switch the material on objects in real-time, something which is extremely useful when for example hiding opaque objects to close to the camera, procedural dissolves, fades, etc.

      In theory, it would also be possible to further improve performance by sorting each model node instance based on surface, which results in a surface only being applied once per frame, however it is difficult to filter out instance unique surfaces (for example if we have one object with a unique alpha blend factor). This might not be necessary (or even noticeable), since the shader subsystem will only actually apply a shader variable if it differs from what it currently is set, so setting the same surface multiple times result in close to no overhead.

      The original render system used a clever bucket-system, where model node instances would be put into buckets depending on shader. When I made the first iteration of the material system, I made it so that this bucket-system used materials to group (sort) objects, so that material switches would be as few as possible. However this system relied on the fact that materials are defined in the model resource. It was easy to switch this to a system where each model node instance would decide which bucket it would be put in.

      The biggest change is to convert all .attributes into surface XMLs, then remove those who are duplicates, then replace the field in each state node which corresponds to its material name to instead be the name of a generated surface. Perhaps it is just easiest to go through the projects, collect their values, make new materials and assign them manually. Then we also need to create tools to make materials with. This should be fairly straight-forward, seeing as we can change the state of a material by setting a value, and swap materials on models using the new system, so we should be able to visualize it easily enough.

      So instead of the above hierarchy, we have something like this:

      • SurfaceMaterial
        • Material
        • Variables
      • Model
        • Model node
          • Surface material name
          • Model node instances
            • Pointer to surface material

Tool updates

Apart from finally releasing the core engine on github we have of course been busy with the tools, working on streamlining the ui and making it more usable in general. Both major editors have received facelifts like dockable widgets and attribute editors that consume less space. The content browser has received support for browsing (and playing) sound, previewing ui layouts, resource browsers for textures.

cb
The leveleditor has support for layers, selecting what ui layouts and sounds to preload, game entities can be transformed to other entity classes and environment entities and vice versa. On top of all that it is possible to drag and drop items between the two programs.

LE

Nebula trifid now on Github

After messing around for all too long we finally moved our main development to github . It’s still rough around the edges and probably close impossible to build without knowing what you are doing, but we are working on that.
The demo content will follow suite soon as well so that it is easier to get started in general.

The beginning of a preprocessing pipeline.

We’ve been somewhat silent. I would say very silent. Silence doesn’t mean we’ve done nothing though, quite the contrary.

We’ve been working A LOT the tools, mainly the content browser and the level editor. On my part, I’ve been fidgeting with the content browser to make it simpler, more minimalistic and faster. For example, the content browser now allows for a static model to be attached with a particle node, or several, which means that artists are able to build more complex (albeit non-skinned) pre-defined objects by adding particle nodes to them. The content browser is also more responsive and interactive due to the new system used when saving, loading and applying object parameter changes such as textures and other shader variables.

However more relevant is the new IBL pipeline which allows a level designer to place light probes in the scene, have them render a reflection cube map and an irradiance cube map, and then have it applied in a novel fashion on overlapping objects. To do so, a designer puts a light probe in the scene, give it a few parameters, presses a button and woop, the entire area is affected by said reflections and irradiance. This effect gives an illusion of realtime GI, since it simulates specular light bounces through reflections, and ambient light bounces through irradiance. The following image shows the result as displayed inside the level editor:

This image shows how reflections are projected inside scene.

This image shows how reflections are projected inside scene.

To do this, we first capture the scene from 6 angles using a bounding box as a capturing area. This area is used to later on reproject the reflecting rays so that we get parallax corrected cube maps when rendering. The result from the render is saved as a render target, and is then processed by CubeMapGen, which is now integrated into Nebula as an external library. Using a predefined set of settings, we generate the reflections and optionally, the irradiance; output it to the work folder and then assign them to the influence zones.

Simple stuff. But here comes the interesting part. As far as I have come across (this might be false), the common solution is to handle objects entering the reflection zone, which gets assigned a reflection map from which to calculate the reflections. Some solutions use the camera as a singular point of interest, and assigns the reflections on every visible object when the camera enters the influence zone. We do it differently.

We had this conundrum where we visualized two different zones of irradiance separated by a sharp border, say a door. Inside the room the lighting is dark, and outside the room, in the corridor, you have strong lights in the ceiling. If an object would move between said areas, then the irradiance would be gradual as the object would cross between these zones. This can be accomplished in a pixel shader by simply submitting N reflections and cube maps, calculate the distance between pixel and cube map, and then apply a blending based on the distance.

We thought of another way in which we don’t have to do the work per object. A way that would let us draw an ‘infinite’ amount of reflections and irradiance per pixel, and without the overhead of having to calculate reflections on pixels we simply cannot see. Enter deferred decal projected reflections.

By using decals, we can project reflections and irradiance into the scene using an origin of reflection and an influence area. The decal blends the reflections and irradiance based on a distance field function (box or sphere), which allows for other reflections to be blended. Decals are then rendered as boxes into the scene as any object but with a special shader that respects roughness and specularity. Using this method, we avoid:

1. Reflecting unseen pixels.
2. Submitting and binding textures per each affected object.
3. Applying reflections on dynamic objects without doing collision checks.
4. Having an upper limit on reflection/irradiance affecting zones.
5. Popping.

We have some limitations however, namely:

1. Render decals ordered on layer.
2. Have to switch between box and sphere distance field functions without shader switch overhead (sorting is impossible since the layer dictates draw order).
3. Potentially switch back and forth between many different textures (if we can see many reflection simultaneously).
4. We don’t calculate reflection and store it in the G-buffer. The projector shader is somewhat complex and computational heavy, so any simplifications are welcome.

Our solution give us the ability to project decals into the scene instead of having to apply them per object, meaning we won’t have any popping or appearing artifacts, and good scalability with the number of objects. I doubt this hasn’t been done before, and there are probably caveats with this method which are yet to be discovered.

Physically based lighting revisited

I’ve been working closely with one of our graphics artists to iron out the faults with the PBR rendering. We also thought it would be a good idea to also include a way to utilize IBL aswell, since it seems to be the way to go when using modern lighting pipelines. Last time I just gathered my code from: http://www.massimpressionsprojects.com/dev/altdevblog/2011/08/23/shader-code-for-physically-based-lighting/ which more or less describes how to implement PBR in realtime, so go there if you wish you to implement PBR in your engine. Instead, I thought I should explain how to combine the PBR with the newly implemented IBL techniques we use!

So to start of, for rendering with IBL you need two textures, or cube maps to be more precise. The first is an ordinary environment map, and is used for reflections. The other is called an irradiance map, and describes the light being radiated from the surrounding environment; an irradiance map can be generated from an environment map using for example ‘cubemapgen’ (https://code.google.com/p/cubemapgen/). The irradiance map is sampled differently from the environment map, whereas the reflections should be just that, a reflection on the surface, the irradiance is more like diffuse light. So to sample reflections and irradiance, we currently use this code (and I doubt it’s gonna be subject to change):

Geometry

	mat2x3 ret;
	vec4 worldNorm = (InvView * vec4(viewSpaceNormal, 0));
	vec3 reflectVec = reflect(worldViewVec, worldNorm.xyz);
	float x = dot(-viewSpaceNormal, normalize(viewSpacePos.xyz)) * MatFresnelDistance;
	vec3 rim = FresnelSchlickGloss(specularColor.rgb, x, roughness);
	ret[1] = textureLod(EnvironmentMap, reflectVec, (1 - roughness) * EnvNumMips).rgb * rim;
	ret[0] = vec3(0);
	return ret;

I’m using a matrix here because for some reason subroutines cannot have two ‘out’ arguments and subroutines cannot return structs…

You might also notice how we input a view space normal, so we actually transform it into a world space normal first, so that’s a bit expensive but all of the lighting is done in view space so this is probably going to be cheaper than to perform the lighting into view space for each light. So, without further ado let’s dive into what the code does!

	vec3 reflectVec = reflect(worldViewVec, worldNorm.xyz);

Calculate reflection vector by reflecting the view vector with the world normal, this will obviously give us the vector to use when sampling the reflections.

	float x = dot(-viewSpaceNormal, normalize(viewSpacePos.xyz)) * MatFresnelDistance;

This is basically the NV vector we calculate for use in the Fresnel calculation on the row below.

	vec3 rim = FresnelSchlickGloss(specularColor.rgb, x, roughness);

Calculate Fresnel using a modified algorithm which takes into account the roughness of the surface. This function looks like this:

vec3
FresnelSchlickGloss(vec3 spec, float dotprod, float roughness)
{
	float base = 1.0 - saturate(dotprod);
	float exponent = pow(base, 5);
	return spec + (max(vec3(roughness), spec) - spec) * exponent;
}

By this point we have what we want! So we just sample our textures using the data!

	ret[1] = textureLod(EnvironmentMap, reflectVec, (1 - roughness) * EnvNumMips).rgb * rim;
	ret[0] = textureLod(IrradianceMap, worldNorm.xyz, 0).rgb;

Now we’re done with sampling the reflections. What is left now is to somehow get this into the rendering pipeline for further processing. We use the roughness to select the mipmap, where EnvNumMips denotes the number of mips present in this specific environment map.

This is a typical fragment shader used in Nebula:

shader
void
psUber(in vec3 ViewSpacePos,
	in vec3 Tangent,
	in vec3 Normal,
	in vec3 Binormal,
	in vec2 UV,
	in vec3 WorldViewVec,
	[color0] out vec4 Albedo,
	[color1] out vec4 Normals,
	[color2] out float Depth,	
	[color3] out vec4 Specular,
	[color4] out vec4 Emissive) 
{
	Depth = calcDepth(ViewSpacePos);
	
	vec4 diffColor = texture(DiffuseMap, UV) * vec4(MatAlbedoIntensity, MatAlbedoIntensity, MatAlbedoIntensity, 1);
	float roughness = texture(RoughnessMap, UV).r * MatRoughnessIntensity;
	vec4 emsvColor = texture(EmissiveMap, UV) * MatEmissiveIntensity;
	vec4 specColor = texture(SpecularMap, UV) * MatSpecularIntensity;
	
	vec4 normals = texture(NormalMap, UV);
	vec3 bumpNormal = normalize(calcBump(Tangent, Binormal, Normal, normals));

	mat2x3 env = calcEnv(specColor, bumpNormal, ViewSpacePos, WorldViewVec, roughness);
	Specular = calcSpec(specColor.rgb, roughness);
	Albedo = calcColor(diffColor, vec4(1), AlphaBlendFactor) * (1 - Specular);	
	Emissive = vec4(env[0] * Albedo.rgb + env[1], 1) + emsvColor;
	
	Normals = PackViewSpaceNormal(bumpNormal);
}

Inputs and outputs

Here is another interesting detail. What is actually a specular map when using PBR? In our engine, we define it as just that, reflective color in RGB. To simplify authoring for our graphics artists, all textures can be adjusted using simple scalar values. The same actually goes for the Fresnel effect mentioned earlier, albeit it’s not actually physically correct. Anyways, we also use subroutines so all functions called ‘calcXXX’ is calling some subroutine function. What we are interested in here is env, and what we feed to Emissive and Albedo. We can see that we cheat a bit with the albedo color by using 1 – Specular. This isn’t really energy conserving since it doesn’t take the Fresnel effect into account. To emissive, we simply do this:

	Emissive = vec4(env[0] * Albedo.rgb + env[1], 1) + emsvColor;

This calculation comes from the two previously calculated arguments, env[0] is the diffuse irradiance, and env[1] is the specular reflection. Later in the pipeline, when we have calculated the light we simply add this value to the total color value of said pixel. The next part will cover how lights are calculated using the above data.

Lights

	vec3 viewVec = normalize(ViewSpacePosition);
	vec3 H = normalize(GlobalLightDir.xyz - viewVec);
	float NH = saturate(dot(ViewSpaceNormal, H));
	float NV = saturate(dot(ViewSpaceNormal, -viewVec));
	float HL = saturate(dot(H, GlobalLightDir.xyz));
	vec3 spec;
	BRDFLighting(NH, NL, NV, HL, specPower, specColor.rgb, spec);
	vec3 final = (albedoColor.rgb + spec) * diff;

So yes, this is the good old calculations normally required to do lighting. We calculate the H vector a bit differently than the usual LightDir + ViewVec, and this is because we actually have the view vector in inverse, since viewVec is the vector FROM the camera to the point, so the formula becomes GlobalLightDir + -viewVec. The same must be applied when calculating the N dot V product, since it’s supposed to represent the angle between the normal and the view vector when looked at from the point on the surface. This is important to consider, since without it the Fresnel attenuation will fail and you will get overly strong specular highlights. I was struggling with getting the specular right and it turned out that the solution was simple, so be sure that these dot products are both saturated and, well, correct! The function then outputs the result to an output parameter, in this case we call it spec. diff in the final calculation is the diffuse color of the light. Roughness is converted to specular power, but it’s a rather simple process, we simply do:

float specPower = exp(13 * roughness + 1);

To get it into a range of 2-8192 which allows us to get rather strong specular highlights if the roughness is low. Also note that specular power is only relevant to use when we feed it into the BRDF function, and not before. In the previous instances we actually just use the raw roughness value.

The pros and cons

PBR materials require way more authoring and obviously loads of knowledge from the artists side, albeit the results are loads better since the lighting doesn’t have to be built into the object itself for every scene. Basically, you will need at least 4 texture inputs per object:

Albedo – Color of the surface, or in lighting terms it’s the direct reflected color of a surface. Channels = RGBA (A is used for alpha blending/testing)

Specular/Reflectiveness – Color of the reflectivity, for most materials this is going to be a whiteish hue, but for some metals, for example gold or copper, the reflectivity is a goldish hue. Channels = RGB, each channel represents each colors reflectivity.

Roughness/Glossyness – Value of surface roughness. This is a value going between 0-255 for artists, or in a shader between 0-1 and corresponds to the microsurface density. Depending on how you want it, it can be 1 for glossy, or 1 for rough. Channels = Red only

Normals – Self explanatory.

However, in order to get the values just right, we also provide a set of intensity sliders for each texture, which makes it simpler for an artist to get the values just right by scaling the texture values with a simple multiplication. This ensures that roughness and specularity matches the wanted values.

Why multidraw is inflexible.

I’ve been working more on the performance stuff in Nebula. In the current version, I added a function to AnyFX in which one can determine if a uniform buffer should be synchronized or not. While this may result in corruptions during rendering due to stomping data in transit, it will also give an enormous boost in performance. One easy way to work around the stomping is by increasing the buffer size so that one can render at least one frame before needing to restart writing to the buffer. In the current state, we have all per-object data in a uniform buffer holding 8192 (a massive amount) of sub-buffers. This gives us the ability to draw (!) 8192 draws each frame without destroying the data, which should be sufficient. If one needs to have synchronizing buffers, then one can just remove the ‘nosync’ qualifier from the varblock. Anyway.

In my quest I also checked out glMultiDraw*, which is nice in principle but there are many things that bother me. To start with, using glMultiDraw, we are only really able to do slightly more flexible instanced rendering. Yes, there is the gl_DrawIDARB GLSL shader variable which lets us address variables in a shader based on which one of the multiple draws we have, and yes this is very simple to implement. But! And this is one major but. It is very non-intuitive to put every per-object variable in arrays. While we can define unsized arrays and use shader storage buffers which are awesome, it requires the engine to, in its core, redefine how it sets variables. For example, some variables has to be set indexed in an array, and some may be set as individual uniforms. Well, we can’t really do any individual uniforms anymore, so everything has to be in arrays. This will cause some major confusion since both are syntactically allowed, however their behavior is completely different. To illustrate, this is how most rendering pipelines do in my experience:

for each material
{
  ApplyMaterial()                         // apply material by setting shaders, tessellation patch size, subroutines
  ApplyMaterialConstants()                // shared variables like time, random value and whatnot
  for each visible model using material
  {
    ApplyModel()                          // set vertex buffer, index buffer
    for each visible model instance using material
    {
      ApplyObjectConstants()               // apply per-object variables, model transform, color etc
      Draw()
    }
  }
}

Which allows us to apply some variables for each draw. Now, if we use the multi draw principle, we will instead get this flow which is better in terms of overhead.

for each material
{
  ApplyMaterial()                         // apply material by setting shaders, tessellation patch size, subroutines
  ApplyMaterialConstants()                // shared variables like time, random value and whatnot
  for each visible model using material
  {
    ApplyModel()                          // set vertex buffer, index buffer
    for each visible model instance using material
    {
      ApplyIndexedObjectConstants()               // apply per-object variables, model transform, color etc
      AccumulateDrawData()                       // add draw data to buffer, i.e. base vertex, base index, number of instances etc.
    }
    MultiDraw()
  }
}

Now, if we have a uniform which isn’t in an array (because a developer, graphics artist(!!!) or shader generation tool may have created a variable using the ordinary way) then we will only get the variable from the last call in the list. It might be specially difficult for a shader generation tool to know whether or not to make an ordinary uniform or to group it in a shader block. Even though this is a user error, it still makes more sense to use ordinary variables as before. The way one would implement this in GLSL to be failsafe would be like this:

struct PerObject
{
  mat4 Model;
  mat4 InvModel;
  vec4 Diffuse;
  int ObjectId;
};

layout(std430) buffer PerObjectBuffer
{
  PerObject perObjectData[];
};

Then setting the variables in an indexed fashion. The problem here lies in the fact that the uniforms will not be visible to the user. Instead, we have to reflect this buffer in the engine code so that we have a matching buffer, and then update into that. The above code will not be as simple as setting the value in a variable, which is one of the major downsides. Another way of doing the same thing is like this:

layout(std430) buffer PerObjectBuffer
{
  mat4 Model[];
  mat4 InvModel[];
  vec4 Diffuse[];
  int ObjectId[];
};

If this would be syntactically okay, but it unfortunately isn’t.

Also, in which case one needs to define the variable as an array to not get undefined behavior. Furthermore, making all variables as buffers as structs like in example one also automatically removes the transparency of using a variable to set a value in. Why you may ask? Well, since shader storage buffers can have many members, but only the last may be an unsized array. So we might expose buffer members as variables (just like with uniform blocks) but exposing a member which is a struct would basically only give us a member which is defined as a block of memory. So setting this block of memory from the engine requires we have a struct on the engine side which we send to the GPU. So we also have to serialize the struct types in order to reflect them on the engine side. Summarized, doing this:

variable->SetMatrix(this->modelTransform);

is much simpler to understand than

struct PerObject
{
  matrix44 Model;
  matrix44 InvModel;
  float4 Diffuse;
  int ObjectId;
} foo;
foo.model = this.modelTransform;
variable->SetStruct(&foo);

Why? Well what if we need to set more than one of the members (for example Model, then InvModel at some later stage)? There is no way to apply this to a generic case! The thing we could do is to simulate a struct as a list of variables, by making:

struct PerObject
{
  mat4 Model;
  mat4 InvModel;
  vec4 Diffuse;
  int ObjectId;
};

layout(std430) buffer PerObjectBuffer
{
   int MaterialType;
   PerObject data[];   
};

Appear like this on the client end:

buffer PerObjectBuffer
{
   int MaterialType;
   mat4 Model[];
   mat4 InvModel[];
   vec4 Diffuse[];
   int ObjectId[];
};

By making the indexed variables here appear as if they were buffer members, we could use use a clever device to offset them when we write to them (using SetVariableIndexed), while still having them exposed as ordinary variables. The only thing we have to consider is that the offset of the variable only increases by the size of the struct for each index, with a constant offset for the other members. Confused? Well let’s make an example. Lets say we are going to set Diffuse for index 0, the offset is then int + mat4 + mat4 since we are going to write to the first index. If we set it for index 1, then the offset is int + (mat4 + mat4 + vec4 + int) * 1. The generic formula is (non-indexable members size) + (indexable member size) * index. In this case, the non-indexable members is simply an int, while the indexable is mat4 + mat4 + vec4 + int.

This being said, OpenGL has a lot of features and is flexible and so on. The only problem is that it’s somewhat complicated to wrap if one wants to wrap all of the advanced features. It should also be mentioned that there is a significant difference between uniforms and buffers. Uniforms are unique, meaning there can be only one uniform per program with the same name. Buffer members are not uniforms, but rather structure members, meaning they are not unique. If we have this code for example:

struct PerObject
{
  mat4 Model;
  mat4 InvModel;
  vec4 Diffuse;
  int ObjectId;
};

layout(std430) buffer PerObjectBuffer
{
   int MaterialType;
   PerObject data[];   
};

layout(std430) buffer SomeOtherBuffer
{
   PerObject data[];   
};

Then PerObject will be used twice, meaning Model will appear twice in the program context. This completely ruins our previous attempts at making storage buffers transparent as variables. Conclusion, we need to change how the engine handles variables. It’s probably simpler to just use uniform buffers and have the members as arrays, since that is already implemented and won’t be weird like this. We can then implement shader storage buffers as simple structures to which we can read/write data from/to when using for example compute shaders.

Coherently persistent

OpenGL as well as DirectX have moved from sending data to the GPU through single values to sending values in a buffer. In DirectX 10+, this is forced on the user, and it’s flexible but also somewhat confusing. The idea is to have a block of memory which can be retained between executions, which I will admit is rather clever. Taking this into consideration, using glUniform* will simply send the data which is only valid during the execution of the current program. As soon as the current program is switched, the set of uniform variables are cleared and must be assigned again. However, in OpenGL 3.1, another method was introduced in parallel to using traditional uniforms, called uniform buffers. This is identical to the method seen in DirectX, however the performance of uniform buffers are abysmal on most drivers. To be honest, most types of buffer updates which requires the use of glBufferData or glBufferSubData is somewhat slow, even if we chose to orphan the current buffer using glInvalidateBufferData and use some multi-buffering method. The main reason is that data has to be flushed to the GPU whenever we make one of these calls, which not only means we have to a lot of interaction with the driver, but also need to synchronize.

Something very new and very cool with OpenGL is the power to persistently map a buffer to CPU memory, and have the GL push the data to the GPU when it’s required. This basically allows us to let the driver decide when to synchronize the data. Pretty awesome, since this allows us to effectively queue draw calls. However, in order to avoid stomping the data in flight, i.e. write to a part of memory which is not yet pushed and used, or which is currently being transferred, we must make sure to wait for that fragment of data to be complete. This has been thoroughly discussed and shown to many extents, but just extra clarity, I will explain how it is implemented in Nebula.

For extra clarity, this is also how AnyFX handles variables and OpenGL uniform blocks and uniform buffers.

// get alignment
GLint alignment;
glGetIntegerv(GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, &alignment);

// calculate aligned size
this->alignedSize = (bufferSize + alignment - 1) - (bufferSize + alignment - 1) % alignment;

// setup
glGenBuffers(1, this->buffers);
glBindBuffer(GL_UNIFORM_BUFFER, this->buffers[0]);
GLenum flags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;
glBufferStorage(GL_UNIFORM_BUFFER, this->alignedSize * this->numBackingBuffers, NULL, flags | GL_DYNAMIC_STORAGE_BIT);
this->glBuffer = (GLchar*)glMapBufferRange(GL_UNIFORM_BUFFER, 0, this->alignedSize * this->numBackingBuffers, flags);

The magic here is of course the new fancy GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT as well as glBufferStorage and glMapBufferRange. glBufferStorage gives us the opportunity to tell the GL ‘give me an immutable buffer with given size’. Since it’s immutable, we can’t change its size, which of course is possible with glBufferData. It’s also vitally important to make sure we align the buffer size to be in multiples of the alignment size. Otherwise, glMapBufferRange will return a null pointer and invoke an invalid operation.

AnyFX makes sure that every shader which somehow includes the same uniform buffer also uses the same backend, so we can basically share this buffer among all shader programs, which is nice.

Then, whenever we set a variable, we get this:

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::SetVariable(InternalEffectVariable* var, void* value)
{    
    char* data = (this->glBuffer + *this->glBufferOffset + var->byteOffset);
    if (!this->manualLocking) this->UnlockBuffer();
    memcpy(data, value, var->byteSize);
    this->isDirty = true;
}

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::SetVariableArray(InternalEffectVariable* var, void* value, size_t size)
{
    char* data = (this->glBuffer + *this->glBufferOffset + var->byteOffset);
    if (!this->manualLocking) this->UnlockBuffer();
    memcpy(data, value, size);
    this->isDirty = true;
}

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::SetVariableIndexed(InternalEffectVariable* var, void* value, unsigned i)
{
    char* data = (this->glBuffer + *this->glBufferOffset + var->byteOffset + i * var->byteSize);
    if (!this->manualLocking) this->UnlockBuffer();
    memcpy(data, value, var->byteSize);
    this->isDirty = true;
}

The thing to note here is that data is a buffer which is coherent between GPU and CPU. Basically, we just calculate offsets into the buffer and copy the data into the buffer at that offset. glBufferOffset here is the byte offset into the buffer to which we are currently writing. The function LockBuffer and UnlockBuffer looks like this:

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::LockBuffer()
{
    if (this->syncs[*this->ringIndex] == 0)
    {
        this->syncs[*this->ringIndex] = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, NULL);

        // traverse to next buffer
        *this->ringIndex = (*this->ringIndex + 1) % this->numBackingBuffers;
        *this->glBufferOffset = *this->ringIndex * this->alignedSize;
    }    
}

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::UnlockBuffer()
{
    // wait for sync
    if (this->syncs[*this->ringIndex] != 0)
    {
        GLbitfield waitFlags = 0;
        GLuint64 waitDuration = 0;
        do
        {
            GLenum result = glClientWaitSync(this->syncs[*this->ringIndex], waitFlags, waitDuration);
            if (result == GL_ALREADY_SIGNALED || result == GL_CONDITION_SATISFIED) break;

            waitFlags = GL_SYNC_FLUSH_COMMANDS_BIT;
            waitDuration = 1000000000;
        } 
        while (true);   

        glDeleteSync(this->syncs[*this->ringIndex]);
        this->syncs[*this->ringIndex] = 0;
    }
}

Ring index is an increasing number which corresponds to the current segment of the multi-buffered backing storage we should work with. Basically, we have to make sure that the range in the buffer is blocked, so that we don’t modify the data before the GL has had time to use it. Also note that only when we lock the buffer, we decide to traverse to the next one. This allows us to fill the consecutive segment of the buffer without waiting.

When we want to perform the draw, we do this just before the draw call:

 
//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::Commit()
{
    // bind buffer at the current position
    glBindBufferRange(GL_UNIFORM_BUFFER, this->uniformBlockBinding, this->buffers[0], *this->ringIndex * this->alignedSize, this->alignedSize);
}

glBindBufferRange is really fast, so we suffer almost no overhead doing this.

So in which sequence does all of this happen? OpenGL defines that coherently mapped buffers are synced whenever a draw call is executed, which is exactly what we must wait for in order to avoid data corruption. This basically means that we have to perform the LockBuffer() function after we do a draw call. So, in essence, the shading system must have to be prepared for when a variable is set, just before a draw call is to be performed, and after a draw call is done. So basically:

// Wait for current segment to be available
for each (buffer in program)
{
    UnlockBuffer(); // glClientWaitSync() + glDeleteSync()
    SetVariable();
    SetVariable();
    SetVariable();
    SetVariable();
    SetVariable();
    glBindBufferRange();
}
glDraw*();
for each (buffer in program)
{
    LockBuffer();   // glFenceSync() + Increase buffer offset to next buffer
}

This is nice, because we only lock and unlock the buffer if something has changed. If nothing is different, we just bind the buffer range and draw as normal. This also allows us to keep variables outside uniform buffers if we want. This could be useful if we have variables which are not shared, or which are already applied once (per pass variables for example).

Now for per-frame buffers, we might want to lock and unlock manually, since per-frame variables doesn’t have to wait until the first coming call is done, but rather not until the next frame. This is where the this->manualLocking comes into play. With manual locking, we can decide if a uniform buffer should lock explicitly. In Nebula, we do this:

    const Ptr<ShaderInstance>& shdInst = this->GetSharedShader();
    AnyFX::EffectVarblock* block = shdInst->GetAnyFXEffect()->GetVarblockByName("PerFrame");
    RenderDevice::Instance()->SetPerFrameBlock(block);

Then on BeginFrame in OGL4RenderDevice:

    // unlock per-frame buffer
    this->perFrameBlock->UnlockBuffer();

Lastly on EndFrame in OGL4RenderDevice:

    // lock per-frame buffer
    this->perFrameBlock->LockBuffer();

Remember that since blocks can be shared, we can use the good old shared shader which also contains all uniform buffers shared by other shaders.

AnyFX has been extended to allow for selecting how big the buffer should be, in multiples. Example:

shared buffers=2048 varblock PerObject
{
	mat4 Model;
	mat4 InvModel;
	int ObjectId;
};

This means we have a uniform buffer (or constant buffer in DX) which is backed 2048 times 😀 . This might seem excessive, but it’s not really that bad considering we have ONE of these for the entire application (using qualifier shared), and it only amounts to 270 kB. It allows us to perform some 2048 draw calls before we actually have to wait to render a new object, which is nice.

The performance increase from this is tremendous. It’s so fast in fact that if I don’t call glFinish() each frame to synchronize with the CPU, the view matrix doesn’t keep up with the current frame. However this proved to be bad for performance, since some calls may take A LOT of time for the GPU to finish, like huge geometry and instanced batches. Syncing will effectively stop putting new commands in the GL command queue (because we are effectively waiting for the GPU to finish everything), and the command queue should never stop being fed if one is aiming for performance.

All modules which stream data to the GL have been implemented to use this method, including the text renderer and the shape renderer for spontaneous geometry. Next up is to convert the particle system to utilize this feature too. After that I think we will look into bindless textures.

We are still very much alive.

Subroutines and suboptimal solutions

I managed to remove the render thread, getting rid of the client/server side object structure which was vital in order to keep the rendering in its own thread. What happened then was that we gained a significant boost in performance, probably due to the fact that the overhead required for syncing was greater than the performance we gained.

I’ve then been investigating on how to extend AnyFX to supply support for some of the stuff I left out in version 1.0, namely shader storage buffers (or RWBuffer in DirectX) and dynamically linked shader functions. The work is currently ongoing, however the syntax for the new buffer and dynamically linked functions are already inplace. Behold!

// declare subroutine 'interface'
prototype vec4 colorMethod(vec2 UV);

// declare implementation of colorMethod which produces static color
subroutine (colorMethod) vec4 staticColorMethod(vec2 UV)
{
   return vec4(1);
}

// declare implementation of colorMethod which produces textured color
subroutine (colorMethod) vec4 texturedColorMethod(vec2 UV)
{
    return texture(SomeSampler, UV);
}

colorMethod dynamicColorMethodVariable;

The dynamicColorMethodVariable then works as a special variable, meaning there is no way to change it using the AnyFX API. The syntax for defining a program previously looked something like this:

program SomeProgram
{
    vs = SomeVertexShader();
}

However, the shader binding syntax now accept arguments to the shader which is in the form of subroutine bindings. For example:

program SomeProgram
{
    vs = SomeVertexShader(dynamicColorMethodVariable = texturedColorMethod);
}

This would bind the vertex shader SomeVertexShader and bind the dynamicColorMethodVariable subroutine variable to be the one that uses a texture. This allows us to create programs which are just marginally different from other programs, and allows us to perform an ‘incremental’ change of the program state, compared to exchanging the whole program object each time we want some variation. The only problem is that Nebula doesn’t really have any concept of knowing whether an incremental shader change is possible or not.

So here comes yet another interesting concept, what if we were to sort materials (which are already sorted based on batch) by variation? Consider the following illustration:

FlatGeometryLit -
                |--- Static
                |--- VertexColor
                |--- Multilayered
                |--- LightmappedLit
                |--- Shiny
                |--- Organic
                |--- Animated
                |--- Skinned
                |--- Skin
                |--- SkinnedShiny
                |--- Foliage

This is the order in which they will be rendered, however they are all opaque geometry, so they might render in any order within this list. However, the change between lets say Static, Shiny and Animated is actually not that much, just a couple of lines of shader code. There is no linkage difference between shaders, and they can as such use the same shader, but with different subroutine sets! If we were to sort this list based on ‘change’, we would probably end up with something like this:

FlatGeometryLit -
                |--- Static
                |--- Shiny
                |--- Animated
                |--- Foliage
                |--- Organic
                |--- VertexColor
                |--- Multilayered
                |--- LightmappedLit
                |--- Skinned
                |--- Skin
                |--- SkinnedShiny

This is because most of these shaders share the same number of vertex shader inputs, or pixel shader outputs. However, if we simply implement the shaders to have equal functions, then AnyFX could figure out which programs are duplicates of others, and then simply tell us which material should actually apply its program, and which materials are sub dominant and thus only requires an incremental update. What we will end up with, is a sorted list of materials, where the first ‘unique’ material will be dominant, and the others will be incremental. The list would look like this:

FlatGeometryLit -
                |--- Static         -- dominant
                |--- Shiny          -- incremental
                |--- Animated       -- incremental
                |--- Foliage        -- incremental
                |--- Organic        -- incremental
                |--- VertexColor    -- dominant  (introduces vertex colors in vertex layout, cannot be subroutined)
                |--- Multilayered   -- incremental
                |--- LightmappedLit -- dominant  (introduces secondary UV set in vertex layout, cannot be subroutined)
                |--- Skinned        -- dominant  (introduces skin weights and joint indices in vertex layout, cannot be subroutined)
                |--- SkinnedShiny   -- incremental
                |--- Skin           -- incremental

As we can see here, every time we encounter a recessive material, we can simply perform a smaller update rather than set the entire shader program, which will probably spare us some performance if we have lots of variation in materials. This table only shows the base materials for a specific batch. However, the algorithm would sort all batches by this manner in order to make the entire pipeline reduce it’s API heavy calls. This is probably not a performance issue right now, seeing as we have a rather small set of materials per batch type, however, consider a game with lots of custom made shader derivatives. Currently, these derivatives would more or less have a copy of the shader code of some base shader, and then apply the program prior to each group of objects with that shader.

The next thing to tackle on the list is getting shader storage blocks working. The syntax for these babies are also already defined, but are only implemented by stubs. The shader storage block counterpart of AnyFX is called varbuffer. As opposed to varblock, the varbuffer allows for application control of the internal buffer storage, meaning we can retrieve its handle and read/write data from it as we please. We can also attach the buffer to some other part of the pipeline, which requires information that resides outside the scope of AnyFX. Also, varbuffers supports member arrays with indetermined size! As such, a varbuffer will have some way of allocating a buffer with a dimension set from the application side. Consider the following:

struct ObjectVariables
{
   float MatEmissiveIntensity;
   float MatSpecularIntensity;
   mat4 Model;
};

varbuffer SomeBuffer
{
   ObjectVariables vars[];
};

This creates a buffer which contains variables per object rendered. We can then from the AnyFX API tell the varbuffer to allocate a backend buffer with a size, which can then be used to for example perform bulk rendering with full per-object variation using glMultiDraw*. The only issue with this is that AnyFX usually handles variables as objects which one can retrieve and simply set, but in this case, a variable would be inside a struct of an array type, and is thus not something which is publicly available. However, we can solve the same problem using the already existing varblock syntax with just a set of arrays of variables with a fixed size. However, shader storage blocks (varbuffer) have a much bigger minimum implementation size, 16MB compared to the one defined for uniform blocks (varblock) which is 16KB, meaning we cannot have as much data per multi draw as we can with varbuffers.

This is totally worth looking into, seeing as it would enable a much faster (probably) execution rate of draw calls seeing as we can pack probably every single object with the same shader in the scene into one single glMultiDraw*, however it will probably not work with the current implementation of using a variable to set a value in, but will need some code which gathers up a bunch of objects and their variables, packs them into a buffer, and then renders everything. More on that when the subroutines are working!

// Gustav

Oh render thread, why art thou so hard to reach?

Recently, I’ve discovered some issues regarding unstable frame rates, as well as a couple of glitches related to the render thread. One of the major issues with having ALL the rendering in a fat separate thread is that all communication to and from the render thread has to go through either the client-server side ‘interface’, meaning that a ModelEntity communicates with its internal counterpart InternalModelEntity to pass information back and forth. This method works really good, since the ModelEntity and the InternalModelEntity never have to be in perfect sync.

However, we have encountered problems where we actually want something from the render thread RIGHT NOW, meaning we block the game thread to wait for render thread data. Take this example code:

Ptr<Graphics::ItemAtPosition> msg = Graphics::ItemAtPosition::Create();
msg->SetPosition(mousePos);
Graphics::GraphicsInterface::Instance()->SendWait(msg.upcast<Messaging::Message>());

// get by id
const Ptr<Game::Entity>& entity = BaseGameFeature::EntityManager::Instance()->GetEntityByUniqueId(msg->GetItem());

This requires the render thread to basically wait for the message to be handled before we can continue. Well, this is good and all, since the render thread executes in parallel, however, this cannot be done while we are running the game. This is because in order to avoid several synchronizations on the same game frame with the render thread, the application goes into a lockstep mode, meaning the game thread basically waits for the render thread to finish, then performs a rendezvous and synchronizes. This then means both threads much arrive at the same ‘position’ for a sync to take place, meaning that we cannot possibly lock either thread during the lockstep phase! So, in a pure catch 22 fashion, we cannot do the above code if we are in lockstep, and if we’re not in lockstep we will get random synchronizations with the render thread which screws up our main thread timings.

Now this is just the most recent problem, we’ve continuously had problems with the render thread, so we thought, hey, why not rethink the whole fat render thread concept?! The idea is to only make low-level OpenGL/DX calls on separate thread(s) and have all the previously internal render stuff straight on the main thread. Since Nebula is already nicely executed using jobs to handle much of the computational heavy pre-render preprocessing such as culling and skeletal animations, we shouldn’t really lose that much performance (I hope). Also, if we can utilize multiple threads to execute draw calls and such, we should be in really good form, since the CPU heavy API calls will be run in a thread which is not related to the main loop.

If we take a look at SUI, we have sort of the same problem. The way we communicate with SUI, which is on the render thread, is by passing messages to it from the main thread, which is very clunky and greatly inhibits control. This will also be solved if we implement render to not entirely reside within a separate fat thread.

As an addition to this, I am planning on looking into the new stuff posted at GDC to reduce OpenGL drawcall overhead. Explanation of this can be found here: http://blogs.nvidia.com/blog/2014/03/20/opengl-gdc2014/

Basically, they explain that the Mantle counterpart of OpenGL already exists, although it’s not widely known. They explain how we can memory map GPU buffers can be persistently mapped to CPU memory, and then have them synchronized once per frame instead of the usual glMap, glUnmap which forces a synchronization each time. They also explain how drawing using indirect drawing and using buffers which contains per-object variables can be buffered instead of using the glUniform-syntax. Basically, you create a uniform buffer, or shader storage buffer which contains variables per each object, update it and then just fetch the variables on a per-object basis. This allows you to buffer ONCE per frame, and then simply tell the GPU to render. If you want to get really fancy, you can even create a draw command buffer much like you do with uniforms, and just tell the GL to render X objects with this setup. We can then create worker threads, one for uniform updates and one for drawing, and then just call glMultiDrawElementsIndirect from the main thread to execute the batch.

So basically, instead of having the thread border somewhere in the middle between the game and render parts, we instead push the thread border to the absolute limit, meaning we have light threads on the backend which just do the CPU heavy API calls. It should be fairly easy to see if the threading offloads some main thread work and thus gives us a performance boost, or if the overhead of syncing with the threads take more time than they give back.

// Gustav

Fun with the content tools and other updates

We have been busy lately as our game projects will be starting up soon. General updates have been a bit slow but here is a short recap about what happened in the last months:

We implemented full linux support for both Nebula and the toolchain. There are some minor issues with the math library that is currently based on bullets vectormath and some of our own code, I think some matrix  functions are a bit off, but other than that everything works as expected. That obviously means that the OpenGL4 renderer has become the primary render engine now. Some other libraries have seen integration as well as for example support for  Havok (only windows due to licensing limitations) and Recast&Detour.

Apart from that there are obviously piles of all fixes everywhere, but mostly to the toolchain, namely the leveleditor and the contentbrowser.

Samuel has had some fun with the tools and took some screencasts while experimenting