Coherently persistent

OpenGL as well as DirectX have moved from sending data to the GPU through single values to sending values in a buffer. In DirectX 10+, this is forced on the user, and it’s flexible but also somewhat confusing. The idea is to have a block of memory which can be retained between executions, which I will admit is rather clever. Taking this into consideration, using glUniform* will simply send the data which is only valid during the execution of the current program. As soon as the current program is switched, the set of uniform variables are cleared and must be assigned again. However, in OpenGL 3.1, another method was introduced in parallel to using traditional uniforms, called uniform buffers. This is identical to the method seen in DirectX, however the performance of uniform buffers are abysmal on most drivers. To be honest, most types of buffer updates which requires the use of glBufferData or glBufferSubData is somewhat slow, even if we chose to orphan the current buffer using glInvalidateBufferData and use some multi-buffering method. The main reason is that data has to be flushed to the GPU whenever we make one of these calls, which not only means we have to a lot of interaction with the driver, but also need to synchronize.

Something very new and very cool with OpenGL is the power to persistently map a buffer to CPU memory, and have the GL push the data to the GPU when it’s required. This basically allows us to let the driver decide when to synchronize the data. Pretty awesome, since this allows us to effectively queue draw calls. However, in order to avoid stomping the data in flight, i.e. write to a part of memory which is not yet pushed and used, or which is currently being transferred, we must make sure to wait for that fragment of data to be complete. This has been thoroughly discussed and shown to many extents, but just extra clarity, I will explain how it is implemented in Nebula.

For extra clarity, this is also how AnyFX handles variables and OpenGL uniform blocks and uniform buffers.

// get alignment
GLint alignment;
glGetIntegerv(GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, &alignment);

// calculate aligned size
this->alignedSize = (bufferSize + alignment - 1) - (bufferSize + alignment - 1) % alignment;

// setup
glGenBuffers(1, this->buffers);
glBindBuffer(GL_UNIFORM_BUFFER, this->buffers[0]);
GLenum flags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;
glBufferStorage(GL_UNIFORM_BUFFER, this->alignedSize * this->numBackingBuffers, NULL, flags | GL_DYNAMIC_STORAGE_BIT);
this->glBuffer = (GLchar*)glMapBufferRange(GL_UNIFORM_BUFFER, 0, this->alignedSize * this->numBackingBuffers, flags);

The magic here is of course the new fancy GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT as well as glBufferStorage and glMapBufferRange. glBufferStorage gives us the opportunity to tell the GL ‘give me an immutable buffer with given size’. Since it’s immutable, we can’t change its size, which of course is possible with glBufferData. It’s also vitally important to make sure we align the buffer size to be in multiples of the alignment size. Otherwise, glMapBufferRange will return a null pointer and invoke an invalid operation.

AnyFX makes sure that every shader which somehow includes the same uniform buffer also uses the same backend, so we can basically share this buffer among all shader programs, which is nice.

Then, whenever we set a variable, we get this:

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::SetVariable(InternalEffectVariable* var, void* value)
{    
    char* data = (this->glBuffer + *this->glBufferOffset + var->byteOffset);
    if (!this->manualLocking) this->UnlockBuffer();
    memcpy(data, value, var->byteSize);
    this->isDirty = true;
}

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::SetVariableArray(InternalEffectVariable* var, void* value, size_t size)
{
    char* data = (this->glBuffer + *this->glBufferOffset + var->byteOffset);
    if (!this->manualLocking) this->UnlockBuffer();
    memcpy(data, value, size);
    this->isDirty = true;
}

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::SetVariableIndexed(InternalEffectVariable* var, void* value, unsigned i)
{
    char* data = (this->glBuffer + *this->glBufferOffset + var->byteOffset + i * var->byteSize);
    if (!this->manualLocking) this->UnlockBuffer();
    memcpy(data, value, var->byteSize);
    this->isDirty = true;
}

The thing to note here is that data is a buffer which is coherent between GPU and CPU. Basically, we just calculate offsets into the buffer and copy the data into the buffer at that offset. glBufferOffset here is the byte offset into the buffer to which we are currently writing. The function LockBuffer and UnlockBuffer looks like this:

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::LockBuffer()
{
    if (this->syncs[*this->ringIndex] == 0)
    {
        this->syncs[*this->ringIndex] = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, NULL);

        // traverse to next buffer
        *this->ringIndex = (*this->ringIndex + 1) % this->numBackingBuffers;
        *this->glBufferOffset = *this->ringIndex * this->alignedSize;
    }    
}

//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::UnlockBuffer()
{
    // wait for sync
    if (this->syncs[*this->ringIndex] != 0)
    {
        GLbitfield waitFlags = 0;
        GLuint64 waitDuration = 0;
        do
        {
            GLenum result = glClientWaitSync(this->syncs[*this->ringIndex], waitFlags, waitDuration);
            if (result == GL_ALREADY_SIGNALED || result == GL_CONDITION_SATISFIED) break;

            waitFlags = GL_SYNC_FLUSH_COMMANDS_BIT;
            waitDuration = 1000000000;
        } 
        while (true);   

        glDeleteSync(this->syncs[*this->ringIndex]);
        this->syncs[*this->ringIndex] = 0;
    }
}

Ring index is an increasing number which corresponds to the current segment of the multi-buffered backing storage we should work with. Basically, we have to make sure that the range in the buffer is blocked, so that we don’t modify the data before the GL has had time to use it. Also note that only when we lock the buffer, we decide to traverse to the next one. This allows us to fill the consecutive segment of the buffer without waiting.

When we want to perform the draw, we do this just before the draw call:

 
//------------------------------------------------------------------------------
/**
*/
void 
GLSL4EffectVarblock::Commit()
{
    // bind buffer at the current position
    glBindBufferRange(GL_UNIFORM_BUFFER, this->uniformBlockBinding, this->buffers[0], *this->ringIndex * this->alignedSize, this->alignedSize);
}

glBindBufferRange is really fast, so we suffer almost no overhead doing this.

So in which sequence does all of this happen? OpenGL defines that coherently mapped buffers are synced whenever a draw call is executed, which is exactly what we must wait for in order to avoid data corruption. This basically means that we have to perform the LockBuffer() function after we do a draw call. So, in essence, the shading system must have to be prepared for when a variable is set, just before a draw call is to be performed, and after a draw call is done. So basically:

// Wait for current segment to be available
for each (buffer in program)
{
    UnlockBuffer(); // glClientWaitSync() + glDeleteSync()
    SetVariable();
    SetVariable();
    SetVariable();
    SetVariable();
    SetVariable();
    glBindBufferRange();
}
glDraw*();
for each (buffer in program)
{
    LockBuffer();   // glFenceSync() + Increase buffer offset to next buffer
}

This is nice, because we only lock and unlock the buffer if something has changed. If nothing is different, we just bind the buffer range and draw as normal. This also allows us to keep variables outside uniform buffers if we want. This could be useful if we have variables which are not shared, or which are already applied once (per pass variables for example).

Now for per-frame buffers, we might want to lock and unlock manually, since per-frame variables doesn’t have to wait until the first coming call is done, but rather not until the next frame. This is where the this->manualLocking comes into play. With manual locking, we can decide if a uniform buffer should lock explicitly. In Nebula, we do this:

    const Ptr<ShaderInstance>& shdInst = this->GetSharedShader();
    AnyFX::EffectVarblock* block = shdInst->GetAnyFXEffect()->GetVarblockByName("PerFrame");
    RenderDevice::Instance()->SetPerFrameBlock(block);

Then on BeginFrame in OGL4RenderDevice:

    // unlock per-frame buffer
    this->perFrameBlock->UnlockBuffer();

Lastly on EndFrame in OGL4RenderDevice:

    // lock per-frame buffer
    this->perFrameBlock->LockBuffer();

Remember that since blocks can be shared, we can use the good old shared shader which also contains all uniform buffers shared by other shaders.

AnyFX has been extended to allow for selecting how big the buffer should be, in multiples. Example:

shared buffers=2048 varblock PerObject
{
	mat4 Model;
	mat4 InvModel;
	int ObjectId;
};

This means we have a uniform buffer (or constant buffer in DX) which is backed 2048 times 😀 . This might seem excessive, but it’s not really that bad considering we have ONE of these for the entire application (using qualifier shared), and it only amounts to 270 kB. It allows us to perform some 2048 draw calls before we actually have to wait to render a new object, which is nice.

The performance increase from this is tremendous. It’s so fast in fact that if I don’t call glFinish() each frame to synchronize with the CPU, the view matrix doesn’t keep up with the current frame. However this proved to be bad for performance, since some calls may take A LOT of time for the GPU to finish, like huge geometry and instanced batches. Syncing will effectively stop putting new commands in the GL command queue (because we are effectively waiting for the GPU to finish everything), and the command queue should never stop being fed if one is aiming for performance.

All modules which stream data to the GL have been implemented to use this method, including the text renderer and the shape renderer for spontaneous geometry. Next up is to convert the particle system to utilize this feature too. After that I think we will look into bindless textures.

We are still very much alive.

Skip to toolbar