A driver on the GPU

The title might be a bit hyperbolic here, but we’re indeed exploring a first step in that direction with radv. The impetus here is the ExecuteIndirect command in Direct3D 12 and some games that are using it in non-trivial ways. (e.g. Halo Infinite)

ExecuteIndirect can be seen as an extension of what we have in Vulkan with vkCmdDrawIndirectCount. It adds extra capabilities. To support that with vkd3d-proton we need the following indirect Vulkan capabilities:

Binding vertex buffers.
Binding index buffers.
Updating push constants.

This functionality happens to be a subset of VK_NV_device_generated_commands and hence I’ve been working on implementing a subset of that extension on radv. Unfortunately, we can’t really give the firmware a “extended indirect draw call” and execute stuff, so we’re stuck generating command buffers on the GPU.

The way the extension works, the application specifies a command “signature” on the CPU, which specifies that for each draw call the application is going to update A, B and C. Then, at runtime, the application provides a buffer providing the data for A, B and C for each draw call. The driver then processes that into a command buffer and then executes that into a secondary command buffer.

The workflow is then as follows:

The application (or vkd3d-proton) provides the command signature to the driver which creates an object out of it.
The application queries how big a command buffer (“preprocess buffer”) of $n$ draws with that signature would be.
The application allocates the preprocess buffer.
The application does its stuff to generate some commands.
The application calls vkCmdPreprocessGeneratedCommandsNV which converts the application buffer into a command buffer (in the preprocess buffer)
The application calls vkCmdExecuteGeneratedCommandsNV to execute the generated command buffer.

What goes into a draw in radv

When the application triggers a draw command in Vulkan, the driver generates GPU commands to do the following:

Flush caches if needed
Set some registers.
Trigger the draw.

Of course we skip any of these steps (or parts of them) when they’re redundant. The majority of the complexity is in the register state we have to set. There are multiple parts here

Fixed function state:
1. subpass attachments
2. static/dynamic state (viewports, scissors, etc.)
3. index buffers
4. some derived state from the shaders (some tesselation stuff, fragment shader export types, varyings, etc.)
shaders (start address, number of registers, builtins used)
user SGPRs (i.e. registers that are available at the start of a shader invocation)

Overall, most of the pipeline state is fairly easy to emit: we just precompute it on pipeline creation and memcpy it over if we switch shaders. The most difficult is probably the user SGPRs, and the reason for that is that it is derived from a lot of the remaining API state . Note that the list above doesn’t include push constants, descriptor sets or vertex buffers. The driver computes all of these, and generates the user SGPR data from that.

Descriptor sets in radv are just a piece of GPU memory, and radv binds a descriptor set by providing the shader with a pointer to that GPU memory in a user SGPR. Similarly, we have no hardware support for vertex buffers, so radv generates a push descriptor set containing internal texel buffers and then provides a user SGPR with a pointer to that descriptor set.

For push constants, radv has two modes: a portion of the data can be passed in user SGPRs directly, but sometimes a chunk of memory gets allocated and then a pointer to that memory is provided in a user SGPR. This fallback exists because the hardware doesn’t always have enough user SGPRs to fit all the data.

On Vega and later there are 32 user SGPRs, and on earlier GCN GPUs there are 16. This needs to fit pointers to all the referenced descriptor sets (including internal ones like the one for vertex buffers), push constants, builtins like the start vertex and start instance etc. To get the best performance here, radv determines a mapping of API object to user SGPR at shader compile time and then at draw time radv uses that mapping to write user SGPRs.

This results in some interesting behavior, like switching pipelines does cause the driver to update all the user SGPRs because the mapping might have changed.

Furthermore, as an interesting performance hack radv allocates all upload buffers (for the push constant and push descriptor sets), shaders and descriptor pools in a single 4 GiB region of of memory so that we can pass only the bottom 32-bits of all the pointers in a user SGPR, getting us farther with the limited number of user SGPRs. We will see later how that makes things difficult for us.

Generating a commandbuffer on the GPUs

As shown above radv has a bunch of complexity around state for draw calls and if we start generating command buffers on the GPU that risks copying a significant part of that complexity to a shader. Luckily ExecuteIndirect and VK_NV_device_generated_commands have some limitations that make this easier. The app can only change

vertex buffers
index buffers
push constants

VK_NV_device_generated_commands also allows changing shaders and the rotation winding of what is considered a primitive backface but we’ve chosen to ignore that for now since it isn’t needed for ExecuteIndirect (though especially the shader switching could be useful for an application).

The second curveball is that the buffer the application provides needs to provide the same set of data for every draw call. This avoids having to do a lot of serial processing to figure out what the previous state was, which allows processing every draw command in a separate shader invocation. Unfortunately we’re still a bit dependent on the old state that is bound before the indirect command buffer execution:

The previously bound index buffer
Previously bound vertex buffers.
Previously bound push constants.

Remember that for vertex buffers and push constants we may put them in a piece of memory. That piece of memory needs to contains all the vertex buffers/push constants for that draw call, so even if we modify only one of them, we have to copy the rest over. The index buffer is different: in the draw packets for the GPU there is a field that is derived from the index buffer size.

So in vkCmdPreprocessGeneratedCommandsNV radv partitions the preprocess buffer into a command buffer and an upload buffer (for the vertex buffers & push constants), both with a fixed stride based on the command signature. Then it launches a shader which processes a draw call in each invocation:

   if (shader used vertex buffers && we change a vertex buffer) {
      copy all vertex buffers 
      update the changed vertex buffers
      emit a new vertex descriptor set pointer
   }
   if (we change a push constant) {
      if (we change a push constant in memory) {
         copy all push constant
         update changed push constants
         emit a new push constant pointer
      }
      emit all changed inline push constants into user SGPRs
   }
   if (we change the index buffer) {
      emit new index buffers
   }
   emit a draw command
   insert NOPs up to the stride

In vkCmdExecuteGeneratedCommandsNV radv uses the internal equivalent of vkCmdExecuteCommands to execute as if the generated command buffer is a secondary command buffer.

Challenges

Of course one does not simply move part of the driver to GPU shaders without any challenges. In fact we have a whole bunch of them. Some of them just need a bunch of work to solve, some need some extension specification tweaking and some are hard to solve without significant tradeoffs.

Code maintainability

A big problem is that the code needed for the limited subset of state that is supported is now in 3 places:

The traditional CPU path
For determining how large the preprocess buffer needs to be
For the shader called in vkCmdPreprocessGeneratedCommandsNV to build the preprocess buffer.

Having the same functionality in multiple places is a recipe for things going out of sync. This makes it harder to change this code and much easier for bugs to sneak in. This can be mitigated with a lot of testing, but a bunch of GPU work gets complicated quickly. (e.g. the preprocess buffer being larger than needed still results in correct results, getting a second opinion from the shader to check adds significant complexity).

`nir_builder` gets old quickly

In the driver at the moment we have no good high level shader compiler. As a result a lot of the internal helper shaders are written using the nir_builder helper to generate nir, the intermediate IR of the shader compiler. Example fragment:

   nir_push_loop(b);
   {
      nir_ssa_def *curr_offset = nir_load_var(b, offset);

      nir_push_if(b, nir_ieq(b, curr_offset, cmd_buf_size));
      {
         nir_jump(b, nir_jump_break);
      }
      nir_pop_if(b, NULL);

      nir_ssa_def *packet_size = nir_isub(b, cmd_buf_size, curr_offset);
      packet_size = nir_umin(b, packet_size, nir_imm_int(b, 0x3ffc * 4));

      nir_ssa_def *len = nir_ushr_imm(b, packet_size, 2);
      len = nir_iadd_imm(b, len, -2);
      nir_ssa_def *packet = nir_pkt3(b, PKT3_NOP, len);

      nir_store_ssbo(b, packet, dst_buf, curr_offset, .write_mask = 0x1,
                     .access = ACCESS_NON_READABLE, .align_mul = 4);
      nir_store_var(b, offset, nir_iadd(b, curr_offset, packet_size), 0x1);
   }
   nir_pop_loop(b, NULL);

It is clear that this all gets very verbose very quickly. This is somewhat fine as long as all the internal shaders are tiny. However, between this and raytracing our internal shaders are getting significantly bigger and the verbosity really becomes a problem.

Interesting things to explore here are to use glslang, or even to try writing our shaders in OpenCL C and then compiling it to SPIR-V at build time. The challenge there is that radv is built on a diverse set of platforms (including Windows, Android and desktop Linux) which can make significant dependencies a struggle.

Preprocessing

Ideally your GPU work is very suitable for pipelining to avoid synchronization cost on the GPU. If we generate the command buffer and then execute it we need to have a full GPU sync point in between, which can get very expensive as it waits until the GPU is idle. To avoid this VK_NV_device_generated_commands has added the separate vkCmdPreprocessGeneratedCommandsNV command, so that the application can batch up a bunch of work before incurring the cost a sync point.

However, in radv we have to do the command buffer generation in vkCmdExecuteGeneratedCommandsNV as our command buffer generation depends on some of the other state that is bound, but might not be bound yet when the application calls vkCmdPreprocessGeneratedCommandsNV.

Which brings up a slight spec problem: The extension specification doesn’t specify whether the application is allowed to execute vkCmdExecuteGeneratedCommandsNV on multiple queues concurrently with the same preprocess buffer. If all the writing of that happens in vkCmdPreprocessGeneratedCommandsNV that would result in correct behavior, but if the writing happens in vkCmdExecuteGeneratedCommandsNV this results in a race condition.

The 32-bit pointers

Remember that radv only passes the bottom 32-bits of some pointers around. As a result the application needs to allocate the preprocess buffer in that 4-GiB range. This in itself is easy: just add a new memory type and require it for this usage. However, the devil is in the details.

For example, what should we do for memory budget queries? That is per memory heap, not memory type. However, a new memory heap does not make sense, as the memory is also still subject to physical availability of VRAM, not only address space.

Furthermore, this 4-GiB region is more constrained than other memory, so it would be a shame if applications start allocating random stuff in it. If we look at the existing usage for a pretty heavy game (HZD) we get about

40 MiB of command buffers + upload buffers
200 MiB of descriptor pools
400 MiB of shaders

So typically we have a lot of room available. Ideally the ordering of memory types would get an application to prefer another memory type when we do not need this special region. However, memory object caching poses a big risk here: Would you choose a memory object in the cache that you can reuse/suballocate (potentially in that limited region), or allocate new for a “better” memory type?

Luckily we have not seen that risk play out, but the only real tested user at this point has been vkd3d-proton.

Secondary command buffers.

When executing the generated command buffer radv does that the same way as calling a secondary command buffer. This has a significant limitation: A secondary command buffer cannot call a secondary command buffer on the hardware. As a result the current implementation has a problem if vkCmdExecuteGeneratedCommandsNV gets called on a secondary command buffer.

It is possible to work around this. An example would be to split the secondary command buffer into 3 parts: pre, generated, post. However, that needs a bunch of refactoring to allow multiple internal command buffers per API command buffers.

Where to go next

Don’t expect this upstream very quickly. The main reason for exploring this in radv is ExecuteIndirect support for Halo Infinite, and after some recent updates we’re back into GPU hang limbo with radv/vkd3d-proton there. So while we’re solving that I’m holding off on upstreaming in case the hangs are caused by the implementation of this extension.

Furthermore, this is only a partial implementation of the extension anyways, with a fair number of limitations that we’d ideally eliminate before fully exposing this extension.

Written on April 25, 2022