Making Reading from VRAM less Catastrophic
In an earlier article I showed how reading from VRAM with the CPU can be very slow. It however turns out there there are ways to make it less slow.
The key to this are instructions with non-temporal hints, in particular VMOVNTDQA. The Intel Instruction Manual says the following about this instruction:
“MOVNTDQA loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint if the memory source is WC (write combining) memory type. For WC memory type, the nontemporal hint may be implemented by loading a temporary internal buffer with the equivalent of an aligned cache line without filling this data to the cache. Any memory-type aliased lines in the cache will be snooped and flushed. Subsequent MOVNTDQA reads to unread portions of the WC cache line will receive data from the temporary internal buffer if data is available. “ (Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2)
This sounds perfect for our VRAM and WC System Memory buffers as we typically only read 16-bytes per instruction and this allows us to read entire cachelines at time.
It turns out that Mesa already implemented a streaming memcpy using these instructions so all we had to do was throw that into our benchmark and write a corresponding memcpy that does non-temporal stores to benchmark writing to these memory regions.
As a reminder, we look into three allocation types that are exposed by the amdgpu Linux kernel driver:
VRAM. This lives on the GPU and is mapped with Uncacheable Speculative Write Combining (USWC) on the CPU. This means that accesses from the CPU are not cached, but writes can be write-combined.
Cacheable system memory. This is system memory that has caching enabled on the CPU and there is cache snooping to ensure the memory is coherent between the CPU and GPU (up till the top level caches. The GPU caches do not participate in the coherence).
USWC system memory. This is system memory that is mapped with Uncacheable Speculative Write Combining on the CPU. This can lead to slight performance benefits compared to cacheable system memory due to lack of cache snooping.
Furthermore this still uses a RX 6800 XT + a 2990WX with 4 channel 3200 MT/s RAM.
|method (MiB/s)||VRAM||Cacheable System Memory||USWC System Memory|
|read via memcpy||15||11488||137|
|write via memcpy||10028||18249||11480|
|read via streaming memcpy||756||6719||4409|
|write via streaming memcpy||10550||14737||11652|
Using this memcpy implementation we get significantly better performance in uncached memory situations, 50x for VRAM and 26x for USWC system memory. If this is a significant bottleneck in your workload this can be a gamechanger. Or if you were using SDMA to avoid this hit, you might be able to do things at significantly lower latency. That said it is not at a level where it does not matter. For big copies using DMA can still be a significant win.
Note that I initially gave an explanation on why the non-temporal loads should be faster, but the increases in performance are significantly above what something that just fiddles with loading entire cachelines would achieve. I have not dug into the why of the performance increase.
I have been claiming DMA is faster for CPU readbacks of VRAM in both this article and the previous article on the topic. One might ask how fast DMA is then. To demonstrate this I benchmarked VRAM<->Cacheable System Memory copies using the SDMA hardware block on Radeon GPUs.
Note that there is a significant overhead per copy here due to submitting work to the GPU, so I will shows results vs copy size. The rate is measured while doing a wait after each individual copy and taking the wall clock time as these usecases tend to be latency sensitive and hence batching is not too interesting.
|copy size||copy from VRAM (MiB/s)||copy to VRAM (MiB/s)|
This shows that for reads DMA is faster than a normal memcpy at 4 KiB and faster than a streaming memcpy at 64 KiB. Of course one still needs to do their CPU access at that point, but at both these thresholds even with an additional CPU memcpy the total process should still be fast with DMA.