Best way to write LSM-atomic to VRAM?

Started by
5 comments, last by NikiTo 3 years ago

Hey

I do some atomic-add operations in a shader on groupshared data (groupshared = local shared memory). At the end of the shader I write this data to VRAM. I'm trying to figure out if I need a barrier (GroupMemoryBarrierWithGroupSync) at the end of the shader or not. If I put the VRAM-write on the last thread in the group is that likely to be executed last in the group? Or is there some other heuristic/method that might work? There will be some dozens of instructions between the last atomic operation and the VRAM operation in the shader. The code will function even if occasionally some atomic-add operations “go missing”, but it would only be acceptable if this happens rarely, e.g. less than 0.1% chance.

I imagine this could be platform specific as well. I mainly just wanted to reach out and check if anyone seeing this message got a clue or experience in this regard ?

Advertisement

You need a GroupMemoryBarrierWithGroupSync after the LDS ops, and before any thread writes to VRAM.

It's guaranteed there are no issues (neither platform specific nor about some threads missing the barrier), if so you did something wrong or it's a driver bug.

If a following dispatch processes the data from VRAM, you may also need an API barrier to sync this as well in between. (Depends on API)

51mon said:
If I put the VRAM-write on the last thread in the group is that likely to be executed last in the group?

You should not rely on that.

Try to put more instructions between the flush and LDS operation. This way it should have more time to update the LDS. But again, somewhere in the middle, the implementation of the manufacturer could decide to not do it the way you expect, if you don't explicitly put a barrier inside the code.

If i were you, i would code the LDS as UP in the shader as i can. Then i would add after it, as much code that does not affect the LDS as i can. And just right before i need the updated data from LDS, i would issue a LDS memory barrier. Just to be sure at least one thread updated at all. Again, the implementation could reorder all that.

I see you have something interesting in mind. Something that does not need barriers the way they were intended to work(or maybe you are just doing something the wrong way?). If i were you, i would be strict with the shader code, while keeping my non-orthodox logic for the implementation as a whole.

Notice - there are two LDS barriers. One operates on a wave, the other on the whole group boundary. I optimize for speed, but so far, in my experience, wave-boundary barriers are so fast, i rarely count them, i just prefer to have one more wave-barrier that i don't need, than missing one that i need and having wrong result that could be very hard to detect by tests.

Thanks for some great feedback!

To give some further context:

It's for replacement of some DX11 append buffers to Vulkan/DX12. The idea is to use do the append-operation first in groupshared memory within the thread-group and then at the end of the shader merge this with the VRAM. I read that the “old gen” append buffers had the counter stored in "global shared memory". So I guess this might be a step up in perf. Also other benefits such as managing this in a single resources and probably more efficient GPU→CPU readback follows.

In the current case (the reason for this thread) the append operation can't unfortunately move much due to dependencies. I noticed in the past that barriers sometimes caused quite a slowdown, so it seems tempting to look into non-barrier solutions if possible. The append buffer hold an event of sort, so a small amount of fail could probably be tolerated (it just won't trigger the code consuming the event).

API barriers should not be needed since the consumption of the appended data happens a different frame (in this particular case).

This might require a bit of experimentation on my end + benchmarking etc.

51mon said:
This might require a bit of experimentation on my end + benchmarking etc.

yes.

Having the counter incremented in LDS is a nice idea. Notice, you can use the three arguments InterlockedAdd() function call to get immediately the older value of the counter and you use it to compute the offset you push onto to VRAM. Depending of your code, you could not need a barrier at all. Just use the third argument of the InterlockedAdd func call to compute the address to push inside VRAM. Use a barrier only once, right after you nullified the counter(to a known value, could be not zero. It depends of your code).

Barriers for the whole group could be pretty slow. Barriers at the width of the wave never felt any slow for me.

This topic is closed to new replies.

Advertisement