Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should there be a limit on the number of open GPUCommandEncoder instances? #4622

Closed
mwyrzykowski opened this issue May 6, 2024 · 35 comments
Closed
Labels
api resolved Resolved - waiting for a change to the API specification api WebGPU API needs-cts-issue This change requires tests (or would need tests if accepted), but may not have a CTS issue filed yet

Comments

@mwyrzykowski
Copy link

mwyrzykowski commented May 6, 2024

The validation for GPUDevice.createCommandEncoder is mostly missing from the specification https://gpuweb.github.io/gpuweb/#issue-f6026bc8 but there currently seems to be nothing preventing a website from creating 100s of thousands of GPUCommandEncoders, attempting to encode commands, and not calling finish.

There doesn't appear to be any known website or CTS coverage of this, but the default Metal limit for the number of command buffers is documented to be 64. So after the 64th encoder which encodes commands and does not call finish, we would see a hang if the command buffers aren't discarded or complete in some fashion.

An implementation could theoretically store the commands until Queue.submit time to get the correct ordering, but that ends up deferring a lot of work and it would be challenging to predict if the commands need to be buffered or not.

It seems like we should have a limit on the number of open GPUCommandEncoders as most applications are unlikely to run into this issue and attempting to solve it in the implementations for the edge case seems like unnecessary, wasted effort.

It could possibly be a hard coded limit in the validation for https://gpuweb.github.io/gpuweb/#issue-f6026bc8 depending on the other backends.

@Kangz
Copy link
Contributor

Kangz commented May 13, 2024

I thought we had discussed this a long time ago in the group (and concluded to not add a limit) but can't find better than https://github.com/gpuweb/gpuweb/wiki/Minutes-2018-09-27 (search for "in-flight").

IMHO it would be unfortunate to expose this limit to applications because it is some form of global-state validation instead of the mostly local validation we have right now. And it seems like it could be papered over somewhat easily by browsers (IIRC Firefox only sends the whole command buffer to the GPU process only on finish() and Chromium records on submit()).

There are other issues like WebGPU not being the only user of the Metal device, so any limit is potentially too large when other components in the same process use the singleton MTLDevice for rendering 2D UI or other things (like other WebGPU tabs also?)

@grovesNL
Copy link
Contributor

This was also complicated a bit at the time because newCommandQueueWithMaxCommandBufferCount allows you to raise the limit on Metal, but there didn't (doesn't?) appear to be a way to query the acceptable range (rdar://39726760 and rdar://39726960).

Years ago we hardcoded it to a higher limit of 2048 for gfx/wgpu because it was pretty easy to hit 64 pending command buffers depending on the application/game we tested. I don't know if there are better suggestions about reasonable defaults for this now but it would be helpful to know whether implementations that open command buffers immediately could avoid this validation by raising the limit to a much larger value.

@mwyrzykowski
Copy link
Author

mwyrzykowski commented May 13, 2024

I thought we had discussed this a long time ago in the group (and concluded to not add a limit) but can't find better than https://github.com/gpuweb/gpuweb/wiki/Minutes-2018-09-27 (search for "in-flight").

IMHO it would be unfortunate to expose this limit to applications because it is some form of global-state validation instead of the mostly local validation we have right now. And it seems like it could be papered over somewhat easily by browsers (IIRC Firefox only sends the whole command buffer to the GPU process only on finish() and Chromium records on submit()).

Historically I was under the impression the group appears to have avoided requiring implementations to defer work, as commented in:
#2612 - whole discussion around how to implement GPURenderBundles in an ICB without deferring work
#3787 (comment) - last part refers to this

This is the only issue we have come across that may actually require large amounts of work to be deferred until submit time. Firefox would presumably run into the same issue if the command buffers are created on the call to finish() since it is a restriction on non-committed command buffers.

There are other issues like WebGPU not being the only user of the Metal device, so any limit is potentially too large when other components in the same process use the singleton MTLDevice for rendering 2D UI or other things (like other WebGPU tabs also?)

I don't think this is a concern since the underlying limit is on the MTLQueue. Presumably, the order of these operations would be enforced by the browser implementation. Creating multiple MTLQueue instances for WebGPU is not practical due to ordering of operations on the resources which would be enforced at the driver level using a single queue. This is not a concern for 2D UI or other WebGPU tabs, which would have their own queues and not have any interdependencies on each other.

@Kangz the reason we are asking for a limit here is WebKit would likely throw an out of memory error after 64 GPUCommandEncoders which have yet to call finish. Presumably at some point, even in a deferred implementation, an out of memory error would need to be thrown on the GPUDevice.createCommandEncoder call so WebKit's implementation would fall into that path at 64 (GPUCommandEncoder + non-committed GPUCommandBuffer) instances.

If we had a limit, we could use newCommandQueueWithMaxCommandBufferCount mentioned above for the applications which absolutely need more than 64 command buffers in flight at once at some potentially significant performance cost. @grovesNL please correct me if I'm misunderstanding, but it sounds like gfx/wgpu was running into the limit at 64 command buffers? It would seem like the same issue still exists but just at higher values. Regarding the acceptable range, anything above 64 is not recommended and comes at a perf cost, especially for any browser implementation which may run on iOS where the app is more likely to be jetsam'ed. For a game / application which is less concerned with the performance cost, they could increase this limit up to UINT_MAX presumably but would likely, at least on iOS, get jetsam'ed once they use too many command buffers. The 'too many' number before getting jetsam'ed is hard to say, which is why the default of 64 is provided.

@grovesNL
Copy link
Contributor

@mwyrzykowski

please correct me if I'm misunderstanding, but it sounds like gfx/wgpu was running into the limit at 64 command buffers

Right, 64 seemed to be too limiting for some applications we tested it with. There doesn't seem to be a good workaround for in-flight command buffers though, because there's no way to upgrade/downgrade the max command buffer count for an existing queue as far as I know (without asking the application to re-record their commands onto a new queue). For implementations that immediately start recording command buffers, it seems like it's probably necessary to raise the limit to some slightly more conservative point for now.

@aleino-nv
Copy link

Unless I'm misunderstanding something, a limit of 64 (including in-flight) command buffers would make it impossible for an app to have a multithreaded renderer with more than 64 worker threads? (Probably the limit would be even less since in-flight buffers are counted.)
This is not an issue on e.g. Windows desktop, so it seems like a very unfortunate limitation to put into the spec.

Can this be solved in the browser's WebGPU/Metal back-end? If Metal command buffers are expensive, then the Metal back-end could "virtualize" command buffers, i.e. offer cheap command buffers that just live in regular memory for the app to own.

Then, when these are submitted, it would copy these command buffers to a set of real Metal command buffers that would be owned and managed by the Metal back-end.
If this is slow, then maybe it's possible create an implementation that only falls back to virtualizing when needed?

@mwyrzykowski
Copy link
Author

Unless I'm misunderstanding something, a limit of 64 (including in-flight) command buffers would make it impossible for an app to have a multithreaded renderer with more than 64 worker threads? (Probably the limit would be even less since in-flight buffers are counted.) This is not an issue on e.g. Windows desktop, so it seems like a very unfortunate limitation to put into the spec.

I didn't think the API today allows for multithreading across worker threads: https://github.com/gpuweb/gpuweb/wiki/The-Multi-Explainer#multi-threading-javascript

When such a feature is discussed more in detail and introduced, we would prefer it to be implementable via MTLParallelRenderCommandEncoder which would allow for encoding in parallel amongst multiple threads without using anywhere close to 64 command buffers.

@aleino-nv
Copy link

I didn't think the API today allows for multithreading across worker threads: https://github.com/gpuweb/gpuweb/wiki/The-Multi-Explainer#multi-threading-javascript

Yes, but I assume when it becomes supported, then it will be supported by the same WebGPU command encoder objects we're now discussing a limit for? If so, I don't see why it matters whether multithreaded command recording works today.

When such a feature is discussed more in detail and introduced, we would prefer it to be implementable via MTLParallelRenderCommandEncoder which would allow for encoding in parallel amongst multiple threads without using anywhere close to 64 command buffers.

I had a look at the documentation for MTLParallelRenderCommandEncoder.
The relationship to Vulkan is not quite clear to me. In Vulkan you might have one 'pool' per thread, and record many 'buffers' per pool within a thread.

It seems like in metal you'd share one 'buffer' between multiple threads, and create from that buffer one (or more?) 'multithreaded command encoders' per thread. (?)
How is the sharing of buffers between threads implemented?

@mwyrzykowski
Copy link
Author

I didn't think the API today allows for multithreading across worker threads: https://github.com/gpuweb/gpuweb/wiki/The-Multi-Explainer#multi-threading-javascript

Yes, but I assume when it becomes supported, then it will be supported by the same WebGPU command encoder objects we're now discussing a limit for?

Apple's position would likely be for that not to be the case, as it is not implementable via Metal without essentially falling back to single threaded encoding.

It seems like in metal you'd share one 'buffer' between multiple threads, and create from that buffer one (or more?) 'multithreaded command encoders' per thread. (?) How is the sharing of buffers between threads implemented?

A MTLCommandBuffer is not sharable across threads, except when using MTLParallelRenderCommandEncoder and calling -[MTLParallelRenderCommandEncoder rendercommandencoder] for each thread. This deserves a discussion, but I would say is outside the scope of this issue regarding the limit of number of command buffers for single threaded encoding. Outside of the CTS, which only triggered the issue due to #4599, we have not seen any usage require more than 64 command buffers. That is why either a field in the GPUQueueDescriptor or an additional field in GPUSupportedLimits seems appropriate, as some applications directly ported from one backend may use more than 64 buffers at once.

@aleino-nv
Copy link

A MTLCommandBuffer is not sharable across threads, except when using MTLParallelRenderCommandEncoder and calling [-[MTLParallelRenderCommandEncoder rendercommandencoder]]

Right, this is what I meant by "sharing".

To me it's as if the various terms are a bit mixed up between Vulkan and Metal.

  • Metal's 'command buffer' together with the 'parallel render command encoder' objects seem to be like some kind of shareable version of Vulkan's 'command pool' objects.
  • Metal's 'command encoder' objects seem to correspond to Vulkan's 'command buffer' objects.

I don't understand why one might want a shareable (Vulkan) 'command pool' object.
(I see that it'd be one way to work around the "max 64 command buffer limit", but I don't yet see this limit as a necessary premise.)

@mwyrzykowski
Copy link
Author

I don't understand why one might want a shareable (Vulkan) 'command pool' object. (I see that it'd be one way to work around the "max 64 command buffer limit", but I don't yet see this limit as a necessary premise.)

Command buffers in Metal are somewhat expensive. If you want to perform multithreaded encoding then creating multiple command buffers is much slower than using one buffer with a parallel encoder. Command buffer boundaries force results to be CPU visible, whereas parallel render command encoder avoids this.

@Kangz
Copy link
Contributor

Kangz commented May 20, 2024

My understanding of Metal is the following:

MTLCommandBuffer is an object that contains a pointer to the queue it was created from. It acts as an allocator of commands that can be encoded via the various MTLCommandEncoders that are created (one at a time) from the command buffer. None of these objects are internally synchronized.

MTLParallelRenderCommandEncoder is ones of these encoders that can additionally vend multiple normal MTLRenderCommandEncoder to be encoded on various threads independently. This avoids the need to internally synchronize them. In Vulkan-speak think of this object as a factory for secondary command buffers that are inside a predeclared render pass.

Parallel render command encoders would map to GPURenderBundle, and even then not really because bundles can be created before their beginRenderPass operation.

@Kangz
Copy link
Contributor

Kangz commented May 20, 2024

In the meeting it was quite clear that there is a cost to MTLCommandBuffer that we should avoid, with a limit of 64 being a soft limit in some devices which will quickly cause OOM or other badness if increased (as opposed to an arbitrary limit that would be free to increase).

However adding a limit in the specification seems 1) like it will not solve the entirety of the problem 2) will cause pain to developers 3) isn't necessary to mitigate the issue.

1. It doesn't solve the problem

The limit in Metal is over the total number of command buffers currently created on an MTLQueue. Not the ones currently encoding, but also the command buffers already in flight. This means that to be precise the limit in WebGPU would need to account for the latter as well, and this would be extremely non-portable.

The limit also includes command buffers created for the use of the WebGPU implementation itself, for example for complex operations like copyExternalImageToTexture. It is definitely and implementation detail how many of these are necessary, but would decrease the limit / make it unpredictable for developers how to count.

In the future, if command buffers are that expensive, browsers are potentially going to multiplex multiple GPUDevices on the same MTLQueue. At that point the counting starts to get very difficult because any number of WebGPU devices could use the requested limit for the maximum number of command buffers.

2. Pain to developers

In WebGPU we tried to have as few stateful validation as possible, let alone global validation (do we have any expect from the device loss?) because each additional stateful / global validation makes it more difficult for developers to make sure that their application is not going to run into it in production. Especially for global validation. Imagine an e-commerce site showing many 3D models in different canvases. Each of them is likely to have a different render loop, and it would be much better if the site didn't stop rendering after you add the 65th command encoder.

That no application ran into this in production for Safari but that wgpu-rs did need to increase a limit is a strong indicator that as WebGPU starts getting more uses, more and more developers risk running into this issue without knowing it, and that sites will break when pushed a bit too hard.

In general in WebGPU we do the hard stuff so that developers don't have to worry about it. Examples are barriers in Vulkan, descriptor set management, driver workarounds, memory allocation and suballocation. MTLCommandBuffer management is just one more of these things we can do for them, so that it is solved ~3 times instead of many more times by Web authors.

3. Other mitigations

A lot of them were proposed in the last meetings (I'm not taking credit for them).

As a first step the WebGPU specification could allow for some graceful failure when allocating a GPUCommandEncoder so it is spec-compliant to do the trivial implementation saying "No" when going above the internal limit.

A slot hint from the site could be added to the default queue descriptor, although I think most developers would just ignore it.

A suggestion was to detect when the internal limit is reached and recreate a command queue with more command buffers allowed, eventually migrating to run entirely on this queue.

Or the implementation could resort to record-replay mechanisms after some internal limit is reached. This is some work but not a crazy amount of work as both Dawn and wgpu implemented variants of this.

@mwyrzykowski
Copy link
Author

mwyrzykowski commented May 20, 2024

I'm not sure if these points are accurate.

  1. like it will not solve the entirety of the problem
    I don't think this statement is true.

1. It doesn't solve the problem

The limit in Metal is over the total number of command buffers currently created on an MTLQueue. Not the ones currently encoding, but also the command buffers already in flight.

To be precise, as long as the command buffer has called commit via -[MTLCommandBuffer commit], it does not matter if it is in flight or not. -[MTLQueue commandBuffer] and variants will return the next command buffer upon completion of the 64th buffer.

So it would be portable since tracking the number of GPUCommandEncoder instances not calling finish() and GPUCommandBuffers not being passed to GPUQueue.submit is quite simple.

I think it does solve the problem unless there is something else which hasn't been raised.

The limit also includes command buffers created for the use of the WebGPU implementation itself, for example for complex operations like copyExternalImageToTexture. It is definitely and implementation detail how many of these are necessary, but would decrease the limit / make it unpredictable for developers how to count.

This does not impact the limit since the implementation can call commit on the MTLCommandBuffer.

2. Pain to developers

In WebGPU we tried to have as few stateful validation as possible, let alone global validation (do we have any expect from the device loss?) because each additional stateful / global validation makes it more difficult for developers to make sure that their application is not going to run into it in production. Especially for global validation. Imagine an e-commerce site showing many 3D models in different canvases. Each of them is likely to have a different render loop, and it would be much better if the site didn't stop rendering after you add the 65th command encoder.

This scenario was raised in the meeting but I don't really understand it. Do you mean, from a single worker or from the main thread, the website creates many GPUCommandEncoder instances via GPUDevice.createCommandEncoder(), begins render or compute passes, and doesn't submit them?

As long as they call device.queue.submit([ GPUCommandEncoder.finish() ]) their use case is going to work fine for as many canvases as they like until they otherwise run into memory issues. So this seems like incorrect API usage to me, to create many encoders and buffers and never send them to GPUQueue.submit()

That no application ran into this in production for Safari but that wgpu-rs did need to increase a limit is a strong indicator that as WebGPU starts getting more uses, more and more developers risk running into this issue without knowing it, and that sites will break when pushed a bit too hard.

I would rather use the term 'a naively ported application' or something along those lines would run into this limit. This is one reason for -[MTLDevice newCommandQueueWithMaxCommandBufferCount:]. It is partly an escape hatch for an application which was ported from Vulkan or a different backed directly to Metal without concern for re-architecting for performance.

3. Other mitigations

A lot of them were proposed in the last meetings (I'm not taking credit for them).

As a first step the WebGPU specification could allow for some graceful failure when allocating a GPUCommandEncoder so it is spec-compliant to do the trivial implementation saying "No" when going above the internal limit.

A slot hint from the site could be added to the default queue descriptor, although I think most developers would just ignore it.

If they ignore it, this creates a greater argument for adding a GPUSupportedLimit and enforcing validation for it across implementations as portability across devices is a goal of WebGPU.

A suggestion was to detect when the internal limit is reached and recreate a command queue with more command buffers allowed, eventually migrating to run entirely on this queue.

We could do this but it comes at significant cost which is the desire for a GPUSupportedLimit. The developer raising the limit seems to know what they are doing. It would also give us the ability to report a higher limit for certain devices (Macs) and a lower one for mobile devices.

Requiring the developer to respect the limit would increase portability across devices.

Or the implementation could resort to record-replay mechanisms after some internal limit is reached. This is some work but not a crazy amount of work as both Dawn and wgpu implemented variants of this.

We discussed this internally within WebKit but the conclusion is record-replay approach is not optimal for real time applications on mobile devices. The recording of commands happens twice each render update, once internally in the browser implementation and then again in the driver, increasing latency.

It seems as we continue the discussion a GPUSupportedLimit is the best approach as one of WebGPU's goal is portability across devices.

@kainino0x
Copy link
Contributor

To clarify about the multithreading question, once we have multithreading I don't think it will be practical to prevent creating command encoders from multiple threads at once. If there's a limit, then it's global, and people will hit it. The only thing that a MTLParallelRenderCommandEncoder-equivalent concept does is provide developers with an alternative way of encoding that won't hit that limit (that they can only use if they're parallelizing a single render pass). I don't see how we could force people to use it1.

For example, if an app tries to create navigator.hardwareConcurrency threads and run init work on all of them, everything will work fine until they run on a device that has enough cores. Or if they try to spin up one thread for each model in a scene file provided by the user, and the user provides a scene with more than 64 models, same thing.

So, the limit improves portability in one way, but also introduces a serious new portability hazard. Of course, there are always hazards to spinning up a lot of threads, but typically machines with that many cores can handle a decent amount of work happening on each one simultaneously. The problem with a limit is that it would enforced at 64 by default regardless of the system's capabilities.

@mwyrzykowski:
To be precise, as long as the command buffer has called commit via -[MTLCommandBuffer commit], it does not matter if it is in flight or not. -[MTLQueue commandBuffer] and variants will return the next command buffer upon completion of the 64th buffer.

The documentation says that newCommandQueueWithMaxCommandBufferCount "Creates a queue you use to submit rendering and computation commands to a GPU that has a fixed number of uncompleted command buffers."
addCompletedHandler defines "completed" as "after the GPU finishes running the commands in the command buffer."

That supports Corentin's interpretation. Is the documentation incorrect?

@Kangz:
This means that to be precise the limit in WebGPU would need to account for the latter as well, and this would be extremely non-portable.

We could do it without needing to account for submitted-but-not-completed command buffers in the validation, by instead having createCommandEncoder() have the possibility of stalling to wait until some command buffers have completed. The performance characteristics would be pretty nasty, but at least it wouldn't be as terribly fragile.

FWIW we could sort of hack our way around this limit by hard-limiting each thread at 64 command buffers, but allowing other threads to createCommandEncoder() by stalling until other app threads' command buffers have completed. But this would introduce significant risk of applications running into a confusing deadlock that circles through both the browser and app code.

(Assuming from here forward that submitted-but-not-completed command buffers are not an issue.)

@mwyrzykowski:
This scenario was raised in the meeting but I don't really understand it. Do you mean, from a single worker or from the main thread, the website creates many GPUCommandEncoder instances via GPUDevice.createCommandEncoder(), begins render or compute passes, and doesn't submit them?

As long as they call device.queue.submit([ GPUCommandEncoder.finish() ]) their use case is going to work fine for as many canvases as they like until they otherwise run into memory issues. So this seems like incorrect API usage to me, to create many encoders and buffers and never send them to GPUQueue.submit()

As long as each widget is calling createCommandEncoder() and submit() in the same JS task, I think you're correct. Issues would arise if there's any await between createCommandEncoder() and submit(). This is possible, but should be quite unusual - it's only possible for command buffers that don't write into the canvas (because canvas textures expire during await).

Footnotes

  1. except maybe by completely disallowing parallel encoding to a single queue - you can have 64 command buffers in one thread, but only 1 thread may be encoding at a time - sounds like a non-starter but maybe it could work if we can parallelize entire command encoders, not just render passes.

@mwyrzykowski
Copy link
Author

@mwyrzykowski:
To be precise, as long as the command buffer has called commit via -[MTLCommandBuffer commit], it does not matter if it is in flight or not. -[MTLQueue commandBuffer] and variants will return the next command buffer upon completion of the 64th buffer.

The documentation says that newCommandQueueWithMaxCommandBufferCount "Creates a queue you use to submit rendering and computation commands to a GPU that has a fixed number of uncompleted command buffers." addCompletedHandler defines "completed" as "after the GPU finishes running the commands in the command buffer."

That supports Corentin's interpretation. Is the documentation incorrect?

The documentation is correct here and tracking the buffers in flight is not needed. Noted in the documentation for -[MTLCommandQueue commandBuffer]

This method blocks the calling CPU thread when the queue doesn’t have any free command buffers, and returns after the GPU finishes executing one.

a committed command buffer will either complete successfully or complete due to an error. -[MTLCommandQueue commandBuffer] will block, returning with a new command buffer once that occurs. It only returns nil in out of memory scenarios but the process is likely to get killed due to memory pressure prior to that or the system performance will deteriorate on macOS due to exhausting all virtual memory.

@kainino0x
Copy link
Contributor

The documentation is correct here and tracking the buffers in flight is not needed. Noted in the documentation for -[MTLCommandQueue commandBuffer]

Ohh, I see. Sorry, you said this, and I didn't understand for some reason.

@aleino-nv
Copy link

What it ultimately comes down to is this: do we think of WebGPU command buffers as cheap enough that developers can create one per core without worrying, or don't we?

If we do, that means they don't map to 'Metal command buffers', thus the Metal backends ought not to associate 'Metal command buffers' 1-to-1 with 'WebGPU command buffers', thus no need to impose a limit on WebGPU.

If we don't, then what's WebGPU's story for multithreaded rendering on more than 64 threads?

@aleino-nv
Copy link

Also, a good application would not try to create a renderer with anywhere close to 64 rendering threads on an iPhone anyway!
Apps should probably query the core count, indirectly or directly, before they create any command buffers for multithreaded rendering.

@mwyrzykowski
Copy link
Author

mwyrzykowski commented May 21, 2024

What it ultimately comes down to is this: do we think of WebGPU command buffers as cheap enough that developers can create one per core without worrying, or don't we?

I don't think this is accurate. One of the goals of WebGPU and the web in general is portability: we cannot make a basis as to whether something should be a limit based on the features of the device. The opposite is true. Most of the limits in GPUSupportedLimits are values which vary between devices / hardware.

This is the reasoning why WebGPU uses the default limits when none are requested in GPUDevice: portability. If the developer so chooses to request a higher limit and the browser supports it, then it would be perfectly acceptable to create any number of command buffers up to the supported limit.

If we don't, then what's WebGPU's story for multithreaded rendering on more than 64 threads?

Apple's preference would be a design which is implementable with MTLParallelRenderCommandEncoder. But I don't want to derail this conversation discussing this as WebGPU does not have support for multithreaded rendering with a single device today. We may also choose to allow the developer to create multiple GPUQueue instances. That would be another solution.

@aleino-nv
Copy link

What it ultimately comes down to is this: do we think of WebGPU command buffers as cheap enough that developers can create one per core without worrying, or don't we?

I don't think this is accurate. One of the goals of WebGPU and the web in general is portability: we cannot make a basis as to whether something should be a limit based on the features of the device. The opposite is true. Most of the limits in GPUSupportedLimits are values which vary between devices / hardware.

It makes no sense to take a limit from a completely different concept ('Metal command buffer') and impose it on an API object ('WebGPU command buffer') just because they happen to be named the same.

@Kangz
Copy link
Contributor

Kangz commented May 21, 2024

GPU Web WG 2024-05-15
  • MW: If we do not have this, then I don’t think we can encode any work on the encoders until the call to submit, because there’s the possibility that the website creates a bunch of encoders, but then fails to submit them, then we’ll block because we don’t have enough available command buffers.
  • CW: I want to avoid global validation. But Chromium encodes on submit, so this isn’t an issue for us. But also, OOM and device loss are always valid ways out of the situation. If that were the choice, then you’d probably increase the command limit with the appropriate Metal API call. Alternatively, you could reserve a few command buffers, then you could switch to encode-at-submit once you run out. But that would be more complexity (two paths). If you could set a limit of, say, 1000, then developers might never hit it. 60 seems borderline, middleware might consume a lot, or complex pages. So we should try to raise the limit substantially higher than 60, to make the global limit more acceptable.
  • MW: we don’t really see people hitting this limit, although games ported from Vulkan could indeed hit it. So some GPUs could have a higher limit. But there’s a substantial performance hit on Metal from having a high limit.
  • CW: People might not hit the limit in development, and only discover it on the user’s machines. I’m not sure that we need a configurable limit.
  • MW: There’s no limit on the device, it’s a limit on the Metal queue, which is presumably tab-specific. We can’t just create more Metal queues, because the driver wouldn’t be able to force the ordering of command buffers.
  • MM: Could you use shared events?
  • MW: I’d need to investigate that, it might be a potential option.
  • MM: Perhaps the reason not many apps have hit the limit is that there’s no mac with 64 cores. Such PCs do exist. So a limit of 64 is too low
  • KG: It’s worth mentioning that Josh Groves says that wgpu was actually hitting the 64 limit in the wild, and raised the limit to 2000.
  • MW: Why does the number of cores have any relevance?
  • MM: A common pattern is to have each thread recording a different command buffer
  • CW: We can’t do this sharing on the web right now.
  • CW: Room: what if we had it configurable, but required to be at least 1000
  • MW: Anything greater than 64 will have consequences on Metal mobile devices. Your app will get “jetsam”ed: your Safari tab will get killed. Command buffers are totally not free. This is why the proposal from Apple’s side is to keep the limit at 64, but let developers raise it if they’re willing to accept the consequences.
  • BJ: This same limit wouldn’t apply to render bundles, would it?
  • MW: No, those don’t create their own command buffer.
  • KG: Many people won’t realize they’re going to run into it, they’ll run into it only after they ship, and we’re going to be introducing issues where the systems wouldn’t otherwise have any problem. We’d be saving some tabs from being process-killed, but we’re increasing developer pain with this validation.
  • MM: Isn’t that an argument for applying validation consistently everywhere, and applying the limit to everyone?
  • KG: The place I see these happen is not the case you’d see on a developer machine until you’d created a test case that matches what the user is seeing. Sometimes that’s hard, users are doing strange things. But we still have to debug the problem, even when they’re doing something weird. But games will work on CI and then fail on user machines
  • KN: Depending on how this is implemented, it might be a new case of limit, depending on what maximum we expose. If we wanted to let you ask for higher, but with a higher risk of getting your process killed, that’d be a new thing for the spec. It might kill the whole browser, which wouldn’t be good.
  • KN: Secondly, I would very strongly prefer not to have global limits. I don’t like the complexity of the two-path solution, but if it lets us avoid a global limit, I think it’s worth it. Or, we could allow OOM on command buffer creation. But it would probably break the whole app.
  • BJ: If this does become a limit, we’re going to have to make some arbitrary choices. There’s fundamentally no limit in Chrome’s implementation, so we’re going to pick a high limit.
  • BJ: It depends on how we as a group view this particular limit. Is it like running out of texture memory: there’s plenty around, but you can still easily create a desktop app that creates much more than a mobile app could handle. And in that case we clearly don’t want to put a limit on texture creation - that’s best handled with OOM. Do we view this as a similar situation? I think we can. But I also feel like a two-tier encoding system would make a lot of difference here. Don’t want to be insensitive to imposing an engineering load on other teams.
  • JB: Would it be possible for Metal to treat all command buffers as bundle and at submission treat them as bundles?
  • MW: No: bundles don’t support all the commands that command buffers do
  • MM: Could we compromise on the limit?
  • MW: Anything over 64 starts to have an impact.
  • BJ: I don’t think it’s too terrible for us to push for best practices that discourage lots of outstanding queues. Even in Chrome, where we’re not actively writing to command buffers as people record, we’ll still accumulate memory, so it’s not a terrible idea to discourage profligate command buffer usage. “Are you leaking?” warning
  • CW: It seems like we have a large panel of options. The option BJ was describing: Browser should probably warn people when they have a lot of command buffers open - far below the limit, maybe 10. And that would help developers avoid getting surprised by what happens on the users’ machines. Give devs advance warning.
  • KG: This seems parallel to classic OOM: application works great on desktop, yet falls over on mobile. Even allocating 4G of memory seems like it’ll attract attention from the oom killer. This is seem as tolerable/necessary even if not ideal, and people cope with it. These feels right on the edge between “we should validate this for portability” and “we should let each device do what it’s capable of”
  • MW: We could do this silently, the downside is that it would work on higher-end machines but not lower-end machines or mobile devices. I agree we have the problem in general. The limit would help ensure more consistent behavior.
  • MM: BJ earlier asked what the difference is between this and running out of texture memory. First, there’s a developer expectation difference: developers understand running out of texture memory. Second, from the discussion in the issue, it seems like developers do hit this issue more often than they run out of memory. So I think there is a difference between the two situations.
  • KG: I agree there’s a difference, just not enough of a difference to merit a hard limit.
  • MW: The upside to the hard or configurable limit is that it would push adoption of moderated command buffer usage, which improves portability.
  • KG: Sometimes it’s useful to embed certain amounts of complexity in our implementations because we only have three WebGPU implementations, as opposed to thousands of content pages. It’s better for browser vendors to solve the problem than to foist it off on web devs.
  • CW: In WebKit, is it just a matter of the amount of engineering? Not having the two-tier thing? Or is there a fundamental reason to never do it? Say, “Virtual calls are expensive”?
  • MW: The engineering aspect is solvable, but it introduces a latency when you replay, and a bit more memory pressure.
  • CW: Those consequences would only happen in the cases that would run into the limit anyway.
  • MM: IF you were writing a native Metal app, you wouldn’t record your commands into a memory buffer. You would simply raise the limit on the Metal Queue.
  • CW: But we don’t have a fixed workload. We can’t anticipate the web page’s workload.
  • MM: You might be able to, if the queue doesn’t support enough command buffers, then you could make a new queue with a larger capacity and then slowly migrate over to the new queue. So you’d gradually raise the queue limit
  • MW: That’s an option. Pages will still get killed, and we’ll have to give them feedback that they shouldn’t encode so many buffers.
  • MM: Would adding a limit provide that feedback?
  • MW: Yes, because you’d get the feedback immediately, regardless of the machine
  • CW: You wouldn’t necessarily get the workload that provokes the failure.
  • CW: Would it makes sense to say, we can have OOM on CommandBuffer.finish? And have warnings in browser that discourage making lots of command buffers? And take MM’s suggestion of gradually increasing the queue size. Would that be a good outcome?
  • MW: We’d like the OOM on createCommandEncoder. This would be a solution, at least.
  • MM: There are two different priorities: one: create a system where the browser doesn’t crash. Two: support applications that need more than 64 but less than the device’s limit.
  • CW: I’d like to suggest that kind of many-bandaid solution. Seems like a good tradeoff between making applications care about a hard limit, while at the same time highly likely to work in WebKit, while also steering developers into not using too many command buffers. A palatable tradeoff?
  • MW: The downside is that they’ll not check for this OOM exception instead of the limit - which people also won’t check for
  • KN: Would the OOM error be a stopgap until you’ve done something in the implementation to paper over it?
  • MW: We’d have to see if we even want to ever allocate more than 64. It may just never work on iOS. So we’d very much prefer that the app simply doesn’t allow using more than the hard limit of command buffers.
  • MM: Would you prefer that it be impossible for the app to request the 65th command buffer?
  • MW: Yes.
  • KG: It was mentioned that in Metal, you can request a higher limit when you create a Queue. Could we surface that limit somehow in WebGPU itself?
  • MM: WebGPU doesn’t have multiple command queues right now
  • KN: It’d go in the device descriptor.
  • MM: Okay, so device descriptor specifies a command buffer in flight cap. What happens if the content surpasses that?
  • KG: Things will fail.
  • AL: It seems like these shouldn’t be so expensive. Why are they so expensive?
  • MW: We’re informed that it slows down notifications in the kernel, and notifications for buffer mapping and work completion get worse latency. I could get some numbers.
  • AL: Is that in-flight command buffers? Why is it different from having them in a memory buffer?
  • MW: No, the limit includes non-submitted buffers
  • MM: This isn’t just normal memory: it’s pinned memory that the GPU needs access to. And there’s actually a fixed-size table in the kernel.
  • AL: Couldn’t we record buffers in memory.
  • CW: Hardware is weird.
  • AL: Drivers always let you pretend you have more things than you really do
  • MM: This was answered above. Latency, memory consumption.
  • CW: We can’t change the way drivers behave. Ideally the drivers would be different, but we have to take them as they are.
  • KG: I think it’s worth considering whether it would be possible to surface the native-level workaround for this.
  • MM: What does the group think about the proposal where there’s a minimum command buffer count, and if the app exceeds that, it may or may not be validated?
  • CW: That could be surfaces as OOM or device loss. So, yes, but I don’t know if it really addresses the concern
  • AL: If you pick 64, doesn’t that mean you can’t use half your threads on a machine with more cores?
  • CW: Yes, and it’s even worse because command buffers in flight contribute to the count, too.
  • KG: If it’s tolerable in the native ecosystem to make people request higher limits in advance, why wouldn’t it be okay in WebGPU too?
  • MM: If the application sets that number to 100, then the likelihood that they’ll get OOM if they use fewer than 100 is very low.
  • CW: It seems like we didn’t find consensus this time. Let’s try to continue discussion in the issue and think about our options.
  • Continue discussion in subsequent meetings

@litherum
Copy link
Contributor

litherum commented May 22, 2024

Given that:

  1. My opinion means nothing
  2. The committee was not enthusiastic about this proposal in the call today,

... Here's what I think should happen:

  1. createCommandEncoder() should be allowed to be fallible (if it isn't already). The spec and MDN should make it obvious that this method can fail. If/when it fails, the spec prescribes something well-defined to happen.
  2. Notes should be added to the spec and/or MDN to indicate that authors should try to not to have many open GPUCommandEncoders.
  3. All browsers should add console warnings when the 65th opened GPUCommandEncoder comes into existence
  4. Because of the above, the WebKit team doesn't need to do anything - the 65th open GPUCommandEncoder can just fail to be created. If the WebKit team observes that this causes a sufficient number of WebGPU sites to not work, the WebKit team can update their implementation to either:
    1. Simply replace their call to -[MTLDevice newCommandQueue] with -[MTLDevice newCommandQueueWithMaxCommandBufferCount:100] (or some other hardcoded value which is enough to make the apps work, or
    2. Create a second MTLCommandQueue at runtime, and migrate future command buffers over to it, using MTLSharedEvents

@mwyrzykowski
Copy link
Author

  1. createCommandEncoder() should be allowed to be fallible (if it isn't already). The spec and MDN should make it obvious that this method can fail. If/when it fails, the spec prescribes something well-defined to happen.

  2. Notes should be added to the spec and/or MDN to indicate that authors should try to not to have many open GPUCommandEncoders.

  3. All browsers should add console warnings when the 65th opened GPUCommandEncoder comes into existence

  4. Because of the above, the WebKit team doesn't need to do anything - the 65th open GPUCommandEncoder can just fail to be created. If the WebKit team observes that this causes a sufficient number of WebGPU sites to not work, the WebKit team can update their implementation to either:

    1. Replace their call to -[MTLDevice newCommandQueue] with -[MTLDevice newCommandQueueWithMaxCommandBufferCount:100] (or some other hardcoded value which is enough to make the apps work, or
    2. Create a second MTLCommandQueue and slowly migrate future command buffers over to it, using MTLSharedEvents

Yes, that would resolve the issue. The concerns are portability and performance. We have gone through extensive discussions to ensure portability in the past, e.g., #3961.

If we are not concerned about the impact to one of the three major backends supported by WebGPU, then we can say an implementation may report device lost or return a GPUOutOfMemory error on the call to GPUDevice.createCommandEncoder and resolve this issue.

We have previously shown that splitting command buffers impacts performance substantially #2189 (comment) and there is no practical way for a WebGPU framework to coalesce two or more GPUCommandBuffer instances into a single MTLCommandBuffer. If we were to establish this limit, we place the responsibly on the website to keep the number of command buffers small to avoid performance cliffs. Or if it is a GPUSupportedLimit, we allow the website to raise the limit from the default at their own detriment.

Without the limit, the concern is:

@litherum
Copy link
Contributor

the website may not work on iOS devices at all

For the sake of discussion, let's say the number of open command buffers which would cause jetsam is 500. I think there are 3 classes of apps:

  1. Apps that only need fewer than 64 open command buffers at a time
  2. Apps that need between 64 and 500 open command buffers at a time
  3. Apps that need more than 500 open command buffers at a time

For the first class, they will work on WebKit and non-WebKit out of the box. Easy peasy.

For the third class, there is nothing WebKit can do to make them work. All we can do is ask developers nicely to please don't write this kind of app (by adding language in the spec, or in MDN, or both). All WebKit can do is try to avoid crashing.

For the second class, WebKit must do something if it wants such apps to run. If WebKit says "sorry, you had more than 64 open command buffers, you just lost your device," then that's a real shame - the application could have run on the device.

So, the question we're discussing here is: "what should the mechanism be whereby WebKit grants more than 64 open command buffers to the app?"

One proposal is "the app should know exactly how many open command buffers it's going to need ahead of time, and it should tell that number to WebKit at device creation, and WebKit should support exactly that number on that device." Another proposal is "the app should display that it needs more open command buffers by .... trying to open more command buffers."

I don't think the issue at hand is "will the app run on iOS?" - both proposals involve the second class of apps getting what it asked for.

performance will be worse on Metal backends if the developer fails to pay attention

I think two things can simultaneously be true:

  • Developers should limit the number of open command buffers they create, and
  • The second class of applications should get all the command buffers they ask for

I also don't believe a limit is the best way to inform authors not to have too many open command buffers. It's too easy for an author to say "oh, I hit the default limit of 64, so I'm just going to bump up the limit so my application runs." And lo and behold, bumping the limit fixes the problem, all without redesigning their app to use fewer open command buffers. I think a better solution to this problem is to inform authors before they have their entire app running, perhaps when they're learning WebGPU in the first place, by reading MDN or the spec (or whatever else).

@kainino0x
Copy link
Contributor

For the third class, there is nothing WebKit can do to make them work. All we can do is ask developers nicely to please don't write this kind of app (by adding language in the spec, or in MDN, or both). All WebKit can do is try to avoid crashing.

Shouldn't the workarounds listed above - most simply record-and-replay - make this third class of application run just fine?
If there's nothing WebKit can do, there's also nothing Dawn/wgpu can do, but I don't think anyone has asserted that.

IIUC the only(?) objection was that it would penalize performance when you reach that point, without any way of pushing developers away from it. To push developers, we could pick a fixed number and have a warning (in all browsers) when you go over that number of open command encoders. (Though I'd want the number to be more than 64.) That's much more palatable than any kind of hard error, which I strongly think is going to cause serious new portability/reliability issues.

@mwyrzykowski
Copy link
Author

Shouldn't the workarounds listed above - most simply record-and-replay - make this third class of application run just fine? If there's nothing WebKit can do, there's also nothing Dawn/wgpu can do, but I don't think anyone has asserted that.

Yes record-and-replay would work until we run out of memory, we could even coalesce most GPUCommandBuffers into a single MTLCommandBuffer with that approach. The downside, from very rough estimates on https://webgpu.github.io/webgpu-samples/?sample=animometer, is 80% frame latency in command recording in the driver.

IIUC the only(?) objection was that it would penalize performance when you reach that point, without any way of pushing developers away from it. To push developers, we could pick a fixed number and have a warning (in all browsers) when you go over that number of open command encoders. (Though I'd want the number to be more than 64.) That's much more palatable than any kind of hard error, which I strongly think is going to cause serious new portability/reliability issues.

A loud warning that you are using too many command buffers along with record-and-replay could be a good compromise.

The performance impact of using many command buffers seems to be growing and not shrinking. I tried the benchmark on my M2 Mac Studio from #2189 (comment) and it reported over 1000% slower splitting the buffers.

So as long as we strongly encourage developers not to use hundreds of command buffers but support the ones which decide to, I think WebKit would find that acceptable.

@litherum
Copy link
Contributor

litherum commented May 23, 2024

In my opinion (which doesn’t matter at all), the WebGPU spec should not require record-and-replay in any circumstances.

@mwyrzykowski
Copy link
Author

mwyrzykowski commented May 23, 2024

In my opinion (which doesn’t matter at all), the WebGPU spec should not require record-and-replay in any circumstances.

Apple has the same opinion and nothing else in the specification requires record-and-replay as it impacts the real time aspect of WebGPU. This was the initial motivation for a limit.

However, if the group is strongly opposed to the limit then communicating to the developer via a warning of sorts that you have too many GPUCommandEncoder / GPUCommandBuffer instances without calling GPUQueue.submit(), we have switched to record-replay at substantial latency cost, please re-architect your website / framework would be acceptable. We anticipate most WebGPU users will use WebGPU through a framework so evangelism could direct the frameworks to optimal API usage by keeping the command buffer count small, passing GPUCommandEncoders to 3rd party frameworks instead of creating new ones.

I modified Myles' benchmark to see when we run out of command buffers. I also confirmed simply requesting a queue with a higher max count does not appear to impact performance. Using more than a certain number causes the degradation. So we can simply request UINT_MAX and there is no need to perform the new queue with higher limit -> copy from old queue step.

M2 Mac Studio / M2 MacBookPro: ~64000 command buffers
iPhone 13 mini: ~4096 command buffers
7th gen iPad (model: iPad7,12 / Apple3 / Compat feature level): ~2048 command buffers

so roughly on the lowest end devices, 2048 command buffers reaches the jetsam memory limit of 300MB.

The real number an iOS app can have in flight is, very roughly using the following formula, solving for K where K is the max command buffer count:

300 - bufferTextureIOSurfaceMemoryInMBs - K * 300 / 2048 = 0
300 - bufferTextureIOSurfaceMemoryInMBs = K * 300 / 2048
K = (300 - bufferTextureMemoryInMBs) / (300 / 2048)

Assuming the app has 150MB of resident texture. buffer, and other IOSurface memory, that leaves us with merely 21: math was wrong initially 150 command buffers before the app gets killed due to memory pressure on lower end devices.

Attaching the benchmark as there is a possibility my estimates are off:
CommandBufferTest.zip

It also prints out the runtime cost to submitting the same work using 1 command buffer vs many command buffers, which is generally in the 500 - 2000% slower range on the devices I tried.

@aleino-nv
Copy link

WebGPU through a framework so evangelism could direct the frameworks to optimal API usage by keeping the command buffer count small

I don't think it's about "big" or "small" -- rather, I think it's a matter of keeping the count appropriate for the CPU of the device.
Presumably the UMD + kernel have in all cases been designed to support rendering using all CPU cores, meaning there is in all cases enough memory for at least a handful of command buffers per core. Thus if apps are somewhat aware of what kind of CPU they're running on, they won't get anywhere near the limit where they would run out of command buffer memory.

I figure that if an app surpasses this limit anyway, an extra memcpy is probably not be the worst of its performance problems.
In this case warning and falling back on a slower mode seems fair.

@mwyrzykowski
Copy link
Author

WebGPU through a framework so evangelism could direct the frameworks to optimal API usage by keeping the command buffer count small

I don't think it's about "big" or "small" -- rather, I think it's a matter of keeping the count appropriate for the CPU of the device. Presumably the UMD + kernel have in all cases been designed to support rendering using all CPU cores, meaning there is in all cases enough memory for at least a handful of command buffers per core. Thus if apps are somewhat aware of what kind of CPU they're running on, they won't get anywhere near the limit where they would run out of command buffer memory.

My understanding is the main slowdown from multiple command buffers comes from the fact that UMA + TBDR architectures need to commit all the results back to unified memory at each command buffer boundary. Without going into processor specifics, the on chip and tile memory is much faster than the unified memory. The command buffer boundaries force a sync point, even for private memory (MTLStorageModePrivate), and the memory bandwidth causes the performance degradation.

I don't believe anything in the WebGPU spec prevents us from coalescing several GPUCommandBuffer instances to a single MTLCommandBuffer at GPUQueue.submit() time, but that has the downside to record-and-reply which @litherum correctly pointed out we have avoided requiring so far.

@Kangz
Copy link
Contributor

Kangz commented May 28, 2024

GPU Web WG 2024-05-22 Atlantic-time
  • JB: Moz is not happy with imposing a limit on the number of GPUCommandEncoders. Memory consumption of buffering and replaying commands isn't substantial. If we impose a restriction on all users of this API because of this particular Metal limitation - would be nice to know that the numbers on Safari are so severe, and the workaround has such a severe impact, that it's prohibitive. Need a big weight to impose this kind of limitation. Would want to see some measurements - increases of KB/MB etc. Without details like that, it's not appealing.
  • MM: you mean - if every CommandQueue was created with the option "1000 max command buffers on this queue"?
  • JB: several points in the design space we discussed. One, doing an over-reservation. What's the impact of doubling the # buffers that a queue's ready to create? Another, command recording shouldn't do command encoding immediately, but buffer it up. wgpu does this already. So does Dawn. Talked with Mike about this - using memory for 2 things - my feeling is that it's true there's overlap but it's not a doubling. Think that first representation can be very compact. Not persuasive enough to me that we want to impose it on all our users.
  • MW: our biggest concern is the latency this adds. Have to wait for submit() to issue commands to the driver.
  • MM: historically, when Apple was implementing its GPU process, we started with a design like this. Queue commands, then replay upon submit(). Big perf regression, was a big deal. Immediately obvious from profiles. The first thing that we had to fix to get the GPU process to work performantly.
  • JB: maybe just do this as a fallback. When people create too many command buffers, fall back to buffering them up. A price they pay, that they opt into.
  • MW: that info's not really available to us. If we spec this, what happens when you get to this limit? The first line in the issue says that the validation for createCommandEncoder's empty in the spec.
  • CW: we could do an OOM there. Always possibility of device loss, too. Not ideal - wgpu has use cases that run into this. We don't see many. People will run into this, in production, when they least expect it. Device loss or OOM isn't great for the users. If we add a hard error, it's bad for every browser. If Safari wants graceful fallback, it can have it. Seems it should be easy to write a fallback path, because WebGPU is very stateless in its command encoding.
  • AL: a well written app would check roughly how many CPU cores are available. Would be fine on iOS where there are strict limits. Wouldn't hit the slow case where you virtualize the buffers. Why not make it robust and fall back to virtualized buffers?
  • MM: a lot of this hinges on whether apps will hit this limit or not. People are posting in the thread saying they do hit this limit. Nail down this question? A sufficient number will never be hit by real apps? Then raise the internal impl to that number, and go forward.
  • MW: WebKit doesn't let the core count be queried for security reasons. Want something in GPU supported limits - in Compat, if dev knows they'll use a certain number than what might cause problems, they request it at their own risk. Also talked with Metal team - there's a way to raise this limit, often needed when porting Vulkan games to Metal, but this isn't the optimal way to use the API.
  • JB: don't think a well written app can know in advance how many simultaneous command buffers it needs. 64 > 8, but < 1024. Can I have an architecture plugging together very disparate things? Folks concerned about load are not technical. Part of the web API responsibility to support that use case. Requiring global coordination is an anti-pattern.
  • MW: we don't understand the global coordination argument. With WebGPU, not allowing command encoding from multiple threads today, if you don't call submit() - the command buffer won't be executed.
  • MM: in safari i just checked navigator.hardwareConcurrenncy and it returns 8.
  • KN: according to a StackOverflow thread from 2019, desktop Safari returns 8 regardless of how many cores your mac has; and mobile safari returns 2 regardless of the number of cores the device has.
  • KN: you traverse scene graph and queue up all buffer updates, then render afterward. In single frame, possible to use multiple command buffers inside the frame. Don't know if people would be likely to go over 64. Esp if some are being recorded asynchronously. Plausible it could happen. I agree with Jim that there are probably good app designs where this can happen.

@teoxoy
Copy link
Member

teoxoy commented May 29, 2024

I also confirmed simply requesting a queue with a higher max count does not appear to impact performance. Using more than a certain number causes the degradation. So we can simply request UINT_MAX.

If this is the case, do we need the limit in WebGPU? We can instead encourage users to not use that many command buffers. It seems mobile devices are the ones being affected by the perf degradation the most and I don't think it would be a surprise for developers targeting mobile.

I think we do need to allow GPUDevice.createCommandEncoder to return an OOM error though.

@mwyrzykowski
Copy link
Author

Encouraging users to not use that many command buffers along with an OOM error on GPUDevice.createCommandEncoder seems like it would work 👍

@kainino0x kainino0x added needs-cts-issue This change requires tests (or would need tests if accepted), but may not have a CTS issue filed yet api resolved Resolved - waiting for a change to the API specification labels May 29, 2024
@kainino0x
Copy link
Contributor

+needs-cts-issue to test creating 1000 concurrent command encoders as agreed upon in the meeting.

@mwyrzykowski
Copy link
Author

Seems resolved, closing

mwyrzykowski added a commit to mwyrzykowski/WebKit that referenced this issue Jun 6, 2024
…r's behalf, instead trigger device lost (275212)

https://bugs.webkit.org/show_bug.cgi?id=275212
<radar://129343583>

Reviewed by NOBODY (OOPS!).

Per gpuweb/gpuweb#4622 (comment), the API
will allow for creation of up to 1000 GPUCommandEncoder and non-submitted GPUCommandBuffer
instances.

Ideally we will advocate for minimal usage, but there was strong pushback to setting a lower
limit. So we can inform developers that if they wish to obtain optimal performance on
Apple's platforms, they should keep work in a command buffer in number of ms and not less than
that with a large number of buffers.

* Source/WebGPU/WebGPU/Device.h:
* Source/WebGPU/WebGPU/Device.mm:
(WebGPU::Device::create):
* Source/WebGPU/WebGPU/Queue.mm:
(WebGPU::Queue::commandBufferWithDescriptor):
webkit-commit-queue pushed a commit to mwyrzykowski/WebKit that referenced this issue Jun 7, 2024
…r's behalf, instead trigger device lost (275212)

https://bugs.webkit.org/show_bug.cgi?id=275212
<radar://129343583>

Reviewed by Dan Glastonbury.

Per gpuweb/gpuweb#4622 (comment), the API
will allow for creation of up to 1000 GPUCommandEncoder and non-submitted GPUCommandBuffer
instances.

Ideally we will advocate for minimal usage, but there was strong pushback to setting a lower
limit. So we can inform developers that if they wish to obtain optimal performance on
Apple's platforms, they should keep work in a command buffer in number of ms and not less than
that with a large number of buffers.

* Source/WebGPU/WebGPU/Device.h:
* Source/WebGPU/WebGPU/Device.mm:
(WebGPU::Device::create):
* Source/WebGPU/WebGPU/Queue.mm:
(WebGPU::Queue::commandBufferWithDescriptor):

Canonical link: https://commits.webkit.org/279799@main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api resolved Resolved - waiting for a change to the API specification api WebGPU API needs-cts-issue This change requires tests (or would need tests if accepted), but may not have a CTS issue filed yet
Projects
None yet
Development

No branches or pull requests

7 participants