-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should there be a limit on the number of open GPUCommandEncoder instances? #4622
Comments
I thought we had discussed this a long time ago in the group (and concluded to not add a limit) but can't find better than https://github.com/gpuweb/gpuweb/wiki/Minutes-2018-09-27 (search for "in-flight"). IMHO it would be unfortunate to expose this limit to applications because it is some form of global-state validation instead of the mostly local validation we have right now. And it seems like it could be papered over somewhat easily by browsers (IIRC Firefox only sends the whole command buffer to the GPU process only on There are other issues like WebGPU not being the only user of the Metal device, so any limit is potentially too large when other components in the same process use the singleton MTLDevice for rendering 2D UI or other things (like other WebGPU tabs also?) |
This was also complicated a bit at the time because Years ago we hardcoded it to a higher limit of 2048 for gfx/wgpu because it was pretty easy to hit 64 pending command buffers depending on the application/game we tested. I don't know if there are better suggestions about reasonable defaults for this now but it would be helpful to know whether implementations that open command buffers immediately could avoid this validation by raising the limit to a much larger value. |
Historically I was under the impression the group appears to have avoided requiring implementations to defer work, as commented in: This is the only issue we have come across that may actually require large amounts of work to be deferred until
I don't think this is a concern since the underlying limit is on the MTLQueue. Presumably, the order of these operations would be enforced by the browser implementation. Creating multiple MTLQueue instances for WebGPU is not practical due to ordering of operations on the resources which would be enforced at the driver level using a single queue. This is not a concern for 2D UI or other WebGPU tabs, which would have their own queues and not have any interdependencies on each other. @Kangz the reason we are asking for a limit here is WebKit would likely throw an out of memory error after 64 GPUCommandEncoders which have yet to call finish. Presumably at some point, even in a deferred implementation, an out of memory error would need to be thrown on the GPUDevice.createCommandEncoder call so WebKit's implementation would fall into that path at 64 (GPUCommandEncoder + non-committed GPUCommandBuffer) instances. If we had a limit, we could use |
Right, 64 seemed to be too limiting for some applications we tested it with. There doesn't seem to be a good workaround for in-flight command buffers though, because there's no way to upgrade/downgrade the max command buffer count for an existing queue as far as I know (without asking the application to re-record their commands onto a new queue). For implementations that immediately start recording command buffers, it seems like it's probably necessary to raise the limit to some slightly more conservative point for now. |
Unless I'm misunderstanding something, a limit of 64 (including in-flight) command buffers would make it impossible for an app to have a multithreaded renderer with more than 64 worker threads? (Probably the limit would be even less since in-flight buffers are counted.) Can this be solved in the browser's WebGPU/Metal back-end? If Metal command buffers are expensive, then the Metal back-end could "virtualize" command buffers, i.e. offer cheap command buffers that just live in regular memory for the app to own. Then, when these are submitted, it would copy these command buffers to a set of real Metal command buffers that would be owned and managed by the Metal back-end. |
I didn't think the API today allows for multithreading across worker threads: https://github.com/gpuweb/gpuweb/wiki/The-Multi-Explainer#multi-threading-javascript When such a feature is discussed more in detail and introduced, we would prefer it to be implementable via MTLParallelRenderCommandEncoder which would allow for encoding in parallel amongst multiple threads without using anywhere close to 64 command buffers. |
Yes, but I assume when it becomes supported, then it will be supported by the same WebGPU command encoder objects we're now discussing a limit for? If so, I don't see why it matters whether multithreaded command recording works today.
I had a look at the documentation for MTLParallelRenderCommandEncoder. It seems like in metal you'd share one 'buffer' between multiple threads, and create from that buffer one (or more?) 'multithreaded command encoders' per thread. (?) |
Apple's position would likely be for that not to be the case, as it is not implementable via Metal without essentially falling back to single threaded encoding.
A MTLCommandBuffer is not sharable across threads, except when using MTLParallelRenderCommandEncoder and calling -[MTLParallelRenderCommandEncoder rendercommandencoder] for each thread. This deserves a discussion, but I would say is outside the scope of this issue regarding the limit of number of command buffers for single threaded encoding. Outside of the CTS, which only triggered the issue due to #4599, we have not seen any usage require more than 64 command buffers. That is why either a field in the GPUQueueDescriptor or an additional field in GPUSupportedLimits seems appropriate, as some applications directly ported from one backend may use more than 64 buffers at once. |
Right, this is what I meant by "sharing". To me it's as if the various terms are a bit mixed up between Vulkan and Metal.
I don't understand why one might want a shareable (Vulkan) 'command pool' object. |
Command buffers in Metal are somewhat expensive. If you want to perform multithreaded encoding then creating multiple command buffers is much slower than using one buffer with a parallel encoder. Command buffer boundaries force results to be CPU visible, whereas parallel render command encoder avoids this. |
My understanding of Metal is the following:
Parallel render command encoders would map to |
In the meeting it was quite clear that there is a cost to However adding a limit in the specification seems 1) like it will not solve the entirety of the problem 2) will cause pain to developers 3) isn't necessary to mitigate the issue. 1. It doesn't solve the problemThe limit in Metal is over the total number of command buffers currently created on an MTLQueue. Not the ones currently encoding, but also the command buffers already in flight. This means that to be precise the limit in WebGPU would need to account for the latter as well, and this would be extremely non-portable. The limit also includes command buffers created for the use of the WebGPU implementation itself, for example for complex operations like In the future, if command buffers are that expensive, browsers are potentially going to multiplex multiple 2. Pain to developersIn WebGPU we tried to have as few stateful validation as possible, let alone global validation (do we have any expect from the device loss?) because each additional stateful / global validation makes it more difficult for developers to make sure that their application is not going to run into it in production. Especially for global validation. Imagine an e-commerce site showing many 3D models in different canvases. Each of them is likely to have a different render loop, and it would be much better if the site didn't stop rendering after you add the 65th command encoder. That no application ran into this in production for Safari but that wgpu-rs did need to increase a limit is a strong indicator that as WebGPU starts getting more uses, more and more developers risk running into this issue without knowing it, and that sites will break when pushed a bit too hard. In general in WebGPU we do the hard stuff so that developers don't have to worry about it. Examples are barriers in Vulkan, descriptor set management, driver workarounds, memory allocation and suballocation. 3. Other mitigationsA lot of them were proposed in the last meetings (I'm not taking credit for them). As a first step the WebGPU specification could allow for some graceful failure when allocating a A slot hint from the site could be added to the default queue descriptor, although I think most developers would just ignore it. A suggestion was to detect when the internal limit is reached and recreate a command queue with more command buffers allowed, eventually migrating to run entirely on this queue. Or the implementation could resort to record-replay mechanisms after some internal limit is reached. This is some work but not a crazy amount of work as both Dawn and wgpu implemented variants of this. |
I'm not sure if these points are accurate.
To be precise, as long as the command buffer has called commit via So it would be portable since tracking the number of GPUCommandEncoder instances not calling finish() and GPUCommandBuffers not being passed to GPUQueue.submit is quite simple. I think it does solve the problem unless there is something else which hasn't been raised.
This does not impact the limit since the implementation can call commit on the MTLCommandBuffer.
This scenario was raised in the meeting but I don't really understand it. Do you mean, from a single worker or from the main thread, the website creates many GPUCommandEncoder instances via GPUDevice.createCommandEncoder(), begins render or compute passes, and doesn't submit them? As long as they call
I would rather use the term 'a naively ported application' or something along those lines would run into this limit. This is one reason for -[MTLDevice newCommandQueueWithMaxCommandBufferCount:]. It is partly an escape hatch for an application which was ported from Vulkan or a different backed directly to Metal without concern for re-architecting for performance.
If they ignore it, this creates a greater argument for adding a GPUSupportedLimit and enforcing validation for it across implementations as portability across devices is a goal of WebGPU.
We could do this but it comes at significant cost which is the desire for a GPUSupportedLimit. The developer raising the limit seems to know what they are doing. It would also give us the ability to report a higher limit for certain devices (Macs) and a lower one for mobile devices. Requiring the developer to respect the limit would increase portability across devices.
We discussed this internally within WebKit but the conclusion is record-replay approach is not optimal for real time applications on mobile devices. The recording of commands happens twice each render update, once internally in the browser implementation and then again in the driver, increasing latency. It seems as we continue the discussion a GPUSupportedLimit is the best approach as one of WebGPU's goal is portability across devices. |
To clarify about the multithreading question, once we have multithreading I don't think it will be practical to prevent creating command encoders from multiple threads at once. If there's a limit, then it's global, and people will hit it. The only thing that a MTLParallelRenderCommandEncoder-equivalent concept does is provide developers with an alternative way of encoding that won't hit that limit (that they can only use if they're parallelizing a single render pass). I don't see how we could force people to use it1. For example, if an app tries to create So, the limit improves portability in one way, but also introduces a serious new portability hazard. Of course, there are always hazards to spinning up a lot of threads, but typically machines with that many cores can handle a decent amount of work happening on each one simultaneously. The problem with a limit is that it would enforced at 64 by default regardless of the system's capabilities.
The documentation says that That supports Corentin's interpretation. Is the documentation incorrect?
We could do it without needing to account for submitted-but-not-completed command buffers in the validation, by instead having FWIW we could sort of hack our way around this limit by hard-limiting each thread at 64 command buffers, but allowing other threads to (Assuming from here forward that submitted-but-not-completed command buffers are not an issue.)
As long as each widget is calling Footnotes
|
The documentation is correct here and tracking the buffers in flight is not needed. Noted in the documentation for -[MTLCommandQueue commandBuffer]
a committed command buffer will either complete successfully or complete due to an error. |
Ohh, I see. Sorry, you said this, and I didn't understand for some reason. |
What it ultimately comes down to is this: do we think of WebGPU command buffers as cheap enough that developers can create one per core without worrying, or don't we? If we do, that means they don't map to 'Metal command buffers', thus the Metal backends ought not to associate 'Metal command buffers' 1-to-1 with 'WebGPU command buffers', thus no need to impose a limit on WebGPU. If we don't, then what's WebGPU's story for multithreaded rendering on more than 64 threads? |
Also, a good application would not try to create a renderer with anywhere close to 64 rendering threads on an iPhone anyway! |
I don't think this is accurate. One of the goals of WebGPU and the web in general is portability: we cannot make a basis as to whether something should be a limit based on the features of the device. The opposite is true. Most of the limits in GPUSupportedLimits are values which vary between devices / hardware. This is the reasoning why WebGPU uses the default limits when none are requested in GPUDevice: portability. If the developer so chooses to request a higher limit and the browser supports it, then it would be perfectly acceptable to create any number of command buffers up to the supported limit.
Apple's preference would be a design which is implementable with MTLParallelRenderCommandEncoder. But I don't want to derail this conversation discussing this as WebGPU does not have support for multithreaded rendering with a single device today. We may also choose to allow the developer to create multiple GPUQueue instances. That would be another solution. |
It makes no sense to take a limit from a completely different concept ('Metal command buffer') and impose it on an API object ('WebGPU command buffer') just because they happen to be named the same. |
GPU Web WG 2024-05-15
|
Given that:
... Here's what I think should happen:
|
Yes, that would resolve the issue. The concerns are portability and performance. We have gone through extensive discussions to ensure portability in the past, e.g., #3961. If we are not concerned about the impact to one of the three major backends supported by WebGPU, then we can say an implementation may report device lost or return a GPUOutOfMemory error on the call to We have previously shown that splitting command buffers impacts performance substantially #2189 (comment) and there is no practical way for a WebGPU framework to coalesce two or more GPUCommandBuffer instances into a single MTLCommandBuffer. If we were to establish this limit, we place the responsibly on the website to keep the number of command buffers small to avoid performance cliffs. Or if it is a GPUSupportedLimit, we allow the website to raise the limit from the default at their own detriment. Without the limit, the concern is:
|
For the sake of discussion, let's say the number of open command buffers which would cause jetsam is 500. I think there are 3 classes of apps:
For the first class, they will work on WebKit and non-WebKit out of the box. Easy peasy. For the third class, there is nothing WebKit can do to make them work. All we can do is ask developers nicely to please don't write this kind of app (by adding language in the spec, or in MDN, or both). All WebKit can do is try to avoid crashing. For the second class, WebKit must do something if it wants such apps to run. If WebKit says "sorry, you had more than 64 open command buffers, you just lost your device," then that's a real shame - the application could have run on the device. So, the question we're discussing here is: "what should the mechanism be whereby WebKit grants more than 64 open command buffers to the app?" One proposal is "the app should know exactly how many open command buffers it's going to need ahead of time, and it should tell that number to WebKit at device creation, and WebKit should support exactly that number on that device." Another proposal is "the app should display that it needs more open command buffers by .... trying to open more command buffers." I don't think the issue at hand is "will the app run on iOS?" - both proposals involve the second class of apps getting what it asked for.
I think two things can simultaneously be true:
I also don't believe a limit is the best way to inform authors not to have too many open command buffers. It's too easy for an author to say "oh, I hit the default limit of 64, so I'm just going to bump up the limit so my application runs." And lo and behold, bumping the limit fixes the problem, all without redesigning their app to use fewer open command buffers. I think a better solution to this problem is to inform authors before they have their entire app running, perhaps when they're learning WebGPU in the first place, by reading MDN or the spec (or whatever else). |
Shouldn't the workarounds listed above - most simply record-and-replay - make this third class of application run just fine? IIUC the only(?) objection was that it would penalize performance when you reach that point, without any way of pushing developers away from it. To push developers, we could pick a fixed number and have a warning (in all browsers) when you go over that number of open command encoders. (Though I'd want the number to be more than 64.) That's much more palatable than any kind of hard error, which I strongly think is going to cause serious new portability/reliability issues. |
Yes record-and-replay would work until we run out of memory, we could even coalesce most GPUCommandBuffers into a single MTLCommandBuffer with that approach. The downside, from very rough estimates on https://webgpu.github.io/webgpu-samples/?sample=animometer, is 80% frame latency in command recording in the driver.
A loud warning that you are using too many command buffers along with record-and-replay could be a good compromise. The performance impact of using many command buffers seems to be growing and not shrinking. I tried the benchmark on my M2 Mac Studio from #2189 (comment) and it reported over 1000% slower splitting the buffers. So as long as we strongly encourage developers not to use hundreds of command buffers but support the ones which decide to, I think WebKit would find that acceptable. |
In my opinion (which doesn’t matter at all), the WebGPU spec should not require record-and-replay in any circumstances. |
Apple has the same opinion and nothing else in the specification requires record-and-replay as it impacts the real time aspect of WebGPU. This was the initial motivation for a limit. However, if the group is strongly opposed to the limit then communicating to the developer via a warning of sorts that you have too many I modified Myles' benchmark to see when we run out of command buffers. I also confirmed simply requesting a queue with a higher max count does not appear to impact performance. Using more than a certain number causes the degradation. So we can simply request
so roughly on the lowest end devices, 2048 command buffers reaches the jetsam memory limit of 300MB. The real number an iOS app can have in flight is, very roughly using the following formula, solving for
Assuming the app has 150MB of resident texture. buffer, and other IOSurface memory, that leaves us with Attaching the benchmark as there is a possibility my estimates are off: It also prints out the runtime cost to submitting the same work using 1 command buffer vs many command buffers, which is generally in the 500 - 2000% slower range on the devices I tried. |
I don't think it's about "big" or "small" -- rather, I think it's a matter of keeping the count appropriate for the CPU of the device. I figure that if an app surpasses this limit anyway, an extra memcpy is probably not be the worst of its performance problems. |
My understanding is the main slowdown from multiple command buffers comes from the fact that UMA + TBDR architectures need to commit all the results back to unified memory at each command buffer boundary. Without going into processor specifics, the on chip and tile memory is much faster than the unified memory. The command buffer boundaries force a sync point, even for private memory (MTLStorageModePrivate), and the memory bandwidth causes the performance degradation. I don't believe anything in the WebGPU spec prevents us from coalescing several GPUCommandBuffer instances to a single MTLCommandBuffer at |
GPU Web WG 2024-05-22 Atlantic-time
|
If this is the case, do we need the limit in WebGPU? We can instead encourage users to not use that many command buffers. It seems mobile devices are the ones being affected by the perf degradation the most and I don't think it would be a surprise for developers targeting mobile. I think we do need to allow |
Encouraging users to not use that many command buffers along with an OOM error on |
+ |
Seems resolved, closing |
…r's behalf, instead trigger device lost (275212) https://bugs.webkit.org/show_bug.cgi?id=275212 <radar://129343583> Reviewed by NOBODY (OOPS!). Per gpuweb/gpuweb#4622 (comment), the API will allow for creation of up to 1000 GPUCommandEncoder and non-submitted GPUCommandBuffer instances. Ideally we will advocate for minimal usage, but there was strong pushback to setting a lower limit. So we can inform developers that if they wish to obtain optimal performance on Apple's platforms, they should keep work in a command buffer in number of ms and not less than that with a large number of buffers. * Source/WebGPU/WebGPU/Device.h: * Source/WebGPU/WebGPU/Device.mm: (WebGPU::Device::create): * Source/WebGPU/WebGPU/Queue.mm: (WebGPU::Queue::commandBufferWithDescriptor):
…r's behalf, instead trigger device lost (275212) https://bugs.webkit.org/show_bug.cgi?id=275212 <radar://129343583> Reviewed by Dan Glastonbury. Per gpuweb/gpuweb#4622 (comment), the API will allow for creation of up to 1000 GPUCommandEncoder and non-submitted GPUCommandBuffer instances. Ideally we will advocate for minimal usage, but there was strong pushback to setting a lower limit. So we can inform developers that if they wish to obtain optimal performance on Apple's platforms, they should keep work in a command buffer in number of ms and not less than that with a large number of buffers. * Source/WebGPU/WebGPU/Device.h: * Source/WebGPU/WebGPU/Device.mm: (WebGPU::Device::create): * Source/WebGPU/WebGPU/Queue.mm: (WebGPU::Queue::commandBufferWithDescriptor): Canonical link: https://commits.webkit.org/279799@main
The validation for
GPUDevice.createCommandEncoder
is mostly missing from the specification https://gpuweb.github.io/gpuweb/#issue-f6026bc8 but there currently seems to be nothing preventing a website from creating 100s of thousands of GPUCommandEncoders, attempting to encode commands, and not calling finish.There doesn't appear to be any known website or CTS coverage of this, but the default Metal limit for the number of command buffers is documented to be 64. So after the 64th encoder which encodes commands and does not call finish, we would see a hang if the command buffers aren't discarded or complete in some fashion.
An implementation could theoretically store the commands until Queue.submit time to get the correct ordering, but that ends up deferring a lot of work and it would be challenging to predict if the commands need to be buffered or not.
It seems like we should have a limit on the number of open GPUCommandEncoders as most applications are unlikely to run into this issue and attempting to solve it in the implementations for the edge case seems like unnecessary, wasted effort.
It could possibly be a hard coded limit in the validation for https://gpuweb.github.io/gpuweb/#issue-f6026bc8 depending on the other backends.
The text was updated successfully, but these errors were encountered: