New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use @builtin(num_workgroups) or @builtin(local_invocation_index) is very slow on M1 MacBook #5677
Comments
This is expected behavior. You are using a workgroup of size (1, 1, 1) this is incredibly small and will destroy performance. Instead you should use a larger workgroup size (32x32 tends to be good), dispatch it in larger chunks of size (125, 32) for your case. Finally, insert a branch to ensure you are within bounds on the array since this is actually acting over an "area" of 1024. If you're over 1000, you don't want to touch your array. |
Thanks for your reply. @compute
@workgroup_size(workgroup_len)
fn main(@builtin(workgroup_id) workgroup_id: vec3<u32>,
@builtin(num_workgroups) num_workgroups: vec3<u32>,
@builtin(local_invocation_index) local_invocation_index: u32
) {
var sum = 0;
var si = 0u;
var index = 0u;
if (local_invocation_index == 0) {
item_si = workgroup_id.x + workgroup_id.y * num_workgroups.x + workgroup_id.z * num_workgroups.x * num_workgroups.y;
item_index = item_si + (item_si / 4000) * (4040 - 4000);
}
workgroupBarrier();
index = item_index;
si = item_si;
for (var i = local_invocation_index; i < arrayLength(&kernal_offset); i += workgroup_len) {
sum += data[i32(index) + kernal_offset[i]] * kernel_value[i];
}
if (local_invocation_index != workgroup_len - 1) {
atomicAdd(&item_sum, sum);
}
workgroupBarrier();
if (local_invocation_index == workgroup_len - 1){
output[si] = sum + item_sum;
}
} In a word, no matter how bad the code is, it should not take additional time by use @Builtin(num_workgroups) instead of a constant number. |
I don't know what we can do about this though. We are doing an almost 1:1 translation from WGSL to MSL and the items in question are |
Thank you for your contribution to this project! |
But it is fine on window 10 with Nvidia 2080ti.
When I dispatch a (4000, 1000, 1) workgroups, without use @Builtin(num_workgroups) or @Builtin(local_invocation_index) will take just 90ms, but if I import these two builtin values, it will take 1800ms on M1 MacBook.
wgpu version: 0.20.0 & latest https://github.com/gfx-rs/wgpu.git
The text was updated successfully, but these errors were encountered: