Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text classification example gives "Shader validation error" when run on multiple GPUs #1745

Open
joshhansen opened this issue May 7, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@joshhansen
Copy link

Describe the bug
Running the text classification example's ag news training step on multiple discrete GPUs fails with "Shader validation error":

This error overlaps some with the one in #1088.

To Reproduce
On a system with two or more discrete GPUs:

git clone https://github.com/tracel-ai/burn.git
cd burn/examples/text-classification

Edit examples/ag-news-train.rs like so:

-        launch::<Autodiff<Wgpu<AutoGraphicsApi, ElemType, i32>>>(vec![WgpuDevice::default()]);
+        launch::<Autodiff<Wgpu<AutoGraphicsApi, ElemType, i32>>>(vec![
+            WgpuDevice::DiscreteGpu(0),
+            WgpuDevice::DiscreteGpu(1),
+        ]);

cargo run --example ag-news-train --features wgpu

Expected behavior
The training proceeds, utilizing both GPUs.

Desktop (please complete the following information):

  • OS: Linux Mint 21.3 Cinnamon
  • Kernel 6.5.0-28-generic
  • Burn 0.14 master commit: a8661a2
  • Threadripper 7965WX on ASUS WRX90E-SAGE
  • 4x RTX 6000 Ada GPUs
  • Nvidia 545.29.06
@nathanielsimard nathanielsimard self-assigned this May 8, 2024
@nathanielsimard nathanielsimard added the bug Something isn't working label May 8, 2024
@nathanielsimard
Copy link
Member

Looking at the experiment.log the problem seems to come from the validation layer of Vulkan, not from a multi-device error. I tested on my system and I can run the training with multiple devices. Maybe you can try to disable the validation layer of Vulkan (branch wgpu-no-validation).

Also, you could test using the LibTorch backend instead.

@joshhansen
Copy link
Author

Training does appear to work with the LibTorch GPU backend, with multiple GPUs specified. That may not be much use to me though - I am specifically migrating away from libtorch due to its lack of thread safety.

Running on the wgpu-no-validation branch surprisingly results in the same validation error:
experiment.log

@nathanielsimard
Copy link
Member

@joshhansen My intuition would suggest that the problem may come from a precision error, where wgpu can't convert the literal to a float32. If you change that value, does it work?

@joshhansen
Copy link
Author

Change 0.00000000023283064365386963f? My apologies, I'm not familiar with Burn's compilation process, where would that value "live" such that I could modify it?

@nathanielsimard
Copy link
Member

@joshhansen I guessed it was a constant defined by your code 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants