Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues running BERT on Windows #166

Open
pixelspark opened this issue May 9, 2023 · 4 comments
Open

Issues running BERT on Windows #166

pixelspark opened this issue May 9, 2023 · 4 comments
Assignees

Comments

@pixelspark
Copy link
Collaborator

On Windows, running BERT models leads to an error about loop unrolling in HLSL. Apparently one of the shaders has a loop with a too large (static) number of iterations. Same model runs fine on Mac/Metal. Will look into this later.

@pixelspark pixelspark self-assigned this May 9, 2023
@pixelspark
Copy link
Collaborator Author

pixelspark commented May 22, 2023

The error:

Caused by:
    In Device::create_compute_pipeline
    Internal error: FXC D3DCompile error (0x80004005): (81,21-35): warning X3550: array reference cannot be used as an l-value; not natively addressable, forcing loop to unroll
(35,5-15): error X3511: unable to unroll loop, loop does not appear to terminate in a timely manner (329 iterations) or unrolled loop is too large, use the [unroll(n)] attribute to force an exact higher number

The generated HLSL:

    struct NagaConstants {
        int base_vertex;
        int base_instance;
        uint other;
    };
    ConstantBuffer<NagaConstants> _NagaConstants: register(b0);

    ByteAddressBuffer input_left : register(t0);
    ByteAddressBuffer input_right : register(t1);
    ByteAddressBuffer input_bias : register(t2);
    RWByteAddressBuffer output_0_ : register(u0);

    [numthreads(1, 1, 1)]
    void main(uint3 global_id : SV_DispatchThreadID)
    {
        float4x4 tmpsum = (float4x4)0;
        float4x4 product = (float4x4)0;
        uint k = (uint)0;
        uint index_mat = (uint)0;
        uint index_mat_1 = (uint)0;

        uint bias_index = (global_id.x % 192u);
        uint x = (global_id.x / 192u);
        uint stack_index = global_id.y;
        uint left_offset = (stack_index * 0u);
        uint right_offset = (stack_index * 0u);
        uint output_offset = (stack_index * 0u);
        uint index = ((output_offset + (x * 768u)) + bias_index);
        float4 zero_vec = float4(0.0, 0.0, 0.0, 0.0);
        float4x4 zero_matrix = float4x4(zero_vec, zero_vec, zero_vec, zero_vec);
        tmpsum = zero_matrix;
        product = zero_matrix;
        k = 0u;
        bool loop_init = true;
        while(true) {
            if (!loop_init) {
                uint _expr109 = k;
                k = (_expr109 + 1u);
            }
            loop_init = false;
            uint _expr28 = k;
            if ((_expr28 < 768u)) {
            } else {
                break;
            }
            {
                uint _expr34 = k;
                uint index_left = ((left_offset + (x * 3072u)) + _expr34);
                uint _expr36 = k;
                uint index_right = ((right_offset + (_expr36 * 768u)) + bias_index);
                float4 _expr46 = asfloat(input_left.Load4((index_left + 0u)*16+0));
                float4 _expr52 = asfloat(input_left.Load4((index_left + 768u)*16+0));
                float4 _expr58 = asfloat(input_left.Load4((index_left + 1536u)*16+0));
                float4 _expr64 = asfloat(input_left.Load4((index_left + 2304u)*16+0));
                float4x4 mat_left = float4x4(_expr46, _expr52, _expr58, _expr64);
                float4 _expr71 = asfloat(input_right.Load4((index_right + 0u)*16+0));
                float4 _expr77 = asfloat(input_right.Load4((index_right + 192u)*16+0));
                float4 _expr83 = asfloat(input_right.Load4((index_right + 384u)*16+0));
                float4 _expr89 = asfloat(input_right.Load4((index_right + 576u)*16+0));
                float4x4 mat_right = float4x4(_expr71, _expr77, _expr83, _expr89);
                product = mul(mat_left, mat_right);
                index_mat = 0u;
                bool loop_init_1 = true;
                while(true) {
                    if (!loop_init_1) {
                        uint _expr106 = index_mat;
                        index_mat = (_expr106 + 1u);
                    }
                    loop_init_1 = false;
                    uint _expr94 = index_mat;
                    if ((_expr94 < 4u)) {
                    } else {
                        break;
                    }
                    {
                        uint _expr97 = index_mat;
                        uint _expr99 = index_mat;
                        float4 _expr101 = tmpsum[_expr99];
                        uint _expr102 = index_mat;
                        float4 _expr104 = product[_expr102];
                        tmpsum[_expr97] = (_expr101 + _expr104);
                    }
                }
            }
        }
        index_mat_1 = 0u;
        bool loop_init_2 = true;
        while(true) {
            if (!loop_init_2) {
                uint _expr132 = index_mat_1;
                index_mat_1 = (_expr132 + 1u);
            }
            loop_init_2 = false;
            uint _expr114 = index_mat_1;
            if ((_expr114 < 4u)) {
            } else {
                break;
            }
            {
                float4 bias = asfloat(input_bias.Load4(bias_index*16+0));
                uint _expr123 = index_mat_1;
                uint _expr128 = index_mat_1;
                float4 _expr130 = tmpsum[_expr128];
                output_0_.Store4((index + (_expr123 * 192u))*16+0, asuint((_expr130 + bias)));
            }
        }
        return;
    }

To cause the error on Windows (note, uses my personal project):

 $env:RUST_LOG=debug
$env:WGPU_BACKEND="dx12"
cargo run --bin quokka-ui --release

It fails with WGPU_BACKEND=vulkan or WGPU_BACKEND=dx11 as well, but the error message there is not very informative.

@pixelspark
Copy link
Collaborator Author

WONNX-generated WGSL:

 alias Scalar = f32;
    alias GemmVec = vec4<f32>;
    alias GemmMat = mat4x4<f32>;

    struct GemmArrayVector {
        data: array<GemmVec>
    };

    @group(0) @binding(0)
    var<storage, read> input_left: GemmArrayVector;

    @group(0) @binding(1)
    var<storage, read> input_right: GemmArrayVector;

     // Bias
        @group(0) @binding(2)
        var<storage, read> input_bias: GemmArrayVector;

        @group(0) @binding(3)
        var<storage, read_write> output_0: GemmArrayVector;


    @compute @workgroup_size(1, 1)
    fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
        let y = global_id.x % 192u;
        let x = global_id.x / 192u;


        let stack_index = global_id.y;
        let left_offset = stack_index * 0u;
        let right_offset = stack_index * 0u;
        let output_offset = stack_index * 0u;

        let index = output_offset + (x * 768u) + y;


        let zero_vec = GemmVec(

                        Scalar(),
                        Scalar(),
                        Scalar(),
                        Scalar()
        );

        let zero_matrix = GemmMat(

                        zero_vec,
                        zero_vec,
                        zero_vec,
                        zero_vec
        );

        var tmpsum = zero_matrix;
        var product = zero_matrix;

        for(var k: u32 = 0u; k < 768u; k = k + 1u) {
                let index_left = left_offset + (x * 3072u) + k;
                let index_right = right_offset + (k * 768u) + y;

                let mat_left = GemmMat(

                                input_left.data[index_left + 0u],
                                input_left.data[index_left + 768u],
                                input_left.data[index_left + 1536u],
                                input_left.data[index_left + 2304u]
                );

                let mat_right = GemmMat(

                                input_right.data[index_right + (0u)],
                                input_right.data[index_right + (192u)],
                                input_right.data[index_right + (384u)],
                                input_right.data[index_right + (576u)]
                );

                product = mat_right * mat_left;

                for(var index_mat: u32 = 0u; index_mat < 4u; index_mat = index_mat + 1u) {
                        tmpsum[index_mat] = tmpsum[index_mat] + product[index_mat];
                }
        }


                let bias_index =

                         y ;

                for(var index_mat: u32 = 0u; index_mat < 4u; index_mat = index_mat + 1u) {


                                let bias = input_bias.data[bias_index ];


                        output_0.data[index + (index_mat * 192u)] =tmpsum[index_mat] +bias;
                }

    }

@pixelspark
Copy link
Collaborator Author

Seeing a different issue now, when writing the input buffers:

[2023-05-23T05:47:19Z DEBUG wonnx::gpu] write input data for input_ids:0
[2023-05-23T05:47:19Z WARN  wonnx::gpu] reading int64 input 'input_ids:0' as int32 (int64 is not supported for calculation but can be used as input as long as values fit in int32)
[2023-05-23T05:47:19Z ERROR wgpu_hal::auxil::dxgi::result] Buffer creation failed: 0x887A0005
[2023-05-23T05:47:19Z ERROR wgpu::backend::direct] Handling wgpu errors as fatal by default
thread 'main' panicked at 'wgpu error: Validation Error

Caused by:
    In Queue::write_buffer
    Parent device is lost

', C:\Users\vandervorst\.cargo\registry\src\github.com-1ecc6299db9ec823\wgpu-0.16.0\src\backend\direct.rs:3019:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[2023-05-23T05:47:19Z INFO  wgpu_core::hub] Dropping Global
[2023-05-23T05:47:19Z INFO  wgpu_core::device] Destroying 1 command encoders
[2023-05-23T05:47:19Z WARN  gpu_allocator::allocator::free_list_allocator] leak detected: {
        memory type: 1
        memory block: 6
        chunk: {
            chunk_id: 29,
            size: 0x10000,
            offset: 0x0,
            allocation_type: Linear,
            name: (wgpu internal) Staging,
            backtrace:
        }
    }
error: process didn't exit successfully: `target\release\quokka-ui.exe` (exit code: 101)

Could be just my Windows laptop...

@teoxoy
Copy link

teoxoy commented May 25, 2023

0x887A0005 = DXGI_ERROR_DEVICE_REMOVED (https://learn.microsoft.com/en-us/windows/win32/direct3ddxgi/dxgi-error)

The video card has been physically removed from the system, or a driver upgrade for the video card has occurred. The application should destroy and recreate the device. For help debugging the problem, call ID3D10Device::GetDeviceRemovedReason.

Not sure why this is happening but apparently it can get triggered by other things.

In my experience actual physical device loss/removal was rare (ex: SurfaceBook GPU hot plug when running on remote GPU and opted in to support it), but other things cause that error value as well (some cases of crashes, driver updates, etc).
In practice, device removal mostly happened when our app hit driver bugs (ex: some drivers crashed if you used too much address space, some crashed with specific shaders etc).

from gfx-rs/wgpu#1624 (comment)

It would be nice to call GetDeviceRemovedReason and get more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants