Issues running BERT on Windows #166

pixelspark · 2023-05-09T22:58:53Z

On Windows, running BERT models leads to an error about loop unrolling in HLSL. Apparently one of the shaders has a loop with a too large (static) number of iterations. Same model runs fine on Mac/Metal. Will look into this later.

pixelspark · 2023-05-22T06:37:10Z

The error:

Caused by:
    In Device::create_compute_pipeline
    Internal error: FXC D3DCompile error (0x80004005): (81,21-35): warning X3550: array reference cannot be used as an l-value; not natively addressable, forcing loop to unroll
(35,5-15): error X3511: unable to unroll loop, loop does not appear to terminate in a timely manner (329 iterations) or unrolled loop is too large, use the [unroll(n)] attribute to force an exact higher number

The generated HLSL:

    struct NagaConstants {
        int base_vertex;
        int base_instance;
        uint other;
    };
    ConstantBuffer<NagaConstants> _NagaConstants: register(b0);

    ByteAddressBuffer input_left : register(t0);
    ByteAddressBuffer input_right : register(t1);
    ByteAddressBuffer input_bias : register(t2);
    RWByteAddressBuffer output_0_ : register(u0);

    [numthreads(1, 1, 1)]
    void main(uint3 global_id : SV_DispatchThreadID)
    {
        float4x4 tmpsum = (float4x4)0;
        float4x4 product = (float4x4)0;
        uint k = (uint)0;
        uint index_mat = (uint)0;
        uint index_mat_1 = (uint)0;

        uint bias_index = (global_id.x % 192u);
        uint x = (global_id.x / 192u);
        uint stack_index = global_id.y;
        uint left_offset = (stack_index * 0u);
        uint right_offset = (stack_index * 0u);
        uint output_offset = (stack_index * 0u);
        uint index = ((output_offset + (x * 768u)) + bias_index);
        float4 zero_vec = float4(0.0, 0.0, 0.0, 0.0);
        float4x4 zero_matrix = float4x4(zero_vec, zero_vec, zero_vec, zero_vec);
        tmpsum = zero_matrix;
        product = zero_matrix;
        k = 0u;
        bool loop_init = true;
        while(true) {
            if (!loop_init) {
                uint _expr109 = k;
                k = (_expr109 + 1u);
            }
            loop_init = false;
            uint _expr28 = k;
            if ((_expr28 < 768u)) {
            } else {
                break;
            }
            {
                uint _expr34 = k;
                uint index_left = ((left_offset + (x * 3072u)) + _expr34);
                uint _expr36 = k;
                uint index_right = ((right_offset + (_expr36 * 768u)) + bias_index);
                float4 _expr46 = asfloat(input_left.Load4((index_left + 0u)*16+0));
                float4 _expr52 = asfloat(input_left.Load4((index_left + 768u)*16+0));
                float4 _expr58 = asfloat(input_left.Load4((index_left + 1536u)*16+0));
                float4 _expr64 = asfloat(input_left.Load4((index_left + 2304u)*16+0));
                float4x4 mat_left = float4x4(_expr46, _expr52, _expr58, _expr64);
                float4 _expr71 = asfloat(input_right.Load4((index_right + 0u)*16+0));
                float4 _expr77 = asfloat(input_right.Load4((index_right + 192u)*16+0));
                float4 _expr83 = asfloat(input_right.Load4((index_right + 384u)*16+0));
                float4 _expr89 = asfloat(input_right.Load4((index_right + 576u)*16+0));
                float4x4 mat_right = float4x4(_expr71, _expr77, _expr83, _expr89);
                product = mul(mat_left, mat_right);
                index_mat = 0u;
                bool loop_init_1 = true;
                while(true) {
                    if (!loop_init_1) {
                        uint _expr106 = index_mat;
                        index_mat = (_expr106 + 1u);
                    }
                    loop_init_1 = false;
                    uint _expr94 = index_mat;
                    if ((_expr94 < 4u)) {
                    } else {
                        break;
                    }
                    {
                        uint _expr97 = index_mat;
                        uint _expr99 = index_mat;
                        float4 _expr101 = tmpsum[_expr99];
                        uint _expr102 = index_mat;
                        float4 _expr104 = product[_expr102];
                        tmpsum[_expr97] = (_expr101 + _expr104);
                    }
                }
            }
        }
        index_mat_1 = 0u;
        bool loop_init_2 = true;
        while(true) {
            if (!loop_init_2) {
                uint _expr132 = index_mat_1;
                index_mat_1 = (_expr132 + 1u);
            }
            loop_init_2 = false;
            uint _expr114 = index_mat_1;
            if ((_expr114 < 4u)) {
            } else {
                break;
            }
            {
                float4 bias = asfloat(input_bias.Load4(bias_index*16+0));
                uint _expr123 = index_mat_1;
                uint _expr128 = index_mat_1;
                float4 _expr130 = tmpsum[_expr128];
                output_0_.Store4((index + (_expr123 * 192u))*16+0, asuint((_expr130 + bias)));
            }
        }
        return;
    }

To cause the error on Windows (note, uses my personal project):

 $env:RUST_LOG=debug
$env:WGPU_BACKEND="dx12"
cargo run --bin quokka-ui --release

It fails with WGPU_BACKEND=vulkan or WGPU_BACKEND=dx11 as well, but the error message there is not very informative.

pixelspark · 2023-05-22T06:39:31Z

WONNX-generated WGSL:

 alias Scalar = f32;
    alias GemmVec = vec4<f32>;
    alias GemmMat = mat4x4<f32>;

    struct GemmArrayVector {
        data: array<GemmVec>
    };

    @group(0) @binding(0)
    var<storage, read> input_left: GemmArrayVector;

    @group(0) @binding(1)
    var<storage, read> input_right: GemmArrayVector;

     // Bias
        @group(0) @binding(2)
        var<storage, read> input_bias: GemmArrayVector;

        @group(0) @binding(3)
        var<storage, read_write> output_0: GemmArrayVector;


    @compute @workgroup_size(1, 1)
    fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
        let y = global_id.x % 192u;
        let x = global_id.x / 192u;


        let stack_index = global_id.y;
        let left_offset = stack_index * 0u;
        let right_offset = stack_index * 0u;
        let output_offset = stack_index * 0u;

        let index = output_offset + (x * 768u) + y;


        let zero_vec = GemmVec(

                        Scalar(),
                        Scalar(),
                        Scalar(),
                        Scalar()
        );

        let zero_matrix = GemmMat(

                        zero_vec,
                        zero_vec,
                        zero_vec,
                        zero_vec
        );

        var tmpsum = zero_matrix;
        var product = zero_matrix;

        for(var k: u32 = 0u; k < 768u; k = k + 1u) {
                let index_left = left_offset + (x * 3072u) + k;
                let index_right = right_offset + (k * 768u) + y;

                let mat_left = GemmMat(

                                input_left.data[index_left + 0u],
                                input_left.data[index_left + 768u],
                                input_left.data[index_left + 1536u],
                                input_left.data[index_left + 2304u]
                );

                let mat_right = GemmMat(

                                input_right.data[index_right + (0u)],
                                input_right.data[index_right + (192u)],
                                input_right.data[index_right + (384u)],
                                input_right.data[index_right + (576u)]
                );

                product = mat_right * mat_left;

                for(var index_mat: u32 = 0u; index_mat < 4u; index_mat = index_mat + 1u) {
                        tmpsum[index_mat] = tmpsum[index_mat] + product[index_mat];
                }
        }


                let bias_index =

                         y ;

                for(var index_mat: u32 = 0u; index_mat < 4u; index_mat = index_mat + 1u) {


                                let bias = input_bias.data[bias_index ];


                        output_0.data[index + (index_mat * 192u)] =tmpsum[index_mat] +bias;
                }

    }

pixelspark · 2023-05-23T05:53:58Z

Seeing a different issue now, when writing the input buffers:

[2023-05-23T05:47:19Z DEBUG wonnx::gpu] write input data for input_ids:0
[2023-05-23T05:47:19Z WARN  wonnx::gpu] reading int64 input 'input_ids:0' as int32 (int64 is not supported for calculation but can be used as input as long as values fit in int32)
[2023-05-23T05:47:19Z ERROR wgpu_hal::auxil::dxgi::result] Buffer creation failed: 0x887A0005
[2023-05-23T05:47:19Z ERROR wgpu::backend::direct] Handling wgpu errors as fatal by default
thread 'main' panicked at 'wgpu error: Validation Error

Caused by:
    In Queue::write_buffer
    Parent device is lost

', C:\Users\vandervorst\.cargo\registry\src\github.com-1ecc6299db9ec823\wgpu-0.16.0\src\backend\direct.rs:3019:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[2023-05-23T05:47:19Z INFO  wgpu_core::hub] Dropping Global
[2023-05-23T05:47:19Z INFO  wgpu_core::device] Destroying 1 command encoders
[2023-05-23T05:47:19Z WARN  gpu_allocator::allocator::free_list_allocator] leak detected: {
        memory type: 1
        memory block: 6
        chunk: {
            chunk_id: 29,
            size: 0x10000,
            offset: 0x0,
            allocation_type: Linear,
            name: (wgpu internal) Staging,
            backtrace:
        }
    }
error: process didn't exit successfully: `target\release\quokka-ui.exe` (exit code: 101)

Could be just my Windows laptop...

teoxoy · 2023-05-25T11:37:06Z

0x887A0005 = DXGI_ERROR_DEVICE_REMOVED (https://learn.microsoft.com/en-us/windows/win32/direct3ddxgi/dxgi-error)

The video card has been physically removed from the system, or a driver upgrade for the video card has occurred. The application should destroy and recreate the device. For help debugging the problem, call ID3D10Device::GetDeviceRemovedReason.

Not sure why this is happening but apparently it can get triggered by other things.

In my experience actual physical device loss/removal was rare (ex: SurfaceBook GPU hot plug when running on remote GPU and opted in to support it), but other things cause that error value as well (some cases of crashes, driver updates, etc).
In practice, device removal mostly happened when our app hit driver bugs (ex: some drivers crashed if you used too much address space, some crashed with specific shaders etc).

from gfx-rs/wgpu#1624 (comment)

It would be nice to call GetDeviceRemovedReason and get more details.

pixelspark self-assigned this May 9, 2023

pixelspark added a commit that referenced this issue May 23, 2023

fix: issue running Gemm on Windows (#166)

c772653

pixelspark mentioned this issue Oct 25, 2023

"Unable to unroll loop" error compiling generated HLSL code that writes to a matrix row in a loop gfx-rs/wgpu#4436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues running BERT on Windows #166

Issues running BERT on Windows #166

pixelspark commented May 9, 2023

pixelspark commented May 22, 2023 •

edited

pixelspark commented May 22, 2023

pixelspark commented May 23, 2023

teoxoy commented May 25, 2023

Issues running BERT on Windows #166

Issues running BERT on Windows #166

Comments

pixelspark commented May 9, 2023

pixelspark commented May 22, 2023 • edited

pixelspark commented May 22, 2023

pixelspark commented May 23, 2023

teoxoy commented May 25, 2023

pixelspark commented May 22, 2023 •

edited