Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assistance Required with tvmjs Integration and webgpu.get_fmap Error #59

Open
ZoneLikeWonderland opened this issue Feb 28, 2024 · 10 comments

Comments

@ZoneLikeWonderland
Copy link

Firstly, I'd like to express my admiration for the remarkable work done on this project. The advancements and capabilities it offers are truly impressive.

I've been diligently following the provided "walkthrough.ipynb" to familiarize myself with the pipeline. Unfortunately, I encountered an issue with the trace part, which seems to malfunction, possibly due to updates in the diffusers library. To circumvent this, I opted for a simplified network module as demonstrated below:

class Net(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x + 1

Following this, I proceeded to convert the network to ONNX format and subsequently to IR:

trace = torch.jit.trace(net, input.to(dtype).to(device))

torch.onnx.export(
    trace, input.to(dtype).to(device), "test/net.onnx", verbose=True, input_names=["input"], output_names=["input"],
)
# Exported graph: graph(%input.1 : Float(1, 3, strides=[3, 1], requires_grad=0, device=cpu)):
#   %/Constant_output_0 : Float(requires_grad=0, device=cpu) = onnx::Constant[value={1}, onnx_name="/Constant"](), scope: Net:: # /tmp/ipykernel_3004661/4051825751.py:6:0
#   %input : Float(1, 3, strides=[3, 1], requires_grad=0, device=cpu) = onnx::Add[onnx_name="/Add"](%input.1, %/Constant_output_0), scope: Net:: # /tmp/ipykernel_3004661/4051825751.py:6:0
#   return (%input)

# ============= Diagnostic Run torch.onnx.export version 2.0.0+cu117 =============
# verbose: False, log level: Level.ERROR
# ======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
onnx_model_path = "test/net.onnx"
model = onnx.load(onnx_model_path)
tvm_model = from_onnx(model, keep_params_in_input=True)
tvm_model

# # from tvm.script import ir as I
# # from tvm.script import relax as R

# @I.ir_module
# class Module:
#     @R.function
#     def main(input_1: R.Tensor((1, 3), dtype="float32")) -> R.Tensor((1, 3), dtype="float32"):
#         R.func_attr({"num_input": 1})
#         with R.dataflow():
#             gv: R.Tensor((1, 3), dtype="float32") = R.add(input_1, R.const(1, "float32"))
#             R.output(gv)
#         return gv

After that, I compiled it to wasm:

tvm_model, model_params = relax.frontend.detach_params(tvm_model) # no params actually
target = tvm.target.Target(
    "webgpu", host="llvm -mtriple=wasm32-unknown-unknown-wasm"
)
ex = relax.build(mod=tvm_model, target=target)
ex.export_library("test/net.wasm")

Finally, I used the following JS to run it:

const tvmjs = require("./public/dist/tvmjs.bundle.js");
const EmccWASI = require("./public/dist/tvmjs_runtime.wasi.js");


window.tvmjs = tvmjs

async function asyncInitTVM() {


    const wasmSource = await (
        await fetch("./public/net.wasm")
    ).arrayBuffer();


    logger = function (message) {
        console.log(message);
    };

    const tvm = await tvmjs.instantiate(
        new Uint8Array(wasmSource),
        new EmccWASI(),
        logger
    );

    const output = await tvmjs.detectGPUDevice();
    if (output !== undefined) {
        var label = "WebGPU";
        if (output.adapterInfo.description.length != 0) {
            label += " - " + output.adapterInfo.description;
        } else {
            label += " - " + output.adapterInfo.vendor;
        }
        console.log("Initialize GPU device: " + label);
        tvm.initWebGPU(output.device);
    } else {
        console.log("This browser env do not support WebGPU");
    }



    tvm.withNewScope(() => {
        device = tvm.webgpu();
        // device = tvm.cpu();
        vm = tvm.detachFromCurrentScope(tvm.createVirtualMachine(device));
        net = tvm.detachFromCurrentScope(vm.getFunction("main"));
    })

    await tvm.asyncLoadWebGPUPipelines(vm.getInternalModule());

    const input_cpu = tvm.withNewScope(() => {
        return tvm.detachFromCurrentScope(
            tvm.empty([1, 3], "float32", tvm.cpu()).copyFrom([1, 1, 1])
        )
    });
    const input_gpu = tvm.withNewScope(() => {
        return tvm.detachFromCurrentScope(
            tvm.empty([1, 3], "float32", device)
        )
    });

    input_gpu.copyFrom(input_cpu);
    await tvm.webgpu().sync();
    console.log("input_cpu", input_cpu.toArray());

    tvm.withNewScope(() => {
        output_gpu = net(input_gpu);
        output_gpu = tvm.detachFromCurrentScope(output_gpu);
    });


    const output_cpu = tvm.withNewScope(() => {
        return tvm.detachFromCurrentScope(
            tvm.empty([1, 3], "float32", tvm.cpu()).copyFrom([2, 3, 4])
        )
    });

    output_cpu.copyFrom(output_gpu);
    await tvm.webgpu().sync();
    console.log("output_cpu", output_cpu.toArray());

}

asyncInitTVM()

However, I've hit a roadblock during the execution phase, particularly at await tvm.asyncLoadWebGPUPipelines(vm.getInternalModule());, where the console outputs the following error:

tvmjs.bundle.js:1863  Uncaught (in promise) Error: Cannot find function webgpu.get_fmap
    at Module.getFunction (tvmjs.bundle.js:1863:23)
    at Instance.eval (tvmjs.bundle.js:2791:38)
    at Generator.next (<anonymous>)
    at eval (tvmjs.bundle.js:28:75)
    at new Promise (<anonymous>)
    at __awaiter (tvmjs.bundle.js:24:16)
    at Instance.asyncLoadWebGPUPipelines (tvmjs.bundle.js:2786:20)
    at asyncInitTVM (main.js:48:15)

In addition, I found that when I use llvm as build target instead of webgpu and use tvm.cpu() as device and skip this line, the example is working.

Given the scarcity of detailed documentation and tutorials on integrating custom networks with tvmjs, especially regarding WebGPU support, I find myself in need of your expertise and guidance.

Could you please help me identify any potential missteps in my approach? I am particularly interested in ensuring that my network can be successfully operated using tvmjs and would greatly appreciate any insights or suggestions you might have.

Thank you very much for your time and assistance.

@senlyu163
Copy link

same error... do u solve it?

@ZoneLikeWonderland
Copy link
Author

same error... do u solve it?

not yet👀

@senlyu163
Copy link

@tqchen Sorry to bother you. If it's convenient, can you give some suggestions?

@senlyu163
Copy link

@ZoneLikeWonderland The error is still reported on Ubuntu, but it is normal on apple M2Pro. I am fresh in tvm, and dont know why...

@abacaaaaaa
Copy link

@senlyu163
Sorry to bother you. Following the get started documentation, I’ve successfully implemented build.py and deploy.py, and they run smoothly during deployment on the web. However, I encountered the following issue:

Generate error, GPUPipelineError: Entry-point uses workgroup_size(1024, 1, 1) that exceeds the maximum allowed (256, 256, 64).
    at ValidateComputeStageWorkgroupSize (../../third_party/dawn/src/dawn/native/ShaderModule.cpp:1004)

image

My environment : Python 3.11, MacOS 14.5 M2
Do you have any suggestions? Thanks.

@senlyu163
Copy link

@senlyu163 Sorry to bother you. Following the get started documentation, I’ve successfully implemented build.py and deploy.py, and they run smoothly during deployment on the web. However, I encountered the following issue:

Generate error, GPUPipelineError: Entry-point uses workgroup_size(1024, 1, 1) that exceeds the maximum allowed (256, 256, 64).
    at ValidateComputeStageWorkgroupSize (../../third_party/dawn/src/dawn/native/ShaderModule.cpp:1004)

image

My environment : Python 3.11, MacOS 14.5 M2 Do you have any suggestions? Thanks.

I also encountered this problem when deployed on the web, probably because of webgpu limitations. You can use the configuration of the following code to tune.

    ms.relax_integration.tune_relax(
        mod=mod,
        target=tvm.target.Target("apple/m1-gpu-restricted"),
        ......
    )

I tested several configurations and succeeded under the “apple/m1-gpu-restricted” configuration.

@abacaaaaaa
Copy link

@senlyu163
Thank you for your response. I’ve found multiple instances of ms.relax_integration.tune_relax in the TVM source code and made modifications to all the files accordingly. However, the issue still persists. Could you please provide detailed information on the files that require modification? Thank you very much.

@senlyu163
Copy link

@abacaaaaaa

  1. You only need to specify the target as "apple/m1-gpu-restricted" when generating log_db in the tune phase.
  2. Just use the repo's default code during the lib compilation phase.

I think the problem you are having is that the number of gpu threads configuration is too large during the tune phase.

It is recommended that you refer to the guoyaol's repo: https://github.com/guoyaol/web-real-esrgan

@abacaaaaaa
Copy link

@senlyu163
Sorry to bother you. I tried specifying apple/m1-gpu-restricted when generating the log_db, referring to https://github.com/guoyaol/web-real-esrgan and https://github.com/happyme531/RK3588-stable-diffusion-GPU, but the issue still persists.

My code is as follows:

def tune(mod: tvm.IRModule) -> None:
    from tvm import meta_schedule as ms

    ms.relax_integration.tune_relax(
        mod=mod,
        target=tvm.target.Target("apple/m1-gpu-restricted"),
        params={},
        builder=ms.builder.LocalBuilder(
            max_workers=2,
        ),
        runner=ms.runner.LocalRunner(),
        work_dir="log_db_tuning_1000_small",
        max_trials_global=5000,  
        max_trials_per_task=200, 
        strategy=ms.search_strategy.EvolutionarySearch(init_min_unmeasured=10, max_fail_count=15),
    )

I’m just tinkering with the parameters — max_trials_global, max_trials_per_task, init_min_unmeasured, and max_fail_count — and they might not be entirely reasonable. However, even after this tentative tuning, the current log_db still throws errors on the web. Could you kindly provide some suggestions? If it’s convenient for you, could you please share the tune file or log_db files? Thank you very much.

@senlyu163
Copy link

@abacaaaaaa
The tune code looks correct.

I verified webgpu on the web-realesrgan project without trying on stable-diffusion yet. So the log_db of sd cannot be provided. You can reproduce the work of web-realesrgan and then migrate to sd.

I am a freshman in tvm, so i can not provide more professional advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants