Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ONNX][TorchToLinalg] Add support for dynamic dims in Interpolate lowering #3351

Merged
merged 4 commits into from
May 17, 2024

Conversation

zjgarvey
Copy link
Collaborator

Addresses Shark-Turbine #196

Related tracker Shark-Turbine #566

Related onnx.Resize issues Shark-Turbine #616

@AmosLewis
Copy link
Collaborator

AmosLewis commented May 16, 2024

python ./run.py --torchmlirbuild ../../torch-mlir/build --tolerance 0.001 0.001 --cachedir ./huggingface_cache --ireebuild ../../iree-build -f onnx -g models --mode onnx --report --tests onnx/models/RRDB_ESRGAN_vaiq_int8 --torchtolinalg
Have you test with shark-testsuites, on my local test, it still fail:

LLVM ERROR: checking for an interface (`mlir::ReifyRankedShapedTypeOpInterface`) that was promised by dialect 'tensor' but never implemented. This is generally an indication that the dialect extension implementing the interface was never registered.

And it would be better to also test with other resize op related model to make sure they all pass.

@zjgarvey
Copy link
Collaborator Author

python ./run.py --torchmlirbuild ../../torch-mlir/build --tolerance 0.001 0.001 --cachedir ./huggingface_cache --ireebuild ../../iree-build -f onnx -g models --mode onnx --report --tests onnx/models/RRDB_ESRGAN_vaiq_int8 --torchtolinalg Have you test with shark-testsuites, on my local test, it still fail:

LLVM ERROR: checking for an interface (`mlir::ReifyRankedShapedTypeOpInterface`) that was promised by dialect 'tensor' but never implemented. This is generally an indication that the dialect extension implementing the interface was never registered.

And it would be better to also test with other resize op related model to make sure they all pass.

here is the issue for this. It is unrelated:

#3352

@AmosLewis
Copy link
Collaborator

I cherry pick this patch and test locally. Looks like someother passed models failed again with this pr:

  • (half_pixel, linear)

    • DeepLabV3_resnet50_vaiq_int8 passed
    • FCN_vaiq_int8 passed
    • LRASPP_vaiq_int8 passed -> failed
    • U-2-Net_vaiq_int8 passed -> failed
  • (asymmetric, nearest)

    • pytorch-3dunet_vaiq_int8
    • RRDB_ESRGAN_vaiq_int8
    • YoloNetV3_vaiq_int8 passed
    • yolov8n_vaiq_int8 passed -> failed

python ./run.py --torchmlirbuild ../../torch-mlir/build --tolerance 0.001 0.001 --cachedir ./huggingface_cache --ireebuild ../../iree-build -f onnx -g models --mode onnx --report --tests onnx/models/U-2-Net_vaiq_int8 --torchtolinalg

| tests                        | model-run   | onnx-import   | torch-mlir   | iree-compile   | inference   |
|:-----------------------------|:------------|:--------------|:-------------|:---------------|:------------|
| onnx/models/LRASPP_vaiq_int8 | passed      | passed        | failed       | notrun         | notrun      |
| onnx/models/U-2-Net_vaiq_int8 | passed      | passed        | passed       | failed         | notrun      |
| onnx/models/yolov8n_vaiq_int8 | passed      | passed        | failed       | notrun         | notrun      |
LRASPP_vaiq_int8.default.torch-onnx.mlir:195:12: error: failed to legalize operation 'torch.aten.convolution' that was explicitly marked illegal
    %191 = torch.operator "onnx.Conv"(%178, %184, %190) {torch.onnx.dilations = [1 : si64, 1 : si64], torch.onnx.group = 16 : si64, torch.onnx.kernel_shape = [3 : si64, 3 : si64], torch.onnx.pads = [1 : si64, 1 : si64, 1 : si64, 1 : si64], torch.onnx.strides = [1 : si64, 1 : si64]} : (!torch.vtensor<[1,16,112,112],f32>, !torch.vtensor<[16,1,3,3],f32>, !torch.vtensor<[16],f32>) -> !torch.vtensor<[1,16,112,112],f32> 
           ^
LRASPP_vaiq_int8.default.torch-onnx.mlir:195:12: note: see current operation: %562 = "torch.aten.convolution"(%550, %552, %561, %238, %238, %238, %45, %240, %36) : (!torch.vtensor<[1,16,112,112],!torch.qint8>, !torch.vtensor<[16,1,3,3],!torch.qint8>, !torch.vtensor<[16],si32>, !torch.list<int>, !torch.list<int>, !torch.list<int>, !torch.bool, !torch.list<int>, !torch.int) -> !torch.vtensor<[1,16,112,112],si32>
yolov8n_vaiq_int8.default.torch-onnx.mlir:262:12: error: failed to legalize operation 'torch.aten.convolution' that was explicitly marked illegal
    %258 = torch.operator "onnx.Conv"(%245, %251, %257) {torch.onnx.dilations = [1 : si64, 1 : si64], torch.onnx.group = 1 : si64, torch.onnx.kernel_shape = [3 : si64, 3 : si64], torch.onnx.pads = [1 : si64, 1 : si64, 1 : si64, 1 : si64], torch.onnx.strides = [1 : si64, 1 : si64]} : (!torch.vtensor<[1,16,160,160],f32>, !torch.vtensor<[16,16,3,3],f32>, !torch.vtensor<[16],f32>) -> !torch.vtensor<[1,16,160,160],f32> 
           ^
U-2-Net_vaiq_int8.default.onnx.linalg.mlir:7124:12: error: 'func.func' op exceeded stack allocation limit of 32768 bytes for function. Got 204800 bytes
    %866 = linalg.generic {indexing_maps = [#map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} outs(%668 : tensor<1x512x20x20xf32>) {
           ^
U-2-Net_vaiq_int8.default.onnx.linalg.mlir:10:3: note: called from
  func.func @torch_jit(%arg0: tensor<1x3x320x320xf32>) -> (tensor<1x1x320x320xf32>, tensor<1x1x320x320xf32>, tensor<1x1x320x320xf32>, tensor<1x1x320x320xf32>, tensor<1x1x320x320xf32>, tensor<1x1x320x320xf32>, tensor<1x1x320x320xf32>) {
  ^

@zjgarvey
Copy link
Collaborator Author

I cherry pick this patch and test locally. Looks like someother passed models failed again with this pr:

* (**half_pixel, linear**)
  
  * DeepLabV3_resnet50_vaiq_int8  passed
  * FCN_vaiq_int8 passed
  * **LRASPP_vaiq_int8** **passed -> failed**
  * **U-2-Net_vaiq_int8** **passed -> failed**

* (**asymmetric, nearest**)
  
  * **pytorch-3dunet_vaiq_int8**
  * **RRDB_ESRGAN_vaiq_int8**
  * YoloNetV3_vaiq_int8 passed
  * **yolov8n_vaiq_int8** **passed -> failed**

Hi @AmosLewis , thanks for testing this out.

The failing with convolution op is happening because the following pr's have not been merged yet:

torch-mlir PR3341 which depends on upstream: llvm-project PR92136

This is not an issue related to this particular patch, but likely came about due to work being done on improving operand quantization in #3327 and #3332

I'm not sure exactly what causes the stack allocation limit issue. It seems to happen during some dequant ops, but this should not be new as far as I am aware. I can focus my attention on these issues if you'd like, but again, I don't think that issue is likely to be specific to this patch.

A good comparison would be to run those same tests at head and compare to this branch.

@zjgarvey
Copy link
Collaborator Author

@AmosLewis

Also for reference, a few days ago, I ran all of the onnx model tests and triaged the torch-mlir failures:

Test onnx/models/VideoResNet_vaiq_int8 failed [torch-mlir]
    onnx.constant??
Test onnx/models/MobileNetV3_small_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/RegNet_y_8gf_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/Inception_v4_vaiq_int8 failed [torch-mlir]
    average Pool
Test onnx/models/pytorch-3dunet_vaiq_int8 failed [torch-mlir]
    resize
Test onnx/models/ShuffleNet_v2_x2_0_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/MNASNet_1_3_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/LRASPP_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/RRDB_ESRGAN_vaiq_int8 failed [torch-mlir]
    resize
Test onnx/models/KeypointRCNN_vaiq_int8 failed [torch-mlir]
    onnx if
Test onnx/models/EfficientNet_v2_s_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/retinanet_resnet50_fpn_vaiq_int8 failed [torch-mlir]
    onnx if
Test onnx/models/ConvNeXt_vaiq_int8 failed [torch-mlir]
    grouped q convolution

All of the ones marked "grouped q convolution" have a fix incoming.

This list and the flags used to run them are in my most recent comment in this issue.

@AmosLewis
Copy link
Collaborator

AmosLewis commented May 17, 2024

A good comparison would be to run those same tests at head and compare to this branch.

Make sense, we need to test along with those patch #3341.

I don't think that issue is likely to be specific to this patch.

Agree. But still need test to double check.

stack allocation limit issue

Could you run on your machine, it might because my VM running out of memory?

@zjgarvey
Copy link
Collaborator Author

stack allocation limit issue

Could you run on your machine, it might because my VM running out of memory?

I'm not sure exactly what the guard is in place for. I was reading into someone else's similar issue recently: iree issue.

It might be possible to remove the guard by adding the flag --iree-llvmcpu-fail-on-out-of-bounds-stack-allocation=false to iree-compile, as Mahesh mentioned in that issue. When I tried this for RAFT_vaiq_int8, iree-compile just sat there for like 30 minutes.

@rsuderman rsuderman merged commit 6cba93b into llvm:main May 17, 2024
3 checks passed
BaneTrifa pushed a commit to BaneTrifa/torch-mlir that referenced this pull request May 24, 2024
…ering (llvm#3351)

Addresses [Shark-Turbine
llvm#196](nod-ai/SHARK-TestSuite#196)

Related tracker [Shark-Turbine
llvm#566](nod-ai/SHARK-Turbine#566)

Related onnx.Resize issues [Shark-Turbine
llvm#616](nod-ai/SHARK-Turbine#616)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants