Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault during GPU export of Reshape after importing model from ONNX #154

Open
jan-haug opened this issue Mar 22, 2023 · 10 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@jan-haug
Copy link
Contributor

jan-haug commented Mar 22, 2023

When you load a Reshape node, an OnnxReshape instance is created with custom logic for the torch.onnx export. (See https://github.com/ENOT-AutoDL/onnx2torch/blob/main/onnx2torch/node_converters/reshape.py#L32-L33 )

However, I'm getting segmentation faults when exporting the torch model using the GPU with this logic. Removing this if-condition (from the link above) entirely fixes the issue for me. What is the reason for this handling and is there a way around this or could it be extended to work with cuda too?

        if torch.onnx.is_in_onnx_export():
            return DefaultExportToOnnx.export(forward_lambda, 'Reshape', input_tensor, shape, {})

I'm running torch==1.13.1 and exporting with onnx opset 14. CPU export works fine but that's not really an option in my case unfortunately.

Standalone reproducer:

import os
import onnx2torch
import tempfile
import torch


class ReshapeModel(torch.nn.Module):
    def forward(self, x):
        return x.reshape(-1, 512)


def test_export():
    tmp_path = tempfile.mkdtemp()
    sample = torch.rand((1, 512, 1, 1))
    model = ReshapeModel()
    out_path = os.path.join(tmp_path, "temp.onnx")
    torch.onnx.export(model, sample, out_path)
    model_reconstructed = onnx2torch.convert(out_path)
    model_reconstructed.to("cuda")
    torch.onnx.export(model_reconstructed, sample, out_path)


if __name__ == "__main__":
    test_export()
@senysenyseny16
Copy link
Collaborator

Hi, @jan-haug!

I can't reproduce:

❯ docker run --rm --gpus all -ti -v $(pwd):/io pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime bash
root@7c0263e9a633:/workspace# cd /io
root@7c0263e9a633:/io# pip3 install onnx2torch
Collecting onnx2torch
  Downloading onnx2torch-1.5.6-py3-none-any.whl (115 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.6/115.6 kB 722.2 kB/s eta 0:00:00
Requirement already satisfied: torch>=1.8.0 in /opt/conda/lib/python3.10/site-packages (from onnx2torch) (1.13.1)
Requirement already satisfied: torchvision>=0.9.0 in /opt/conda/lib/python3.10/site-packages (from onnx2torch) (0.14.1)
Collecting onnx>=1.9.0
  Downloading onnx-1.13.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.5/13.5 MB 14.1 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.16.4 in /opt/conda/lib/python3.10/site-packages (from onnx2torch) (1.22.3)
Collecting protobuf<4,>=3.20.2
  Downloading protobuf-3.20.3-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 19.6 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions>=3.6.2.1 in /opt/conda/lib/python3.10/site-packages (from onnx>=1.9.0->onnx2torch) (4.4.0)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from torchvision>=0.9.0->onnx2torch) (2.28.1)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /opt/conda/lib/python3.10/site-packages (from torchvision>=0.9.0->onnx2torch) (9.3.0)
Requirement already satisfied: charset-normalizer<3,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->torchvision>=0.9.0->onnx2torch) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->torchvision>=0.9.0->onnx2torch) (2022.9.24)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->torchvision>=0.9.0->onnx2torch) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests->torchvision>=0.9.0->onnx2torch) (1.26.13)
Installing collected packages: protobuf, onnx, onnx2torch
Successfully installed onnx-1.13.1 onnx2torch-1.5.6 protobuf-3.20.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
root@7c0263e9a633:/io# python3 reshape_segfault.py 
root@7c0263e9a633:/io# 

What is the reason for this handling and is there a way around this or could it be extended to work with cuda too?

The reason is to ensure backward compatibility: ONNX -> PyTorch -> ONNX, sometimes converted PyTorch code is exported to ONNX as set of operations, not as original ONNX operation.
In other words it ensures the following: ONNX Gather -> PyTorch Gather implementation -> ONNX Gather.

@senysenyseny16 senysenyseny16 added the bug Something isn't working label Mar 23, 2023
@senysenyseny16 senysenyseny16 self-assigned this Mar 23, 2023
@jan-haug
Copy link
Contributor Author

@senysenyseny16 my bad, I forgot the if __name__ == "__main__" to actually call the function in the script. I updated the example above.

With that, I also get the error when following the docker setup you posted.

@senysenyseny16
Copy link
Collaborator

I have reproduced the problem. Thanks for your report, I think the problem is in the C++ code, I'll try to debug it.

@jan-haug
Copy link
Contributor Author

Thanks! FWIW, this also pertains to (some) other ops where a custom mapping is implemented in this way, for example Slice. So it doesn't seem to be specific to the exact operation the mapping is used in

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Jun 13, 2023
@github-actions
Copy link

This issue was closed because it has been stalled for 10 days with no activity.

@jan-haug
Copy link
Contributor Author

Any update on this?

@senysenyseny16
Copy link
Collaborator

@jan-haug

We think that the problem is in the export mechanism of PyTorch, we are unlikely to be able to do something about it now.
Maybe in 2.0. export is better 🙅 .

Sorry for the late reply.

@github-actions github-actions bot added the Stale label Aug 7, 2023
@ENOT-AutoDL ENOT-AutoDL deleted a comment from github-actions bot Aug 7, 2023
@github-actions
Copy link

github-actions bot commented Sep 7, 2023

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Sep 7, 2023
@github-actions github-actions bot added the Stale label Oct 9, 2023
@github-actions github-actions bot added the Stale label Nov 9, 2023
@ENOT-AutoDL ENOT-AutoDL deleted a comment from github-actions bot Nov 10, 2023
@ENOT-AutoDL ENOT-AutoDL deleted a comment from github-actions bot Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants