Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

what(): new_refcount != 1 INTERNAL ASSERT FAILED #126010

Open
ezyang opened this issue May 11, 2024 · 0 comments
Open

what(): new_refcount != 1 INTERNAL ASSERT FAILED #126010

ezyang opened this issue May 11, 2024 · 0 comments
Assignees
Labels
high priority oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ezyang
Copy link
Contributor

ezyang commented May 11, 2024

馃悰 Describe the bug

Internal xref: https://fb.workplace.com/groups/1075192433118967/posts/1426066368031570/

The internal user decided to try a different approach which doesn't hit this problem, but we should eventually figure this out.

The problem is related to weakref and tensor resurrection. I don't have more data than that.

Backtrace:

                       /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/libsupc++/../../.././libstdc++-v3/libsupc++/eh_throw.cc:95
    @ 00000000031e28cb (unknown)
    @ 00000000031e2a32 (unknown)
    @ 0000000000a5994b THPStorage_fix_weakref(_object*, _object*)
                       fbcode/caffe2/c10/util/intrusive_ptr.h:274
                       -> ./fbcode/caffe2/torch/csrc/StorageMethods.cpp
    @ 000000000058cdd1 method_vectorcall_NOARGS(_object*, _object* const*, unsigned long, _object*) [clone .__uniq.59579175487726443387915304255113886938] [clone .llvm.10286476335678555850]
    @ 000000000039283f call_function(_ts*, PyTraceInfo*, _object***, long, _object*) [clone .__uniq.79849310599369217189729546442812793949]
    @ 0000000000331352 _PyEval_EvalFrameDefault

Python trace:

--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/aiplatform/logging/console_logger.py", line 31, in format
    message = super(CppGlogFormatter, self).format(record)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/libfb/py/log.py", line 157, in format
    result = CustomFormatter.format(self, record)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/libfb/py/log.py", line 33, in format
    return logging.Formatter.format(self, record)
  File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 678, in format
    record.message = record.getMessage()
  File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "<string>", line 52, in <module>
  File "<string>", line 49, in __run
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/__par__/meta_only/bootstrap.py", line 98, in run_as_main
    oss_run_as_main(
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/__par__/bootstrap.py", line 94, in run_as_main
    main()
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/aimp/cli/cli.py", line 1467, in main
    cli()
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/click/core.py", line 1719, in invoke
    rv.append(sub_ctx.command.invoke(sub_ctx))
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/aimp/cli/cli.py", line 251, in wrapper
    func(ctx, *other_args, **kwargs)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/aimp/cli/cli.py", line 1140, in gpu_disagg_split
    ).execute(
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/inference_enablement/model_processing/components/model_arch_transform/interface/transform.py", line 62, in execute
    self._execute(transform_state, transform_lib)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/aimp/lib/generic_disagg_split.py", line 489, in _execute
    self._split(gm, transform_state, transform_lib)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/aimp/lib/generic_disagg_split.py", line 659, in _split
    res = split_graphmodule(
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/aiplatform/model_splitting/generic/model_split.py", line 183, in split_graphmodule
    n = comp.graph.node_copy(node, remap_func)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/torch/fx/graph.py", line 1316, in node_copy
    args = map_arg(node.args, arg_transform)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/torch/fx/node.py", line 743, in map_arg
    return map_aggregate(a, lambda x: fn(x) if isinstance(x, Node) else x)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/torch/fx/node.py", line 751, in map_aggregate
    t = tuple(map_aggregate(elem, fn) for elem in a)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/torch/fx/node.py", line 751, in <genexpr>
    t = tuple(map_aggregate(elem, fn) for elem in a)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/torch/fx/node.py", line 761, in map_aggregate
    return fn(a)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/torch/fx/node.py", line 743, in <lambda>
    return map_aggregate(a, lambda x: fn(x) if isinstance(x, Node) else x)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/aiplatform/model_splitting/generic/model_split.py", line 173, in remap_func
    x_copy = _gen_placeholder_node_with_the_same_metadata(x, comp.graph)
  File "/data/users/paulzhan/fbsource/buck-out/v2/gen/fbcode/eaf2f1bb2a81291f/aimp/cli/__cli__/cli#link-tree/aiplatform/model_splitting/generic/model_split.py", line 68, in _gen_placeholder_node_with_the_same_metadata
    logger.info("Meta: ", meta_key, " ", meta_value)
Message: 'Meta: '
Arguments: ('tensor_meta', ' ', TensorMetadata(shape=torch.Size([u17]), dtype=torch.int32, requires_grad=False, stride=(1,), memory_format=torch.contiguous_format, is_quantized=False, qparams={}))
W0508 12:40:19.421077 2381550 ExceptionTracer.cpp:162] Exception tracer library not linked, stack traces not available
E0508 12:40:19.421105 2381550 ExceptionTracer.cpp:218] terminate() called, exception stack follows
E0508 12:40:19.421115 2381550 ExceptionTracer.cpp:220] Exception type: c10::Error

E0508 12:40:19.421129 2381550 ExceptionTracer.cpp:222] exception stack complete
terminate called after throwing an instance of 'c10::Error'
  what():  new_refcount != 1 INTERNAL ASSERT FAILED at "fbcode/caffe2/c10/util/intrusive_ptr.h":276, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero.
Exception raised from retain_ at fbcode/caffe2/c10/util/intrusive_ptr.h:276 (most recent call first):

# 0  std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), caffe2::(anonymous namespace)::registerStacktraceFetcher(int*, char***)::$_2>::_M_invoke(std::_Any_data const&)
# 1  0x0000000003111b63
# 2  0x0000000003110252
# 3  0x00000000031103d2
# 4  THPStorage_fix_weakref(_object*, _object*)
# 5  method_vectorcall_NOARGS(_object*, _object* const*, unsigned long, _object*) [clone .__uniq.59579175487726443387915304255113886938] [clone .llvm.10286476335678555850]
# 6  call_function(_ts*, PyTraceInfo*, _object***, long, _object*) [clone .__uniq.79849310599369217189729546442812793949]
# 7  _PyEval_EvalFrameDefault
# 8  _PyEval_Vector

All in all it's pretty weird.

Versions

main

cc @gchanan @zou3519 @kadeng @msaroufim @bdhirsh @anijain2305 @chauhang

@bdhirsh bdhirsh added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

2 participants