Fix access violation exception on shutdown #2386

Frawak · 2024-05-13T14:38:09Z

What does this implement/fix? Explain your changes.

When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in the CLRObject.reflectedObjects are zero. This caused an exception in some scenarios because Runtime.PyObject_TYPE(reflectedClrObject) is called while the reference counter is at zero.

After TypeManager.RemoveTypes(); is called in the Runtime.Shutdown() method, reference count decrements to zero do not invoke ClassBase.tp_clear for managed objects anymore which normally is responsible for freeing GC handles and removing references from CLRObject.reflectedObjects. Collecting objects referenced in CLRObject.reflectedObjects only after leads to an unstable state in which the reference count for these object addresses is zero while still maintaining them to be used for further pseudo-cleanup. In that time, the memory could have been reclaimed already which leads to the exception.

Does this close any currently open issues?

#1977

Any other comments?

...

Checklist

Check all those that are applicable and complete.

Make sure to include one or more tests for your change
If an enhancement PR, please create docs and at best an example
Ensure you have signed the .NET Foundation CLA
Add yourself to AUTHORS
Updated the CHANGELOG

When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in the `CLRObject.reflectedObjects` are zero. This caused an exception in some scenarios because `Runtime.PyObject_TYPE(reflectedClrObject)` is called while the reference counter is at zero. After `TypeManager.RemoveTypes();` is called in the `Runtime.Shutdown()` method, reference count decrements to zero do not invoke `ClassBase.tp_clear` for managed objects anymore which normally is responsible for removing references from `CLRObject.reflectedObjects`. Collecting objects referenced in `CLRObject.reflectedObjects` only after leads to an unstable state in which the reference count for these object addresses is zero while still maintaining them to be used for further pseudo-cleanup. In that time, the memory could have been reclaimed already which leads to the exception.

Frawak · 2024-05-13T19:11:47Z

Alright. At least the embedding tests went through. I am going to take another look.

Otherwise, collecting all at this earlier point results in corrupt memory for derived types.

Frawak · 2024-05-13T20:43:30Z

The original change broke the shutdown flow for derived objects. The (hopefully) final proposition from me remains the same, only restricting the early collection run to the managed objects which would, as described above, run into the problem of the cleared tp_clear slot at a later point.

Frawak · 2024-05-21T07:15:42Z

@filmor @lostmsu Could you please trigger the workflows/pipelines again?

lostmsu · 2024-05-21T18:07:41Z

This feels like a band-aid rather than proper fix. Theoretically invoking GC should not leave anything in a broken state, but types derived across .NET/Python boundaries are tricky. We should come up with a correct way to finalize them rather than adding such a band-aid, IMHO.

It is possible your scenario keeps invalid references. E.g. Python or .NET hold live references to each other's objects at the point when you call Shutdown. You need to narrow down the scenario to a short test case that would reliably crash, then we can see if Python.NET needs to fix that and how or provide a better error message.

Frawak · 2024-05-21T20:08:47Z

@lostmsu To make sure we are on the same page and in risk of repeating myself, I will copy some code snippets of the current master.

The shutdown method:

            NullGCHandles(ExtensionType.loadedExtensions);
            ClassManager.RemoveClasses();
            TypeManager.RemoveTypes(); // this removes tp_clear slot
            _typesInitialized = false;

            ...

            PyGC_Collect();
            bool everythingSeemsCollected = TryCollectingGarbage(MaxCollectRetriesOnShutdown,
                                                                 forceBreakLoops: true); // this decreases the reference count

The tp_clear method:

        public static int tp_clear(BorrowedReference ob)
        {
            var weakrefs = Runtime.PyObject_GetWeakRefList(ob);
            if (weakrefs != null)
            {
                Runtime.PyObject_ClearWeakRefs(ob);
            }

            if (TryFreeGCHandle(ob)) // what NullGCHandles tries to do after the fact
            {
                IntPtr addr = ob.DangerousGetAddress();
                bool deleted = CLRObject.reflectedObjects.Remove(addr); // the normal removal from hashset
                Debug.Assert(deleted);
            }

            int baseClearResult = BaseUnmanagedClear(ob);
            if (baseClearResult != 0)
            {
                return baseClearResult;
            }

            ClearObjectDict(ob);
            return 0;
        }

tp_clear should be responsible for freeing the GC handle and removing the address from CLRObject.reflectedObjects what the forceBreakLoops tries to do separatedly.

                for (int i = 0; i < 2; i++)
                {
                    GC.Collect();
                    GC.WaitForPendingFinalizers();
                    pyCollected += PyGC_Collect();
                    pyCollected += Finalizer.Instance.DisposeAll();
                }
                if (Volatile.Read(ref _collected) == 0 && pyCollected == 0)
                {
                    if (attempt + 1 == runs) return true;
                }
                else if (forceBreakLoops)
                {
                    NullGCHandles(CLRObject.reflectedObjects);
                    CLRObject.reflectedObjects.Clear();
                }

There are two crucial points which I hope you see are wrong in the current shutdown flow:

tp_clear is not invoked after decreasing the reference count of a respective object (referenced in CLRObject.reflectedObjects) to zero.
The reference count to all addresses in CLRObject.reflectedObjects is zero when calling NullGCHandles(CLRObject.reflectedObjects);, i.e. handling the addresses in that state.

This can be seen by just debugging and stepping into the code, regardless if an exception is thrown or not. Hence, the state is already broken after the line pyCollected += Finalizer.Instance.DisposeAll();.

Is this not enough that a fix is necessary?
An error message does not help because nothing can be done really to handle it. If you extend Python and unload pythonnet, you cannot use C# exception types anymore to catch them in the Python domain. This is an unhandled exception that crashes the Python process.

Frawak · 2024-05-21T20:32:09Z

This feels like a band-aid rather than proper fix.

Is this a feeling because of function name semantics which could be dissolved by refactoring the solution?

Effectively, my changes only run this:

                    while (!_objQueue.IsEmpty)
                    {
                        if (!_objQueue.TryDequeue(out var obj))
                            continue;

                        if (obj.RuntimeRun != run)
                        {
                            HandleFinalizationException(obj.PyObj, new RuntimeShutdownException(obj.PyObj));
                            continue;
                        }

                        IntPtr copyForException = obj.PyObj;
                        Runtime.XDecref(StolenReference.Take(ref obj.PyObj));
                        collected++;
                        try
                        {
                            Runtime.CheckExceptionOccurred();
                        }
                        catch (Exception e)
                        {
                            HandleFinalizationException(obj.PyObj, e);
                        }
                    }

Before this:

            NullGCHandles(ExtensionType.loadedExtensions);
            ClassManager.RemoveClasses();
            TypeManager.RemoveTypes();
            _typesInitialized = false;

            MetaType.Release();
            PyCLRMetaType.Dispose();
            PyCLRMetaType = null!;

            Exceptions.Shutdown();
            PythonEngine.InteropConfiguration.Dispose();
            DisposeLazyObject(clrInterop);
            DisposeLazyObject(inspect);
            DisposeLazyObject(hexCallable);
            PyObjectConversions.Reset();

lostmsu · 2024-05-21T22:35:42Z

An explanation is not enough, we need a minimal test (not necessarily written as a test function, could be a short script or .NET console app) that fails before the fix, and passes after, preferably with just a handful of allocated objects.

I am being so picky because without certainty about exactly what's going on I can not be sure that your fix does not just randomly happens to help your particular use case while breaking someone else's use case not yet covered by tests. And it introduces a little extra complexity too.

Frawak · 2024-05-22T15:20:58Z

This is the most down grinded version of my case that I could accomplish: https://github.com/Frawak/pythonnet/tree/fix/1977-access-violation-gc-testcase/exception_scenario
Please read the README.

I hope that the exception is as reliable on other machines.

preferably with just a handful of allocated objects

The problem with that: The exception only occurs with some internal memory management from Python (my interpretation) which is hard to reproduce in general and I do not see how it can be done with a few simple objects. The process has to be triggered by its inner logic to access the memory while pythonnet is shutting down.

I also had to add the other open-source library because while using that, the exception occurred in specific cases. But there is nothing wrong with the library and its types. They are pretty simple and no IDisposables or the like.
Trying to magically trigger this exception with default .NET library types is like gambling in the lottery.

And as I said: It is not about the surrounding conditions when the exception is thrown. They are not the cause. The cause can be seen with any script using .NET objects from Python and debugging into the shutdown method. It is just lucky that the exception is rare.

Frawak · 2024-05-31T09:19:47Z

@lostmsu Are you going to look at the code example? Does the exception get thrown when you run it?

It's a process. Work with me.

lostmsu · 2024-05-31T17:31:18Z

@Frawak I took a look at the example, just want to see if I understood it correctly: it does not appear to have either Python classes derived from .NET classes, nor .NET classes derived from Python classes? E.g. all it does is imports .NET lib, instantiates a few classes from there as temporary variables, and that's it?

P.S. sorry about late response, we are unpaid maintainers here, doing it at our leisure.

Frawak · 2024-05-31T20:55:46Z

I took a look at the example, just want to see if I understood it correctly: it does not appear to have either Python classes derived from .NET classes, nor .NET classes derived from Python classes? E.g. all it does is imports .NET lib, instantiates a few classes from there as temporary variables, and that's it?

That is correct.

P.S. sorry about late response, we are unpaid maintainers here, doing it at our leisure.

No need to apologize. Of course I know that. But any information (like "maybe in x weeks") is better than no information. Otherwise, I don't know whether it has genuinely slipped your mind and you need a poke or you come around to it at your own time. In my experience, the former is more common.

Frawak force-pushed the fix/1977-access-violation-gc branch from d496959 to 4e5afdf Compare May 13, 2024 14:43

Restrict first garbage collection

6f0f671

Otherwise, collecting all at this earlier point results in corrupt memory for derived types.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix access violation exception on shutdown #2386

Fix access violation exception on shutdown #2386

Frawak commented May 13, 2024 •

edited

Frawak commented May 13, 2024 •

edited

Frawak commented May 13, 2024

Frawak commented May 21, 2024 •

edited

lostmsu commented May 21, 2024 •

edited

Frawak commented May 21, 2024 •

edited

Frawak commented May 21, 2024

lostmsu commented May 21, 2024 •

edited

Frawak commented May 22, 2024 •

edited

Frawak commented May 31, 2024

lostmsu commented May 31, 2024 •

edited

Frawak commented May 31, 2024 •

edited

Fix access violation exception on shutdown #2386

Are you sure you want to change the base?

Fix access violation exception on shutdown #2386

Conversation

Frawak commented May 13, 2024 • edited

What does this implement/fix? Explain your changes.

Does this close any currently open issues?

Any other comments?

Checklist

Frawak commented May 13, 2024 • edited

Frawak commented May 13, 2024

Frawak commented May 21, 2024 • edited

lostmsu commented May 21, 2024 • edited

Frawak commented May 21, 2024 • edited

Frawak commented May 21, 2024

lostmsu commented May 21, 2024 • edited

Frawak commented May 22, 2024 • edited

Frawak commented May 31, 2024

lostmsu commented May 31, 2024 • edited

Frawak commented May 31, 2024 • edited

Frawak commented May 13, 2024 •

edited

Frawak commented May 13, 2024 •

edited

Frawak commented May 21, 2024 •

edited

lostmsu commented May 21, 2024 •

edited

Frawak commented May 21, 2024 •

edited

lostmsu commented May 21, 2024 •

edited

Frawak commented May 22, 2024 •

edited

lostmsu commented May 31, 2024 •

edited

Frawak commented May 31, 2024 •

edited