Stack overflow not sent to backend #977

pontusn · 2024-04-10T11:26:41Z

Description
We have discovered that crashes due to stack overflow are not uploaded to Sentry. The application silently crash without any additional details reported.

When does the problem happen

During build
During run-time
When capturing a hard crash

Environment

OS: Windows 10 and later, 32-bit
Compiler: MSVC2022
vcpkg, 0.7.2, also applies to older versions

Steps To Reproduce
Generate stack overflow by allocating all available stack and then calling into any function.

Log output
None.

Mitigations
After discovering this issue we applied several mitigations:

Increased stack size to reduce probability
Reserve one extra stack page for recovery SetThreadStackGuarantee
Inject SEH handler that calls _resetstkoflw to avoid subsequent crash in error handling

However, I would very much prefer that this situations was handled also with "naive" integration.

The text was updated successfully, but these errors were encountered:

supervacuus · 2024-04-10T16:30:24Z

Hi @pontusn, thanks for the report. Can you provide more information?

which crash-backend (crashpad, breakpad, inproc) do you use?
do you specify any stack-relevant parameters (/STACK, /GS, /Gs, etc.) in your build?
do you see the same behavior on 64-bit systems?
do other crashes report correctly?
do you register either an on_crash or before_send hook in which you allocate memory on the stack?

Generate stack overflow by allocating all available stack and then calling into any function.

I am just asking because I can't reproduce this. I regularly test against stack overflows, and Sentry certainly receives stack overflows on Windows.

Can you elaborate on your mitigations? Did any of them fix the failure to report (i.e., do you see EXCEPTION_STACK_OVERFLOW events in Sentry after applying your mitigations)?

In particular, _resetstkoflw should not be called by any filter functions, which are the primary handlers in all our backends; it is also unclear how it could help in that situation inside an UnhandledExceptionFilter.

SetThreadStackGuarantee() is a function that sets the guarantee per thread and will only provide the requested stack size for the thread in which sentry_init() would be run. So, this is not a generic solution, but I could imagine adding this to the docs. However, that information would only make sense for us to add after we understand your setup better.

However, I would very much prefer that this situations was handled also with "naive" integration.

Can you explain what you mean by "naive" integrations? What do you expect from the Native SDK?

pontusn · 2024-04-10T16:55:09Z

Crashpad (default)
/GS
/STACK was previously missing (ie 1Mb), now explicitly set to 2Mb
Our application is 32-bit only, hence no information about 64-bit.
Other crashes are reported, but since this failed I really don't know if other crashes are missing
No hooks

The crash I was investigating happened inside MSFTEDIT_CLASS when handling WM_COPY for 260000 characters. I suspect some internal handling use _alloca during conversion between RTF, ANSI and UNICODE to place all formats on clipboard.

We have been using Sentry Native SDK with great results for +3 years. The "naive" integrations is simply calling sentry_init and start collecting. Now our integration goes much deeper, but the core is the simplicity. I expect that the SDK will continue to help us improve software quality in ways no one in the team initially imagined.

I'll try to investigate the problem deeper, but for some reason these crashes are nor uploaded....

pontusn · 2024-04-10T17:35:31Z

I've now confirmed that usage of _alloca is the trigger.

supervacuus · 2024-04-11T08:16:01Z

Crashpad (default)

/GS

/STACK was previously missing (ie 1Mb), now explicitly set to 2Mb

Our application is 32-bit only, hence no information about 64-bit.

Other crashes are reported, but since this failed I really don't know if other crashes are missing

No hooks

Thanks! With my question regarding other kinds of crashes, I want to ensure that stack-overflows specifically do not report instead of you experiencing a regression with crash reporting in general. Can I ask you to check a few things?

Did any of your mentioned mitigations fix the reporting issue (i.e., not only prevent the stack-overflow from happening but allow the Native SDK to report a stack-overflow when it happens)?
Can you access the Native SDK debug log in the affected application? If so, do you see the log entries "flushing session and queue before crashpad handler" and "handing control over to crashpad" when the stack-overflow happens? Do you see any warnings during the initialization?
Do you see new *.dmp files being produced in <database-path>/reports directory?
Can you provoke another crash, like an access violation in which you write to a NULL pointer or another invalid address, and check whether it reports correctly?
Can you check that the crashpad_handler.exe is running when you trigger the stack-overflow?

We have been using Sentry Native SDK with great results for +3 years. The "naive" integrations is simply calling sentry_init and start collecting. Now our integration goes much deeper, but the core is the simplicity. I expect that the SDK will continue to help us improve software quality in ways no one in the team initially imagined.

It's awesome that the Native SDK has given you value over the years. We'll try to keep it that way! None of the mentioned mitigations should be necessary for the Native SDK to detect the crash, so let's find out why it is not reporting the stack overflow.

On the other hand, if you expect that some part of your application requires stack allocations of any considerable size and there is no way to change that to a heap allocation, then I think specifying larger stack commit/reserve values is the right approach, independent of your use of the Native SDK.

supervacuus · 2024-04-11T09:35:36Z

I experimented a little more, and I can provoke stack overflows that hit your scenario. I seemingly have hit the __chkstk in all my previous tests, and now, through minor reordering in the triggering function, I can bypass it and land on a global guard page.

At this point, the stack is exhausted enough that the UnhandledExceptionFilter is no longer invoked. SetThreadStackGuarantee() is a fix for that. I will create a stable integration test for an exhausted stack and add docs and the fix to the code.

cc @kahest, this is sigaltstack() for Windows. I will add related docs when I add the Android docs. I also wonder whether extending the WER module for stack-overflows wouldn't be an interesting scenario since it should be informed about these exceptions, too, independent of whether all threads have called SetThreadStackGuarantee() or not. It also would make us independent of guesstimating the stack's size (including required docs to let users tune it), given that the clients' crash hooks could considerably increase the required stack size.

@pontusn, you don't need to do any more checks on your side. Thanks again for the report!

pontusn · 2024-04-23T14:40:35Z

Any information on when fix for this will be included in SDK?

supervacuus · 2024-04-23T19:31:55Z

Any information on when fix for this will be included in SDK?

Not before next week. Other tasks, including precursors for this topic, are still in progress. After these tasks it will be the next thing I will work on.

Just to be clear, we cannot provide a comprehensive fix for this problem because we don't own all threads. For these, it is mostly a case for the documentation.

We will provide a fix for the thread that executes sentry_init() and, I think, adapt the crashpad WER module so that we can report crashes like this one even if the in-process crash handler crashes due to stack exhaustion. But this only works for users of the crashpad backend.

pontusn · 2024-04-24T06:32:09Z

For our product, where the integration with Sentry is in a separate DLL, we hooked thread-level injection via DllMain:

extern "C" BOOL WINAPI DllMain(HINSTANCE hInstance, DWORD dwReason, LPVOID lpReserved)
{
    if (dwReason == DLL_PROCESS_ATTACH || dwReason == DLL_THREAD_ATTACH)
    {
        // Reserve memory for handling "stack overflow"
        ULONG ulStackGuarantee = 64 * 1024;
        SetThreadStackGuarantee(&ulStackGuarantee);
    }

    //...
}

I would assume you could use similar pattern for sentry.dll itself. Possibly guarded by some initialization flag.

These articles provided me with interesting details:
https://peteronprogramming.wordpress.com/2016/08/10/crashes-you-cant-handle-easily-2-stack-overflows-on-windows/
https://devblogs.microsoft.com/oldnewthing/20200610-00/?p=103855

supervacuus · 2024-04-24T09:17:46Z

Thanks @pontusn. The problem with using DllMain is twofold:

it only affects threads started after sentry.dll was loaded
it doesn't cover the considerable use of the Native SDK as a static library

But I'll consider adding it, even if only as part of the documentation.

pontusn · 2024-04-24T20:33:18Z

Fair limitations, to my understanding there is no alternative method to inject code for thread initialization in Windows.

For most applications I would assume Sentry is among first things to initialize on the start thread, hence all threads should be included for real world use cases.

The static scenario could be handled by embedding a minimal DLL for this purpose that is exported to file and loaded dynamically.

However, as you already mentioned, documentation is probably the key part of addressing this issue

getsantry bot added the Waiting for: Product Owner label Apr 10, 2024

getsantry bot removed the Waiting for: Product Owner label Apr 10, 2024

getsantry bot added the Waiting for: Product Owner label Apr 10, 2024

getsantry bot removed the Waiting for: Product Owner label Apr 11, 2024

supervacuus self-assigned this Apr 11, 2024

supervacuus added bug Something isn't working Platform: Windows area: backend labels Apr 11, 2024

getsantry bot added the Waiting for: Product Owner label Apr 23, 2024

supervacuus added the area: docs label Apr 23, 2024

getsantry bot removed the Waiting for: Product Owner label Apr 23, 2024

supervacuus mentioned this issue Apr 23, 2024

feat(native): advanced usage section to explain OS error handling getsentry/sentry-docs#9787

Merged

getsantry bot added the Waiting for: Product Owner label Apr 24, 2024

kahest removed the Waiting for: Product Owner label Apr 24, 2024

getsantry bot added the Waiting for: Product Owner label Apr 24, 2024

kahest removed the Waiting for: Product Owner label Apr 26, 2024

supervacuus mentioned this issue May 7, 2024

fix: correctly support stack overflows across platforms #982

Merged

supervacuus closed this as completed in #982 May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack overflow not sent to backend #977

Stack overflow not sent to backend #977

pontusn commented Apr 10, 2024 •

edited

supervacuus commented Apr 10, 2024

pontusn commented Apr 10, 2024

pontusn commented Apr 10, 2024

supervacuus commented Apr 11, 2024

supervacuus commented Apr 11, 2024 •

edited

pontusn commented Apr 23, 2024

supervacuus commented Apr 23, 2024

pontusn commented Apr 24, 2024 •

edited

supervacuus commented Apr 24, 2024

pontusn commented Apr 24, 2024

Stack overflow not sent to backend #977

Stack overflow not sent to backend #977

Comments

pontusn commented Apr 10, 2024 • edited

supervacuus commented Apr 10, 2024

pontusn commented Apr 10, 2024

pontusn commented Apr 10, 2024

supervacuus commented Apr 11, 2024

supervacuus commented Apr 11, 2024 • edited

pontusn commented Apr 23, 2024

supervacuus commented Apr 23, 2024

pontusn commented Apr 24, 2024 • edited

supervacuus commented Apr 24, 2024

pontusn commented Apr 24, 2024

pontusn commented Apr 10, 2024 •

edited

supervacuus commented Apr 11, 2024 •

edited

pontusn commented Apr 24, 2024 •

edited