Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow not sent to backend #977

Closed
2 of 3 tasks
pontusn opened this issue Apr 10, 2024 · 10 comments · Fixed by #982
Closed
2 of 3 tasks

Stack overflow not sent to backend #977

pontusn opened this issue Apr 10, 2024 · 10 comments · Fixed by #982
Assignees

Comments

@pontusn
Copy link

pontusn commented Apr 10, 2024

Description
We have discovered that crashes due to stack overflow are not uploaded to Sentry. The application silently crash without any additional details reported.

When does the problem happen

  • During build
  • During run-time
  • When capturing a hard crash

Environment

  • OS: Windows 10 and later, 32-bit
  • Compiler: MSVC2022
  • vcpkg, 0.7.2, also applies to older versions

Steps To Reproduce
Generate stack overflow by allocating all available stack and then calling into any function.

Log output
None.

Mitigations
After discovering this issue we applied several mitigations:

  • Increased stack size to reduce probability
  • Reserve one extra stack page for recovery SetThreadStackGuarantee
  • Inject SEH handler that calls _resetstkoflw to avoid subsequent crash in error handling

However, I would very much prefer that this situations was handled also with "naive" integration.

@supervacuus
Copy link
Collaborator

Hi @pontusn, thanks for the report. Can you provide more information?

  • which crash-backend (crashpad, breakpad, inproc) do you use?
  • do you specify any stack-relevant parameters (/STACK, /GS, /Gs, etc.) in your build?
  • do you see the same behavior on 64-bit systems?
  • do other crashes report correctly?
  • do you register either an on_crash or before_send hook in which you allocate memory on the stack?

Generate stack overflow by allocating all available stack and then calling into any function.

I am just asking because I can't reproduce this. I regularly test against stack overflows, and Sentry certainly receives stack overflows on Windows.

Can you elaborate on your mitigations? Did any of them fix the failure to report (i.e., do you see EXCEPTION_STACK_OVERFLOW events in Sentry after applying your mitigations)?

In particular, _resetstkoflw should not be called by any filter functions, which are the primary handlers in all our backends; it is also unclear how it could help in that situation inside an UnhandledExceptionFilter.

SetThreadStackGuarantee() is a function that sets the guarantee per thread and will only provide the requested stack size for the thread in which sentry_init() would be run. So, this is not a generic solution, but I could imagine adding this to the docs. However, that information would only make sense for us to add after we understand your setup better.

However, I would very much prefer that this situations was handled also with "naive" integration.

Can you explain what you mean by "naive" integrations? What do you expect from the Native SDK?

@pontusn
Copy link
Author

pontusn commented Apr 10, 2024

  • Crashpad (default)
  • /GS
  • /STACK was previously missing (ie 1Mb), now explicitly set to 2Mb
  • Our application is 32-bit only, hence no information about 64-bit.
  • Other crashes are reported, but since this failed I really don't know if other crashes are missing
  • No hooks

The crash I was investigating happened inside MSFTEDIT_CLASS when handling WM_COPY for 260000 characters. I suspect some internal handling use _alloca during conversion between RTF, ANSI and UNICODE to place all formats on clipboard.

We have been using Sentry Native SDK with great results for +3 years. The "naive" integrations is simply calling sentry_init and start collecting. Now our integration goes much deeper, but the core is the simplicity. I expect that the SDK will continue to help us improve software quality in ways no one in the team initially imagined.

I'll try to investigate the problem deeper, but for some reason these crashes are nor uploaded....

@pontusn
Copy link
Author

pontusn commented Apr 10, 2024

I've now confirmed that usage of _alloca is the trigger.

@supervacuus
Copy link
Collaborator

  • Crashpad (default)
  • /GS
  • /STACK was previously missing (ie 1Mb), now explicitly set to 2Mb
  • Our application is 32-bit only, hence no information about 64-bit.
  • Other crashes are reported, but since this failed I really don't know if other crashes are missing
  • No hooks

Thanks! With my question regarding other kinds of crashes, I want to ensure that stack-overflows specifically do not report instead of you experiencing a regression with crash reporting in general. Can I ask you to check a few things?

  • Did any of your mentioned mitigations fix the reporting issue (i.e., not only prevent the stack-overflow from happening but allow the Native SDK to report a stack-overflow when it happens)?
  • Can you access the Native SDK debug log in the affected application? If so, do you see the log entries "flushing session and queue before crashpad handler" and "handing control over to crashpad" when the stack-overflow happens? Do you see any warnings during the initialization?
  • Do you see new *.dmp files being produced in <database-path>/reports directory?
  • Can you provoke another crash, like an access violation in which you write to a NULL pointer or another invalid address, and check whether it reports correctly?
  • Can you check that the crashpad_handler.exe is running when you trigger the stack-overflow?

We have been using Sentry Native SDK with great results for +3 years. The "naive" integrations is simply calling sentry_init and start collecting. Now our integration goes much deeper, but the core is the simplicity. I expect that the SDK will continue to help us improve software quality in ways no one in the team initially imagined.

It's awesome that the Native SDK has given you value over the years. We'll try to keep it that way! None of the mentioned mitigations should be necessary for the Native SDK to detect the crash, so let's find out why it is not reporting the stack overflow.

On the other hand, if you expect that some part of your application requires stack allocations of any considerable size and there is no way to change that to a heap allocation, then I think specifying larger stack commit/reserve values is the right approach, independent of your use of the Native SDK.

@supervacuus
Copy link
Collaborator

supervacuus commented Apr 11, 2024

I experimented a little more, and I can provoke stack overflows that hit your scenario. I seemingly have hit the __chkstk in all my previous tests, and now, through minor reordering in the triggering function, I can bypass it and land on a global guard page.

At this point, the stack is exhausted enough that the UnhandledExceptionFilter is no longer invoked. SetThreadStackGuarantee() is a fix for that. I will create a stable integration test for an exhausted stack and add docs and the fix to the code.

cc @kahest, this is sigaltstack() for Windows. I will add related docs when I add the Android docs. I also wonder whether extending the WER module for stack-overflows wouldn't be an interesting scenario since it should be informed about these exceptions, too, independent of whether all threads have called SetThreadStackGuarantee() or not. It also would make us independent of guesstimating the stack's size (including required docs to let users tune it), given that the clients' crash hooks could considerably increase the required stack size.

@pontusn, you don't need to do any more checks on your side. Thanks again for the report!

@supervacuus supervacuus self-assigned this Apr 11, 2024
@pontusn
Copy link
Author

pontusn commented Apr 23, 2024

Any information on when fix for this will be included in SDK?

@supervacuus
Copy link
Collaborator

Any information on when fix for this will be included in SDK?

Not before next week. Other tasks, including precursors for this topic, are still in progress. After these tasks it will be the next thing I will work on.

Just to be clear, we cannot provide a comprehensive fix for this problem because we don't own all threads. For these, it is mostly a case for the documentation.

We will provide a fix for the thread that executes sentry_init() and, I think, adapt the crashpad WER module so that we can report crashes like this one even if the in-process crash handler crashes due to stack exhaustion. But this only works for users of the crashpad backend.

@pontusn
Copy link
Author

pontusn commented Apr 24, 2024

For our product, where the integration with Sentry is in a separate DLL, we hooked thread-level injection via DllMain:

extern "C" BOOL WINAPI DllMain(HINSTANCE hInstance, DWORD dwReason, LPVOID lpReserved)
{
    if (dwReason == DLL_PROCESS_ATTACH || dwReason == DLL_THREAD_ATTACH)
    {
        // Reserve memory for handling "stack overflow"
        ULONG ulStackGuarantee = 64 * 1024;
        SetThreadStackGuarantee(&ulStackGuarantee);
    }

    //...
}

I would assume you could use similar pattern for sentry.dll itself. Possibly guarded by some initialization flag.

These articles provided me with interesting details:
https://peteronprogramming.wordpress.com/2016/08/10/crashes-you-cant-handle-easily-2-stack-overflows-on-windows/
https://devblogs.microsoft.com/oldnewthing/20200610-00/?p=103855

@supervacuus
Copy link
Collaborator

Thanks @pontusn. The problem with using DllMain is twofold:

  • it only affects threads started after sentry.dll was loaded
  • it doesn't cover the considerable use of the Native SDK as a static library

But I'll consider adding it, even if only as part of the documentation.

@pontusn
Copy link
Author

pontusn commented Apr 24, 2024

Fair limitations, to my understanding there is no alternative method to inject code for thread initialization in Windows.

For most applications I would assume Sentry is among first things to initialize on the start thread, hence all threads should be included for real world use cases.

The static scenario could be handled by embedding a minimal DLL for this purpose that is exported to file and loaded dynamically.

However, as you already mentioned, documentation is probably the key part of addressing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants