Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceptions.throwIfFatal / throwIfJvmFatal should log fatal exceptions #3111

Closed
elefeint opened this issue Jul 8, 2022 · 7 comments · Fixed by #3122
Closed

Exceptions.throwIfFatal / throwIfJvmFatal should log fatal exceptions #3111

elefeint opened this issue Jul 8, 2022 · 7 comments · Fixed by #3122
Assignees
Labels
type/enhancement A general enhancement
Milestone

Comments

@elefeint
Copy link

elefeint commented Jul 8, 2022

Related issues: #3036, #2521, #2917, #1118

The combination of unrecoverable Error subclasses and reactive/asynchronous applications can easily lead to applications being left in semi-working state with no visible errors being logged.

Motivation

In an imperative, blocking application, an unrecoverable Error subclass would lead to the main application thread terminating with a stack trace. However, in reactive/asynchronous applications, the Error is often propagated up to a thread in some thread pool, which upon termination is peacefully returned to the pool. Unless some layer of the application catches Throwables and logs the error, it will be invisible to the end users and application maintainers. On the other hand, Reactor sinks/subscriptions will be left in a "hanging" state -- neither working, nor terminated -- leading to hard to catch and hard to debug situations.

The specific scenario that caused me to look into this was a minor library version incompatibility that resulted in java.lang.NoSuchMethodError being thrown in some stream mapping logic. While unrecoverable from both Java perspective (because it's an Error) and from Reactor throwIfJvmFatal perspective (because it's a LinkageErrror), this type of issue tends to be isolated to a particular code path and perhaps does not need to lead to application shutdown. At minimum, it would be good to have consistent logging of unrecoverable errors.

Desired solution

In order of the amount of work involved.

  1. Logging fatal errors as part of Exceptions.throwIfJvmFatal.
  2. While Reactor is doing the right thing by not turning thrown Error into stream errors, it would be good if the downstream code could recover if desired. This would require that FluxSink is left in a state in which it's capable of terminating -- in the reactorFluxSinkUnrecoverableError() test below, it would mean sink.error(e); results in the proper stream termination (although the exception passed in to the sink probably needs to be wrapped in something that won't re-trigger throwIfJvmFatal behavior).

Considered alternatives

Leave everything as is; not handling Error subclasses works as intended. While true, it does lead to developers spending a lot of time troubleshooting -- see related issues. And that's only the people who did figure out what was going wrong! It's a valid option though.

Additional context

The following two tests demonstrate the difference in behavior between normal exceptions being handled and converted to stream errors, and unrecoverable errors leading to a hanging stream.

  // this test passes
 @Test
  void reactorFluxSinkRecoverableException() {

    Flux<String> testFlux = Flux.create(sink -> {
      sink.onCancel(() -> System.out.println("*** canceled"));
      sink.onDispose(() -> System.out.println("*** disposed"));
      try {
        sink.next("some value");
        sink.complete();
      } catch(Exception e) {
        // This point is never reached because Reactor correctly converts exception into stream error.
        System.out.println("Exception in `next` processing caught, handled: " + e.getMessage());
        sink.error(e);
      }

    }).map(value -> {
      System.out.println("received value: " + value);
      throw new RuntimeException("normal exception");
    });

    StepVerifier.create(testFlux)
            .verifyErrorMessage("normal exception");
  }

  // this test hangs
  @Test
  void reactorFluxSinkUnrecoverableError() {

    Flux<String> testFlux = Flux.create(sink -> {
      sink.onCancel(() -> System.out.println("*** canceled"));
      sink.onDispose(() -> System.out.println("*** disposed"));
      try {
        sink.next("some value");
        sink.complete();
      } catch(Throwable e) {
        System.out.println("Throwable in `next` processing caught, handled: " + e.getMessage());
        // this will have no effect; a terminal state is no longer reachable for the stream
        sink.error(e);
      }
    }).map(value -> {
      System.out.println("received value: " + value);
      throw new LinkageError("unrecoverable error");
    });

    // This stream will hang.
    StepVerifier.create(testFlux)
            .verifyErrorMessage("unrecoverable error");
  }

@reactorbot reactorbot added the ❓need-triage This issue needs triage, hasn't been looked at by a team member yet label Jul 8, 2022
@simonbasle
Copy link
Member

I think the logging of fatal errors would be the most straightforward solution. would it be enough of a step forward @elefeint ?

wrapping the fatal exception isn't something that can currently easily be implemented without painstakingly exploring and refactoring the whole call tree of throwIfJvmFatal.

@simonbasle simonbasle added type/enhancement A general enhancement status/need-decision This needs a decision from the team and removed ❓need-triage This issue needs triage, hasn't been looked at by a team member yet labels Jul 13, 2022
@OlegDokuka
Copy link
Contributor

OlegDokuka commented Jul 13, 2022

I think the logging of fatal errors would be the most straightforward solution. would it be enough of a step forward @elefeint ?

@simonbasle I don't think we should and can do anything here but logging. Another alternative can be adding something like Exceptions.isFatal(throwable) so one can see if exceptions can be propagated downstream or not but I doubt it can be useful.


@elefeint, it is worth mentioning that FluxSink reactor java doc clearly states on next|complete|error:

Might throw an unchecked exception in case of a fatal error downstream which cannot be propagated to any asynchronous handler (aka a bubbling exception).

which means that FluxSink#next|error|complete returns normally except for a fatal error which should be equal to a kernel panic for the reactive chain (the whole chain is dead now).

Also, it is worth mentioning that a fatal error may mean internal state corruption (because they can be thrown unexpectedly) so further communication with the chain could be impossible.

Thus, wrapping sink calls makes no sense at all, especially calling sink.error in the catch section (assume if sink.next|complete|error is thrown, you can do nothing but log)

@elefeint
Copy link
Author

I agree that logging has the highest benefit-to-effort. It would absolutely solve the biggest problem of people not knowing where to start looking for a problem.

For the second part -- allowing sinks to recover, the amount of work and payoff are both less clear. This behavior is definitely documented in sink.error() and in Exceptions, but making use of the documentation requires already knowing what went wrong, which is not possible without debugging currently.

While it's true that errors like OOM are likely to leave the application in a bad state regardless of what reactor sinks do, right now the throwIfJvmFatal consequences lead to an unrecoverable state even if the rest of the application is okay. I could not figure out what it is that prevents any future calls to FluxSink.error() orFluxSink.complete() from taking effect, so I have no idea how much work changing that would be.

@OlegDokuka
Copy link
Contributor

OlegDokuka commented Jul 13, 2022

While it's true that errors like OOM are likely to leave the application in a bad state regardless of what reactor sinks do, right now the throwIfJvmFatal consequences lead to an unrecoverable state even if the rest of the application is okay. I could not figure out what it is that prevents any future calls to FluxSink.error() orFluxSink.complete() from taking effect, so I have no idea how much work changing that would be.

It is not even related to the application state, it is related to the internal state-machine implementation of the project reactor. We do not assume that everything should blow at any moment, we assume the majority of method calls that we do returns normally. If an error like OOM happens spuriously, we just bubble it up, and eventually, it is handled at some point (so can be logged). However, there is no guarantee that the place from which the error was propagated is also handled exception properly, etc. All that leads to the fact that some volatile or non-volatile states that we maintain may be corrupted because of that. After all, it may end up in the impossibility to progress further at all (e.g. sending other signals).

@simonbasle simonbasle self-assigned this Jul 18, 2022
@simonbasle
Copy link
Member

I'll take that one and add logging to Exceptions.throwIfFatal

@simonbasle simonbasle added this to the 3.4.22 milestone Jul 19, 2022
@simonbasle simonbasle removed the status/need-decision This needs a decision from the team label Jul 19, 2022
@simonbasle simonbasle changed the title Allow FluxSink to terminate even for Throwable/Error Exceptions.throwIfFatal / throwIfJvmFatal should log fatal exceptions Jul 21, 2022
simonbasle added a commit that referenced this issue Jul 21, 2022
This commit improves the discoverability of exceptions that are "fatal"
to Reactor by:
 - adding `Exceptions.isFatal` and `Exceptions.isJvmFatal` methods
 - adding logging to `Exceptions.throwIfFatal` and
 `Exceptions.throwIfJvmFatal` just before a fatal exception is thrown

This should at least help pinpointing such occurrences from the logs,
in cases where the thrown exception bubble all the way up the stack of
a thread with no uncaughtExceptionHandler for example.

Fixes #3111.
@schananas
Copy link

It would be helpful to add more control over fatal exceptions. In my case, a large object was attempted to be loaded into memory which caused of memory exception, which was not fatal for application in general but only for a particular request that was too large. We didn't notice any errors, nothing was in the log and Flux just kept hanging forever, because throwIfFatal was thrown before a cancelation of the subscription. It was really hard to find the cause of why Flux was hanging forever.

Since we consider that OutOfMemoryException is not fatal for us, now we need to find all the places in the code where we could potentially load a large object in memory, try-catch, and in case of OutOfMemoryException to convert to a nonfatal version of exception.

Instead of just adding a log, I would propose to add a Hooks.doOnFatalException(...), where users could override fatal logic, like mapping an exception to non fatal version.

@simonbasle
Copy link
Member

@schananas if you have pieces of code that are susceptible to OOM but you're confident you can recover from these, then you are the best suited to implement that. I'd argue only you can take the responsibility to wrap such code in a try-catch block, because Reactor certainly can't identify these specific occurrences. I doubt even you can, but that's another story (see this informative stackoverflow answer for some context).

with the introduction of isFatal is the associated PR, it could open up a path of hooking a Predicate but as explained above that Predicate would apply too broadly (filter by exception Class, message...). Nothing better than user knowledge to pinpoint "I'm about to decode a potentially huge object, let's try/catch". We won't be pursuing that path for now.

simonbasle added a commit that referenced this issue Aug 1, 2022
This commit improves the discoverability of exceptions that are "fatal"
to Reactor by:
 - adding `Exceptions.isFatal` and `Exceptions.isJvmFatal` methods
 - adding logging to `Exceptions.throwIfFatal` and
 `Exceptions.throwIfJvmFatal` just before a fatal exception is thrown

This should at least help pinpointing such occurrences from the logs,
in cases where the thrown exception bubble all the way up the stack of
a thread with no uncaughtExceptionHandler for example.

Fixes #3111.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement A general enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants