Exceptions.throwIfFatal / throwIfJvmFatal should log fatal exceptions #3111

elefeint · 2022-07-08T15:57:46Z

Related issues: #3036, #2521, #2917, #1118

The combination of unrecoverable Error subclasses and reactive/asynchronous applications can easily lead to applications being left in semi-working state with no visible errors being logged.

Motivation

In an imperative, blocking application, an unrecoverable Error subclass would lead to the main application thread terminating with a stack trace. However, in reactive/asynchronous applications, the Error is often propagated up to a thread in some thread pool, which upon termination is peacefully returned to the pool. Unless some layer of the application catches Throwables and logs the error, it will be invisible to the end users and application maintainers. On the other hand, Reactor sinks/subscriptions will be left in a "hanging" state -- neither working, nor terminated -- leading to hard to catch and hard to debug situations.

The specific scenario that caused me to look into this was a minor library version incompatibility that resulted in java.lang.NoSuchMethodError being thrown in some stream mapping logic. While unrecoverable from both Java perspective (because it's an Error) and from Reactor throwIfJvmFatal perspective (because it's a LinkageErrror), this type of issue tends to be isolated to a particular code path and perhaps does not need to lead to application shutdown. At minimum, it would be good to have consistent logging of unrecoverable errors.

Desired solution

In order of the amount of work involved.

Logging fatal errors as part of Exceptions.throwIfJvmFatal.
While Reactor is doing the right thing by not turning thrown Error into stream errors, it would be good if the downstream code could recover if desired. This would require that FluxSink is left in a state in which it's capable of terminating -- in the reactorFluxSinkUnrecoverableError() test below, it would mean sink.error(e); results in the proper stream termination (although the exception passed in to the sink probably needs to be wrapped in something that won't re-trigger throwIfJvmFatal behavior).

Considered alternatives

Leave everything as is; not handling Error subclasses works as intended. While true, it does lead to developers spending a lot of time troubleshooting -- see related issues. And that's only the people who did figure out what was going wrong! It's a valid option though.

Additional context

The following two tests demonstrate the difference in behavior between normal exceptions being handled and converted to stream errors, and unrecoverable errors leading to a hanging stream.

  // this test passes
 @Test
  void reactorFluxSinkRecoverableException() {

    Flux<String> testFlux = Flux.create(sink -> {
      sink.onCancel(() -> System.out.println("*** canceled"));
      sink.onDispose(() -> System.out.println("*** disposed"));
      try {
        sink.next("some value");
        sink.complete();
      } catch(Exception e) {
        // This point is never reached because Reactor correctly converts exception into stream error.
        System.out.println("Exception in `next` processing caught, handled: " + e.getMessage());
        sink.error(e);
      }

    }).map(value -> {
      System.out.println("received value: " + value);
      throw new RuntimeException("normal exception");
    });

    StepVerifier.create(testFlux)
            .verifyErrorMessage("normal exception");
  }

  // this test hangs
  @Test
  void reactorFluxSinkUnrecoverableError() {

    Flux<String> testFlux = Flux.create(sink -> {
      sink.onCancel(() -> System.out.println("*** canceled"));
      sink.onDispose(() -> System.out.println("*** disposed"));
      try {
        sink.next("some value");
        sink.complete();
      } catch(Throwable e) {
        System.out.println("Throwable in `next` processing caught, handled: " + e.getMessage());
        // this will have no effect; a terminal state is no longer reachable for the stream
        sink.error(e);
      }
    }).map(value -> {
      System.out.println("received value: " + value);
      throw new LinkageError("unrecoverable error");
    });

    // This stream will hang.
    StepVerifier.create(testFlux)
            .verifyErrorMessage("unrecoverable error");
  }

The text was updated successfully, but these errors were encountered:

simonbasle · 2022-07-13T09:41:20Z

I think the logging of fatal errors would be the most straightforward solution. would it be enough of a step forward @elefeint ?

wrapping the fatal exception isn't something that can currently easily be implemented without painstakingly exploring and refactoring the whole call tree of throwIfJvmFatal.

OlegDokuka · 2022-07-13T12:42:43Z

I think the logging of fatal errors would be the most straightforward solution. would it be enough of a step forward @elefeint ?

@simonbasle I don't think we should and can do anything here but logging. Another alternative can be adding something like Exceptions.isFatal(throwable) so one can see if exceptions can be propagated downstream or not but I doubt it can be useful.

@elefeint, it is worth mentioning that FluxSink reactor java doc clearly states on next|complete|error:

Might throw an unchecked exception in case of a fatal error downstream which cannot be propagated to any asynchronous handler (aka a bubbling exception).

which means that FluxSink#next|error|complete returns normally except for a fatal error which should be equal to a kernel panic for the reactive chain (the whole chain is dead now).

Also, it is worth mentioning that a fatal error may mean internal state corruption (because they can be thrown unexpectedly) so further communication with the chain could be impossible.

Thus, wrapping sink calls makes no sense at all, especially calling sink.error in the catch section (assume if sink.next|complete|error is thrown, you can do nothing but log)

elefeint · 2022-07-13T13:20:51Z

I agree that logging has the highest benefit-to-effort. It would absolutely solve the biggest problem of people not knowing where to start looking for a problem.

For the second part -- allowing sinks to recover, the amount of work and payoff are both less clear. This behavior is definitely documented in sink.error() and in Exceptions, but making use of the documentation requires already knowing what went wrong, which is not possible without debugging currently.

While it's true that errors like OOM are likely to leave the application in a bad state regardless of what reactor sinks do, right now the throwIfJvmFatal consequences lead to an unrecoverable state even if the rest of the application is okay. I could not figure out what it is that prevents any future calls to FluxSink.error() orFluxSink.complete() from taking effect, so I have no idea how much work changing that would be.

OlegDokuka · 2022-07-13T16:23:59Z

While it's true that errors like OOM are likely to leave the application in a bad state regardless of what reactor sinks do, right now the throwIfJvmFatal consequences lead to an unrecoverable state even if the rest of the application is okay. I could not figure out what it is that prevents any future calls to FluxSink.error() orFluxSink.complete() from taking effect, so I have no idea how much work changing that would be.

It is not even related to the application state, it is related to the internal state-machine implementation of the project reactor. We do not assume that everything should blow at any moment, we assume the majority of method calls that we do returns normally. If an error like OOM happens spuriously, we just bubble it up, and eventually, it is handled at some point (so can be logged). However, there is no guarantee that the place from which the error was propagated is also handled exception properly, etc. All that leads to the fact that some volatile or non-volatile states that we maintain may be corrupted because of that. After all, it may end up in the impossibility to progress further at all (e.g. sending other signals).

simonbasle · 2022-07-18T12:41:10Z

I'll take that one and add logging to Exceptions.throwIfFatal

This commit improves the discoverability of exceptions that are "fatal" to Reactor by: - adding `Exceptions.isFatal` and `Exceptions.isJvmFatal` methods - adding logging to `Exceptions.throwIfFatal` and `Exceptions.throwIfJvmFatal` just before a fatal exception is thrown This should at least help pinpointing such occurrences from the logs, in cases where the thrown exception bubble all the way up the stack of a thread with no uncaughtExceptionHandler for example. Fixes #3111.

schananas · 2022-07-28T11:08:59Z

It would be helpful to add more control over fatal exceptions. In my case, a large object was attempted to be loaded into memory which caused of memory exception, which was not fatal for application in general but only for a particular request that was too large. We didn't notice any errors, nothing was in the log and Flux just kept hanging forever, because throwIfFatal was thrown before a cancelation of the subscription. It was really hard to find the cause of why Flux was hanging forever.

Since we consider that OutOfMemoryException is not fatal for us, now we need to find all the places in the code where we could potentially load a large object in memory, try-catch, and in case of OutOfMemoryException to convert to a nonfatal version of exception.

Instead of just adding a log, I would propose to add a Hooks.doOnFatalException(...), where users could override fatal logic, like mapping an exception to non fatal version.

simonbasle · 2022-07-29T08:06:14Z

@schananas if you have pieces of code that are susceptible to OOM but you're confident you can recover from these, then you are the best suited to implement that. I'd argue only you can take the responsibility to wrap such code in a try-catch block, because Reactor certainly can't identify these specific occurrences. I doubt even you can, but that's another story (see this informative stackoverflow answer for some context).

with the introduction of isFatal is the associated PR, it could open up a path of hooking a Predicate but as explained above that Predicate would apply too broadly (filter by exception Class, message...). Nothing better than user knowledge to pinpoint "I'm about to decode a potentially huge object, let's try/catch". We won't be pursuing that path for now.

This commit improves the discoverability of exceptions that are "fatal" to Reactor by: - adding `Exceptions.isFatal` and `Exceptions.isJvmFatal` methods - adding logging to `Exceptions.throwIfFatal` and `Exceptions.throwIfJvmFatal` just before a fatal exception is thrown This should at least help pinpointing such occurrences from the logs, in cases where the thrown exception bubble all the way up the stack of a thread with no uncaughtExceptionHandler for example. Fixes #3111.

reactorbot added the ❓need-triage This issue needs triage, hasn't been looked at by a team member yet label Jul 8, 2022

simonbasle added type/enhancement A general enhancement status/need-decision This needs a decision from the team and removed ❓need-triage This issue needs triage, hasn't been looked at by a team member yet labels Jul 13, 2022

simonbasle self-assigned this Jul 18, 2022

simonbasle added this to the 3.4.22 milestone Jul 19, 2022

simonbasle removed the status/need-decision This needs a decision from the team label Jul 19, 2022

simonbasle changed the title ~~Allow FluxSink to terminate even for Throwable/Error~~ Exceptions.throwIfFatal / throwIfJvmFatal should log fatal exceptions Jul 21, 2022

simonbasle mentioned this issue Jul 21, 2022

Add logging in Exceptions.throwIf[Jvm]Fatal, add isFatal methods #3122

Merged

simonbasle linked a pull request Jul 26, 2022 that will close this issue

Add logging in Exceptions.throwIf[Jvm]Fatal, add isFatal methods #3122

Merged

simonbasle closed this as completed in 50bb983 Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exceptions.throwIfFatal / throwIfJvmFatal should log fatal exceptions #3111

Exceptions.throwIfFatal / throwIfJvmFatal should log fatal exceptions #3111

elefeint commented Jul 8, 2022

simonbasle commented Jul 13, 2022

OlegDokuka commented Jul 13, 2022 •

edited

elefeint commented Jul 13, 2022

OlegDokuka commented Jul 13, 2022 •

edited

simonbasle commented Jul 18, 2022

schananas commented Jul 28, 2022

simonbasle commented Jul 29, 2022

Exceptions.throwIfFatal / throwIfJvmFatal should log fatal exceptions #3111

Exceptions.throwIfFatal / throwIfJvmFatal should log fatal exceptions #3111

Comments

elefeint commented Jul 8, 2022

Motivation

Desired solution

Considered alternatives

Additional context

simonbasle commented Jul 13, 2022

OlegDokuka commented Jul 13, 2022 • edited

elefeint commented Jul 13, 2022

OlegDokuka commented Jul 13, 2022 • edited

simonbasle commented Jul 18, 2022

schananas commented Jul 28, 2022

simonbasle commented Jul 29, 2022

OlegDokuka commented Jul 13, 2022 •

edited

OlegDokuka commented Jul 13, 2022 •

edited