Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Supervision.restart for SubFlow's #981

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mdedetrich
Copy link
Contributor

@mdedetrich mdedetrich commented Jan 19, 2024

Provide support for restarting subflow.

@mdedetrich mdedetrich marked this pull request as draft January 19, 2024 09:54
@mdedetrich mdedetrich force-pushed the support-supervision-restart-for-subflow branch 6 times, most recently from 0f472e9 to e555122 Compare January 19, 2024 10:26
@mdedetrich mdedetrich added this to the 1.1.0 milestone Jan 23, 2024
} else {
// Start draining
if (!hasBeenPulled(in)) pull(in)
decider(cause) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cause can be NoMoreElementsNeeded | StageWasCompleted should test that first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this being tested in a different way later down with isClosed/hasBeenPulled? Also I had a look at parts of the Pekko codebase that implements Supervision.restart and I don't see NoMoreElementsNeeded | StageWasCompleted being tested in this way.

This doesn't mean its wrong, but if thats the case then testing NoMoreElementsNeeded | StageWasCompleted would be unique to SubFlow and I am not sure why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NoMoreElementsNeeded | StageWasCompleted indicate the downstream cancel without exception, then that should not go through the supervision decider ? @raboof @jrudolph may have better understanding than me , ping~

@mdedetrich mdedetrich force-pushed the support-supervision-restart-for-subflow branch from e555122 to ec6e2d5 Compare March 18, 2024 08:24
@mdedetrich
Copy link
Contributor Author

@He-Pin @jxnu-liguobin @Roiocam @raboof

So I just noticed that for SubFlow's in general the supervision strategy doesn't properly propagate (see comment at https://github.com/apache/incubator-pekko/blob/c44c0b7cbdab11d85176cfe062288fdcba16c56a/stream-tests/src/test/scala/org/apache/pekko/stream/scaladsl/FlowSplitAfterSpec.scala#L213) which mean that this PR isn't going to work until we solve the fundamental underlying issue.

Due to this I will set the PR to blocked and make another issue on this point

@mdedetrich mdedetrich added the blocked Blocked for some reason label Mar 18, 2024
@pjfanning
Copy link
Contributor

I'm planning to cut an RC for 1.1.0-M1. This will likely not be in it - but can be included in 1.1.0-M2.

@mdedetrich
Copy link
Contributor Author

I'm planning to cut an RC for 1.1.0-M1. This will likely not be in it - but can be included in 1.1.0-M2.

There is an actual bug with the functionality introduced in #252, i.e. Supervision.resumeStrategy doesn't actually resume when an exception is thrown.

This needs to be solved for 1.1.0, ideally in -M1 but I don't know when you wanted to make the actual release (I guess after incubation is fully passed?)

@mdedetrich
Copy link
Contributor Author

Bug made at #1205

@pjfanning
Copy link
Contributor

is there a chance that we could revert #252 and try a full change for 1.1.0-M2?

@mdedetrich
Copy link
Contributor Author

is there a chance that we could revert #252 and try a full change for 1.1.0-M2?

We are actually using this feature in production so I would try and avoid that. If fixing the bug ends up taking too long I would propose we fix it in 1.1.0-M2 since no one right now is actually relying on Supervision.resumeStrategy catching and resuming exceptions

@pjfanning
Copy link
Contributor

I'm only to delaying the 1.1.0-M1-RC1 until next week if this a reasonable chance that this can be addressed soon.

@mdedetrich
Copy link
Contributor Author

I'm only to delaying the 1.1.0-M1-RC1 until next week if this a reasonable chance that this can be addressed soon.

Agreed

@mdedetrich mdedetrich removed the blocked Blocked for some reason label Mar 19, 2024
@mdedetrich mdedetrich force-pushed the support-supervision-restart-for-subflow branch 4 times, most recently from 7a81112 to ab934dc Compare March 19, 2024 16:18
@mdedetrich mdedetrich force-pushed the support-supervision-restart-for-subflow branch from ab934dc to 186f1dc Compare March 19, 2024 16:18
@mdedetrich
Copy link
Contributor Author

So I had a look at this after it was unblocked and it does seem to be harder than normal, this is because a new SubstreamHandler is created from the original input port which makes things complicated if we have already handed over to the new SubStreamHandler because we have crossed the boundary of the split decision. I think to solve this properly we need to persist already retrieved elements.

Furthermore its not just Split we have to support but also Group (the other operator that creates SubFlow). For this reason and in the interest of also reducing the number of milestones ill drop this unless I happen to be missing something.

@pjfanning @He-Pin @Roiocam wdyt?

@mdedetrich mdedetrich removed this from the 1.1.0-M1 milestone Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants