Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_helper requests stop the server before logging can finish when the logger uses async methods #242

Open
kybishop opened this issue May 11, 2013 · 8 comments

Comments

@kybishop
Copy link
Contributor

I have a server that uses the Goliath::Request.log_block to do some post-response work. In that log block I eventually make redis calls with redis-rb (synchrony driver). The log block stops processing as soon it fires off the request to Redis.

Here is what I believe is happening:

  1. The log block yields to the reactor after making the async call
  2. The reactor defers to the test_helper request, which then stops the eventmachine, regardless of the log block's ongoing request

I've come up with this ugly hack to get the log block to finish, but I despise its implications:

with_api(RtbNicerApi) { get_request(request.dup) { EM::Synchrony.sleep 0.1 } }

This forces the callback to wait 1/10 of a second before terminating the eventmachine, giving the log block just enough time to finish . . but damn is it gross (sleeps in a test and depends on the log block finishing in under 0.1 seconds!)

Any ideas on how to check that logging has finished before terminating the eventmachine?

@nolman
Copy link
Member

nolman commented May 12, 2013

Ya, I believe you are right as to what is happening, here's a gist reproducing it:
https://gist.github.com/nolman/5564680

+1 to that implementation feeling icky, I can't think of a clean way to do it but I may be missing an obvious EventMachine handle. The only method I can think of is: http://eventmachine.rubyforge.org/EventMachine.html#connection_count-class_method which I am not sure we would want to use. I would have hoped that stop_event_loop would have waited for those callbacks to complete but it does not (it just calls EM::stop).

In the past I have dealt with this kind of need the reactor to keep running code by doing something like this:
https://gist.github.com/nolman/5564693
It basically makes the happy path fast but makes the failure a slow fail as it is waiting for some condition to be met. I wasn't super happy with it but it but happier than I was with the sleep $0.02.

I believe the right way to fix this would be a method in EventMachine that stopped the reactor from accepting new connections and bleed the existing callbacks down to 0 with an optional timeout.

@kybishop
Copy link
Contributor Author

EDIT: I deleted my earlier comments on EM.connection_count as understand it a little better now, and think it might even be a reasonable way of checking if EM is actually done serving the request, but would like someone with more eventmachine experience to chime in. I'm trying to dig into why EM.connection_count == 1 when you call with_api, and why it jumps to 3 when called from within get_request when no async methods are called in the API. If I can establish a reliable pattern, something like EM.connection_count == 3 after a request truly finishes, I think EM.connection_count is a worthwhile approach.

I'll dig into eventmachine to see if I can come up with anything generalized; any words of advice greatly appreciated. Send a rescue party if I don't return in a few days :)

@kybishop
Copy link
Contributor Author

The first connection comes from starting the server itself
The second comes from establishing the connection for the test_helper_request
The third comes from, and I don't fully understand this, something having to do with the callback of the em-http-request
EDIT: I believe the third is actually from the EM server receiving the request made to itself (one connection from the sender, one from the receiver)

Seeing how @igrigorik is the author of em-http-request 😃, perhaps he could chime in on the third, and whether or not there would be any gotchas to using EM.connection_count to see if the test request is truly finished. If my understanding is correct, EM.connection_count should indeed equal 3 when the request is truly finished . . unless I'm missing some way connections could be established.

I'll happily whip up a pull request if it seems reasonable to wait on EM.connection_count == 3 before calling EM.stop_server, implementing some form of user-alterable maximum wait time with a low default.

@nolman
Copy link
Member

nolman commented May 13, 2013

I wonder in testing a goliath server in https will start more connections? I suspect it will. I think the connection_count can be really dangerous to rely on, but I may very well be wrong. I am just not sure that the number will always be 3, it does seem a bit scary to me to add this code into goliath (even if only in test_helper). Curious what @dj2 and @igrigorik think.

Having played with netty a bit their default behavior when shutting down their reactor is to block until the descriptors have all been closed, it sucks if you have a bug in your code and you are leaking descriptors (but it is arguably worse if that bug makes it to production). I do think this is the right approach, either adding a new method or changing stop_event_loop to first wait for all in flight requests to complete.

I feel like this is something that should be fixed in eventmachine but I don't believe it would be a simple fix.

@kybishop
Copy link
Contributor Author

Ah crap, the use of connection pools in the log block results in varied EM.connection_counts. I'm not quite sure what the options are from here, hopefully others will chime in. I'll keep looking for a solution in the meanwhile.

@dj2
Copy link
Contributor

dj2 commented May 13, 2013

I'd have to agree with @nolman, it sounds like the right solution is to fix this in EM itself. Possibly by adding an exit_when_all_connections_closed, or something similar?

@kybishop kybishop reopened this May 13, 2013
@kybishop
Copy link
Contributor Author

Sorry, didn't mean to close and re-open. My googling shows that many people opt to keep track of the connections themselves. I'm going to go with this solution in the short-term, but see it as a hack-job that doesn't address the true issue in test_helper/eventmachine.

I'd like to contribute back to the community, so I'll dig into eventmachine in my free time to see if I can come up with a graceful shutdown. I'm hoping eventmachine keeps track of registered callbacks and I can go from there. I can see how a lot of issues could arise from simply closing open connections, as that could stop internal requests mid-flight . . . need to somehow allow current requests to keep processing while denying new ones.

@kybishop
Copy link
Contributor Author

Hey all, I thought of a rather simple workaround that doesn't require messing with the internals of eventmachine:

What if test_helper requests TestHelper#server wrapped Goliath::Request.log_block with a parent proc? The parent proc could set a variable in the test_helper class when the request is truly finished. From there, the callback could just call a #wait_until_log_block_finishes method, which invokes itself in a next_tick block until the log block is finished (or some maximum amount of ticks/time elapses?)

I propose this as a solution since it seems rather difficult to get eventmachine to perform a graceful shutdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants