Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing graceful shutdown #46

Open
amv opened this issue Dec 23, 2016 · 6 comments
Open

Implementing graceful shutdown #46

amv opened this issue Dec 23, 2016 · 6 comments

Comments

@amv
Copy link
Collaborator

amv commented Dec 23, 2016

I am trying to implement a graceful shutdown for my workers, meaning that when I receive a signal (or otherwise decide so), all of my workers should continue doing their job, but no new jobs should be accepted.

My current implementation is to call client.forgetAllWorkers() when receiving the signal, and then waiting for all of the jobs to do their thing before exiting. I only call client.disconnect() some seconds after the last job has finished executing (and should have delivered it's payload).

Unfortunately at least with the C version of the gearmand server, the results of these jobs that finish after the client.forgetAllWorkers() never reach the clients who have submitted the jobs.

So either the gearmand stops passing the results on after unregistering the work, or abraxas fails to deliver the results after client.forgetAllWorkers() has been called.

In the former case I would need a way to tell abraxas to stop grabbing any future jobs even though the functions are still registered.. And in the latter case I would love to have a fix for this on abraxas side..

Would anyone happen to come up with a simple way to test which of these theories (or some alternate one) is the case?

@amv
Copy link
Collaborator Author

amv commented Dec 23, 2016

I did some more test and it seems like looping through all of the registered functions and calling client.unregisterWorker( fn ) for each of them actually gets me the result I wanted.

For some reason client.forgetAllWorkers() has a different effect, which causes the results not ending up to the caller.

Is there a reason why forgetAllWorkers() works in a different way than calling unregisterWorker( fn ) for all registered functions? If so, how should this be documented?

@nponeccop
Copy link
Collaborator

nponeccop commented Dec 23, 2016

You can use gearman-packet library to implement a simple low-level client to test whether "the gearmand stops passing the results on after unregistering the work".

Note that on protocol level graceful shutdown should be implemented by avoiding sending GRAB_JOB and PRE_SLEEP packets, not by sending CANT_DO packet.

Also you can review the code/detailed logs to see if abraxas sends CANT_DO when you call forgetAllWorkers. If it does we probably need a separate client.gracefulShutdown. I'm against something more low-level like pauseAllWorkers/resumeAllWorkers unless somebody actually needs it. Designing features you don't plan to use leads to bad design IMO.

Note that we do need to support one-shot workers, that is graceful shutdown after receiving the first job. See @amv comment on #19

Upd; Ahaha I didn't see that it's you.

@nponeccop
Copy link
Collaborator

As for unregisterWorker vs forgetAllWorkers, they are both defined in worker.js

unregisterWorker sends CANT_DO (which seems safe from your experience so my PRE_SLEEP hypothesis above is wrong). While forgetAllWorkers sends RESET_ABILITIES. So you may wish to test the CANT_DO vs RESET_ABILITIES hypothesis. From the protocol spec RESET_ABILITIES is the same as calling many CANT_DO in a row, so there should be no difference.

@amv
Copy link
Collaborator Author

amv commented Dec 23, 2016

I did a quick test by modifying the forgetAllWorkers() and got it working properly by changing the current implementation:

Worker.forgetAllWorkers = function () {
    if (! this._workersCount) return;
    this._workers = {};
    this._workersCount = 0;
    clearInterval(this.keepAlive);
    this.getConnectedServers().forEach(function(conn) {
        conn.socket.unhandleJobAssign();
        conn.socket.resetAbilities();
    });
}

To a different implementation:

Worker.forgetAllWorkers = function () {
    if (! this._workersCount) return;
    var funcs = Object.keys(this._workers);
    this._workers = {};
    this._workersCount = 0;
    clearInterval(this.keepAlive);
    this.getConnectedServers().forEach(function(conn) {
        conn.socket.unhandleJobAssign();
        funcs.forEach( function ( func ) {
            conn.socket.cantDo( func );
        } )
    });
}

From this test I would guess that it is the resetAbilities() which for some reason blocks the responses from getting to the requesting clients. I believe this means this is a bug in the C gearmand RESET_ABILITIES handler, which does too much cleaning.

@nponeccop
Copy link
Collaborator

Can you test with "our" abraxas-based Javascript server?

@nponeccop
Copy link
Collaborator

Both CANT_DO and RESET_ABILITIES in the C gearman are in libgearman-server/server.cc and I don't see any extra cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants