Implementing graceful shutdown #46

amv · 2016-12-23T12:09:44Z

I am trying to implement a graceful shutdown for my workers, meaning that when I receive a signal (or otherwise decide so), all of my workers should continue doing their job, but no new jobs should be accepted.

My current implementation is to call client.forgetAllWorkers() when receiving the signal, and then waiting for all of the jobs to do their thing before exiting. I only call client.disconnect() some seconds after the last job has finished executing (and should have delivered it's payload).

Unfortunately at least with the C version of the gearmand server, the results of these jobs that finish after the client.forgetAllWorkers() never reach the clients who have submitted the jobs.

So either the gearmand stops passing the results on after unregistering the work, or abraxas fails to deliver the results after client.forgetAllWorkers() has been called.

In the former case I would need a way to tell abraxas to stop grabbing any future jobs even though the functions are still registered.. And in the latter case I would love to have a fix for this on abraxas side..

Would anyone happen to come up with a simple way to test which of these theories (or some alternate one) is the case?

The text was updated successfully, but these errors were encountered:

amv · 2016-12-23T13:22:12Z

I did some more test and it seems like looping through all of the registered functions and calling client.unregisterWorker( fn ) for each of them actually gets me the result I wanted.

For some reason client.forgetAllWorkers() has a different effect, which causes the results not ending up to the caller.

Is there a reason why forgetAllWorkers() works in a different way than calling unregisterWorker( fn ) for all registered functions? If so, how should this be documented?

nponeccop · 2016-12-23T13:31:08Z

You can use gearman-packet library to implement a simple low-level client to test whether "the gearmand stops passing the results on after unregistering the work".

Note that on protocol level graceful shutdown should be implemented by avoiding sending GRAB_JOB and PRE_SLEEP packets, not by sending CANT_DO packet.

Also you can review the code/detailed logs to see if abraxas sends CANT_DO when you call forgetAllWorkers. If it does we probably need a separate client.gracefulShutdown. I'm against something more low-level like pauseAllWorkers/resumeAllWorkers unless somebody actually needs it. Designing features you don't plan to use leads to bad design IMO.

Note that we do need to support one-shot workers, that is graceful shutdown after receiving the first job. See @amv comment on #19

Upd; Ahaha I didn't see that it's you.

nponeccop · 2016-12-23T13:45:05Z

As for unregisterWorker vs forgetAllWorkers, they are both defined in worker.js

unregisterWorker sends CANT_DO (which seems safe from your experience so my PRE_SLEEP hypothesis above is wrong). While forgetAllWorkers sends RESET_ABILITIES. So you may wish to test the CANT_DO vs RESET_ABILITIES hypothesis. From the protocol spec RESET_ABILITIES is the same as calling many CANT_DO in a row, so there should be no difference.

amv · 2016-12-23T13:58:35Z

I did a quick test by modifying the forgetAllWorkers() and got it working properly by changing the current implementation:

Worker.forgetAllWorkers = function () {
    if (! this._workersCount) return;
    this._workers = {};
    this._workersCount = 0;
    clearInterval(this.keepAlive);
    this.getConnectedServers().forEach(function(conn) {
        conn.socket.unhandleJobAssign();
        conn.socket.resetAbilities();
    });
}

To a different implementation:

Worker.forgetAllWorkers = function () {
    if (! this._workersCount) return;
    var funcs = Object.keys(this._workers);
    this._workers = {};
    this._workersCount = 0;
    clearInterval(this.keepAlive);
    this.getConnectedServers().forEach(function(conn) {
        conn.socket.unhandleJobAssign();
        funcs.forEach( function ( func ) {
            conn.socket.cantDo( func );
        } )
    });
}

From this test I would guess that it is the resetAbilities() which for some reason blocks the responses from getting to the requesting clients. I believe this means this is a bug in the C gearmand RESET_ABILITIES handler, which does too much cleaning.

nponeccop · 2016-12-23T14:01:39Z

Can you test with "our" abraxas-based Javascript server?

nponeccop · 2016-12-23T14:09:10Z

Both CANT_DO and RESET_ABILITIES in the C gearman are in libgearman-server/server.cc and I don't see any extra cleanup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing graceful shutdown #46

Implementing graceful shutdown #46

amv commented Dec 23, 2016

amv commented Dec 23, 2016

nponeccop commented Dec 23, 2016 •

edited

nponeccop commented Dec 23, 2016

amv commented Dec 23, 2016

nponeccop commented Dec 23, 2016

nponeccop commented Dec 23, 2016

Implementing graceful shutdown #46

Implementing graceful shutdown #46

Comments

amv commented Dec 23, 2016

amv commented Dec 23, 2016

nponeccop commented Dec 23, 2016 • edited

nponeccop commented Dec 23, 2016

amv commented Dec 23, 2016

nponeccop commented Dec 23, 2016

nponeccop commented Dec 23, 2016

nponeccop commented Dec 23, 2016 •

edited