Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance of workers limited by downlink bandwidth #20

Open
nponeccop opened this issue Dec 12, 2015 · 3 comments
Open

performance of workers limited by downlink bandwidth #20

nponeccop opened this issue Dec 12, 2015 · 3 comments

Comments

@nponeccop
Copy link
Collaborator

Imagine that there is one worker that is heavy. I.e. it consumes much resources so it's not practical to run more than one worker.

In this case it is still beneficial to grab more than one job to fully utilize the connection. E,g. if one job (job_assign packet) is 10kb long, on 10 mbit connection with 25ms latency there should be 10 * 1024 * 1024 * 0.025 / (8 * 10 * 1024) = 4 packets in flight (after rounding up from 3.2)

My proposal is to have another control for job count. maxJobs controls how many jobs are executed concurrently, and maxExtraJobsInFlight (or a shorter name) controls, well, the extra jobs in flight.

So in the example situation mentioned above, we will have maxJobs = 1; maxExtraJobsInFlight = 4

@iarna
Copy link
Owner

iarna commented May 12, 2016

This seems like a reasonable addition to me.

@nponeccop
Copy link
Collaborator Author

nponeccop commented May 12, 2016

The code is already there in https://github.com/streamcode9/abraxas/commit/fbb7be1e0f075a7257115432bead5597efe1e6a3 (see this._queue). I'm testing it now with both Abraxas server and upstream Gearman and later I will split it into more reasonable separate changes.

nponeccop pushed a commit to nponeccop/abraxas that referenced this issue May 15, 2016
@nponeccop
Copy link
Collaborator Author

I cleaned up the changes, and put to nponeccop/master. I added maxQueued option. And as a different change I utilize PRE_SLEEP only when the link is idle, namely when we don't have outstanding GRAB_JOB.

Unfortunately the way it is implemented now, while showing excellent performance over WAN (I got about 3 ms overhead per job on batches of 1000 jobs over a bad 200 ms link), breaks support for multiple servers.

The whole algorithm/protocol looks like this (in pseudocode):

invariant active + grabbing + queued <= max_active + max_queued
invariant active < max_active && queued == 0
invariant active >= 0 && active <= max_active
invariant queued >=0 && queued <= max_queued
invariant grabbing >=0

max_active = ...
max_queued = ...
grabbing = 0
active = 0
queued = 0

run():
    dequeueJob()
    active++
    queued--
grab():
    <- GRAB_JOB
    grabbing++
start():
    <- CAN_DO test_tube
    <- PRE_SLEEP
-> NO_JOB:
    grabbing--
    if grabbing == 0
        <- PRE_SLEEP
-> NOOP:
    while (active + grabbing + queued < max_active + max_queued)
        grab()
-> JOB_ASSIGN:
    grabbing--
    queueJob()
    queued++
    if active < max_active
        run()
-> workComplete:
    active--
    if (queued > 0)
        run()
        assert active == max_active
    grab()
    <- WORK_COMPLETE
-> workData:
    <- WORK_DATA

It's rather complicated already, and with speculative GRAB_JOBs sent to multiple servers it will be even worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants