ISSUE-159 VegasLimit for batch processing #161

ericzhifengchen · 2020-08-04T06:12:26Z

Background
At Uber, We use Kafka producer to produce message to broker, we saw two problems could cause Kafka producer overload while running the system.

Surge of messages rate.
Broker degradation

When Kafka producer overloaded, messages accumulated in buffer, then it causes high message delivery latency, saturation of garbage-collector, and drop of throughput.
We think load shedding is the right way to deal with overload, however, existing concurrency limit algorithm is not perfectly fit for batching system, such as Kafka producer.

Abstraction
As a batching system, Kafka producer comes with a sending buffer. while sending message, clients put messages into the buffer.
And there are one or more sender threads keeps fetching message from the queue and send them to broker. latency is evaluated after client received ack from broker.
So in the system, latency has two components.

message buffer time
request roundtrip time

concretely, there is equation:
(1) observed-latency = buffer-time + roundtrip-latency
in practice, when not overload, there is following relationship
(2) 0 < buffer-time < BF * roundtrip-latency ( BF stand for bufferFactory = 1 / num-inflight-requests)
with equation (1) and relationship (2), there is
(3) roundtrip-latency < observed-latency < (1 + BF) * roundtrip-latency

The problem of VegasLimit is, it use rtt_noload as boundary to detect saturation.
however, for a batching system, it could easily cause over shedding, so we are proposing a little enhancement to make VegasLimit perfectly fit for batching system

Solution
Here is high level solution

Introduce bufferFactor to represent portion of buffer-time in observed-latency.
Consider bufferFactor while calculating queue size to avoid over shedding
Consider bufferFactor while reseting probe to avoid limit inflation

with bufferFactor to represent portion of buffer-time in observed-latency, we propose a new equation to estimate queue size

    queueSize = (int) Math.ceil(estimatedLimit * (1 - (double)rtt_noload * (bufferFactor + 1) / rtt));

it suggests no queue, when 0 < rtt < rtt_noload * (bufferFactor + 1)

In practice, we found the change prevented over shedding, but also causes another problem - limit inflation.
limit inflation means concurrent limit gradually increase as time elapse.
The reason is when probe, we have

        rtt_noload = rtt

when system under load, rtt are actually in the range of [rtt_noload, (bufferFactor + 1) * rtt_noload]
as a result, if when probe, rtt is at the higher end, it could cause both rtt_noload and estimatedLimit inflate.

We solved the problem in two steps

when probe, update estimatedLimit with following equation

      estimatedLimit = Math.max(initialLimit, Math.ceil(estimatedLimit / (bufferFactor + 1))

the change reduce estimatedLimit to bare size (no buffer), so in case system is under load, it should start rejecting requests and concurrency start dropping.
eventually when concurrency dropped to estimatedLimit, observed rtt should be equals to rtt_noload, and limit inflation can be solved.

Pause updating of estimatedLimit for extra amount of time, to allow rtt dropped to rtt_noload
```
      pauseUpdateTime = rtt * (1 + bufferFactor / (1 + bufferFactor))
```

It takes time to have concurrency dropped to updated estimatedLimit, and during the time, updating of estimatedLimit should be paused.
other wise estimatedLimit could increase during the time, and miss the opportunity to observe rtt_noload.
The time to wait can be estimated with above equation.

ericzhifengchen · 2020-08-04T06:31:20Z

CI pass locally. how to rerun?

Netflix/concurrency-limits#161

ISSUE-159 VegasLimit for batch processing

3f34535

platinummonkey added a commit to platinummonkey/go-concurrency-limits that referenced this pull request Aug 25, 2020

mirror feature changes from ISSUE-159 on upstream

b4d3985

Netflix/concurrency-limits#161

platinummonkey mentioned this pull request Aug 25, 2020

ISSUE-159 VegasLimit for batch processing platinummonkey/go-concurrency-limits#34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISSUE-159 VegasLimit for batch processing #161

ISSUE-159 VegasLimit for batch processing #161

ericzhifengchen commented Aug 4, 2020 •

edited

ericzhifengchen commented Aug 4, 2020

ISSUE-159 VegasLimit for batch processing #161

Are you sure you want to change the base?

ISSUE-159 VegasLimit for batch processing #161

Conversation

ericzhifengchen commented Aug 4, 2020 • edited

ericzhifengchen commented Aug 4, 2020

ericzhifengchen commented Aug 4, 2020 •

edited