Implement throttling for service to storage api #959

codenrhoden · 2017-08-07T22:58:03Z

@cduchesne commented on Sun Feb 05 2017

libStorage should have a throttling mechanism to prevent sending too many requests to storage api

@codenrhoden commented on Wed Mar 15 2017

@koensayr If you want a place to paste that throttling info, this is the place.

@codenrhoden commented on Mon Mar 20 2017

Consider this situation:
You need to create, mount, or unmount a volume and the operation takes a long time to complete. The reason could be that an “downstream” 3rd party API to accomplish this is:
Slow - Imposes rate limiting (rejects or fails to successfully execute requests under heavy use) Two forms of rate limiting are known to exist:
Exhibits a cap on requests per unit of time
Exhibits a cap on number of outstanding requests in progress

So, you basically have two options:

you will force API client to wait
you can immediately return some preliminary status response (202 Accepted is the convention defined by RFC 7231), and defer final status reporting to some later point

This falls into the pattern of a “long running operation” that commonly leads to:

A desire for a cancellation API
A mechanism to support “push” notification (websocket, message queue, etc)
A need for state maintenance in the service

Furthermore, the downstream API, could additional exhibit non-deterministic ordering behavior. For example, when even though a caller submits an unmount request followed by a mount request, the mount request is attempted first.

While it is almost always a mistake to impose specifying an implementation demand in a functional spec, these characteristics leads to:

A desire for state to be held in the form of a queue in order to preserve ordering of requests
A desire for optional support for having a unified queue which combined multiple request types in a common queue to preserve ordering, e.g. unmount + mount combined into the same queue
The queue should be implemented by libStorage itself, in order to prevent duplication of efforts in individual plugins. If libStorage API calls are submitted at a rate greater than the queue can be emptied, eventually the queue will reach capacity. There is no algorithm for implementing an unlimited size queue. When this happens, the libStorage API should return an error (429 Too Many Requests or 503 Service Unavailable would both be considered conventional)
Because the needs of various plugins vary, these aspects of queueing and rate limiting should be configurable on a per plugin basis. A plugin will have the option to define these settings. At the plugin’s option independent or combined queues may be selected:
** Rate limiting (max calls to API per unit time): e.g. do not dispatch external 3rd party requests from this queue at a rate greater than 20 requests per minute
** Outstanding request limiting: e.g. Stop dispatching request if more than 20 active requests are pending final resolution
** Retry Timeout specification: e.g. if any request dispatched from this queue has been pending more than 5 minutes, attempt a retry
** Cancel Timeout specification: e.g. if any request dispatched from this queue has been pending more than 5 minutes, attempt a cancel
** Queue clear invocation for use on communication or authentication error.
*** Certain kinds of errors such as a credential rejection, or host unreachable might be best handled by flushing all pending queued operations rather than re-encountering the error as each queue entry is processed. I mechanism will be provided to allow plugin code to flush the queue.
** Dispatch filter option. A plugin should have the option to look at the next impending dispatched entry from the queue and temporarily suspend dispatch.
*** Some plugins may have a maximum of one pending operation per client (cluster node) or per volume. This feature will allow a plugin to impose appropriate limits on queue draining
** These defaults can be defined by a driver. Override by user config is an optional feature that could be deferred to a later release.

Option	Meaning
MaxPerMinute	Max calls to be submitted to downstream storage provider API before queuing (or LibStorage API rejection) takes place
MaxActivePerVolume	Max operations (attach, detach, etc) to be submitted to the driver for a specific volume before queueing is engaged
MaxActivePerInstance	Max operations (attach, detach, etc) to be submitted to the driver for a specific instance (cluster node) before queueing is engaged
MaxActivePerVolumeInstanceTuple	Max operations (attach, detach, etc) to be submitted to the driver for a specific volume-instance combination before queueing is engaged
MaxActiveOverall	Max operation of all types submitted and pending with the driver before queueing is engaged
CancelTimeoutSecs	Should an operation not terminate (success or failure) within this time, a cancel will be submitted to the driver. Note that is queueing is engaged, the cancel will skip the queue. This is intentional because queueing be happening because of problematic operations that might be best cleared by cancellation

An Example of issue/problem (still open) in Kubernetes support for AWS:
kubernetes/kubernetes#31858

The platform API should be able to respect backend storage platforms rate limits either in a centralized throttling mechanism or something that is deferred to the driver, ex. API overloads in AWS EBS and EFS

daniel-jirca · 2017-09-07T08:04:01Z

Currently I cannot mount EFS volumes with the rexray docker plugin because of API throttling.
AWS support says that:

After inspecting our logs, we have confirmed that the plugin is hitting a rate limit in the EFS API call. From a casual inspection of the rexray source code, it appears that it uses the DescribeFileSystems and DescribeMountTargets API calls prior to mounting a file system, to ensure that the volume and mount targets exist. When multiple mounts are made in quick succession, this causes the API rate limiting, which in turn results in the throttling exceptions that you see.

Unfortunately in this state, the plugin is mostly unusable because mounting of volumes is unreliable and leaves the system in a continuous retry state. So, some form of request limit should be implemented.

brewsteropsdev · 2017-09-19T13:29:35Z

@thenoots I also am experiencing this issue. The plugin worked fine for a couple mounts, but as we scaled out across services we quickly hit the rate limit, wreaking havoc across our deployments.

codenrhoden · 2018-01-15T17:09:21Z

The scope of this issue is going to change a bit, in light of roadmap plans for CSI support in REX-Ray. End result is more or less the same as the WIP PR that was submitted previous (#1040), but general idea is that REX-Ray will present a mechanism to throttle/rate-limit API calls made by CSI plugins. This can be done in a "global" scope, supporting multiple REX-Ray instances across nodes, and even across multiple Docker, k8s clusters, when they are using the same AWS key, since an AWS rate-limit is tied to that account.

akutz linked a pull request Sep 22, 2017 that will close this issue

WiP - API Throttling #1040

Open

7 tasks

codenrhoden added this to the 2017-11.1 milestone Sep 26, 2017

codenrhoden removed this from the 2017.12-1 milestone Dec 12, 2017

codenrhoden added this to the 2018.01-1 milestone Jan 15, 2018

clintkitson modified the milestones: 2018.02-1, 2018.11-1 Jan 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement throttling for service to storage api #959

Implement throttling for service to storage api #959

codenrhoden commented Aug 7, 2017

daniel-jirca commented Sep 7, 2017

brewsteropsdev commented Sep 19, 2017

codenrhoden commented Jan 15, 2018

Implement throttling for service to storage api #959

Implement throttling for service to storage api #959

Comments

codenrhoden commented Aug 7, 2017

daniel-jirca commented Sep 7, 2017

brewsteropsdev commented Sep 19, 2017

codenrhoden commented Jan 15, 2018