Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement throttling for service to storage api #959

Open
codenrhoden opened this issue Aug 7, 2017 · 3 comments · May be fixed by #1040
Open

Implement throttling for service to storage api #959

codenrhoden opened this issue Aug 7, 2017 · 3 comments · May be fixed by #1040
Milestone

Comments

@codenrhoden
Copy link
Member

@cduchesne commented on Sun Feb 05 2017

libStorage should have a throttling mechanism to prevent sending too many requests to storage api


@codenrhoden commented on Wed Mar 15 2017

@koensayr If you want a place to paste that throttling info, this is the place.


@codenrhoden commented on Mon Mar 20 2017

Thoughts from @cantbewong:

Consider this situation:
You need to create, mount, or unmount a volume and the operation takes a long time to complete. The reason could be that an “downstream” 3rd party API to accomplish this is:
Slow - Imposes rate limiting (rejects or fails to successfully execute requests under heavy use) Two forms of rate limiting are known to exist:
Exhibits a cap on requests per unit of time
Exhibits a cap on number of outstanding requests in progress

So, you basically have two options:

  1. you will force API client to wait
  2. you can immediately return some preliminary status response (202 Accepted is the convention defined by RFC 7231), and defer final status reporting to some later point

This falls into the pattern of a “long running operation” that commonly leads to:

  • A desire for a cancellation API
  • A mechanism to support “push” notification (websocket, message queue, etc)
  • A need for state maintenance in the service

Furthermore, the downstream API, could additional exhibit non-deterministic ordering behavior. For example, when even though a caller submits an unmount request followed by a mount request, the mount request is attempted first.

While it is almost always a mistake to impose specifying an implementation demand in a functional spec, these characteristics leads to:

  • A desire for state to be held in the form of a queue in order to preserve ordering of requests
  • A desire for optional support for having a unified queue which combined multiple request types in a common queue to preserve ordering, e.g. unmount + mount combined into the same queue
  • The queue should be implemented by libStorage itself, in order to prevent duplication of efforts in individual plugins. If libStorage API calls are submitted at a rate greater than the queue can be emptied, eventually the queue will reach capacity. There is no algorithm for implementing an unlimited size queue. When this happens, the libStorage API should return an error (429 Too Many Requests or 503 Service Unavailable would both be considered conventional)
  • Because the needs of various plugins vary, these aspects of queueing and rate limiting should be configurable on a per plugin basis. A plugin will have the option to define these settings. At the plugin’s option independent or combined queues may be selected:
    ** Rate limiting (max calls to API per unit time): e.g. do not dispatch external 3rd party requests from this queue at a rate greater than 20 requests per minute
    ** Outstanding request limiting: e.g. Stop dispatching request if more than 20 active requests are pending final resolution
    ** Retry Timeout specification: e.g. if any request dispatched from this queue has been pending more than 5 minutes, attempt a retry
    ** Cancel Timeout specification: e.g. if any request dispatched from this queue has been pending more than 5 minutes, attempt a cancel
    ** Queue clear invocation for use on communication or authentication error.
    *** Certain kinds of errors such as a credential rejection, or host unreachable might be best handled by flushing all pending queued operations rather than re-encountering the error as each queue entry is processed. I mechanism will be provided to allow plugin code to flush the queue.
    ** Dispatch filter option. A plugin should have the option to look at the next impending dispatched entry from the queue and temporarily suspend dispatch.
    *** Some plugins may have a maximum of one pending operation per client (cluster node) or per volume. This feature will allow a plugin to impose appropriate limits on queue draining
    ** These defaults can be defined by a driver. Override by user config is an optional feature that could be deferred to a later release.
Option Meaning
MaxPerMinute Max calls to be submitted to downstream storage provider API before queuing (or LibStorage API rejection) takes place
MaxActivePerVolume Max operations (attach, detach, etc) to be submitted to the driver for a specific volume before queueing is engaged
MaxActivePerInstance Max operations (attach, detach, etc) to be submitted to the driver for a specific instance (cluster node) before queueing is engaged
MaxActivePerVolumeInstanceTuple Max operations (attach, detach, etc) to be submitted to the driver for a specific volume-instance combination before queueing is engaged
MaxActiveOverall Max operation of all types submitted and pending with the driver before queueing is engaged
CancelTimeoutSecs Should an operation not terminate (success or failure) within this time, a cancel will be submitted to the driver. Note that is queueing is engaged, the cancel will skip the queue. This is intentional because queueing be happening because of problematic operations that might be best cleared by cancellation

An Example of issue/problem (still open) in Kubernetes support for AWS:
kubernetes/kubernetes#31858

The platform API should be able to respect backend storage platforms rate limits either in a centralized throttling mechanism or something that is deferred to the driver, ex. API overloads in AWS EBS and EFS

@daniel-jirca
Copy link

Currently I cannot mount EFS volumes with the rexray docker plugin because of API throttling.
AWS support says that:

After inspecting our logs, we have confirmed that the plugin is hitting a rate limit in the EFS API call. From a casual inspection of the rexray source code, it appears that it uses the DescribeFileSystems and DescribeMountTargets API calls prior to mounting a file system, to ensure that the volume and mount targets exist. When multiple mounts are made in quick succession, this causes the API rate limiting, which in turn results in the throttling exceptions that you see.

Unfortunately in this state, the plugin is mostly unusable because mounting of volumes is unreliable and leaves the system in a continuous retry state. So, some form of request limit should be implemented.

@brewsteropsdev
Copy link

@thenoots I also am experiencing this issue. The plugin worked fine for a couple mounts, but as we scaled out across services we quickly hit the rate limit, wreaking havoc across our deployments.

@akutz akutz linked a pull request Sep 22, 2017 that will close this issue
7 tasks
@codenrhoden codenrhoden added this to the 2017-11.1 milestone Sep 26, 2017
@codenrhoden codenrhoden removed this from the 2017.12-1 milestone Dec 12, 2017
@codenrhoden codenrhoden added this to the 2018.01-1 milestone Jan 15, 2018
@codenrhoden
Copy link
Member Author

The scope of this issue is going to change a bit, in light of roadmap plans for CSI support in REX-Ray. End result is more or less the same as the WIP PR that was submitted previous (#1040), but general idea is that REX-Ray will present a mechanism to throttle/rate-limit API calls made by CSI plugins. This can be done in a "global" scope, supporting multiple REX-Ray instances across nodes, and even across multiple Docker, k8s clusters, when they are using the same AWS key, since an AWS rate-limit is tied to that account.

@clintkitson clintkitson modified the milestones: 2018.02-1, 2018.11-1 Jan 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants