Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP QoS for asymmetric DDoS mitigation #488

Open
krizhanovsky opened this issue May 19, 2016 · 4 comments
Open

HTTP QoS for asymmetric DDoS mitigation #488

krizhanovsky opened this issue May 19, 2016 · 4 comments

Comments

@krizhanovsky
Copy link
Contributor

krizhanovsky commented May 19, 2016

Some of the ideas for the issue are inspired by Web2K: Bringing QoS to Web Servers and the AIS Danger Theory. Basic clients clusterization and classification is required.

Stress calculation

HTTP QoS works with 3 input sources:

  1. stress module invokes QoS mechanism when a local system stress or upstream servers stress are observed;
  2. configured QoS policies;
  3. initial assumption that all the TCP connections have equal TCP send buffer sizes (ingress and egress), i.e. equal QoS.

There should be 3 stress modules: for local system (running Tempesta) stress, upstream servers stress, and dangerous clients. Following local stress parameters must be monitored and their triggering values must be configurable:

  1. NIC packet drops (global system parameter). Packet drops significantly degrade TCP performance, so QoS recalculation for the clients (with corresponding TCP window size reducing, see Redesign of TCP synchronous sending and data caching #391 point 3) must be initiated. Probably some more network performance events should be processed as a stress condition;
    UPD NIC packets drop isn't the only one thing reducing TCP transmission performance, e.g. network congestion is another such parameter. Probably we should hook tcp_enter_cwr(), tcp_enter_loss() or some other TCP CWND modification function. Check for appropriate NET_INC_STATS() calls for the stress accounting.
  2. Total system memory and memory required to service a particular client. Memory must be accounted for local Tempesta's operations (e.g. using TfwPool) and system resources like all sockets, so that many TIME_WAIT & FIN_WAIT2 sockets with the client will reduce it's QoS.
    TCP memory pressure (tcp_under_memory_pressure() & Ko) is an obvious solution here.
  3. CPU usage (system overall and per client);
  4. ring-buffer work queues overrun;
  5. Tempesta's response latency;

Upstream servers stress basically should be based on APM results. Plus to DDoS mitigation, the technique mitigates application upstream server livelock caused by Tempesta FW processing: if an upstream server works on the same host as Tempesta FW, then softirq monopolizes CPU, so all ingress traffic is processed by Tempesta FW only leaving no CPU resources for user space activities. Mogul addressed the issue. Following arguments must be measured:

  1. Upstream response time in average, maximum or particular percentile;
  2. servers' send queues (the queues limit is implemented in [HTTP] Don't pipeline non-idempotent method requests #419) overrun;
  3. ratio of number of sent requests (ReqNum) by a client to number of forwarded responses to it (RespNum). This is per client measurement (this is Mēris DDoS case).
  4. How many requests must be fulfilled by requesting backend servers (see [Frang] negative cache entries rate limit #520 as well).
  5. Ratio between client traffic and sent traffic to backends (HTTP/2 amplification case).

A client is obviously danger if it ignores our cookies or JS challenge, i.e. it just sends us many requests without set cookie and ignoring JS timers. It could be just dummy web client which is probably fine, so we should analyze other performance measurements for the client (how many requests does it send, requests/response rates etc). See static limit implementation for #535.

If any of the stress modules is triggered by a exceeding a system or upstream limit, e.g. packet drops or a upstream response time, then the most greedy clients (in sending the largest number of packets or highest ReqNum/RespNum value correspondingly) must get reduced TCP windows for all their connections or all the connection must be closed.

TBD Different locations and request methods can load server differently and we should not rely on the "average load".

QoS

Different resources of the same vhost or different vhosts as well as different server groups can require different QoS. Local stress module configuration is global, while limits for upstream servers response time must be configured with the granularity (e.g. server group servicing static content responses faster than dynamic content servers running database queries). QoS of a particular resource/vhost/server group must be specified by an integer with values from 0 to 100: the higher value means higher QoS. The default value is 50. If we cannot process request for high QoS client or resource value, then stress event is triggered and connections eviction or TCP window sizes reduction must be initiated.

Typically arrays of lists indexed by client QoS weights should be used to avoid big locks on reordering. If a client changes it's QoS weight, then it's moved to the list of appropriate QoS value.

We need to provide traffic shaping for vhosts. Basic L3 traffic shaping can be done by tc: lower Qdisc bandwidth will rise upstream stress event, so QoS for all the vhost clients must be reduced. However, configuration option for HTTP request per second must be introduced for vhosts.

QoS API must be generic enough for further ML classifiers working in user-space, so complex clustering algorithms can be used to set clients QoS more accurately. While HTTP messages are offloaded to user space by #77, we also need to export client statistics to user space as well.

Since QoS rules can be used for DDoS mitigation, then it's expected to have plenty of the rules and most of them can be changed dynamically. So the rules should be stored in TDB (probably with some eviction strategy) and be analyzed with tdbq. In this sense the issue relates #731.

TCP-BPF addresses per-connection TCP parameters (e.g. buffers sizes, cwnd etc.). Also BBRx and PCC-Vivace employs ML for congestion control algorithms. So it has sense to change not only receive window as a dynamic QoS parameters: plus to dynamic performance optimization we must solve probable malicious resource exhausting problem.

BTW in some cases client and backend connections can work in very different network environments, e.g. poor far Internet connections for clients and fast LAN connections with backends. So consider (maybe move to a new task) setting different congestion control algorithms for client and server connections as well as using different and dynamically calculated parameters as in TCP-BPF.

Server/vhost QoS

There are 2 types of QoSes: client and server QoS. Server/vhost QoS is static, defined by a system administrator and defines how important a resource is, client QoS is dynamic and calculated depending on current system and backend stress caused by the client. Client QoS is expressed by TCP receive buffer and receive window correspondingly, i.e. it essentially manipulates socket throughput in bytes. Meantime, it has sense to manage server/vhost QoS in terms of RPS.

Currently we have ratio loadbalancing working in terms of forwarded requests. QoS, just like tc, should care about minimum provided RPS to a particular vhost/sever (if one wants throughput QoS, they can use tc for backend IP addresses), in particular 2 configuration options must be introduced:

  • qos_rps <value in rps> - minimum provided RPS to a vhost
  • qos_delay <percentile> <value in ms> - maximum request delay

APM provides perecentile delays and shall provide RPS statistics. If APM faces lower values for the QoS statistics for a particular vhost/server, it should trigger upstream server stress event, so some client TCP receive buffers must be reduced, to get more resources. The question is how to defile the some clients for suppressing: we should not(?) suppress clients requesting the crucial vhosts (we have minimum RPS, clients want more RPS and that's wrong to shape them out). Probably we should leave the question for more advanced ML with clients clusterization and just limit the most active clients for this task. A better solution is TBD.

There could be a wrong configuration, e.g. qos_delay less than the network delay. This is completely wrong to try to shape clients in this case, so we should stop in some attempts and print a warning that we can not achieve the specified QoS.

Need to carefully analyze how our server QoS (also as a stress trigger) working in terms of RPS can cooperate with native Linux QoS working with PPS and throughput. In particuar, can we handle a QoS stress trigger by TCP CA function calls? How can we configure RPS QoS and integrate it with Linux QoS?

TCP flow control

Stress configurations must have soft and hard limits. Reaching soft limit triggers TCP window sizes reduction AND stopping accepting new connections for system stress, while hard limits requires immediate resource freeing, so connections must be evicted (closed) immediately. Connections closing can be very harmful for session oriented Web services, so this is the last thing which we should do. The connections closing can use sending HTTP error message, normal TCP closing (sending FIN), resetting (sending RST) or silent dropping (just silently free the connection data structures - the roughest method, but the most efficient under DDoS). The behavior must be configurable. Meantime QoS for resources should be guaranteed by TCP windows reduction only.

Suppose we have 2 sockets: sock_rcv - a socket on which we currently read data and sock_snd - a socket which we forward the received data through. sock_snd may not be able to send all the received data due to TCP congestion and receive windows as well as other competing receive sockets (e.g. if we have several client sockets sending requests through the same server socket). Thus our announced receive window for sock_rcv must be influenced by the congestion and receive windows on sock_snd. Moreover, since TCP windows are dynamic, we have to keep some more data in TCP send queue plus to the data on-the-wire to be able to immediately send more data. However, there is no sense to obey tcp_wmem limitation:

Besides limiting the client TCP window size, we might need to limit the window size on the upstream TCP with HTTP/1 connection to block the HTTP/2 amplification attacks.

TODO. Currently we don't bother with Tempesta socket memory limitations since in proxy mode we just forward packets instead of real allocations. Probably this is an issue. Probably sockets can be freed from under us. See __sk_free(sk) call in sock_wfree().

HTTP flow control

HTTP/2 (#309) and HTTP/3 (#724) provide close to TCP flow control, so full HTTP proxying #1125 make the same TCP flow control concepts described above applicable to HTTP window.

HTTP/2 HPACK, and correspondingly HTTP/3 QPACK, introduces HTTP/2 amplification threat which must be handled with #498 flow control and QoS of the issue. Basically, we need to compare and limit ratio between ingress HTTP/2 and egress HTTP/1.1.

This task also relates to equal QoS (root stream prioritization) for different clients. See The ultimate guide to HTTP resource prioritization (task #1196).

In case of HTTP/2 or QUIC <-> Tempesta <-> HTTP/2 or QUIC (see #1125) we might need propagatee the client flow control settings to the upstream connections to block the HTTP/2 amplification attacks.

Clients handling

Currently we identify client by their IP only. A new configuration option must be introduced to specify which data should be used for client identification (Sticky cookie, IP, User-Agent, Referer etc). Early client operation must still be done by IP address, for parent client, for Frang low level limiting. However, as HTTP requests is read a new child client must be "forked" and used hereafter for accounting.

Filter module must call a client connections closing (in configured fashion) when it adds its address to blocking table. That will faster free unnecessary resources.

We also must implement default and Keep-Alive header defined timeouts for open connections. Timers from #387 must be integrated with the eviction strategy for TfwCliConnection and TCP window calculation (#488).

Connections evictions and TCP window sizes reduction must be done in separate kernel thread. Old connections are typically proven, so the thread should increase QoS for the connections from time to time to mitigate previous penalties on them. When QoS value for a client is increased, the client connections must increase their TCP receive window and socket write memory. The connection should never receive QoS higher than specified in config.

Consider Weboscket (#755) to reduce QoS for clients who tries to exhaust system resources using slow Websocket connections.

JavaScript Challenge

Must set proper timeout for JavaScript challenge. [Is this still relevant after #1102 and #2025?]

Client QoS should also be decreased for 'suspicious' clients by sending them JSCH. From #1102 :

At the moment we send JS challenge on each request. While it was OK with Cookie challenge since a user doesn't see the redirects, it makes user experience significantly worse for JS challenge since all users now see the message about the browser verification. We must send JS challenge only if a users is suspisious, e.g. exceeded an HTTP requests rate soft limit. This is essentialy #598 (comment) , so the task depends on #598.

Cloudflare also does not sent JSCH on each request, only on particular triggers (GeoIP, IP repputation, WAF rules).

References

Need to explore mitigation techniques cited by the paper (also referenced by #496), especially ReDoS vulnerabilities and Algorithmic complexity attacks.

Shenango uses ring-buffer work queue saturation as an indicator for overload and this technique is also usable for this issue.

Cloudflare probabilistic approach to provide flows QoS. While the DDoS attack isn't blocked, the innocent stream doesn't hurt.

Understanding Host Network Stack Overheads proposes to adjust TCP buffers depending on the current TCP state (e.g. windows).

@krizhanovsky krizhanovsky added this to the 0.5.0 Web Server milestone May 19, 2016
@krizhanovsky krizhanovsky self-assigned this May 19, 2016
@krizhanovsky krizhanovsky changed the title SS, classifier: TCP buffering & DDoS mitigation HTTP QoS for asymmetric DDoS mitigation Feb 26, 2017
@krizhanovsky krizhanovsky modified the milestones: 0.6 WebOS, 0.5.0 Web Server Feb 26, 2017
This was referenced Jun 18, 2017
@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Jan 25, 2018

From #100: the most urgent thing is to keep security accounting data for a client for some time after the last client connection is closed. This is very important to track client security limits properly for Connection: closed connections. See https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/client.c#L89

Since we have to evict client accounting data after 'some time', it has sense to store them in a TDB table.

The comment is moved to separate issue #1115.

@krizhanovsky krizhanovsky modified the milestones: backlog, 0.7 HTTP/2 Jan 25, 2018
@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Mar 28, 2018

The consequence of the issue also appears on simple test with configuration as

listen 192.168.100.4:80;
server 127.0.0.1:9090 conns_n=1;
cache 0;
server_queue_size 1;

In this case we get a lot of error responses:

# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:80/
Running 30s test @ http://192.168.100.4:80/
  8 threads and 4096 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   135.34ms  260.83ms   1.98s    89.95%
    Req/Sec     8.12k     2.74k   29.53k    72.39%
  1934441 requests in 30.09s, 207.60MB read
  Socket errors: connect 0, read 0, write 0, timeout 882
  Non-2xx or 3xx responses: 1916248
Requests/sec:  64296.04
Transfer/sec:      6.90MB

The only one server queue is busy, but we continue to read new requests and just send error responses for them. The activity wastes resources and decreases user experience. Instead we must politely slow down the clients if we're unable to process their requests and not to read their requests.

The problem is the subject for #940 (Requests queueing if there is no backend connection).

krizhanovsky added a commit that referenced this issue Dec 25, 2018
@krizhanovsky krizhanovsky modified the milestones: 1.0 Beta, 1.1 Network performance & scalability, 1.1 TBD (Network performance & scalability), 1.1 TDB (ML, QUIC, DoH etc.) Feb 11, 2019
@krizhanovsky krizhanovsky mentioned this issue Feb 24, 2019
7 tasks
@krizhanovsky krizhanovsky added crucial enhancement and removed question Questions and support tasks labels Mar 31, 2020
@krizhanovsky krizhanovsky modified the milestones: 0.9 - TDB, 1.2 TBD Jan 3, 2022
@krizhanovsky krizhanovsky modified the milestones: 1.x: TBD, 1.0 - GA May 24, 2023
@krizhanovsky krizhanovsky modified the milestones: 1.0 - GA, 1.1: TBD Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants