Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign of TCP synchronous sending and data caching #391

Open
4 of 16 tasks
krizhanovsky opened this issue Jan 7, 2016 · 0 comments
Open
4 of 16 tasks

Redesign of TCP synchronous sending and data caching #391

krizhanovsky opened this issue Jan 7, 2016 · 0 comments

Comments

@krizhanovsky
Copy link
Contributor

krizhanovsky commented Jan 7, 2016

Current implementation of ss_send() and cache logic are unsatisfactory. Following changes should be made, many of them are inspired by Sandstorm Web server. First points are referenced by #534.

Queuing

Initially the issue blamed that too many queues are involved in proxying an HTTP request to a server socket (linked with #687):

  1. TfwCliConn->seq_queue required to order server responses for pipelined requests;
  2. TfwSrvConn->fwd_queue and TfwSrvConn->nip_queue implement server_queue_size and server_forward_timeout limits, responsible for server failovering per-connection and per-server logic;
  3. Ring buffer work queue used for lock-free proxying among sockets living on different CPUs.
  4. TCP send queue used to quickly send next portion of data on ACK and keep unacknowledged data.

Per-server TfwSrvConn queues live longer than per-socket TCP send queues and TCP send queues are required to provision ready-to-send data, so neither of the queues can be eliminated. Moreover, TCP windows are dynamically varying, so we have no idea how much data a socket can send when we schedule a data for transmission (which must queue for sending in TCP control block). This also means that we have to live with the fact that TCP can adjust an skb in transmission path since it works with dynamically changing TCP connection parameters.

TODO

  • tfw_cache_build_resp() can produce very large response. Many of the skb are simply dropped by Qdisc the others will be sent using TSQ tasklet. So there is no sense to do large TDB scans, but rather the scans should be done by small portions when there is room in Qdisc. Actually, current available room in proxy_buffering from HTTP message buffering and streaming #498 can be used for now as an amount of memory fetched from the cache. This also reduces bufferbloat. Response transfer rate is also the subject for HTTP QoS (HTTP QoS for asymmetric DDoS mitigation #488): if a client has high QoS mark, it received the data faster, i.e. we should send min(client advertised TCP window size, QoS results). Response should be assembled with current TCB state knowledge for the client connection to avoid additional splitting at TCP/IP layer, typically as a callback from tcp_write_xmit() or tcp_transmit_skb() where we know exact "right" size of skb (i.e. we can trick the TCP/IP stack by queueing an empty skb with maximum value in setting skb->len, but fill the skb from the cache later via callback when we know how much data we're going to send) and/or on sk->sk_write_space() hook. In general we should implement pull strategy: when the data ACKnoledged, we pull next portion of data from the cache. This step reduces skb transformations for cached responses. Proxied messages (requests and responses) can not be optimized this way: there is no difference when to split an skb, but we need to queue it anyway and tcp_write_xmit() has a knowledge how much data can be sent at particular point of time.

  • MSG_MORE/skb->xmit_more provide optimizations up to a NIC driver layer, so the flags must be implemented and used for sending data from the cache and through TLS. It's also seems that Nagle algorithm has no sense in Tempesta since we do our best to form as large skbs as possible, so the algorithm should be switched off.

  • RX softirq can do millions TDB blocks transitions to process one request to large stored file - this leads to significant packet drops from other clients. Moreover, concurrent requests to large files may lead to efficient DoS. Large files should be transferred by smaller blocks in work queue as currently;

  • Ingress response pieces must be stored at cache as soon as they come to interface while the data is still hot in CPU caches (RFC 7234, 3.1 allows the behavior);

  • SKBs free in net_tx_action() should be just reused for next message to send (so NET_TX_SOFTIRQ should be interchanged with NET_RX_SOFTIRQ). Per-cpu skb caches should be introduced. Memory corruption when adding/changing Connection: header in an HTTP response. #353 also requires different skb linear data allocation in pages or simply reallocate and copy linear data on demand. Also see Do not generate static responses #163 (comment)

  • Copies in ss_send() should be eliminated. We copy skbs in ss_send() to be able to resend them in case of server connection failure and compare URI when a response is received. Since we use paged data (and should use headerless skbs - check this!) and get() the pages, all TfwStr pointers from HttpReq to the skb pages remain the same after TCP/IP operations with skbs. tcp_ack() seems doesn't mangle skbs, but just updates their TCP control block or remove them from send queue, so it seems we don't need real skb copies for ACK processing. Thus, with the previous point implemented so we don't just free skbs, we can (1) just pass an skb to SS/TCP/IP layer and (2) hook when the split skbs are freed and reacquire them for further possible retransmission.

  • Also after Enforce the correct order of responses. Handle non-idempotent requests. #660 Tempesta operates with HTTP queues, so functions like __tfw_http_resp_fwd() sending many HTTP messages in one shot should be rewritten in scatter manner, i.e. calling SS only once for a whole message queue.

  • Instrumentation patch Server failovering may cause crashes under load or during getting of perfstat  #692 (comment) fully replaces SLAB allocator by our own page allocator - probably it has sense. At least better memory utilization will be achieved. Also functions like skb_split_inside_header() can win from using skbs with head_frag to just manipulate with page fragments instead of copying kmalloc()'ed areas.

  • Current ss_skb_unroll_slow() should be optimized somehow, preferably to avoid copies at all, but at least consumed ingress skb must be reused instead of allocating a new skb.

  • [INVALID doesn't make sense with Optimizer for HTTP messages adjustment #1103 and updated Properly store and build HTTP headers #634] tfw_cache_build_resp() should keep assembled TfwHttpResp, probably referenced by TfwCacheEntry, and return copy of it: copying list of skb and assembled TfwHttpResp is must faster than scan TDB and assemble all HTTP structures for following adjustments;

  • [DONE in #391 fixes] RX softirq is responsible also for HTTP message retransmissions (requests to a upstream server or responses to client), so we must lock 2 sockets. Giving that requests and responses can came at the same time 2 cpus can try to lock the same sockets in different orders leading to deadlock. Kernel threads using SS interface also suffer from the issue (a socket can be locked by softirq and a kernel thread, see Socket deadlock on ss_send() from thread context #337). This also bad for performance since different CPUs compete for the same sockets. Thus transmission action must be scheduled to proper CPU and performed from TX softirq;

  • [DONE] Headers writing must be optimized. __alloc_skb() makes serious memory provisioning, so the space must be used efficiently: I've seen in some cases that HTTP headers come in separate page, while the skb's page was shared with other skbs. Firstly, a new headers should be added by just moving CRLF instead of inserting a new fragments. Next, ss_skb_alloc_pages() shouldn't allocate headerless skb, but instead fully use the first page with the allocated skb.
    Second point is done in Tempesta TLS performance optimizations #1037 . A stronger optimization is proposed in Optimizer for HTTP messages adjustment #1103 .

  • All work is done is softirq, i.e. we spend more time to process each packet under heavy load, but we don't do anything while system is idle. Small traffic bursts are mitigated by dev queue and Tempesta frees the queue slower than vanilla Linux. However, large traffic bursts lead to TCP buffer overflows in normal system and Tempesta behaves there better since it processes TCP segments faster than usual user-space process. Meantime less loaded CPU is just idles, but could do some useful work like eviction of old entries from various caches. Also more work in softiq means more data structures accessed with larger memory footprint, so CPU cache starvation is more probable. Thus, asynchronous cleanups (e.g. eviction of old cache entries while systems has enough resources) and other non-crucial logic (e.g. traffic classification) should be done in separate kernel threads running of designated CPU cores to mitigate cache pressure. The asynchronous logic can be made synchronous in particular conditions, e.g. garbage collection thread could be awakened if there is not enough memory, i.e. at system stress (like VMM exhausting, NIC packet drops, attack detected and we should not pass the packets etc). Other example of synchronous garbage collection is evict accessed old/invalid entries during data structure scanning or updating (however scanning and eviction requires different lock types);

  • If a server connection is dropped and we have something to send to a server, we can and we should send the data together with final handshake ACK (somewhat related to TCP Fast Open #144 TCP Fast Open).

  • There are several calls of ss_skb_split(), which allocates and copies memory. It seems we can just use offsets and lengths of the data chunks plus take reference of an skb to avoid the ss_skb_split() calls.

  • [DONE there are no Linux work queues anymore, we use our own TfwRBQueue] There is fixed number of CPUs, equal to number of softirqs, so we can fully utilize CPUs by softirqs. Any queueing, like current work queues, is a potential source of bufferbloat;

@krizhanovsky krizhanovsky self-assigned this Jan 7, 2016
@krizhanovsky krizhanovsky added this to the 0.5.0 Web Server milestone Jan 7, 2016
@krizhanovsky krizhanovsky changed the title TCP synchronous sending Redesign of TCP synchronous sending and data caching Jan 8, 2016
krizhanovsky added a commit that referenced this issue Jan 9, 2016
@krizhanovsky krizhanovsky modified the milestones: 0.6 WebOS, 0.5.0 Web Server Feb 26, 2017
@krizhanovsky krizhanovsky modified the milestones: backlog, 0.6 KTLS Mar 22, 2018
krizhanovsky added a commit that referenced this issue Jul 3, 2018
Some FSM DSL defines are moved to lib/fsm.h, http_limit.c ported to the new API.
Address #391.12: ss_skb_alloc() extended with an agrument for head room.
Many cleanups again.
@krizhanovsky krizhanovsky added this to the 0.7 HTTP/2 milestone Oct 29, 2018
krizhanovsky added a commit that referenced this issue Nov 29, 2018
* Encrypt hash for server finished (missed functionality).
* Multiple fixes in handling scatter lists;
* Multiple fixes for IV handling in encryption and decryption code.
* Fix TLS record header and tag allocation in skb (linked with #391.11).
* Many cleanups and nicer debug and errors reporting.

Kernel:
* Fix TLS skb type handling to call sk_write_xmit() callback.
* Reserve room for TLS header in skb headroom.
* Reset TCP connection if we can not encrypt data on it instead of retransmit
  it in plaintext. This leads to warning similar to #984 - leave as TODO for now.
krizhanovsky added a commit that referenced this issue Dec 25, 2018
Some FSM DSL defines are moved to lib/fsm.h, http_limit.c ported to the new API.
Address #391.12: ss_skb_alloc() extended with an agrument for head room.
Many cleanups again.
krizhanovsky added a commit that referenced this issue Dec 25, 2018
* Encrypt hash for server finished (missed functionality).
* Multiple fixes in handling scatter lists;
* Multiple fixes for IV handling in encryption and decryption code.
* Fix TLS record header and tag allocation in skb (linked with #391.11).
* Many cleanups and nicer debug and errors reporting.

Kernel:
* Fix TLS skb type handling to call sk_write_xmit() callback.
* Reserve room for TLS header in skb headroom.
* Reset TCP connection if we can not encrypt data on it instead of retransmit
  it in plaintext. This leads to warning similar to #984 - leave as TODO for now.
@krizhanovsky krizhanovsky modified the milestones: 0.8 TLS 1.3, 1.1 Network performance & scalability, 1.1 TBD (Network performance & scalability), 1.1 TDB (ML, QUIC, DoH etc.) Feb 11, 2019
@krizhanovsky krizhanovsky modified the milestones: 0.9 - TDB, 1.2 TBD Jan 3, 2022
@krizhanovsky krizhanovsky removed their assignment Apr 3, 2024
@krizhanovsky krizhanovsky modified the milestones: 1.1: TBD, 0.9 - LA Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants