Redesign of TCP synchronous sending and data caching #391

krizhanovsky · 2016-01-07T19:44:18Z

Current implementation of ss_send() and cache logic are unsatisfactory. Following changes should be made, many of them are inspired by Sandstorm Web server. First points are referenced by #534.

Queuing

Initially the issue blamed that too many queues are involved in proxying an HTTP request to a server socket (linked with #687):

TfwCliConn->seq_queue required to order server responses for pipelined requests;
TfwSrvConn->fwd_queue and TfwSrvConn->nip_queue implement server_queue_size and server_forward_timeout limits, responsible for server failovering per-connection and per-server logic;
Ring buffer work queue used for lock-free proxying among sockets living on different CPUs.
TCP send queue used to quickly send next portion of data on ACK and keep unacknowledged data.

Per-server TfwSrvConn queues live longer than per-socket TCP send queues and TCP send queues are required to provision ready-to-send data, so neither of the queues can be eliminated. Moreover, TCP windows are dynamically varying, so we have no idea how much data a socket can send when we schedule a data for transmission (which must queue for sending in TCP control block). This also means that we have to live with the fact that TCP can adjust an skb in transmission path since it works with dynamically changing TCP connection parameters.

TODO

tfw_cache_build_resp() can produce very large response. Many of the skb are simply dropped by Qdisc the others will be sent using TSQ tasklet. So there is no sense to do large TDB scans, but rather the scans should be done by small portions when there is room in Qdisc. Actually, current available room in proxy_buffering from HTTP message buffering and streaming #498 can be used for now as an amount of memory fetched from the cache. This also reduces bufferbloat. Response transfer rate is also the subject for HTTP QoS (HTTP QoS for asymmetric DDoS mitigation #488): if a client has high QoS mark, it received the data faster, i.e. we should send min(client advertised TCP window size, QoS results). Response should be assembled with current TCB state knowledge for the client connection to avoid additional splitting at TCP/IP layer, typically as a callback from tcp_write_xmit() or tcp_transmit_skb() where we know exact "right" size of skb (i.e. we can trick the TCP/IP stack by queueing an empty skb with maximum value in setting skb->len, but fill the skb from the cache later via callback when we know how much data we're going to send) and/or on sk->sk_write_space() hook. In general we should implement pull strategy: when the data ACKnoledged, we pull next portion of data from the cache. This step reduces skb transformations for cached responses. Proxied messages (requests and responses) can not be optimized this way: there is no difference when to split an skb, but we need to queue it anyway and tcp_write_xmit() has a knowledge how much data can be sent at particular point of time.
MSG_MORE/skb->xmit_more provide optimizations up to a NIC driver layer, so the flags must be implemented and used for sending data from the cache and through TLS. It's also seems that Nagle algorithm has no sense in Tempesta since we do our best to form as large skbs as possible, so the algorithm should be switched off.
RX softirq can do millions TDB blocks transitions to process one request to large stored file - this leads to significant packet drops from other clients. Moreover, concurrent requests to large files may lead to efficient DoS. Large files should be transferred by smaller blocks in work queue as currently;
Ingress response pieces must be stored at cache as soon as they come to interface while the data is still hot in CPU caches (RFC 7234, 3.1 allows the behavior);
SKBs free in net_tx_action() should be just reused for next message to send (so NET_TX_SOFTIRQ should be interchanged with NET_RX_SOFTIRQ). Per-cpu skb caches should be introduced. Memory corruption when adding/changing Connection: header in an HTTP response. #353 also requires different skb linear data allocation in pages or simply reallocate and copy linear data on demand. Also see Do not generate static responses #163 (comment)
Copies in ss_send() should be eliminated. We copy skbs in ss_send() to be able to resend them in case of server connection failure and compare URI when a response is received. Since we use paged data (and should use headerless skbs - check this!) and get() the pages, all TfwStr pointers from HttpReq to the skb pages remain the same after TCP/IP operations with skbs. tcp_ack() seems doesn't mangle skbs, but just updates their TCP control block or remove them from send queue, so it seems we don't need real skb copies for ACK processing. Thus, with the previous point implemented so we don't just free skbs, we can (1) just pass an skb to SS/TCP/IP layer and (2) hook when the split skbs are freed and reacquire them for further possible retransmission.
Also after Enforce the correct order of responses. Handle non-idempotent requests. #660 Tempesta operates with HTTP queues, so functions like __tfw_http_resp_fwd() sending many HTTP messages in one shot should be rewritten in scatter manner, i.e. calling SS only once for a whole message queue.
Instrumentation patch Server failovering may cause crashes under load or during getting of perfstat #692 (comment) fully replaces SLAB allocator by our own page allocator - probably it has sense. At least better memory utilization will be achieved. Also functions like skb_split_inside_header() can win from using skbs with head_frag to just manipulate with page fragments instead of copying kmalloc()'ed areas.
Current ss_skb_unroll_slow() should be optimized somehow, preferably to avoid copies at all, but at least consumed ingress skb must be reused instead of allocating a new skb.
[INVALID doesn't make sense with Optimizer for HTTP messages adjustment #1103 and updated Properly store and build HTTP headers #634] tfw_cache_build_resp() should keep assembled TfwHttpResp, probably referenced by TfwCacheEntry, and return copy of it: copying list of skb and assembled TfwHttpResp is must faster than scan TDB and assemble all HTTP structures for following adjustments;
[DONE in #391 fixes] RX softirq is responsible also for HTTP message retransmissions (requests to a upstream server or responses to client), so we must lock 2 sockets. Giving that requests and responses can came at the same time 2 cpus can try to lock the same sockets in different orders leading to deadlock. Kernel threads using SS interface also suffer from the issue (a socket can be locked by softirq and a kernel thread, see Socket deadlock on ss_send() from thread context #337). This also bad for performance since different CPUs compete for the same sockets. Thus transmission action must be scheduled to proper CPU and performed from TX softirq;
[DONE] Headers writing must be optimized. __alloc_skb() makes serious memory provisioning, so the space must be used efficiently: I've seen in some cases that HTTP headers come in separate page, while the skb's page was shared with other skbs. Firstly, a new headers should be added by just moving CRLF instead of inserting a new fragments. Next, ss_skb_alloc_pages() shouldn't allocate headerless skb, but instead fully use the first page with the allocated skb.
Second point is done in Tempesta TLS performance optimizations #1037 . A stronger optimization is proposed in Optimizer for HTTP messages adjustment #1103 .
All work is done is softirq, i.e. we spend more time to process each packet under heavy load, but we don't do anything while system is idle. Small traffic bursts are mitigated by dev queue and Tempesta frees the queue slower than vanilla Linux. However, large traffic bursts lead to TCP buffer overflows in normal system and Tempesta behaves there better since it processes TCP segments faster than usual user-space process. Meantime less loaded CPU is just idles, but could do some useful work like eviction of old entries from various caches. Also more work in softiq means more data structures accessed with larger memory footprint, so CPU cache starvation is more probable. Thus, asynchronous cleanups (e.g. eviction of old cache entries while systems has enough resources) and other non-crucial logic (e.g. traffic classification) should be done in separate kernel threads running of designated CPU cores to mitigate cache pressure. The asynchronous logic can be made synchronous in particular conditions, e.g. garbage collection thread could be awakened if there is not enough memory, i.e. at system stress (like VMM exhausting, NIC packet drops, attack detected and we should not pass the packets etc). Other example of synchronous garbage collection is evict accessed old/invalid entries during data structure scanning or updating (however scanning and eviction requires different lock types);
If a server connection is dropped and we have something to send to a server, we can and we should send the data together with final handshake ACK (somewhat related to TCP Fast Open #144 TCP Fast Open).
There are several calls of ss_skb_split(), which allocates and copies memory. It seems we can just use offsets and lengths of the data chunks plus take reference of an skb to avoid the ss_skb_split() calls.
[DONE there are no Linux work queues anymore, we use our own TfwRBQueue] There is fixed number of CPUs, equal to number of softirqs, so we can fully utilize CPUs by softirqs. Any queueing, like current work queues, is a potential source of bufferbloat;

The text was updated successfully, but these errors were encountered:

Some FSM DSL defines are moved to lib/fsm.h, http_limit.c ported to the new API. Address #391.12: ss_skb_alloc() extended with an agrument for head room. Many cleanups again.

* Encrypt hash for server finished (missed functionality). * Multiple fixes in handling scatter lists; * Multiple fixes for IV handling in encryption and decryption code. * Fix TLS record header and tag allocation in skb (linked with #391.11). * Many cleanups and nicer debug and errors reporting. Kernel: * Fix TLS skb type handling to call sk_write_xmit() callback. * Reserve room for TLS header in skb headroom. * Reset TCP connection if we can not encrypt data on it instead of retransmit it in plaintext. This leads to warning similar to #984 - leave as TODO for now.

Some FSM DSL defines are moved to lib/fsm.h, http_limit.c ported to the new API. Address #391.12: ss_skb_alloc() extended with an agrument for head room. Many cleanups again.

* Encrypt hash for server finished (missed functionality). * Multiple fixes in handling scatter lists; * Multiple fixes for IV handling in encryption and decryption code. * Fix TLS record header and tag allocation in skb (linked with #391.11). * Many cleanups and nicer debug and errors reporting. Kernel: * Fix TLS skb type handling to call sk_write_xmit() callback. * Reserve room for TLS header in skb headroom. * Reset TCP connection if we can not encrypt data on it instead of retransmit it in plaintext. This leads to warning similar to #984 - leave as TODO for now.

krizhanovsky added enhancement crucial labels Jan 7, 2016

krizhanovsky self-assigned this Jan 7, 2016

krizhanovsky added this to the 0.5.0 Web Server milestone Jan 7, 2016

krizhanovsky changed the title ~~TCP synchronous sending~~ Redesign of TCP synchronous sending and data caching Jan 8, 2016

krizhanovsky added a commit that referenced this issue Jan 9, 2016

The last fixes for #337 before switching to #391

482bcae

krizhanovsky added a commit that referenced this issue Jan 9, 2016

Several fixes for #337, SS redesign (#391) is required

bf8e492

krizhanovsky added a commit that referenced this issue Jan 9, 2016

#391: work queue

e7ec1da

krizhanovsky added a commit that referenced this issue Jan 10, 2016

#391: early prototype of TCP transmission from TX softirq

7ac9dc4

krizhanovsky added a commit that referenced this issue Jan 15, 2016

#391: schedule socket and cache work to proper CPU

92acaa0

This was referenced Jan 24, 2016

Socket deadlock on ss_send() from thread context #337

Closed

NUMA-aware cache modes #400

Open

krizhanovsky removed the crucial label Jan 25, 2016

krizhanovsky mentioned this issue Jul 7, 2016

SS: Proper TCP options #574

Merged

milabs mentioned this issue Jul 15, 2016

SKB is used by many CPUs #583

Closed

krizhanovsky mentioned this issue Nov 4, 2016

Performance optimization #632

Merged

krizhanovsky mentioned this issue Nov 16, 2016

Web-server: send large resources in chunked mode #534

Closed

krizhanovsky mentioned this issue Dec 29, 2016

Enforce the correct order of responses. Handle non-idempotent requests. #660

Merged

krizhanovsky modified the milestones: 0.6 WebOS, 0.5.0 Web Server Feb 26, 2017

krizhanovsky assigned keshonok Feb 26, 2017

krizhanovsky mentioned this issue Feb 26, 2017

HTTP QoS for asymmetric DDoS mitigation #488

Open

krizhanovsky mentioned this issue Jul 14, 2017

QUIC & HTTP/3 #724

Open

krizhanovsky unassigned keshonok Aug 31, 2017

krizhanovsky added the performance label Dec 27, 2017

krizhanovsky modified the milestones: backlog, 0.6 KTLS Mar 22, 2018

krizhanovsky mentioned this issue Mar 26, 2018

HTTP message buffering and streaming #498

Open

6 tasks

krizhanovsky mentioned this issue May 10, 2018

Add message streaming mode #1012

Closed

krizhanovsky added a commit that referenced this issue Jul 3, 2018

Zero(almost)-copy TLS handshakes FSM;

daa2a83

Some FSM DSL defines are moved to lib/fsm.h, http_limit.c ported to the new API. Address #391.12: ss_skb_alloc() extended with an agrument for head room. Many cleanups again.

krizhanovsky added this to the 0.7 HTTP/2 milestone Oct 29, 2018

krizhanovsky mentioned this issue Nov 6, 2018

TCP Fast Open #144

Closed

krizhanovsky mentioned this issue Nov 27, 2018

TLS: further performance improvements and cleanups #1064

Open

14 tasks

krizhanovsky added a commit that referenced this issue Dec 25, 2018

Zero(almost)-copy TLS handshakes FSM;

2972eaa

Some FSM DSL defines are moved to lib/fsm.h, http_limit.c ported to the new API. Address #391.12: ss_skb_alloc() extended with an agrument for head room. Many cleanups again.

krizhanovsky modified the milestones: 0.8 TLS 1.3, 1.1 Network performance & scalability, 1.1 TBD (Network performance & scalability), 1.1 TDB (ML, QUIC, DoH etc.) Feb 11, 2019

krizhanovsky mentioned this issue Jul 5, 2019

Properly store and build HTTP headers #634

Open

krizhanovsky mentioned this issue Aug 8, 2019

TDBv0.2: Cache background revalidation and eviction #515

Open

22 tasks

krizhanovsky mentioned this issue Sep 5, 2019

Close the connection after sending tls alerts in the queue #1323

Merged

krizhanovsky modified the milestones: 1.1 TBD (ML, QUIC, DoH etc.), 0.8 TLS 1.3 & Performance, 1.0 Stability - GA Oct 13, 2019

krizhanovsky modified the milestones: 1.0 Stability - GA, 0.9 TDBv0.2 - Beta Nov 6, 2019

vankoven mentioned this issue Mar 25, 2020

Flow contol and pull-model for message forwarding #1394

Closed

7 tasks

krizhanovsky mentioned this issue May 27, 2020

Do not generate static responses #163

Closed

krizhanovsky modified the milestones: 0.9 - TDB, 1.2 TBD Jan 3, 2022

krizhanovsky modified the milestones: 1.2 TBD, 0.8 - Beta: TDBv0.2 & web cache eviction Dec 8, 2022

krizhanovsky mentioned this issue Mar 26, 2023

Mekhanik evgenii/flow control 1394 v3 #1845

Merged

krizhanovsky mentioned this issue Apr 3, 2024

Iperf3 results on vanilla linux vs TFW-patched linux #1863

Open

krizhanovsky removed their assignment Apr 3, 2024

krizhanovsky modified the milestones: 1.1: TBD, 0.9 - LA Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign of TCP synchronous sending and data caching #391

Redesign of TCP synchronous sending and data caching #391

krizhanovsky commented Jan 7, 2016 •

edited

Redesign of TCP synchronous sending and data caching #391

Redesign of TCP synchronous sending and data caching #391

Comments

krizhanovsky commented Jan 7, 2016 • edited

Queuing

TODO

krizhanovsky commented Jan 7, 2016 •

edited