Numerous TCP stack warnings when Tempesta is involved #439

keshonok · 2016-03-24T08:54:15Z

Tempesta runs on a test bed with two 10G Ethernet adapters. The back end server and the benchmark host are connected to the test bed via 10G links as well. Tempesta sets SOCK_DBG flag on server sockets. When the benchmark is running, the kernel on the test bed spits numerous warnings like those below. These warnings come in different combinations.

[  346.598587] Ack 119200080 before 119203464:119212488
[  346.603681] Ack 119200080 before 119203464:119212488
[  346.608784] Ack 119201772 before 119203464:119212488

[  346.636327] out of order segment: rcv_next 688373 seq 689287 - 68982F
[  346.636449] out of order segment: rcv_next 688373 seq 68982F - 689DD7
[  346.636475] out of order segment: rcv_next 688373 seq 689DD7 - 68A37F
[  346.636536] out of order segment: rcv_next 688373 seq 68A37F - 68A927
[  346.636673] out of order segment: rcv_next 688EC3 seq 68A927 - 68AECF
[  346.636686] out of order segment: rcv_next 688EC3 seq 68AECF - 68B477
[  346.636757] out of order segment: rcv_next 688EC3 seq 68B477 - 68BA1F

[  346.870002] ofo requeuing : rcv_next E7E2722D seq E7E2722D - E7E2C15D
[  346.870004] ofo requeuing : rcv_next E7E2C15D seq E7E2C15D - E7E31635

The text was updated successfully, but these errors were encountered:

keshonok · 2016-03-24T17:32:31Z

The test bed set up is as follows:

                                     Internet
                                        |
    ---------- 1G ---------------------------------------------- 1G -------------
    |                                   |                                       |
[Benchmark server] <==== 10G ====> [Tempesta] <==== 10G ====> [Back end server (Nginx)]

A separate local network is set up over these 10G links. The network consists of point-to-point connections between the neighbouring servers. The server with Tempesta (and the patched kernel) is in the middle and has ip_forward kernel parameter set to 1 via sysctl, so that the full connectivity can be tested between these three hosts.

A simple benchmark test is usually as simple as this:

ab -k -n 30000 -c 1000 http://192.168.100.150:80/

Below are the combinations tried and the observed results:

The classic, original set up. Start Tempesta on Tempesta server with the configuration that points to the back end server on 10G link. Run benchmark on Tempesta from the benchmark server on the 10G link as well. This test leads to multiple warnings from the TCP/IP stack.
Same as above, but run the benchmark over the 1G link. Just a few of these warnings appear, and often times none at all.
Start Tempesta with configuration that points to the back end server on 1G link. Run the benchmark on the 10G link. Just a few of these warnings appear, and often times none at all.
Start Nginx on the Tempesta server, and run Tempesta with local Nginx as the back end. The test shows no warnings.
Start Tempesta with configuration that point to the back end server on 1G link. Run the benchmark on the 1G link as well. The test shows no warnings.
The original set up. However, run the benchmark from the back end server. Just a few warnings appears, but never totally without warnings.

In all of these tests GRO is set on all network interfaces, and I can confirm that GROed SKBs are indeed seen in Tempesta.

I certainly appears that the issue is most pronounced when Tempesta transfers SKBs between sockets that are on different 10G interfaces controlled by the same ixgbe driver.

keshonok · 2016-03-29T08:06:25Z

Tempesta has two types of modifications of SKBs.

Splitting of SKBs in two. In case of pipelined messages an HTTP message may end in the middle of an SKB. In that case Tempesta creates an SKB that contains exactly the contents of the message, and the rest is placed in another SKB. This operation is safe as long and both SKBs created in the process are semantically valid.
Addition, deletiion, or modification of HTTP message headers. Data needs to be contiguous in an SKB. For that to happen, pieces of SKBs data, as well as pieces of internal SKBs structures need to be modified. These are potentially dangerous operations.

In tests, the appearance of kernel warnings from Linux TCP/IP stack is NOT related to these modifications of SKBs. When these modifications are completely commented out, the warnings still occur at about the same rate.

keshonok · 2016-04-01T11:40:25Z

In view of multiple out of order segments and multiple out of order ACKs the Linux TCP/IP stack resets these connections from time to time by sending RST. The connections are closed, and the outstanding requests are deleted. These are the requests without responses, and deletetion of these requests is reflected in Tempesta statistics as shown below:

Client messages other errors            : 4459

keshonok · 2016-04-04T11:32:04Z

The issue is also reliably reproduced on the test bed that consist of a host and a VM running on the host:

The host machine is Ubuntu 14.04 LTS.
The VM is QEMU+KVM running Centos 7, Tempesta kernel, and Tempesta.
The host is connected to the VM via a bridge. The network driver in the VM is virtio-net.
The back end server is Apache that runs on the host.
The benchmark is ab, and it also runs on the host.

milabs · 2016-04-29T22:40:15Z

I've done with some tests with customized nginx and ab. First of all, I've patched nginx (1.9.15) to enable SO_DEBUG option on it's sockets:

diff --git a/src/core/ngx_connection.c b/src/core/ngx_connection.c
index 5a53bac..71823e3 100644
--- a/src/core/ngx_connection.c
+++ b/src/core/ngx_connection.c
@@ -559,6 +559,8 @@ ngx_open_listening_sockets(ngx_cycle_t *cycle)
                 continue;
             }

+            setsockopt(s, SOL_SOCKET, SO_DEBUG, "\x01\x00\x00\x00", sizeof(int));
+
 #if (NGX_HAVE_UNIX_DOMAIN)

             if (ls[i].sockaddr->sa_family == AF_UNIX) {
diff --git a/src/event/ngx_event_accept.c b/src/event/ngx_event_accept.c
index 1c87a34..63be385 100644
--- a/src/event/ngx_event_accept.c
+++ b/src/event/ngx_event_accept.c
@@ -135,6 +135,8 @@ ngx_event_accept(ngx_event_t *ev)
             return;
         }

+        setsockopt(s, SOL_SOCKET, SO_DEBUG, "\x01\x00\x00\x00", sizeof(int));
+
 #if (NGX_STAT_STUB)
         (void) ngx_atomic_fetch_add(ngx_stat_accepted, 1);
 #endif

Next, I've compiled nginx with the following configure options:

./configure --prefix=/etc/nginx --conf-path=/etc/nginx/nginx.conf --sbin-path=/usr/bin/nginx --pid-path=/run/nginx.pid --lock-path=/run/lock/nginx.lock --user=http --group=http --http-log-path=/var/log/nginx/access.log --error-log-path=stderr --http-client-body-temp-path=/var/lib/nginx/client-body --http-proxy-temp-path=/var/lib/nginx/proxy --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-ipv6 --with-pcre-jit --with-file-aio --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_degradation_module --with-http_flv_module --with-http_geoip_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_realip_module --with-http_secure_link_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-imap --with-imap_ssl_module

And finally, I run nginx and benchmark on the same host:

$ cat stress-test.sh
#!/bin/bash
for i in $(eval echo {1..$1}); do
    (sleep 1 && siege -b -c 100 http://127.0.0.1:80/index.html &>/dev/null)&
done
$ stress-test.sh 100

Next, I've got the following in kernel's log:

$ dmesg | head
[394948.151671] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies.  Check SNMP counters.
[394960.366261] Ack 3864231500 after 3864231267:3864231267
[394960.366268] Ack 3864232113 after 3864231267:3864231267
[394960.366383] Ack 3864232113 after 3864231267:3864231267
[394960.558702] Ack 3864232113 after 3864231267:3864231267
[394960.568694] Ack 3864232113 after 3864231267:3864231267
[394960.772045] Ack 3864232113 after 3864231267:3864231267
[394960.855392] Ack 3864232113 after 3864231267:3864231267
[394961.178723] Ack 3864232113 after 3864231267:3864231267
[394961.448740] Ack 3864232113 after 3864231267:3864231267

So, it seems that it's normal for the TCP to have such messages produced under the heavy load and SO_DEBUG (SOCK_DBG) enabled.

krizhanovsky · 2016-05-17T11:07:02Z

Need to test whether we have the problem on standard Linux kernel + Nginx on real hardware testbed.

milabs · 2016-07-05T09:34:45Z

OFO triggers when the kernel meets a packet in the stream which SEQ-number is greater than SEQ expected by the TCP. With SOCK_DBG option enabled on the socket, kernel shows out of order segment: rcv_next ... seq ... - ... message.

At the moment, the following issues were gathered:

QEMU can drop packets while forwarding them from the tap device to the VM networking driver (virtio_net or e1000). There are no counters in QEMU which do such accounting, so the observer can not see such a drops except dumping the traffic. There were multiple OFOs with virtio_net, and sporadic OFOs with e1000.
I've used setup of (PC) <--wire--> (Router) <--wifi--> (Laptop) and have captured traffic on the both sides. Tempesta runs on laptop, nginx runs on PC and ab runs on PC too. Next, I expected the OFO message from the kernel on the laptop side. For such message I've grabbed SEQ number dumped by the kernel and then looked for such a number in captured TCP-stream. The observation shows that some packet of data were transmitted from the nginx and not reached the Tempesta. So, in my opinion, the reason for the OFO in that case was packet drop.
I've tried to reproduce OFO without Tempesta with nginx only. For that reason I've patched nginx to add SOCK_DBG for the sockets used. Unfortunately, I have no OFOs except when cancelling the ab. Besides that, I've seen a lot of messages like Ack ... before ...

In my opinion packet drops causes the OFO's. But that might be not the only reason. More research is needed.

krizhanovsky · 2016-11-05T08:19:51Z

Let's try to reproduce the issue with our 3 machines testbed with standard 3.16 Debian kernel. It has sense to build wrk with SOCK_DEBUG as well as both the Nginx's (upstream and proxy) and check the results. Meantime, we must learn reason of the reordering - which mechanism exactly causes packets reordering (probably it's caused by TSQ on ixgbe adapter itself).

krizhanovsky · 2016-11-22T13:43:50Z

The issue doesn't appear with clear Nginx/Linux setup and appears on Tempesta only. Also it seems the reason for it the same as #645 and #583: skbs are scheduled to wrong CPUs due to inappropriate sender_cpu value.

krizhanovsky · 2016-11-23T18:40:06Z

Now, when #645 is fixed, I don't see Ack ... before ... either on upstream server and Tempesta. However, if I try to establish many connections at once like:

     # ./wrk/wrk -t 16 -c 8192 -d 600s http://192.168.0.1/

, then we get the problem, but only on connections establishing, there are no the warnings while the becnhmark is working. Also note work queue overrun messages.

    [ 1772.426404] out of order segment: rcv_next 4202B91B seq 4202BAFF - 4202BBF1
    [ 1772.426431] out of order segment: rcv_next 4202B91B seq 4202BBF1 - 4202BCE3
    [ 1772.426466] out of order segment: rcv_next C4E4D677 seq C4E4D94D - C4E4DA3F
    [ 1772.426948] out of order segment: rcv_next C4E4D677 seq C4E4DA3F - C4E4DB31
    [ 1772.427021] out of order segment: rcv_next 4202B91B seq 4202BCE3 - 4202BDD5
    [ 1772.427070] out of order segment: rcv_next C4E4D677 seq C4E4DB31 - C4E4DC23
    [ 1772.427237] ofo requeuing : rcv_next 6B0DD918 seq 6B0DD550 - 6B0DDDD2
    [ 1772.427284] out of order segment: rcv_next 4202B91B seq 4202BDD5 - 4202BEC7
    [ 1772.432869] out of order segment: rcv_next C4E4D677 seq C4E4DC23 - C4E4DD15
    [ 1772.432906] out of order segment: rcv_next 4202B91B seq 4202BEC7 - 4202BFB9
    [ 1772.435134] out of order segment: rcv_next 2AA3821C seq 2AA386D6 - 2AA387C8
    [ 1772.435178] out of order segment: rcv_next 2AA3821C seq 2AA387C8 - 2AA388BA
    [ 1772.435546] out of order segment: rcv_next 5F1C9A74 seq 5F1C9F2E - 5F1CA020
    [ 1772.435578] out of order segment: rcv_next 5F1C9A74 seq 5F1CA020 - 5F1CA112
    [ 1772.436034] out of order segment: rcv_next 5F1C9A74 seq 5F1CA112 - 5F1CA204
    [ 1772.440072] out of order segment: rcv_next C4E4D677 seq C4E4DD15 - C4E4DE07
    [ 1772.440128] out of order segment: rcv_next 4202B91B seq 4202BFB9 - 4202C0AB
    [ 1772.442265] out of order segment: rcv_next 4202B91B seq 4202C0AB - 4202C19D
    [ 1772.442321] out of order segment: rcv_next C4E4D677 seq C4E4DE07 - C4E4DEF9
    [ 1772.442360] out of order segment: rcv_next 5F1C9A74 seq 5F1CA204 - 5F1CA2F6
    [ 1772.442376] out of order segment: rcv_next C4E4D677 seq C4E4DEF9 - C4E4DFEB
    [ 1772.442398] out of order segment: rcv_next 4202B91B seq 4202C19D - 4202C28F
    [ 1772.442400] out of order segment: rcv_next 5F1C9A74 seq 5F1CA2F6 - 5F1CA3E8
    [ 1772.443227] out of order segment: rcv_next 2AA3821C seq 2AA388BA - 2AA389AC
    [ 1772.443309] out of order segment: rcv_next 2AA3821C seq 2AA389AC - 2AA38A9E
    [ 1772.443773] out of order segment: rcv_next 2AA3821C seq 2AA38A9E - 2AA38B90
    [ 1772.457991] ofo requeuing : rcv_next 2AA386D6 seq 2AA38400 - 2AA38B90
    [ 1772.464722] ofo requeuing : rcv_next 4202BDD5 seq 4202BAFF - 4202C28F
    [ 1772.465659] ofo requeuing : rcv_next C4E4DB31 seq C4E4D85B - C4E4DFEB
    [ 1772.474546] ofo requeuing : rcv_next 5F1C9F2E seq 5F1C9C58 - 5F1CA3E8
    [ 1772.612360] net_ratelimit: 1462 callbacks suppressed
    [ 1772.612361] [tempesta] Warning: Socket work queue overrun: [0]
    [ 1772.612362] [sync_sockets] Warning: Cannot schedule socket ffff880e08a7b480 for transmission
    [ 1772.612601] [tempesta] Warning: Socket work queue overrun: [0]
    [ 1772.612602] [sync_sockets] Warning: Cannot schedule socket ffff880e12ef8000 for transmission
    [ 1772.612889] [tempesta] Warning: Socket work queue overrun: [0]
    [ 1772.612890] [sync_sockets] Warning: Cannot schedule socket ffff880d6b956040 for transmission
    [ 1772.613331] [tempesta] Warning: Socket work queue overrun: [0]
    [ 1772.613333] [sync_sockets] Warning: Cannot schedule socket ffff880e08a788c0 for transmission
    [ 1772.613637] [tempesta] Warning: Socket work queue overrun: [0]
    [ 1772.613639] [sync_sockets] Warning: Cannot schedule socket ffff880e08a7b480 for transmission

Nginx running on Linux 4.1.27-tfw doesn't exhibit the behaviour.

…te device queues; some debugging

minimizing probability of livelock between new connections establishing and processing packets in already established connections.

krizhanovsky · 2016-11-28T14:15:04Z

This is just a performance issue, linked with #488. Many (8192 in my tests) concurrently establishing connections are stuck at bh_lock_sock_nested(sk) in tcp_v4_rcv(). All the connections have only TCP source port different, so network adapter does poor job in proper distribution of packets among CPU cores. Moreover, we mistakenly enabled software RPS on the interface causing performance overhead. The issue can be easily reproduced by

    # ab -c 8192 -n 100000 -r http://192.168.0.1/

tcpdump shows that OFO messages were caused by dropped first upstream response, so all the following upstream transmissions produce the OFO message. While softirq faced the spinlock contention, they were unable to process net device backlog queue and TCP segments, upstream responses, were dropped by NIC.

Nginx doesn't exhibit the problem because it doesn't process application layer in softirq. Doing more work in softirq obviously emphasises the problem. We also brutally drop TCP segments in performance issues worsen the problem. Thus, the problem is multi-layer. Firstly, I add performance optimizations by proper system tuning in tempesta.sh. Next #488 must be solved. After that, the connection lock should be optimized, this is #653.

It worth to mention that kernel compiled with totally enabled SOCK_DBG for all the sockets and without loaded Tempesta FW spits OFO messages. So I move setting of the option under #ifdef DEBUG.

…avoid overruns on huge proxied traffic

keshonok added bug crucial labels Mar 24, 2016

keshonok self-assigned this Mar 24, 2016

keshonok added this to the 0.5.0 Web Server milestone Mar 24, 2016

keshonok mentioned this issue Apr 7, 2016

Random memory corruptions when modifying HTTP header data in an SKB. #447

Closed

milabs assigned milabs and unassigned keshonok Apr 20, 2016

krizhanovsky mentioned this issue Aug 23, 2016

Performance degradation with increase of the number of concurrent connections #590

Closed

krizhanovsky assigned krizhanovsky and unassigned milabs Aug 31, 2016

krizhanovsky assigned vankoven and keshonok and unassigned krizhanovsky Nov 5, 2016

krizhanovsky self-assigned this Nov 5, 2016

krizhanovsky mentioned this issue Nov 20, 2016

Oops at __netdev_pick_tx #645

Closed

krizhanovsky added a commit that referenced this issue Nov 27, 2016

#439: switch off RPS on RSS devices; properly figure out and distriby…

b48b569

…te device queues; some debugging

krizhanovsky added a commit that referenced this issue Nov 27, 2016

Fix #439: enable more NIC channels for better packets/CPU distribution

42fefab

minimizing probability of livelock between new connections establishing and processing packets in already established connections.

krizhanovsky mentioned this issue Nov 28, 2016

Linux new connections lock contention #653

Closed

krizhanovsky added a commit that referenced this issue Nov 29, 2016

#439 print OFO on debug log level only; increase work queue siize to …

b77a3f3

…avoid overruns on huge proxied traffic

krizhanovsky closed this as completed Nov 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerous TCP stack warnings when Tempesta is involved #439

Numerous TCP stack warnings when Tempesta is involved #439

keshonok commented Mar 24, 2016

keshonok commented Mar 24, 2016

keshonok commented Mar 29, 2016

keshonok commented Apr 1, 2016

keshonok commented Apr 4, 2016

milabs commented Apr 29, 2016

krizhanovsky commented May 17, 2016

milabs commented Jul 5, 2016

krizhanovsky commented Nov 5, 2016

krizhanovsky commented Nov 22, 2016

krizhanovsky commented Nov 23, 2016 •

edited

krizhanovsky commented Nov 28, 2016 •

edited

Numerous TCP stack warnings when Tempesta is involved #439

Numerous TCP stack warnings when Tempesta is involved #439

Comments

keshonok commented Mar 24, 2016

keshonok commented Mar 24, 2016

keshonok commented Mar 29, 2016

keshonok commented Apr 1, 2016

keshonok commented Apr 4, 2016

milabs commented Apr 29, 2016

krizhanovsky commented May 17, 2016

milabs commented Jul 5, 2016

krizhanovsky commented Nov 5, 2016

krizhanovsky commented Nov 22, 2016

krizhanovsky commented Nov 23, 2016 • edited

krizhanovsky commented Nov 28, 2016 • edited

krizhanovsky commented Nov 23, 2016 •

edited

krizhanovsky commented Nov 28, 2016 •

edited