Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerous TCP stack warnings when Tempesta is involved #439

Closed
keshonok opened this issue Mar 24, 2016 · 11 comments
Closed

Numerous TCP stack warnings when Tempesta is involved #439

keshonok opened this issue Mar 24, 2016 · 11 comments
Assignees
Milestone

Comments

@keshonok
Copy link
Contributor

Tempesta runs on a test bed with two 10G Ethernet adapters. The back end server and the benchmark host are connected to the test bed via 10G links as well. Tempesta sets SOCK_DBG flag on server sockets. When the benchmark is running, the kernel on the test bed spits numerous warnings like those below. These warnings come in different combinations.

[  346.598587] Ack 119200080 before 119203464:119212488
[  346.603681] Ack 119200080 before 119203464:119212488
[  346.608784] Ack 119201772 before 119203464:119212488
[  346.636327] out of order segment: rcv_next 688373 seq 689287 - 68982F
[  346.636449] out of order segment: rcv_next 688373 seq 68982F - 689DD7
[  346.636475] out of order segment: rcv_next 688373 seq 689DD7 - 68A37F
[  346.636536] out of order segment: rcv_next 688373 seq 68A37F - 68A927
[  346.636673] out of order segment: rcv_next 688EC3 seq 68A927 - 68AECF
[  346.636686] out of order segment: rcv_next 688EC3 seq 68AECF - 68B477
[  346.636757] out of order segment: rcv_next 688EC3 seq 68B477 - 68BA1F
[  346.870002] ofo requeuing : rcv_next E7E2722D seq E7E2722D - E7E2C15D
[  346.870004] ofo requeuing : rcv_next E7E2C15D seq E7E2C15D - E7E31635
@keshonok keshonok self-assigned this Mar 24, 2016
@keshonok keshonok added this to the 0.5.0 Web Server milestone Mar 24, 2016
@keshonok
Copy link
Contributor Author

The test bed set up is as follows:

                                     Internet
                                        |
    ---------- 1G ---------------------------------------------- 1G -------------
    |                                   |                                       |
[Benchmark server] <==== 10G ====> [Tempesta] <==== 10G ====> [Back end server (Nginx)]

A separate local network is set up over these 10G links. The network consists of point-to-point connections between the neighbouring servers. The server with Tempesta (and the patched kernel) is in the middle and has ip_forward kernel parameter set to 1 via sysctl, so that the full connectivity can be tested between these three hosts.

A simple benchmark test is usually as simple as this:

ab -k -n 30000 -c 1000 http://192.168.100.150:80/

Below are the combinations tried and the observed results:

  • The classic, original set up. Start Tempesta on Tempesta server with the configuration that points to the back end server on 10G link. Run benchmark on Tempesta from the benchmark server on the 10G link as well. This test leads to multiple warnings from the TCP/IP stack.
  • Same as above, but run the benchmark over the 1G link. Just a few of these warnings appear, and often times none at all.
  • Start Tempesta with configuration that points to the back end server on 1G link. Run the benchmark on the 10G link. Just a few of these warnings appear, and often times none at all.
  • Start Nginx on the Tempesta server, and run Tempesta with local Nginx as the back end. The test shows no warnings.
  • Start Tempesta with configuration that point to the back end server on 1G link. Run the benchmark on the 1G link as well. The test shows no warnings.
  • The original set up. However, run the benchmark from the back end server. Just a few warnings appears, but never totally without warnings.

In all of these tests GRO is set on all network interfaces, and I can confirm that GROed SKBs are indeed seen in Tempesta.

I certainly appears that the issue is most pronounced when Tempesta transfers SKBs between sockets that are on different 10G interfaces controlled by the same ixgbe driver.

@keshonok
Copy link
Contributor Author

Tempesta has two types of modifications of SKBs.

  • Splitting of SKBs in two. In case of pipelined messages an HTTP message may end in the middle of an SKB. In that case Tempesta creates an SKB that contains exactly the contents of the message, and the rest is placed in another SKB. This operation is safe as long and both SKBs created in the process are semantically valid.
  • Addition, deletiion, or modification of HTTP message headers. Data needs to be contiguous in an SKB. For that to happen, pieces of SKBs data, as well as pieces of internal SKBs structures need to be modified. These are potentially dangerous operations.

In tests, the appearance of kernel warnings from Linux TCP/IP stack is NOT related to these modifications of SKBs. When these modifications are completely commented out, the warnings still occur at about the same rate.

@keshonok
Copy link
Contributor Author

keshonok commented Apr 1, 2016

In view of multiple out of order segments and multiple out of order ACKs the Linux TCP/IP stack resets these connections from time to time by sending RST. The connections are closed, and the outstanding requests are deleted. These are the requests without responses, and deletetion of these requests is reflected in Tempesta statistics as shown below:

Client messages other errors            : 4459

@keshonok
Copy link
Contributor Author

keshonok commented Apr 4, 2016

The issue is also reliably reproduced on the test bed that consist of a host and a VM running on the host:

  • The host machine is Ubuntu 14.04 LTS.
  • The VM is QEMU+KVM running Centos 7, Tempesta kernel, and Tempesta.
  • The host is connected to the VM via a bridge. The network driver in the VM is virtio-net.
  • The back end server is Apache that runs on the host.
  • The benchmark is ab, and it also runs on the host.

@milabs
Copy link
Contributor

milabs commented Apr 29, 2016

I've done with some tests with customized nginx and ab. First of all, I've patched nginx (1.9.15) to enable SO_DEBUG option on it's sockets:

diff --git a/src/core/ngx_connection.c b/src/core/ngx_connection.c
index 5a53bac..71823e3 100644
--- a/src/core/ngx_connection.c
+++ b/src/core/ngx_connection.c
@@ -559,6 +559,8 @@ ngx_open_listening_sockets(ngx_cycle_t *cycle)
                 continue;
             }

+            setsockopt(s, SOL_SOCKET, SO_DEBUG, "\x01\x00\x00\x00", sizeof(int));
+
 #if (NGX_HAVE_UNIX_DOMAIN)

             if (ls[i].sockaddr->sa_family == AF_UNIX) {
diff --git a/src/event/ngx_event_accept.c b/src/event/ngx_event_accept.c
index 1c87a34..63be385 100644
--- a/src/event/ngx_event_accept.c
+++ b/src/event/ngx_event_accept.c
@@ -135,6 +135,8 @@ ngx_event_accept(ngx_event_t *ev)
             return;
         }

+        setsockopt(s, SOL_SOCKET, SO_DEBUG, "\x01\x00\x00\x00", sizeof(int));
+
 #if (NGX_STAT_STUB)
         (void) ngx_atomic_fetch_add(ngx_stat_accepted, 1);
 #endif

Next, I've compiled nginx with the following configure options:

./configure --prefix=/etc/nginx --conf-path=/etc/nginx/nginx.conf --sbin-path=/usr/bin/nginx --pid-path=/run/nginx.pid --lock-path=/run/lock/nginx.lock --user=http --group=http --http-log-path=/var/log/nginx/access.log --error-log-path=stderr --http-client-body-temp-path=/var/lib/nginx/client-body --http-proxy-temp-path=/var/lib/nginx/proxy --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-ipv6 --with-pcre-jit --with-file-aio --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_degradation_module --with-http_flv_module --with-http_geoip_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_realip_module --with-http_secure_link_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-imap --with-imap_ssl_module

And finally, I run nginx and benchmark on the same host:

$ cat stress-test.sh
#!/bin/bash
for i in $(eval echo {1..$1}); do
    (sleep 1 && siege -b -c 100 http://127.0.0.1:80/index.html &>/dev/null)&
done
$ stress-test.sh 100

Next, I've got the following in kernel's log:

$ dmesg | head
[394948.151671] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies.  Check SNMP counters.
[394960.366261] Ack 3864231500 after 3864231267:3864231267
[394960.366268] Ack 3864232113 after 3864231267:3864231267
[394960.366383] Ack 3864232113 after 3864231267:3864231267
[394960.558702] Ack 3864232113 after 3864231267:3864231267
[394960.568694] Ack 3864232113 after 3864231267:3864231267
[394960.772045] Ack 3864232113 after 3864231267:3864231267
[394960.855392] Ack 3864232113 after 3864231267:3864231267
[394961.178723] Ack 3864232113 after 3864231267:3864231267
[394961.448740] Ack 3864232113 after 3864231267:3864231267

So, it seems that it's normal for the TCP to have such messages produced under the heavy load and SO_DEBUG (SOCK_DBG) enabled.

@krizhanovsky
Copy link
Contributor

Need to test whether we have the problem on standard Linux kernel + Nginx on real hardware testbed.

@milabs
Copy link
Contributor

milabs commented Jul 5, 2016

OFO triggers when the kernel meets a packet in the stream which SEQ-number is greater than SEQ expected by the TCP. With SOCK_DBG option enabled on the socket, kernel shows out of order segment: rcv_next ... seq ... - ... message.

At the moment, the following issues were gathered:

  1. QEMU can drop packets while forwarding them from the tap device to the VM networking driver (virtio_net or e1000). There are no counters in QEMU which do such accounting, so the observer can not see such a drops except dumping the traffic. There were multiple OFOs with virtio_net, and sporadic OFOs with e1000.
  2. I've used setup of (PC) <--wire--> (Router) <--wifi--> (Laptop) and have captured traffic on the both sides. Tempesta runs on laptop, nginx runs on PC and ab runs on PC too. Next, I expected the OFO message from the kernel on the laptop side. For such message I've grabbed SEQ number dumped by the kernel and then looked for such a number in captured TCP-stream. The observation shows that some packet of data were transmitted from the nginx and not reached the Tempesta. So, in my opinion, the reason for the OFO in that case was packet drop.
  3. I've tried to reproduce OFO without Tempesta with nginx only. For that reason I've patched nginx to add SOCK_DBG for the sockets used. Unfortunately, I have no OFOs except when cancelling the ab. Besides that, I've seen a lot of messages like Ack ... before ...

In my opinion packet drops causes the OFO's. But that might be not the only reason. More research is needed.

@krizhanovsky
Copy link
Contributor

Let's try to reproduce the issue with our 3 machines testbed with standard 3.16 Debian kernel. It has sense to build wrk with SOCK_DEBUG as well as both the Nginx's (upstream and proxy) and check the results. Meantime, we must learn reason of the reordering - which mechanism exactly causes packets reordering (probably it's caused by TSQ on ixgbe adapter itself).

@krizhanovsky
Copy link
Contributor

The issue doesn't appear with clear Nginx/Linux setup and appears on Tempesta only. Also it seems the reason for it the same as #645 and #583: skbs are scheduled to wrong CPUs due to inappropriate sender_cpu value.

@krizhanovsky
Copy link
Contributor

krizhanovsky commented Nov 23, 2016

Now, when #645 is fixed, I don't see Ack ... before ... either on upstream server and Tempesta. However, if I try to establish many connections at once like:

     # ./wrk/wrk -t 16 -c 8192 -d 600s http://192.168.0.1/

, then we get the problem, but only on connections establishing, there are no the warnings while the becnhmark is working. Also note work queue overrun messages.

    [ 1772.426404] out of order segment: rcv_next 4202B91B seq 4202BAFF - 4202BBF1
    [ 1772.426431] out of order segment: rcv_next 4202B91B seq 4202BBF1 - 4202BCE3
    [ 1772.426466] out of order segment: rcv_next C4E4D677 seq C4E4D94D - C4E4DA3F
    [ 1772.426948] out of order segment: rcv_next C4E4D677 seq C4E4DA3F - C4E4DB31
    [ 1772.427021] out of order segment: rcv_next 4202B91B seq 4202BCE3 - 4202BDD5
    [ 1772.427070] out of order segment: rcv_next C4E4D677 seq C4E4DB31 - C4E4DC23
    [ 1772.427237] ofo requeuing : rcv_next 6B0DD918 seq 6B0DD550 - 6B0DDDD2
    [ 1772.427284] out of order segment: rcv_next 4202B91B seq 4202BDD5 - 4202BEC7
    [ 1772.432869] out of order segment: rcv_next C4E4D677 seq C4E4DC23 - C4E4DD15
    [ 1772.432906] out of order segment: rcv_next 4202B91B seq 4202BEC7 - 4202BFB9
    [ 1772.435134] out of order segment: rcv_next 2AA3821C seq 2AA386D6 - 2AA387C8
    [ 1772.435178] out of order segment: rcv_next 2AA3821C seq 2AA387C8 - 2AA388BA
    [ 1772.435546] out of order segment: rcv_next 5F1C9A74 seq 5F1C9F2E - 5F1CA020
    [ 1772.435578] out of order segment: rcv_next 5F1C9A74 seq 5F1CA020 - 5F1CA112
    [ 1772.436034] out of order segment: rcv_next 5F1C9A74 seq 5F1CA112 - 5F1CA204
    [ 1772.440072] out of order segment: rcv_next C4E4D677 seq C4E4DD15 - C4E4DE07
    [ 1772.440128] out of order segment: rcv_next 4202B91B seq 4202BFB9 - 4202C0AB
    [ 1772.442265] out of order segment: rcv_next 4202B91B seq 4202C0AB - 4202C19D
    [ 1772.442321] out of order segment: rcv_next C4E4D677 seq C4E4DE07 - C4E4DEF9
    [ 1772.442360] out of order segment: rcv_next 5F1C9A74 seq 5F1CA204 - 5F1CA2F6
    [ 1772.442376] out of order segment: rcv_next C4E4D677 seq C4E4DEF9 - C4E4DFEB
    [ 1772.442398] out of order segment: rcv_next 4202B91B seq 4202C19D - 4202C28F
    [ 1772.442400] out of order segment: rcv_next 5F1C9A74 seq 5F1CA2F6 - 5F1CA3E8
    [ 1772.443227] out of order segment: rcv_next 2AA3821C seq 2AA388BA - 2AA389AC
    [ 1772.443309] out of order segment: rcv_next 2AA3821C seq 2AA389AC - 2AA38A9E
    [ 1772.443773] out of order segment: rcv_next 2AA3821C seq 2AA38A9E - 2AA38B90
    [ 1772.457991] ofo requeuing : rcv_next 2AA386D6 seq 2AA38400 - 2AA38B90
    [ 1772.464722] ofo requeuing : rcv_next 4202BDD5 seq 4202BAFF - 4202C28F
    [ 1772.465659] ofo requeuing : rcv_next C4E4DB31 seq C4E4D85B - C4E4DFEB
    [ 1772.474546] ofo requeuing : rcv_next 5F1C9F2E seq 5F1C9C58 - 5F1CA3E8
    [ 1772.612360] net_ratelimit: 1462 callbacks suppressed
    [ 1772.612361] [tempesta] Warning: Socket work queue overrun: [0]
    [ 1772.612362] [sync_sockets] Warning: Cannot schedule socket ffff880e08a7b480 for transmission
    [ 1772.612601] [tempesta] Warning: Socket work queue overrun: [0]
    [ 1772.612602] [sync_sockets] Warning: Cannot schedule socket ffff880e12ef8000 for transmission
    [ 1772.612889] [tempesta] Warning: Socket work queue overrun: [0]
    [ 1772.612890] [sync_sockets] Warning: Cannot schedule socket ffff880d6b956040 for transmission
    [ 1772.613331] [tempesta] Warning: Socket work queue overrun: [0]
    [ 1772.613333] [sync_sockets] Warning: Cannot schedule socket ffff880e08a788c0 for transmission
    [ 1772.613637] [tempesta] Warning: Socket work queue overrun: [0]
    [ 1772.613639] [sync_sockets] Warning: Cannot schedule socket ffff880e08a7b480 for transmission

Nginx running on Linux 4.1.27-tfw doesn't exhibit the behaviour.

krizhanovsky added a commit that referenced this issue Nov 27, 2016
krizhanovsky added a commit that referenced this issue Nov 27, 2016
minimizing probability of livelock between new connections establishing
and processing packets in already established connections.
@krizhanovsky
Copy link
Contributor

krizhanovsky commented Nov 28, 2016

This is just a performance issue, linked with #488. Many (8192 in my tests) concurrently establishing connections are stuck at bh_lock_sock_nested(sk) in tcp_v4_rcv(). All the connections have only TCP source port different, so network adapter does poor job in proper distribution of packets among CPU cores. Moreover, we mistakenly enabled software RPS on the interface causing performance overhead. The issue can be easily reproduced by

    # ab -c 8192 -n 100000 -r http://192.168.0.1/

tcpdump shows that OFO messages were caused by dropped first upstream response, so all the following upstream transmissions produce the OFO message. While softirq faced the spinlock contention, they were unable to process net device backlog queue and TCP segments, upstream responses, were dropped by NIC.

Nginx doesn't exhibit the problem because it doesn't process application layer in softirq. Doing more work in softirq obviously emphasises the problem. We also brutally drop TCP segments in performance issues worsen the problem. Thus, the problem is multi-layer. Firstly, I add performance optimizations by proper system tuning in tempesta.sh. Next #488 must be solved. After that, the connection lock should be optimized, this is #653.

It worth to mention that kernel compiled with totally enabled SOCK_DBG for all the sockets and without loaded Tempesta FW spits OFO messages. So I move setting of the option under #ifdef DEBUG.

krizhanovsky added a commit that referenced this issue Nov 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants