Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS connection error oversize record received #164

Open
beyhan opened this issue Jul 16, 2019 · 9 comments
Open

TLS connection error oversize record received #164

beyhan opened this issue Jul 16, 2019 · 9 comments
Assignees

Comments

@beyhan
Copy link

beyhan commented Jul 16, 2019

We are from the BOSH team, and we use nats.rb client with TLS enabled. We have a problem with NATS server error: oversized record received.
So we observed that the client sends plaintext messages to server before TLS is started, which results in the oversized error.
We face the issue only when there is heavy load of NATS messages from BOSH Director. We use nats.rb version 0.9.2, but we tried with 0.11.0 as well and the issue was reproducible.
Here an example what we see with tshark. Director is the client and uses the nats.rb version 0.9.2. nats is the server in version 1.3.0, which is forked and can be found here.

In case the connection fails with TLS handshake error:

director -> nats (SYNC)
director <- nats (SYNC, ACK)
director -> nats (ACK)
director -> nats (PSH) 
payload: CONNECT {"verbose":false,"pedantic":false,"lang":"ruby","version":"0.9.2","protocol":1,"ssl_required":true,"tls_required":true}
director <- nats (ACK)
director <- nats (PSH, ACK)
payload:
INFO {"server_id":"xA5UfriyGj2oHzAaQH42Ji","version":"1.3.0-bosh.2 2018-10-25T15:57:27Z a79a47d","proto":1,"go":"go1.11","host":"0.0.0.0","port":4222,"auth_required":true,"tls_required":true,"tls_verify":true,"max_payload":1048576,"client_id":4172}

Normally when the communication is successful:

director -> nats (SYNC)
director <- nats (SYNC, ACK)
director -> nats (ACK)
director <- nats (PSH) 
payload:
INFO {"server_id":"xA5UfriyGj2oHzAaQH42Ji","version":"1.3.0-bosh.2 2018-10-25T15:57:27Z a79a47d","proto":1,"go":"go1.11","host":"0.0.0.0","port":4222,"auth_required":true,"tls_required":true,"tls_verify":true,"max_payload":1048576,"client_id":4173} 
director -> nats (ACK)
director -> nats (PSH, ACK)
payload:ëE9@@%�~ÕRÁ@n;è^A¤ÈÕ¤ÈÔüôÖz,N©I7)qrMÃì<0aX´c9Ü,Ðñª°À0À,À(À$ÀÀ¥£¡kjih9876�ÀÀ2À.À*À&ÀÀ=5À/À+À'À#ÀÀ¤¢ g@?>3210EDCBÀÀ1À-À)À%ÀÀ</AÀÀÀÀÀÀÀÀÀÀÿ##Ê

The difference is that the NATS CONNECT message in the error case is sent before the INFO message and before the TLS is started. The main issue for us is that we lose messages that do not reach the NATS server in this scenario. Is there a guarantee that the NATS client will deliver the messages to the NATS server or in case of reconnect errors some may be lost (e.g. in EM/SSL buffers).

@beyhan
Copy link
Author

beyhan commented Jul 16, 2019

Additional context for this issue in this Slack thread: https://natsio.slack.com/archives/C069GSYFP/p1563202735013400

@videlov
Copy link

videlov commented Jul 17, 2019

Further analysis with the latest version of nats.rb (0.11.0) showed one time issue with a new error message:

TLS handshake error: tls: first record does not look like a TLS handshake

But after few other attempts was not seen again. However we lose NATS messages that did not reach the server. On the server side we observed to the time when few messages were lost, the client opened two connections and only one was used. Which was not the case with version 0.9.2. Additionally we see SSL connection errors in EM, but not on the NATS server side. I will provide @wallyqs with logs from our setup. We tried quite a lot of different options with no success.

Please advise what is the best option from here on to proceed?

@derekcollison
Copy link
Member

Providing the logs would be great. We are actively trying to determine the root cause of the issue.

@wallyqs
Copy link
Member

wallyqs commented Jul 17, 2019

Thanks for sharing the info and the logs.

Currently still trying to reproduce this issue, the ordering of the CONNECT and INFO protocol should be resolved with the upgrade to v0.11.0, so still not sure which is the condition that the client it is running into right now.

One issue that we can see from the logs is that there are Stale Client Connections errors from the clients so I wonder if there is some possible blocking behavior happening in the client, since in order to get into that state it means that the client connection was lingering idle for 6 minutes and not responding to the ping/pong protocol:

[7] 2019/07/17 08:07:26.472828 [DBG] 10.0.3.11:32982 - cid:995 - Client Ping Timer
[7] 2019/07/17 08:07:26.472851 [DBG] 10.0.3.11:32982 - cid:995 - Stale Client Connection - Closing
[7] 2019/07/17 08:07:26.473023 [DBG] 10.0.3.11:32982 - cid:995 - Client connection closed

It is possible to lower down the ping/pong interval from the configuration file in order to detect these clients sooner (in nats-server v1.3.0 this should work):

ping_interval: 15 # seconds

There are a couple of other callbacks that also may help to implement to gather more info about the condition of the client besides NATS.on_error sharing these below:

NATS.on_error do |e|
  puts "Error: #{e}"
end

NATS.on_close do
  puts "Connection closed"
end

NATS.on_disconnect do |e|
  puts "Disconnected! Reason: #{e}"
  puts "Pending data: #{NATS.pending_data_size}"
end

NATS.on_reconnect do |nats|
  puts "Reconnected! Pending data: #{nats.pending_data_size}"
end

Note that while the client is reconnecting the client will be buffering some of the commands (pending_data_size) but they will be flushed once it reaches the connected state but some of the commands may be lost when it disconnects.

@beyhan
Copy link
Author

beyhan commented Jul 18, 2019

@wallyqs thanks for the prompt response! Is there a why to find out, which commands get lost during disconnects?

@videlov
Copy link

videlov commented Jul 19, 2019

@wallyqs Not sure if this additional points you refer would bring much more valuable info, but we will include it in the further analysis. I did already mentioned in private message that we traced the messages to just before the OpenSSL EM helper classes do encrypt and send via the connection. We added to the EventMachine cpp coding some debug traces. The reply_to of a NATS message is monitored and dumped just right before SSL_write in here: https://github.com/eventmachine/eventmachine/blob/d761d2a6916b854b5d62aeaa4775be359e9cf6ce/ext/ssl.cpp#L512

Please advise how to continue on investigating the issue, what kind of tracing/debug you would need to fully understand reproduce the issue?

@derekcollison
Copy link
Member

I think we should all jump on a Google Hangout or a Zoom call as a next step. WDYT?

@videlov
Copy link

videlov commented Jul 24, 2019

Yes, it makes sense. Let discuss in Slack about how and when.

@derekcollison
Copy link
Member

Sounds good.

@wallyqs wallyqs added the v0 label Jun 8, 2023
@bruth bruth removed the v0 label Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants