Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection process stopped handling requests #334

Open
IgorKoshnarov opened this issue Mar 24, 2024 · 11 comments
Open

Connection process stopped handling requests #334

IgorKoshnarov opened this issue Mar 24, 2024 · 11 comments

Comments

@IgorKoshnarov
Copy link

IgorKoshnarov commented Mar 24, 2024

I use gun 2.0.0 as http client in my application and use once opened connections to handle multiple requests.

If for some request
gun:await(Connection, StreamRef, Timeout)
returns {error, Reason} I close the connection and it's not used any more, excepting {error, timeout} case, in this case I consider connection still valid and continue using it.

It most of the time works ok, but recently my application stopped handling requests and it turned out that gun connections processes have lists of unhandled messages in their mailboxes, like this:

rp(process_info(<0.31175.48>, messages)).
{messages,[{'$gen_cast',{request,<0.11099.53>,
                                 #Ref<0.3234177338.3490709505.118665>,<<"POST">>,
                                 <<"/v1/topic/get">>,
                                  [{<<"content-type">>,<<"application/json">>}],
                                 <<"{}">>,infinity}},
           {'$gen_cast',{request,<0.12338.53>,
                                 #Ref<0.3234177338.3095658499.71403>,<<"POST">>,
                                 <<"/v1/topic/get">>,
                                  [{<<"content-type">>,<<"application/json">>}],
                                 <<"{}">>,infinity}}]}

And every new request just adds a new message to that mailbox.

I've looked into gun's code and I see that it should handle such messages with no problem in 'connected' state, here is the line from gun:

connected(cast, {request, ReplyTo, StreamRef, Method, Path, Headers, Body, InitialFlow},
		State=#state{origin_host=Host, origin_port=Port,
			protocol=Protocol, protocol_state=ProtoState, cookie_store=CookieStore0,
			event_handler=EvHandler, event_handler_state=EvHandlerState0}) ->
	{Commands, CookieStore, EvHandlerState} = Protocol:request(ProtoState,
		dereference_stream_ref(StreamRef, State), ReplyTo,
		Method, Host, Port, Path, Headers, Body,
		InitialFlow, CookieStore0, EvHandler, EvHandlerState0),
	commands(Commands, State#state{cookie_store=CookieStore,
		event_handler_state=EvHandlerState});

I tried to see the status of that connection process (with big timeout just in case) :
sys:get_status(<0.31175.48>, 600000).

But it didn't work, I've just found the new system message in gun's connection process mailbox:
{system,{<0.22180.130>,#Ref<0.3234177338.4042522626.87030>}, get_status}

So I can't find out what state is this gen_statem server is in and I don't know what caused it to switch into this state from 'connected' state.
The process_info looks like this:

process_info(<0.31175.48>).          
[{current_function,{gen,do_call,4}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,8},
 {links,[<0.1197.0>]},
 {dictionary,[{'$initial_call',{gun,init,1}},
              {'$ancestors',[gun_conns_sup,gun_sup,<0.1195.0>]}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.1194.0>},
 {total_heap_size,8370},
 {heap_size,4185},
 {stack_size,63},
 {reductions,1402917},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,6}]},
 {suspending,[]}]

The protocol used by connection is http2 and secure layer is on.

I can't figure out reasons why that could happen. I'm sure that this process was in state 'connected' because I stored it's pid in my pool of connections when the connection was established.

@essen
Copy link
Member

essen commented Mar 24, 2024

If this {current_function,{gen,do_call,4}} never changes, you could try tracing the process with dbg (or recon which can disable the trace automatically after a few seconds).

If you see nothing, then I suppose the process is really stuck in gen:do_call/4. In that case it's likely to be a bug in OTP. There's been a bug or two that were fixed in patch releases more or less recently that you might be running into.

@IgorKoshnarov
Copy link
Author

Thanks, yes, {current_function, {gen, do_call, 4}} is always the same.
I'll try to dbg and report here if something pops up.

@IgorKoshnarov
Copy link
Author

I couldn't dbg somehow, I tried to erl -remsh to my app and open trace port, but got error:

dbg:tracer(port, dbg:trace_port(ip, 50000)).
** exception error: undefined function dbg:trace_port/2

The erlang version on the server is 23.

Also I did
rp(process_Info(<0.3238.0>, backtrace)).

and got this:
{backtrace,<<"Program counter: 0x00007f3542681490 (gen:do_call/4 + 280)\ny(0) []\ny(1) infinity\ny(2) []\ny(3) #Ref<0.3234177338.3271557122.27248>\ny(4) <0.21289.50>\n\n0x00007f34ab0c1630 Return addr 0x00007f34c72fb808 (gen_statem:call_dirty/4 + 104)\ny(0) infinity\ny(1) {start,infinity}\ny(2) <0.21289.50>\ny(3) Catch 0x00007f34c72fb840 (gen_statem:call_dirty/4 + 160)\n\n0x00007f34ab0c1658 Return addr 0x00007f3542a7e1c8 (ssl_gen_statem:call/2 + 72)\ny(0) Catch 0x00007f3542a7e1d8 (ssl_gen_statem:call/2 + 88)\n\n0x00007f34ab0c1668 Return addr 0x00007f3542a751d0 (ssl_gen_statem:handshake/2 + 152)\ny(0) []\ny(1) {sslsocket,{gen_tcp,#Port<0.5296215>,tls_connection,undefined},[<0.21289.50>,<0.22818.50>]}\n\n0x00007f34ab0c1680 Return addr 0x00007f3542a94790 (tls_gen_connection:start_fsm/8 + 1240)\ny(0) []\ny(1) []\ny(2) []\ny(3) []\ny(4) []\ny(5) []\ny(6) []\ny(7) []\ny(8) []\ny(9) []\ny(10) Catch 0x00007f3542a947a0 (tls_gen_connection:start_fsm/8 + 1256)\n\n0x00007f34ab0c16e0 Return addr 0x00007f3542a74fc8 (ssl_gen_statem:connect/8 + 104)\ny(0) Catch 0x00007f3542a74fd8 (ssl_gen_statem:connect/8 + 120)\n\n0x00007f34ab0c16f0 Return addr 0x00007f34c5189890 (gun:normal_tls_handshake/4 + 384)\ny(0) []\ny(1) gun_default_event_h\ny(2) []\ny(3) [http2,http]\ny(4) #{socket=>#Port<0.5296215>,timeout=>infinity,tls_opts=>[{alpn_advertised_protocols,[<<\"http/1.1\">>,<<\"h2\">>]},{client_preferred_next_protocols,{client,[<<\"http/1.1\">>,<<\"h2\">>],<<\"http/1.1\">>}},{server_name_indication,\"management.me\"},{alpn_advertised_protocols,[<<\"http/1.1\">>,<<\"h2\">>]},{client_preferred_next_protocols,{client,[<<\"http/1.1\">>,<<\"h2\">>],<<\"http/1.1\">>}}]}\ny(5) {state,<0.3237.0>,{up,#Ref<0.3234177338.2967732226.142611>},\"management.me\",443,<<\"https\">>,\"management.me\",443,[],#{connect_timeout=>3000,http_opts=>#{keepalive=>5000},retry=>3,retry_timeout=>1000},undefined,undefined,gun_tls,true,{ssl,ssl_closed,ssl_error},undefined,undefined,undefined,gun_default_event_h,undefined}\ny(6) undefined\n\n0x00007f34ab0c1730 Return addr 0x00007f34c5188518 (gun:initial_tls_handshake/3 + 416)\ny(0) []\ny(1) []\ny(2) 2\ny(3) []\ny(4) []\n\n0x00007f34ab0c1760 Return addr 0x00007f34c72fea10 (gen_statem:loop_state_callback/11 + 872)\ny(0) []\ny(1) []\ny(2) true\ny(3) false\ny(4) []\ny(5) false\ny(6) []\ny(7) {initial_tls_handshake,{state,<0.3237.0>,{up,#Ref<0.3234177338.2967732226.142611>},\"management.me\",443,<<\"https\">>,\"management.me\",443,[],#{connect_timeout=>3000,http_opts=>#{keepalive=>5000},retry=>3,retry_timeout=>1000},undefined,undefined,gun_tls,true,{ssl,ssl_closed,ssl_error},undefined,undefined,undefined,gun_default_event_h,undefined}}\ny(8) [{internal,{retries,2,#Port<0.5296215>}}]\ny(9) {state,{initial_tls_handshake,{state,<0.3237.0>,{up,#Ref<0.3234177338.2967732226.142611>},\"management.me\",443,<<\"https\">>,\"management.me\",443,[],#{connect_timeout=>3000,http_opts=>#{keepalive=>5000},retry=>3,retry_timeout=>1000},undefined,undefined,gun_tls,true,{ssl,ssl_closed,ssl_error},undefined,undefined,undefined,gun_default_event_h,undefined}},[],#{},false}\ny(10) []\ny(11) {params,state_functions,false,<0.1197.0>,[gun],<0.3238.0>,infinity}\ny(12) Catch 0x00007f34c72fea80 (gen_statem:loop_state_callback/11 + 984)\n\n0x00007f34ab0c17d0 Return addr 0x00007f34c7c83c00 (proc_lib:init_p/5 + 264)\ny(0) []\ny(1) []\ny(2) []\ny(3) Catch 0x00007f34c7c83c10 (proc_lib:init_p/5 + 280)\n\n0x00007f34ab0c17f8 Return addr 0x000000000092aa38 (<terminate process normally>)\n">>}

But this didn't give me any clue.

@essen
Copy link
Member

essen commented Mar 25, 2024

Just do

dbg:tracer().
dbg:tpl(gen).
dbg:tpl(gen_statem).
dbg:tpl(gun).
dbg:p(<0.31175.48>, c).

@IgorKoshnarov
Copy link
Author

IgorKoshnarov commented Mar 25, 2024

It failed on the first command:

dbg:tracer().
** exception error: undefined function dbg:tracer/0

It seems that module dbg is not present on the server.

code:is_loaded(dbg).
false

tried to load it:

code:ensure_loaded(dbg).
{error,embedded}

@essen
Copy link
Member

essen commented Mar 25, 2024

Ah well you probably are running a release that doesn't include runtime_tools. Make note to always include runtime_tools in the future to allow debugging things. I would also recommend adding https://github.com/zhongwencool/observer_cli (which also depends on recon).

In any case if you are on 23 I would recommend upgrading. What you are seeing looks like a bug in OTP, which may have been fixed since. At the very least make sure you are on the latest 23 patch release (OTP-23.3.4.20).

Then if you run into this again, with runtime_tools we would have an easier time confirming that the process is, indeed, stuck. Then a report can be sent to OTP team.

@IgorKoshnarov
Copy link
Author

IgorKoshnarov commented Mar 25, 2024

Ok, thanks, currently I see this version:
`cat /usr/lib/erlang/releases/RELEASES

%% coding: utf-8
[{release,"Erlang/OTP","23","11.2.2.18",
[{kernel,"7.3.1.7","/usr/lib/erlang/lib/kernel-7.3.1.7"},
{stdlib,"3.14.2.3","/usr/lib/erlang/lib/stdlib-3.14.2.3"},
{sasl,"4.0.2","/usr/lib/erlang/lib/sasl-4.0.2"}],
permanent}].`

@essen
Copy link
Member

essen commented Mar 25, 2024

That's OTP-23.3.4.18.

OTP-23.3.4.19 fixes an issue where processes that use max_heap_size (that you might have set) to become zombies, as well as issues around process signal queues.

OTP-23.3.4.20 doesn't seem to have a relevant fix but doesn't hurt to use it.

@IgorKoshnarov
Copy link
Author

I'll put here the reply which I got from Fred Herbert on erlangforums, based on my backtrace:

https://erlangforums.com/t/gun-gen-statem-process-queue-messages-stoped-being-handled/3413/5

Your FSM is currently stuck in a call to another FSM. In this case it appears to be related to TLS connection handling, during the handshake. The pid of that FSM Is <0.21289.50>.

Since that call has an infinity timeout, your own FSM is gonna be stuck until the other one returns; if it’s hung (because it’s waiting on your FSM or on a broken external connection), then you’re also going to be stuck forever.

Your options are likely to be one of:

put a timeout on the call (and possibly handle the errors)
move the call to be asynchronous to a sub-process so even if it hangs it doesn’t block the FSM
investigate and resolve the issue with the TLS handshake to make sure it can’t happen (may not always be possible if the causes are outside of your control)

@essen
Copy link
Member

essen commented Mar 25, 2024

Yes I suppose you could configure tls_handshake_timeout to something other than infinity to avoid the issue of the Gun connection process getting stuck. But I don't know what would happen to the TLS process. Maybe investigating it would tell us more. I think its pid is <0.21289.50> if your node is still up.

Edit: Note that depending on what the issue is, setting a timeout will not help. It depends on which process gets stuck and why.

@IgorKoshnarov
Copy link
Author

IgorKoshnarov commented Mar 25, 2024

I think its pid is <0.21289.50> if your node is still up.

I had to restart the application because of complaints, so I can't investigate that process. And currently there is no queues in gun connection processes, as I said it happens occasionally under unknown conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants