Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stream_error with retry #291

Open
dubrovine opened this issue May 6, 2022 · 9 comments
Open

stream_error with retry #291

dubrovine opened this issue May 6, 2022 · 9 comments

Comments

@dubrovine
Copy link

Hello

Gun vsn - 2.0.0-rc.2

I faced some stream_error issue with the latest version of gun

connection pool options for gun

Opts2 = Opts#{ connect_timeout => 5000, retry => 200, retry_timeout => 500 % 500ms }

here is the part of the code

`request(PoolName, Method, Path, ReqHeaders, ReqBody)->
    request(PoolName, Method, Path, ReqHeaders, ReqBody, 0, undefined).

request(_, _, _, _, _, 20, ErrLast)->
    ErrLast;
request(PoolName, Method, Path, ReqHeaders, ReqBody, Counter, _)->
    
    ConnPid   = erlpool:pid(PoolName),
    StreamRef = gun:request(ConnPid, Method, Path, ReqHeaders, ReqBody, #{reply_to => self()}),
    
    case gun:await(ConnPid, StreamRef, ?TIMEOUT) of

        {response, fin, Status, Headers} ->
            {ok, Status, Headers, <<>>};

        {response, nofin, Status, Headers} ->
            
            case gun:await_body(ConnPid, StreamRef, ?TIMEOUT) of

                {ok, Body} ->
                    {ok, Status, Headers, Body};

                {error, Err} = ErrLast ->
                    timer:sleep(200),
                    request(PoolName, Method, Path, ReqHeaders, ReqBody, Counter+1, ErrLast)
            end;

        {error, Err} = ErrLast ->
            timer:sleep(200),
            request(PoolName, Method, Path, ReqHeaders, ReqBody, Counter+1, ErrLast)

    end.`

a lot of time I receive the error

`
<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}
`

Yes it is working as it should after retrying but I wanted to ask if it's good decision to make it work this way ?

I also tried to use settings like max_concurent_streams (200) but had no luck (Still reproducing such error messages)

Could you please point me to the right direction to (maybe) avoid this type of messages and re-try function (sometimes I need to execute function 15 times before it pass correctly)

thanks in advance

@essen
Copy link
Member

essen commented May 6, 2022

Not enough info I'm afraid. I don't know where this message comes from. Which pool are you using? What server? Any way I can reproduce?

@dubrovine
Copy link
Author

dubrovine commented May 6, 2022

Pool is erlPool (10sockets for each PoolName)
Server is cowboy 2.9.0 (If I understand you correctly)

`open(PoolName, PoolSize, Host, Port, Opts)->

    Opts2 = Opts#{
        connect_timeout => 5000,
        retry => 200,
        retry_timeout => 500 % 500ms
    },

    PoolRes = 
    erlpool:start_pool(PoolName, [
        {size, PoolSize},
        {group, http2_pools},
        {start_mfa, {gun, start_link, [self(), util:to_list(Host), Port, Opts2]}},
        {supervisor_period, 1},
        {supervisor_intensity, 1000},
        {supervisor_restart, permanent}
    ]),
    case PoolRes of
        ok -> 
            log:info("[HTTP2] START_POOL ~p ~p ~p", [self(), PoolName, PoolSize]),
            ok;
        {error,{already_started,_}} -> 
            log:error("[HTTP2] START_POOL already_started ~p ~p ~p", [self(), PoolName, PoolSize]),
            ok;
        PoolErr ->
            log:error("[HTTP2] START_POOL_ERR ~p ~p ~p ~p", [self(), PoolName, PoolSize, PoolErr]),
            {error, PoolErr}
    end.`

error message I receive directly from gun:await(ConnPid, StreamRef, ?TIMEOUT)
unfortunately I can not reproduce the issue locally of on dev environment (only in production env)

we have like 30-50k requests to apns per minute

please tell me if you need more specific information

@essen
Copy link
Member

essen commented May 6, 2022

Oh it's against APNS? If I recall they were doing something weird where they only allow 1 stream initially and then increase that to 100 after the first request succeeded. You might be running into that. Try to trace the HTTP/2 settings that Gun receives.

@dubrovine
Copy link
Author

Okay thank you. I will close the ticket for now and get back to you when I will be able to trace settings on production env

@dubrovine dubrovine reopened this May 6, 2022
@dubrovine
Copy link
Author

dubrovine commented May 6, 2022

Here is what I found

connected to active production node make some manual tests

OK TEST

`(channelapi@16.38.26.28)6> ConnPid = erlpool:pid('apns_privat24-2.0-voip-prod'). 
<0.2411.0>
(channelapi@16.38.26.28)7> 
(channelapi@16.38.26.28)7> StreamRef = gun:request(ConnPid, <<"POST">>, <<"/3/device/***">>, [{<<"apns-topic">>, <<"ua.hidden">>}, {<<"apns-push-type">>,<<"voip">>}], <<"{\"aps\" : {\"badge\" : 9,\"sound\" : \"bingbong.aiff\"},\"messageID\" : \"ABCDEFGHIJ\"}">>, #{reply_to => self()}).
#Ref<0.1992702093.3131834371.253809>
(channelapi@16.38.26.28)8> gun:info(ConnPid).
#{cookie_store => undefined,intermediaries => [],
  origin_host => "api.push.apple.com",origin_port => 443,
  origin_scheme => <<"https">>,owner => <0.2178.0>,
  protocol => http2,
  sock_ip => {16,38,26,28},
  sock_port => 52616,
  socket =>
      {sslsocket,{gen_tcp,#Port<0.453357>,tls_connection,
                          undefined},
                 [<0.21000.64>,<0.17325.65>]},
  transport => tls}
(channelapi@16.38.26.28)9> gun:stream_info(ConnPid , StreamRef).
{ok,#{ref => #Ref<0.1992702093.3131834371.253809>,
      reply_to => <0.27133.65>,state => running}}
(channelapi@16.38.26.28)10> gun:await(ConnPid, StreamRef, 5000).
{response,fin,200,
          [{<<"apns-id">>,
            <<"016B51EA-EF0D-E5C9-772E-E3FD48F6B1AE">>}]}`

ERROR TEST

`(channelapi@16.38.26.28)6> ConnPid1 = erlpool:pid('apns_privat24-2.0-voip-prod'). 
<0.2413.0>
(channelapi@16.38.26.28)7> 
(channelapi@16.38.26.28)7> StreamRef1 = gun:request(ConnPid1, <<"POST">>, <<"/3/device/***">>, [{<<"apns-topic">>, <<"ua.hidden">>}, {<<"apns-push-type">>,<<"voip">>}], <<"{\"aps\" : {\"badge\" : 9,\"sound\" : \"bingbong.aiff\"},\"messageID\" : \"ABCDEFGHIJ\"}">>, #{reply_to => self()}).
#Ref<0.1992702093.3132620803.58733>
(channelapi@16.38.26.28)8> gun:info(ConnPid1).
#{cookie_store => undefined,intermediaries => [],
  origin_host => "api.push.apple.com",origin_port => 443,
  origin_scheme => <<"https">>,owner => <0.2178.0>,
  protocol => http2,
  sock_ip => {16,38,26,28},
  sock_port => 50478,
  socket =>
      {sslsocket,{gen_tcp,#Port<0.491351>,tls_connection,
                          undefined},
                 [<0.5892.68>,<0.19792.67>]},
  transport => tls}
(channelapi@16.38.26.28)9> gun:stream_info(ConnPid1 , StreamRef1).
{ok,undefined}
(channelapi@16.38.26.28)10> gun:await(ConnPid1, StreamRef1, 5000).
{error,{stream_error,closed}}`

all the data seems like the same except connection pid and stream ref but gun:stream_info respond with undefined

@essen
Copy link
Member

essen commented Nov 18, 2023

All I know about this is that it likely happens because the client was disconnected. Most likely it received an {error, closed} back after trying to do a socket operation. You would get a more verbose stream_error tuple otherwise. I do not know why that happens or what to do about it. Having a trace of what Gun is doing when this happens may help.

@Vkutovoy92
Copy link

Hi, I have the same, with 1.3.0 I had
{error,{stop,{goaway,215,no_error,<<>>},'Client is going away.'}}
{error,{stop,{goaway,145,no_error,<<>>},'Client is going away.'}}
{error,{stop,{goaway,143,no_error,<<>>},'Client is going away.'}}

And after update to 2.0.1 these problems are not exist but I have had new error
{stream_error,{closed,{error,closed}}}
And this I can catch in condition:

`

{response, nofin, HttpCode = 200,  _Headers} ->
  case gun:await_body(ConnPid, StreamRef, ?TIMEOUT) of
    {ok, Body}  ->
      FinishTime = time_util:militime() - StartTime,
      metrics:instrument(FinishTime, UrlForMetrics, true, HttpCode),
      DecodeBody = json:decode(Body),
      {prepare_response(DecodeBody), FinishTime};
    {error,{stream_error,{closed,{error,closed}}}} = Error ->
      log:error("{"?MODULE_STRING"}  - stream_error, error ~p", [Error]),
      FinishTime = time_util:militime() - StartTime,
      metrics:instrument(FinishTime, UrlForMetrics, false, stream_error),
      ?ErrorResp(FinishTime)
  end;

`

@Vkutovoy92
Copy link

I have noticed that this error turns up on highload services and almost at the same time.
2 screens from 2 different services
Снимок экрана 2023-11-20 в 17 13 14

Снимок экрана 2023-11-20 в 17 13 09

@essen
Copy link
Member

essen commented Nov 20, 2023

Could be happening if the network has hiccups. You should see TCP connections getting closed at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants