Connection pooling, port exhaustion in App Service #615

dpolivy · 2020-04-18T00:13:33Z

I've recently enabled Application Insights for some node.js services running in Azure App Service, and have been running into lots of issues with TCP connections and SNAT port exhaustion. I do have live metrics enabled, for now.

I'm looking for some guidance and best practices around addressing this, considering the outbound socket limitations on Azure App Service (only 120 sockets guaranteed per instance). Also, I have some feedback at the end on documentation.

I am using agentkeepalive as per the App Service best practices. Since I have 5 node processes in my instance, I have maxSockets set to 24. My initialization logic looks like this:

const HttpsAgent = require('agentkeepalive').HttpsAgent;
const https = require('https');
https.globalAgent  = new HttpsAgent({
			maxSockets: 24,
			maxFreeSockets: 10,
			timeout: 60000,
			freeSocketTimeout: 30000
		});

// Initialize App Insights
const appInsights = require('applicationinsights');
appInsights.setup(config.azure.applicationInsightsKey)
	.setSendLiveMetrics(true)
	.setAutoCollectConsole(true, true);
appInsights.defaultClient.config.httpsAgent = https.globalAgent;
appInsights.start();

Of my 5 node processes, 2 of them also interact with a few other external APIs on a frequent basis, so Application Insights is not the only consumer of external sockets.

With debug logging enabled, I'm also seeing frequent instances of these errors in the logs:

ApplicationInsights:QuickPulseSender [ 'Live Metrics endpoint could not be reached 25 consecutive times. Most recent error:',
  { Error: read ECONNRESET
      at exports._errnoException (util.js:1020:11)
      at TLSWrap.onread [as _originalOnread] (net.js:580:26)
      at TLSWrap.<anonymous> (D:\home\site\wwwroot\node_modules\applicationinsights\node_modules\async-listener\glue.js:188:31) code: 'ECONNRESET', errno: 'ECONNRESET', syscall: 'read' } ]

ApplicationInsights:QuickPulseSender [ 'Live Metrics endpoint could not be reached 50 consecutive times. Most recent error:',
  { Error: connect ETIMEDOUT 23.96.28.38:443
      at Object.exports._errnoException (util.js:1020:11)
      at exports._exceptionWithHostPort (util.js:1043:20)
      at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1099:14)
    code: 'ETIMEDOUT',
    errno: 'ETIMEDOUT',
    syscall: 'connect',
    address: '23.96.28.38',
    port: 443 } ]

I'm pretty sure the live metrics stream is not working fully, since, for example, servers pop in and out of the list on an intermittent basis. Presumably it's due to the errors above.

I'm still in the early stages of debugging some of my connection pooling issues, and would certainly appreciate some assistance in making sure app insights works properly in App Service while respecting the outbound connection limitations.

Other Feedback
By setting https.globalAgent, I expected Application Insights to use that for outbound connections. Instead, it seems that this module uses its own agent implementation. I didn't find any documentation discussing using this module in app service -- it might be nice to include some details about explicitly setting the httpsAgent property so that the module can use connection pooling and HTTP keep-alive.

The text was updated successfully, but these errors were encountered:

markwolff · 2020-07-08T22:25:51Z

This should be fixed by #621

dpolivy · 2020-07-09T18:23:56Z

@markwolff Unless I'm missing something, the merged commit doesn't actually use https.globalAgent if nothing is specified in the config? I am using the new module in my app and am still seeing SNAT port exhaustion errors.

markwolff · 2020-07-09T19:00:13Z

This SDK cannot use https.globalAgent because the default value does not meet our TLS requirements. You need to configure the agent with this SDK manually

appInsights.defaultClient.config.httpsAgent = https.globalAgent;

dpolivy · 2020-07-13T17:37:21Z

I am doing that in my code. I will try to find some time to dive deeper this week to see why it is still causing the port exhaustion.

But out of curiosity, isn't the supported TLS version more a function of the server vs the client? Just because the client can support TLS 1.0, etc, if the server doesn't support it, it won't get negotiated. Are you enforcing TLS versions on the server, which would make the need to do so on the client moot?

dpolivy · 2020-08-13T16:33:49Z

@markwolff Even with the agent set as you describe (and I have been doing), I continue to see SNAT port exhaustion running in App Service. We did a test with support to comment out setting the agent, and as expected the port exhaustion skyrocketed significantly (3000 failed SNAT requests per 5m period). So for whatever reason, even with the custom agent, it's still using too many sockets.

Also, is there a reason that you don't automatically enable keep-alive in whatever default agent you're using?

markwolff · 2020-08-13T16:47:35Z

The default agent is a minimum viable config to satisfy our TLS requirements. Any additional config decisions are left to the user (by overwriting our default agent).

Are you seeing these snat errors only when live metrics is enabled? Or does that only accelerate it? Also are you measuring it in a way other than looking for logged errors? Are you aware of anything in your app also not honoring the global agent in the default way similar to this sdk?

FWIW we insert the agent in our http calls here, and every network request is made through this util.

ApplicationInsights-node.js/Library/Util.ts

Lines 322 to 337 in b23988f

    
           var isHttps = requestUrlParsed.protocol === 'https:' && !proxyUrl; 
        
           if (isHttps && config.httpsAgent !== undefined) { 
        
               options.agent = config.httpsAgent; 
        
           } else if (!isHttps && config.httpAgent !== undefined) { 
        
               options.agent = config.httpAgent; 
        
           } else if (isHttps) { 
        
               // HTTPS without a passed in agent. Use one that enforces our TLS rules 
        
               options.agent = Util.tlsRestrictedAgent; 
        
           } 
        
           if (isHttps) { 
        
               return https.request(<any>options, requestCallback); 
        
           } else { 
        
               return http.request(<any>options, requestCallback); 
        
           }

dpolivy · 2020-08-13T16:57:54Z

The default agent is a minimum viable config to satisfy our TLS requirements. Any additional config decisions are left to the user (by overwriting our default agent).

But as I said earlier: isn't the supported TLS version more a function of the server vs the client? Just because the client can support TLS 1.0, etc, if the server doesn't support it, it won't get negotiated. Are you enforcing TLS versions on the server, which would make the need to do so on the client moot (and you could use the global agents by default)?

Are you seeing these snat errors only when live metrics is enabled? Or does that only accelerate it? Also are you measuring it in a way other than looking for logged errors? Are you aware of anything in your app also not honoring the global agent in the default way similar to this sdk?

Yes, this is only happening when Live Metrics is enabled. With it disabled, I do not see any SNAT exhaustion errors or failed TCP connections. I am specifically using agentkeepalive as recommended in the App Service documentation and limiting my total # of sockets per Node instance to keep the total # of allowed sockets below the 160 threshold as recommended by documentation and support.

Is Live Metrics creating a new outbound HTTP connection for each data point it is sending? It certainly seems to look that way based on the graphs. Maybe this would be better implemented using something like web sockets? All of my other outbound API calls are definitely going through my custom agent.

markwolff · 2020-08-13T17:17:08Z

The default TLS agent came at a time where the server accepted more TLS versions than our security guidelines prescribed for (a transition period). Since the server will reject unsupported TLS now, I agree the default agent could probably be removed.

After taking a closer look, you actually need to give the agent to the live metrics config object as well. The defaultClient isn't passing down the initial config to its live metrics client, which seems like an oversight on my part.

appInsights.liveMetricsClient.config.httpsAgent = https.globalAgent;

// or

appInsights.defaultClient.quickPulseClient.config.httpsAgent = https.globalAgent;

dpolivy · 2020-08-13T17:36:50Z

After taking a closer look, you actually need to give the agent to the live metrics config object as well. The defaultClient isn't passing down the initial config to its live metrics client, which seems like an oversight on my part.

I will try this now and let you know how it goes! This certainly seems like it could be the source of the issue.

aderici · 2020-09-10T09:56:39Z

This is being a problem for us as well, any confirmed way of fixing? Will have to disable appinsights if i cannot solve this

dpolivy · 2020-09-10T17:02:44Z

@aderici I'm still seeing SNAT failures, but this is what my initialization is doing and at least it does seem to get AI and Live Metrics to use my agentkeepalive instance:

appInsights.defaultClient.config.httpsAgent = https.globalAgent;
appInsights.defaultClient.quickPulseClient.config.httpsAgent = https.globalAgent;

markwolff · 2020-09-10T17:21:39Z

Is there another network request being made here (by this sdk or otherwise)? Do you have any network debug logs to see which requests aren't using your http agent? Live metrics makes the most network calls by far in this sdk, and so I would expect re-routing that would resolve the issue.

dpolivy · 2020-09-10T17:27:04Z

Do you have any network debug logs to see which requests aren't using your http agent?

Any examples on how to generate this? I can log via agentkeepalive but that only shows network requests using that agent, not any that use other agents.

markwolff · 2020-09-10T17:37:45Z

Do the SNAT errors show the url it was thrown on? Would it throw on a URL that is using the correct agent, where maybe 99% of requests are using the agent and the 1% that isn't is causing it to go over the connection limit (and hopefully it shows us that 1% URL)? If this is the case, then live metrics could use up all your connections and then any non-agent request made by anything in your app could start throwing errors.

Else you can try fiddler/wireshark/etc to view all requests (but I remember I had to go through quite a few hoops to get fiddler working for node.js apps). Are you repro-ing this locally as well? i.e setting any sort of max connection agent and running an app should surface the error at some point?

dpolivy · 2020-09-10T17:58:32Z

Do the SNAT errors show the url it was thrown on?

No, at least not in App Service diagnostics. I've asked support but they don't have any further details either.

Are you repro-ing this locally as well?

My machine doesn't have the same SNAT behavior as App Service, so I think it's impossible for me to reproduce this exactly, as it's primarily an issue with the App Service platform and the limits in place. Also, I can't use Fiddler or Wireshark in App Service to sniff traffic, but ultimately I'm not sure what that would tell us as the agent is a Node.js construct and is independent of what's sent over the wire.

I know that for the explicit external API calls I make, I am passing in the agent I want to use. I've verified that. Aside from that, there shouldn't be any outbound API calls other than the applicationinsights module and the Azure storage module. Have you actually done testing of applicationinsights in all various combinations using the settings recommended here to verify all calls are correctly using the specified httpsAgent? We already identified that Live Metrics wasn't using the default agent when set -- is there some other part internally that may be doing the same?

dpolivy · 2020-09-25T18:10:01Z

Just wanted to leave a quick follow-up here. I was accepted into the Private Endpoint preview for Azure Monitor, and got that configured on my network. This allows all App Insights/Live Metrics traffic to be routed via my vnet to a private endpoint instead of via the public internet (and thus requiring SNAT ports). Since deploying the change, here's how my SNAT usage looks:

I'd say it's pretty hard to argue that the SNAT usage/issues were not the app insights SDK. Perhaps it's just due to the volume of requests generated or the way I'm using it (with multiple node apps in the same web app, plus a second slot of the same), but this seems like the SDK is not well suited for running in App Service.

On a separate but related note, have you thought about enabling keepalive by default in your default agent (or even using agentkeepalive to help with that) to promote better connection re-use as a default best practice? Or at least cover this in the documentation?

markwolff added the bug label Apr 20, 2020

markwolff mentioned this issue Apr 30, 2020

Update all Senders to use https.globalAgent instead of tlsRestrictedAgent #621

Merged

markwolff closed this as completed Jul 8, 2020

bradoyler mentioned this issue May 24, 2022

require env var to enable live metrics v0.4.1 gopuff/appinsights-logger#26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection pooling, port exhaustion in App Service #615

Connection pooling, port exhaustion in App Service #615

dpolivy commented Apr 18, 2020

markwolff commented Jul 8, 2020

dpolivy commented Jul 9, 2020

markwolff commented Jul 9, 2020

dpolivy commented Jul 13, 2020

dpolivy commented Aug 13, 2020

markwolff commented Aug 13, 2020

dpolivy commented Aug 13, 2020

markwolff commented Aug 13, 2020

dpolivy commented Aug 13, 2020

aderici commented Sep 10, 2020

dpolivy commented Sep 10, 2020

markwolff commented Sep 10, 2020

dpolivy commented Sep 10, 2020

markwolff commented Sep 10, 2020

dpolivy commented Sep 10, 2020

dpolivy commented Sep 25, 2020

Connection pooling, port exhaustion in App Service #615

Connection pooling, port exhaustion in App Service #615

Comments

dpolivy commented Apr 18, 2020

markwolff commented Jul 8, 2020

dpolivy commented Jul 9, 2020

markwolff commented Jul 9, 2020

dpolivy commented Jul 13, 2020

dpolivy commented Aug 13, 2020

markwolff commented Aug 13, 2020

dpolivy commented Aug 13, 2020

markwolff commented Aug 13, 2020

dpolivy commented Aug 13, 2020

aderici commented Sep 10, 2020

dpolivy commented Sep 10, 2020

markwolff commented Sep 10, 2020

dpolivy commented Sep 10, 2020

markwolff commented Sep 10, 2020

dpolivy commented Sep 10, 2020

dpolivy commented Sep 25, 2020