New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection pooling, port exhaustion in App Service #615
Comments
This should be fixed by #621 |
@markwolff Unless I'm missing something, the merged commit doesn't actually use |
This SDK cannot use appInsights.defaultClient.config.httpsAgent = https.globalAgent; |
I am doing that in my code. I will try to find some time to dive deeper this week to see why it is still causing the port exhaustion. But out of curiosity, isn't the supported TLS version more a function of the server vs the client? Just because the client can support TLS 1.0, etc, if the server doesn't support it, it won't get negotiated. Are you enforcing TLS versions on the server, which would make the need to do so on the client moot? |
@markwolff Even with the agent set as you describe (and I have been doing), I continue to see SNAT port exhaustion running in App Service. We did a test with support to comment out setting the agent, and as expected the port exhaustion skyrocketed significantly (3000 failed SNAT requests per 5m period). So for whatever reason, even with the custom agent, it's still using too many sockets. Also, is there a reason that you don't automatically enable keep-alive in whatever default agent you're using? |
The default agent is a minimum viable config to satisfy our TLS requirements. Any additional config decisions are left to the user (by overwriting our default agent). Are you seeing these snat errors only when live metrics is enabled? Or does that only accelerate it? Also are you measuring it in a way other than looking for logged errors? Are you aware of anything in your app also not honoring the global agent in the default way similar to this sdk? FWIW we insert the agent in our http calls here, and every network request is made through this util. ApplicationInsights-node.js/Library/Util.ts Lines 322 to 337 in b23988f
|
But as I said earlier: isn't the supported TLS version more a function of the server vs the client? Just because the client can support TLS 1.0, etc, if the server doesn't support it, it won't get negotiated. Are you enforcing TLS versions on the server, which would make the need to do so on the client moot (and you could use the global agents by default)?
Yes, this is only happening when Live Metrics is enabled. With it disabled, I do not see any SNAT exhaustion errors or failed TCP connections. I am specifically using Is Live Metrics creating a new outbound HTTP connection for each data point it is sending? It certainly seems to look that way based on the graphs. Maybe this would be better implemented using something like web sockets? All of my other outbound API calls are definitely going through my custom agent. |
The default TLS agent came at a time where the server accepted more TLS versions than our security guidelines prescribed for (a transition period). Since the server will reject unsupported TLS now, I agree the default agent could probably be removed. After taking a closer look, you actually need to give the agent to the live metrics config object as well. The defaultClient isn't passing down the initial config to its live metrics client, which seems like an oversight on my part.
|
I will try this now and let you know how it goes! This certainly seems like it could be the source of the issue. |
This is being a problem for us as well, any confirmed way of fixing? Will have to disable appinsights if i cannot solve this |
@aderici I'm still seeing SNAT failures, but this is what my initialization is doing and at least it does seem to get AI and Live Metrics to use my
|
Is there another network request being made here (by this sdk or otherwise)? Do you have any network debug logs to see which requests aren't using your http agent? Live metrics makes the most network calls by far in this sdk, and so I would expect re-routing that would resolve the issue. |
Any examples on how to generate this? I can log via |
Do the SNAT errors show the url it was thrown on? Would it throw on a URL that is using the correct agent, where maybe 99% of requests are using the agent and the 1% that isn't is causing it to go over the connection limit (and hopefully it shows us that 1% URL)? If this is the case, then live metrics could use up all your connections and then any non-agent request made by anything in your app could start throwing errors. Else you can try fiddler/wireshark/etc to view all requests (but I remember I had to go through quite a few hoops to get fiddler working for node.js apps). Are you repro-ing this locally as well? i.e setting any sort of max connection agent and running an app should surface the error at some point? |
No, at least not in App Service diagnostics. I've asked support but they don't have any further details either.
My machine doesn't have the same SNAT behavior as App Service, so I think it's impossible for me to reproduce this exactly, as it's primarily an issue with the App Service platform and the limits in place. Also, I can't use Fiddler or Wireshark in App Service to sniff traffic, but ultimately I'm not sure what that would tell us as the agent is a Node.js construct and is independent of what's sent over the wire. I know that for the explicit external API calls I make, I am passing in the agent I want to use. I've verified that. Aside from that, there shouldn't be any outbound API calls other than the |
Just wanted to leave a quick follow-up here. I was accepted into the Private Endpoint preview for Azure Monitor, and got that configured on my network. This allows all App Insights/Live Metrics traffic to be routed via my vnet to a private endpoint instead of via the public internet (and thus requiring SNAT ports). Since deploying the change, here's how my SNAT usage looks: I'd say it's pretty hard to argue that the SNAT usage/issues were not the app insights SDK. Perhaps it's just due to the volume of requests generated or the way I'm using it (with multiple node apps in the same web app, plus a second slot of the same), but this seems like the SDK is not well suited for running in App Service. On a separate but related note, have you thought about enabling |
I've recently enabled Application Insights for some node.js services running in Azure App Service, and have been running into lots of issues with TCP connections and SNAT port exhaustion. I do have live metrics enabled, for now.
I'm looking for some guidance and best practices around addressing this, considering the outbound socket limitations on Azure App Service (only 120 sockets guaranteed per instance). Also, I have some feedback at the end on documentation.
I am using
agentkeepalive
as per the App Service best practices. Since I have 5 node processes in my instance, I havemaxSockets
set to 24. My initialization logic looks like this:Of my 5 node processes, 2 of them also interact with a few other external APIs on a frequent basis, so Application Insights is not the only consumer of external sockets.
With debug logging enabled, I'm also seeing frequent instances of these errors in the logs:
I'm pretty sure the live metrics stream is not working fully, since, for example, servers pop in and out of the list on an intermittent basis. Presumably it's due to the errors above.
I'm still in the early stages of debugging some of my connection pooling issues, and would certainly appreciate some assistance in making sure app insights works properly in App Service while respecting the outbound connection limitations.
Other Feedback
By setting
https.globalAgent
, I expected Application Insights to use that for outbound connections. Instead, it seems that this module uses its own agent implementation. I didn't find any documentation discussing using this module in app service -- it might be nice to include some details about explicitly setting thehttpsAgent
property so that the module can use connection pooling and HTTP keep-alive.The text was updated successfully, but these errors were encountered: