Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random "The service is unavailable." and "Azure Functions runtime is unreachable" errors #8583

Open
Arjan321 opened this issue Jul 27, 2022 · 142 comments

Comments

@Arjan321
Copy link

We have been running a couple of Azure Functions in various subscriptions, and every once in a while (about once a week), the entire Azure Function goes down and reports "The service is unavailable." when accessing the Function-app via HTTP and reports "Azure Functions runtime is unreachable" in the Azure Portal.

HTTP response:
image

Portal:
image

The issue appears randomly, without any changes in our end (no deployment, etc.) and also resolves randomly without any interaction from our side.

In the "Activity log" an error is written for the job "Sync Web Apps Function Triggers" with status "Failed":

        "statusCode": "BadRequest",
        "statusMessage": "{\"Code\":\"BadRequest\",\"Message\":\"Encountered an error (ServiceUnavailable) from host runtime.\",\"Target\":null,\"Details\":[{\"Message\":\"Encountered an error (ServiceUnavailable) from host runtime.\"},{\"Code\":\"BadRequest\"},{\"ErrorEntity\":{\"Code\":\"BadRequest\",\"Message\":\"Encountered an error (ServiceUnavailable) from host runtime.\"}}],\"Innererror\":null}",

The issue seems quite similar to #8519, however we are running Linux.

This is causing quite some problems, since we are no longer able to provide reliable service to our end-users.

Investigative information

  • Timestamp: Wednesday, 27 July 2022, 07:47, Coordinated Universal Time (UTC)
  • Function App version: V4
  • Function App name: Please use Invocation ID
  • Function name(s) (as appropriate): All functions within the app are non-working
  • Invocation ID: None of the functions work, but to find our function-app: 2022-07-27T08:05:13.773Z - e95ba16f-f2a2-420c-bcfd-4891d54198ae
  • Region: West-Europe

Repro steps

None that we can find

Expected behavior

Always work

Known workarounds

None

Related information

Hosting Model: Consumption Plan
OS: Linux
Version: V4
Hosting Model In-Process
Language: C#/dotnet6
Configuration:

  • WEBSITE_RUN_FROM_PACKAGE points to a valid file on a storage account
  • WEBSITE_MOUNT_ENABLED: 1
  • WEBSITE_ENABLE_SYNC_UPDATE_SITE: false
  • WEBSITE_CONTENTAZUREFILECONNECTIONSTRING: Valid keyvault reference toa a storage account
  • SCM_DO_BUILD_DURING_DEPLOYMENT: false
  • FUNCTIONS_WORKER_RUNTIME: dotnet-isolated
  • FUNCTIONS_EXTENSION_VERSION: ~4
  • AzureWebJobsDisableHomepage: true
@wbail
Copy link
Contributor

wbail commented Aug 27, 2022

Hi Arjan32,

Normally this error message is related with misconfiguration.

Please take a look the bullets below:

@Arjan321
Copy link
Author

As mentioned in then original issue, Those values point to valid values.

Given the extreme randomness of both the issue appearing and resolving itself, I hardly doubt any setting on our end is responsible.

@gilesmatthews
Copy link

Hi, is there any update on this issue? I am seeing the same random service unavailable issues. Thanks

@rlucassen
Copy link

Same issue here, completely random. Waiting on a fix for a while now.

@LuckyLub
Copy link

Sorry for you guys, but I'm happy to read this... thought I was going crazy. What region do you guys have your Azure Functions running? Ours are in West Europe.

@rlucassen
Copy link

Sorry for you guys, but I'm happy to read this... thought I was going crazy. What region do you guys have your Azure Functions running? Ours are in West Europe.

Mine are running in West Europe as well, same for @Arjan321

@tomabg
Copy link

tomabg commented Sep 21, 2022

same region West Europe

@tomabg
Copy link

tomabg commented Sep 21, 2022

this also breaks terraform deployment...as we are on test only we will try different region soon

@rlucassen
Copy link

this also breaks terraform deployment...as we are on test only we will try different region soon

Curious to know if other regions do work.

Heard that Microsoft is doing updates in the West Europe region around next week

@LuckyLub
Copy link

We also deploy with Terraform BTW. Changing the location to Central US seems to do the trick.

@jnekrasov
Copy link

Any news on this?! We are experiencing the same problems in West Europe region

@Ralle1986
Copy link

Also seeing issue in West Europe region

@balag0
Copy link
Contributor

balag0 commented Sep 27, 2022

Sorry for the delay in responding. Yes, there was a regression due to a recent update in the region which caused intermittent errors.
The fix rollout has already started and in progress currently. If there are any apps still experiencing failures, could you please share the details and we can double check them. Thanks

@LuckyLub
Copy link

LuckyLub commented Oct 4, 2022

Before we move back, can you please confirm when the roll-out is completed?

@lightwaver
Copy link

any news on that topic or link to the issue to see if its solved ?

@LuckyLub
Copy link

LuckyLub commented Oct 10, 2022

@balag0 any update?

@LuckyLub
Copy link

@surgupta-msft, @balag0, any updates? Just tried to redeploy to West-Europe, still running into random "The service is unavailable." errors Running the same functions in North-Europe works just fine.

@LuckyLub
Copy link

Just had contact with Azure's Help + Support. I presented the problem, it seemed to be known. However, they will collect some logs from my Azure Functions and look further into it. They are still working on upgrading the services in West-Europe. Currently it was advised to use another region.

@jeroenvermunt
Copy link

I can confirm that it is still occurring

@Rutix
Copy link

Rutix commented Oct 17, 2022

We have had these problems too. We have been in contact with Azure Support and they are saying the following:

"
Conclusion:
The 503s were detected ONLY on Azure Front End, the Front End instances encountered some unexpected error at that moment and weren’t able to handle the http requests and distributed them to specific workers that were hosting your function app.

.....

Let’s take a closer look at these 2 Front End instances at that time, Front End instance 24 encountered an error when trying to get a worker from the data role, and the same situation for the instance 5.

image
image

Unfortunately, this is an underlying platform issue as the Front End is an important component inside of Azure platform, and both the user side and our side cannot have action to interact with it, we apologize for all the inconvenience caused, but please rest assured that I’ve already reflected this to the Microsoft Azure Team and they already know this, we met this kind of issue before.
"

^ this has been now several months ago. The fact that these problems are still popping up is disappointing. Also telling us to a different region is also disappointing. We are bound by compliance issues so we cant leave our region as easy.

@rlucassen
Copy link

Last wednesday we got the message from Microsoft that this issue should be fixed, we already transferred to a premium plan because we got tired of it after 4 months. I'm curious to know if anybody is still encountering this issue in the West-Europe region?

@LuckyLub
Copy link

So using premium is a valid work around?

@rlucassen
Copy link

So using premium is a valid work around?

Yes upgrading to premium worked for us

@Arjan321
Copy link
Author

Arjan321 commented Oct 21, 2022

So using premium is a valid work around?

If "Throwing money at the problem" counts as a workaround, then yes. This ticket is specifically about the Consumption Plan.

@LuckyLub
Copy link

Just good to know what options are out there.

@LuckyLub
Copy link

Message from MS:

Product Group is still working on improvement on the west Europe region. Will keep you updated on any progress.

@Rutix
Copy link

Rutix commented Oct 25, 2022

I got some messages that we got hit again. We are not entirely sure if it was the same cause as this issue but we had "Service unavailable" a couple of times today in west europe.

@rdvansloten
Copy link

@Rutix I happened upon this thread as well this afternoon, also from The Netherlands, having Functions (Consumption tier) in West Europe. They were unavailable and/or throwing SSL errors. I also could not deploy from VS Code (unavailable)

What fixed it for me is going into the Portal and Restarting my Functions manually.

@maaaNu
Copy link

maaaNu commented Jun 2, 2023

Hi Guys,

Is somebody able to create a function app in euw? Neither with terraform nor manually I was able to create a function app for my service-plan. E.g. if I try to use the azure portal I am stuck in the "validating" step:

Screenshot 2023-06-02 at 11 02 31

@FFranz93
Copy link

FFranz93 commented Aug 1, 2023

Due to the quantity of comments I did not read all of them, but on our Function App (running on Windows with a Consumption Plan in West Europe) we are also facing the same issue randomly about once a week.

@milkyjoe90
Copy link

I've been experiencing this since 6pm last night, linux consumption app returning 503 for both http triggered calls and deployments from Azure DevOps pipelines and releases

@javast97
Copy link

javast97 commented Aug 3, 2023

If anyone is experiencing the same issue, I resolved refreshing the SAS Token.

See this discussion below:

#9113

@rdvansloten
Copy link

If anyone is experiencing the same issue, I resolved refreshing the SAS Token.

See this discussion below:

#9113

I don't think this applies to most of us. This has been happening in West Europe with Functions that don't use SAS tokens or are less than a month old.

@javast97
Copy link

javast97 commented Aug 3, 2023

If anyone is experiencing the same issue, I resolved refreshing the SAS Token.
See this discussion below:
#9113

I don't think this applies to most of us. This has been happening in West Europe with Functions that don't use SAS tokens or are less than a month old.

Maybe not apply to all of you, but I'm not in this thread from a specific search "West Europe", I searched the error an this thread is in the top of searches and maybe helps anyone.

@AliVaseghnia
Copy link

The SAS token for the WEBSITE_RUN_FROM_PACKAGE url was expired in my case, updating the url fixed the issue for me.

@mrubiottec
Copy link

This has intermittently occurred on our end as well. Any update?

@ottosulin
Copy link

We also see these on our Static Web Apps regularly about weekly for short moments. Afaik SWA backend run on Function Apps.

@mrubiottec
Copy link

Microsoft has to act on this quickly. This looks to be a global problem.

@noontz
Copy link

noontz commented Nov 27, 2023

I have the exact same problem deploying a completely new unaltered azure function template from Visual Studio to a North Europe App Service Plan. The pipeline succeeds with no issues or warnings..

@bluebobbo
Copy link

bluebobbo commented Nov 30, 2023

Yup, same exact issue here. East US. I also no longer see the "your function app 4.0" is running page when I go to the function's URL. Strangely enough, a few weeks ago this was all good.

I ended up deleting the function app and service plan, recreating it in, and then things worked normally for a couple hours.

Now back to the same issue.

@isadag
Copy link

isadag commented Jan 21, 2024

I thought I was going crazy until I found this post. I have a relatively new function that has now experienced this twice (today, and a couple of days ago). It is really frustrating that there is now workaround or fix. I haven't been able to find any status or outage info either when this occurs. The 503 Service Unavailable occurs without any changes in the code or settings and then resolves itself after quite some time (took about 1 hour ish). Restarting the function app has no effect whatsoever.

Running Linux Function, Consumption plan, in Sweden, and in this case I'm having issues with my http triggered function. I've tried West Europe too, but got the exact same issue there too..

Any news on this? Any tips to avoid this behavior?

Isa

@isadag
Copy link

isadag commented Jan 23, 2024

I wanted to add a follow-up comment since I've been comparing two different accounts (my own personal, and an account belonging to an organization I supported with some setup previously.) The thing I find interesting is that when I check both accounts and compare some stats, I see the same information for both accounts, which I assume to be because of a general (even global ish, or at least in the same region?) issue in Azure.

What stats did I look up and how?

  1. In the Azure portal, go to your function app
  2. In the left menu, click the link called Diagnose and solve problems
  3. Click the search bar and wait a few seconds for it to load before you type in: Function App Down or Reporting Errors
  4. Click on that item in the search bar's dropdown
  5. In the time-picker you should look at the day you had issues, in my case I chose to set the time-picker to be on Wednesday the 17th of January (one of the days I had issues with my function apps), and the time is from 00:00 until the 18th at 00:00.
  6. Let it load..... once it has analyzed the data you should get some information and stats
  7. Scroll down to panel containing info about the issues, the panel has the heading: Host Runtime instance (Dynamic Plan) was not available for a long time period (> 15 minutes)

Both accounts (not connected to each-other in any way) are running a Linux Function, Consumption plan, in Sweden Central.

When observing the report in the Azure Portal I can see that the functions had issues for a total of 470 minutes where the function was running on 0 worker instance.. And on the 21st of January (Sunday) there seemed to be issues as well, but this time for a total time of 1435 minutes.

I have not found any outage information from MS.

If you are running azure functions, could you take 5 minutes to:

  • look up those two dates (17/1 and 21/1) and get back here with info whether you had the same issue and what info is stated there?
  • write info about type of function, OS, which region you're running in and whether you're running consumption/premium/dedicated..
image

@tom08zehn
Copy link

tom08zehn commented Jan 26, 2024

Thanks @isadag for this step-by-step guide.

I'm running an Azure Function with an HTTP trigger (Node.js 18) on a Consumption plan in France Central and I'm also running into random HTTP 500 errors. Workaround: refresh page. Impact: unstable service to users. It's definitely not an error caused by my app because I wrapped the entire code in a try-catch-block and Application Insights does not show any application error.

Any ideas for a solution or further investigation?

For me it reports 2 errors (full exception below):

  1. Functions that are not triggering
Host Runtime instance (Dynamic Plan) was not available for a long time period (> 15 minutes)

Description: Function was running on 0 worker instance for more than 1420 minutes between 1/25/2024 8:20:00 AM and 1/26/2024 8:05:00 AM.

Possible Cause: Function App was offline due to previous deployment

image

  1. Function Executions and Errors

Detected function(s) having execution failure rate between 0.1% and 1%.

image

Timestamp : 1/25/2024 5:25:54 PM
Inner Exception Type: System.Exception
Total Occurrences: 1
Latest Exception Message:

Full Exception :
 Exception while executing function
 /Functions.register ---> Microsoft.Azure.WebJobs.Script.Workers.WorkerProcessExitException 
 /node exited with code -1073740791 (0xC0000409) ---> System.Exception
   End of inner exception
   
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   
async Microsoft.Azure.WebJobs.Script.Description.WorkerFunctionInvoker.InvokeCore(Object[] parameters,FunctionInvocationContext context) 
/_/src/WebJobs.Script/Description/Workers/WorkerFunctionInvoker.cs 
 /101
   
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   
async Microsoft.Azure.WebJobs.Script.Description.FunctionInvokerBase.Invoke(Object[] parameters) 
/_/src/WebJobs.Script/Description/FunctionInvokerBase.cs 
 /82
   
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   
async Microsoft.Azure.WebJobs.Script.Description.FunctionGenerator.Coerce[T](Task`1 src) 
/_/src/WebJobs.Script/Description/FunctionGenerator.cs 
 /225
   
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   
async Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.InvokeAsync[TReflected,TReturnValue](Object instance,Object[] arguments) 
D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs 
 /52
   
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   
async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeWithTimeoutAsync(IFunctionInvoker invoker,ParameterHelper parameterHelper,CancellationTokenSource timeoutTokenSource,CancellationTokenSource functionCancellationTokenSource,Boolean throwOnTimeout,TimeSpan timerInterval,IFunctionInstance instance) 
D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs 
 /581
   
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   
async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance,ParameterHelper parameterHelper,ILogger logger,CancellationTokenSource functionCancellationTokenSource) 
D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs 
 /527
   
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   
async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance,FunctionStartedMessage message,FunctionInstanceLogEntry instanceLogEntry,ParameterHelper parameterHelper,ILogger logger,CancellationToken cancellationToken) 
D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs 
 /306
   End of inner exception
   
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   
async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance,FunctionStartedMessage message,FunctionInstanceLogEntry instanceLogEntry,ParameterHelper parameterHelper,ILogger logger,CancellationToken cancellationToken) 
D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs 
 /352
   
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   
async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.TryExecuteAsync(IFunctionInstance functionInstance,CancellationToken cancellationToken) 
D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs 
 /108

@benrutter
Copy link

Seem to be having this same issue now (also in West Europe region). We've noticed random "service is unavailable" failures for a while - although recently an entire functions service was done for 1420 minutes between 1/29/2024 9:40:00 AM and 1/30/2024 9:25:00 AM.

No outage information reported from my end - and seems to have "fixed itself". As someone trying to debug this, having an entire service just stop working without any changes on our side is kind of crazy.

@tom08zehn
Copy link

For me the "service is unavailable" error (HTTP 500 status code) occurs in ~one out of 100 HTTP calls and it's immediately fixed when you reload the page (I'm serving a tiny website). There is no long outage but I see in the logs that the reason is a System.Exception.

Does anybody know how to troubleshoot this with Microsoft? Even if it's a consumption plan it's still a paid service.

@isadag
Copy link

isadag commented Jan 30, 2024

I am wondering the same thing, how can this be reported and escalated to an MS team??

I find it very strange that there is no information on the status page regarding these outages that occur. The comments in this GitHub issue all show that it is not something that is application specific or as an effect of some action taken by us developers or ops team, but rather something in the functions platform that seem to be region(s) wide. It is very concerning indeed.

@noontz
Copy link

noontz commented Jan 31, 2024

Does anybody know how to troubleshoot this with Microsoft? Even if it's a consumption plan it's still a paid service.

Take the lack of feedback from MS on this issue as a clue.
From my own experience this will require a PAYED support subscription, and even then there will be a reluctance from MS to recognize / admit bugs on Azure. I had another case that took 2 months before the issue was even recognized, yet alone the time to find a workaround. And yes. I had to pay for that experience with a support subscription.

@tom08zehn
Copy link

I'm in contact with MS Azure support (my org has a paid support subscription)... will keep you posted...

@peterboba
Copy link

peterboba commented Feb 1, 2024

We're also in contact with MS support since we're experiencing HTTP 503/502 responses when trying to access
https://{func}.scm.azurewebsites.net/. Mostly WestEurope, but seen also in other regions.

@agravityio
Copy link

We are facing the same issue. The 502/503 occurs when we apply the User Access Managed Identity on the functions on EP1 plan. We always used the same bicep file, so it must be something on the side of microsoft. Tried to create function manually in the portal in West Europe at it succeeded. But we need to access the keyvault. So after applying the UAMI the 502/503 came back.

I am running a Windows function app (.NET 6.0 in-process) on West Europe on Elastic Premium 1. So the "pay your way out" is not an option, since we are already paying.

@fabiocav - Do you have any thoughts on that?

@ajuch
Copy link

ajuch commented Feb 5, 2024

We're also affected by random 503 errors in our function app. Also in West Europe. We updated Microsoft.Azure.Functions.Worker.Sdk from 1.15.1 to 1.16.4 and updated .Net from 7 to 8. The function app works, but tests often fail with status code 503. This didn't happen before.

@agravityio
Copy link

Okay. I invested some time on this issue and I have to revert my mentioned thesis. In my case it does not have anything to do with UAMI.

Just for all others who run in to this issue. Until now it was not problem to have the following app settings enabled on creation:
{ name: 'WEBSITE_RUN_FROM_PACKAGE' value: '1' } { name: 'WEBSITE_ENABLE_SYNC_UPDATE_SITE' value: 'true' }
After removing this the creation of the Function App was possible again.
They will be added after pipeline deployment anyway, but on creation this caused the 502/503 response status.

@isadag
Copy link

isadag commented Feb 18, 2024

@peterboba and @tom08zehn did you guys hear anything back from MS Support regarding this issue? 🙏🏻

@tom08zehn
Copy link

MS is still investigating... and no news since one week. In my case it seems like the random HTTP 500 error occurs when my Node.js Function establishes a connection to the Azure SQL database (serverless) via the mssql module - the strange thing in my eyes is that my code retries all db activities (connect, query, etc.) but all this is not executed because the underlying Node.js process crashes... So it's either a general Node.js issue or a specific problem with the mssql module or with the Azure SQL db or with the Azure network or ...

@isadag
Copy link

isadag commented Feb 18, 2024

OK, sounds strange. In my case the request never even hits my code, it failed as soon as the function app was being requested. Instead of getting the "your app is up and running" screen, I would get the 503 instead.

For instance let's say I have an http endpoint here https://myfunction.azurewebsites.net/api/myendpoint. Even if I just tried to browse to the function app at: https://myfunction.azurewebsites.net I'd get a 503 instead of the "up and running" screen, and none of my code is being executed from there.

@tom08zehn
Copy link

In my case the Function code is executed. I see it because I added some verbose logging at the very beginning and before/after database queries.

At the moment, my gut feeling tells me that it's a general issue in the way Azure serverless works (trying to allocate resources and running the process) or it's related to networking or the database.

I added a time trigger to my Function that runs a simple database query every 3 mins to keep both the Function itself and the database in a "hot state" so they never fall asleep - because a cold start takes 35-75 secs in my case, which is unacceptable.

@danbasszeti
Copy link

I'm currently having this issue, mostly when I've cancelled requests that are taking ridiculously long due to the new .NET 8 isolated cold start and then retrying them

@tom08zehn
Copy link

In my case, the problem was solved by switching the app platform settings from 32 to 64 bit.

I had random HTTP 500 errors and the logs showed that it happened when a database connection was established. My guess is that the connection was broken at some point due to Azure's resource allocation, e.g. when cold-starting the database and sometimes also during runtime... maybe virtual memory reallocation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests