Handling failures? #24

kenny-evitt · 2019-08-21T16:15:13Z

The app I'm maintaining references the OpenIDConnect.Worker directly in the app supervisor. Recently, I was working on the app while my internet was down. I was very surprised that my app failed to start and it seemed to be because the worker itself was failing to start (because the update_documents call was failing).

I also tested an 'intermittent' failure by using a network monitoring tool to block connections to intuit.com and then manually triggering the worker 'update documents' action (by sending a suitable message to the worker process with my app running under iex). I was again surprised that the (expected) failure of the worker caused my app itself to shutdown (and almost immediately too).

I'm still pretty new to Elixir and Erlang/OTP so I'm not sure whether this is 'reasonable' and that I should expect to handle failures myself, or whether the Worker module itself would (at least ideally) gracefully handle errors connecting to the configured OpenID Connect providers.

The text was updated successfully, but these errors were encountered:

kenny-evitt · 2019-08-21T16:19:55Z

I ended up writing my own little 'worker manager' GenServer and referencing it in my app's top-level supervisor; here's the manager module I ended-up with:

defmodule MyApp.OpenIdConnectWorkerManager do
  use GenServer

  require Logger



  def start_link(init_arg) do
    GenServer.start_link(__MODULE__, init_arg)
  end



  # `GenServer` callbacks


  @impl true
  def init(init_arg)

  def init(:ignore) do
    :ignore
  end

  def init(_init_arg) do
    state = try_to_start_worker(nil)
    {:ok, state}
  end


  @impl true
  def handle_info(msg, state)

  def handle_info({:DOWN, _ref, :process, object, reason}, state) do
    {worker_pid, _} = state

    if object === worker_pid do
      Logger.error("OpenID Connect worker failed; reason: #{inspect reason}")
      delay_ms = next_delay_ms(nil)
      Process.send_after(self(), :try_to_start_worker, delay_ms)
      {:noreply, build_state(nil, delay_ms)}
    else
      Logger.info("Down signal received NOT matching the OpenID Connect worker process")
      {:noreply, state}
    end
  end

  def handle_info(:try_to_start_worker, state) do
    {_, previous_delay_ms} = state
    state = try_to_start_worker(previous_delay_ms)
    {:noreply, state}
  end



  # Other functions


  def build_state(worker_pid, delay_ms) do
    {worker_pid, delay_ms}
  end

  def try_to_start_worker(previous_delay_ms) do
    Logger.info("Starting OpenID Connect worker ...")

    worker_config = Application.get_env(:my_app, :openid_connect_providers)

    case GenServer.start(OpenIDConnect.Worker, worker_config, name: :openid_connect) do
      {:ok, worker_pid} ->
        Logger.info("OpenID Connect worker started")
        Process.monitor(worker_pid)
        build_state(worker_pid, nil)
      {:error, reason} ->
        Logger.error("OpenID Connect worker failed to start; reason: #{inspect reason}")
        delay_ms = next_delay_ms(previous_delay_ms)
        Process.send_after(self(), :try_to_start_worker, delay_ms)
        build_state(nil, delay_ms)
    end
  end


  def next_delay_ms(nil), do: 225_000

  def next_delay_ms(previous_delay_ms) when previous_delay_ms >= 3600_000, do: 3600_000

  def next_delay_ms(previous_delay_ms) do
    previous_delay_ms * 2
  end
end

The next_delay_ms implements a very simple 'exponential backoff', starting at 225 seconds and doubling between each (failed) attempt to start the worker, up-to an hour.

bcardarella · 2019-08-21T16:21:58Z

There could be a better way to handle the worker starting, this would be a good start to discuss.

kenny-evitt · 2019-08-22T16:18:40Z

Minimally, it seems like all of the changes could be restricted to the OpenIDConnect.Worker.update_documents function:

  defp update_documents(provider, config) do
    {:ok, %{remaining_lifetime: remaining_lifetime}} =
      {:ok, documents} = OpenIDConnect.update_documents(config)

    refresh_time = time_until_next_refresh(remaining_lifetime)

    Process.send_after(self(), {:update_documents, provider}, refresh_time)

    documents
  end

The error return type from OpenIDConnect.update_documents would need to be caught/matched, a suitable value returned or assigned for documents, and a 'retry delay' assigned to refresh_time or otherwise passed to the Process.send_after call.

kenny-evitt · 2019-08-22T16:23:32Z

Alternatively, it could simply be spelled out in the README that users should consider implementing their own workers if they need something more robust. For one, I could imagine that users might want to handle failures differently depending on the provider.

kouno · 2019-10-11T13:05:19Z

I ran into the same issue and I think this should be fixed within the openid_connect package itself.

Implementing an exponential backoff could be a strategy, but using a delay of 225 seconds and up to an hour like the code snippet that @kenny-evitt posted would be too long IMO because my users would be unable to login until that call passes. In my case, ideally, it should be retried after a few seconds and then exponentially backoff until we reach 5 or 10 minutes (maybe make a configuration for that?). This mechanism would apply only to the failing provider while other providers could still be working.

Another point that I think should be discussed is that getting documents is really important at first, but refreshing the documents is not as important (AFAICT) since I don't expect the discovery documents to be updated so often. Only if the data has been sitting there for a really long time or if the service started responding with errors would you really consider updating the documents.

WDYT?

bcardarella · 2019-10-11T13:30:02Z

@kouno I'd be open to different strategies, or even a way to customize them. Our use cases didn't require the more complex scenarios that you're describing which is why we haven't been in a good position to implement.

WRT refetching the discovery documents, according to the spec everything should be refreshed when the TTL window for the resource expires. The overhead for fetching is extremely minimal

kouno · 2019-10-11T14:17:25Z

WRT refetching the discovery documents, according to the spec everything should be refreshed when the TTL window for the resource expires.

I'm not sure what expiry you are speaking of and I can't seem to find it in the documentation (https://openid.net/specs/openid-connect-discovery-1_0.html#ProviderConfigurationResponse). Do you have a reference that would explain what we are supposed to check? Are you speaking about the cache-control header returned when querying the discovery document maybe?

bcardarella · 2019-10-11T14:19:28Z

It's been a while, I'll have to dig through the document but when I was implementing this library I recall that all resources with an expiration should be refetched after expiration

kouno · 2019-10-13T12:29:10Z

Seems like it's all in the HTTP headers:

openid_connect/lib/openid_connect.ex

Lines 384 to 396 in 5292a1b

    
           defp find_max_age(headers) when is_map(headers) do 
        
             case Regex.run(~r"(?<=max-age=)\d+", Map.get(headers, "Cache-Control", "")) do 
        
               [max_age] -> {:ok, String.to_integer(max_age)} 
        
               _ -> :error 
        
             end 
        
           end 
        
           defp find_age(headers) when is_map(headers) do 
        
             case Map.get(headers, "Age") do 
        
               nil -> :error 
        
               age -> {:ok, String.to_integer(age)} 
        
             end 
        
           end

But we are ignoring the remaining lifetime of the discovery document and use the remaining lifetime of the JWKS URI instead:

openid_connect/lib/openid_connect.ex

Lines 195 to 203 in 5292a1b

    
           with {:ok, discovery_document, _} <- fetch_resource(uri), 
        
                {:ok, certs, remaining_lifetime} <- fetch_resource(discovery_document["jwks_uri"]), 
        
                {:ok, jwk} <- from_certs(certs) do 
        
             {:ok, 
        
              %{ 
        
                discovery_document: normalize_discovery_document(discovery_document), 
        
                jwk: jwk, 
        
                remaining_lifetime: remaining_lifetime 
        
              }}

However, for me that isn't even used since my provider does not allow caching (i.e. no-cache) which would mean that it uses the hard coded value:

openid_connect/lib/openid_connect/worker.ex

Line 10 in 5292a1b

@refresh_time 60 * 60 * 1000

nbryant42 · 2020-04-08T16:40:03Z

I don't believe the discovery document is in any way time-sensitive, for probably the vast majority of providers. I have reviewed its contents for the providers I work with (Google, and a small company in the insurance industry), and neither contain any autogenerated values. It should change only rarely.

The spec for OpenID Connect Discovery is here, and it does not say anything about how cache should be handled: https://openid.net/specs/openid-connect-discovery-1_0.html

So I think the most reliable solution is that a failure to re-retrieve the discovery document, after it has already been fetched and cached, should simply be ignored and can be attempted again at the next regular interval.

I'm happy to submit a PR for such a fix.

A failure of initial retrieval is another story, and will inevitably cause errors until the doc can be successfully retrieved, unless we build a mechanism to store its known content by other means. But except for ensuring that condition doesn't cause the entire application to exit, I'm probably going to treat such additional caching as beyond the scope of a PR.

bcardarella · 2020-04-08T18:17:29Z

Google most certainly does issue a new discovery document at the end of each cache period. Just because the values aren't changing it doesn't mean they never will.

nbryant42 · 2020-04-08T18:27:18Z

Correct. That's why the PR keeps retrying.

DockYard#24

nbryant42 · 2020-04-08T20:18:52Z

PR is here: #30
I've just finished testing it locally.

danielweil · 2020-08-06T05:49:30Z

Hello Guys, I am having the same problem.
Locally the commits in the PR fix the issue, but in production, it stopped the application from crashing on init, but when I try to use the authorization_uri, it gives me an error:

I know very little of Elixir, can someone help me out?

bcardarella · 2020-08-06T13:21:19Z

DockYard has assigned a new engineer to address this issue. @mcrumm will be taking a look at the library probably starting tomorrow (maybe?) and attempting to resole the issue. Sorry for the pain 😞

danielweil · 2020-08-06T14:40:54Z

Hello @bcardarella,

Thank you. I will be waiting.

Sincerely,
Daniel

kenny-evitt · 2020-08-06T15:42:10Z

@danielweil That error – specifically the :nxdomain error reason – seems to imply that the domain of the OpenID URI you're using/trying couldn't be resolved by DNS.

But the general problem of this worker crashing on errors can be solved in several ways mentioned or described earlier in the comments here on this issue. But you're going to have to fork this repo, make the changes yourself, and update your Mix dependency for this package to use your customized version. You could probably also setup a suitable supervisor that would just restart the worker indefinitely too.

But it looks like you could also modify Visag.Auth.SessionController.new/2 (which appears to be your own app's code) to handle errors being returned by your call to OpenIDConnect.authorization_uri/3. That might be sufficient for your needs.

danielweil · 2020-08-06T15:48:41Z

@kenny-evitt Hey, thanks for the reply.

Its is very strange, because the authentication with the SSO provider is working perfectly while running the project in my machine. To make it work, I have had to fork the dependency and alter the code as told in this thread. But it still gives me the error when I try in the production machine.

I also tried to add an redirect url in the code instead of using authorization_url to generate the provider address for me and when it comes back with the session information, it gives the same error about "No match of right hand side value".

Can you think of why the OpenID URI I'm using/trying couldn't be resolved by DNS?

Thank you

nbryant42 · 2020-08-06T15:54:13Z

Hello,

I am the author of the PR mentioned in @danielweil's comment.

The intention of the PR is to prevent the app from crashing on init due to a transient DNS resolution failure such as :nxdomain. If such a failure persists after the app has started, the first attempt to resolve the domain of the OpenID URI will indeed fail, and there is nothing that can be done in such a case.

The MatchError occurs because of widespread error-handling issues in the original codebase. That problem is that many functions call a subroutine and take its results blindly, without checking for {:ok, ...} vs {:error,...} as is traditional style for Elixir. The PR makes things incrementally better but doesn't solve the whole problem. Changing things to avoid throwing MatchError would involve breaking changes to the public API exposed by the OpenIDConnect module. So, I think the PR at least doesn't regress anything, and I've been running it in prod at my employer for months without issue.

That said, my use case is pretty simple, I think we only call OpenIDConnect.verify/3

danielweil · 2020-08-06T18:05:42Z

Guys, finally figured out what was my issue.

With the advice you told me, about failed DNS resolution, I looked up this post about failing DNS resolution within kubernetes:
https://tech.findmypast.com/k8s-dns-lookup/.

It told me that there is an issue with alpine images. With that, I changed the bitwalker/alpine-elixir-phoenix with the elixir image, and added npm though the Dockerfile, and it worked!

Thanks for all those who helped.

kenny-evitt · 2020-08-06T18:07:10Z

@danielweil Great!

I had started writing this in response:

Yeah, I can think of various reasons why DNS resolution fails, but it's very hard to pick one to focus on based on the info you've provided. (DNS can be, in the worst cases, extremely hard to debug.)

The first thing I'd check is whether you can access the OpenID URI at all from the production server, e.g. via ping or nslookup or dig.

kenny-evitt · 2020-08-06T18:20:51Z

@nbryant42 Hi – thanks for the PR!

It's still not clear the best way for this package to handle errors. It was relatively straightforward to setup a supervisor (GenServer) to monitor this package's worker and doing that would let anyone decide how best to handle intermittent (or permanent) errors for their own usage.

Otherwise, your changes look okay. I added a comment for some of them that I had some questions about.

nbryant42 · 2020-08-06T22:06:01Z

Hi @kenny-evitt , thanks for the feedback. The first cut of this PR was definitely a quick-and-dirty minimal fix. I've made some updates to the PR that should bring it in line with style guidelines; feel free to have another look!

DockYard#24

nbryant42 · 2020-08-06T22:23:07Z

Just rebased the PR on top of master, as well.

tcrossland · 2020-08-11T20:27:10Z

Just my 2¢: For ease of library adoption, rather than having to implement some fancy incremental back-off supervisor, ideally the worker process should just be set up as documented here, e.g.

children = [
  {OpenIDConnect.Worker, Application.get_env(:my_app, :oidc_config)}
]

Supervisor.start_link(children, opts)

Ideally the worker shouldn't die in init/1 on a network error (as it does currently). That brings down the whole supervision tree and the application. A better approach IMHO would be that it attempts update_documents again after some configurable interval. But possibly the worker SHOULD die after a certain time limit if it has never been able to perform update_documents.

One approach would be include a retry interval and a maximum time limit in the configuration. The worker would periodically attempt to read it's initial configuration until the maximum time limit is reached, after which it could terminate.

For the general (rare?) case of multiple providers, it's probably worth considering starting a child process for each provider., so that a DNS failure on one provider doesn't bring down the others.

anildigital · 2023-03-01T16:58:03Z

This is still an issue currently. If we pass the wrong value to discovery_document_uri or if the network is down, it makes whole app crash.

** (Mix) Could not start application app: App.Application.start(:normal, []) returned an error: shutdown: failed to start child: OpenIDConnect.Worker
    ** (EXIT) an exception was raised:
        ** (MatchError) no match of right hand side value: {:error, :update_documents, %HTTPoison.Error{id: nil, reason: :nxdomain}}
            (openid_connect 0.2.2) lib/openid_connect/worker.ex:55: OpenIDConnect.Worker.update_documents/2
            (openid_connect 0.2.2) lib/openid_connect/worker.ex:23: anonymous fn/1 in OpenIDConnect.Worker.init/1
            (elixir 1.13.4) lib/enum.ex:1593: Enum."-map/2-lists^map/1-0-"/2

bcardarella · 2023-03-01T16:59:03Z

@anildigital are you pulling from HEAD or from Hex?

anildigital · 2023-03-01T17:17:06Z

@bcardarella hex https://hex.pm/packages/openid_connect It's 0.2.2 in mix.lock.

bcardarella · 2023-03-01T17:23:44Z

You may want to try pulling from this repo and see if the problem still exists

anildigital · 2023-03-02T09:22:11Z

@bcardarella

-  "openid_connect": {:hex, :openid_connect, "0.2.2", "c05055363330deab39ffd89e609db6b37752f255a93802006d83b45596189c0b", [:mix], [{:httpoison, "~> 1.2", [hex: :httpoison, repo: "hexpm", optional: false]}, {:jason, ">= 1.0.0", [hex: :jason, repo: "hexpm", optional: false]}, {:jose, "~> 1.8", [hex: :jose, repo: "hexpm", optional: false]}], "hexpm", "735769b6d592124b58edd0582554ce638524c0214cd783d8903d33357d74cc13"},
+  "openid_connect": {:git, "https://github.com/DockYard/openid_connect.git", "4cef83893f4dcb4ac44ed50aace83dc597ca98aa", []},

Using the git version also doesn't fix the issue.

nbryant42 pushed a commit to nbryant42/openid_connect that referenced this issue Apr 8, 2020

fixes for DockYard#24

b217cb3

nbryant42 pushed a commit to nbryant42/openid_connect that referenced this issue Apr 8, 2020

also need error handling on :jwk message

b0a6afb

DockYard#24

tcrossland added a commit to tcrossland/openid_connect that referenced this issue Aug 3, 2020

DockYard#24: worker was dying if update_documents failed

fb1478a

tcrossland mentioned this issue Aug 3, 2020

Keep worker alive if update_documents fails #37

Open

nbryant42 pushed a commit to nbryant42/openid_connect that referenced this issue Aug 6, 2020

fixes for DockYard#24

926232a

nbryant42 pushed a commit to nbryant42/openid_connect that referenced this issue Aug 6, 2020

also need error handling on :jwk message

9b6864d

DockYard#24

AndrewDryga mentioned this issue Apr 21, 2023

Complete rewrite of the library #50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling failures? #24

Handling failures? #24

kenny-evitt commented Aug 21, 2019

kenny-evitt commented Aug 21, 2019

bcardarella commented Aug 21, 2019

kenny-evitt commented Aug 22, 2019

kenny-evitt commented Aug 22, 2019

kouno commented Oct 11, 2019

bcardarella commented Oct 11, 2019

kouno commented Oct 11, 2019

bcardarella commented Oct 11, 2019

kouno commented Oct 13, 2019 •

edited

nbryant42 commented Apr 8, 2020 •

edited

bcardarella commented Apr 8, 2020

nbryant42 commented Apr 8, 2020

nbryant42 commented Apr 8, 2020

danielweil commented Aug 6, 2020

bcardarella commented Aug 6, 2020

danielweil commented Aug 6, 2020

kenny-evitt commented Aug 6, 2020

danielweil commented Aug 6, 2020 •

edited

nbryant42 commented Aug 6, 2020 •

edited

danielweil commented Aug 6, 2020

kenny-evitt commented Aug 6, 2020

kenny-evitt commented Aug 6, 2020

nbryant42 commented Aug 6, 2020

nbryant42 commented Aug 6, 2020

tcrossland commented Aug 11, 2020 •

edited

anildigital commented Mar 1, 2023 •

edited

bcardarella commented Mar 1, 2023

anildigital commented Mar 1, 2023

bcardarella commented Mar 1, 2023

anildigital commented Mar 2, 2023 •

edited

Handling failures? #24

Handling failures? #24

Comments

kenny-evitt commented Aug 21, 2019

kenny-evitt commented Aug 21, 2019

bcardarella commented Aug 21, 2019

kenny-evitt commented Aug 22, 2019

kenny-evitt commented Aug 22, 2019

kouno commented Oct 11, 2019

bcardarella commented Oct 11, 2019

kouno commented Oct 11, 2019

bcardarella commented Oct 11, 2019

kouno commented Oct 13, 2019 • edited

nbryant42 commented Apr 8, 2020 • edited

bcardarella commented Apr 8, 2020

nbryant42 commented Apr 8, 2020

nbryant42 commented Apr 8, 2020

danielweil commented Aug 6, 2020

bcardarella commented Aug 6, 2020

danielweil commented Aug 6, 2020

kenny-evitt commented Aug 6, 2020

danielweil commented Aug 6, 2020 • edited

nbryant42 commented Aug 6, 2020 • edited

danielweil commented Aug 6, 2020

kenny-evitt commented Aug 6, 2020

kenny-evitt commented Aug 6, 2020

nbryant42 commented Aug 6, 2020

nbryant42 commented Aug 6, 2020

tcrossland commented Aug 11, 2020 • edited

anildigital commented Mar 1, 2023 • edited

bcardarella commented Mar 1, 2023

anildigital commented Mar 1, 2023

bcardarella commented Mar 1, 2023

anildigital commented Mar 2, 2023 • edited

kouno commented Oct 13, 2019 •

edited

nbryant42 commented Apr 8, 2020 •

edited

danielweil commented Aug 6, 2020 •

edited

nbryant42 commented Aug 6, 2020 •

edited

tcrossland commented Aug 11, 2020 •

edited

anildigital commented Mar 1, 2023 •

edited

anildigital commented Mar 2, 2023 •

edited