Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling failures? #24

Open
kenny-evitt opened this issue Aug 21, 2019 · 30 comments · Fixed by #50
Open

Handling failures? #24

kenny-evitt opened this issue Aug 21, 2019 · 30 comments · Fixed by #50

Comments

@kenny-evitt
Copy link

The app I'm maintaining references the OpenIDConnect.Worker directly in the app supervisor. Recently, I was working on the app while my internet was down. I was very surprised that my app failed to start and it seemed to be because the worker itself was failing to start (because the update_documents call was failing).

I also tested an 'intermittent' failure by using a network monitoring tool to block connections to intuit.com and then manually triggering the worker 'update documents' action (by sending a suitable message to the worker process with my app running under iex). I was again surprised that the (expected) failure of the worker caused my app itself to shutdown (and almost immediately too).

I'm still pretty new to Elixir and Erlang/OTP so I'm not sure whether this is 'reasonable' and that I should expect to handle failures myself, or whether the Worker module itself would (at least ideally) gracefully handle errors connecting to the configured OpenID Connect providers.

@kenny-evitt
Copy link
Author

I ended up writing my own little 'worker manager' GenServer and referencing it in my app's top-level supervisor; here's the manager module I ended-up with:

defmodule MyApp.OpenIdConnectWorkerManager do
  use GenServer

  require Logger



  def start_link(init_arg) do
    GenServer.start_link(__MODULE__, init_arg)
  end



  # `GenServer` callbacks


  @impl true
  def init(init_arg)

  def init(:ignore) do
    :ignore
  end

  def init(_init_arg) do
    state = try_to_start_worker(nil)
    {:ok, state}
  end


  @impl true
  def handle_info(msg, state)

  def handle_info({:DOWN, _ref, :process, object, reason}, state) do
    {worker_pid, _} = state

    if object === worker_pid do
      Logger.error("OpenID Connect worker failed; reason: #{inspect reason}")
      delay_ms = next_delay_ms(nil)
      Process.send_after(self(), :try_to_start_worker, delay_ms)
      {:noreply, build_state(nil, delay_ms)}
    else
      Logger.info("Down signal received NOT matching the OpenID Connect worker process")
      {:noreply, state}
    end
  end

  def handle_info(:try_to_start_worker, state) do
    {_, previous_delay_ms} = state
    state = try_to_start_worker(previous_delay_ms)
    {:noreply, state}
  end



  # Other functions


  def build_state(worker_pid, delay_ms) do
    {worker_pid, delay_ms}
  end

  def try_to_start_worker(previous_delay_ms) do
    Logger.info("Starting OpenID Connect worker ...")

    worker_config = Application.get_env(:my_app, :openid_connect_providers)

    case GenServer.start(OpenIDConnect.Worker, worker_config, name: :openid_connect) do
      {:ok, worker_pid} ->
        Logger.info("OpenID Connect worker started")
        Process.monitor(worker_pid)
        build_state(worker_pid, nil)
      {:error, reason} ->
        Logger.error("OpenID Connect worker failed to start; reason: #{inspect reason}")
        delay_ms = next_delay_ms(previous_delay_ms)
        Process.send_after(self(), :try_to_start_worker, delay_ms)
        build_state(nil, delay_ms)
    end
  end


  def next_delay_ms(nil), do: 225_000

  def next_delay_ms(previous_delay_ms) when previous_delay_ms >= 3600_000, do: 3600_000

  def next_delay_ms(previous_delay_ms) do
    previous_delay_ms * 2
  end
end

The next_delay_ms implements a very simple 'exponential backoff', starting at 225 seconds and doubling between each (failed) attempt to start the worker, up-to an hour.

@bcardarella
Copy link
Member

There could be a better way to handle the worker starting, this would be a good start to discuss.

@kenny-evitt
Copy link
Author

Minimally, it seems like all of the changes could be restricted to the OpenIDConnect.Worker.update_documents function:

  defp update_documents(provider, config) do
    {:ok, %{remaining_lifetime: remaining_lifetime}} =
      {:ok, documents} = OpenIDConnect.update_documents(config)

    refresh_time = time_until_next_refresh(remaining_lifetime)

    Process.send_after(self(), {:update_documents, provider}, refresh_time)

    documents
  end

The error return type from OpenIDConnect.update_documents would need to be caught/matched, a suitable value returned or assigned for documents, and a 'retry delay' assigned to refresh_time or otherwise passed to the Process.send_after call.

@kenny-evitt
Copy link
Author

Alternatively, it could simply be spelled out in the README that users should consider implementing their own workers if they need something more robust. For one, I could imagine that users might want to handle failures differently depending on the provider.

@kouno
Copy link

kouno commented Oct 11, 2019

I ran into the same issue and I think this should be fixed within the openid_connect package itself.

Implementing an exponential backoff could be a strategy, but using a delay of 225 seconds and up to an hour like the code snippet that @kenny-evitt posted would be too long IMO because my users would be unable to login until that call passes. In my case, ideally, it should be retried after a few seconds and then exponentially backoff until we reach 5 or 10 minutes (maybe make a configuration for that?). This mechanism would apply only to the failing provider while other providers could still be working.

Another point that I think should be discussed is that getting documents is really important at first, but refreshing the documents is not as important (AFAICT) since I don't expect the discovery documents to be updated so often. Only if the data has been sitting there for a really long time or if the service started responding with errors would you really consider updating the documents.

WDYT?

@bcardarella
Copy link
Member

@kouno I'd be open to different strategies, or even a way to customize them. Our use cases didn't require the more complex scenarios that you're describing which is why we haven't been in a good position to implement.

WRT refetching the discovery documents, according to the spec everything should be refreshed when the TTL window for the resource expires. The overhead for fetching is extremely minimal

@kouno
Copy link

kouno commented Oct 11, 2019

WRT refetching the discovery documents, according to the spec everything should be refreshed when the TTL window for the resource expires.

I'm not sure what expiry you are speaking of and I can't seem to find it in the documentation (https://openid.net/specs/openid-connect-discovery-1_0.html#ProviderConfigurationResponse). Do you have a reference that would explain what we are supposed to check? Are you speaking about the cache-control header returned when querying the discovery document maybe?

@bcardarella
Copy link
Member

It's been a while, I'll have to dig through the document but when I was implementing this library I recall that all resources with an expiration should be refetched after expiration

@kouno
Copy link

kouno commented Oct 13, 2019

Seems like it's all in the HTTP headers:

defp find_max_age(headers) when is_map(headers) do
case Regex.run(~r"(?<=max-age=)\d+", Map.get(headers, "Cache-Control", "")) do
[max_age] -> {:ok, String.to_integer(max_age)}
_ -> :error
end
end
defp find_age(headers) when is_map(headers) do
case Map.get(headers, "Age") do
nil -> :error
age -> {:ok, String.to_integer(age)}
end
end

But we are ignoring the remaining lifetime of the discovery document and use the remaining lifetime of the JWKS URI instead:

with {:ok, discovery_document, _} <- fetch_resource(uri),
{:ok, certs, remaining_lifetime} <- fetch_resource(discovery_document["jwks_uri"]),
{:ok, jwk} <- from_certs(certs) do
{:ok,
%{
discovery_document: normalize_discovery_document(discovery_document),
jwk: jwk,
remaining_lifetime: remaining_lifetime
}}

However, for me that isn't even used since my provider does not allow caching (i.e. no-cache) which would mean that it uses the hard coded value:

@refresh_time 60 * 60 * 1000

@nbryant42
Copy link

nbryant42 commented Apr 8, 2020

I don't believe the discovery document is in any way time-sensitive, for probably the vast majority of providers. I have reviewed its contents for the providers I work with (Google, and a small company in the insurance industry), and neither contain any autogenerated values. It should change only rarely.

The spec for OpenID Connect Discovery is here, and it does not say anything about how cache should be handled: https://openid.net/specs/openid-connect-discovery-1_0.html

So I think the most reliable solution is that a failure to re-retrieve the discovery document, after it has already been fetched and cached, should simply be ignored and can be attempted again at the next regular interval.

I'm happy to submit a PR for such a fix.

A failure of initial retrieval is another story, and will inevitably cause errors until the doc can be successfully retrieved, unless we build a mechanism to store its known content by other means. But except for ensuring that condition doesn't cause the entire application to exit, I'm probably going to treat such additional caching as beyond the scope of a PR.

nbryant42 pushed a commit to nbryant42/openid_connect that referenced this issue Apr 8, 2020
@bcardarella
Copy link
Member

Google most certainly does issue a new discovery document at the end of each cache period. Just because the values aren't changing it doesn't mean they never will.

@nbryant42
Copy link

Correct. That's why the PR keeps retrying.

nbryant42 pushed a commit to nbryant42/openid_connect that referenced this issue Apr 8, 2020
@nbryant42
Copy link

PR is here: #30
I've just finished testing it locally.

@danielweil
Copy link

Hello Guys, I am having the same problem.
Locally the commits in the PR fix the issue, but in production, it stopped the application from crashing on init, but when I try to use the authorization_uri, it gives me an error:

image

I know very little of Elixir, can someone help me out?

@bcardarella
Copy link
Member

DockYard has assigned a new engineer to address this issue. @mcrumm will be taking a look at the library probably starting tomorrow (maybe?) and attempting to resole the issue. Sorry for the pain 😞

@danielweil
Copy link

Hello @bcardarella,

Thank you. I will be waiting.

Sincerely,
Daniel

@kenny-evitt
Copy link
Author

@danielweil That error – specifically the :nxdomain error reason – seems to imply that the domain of the OpenID URI you're using/trying couldn't be resolved by DNS.

But the general problem of this worker crashing on errors can be solved in several ways mentioned or described earlier in the comments here on this issue. But you're going to have to fork this repo, make the changes yourself, and update your Mix dependency for this package to use your customized version. You could probably also setup a suitable supervisor that would just restart the worker indefinitely too.

But it looks like you could also modify Visag.Auth.SessionController.new/2 (which appears to be your own app's code) to handle errors being returned by your call to OpenIDConnect.authorization_uri/3. That might be sufficient for your needs.

@danielweil
Copy link

danielweil commented Aug 6, 2020

@kenny-evitt Hey, thanks for the reply.

Its is very strange, because the authentication with the SSO provider is working perfectly while running the project in my machine. To make it work, I have had to fork the dependency and alter the code as told in this thread. But it still gives me the error when I try in the production machine.

I also tried to add an redirect url in the code instead of using authorization_url to generate the provider address for me and when it comes back with the session information, it gives the same error about "No match of right hand side value".

Can you think of why the OpenID URI I'm using/trying couldn't be resolved by DNS?

Thank you

@nbryant42
Copy link

nbryant42 commented Aug 6, 2020

Hello,

I am the author of the PR mentioned in @danielweil's comment.

The intention of the PR is to prevent the app from crashing on init due to a transient DNS resolution failure such as :nxdomain. If such a failure persists after the app has started, the first attempt to resolve the domain of the OpenID URI will indeed fail, and there is nothing that can be done in such a case.

The MatchError occurs because of widespread error-handling issues in the original codebase. That problem is that many functions call a subroutine and take its results blindly, without checking for {:ok, ...} vs {:error,...} as is traditional style for Elixir. The PR makes things incrementally better but doesn't solve the whole problem. Changing things to avoid throwing MatchError would involve breaking changes to the public API exposed by the OpenIDConnect module. So, I think the PR at least doesn't regress anything, and I've been running it in prod at my employer for months without issue.

That said, my use case is pretty simple, I think we only call OpenIDConnect.verify/3

@danielweil
Copy link

Guys, finally figured out what was my issue.

With the advice you told me, about failed DNS resolution, I looked up this post about failing DNS resolution within kubernetes:
https://tech.findmypast.com/k8s-dns-lookup/.

It told me that there is an issue with alpine images. With that, I changed the bitwalker/alpine-elixir-phoenix with the elixir image, and added npm though the Dockerfile, and it worked!

Thanks for all those who helped.

@kenny-evitt
Copy link
Author

@danielweil Great!

I had started writing this in response:

Yeah, I can think of various reasons why DNS resolution fails, but it's very hard to pick one to focus on based on the info you've provided. (DNS can be, in the worst cases, extremely hard to debug.)

The first thing I'd check is whether you can access the OpenID URI at all from the production server, e.g. via ping or nslookup or dig.

@kenny-evitt
Copy link
Author

@nbryant42 Hi – thanks for the PR!

It's still not clear the best way for this package to handle errors. It was relatively straightforward to setup a supervisor (GenServer) to monitor this package's worker and doing that would let anyone decide how best to handle intermittent (or permanent) errors for their own usage.

Otherwise, your changes look okay. I added a comment for some of them that I had some questions about.

@nbryant42
Copy link

Hi @kenny-evitt , thanks for the feedback. The first cut of this PR was definitely a quick-and-dirty minimal fix. I've made some updates to the PR that should bring it in line with style guidelines; feel free to have another look!

nbryant42 pushed a commit to nbryant42/openid_connect that referenced this issue Aug 6, 2020
nbryant42 pushed a commit to nbryant42/openid_connect that referenced this issue Aug 6, 2020
@nbryant42
Copy link

Just rebased the PR on top of master, as well.

@tcrossland
Copy link
Contributor

tcrossland commented Aug 11, 2020

Just my 2¢: For ease of library adoption, rather than having to implement some fancy incremental back-off supervisor, ideally the worker process should just be set up as documented here, e.g.

children = [
  {OpenIDConnect.Worker, Application.get_env(:my_app, :oidc_config)}
]

Supervisor.start_link(children, opts)

Ideally the worker shouldn't die in init/1 on a network error (as it does currently). That brings down the whole supervision tree and the application. A better approach IMHO would be that it attempts update_documents again after some configurable interval. But possibly the worker SHOULD die after a certain time limit if it has never been able to perform update_documents.

One approach would be include a retry interval and a maximum time limit in the configuration. The worker would periodically attempt to read it's initial configuration until the maximum time limit is reached, after which it could terminate.

For the general (rare?) case of multiple providers, it's probably worth considering starting a child process for each provider., so that a DNS failure on one provider doesn't bring down the others.

@anildigital
Copy link

anildigital commented Mar 1, 2023

This is still an issue currently. If we pass the wrong value to discovery_document_uri or if the network is down, it makes whole app crash.

** (Mix) Could not start application app: App.Application.start(:normal, []) returned an error: shutdown: failed to start child: OpenIDConnect.Worker
    ** (EXIT) an exception was raised:
        ** (MatchError) no match of right hand side value: {:error, :update_documents, %HTTPoison.Error{id: nil, reason: :nxdomain}}
            (openid_connect 0.2.2) lib/openid_connect/worker.ex:55: OpenIDConnect.Worker.update_documents/2
            (openid_connect 0.2.2) lib/openid_connect/worker.ex:23: anonymous fn/1 in OpenIDConnect.Worker.init/1
            (elixir 1.13.4) lib/enum.ex:1593: Enum."-map/2-lists^map/1-0-"/2

@bcardarella
Copy link
Member

@anildigital are you pulling from HEAD or from Hex?

@anildigital
Copy link

@bcardarella hex https://hex.pm/packages/openid_connect It's 0.2.2 in mix.lock.

@bcardarella
Copy link
Member

You may want to try pulling from this repo and see if the problem still exists

@anildigital
Copy link

anildigital commented Mar 2, 2023

@bcardarella

-  "openid_connect": {:hex, :openid_connect, "0.2.2", "c05055363330deab39ffd89e609db6b37752f255a93802006d83b45596189c0b", [:mix], [{:httpoison, "~> 1.2", [hex: :httpoison, repo: "hexpm", optional: false]}, {:jason, ">= 1.0.0", [hex: :jason, repo: "hexpm", optional: false]}, {:jose, "~> 1.8", [hex: :jose, repo: "hexpm", optional: false]}], "hexpm", "735769b6d592124b58edd0582554ce638524c0214cd783d8903d33357d74cc13"},
+  "openid_connect": {:git, "https://github.com/DockYard/openid_connect.git", "4cef83893f4dcb4ac44ed50aace83dc597ca98aa", []},

Using the git version also doesn't fix the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants