Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul name resolution in-memory cache #3121

Merged
merged 30 commits into from Oct 13, 2023

Conversation

a-elsheikh
Copy link
Contributor

Description

Added service cache to consul nr component

Clone of ancient PR 963 in order to work around DCO check

Issue reference

#934

Checklist

Please make sure you've completed the relevant tasks for this PR, out of the following list:

  • Code compiles correctly
  • Created/updated tests
  • Extended the documentation / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]

Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
@a-elsheikh a-elsheikh requested review from a team as code owners September 11, 2023 12:41
@a-elsheikh a-elsheikh changed the title nr_consul_cache squashed commits to resolve dco Consul name resolution in-memory cache (DCO) Sep 11, 2023
Copy link
Contributor

@ItalyPaleAle ItalyPaleAle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First quick pass - I am not very familiar with blocking queries, could you please help me understand how this PR works?

Also, how does the in-memory cache get purged? And is there a way to limit its size?

nameresolution/consul/README.md Outdated Show resolved Hide resolved
nameresolution/consul/consul.go Outdated Show resolved Hide resolved
nameresolution/consul/consul.go Outdated Show resolved Hide resolved
Co-authored-by: Alessandro (Ale) Segala <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
@a-elsheikh
Copy link
Contributor Author

First quick pass - I am not very familiar with blocking queries, could you please help me understand how this PR works?

Also, how does the in-memory cache get purged? And is there a way to limit its size?

Thanks for picking this up :)

The way this works is, whenever a request comes in to resolve an app-id the registry is checked for an entry, if one doesn't exist then it will kick off a watcher routine for the service and fall back to the vanilla route (i.e. request the service from the consul agent).

The watcher routine uses http long polling (consul "blocking queries") against the consul agent to track a set of health nodes for the given service/app-id - whenever there is a change in state then the agent will respond with the new set of healthy nodes which is then updated against the registry entry and the routine will then continue on to the next long poll.

Any given key (i.e. app-id) in the registry is perpetual so long as its watcher routine is running - if a given routine fails it will remove the key from the registry. Finally, the current implementation has no way of limiting the cache size

@ItalyPaleAle
Copy link
Contributor

@a-elsheikh thanks for explaining! This seems like a very interesting approach. I do have some concerns however:

  1. It seems that there's a possibly unbounded number of goroutines being spawned. Is it possible to have a single watcher goroutine for all keys? Each goroutine adds quite a bit of resources and pressure on the GC
  2. Also, if there's a long-polling request per each key, wouldn't each request be an open TCP socket? I am concerned with the number of open connections and sockets that could cause. Again, a single watcher would be a lot more efficient.

@a-elsheikh
Copy link
Contributor Author

@a-elsheikh thanks for explaining! This seems like a very interesting approach. I do have some concerns however:

  1. It seems that there's a possibly unbounded number of goroutines being spawned. Is it possible to have a single watcher goroutine for all keys? Each goroutine adds quite a bit of resources and pressure on the GC
  2. Also, if there's a long-polling request per each key, wouldn't each request be an open TCP socket? I am concerned with the number of open connections and sockets that could cause. Again, a single watcher would be a lot more efficient.

@ItalyPaleAle - to answer both questions, as far as I'm aware there is no way to invoke the consul health api to target multiple services/app-ids, only one per request. With that said, the number of goroutines used by this pattern will be equivalent to the number of unique services/app-ids the service/sidecar is resolving, in that context I don't see how goroutines would be expensive unless an app is resolving a very large number of distinct app-ids.

I do though appreciate the lack of an upper bound (or even a time-based expiry) on the key set especially considering that it will scale the number of connections, such is the cost of long polling. Of course the impact of all this is dependent on the deployment type i.e. edge/k8s.

What do you think if we added a configurable upper bound combined with an LRU policy on the keyset so that by default the sidecar will only track n services?

FYI @berndverst

@ItalyPaleAle
Copy link
Contributor

ItalyPaleAle commented Sep 12, 2023

@a-elsheikh thanks for explaining! This seems like a very interesting approach. I do have some concerns however:

  1. It seems that there's a possibly unbounded number of goroutines being spawned. Is it possible to have a single watcher goroutine for all keys? Each goroutine adds quite a bit of resources and pressure on the GC
  2. Also, if there's a long-polling request per each key, wouldn't each request be an open TCP socket? I am concerned with the number of open connections and sockets that could cause. Again, a single watcher would be a lot more efficient.

@ItalyPaleAle - to answer both questions, as far as I'm aware there is no way to invoke the consul health api to target multiple services/app-ids, only one per request. With that said, the number of goroutines used by this pattern will be equivalent to the number of unique services/app-ids the service/sidecar is resolving, in that context I don't see how goroutines would be expensive unless an app is resolving a very large number of distinct app-ids.

I do though appreciate the lack of an upper bound (or even a time-based expiry) on the key set especially considering that it will scale the number of connections, such is the cost of long polling. Of course the impact of all this is dependent on the deployment type i.e. edge/k8s.

What do you think if we added a configurable upper bound combined with an LRU policy on the keyset so that by default the sidecar will only track n services?

FYI @berndverst

Sadly, I don't think we can make assumptions on whether an app talks to a small number of other apps only, as that's a dangerous assumption IMHO.

Also, apps can scale horizontally, possibly to dozens or hundreds of instances. In this case, I am more concerned about the usage of TCP sockets, and potential for TCP port exhaustion on the consul container. :(

I think a LRU policy, or even just a timeout, may help.

A few other options:

  1. Enable this only for gRPC connections to Consul. This will still have the issues with goroutines (which could be mitigated by a timeout or something). But gRPC uses multiplexing so there's a single connection (and single TCP socket) per app
  2. Consider doing this a bit more lazily? For example:
    • Keep a value in cache for a few seconds (could be small)
    • Before the value expires, ping Consul to check if the value is still valid and refresh the cache. This can be done in background with a one-off request, so it doesn't require long polling.
    • Continue refreshing in background, on an interval.
    • Stop refreshing a value if it hasn't been used in a certain amount of time
    • This has the potential of returning stale values, such as instances of an app that are dead. But this is fine, as apps should be able to handle these scenarios (and Dapr resiliency can help)

@berndverst
Copy link
Member

@a-elsheikh unfortunately there are too many details to be figured out still, and getting this PR merged in 1.12 (technically already well beyond code freeze) was a stretch. With that said, we are committed to helping you get this PR merged in time for Dapr 1.13.

@berndverst berndverst added this to the v1.13 milestone Sep 12, 2023
@berndverst berndverst added the do-not-merge PR is not ready for merging label Sep 12, 2023
@a-elsheikh
Copy link
Contributor Author

Sadly, I don't think we can make assumptions on whether an app talks to a small number of other apps only, as that's a dangerous assumption IMHO.

Also, apps can scale horizontally, possibly to dozens or hundreds of instances. In this case, I am more concerned about the usage of TCP sockets, and potential for TCP port exhaustion on the consul container. :(

I think a LRU policy, or even just a timeout, may help.

A few other options:

  1. Enable this only for gRPC connections to Consul. This will still have the issues with goroutines (which could be mitigated by a timeout or something). But gRPC uses multiplexing so there's a single connection (and single TCP socket) per app

  2. Consider doing this a bit more lazily? For example:

    • Keep a value in cache for a few seconds (could be small)
    • Before the value expires, ping Consul to check if the value is still valid and refresh the cache. This can be done in background with a one-off request, so it doesn't require long polling.
    • Continue refreshing in background, on an interval.
    • Stop refreshing a value if it hasn't been used in a certain amount of time
    • This has the potential of returning stale values, such as instances of an app that are dead. But this is fine, as apps should be able to handle these scenarios (and Dapr resiliency can help)

@ItalyPaleAle you're right, a growing number of clients against the consul agent would be problematic. I've struggled to find any documentation around the gRPC streaming (this) - I think it's only intended for consul->consul comms. This would actual solve both problems as we could use a single goroutine to subscribe to all the changes and update the cache. I'll keep an eye out but let me know if you're aware of any docs/spec for a gRPC api on the consul agents.

The next best thing is, I've found a query on the HTTP health api that will allow us to track state changes for multiple services with a single call, the dataset is limited so this can only act as a notification and we'll need another request to resolve the target nodes for any changed services, this sacrifices some latency but should still be a much better outcome.

@ItalyPaleAle
Copy link
Contributor

The next best thing is, I've found a query on the HTTP health api that will allow us to track state changes for multiple services with a single call, the dataset is limited so this can only act as a notification and we'll need another request to resolve the target nodes for any changed services, this sacrifices some latency but should still be a much better outcome.

That sounds like an interesting idea! If the refreshing is done in background, the latency impact should not make a huge difference anyways.

@berndverst berndverst removed the do-not-merge PR is not ready for merging label Sep 19, 2023
…vices and updating cache

Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
…call, added more comments

Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
Copy link
Contributor

@ItalyPaleAle ItalyPaleAle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes, it looks good :)

I have left some comments. Could you also please try running the tests for this package with the -race flag and confirm there's no race conditions?

nameresolution/consul/consul.go Outdated Show resolved Hide resolved
nameresolution/consul/consul.go Show resolved Hide resolved
nameresolution/consul/consul.go Outdated Show resolved Hide resolved
nameresolution/consul/consul.go Outdated Show resolved Hide resolved
nameresolution/consul/consul.go Show resolved Hide resolved
if _, ok := serviceKeys[service]; !ok {
r.registry.addOrUpdate(service, nil)
}
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this approach is that you are racing with the producer. If the producer adds to the channel faster than you consume them, you will end the loop sooner.

Consider maybe making registrationChannel accept a slice so you can send all keys to add in bulk, in a single message over the channel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, there's a mistake in this comment - I've rephrased both possibilities of the produce/consume statement and addressed below.

If the producer adds to the channel faster than you consume them, the loop will not end.

This is only possible if the user is constantly resolving new IDs, in that bizarre case caching would be irrelevant.

If the consumer reads from the channel faster than the producer writes, you will end the loop sooner.

This is obviously the expected behaviour.

The registration channel is only written to if a new service/app-id that is not in the registry is being resolved - there is really no batching concept here so it doesn't quite make sense to make it a slice channel. To elaborate:

ResolveID (sidecar invoke)

  1. Resolver checks registry for key
  2. Key does not exist and is written to channel
  3. Resolver invokes consul agent directly

Watcher routine (perpetual)

  1. Watcher picks up key from the channel and cancels long poll
  2. Watcher adds the key to the registry
  3. Watcher checks for any other new keys on channel
  4. Watcher invokes long poll with new set of keys

nameresolution/consul/watcher.go Outdated Show resolved Hide resolved

func (r *resolver) runWatchPlan(p *watchPlan, services []string, ctx context.Context, watchTask chan bool) {
defer func() {
recover()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a need for a recover()? We normally try to avoid panicking entirely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recover() here was to cater for any transport panics that may occur from within the consul libs, it's clear to me now that this is a false expectation as panics are not equivalent to exceptions in behaviour and there are no explicit panics referenced within the consul client api :) will remove

nameresolution/consul/watcher.go Outdated Show resolved Hide resolved
nameresolution/consul/watcher.go Outdated Show resolved Hide resolved
a-elsheikh and others added 4 commits October 4, 2023 10:07
Co-authored-by: Alessandro (Ale) Segala <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
Co-authored-by: Alessandro (Ale) Segala <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
Co-authored-by: Alessandro (Ale) Segala <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
Co-authored-by: Alessandro (Ale) Segala <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
@a-elsheikh
Copy link
Contributor Author

Thanks for making these changes, it looks good :)

I have left some comments. Could you also please try running the tests for this package with the -race flag and confirm there's no race conditions?

@ItalyPaleAle thanks for raising this one, I didn't know this feature existed, it's awesome. It found races in the test mocks where invocation counts are being checked and waited on, in any case I've solved those now so it's a clean run every time 👍

@ItalyPaleAle
Copy link
Contributor

@a-elsheikh thank you for your patience while we're reviewing this! With the 1.12 release constantly delayed, and Bernd's OOF for a while, it's taking me longer to review these PRs :(

I have a few cosmetic changes. I cannot push to your branch, so can you please merge this?

curl -L "https://github.com/dapr/components-contrib/commit/346bcb17a2b8934ef48558503d179261ba65a98e.patch" | git am

After that... At this stage this PR looks really good, and it's almost ready to be merged :)

Only thing missing is the ability for the component to perform a clean shutdown. For most components, we implement io.Closer (i.e. adding the Close() error) so they can stop background goroutines etc. This component didn't have any background process, so we didn't implement it. However, now you have a background process started by r.startWatcher(), so we need to implement:

  • A Close() error method that Dapr can invoke to shut down the component
  • In the close method, there needs to be a way to stop the background watcher (a nice way could be using a channel for example)

You can see an example in the mDNS resolver!

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
@a-elsheikh
Copy link
Contributor Author

@ItalyPaleAle thanks for the update, response below 👍

I have a few cosmetic changes. I cannot push to your branch, so can you please merge this?

curl -L "https://github.com/dapr/components-contrib/commit/346bcb17a2b8934ef48558503d179261ba65a98e.patch" | git am

I've applied the patch and it looks good but I'm not sure I agree with replacing the "watcher started" mutex+bool for an atomic bool - the following code is on a hot path, it makes sense to only synchronize if we think the watcher is not running otherwise we should be happy reading the bool and continuing. I've also benchmarked this for good measure and on parallel runs using the atomic is an order of magnitude slower. Of course this only applies if a sidecar is making parallel invocations and we are talking ns deltas so perhaps clear is better than clever 😅 I'll keep the change and leave it to your discretion

func (r *resolver) getService(service string) (*consul.ServiceEntry, error) {
	var services []*consul.ServiceEntry

	if r.config.UseCache {
		r.startWatcher() // will sync every time service is resolved
		...
  • A Close() error method that Dapr can invoke to shut down the component
  • In the close method, there needs to be a way to stop the background watcher (a nice way could be using a channel for example)

Perfect - I'll use this as an opportunity to add an optional deregister on close, it is much needed!

Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
@@ -54,7 +54,8 @@ As of writing the configuration spec is fixed to v1.3.0 of the consul api
| Tags | `[]string` | Configures any tags to include if/when registering services |
| Meta | `map[string]string` | Configures any additional metadata to include if/when registering services |
| DaprPortMetaKey | `string` | The key used for getting the Dapr sidecar port from consul service metadata during service resolution, it will also be used to set the Dapr sidecar port in metadata during registration. If blank it will default to `DAPR_PORT` |
| SelfRegister | `bool` | Controls if Dapr will register the service to consul. The name resolution interface does not cater for an "on shutdown" pattern so please consider this if using Dapr to register services to consul as it will not deregister services. |
| SelfRegister | `bool` | Controls if Dapr will register the service to consul on startup. If unset it will default to `false` |
| SelfDeregister | `bool` | Controls if Dapr will deregister the service from consul on shutdown. If unset it will default to `false` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when we tried to do that (to solve #2490 ) this was a bad idea. It did un-register all instances of an app and not just the one that was going offline.

I would prefer if this wasn't even offered as an option, as it can cause confusion to users.

(And with #2490 merged in 1.12, this isn't needed anyways)

@ItalyPaleAle
Copy link
Contributor

I've applied the patch and it looks good but I'm not sure I agree with replacing the "watcher started" mutex+bool for an atomic bool - the following code is on a hot path, it makes sense to only synchronize if we think the watcher is not running otherwise we should be happy reading the bool and continuing. I've also benchmarked this for good measure and on parallel runs using the atomic is an order of magnitude slower.

The previous code had a race condition. If 2 requests came in at the same time, and the watcher wasn't running, watcherStarted would be modified by a goroutine while being read by the other. Yes, there was a lock around the code that modified the value, but not around the code that read it. go test -race would highlight that as a failure (if there is a test of invoking getService in parallel).

The general rule is that if there are 2 goroutines accessing the same value, and at least one of them is doing a write, then you must either use a lock or some atomic value. You can read more here: https://go.dev/doc/articles/race_detector

I'll use this as an opportunity to add an optional deregister on close, it is much needed!

I left a comment on the code. A few months ago I did some investigation and found that deregister did deregister the entire service. So if your app is scaled horizontally, it un-registers the other instances too. Because of that, I think this option is "dangerous" as users may not be aware of its behavior if the app is scaled horizontally.

In Dapr 1.12, #2490 was merged so we should be good with unregistering!

@a-elsheikh
Copy link
Contributor Author

I left a comment on the code. A few months ago I did some investigation and found that deregister did deregister the entire service. So if your app is scaled horizontally, it un-registers the other instances too. Because of that, I think this option is "dangerous" as users may not be aware of its behavior if the app is scaled horizontally.

@ItalyPaleAle Can you share any details of this investigation - the behaviour you're describing is both contrary to what I'm observing and to the documentation. The only way I can see that happening is if you were using the older version of dapr which (by default) did not generate unique IDs for the registrations and instead used the AppID thus giving every instance of a service on each node the same ID, this is no longer the case thanks to #1802

Even with that said though, I still don't see how that behaviour could have happened using agent/service/deregister

@ItalyPaleAle
Copy link
Contributor

@a-elsheikh I wrote about that here: #2490 (comment)

I can't remember exactly the details, it's been a while :(

@ItalyPaleAle
Copy link
Contributor

This was the PR i created as a test: #2594 in there you can still see the commit where I added the "Deregister" and then removed because of those issues.

Signed-off-by: Abdulaziz Elsheikh <abdulaziz.elsheikh@gmail.com>
@a-elsheikh
Copy link
Contributor Author

This was the PR i created as a test: #2594 in there you can still see the commit where I added the "Deregister" and then removed because of those issues.

@ItalyPaleAle I reviewed the comments but it's still not clear how this was tested - our test environment is also using horizontally scaled apps and the behaviour looks fine 🤔 Can you try re-test with this implementation please 🙏

@ItalyPaleAle
Copy link
Contributor

This was the PR i created as a test: #2594 in there you can still see the commit where I added the "Deregister" and then removed because of those issues.

@ItalyPaleAle I reviewed the comments but it's still not clear how this was tested - our test environment is also using horizontally scaled apps and the behaviour looks fine 🤔 Can you try re-test with this implementation please 🙏

I don't remember, to be fully honest, it's been a long time ago!

Ok, let's merge this PR. We can do more tests later and if we do see something's off in the behavior of unregistering, we can disable that later on.

Thanks for your contributions :)

One more ask: can you please update our docs to document the new options for Consul? See dapr/docs

@ItalyPaleAle ItalyPaleAle added the documentation required This issue needs documentation label Oct 13, 2023
@ItalyPaleAle ItalyPaleAle merged commit b343683 into dapr:master Oct 13, 2023
84 of 86 checks passed
@ItalyPaleAle ItalyPaleAle changed the title Consul name resolution in-memory cache (DCO) Consul name resolution in-memory cache Oct 13, 2023
@a-elsheikh
Copy link
Contributor Author

Thanks for your contributions :)

One more ask: can you please update our docs to document the new options for Consul? See dapr/docs

@ItalyPaleAle you're most welcome - I'll get the docs updated soon! Thanks for the help 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation required This issue needs documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants