Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipcache don't short-circuit InjectLabels if source differs #24875

Merged
merged 2 commits into from
Apr 21, 2023

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Apr 13, 2023

InjectLabels is one of the functions responsible for synchronizing the ipcache metadata store and ip store. As such, it shouldn't shortcut when the numeric identity is the same, but the source is different; this means that an update to the ipcache isn't complete.

This can happen, for example, when there are two identities for the same IP, which can happen on daemon restart whenever a CIDR is referenced.

Fixes: #24502

@squeed squeed added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. sig/agent Cilium agent related. labels Apr 13, 2023
@@ -224,7 +224,7 @@ func (ipc *IPCache) InjectLabels(ctx context.Context, modifiedPrefixes []netip.P
// kvstore and via the k8s control plane. If the new
// security identity is the same as the one currently
// being used, then no need to update it.
if oldID.ID == newID.ID {
if oldID.ID == newID.ID && prefixInfo.Source() == oldID.Source {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting. Previously this code assumed for one that if the identity is the same, then the source of that identity is the same. However as the comment demonstrates immediately above, that may not be true. Secondly I think that the previous implementation also implicitly added an assumption that if the eventual result of the ipcache update here is going to map a specific IP to a specific Identity, then if the identity is the same, we don't need to propagate the event to the other subsystems because the ultimate result in the datapath is the same. For the UpdatePolicyMaps() call below I think that's true - technically there is nothing to do if the source is different. However for the main ipcache structure update below as part of the entriesToReplace logic, that layer knows about source and uses it for some protections against ipcache entry deletion etc. Therefore, that layer really should be informed about a change in source. 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think we could potentially avoid the extra policymap notification for this case, this probably doesn't have much of an impact and likely just results in a little extra processing to figure out that it's a no-op. That's acceptable IMO.

Comment on lines 240 to 242
// have now been removed, then we need to explicitly
// work around that to remove the old higher-priority
// identity and replace it with this new identity.
if entryExists && prefixInfo.Source() != oldID.Source {
if entryExists && prefixInfo.Source() != oldID.Source && oldID.ID != newID.ID {
forceIPCacheUpdate[prefix] = true
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking through this line with the following cases:

  1. An ipcache entry exists for some IP, the numeric identity is the same and the source is the same
  2. An ipcache entry exists for some IP, the numeric identity is the same and the source is different
  3. An ipcache entry exists for some IP, the numeric identity is different and the source is the same
  4. An ipcache entry exists for some IP, the numeric identity is different and the source is the different

Case (1) never gets to this point because of the check above. No change in behaviour ✔️ .

Case (2) previously didn't get to this line but this PR changes that logic above, so now it hits here. The change on this line avoids "forcing" the ipcache update for this case. This basically means that if it happens that the existing entry is a higher priority Source than the newly calculated one, the ipcache upsert would fail and warn in the logs. Technically I think this couldn't happen today just given the users of this ipcache metadata layer, but it could happen in future. I think that the right path for this would actually just be to forcefully update the ipcache in that case as well. I can theorize about future implications but I think my main issue right now is not understanding why this case needs to avoid the force.

Case (3) is not a big deal, because overriding an entry with the same priority source should work (cf. source.AllowOverwrite()) so the force hack is not necessary there. ✔️

Case (4) is effectively the same before and after this change, we will effectively trust this metadata layer's decision about the correct underlying ipcache Identity+Source over the top of any users of the legacy direct ipcache.Upsert() API. ✔️

sidenote, I can't wait to switch all the callers to the new async metadata interface and resolve this internally in the metadata structures rather than having this forceIPCacheUpdate hack in the metadata layer in order to override some of the core ipcache behaviour to handle these cases correctly 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that case 2 is the interesting one. I don't understand enough all the possible values for metadata source and ipcache source, so I wanted to preserve existing behavior as much as possible (save for the additional ipcache upsert).

If you think it's safe to always force when the metadata diverges from the ipcache, then I'm happy to remove this conditional. I'm not so sure, mostly because it is possible to have multiple identities for the same /32 cidr. In that case, we would always want the higher-precedence source to "win".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a bit more thinking, I think that in practice they're one and the same at least in the immediate term.

The old ipcache.Upsert() API mostly assumed that there's only one identity for each IP, and the callers will resolve what the identity is. If there's more than one identity, the callers can override one another with a higher priority Source. The awkward part is that it's up to every ipcache user to understand every other ipcache user and accept the results of that overriding behaviour. The callers today mostly just pretend those cases don't exist (and mostly get away with it).

The newer ipcache.UpsertLabels() API mostly assumes that callers should just specify a set of labels to be associated with the IP, then this ipcache.InjectLabels() will resolve any conflicts between different sources attempting to associate specific identities. Today, this logic primarily handles just the cases where the prefix needs to be associated either with the kube-apiserver (add that label into the identity) or with a remote node. These two are typically higher priority sources so they are likely to take precedence over the users from the older ipcache anyway (at least now that your patch ensures that this code propagates the call into the older ipcache.Upsert() API).

I think my argument for having this logic override regardless is that we're aiming for this InjectLabels() logic to resolve conflicts between different Sources, and therefore if some other piece of logic did an ipcache.Upsert() with a higher priority source then this logic tried to override it, then the other logic shouldn't be able to go and ipcache.Delete() the entry that this logic installed, because the source would no longer match.

That said, over time we'll just convert all of the users over to this logic and at that point this discussion will be moot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the other awkwardness is that updates to the metadata and ipcache are in separate "transactions"; there are a few bits of code that do a (Metadata lock, Metadata read, Metadata unlock) -> (IPCache lock, IPCache write, IPCache unlock), but something could have touched the metadata in the mean time.

As you said, it mostly works, and the bits of logic that do this are getting smaller and smaller. However, it's still something to be aware of, and I wonder if we will discover more bugs in this vein. I did a quick scan and didn't find much else; most things hard-code the Source and just ignore an upsert failure.

@joestringer
Copy link
Member

/test

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Joe went through this one pretty thoroughly and I think I have a rough understanding of how this fixes the original issue (#24502).

Do we also need to change the daemon CIDR restoration logic to also use source.Restored?

@squeed
Copy link
Contributor Author

squeed commented Apr 14, 2023

Do we also need to change the daemon CIDR restoration logic to also use source.Restored?

Nope, AllocateCIDRs() is hard-coded to use the generated source, which is very low precedence.

The last TODO For this is to write some tests, which I'm doing now. Should be ready for final review on Monday.

@joestringer
Copy link
Member

Heads up I found this while investigating something else, this provides some background for the origin of the logic that is being changed here: #19765 (comment) . Given that the etcd tests are failing, I think that this is reintroducing the problem I had originally while writing this logic.

@joestringer
Copy link
Member

Do we also need to change the daemon CIDR restoration logic to also use source.Restored?

Nope, AllocateCIDRs() is hard-coded to use the generated source, which is very low precedence.

I would argue that we should make this change, but that it's not necessary in this PR. We can follow up separately, I don't think it is likely to impact the issue at hand here.

@squeed
Copy link
Contributor Author

squeed commented Apr 17, 2023

It seems like the tests are failing because they're detecting an (expected) error message. In this case, we are intending for the usual ipcache precedence rules to abort; it is not an error message.

@joestringer it seems like we can just not return an error message in upsertLocked() if only the source differs. Alternatively, we could swallow the error message in InjectLabels() but that seems less correct. Thoughts?

@squeed squeed force-pushed the ipcache-label-source-overwrite branch from a64eb6f to ef36fad Compare April 17, 2023 13:09
@squeed
Copy link
Contributor Author

squeed commented Apr 17, 2023

I wound up swallowing the error message in InjectLabels() because the caller has more "context" to understand that the error is safe.

I also added a test case that emulates what I saw on real systems. It does some small amount of hackery to simulate the race condition.

I'm marking this as ready for review.

@squeed
Copy link
Contributor Author

squeed commented Apr 17, 2023

/test

(edit: failure is due to.. an issue posting a slack message)

@squeed squeed marked this pull request as ready for review April 17, 2023 13:21
@squeed squeed requested review from a team as code owners April 17, 2023 13:21
@squeed squeed added the release-blocker/1.13 This issue will prevent the release of the next version of Cilium. label Apr 17, 2023
@squeed
Copy link
Contributor Author

squeed commented Apr 17, 2023

/test-1.26-net-next

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also went searching for a leaking identity refcount related to the restore logic, but I found the corresponding identity release for the identity with the specific labels that should have been allocated from the start, so I believe that aspect to be correct.

Comment on lines +129 to +131
// Now, emulate policyAdd(), which calls AllocateCIDRs()
_, err = IPIdentityCache.AllocateCIDRs([]netip.Prefix{prefix}, []identity.NumericIdentity{oldID}, nil)
assert.NoError(t, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure I quite understand how this additional allocation of the specific /32 from the policy engine relates to the rest of the bug or whether this matches the user reports.

For this to be relevant in all of the user cases, we would need a policy that says something like:

  egress:                                                                                                                                                                                                   
  - toCIDR:                                                                                                                                                                                                 
    - w.x.y.z/32  

Looking back through #24502, I only find two examples of policies that users are using: fromEntities: cluster which has a label selector for the kube-apiserver entity, and a fromCIDR for a shorter prefix (/24 as pointed out here). Neither case has a CIDR policy specifically for the /32s of the kube-apiserver IPs. As a result, neither should end up interacting with the ipcache for this IP address. We can certainly ask the users if they are creating such /32 policies, but I would have thought they would have brought that up if it was a relevant aspect of the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit here as well, daemon/cmd/policy.go does not feed the old identities in, since it doesn't know the old identities. That said, I tried locally to pass nil for the old identities and pass an output parameter as the third argument + run UpsertGeneratedIdentities() immediately for the returned identities and that failed in the same way with the old code as the way this test is written. While that would be a more accurate emulation, I don't think it materially affects the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to reproduce similar behaviour without these lines, by moving the upsertLabels up to the top of the function. This would be equivalent to syncing the k8s endpoints for the kubernetes service on startup before allocating the identities from the ipcache map. That said, as far as I can tell this order is not possible at runtime currently, since those identities are allocated around line 628 of daemon/cmd/daemon.go and upserted around line 767, whereas the k8s watchers are not launched until line 1086. Unless I'm missing something about the way that the Hive-based k8s resource handling behaves, this option seems like a dead end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a bit of looking around, the only options I really see are some other policy logic allocating the Identity (ToCIDR, ToServices, ToFQDNs) or maybe the node manager logic, but the latter shouldn't occur in an EKS environment as the users report. So this test is probably about as good of an approximation we'll get, short of coming up with some complicated e2e test. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, some of the allocations here are probably superfluous; I was trying to match the race condition I'd observed in logs / delve sessions. If you like, I can try and pare this down to a minimum reproducer.

Comment on lines 240 to 242
// have now been removed, then we need to explicitly
// work around that to remove the old higher-priority
// identity and replace it with this new identity.
if entryExists && prefixInfo.Source() != oldID.Source {
if entryExists && prefixInfo.Source() != oldID.Source && oldID.ID != newID.ID {
forceIPCacheUpdate[prefix] = true
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a bit more thinking, I think that in practice they're one and the same at least in the immediate term.

The old ipcache.Upsert() API mostly assumed that there's only one identity for each IP, and the callers will resolve what the identity is. If there's more than one identity, the callers can override one another with a higher priority Source. The awkward part is that it's up to every ipcache user to understand every other ipcache user and accept the results of that overriding behaviour. The callers today mostly just pretend those cases don't exist (and mostly get away with it).

The newer ipcache.UpsertLabels() API mostly assumes that callers should just specify a set of labels to be associated with the IP, then this ipcache.InjectLabels() will resolve any conflicts between different sources attempting to associate specific identities. Today, this logic primarily handles just the cases where the prefix needs to be associated either with the kube-apiserver (add that label into the identity) or with a remote node. These two are typically higher priority sources so they are likely to take precedence over the users from the older ipcache anyway (at least now that your patch ensures that this code propagates the call into the older ipcache.Upsert() API).

I think my argument for having this logic override regardless is that we're aiming for this InjectLabels() logic to resolve conflicts between different Sources, and therefore if some other piece of logic did an ipcache.Upsert() with a higher priority source then this logic tried to override it, then the other logic shouldn't be able to go and ipcache.Delete() the entry that this logic installed, because the source would no longer match.

That said, over time we'll just convert all of the users over to this logic and at that point this discussion will be moot.

log.WithError(err2).WithFields(logrus.Fields{
logfields.IPAddr: prefix,
logfields.Identity: id,
}).Error("Failed to replace ipcache entry with new identity after label removal. Traffic may be disrupted.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is only wrapping the case where the identities remain the same but the source changes, I think that this should be OK and the error message still accurately applies to the case we're worried about. I think that once we properly switch the node manager over to the newer ipcache.UpsertLabels() APIs we should be able to revert this hunk without it causing complaints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed.

Copy link
Member

@nbusseneau nbusseneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just blindly approve this one since I'm late and ci-structure changes are trivial.

@squeed
Copy link
Contributor Author

squeed commented Apr 20, 2023

Looks like Jenkins didn't write status. Huh. This was all green before.

@squeed
Copy link
Contributor Author

squeed commented Apr 20, 2023

/test-1.27-net-next

This makes the mock allocator work similarly to the "real" one: if an ID
is requested and it is not in use, then accept it.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
InjectLabels is one of the functions responsible for synchronizing the
ipcache metadata store and ip store. As such, it shouldn't shortcut when
the numeric identity is the same, but the source is different; this
means that an update to the ipcache isn't complete.

This can happen, for example, when there are two identities for the same
IP, which can happen on daemon restart whenever a CIDR is referenced.

Fixes: cilium#24502
Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@squeed squeed force-pushed the ipcache-label-source-overwrite branch from ef36fad to 6549b5c Compare April 20, 2023 09:20
@squeed
Copy link
Contributor Author

squeed commented Apr 20, 2023

(rebased to pick up master -> main rename)

@squeed
Copy link
Contributor Author

squeed commented Apr 20, 2023

An update: a few end users have picked up this change and deployed it to their environments; they confirmed it fixed the issue. Thanks to the testers (and for the awesome bug report)!

@squeed squeed added backport/author The backport will be carried out by the author of the PR. needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Apr 20, 2023
@nathanjsweet nathanjsweet removed their request for review April 20, 2023 15:19
@joestringer
Copy link
Member

/test

@squeed
Copy link
Contributor Author

squeed commented Apr 21, 2023

Discussed with servicemesh people, agreed the test failure looked like a flake. test failure flake issue

@squeed
Copy link
Contributor Author

squeed commented Apr 21, 2023

test failure is confusing, curl between pods failed with exit code 22, which indicates a non-200 HTTP error code.

I take this to mean the actual TCP connection was successful, so I don't believe this PR has caused any issues. Still, I'm making sure the sysdumps don't show anything untoward.

Edit: The identity and ipcache values for the pods in question are correct, which is all this PR really touches, so I'm leaning towards flake. I wish we had more info in the sysdumps, specifically server pod logs.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 21, 2023
@joestringer
Copy link
Member

The test also only failed on one of many jobs, and after re-running that workflow, it has passed. Good to merge 👍

@joestringer joestringer merged commit 8d3a498 into cilium:main Apr 21, 2023
56 checks passed
@joestringer joestringer added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Apr 24, 2023
@joestringer
Copy link
Member

@squeed I have manually updated the labels on this PR in order to track the latest status. This is typically taken care of by the standard backport process, so please consider following that in future so that we can easily track which releases the PRs land in.

@michi-covalent michi-covalent added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/author The backport will be carried out by the author of the PR. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.13 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/agent Cilium agent related. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kube-apiserver IP adresses disappear from bpf ipcache 10 minutes after restarting cilium-agent
5 participants