ipcache don't short-circuit InjectLabels if source differs #24875

squeed · 2023-04-13T16:02:33Z

InjectLabels is one of the functions responsible for synchronizing the ipcache metadata store and ip store. As such, it shouldn't shortcut when the numeric identity is the same, but the source is different; this means that an update to the ipcache isn't complete.

This can happen, for example, when there are two identities for the same IP, which can happen on daemon restart whenever a CIDR is referenced.

Fixes: #24502

joestringer · 2023-04-13T21:09:50Z

pkg/ipcache/metadata.go

@@ -224,7 +224,7 @@ func (ipc *IPCache) InjectLabels(ctx context.Context, modifiedPrefixes []netip.P
 			// kvstore and via the k8s control plane. If the new
 			// security identity is the same as the one currently
 			// being used, then no need to update it.
-			if oldID.ID == newID.ID {
+			if oldID.ID == newID.ID && prefixInfo.Source() == oldID.Source {


Ah interesting. Previously this code assumed for one that if the identity is the same, then the source of that identity is the same. However as the comment demonstrates immediately above, that may not be true. Secondly I think that the previous implementation also implicitly added an assumption that if the eventual result of the ipcache update here is going to map a specific IP to a specific Identity, then if the identity is the same, we don't need to propagate the event to the other subsystems because the ultimate result in the datapath is the same. For the UpdatePolicyMaps() call below I think that's true - technically there is nothing to do if the source is different. However for the main ipcache structure update below as part of the entriesToReplace logic, that layer knows about source and uses it for some protections against ipcache entry deletion etc. Therefore, that layer really should be informed about a change in source. 👍

While I think we could potentially avoid the extra policymap notification for this case, this probably doesn't have much of an impact and likely just results in a little extra processing to figure out that it's a no-op. That's acceptable IMO.

joestringer · 2023-04-13T21:33:18Z

pkg/ipcache/metadata.go

 			// have now been removed, then we need to explicitly
 			// work around that to remove the old higher-priority
 			// identity and replace it with this new identity.
-			if entryExists && prefixInfo.Source() != oldID.Source {
+			if entryExists && prefixInfo.Source() != oldID.Source && oldID.ID != newID.ID {
 				forceIPCacheUpdate[prefix] = true
 			}


I'm looking through this line with the following cases:

An ipcache entry exists for some IP, the numeric identity is the same and the source is the same

An ipcache entry exists for some IP, the numeric identity is the same and the source is different

An ipcache entry exists for some IP, the numeric identity is different and the source is the same

An ipcache entry exists for some IP, the numeric identity is different and the source is the different

Case (1) never gets to this point because of the check above. No change in behaviour ✔️ .

Case (2) previously didn't get to this line but this PR changes that logic above, so now it hits here. The change on this line avoids "forcing" the ipcache update for this case. This basically means that if it happens that the existing entry is a higher priority Source than the newly calculated one, the ipcache upsert would fail and warn in the logs. Technically I think this couldn't happen today just given the users of this ipcache metadata layer, but it could happen in future. I think that the right path for this would actually just be to forcefully update the ipcache in that case as well. I can theorize about future implications but I think my main issue right now is not understanding why this case needs to avoid the force.

Case (3) is not a big deal, because overriding an entry with the same priority source should work (cf. source.AllowOverwrite()) so the force hack is not necessary there. ✔️

Case (4) is effectively the same before and after this change, we will effectively trust this metadata layer's decision about the correct underlying ipcache Identity+Source over the top of any users of the legacy direct ipcache.Upsert() API. ✔️

sidenote, I can't wait to switch all the callers to the new async metadata interface and resolve this internally in the metadata structures rather than having this forceIPCacheUpdate hack in the metadata layer in order to override some of the core ipcache behaviour to handle these cases correctly 😅

Agreed that case 2 is the interesting one. I don't understand enough all the possible values for metadata source and ipcache source, so I wanted to preserve existing behavior as much as possible (save for the additional ipcache upsert).

If you think it's safe to always force when the metadata diverges from the ipcache, then I'm happy to remove this conditional. I'm not so sure, mostly because it is possible to have multiple identities for the same /32 cidr. In that case, we would always want the higher-precedence source to "win".

After a bit more thinking, I think that in practice they're one and the same at least in the immediate term.

The old ipcache.Upsert() API mostly assumed that there's only one identity for each IP, and the callers will resolve what the identity is. If there's more than one identity, the callers can override one another with a higher priority Source. The awkward part is that it's up to every ipcache user to understand every other ipcache user and accept the results of that overriding behaviour. The callers today mostly just pretend those cases don't exist (and mostly get away with it).

The newer ipcache.UpsertLabels() API mostly assumes that callers should just specify a set of labels to be associated with the IP, then this ipcache.InjectLabels() will resolve any conflicts between different sources attempting to associate specific identities. Today, this logic primarily handles just the cases where the prefix needs to be associated either with the kube-apiserver (add that label into the identity) or with a remote node. These two are typically higher priority sources so they are likely to take precedence over the users from the older ipcache anyway (at least now that your patch ensures that this code propagates the call into the older ipcache.Upsert() API).

I think my argument for having this logic override regardless is that we're aiming for this InjectLabels() logic to resolve conflicts between different Sources, and therefore if some other piece of logic did an ipcache.Upsert() with a higher priority source then this logic tried to override it, then the other logic shouldn't be able to go and ipcache.Delete() the entry that this logic installed, because the source would no longer match.

That said, over time we'll just convert all of the users over to this logic and at that point this discussion will be moot.

Right, the other awkwardness is that updates to the metadata and ipcache are in separate "transactions"; there are a few bits of code that do a (Metadata lock, Metadata read, Metadata unlock) -> (IPCache lock, IPCache write, IPCache unlock), but something could have touched the metadata in the mean time.

As you said, it mostly works, and the bits of logic that do this are getting smaller and smaller. However, it's still something to be aware of, and I wonder if we will discover more bugs in this vein. I did a quick scan and didn't find much else; most things hard-code the Source and just ignore an upsert failure.

joestringer · 2023-04-13T21:34:07Z

/test

christarazi

Looks like Joe went through this one pretty thoroughly and I think I have a rough understanding of how this fixes the original issue (#24502).

Do we also need to change the daemon CIDR restoration logic to also use source.Restored?

squeed · 2023-04-14T10:43:27Z

Do we also need to change the daemon CIDR restoration logic to also use source.Restored?

Nope, AllocateCIDRs() is hard-coded to use the generated source, which is very low precedence.

The last TODO For this is to write some tests, which I'm doing now. Should be ready for final review on Monday.

joestringer · 2023-04-14T20:43:45Z

Heads up I found this while investigating something else, this provides some background for the origin of the logic that is being changed here: #19765 (comment) . Given that the etcd tests are failing, I think that this is reintroducing the problem I had originally while writing this logic.

joestringer · 2023-04-14T20:44:26Z

Do we also need to change the daemon CIDR restoration logic to also use source.Restored?

Nope, AllocateCIDRs() is hard-coded to use the generated source, which is very low precedence.

I would argue that we should make this change, but that it's not necessary in this PR. We can follow up separately, I don't think it is likely to impact the issue at hand here.

squeed · 2023-04-17T11:50:15Z

It seems like the tests are failing because they're detecting an (expected) error message. In this case, we are intending for the usual ipcache precedence rules to abort; it is not an error message.

@joestringer it seems like we can just not return an error message in upsertLocked() if only the source differs. Alternatively, we could swallow the error message in InjectLabels() but that seems less correct. Thoughts?

squeed · 2023-04-17T13:21:21Z

I wound up swallowing the error message in InjectLabels() because the caller has more "context" to understand that the error is safe.

I also added a test case that emulates what I saw on real systems. It does some small amount of hackery to simulate the race condition.

I'm marking this as ready for review.

squeed · 2023-04-17T13:21:30Z

/test

(edit: failure is due to.. an issue posting a slack message)

squeed · 2023-04-17T14:25:39Z

/test-1.26-net-next

joestringer

I also went searching for a leaking identity refcount related to the restore logic, but I found the corresponding identity release for the identity with the specific labels that should have been allocated from the start, so I believe that aspect to be correct.

joestringer · 2023-04-17T21:13:52Z

pkg/ipcache/metadata_test.go

+	// Now, emulate policyAdd(), which calls AllocateCIDRs()
+	_, err = IPIdentityCache.AllocateCIDRs([]netip.Prefix{prefix}, []identity.NumericIdentity{oldID}, nil)
+	assert.NoError(t, err)


I'm still not sure I quite understand how this additional allocation of the specific /32 from the policy engine relates to the rest of the bug or whether this matches the user reports.

For this to be relevant in all of the user cases, we would need a policy that says something like:

egress: - toCIDR: - w.x.y.z/32

Looking back through #24502, I only find two examples of policies that users are using: fromEntities: cluster which has a label selector for the kube-apiserver entity, and a fromCIDR for a shorter prefix (/24 as pointed out here). Neither case has a CIDR policy specifically for the /32s of the kube-apiserver IPs. As a result, neither should end up interacting with the ipcache for this IP address. We can certainly ask the users if they are creating such /32 policies, but I would have thought they would have brought that up if it was a relevant aspect of the issue.

Minor nit here as well, daemon/cmd/policy.go does not feed the old identities in, since it doesn't know the old identities. That said, I tried locally to pass nil for the old identities and pass an output parameter as the third argument + run UpsertGeneratedIdentities() immediately for the returned identities and that failed in the same way with the old code as the way this test is written. While that would be a more accurate emulation, I don't think it materially affects the test.

I was able to reproduce similar behaviour without these lines, by moving the upsertLabels up to the top of the function. This would be equivalent to syncing the k8s endpoints for the kubernetes service on startup before allocating the identities from the ipcache map. That said, as far as I can tell this order is not possible at runtime currently, since those identities are allocated around line 628 of daemon/cmd/daemon.go and upserted around line 767, whereas the k8s watchers are not launched until line 1086. Unless I'm missing something about the way that the Hive-based k8s resource handling behaves, this option seems like a dead end.

After a bit of looking around, the only options I really see are some other policy logic allocating the Identity (ToCIDR, ToServices, ToFQDNs) or maybe the node manager logic, but the latter shouldn't occur in an EKS environment as the users report. So this test is probably about as good of an approximation we'll get, short of coming up with some complicated e2e test. 👍

Indeed, some of the allocations here are probably superfluous; I was trying to match the race condition I'd observed in logs / delve sessions. If you like, I can try and pare this down to a minimum reproducer.

joestringer · 2023-04-17T22:56:17Z

pkg/ipcache/metadata.go

 			// have now been removed, then we need to explicitly
 			// work around that to remove the old higher-priority
 			// identity and replace it with this new identity.
-			if entryExists && prefixInfo.Source() != oldID.Source {
+			if entryExists && prefixInfo.Source() != oldID.Source && oldID.ID != newID.ID {
 				forceIPCacheUpdate[prefix] = true
 			}


After a bit more thinking, I think that in practice they're one and the same at least in the immediate term.

The old ipcache.Upsert() API mostly assumed that there's only one identity for each IP, and the callers will resolve what the identity is. If there's more than one identity, the callers can override one another with a higher priority Source. The awkward part is that it's up to every ipcache user to understand every other ipcache user and accept the results of that overriding behaviour. The callers today mostly just pretend those cases don't exist (and mostly get away with it).

The newer ipcache.UpsertLabels() API mostly assumes that callers should just specify a set of labels to be associated with the IP, then this ipcache.InjectLabels() will resolve any conflicts between different sources attempting to associate specific identities. Today, this logic primarily handles just the cases where the prefix needs to be associated either with the kube-apiserver (add that label into the identity) or with a remote node. These two are typically higher priority sources so they are likely to take precedence over the users from the older ipcache anyway (at least now that your patch ensures that this code propagates the call into the older ipcache.Upsert() API).

I think my argument for having this logic override regardless is that we're aiming for this InjectLabels() logic to resolve conflicts between different Sources, and therefore if some other piece of logic did an ipcache.Upsert() with a higher priority source then this logic tried to override it, then the other logic shouldn't be able to go and ipcache.Delete() the entry that this logic installed, because the source would no longer match.

That said, over time we'll just convert all of the users over to this logic and at that point this discussion will be moot.

joestringer · 2023-04-17T22:58:24Z

pkg/ipcache/metadata.go

+				log.WithError(err2).WithFields(logrus.Fields{
+					logfields.IPAddr:   prefix,
+					logfields.Identity: id,
+				}).Error("Failed to replace ipcache entry with new identity after label removal. Traffic may be disrupted.")


Given that this is only wrapping the case where the identities remain the same but the source changes, I think that this should be OK and the error message still accurately applies to the case we're worried about. I think that once we properly switch the node manager over to the newer ipcache.UpsertLabels() APIs we should be able to revert this hunk without it causing complaints.

Yes, agreed.

nbusseneau

I'll just blindly approve this one since I'm late and ci-structure changes are trivial.

squeed · 2023-04-20T09:07:54Z

Looks like Jenkins didn't write status. Huh. This was all green before.

squeed · 2023-04-20T09:08:15Z

/test-1.27-net-next

This makes the mock allocator work similarly to the "real" one: if an ID is requested and it is not in use, then accept it. Signed-off-by: Casey Callendrello <cdc@isovalent.com>

InjectLabels is one of the functions responsible for synchronizing the ipcache metadata store and ip store. As such, it shouldn't shortcut when the numeric identity is the same, but the source is different; this means that an update to the ipcache isn't complete. This can happen, for example, when there are two identities for the same IP, which can happen on daemon restart whenever a CIDR is referenced. Fixes: cilium#24502 Signed-off-by: Casey Callendrello <cdc@isovalent.com>

squeed · 2023-04-20T09:21:08Z

(rebased to pick up master -> main rename)

squeed · 2023-04-20T10:51:12Z

An update: a few end users have picked up this change and deployed it to their environments; they confirmed it fixed the issue. Thanks to the testers (and for the awesome bug report)!

joestringer · 2023-04-21T05:06:17Z

/test

squeed · 2023-04-21T12:29:49Z

Discussed with servicemesh people, agreed the test failure looked like a flake. test failure flake issue

squeed · 2023-04-21T13:27:59Z

test failure is confusing, curl between pods failed with exit code 22, which indicates a non-200 HTTP error code.

I take this to mean the actual TCP connection was successful, so I don't believe this PR has caused any issues. Still, I'm making sure the sysdumps don't show anything untoward.

Edit: The identity and ipcache values for the pods in question are correct, which is all this PR really touches, so I'm leaning towards flake. I wish we had more info in the sysdumps, specifically server pod logs.

joestringer · 2023-04-21T20:13:45Z

The test also only failed on one of many jobs, and after re-running that workflow, it has passed. Good to merge 👍

joestringer · 2023-04-24T16:33:46Z

@squeed I have manually updated the labels on this PR in order to track the latest status. This is typically taken care of by the standard backport process, so please consider following that in future so that we can easily track which releases the PRs land in.

squeed added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. sig/agent Cilium agent related. labels Apr 13, 2023

joestringer reviewed Apr 13, 2023

View reviewed changes

christarazi reviewed Apr 13, 2023

View reviewed changes

squeed force-pushed the ipcache-label-source-overwrite branch from a64eb6f to ef36fad Compare April 17, 2023 13:09

squeed marked this pull request as ready for review April 17, 2023 13:21

squeed requested review from a team as code owners April 17, 2023 13:21

squeed requested review from nbusseneau and nathanjsweet April 17, 2023 13:21

squeed added the release-blocker/1.13 This issue will prevent the release of the next version of Cilium. label Apr 17, 2023

joestringer approved these changes Apr 17, 2023

View reviewed changes

squeed mentioned this pull request Apr 18, 2023

[backport-v1.13] ipcache don't short-circuit InjectLabels if source differs #24947

Closed

nbusseneau approved these changes Apr 19, 2023

View reviewed changes

squeed added 2 commits April 20, 2023 11:20

testutils/allocator: Respect OldNID

7f82c77

This makes the mock allocator work similarly to the "real" one: if an ID is requested and it is not in use, then accept it. Signed-off-by: Casey Callendrello <cdc@isovalent.com>

squeed force-pushed the ipcache-label-source-overwrite branch from ef36fad to 6549b5c Compare April 20, 2023 09:20

squeed added backport/author The backport will be carried out by the author of the PR. needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Apr 20, 2023

nathanjsweet removed their request for review April 20, 2023 15:19

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 21, 2023

joestringer merged commit 8d3a498 into cilium:main Apr 21, 2023
56 checks passed

squeed mentioned this pull request Apr 24, 2023

[backport-v1.13] ipcache: don't short-circuit InjectLabels if source differs #25077

Merged

1 task

joestringer added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Apr 24, 2023

michi-covalent added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Apr 25, 2023

dctrwatson mentioned this pull request Apr 27, 2023

kube apiserver unaccessible when unrelated egress policy is removed #25172

Closed

2 tasks

thorn3r mentioned this pull request May 17, 2023

Prepare for release v1.13.3 #25514

Merged

Hayajiro mentioned this pull request Jul 19, 2023

IPv6 packets falsely dropped; conntrack issue #26886

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipcache don't short-circuit InjectLabels if source differs #24875

ipcache don't short-circuit InjectLabels if source differs #24875

squeed commented Apr 13, 2023

joestringer Apr 13, 2023 •

edited

joestringer Apr 13, 2023

joestringer Apr 13, 2023 •

edited

squeed Apr 14, 2023

joestringer Apr 17, 2023

squeed Apr 18, 2023

joestringer commented Apr 13, 2023

christarazi left a comment

squeed commented Apr 14, 2023

joestringer commented Apr 14, 2023

joestringer commented Apr 14, 2023

squeed commented Apr 17, 2023 •

edited

squeed commented Apr 17, 2023

squeed commented Apr 17, 2023 •

edited

squeed commented Apr 17, 2023

joestringer left a comment

joestringer Apr 17, 2023

joestringer Apr 17, 2023

joestringer Apr 17, 2023

joestringer Apr 17, 2023

squeed Apr 18, 2023

joestringer Apr 17, 2023

joestringer Apr 17, 2023

squeed Apr 18, 2023

nbusseneau left a comment

squeed commented Apr 20, 2023

squeed commented Apr 20, 2023

squeed commented Apr 20, 2023

squeed commented Apr 20, 2023 •

edited

joestringer commented Apr 21, 2023

squeed commented Apr 21, 2023

squeed commented Apr 21, 2023 •

edited

joestringer commented Apr 21, 2023

joestringer commented Apr 24, 2023

ipcache don't short-circuit InjectLabels if source differs #24875

ipcache don't short-circuit InjectLabels if source differs #24875

Conversation

squeed commented Apr 13, 2023

joestringer Apr 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joestringer Apr 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joestringer commented Apr 13, 2023

christarazi left a comment

Choose a reason for hiding this comment

squeed commented Apr 14, 2023

joestringer commented Apr 14, 2023

joestringer commented Apr 14, 2023

squeed commented Apr 17, 2023 • edited

squeed commented Apr 17, 2023

squeed commented Apr 17, 2023 • edited

squeed commented Apr 17, 2023

joestringer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbusseneau left a comment

Choose a reason for hiding this comment

squeed commented Apr 20, 2023

squeed commented Apr 20, 2023

squeed commented Apr 20, 2023

squeed commented Apr 20, 2023 • edited

joestringer commented Apr 21, 2023

squeed commented Apr 21, 2023

squeed commented Apr 21, 2023 • edited

joestringer commented Apr 21, 2023

joestringer commented Apr 24, 2023

joestringer Apr 13, 2023 •

edited

joestringer Apr 13, 2023 •

edited

squeed commented Apr 17, 2023 •

edited

squeed commented Apr 17, 2023 •

edited

squeed commented Apr 20, 2023 •

edited

squeed commented Apr 21, 2023 •

edited