-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipcache don't short-circuit InjectLabels if source differs #24875
ipcache don't short-circuit InjectLabels if source differs #24875
Conversation
@@ -224,7 +224,7 @@ func (ipc *IPCache) InjectLabels(ctx context.Context, modifiedPrefixes []netip.P | |||
// kvstore and via the k8s control plane. If the new | |||
// security identity is the same as the one currently | |||
// being used, then no need to update it. | |||
if oldID.ID == newID.ID { | |||
if oldID.ID == newID.ID && prefixInfo.Source() == oldID.Source { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah interesting. Previously this code assumed for one that if the identity is the same, then the source of that identity is the same. However as the comment demonstrates immediately above, that may not be true. Secondly I think that the previous implementation also implicitly added an assumption that if the eventual result of the ipcache update here is going to map a specific IP to a specific Identity, then if the identity is the same, we don't need to propagate the event to the other subsystems because the ultimate result in the datapath is the same. For the UpdatePolicyMaps()
call below I think that's true - technically there is nothing to do if the source is different. However for the main ipcache structure update below as part of the entriesToReplace
logic, that layer knows about source
and uses it for some protections against ipcache entry deletion etc. Therefore, that layer really should be informed about a change in source. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I think we could potentially avoid the extra policymap notification for this case, this probably doesn't have much of an impact and likely just results in a little extra processing to figure out that it's a no-op. That's acceptable IMO.
// have now been removed, then we need to explicitly | ||
// work around that to remove the old higher-priority | ||
// identity and replace it with this new identity. | ||
if entryExists && prefixInfo.Source() != oldID.Source { | ||
if entryExists && prefixInfo.Source() != oldID.Source && oldID.ID != newID.ID { | ||
forceIPCacheUpdate[prefix] = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking through this line with the following cases:
- An ipcache entry exists for some IP, the numeric identity is the same and the source is the same
- An ipcache entry exists for some IP, the numeric identity is the same and the source is different
- An ipcache entry exists for some IP, the numeric identity is different and the source is the same
- An ipcache entry exists for some IP, the numeric identity is different and the source is the different
Case (1) never gets to this point because of the check above. No change in behaviour ✔️ .
Case (2) previously didn't get to this line but this PR changes that logic above, so now it hits here. The change on this line avoids "forcing" the ipcache update for this case. This basically means that if it happens that the existing entry is a higher priority Source than the newly calculated one, the ipcache upsert would fail and warn in the logs. Technically I think this couldn't happen today just given the users of this ipcache metadata layer, but it could happen in future. I think that the right path for this would actually just be to forcefully update the ipcache in that case as well. I can theorize about future implications but I think my main issue right now is not understanding why this case needs to avoid the force
.
Case (3) is not a big deal, because overriding an entry with the same priority source should work (cf. source.AllowOverwrite()
) so the force hack is not necessary there. ✔️
Case (4) is effectively the same before and after this change, we will effectively trust this metadata layer's decision about the correct underlying ipcache Identity+Source over the top of any users of the legacy direct ipcache.Upsert()
API. ✔️
sidenote, I can't wait to switch all the callers to the new async metadata interface and resolve this internally in the metadata structures rather than having this forceIPCacheUpdate
hack in the metadata layer in order to override some of the core ipcache behaviour to handle these cases correctly 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that case 2 is the interesting one. I don't understand enough all the possible values for metadata source and ipcache source, so I wanted to preserve existing behavior as much as possible (save for the additional ipcache upsert).
If you think it's safe to always force when the metadata diverges from the ipcache, then I'm happy to remove this conditional. I'm not so sure, mostly because it is possible to have multiple identities for the same /32 cidr. In that case, we would always want the higher-precedence source to "win".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a bit more thinking, I think that in practice they're one and the same at least in the immediate term.
The old ipcache.Upsert()
API mostly assumed that there's only one identity for each IP, and the callers will resolve what the identity is. If there's more than one identity, the callers can override one another with a higher priority Source
. The awkward part is that it's up to every ipcache user to understand every other ipcache user and accept the results of that overriding behaviour. The callers today mostly just pretend those cases don't exist (and mostly get away with it).
The newer ipcache.UpsertLabels()
API mostly assumes that callers should just specify a set of labels to be associated with the IP, then this ipcache.InjectLabels()
will resolve any conflicts between different sources attempting to associate specific identities. Today, this logic primarily handles just the cases where the prefix needs to be associated either with the kube-apiserver (add that label into the identity) or with a remote node. These two are typically higher priority sources so they are likely to take precedence over the users from the older ipcache anyway (at least now that your patch ensures that this code propagates the call into the older ipcache.Upsert()
API).
I think my argument for having this logic override regardless is that we're aiming for this InjectLabels()
logic to resolve conflicts between different Sources, and therefore if some other piece of logic did an ipcache.Upsert()
with a higher priority source then this logic tried to override it, then the other logic shouldn't be able to go and ipcache.Delete()
the entry that this logic installed, because the source would no longer match.
That said, over time we'll just convert all of the users over to this logic and at that point this discussion will be moot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the other awkwardness is that updates to the metadata and ipcache are in separate "transactions"; there are a few bits of code that do a (Metadata lock, Metadata read, Metadata unlock) -> (IPCache lock, IPCache write, IPCache unlock), but something could have touched the metadata in the mean time.
As you said, it mostly works, and the bits of logic that do this are getting smaller and smaller. However, it's still something to be aware of, and I wonder if we will discover more bugs in this vein. I did a quick scan and didn't find much else; most things hard-code the Source and just ignore an upsert failure.
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like Joe went through this one pretty thoroughly and I think I have a rough understanding of how this fixes the original issue (#24502).
Do we also need to change the daemon CIDR restoration logic to also use source.Restored
?
Nope, The last TODO For this is to write some tests, which I'm doing now. Should be ready for final review on Monday. |
Heads up I found this while investigating something else, this provides some background for the origin of the logic that is being changed here: #19765 (comment) . Given that the etcd tests are failing, I think that this is reintroducing the problem I had originally while writing this logic. |
I would argue that we should make this change, but that it's not necessary in this PR. We can follow up separately, I don't think it is likely to impact the issue at hand here. |
It seems like the tests are failing because they're detecting an (expected) error message. In this case, we are intending for the usual ipcache precedence rules to abort; it is not an error message. @joestringer it seems like we can just not return an error message in |
a64eb6f
to
ef36fad
Compare
I wound up swallowing the error message in I also added a test case that emulates what I saw on real systems. It does some small amount of hackery to simulate the race condition. I'm marking this as ready for review. |
/test (edit: failure is due to.. an issue posting a slack message) |
/test-1.26-net-next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also went searching for a leaking identity refcount related to the restore logic, but I found the corresponding identity release for the identity with the specific labels that should have been allocated from the start, so I believe that aspect to be correct.
// Now, emulate policyAdd(), which calls AllocateCIDRs() | ||
_, err = IPIdentityCache.AllocateCIDRs([]netip.Prefix{prefix}, []identity.NumericIdentity{oldID}, nil) | ||
assert.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure I quite understand how this additional allocation of the specific /32 from the policy engine relates to the rest of the bug or whether this matches the user reports.
For this to be relevant in all of the user cases, we would need a policy that says something like:
egress:
- toCIDR:
- w.x.y.z/32
Looking back through #24502, I only find two examples of policies that users are using: fromEntities: cluster
which has a label selector for the kube-apiserver entity, and a fromCIDR for a shorter prefix (/24
as pointed out here). Neither case has a CIDR policy specifically for the /32s of the kube-apiserver IPs. As a result, neither should end up interacting with the ipcache for this IP address. We can certainly ask the users if they are creating such /32
policies, but I would have thought they would have brought that up if it was a relevant aspect of the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit here as well, daemon/cmd/policy.go
does not feed the old identities in, since it doesn't know the old identities. That said, I tried locally to pass nil
for the old identities and pass an output parameter as the third argument + run UpsertGeneratedIdentities()
immediately for the returned identities and that failed in the same way with the old code as the way this test is written. While that would be a more accurate emulation, I don't think it materially affects the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to reproduce similar behaviour without these lines, by moving the upsertLabels
up to the top of the function. This would be equivalent to syncing the k8s endpoints for the kubernetes
service on startup before allocating the identities from the ipcache map. That said, as far as I can tell this order is not possible at runtime currently, since those identities are allocated around line 628 of daemon/cmd/daemon.go
and upserted around line 767, whereas the k8s watchers are not launched until line 1086. Unless I'm missing something about the way that the Hive-based k8s resource handling behaves, this option seems like a dead end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a bit of looking around, the only options I really see are some other policy logic allocating the Identity (ToCIDR, ToServices, ToFQDNs) or maybe the node manager logic, but the latter shouldn't occur in an EKS environment as the users report. So this test is probably about as good of an approximation we'll get, short of coming up with some complicated e2e test. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, some of the allocations here are probably superfluous; I was trying to match the race condition I'd observed in logs / delve sessions. If you like, I can try and pare this down to a minimum reproducer.
// have now been removed, then we need to explicitly | ||
// work around that to remove the old higher-priority | ||
// identity and replace it with this new identity. | ||
if entryExists && prefixInfo.Source() != oldID.Source { | ||
if entryExists && prefixInfo.Source() != oldID.Source && oldID.ID != newID.ID { | ||
forceIPCacheUpdate[prefix] = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a bit more thinking, I think that in practice they're one and the same at least in the immediate term.
The old ipcache.Upsert()
API mostly assumed that there's only one identity for each IP, and the callers will resolve what the identity is. If there's more than one identity, the callers can override one another with a higher priority Source
. The awkward part is that it's up to every ipcache user to understand every other ipcache user and accept the results of that overriding behaviour. The callers today mostly just pretend those cases don't exist (and mostly get away with it).
The newer ipcache.UpsertLabels()
API mostly assumes that callers should just specify a set of labels to be associated with the IP, then this ipcache.InjectLabels()
will resolve any conflicts between different sources attempting to associate specific identities. Today, this logic primarily handles just the cases where the prefix needs to be associated either with the kube-apiserver (add that label into the identity) or with a remote node. These two are typically higher priority sources so they are likely to take precedence over the users from the older ipcache anyway (at least now that your patch ensures that this code propagates the call into the older ipcache.Upsert()
API).
I think my argument for having this logic override regardless is that we're aiming for this InjectLabels()
logic to resolve conflicts between different Sources, and therefore if some other piece of logic did an ipcache.Upsert()
with a higher priority source then this logic tried to override it, then the other logic shouldn't be able to go and ipcache.Delete()
the entry that this logic installed, because the source would no longer match.
That said, over time we'll just convert all of the users over to this logic and at that point this discussion will be moot.
log.WithError(err2).WithFields(logrus.Fields{ | ||
logfields.IPAddr: prefix, | ||
logfields.Identity: id, | ||
}).Error("Failed to replace ipcache entry with new identity after label removal. Traffic may be disrupted.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is only wrapping the case where the identities remain the same but the source changes, I think that this should be OK and the error message still accurately applies to the case we're worried about. I think that once we properly switch the node manager over to the newer ipcache.UpsertLabels()
APIs we should be able to revert this hunk without it causing complaints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just blindly approve this one since I'm late and ci-structure
changes are trivial.
Looks like Jenkins didn't write status. Huh. This was all green before. |
/test-1.27-net-next |
This makes the mock allocator work similarly to the "real" one: if an ID is requested and it is not in use, then accept it. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
InjectLabels is one of the functions responsible for synchronizing the ipcache metadata store and ip store. As such, it shouldn't shortcut when the numeric identity is the same, but the source is different; this means that an update to the ipcache isn't complete. This can happen, for example, when there are two identities for the same IP, which can happen on daemon restart whenever a CIDR is referenced. Fixes: cilium#24502 Signed-off-by: Casey Callendrello <cdc@isovalent.com>
ef36fad
to
6549b5c
Compare
(rebased to pick up master -> main rename) |
An update: a few end users have picked up this change and deployed it to their environments; they confirmed it fixed the issue. Thanks to the testers (and for the awesome bug report)! |
/test |
Discussed with servicemesh people, agreed the test failure looked like a flake. test failure flake issue |
test failure is confusing, curl between pods failed with exit code 22, which indicates a non-200 HTTP error code. I take this to mean the actual TCP connection was successful, so I don't believe this PR has caused any issues. Still, I'm making sure the sysdumps don't show anything untoward. Edit: The identity and ipcache values for the pods in question are correct, which is all this PR really touches, so I'm leaning towards flake. I wish we had more info in the sysdumps, specifically server pod logs. |
The test also only failed on one of many jobs, and after re-running that workflow, it has passed. Good to merge 👍 |
@squeed I have manually updated the labels on this PR in order to track the latest status. This is typically taken care of by the standard backport process, so please consider following that in future so that we can easily track which releases the PRs land in. |
InjectLabels is one of the functions responsible for synchronizing the ipcache metadata store and ip store. As such, it shouldn't shortcut when the numeric identity is the same, but the source is different; this means that an update to the ipcache isn't complete.
This can happen, for example, when there are two identities for the same IP, which can happen on daemon restart whenever a CIDR is referenced.
Fixes: #24502