-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
052 server side apply #9306
base: main
Are you sure you want to change the base?
052 server side apply #9306
Conversation
233eb6b
to
1aea465
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the PodOperator the right place to try things out? Pods are in general never patched but deleted and created. So the SSA gets litttle use there and I wonder whether any tests actually touch it there.
...or/src/main/java/io/strimzi/operator/cluster/operator/resource/ResourceOperatorSupplier.java
Outdated
Show resolved
Hide resolved
52fef64
to
b40b42e
Compare
Good point, I chose it as a simple example only really. In the proposal the issue was seen on StatefulSet, but I think StrimziPodSet is the replacement for this? So I'll look into doing it on the StrimziPodSetOperator instead |
b40b42e
to
51e3e24
Compare
I think StrimziPodSets would work. Or Service Accounts for example might be a good example.
So how does it differentiate from the other mock server? |
@@ -262,6 +297,23 @@ protected T patchOrReplace(String namespace, String name, T desired) { | |||
return operation().inNamespace(namespace).withName(name).patch(PatchContext.of(PatchType.JSON), desired); | |||
} | |||
|
|||
protected T patchOnly(Reconciliation reconciliation, String namespace, String name, T desired) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the logic behind naming it patchOnly
? I'm not sure I understand what would be the alternative to it.
protected T patchOnly(Reconciliation reconciliation, String namespace, String name, T desired) { | ||
try { | ||
return operation().inNamespace(namespace).withName(name).patch(serverSideApplyPatchContext(false), desired); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to watch here for some more specific exception or error code?
try { | ||
return operation().inNamespace(namespace).withName(name).patch(serverSideApplyPatchContext(false), desired); | ||
} catch (Exception e) { | ||
LOGGER.debugCr(reconciliation, "{} {} in namespace {} failed to apply, using force", resourceKind, name, namespace, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is unexpected, should it be a warning message?
I left some more comments. I guess the next steps would be:
|
And I guess share what I tried ... it does not seem to work on my ARM based MacOS. It seems to fail with this when trying to run the
And this from command line:
|
@scholzj We did try it, and seems it doesn't work on Power(ppc64le).
|
I get those connection warnings too, perhaps there is a way to turn them off in the logging, but they are not necessarily indicating a failure - @Vaibhav-Nazare do you have further logs after the ones you showed? It should eventually show the usual test output. Mine does pass and doesn't take very long ( This issue confirming that the warnings in the logs are just readiness checks: java-operator-sdk/jenvtest#97 Also an issue with someone experiencing the same timeout: java-operator-sdk/jenvtest#105 And another suggesting maybe an existing kubeconfig file could be the problem: java-operator-sdk/jenvtest#116 My logs for reference:
|
bc62be9
to
10174a8
Compare
I've updated the regression tests to use server side apply and added the flag to all the operators, so opening the PR properly for review. Please let me know if there are any remaining changes you think are needed. |
I will have another look and try to run the system tests, But we still need to make the unit tests work for everyone and on all platforms we support. We cannot rely on system tests only. |
In the first case of Power(ppc64le), we need to see the full log output, but I won't be able to help to reproduce/diagnose it. In the case of ARM based MacOS, this is the same platform as mine so it could be a different issue - did my previous comment shed any light on the problem? I linked some issues from the project that could be related i.e. maybe an existing kube config file is causing the problem? |
/azp run feature-gates-regression |
Azure Pipelines successfully started running 1 pipeline(s). |
5a737fe
to
c775fa2
Compare
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run feature-gates-regression |
Azure Pipelines successfully started running 1 pipeline(s). |
Hey @scholzj! I will be taking over work on this PR with help from @jamesfarlow (if needed). Thanks for triggering regression tests! I see a few of them failed and Azure killed two jobs as well. Is that normal? Looking at the previous runs I don't think so but still, this run took a lot more time than other runs. I also ran all failed ones (namely |
There is a 6 hours limit. So if there are too many failures, the pipeline will be killed. If you check the logs, you should see at least some of the tests that failed in them. Also, in the pipelines that failed in time, you should see what failed. When a lot of stuff fails like this, it is often related to similar issues.
Are you sure you run then with the server-side-apply enabled? I did not had much time to work on this lately. I was looking at how to deal with the mocked tests to work on all platforms and to work in a reasonable time. One of the alternatives might be to use https://github.com/dajudge/kindcontainer instead of the Jenvtest. But the main issue is that all of them are much much slower than the current solution. And what really needs to be done for this PR is not just have single test for server apply. But have all the MockKube tests use something that supports server side apply. Unfortunately, I'm busy with Kraft support, so did not had much time to move forward with it :-(. |
54403af
to
f8f61de
Compare
Signed-off-by: Adam Jedro <adam.jedro@mettle.co.uk>
Signed-off-by: Adam Jedro <adam.jedro@mettle.co.uk>
b6b10d3
to
a7e32db
Compare
Hi @scholzj The last run of regression tests passed without any failures. Could you share the next steps regarding this PR? Thanks! |
@adamj-m I'm at the KubeCon this week so will probably not get to this until next week, sorry. As for the docs, please check the docs for the other feature gates: There are several docs section where the feature gates are listed and explained what they do. So we should have it for this as well. IIRC the code, I think it would be useful to have some more unit tests with the SSA. E.g. copy one of the KafkaAssembly tests with Mockito and one with MockKube and have a copy of them run with SSA. I might be able to provide more details next week if I get to this. |
@adamj-m Trying to get back to this after KubeCon. As for the documentation, I think there are two main chapters to add this to:
Just check what is there for the existing feature gates and add similar sections for this one. I will try to go through the code again and leave more comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments:
- I suspect you should also update the
io.strimzi.operator.common.operator.resource.concurrent
package. - This does not seem to propagate to the User and Topic Operators. I guess we should decide if that is desired or not. It might be useful to have it covered by a single feature gate and have it promote together. Unless we decide to not use it there - the value there is probably much smaller than in the cluster operator.
- You should test the Strizi Drain Cleaner and the impact this has on it. Will it still work? Will it need to force patching the manual rolling update annotations?
- You should run some Cluster Operator mock tests with the SSA feature gate enabled. For example:
KafkaAssemblyOperatorMockTest
KafkaAssemblyOperatorWithPoolsKRaftMockTest
KafkaConnectAssemblyOperatorMockTest
StrimziPodSetControllerMockTest
- Just create a copy of these classes with a new name and enable there the SSA
- Similarly, you should run some Cluster Operator mockito tests with the SSA feature gate enabled. For example:
KafkaAssemblyOperatorPodSetTest
KafkaMirrorMaker2AssemblyOperatorPodSetTest
- Maybe some more such as
KafkaStatusTest
etc. - Again, same principle, copy the test and test the same things with SSA.
- (In order to propagate the feature gate to beta, all the existing mockito tests will likely need to be updated sooner or later, but no need to do it for all of them in this PR 😉)
if (current == null || desired == null) { | ||
return false; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? If it really is needed, maybe you could add some explanatory comment?
cluster-operator/src/main/java/io/strimzi/operator/cluster/model/KafkaCluster.java
Show resolved
Hide resolved
cluster-operator/src/main/java/io/strimzi/operator/cluster/model/ZookeeperCluster.java
Show resolved
Hide resolved
boolean isNoop = patchResult instanceof ReconcileResult.Noop; | ||
boolean patchedUsingServerSideApply = patchResult instanceof ReconcileResult.Patched | ||
&& ((ReconcileResult.Patched<Deployment>) patchResult).isUsingServerSideApply(); | ||
|
||
if (isNoop || patchedUsingServerSideApply) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code section seems wrong. We should use annotation with hashes to track certificate changes. And I wonder if we already do and this just wasn't removed. But that is not related to this PR - I will try to look into it separately.
But I wonder why does the patchedUsingServerSideApply
matters here? Why does that have an impact on the rolling of the pods? Does SSA not return Noop?
@@ -595,6 +595,7 @@ public void createRenewOrReplace(String namespace, String clusterName, Map<Strin | |||
// cluster CA certificate annotation handling | |||
Map<String, String> certAnnotations = new HashMap<>(2); | |||
certAnnotations.put(ANNO_STRIMZI_IO_CA_CERT_GENERATION, String.valueOf(caCertGeneration)); | |||
certAnnotations.put(Annotations.ANNO_STRIMZI_IO_FORCE_RENEW, "false"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, for this and all the other annotations which the operator owns through this -> I guess you would also need to change the commands in the docs to force the update? Or will they work as they are today?
LOGGER.debugCr(reconciliation, "Deleting managedFields from request before pathing resource {} {}/{}", resourceKind, namespace, name); | ||
desired.getMetadata().setManagedFields(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Maybe some comments would be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to update managed fields (the internal metadata about SSA / field ownership) at all. It caused problems with StrimziPodSet
where whole StrimziPodSet
object was replaced.
I think it was io/strimzi/operator/cluster/operator/assembly/KafkaReconciler.java:527
but I am not sure :)
@@ -248,6 +278,17 @@ protected Future<ReconcileResult<T>> internalUpdate(Reconciliation reconciliatio | |||
} | |||
} | |||
|
|||
protected Future<ReconcileResult<T>> internalPatch(Reconciliation reconciliation, String namespace, String name, T desired) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAybe you can add some Javadocs to these methods even through they are not public?
@@ -262,6 +303,23 @@ protected T patchOrReplace(String namespace, String name, T desired) { | |||
return operation().inNamespace(namespace).withName(name).patch(PatchContext.of(PatchType.JSON), desired); | |||
} | |||
|
|||
protected T patch(Reconciliation reconciliation, String namespace, String name, T desired) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reasoning behind having patch
and internalPatch
? I'm not sure I really understand the need fo the two methods. Maybe this is where the Javadocs would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't understand the difference between this test and the next one for PodOperator. Plus, we likely never patch a Pod, so this test is not really that valuable on its own.
Why not integrate these tests into all the test classes instead of creating a separate classes for just few of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow the changes in this file. Why are they needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was flaky and failed on my env (M2) - I fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never saw this fail on my environment. Are you sure this wasn't related to any changes from this PR?
Also, I think the waitUntilCondition
should not be called with a blocking code, so I'm not sure whether the change you did is correct. Also, my understanding is that it is repeatedly calling the predicate until the timeout. So the predicate should not be hanging for 10 seconds.
Signed-off-by: Adam Jedro <adam.jedro@mettle.co.uk>
Signed-off-by: Adam Jedro <adam.jedro@mettle.co.uk>
…rrent and add FeatureGates support to User Operator Signed-off-by: Adam Jedro <adam.jedro@mettle.co.uk>
Signed-off-by: Adam Jedro <adam.jedro@mettle.co.uk>
Hi @scholz Thanks for the thought review!
|
Thanks for the changes.
I did not realized that. I think what you did would work in general. But I'm not sure it is optimal. I wonder if it would be better to have a single global feature gates configuration in operator-common shared between all classes. That way you can:
I realize that I suggested to copy them. But seeing the result, I wonder if it would be better to do one of this:
Can you elaborate on this please? Why do you think it is not needed?
This would be worth investigating so that we know what to expect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I guess I suggested to just copy it. I wonder if using some parametrization or abstract class with two implementations each using just a different SSA setting would make it easier 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably have some unit tests? Also, how does this handle interdependencies between feature gates which are a very common thing?
/** | ||
* Configuration string with feature gates settings | ||
*/ | ||
public static final ConfigParameter<FeatureGates> FEATURE_GATES = new ConfigParameter<>("STRIMZI_FEATURE_GATES", parseFeatureGates(), "", CONFIG_VALUES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. But I guess you also need to pass it to the user operator based on the setting in the cluster operator.
Hey, unfortunately Adam is not gonna be able to work further on this. I'm gonna step into his place. Wanted to inform you as we're still interested to push this forward and not just let this feature rot. |
@krystian-kulgawczuk-natwest Ok, thanks for letting us know. Please check the last comments I opened and let me know if there is something you need me to help with. I think this is very close to being complete, so it would be great to get it done. |
FYI: I opened a new proposal strimzi/proposals#118 to best deal with how the feature gates are configured in the User and Topic Operators. Hopefully that way we can split it off from this PR and simplify things a bit. |
Hey @scholzj, read through the proposal and it seems like a good preemptive move to not cause a lot of mess with feature gates. I just got back from vacation, so I will be able to give more attention here 😅 |
I'm fine to implement the proposal once it is approved. But we will then need to rebase this PR on top of it once it is merged. Let's see how many and hwat kind of comments will the proposal get. If things go well, it might be done next week. |
Ok cool. I'll try to close everything else that is not related to feature gates. |
I looked through those tests and tried to elaborate what Adam had in mind. I think what he meant was for the tests you mentioned, controllers/operator/etc that would actually take as an argument this is an example of such a test. In order to properly test the Please correct me wherever I'm wrong and I'm up for any suggestions how do you think this should be done. |
@krystian-kulgawczuk-natwest The
|
I get the point of writing tests for both of these cases, but I'm not exactly sure how do we want to achieve that for |
Sorry, I don't follow what exactly you mean with that. |
Okay, let me try to explain this as detailed as possible 😅 So for
*1) kube api supplier edit1: |
In some cases you might need to mock these things yes. But in others, you likely mock the whole |
Type of change
Description
This PR is an initial attempt at adding Server Side Apply (SSA) to the reconcile process, as per proposal 052 - https://github.com/strimzi/proposals/blob/main/052-k8s-server-side-apply.md
It is not finished work but I am opening the PR to get early feedback - some details below and also prompts/questions on areas that I am unsure about:
PodOperatorServerSideApplyApiServerTest.java
test.PodOperatorServerSideApplyTest.java
which overrides any tests which were failing in the originalPodOperatorTest.java
so that it is obvious where the behaviour is changed and how we have to change the mocks/verifys to match. How would you see this approach scaling if we enable SSA on all the operators? Should we create new abstract classes to represent the SSA behaviour and create new tests extending these for all the operators when SSA is enabled?FeatureGatesST
?ResourceOperatorSupplier.java
but I can't actually get this part of the project to build on my IDE so may not be correct. I also can't build on the terminal when I rebased to the latest upstream changes due to errors unrelated to my changesChecklist
Please go through this checklist and make sure all applicable tasks have been done