Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One Pager: Resource Status Improvements #5453

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

TerjeLafton
Copy link
Contributor

Description of your changes

This PR introduces a one-pager with some potential improvements to the current statuses and how they are set.

The idea was initially brought up in this crossplane-runtime issue: crossplane/crossplane-runtime#583
It was then discussed further in this PR, also in crossplane-runtime, alongside a potential implementation to solve the issue: crossplane/crossplane-runtime#654

This was a big enough change that we agreed to document it in a one-pager. I took a while to work through this thoroughly, but it is still only based on my experiences working with Crossplane at work. Any feedback or discussion is very welcome.

I have:

  • Read and followed Crossplane's [contribution process].
    - [ ] Run make reviewable to ensure this PR is ready for review.
    - [ ] Added or updated unit tests.
    - [ ] Added or updated e2e tests.
    - [ ] Linked a PR or a [docs tracking issue] to [document this change].
    - [ ] Added backport release-x.y labels to auto-backport this PR.

Terje Lafton / Intility AS added 2 commits March 4, 2024 15:09
@TerjeLafton TerjeLafton requested review from negz and a team as code owners March 5, 2024 12:10
@TerjeLafton TerjeLafton requested a review from turkenh March 5, 2024 12:10
@TerjeLafton
Copy link
Contributor Author

@negz @turkenh @phisco

As discussed in the PR, here is an attempt at a one-pager. Any review is greatly appreciated :)

@negz negz self-assigned this Mar 11, 2024
@negz
Copy link
Member

negz commented Mar 11, 2024

@TerjeLafton Thanks for opening this. I'll do a review pass ASAP - probably later this week at this stage.

@TerjeLafton
Copy link
Contributor Author

Just a friendly reminder, @negz, in case it got lost in other tasks 😄

@negz
Copy link
Member

negz commented Mar 25, 2024

@TerjeLafton Apologies, I got pretty sidetracked with Kubecon EU last week. This is near the top of my todo list.

@negz
Copy link
Member

negz commented Mar 25, 2024

@TerjeLafton Here's a quick brain dump of some things that I'd like to see discussed/considered in this design. Apologies if some of this is already covered, this is a bit write-only until I get time to load up all the context again:

@TerjeLafton
Copy link
Contributor Author

TerjeLafton commented Mar 26, 2024

@TerjeLafton Here's a quick brain dump of some things that I'd like to see discussed/considered in this design. Apologies if some of this is already covered, this is a bit write-only until I get time to load up all the context again:

Thank you, the points you mentioned are very insightful! I'll just reply to your thoughts as a reply now, but please

  • Should we try to adopt kstatus (see here)? Would it help with the problem at hand? I wonder how widely adopted it is by other projects.

I did look into kstatus as part of the research for this one-pager, but I mostly used it to learn more about the theory and history of statuses in k8s. AFAIU, kstatus is mostly a library that users can use to inspect their k8s resources and easily get the true status of the resource, if it is supported.

Crossplane has quite good conditions implemented already, so adopting kstatus would mostly mean changing the Reason field of the Ready condition to be these strings: InProgress, Failed, Current, Terminating, NotFound or Unknown.
This is how I see this being implemented:

  • Creating and Updating => InProgress
  • Unavailable => Failed
  • Available => Current
  • Deleting => Terminating
  • The ready status for new resources are not updated until a significant change, like something going wrong or it being successful. To comply we should probably set Unknown immediately to make sure it always has a status. No status is considered successful in kstatus, so this would be unfortunate.
  • The NotFound could be implemented after external.Observe is called, if it returns ResourceExists: False. It would be a short lived status as it would quickly be changed to Failed, InProgress or Current, but it could be nice.

As conditions are part of the CRD's API, would this be considered a breaking change? It would probably be a very minor one, but I know some of the systems we have build would need changing to support this.

Implementing this would comply with my wish for the Ready condition to be more about if the desired state is compliant, so I don't personally mind this change. Of course including that we should requeue a reconcile after updates to get the status from InProgress to Current as soon as possible.

As stated in the issue, by you and others, I see observedGeneration as a very good addition, but for different reasons. kstatus also mentions this, as one drawback of conditions is if the controller is not running or is unable to update a resource's conditions, it would keep it's old status, which could be wrong.

The users should primarily be focused on conditions, and we should give them everything they need to understand the status of their resource here. The observedGeneration would just give users of kstatus, or someone that checks this themselves, more insight, as it would show the resource as Ready: false.

  • Which Crossplane resources will we change? Only MRs? Or also XRs?

Initially I didn't consider XRs part of this, as their status is only based on the status of their referenced resources. With the addition of composition functions, I am even responsible for updating their condition myself using your function-auto-ready.

But if we consider adopting kstatus, then XRs should probably also be updated. I am not familiar with the reconciler for XRs and other Crossplane CRDs though, so I couldn't say what this would mean to implement.

  • What's the "source of truth" for the MR being creating, updating, etc? Do we expect the external API to actually tell us the external resource is creating, updating, etc (e.g. their API has a field that indicates this) or do we infer it?

For public clouds, I think some of their resource APIs might indicate this. I know for example the VirtualNetworkGateway in Azure has a status field which indicates if it is creating or ready, as this resource takes 40 minutes to deploy.

But for most APIs, I think it would be impossible to indicate from their API fields if they are creating, updating, etc. Inferring it would probably be the most consistent way.
I know provider-azure has extra conditions like AsyncOperation and LastAsyncOperation, which usually provide extra context for users, but I am not familiar with their implementation.

For me, a successful call to external.Create should have the status Creating or InProgress depending on if we want to support kstatus. When we have successfully observed the resource and verified that it is compliant with the desired state, we will change it to Available or Current.
The same would apply for calls to external.Update, as described in the one-pager.

UpToDate -->|No| Update
```

#### Solution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical Kube way is to have observedGeneration as proposed for Crossplane in crossplane/crossplane-runtime#633. If it does not match the generation in metadata, the reconcile has not finished.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage here: it reduces writes when an update is requested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what is the cost of a write? I imagine it is quite small, especially with updates which in my experience is one of the rarer operations.

My point is that I think it is worth it to have more precise statuses that can fully represent the state of the resource.

Copy link
Member

@jbw976 jbw976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool @TerjeLafton, thank you for taking the time to think through this scenario and write down your thoughts in this proposal. That's really helpful for community members to articulate and drive the changes that are important to them. Thank you! 🙇‍♂️

I've thought about this more deeply this morning, and while I may not have some of the specific implementation details for crossplane-runtime's reconciler, I do feel like I've formed an opinion on this topic.

In general, I would agree with the high level problem statement, which I would summarize as "How do we know when a managed resource exactly matches its desired state?". A new status condition could possibly be added that captures this need. To majorly simplify the purpose of each status condition, I think they would look like:

  • Ready: resource is ready to use / available
  • Synced: we were able to apply the desired to the system
  • UpToDate: the observed state of the system matches the desired state we applied

The observedGeneration changes in crossplane/crossplane-runtime#633 are going to be helpful, but I don't think they fully cover this UpToDate scenario you've been pursuing in this PR. IIRC, observedGeneration is more about "we saw up to these desired changes and we have acted on them", as described in kstatus:

If the generation and the observedGeneration of a resource does not match, it means there are changes that the controller has not yet seen, and therefore not acted upon.

I don't think that says anything about "and now the observed state of the external system matches that desired state", especially for systems that take a long time to update themselves internally to the state we requested. As we all seem to agree on, external APIs from say the cloud providers generally won't tell us this state directly for all their resources, so the best we can likely do is "does observed actually match desired"? This 3rd condition would mean exactly that and probably be useful.

That is what I would support, as opposed to changing the meaning of Ready.

A few thoughts about kstatus:

  • it could definitely be useful for us to be conformant with a broader standard in use by others in the ecosystem
  • we'll be closer to being able to support it with Add ability to expose resource reconciliation progress crossplane-runtime#633 (comment)
  • It would probably be a requirement to do it in an "additive" way, as opposed to changing any meaning of our long standing conditions with breaking changes, e.g. our existing conditions don't go away or change.

```

#### Solution
Implement an `Updating` status, mimicking the `Creating` status. Setting this status every time an update request either succeeds or fails, informing a user that an update request has been requested, but we have not verified if the resource is compliant with the updated desired state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to add such a status, I think according to the API conventions, it would be better to not surface as a transient state:

Condition type names should describe the current observed state of the resource, rather than describing the current state transitions. This typically means that the name should be an adjective ("Ready", "OutOfDisk") or a past-tense verb ("Succeeded", "Failed") rather than a present-tense verb ("Deploying").

Perhaps it would be better to expose it as UpToDate or something that is an adjective or past-tense verb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this part too, and I agree.
As I was basing my proposal on updating the existing Ready condition, and I wanted it to be more about showing that it wasn't ready yet, I went with Updating, so it would match the Creating reason.

If the solution ends up being creating another condition, then I don`t mind following the API conventions linked.

#### Solution
Implement an `Updating` status, mimicking the `Creating` status. Setting this status every time an update request either succeeds or fails, informing a user that an update request has been requested, but we have not verified if the resource is compliant with the updated desired state.

This also requires updating the reconciler to immediately requeue a reconcile after a successfuly requested update. This ensures that an updated resource is immediatly checked for compliance, to give user quick feedback on if the resource is `Available` again after the update.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected this to already be the case - when we perform an apply/update to an external resource, that we would requeue, because we know there's likely to be an external change soon in response to our update request.

If we're not doing that, then I would support a code change so we reconcile again soon after applying a change to the external system, instead of waiting until the next poll internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! This was my main motivation for looking into this, and it started with this issue. A suggested fix was also proposed in this PR.

I do believe there are reasons for why the reconciler wasn't designed to reconcile after an update though, having the amount of reconciles in mind. If we assume that a successful update request is enough to say that this resource is ready and compliant, then we don't need to reconcile.
My view has always been that this isn't optimal, and that a successful observe of the resource and checking that it matches our desired state is the only way to really say it is ready for use. An extra reconcile seems worth it to me.


This also requires updating the reconciler to immediately requeue a reconcile after a successfuly requested update. This ensures that an updated resource is immediatly checked for compliance, to give user quick feedback on if the resource is `Available` again after the update.

Note that this would technically change the definition of the `Ready` condition. However, this definition is not very well documented, so I am not sure how many know about it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should change the semantics or meaning of our long standing Ready condition. Way too many folks have made (codified) assumptions about the meaning across the ecosystem, that it's not an option IMO to change its meaning. As you already mentioned above from the API conventions:

Once defined, the meaning of a Condition can not be changed arbitrarily - it becomes part of the API, and has the same backwards- and forwards-compatibility concerns of any other part of the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally get what you mean! As you pointed out, I commented on this myself, so it was always on my mind.

You know much better than me what others in the community have created, so if you think this is too much of a breaking change, I can understand that.
But do you think there are that many that have created things with a strong assumption that Ready is not about desired state matching observed state? My point is just that, while I agree this would be a breaking change, I think it might be worth looking into if it would actually break anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure people have yes - unfortunately it's impossible to ever know that fully within our ecosystem and it's many thousands of adopters. We can never hear from everyone and therefore we always have to stick with the fundamental principle of no breaking changes or changes to semantics in anything that is v1 API.

#### Solution
Modify the existing condition helper functions in Crossplane to optionally accept a string argument intended for the `Message` field of a condition. This would empower provider authors to provide human-readable summaries when setting condition states.

Crossplane would retain the ability to overwrite a provider-set `Reason` for the `Ready` condition if it's incorrect. This ensures consistency within the API, while providing more nuance in status reporting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure Crossplane should take on the burden of assessing whether a provider has set an accurate Reason. The provider has more information about the external system than crossplane-runtime does, and I suspect we should either trust the providers fully or not allow them to set a Reason at all. Having Crossplane assess the accuracy of a provider statement feels like it would not be reliable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! My preferred solution would be for providers to be fully responsible for the Ready condition. I would like there to be guidelines for which reasons you can use for a condition though, so the API is consistent across providers.

But as the crossplane-runtime does update the Ready condition, for example with Reason: Creating, it wouldn't work to make that change. That is just why I suggested this, but it is not the most important thing to fix.

Crossplane currently has primary control over the `Ready` condition, limiting the ability of providers to communicate detailed, human-readable status information to users via the `Message` field. While Crossplane's consistent conditions create predictable expectations for users, this generic nature leaves valuable context underutilized. Important status information for users and developers is often relegated to Kubernetes events and logs, which can be less convenient to access and interpret.

#### Solution
Modify the existing condition helper functions in Crossplane to optionally accept a string argument intended for the `Message` field of a condition. This would empower provider authors to provide human-readable summaries when setting condition states.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement seems to conflict with the previous "Problem" statement:

limiting the ability of providers to communicate detailed, human-readable status information to users via the Message field

Providers already can set the Message field, but we should add the ability to set the Message field? Did you perhaps mean Reason field here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, the first sentence in the solution wasn't optimal 😄
The limitation stems from the crossplane-runtime setting the Ready condition most of the time, without a message.

Say something went wrong when creating an external resource. I would then want to create something like, to give good feedback to my users in a human readable format.

Type: Ready
Status: False
Reason: Creating
Message: This and that went wrong, etc.

If I create this condition and set it on my MR, it would immediately be overwritten by crossplane-runtime with a similar condition, without the message.

So my intent with this was basically to respect the message field, but it could have been explained better!

Co-authored-by: Pete Lumbis <pete@upbound.io>
Signed-off-by: Terje Lafton <terje@lafton.io>
@TerjeLafton
Copy link
Contributor Author

Thank you for the review, @jbw976!

I understand the reluctance to change the Ready condition. I have tried to explain why I don't think it would be too big of a deal, but I fully understand if you don't agree. You know much better than me what the community has made and how breaking this change would be.

Adding an extra condition would solve the issue too. My only problem with this is that the Ready condition and the UpToDate condition, would overlap in most ways except for how they view a resource that has recently been updated. But with good documentation, I think it could make sense to users!

I definitely support trying to be compliant with kstatus! I might be wrong though, but it seems like kstatus is largely based on the Ready condition, which Crossplane already provides. Changing Ready to be compliant with kstatus would be a very breaking change.

I am new to this, so please tell me what I should do with this PR as reviews come. Do we discuss it until we reach a point where we generally agree, and then rewrite the one-pager to match the final design decision, if any, or how does this work?

@pedjak
Copy link
Contributor

pedjak commented Apr 19, 2024

related PR: crossplane/crossplane-runtime#633

+1 on being kstatus compliant.

@jbw976
Copy link
Member

jbw976 commented Apr 22, 2024

Thanks for your continued collaboration @TerjeLafton! I think this is a tricky issue to work through because it has a lot of nuances and impacts every resource in the crossplane ecosystem, which isn't a small surface area 😉

I am new to this, so please tell me what I should do with this PR as reviews come. Do we discuss it until we reach a point where we generally agree, and then rewrite the one-pager to match the final design decision, if any, or how does this work?

Typically if we reach alignment during a PR discussion, it's reasonable to push a commit that captures that newly aligned state. When things are still in active discussion, it's reasonable to hold off on making any updates. Try to find a balance between avoiding churning your doc and pushing work that you may need to change again. It's tough to find a perfect balance, but thanks for the efforts always!

I'll share my current thoughts here (influenced by some feedback from @negz), but I still have some lingering questions that I don't have answers for personally - perhaps others in the community will be able to add their knowledge as well.

  • An UpToDate condition feels useful for users to quickly understand if observed state matches the desired state
  • However, there are doubts about the feasibility and practical usefulness of this condition across the N 1000's of resources in the Crossplane ecosystem
    • The vast majority of the time, there likely is no noticeable delay between when desired state is applied to an external API and when that API reports that the status has been applied (in the form of observed state).
    • In those cases, UpToDate essentially becomes the same as Synced
    • Do you have some particular resources where you've found this type of functionality would be most helpful? Some practical examples may help ground the thinking again.
  • Behavior for Claims and XRs
    • I don't think it was discussed in this proposal, but I was imagining that UpToDate would be fairly naive/straightforward here. A Claim is UpToDate if its XR is UpToDate. A XR is UpToDate when all its composed resources are UpToDate. This condition should probably be settable by composition authors similar to how Ready works (e.g., function-auto-ready)
  • Overlap with Add ability to expose resource reconciliation progress crossplane-runtime#633
    • I'm trying to reason about this new functionality and if it can help answer the questions we want here too. I'm not sure of how it's being rolled out yet though, because I don't see any consumers of that functionality yet.
    • It's problem statement seemed very similar though, but I'm not actually sure how much overlap there is.
  • kstatus support

@jbw976
Copy link
Member

jbw976 commented Apr 24, 2024

@TerjeLafton note the further discussion on kstatus in #2672 (comment) - I feel like I'm going in a bit of circles mentally as we're thinking through the somewhat foundational impact this idea has (sorry about that), but it's getting harder to avoid kstatus now and to not just completely dive into it to get the full picture figured out. What do you think?

@TerjeLafton
Copy link
Contributor Author

I think going all in on kstatus is perfect! It would solve all the completeness issues with the current API.

But is there a way to be fully compliant with kstatus without breaking the current API? I might be missinterpreting the 'kstatus' document about conditions, but it seems to be relying on a condition named Ready, which Crossplane already have. Chaning the Ready condition to be compliant would definitely be breaking.
Is it possible to implement the kstatus required condition with another name?


I'll just provide a little context for what inspired this PR in the first place, as you asked for some relevant examples.

We were building a Crossplane provider for use in-house against network devices. Because of security related reasons, we couldn't let a resource that is not compliant with the desired state be seen as Ready.
Because of this, we implemented an Updating status for the Ready condition, which solved this issue for us. We also had to have a quite high poll interval, more than an hour, for the provider, as the devices we were programming against could not handle too many requests.
As described in the one-pager and comments afterwards, we then had issues with Crossplane not requeueing reconciles after the call to external.Update, which caused our resources to be not ready and Updating for an hour.

As I can implement custom status for conditions, my issues would be solved only be reconcileing after updates. It also seems like you agreed that this is a good change to implement here.

It is mostly the discussions in the issue and PR I made that inspired me to look more into if there could be made improvements to the API to make it more complete. I tried to think of ways to improve it without breaking changes, or at least minor ones, and kstatus was on my mind, but I didn't see a way to be compliant without big breaing changes.

A minor extra comment: We also implemented a Waiting status for Ready to be used when a resource is waiting for it's references to be Ready, which made it a lot easier for our users. I didn't include this here as it didn't seem as importent.

@TerjeLafton
Copy link
Contributor Author

I just read the kstatus document again, as I have been struggeling to fully grasp it.
It seems I might have been very wrong in my understanding of kstatus, even that it requires a Ready condition. The Ready condition is just a fallback, it seems.

So to be kstatus compliant, you basically just need to implement observedGeneration, and optionally the Reconciling and Stalled conditions. Through this, kstatus is able to give us a reliable status for our resource.

This sounds good, and should be something we could implement quite easily. The Reconciling and Stalled could probably be implemented in crossplane-runtime too, so no changes to providers would be nessecary.

But one concern I have is about how users experience the API directly. kstatus is nice and makes it easy for programmers to get a consistent status from a resource for what they are building, but every user won't look at their resources with tooling that parses the resource with kstatus.

The API convention states that the conditions should ideally be a complete view for the user to get a good idea of the status of their resource. I feel like a complete Ready condition or similar is still needed to give the users everything they need.

@vibe
Copy link

vibe commented May 2, 2024

Realizing I need to consume the rest of these comments and links everyone is sharing, but for what it's worth

We currently use the ready status on a resource to determine whether or not we can access status properties on the resource, which usually includes some form of ID / ARN.

From our perspective, a resource that is updating is still "ready" even if the update fails because its existing status properties remain valid and usable.

For example we may request to expand a disk on a VM which fails, but other resources consuming that VM's ID should consider the VM ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants