Service for read-write instance #3722

Agalin · 2023-09-14T16:58:00Z

Overview

There should be one more service created for a cluster - one pointing only to a writeable primary. It should not point to any pod if cluster is currently in the standby mode.

Use Case

I have a service mesh (linkerd) configured in a way that makes it possible to replicate (mirror) services cross-cluster and use them as local. It's an alternative to a loadbalancer or nodeport. But in case of a primary cluster failure failover is still a multistep process.

Configure one of the standby clusters to be a new primary (disable standby).
Update all apps to use a mirrored service of the new primary.
If direct replication is used: Update all other clusters to use a mirrored service of the new primary.

Linkerd (and probably other services) makes it possible to automatically failover a service in case of the main one being unreachable. It can route all traffic to the new mirrored service. But it works in such a way that traffic is split equally between all secondary services. Which means with current setup you can only put a single failover cluster in the secondary list. Otherwise when primary fails traffic gets split equally to all clusters even if nodes are read-only.

With a service pointing only to a writeable node failover would look like this:

Configure one of the standby clusters to be a new primary (disable standby).
Linkerd automatically updates routing rules to point to the new primary cluster after detecting an available endpoint.
All apps are transparently redirected to the new primary.
All other clusters are transparently redirected to the new primary.

Desired Behavior

To the list of services managed by the operator:

db-ha
db-ha-config
db-pods
db-primary
db-replicas

add a new one (db-ha-rw, db-primary-writeable, whatever) - or even just attach a label to the current writeable primary that it's writeable - then user can create such a service manually.

Environment

Tell us about your environment:

Please provide the following details:

Platform: Kubernetes

The text was updated successfully, but these errors were encountered:

ValClarkson · 2023-09-14T20:38:37Z

Hi @Agalin

There should be one more service created for a cluster - one pointing only to a writeable primary. It should not point to any pod if cluster is currently in the standby mode.

This is an interesting idea. It sounds like you have everything you need to try it out. If it works well, we can consider automating it here in the operator.

... failover would look like this:

Configure one of the standby clusters to be a new primary (disable standby).

Linkerd automatically updates routing rules to point to the new primary cluster after detecting an available endpoint.

It sounds like the first step is manual, or you've automated it already. If so, in that same step (or separately), you can update the selector of the service you're proposing. Maybe the following?

# coming out of standby
kubectl patch postgrescluster hippo -p '{...}'
kubectl patch service hippo-linkerd -p '{...}'

Let us know in the discord how it goes!

Agalin · 2023-09-15T07:34:19Z

First step (failover to another dc) is expected to be manual. I don't think it would be safe to automate without a PGO being multicluster-aware.

Yeah, it should be possible to test by creating a service in each DC which selects nothing everywhere besides the current read-write cluster (where it selects primary) and patch it together with the cluster transitioning from standby.

Now that I think about it - maybe it's doable even in a single action. Create a service which selects the local primary + a custom label configured through the postgrescluster object. Only configure this label on the read-write cluster. Then it can be changed together with standby, in a single patch operation.

jmckulk · 2023-11-30T16:30:56Z

Hey @Agalin, do you have any updates on this?

Agalin · 2023-11-30T21:49:34Z

Ah sorry, was busy with various things and forgot to update this topic. 😞

I've been able to create a described setup - separate service that takes only primary node from a read-write cluster with read-write label being assigned manually. I have this service deployed in each cluster and replicated between them using Linkerd with failover rules configured so if a cluster dies it's just a single PostgresCluster resource update in a single k8s cluster (set standby to false and add that custom label) to perform a failover to a standby in all k8s clusters at at once.

tjmoore4 · 2024-02-06T19:21:21Z

Hello @Agalin. Thank you for the update. I'm curious if, with your current setup, you still believe the additional managed service would be beneficial. There is obviously a bit of overhead with each service we manage, so if this can be accomplished as needed using existing labels, etc, that may be the best option, but I want to make sure I understand the use case as well as possible.

Agalin · 2024-02-07T08:39:22Z

I understand your concerns and believe that service isn't necessary. Neither is a new label although either that or changing postgres-operator.crunchydata.com/role to add one more option to it would be nice as it increases state visibility.

From my perspective all building blocks are there. Which means a sufficient solution would be to document it. If you find this scenario not particularly useful for broader user base then I won't oppose closing this issue.

tjmoore4 · 2024-02-13T15:58:45Z

@Agalin thank you for the additional details. After thinking a bit more through your scenario, I've added an item to are backlog for consideration.

ValClarkson added the triaged label Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service for read-write instance #3722

Service for read-write instance #3722

Agalin commented Sep 14, 2023

ValClarkson commented Sep 14, 2023

Agalin commented Sep 15, 2023

jmckulk commented Nov 30, 2023

Agalin commented Nov 30, 2023

tjmoore4 commented Feb 6, 2024

Agalin commented Feb 7, 2024

tjmoore4 commented Feb 13, 2024

Service for read-write instance #3722

Service for read-write instance #3722

Comments

Agalin commented Sep 14, 2023

Overview

Use Case

Desired Behavior

Environment

ValClarkson commented Sep 14, 2023

Agalin commented Sep 15, 2023

jmckulk commented Nov 30, 2023

Agalin commented Nov 30, 2023

tjmoore4 commented Feb 6, 2024

Agalin commented Feb 7, 2024

tjmoore4 commented Feb 13, 2024