Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an interface to refresh contact points #1681

Open
joao-r-reis opened this issue Mar 14, 2023 · 0 comments
Open

Add an interface to refresh contact points #1681

joao-r-reis opened this issue Mar 14, 2023 · 0 comments

Comments

@joao-r-reis
Copy link
Contributor

joao-r-reis commented Mar 14, 2023

Problem statement

After the introduction of the HostDialer interface (#1629), we have successfully created a library that leverages this interface to use gocql to connect to Datastax Astra (original issue was #1487).

However we are seeing some errors during operations that cause the full set of C* hosts to change. During these operations new hosts are added to the C* cluster topology and old hosts are removed. I've troubleshooted these issues and I have found that currently gocql might drop some topology events if a status event is received roughly at the same time for the same host and I've also found that it is possible for the control connection to reconnect but not trigger a ring refresh which is an issue because during the reconnection process some events might be missed. Since these issues can affect any user of the driver regardless of their C* deployment type, I've submitted a separate PR to fix these (#1680).

Another issue that we discovered is that if, for any reason, the topology events are not correctly sent by the server (e.g. the server nodes crash instead of gracefully restarting/shutting down), then it is possible for gocql to be "stuck" with a list of hosts that are not valid anymore so it is unable to reconnect until the user forcefully closes the session/cluster and creates a new one.

Proposal

We are interested in adding an interface to gocql that would allow client libraries to "refresh" the contact points when gocql fails to reconnect the control connection. I was thinking that it could be something like:

type ContactPointProvider interface {
    Get() ([]*HostInfo, error)
}

Using this interface, a client library could provide an implementation that returns a list of HostInfo objects that would then be passed to the HostDialer (e.g. the HostDialer in gocql-astra uses host IDs as server names in TLS). In the specific case of gocql-astra, these hosts would have a 0.0.0.0 ip address because the IP address is only looked up via system tables during a ring refresh operation but gocql-astra is already calling NewCluster("0.0.0.0") and it works fine.

Alternatively, we could just use the same string type that is used for the hosts parameter in the NewCluster() function:

type ContactPointProvider interface {
    Get() ([]string, error)
}

gocql-astra's custom implementation would return a 0.0.0.0 string and gocql would then build a HostInfo object with empty host id and our custom HostDialer could then see that the provided HostInfo has no Host ID and the behavior would be the same as the one during session initialization (retrieve host IDs from Astra and use one of them as server name).

Other libraries and users in general would also be able to leverage this interface to provide a new set of contact points which might be useful in k8s and other types of "dynamic" environments. They can return a freshly resolved set of IP addresses or a new set of hostnames that gocql will then resolve.

This provider would be called by gocql during a control connection reconnection, after all current hosts have been attempted unsuccessfully. Then, gocql would use the new set of contact points to attempt to reconnect the control connection.


@martin-sucha Would you be willing to consider this proposal and review my implementation of it? Your work allowed us to create gocql-astra in the first place and this particular proposal could help not only us but also other users that are facing similar issues with k8s based deployments of C*.

This proposal is not related to #1680 but the goal of both #1680 and this proposal are the same: making the driver more robust in environments where C* topology changes happen frequently. Also, this proposal alone will not be enough if the control connection does not perform a ring refresh after a successful reconnection so merging #1680 would also be required in order for this proposal to be effective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant