Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Go client libraries for etcd #11980

Merged
merged 6 commits into from Sep 29, 2021
Merged

Conversation

tsaarni
Copy link
Contributor

@tsaarni tsaarni commented Jul 2, 2021

This change updates the go client libraries for both etcd2 and etcd3.

Fixes #9915

Signed-off-by: Tero Saarni tero.saarni@est.tech

@hashicorp-cla
Copy link

hashicorp-cla commented Jul 2, 2021

CLA assistant check
All committers have signed the CLA.

@vercel vercel bot temporarily deployed to Preview – vault July 2, 2021 12:29 Inactive
@vercel vercel bot temporarily deployed to Preview – vault July 2, 2021 12:31 Inactive
@tsaarni
Copy link
Contributor Author

tsaarni commented Jul 2, 2021

I would like to kindly ask help on how to run the etcd related unit tests.

So far, I have started etcd3 in docker according to instructions on etcd release notes, then executed
ETCD_ADDR=http://localhost:2379 go test -v github.com/hashicorp/vault/physical/etcd -run ^TestEtcd3Backend and monitored the HTTP activity on Wireshark. The test case passes but I don't see much activity on the etcd REST API at all, and surely not anything related to sdk/physical/testing.go:ExerciseBackend(), so I think this must be wrong approach

@ncabatoff
Copy link
Contributor

Hi @tsaarni,

I had issues running the docker container, so I followed the MacOS instructions in the link you provided instead. When I ran the test using the command you gave, I did see what looked like relevant traffic on port 2379:

sudo tcpdump -A -i lo0 tcp port 2379 
ETCD_ADDR=http://localhost:2379 go test -count=1 -v github.com/hashicorp/vault/physical/etcd -run ^TestEtcd3Backend

There was lots of activity, but this packet seemed unambiguous:

10:24:18.351267 IP localhost.52041 > localhost.2379: Flags [P.], seq 1869:1948, ack 1510, win 6356, options [nop,nop,TS val 2496272358 ecr 2496272358], length 79
E.....@.@............I  Kz&...q.......w.....
........................~.i..}.....-..........(
&/vault-1625495058/foo/694d7a770c6659a7

The test also fails if I invoke it this way without etcd running.

One thing to bear in mind is the -count=1 option I gave to go test. If the test had already passed before with those inputs, Go won't re-run it by default, it'll give you the cached result, unless you override that behaviour e.g. by saying -count=1.

I'll also warn you that we've had difficulties upgrading etcd in the past, involving some incompatibilities with it and gprc and cloud.google.com/go. So this may be a very nontrivial upgrade...

@ncabatoff
Copy link
Contributor

I'll also warn you that we've had difficulties upgrading etcd in the past, involving some incompatibilities with it and gprc and cloud.google.com/go. So this may be a very nontrivial upgrade...

This might be better now that etcd 3.5.0 is out. The problems we had were complicated but some notes I've consulted suggested that they might be resolved upon that release.

Back to the testing: many of our backends are tested in CI using docker, e.g. https://github.com/hashicorp/vault/blob/main/physical/consul/consul_test.go#L159. Maybe you could do something similar as part of this PR? I don't know why we haven't already, but I suspect it's just that no one has gotten around to it yet.

@tsaarni
Copy link
Contributor Author

tsaarni commented Jul 5, 2021

This might be better now that etcd 3.5.0 is out. The problems we had were complicated but some notes I've consulted suggested that they might be resolved upon that release.

Hopefully that is the case. The version that is imported currently is bit problematic since some vulnerability scanners flag it as security issue.

many of our backends are tested in CI using docker, e.g. https://github.com/hashicorp/vault/blob/main/physical/consul/consul_test.go#L159. Maybe you could do something similar as part of this PR? I don't know why we haven't already, but I suspect it's just that no one has gotten around to it yet.

Cool, yes I can add that.

Thanks for your help!

@tsaarni
Copy link
Contributor Author

tsaarni commented Jul 8, 2021

I've added test container.

I forced v2 API version in etcd_test.go which I think was the intention since there is also etcd3_test.go? Without that, the code would prefer v3 API if running against etcd v3.x servers. That said, when running with v2 the test fails:

=== RUN   TestEtcdBackend
    etcd_test.go:40: should be exactly 1 key == bar: [bar nested1/]
    etcd_test.go:40: should be empty at end: [foo/]
    etcd_test.go:40: should be empty at end: [foo/]
    etcd_test.go:41: level 1 expected [bar bar/]: [bar bar/ nested1/]
--- FAIL: TestEtcdBackend (2.56s)

I do not know the reason for this. It is running against etcd v3.5.0 server which I started with --enable-v2 to allow v2 API to be used. I have not yet tested if the test case would work with historical etcd v2.3 release. I should do that, but I have doubts if that matters much. After all, etcd 2.3 is years out of support, so I guess the use case for running Vault with v2 must be still against etcd 3.x track?

BTW the v2 API will be removed shortly https://github.com/etcd-io/etcd/blob/main/CHANGELOG-3.5.md: "etcd 3.5 is the last version that supports V2 API".

@ncabatoff
Copy link
Contributor

Thanks for bringing this up. We discussed it internally and agree it makes sense to drop etcd v2 support. Would you like to tackle this as part of this PR? I'm happy to help however I can.

@tsaarni
Copy link
Contributor Author

tsaarni commented Jul 11, 2021

Would you like to tackle this as part of this PR? I'm happy to help however I can.

Would it be OK to create another issue for removal of etcd2 support and continue with it in separate PR?

I would be grateful if we could proceed with the client library upgrade separately from etcd2 deprecation, just to fix #9915 first. Would it be acceptable to proceed by disabling etcd2 tests in this PR (*)?

(*) Meanwhile, I have figured out the reason for the etcd2 test case failures:

=== RUN   TestEtcdBackend
    etcd_test.go:40: should be exactly 1 key == bar: [bar nested1/]
    etcd_test.go:40: should be empty at end: [foo/]
    etcd_test.go:40: should be empty at end: [foo/]
    etcd_test.go:41: level 1 expected [bar bar/]: [bar bar/ nested1/]
--- FAIL: TestEtcdBackend (2.56s)

First, following observations

  • The test case failure is NOT caused by upgrade of etcd client library: it fails with the old client version too.
  • The test case failure is NOT caused by etcd server version: the failure happens with legacy etcd2 server and etcd3 server in etcd v2 mode.

The root cause for the failure is the data model in etcd2. It had a hierarchical data model where keys could be nested. Removal of leaf key/node in a hierarchy would not automatically clean up the empty nodes that lead to the leaf node. Same problem was noted with zookeeper backend several years ago in #1964. The failing test case was introduced by that PR, but I think nobody tested with etcd2 server. It was even mentioned in the ticket "I am not sure if other backends do not share this bug (i.e. PSQL, MySQL, ETCd, S3, etc...)". Indeed, I believe this problem is relevant for etcd backend aswell.

etcd3 introduced new flat data model and therefore nesting is implemented purely by key naming convention. Therefore the etcd3 backend implementation does not suffer from the problem of leaving empty nodes hanging around and the test case passes.

@ncabatoff
Copy link
Contributor

I would be grateful if we could proceed with the client library upgrade separately from etcd2 deprecation, just to fix #9915 first. Would it be acceptable to proceed by disabling etcd2 tests in this PR (*)?

Sure, that works.

Thanks for the investigation and explanation re the test failure. I'm glad to hear it's not something we need to worry about given that we're dropping the v2 support.

tsaarni added a commit to Nordix/vault that referenced this pull request Jul 15, 2021
* Added etcd server container to run etcd3 tests automatically.

* Removed etcd2 test case: it fails the backend tests but the failure is
  unrelated to the uplift.  The etcd2 backend implementation does not
  remove empty nested nodes when removing leaf (see comments in hashicorp#11980).

Signed-off-by: Tero Saarni <tero.saarni@est.tech>
@vercel vercel bot temporarily deployed to Preview – vault July 15, 2021 11:16 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook July 15, 2021 11:16 Inactive
@tsaarni
Copy link
Contributor Author

tsaarni commented Jul 15, 2021

Hi @ncabatoff, I've now updated this PR by removing the failing etcd2 test case.

I would like to request review for this PR in order to proceed with the client library uplift.

Also, as discussed previously, I've posted new issue #12091 for removal of v2 API support. I will be submitting an initial PR for that separately, and I look forward for guidance on that removal process, since it touches much more visible parts and has potential for being a breaking change.

@ncabatoff
Copy link
Contributor

Hi @tsaarni,

Sorry for the delay, we needed to see how this plays out on the enterprise side. The bad news is that it doesn't work there. We'll need to make changes to our gRPC-using code there before we can merge this.

I'm afraid I can't commit to a time when that will happen, but we're eager to be able to upgrade these dependencies, and we don't want to get into a situation where a security issue forces our hand, so I don't think it will be a super long time.

@ncabatoff
Copy link
Contributor

Hi @tsaarni, we've laid the groundwork on the enterprise side, so we're ready to accept this patch now. Could you pull in main and resolve conflicts please, then I'll merge it?

tsaarni added a commit to Nordix/vault that referenced this pull request Aug 18, 2021
* Added etcd server container to run etcd3 tests automatically.

* Removed etcd2 test case: it fails the backend tests but the failure is
  unrelated to the uplift.  The etcd2 backend implementation does not
  remove empty nested nodes when removing leaf (see comments in hashicorp#11980).

Signed-off-by: Tero Saarni <tero.saarni@est.tech>
@vercel vercel bot temporarily deployed to Preview – vault August 18, 2021 13:41 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook August 18, 2021 13:41 Inactive
@tsaarni
Copy link
Contributor Author

tsaarni commented Aug 18, 2021

@ncabatoff Cool! I've now rebased the PR with main.

@ncabatoff
Copy link
Contributor

Looks like you have a conflict on go.sum. There's also a problem with the circleci tests not running, I'm looking into that.

* Added etcd server container to run etcd3 tests automatically.

* Removed etcd2 test case: it fails the backend tests but the failure is
  unrelated to the uplift.  The etcd2 backend implementation does not
  remove empty nested nodes when removing leaf (see comments in hashicorp#11980).

Signed-off-by: Tero Saarni <tero.saarni@est.tech>
@vercel vercel bot temporarily deployed to Preview – vault-storybook August 18, 2021 15:15 Inactive
@tsaarni
Copy link
Contributor Author

tsaarni commented Aug 18, 2021

Now the PR is up to date again.

I see warning "First-time contributors need a maintainer to approve running workflows".

@ncabatoff
Copy link
Contributor

I see warning "First-time contributors need a maintainer to approve running workflows".

Yeah, it was saying that before, and I approved it, but it didn't start running them. I did so again just now but it still seems stuck.

@tsaarni
Copy link
Contributor Author

tsaarni commented Aug 18, 2021

I think I had this exact same experience with another project, also using circleci. Just a random thought but I wonder if manually approved workflows could get stuck when I've used force push?

* Changelog

Signed-off-by: Tero Saarni <tero.saarni@est.tech>
@vercel vercel bot temporarily deployed to Preview – vault August 18, 2021 18:51 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook August 18, 2021 18:51 Inactive
@tsaarni
Copy link
Contributor Author

tsaarni commented Aug 18, 2021

Out of curiosity, I added a new commit (just a period at the end of changelog entry - it can be squashed afterwards).
I wonder if circleci would now execute the PR if the workflow is approved again?

@ncabatoff
Copy link
Contributor

Looks like it worked, good idea! And now you have a test failure (which looks like an easy fix) and a Solaris build failure, which I'm less confident about. I wonder if etcd still supports Solaris in recent versions? I know they used to...

@tsaarni
Copy link
Contributor Author

tsaarni commented Aug 18, 2021

I will check the solaris build failure tomorrow.

Regarding the other one:

vault/cluster/inmem_layer_test.go:259:3: Fatalf format %d has arg accepted of wrong type go.uber.org/atomic.Int32

I'm not connecting the dots yet, what is the root cause?

@ncabatoff
Copy link
Contributor

I will check the solaris build failure tomorrow.

Regarding the other one:

vault/cluster/inmem_layer_test.go:259:3: Fatalf format %d has arg accepted of wrong type go.uber.org/atomic.Int32

I'm not connecting the dots yet, what is the root cause?

Should be

		t.Fatalf("expected 18 connections to be accepted, got %d", accepted.Load())

Presumably relates to the version update of the atomic package.

* Fixed TestInmemCluster_ConnectCluster after go.uber.org/atomic upgrade

Signed-off-by: Tero Saarni <tero.saarni@est.tech>
@vercel vercel bot temporarily deployed to Preview – vault-storybook August 18, 2021 20:13 Inactive
@vercel vercel bot temporarily deployed to Preview – vault August 18, 2021 20:13 Inactive
@tsaarni
Copy link
Contributor Author

tsaarni commented Aug 18, 2021

Should be ... accepted.Load())

Thanks, I should have spotted that!

Regarding Solaris build, it is failing due to undefined socket option

$ XC_ARCH=amd64 XC_OS=solaris XC_OSARCH=solaris/amd64 scripts/build.sh
==> Removing old directory...
==> Building...
Number of parallel builds: 15

-->   solaris/amd64: github.com/hashicorp/vault

1 errors occurred:
--> solaris/amd64 error: exit status 1
Stderr: # go.etcd.io/etcd/client/pkg/v3/transport
../../go/pkg/mod/go.etcd.io/etcd/client/pkg/v3@v3.5.0/transport/sockopt_unix.go:14:54: undefined: unix.SO_REUSEPORT

I did not see it mentioned anywhere that Solaris support would have been removed so I will check this.

@tsaarni
Copy link
Contributor Author

tsaarni commented Aug 18, 2021

Submitted a question etcd-io/etcd#13298

@ncabatoff
Copy link
Contributor

Hi @tsaarni,

Looks like the fix for etcd-io/etcd#13298 was merged. Would you mind refreshing this PR please?

@tsaarni
Copy link
Contributor Author

tsaarni commented Sep 29, 2021

@ncabatoff It seems tricky to uplift while unreleased. I seem to end up tangled with dependencies between different etcd packages. I did not manage to get it updated even by forcing with replace statements in go.mod :(

@vercel vercel bot temporarily deployed to Preview – vault September 29, 2021 16:53 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook September 29, 2021 16:53 Inactive
@tsaarni
Copy link
Contributor Author

tsaarni commented Sep 29, 2021

@ncabatoff I got it sorted out but it is not pretty!

First, I needed to figure out what the indirect dependencies are, then find out the pseudo-version identifier for the latest commit in main. I did that by running go get <package>@main. I added replace for them, which fixed the problem. The most recent tags look strange v3.0.0 and v3.5.0-alpha but I think this is because etcd creates release tags in release branch, not in main.

Update: I should have realized that I can use released versions of other etcd modules and only replace the one where I have the Solaris fix. So I updated the PR once more. This simplified it a bit but still not pretty :)

@vercel vercel bot temporarily deployed to Preview – vault September 29, 2021 17:47 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook September 29, 2021 17:47 Inactive
Copy link
Contributor

@ncabatoff ncabatoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great - we can tolerate a little bit of ugliness in go.mod for a while, it's far better than the current situation. Thanks so much for tackling this and for being patient with us.

@ncabatoff ncabatoff merged commit 113860b into hashicorp:main Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade go.etcd.io/etcd
3 participants