Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-s3: Inconsistent package list in cluster deployment #317

Open
favoyang opened this issue Nov 26, 2019 · 12 comments
Open

aws-s3: Inconsistent package list in cluster deployment #317

favoyang opened this issue Nov 26, 2019 · 12 comments
Labels
enhancement New feature or request plugin: aws-s3

Comments

@favoyang
Copy link
Contributor

Describe the bug

The bug demostrates that verdaccio with s3 backend is stateful, with package-list cached in memory, and caused inconsistent and racing issues in a clustrer env. The solution is attached, and discussion are welcome.

I've setup a minimal cluster verdaccio deployment, using two verdaccio instances, s3 store and nginx as reverse proxy. The s3 plugin is slightly modified, but nothing really hit the core logic.

Before the test, I already have one package (com.littlebigfun.addressable-importer) in verdaccio.

# space is an alias for aws-cli

$ space ls s3://openupm/verdaccio/
                           PRE com.littlebigfun.addressable-importer/
2019-11-27 00:26:39        126 verdaccio-s3-db.json

$ space cp s3://openupm/verdaccio/verdaccio-s3-db.json -
{"list":["com.littlebigfun.addressable-importer"],"secret":"..."}

Let's publish another package (com.bastianblokland.enumgenerator) for testing.

$ npm --registry=my-registry publish
...
+ com.bastianblokland.enumgenerator@1.6.1

Logs show that the return code is 201, the publish is successful. The NotFoundError is harmless for new package. Notice the publish job is executed by verdaccio instance 0 (the log prefix tells).

0|verdaccio  |  info <-- 127.0.0.1 requested 'PUT /com.bastianblokland.enumgenerator'
0|verdaccio  |  error-=- s3: [S3PackageManager writeTarball headObject] { NotFoundError: no such package available
0|verdaccio  |  http <-- 201, user: openupm(156.236.113.121 via 127.0.0.1), req: 'PUT /com.bastianblokland.enumgenerator', bytes: 1683542/53

The added package is verified in S3.

$ space cp s3://openupm/verdaccio/verdaccio-s3-db.json -
{"list":["com.littlebigfun.addressable-importer","com.bastianblokland.enumgenerator"],"secret":"..."}

$ space ls s3://openupm/verdaccio/
                           PRE com.bastianblokland.enumgenerator/
                           PRE com.littlebigfun.addressable-importer/
2019-11-27 01:08:59        162 verdaccio-s3-db.json

Now the buggy part, let's curl the package list, twice. Notice that only the second call return the new added package.

# first pass - wrong
$ curl https://my-registry/-/verdaccio/packages
[
  {
    "name": "com.littlebigfun.addressable-importer",
    ...
  }
]
# second pass - correct
$ curl https://my-registry/-/verdaccio/packages
[
  {
    "name": "com.bastianblokland.enumgenerator",
    ...
  },
  {
    "name": "com.littlebigfun.addressable-importer",
    ...
  }
]

Logs show that the second correct curl result is from verdaccio instance 0, the one just executed the publish command. The incorrect curl result is from verdaccio instance 1. We can run it for multiple times, the result is the same. The verdaccio instance 1 never return the new added package.

1|verdaccio  |  info <-- 127.0.0.1 requested 'GET /-/verdaccio/packages'
1|verdaccio  |  http <-- 200, user: null(156.236.113.121 via 127.0.0.1), req: 'GET /-/verdaccio/packages', bytes: 0/4070
0|verdaccio  |  info <-- 127.0.0.1 requested 'GET /-/verdaccio/packages'
0|verdaccio  |  http <-- 200, user: null(156.236.113.121 via 127.0.0.1), req: 'GET /-/verdaccio/packages', bytes: 0/7787

This behavior seems implying that verdaccio has some sort of local cache in memory of package list (verdaccio-s3-db.json). So until I restart verdaccio instance 1, there's no way to notify verdaccio instance to refresh the cache. I haven't check the source code yet, so it is just my guessing. But if this is true, it means verdaccio is not scalable, can only run with one instance. Well this isn't my expectation when discussing with @juanpicado on verdaccio/verdaccio#1459 (comment), where I ask for the the behaviour of how to handle a shared package list in cluster env.

I need some time to read getLocalDatabase method of https://github.com/verdaccio/verdaccio/blob/dbf20175dc68dd81e52363cc7e8013e24947d0fd/src/lib/storage.ts, to figure it out. But please guide me if you think there's something obvious I missed.

To Reproduce
You will need a simliar deployment - two instances of verdaccio managed by pm2, s3 backend. Nginx isn't necessary.

Expected behavior
All verdaccio instances should return the latest package list right after new package added (or removed).

Configuration File (cat ~/.config/verdaccio/config.yaml)

storage: ./storage
plugins: ./plugins
max_body_size: 200mb
listen: 0.0.0.0:4873

server:
  keepAliveTimeout: 60

middlewares:
  audit:
    enabled: true

web:
  enable: true

auth:
  htpasswd:
    file: ./htpasswd
    max_users: -1

packages:
  '@*/*':
    # scoped packages
    access: $all
    publish: $authenticated
    unpublish: $authenticate

  '**':
    access: $all
    publish: $authenticated
    unpublish: $authenticated

store:
  aws-s3-storage:
    bucket: openupm
    region: sfo2
    endpoint: ...
    accessKeyId: ...
    secretAccessKey: ...
    s3ForcePathStyle: true
    keyPrefix: 'verdaccio/'
    tarballACL: public-read
    tarballEdgeUrl: ...

convert_to_local_tarball_url: false

Environment information

verdaccio: 4.3.4 (modified: verdaccio/verdaccio#1580)
s3-plugin: 8.4.2 (modified: #249)
My modifications are made for #250, which is not related to this bug.

Debugging output

  • $ NODE_DEBUG=request verdaccio display request calls (verdaccio <--> uplinks)
  • $ DEBUG=express:* verdaccio enable extreme verdaccio debug mode (verdaccio api)
  • $ npm -ddd prints:
  • $ npm config get registry prints:

Additional context

@favoyang
Copy link
Contributor Author

favoyang commented Nov 27, 2019

Continue the dicussion.

local-storage

The package list (data) is cached in memory via _fetchLocalPackages() since verdaccio started. When add/remove a package or set secret, the package list is synced back to disk via _sync. The local-storage backend is designed for simple case, which works just fine in 1 verdaccio instance. But it's not useful for cluster deployment, due to lack a way to sync the mutable package list to other running instances.

aws-s3-storage

The package list is (lazy-)loaded and cached in memory via _getData(). When add/remove a package or set secret, the package list is synced back to disk via _sync. The _sync writes the package list in the memory of the current verdaccio instance, back to disk. This pretty much same as local-storage backend, it handles files in S3 instead of local storage, not work with cluster.

I'm digging a related but old cluster thread. Where @wzrdtales original asked for how to solve the .sinopia-db file in a cluster env. Later on @juanpicado steps in and suggest to create a plugin (adapter) structure first (as we have today, thanks for it). But the original question is not solved. @wzrdtales said will have a look, but the conversation stops there.

Solution A: remove the cache of package list. It seems a no-brain choice. However it has racing problem. Say two users publish two new packages at the same time, two verdaccio instances both read the package list from S3, add the package name, write back to S3. The second write overwrite the first write, the first new added package is missing. So solution A is not working..

Solution B: get rid of package list file. Use the directory name to present an exist package.

  • add/remove no longer maintain a list file, just upload package.json and tarball is enough.
  • get package list performs a ls action, and treat each sub-folder as a package name. And no longer cache package list in memory.
  • it has no racing issue, because there's no write package list action.
  • scoped package is supported by performing an s3 ls action on each scoped folder. In total it generated 1 + num_of_scoped_packages s3 list actions.
  • the solution is roughly implemented (without scoped package support), the fix: stateless s3 storage for verdaccio/verdaccio/issues/1595 #275 implemented solution B, rough functional test shows no issue.

Solution C: move the package list file to redis or other database. Take redis as an example, it faster enough so you don't need to worry about performance. And we can save the package list as a hash, i.e. pkg:name: {data if have}. Then add/remove/search can be refactor to be more efficient version. There's no racing issue, since verdaccio instances are working on different hashes (i.e. pkg:pkgname-a: {} and pkg:pkgname-b: {}). The problems are

  • Redis is not a strong persistent database, you need configure it carefully to get what you want.
  • The storage is mixed of two things, S3 and Redis (or others). Perhaps the naming looks a bit weird aws-s3-redis-storage

Thought?

@favoyang
Copy link
Contributor Author

favoyang commented Nov 28, 2019

Updated solution B in #275, added scope package support. I implemented solution B first, since it's the most clean one without changing the structure of verdaccio storage plugin.

Solution C however, depends on how we mix redis into the system.

  • C-1, simplest way is to implement a new aws-s3-redis-storage backend to fuse redis and s3 for cluster.
  • C-2, add an option to store package-list and secret into redis, to make it work for other storage as well. If redis is enabled, the storage backend is only used to store tarball and package-info json.

@favoyang
Copy link
Contributor Author

favoyang commented Nov 28, 2019

For curious, I have a look at the google-cloud storage backend, created by @griffithtp. Surprisingly the implementation shall work well on cluster (but I'm not using google cloud :-(). This can be a good reference to help understand the issue.

In the google-cloud storage,

  • It stores package.json and tarball into google-cloud/storage, a s3-like simple object system.
  • It stores secret and package present info (the package list) into google-cloud/datastore, a no-SQL database, think about redis/couchdb/mongodb, which handle the concurrent IO for you. The underlying data structure is a lot like a hash in redis
VerdaccioDataStore:
  package-a: {name: 'package-a'}
  package-b: {name: 'package-b'}
  • To get package list, it querys VerdaccioDataStore key and get all key names.
  • Hence, it doesn't need the concept to sync the whole package list back to storage.

Based on the information, I'm more confident to implement soltuion C as option C-1, a new storage backend called aws-s3-redis-storage. Because s3 backend and google-cloud backend are probably the only two backends care about the concurrent issue, and in fact the google-cloud backend already handle it well, option C-2 (make redis available for all other backends) is uselss.

The C-1 solution will be similar to the google cloud,

  • It stores package.json and tarball into s3 as before.
  • It stores secret and package present info (the package list) into redis.

@juanpicado
Copy link
Member

Hi @favoyang , thanks for all this thread, sorry the delay, but I'll read carefully all your proposals, I know you are right and later on will post here my thoughts about it.

@juanpicado
Copy link
Member

I see, well, the #275 approach is interesting. When I did made the google-cloud plugin I could use the datastore to make it scalable. AWS plugin is a fork of verdaccio-s3-storage which at the same time a fork of the default local-storage plugin, the unique intention was just be able to use AWS S3 bucket, but does not solve the be scalable. If AWS does offer something similar, then we might follow the same approach.

First of all, I'll move this ticket to monorepo, due Verdaccio does not define itself as scalable by default, that depends of which storage are you using.

Second, let's give a try to the option B, I'll run some test and see what happens, I'm worried about the performance that s3-ls might be.

Third, I do no have any example for practice clustering with Verdaccio, my dream is to have a nice local setup with Kubernetes and easily scale and run more accurate test.

I'd like to keep the aws-s3 plugin simple, with less dependencies, not everybody needs clustering multiple instances, but also, I understand the pluging does not cover all needs.

@juanpicado juanpicado transferred this issue from verdaccio/verdaccio Jan 25, 2020
@juanpicado juanpicado added plugin: aws-s3 enhancement New feature or request labels Jan 25, 2020
@juanpicado juanpicado changed the title Inconsistent package list in cluster deployment aws-s3: Inconsistent package list in cluster deployment Jan 25, 2020
@favoyang
Copy link
Contributor Author

If AWS does offer something similar, then we might follow the same approach.

The AWS version of google-cloud/datastore is DynamoDB. But I don't like the idea, because the plugin is called aws-s3, not aws. For example, I'm maintaining a deployment on digital ocean's space, an aws-s3 compatible solution. Digital ocean platform doesn't have DynamoDB.

I will update solution B #275 later. It's the simplest one to solve the scalable issue at some level without other dependency.

Second, let's give a try to the option B, I'll run some test and see what happens, I'm worried about the performance that s3-ls might be.

I use #275 in a deployment for a while (so far low traffic). This deployment doesn't support scope packages, so the list package endpoint performs O(1). It's the most ideal scenario.

This also brings an interesting point of splitting the data store for different types of data.

  • tarball files shall store into a file-like system.
  • package meta data may store into a file-like system or a database-like system.

I won't surprise in a real world cluster, the store may be named as s3-mysql / s3-postgresql / s3-mysql-redis.

@benjamine
Copy link

benjamine commented Jun 7, 2020

+1, got bit by this too (using ECS+S3, will probably scale to 1 instance in the meantime).
I think a solution that can solve this without introducing more dependencies and no latency cost is:

  • have each instance poll for the modified date of a file (maybe verdaccio-s3-db.json)
  • do this in background with a certain interval, and flush the cache when they find a new value

@lneveu
Copy link

lneveu commented Jun 11, 2020

I'm facing the same issues described by @favoyang, when trying to deploy multiple Verdaccio instances with the aws-s3-storage plugin. I dived into this thread and has a look at your pull request #275 , but I'm worried about some changes added there.

Currently, only published packages on the proxy are saved in the database json file. With the changes in your pull request, every packages on the storage are considered as "internal" packages; and therefore, concerned by "add"/"remove"/"search" api.
Moreover, with a large storage (especially when caching all packages from uplinks sources, like in my case), performance will decrease (maybe it's not a real issue?).

Maybe the solution A you exposed (the simplest one I think, without adding a database mechanism) should be reconsidered, with the addition of a simple locking mechanism?

@favoyang
Copy link
Contributor Author

Currently, only published packages on the proxy are saved in the database json file. With the changes in your pull request, every packages on the storage are considered as "internal" packages; and therefore, concerned by "add"/"remove"/"search" api.

You're right. Because there's no way to filter out uplinks from the directory name.

Moreover, with a large storage (especially when caching all packages from uplinks sources, like in my case), performance will decrease (maybe it's not a real issue?).

See https://stackoverflow.com/questions/25061960/s3-performance-for-list-by-prefix-with-millions-of-objects-in-a-single-bucket

Prefix search itself seems fast enough, though you may still have multiple network round-trips because data can be returned chunk by chunk. But I don't really worry about it, because it's rare to use verdaccio to serve many thousands of packages. The verdaccio UI will be the first bottleneck.

Maybe the solution A you exposed (the simplest one I think, without adding a database mechanism) should be reconsidered, with the addition of a simple locking mechanism?

Refs barolab/verdaccio-minio#13, a robust solution would be:

  • remove the cache of the package list.
  • add a transaction locking mechanism for actions to modify the package list (publish/unpublish). Basically a package can be added/removed at a time - a single and global blocking point.
  • The global blocking point will also avoid the rare racing issue for modifying the package meta file (package.json). It happens
    • if a user instantly publishes two versions of one package, the first published version will be lost.
    • if a user instantly publishes a version, then instantly unpublish another version. The first published version will be lost.

To be honest, I'm not interested to implement the correct but useless solution A. It's working for the cluster, but not designed for the cluster.

@favoyang
Copy link
Contributor Author

favoyang commented Oct 30, 2020

@benjamine #275 (comment)

This is so close :), is there anything beyond conflict solving we can help here? Thanks!

Thanks for taking care of it. If you follow the related issue #317, you will notice that PR #275 implemented solution B, using the s3 list action to construct the package list.

But in #317 (comment), @lneveu found an issue of the solution that it can not distinguish user-submitted (internal) packages and the proxied (cached) packages. Because verdaccio by default saved them all into the same directory. I'm not using a proxy setting, so I didn't realize the issue in my production deployment.

@lneveu also promised to implemented proposed to extend solution A with a file locking mechanism. I think it's the same idea as barolab/verdaccio-minio#13, but I didn't dig into that.

I haven't got enough time to implement a file locking mechanism. It needs to be carefully designed so whenever you want to write the package list (verdaccio db file), e.g. when adding or removing a package to the registry,

  • wait for the lock file with a timeout
  • acquire the lock file
  • read content from it to make sure it's fresh
  • add/remove stuff
  • write it back
  • release the lock file
  • make sure if anything goes wrong the lock is removed

Basically, it's a transaction lock. It's useless to only lock on the write method, you will still get the racing issue (see discussion of solution A #317 (comment)). But I'm not sure if it's possible to implement a transaction lock, what I got is a sync method:

  private async _sync(): Promise<void> {

Anyway, I prefer to close #275, unless anyone can figure out a way to distinguish internal packages and the proxied packages to make it useful.

Or if anyone familiar with the project can tell if it's possible to implement a transaction lock in a storage plugin. I also want to point out that implementing a file lock didn't make the system scalable, as it keeps the single point write bottleneck. But at least it fixes the concurrent issue when more than one people submitting packages at the same time, and let one of them fail. But it could good enough for most people actually looking for scale verdaccio for a heavy read.

In the meanwhile, I created two other projects. By combining them together you can use s3 for tarball serving and use Redis as the database (for package list, package meta, and security info). If anyone looking for a scalable workaround, please checkout,

@benjamine
Copy link

benjamine commented Nov 18, 2020

@favoyang thanks for all the context,
so I see s3 has a native solution for pessimistic locking: https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lock.html
note this already supports setting a lock timeout.

imo an optimistic lock would be a lot better for this use case (try to write, if version has changed fail and retry), but it doesn't seem like that is supported out-of-the-box, neither I think it'd be simple to implement.

what do you think about using S3 native's lock capability? (still reading if that's a bucket-wide lock, or it can be used for a single file)

@favoyang
Copy link
Contributor Author

I closed #275. Anyone who has more energy can propose an implementation of solution A using a locking mechanism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request plugin: aws-s3
Projects
None yet
Development

No branches or pull requests

4 participants