Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeze releases and website changes, pending cache fixes? #1416

Closed
ljharb opened this issue Jul 13, 2023 · 22 comments
Closed

Freeze releases and website changes, pending cache fixes? #1416

ljharb opened this issue Jul 13, 2023 · 22 comments

Comments

@ljharb
Copy link
Member

ljharb commented Jul 13, 2023

Every time a commit is pushed to the website, or a release is done, I'm told the CloudFlare cache of nodejs.org/dist is purged, which causes a lot of server churn as the cache is repopulated, which also causes both nodejs.org/dist and iojs.org/dist to break.

During this time, anyone trying to install node may encounter 5xx errors; anyone using nvm to do anything remote may encounter 5xx errors (nvm relies on both index.tab files to list available versions to install); and any CI based on dynamically building a matrix from index.tab is likely to encounter 5xx errors.

I would offer my opinion that "changes to the website" are likely never more important than "people's ability to install node", and "a new release of node" is, modulo security fixes, almost never more important than that ability either.

Fixing the problem requires people with all of access, ability, and time, and one or more of those has been lacking for awhile - and to be clear, I'm not complaining about this fact: everyone involved in node is doing their best to volunteer (or wrangle from an employer) what time they can. However, I think it's worth considering ways to avoid breakage until such time as a fix can be implemented.

Additionally, this seems like very critical infrastructure work that perhaps @openjs-foundation could help with - cc @rginn, @bensternthal for thoughts on prioritizing this work (funding and/org person-hours) for DESTF?

I'd love to hear @nodejs/build, @nodejs/releasers, and @nodejs/tsc's thoughts on this.

Related: nodejs/nodejs.org#5302 nodejs/nodejs.org#4495 and many more

@MattIPv4
Copy link
Member

Major +1 here, the frequent releases and website updates cause a full cache purge in Cloudflare every single time, putting a massive load on the origin that it is currently unable to handle, leading to Node.js becoming essentially unavailable to download.

Until time is put into reworking the origin (likely moving it primarily to R2 with a Worker that handles fallback to the origin server), and into reworking how cache purging happens for releases, I would agree that doing a freeze of releases + website updates makes sense to ensure the Cloudflare cache is retained so Node.js is actually available for folks to even download (there's no point releasing new versions if folks can't download them or read the docs for them).

@ovflowd
Copy link
Member

ovflowd commented Jul 13, 2023

Major +1 here,

A random unordered mental dump:

  • A lot of threads on Slack regarding these subjects
  • A lot of man-hours (volunteer, non-paid) of the Build WG + some folks (like @MattIPv4 and me included)
  • A lot of discussions happening
  • The website Team receiving most of the complaints, and we're kind of trying to manage things
  • A lot of flakiness and issues boiling up
  • Users being affected from time to time. Including other open-source projects and CI systems such as GitHub Actions and more.

Few (potential) suggestions:

  • Freeze pushes on nodejs.org on US hours (keep pushes on non-working hours)
    • Enable GitHub to merge queues to reduce the number of actual pushes
  • It's ultimately the OpenJS members (sponsors) interests in us having a reliable infra
  • We can't keep having our volunteers stepping it that much; It's draining them.
  • Honestly speaking, we should get either the Foundation to hire someone to take care of Build/Infra or have one (or more) member companies hire someone to help us with infra. (Some companies such as Postman, RedHat, Cloudflare, and many others have full-time dedicated people for Open Source; we could get them to open a role to support us. AsyncAPI for example, is fully backed by Postman)

@nschonni
Copy link
Member

Linking to the cache purge tracking issue nodejs/build#3410

@jasnell
Copy link
Member

jasnell commented Jul 13, 2023

I don't have a lot of context here but it sounds like there's pain and it's great to relieve pain so I'm all for whatever needs to be done here.

@mcollina
Copy link
Member

There are no good option here. The best outcome would be to have somebody redesign these pipelines and only purges the URLs that are needed, or possibly nothing at all (and use stale-while-revalidate semantics).

Given that this requires a volunteer to lead that effort or funds, possibly the least bad options would be to:

  1. limit cache invalidations to Node.js releases (not canary or nightly)
  2. release the website only once a week / ideally on weekends

I'm not happy with any of these, but I don't think we can do much better right now.

@MoLow
Copy link
Member

MoLow commented Jul 14, 2023

I have raised this before at https://github.com/nodejs/build, but since we are discussing hosting static files, is there a reason why we are managing the infrastructure ourselves, and not using a managed service for this? (e.g amazon s3/azure blob storage/github pages/cloudflare pages etc)
is there some security or integrity consideration that has led to the decision to host files ourselves?

I think the solutions suggested above are fine for the short term, but if there is no other reason we should probably consider a managed solution for the long term.

if this makes sense, I am glad to lead such an effort

@mcollina
Copy link
Member

@MoLow mostly money. Node.js infrastructure consumes very little of the foundation money.

Moreover, all of this was put in place a long time ago and there were fewer options at the time.

@targos
Copy link
Member

targos commented Jul 14, 2023

I think it's also related to the fact that Node.js was a lot less downloaded when this was put into place many years ago.

@ovflowd
Copy link
Member

ovflowd commented Jul 14, 2023

Hey @mcollina, just to mention that your proposed solutions will not solve the situation (I guess that's why you mentioned bad options?) but just reduce the problems (from what you've explained, they would reduce somewhat significantly already, but it's just a patch I'd say, because the moment we do cache invalidations the issue happens, because our servers are just unable to handle)

@ovflowd
Copy link
Member

ovflowd commented Jul 14, 2023

I have raised this before at nodejs/build, but since we are discussing hosting static files, is there a reason why we are managing the infrastructure ourselves, and not using a managed service for this? (e.g amazon s3/azure blob storage/github pages/cloudflare pages etc)

We are having talks over adopting Cloudflare R2, they offered us the R2 service (similar to AWS S3) for free with all the traffic and needs we have. It is a path we're exploring!

@ovflowd
Copy link
Member

ovflowd commented Jul 14, 2023

I think the solutions suggested above are fine for the short term, but if there is no other reason we should probably consider a managed solution for the long term.

A managed solution still requires someone to "manage" them or at least maintain them. In the case of R2 we need to write Cloudflare Workers and do a lot of initial configs just to mirror our current www-standalone server to R2 (at least the binaries and assets)

FYI a lot of discussion is happening on #nodejs-build, our issue right now is definitely not the lack of good plans/ideas, but the lack of someone able to execute them.

@ovflowd
Copy link
Member

ovflowd commented Jul 14, 2023

if this makes sense, I am glad to lead such an effort

I think it's better to leave the people at Node.js Build WG that understand the situation completely to lead this initiative technically. What we need is an ack from the TSC about this issue and that we're able to dedicate resources into this.

Not to mention, what @ljharb suggested would already be a temporary "workaround" to improve the user experience, by reducing Website builds and releasing "promotions". We still need someone (or a bunch of people) to be able to do the long term-plan...

@mhdawson
Copy link
Member

I'd be ok with not invalidating for nightly and canary releases or possibly doing them less often. For the others I think that releases are not often enough that we should slow down releases of Current and LTS lines.

@mhdawson
Copy link
Member

As @ovflowd mentions the key question is what we do in the mid to long term in terms of "We still need someone (or a bunch of people) to be able to do the long term-plan...".

Sounds like @MoLow who is a member of the build WG has offered to lead work on the mid to longer term plan in #1416 (comment) and I think that it would be great to start working on that.

I also think that in terms of keeping things up/running even after we have a new/better infrastructure we need people who can drop everything else when needed to address problems with the downloads, OR set the expectation that it's a best effort and there is no SLA. The downloads may not be available at any point in time and people should plan for that. On this front I've asked for help from the Foundation in the past on the build side, presented to the board, worked with Foundation staff on summaries of work etc. but unfortunately that did not result in resources to let us be more proactive. It may be a different time, and or the situation more urgent now so looking at that again might make sense.

@ovflowd
Copy link
Member

ovflowd commented Jul 14, 2023

Sounds like @MoLow who is a member of the build WG has offered to lead work on the mid to longer term plan in #1416 (comment) and I think that it would be great to start working on that.

I completely forgot @MoLow was on the build team, +1 for him to lead the initiative!

@danielleadams
Copy link
Member

Thanks for bringing this up. Currently, there is an LTS release in flight that I'd like to get out because it has a lot of anticipated changes (nodejs/node#48694). I had planned to get it out around 1:00 UTC to accommodate a "low activity" time, but that doesn't look like it's going to happen.

Instead, I'm just going to get this release out as soon as possible (hopefully in the next 12 hours), and then in the next release meeting we can discuss optimal time frames for promoting builds.

@ovflowd
Copy link
Member

ovflowd commented Jul 17, 2023

Thanks @danielleadams, I'll be monitoring our infra, I'll let you know if anything weird happens 👀

@richardlau
Copy link
Member

In terms of actual releases we're not doing them that often (for example, the last non-security 18.x release prior to the one @danielleadams is working on was back in April) -- I don't think freezing releases would actually gain much. The last actual release, for example, was 20.4.0 on 5 July and we've had plenty of issues since then without a new release being put out.

We are purging the CloudFlare cache perhaps three or more times a day for the nightly and v8-canary builds -- as far as the current tooling/scripts are concerned there is no difference in how those are treated vs releases (so it's one thing saying that maybe they should not be, but another to do the remedial work). And while frequent cache purges are certainly not helping the situation, I'm not convinced that the problem is entirely related to the CloudFlare cache.

@MattIPv4
Copy link
Member

I think perhaps the wording here, regarding freezing of releases, was intended to also capture the release of nightly/canary builds, as those also cause cache purges.

While I agree I don't think cache purging itself is probably the issue at its core here, the origin seems to just be rather unhappy, avoiding purging the cache many times a day is definitely going to massively improve the situation, as Cloudflare will be able to actually serve stuff from their cache, rather than it being repeatedly wiped and forcing traffic to be served from the struggling origin.

@ovflowd
Copy link
Member

ovflowd commented Jul 18, 2023

I think perhaps the wording here, regarding freezing of releases, was intended to also capture the release of nightly/canary builds, as those also cause cache purges.

☝️ exactly this!

While I agree I don't think cache purging itself is probably the issue at its core here, the origin seems to just be rather unhappy, avoiding purging the cache many times a day is definitely going to massively improve the situation,

Same here. If we can avoid purging caches for nightly/canary releases as it might not be that much needed, that'd be great!

@mhdawson
Copy link
Member

I think we chose not to freeze release or changes to the website so this can be closed. Unless there are objections to be closing this in the next few days I'll go ahead and do that.

@MattIPv4
Copy link
Member

+1, the website is no longer served out of NGINX and releases are now served from R2 afaik, so I think this is no longer a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests