Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RRFC] Avoid distributing Readme.md when is not needed #744

Open
H4ad opened this issue Nov 1, 2023 · 15 comments
Open

[RRFC] Avoid distributing Readme.md when is not needed #744

H4ad opened this issue Nov 1, 2023 · 15 comments

Comments

@H4ad
Copy link

H4ad commented Nov 1, 2023

Motivation ("The Why")

I found two packages that ship Readme.md, chalk and commander.js.

If we stop shipping Readme.md, we can save 631GB and 1.3TB per week of bandwidth for these two packages.

If we consider all the packages in the NPM ecosystem, I think the bandwidth that can be saved is huge.

Example

I created two PRs to propose a lighter version instead of shipping the heaviest Readme.md:

How

But I think instead of trying to fix each package, maybe we can change the behavior of NPM itself.

Current Behaviour

Today, we always include README.md.

Desired Behaviour

Not have to the README.md on the bundle.

Possible solutions

We can remove Readme.md when the repository URL is defined and if Readme.md is not included in the files.
To present the content of the NPM page, when it can be fetched from the repository URL.

Only when the user explicitly defines Readme.md in files or when the repository URL is not defined then we will keep Readme.md.

All this behavior must happen in the prepack step because we should not modify the package after it is generated, as there are many packages that send not only the .tgz but also the hash to validate the integrity, so there is no way for NPM modify package contents after publish.

@wraithgar
Copy link
Member

What would we display on npmjs.org for a package?

@H4ad
Copy link
Author

H4ad commented Nov 1, 2023

The NPM can go to https://github.com/chalk/chalk#readme and fetch the readme directly when the repository URL is a host that NPM trust, like Github for example.

But this is just an initial idea, I want to raise this discussion so we could see if there is a good solution for this problem, or see if this is not a problem at all.

In my perspective, the developers don't need to have the README.md on the bundle, for other developers it could be a feature.

If the only reason of README.md is bundled is just to show a nice page for the package, so maybe we could introduce another API (or behavior) that provides the same feature without increasing the bundle size.

@jakebailey
Copy link

Without some sort of metadata that allows a package to be traced exactly back to what generated it, I don't know how the website would be able to fetch the right version. Other websites that use the registry (like jsdelivr, yarnpkg, etc) all use the README info within the package to work, and would too need to fetch this information on the fly and cache it themselves. (Of course, that level of tracing wound be nice, and some tooling adds a gitHead prop, but that's not universal and may not be accurate when produced automatically depending on the situation.)

I personally would like to see some sort of info that says that the bandwidth used by README files is significant enough to warrant the change; it would seem to me that READMEs are small fish compared to the traffic generated by large packages or packages with large dependency graphs.

@wesleytodd
Copy link

Honestly long run I would prefer we move to the readme being updatable separate from the tarball upload (basically what @H4ad said in the follow up comment). It would be a larger long term change for the registry, so not sure if it is a topic for an RFC or just something to ask from npm folks. I believe we had talked about that idea before as well but I don't immediately see an issue/discussion to that point.

@wraithgar
Copy link
Member

The readme in the tarball is not a significant source of extra bandwidth to be honest.

The readme in the packument is. That's a separate issue and we have internal discussions about how to stop needing to send that / store it.

@wesleytodd
Copy link

wesleytodd commented Nov 1, 2023

I did not mean that it being in the tarball was a bandwidth issue. I meant more that we should decouple the two things. Readme in the tarball or not, we should enable the readme on the website to be updated without publishing a new version. That would then enable projects to decide if bundling the readme made sense for their project without npm doing any change.

@H4ad
Copy link
Author

H4ad commented Nov 1, 2023

@jakebailey If you know a way to get the top downloaded packages, I can try to get this statistic.

But by looking just for two packages that we know are heavily downloaded (chalk and commander), it can save ~7,7TB/month, so I think the number is considerable, but the magnitude of significance can be different from person to person, for me TB is something to worry, for NPM could not.

The readme in the tarball is not a significant source of extra bandwidth, to be honest.

I disagree in parts, for commander.js the README alone is 12,7kB of 45.8kB (gzip), but I can be biased so I think we should get more data on this.

@ljharb
Copy link
Contributor

ljharb commented Nov 1, 2023

If the readme is 12.7k compressed that suggests a way-too-large readme that could be split into separate files, rather than “readmes shouldn’t be in the tarball”.

@wesleytodd
Copy link

wesleytodd commented Nov 1, 2023

Honestly I like having docs in readme's for small enough scoped libraries. So I am not a huge fan of that proposed solution as a reason not to try and make things better. Although I do agree that is the right approach for now.

If we can get movement on the registry api then I think it is reasonable to ask for an endpoint just for posting a new readme as a metata bit attached to the package not from the tarball itself. This solves more than one problem, it allows for readme updates without a publish (a long time ask) and also enables reducing the tarball size if packages do have large docs written in the readme.

@mcollina
Copy link

mcollina commented Nov 2, 2023

The solution to solve most of the module size issues is to have two published bundles:

  1. source bundle, with all the docs, tests, source files, build scripts etc.
  2. binary bundle, with just the (built) js and only the assets needed for production

With a command, npm could “augment” the given module with the missing files (in case somebody needs them locally). Users that needs offline access could also turn it on by default.

(A popular argument is that github releases provides that source bundle. This is unfortunately not correct, because a published module cannot be deleted, while a github repo can).

@wesleytodd
Copy link

Yeah, we want distributions! This would help so much with other issues as well. That would be a great solution to this problem.

@H4ad
Copy link
Author

H4ad commented Nov 2, 2023

Maybe we can integrate with #519 and add a new type of distribution that is only used to address the source bundle, and this kind of distribution can be used to render the content on NPM page.

@wesleytodd
Copy link

That sounds like a great idea!

@ljharb
Copy link
Contributor

ljharb commented Nov 2, 2023

That’s basically what I’ve been asking npm for for the better part of a decade :-) i would happily remove tests and a bunch of other dev-time files from the “runtime” package if i could set an npm config value and always have the “full’ package locally installed (for 450+ packages).

@mcollina
Copy link

mcollina commented Nov 2, 2023

same here @ljharb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants