feat: Initial RFC for sharding #112

balavishnuvj · 2023-06-01T04:58:29Z

Summary

This RFC proposes adding sharding support to the ESLint package. Sharding is a technique used to divide a large workload into smaller, more manageable parts that can be processed concurrently. By introducing sharding in ESLint, we aim to improve linting performance for large codebases and reduce the overall execution time of linting tasks.

Related Issues

eslint/eslint#16726

eslint-github-bot · 2023-06-01T04:58:34Z

Hi @balavishnuvj!, thanks for the Pull Request

The first commit message isn't properly formatted. We ask that you update the message to match this format, as we use it to generate changelogs and automate releases.

The commit message tag wasn't recognized. Did you mean "docs", "fix", or "feat"?
There should be a space following the initial tag and colon, for example 'feat: Message'.
The first letter of the tag should be in lowercase

To Fix: You can fix this problem by running git commit --amend, editing your commit message, and then running git push -f to update this pull request.

Read more about contributing to ESLint here

linux-foundation-easycla · 2023-06-01T04:58:34Z

✅ login: balavishnuvj / name: Balavishnu V J (fcb2ab0)
❌ The email address for the commit (c7468f7) is not linked to the GitHub account, preventing the EasyCLA check. Consult this Help Article and GitHub Help to resolve. (To view the commit's email address, add .patch at the end of this PR page's URL.) For further assistance with EasyCLA, please submit a support request ticket.

eslint-github-bot · 2023-06-01T04:59:08Z

Hi @balavishnuvj!, thanks for the Pull Request

The first commit message isn't properly formatted. We ask that you update the message to match this format, as we use it to generate changelogs and automate releases.

The commit message tag wasn't recognized. Did you mean "docs", "fix", or "feat"?
There should be a space following the initial tag and colon, for example 'feat: Message'.
The first letter of the tag should be in lowercase

To Fix: You can fix this problem by running git commit --amend, editing your commit message, and then running git push -f to update this pull request.

Read more about contributing to ESLint here

bradzacher · 2023-06-01T08:52:39Z

One important thing to note is that if we rely solely on ESLint internals for sharding then we will very easily create situations where type-aware linting will be a lot more expensive than it should have been.

For example if a shard contains 1 file from a project, then type-aware linting would mor eor less need to redo all of the type work performed in another shard.

This behaviour would not be obvious to end users (as sharding would likely be an opaque action) which would lead to a lot of wasted time and resources.

Ideally some sharding solution should be able to ask the language provider how to best shard the codebase so that correct optimisations can be made.

For example in TS projects it's logical to shard a lint run by project so that the files contained in a tsconfig is entirely contained within one shard.

This is something we've discussed before in the context of language plugins as a whole in the ecosystem.

Note that this situation doesn't just apply to typescript-eslint, but also to eslint-plugin-import - which will do parsing out-of-band and cache that information for the lint run. So sharding the middle of a dependency graph could lead to wasted time and work.

voxpelli · 2023-06-01T12:13:21Z

@bradzacher Adding an option for a plug-in and/or plug-in rule to mark itself as undesirable / unsupported in sharding could help mitigate the effect of that in the short term, until a solution that enable them to also be shared gets added?

bradzacher · 2023-06-01T12:37:08Z

typescript-eslint and eslint-plugin-import are used by something like 70% of the eslint community - which really, really limits the impact of a feature that lands without design consideration for those parts of the ecosystem.

I personally don't think this is something that should ship into ESLint unless there is a solution for those usecases given how big of a part of the ecosystem that they are.

The reason being that failure to plan for this now could mean we design ourselves into a corner and creat a feature that's too hard to retrofit to support them without major changes.

I'm not saying that v1 of this feature needs to explicitly support these cases - but IMO this RFC should at least show serious consideration into how they might be supported in order to ensure that a v2+ can be built without major breaking changes.

I'm also not a TSC member, I'm just a maintainer of typescript-eslint - so what I say is just what I'd love to see, nothing more.

voxpelli · 2023-06-01T12:49:11Z

@bradzacher Though not all of eg typescript-eslint requires it to read a full project and it already warns of the performance impact of enabling such rules?

By including parserOptions.project in your config, you incur the performance penalty of asking TypeScript to do a build of your project before ESLint can do its linting.

And its recommended config does not include any such rules?

On the main topic – I'm not sure whether sharding by file or by rules is the proper way to go to speed things up.

As @bradzacher points out, some rules are easier to run in the same process and have it share the same caches. For them it would make more sense to have them have a separate process to other more light weight rules and thus enable them to be processed in parallel?

I believe that was brought up as an alternative in #42, which essentially suggested same thing thing as this RFC and which resulted in eg. eslint/eslint#13525 and eslint/eslint#14139 before being closed.

bradzacher · 2023-06-01T13:36:07Z

You're correct - we don't recommend type-aware rules by default because there is performance impact to using type information and for many users their codebase is large enough that this impact is too great for them.

But we do recommend people use type-aware rules where possible because they are so powerful and cover so much more. Which is why we also provide a separate config, recommended-requiring-type-checking, which is our recommended set for those that want type-aware linting.

Not all of the typescript-eslint community uses type-aware linting, for sure, just like not all of the eslint-plugin-import community uses the rules that parse dependencies - it's sadly impossible to get exact numbers on those subsets.

Sharding by rule is dangerous, in a way, because there are rules that depend on other rules to function.

For example in react codebases the core no-unused-vars lint rule will not work correctly without react/jsx-uses-vars and react/jsx-uses-react from eslint-plugin-react. Those rules call context.markVariableAsUsed which in turn impacts what no-unused-vars reports!

Sharding by rule would need to take these implicit dependencies into account.

Additionally sharding by rule doesn't solve the duplicate work problem. For example typescript-eslint type information is calculated at parse-time, not at rule time and is thus shared between rules. Similarly eslint-plugin-import dependency data is calculated at lint time, but is cached and shared between rules. So in both cases sharding by rules is explicitly worse than sharding by files!

It is possible that you could have a plugin declare "expensive" rules or "related" rules to have them sharded together, though in future we are considering building "optionally type-aware rules" for which the user can opt-in to stricter behaviour for specific rules at config time.

Which is why I say that sharding by file, but with information from the "language plugin" is the best way to think about this, IMO. ESLint can supply a default implementation for the sharding algorithm (eg for the "JS language") which boils down to "split into N shards naively using a deterministic algorithm", and typescript-eslint could specify a new implementation which shards around projects instead. If they wanted, eslint-plugin-import could provide an override implementation which is dependency-tree aware.

nzakas · 2023-06-01T17:57:13Z

@balavishnuvj thanks for taking the time to put this together, it's a good explainer of why sharding could be helpful but it doesn't explain how the sharding would be implemented. The purpose of the RFC is to put together a tech spec -- you should be investigating how ESLint works and how to implement sharding, not just presenting arguments as to why sharding is a good idea. Without the implementation details, we just don't have enough information to provide any real feedback.

This is something we'd like to investigate, but unless someone can suggest a possible implementation, we really can't go any further.

jfmengels · 2023-06-01T20:22:55Z

designs/2023-sharding/README.md

+
+Q: Will enabling sharding affect the linting results or introduce any inconsistencies?
+
+A: Enabling sharding should not affect the linting results or introduce any inconsistencies. Sharding is designed to distribute the workload and execute the linting process independently on each shard. The aggregated results should provide a unified view of the linting issues across all shards, ensuring consistent analysis.


I think you're right that this won't affect the current analysis. That said, it can limit future kinds of analysis like multi-file linting (which I've written about here, specifically this section), where sharding could make this impossible or at best become a hindrance that would require new solutions (like opting out some rules out of the sharding strategy).
You could end up with different results when turning on/off sharding, which would be pretty bad.
That said, I don't remember whether ESLint aims to implement this one day, but it's like eslint-plugin-import does in a round-about way.

balavishnuvj · 2023-06-02T06:50:56Z

@nzakas thanks. I have shared this to get an initial idea about the feature and to verify if I have missed any part. Will add implementation details over the next week.

balavishnuvj · 2023-06-04T16:41:56Z

have added some implementation detail. Can create a draft PR to help with the implementation details.

designs/2023-sharding/README.md

bmish · 2023-06-07T23:16:36Z

designs/2023-sharding/README.md

+
+- **Resource Consumption**: Sharding involves launching multiple processes/threads to handle each shard concurrently. This can result in increased resource consumption, such as CPU and memory usage, particularly when dealing with a large number of shards or codebases.
+
+- **Potential Overhead**: The overhead of shard coordination and result aggregation may impact the overall performance gain achieved through parallel execution. Careful optimization and benchmarking should be performed to ensure that the benefits of sharding outweigh the associated overhead.


Overhead is definitely a concern. As @bradzacher calls out, there could be situations where sharding unexpectedly reduces performance. And given overhead costs, sharding might only be worthwhile in codebases of a large-enough size. How would the user reasonably know when it makes sense to enable sharding? Can/should users realistically be expected to perform "careful optimization and benchmarking" on their own to decide when to use the feature?

The significant overhead would be to figure out the sharding buckets. But for large enough projects, it would be negligible.

How would the user reasonably know when it makes sense to enable sharding? Can/should users realistically be expected to perform "careful optimization and benchmarking" on their own to decide when to use the feature?

Sharding is an optimization feature, and it is essential for users to exercise diligence when deciding to enable it. While we can provide guidance and recommendations, it is ultimately up to the users to assess their specific use cases and evaluate whether sharding is necessary and beneficial for their codebase.

Users should consider a few factors like Codebase Size, Performance Issues and Resource Availability.

The guidelines and recommendations we provide should include example scenarios of when it makes sense to use sharding, when it doesn't make sense, and what kind of performance improvements can be expected. For example, we could include results from real-world codebases like "a codebase of 100k lines across 1k JS files saw a 20% speedup in lint running time from 5 min to 4 min when using sharding" (completely made-up example).

In fact, I think the RFC should test and report the real-world performance impacts (likely on some popular, large, open source repos), in order to justify implementing the sharding feature in the first place.

I don't think we need to be too prescriptive about this. In reality, people will likely try it out and determine if there's any benefit to their current process rather than going by any guidelines we publish. As with most of ESLint, I think we can provide the capability and let people try it, use it, or discard it depending on their individual results.

In fact, I think the RFC should test and report the real-world performance impacts (likely on some popular, large, open source repos), in order to justify implementing the sharding feature in the first place.

I disagree. Let's not move the goalposts here. We should be able to come up with scenarios where we reasonably expect sharding to improve performance and also scenarios where we believe it will decrease performance. Building out a full prototype and running it on random projects doesn't give us useful information as most tools would need to be tuned over time to make this work in individual projects.

Also, please see my earlier comment about sharding vs. parallel linting. It seems like we are combining those in this RFC, but that's not what the original issue was about.

bmish · 2023-06-07T23:31:35Z

designs/2023-sharding/README.md

+  eslint --shard=1/3
+  ```
+
+- **Workload Division**: Analyze the input codebase and divide it into smaller units of work called shards. The division can be based on files, directories, or any other logical grouping that allows for parallel processing. Each shard should be self-contained and independent of the others. For example, larger files can be distributed across shards to balance the workload. We could also expose api to get custom division statergy.


Earlier, @bradzacher mentioned:

Sharding by rule is dangerous, in a way, because there are rules that depend on other rules to function.

For example in react codebases the core no-unused-vars lint rule will not work correctly without react/jsx-uses-vars and react/jsx-uses-react from eslint-plugin-react. Those rules call context.markVariableAsUsed which in turn impacts what no-unused-vars reports!

Sharding by rule would need to take these implicit dependencies into account.

To touch on this a bit more, there are various ways rules could be using shared / global state, as discussed in this thread. This behavior might not be encouraged or officially supported, but just be aware that sharding even a single rule could break things if the rule gathers up state while running across files. It would be good to call this possibility out, whether rules are allowed to do this generally or only through a supported API like markVariableAsUsed, and how sharding could impact such rules.

It would be helpful to provide some details about what other tools do that have sharding. How does Jest decide which files to use in a shard, for example?

Co-authored-by: Bryan Mishkin <698306+bmish@users.noreply.github.com>

nzakas · 2023-06-14T15:56:27Z

One thing I want to highlight for everyone: I don't think we should be dissuaded from working through this RFC because there are complexities involved. As with many ESLint features, people ultimately use the ones that work best for them and discard the ones that don't. Even the note about plugins potentially wanting to indicate how to shard is a bit of a distraction at this point. Once that logic is codified, it can always move from the core out into a plugin for modification. What's more interesting at this moment is how sharding as a concept applies to ESLint (or doesn't).

nzakas

This is a good first step regarding implementation. I think it would be helpful to separate out sharding from parallel linting and also to reference how other tools are already using sharding. That would help ground this RFC and help us better evaluate it.

nzakas · 2023-06-14T16:54:02Z

designs/2023-sharding/README.md

+
+The implementation of sharding in the ESLint package involves the following steps:
+
+- **Configuration**: Introduce a new configuration option or flag in the ESLint configuration file to enable sharding. This option can specify the number of shards or the desired granularity of the shards.


This will have to be done at the CLI level. It's not possible to have it in the config file.

nzakas · 2023-06-14T16:54:41Z

designs/2023-sharding/README.md

+  eslint --shard=1/3
+  ```
+
+- **Workload Division**: Analyze the input codebase and divide it into smaller units of work called shards. The division can be based on files, directories, or any other logical grouping that allows for parallel processing. Each shard should be self-contained and independent of the others. For example, larger files can be distributed across shards to balance the workload. We could also expose api to get custom division statergy.


It would be helpful to provide some details about what other tools do that have sharding. How does Jest decide which files to use in a shard, for example?

nzakas · 2023-06-14T16:56:33Z

designs/2023-sharding/README.md

+
+- **Workload Division**: Analyze the input codebase and divide it into smaller units of work called shards. The division can be based on files, directories, or any other logical grouping that allows for parallel processing. Each shard should be self-contained and independent of the others. For example, larger files can be distributed across shards to balance the workload. We could also expose api to get custom division statergy.
+
+- **Parallel Execution**: Launch multiple processes or threads to handle each shard concurrently. Each process/thread should execute the linting process independently on its assigned shard.


The original issue made it a point to separate sharding from parallel linting.

For this RFC, we should focus on sharding, which involves running ESLint on a subset of the found files rather than the entire set.

nzakas · 2023-06-14T16:58:27Z

designs/2023-sharding/README.md

+
+- **Resource Consumption**: Sharding involves launching multiple processes/threads to handle each shard concurrently. This can result in increased resource consumption, such as CPU and memory usage, particularly when dealing with a large number of shards or codebases.
+
+- **Potential Overhead**: The overhead of shard coordination and result aggregation may impact the overall performance gain achieved through parallel execution. Careful optimization and benchmarking should be performed to ensure that the benefits of sharding outweigh the associated overhead.


Also, please see my earlier comment about sharding vs. parallel linting. It seems like we are combining those in this RFC, but that's not what the original issue was about.

nzakas · 2023-06-14T17:00:55Z

designs/2023-sharding/README.md

+## Detailed Design
+
+1. Get the files in a deterministic order.
+   in `lib/cli.js`, we can get all the files. Update the `executeOnFiles` function


I think you mean cli-engine.js? If so, this file is deprecated and will be removed, so changes need to be made in flat-eslint.js instead.

nzakas · 2023-06-14T17:01:24Z

designs/2023-sharding/README.md

+   }
+   ```
+
+2. Divide the files into shards. The number of shards is determined by the `--shard` option.


Again, it would be helpful to point to what other tools do to determine which files are in a shard.

nzakas · 2023-10-09T15:24:42Z

@balavishnuvj are you going to follow up on this?

nzakas · 2023-11-13T16:00:26Z

No response, so closing.

balavishnuvj changed the title ~~Initial RFC for sharding~~ feat: Initial RFC for sharding Jun 1, 2023

feat: RFC for sharding

fcb2ab0

balavishnuvj force-pushed the rfc-sharding branch from 559fd24 to fcb2ab0 Compare June 1, 2023 05:06

eslint-github-bot bot added the feature label Jun 1, 2023

nzakas added the Initial Commenting This RFC is in the initial feedback stage label Jun 1, 2023

nzakas mentioned this pull request Jun 1, 2023

Change Request: add ability to shard eslint eslint/eslint#16726

Closed

1 task

jfmengels reviewed Jun 1, 2023

View reviewed changes

adds implementation details

87a4879

balavishnuvj force-pushed the rfc-sharding branch from c7468f7 to 87a4879 Compare June 4, 2023 16:44

bmish reviewed Jun 7, 2023

View reviewed changes

balavishnuvj and others added 2 commits June 9, 2023 15:13

Update designs/2023-sharding/README.md

5e3c99b

Co-authored-by: Bryan Mishkin <698306+bmish@users.noreply.github.com>

Update designs/2023-sharding/README.md

2f6c897

Co-authored-by: Bryan Mishkin <698306+bmish@users.noreply.github.com>

nzakas requested changes Jun 14, 2023

View reviewed changes

nzakas closed this Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Initial RFC for sharding #112

feat: Initial RFC for sharding #112

balavishnuvj commented Jun 1, 2023

eslint-github-bot bot commented Jun 1, 2023

linux-foundation-easycla bot commented Jun 1, 2023 •

edited

eslint-github-bot bot commented Jun 1, 2023

bradzacher commented Jun 1, 2023 •

edited

voxpelli commented Jun 1, 2023

bradzacher commented Jun 1, 2023 •

edited

voxpelli commented Jun 1, 2023

bradzacher commented Jun 1, 2023

nzakas commented Jun 1, 2023

jfmengels Jun 1, 2023

balavishnuvj commented Jun 2, 2023

balavishnuvj commented Jun 4, 2023

bmish Jun 7, 2023

balavishnuvj Jun 9, 2023

bmish Jun 9, 2023

nzakas Jun 14, 2023

nzakas Jun 14, 2023

bmish Jun 7, 2023

nzakas Jun 14, 2023

nzakas commented Jun 14, 2023

nzakas left a comment

nzakas Jun 14, 2023

nzakas Jun 14, 2023

nzakas Jun 14, 2023

nzakas Jun 14, 2023

nzakas Jun 14, 2023

nzakas Jun 14, 2023

nzakas commented Oct 9, 2023

nzakas commented Nov 13, 2023


		Q: Will enabling sharding affect the linting results or introduce any inconsistencies?

		A: Enabling sharding should not affect the linting results or introduce any inconsistencies. Sharding is designed to distribute the workload and execute the linting process independently on each shard. The aggregated results should provide a unified view of the linting issues across all shards, ensuring consistent analysis.


		- Resource Consumption: Sharding involves launching multiple processes/threads to handle each shard concurrently. This can result in increased resource consumption, such as CPU and memory usage, particularly when dealing with a large number of shards or codebases.

		- Potential Overhead: The overhead of shard coordination and result aggregation may impact the overall performance gain achieved through parallel execution. Careful optimization and benchmarking should be performed to ensure that the benefits of sharding outweigh the associated overhead.


		The implementation of sharding in the ESLint package involves the following steps:

		- Configuration: Introduce a new configuration option or flag in the ESLint configuration file to enable sharding. This option can specify the number of shards or the desired granularity of the shards.


		- Workload Division: Analyze the input codebase and divide it into smaller units of work called shards. The division can be based on files, directories, or any other logical grouping that allows for parallel processing. Each shard should be self-contained and independent of the others. For example, larger files can be distributed across shards to balance the workload. We could also expose api to get custom division statergy.

		- Parallel Execution: Launch multiple processes or threads to handle each shard concurrently. Each process/thread should execute the linting process independently on its assigned shard.

feat: Initial RFC for sharding #112

feat: Initial RFC for sharding #112

Conversation

balavishnuvj commented Jun 1, 2023

Summary

Related Issues

eslint-github-bot bot commented Jun 1, 2023

linux-foundation-easycla bot commented Jun 1, 2023 • edited

eslint-github-bot bot commented Jun 1, 2023

bradzacher commented Jun 1, 2023 • edited

voxpelli commented Jun 1, 2023

bradzacher commented Jun 1, 2023 • edited

voxpelli commented Jun 1, 2023

bradzacher commented Jun 1, 2023

nzakas commented Jun 1, 2023

Choose a reason for hiding this comment

balavishnuvj commented Jun 2, 2023

balavishnuvj commented Jun 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nzakas commented Jun 14, 2023

nzakas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nzakas commented Oct 9, 2023

nzakas commented Nov 13, 2023

linux-foundation-easycla bot commented Jun 1, 2023 •

edited

bradzacher commented Jun 1, 2023 •

edited

bradzacher commented Jun 1, 2023 •

edited