Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Perf improvement #68

Closed
wants to merge 7 commits into from
Closed

[WIP] Perf improvement #68

wants to merge 7 commits into from

Conversation

pvdlg
Copy link
Contributor

@pvdlg pvdlg commented Jan 31, 2018

See benchmark results

And a recap of the gains:

Test Current New Improvement
negative globs (some files inside dir) - sync 41 op/s 247 op/s 6x
negative globs (some files inside dir) - async 38 op/s 107 op/s 2.8x
negative globs (some files inside dir) - gitignore - sync 11 op/s 82 op/s 7.4x
negative globs (some files inside dir) - gitignore - async 12 op/s 43 op/s 3.6x
negative globs (whole dir) - sync 37 op/s 127 op/s 3.4x
negative globs (whole dir) - async 33 op/s 64 op/s 1.9x
negative globs (whole dir) - gitignore - sync 11 op/s 62 op/s 5.6x
negative globs (whole dir) - gitignore - async 11 op/s 32 op/s 2.9x
multiple positive globs - sync 18 op/s 114 op/s 6.3x
multiple positive globs - async 17 op/s 61 op/s 3.6x
multiple positive globs - gitignore - sync 8 op/s 61 op/s 7.6x
multiple positive globs - gitignore - async 8 op/s 19 op/s 2.4x

Performance improvements

Use fast-glob instead of node-glob.

Handle .gitignore differently: Parse the .gitignore files and transform the content into globs, then revert them and add them to the patterns.
This allow to glob and read .gitignore files only once per call (while it was done once per task before). In addition, it avoid to retrieve files that would be excluded after by the gitignoreFilter. This allow to take advantage of fast-glob optimizations.
That should help (or probably fix) xojs/xo#65

Micro optimizations in the code (less branches, avoid creating unnecessary functions etc...).

Code changes

nodir option is replaced by onlyFiles (as it's the name used by fast-glob). We could map nodir to onlyFiles to avoid a breaking change. Not sure if it's useful though, as some node-glob options are not available yet in fast-glob so it would be a breaking anyway.

When using expandDir, directories directly under cwd are not included. See mrmlnc/fast-glob#47.
glob('tmp/**') => ['tmp', 'tmp/a.tmp', 'tmp/b.tmp', 'tmp/c.tmp', 'tmp/d.tmp', 'tmp/e.tmp']
fast-glob('tmp/**') => ['tmp/a.tmp', 'tmp/b.tmp', 'tmp/c.tmp', 'tmp/d.tmp', 'tmp/e.tmp']
Not sure which one is the expected behavior though.

I implemented a workaround for mrmlnc/fast-glob#45 to avoid trailing slashes for directories. Fixed in fast-glob@2.02.

I added a default list of ignore directory when globbing the .gitignore files, to workaround mrmlnc/fast-glob#42. Even when the bug is fixed its probably to a good idea to keep the workaround as there is no reason to look for a .gitignore file in node_modules, bower_components etc...
mrmlnc/fast-glob#42 can also appear for users with directory with a large amount of files. We should probably wait for it to be fixed.
Fixed in fast-glob2.0.2.

I dropped the globby.gitignore.sync([options]) for now. My understanding is that was used for checking is a file is ignored in a .gitignored. With the new implementation it would add additional code more or less disconnected to the main features. We could recommend to do globby(['file_to_check'], {gitignore: true}) instead.
If necessary I can add the functionality again.

Benchmark changes

I added a benchmark with .gitignore. The comparison is relevant only between globby (working directory) and globby (upstream/master) as other tools do not support reading the exclusions from a .gitignore file.

I changed the negative globs (whole dir) bench to use a more realistic pattern. The patter ['a/*', '!a/**'] doesn't make much sense as the 2 pattern just contradict each others. It also happens to be a very specific edge case that only node-glob optimize very well. See mrmlnc/fast-glob#45

I replaced it with ['b/*', 'a/*', '!a/**'] which is a more realistic (less edgy) use case and give a more valuable information regarding the performance of each solution.

TODO

  • Clean up the code, maybe split in multiple files (both the test and index files)
  • Add more unit test for edge cases regarding .gitignore files (exclude comments for example)
  • Add more comments and JSDoc in the code
  • Update the README
  • Bring back globby.gitignore.sync([options])?
  • Map nodir to onlyFiles?

`['a/*', '!a/**']` is not really a relevant pattern to test as it wouldn't select any file.
The pattern `['b/*', 'a/*', '!a/**']` still test the performance of ignore a full directory but make more realistic still selecting other files
@pvdlg
Copy link
Contributor Author

pvdlg commented Jan 31, 2018

For further improvements we could drop the generateGlobTasks function and just pass the patterns (with inverted patterns from .gitignore) to fast-glob.
We would have some cases that would work differently though:

  • With ['a.tmp', '!*.tmp'] fast-glob returns [], while globby returns ['a.tmp'].
  • But with ['tmp/a.tmp', '!tmp'] both returns ['tmp/a.tmp'].

It seems that fast-glob allow negative patterns to be override by positive ones only for directory (i.e. the negative patter has to be a directory). But it always allow positive ones to override negative ones.

@pvdlg
Copy link
Contributor Author

pvdlg commented Feb 1, 2018

After doing a bit more test, it turns the current solution doesn't work properly for negative patterns in .gitignore, as files matched by those patterns end up being included in the result even if they are not part of the patterns passed by the user.

I added a failing test to demonstrate that.

.gitignore negative have to used as an exception in the scope of the files ignored within the same .gitgnore file. Unfortunately I don't see how this can be achieved without doing 2 steps as currently (glob then filter what's defined in the .gitignore).

If someone has a an idea how to solve this, let me know! Otherwise we'll have to revert to the old method.

@marionebl
Copy link
Contributor

@pvdlg Just a rough idea: What if we combine both approaches, like this.

  1. Read and parse relevant ignore files
  2. Create ignore globs when safe, e.g. ignores with exclusion would be omitted. This should be considered only as performance optimization
  3. Create a filter function as before and apply to fast-glob results to ensure all filter criteria are met

@pvdlg
Copy link
Contributor Author

pvdlg commented Feb 2, 2018

That sounds a good idea. I'm going to try that.

For info I did some test with the current master using fast-glob. With the current implementation (on master) I don't see much differences between node-glob and fast-glob regarding the cases with a .gitignore. I have an improvement a 1 to 3 op/s.

Not directly related, but I noticed that in most performance issue reported for XO, the performance issue come from what appear to be a bug in node-glob that is not present in fast-glob: isaacs/node-glob#309

glob.sync('**/*.js', {ignore: ['**/node_modules/**']})
// => Works as expected, content of node_modules is excluded

glob.sync('./**/*.js', {ignore: ['**/node_modules/**']})
// => Content of node_modules is NOT excluded

This problem doesn't happen with fast-glob. It seems a lot of folks pass to XO a pattern that starts with ./ and that create a huge performance problem as non of the XO default ignores are used.

If we switch to fast-glob (without modifying the .gitignore handling) we would improve the performances a lot on those cases. With the case described in xojs/xo#234, after setting Eslint to sue the same rules and plugins as XO I get:

  • ~16s with ESLint
  • ~30s with xo ./**/*.js
  • ~19s with xo **/*.js
  • ~16s with xo **/*.js if I replace the XO default ignores by just **/node_modules/**

=> Long story short, keeping the code as is and just replacing node-glob by fast-glob should get XO on par or pretty close to ESLint. Maybe we can do that, so we can get a XO release out, and then we'll continue to work on improving globby performances further.

@sindresorhus
Copy link
Owner

We could map nodir to onlyFiles to avoid a breaking change. Not sure if it's useful though, as some node-glob options are not available yet in fast-glob so it would be a breaking anyway.

Nah, we shouldn't map it. We need to do a major bump for this regardless.

=> Long story short, keeping the code as is and just replacing node-glob by fast-glob should get XO on par or pretty close to ESLint. Maybe we can do that, so we can get a XO release out, and then we'll continue to work on improving globby performances further.

I would be ok with that.

@sindresorhus
Copy link
Owner

Awesome work on investigating this @pvdlg :)

@sindresorhus
Copy link
Owner

@pvdlg Is there something worth merging from this PR? I could at least see the additional benchmarks and tests being useful.

@pvdlg
Copy link
Contributor Author

pvdlg commented Jan 9, 2019

Yes we could merge the benchmark and the tests.
I don't have much open source time currently, but I can open a new PR with only test/bench when I get time.

@sindresorhus
Copy link
Owner

Alright. Sounds good.

@sindresorhus
Copy link
Owner

Closing this, but would really appreciate the mentioned PR with test/bench improvements whenever/if you have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants