Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recursive option #24

Closed

Conversation

schnittstabil
Copy link
Contributor

A recursive option for sindresorhus/cpy#10 (comment).

The procedure would be as follows:

  1. glob patterns as usual
  2. for each path in the result, execute an additional glob: var paths = glob(path + '{,/**}', opts)
  • if path is a file, then paths equals to [path]
  • if path is a directory, then paths contains path and all its files and folders with respect to opts (especially: ignore, dot and follow)

@UltCombo
Copy link
Contributor

I'm not sure if I have the necessary context to review this PR, but I'm wondering: for the use case you have presented (cpy), wouldn't it be viable to just append {,/**} to the end of all glob patterns?

@schnittstabil
Copy link
Contributor Author

@UltCombo The results may be the same, the runtimes are not, e.g.:

globby(['node_modules', '!node_modules'], {recursive: true})
// vs
globby(['node_modules{,/**}', '!node_modules{,/**}'])

@sindresorhus, @UltCombo About the use cases: I'm struggling with myself:
As dot defaults to false, should globby(['node_modules'], {recursive: true}) also include node_modules/.bin or not - what kind of behavior is expected by most users?

An alternative:

// typeof(recursive): boolean or object

globby(['node_modules'], {
  recursive: {
    dot: true,}
}) 

@UltCombo
Copy link
Contributor

Note that negation globs are always dot: true, this is a limitation of node-glob's ignore patterns (doc). That said, using the patterns ['node_modules{,/**}', '!node_modules{,/**}'] _should_ not traverse the node_modules directory (I'm pretty positive, but we should test to be sure).

And what I'm trying to say is, it would be nice if we could avoid adding so much complexity to this module. If the results are actually the same, and the perf is the same or better, I'd suggest to have the recursive: true option simply map the input globs appending ,{/**} to them. Then again, this might be an unnecessary abstraction that the caller (cpy) could handle on its own.

Apart from the complexity, another important point is that your code is instantiating a new glob object for each matched path, which likely adds a huge perf overhead to parse all these new generated glob patterns. Also, these new glob patterns are not sharing their caches (statCache, symlinks, realpathCache) which may result in a considerable perf loss depending on the use case. From the docs:

If you are running many glob operations, you can pass a Glob object as the options argument to a subsequent operation to shortcut some stat and readdir calls. At the very least, you may pass in shared symlinks, statCache, realpathCache, and cache options, so that parallel glob operations will be sped up by sharing information about the filesystem.

Although the (lack of) cache is easy to fix, I believe it would still not be worth to spawn a new glob instance for each matched path. Perhaps it is worth trying to implement something like the {,/**} logic I've mentioned above and compare the perf of both approaches.

Another general point about this feature: Did you consider "recursive" copies? I mean, copying a directory to a nested directory, e.g. if you copy the directory /a to /a/b/c, will it result in an infinite copy loop? Maybe it is worth trying to prevent that.
(Not really relevant to this repository, but keep it in mind when preparing the PR to cpy)

@schnittstabil
Copy link
Contributor Author

['node_modules{,/**}', '!node_modules{,/**}'] should not traverse the node_modules directory (I'm pretty positive, but we should test to be sure).

To be honest, I have only tested this:

glob('node_modules', {ignore:[ 'node_modules' ]}, )
// vs
glob('node_modules{,/**}', {ignore:[ 'node_modules{,/**}' ]}, )

The first one was more than 7 times faster than the second (in the globby directory) - however I don't know if node_modules was traversed 😏

important point is that your code is instantiating a new glob object for each matched path,

That is true, but I didn't came up with a better, readable idea - In addition, the nested Promise.all sucks too 😒

Did you consider "recursive" copies?

Thank you, yes I did. There are also some issues with mutually recursively linked directories to handle. Therefore I'm planning to determine the source files first, and starting to copy after.

@UltCombo
Copy link
Contributor

The first one was more than 7 times faster than the second (in the globby directory) - however I don't know if node_modules was traversed 😏

Well, that relative discrepancy is only useful when you pair it with the total execution time.

If the total execution time is just some nano/microseconds, then it is perfectly normal that the second pattern takes a few microseconds more to be parsed into a regex by the underlying minimatch library.

If the total execution times actually takes more than a few seconds, then it really means there is a perf issue with those patterns.

@schnittstabil
Copy link
Contributor Author

Well, that relative discrepancy is only useful when you pair it with the total execution time.

You are very hard to convince, I like that - but I would not have wrote it, if it was negligible 😏

# install glob and some big dependencies like ava

$ time node -e "require('glob')('node_modules', {ignore:[ 'node_modules' ]}, console.log);"
null []

real    0m0.238s
user    0m0.200s
sys 0m0.028s

$ time node -e "require('glob')('node_modules{,/**}', {ignore:[ 'node_modules{,/**}' ]}, console.log);"
null []

real    0m4.020s
user    0m3.832s
sys 0m0.552s

Apart from that, I think a recursive option is not necessarily need anymore: sindresorhus/cpy#10 (comment)

@UltCombo
Copy link
Contributor

You are very hard to convince, I like that - but I would not have wrote it, if it was negligible 😏

Hahah yeah, I like to be throughout with my reviews. Furthermore, I'm currently working 11h+ a day, so I'm saving time by offloading the burden of proof to you. 😛

Indeed, looks like {,/**} causes perf issues as well. That's odd, IIRC node-glob applies special treatment to ignore globs ending with /** as to not traverse inside the matched directories. Perhaps this special treatment is only applied when the pattern literally ends with /** (that is, {,/**} may not be recognized as valid for the optimization path in the current logic).

I'd like to see bench tests for glob('node_modules/**', {ignore:[ 'node_modules/**' ]}) and glob('node_modules{,/**}', {ignore:[ 'node_modules', 'node_modules/**' ]}), I believe both of these should get into the fast path. I think I should have more free time towards the end of the week, so I should be able to run such tests myself by then.

Well, in any case, if you'd like to land this PR sooner, feel free to work with other reviewers here. 😉

@schnittstabil
Copy link
Contributor Author

I believe both of these should get into the fast path.

You are right, ignore:[ 'node_modules', 'node_modules/**' ] is much faster.

@schnittstabil
Copy link
Contributor Author

Thank you, @UltCombo, you put me on the right track. It turned out that a recursive option wouldn't help much though.

@UltCombo
Copy link
Contributor

Oh, nice!
Sorry for the delay, this was still in my list. Good to know you have found a better approach. 😃
Let me know if you need help reviewing this implementation in cpy or something else.

@sindresorhus sindresorhus mentioned this pull request May 18, 2016
Closed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants