Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code generator reparses same inputs files again and again #1060

Closed
mraleph opened this issue Apr 5, 2024 · 13 comments · Fixed by #1061
Closed

code generator reparses same inputs files again and again #1060

mraleph opened this issue Apr 5, 2024 · 13 comments · Fixed by #1061
Assignees
Labels
bug Something isn't working

Comments

@mraleph
Copy link

mraleph commented Apr 5, 2024

I became interested in freezed / build_runner performance when I saw this reddit thread.

I took @jankuss original benchmark https://github.com/jankuss/genq/tree/00b0ea50143a1e813d7a1010dc172ccc09483cd1/benchmarks/freezed_benchmark and profiled build_runner invocation. To my surprise most of the time was spent in parsing. So I added a print('parsing $uri at ${StackTrace.current}') in FileState.parse in analyzer. That revealed that we are just continuously parsing the same files again and again. I then looked at stack traces: that revealed things like documentationOfParameter and tryGetAstNodeForElement. It seems that for each parameter it calls getParsedLibraryByElement which causes the input file to be reparsed. This seems suboptimal and certainly will result in rather poor performance.

I am by no means a build expert, but it seems that getParsedLibraryByElement should not be used like that or it should do more caching and not reparse input file again and again. /cc @natebosch @jakemac53

I think currently each input file is reparsed for each constructor and each parameter in constructor? That's a lot of reparsing. So you can probably make freezed run 2 orders of magnitude faster on this benchmark.

@mraleph mraleph added bug Something isn't working needs triage labels Apr 5, 2024
@rrousselGit
Copy link
Owner

To my surprise most of the time was spent in parsing

How did you know that?

@rrousselGit
Copy link
Owner

I'll look into it this w-e. Thanks for the heads up.

@mraleph
Copy link
Author

mraleph commented Apr 5, 2024

How did you know that?

I profiled it with perf and then looked at the flame graph.

@rrousselGit
Copy link
Owner

Is there no way to parse the AST across code-generators?

I have two code-generators that need the AST. Sounds like if someone uses both, all their files will be parsed twice.

@mraleph
Copy link
Author

mraleph commented Apr 5, 2024

Is there no way to parse the AST across code-generators?

I am not sure. @natebosch @jakemac53 would know.

I am a bit puzzled why there does not seem to be any caching/sharing - maybe due to mutability of underlying AST?

Then I would at least expect some documentation highlighting the cost of these methods.

But again - today was the first time I am even looking at this code.

I think as long as each code-generator parses input files small amount of times things are going to have acceptable performance, but calling these methods repeatedly for each element seems to be a no-no which would lead to really bad performance.

@rrousselGit
Copy link
Owner

Afaik it's a question of "ast is too complex" and "ast costs a lot of memory".

I asked for some improvements there before (as I need the Ast in all of my generators) and this was the TL;DR from what I remember

@natebosch
Copy link

I am not sure. @natebosch @jakemac53 would know.

I am a bit puzzled why there does not seem to be any caching/sharing - maybe due to mutability of underlying AST?

We don't cache at the build system level because it leads to problems like dart-lang/build#3657

Afaik it's a question of "ast is too complex" and "ast costs a lot of memory".

I asked for some improvements there before (as I need the Ast in all of my generators) and this was the TL;DR from what I remember

This is the answer for why we don't preemptively parse ASTs for all libraries so they can be read synchronously. The reason we don't cache is the behavior of the analyzer when we reuse objects after adding more files to the analysis context.

@jakemac53
Copy link

build_runner does have the Resource API, which can be used to cache things across build steps. But, I don't think it would work well to cache analyzer things using that, because they tend to get invalidated often (which can make them unusable). The resolved AST might not do this but I really don't know. It is better to cache some result which was derived from the analyzer result.

But in general, yes it is known that going from Element model => AST node causes a full re-parse, AST nodes are not cached. You could instead ask for the resolved AST of a library once, and then work from that, instead of the element model?

ultimately, @scheglov is probably the best one to weigh in there.

@rrousselGit
Copy link
Owner

I've pushed a fix to this specific issue. Although it's still not ideal.

To be fair, genq's comparison is quite misleading. It's faster because it only relies on the AST.

@mraleph
Copy link
Author

mraleph commented Apr 5, 2024

Thanks! I can confirm that it is much faster now - 10s vs 100s it used to take before on my machine.

I think whopping 50% of the time is spent formatting generated code. When I comment out formatOutput here code generation time decreases to ~4.5s.

@mraleph
Copy link
Author

mraleph commented Apr 5, 2024

If I 10x the number of input files the time goes to 18s, which to me indicates that the large portion of ~5s is JIT warmup overhead and the actual cost of codegeneration is just half of that.

I tried just AOT compiling build_runner build script - but the naive attempt fails because build_runner seems to use mirrors somewhere.

Anyway, I might poke more at this some time later - but TBH this reveals to me that we have some really low hanging fruit here and there.

@rrousselGit
Copy link
Owner

Yeah the fact that 50% of the time is spent on formatting is a known issue. It's shared with all code-generators.
If dart format could naturally ignore generated files, I'd just not run the formatter on generated files.

@rrousselGit
Copy link
Owner

I agree that we have a bunch of low hanging fruits.

IMO another big issue is how build_runner listens to all imported files (including transitive imports).
On large projects, we can easily end-up having all generated files re-generate on any change in the project, which is really inefficient.
I was thinking about changing riverpod_generator to rely only on the Ast, to avoid this. But for Freezed, that's not really doable.

There are probably a few possible APIs to help here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants