-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Introducing an AST factory #3770
base: master
Are you sure you want to change the base?
Conversation
@@ -2003,8 +1990,10 @@ public virtual ISqlExpression ConvertExpressionImpl(ISqlExpression expression, C | |||
return expression; | |||
} | |||
|
|||
protected virtual ISqlExpression ConvertFunction(SqlFunction func) | |||
protected virtual ISqlExpression ConvertFunction(ISqlExpression expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would recommend this remains SqlFunction func
, and AstFactory.Coalesce()
returns a SqlFunction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the ConvertFunction
business was annoying, I had to hack something to be able to proceed without making too many deep changes for now.
Long-term, I would like to see ConvertFunction
be gone, seems to me its purpose is to switch function names so that correct SQL is generated. This could also happen at SQL generation time without modifying the AST at all, but I don't want to engage in too many refactorings at the same time.
Would recommend AstFactory.Coalesce() returns a SqlFunction.
Seems against the principle of AST factory to constraint what is returned too much.
For instance, one local optimization I have on the map for later is that Coalesce
can drop all its arguments after the 1st one that cannot be null.
If this happens to be the 1st argument, Coalesce
can just return the first argument. In that case there's no guarantee that Coalesce
returns a SqlFunction
, could be any expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Wonder if a lot of the remaining functionality in this method could be converted to Function attributes or expression mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, I haven't got my head too deep into that yet.
Gotta say one challenge when trying to improve the code is that you hit some design quirks and then it's a rabbit hole to work-around or fix them.
Other instances where I had to hack around:
- I've changed
Invert
to return aISqlPredicate
instead ofIQueryElement
... turned out every caller had a cast already! 😱 . - I've changed
NotExpr
to take aISqlPredicate
instead ofISqlExpression
. Makes sense and was already the case almost everywhere.
(SqlFunction { Name: "Coalesce" or PseudoFunctions.COALESCE } ca, | ||
SqlFunction { Name: "Coalesce" or PseudoFunctions.COALESCE } cb) => ToArray(ca.Parameters, cb.Parameters), | ||
(SqlFunction { Name: "Coalesce" or PseudoFunctions.COALESCE } ca, _) => ToArray(ca.Parameters, b), | ||
(_, SqlFunction { Name: "Coalesce" or PseudoFunctions.COALESCE } cb) => ToArray(a, cb.Parameters), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the commentary of the PR, these cases should be recognizing an existing SqlFunction
object and updating it rather than preparing an array for a brand new SqlFunction
object in l180. This probably means that SqlFunction.Parameters
should use an IReadOnlyList<>
or similar, where SqlFunction
provides an internal void AddParameter(ISqlExpression a)
that updates the underlying list of that property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm still trying to come up with better ways to reduce allocations and would love new ideas here.
I think our SQL AST nodes should be immutable. Maybe not in the strict sense, I'd maybe allow having flags set later during analysis / optimizations, but as far as what what the node is/represents.
Immutability enables re-using the node safely, for example when expanding an expression a == b
into a == b or a is null and b is null
without cloning the whole a
and b
subtree.
Another idea is that we could cache and reuse sub-expressions across various compilations.
I'm not set on what's the best way to reduce copying of immutable structures, there are options. I'd like to balance clean code and efficiency.
Some ideas:
-
This specific
coalesce
example might be common and bad when converting a long??
chain.
To avoid this we could complement it by some custom code in the expression converter to immediately create the flattened node.
I have done this foror
/and
predicates, the Expression converter immediately flattens them before calling the factory, although the factory would also flatten (by copying) later tree transforms that would put anor
inside anor
. -
Maybe nodes could be mutable until they're cloned? Could be indicated by a bit on the node.
This makes any code that want to mutate the node less clean because it has to check the bit and it needs 2 code paths for the case where it can mutate and the case where it needs a copy :( -
For nodes that are nicer being mutated when they're first created (
or
/and
immediately come to mind in current linq2db design), maybe we could have a Builder pattern, where a mutablestruct
is used until we're done and then an immutable node is produced. -
Maybe we can use more sophisticated structures inside ast nodes, e.g. a collection of nodes that can share and reuse a buffer when stuff is appended?
BTW in this regard, I think we would benefit from using frugal collections, i.e. structs with optimized storage for 2-4 elements, instead of generic arrays/list, as many nodes commonly have very few children. That would cut down allocations some more... Lots of ideas but I don't want to touch the AST nodes until the factory is fully in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should do some research on red-green trees from Roslyn: https://ericlippert.com/2012/06/08/red-green-trees/. It may help you find a better balance between immutability and memory pressure. I think the gist of it is similar to option 2, where you have internal mutable trees and public immutable trees once cloned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I change my mind every day but seeing some stuff that's done with the AST today, I'm having second-thoughts about immutability. I'm still unsure what the best design will be, but in all cases it'll be easier to tweak AST representation with the factory in place.
RE: red-green trees I don't think they would help with building the AST, as the green part is immutable and basically matches what our AST is and does today.
The design might be interesting if we want to maximize the sharing of immutable nodes (the green part), maybe even across queries, while allowing a mutable layer on top, e.g. for flags and other annotations (would be the red part).
My current thoughts:
- The builder approach is most promising if we want to optimize allocations to the max during
Expression
conversion. - Another design that would work well during optimizations too: add a flag indicating that the node is shared (hence immutable -- but most nodes would not be) and mutating methods. Those methods would mutate and return the same node when they can, and return a fresh modified copy when they can't.
The second design pushed to the max would seriously reduce allocations during full tree optimizations by SqlOptimizer
. Currently, when something is modified, every parent node of the tree is replaced by a new one -> in many cases that may not be required if you just swap child references.
About SqlBuildersSqlBuilder is the last step of the pipeline, it takes an optimized tree and prints a SQL string. I've noticed some builders create a bunch of AST nodes, the main reason is to print those nodes as an alternative for "standard" nodes in the tree. These kind of transformations are often done by SqlOptimizer but not always... I'm not gonna introduce the AST factory inside SqlBuilder classes. This is a last step past every optimizations and the provider-specific builders know exactly what they want to print, so let them be. It is quite wasteful to create those temporary nodes. It'd be good to remove them with some more direct calls to SqlBuilder helpers that can print expressions without allocating new nodes. A good example is the |
I'll agree with this comment. I think ultimately there should be three distinct phases: building (when initial LINQ calls are made), optimizing (SqlOptimizer and co.), and rendering (SqlBuilder and co.). Currently, it sounds like optimizing and rendering are blended, with some optimization happening during the rendering phase and some pre-rendering happening during the optimization phase. It would definitely be a bit cleaner if the optimization was independent of rendering, with maybe some flags available from the data provider that affect certain features during the optimization phase. |
@viceroypenguin, I've did that separation, and I've got performance degradation ;) So, SqlBuilder also optimises SQL on the fly. It is critical for queries which depends on parameter values. Main disadvantage of doing optimisation on whole Statement is a lot of new allocations and AST rebuilding. |
This is an attempt at improving the internals of linq2db, there is no change for users.
As linq2db is large and complex, the changes in the PR are just "getting started" and far from complete, although as I will discuss below it should probably be merged incrementally in multiple steps..
Main idea: add an AST factory
The core idea of this PR is to introduce a factory class to create our SQL abstract syntax tree.
So, instead of new'ing up nodes with
new SqlPredicate.IsNull(...)
and friends, we call intoast.IsNull(...)
.What's the difference anyway?!
new T
must return an instance ofT
, whereas a factory could return anyISqlExpression
.This opens up quite a few possibilities:
Local optimizations
The AST factory can apply local transformations and return an equivalent, optimized expression. I think I've seen this referred to as optimization by construction.
For example, if you call
and(...)
with a series of nodes that incl. afalse
constant, the factory would just return afalse
constant; Nested calls tocoalesce(...)
can be flattened in a single node.Today, these transformations are generally done by
SqlOptimizer
. Doing them at construction time largely reduces the number of allocations, which is beneficial for performance.To illustrate, consider the expression
isnull(row(x, y))
in a provider that doesn't supportrow
.Today, compilation could go like this:
With a builder, the same process happens but immediately at construction without visiting anything, and without constructing intermediate nodes:
Summary: current design allocates 7 nodes (in addition to x, y); factory only allocates 2!
Interning
Common constants or expressions such as
null
,0
,1
,true
,false
are cached and re-used, reducing allocations.If the codebase is 100% going through the AST factory, we can take advantage of the fact that values are interned and make reference equalities rather than pattern matching to check if something is a well-known value like
0
.Provider-specific translations
The factory could be a static class, but I've made it an instance, so that we can have provider-specific factories.
This enable provider-specific transformations, such as lowering unsupported expressions like
SqlRow
in the previous example.Enforcing invariants
The factory can enforce some constraints on our AST nodes, for example we could guarantee that a some structures like
or
,coalesce
are always flattened, which makes further processing more efficient.Benefits of AST factory
Performance
As explained above, a factory can apply transformations at construction without allocating nodes.
This improves performance because it reduces GC pressure and requires less "visiting".
Separation of concerns
Some optimizations are actually made by the expression parser.
This complicates the parser with things that are not its responsibilities.
Furthermore, those optimizations are either not applied later when the tree is transformed, or duplicated.
With a dedicated class that consistently does these transforms, the calling code can be simplified.
I think that if you look at the code that is already modified in this PR, the result is IMHO generally much cleaner -- and I'm only getting started.
Future-proof
This is a first step, that will help future refactorings.
I think our SQL AST is not ideal and there are changes I would like to make for cleaner code and more efficient execution.
This abstraction layer, once complete, will make it easier to apply those changes project-wide.
Provider-independent remoting?
This is a long-term idea that is not on my map for now, but anyway:
Instead of serializing a provider specific AST on the client, we could call into a "generic" AST factory that does nothing but creates a neutral structure to serialize these calls.
On the server, we can rebuild the AST by making the same calls into a provider-specific factory.
This could make the client provider-agnostic, which is nice to have, IMHO.
Rollout
Obviously this is a very large-scale change, as there are calls to AST constructors all over the place.
I've started the work just so that you can get a better grasp of what the code would actually look like.
I think this is gonna be too annoying to merge to fully do in a single PR.
If we commit to this idea, I think we should regularly merge progress. Not all the code calls into the factory, but everything is still working and can be released.
There's a lot to do, in waves: