You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following the discussion from #816, there would be benefits to allow materializer nodes to be defined statically at the Driver level (both DataLoader and DataSaver).
The nodes can be called directly via .execute()
Materializers appear in HamiltonGraph and visualizations even if they aren't executed.
Validate the DAG, including the materializers before execution.
For basic to.parquet() usage, it might be more efficient to store simple Python functions using pd.to_parquet() in a module to enable this patterns. More powerful materializers (e.g., dlt) would benefit from this approach though
The text was updated successfully, but these errors were encountered:
Following #877
Added the ability to add materializer nodes to the `FunctionGraph` at `Driver` build time.
Use cases:
- Easier to view materializers as part of `.display_all_functions()`
- Allows to call materializers by name with `.execute()`. Would allow some users to completely ignore `.materialize()`
- Possible to combine "static" materializers at build time and "dynamic" materializers at execution time
- Catch invalid dataflows earlier than `dr.materialize()`
## Changes
- Builder now accepts `.with_materializers(*materializers)`
- `Builder.build()` adds materializer nodes after creating the `Driver` and returns the updated Driver. The logic is derived from `Driver.materialize()`
- `Builder.copy()` copies the materializers
## How I tested this
- Added tests to check materializer nodes (savers and loaders) are properly added
- test that "static" and "dynamic" materializers can be used together
## Notes
- each time materializers are added `post_graph_construct` hooks are triggered
- Should be include materializers in the `version` hashing of the dataflow? should we treat "static" and "dynamic" differently. IMO, we might want two create two versions: one for "the dataflow transform, ignoring materializers" and one for "all nodes" since they answer different equality / diffing questions
- there's duplicate logic between `Builder.build()` and `Driver.materialize()`. A better approach could be to have `Driver.add_materializers()` and `Driver.execute_materializers()`. Both `Builder.build()` and `Driver.materialize()` could call `Driver.add_materializers()`
- Related to #713, users need to be aware that materializers need a valid `target` or `dependencies` (type and name) to connect to.
Squashed commits of:
* added Builder.with_materializers()
* moved logic to Driver __init__
* added type annotation
* made materializers an internal attribute
* updated docs; pre-commit hook; typing bugfix
---------
Co-authored-by: zilto <tjean@DESKTOP-V6JDCS2>
Following the discussion from #816, there would be benefits to allow materializer nodes to be defined statically at the Driver level (both
DataLoader
andDataSaver
)..execute()
HamiltonGraph
and visualizations even if they aren't executed.Solution 1
Solution 2
An alternative, would be to allow materializers to be imported and added via
.with_modules()
. For example,production_materializers.py
containsFor basic
to.parquet()
usage, it might be more efficient to store simple Python functions usingpd.to_parquet()
in a module to enable this patterns. More powerful materializers (e.g., dlt) would benefit from this approach thoughThe text was updated successfully, but these errors were encountered: