You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
kedro runner calls data catalog shallow_copy that always return new DataCatalog object type, destroying any CustomDataCatalog object being copied.
Context
I created custom PickleDataCatalog class in order to handle dynamically multiple pickle objects on after_node_run hook event.
My custom datacatalog and hooks were properly set in src/my_project_name/settings.py
from .hooks import ModelSavingHook # noqa: F401
HOOKS = (ModelSavingHook(),)
...
from .pickle import PickleDataCatalog
DATA_CATALOG_CLASS = PickleDataCatalog
Turned out that preset DATA_CATALOG_CLASS reference is lost during pipeline lifecycle and the catalog param is DataCatalog type instead of expected custom PickleDataCatalog
Steps to Reproduce
create simple custom PickleDataCatalog class
class PickleDataCatalog(DataCatalog):
def __init__(self, pickle_directory : str = 'data/models/', *args, **kwargs):
super().__init__(*args, **kwargs)
logging.info(f"PickleDataCatalog instance created, id: {id(self)}")
self._pickle_directory = pickle_directory
def save(self, name: str, data: Any):
if name not in self._datasets:
self.add(name, PickleDataset(filepath=os.path.join(self._pickle_directory, f"{name}.pickle")))
super().save(name, data)
create hooks.pywith custom ModelSavingHook handler, to properly maintain files:
classModelSavingHook:
"""Hook to save models to disk after they are run."""@hook_impldefbefore_node_run( # noqa: PLR0913self,
node: Node,
catalog: DataCatalog,
inputs: dict[str, Any],
is_async: bool,
session_id: str,
) :
pass# load everything that is required@hook_impldefafter_node_run(self, catalog: PickleDataCatalog, outputs: Dict[str, Any], node, inputs: Dict[str, Any]):
logging.info(f"Catalog type in hook: {type(catalog)}, id: {id(catalog)}")
ifnotisinstance(catalog, PickleDataCatalog):
raiseTypeError(
f"Expected `PickleDataCatalog`, got {type(catalog)}"
)
# iterating output and saving objects to passed `catalog` param
update src/my_project_name/settings.py
from .hooks import ModelSavingHook # noqa: F401
HOOKS = (ModelSavingHook(),)
...
from .pickle import PickleDataCatalog
DATA_CATALOG_CLASS = PickleDataCatalog
kedro run
Expected Result
While investigating pipeline lifecycle, i can see that custom DataCatalog is properly propagated among following events:
after_catalog_created
before_pipeline_run
Unfortunatelly it is lost on before_node_run and after_node_run, expected :
Received catalog object on before_node_run is different than on after_catalog_created and before_pipeline_run!
instead of expected PickleDataCatalog there is passed default kedro.io.DataCatalog (check id(catalog)} on all events)
Cause of a problem
Turned out in kedro.runner there is custom shallow_copy call on catalog
defshallow_copy(
self, extra_dataset_patterns: Patterns|None=None
) ->DataCatalog:
"""Returns a shallow copy of the current object. Returns: Copy of the current object. """
...
returnDataCatalog(
datasets=self._datasets,
dataset_patterns=dataset_patterns,
load_versions=self._load_versions,
save_version=self._save_version,
)
In order to fix that bug we need to use the actual type of catalog object - which would be:
defshallow_copy(
self, extra_dataset_patterns: Patterns|None=None
) ->DataCatalog:
"""Returns a shallow copy of the current object. Returns: Copy of the current object. """
...
returnself.__class__(
datasets=self._datasets,
dataset_patterns=dataset_patterns,
load_versions=self._load_versions,
save_version=self._save_version,
)
Your Environment
Kedro version used (pip show kedro or kedro -V): kedro, version 0.19.5
Python version used (python -V): Python 3.9.17
Operating system and version: macOS
The text was updated successfully, but these errors were encountered:
Description
kedro runner calls data catalog
shallow_copy
that always return newDataCatalog
object type, destroying anyCustomDataCatalog
object being copied.Context
I created custom
PickleDataCatalog
class in order to handle dynamically multiple pickle objects onafter_node_run
hook event.My custom datacatalog and hooks were properly set in
src/my_project_name/settings.py
Turned out that preset
DATA_CATALOG_CLASS
reference is lost during pipeline lifecycle and thecatalog
param isDataCatalog
type instead of expected customPickleDataCatalog
Steps to Reproduce
PickleDataCatalog
classhooks.py
with customModelSavingHook
handler, to properly maintain files:src/my_project_name/settings.py
kedro run
Expected Result
While investigating pipeline lifecycle, i can see that custom DataCatalog is properly propagated among following events:
after_catalog_created
before_pipeline_run
Unfortunatelly it is lost on
before_node_run
andafter_node_run
, expected :Actual Result
Received
catalog
object onbefore_node_run
is different than onafter_catalog_created
andbefore_pipeline_run
!instead of expected
PickleDataCatalog
there is passed defaultkedro.io.DataCatalog
(checkid(catalog)}
on all events)Cause of a problem
Turned out in
kedro.runner
there is customshallow_copy
call on catalogwhere
In order to fix that bug we need to use the actual type of
catalog
object - which would be:Your Environment
pip show kedro
orkedro -V
): kedro, version 0.19.5python -V
): Python 3.9.17The text was updated successfully, but these errors were encountered: