an option to download only added files for a given dataset version #1100

kirillfish · 2023-08-24T15:30:12Z

So far, when calling get_local_copy method on a clearml.Dataset object, you would download all the files from the dataset AND all its parents recursively, or create soft links for all files that have been downloaded previously. But there was no way to get only files added to this particular version of the dataset, ignoring all the parents. This little PR implements this exact feature.

Testing Instructions

Register a parent dataset and add some files
Register a child dataset, inherited from the first one, and add some more files
Use only_added argument (False by default):

from clearml import Dataset
dataset = Dataset.get(dataset_name='child')
data_base_dir = dataset.get_local_copy(only_added=True)

data_base_dir will contain only files returned by list_added_files

Other Information

Potential issue:

Soft links are still being created for files in the diff which have already been downloaded - this is ok

But it you first call child.get_local_copy(only_added=True) and then once again child.get_local_copy(), it will not create soft links for existing files and download the diff once again -- probably not ok... The same applies to "grandchildren" datasets. Still figuring out why. On the other hand, this could be ok if we assume only_added=True flag is supposed to be used only for debug purposes or to quickly inspect datasets.

eugen-ajechiloae-clearml · 2023-08-28T10:42:00Z

I believe that this will also download the modified files (which is good), but maybe the name only_added is not appropriate. How about ignore_parent_datasets?

kirillfish · 2023-08-28T16:16:02Z

@eugen-ajechiloae-clearml done, I renamed it

eugen-ajechiloae-clearml · 2023-09-11T10:27:35Z

clearml/datasets/dataset.py

+        unified_list = set()
+        for ds_id in datasets:
+            dataset = self.get(dataset_id=ds_id)
+            unified_list |= set(dataset._dataset_file_entries.values())


@kirillfish This line seems to break:

unified_list |= set(dataset._dataset_file_entries.values()) TypeError: unhashable type: 'FileEntry'

Did you mean to construct the set based on the ids? Like:

unified_list |= {entry.id for entry in dataset._dataset_file_entries.values()}

@kirillfish btw, here's a simple test snippet (you'll need to adapt it a little) to check that it works properly.

from pathlib import Path import os from clearml import Dataset, StorageManager def main(): manager = StorageManager() print("STEP1 : Downloading mnist dataset") mnist_dataset = Path(manager.get_local_copy( remote_url="https://allegro-datasets.s3.amazonaws.com/datasets/MNIST.zip", name="MNIST")) mnist_dataset_train = mnist_dataset / "TRAIN" mnist_dataset_test = mnist_dataset / "TEST" print("STEP2 : Creating the training dataset") mnist_dataset = Dataset.create( dataset_project="dataset_examples", dataset_name="MNIST Training Dataset") mnist_dataset.add_files(path=mnist_dataset_train, dataset_path="TRAIN") mnist_dataset.upload() mnist_dataset.finalize() print("STEP3 : Create a child dataset of mnist dataset using TEST Dataset") child_dataset = Dataset.create( dataset_project="dataset_examples", dataset_name="MNIST Complete Dataset", parent_datasets=[mnist_dataset.id]) child_dataset.add_files(path=mnist_dataset_test, dataset_path="TEST") child_dataset.upload() child_dataset.finalize() print("We are done, have a great day :)") main() dataset_path = Dataset.get("<replace with child dataset id, i.e. MNIST Complete Dataset ID>").get_local_copy() print(Path(dataset_path).glob("*")) os.rmtree(dataset_path) dataset_path = Dataset.get("<replace with child dataset id, i.e. MNIST Complete Dataset ID>").get_local_copy(ignore_parent_datasets=True) print(Path(dataset_path).glob("*"))

@kirillfish any news on this?

Hi @kirillfish , any update?

an option to download only added files for a given dataset version

83330d2

renamed only_added -> ignore_parent_datasets

f3916cc

kirillfish marked this pull request as ready for review August 28, 2023 16:14

eugen-ajechiloae-clearml reviewed Sep 11, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

an option to download only added files for a given dataset version #1100

an option to download only added files for a given dataset version #1100

kirillfish commented Aug 24, 2023

eugen-ajechiloae-clearml commented Aug 28, 2023

kirillfish commented Aug 28, 2023

eugen-ajechiloae-clearml Sep 11, 2023

AlexandruBurlacu Sep 11, 2023

jkhenning Oct 17, 2023

jkhenning Nov 7, 2023

an option to download only added files for a given dataset version #1100

Are you sure you want to change the base?

an option to download only added files for a given dataset version #1100

Conversation

kirillfish commented Aug 24, 2023

Testing Instructions

Other Information

eugen-ajechiloae-clearml commented Aug 28, 2023

kirillfish commented Aug 28, 2023

eugen-ajechiloae-clearml Sep 11, 2023

Choose a reason for hiding this comment

AlexandruBurlacu Sep 11, 2023

Choose a reason for hiding this comment

jkhenning Oct 17, 2023

Choose a reason for hiding this comment

jkhenning Nov 7, 2023

Choose a reason for hiding this comment