Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

localsync will always fail to skip unchanged files because of 'downloadStatistics' #3394

Open
mrtgenet opened this issue Mar 24, 2022 · 3 comments

Comments

@mrtgenet
Copy link
Contributor

Hello there,

I've been trying to use localsync through the API to use downloadFolderRecursive with metadata and prevent re-downloading unchanged files as expected. The code shown in girder client is pretty simple :

@main.command('localsync', short_help=_short_help, help='%s\n\n%s' % (_short_help, _common_help))
@_CommonParameters(additional_parent_types=[])
@click.pass_obj
def _localsync(gc, parent_type, parent_id, local_folder):
    if parent_type != 'folder':
        raise Exception('localsync command only accepts parent-type of folder')
    gc.loadLocalMetadata(local_folder)
    gc.downloadFolderRecursive(parent_id, local_folder, sync=True)
    gc.saveLocalMetadata(local_folder)

So in my client I did gc.loadLocalMetadata(local_folder), gc.downloadFolderRecursive(parent_id, local_folder, sync=True) and gc.saveLocalMetadata(local_folder) but it would not work, it kept re-downloading every files evey time, .girder_metadata file present or not.

The problem lies in the definition of gc.downloadFolderRecursive(), in the test for syncing:

def downloadFolderRecursive(self, folderId, dest, sync=False):
    """
    Download a folder recursively from Girder into a local directory.

    :param folderId: Id of the Girder folder or resource path to download.
    :type folderId: ObjectId or Unix-style path to the resource in Girder.
    :param dest: The local download destination.
    :type dest: str
    :param sync: If True, check if item exists in local metadata
        cache and skip download provided that metadata is identical.
    :type sync: bool
    """
    offset = 0
    folderId = self._checkResourcePath(folderId)
    while True:
        folders = self.get('folder', parameters={
            'limit': DEFAULT_PAGE_LIMIT,
            'offset': offset,
            'parentType': 'folder',
            'parentId': folderId
        })

        for folder in folders:
            local = os.path.join(dest, self.transformFilename(folder['name']))
            os.makedirs(local, exist_ok=True)

            self.downloadFolderRecursive(folder['_id'], local, sync=sync)

        offset += len(folders)
        if len(folders) < DEFAULT_PAGE_LIMIT:
            break

    offset = 0

    while True:
        items = self.get('item', parameters={
            'folderId': folderId,
            'limit': DEFAULT_PAGE_LIMIT,
            'offset': offset
        })
        
        for item in items:
            _id = item['_id']
            self.incomingMetadata[_id] = item
        
# --------> HERE <--------------------------------------------------------------------
            if sync and _id in self.localMetadata and item == self.localMetadata[_id]:
                continue
            self.downloadItem(item['_id'], dest, name=item['name'])
        
        offset += len(items)
        if len(items) < DEFAULT_PAGE_LIMIT:
            break

The test item == self.localMetadata[_id] will always fail, since the 'downloadStatistics' metadata field has changed on the server. We would like to check that id, name, last modification time etc. haven't changed but the number of downloads, requests and stuff is causing this test never ever be True.

I ended up recoding the gc.downloadFolderRecursive() method as follows, which works well:

def downloadFolderRecursive(self, folderId, dest, sync=False):
    """
    Download a folder recursively from Girder into a local directory.

    :param folderId: Id of the Girder folder or resource path to download.
    :type folderId: ObjectId or Unix-style path to the resource in Girder.
    :param dest: The local download destination.
    :type dest: str
    :param sync: If True, check if item exists in local metadata
        cache and skip download provided that metadata is identical.
    :type sync: bool
    """
    offset = 0
    folderId = self._checkResourcePath(folderId)
    while True:
        folders = self.get('folder', parameters={
            'limit': DEFAULT_PAGE_LIMIT,
            'offset': offset,
            'parentType': 'folder',
            'parentId': folderId
        })

        for folder in folders:
            local = os.path.join(dest, self.transformFilename(folder['name']))
            os.makedirs(local, exist_ok=True)

            self.downloadFolderRecursive(folder['_id'], local, sync=sync)

        offset += len(folders)
        if len(folders) < DEFAULT_PAGE_LIMIT:
            break

    offset = 0

    while True:
        items = self.get('item', parameters={
            'folderId': folderId,
            'limit': DEFAULT_PAGE_LIMIT,
            'offset': offset
        })
        
        for item in items:
# --------> THERE <-----------------------------------------------------------------------------------------
            # remove 'downloadStatistics' field from metadata, which causes to re download everything anyway
            if 'downloadStatistics' in item:
                item.pop('downloadStatistics')
            if 'downloadStatistics' in self._gc.localMetadata:
                self._gc.localMetadata.pop('downloadStatistics')

            _id = item['_id']
            self.incomingMetadata[_id] = item
        
            if sync and _id in self.localMetadata and item == self.localMetadata[_id]:
                continue
            self.downloadItem(item['_id'], dest, name=item['name'])
        
        offset += len(items)
        if len(items) < DEFAULT_PAGE_LIMIT:
            break

That way, the 'downloadStatistics' field is also removed from the .girder_matadata local file, as it is of no interest for my use, but keeping it there and yet not using it for the sync test it straight forward.

I hope this helps, I hope it is not a misunderstanding of what localsync is expected to do, don't hesitate to tell me if this is all wrong.

Cheers.

@zachmullen
Copy link
Member

I believe this is simply incorrect behavior that you've discovered. Rather than removing fields, we should just identify some known set of fields that we can use to indicate change in the item. The updated field is probably the easiest.

I'd be happy to review a PR if you open one.

@mrtgenet
Copy link
Contributor Author

Hello @zachmullen
Thank you, I will do so then

@mrtgenet
Copy link
Contributor Author

Done here, tested locally and works as expected for me: #3397 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants