Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move attached files between assetstores #3467

Open
manthey opened this issue Sep 15, 2023 · 3 comments
Open

Move attached files between assetstores #3467

manthey opened this issue Sep 15, 2023 · 3 comments

Comments

@manthey
Copy link
Member

manthey commented Sep 15, 2023

We have a function to move files between assetstores. This doesn't support moving attached files, only files that are actually owned directly by an item.

Probably the correct behavior is to modify the upload record in moveFileToAssetstore to reflect the attached status.

Further, there is a major performance penalty in the movement of files because we build up the chunk to upload by repeatedly adding data blocks from the small downloaded chunks. We should, instead, keep a list of these chunks and join them once for upload rather than iteratively adding binary string together.

@willdunklin
Copy link
Contributor

To quote my comment from the import-tracker plugin PR (DigitalSlideArchive/import-tracker#19 (comment))

That looks good, I've been able to essentially recreate a version of moveFileToAssetstore for attached files. Using the lower level APIs there also expose control over the file metadata so maintaining the created field is automatically encapsulated (by the upload.update(...) line).

def move_meta_file(file, assetstore):
    parent = Item().findOne({'_id': file['attachedToId']})

    chunk = None
    try:
        for data in File().download(file, headers=False)():
            if chunk is not None:
                chunk += data
            else:
                chunk = data
    except Exception as e:
        return {'error': f'Exception downloading file: {e}'}

    upload = Upload().uploadFromFile(
        obj=io.BytesIO(chunk), size=file['size'], name=file['name'],
        parentType=file['attachedToType'], parent=parent,
        mimeType=file['mimeType'], attachParent=True, assetstore=assetstore)
    upload.update({k: v for k, v in file.items() if k != 'assetstoreId'})
    upload = File().save(upload)

    return upload

This is modeled after the large image plugin's code for manipulating attached files https://github.com/girder/large_image/blob/master/girder/girder_large_image/rest/tiles.py#L1538-L1547

I believe this version addresses the chunking issue you mentioned, correct? Or am I misinterpreting the bottleneck at hand here?

@manthey
Copy link
Member Author

manthey commented Sep 18, 2023

The chunking issue is the add in this loop:

    chunk = None
    try:
        for data in File().download(file, headers=False)():
            if chunk is not None:
                chunk += data
            else:
                chunk = data

specifically, the download iterator gets something on the order of 64kb at at time. We don't want to get the whole file before uploading (since it could be arbitrarily large). But in the girder code we default to an upload chunk size of 32 Mb, which means we might add chunks 512 times. Python creates a new memory object each and every time that occurs and does a memory copy (and is slow about it). Better would be to collect the chunks in a list:

    chunks = []
    for data in File().download(file, headers=False)():
            chunks.append(data)
            if sum(len(chunk) for chunk in chunks) >= chunkSize:
                uploadchunk = b''.join(chunks)
                upload = self.handleChunk(upload, RequestBodyStream(io.BytesIO(uploadchunk), len(uploadchunk)))
                progress.update(increment=len(uploadchunk))
                chunks = []

        if len(chunks):
            uploadchunk = b''.join(chunks)
            upload = self.handleChunk(upload, RequestBodyStream(io.BytesIO(uploadchunk), len(uploadchunk)))
            progress.update(increment=len(uploadchunk))

Though that repeated code bothers me, and maybe it'd be better to keep a tally of lengths rather than call sum repeatedly.

@willdunklin
Copy link
Contributor

Got it, I made a PR with a modified version of your code for the speedup. It seems to perform quite well in comparison. I'm working on sorting out attached files on top of the listed changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants