Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make duplicate file assignment deterministic. #222

Open
MiningMarsh opened this issue Nov 14, 2023 · 5 comments
Open

Make duplicate file assignment deterministic. #222

MiningMarsh opened this issue Nov 14, 2023 · 5 comments

Comments

@MiningMarsh
Copy link

I've been using phockup to organize a set of camera photos and videos, which I then copy into a flat file structure. I then synchronize both the flat and non-flat organized directories back into the source files using rsync. Lastly, I synchronize both of these files to my phone's camera role with Foldersync for Android.

I've noticed that some files are synchronized every single transfer. Rsync shows a sync for every single file due to checksum and size difference, and when I compare two example files, they show a single byte different:

$ cmp /var/{tmp,share}/camera/20230613-174230-2.mp4                                                        /var/tmp/camera/20230613-174230-2.mp4 /var/share/camera/20230613-174230-2.mp4 differ: byte 807014, line 3188

Tracking this down a bit more, I've noticed this only seems to happen to files that phockup detects as having the same timestamps, ones that generate filenames with the -# appended to the end of them. Some of my files do appear to be different but have the same timestamps. As far as I can tell, the order that phockup decides to assign the duplicate files a discriminating number does not appear stable, so, i.e., one iteration one file might be assigned *-2, and another iteration a different file might be assigned *-2.

Is it possible that the sorting method of identified duplicates could be made stable, so that in these cases, the file contents will not change each iteration? I can fix this on my end by removing duplicates, but it seems suboptimal that phockup might cause files to swap places with every iteration.

@MiningMarsh MiningMarsh changed the title Pickup modifying MD5 checksum of files by modifying files. Make duplicate file assignment deterministic. Nov 14, 2023
@rob-miller
Copy link
Contributor

are you using concurrency?

@MiningMarsh
Copy link
Author

I did enable concurrency, 32 cores.

@rob-miller
Copy link
Contributor

As @ivandokov has not commented, could you try without concurrency and see if the issue remains? Also curious if you can report how much speed impact this has for you.

@MiningMarsh
Copy link
Author

I can't test the stability issue right at the moment (I already removed the duplicates to resolve the issue on my end as the constant syncing was causing me problems, so I'll need to create some test data and get back to you), but I can give you the speed difference right now. This is on a Ryzen 5950x processor.

With 32 cores:

[2023-11-18 20:02:35] - [INFO] - Processed 1796 files in 40.08 seconds. Average Throughput: 44.81 files/second
[2023-11-18 20:02:35] - [INFO] - Copied 1796 files.

With 1 core:

[2023-11-18 20:10:47] - [INFO] - Processed 1796 files in 306.53 seconds. Average Throughput: 5.86 files/second
[2023-11-18 20:10:47] - [INFO] - Copied 1796 files.

I'll try and test the issue without concurrency tomorrow.

@ivandokov
Copy link
Owner

The way the folders are traversed with concurency is causing this issue. Unfortunatelly I didn't build this feature and I am not really sure how to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants