Skip to content

hekmon/deduper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

deduper

Analyse 2 paths on the same file system to found identical files and hard link them to save space.

How it works

  • Indexing: both paths will be analyzed and the structure of the directories tree and their corresponding inodes mapped in memory (files & directories)
  • Then the structure of the A path will be walked and for each regular file, the mapped memory structure of path B will be analyzed to find potential candidates
    • First a list of all files in B having the exact same size of the A files analyzed will be compiled (empty files will be ignored)
    • Then this list will be pruned based on several criterias
      • Candidates in B that are already hardlinks of the reference A file will be removed from the list
      • Files that do not have the same inode metadata (ownership [uid, gid] and file mode) will be removed from the candidates list to avoid breaking potential current access to these files (as hardlinks share the same metadata by design)
        • Unless the -force flag is set, in that case candidates are kept (but will have their metadata changed once hardlinking is done)
      • For candidates that are still on the list, a SHA256 checksum will be performed to ensure they have indeed the same content as the reference file in A currently being processed
  • For candidates that have passed all the tests and are still on the candidates list:
    • if the -apply flag has been set
      • They will be removed (in order to free their path)
      • Reffile in A will be hard linked to the path that the B candidate had, making it available once again but dedupped with A this time
    • if the -apply flag has not been set
      • A reporting will be printed of what would have been done (and saved) with the flag on

Usage

Usage of ./deduper:
  -apply
        By default deduper run in dry run mode: set this flag to actually apply changes
  -debug
        Show debug logs during the analysis phase
  -dirA string
        Referential directory
  -dirB string
        Second directory to compare dirA against
  -force
        Dedup files that have the same content even if their inode metadata (ownership and mode) are not the same
  -minSize string
        Set the minimum size a file must have to be kept for analysis (ex: 100MiB)
  -workers int
        Set the maximum numbers of workers that will perform IO tasks (default 6)

Example

./deduper -minSize 10MiB -workers 8 -dirA "$(pwd)/example/dirA" -dirB "$(pwd)/example/dirB" -apply

Example GIF