deduper

Analyse 2 paths on the same file system to found identical files and hard link them to save space.

How it works

Indexing: both paths will be analyzed and the structure of the directories tree and their corresponding inodes mapped in memory (files & directories)
Then the structure of the A path will be walked and for each regular file, the mapped memory structure of path B will be analyzed to find potential candidates
- First a list of all files in B having the exact same size of the A files analyzed will be compiled (empty files will be ignored)
- Then this list will be pruned based on several criterias
  - Candidates in B that are already hardlinks of the reference A file will be removed from the list
  - Files that do not have the same inode metadata (ownership [uid, gid] and file mode) will be removed from the candidates list to avoid breaking potential current access to these files (as hardlinks share the same metadata by design)
    - Unless the -force flag is set, in that case candidates are kept (but will have their metadata changed once hardlinking is done)
  - For candidates that are still on the list, a SHA256 checksum will be performed to ensure they have indeed the same content as the reference file in A currently being processed
For candidates that have passed all the tests and are still on the candidates list:
- if the -apply flag has been set
  - They will be removed (in order to free their path)
  - Reffile in A will be hard linked to the path that the B candidate had, making it available once again but dedupped with A this time
- if the -apply flag has not been set
  - A reporting will be printed of what would have been done (and saved) with the flag on

Usage

Usage of ./deduper:
  -apply
        By default deduper run in dry run mode: set this flag to actually apply changes
  -debug
        Show debug logs during the analysis phase
  -dirA string
        Referential directory
  -dirB string
        Second directory to compare dirA against
  -force
        Dedup files that have the same content even if their inode metadata (ownership and mode) are not the same
  -minSize string
        Set the minimum size a file must have to be kept for analysis (ex: 100MiB)
  -workers int
        Set the maximum numbers of workers that will perform IO tasks (default 6)

Example

./deduper -minSize 10MiB -workers 8 -dirA "$(pwd)/example/dirA" -dirB "$(pwd)/example/dirB" -apply

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
example		example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
checks.go		checks.go
dedup.go		dedup.go
go.mod		go.mod
go.sum		go.sum
index.go		index.go
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example

example

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

build.sh

build.sh

checks.go

checks.go

dedup.go

dedup.go

go.mod

go.mod

go.sum

go.sum

index.go

index.go

main.go

main.go

Repository files navigation

deduper

How it works

Usage

Example

About

Releases 1

Languages

License

hekmon/deduper

Folders and files

Latest commit

History

Repository files navigation

deduper

How it works

Usage

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Languages