yafd

yafd is a (yet another) file deduplicator.

Usage

For detailed info, see USAGE or the manpage (man yafd).

The easiest way to use yafd is to pass a directory or a set of directories to it.

jason@io ~ yafd .

It can recurse as well.

jason@io ~ yafd -r .

Another easy way to use yafd is passing files to check as arguments. Shell globbing helps.

jason@io ~ yafd **/*.c **/*.h

You can also pipe paths to yafd via stdin. This makes it easy to limit the sets of files to check.

jason@io ~ find . -size +1M | yafd

The output can also be piped to other commands to do things with the duplicate files.

jason@io ~ find /usr/src -size +1M | yafd | xargs du -b | awk '{ x+=$1; } END { print x; }'
12659698

Performance

As of yet, yafd is not always the fastest deduplicator (see hdd performance). If performance is a concern, it may be worth considering another deduplicator like rmlint. Performance can be optimized using command arguments (--bytes, --blocksize, --threads, etc...), although yafd with defaults should be usable for most tasks.

Here are some metrics for reference.

SSD (btrfs)

	time	throughput	throughput (dup)
`yafd`	4.30s	267.88 MiB/s	175.70 MiB/s
`rmlint`	7.43s	155.13 MiB/s	101.74 MiB/s
`fdupes`	30.34s	37.99 MiB/s	24.92 MiB/s
`duff`	25.14s	45.86 MiB/s	30.08 MiB/s
`yafd (cached)`	0.61s	1.84 GiB/s	1.20 GiB/s
`rmlint (cached)`	2.46s	466.21 MiB/s	307.40 MiB/s
`fdupes (cached)`	12.27s	93.94 MiB/s	61.61 MiB/s
`duff (cached)`	6.51s	176.17 MiB/s	116.12 MiB/s

HDD (ext4)

	time	throughput	throughput (dup)
`yafd`	1087.59s	1.05 MiB/s	711.99 KiB/s
`rmlint`	65.03s	163.46 MiB/s	107.21 MiB/s
`fdupes`	322.57s	3.57 MiB/s	2.34 MiB/s
`duff`	954.70s	1.20 MiB/s	811.10 KiB/s
`yafd (cached)`	7.05s	163.46 MiB/s	107.21 MiB/s
`rmlint (cached)`	2.84s	406.37 MiB/s	266.53 MiB/s
`fdupes (cached)`	12.44s	92.64 MiB/s	60.76 MiB/s
`duff (cached)`	6.56s	175.76 MiB/s	115.28 MiB/s

NFS (v4)

	time	throughput	throughput (dup)
`yafd`	197.08s	5.85 MiB/s	3.83 MiB/s
`rmlint`	461.26s	2.49 MiB/s	1.63 MiB/s
`fdupes`	648.24s	1.77 MiB/s	1.16 MiB/s
`duff`	466.69s	2.47 MiB/s	1.62 MiB/s
`yafd (cached)`	95.04s	12.13 MiB/s	7.95 MiB/s
`rmlint (cached)`	423.90s	2.71 MiB/s	1.78 MiB/s
`fdupes (cached)`	611.19s	1.88 MiB/s	1.23 MiB/s
`duff (cached)`	403.72s	2.85 MiB/s	1.87 MiB/s

(1) The linux sources were searched for identical files (4.3, 4.4)

(2) For an equivalent comparison, the following command arguments were used (also see)

yafd --recurse --zero
rmlint --algorithm=paranoid --hidden -o fdupes:stdout
fdupes --recurse
duff -rpta -f#

(3) Linux 4.4.0 and Intel Ivy Bridge (i7-3632QM) were used for benchmarks

Install

You can download a copy of the source here or you can clone the repository using git.

jason@io ~ git clone git://github.com:uxcn/yafd.git

It's a good idea to check out a specific release.

jason@io ~/yafd git checkout v0.1

In the project directory, run the autoconf script.

jason@io ~/yafd ./autoconf.sh CFLAGS='-march=native -mtune=native -O2'

Adding the architecture allows algorithms that rely on architecutre specific implementations to be used. The easiest way to do this is normally -march=native. You can also explicitly enable instruction sets via autoconf.

jason@io ~/yafd ./autoconf.sh --enable-sse4_2

To install to a directory other than /usr/local, you can manually configure the prefix. If you do, make sure your PATH and MANPATH are set correctly.

jason@io ~/yafd ./autoconf.sh --prefix=$HOME

Run make install to compile and install.

jason@io ~/yafd $ make install

Currently yafd compiles and is tested on Linux, FreeBSD, OSX, and Windows. Although, patches and pull requests for others are definitely welcome.

Versions

0.1 - alpha release

FAQ

Why write another file deduplicater?

A lot of the current ones were more complicated than I wanted, didn't perform well, or weren't portable.

Why doesn't yafd do X?

Most likely nobody asked for X yet. If you think something's missing, send a feature request or even better, a pull request.

How does yafd work?

The basic algorithm is to group files by their sizes, compute a hash on a small (random) chunk of each file, and then compare files that have the same hash. This is a bit of an oversimplicification though. For a better understanding, it may help to try reading the code.

other deduplicators

rmlint
fdupes
duff
others...

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.appveyor		.appveyor
.new		.new
.travis		.travis
m4		m4
man		man
perf		perf
src/c		src/c
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile.am		Makefile.am
README.md		README.md
USAGE.md		USAGE.md
appveyor.yml		appveyor.yml
autoconf.sh		autoconf.sh
compile_commands.json		compile_commands.json
configure.ac		configure.ac

License

uxcn/yafd

Folders and files

Latest commit

History

Repository files navigation

yafd

Usage

Performance

Install

Versions

FAQ

other deduplicators

About

Topics

Resources

License

Stars

Watchers

Forks

Languages