Skip to content

uxcn/yafd

Repository files navigation

yafd

build status build status coverage status issues

yafd is a (yet another) file deduplicator.

Usage

For detailed info, see USAGE or the manpage (man yafd).

The easiest way to use yafd is to pass a directory or a set of directories to it.

jason@io ~ yafd .

It can recurse as well.

jason@io ~ yafd -r .

Another easy way to use yafd is passing files to check as arguments. Shell globbing helps.

jason@io ~ yafd **/*.c **/*.h

You can also pipe paths to yafd via stdin. This makes it easy to limit the sets of files to check.

jason@io ~ find . -size +1M | yafd

The output can also be piped to other commands to do things with the duplicate files.

jason@io ~ find /usr/src -size +1M | yafd | xargs du -b | awk '{ x+=$1; } END { print x; }'
12659698

Performance

As of yet, yafd is not always the fastest deduplicator (see hdd performance). If performance is a concern, it may be worth considering another deduplicator like rmlint. Performance can be optimized using command arguments (--bytes, --blocksize, --threads, etc...), although yafd with defaults should be usable for most tasks.

Here are some metrics for reference.

SSD (btrfs)

timethroughputthroughput (dup)
yafd4.30s267.88 MiB/s175.70 MiB/s
rmlint7.43s155.13 MiB/s101.74 MiB/s
fdupes30.34s37.99 MiB/s24.92 MiB/s
duff25.14s45.86 MiB/s30.08 MiB/s
yafd (cached)0.61s1.84 GiB/s1.20 GiB/s
rmlint (cached)2.46s466.21 MiB/s307.40 MiB/s
fdupes (cached)12.27s93.94 MiB/s61.61 MiB/s
duff (cached)6.51s176.17 MiB/s116.12 MiB/s

HDD (ext4)

timethroughputthroughput (dup)
yafd1087.59s1.05 MiB/s711.99 KiB/s
rmlint65.03s163.46 MiB/s107.21 MiB/s
fdupes322.57s3.57 MiB/s2.34 MiB/s
duff954.70s1.20 MiB/s811.10 KiB/s
yafd (cached)7.05s163.46 MiB/s107.21 MiB/s
rmlint (cached)2.84s406.37 MiB/s266.53 MiB/s
fdupes (cached)12.44s92.64 MiB/s60.76 MiB/s
duff (cached)6.56s175.76 MiB/s115.28 MiB/s

NFS (v4)

timethroughputthroughput (dup)
yafd197.08s5.85 MiB/s3.83 MiB/s
rmlint461.26s2.49 MiB/s1.63 MiB/s
fdupes648.24s1.77 MiB/s1.16 MiB/s
duff466.69s2.47 MiB/s1.62 MiB/s
yafd (cached)95.04s12.13 MiB/s7.95 MiB/s
rmlint (cached)423.90s2.71 MiB/s1.78 MiB/s
fdupes (cached)611.19s1.88 MiB/s1.23 MiB/s
duff (cached)403.72s2.85 MiB/s1.87 MiB/s

(1) The linux sources were searched for identical files (4.3, 4.4)

(2) For an equivalent comparison, the following command arguments were used (also see)

yafd --recurse --zero
rmlint --algorithm=paranoid --hidden -o fdupes:stdout
fdupes --recurse
duff -rpta -f#

(3) Linux 4.4.0 and Intel Ivy Bridge (i7-3632QM) were used for benchmarks

Install

You can download a copy of the source here or you can clone the repository using git.

jason@io ~ git clone git://github.com:uxcn/yafd.git

It's a good idea to check out a specific release.

jason@io ~/yafd git checkout v0.1

In the project directory, run the autoconf script.

jason@io ~/yafd ./autoconf.sh CFLAGS='-march=native -mtune=native -O2'

Adding the architecture allows algorithms that rely on architecutre specific implementations to be used. The easiest way to do this is normally -march=native. You can also explicitly enable instruction sets via autoconf.

jason@io ~/yafd ./autoconf.sh --enable-sse4_2

To install to a directory other than /usr/local, you can manually configure the prefix. If you do, make sure your PATH and MANPATH are set correctly.

jason@io ~/yafd ./autoconf.sh --prefix=$HOME

Run make install to compile and install.

jason@io ~/yafd $ make install

Currently yafd compiles and is tested on Linux, FreeBSD, OSX, and Windows. Although, patches and pull requests for others are definitely welcome.

Versions

0.1 - alpha release

FAQ

Why write another file deduplicater?

A lot of the current ones were more complicated than I wanted, didn't perform well, or weren't portable.

Why doesn't yafd do X?

Most likely nobody asked for X yet. If you think something's missing, send a feature request or even better, a pull request.

How does yafd work?

The basic algorithm is to group files by their sizes, compute a hash on a small (random) chunk of each file, and then compare files that have the same hash. This is a bit of an oversimplicification though. For a better understanding, it may help to try reading the code.

other deduplicators