"simpleDDD" is short for "simple Duplicate Document Detectio". This program is a simple implementation of the model in paper "Finding similar files in large document repositories". For the details of this algorithm, you can read that paper.
Platform:
Linux(originally on ubuntu 8.04)
dependency:
- tokyo cabinet:how to install
- the filesystem module of boost 1.39: how to install
- gcc, g++: you should change the makefile according to the version of gcc you use, i.e. change the "43" in "-lboost_filesystem-gcc43-mt" to the version number of your gcc
- python2.5 and python-dev
- libssl-dev
#Parameters There're mainly 11 parameters you can adjust according to your application. They are defined in the file "glo.h".
the threshold of duplicate documents. It determines how strict you want the "duplicate" be.T
As the name indicate, it defines the name of result file
They are used in the TTTD algorithm. They define the minimum size and maximum size of the chunk respectively. They're crucial parameters, which determine the granularity of the chunking. The larger these two parameters are, the smaller the precision will be, and the faster it will be.
They are two divisors used in TTTD algorithm. According to the experiment, they have little impact on the result.
It defines whether to check "can the metadata fit in memory" or not. If your data is so large that the metadata file can't fit in memory, you need do partition. If small, you don't need partition(because it costs more time). If this flag is false, the program does partition all the time.
The size of the sliding window in TTTD. The smaller it is, the smaller the precision will be and the faster it will be.
Check file "debug.h"
The maximum size of the sub-problem file. 512*1024 for default.
The name of the directory where the corpus locate.
-
Install all the dependency
-
Adjust all the parameters that needs for changing. Then run
./bat.sh
in the src directory. -
Run
./main
.
Change the value of parameter "DEBUG_LEVEL" and run ./bat_debug.sh
to compile. Then you can debug in one of the following ways:
-
Set the "DEBUG_LEVEL", and different levels of debug information will be printed according to it. (Redirection is recommended, because of possibly massive debug info)
-
Run
gdb main
. -
profiling: run
./main
. After the program ends, rungprof gmon.out
.