Introduction

"simpleDDD" is short for "simple Duplicate Document Detectio". This program is a simple implementation of the model in paper "Finding similar files in large document repositories". For the details of this algorithm, you can read that paper.

System design

Environment requirement

Platform:

Linux(originally on ubuntu 8.04)

dependency:

tokyo cabinet：how to install
the filesystem module of boost 1.39: how to install
gcc, g++: you should change the makefile according to the version of gcc you use, i.e. change the "43" in "-lboost_filesystem-gcc43-mt" to the version number of your gcc
python2.5 and python-dev
libssl-dev

#Parameters There're mainly 11 parameters you can adjust according to your application. They are defined in the file "glo.h".

THRESHOLD

the threshold of duplicate documents. It determines how strict you want the "duplicate" be.T

resultFileName

As the name indicate, it defines the name of result file

TMIN and TMAX

They are used in the TTTD algorithm. They define the minimum size and maximum size of the chunk respectively. They're crucial parameters, which determine the granularity of the chunking. The larger these two parameters are, the smaller the precision will be, and the faster it will be.

D and DDASH

They are two divisors used in TTTD algorithm. According to the experiment, they have little impact on the result.

ENABLE_FITTING

It defines whether to check "can the metadata fit in memory" or not. If your data is so large that the metadata file can't fit in memory, you need do partition. If small, you don't need partition(because it costs more time). If this flag is false, the program does partition all the time.

SLIDING_WINDOW_SIZE

The size of the sliding window in TTTD. The smaller it is, the smaller the precision will be and the faster it will be.

DEBUG_LEVEL

Check file "debug.h"

MAX_FILE_SIZE

The maximum size of the sub-problem file. 512*1024 for default.

sourceDir

The name of the directory where the corpus locate.

Running

Install all the dependency
Adjust all the parameters that needs for changing. Then run ./bat.sh in the src directory.
Run ./main.

Debugging

Change the value of parameter "DEBUG_LEVEL" and run ./bat_debug.sh to compile. Then you can debug in one of the following ways:

Set the "DEBUG_LEVEL", and different levels of debug information will be printed according to it. (Redirection is recommended, because of possibly massive debug info)
Run gdb main.
profiling: run ./main. After the program ends, run gprof gmon.out.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
doc		doc
src		src
tools		tools
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

src

src

tools

tools

README.md

README.md

Repository files navigation

Introduction

System design

Environment requirement

THRESHOLD

resultFileName

TMIN and TMAX

D and DDASH

ENABLE_FITTING

SLIDING_WINDOW_SIZE

DEBUG_LEVEL

MAX_FILE_SIZE

sourceDir

Running

Debugging

About

Releases

Packages

Languages

teloon/simpleDDD

Folders and files

Latest commit

History

Repository files navigation

Introduction

System design

Environment requirement

THRESHOLD

resultFileName

TMIN and TMAX

D and DDASH

ENABLE_FITTING

SLIDING_WINDOW_SIZE

DEBUG_LEVEL

MAX_FILE_SIZE

sourceDir

Running

Debugging

About

Resources

Stars

Watchers

Forks

Languages