Skip to content

teloon/simpleDDD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

"simpleDDD" is short for "simple Duplicate Document Detectio". This program is a simple implementation of the model in paper "Finding similar files in large document repositories". For the details of this algorithm, you can read that paper.

System design

design

Environment requirement

Platform:

Linux(originally on ubuntu 8.04)

dependency:

  1. tokyo cabinet:how to install
  2. the filesystem module of boost 1.39: how to install
  3. gcc, g++: you should change the makefile according to the version of gcc you use, i.e. change the "43" in "-lboost_filesystem-gcc43-mt" to the version number of your gcc
  4. python2.5 and python-dev
  5. libssl-dev

#Parameters There're mainly 11 parameters you can adjust according to your application. They are defined in the file "glo.h".

THRESHOLD

the threshold of duplicate documents. It determines how strict you want the "duplicate" be.T

resultFileName

As the name indicate, it defines the name of result file

TMIN and TMAX

They are used in the TTTD algorithm. They define the minimum size and maximum size of the chunk respectively. They're crucial parameters, which determine the granularity of the chunking. The larger these two parameters are, the smaller the precision will be, and the faster it will be.

D and DDASH

They are two divisors used in TTTD algorithm. According to the experiment, they have little impact on the result.

ENABLE_FITTING

It defines whether to check "can the metadata fit in memory" or not. If your data is so large that the metadata file can't fit in memory, you need do partition. If small, you don't need partition(because it costs more time). If this flag is false, the program does partition all the time.

SLIDING_WINDOW_SIZE

The size of the sliding window in TTTD. The smaller it is, the smaller the precision will be and the faster it will be.

DEBUG_LEVEL

Check file "debug.h"

MAX_FILE_SIZE

The maximum size of the sub-problem file. 512*1024 for default.

sourceDir

The name of the directory where the corpus locate.

Running

  1. Install all the dependency

  2. Adjust all the parameters that needs for changing. Then run ./bat.sh in the src directory.

  3. Run ./main.

Debugging

Change the value of parameter "DEBUG_LEVEL" and run ./bat_debug.sh to compile. Then you can debug in one of the following ways:

  1. Set the "DEBUG_LEVEL", and different levels of debug information will be printed according to it. (Redirection is recommended, because of possibly massive debug info)

  2. Run gdb main.

  3. profiling: run ./main. After the program ends, run gprof gmon.out.

About

simple Near Duplicate Document Detection program

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published