Skip to content
/ ncd Public

This C++ application lets you calculate and plot the Normalized Compression Distance between a set of files which is more awesome than it might sound

License

Notifications You must be signed in to change notification settings

MKleusberg/ncd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Normalized Compression Distance

What is this?

The NCD (Normalized Compression Distance) algorithm lets you calculate the similarities between a set of files so you can plot them as a hierarchical cluster. The resulting image lets you easily detect relationships in your data.

For example?

Say, you have a set of DNA sequences. Coincidentally there is one in this directory. In an ideal world you could now analyse the information stored in these sequences and decide about which species are related on this basis. Unfortunately though, it's not possible to calculate this so-called information distance. However, there is an approximation, the normalised compression distance which estimates the information distance by using a compression algorithm on the files. You might want to read the Wikipedia article for more details.

This is what this program does. The result is a text file containing the similarities in a matrix, a huge pile of numbers. This data can then be plotted using a tool like R. If you don't know R, it might be best to start by having a look at a graphical frontend like RStudio.

This should give you an image like this one: DNA example

Is it any good?

Yes.

How do you build it?

  • Setup build environment. I'm using LLVM/Clang but g++ works just as fine
  • Install dependencies (LZMA library and Boost filesystem library)
  • make all
  • make run

How can I run run it?

$ ./cluster input-dir output-file

This reads all files in the specified input directory and performs an NCD calculation on them. The result are written to the specified output file which is overwritten without warning if it already exists.

You can then use the included plot.R file to plot the NCD matrix.

Any prebuilt executables available?

No.

More questions?

Feel free to just open a new issue.

About

This C++ application lets you calculate and plot the Normalized Compression Distance between a set of files which is more awesome than it might sound

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published