Hash Tree Data Backup

About

In this project, I have implemented an efficient backup system using hash trees. Since directories in all operating systems are implemented in a tree-like structure, it is possible to store the state of a file system using a hash tree. By using a hash tree, we can easily find which files have been modified by just checking the hashes of the tree in a top-down manner.

The project has three important parts:

Construction of Hash Tree
Comparing Hash Trees (to find modified files)
Backup The Changes

Construction of Hash (Merkle) Tree

A Merkle tree is a generalisation of the hash list and is a hash-based data structure. It is a tree structure in which each leaf node is a hash of a block of data, and each non-leaf node is a hash of its children. Generally, Merkle trees have a branching factor of 2, where each node has a maximum of 2 children. However, an n-ary hash tree is required here for the file system.

The hash-tree folder consists of the implementation of the hash tree class which build a hash tree. It can be built using two ways:

From file path - Build a hash tree of data with the root as the input file path
From text file - Build the hash tree data structure from an existing hash tree (stored in a text file as an adjacency list)

The hash tree is built from the file path using a DFS-like approach. I made use of OpenMP tasks to parallelise the building process so that different threads can handle different branches (and nodes) of the tree simultaneously. The SHA256 implementation from the C++ hash-library is used to compute the hash of data. The hash tree is stored as an adjacency list using STL's map data structure.

To build a hash tree from a data/file path. Run the following code:

HashTree *ht_data = new HashTree(dataPath, false);
ht_data->print_metadata();

The hash tree can be stored/saved as a text file in the form of an adjancency list. Below is the format of the text file:

src1|src1_hash|dest1|dest1_hash|dest2|dest2_hash| ... |destn|destn_hash 
src2|src2_hash|dest1|dest1_hash|dest2|dest2_hash| ... |destn|destn_hash 
. 
. 
. 
srcm|srcm_hash|dest1|dest1_hash|dest2|dest2_hash| ... |destn|destn_hash

To build the hash tree from the text file. Run the following code:

HashTree *ht_backup = new HashTree(backupPath+"hashtree.txt", true);
ht_backup->print_metadata();

Comparing Hash Trees

Now, say we have the hash tree of a previous backup and the hash tree of the current system. It is possible to find what new changes have been made by simply comparing the hash trees. This way we can efficiently find which files have been modified and only backup those changes. This is much better that making a completely new copy of all the contents of the system. The hash tree of the backup can be retrieved from the text file and the current hash tree can be built directly from the file path.

Hash trees are compared by comparing the hashes of each node. BFS is used for the comparison. If the hashes for a particular node matches, then none of the files below its branch has been modified and we do not have to explore them. For the comparison to happen correctly, it is crucial for the children in the adjacency list to be sorted (alphanumerically). The function compareTrees takes two hash trees as arguments are generates queries in a global query vector. The queries are the different files and folders which have been modified. There are three kinds of queries: add, modify, and delete. The function can be called the following way:

compareTrees(ht_backup, ht_data);

Backup The Changes

Finally, the queries are executed on the existing backup itself. This has the benefit of saving space since no new backup needs to be created. However, the program can easily be modified to create a new backup. The queries are executed using the copy and remove_all functions from teh file system library. The new hash tree replaces the old one using the save function. OpenMP work-sharing constructs is also used to execute multiple queries in parallel. A dynamic schedule is used since queries take different amounts of time to execute. The query class has two member functions: "print" which displays the query information and "execute" which executes the query.

Sample Output

Results

The parallel code performed much better than the serial code. On average, the parallel code performed at least 2x better than the serial code. Below is the table showing the time comparison between parallel code (4 threads) and serial code.

Size (mb)	Serial (sec)	Parallel (4 threads) (sec)
10	0.810	0.295
25	2.350	0.913
50	2.724	1.504
250	11.481	4.840
500	18.680	9.759
1000	51.533	24.750
5000	272	124

The program also performs slightly better as the number of threads are increased. The performance depends on various factors such as the size of the data and the algorithm used by the CPU to schedule OpenMP tasks. However, we do observe that the difference in performance between serial and parallel is much more noticable for large data.

No. of Threads	Time - 250mb	Time - 500mb	Time - 1gb
1	12.67	22.772	37.938
2	10.301	16.72	23.95
3	8.200	13.84	23.502
4	7.572	13.59	23.45
6	5.899	11.75	22.182
8	4.573	10.12	20.56

Note

For those not interested in the openMP code, a version of the code without openMP can be found in the serial branch.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
hash-library		hash-library
hash-tree		hash-tree
images		images
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hash-library

hash-library

hash-tree

hash-tree

images

images

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

README.md

README.md

main.cpp

main.cpp

Repository files navigation

Hash Tree Data Backup

About

Construction of Hash (Merkle) Tree

Comparing Hash Trees

Backup The Changes

Sample Output

Results

Note

About

Releases

Packages

Languages

ramanakshay/Hash-Tree-Data-Backup

Folders and files

Latest commit

History

Repository files navigation

Hash Tree Data Backup

About

Construction of Hash (Merkle) Tree

Comparing Hash Trees

Backup The Changes

Sample Output

Results

Note

About

Topics

Resources

Stars

Watchers

Forks

Languages