Skip to content

agarwalgaurav811/Data-deduplication

Repository files navigation

Data-deduplication

About

This is the Model for data deduplication challenge,which identifies unique patients from dataset by applying machine learning algorithms like clustering as well as logistic regression with help of python library dedupe. It takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.

Installation

  • Install python and pip according to your system with the guide available here
  • git clone https://github.com/agarwalgaurav811/Data-deduplication && cd Data-deduplication
  • pip install -r requirements.txt
  • pip install -e .

Running Instructions

python main.py

A file named "Deduplication output.csv" will be created in the data directory with a new column called 'Cluster ID' which indicates which records refer to each other.

About

Model for data deduplication assignment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages