Search Engine Evaluation and Near Duplicate Detection

University project at the course of Data Mining Technology for Business and Society concerning the building of a search engine using the Pyterrier library and the Near Duplicates Detection task.

Project Tasks

The project is divided in two main parts:

Search Engine Evaluation:
- use the Pyterrier library in order to build a search engine for the 'irds:nfcorpus/dev dataset' and improve the search-engines performance, comparing different pipeline combination of preprocessing and weighting model, togheter with choosing in a proper way different evaluation metrics of the information retrieval task (like Normalized Discounted Cumulative Gain or Mean Recirpocal Rank) in order to understand the quality of the engine.
- given several scenarios, like a company needing a search engine for the dataset of scientific paper, being able to build the proper search engine with a single specific configuration of preprocessing, weighting model and evaluation metrics, justifying the choice.
Near Duplicate Detection:
- in this part of the homework, we have to find, in an approximated way, all near-duplicate documents inside a collection of documents, following the rules below.
- We will consider Near-duplicates all those pair of documents that have a Jaccard similarity greater than or equal to 0.95
- Each set of shingles, that represents an original document, must be sketched in a Min-Hashing sketch with a length of at most 210
- The probability to have as a near-duplicate candidate a pair of documents with Jaccard=0.95 must be > 0.97
- The generation process of near-duplicate pairs you implement must generate the smallest amount of both False-Negatives and False-Positives
- The running time of all the LSH process must be less than 10 minutes, and motivate the choice of the hyperparameters like the row and band for the LSH.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SearchEngine_and_NearDuplicatesDetections.ipynb		SearchEngine_and_NearDuplicatesDetections.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

SearchEngine_and_NearDuplicatesDetections.ipynb

SearchEngine_and_NearDuplicatesDetections.ipynb

Repository files navigation

Search Engine Evaluation and Near Duplicate Detection

Project Tasks

Project group members

About

Releases

Packages

Contributors 2

Languages

License

giulio-derasmo/Search-Engine-Evaluation-and-Near-Duplicate-Detection

Folders and files

Latest commit

History

Repository files navigation

Search Engine Evaluation and Near Duplicate Detection

Project Tasks

Project group members

About

Topics

Resources

License

Stars

Watchers

Forks

Languages