Skip to content

An information retrieval system which consists of various techniques' implementations like indexing, tokenization, stopping, stemming, page ranking, snippet generation and evaluation of results

Notifications You must be signed in to change notification settings

parshva45/Information-Retrieval-System

Repository files navigation

Prerequisites to run this project:
1) Python Environment
2) Java IDE like Eclipse to import entire Lucene Folders as Java Projects


Steps for project execution:

NOTE: Execute each file in the exact order as mentioned below.
Execute python file by:
> python file_name.py

Only Lucene model is implemented in Java. To execute it, just import the entire project into Eclipse
> Run project as Java Application
1) Phase 1/Task 1/Step 4/Lucene/src/Lucene.java
2) Phase 1/Task 3/Part A/Step 4/Lucene (Stopped)/src/Lucene.java
3) Phase 1/Task 3/Part B/Step 3/Lucene (Stemmed)/src/Lucene.java


Execute each file in the exact order as mentioned below.
Input and output fields for every file has been mentioned for better understanding.
-----------------------------------------------------------------------------------------------------------------------------

PHASE 1:

TASK 1 Steps:

1. Execute Phase 1/Task 1/Step 1/tokenizer.py
I/P: Raw HTML dir
O/P: Phase 1/Task 1/Step 1/Tokenizer Output/

2. Execute Phase 1/Task 1/Step 2/create_inverted_list.py
	- I/P: Phase 1/Task 1/Step 1/Tokenizer Output/
	
	- O/P: Phase 1/Task 1/Setp 2/Inverted_List.txt
	- O/P: Phase 1/Task 1/Step 2/DocumentID_DocLen.txt
	- O/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Inverted_List.txt
	- O/P: Phase 1/Task 1/Encoded Data Structures/Encoded-DocumentID_DocLen.txt

3. Execute Phase 1/Task 1/Step 3/query_cleaning.py
	- I/P: Phase 1/Task 1/Step 3/cacm.query.txt
	- O/P: Phase 1/Task 1/Step 3/Cleaned_Queries.txt
	- O/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt

4. Execute the 4 algorithms of ranking (BM25, Lucene, TFIDF, Query Likelihood Model)

A.

a. Execute Phase 1/Task 1/Step 4/BM25/bm25_no_relevance.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Inverted_List.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-DocumentID_DocLen.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt
	
	- O/P: Phase 1/Task 1/Step 4/BM25/BM25_NonRelevance_Top5_Docs.txt
	- O/P: Phase 1/Task 1/Step 4/BM25/BM25_NonRelevance_Top5_Query_Pages.txt
	- O/P: Phase 1/Task 1/Step 4/BM25/BM25_NonRelevance_Top100_Pages.txt
	- O/P: Phase 1/Task 1/Encoded Data Structures/Encoded-BM25-NoRelevance-Top100Docs-perQuery/
	- O/P: Phase 1/Task 1/Encoded Data Structures/Encoded-BM25_NoRelevance_Top5_Query_Pages.txt

b. Execute Phase 1/Task 1/Step 4/BM25/rel_doc.py
	- I/P: cacm.rel.txt
	- O/P: Phase 1/Task 1/Encoded Data Structures/Encoded-QueryID_RelevantDocs.txt

c. Execute Phase 1/Task 1/Step 4/BM25/bm25_relevance.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Inverted_List.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-DocumentID_DocLen.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-QueryID_RelevantDocs.txt
	
	- O/P: Phase 1/Task 1/Step 4/BM25/BM25_Relevance_Top5_Docs.txt
	- O/P: Phase 1/Task 1/Step 4/BM25/BM25_Relevance_Top5_Query_Pages.txt
	- O/P: Phase 1/Task 1/Step 4/BM25/BM25_Relevance_Top100_Pages.txt
	- O/P: Phase 1/Task 1/Encoded Data Structures/Encoded-BM25-Relevance-Top100Docs-perQuery/
	- O/P: Phase 1/Task 1/Encoded Data Structures/Encoded-BM25-Relevance-Top5Docs-perQuery/
	- O/P: Phase 1/Task 1/Encoded Data Structures/Encoded-BM25_Relevance_Top5_Query_Pages.txt

B. Executee Phase 1/Task 1/Step 4/TF-IDF/tf-idf_normalized.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Inverted_List.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-DocumentID_DocLen.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt
	
	- O/P: Phase 1/Task 1/Step 4/TF-IDF/TF_IDF_Normalized_Top5_Docs.txt
	- O/P: Phase 1/Task 1/Step 4/TF-IDF/TF_IDF_Normalized_Top5_Query_Pages.txt
	- O/P: Phase 1/Task 1/Step 4/TF-IDF/TF_IDF_Normalized_Top100_Pages.txt
	- O/P: Phase 1/Task 1/Encoded Data Structures/Encoded-TF-IDF-Normalized-Top100Docs-perQuery/
	- O/P: Phase 1/Task 1/Encoded Data Structures/Encoded-TF_IDF_Normalized_Top5_Query_Pages.txt

C. Executee Phase 1/Task 1/Step 4/QLM/QLM.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Inverted_List.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-DocumentID_DocLen.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt
	
	- O/P: Phase 1/Task 1/Step 4/QLM/QLM_Top5_Docs.txt
	- O/P: Phase 1/Task 1/Step 4/QLM/QLM_Top5_Query_Pages.txt
	- O/P: Phase 1/Task 1/Step 4/QLM/QLM_Top100_Pages.txt
	- O/P: Phase 1/Task 1/Encoded Data Structures/Encoded-QLM-Top100Docs-perQuery/
	- O/P: Phase 1/Task 1/Encoded Data Structures/Encoded-QLM_Top5_Query_Pages.txt

D. Execute Phase 1/Task 1/Step 4/Lucene/src/Lucene.java

Task 2 Steps:

1. Execute generate_QueryID_Top5Docs_Dictionary.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-BM25-Relevance-Top5Docs-perQuery/
	- O/P: Phase 1/Task 2/Encoded Data Structures (PRF)/Encoded-QueryID_Top5Docs_BM25_Relevance.txt

2. Execute creating_inv_list_for_top_5.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt
	- I/P: Phase 1/Task 2/common_words.txt
	- I/P: Phase 1/Task 2/Encoded Data Structures (PRF)/Encoded-QueryID_Top5Docs_BM25_Relevance.txt
	- I/P: Phase 1/Task 1/Step 1/Tokenizer Output/
	- O/P: Phase 1/Task 2/Encoded Data Structures (PRF)/Encoded-Queries_With_Their_Expansion_Terms.txt

3. Execute create_expanded_queries.py
	- I/P: Phase 1/Task 2/Encoded Data Structures (PRF)/Encoded-Queries_With_Their_Expansion_Terms.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-QueryID_RelevantDocs.txt
	- O/P: Phase 1/Task 2/Encoded Data Structures (PRF)/Encoded-Expanded_Queries.txt

4. Execute bm25_Relevance_PRF.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Inverted_List.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-DocumentID_DocLen.txt
	- I/P: Phase 1/Task 2/Encoded Data Structures (PRF)/Encoded-Expanded_Queries.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-QueryID_RelevantDocs.txt
	- I/P: Phase 1/Task 1/Step 1/Tokenizer Output/
	
	- O/P: Phase 1/Task 2/Step 4/BM25_Relevance_PRF_Top100_Pages.txt
	- O/P: Phase 1/Task 2/Encoded Data Structures (PRF)/Encoded-BM25-Relevance-PRF-Top100Docs-perQuery/

TASK 3 Steps:

Part A (Stopping):

Steps:
1. Exectute Phase 1/Task 3/Part A/Step 1/tokenizer_with_stopping.py
	- I/P: Raw HTML dir
	- I/P: Phase 1/Task 3/Part A/common_words.txt
	- O/P: Phase 1/Task 3/Part A/Step 1/Stopped Tokenizer Output/

2. Execute Phase 1/Task 3/Step 2/create_stopped_inverted_list.py
	- I/P: Phase 1/Task 3/Part A/Step 1/Stopped Tokenizer Output/
	
	- O/P: Phase 1/Task 3/Part A/Step 2/Stopped_Inverted_List.txt
	- O/P: Phase 1/Task 3/Part A/Step 2/Stopped_DocumentID_DocLen.txt
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_Inverted_List.txt
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_DocumentID_DocLen.txt

3. Execute Phase 1/Task 3/Step 3/query_cleaning_stopwords_removed.py
	- I/P: Phase 1/Task 3/Part A/common_words.txt
	- I/P: Phase 1/Task 3/Part A/Step 3/cacm.query.txt
	
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Cleaned_Queries_Stopped.txt
	- O/P: Phase 1/Task 3/Part A/Step 3/Cleaned_Queries_Stopped.txt

4.

Part A (BM25 (Stopped))
1. Execute Phase 1/Task 3/Part A/Step 4/BM25 (Stopped)/bm25_no_relevance_stopping.py
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_Inverted_List.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_DocumentID_DocLen.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Cleaned_Queries_Stopped.txt
	
	- O/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stopped)/Stopped_BM25_NoRelevance_Top5_Docs.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stopped)/Stopped_BM25_NoRelevance_Top5_Query_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stopped)/Stopped_BM25_NoRelevance_Top100_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_BM25-NoRelevance-Top100Docs-perQuery/
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_BM25_NoRelevance_Top5_Query_Pages.txt

2. Execute Phase 1/Task 3/Part A/Step 4/BM25 (Stopped)/rel-doc-stopped.py
	- I/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stopped)/cacm.rel.txt
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_QueryID_RelevantDocs.txt

3. Execute Phase 1/Task 3/Part A/Step 4/BM25 (Stopped)/bm25_relevance_stopping.py
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_Inverted_List.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_DocumentID_DocLen.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Cleaned_Queries_Stopped.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_QueryID_RelevantDocs.txt
	- I/P: Phase 1/Task 3/Part A/Step 1/Stopped Tokenizer Output/

	- O/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stopped)/Stopped_BM25_Relevance_Top5_Docs.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stopped)/Stopped_BM25_Relevance_Top5_Query_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stopped)/Stopped_BM25_Relevance_Top100_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_BM25-Relevance-Top100Docs-perQuery/
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_BM25_Relevance_Top5_Query_Pages.txt

Part B (QLM (Stopped))
Execute Phase 1/Task 3/Part A/Step 4/QLM (Stopped)/qlm_stopping.py
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_Inverted_List.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_DocumentID_DocLen.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Cleaned_Queries_Stopped.txt
	
	- O/P: Phase 1/Task 3/Part A/Step 4/QLM (Stopped)/Stopped_QLM_Top5_Docs.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/QLM (Stopped)/Stopped_QLM_Top5_Query_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/QLM (Stopped)/Stopped_QLM_Top100_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_QLM-Top100Docs-perQuery/
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_QLM_Top5_Query_Pages.txt

Part C (TF-IDF (Stopped))
Execute Phase 1/Task 3/Part A/Step 4/TF-IDF (Stopped)/tf-idf_normalized_stopping.py
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_Inverted_List.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_DocumentID_DocLen.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Cleaned_Queries_Stopped.txt
	
	- O/P: Phase 1/Task 3/Part A/Step 4/TF-IDF (Stopped)/Stopped_TF-IDF_Normalized_Top5_Docs.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/TF-IDF (Stopped)/Stopped_TF-IDF_Normalized_Top5_Query_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/TF-IDF (Stopped)/Stopped_TF-IDF_Normalized_Top100_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_TF-IDF-Normalized-Top100Docs-perQuery/
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_TF-IDF_Normalized_Top5_Query_Pages.txt

Part D (Lucene)
Execute Phase 1/Task 3/Part A/Step 4/Lucene (Stopped)/src/Lucene.java


Task 3 Part B (Stemming):

Steps:
1. Execute Phase 1/Task 3/Part B/Step 1/cacm_stem_extracter.py
	- I/P: Phase 1/Task 3/Part B/Step 1/cacm_stem.txt
	- O/P: Phase 1/Task 3/Part B/Step 1/Stemmed_Corpus/

2. Execute Phase 1/Task 3/Part B/Step 2/create_stemmed_inverted_list.py
	- I/P: Phase 1/Task 3/Part B/Step 1/Stemmed_Corpus/
	- O/P: Phase 1/Task 3/Part B/Step 2/Stemmed_Inverted_List.txt
	- O/P: Phase 1/Task 3/Part B/Step 2/Stemmed_DocumentID_DocLen.txt
	- O/P: Phase 1/Task 3/Part B/Encoded Data Structures (Stemmed)/Encoded-Stemmed_Inverted_List.txt
	- O/P: Phase 1/Task 3/Part B/Encoded Data Structures (Stemmed)/Encoded-Stemmed_DocumentID_DocLen.txt

3. (BM25 (Stemmed))
1. Execute Phase 1/Task 3/Part A/Step 4/BM25 (Stemmed)/bm25_no_relevance_stemming.py
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_Inverted_List.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_DocumentID_DocLen.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Cleaned_Queries_Stemmed.txt
	
	- O/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stemmed)/Stemmed_BM25_NoRelevance_Top5_Docs.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stemmed)/Stemmed_BM25_NoRelevance_Top5_Query_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stemmed)/Stemmed_BM25_NoRelevance_Top100_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_BM25-NoRelevance-Top100Docs-perQuery/
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_BM25_NoRelevance_Top5_Query_Pages.txt

2. Execute Phase 1/Task 3/Part A/Step 4/BM25 (Stemmed)/rel-doc-stemmed.py
	- I/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stemmed)/cacm.rel.txt
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_QueryID_RelevantDocs.txt

3. Execute Phase 1/Task 3/Part A/Step 4/BM25 (Stemmed)/bm25_relevance_stemming.py
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_Inverted_List.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_DocumentID_DocLen.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Cleaned_Queries_Stemmed.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_QueryID_RelevantDocs.txt
	- I/P: Phase 1/Task 3/Part A/Step 1/Stemmed Tokenizer Output/

	- O/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stemmed)/Stemmed_BM25_Relevance_Top5_Docs.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stemmed)/Stemmed_BM25_Relevance_Top5_Query_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stemmed)/Stemmed_BM25_Relevance_Top100_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_BM25-Relevance-Top100Docs-perQuery/
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_BM25_Relevance_Top5_Query_Pages.txt

Part B (QLM (Stemmed))
Execute Phase 1/Task 3/Part A/Step 4/QLM (Stemmed)/qlm_stemming.py
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_Inverted_List.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_DocumentID_DocLen.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Cleaned_Queries_Stemmed.txt
	
	- O/P: Phase 1/Task 3/Part A/Step 4/QLM (Stemmed)/Stemmed_QLM_Top5_Docs.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/QLM (Stemmed)/Stemmed_QLM_Top5_Query_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/QLM (Stemmed)/Stemmed_QLM_Top100_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_QLM-Top100Docs-perQuery/
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_QLM_Top5_Query_Pages.txt

Part C (TF-IDF (Stemmed))
Execute Phase 1/Task 3/Part A/Step 4/TF-IDF (Stemmed)/tf-idf_normalized_stemming.py
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_Inverted_List.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_DocumentID_DocLen.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Cleaned_Queries_Stemmed.txt
	
	- O/P: Phase 1/Task 3/Part A/Step 4/TF-IDF (Stemmed)/Stemmed_TF-IDF_Normalized_Top5_Docs.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/TF-IDF (Stemmed)/Stemmed_TF-IDF_Normalized_Top5_Query_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Step 4/TF-IDF (Stemmed)/Stemmed_TF-IDF_Normalized_Top100_Pages.txt
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_TF-IDF-Normalized-Top100Docs-perQuery/
	- O/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stemmed)/Encoded-Stemmed_TF-IDF_Normalized_Top5_Query_Pages.txt

Part D (Lucene)
Execute Phase 1/Task 3/Part A/Step 4/Lucene (Stemmed)/src/Lucene.java

-----------------------------------------------------------------------------------------------------------------------------


PHASE 2:

Exectute Phase 2/Snippet_generation.py
	- I/P: Raw HTML dir
	- I/P: Phase 2/common_word.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_Inverted_List.txt
	- I/P: Phase 1/Task 3/Part A/Step 4/Lucene (Stopped)/Stopped_Lucene_Top5_Docs.txt
	- I/P: Phase 1/Task 3/Part A/Step 4/BM25 (Stopped)/Stopped_BM25_NoRelevance_Top5_Docs.txt
	- I/P: Phase 1/Task 3/Part A/Step 4/QLM (Stopped)/Stopped_QLM_Top5_Docs.txt
	- I/P: Phase 1/Task 3/Part A/Step 4/TF-IDF (Stopped)/Stopped_TF_IDF_Normalized_Top5_Docs.txt
	
	- O/P: Phase 2/Snippets_Text/Snippets_Stopped_Lucene.txt
	- O/P: Phase 2/Snippets_Text/Snippets_Stopped_BM25_NoRelevance.txt
	- O/P: Phase 2/Snippets_Text/Snippets_Stopped_QLM.txt
	- O/P: Phase 2/Snippets_Text/Snippets_Stopped_TF_IDF_Normalized.txt
	- O/P: Phase 2/Snippets_HTML/Snippets_Stopped_Lucene.html
	- O/P: Phase 2/Snippets_HTML/Snippets_Stopped_BM25_NoRelevance.html
	- O/P: Phase 2/Snippets_HTML/Snippets_Stopped_QLM.html
	- O/P: Phase 2/Snippets_HTML/Snippets_Stopped_TF_IDF_Normalized.html


-----------------------------------------------------------------------------------------------------------------------------


PHASE 3:

Steps:

1. Part A 
Execute Phase 3/Step 1/BM25 (No Relevance)/generate_QueryID_Top100Docs_bm25_no_relevance.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-BM25-NoRelevance-Top100Docs-perQuery/
	- O/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_BM25_NoRelevance.txt

1. Part B
Execute Phase 3/Step 1/BM25 (Relevance)/generate_QueryID_Top100Docs_bm25_relevance.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-BM25-Relevance-Top100Docs-perQuery/
	- O/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_BM25_Relevance.txt

1. Part C
Execute Phase 3/Step 1/Lucene/generate_QueryID_Top100Docs_lucene.py
	- I/P: Phase 1/Task 1/Step 4/Lucene/Lucene_Top100_Docs.txt
	- O/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_Lucene.txt

1. Part D
Execute Phase 3/Step 1/QLM/generate_QueryID_Top100Docs_QLM.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-QLM-Top100Docs-perQuery/
	- O/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_QLM.txt

1. Part E
Execute Phase 3/Step 1/TF-IDF/generate_QueryID_Top100Docs_tf-idf_normalized.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-TF-IDF-Normalized-Top100Docs-perQuery/
	- O/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_tf-idf_normalized.txt

1. Part F
Execute Phase 3/Step 1/BM25 (Relevance with PRF)/generate_QueryID_Top100Docs_bm25_relevance_PRF.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-BM25-Relevance-PRF-Top100Docs-perQuery/
	- O/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_BM25_Relevance_PRF.txt

1. Part G
Execute Phase 3/Step 1/BM25 (Relevance Stopped)/generate_QueryID_Top100Docs_bm25_relevance_stopped.py
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_BM25-Relevance-Top100Docs-perQuery
	- O/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_Stopped_BM25_Relevance.txt

1. Part H
Execute Phase 3/Step 1/Lucene (Stopped)/generate_QueryID_Top100Docs_lucene_stopped.py
	- I/P: Phase 1/Task 3/Part A/Step 4/Lucene (Stopped)/Stopped_Lucene_Top100_Docs.txt
	- O/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_Stopped_Lucene.txt

1. Part I
Execute Phase 3/Step 1/TF-IDF (Stopped)/generate_QueryID_Top100Docs_tf-idf_normalized_stopped.py
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_TF-IDF-Normalized-Top100Docs-perQuery
	- O/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_Stopped_tf-idf_normalized.txt



2. Part A 
Execute Phase 3/Step 2/BM25 (No Relevance)/retrieval_model_evaluation_bm25_no_relevance.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-QueryID_RelevantDocs.txt
	- I/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_BM25_NoRelevance.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt
	- O/P: Phase 3/Precision Recall Tables/BM25 Evaluation Results/BM25 (No Relevance)/

2. Part B
Execute Phase 3/Step 2/BM25 (Relevance)/retrieval_model_evaluation_bm25_relevance.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-BM25-Relevance-Top100Docs-perQuery/
	- I/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_BM25_Relevance.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt
	- O/P: Phase 3/Precision Recall Tables/BM25 Evaluation Results/BM25 (Relevance)/

2. Part C
Execute Phase 3/Step 2/Lucene/retrieval_model_evaluation_lucene.py
	- I/P: Phase 1/Task 1/Step 4/Lucene/Lucene_Top100_Docs.txt
	- I/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_Lucene.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt
	- O/P: Phase 3/Precision Recall Tables/Lucene Evaluation Results/Lucene/

2. Part D
Execute Phase 3/Step 2/QLM/retrieval_model_evaluation_QLM.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-QLM-Top100Docs-perQuery/
	- I/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_QLM.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt
	- O/P: Phase 3/Precision Recall Tables/QLM Evaluation Results/QLM/

2. Part E
Execute Phase 3/Step 2/TF-IDF/retrieval_model_evaluation_tf-idf_normalized.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-TF-IDF-Normalized-Top100Docs-perQuery/
	- I/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_tf-idf_normalized.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt
	- O/P: Phase 3/Precision Recall Tables/TF-IDF Evaluation Results/TF-IDF/

2. Part F
Execute Phase 3/Step 1/BM25 (Relevance with PRF)/retrieval_model_evaluation_by_bm25_with_relevance_for_PRM_s_top_100.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-BM25-Relevance-Top100Docs-perQuery/
	- I/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_BM25_Relevance_PRF.txt
	- I/P: Phase 1/Task 2/Encoded-Expanded_Queries.txt
	- O/P: Phase 3/Precision Recall Tables/BM25 (Relevance with PRF)/

2. Part G
Execute Phase 3/Step 1/BM25 (Relevance Stopped)/retrieval_model_evaluation_relevance_stopped.py
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_BM25-Relevance-Top100Docs-perQuery/
	- I/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_Stopped_BM25_Relevance.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Cleaned_Queries_Stopped.txt
	- O/P: Phase 3/Precision Recall Tables/BM25 (Relevance Stopped)/

2. Part H
Execute Phase 3/Step 1/Lucene (Stopped)/retrieval_model_evaluation_lucene_stopped.py
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_QueryID_RelevantDocs.txt
	- I/P: Phase 3/Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_Stopped_Lucene.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Cleaned_Queries_Stopped.txt
	- O/P: Phase 3/Precision Recall Tables/Lucene (Stopped)/

2. Part I
Execute Phase 3/Step 1/TF-IDF (Stopped)/retrieval_model_evaluation_tf-idf_normalized_stopped.py
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Stopped_QueryID_RelevantDocs.txt
	- I/P: Encoded Data Structures (Phase 3)/Encoded-QueryID_Top100Docs_Stopped_tf-idf_normalized.txt
	- I/P: Phase 1/Task 3/Part A/Encoded Data Structures (Stopped)/Encoded-Cleaned_Queries_Stopped.txt
	- O/P: Phase 3/Precision Recall Tables/TF-IDF-Normalized (Stopped)/


-----------------------------------------------------------------------------------------------------------------------------


Supplementary Features:

Part A (No Stopping):

Execute Supplementary Features/Part A (No Stopping)/create_inverted_list_with_positions_no_stop.py
	- I/P: Phase 1/Task 1/Step 1/Tokenizer Output
	- O/P: Supplementary Features/Inverted_List_With_Positions_No_Stopping.txt
	- O/P: Supplementary Features/Encoded Data Structures (Bonus)/Encoded-Inverted_List_Position_No_Stopping.txt

Execute Supplementary Features/Part A (No Stopping)/proximity_no_stop.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-DocumentID_DocLen.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt
	- I/P: Supplementary Features/Encoded Data Structures (Bonus)/Encoded-Inverted_List_Position_No_Stopping.txt
	- O/P: Supplementary Features/Encoded Data Structures (Bonus)/Encoded-Top100-Docs-Proximity-NoStopping/
	- O/P: Supplementary Features/Part A (No Stopping)/Proximity_NoStopping_Top100_Pages.txt

Part B (Stopping):

Execute Supplementary Features/Part B (Stopping)/create_inverted_list_with_positions_stop.py
	- I/P: Phase 1/Task 3/Part A/Step 1/Stopped Tokenizer Output
	- I/P: Phase 1/Task 3/Part A/common_words.txt
	- O/P: Supplementary Features/Inverted_List_With_Positions_Stopping.txt
	- O/P: Supplementary Features/Encoded Data Structures (Bonus)/Encoded-Inverted_List_Position_With_Stopping.txt

Execute Supplementary Features/Part B (Stopping)/proximity_with_stop.py
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-DocumentID_DocLen.txt
	- I/P: Phase 1/Task 1/Encoded Data Structures/Encoded-Cleaned_Queries.txt
	- I/P: Supplementary Features/Encoded Data Structures (Bonus)/Encoded-Inverted_List_Position_With_Stopping.txt
	- O/P: Supplementary Features/Encoded Data Structures (Bonus)/Encoded-Top100-Docs-Proximity-With-Stopping/
	- O/P: Supplementary Features/Part B (Stopping)/Proximity_With_Stopping_Top100_Pages.txt

About

An information retrieval system which consists of various techniques' implementations like indexing, tokenization, stopping, stemming, page ranking, snippet generation and evaluation of results

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published