Doc-Phi: Intelligent Document Finder

Doc-Phi is a Python-based application written for efficient retrieval of documents in your device based on natural language queries. It currently supports the following operations.

Adding documents to Doc-Phi
Querying based on natural language to find the most relevant documents
Listing manually and automatically assigned tags to a particular document

The application consists of a backend daemon service and a command-line interface.

Installation

Docker Container

A ready to use docker image, which consists of all dependencies pre-installed and starts with Doc-Phi backend daemon running, is available at the following link: https://hub.docker.com/r/aman18e/doc_phi

To use the application follow the given steps:-

Install docker-engine on the local system by referring to https://docs.docker.com/get-docker/
Pull the image from the docker hub using the command sudo docker pull aman18e/doc_phi
Create a container from the pulled image using the following command:
```
sudo docker run -v path_to_files:/home/data/ -it doc_phi
```
Replace the variable path_to_files with the complete path of the directory containing all document files .

The running container can be stopped by using the exit command. To restart use the command sudo docker start container_name

Manual Installation

Users can also manually install all dependencies and run the application using python natively installed in the local system/virtual environment by following steps:

Clone the repository from by git clone https://github.com/ShivanshMishra18/IntelligentDocFinder.git
Run cd IntelligentDocFinder
Install python 3.8 & pip in the local system
Install PyTorch by running pip install torch==1.10.1+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html
Run pip install -r requirements.txt to install remaining python dependencies.

Create the backend daemon service from docfinder.service file :

sudo cp docfinder.service /etc/systemd/systemd/docfinder.service
sudo systemctl enable docfinder.service
sudo systemctl start docfinder.service

Set up a shortcut for CLI using the command given below:
```
echo 'alias doc-phi="python3.8 /home/IntelligentDocFinder/cli/doc_phi.py"' >> ~/.bashrc
```
The application can be now used by the doc-phi command.

Dependencies

The major dependencies include:

python 3.8
pytorch 1.10.1
click 8.0.3
huggingface-hub 0.2.1
ipcqueue 0.9.7
lmdb 1.2.1
nltk 3.6.6
numpy 1.21.5
python-docx 0.8.11
python-pptx 0.6.21
scikit-learn 1.0.1
sentence-transformers 2.1.0
transformers .14.1

Usage

The command-line interface for the doc-phi can be used as:

Usage: doc-phi [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  add     Add files to the database
  search  Search files using suitable query
  tags    Retrieve allotted tags for a file

Currently, the following operations can be performed:

1. Adding Documents

Various types of documents can be added using the following syntax:

doc-phi add

After executing this command, various details pertaining to the document can be added using an interactive interface.

2. Querying based on Natural Language

Most relevant documents can be queried upon using the following command:

doc-phi search -q <query>

The results will be displayed in the sorted order based on the rank of the document calculated using Okapi BM25 ranking function.

3. Listing of the Tags

The tags, both manual and automatic, assigned to a document can be viewed using the following command:

doc-phi tags -f <file_name>

How Doc-Phi works

The querying backend for Doc-Phi is an amalgamation of the conventional ranking algorithm, BM25, based on TF-IDF and contextual sentence level BERT model, MSMARCO. All neural models are implemented on the PyTorch backend. The indexes and documents' metadata is stored using the Lightening Memory-Mapped Database (LMDB). An ontology derived from the union of FIGER and TypeNet ontologies is used by Doc-Phi to automatically assign tags from a generic knowledge space.

Data Flow Diagram

The following sections describe the details of various components of the application.

1. Document Processing

The core functionality of Doc-Phi is to efficiently process documents associated with the software to enable quick and efficient semantic document retrieval. The processing, as discussed above includes

word frequency based statistics used for filtering relevant documents
neural transformer-based processing of paragraphs for contextual similarity matching

Along with the processing required for querying, documents are also assigned tags which are shown while querying. These tags are also assigned in this component.

This processing is achieved by the following logical steps:

a. Paragraphs Extractor

The processing operations within each document are performed at the granularity of a paragraph. The paragraphs extractor module lays out the interface and implements classes capable of reading documents and returning iterators over the paragraphs for various file types.

The module is thread-safe, so as to be used along with our distributed pipeline. A factory for iterators is also provided.

b. Distributed Pipeline

All paragraphs go through various stages of transformations so that they can be used for filtering relevant documents during the first phase of querying. This involves natural language processing methods like tokenizing, lemmatizing, removing stopwords and punctuations and stemming words.

This phase is implemented using a distributed multithreaded pipeline which, in essence, allows various portions of a document to be at different stages of the transformation at the same time while taking advantage of multiple cores of the machine. Each stage is allowed to have multiple workers.

The pipeline uses iterators of a paragraphs extractor, a list of the various processing functions and an accumulator function providing a flexible use case.

c. TF-IDF

TF-IDF stands for “Term Frequency — Inverse Document Frequency”, a technique to quantify words in a set of documents. A score is computed for each word to signify its importance in the document and corpus.

Term Frequency, for a given word, tells how many times it occurs in a particular document. Inverse Document Frequency measures the informativeness of a word. Given the words from a query, TF and TDF can together be used to find relevant documents. This was how search engines worked until a few years ago. However, Doc-Phi uses this (with the BM25 algorithm) only to filter relevant documents.

d. Embeddings for Documents

For documents filtered out by BM25, Doc-Phi ranks the documents in order of relevance to the user query. This makes use of the sentence level BERT models which can encode paragraphs into real vectors. This can be achieved in either of the two ways:

While processing a document, Doc-Phi (by default) uses MSMARCO to find and store embeddings for every paragraph in that document.
By using the contextual word embeddings from BERT models and taking a weighted sum based on some parameter to aggregate them into sentence embeddings

Note: The choice of sticking to the paragraph level granularity is based on the heuristic that a paragraph talks about primarily one point at once. Document level embeddings on the hand, talk about the 'big picture' in general, while the sentence level embedding which are more informative would result in much higher storage and significant increase in search time while querying.

e. Tag Assignment

Although the CLI allows assigning manual tags for a document, the real power of Doc-Phi is automatic assignment of tags to documents. This is achieved by using the paragraph embeddings computed in the above step and finding the cosine similarity with the embeddings from a predefined set of tags.

We have defined our own ontology (tag set) for this purpose which is derived from the FIGER and the TypeNet knowledge hierarchies used for entity classification purposes.

2. Database Management System

Doc-Phi uses LMDB (Lightning Memory-Mapped Database) as its database. LMDB is a key-value pair database whose following functionalities motivated its use:

Read transactions are extremely cheap.
Memory mapped, allowing for zero copy lookup and iteration.
No application-level caching is required: LMDB fully exploits the operating system’s buffer cache.

More about lmdb can be found at its official documentation.

Schema

More about these data stores is as follows:

document
It contains the details about the documents that are added to the Doc-phi. The documents are identified by a unique identifier uuid. The values contain the attribute and its details in the form of dictionary.
tf
tf stands for the term-frequency and has its reference from the tf-idf. The tf store contains the frequency of each of the tokens present in all the documents. The key for the tf store is obtained by concatenating the document id with the token itself (as string).
nq
The nq store keeps the track of the documents in which the token has appeared (at least once). It has its significance in the Okapi BM25.
token
This data store tracks the list of tokens present in each document. The key is constituted by the document_id.

Data Access Object (DAO)

Doc-Phi utilises DAO as an interface which provides the data operations without exposing the details of the database. As a result, there is no tight coupling between the database and the application logic, and a different database can be used without affecting the main application.

3. Query Processing

Doc-Phi takes query in the form of natural language and returns a list of most relevant documents. This entire functionality is handled by the QueryEngine class and SemanticEmbedder module. The series of operations can be broken down into following functional unites:

Non-pipelined processing

distributed pipeline

Query Augmentation

nlapaug

2-level filtering

Ranking Function

Okapi BM25

augmented query tokens

wikipedia page

Sentence Embeddings

here

On obtaining the sentence embeddings for the query, the cosine similarity is found between the documents obtained after level-1 filtering and the query. Thereafter the resultant documents are displayed in descending order of their relevance.

4. Command Line Interface

The tool is made available as a client-server program. The server is intended to be run as a daemon to provide service to possibly multiple clients. The client at present has a command-line interface.

The server is associated with a message queue where all clients are allowed to write their requests along with their process identifier (PID). The message queue allows the correctness of results as there is a single server. This condition is also demanded by LMDB.

The client is designed to sleep after placing a request. The server writes the response into a shared memory location which is known to the client as well. The server then raises an OS-level signal to wake the client up.

License

BSD 2-Clause "Simplified" Lisence

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
cli		cli
database_access_object		database_access_object
docs		docs
document_preprocessor		document_preprocessor
executors		executors
paragraphs_extractor		paragraphs_extractor
query_engine		query_engine
samples		samples
semantic_embedder		semantic_embedder
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docfinder.service		docfinder.service
index.py		index.py
requirements.txt		requirements.txt
setup.py		setup.py
utils.py		utils.py

License

ShivanshMishra18/IntelligentDocFinder

Folders and files

Latest commit

History

Repository files navigation

Doc-Phi: Intelligent Document Finder

Table of Contents

Installation

Docker Container

Manual Installation

Dependencies

Usage

1. Adding Documents

2. Querying based on Natural Language

3. Listing of the Tags

How Doc-Phi works

Data Flow Diagram

1. Document Processing

a. Paragraphs Extractor

b. Distributed Pipeline

c. TF-IDF

d. Embeddings for Documents

e. Tag Assignment

2. Database Management System

Schema

Data Access Object (DAO)

3. Query Processing

4. Command Line Interface

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages