This is a demo app built to chat with your custom PDFs using the vector search capabilities of Couchbase to augment the OpenAI results in a Retrieval-Augmented-Generation (RAG) model.
For the full tutorial, please visit Developer Portal.
Note that you need Couchbase Server 7.6 or higher for Vector Search.
You can upload your PDFs with custom data & ask questions about the data in the chat box.
For each question, you will get two answers:
- one using RAG (Couchbase logo)
- one using pure LLM - OpenAI (🤖).
For RAG, we are using LangChain, Couchbase Vector Search & OpenAI. We fetch parts of the PDF relevant to the question using Vector search & add it as the context to the LLM. The LLM is instructed to answer based on the context from the Vector Store.
-
pip install -r requirements.txt
-
Copy the
secrets.example.toml
file in.streamlit
folder and rename it tosecrets.toml
and replace the placeholders with the actual values for your environmentOPENAI_API_KEY = "<open_ai_api_key>" DB_CONN_STR = "<connection_string_for_couchbase_cluster>" DB_USERNAME = "<username_for_couchbase_cluster>" DB_PASSWORD = "<password_for_couchbase_cluster>" DB_BUCKET = "<name_of_bucket_to_store_documents>" DB_SCOPE = "<name_of_scope_to_store_documents>" DB_COLLECTION = "<name_of_collection_to_store_documents>" INDEX_NAME = "<name_of_fts_index_with_vector_support>" AUTH_ENABLED = "True/False" # enables authentication for the streamlit app using LOGIN_PASSWORD LOGIN_PASSWORD = "<password to access the streamlit app>"
-
We need to create the Search Index on the Full Text Service in Couchbase. For this demo, you can import the following index using the instructions.
-
- Copy the index definition to a new file index.json
- Import the file in Capella using the instructions in the documentation.
- Click on Create Index to create the index.
-
- Click on Search -> Add Index -> Import
- Copy the following Index definition in the Import screen
- Click on Create Index to create the index.
Here, we are creating the index
pdf_search
on the documents in thedocs
collection within theshared
scope in the bucketpdf-docs
. The Vector field is set toembeddings
with 1536 dimensions and the text field set totext
. We are also indexing and storing all the fields undermetadata
in the document as a dynamic mapping to account for varying document structures. The similarity metric is set todot_product
. If there is a change in these parameters, please adapt the index accordingly.{ "name": "pdf_search", "type": "fulltext-index", "params": { "doc_config": { "docid_prefix_delim": "", "docid_regexp": "", "mode": "scope.collection.type_field", "type_field": "type" }, "mapping": { "default_analyzer": "standard", "default_datetime_parser": "dateTimeOptional", "default_field": "_all", "default_mapping": { "dynamic": true, "enabled": false }, "default_type": "_default", "docvalues_dynamic": false, "index_dynamic": true, "store_dynamic": false, "type_field": "_type", "types": { "shared.docs": { "dynamic": true, "enabled": true, "properties": { "embedding": { "enabled": true, "dynamic": false, "fields": [ { "dims": 1536, "index": true, "name": "embedding", "similarity": "dot_product", "type": "vector", "vector_index_optimized_for": "recall" } ] }, "text": { "enabled": true, "dynamic": false, "fields": [ { "index": true, "name": "text", "store": true, "type": "text" } ] } } } } }, "store": { "indexType": "scorch", "segmentVersion": 16 } }, "sourceType": "gocbcore", "sourceName": "pdf-docs", "sourceParams": {}, "planParams": { "maxPartitionsPerPIndex": 64, "indexPartitions": 16, "numReplicas": 0 } }
-
-
streamlit run chat_with_pdf.py