Skip to content

VDK and RAG: periodically scrape the organisational data source and publish new data into the vector store ideas

Duygu Hasan edited this page Jan 26, 2024 · 12 revisions

This wiki page is dedicated to outlining and discussing a method for periodically scraping organizational data sources and publishing the new data into a vector store, using Versatile Data Kit (VDK).

Make sure you know about the initiative before continuing reading: https://github.com/vmware/versatile-data-kit/tree/main/specs/vep-milestone-25-vector-database-ingestion

Initially, we'll outline the tools and methods that closely align with the core concept of our proof of concept (POC). Following this, we'll introduce additional tools and resources that could further aid in the development and enhancement of the desired functionalities.

For the POC with Confluence and Langchain

To replicate the desired functionality using the example for Confluence data retrieval, there are two main strategies:

  • Tracking updated files using metadata timestamps:

By saving the timestamp of the last data job in a file and updating it with each run, you can identify updated files. This is done by comparing the "when" value in the file's metadata against the timestamp of the last job run. If the "when" value is more recent, the file is marked as updated.

Screenshot 2024-01-26 at 15 49 29

Utilize the ConfluenceLoader to load only the pages that have been updated since the last check. This can be done by using the cql parameter in the loader to query for pages updated after a certain date

  • Identifying deleted files through database comparison:

Regularly take snapshots of the Confluence content using the ConfluenceLoader and compare the snapshots to identify missing pages, which could indicate deletions. (Snapshot comparison)

Tools that can be useful for further development

  • One of the tools found that can be used is the Atlassian Python API :

The Atlassian Python API provides a way to interact programmatically with Atlassian services like Jira, Confluence, Bitbucket, Service Desk, and Xray. It offers modules for each service, allowing users to perform various actions such as managing issues, projects, users, groups, and more. The API supports different authentication methods including OAuth, Kerberos, cookie file, and Personal Access Token. For more information and specific details, you can visit the Atlassian Python API documentation.

Specifically for Confluence:

The Confluence section of the Atlassian Python API offers functionalities for interacting with Confluence, a content collaboration tool. Through this API, users can manage Confluence data, including pages, blogs, comments, spaces, and attachments. It enables operations such as creating, updating, deleting, and retrieving various content types. The API also supports advanced features like searching for content using CQL (Confluence Query Language) and handling page history and versions.

To learn more about CQL refer to: https://developer.atlassian.com/server/confluence/cql-field-reference/

Tracking:

  • Deleted Confluence pages:

The Atlassian Python API does not directly provide a feature for tracking deleted files in Confluence. In Confluence, once a page or attachment is deleted, it's typically not easily traceable through the standard API. However, there might be indirect ways to monitor deletions.

  • Updated Confluence pages:

You can use this API to retrieve information about recently updated content, such as pages and attachments. It supports methods to fetch a list of content that has been updated within a specified time frame.

Note: Another idea here is to directly use Langchain since it uses this API as well.

Clone this wiki locally