Skip to content

MIRACLE-cowf/Powerful-Auto-Researcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

34 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Powerful-Auto-Researcher(PAR)

πŸ‘‹ Introduction πŸ‘‹

Hello! I am a beginner developer who is greatly interested in the rapidly emerging field of LLMs.

This is an experimental project in its early stages, created for the purpose of studying LLM prompting and Python for my own study.
However, I believe it to be quite an intriguing idea, and I wish to receive ample feedback and opinions from the brilliant developers on GitHub, while honing my development skills!!

Thank you for coming by, and please keep an eye out for future updates!

πŸ˜‰ Help Me & Discuss Me πŸ˜‰

If you would like to see the results of the document generation, please leave a comment in the designated issue. I will be happy to provide you with the generated documents for your review.

🌠 OverView 🌠

This project aims to build a more powerful RAG system powered by LangChain, LangGraph, and Anthropic.
While the project is still in its early experimental stages and there are many steps that I ahead, it is an endeavor that I would like to share with the brilliant developers on GitHub, discussing exciting ideas and possibilities!

Before you start, you can see Test Case and Result in Test_Case folder!

❗WARNING❗

This project may use so many tokens, so be careful!

❗WARNING❗

πŸš€ HOW TO START πŸš€

The main third-pary libraries currently used in this project are:

  • 2024.04.06 added -> In the future, will use ParentDocumentRetriever.

9. YouTube Search(Main Search Engine)

10. Wikipedia(Main Search Engine)

11. arXiv(Main Search Engine)

The main libraries required for running the project can be found in the requirements.txt file.

To clone the project, run the following command:

git clone https://github.com/MIRACLE-cowf/Powerful-Auto-Researcher.git
cd Powerful-Auto-Researcher

Next, install the required libraries:

pip install -r requirements.txt

SET API Keys in .evn at PAR folder!

To start the project, run:

py -m main

🧐 What is RAG(Retrieval-Augmented Generation)?

Retrieval-Augmented Generation(RAG) is a process that optimizes the output of large language models(LLMs) by enabling them to reference reliable knowledge bases outside of their training data sources before generating responses.
LLMs are trained on vast amounts of data and use billions of parameters to generate original results for tasks such as answering questions, translating languages, and completing sentences.
RAG extends the capabilities of already powerful LLMs to the internal knowledge bases of specific domains or organizations, eliminating the need to retrain the model. This is a cost-effective approach to improving LLM results and maintaining relevance, accuracy, and usefulness across a variety of scenarios.

πŸ‘€ So, What is the main difference between RAG and your project PAR(Powerful-Auto-Researcher)?

Concept & Features πŸ’ͺ

This PAR Project draws inspiration from various approaches, including:

This PAR project primarily use Anthropic's Claude 3 family due to their Long Context Window capability.
Although the advent of 'Long Context Window' has reduced the necessity for RAG, this PAR project aims to leverage its advantages beyond the scope of a single document.
Unlike conventional RAG techniques that extract small token amounts from various sources, PAR seeks to retrieve more substantial token content from diverse origins.
By processing a large volume of tokens and sources on a single topic simultaneously, the project strives to achieve higher-quality results.

To this end, PAR project employs a prompting approach similar to Multi-Query-Retriever, where the user's original question is interpreted from multiple perspectives and reformulated into different question formats on the same topic.
This enables the model to handle a wider range of thought processes and adaptability, which is the primary objective.

The second powerful aspect of PAR lies in its ability to generate documents autonomously by leveraging custom search engines developed by the project's creators.
This allows for extensive searches and the utilization of diverse sources such as Tavily, arXiv, YouTube, Wikipedia, and more.

User often invest significant time in conducting searches. To address this issue, PAR aims to integrate results from various search engines to automatically generate documents.
These documents are made available for users to access later and are embedded and stored in a vector store for future use in identical or similar content searches.

πŸ’» How does Powerful-Auto-Researcher (PAR) work? πŸ’»

PAR is fundamentally designed to start with the user's question, convert it into multi-queries, and leverage the existing vector DB data sources possessed by the user or developer.

πŸ‘Ύ Multi-Query-Generate

To search the vector DB, the user's question is transformed into three different queries serving the following purposes:

  1. Regenerate a related search query focusing on the core aspects of the original question.
  2. Regenerate a query using semantically similar phrases to the original question.
  3. Regenerate a new query expanding on the potential intent behind the original question.

πŸ” Vector DB Search

For each of the three generated queries, a search is conducted in the user's current vector DB, and the search results are retrieved. In the current project, the PineCone Online Vector Database is used, and a maximum of three search results are obtained for each query.

πŸ’― Grade

The LLM evaluates the vector DB search results for each of the three queries. After the evaluation, similar to a typical RAG system, the results are used to either generate a response(Generation) or move to the Thought-High-Level-Outline stage.
(This part is currently under consideration and design to explore more possibilities.)

πŸ“š Thought-High-Level-Outline

(This Thought-High-Level-Outline part is also under consideration and design to explore more possibilities and exciting experiments.)
I think this is considered the most powerful part of this project.
It combines three different smaller components implemented as a LangGraph:

  • πŸ“— Think & Inner monologue: The LLM "thinks" about why the user asked the question and what content the LLM itself would like to see included in the final document if it were the user.
  • πŸ“˜ Create a draft & outline: Based on the thoughts from the previous step, a high-level draft of the document is generated. This step focuses on creating a high-level outline rather than a completed draft. (This is quite similar to the Plan stage in the Plan-and-Execute architecture.)
  • πŸ“‹ Generate search queries and select/assign tools: For performing actual "searches" based on the thoughts and sections generated in the previous steps.
    These three steps are inspired by the STORM architecture in LangGraph.

🦾 Agent

This part is identical to the standard Agent provided by LangChain.
However, since an Anthropic version is being used, several "code" modifications have been made by directly examining and customizing LangChain modules.
Currently, four search engines are being used:
Tavily, Wikipedia, YouTube, and arXiv. If you have the opportunity to directly inspect the code or have a keen interest, your active feedback is highly appreciated!
Please note that the code may change based on updates to LangChain.

βœ‰οΈ Generation

(This stage is also currently under consideration and design to explore more possibilities.)
Similar to the G(Generation) step in a typical RAG, this stage generates the final response.
If coming from the Grade stage, the response is generated based on the contents searched from the Vector DB.
If coming from the Thought-High-Level-Outline stage, the response is generated based on the document created through the previous stages.

Before generating the response, the user can decide whether to review and store the generated document in the vector DB.

The next diagram is illustrates the conceptual flow of this project:

Powerful-Auto-Researcher

πŸ” Question πŸ”Ž

Why use Anthropic?

  • It is my personal judgment, but I found Anthropic to be more finely tunable simply through prompts. Moreover, the Long Context Window was incredibly appealing for this project. However, this does not mean that other LLM models are incompatible. As all interfaces of the project operate based on LangChain, there is a high possibility of it functioning with all LLMs and Chat Models supported by LangChain. (I cannot give a definite answer as I have not tested it out! My apologies!)

Which Anthropic model should I use?

  • In all cases, I used Anthropic's haiku model! However, when it came to handling the operation of the agent, models beyond haiku, such as sonnet and opus, provided significantly better performance. It seems that it does not matter which model you use! One thing for certain is that I believe the sonnet or opus models will exhibit much better performance. (I cannot give a definite answer for this part as well, as I have not conducted many tests yet! My apologies!)

βœ… Update Log βœ…

2024.04.11

  • Now you can find new test cases that have been added after each update in the issue tracker!

2024.04.10

  • New updates are now being managed through the GitHub issue tracker. You can check it!
  • Project changes, bug tracking, feature requests, and more can be effectively tracked and documented using issues.
  • In addition to the update log, various topics will be managed systematically through the use of issues.
  • Please refer to the GitHub issues for future updates and changes.

2024.04.06

  • With the official support for Tool Calling from Anthropic and the corresponding update to LangChain, the function calling has been changed from using XML tags to the official function calling method.
  • With the availability of official tool calling, there have been modifications to the LangSmith prompts and several new classes have been introduced for structured output.
  • In the Retrieve Vector DB stage, I plan to use LangChain's ParentDocumentRetrieve beyond simple similarity search, so preliminary work has been done to enable the use of MongoDB.
  • To fetch the Raw content and AI responses from the Tavily API, I have customized the TavilySearchResults function in LangChain.
  • For smooth usage of Anthropic's agent, I have customized the OutputParser provided by LangChain.

2024.04.02

  • Project First Init

πŸ”₯ FeedBack πŸ”₯

As a beginner developer, I am greatly seeking diverse feedback from the brilliant developers on GitHub!

I would appreciate any kind of feedback, regardless of the type, be it Python syntax, structure, prompting, readme, etc.!

적극적인 ν”Όλ“œλ°± λΆ€νƒλ“œλ¦½λ‹ˆλ‹€!

Thank you!
miracle.cowf@gmail.com