scrapy_llm_loader
is a Scrapy extension that enables data extraction using LangChain, a technology that leverages large language models (LLMs) like OpenAI's GPT models.
- Integration with LangChain: Utilizes LangChain to process HTML content and extract structured data.
- OpenAI GPT Model Support: Compatible with OpenAI's GPT models, providing high-quality content extraction.
- Scrapy Compatibility: Seamlessly integrates with existing Scrapy projects, enhancing them with advanced LLM capabilities.
scrapy_llm_loader
can be easily installed using pip
. Just run the following command:
pip install scrapy_llm_loader
Before using scrapy_llm_loader
, ensure that the OPENAI_API_KEY
is set in your project's settings.py
or passed explicitly as an argument with openai_api_key
during the loader's initialization.
To use scrapy_llm_loader
in your Scrapy project:
- Import
LangChainLoader
fromscrapy_llm_loader.loader
. - Define your item model using Pydantic.
- Create an instance of
LangChainLoader
in your spider and use it to load items.
Example:
from scrapy_llm_loader.loader import LangChainLoader
from pydantic import BaseModel, Field
class MyItem(BaseModel):
name: str = Field(description="name of the product")
price: str = Field(description="price of the product")
# Describe other fields here
pass
class MySpider(scrapy.Spider):
# Your spider definition
def parse(self, response):
loader = LangChainLoader(item_class=MyItem, response=response, crawler=self.crawler)
item = loader.load_item()
yield item.dict()
Generally you don't want to send all HTML element attributes to the LLM. If the data you need is only inside the inner HTML of the elements, you can call load_item
function as following:
from scrapy_llm_loader.utils import CleaningMode
item = loader.load_item(cleaning_mode=CleaningMode.TEXT_ONLY)