scrapy_llm_loader

scrapy_llm_loader is a Scrapy extension that enables data extraction using LangChain, a technology that leverages large language models (LLMs) like OpenAI's GPT models.

Features

Integration with LangChain: Utilizes LangChain to process HTML content and extract structured data.
OpenAI GPT Model Support: Compatible with OpenAI's GPT models, providing high-quality content extraction.
Scrapy Compatibility: Seamlessly integrates with existing Scrapy projects, enhancing them with advanced LLM capabilities.

Installation

scrapy_llm_loader can be easily installed using pip. Just run the following command:

pip install scrapy_llm_loader

Note

Before using scrapy_llm_loader, ensure that the OPENAI_API_KEY is set in your project's settings.py or passed explicitly as an argument with openai_api_key during the loader's initialization.

Usage

To use scrapy_llm_loader in your Scrapy project:

Import LangChainLoader from scrapy_llm_loader.loader.
Define your item model using Pydantic.
Create an instance of LangChainLoader in your spider and use it to load items.

Example:

from scrapy_llm_loader.loader import LangChainLoader
from pydantic import BaseModel, Field

class MyItem(BaseModel):
    name: str = Field(description="name of the product")
    price: str = Field(description="price of the product")
    # Describe other fields here
    pass

class MySpider(scrapy.Spider):
    # Your spider definition
    def parse(self, response):
        loader = LangChainLoader(item_class=MyItem, response=response, crawler=self.crawler)
        item = loader.load_item()
        yield item.dict()

HTML Cleaning Options

Generally you don't want to send all HTML element attributes to the LLM. If the data you need is only inside the inner HTML of the elements, you can call load_item function as following:

from scrapy_llm_loader.utils import CleaningMode

item = loader.load_item(cleaning_mode=CleaningMode.TEXT_ONLY)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
examples		examples
scrapy_llm_loader		scrapy_llm_loader
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples