containers-scrapy

Scrapy with Python 3 in a docker env, ready to deploy in any dev machine or runtime node.

Commands

Outside of container

A: development mode

docker run -it --rm \
    -v $(pwd):/code \
    -e USER=<your user on host> \
    bluekvirus/scrapy

B: runtime mode

docker run -d --rm \ 
    -v $(pwd):/code \
    -e SPIDER_NAME=<the spider name> \
    -e INTERVAL=[the repeat interval, default 60s] \
    --log-opt max-size=5m \
    --log-opt max-file=1 \
    bluekvirus/scrapy

Note that the spider/pipeline might require futher environment vars to operate, be sure to provide them with additional -e ! (e.g SLACK_WEBHOOK and REPORT_RATIO_THRESHOLD in nv-gpu-nowinstock spider's pipeline)

Real runtime cmd example

sudo docker run -d --log-opt max-size=5m --log-opt max-file=1 -v $(pwd):/code -e SPIDER_NAME=nv-gpu-nowinstock -e SLACK_WEBHOOK=https://hooks.slack.com/services/.../.../y0D46F4443McqHW8PRjUitNS -e REPORT_RATIO_THRESHOLD=1.35 bluekvirus/scrapy

We do not have repository branching support at the moment, cloned Scrapy repository will always use master by default.

Inside of container (dev mode only)

A: interactive debug using fetch(request) and response.css().extract()

scrapy shell <url>

B: run spider by file

scrapy runspider <crawler.py> [-o items.json]

C: run spider defined in current Scrapy project crawlers (after startproject, genspider)

scrapy crawl <crawler by name>

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
crawlers		crawlers
docker-image		docker-image
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawlers

crawlers

docker-image

docker-image

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

scrapy.cfg

scrapy.cfg

Repository files navigation

containers-scrapy

Commands

Outside of container

A: development mode

B: runtime mode

Inside of container (dev mode only)

A: interactive debug using fetch(request) and response.css().extract()

B: run spider by file

C: run spider defined in current Scrapy project crawlers (after startproject, genspider)

About

Releases

Packages

Languages

License

bluekvirus/containers-scrapy

Folders and files

Latest commit

History

Repository files navigation

containers-scrapy

Commands

Outside of container

A: development mode

B: runtime mode

Inside of container (dev mode only)

A: interactive debug using fetch(request) and response.css().extract()

B: run spider by file

C: run spider defined in current Scrapy project crawlers (after startproject, genspider)

About

Resources

License

Stars

Watchers

Forks

Languages