Skip to content

Simple Website Screenshots as a Service (Django, Selenium, Docker, Docker-compose)

License

Notifications You must be signed in to change notification settings

simplecto/screenshots

Repository files navigation

Purpose

The purpose of this project is to explore and experiment with what it takes to make a website screen-shotting tool. At first it may seem like an easy task, but it becomes complex once you try.

NOTE: If you just want a tool that "just works" then I suggest you try any of the capable services linked below.

Common problems

  • Javascript heavy pages (almost all these days); many sites use JavaScript to load content after the page has downloaded into the browser. Therefore you need to have a modern javascript engine to parse and execute those extra instructions to get the content as it was intented to be seen by humans.
  • Geography-restricted content; some sites in the US have blocked visitors from Europe because of GDPR. Do you accept this, or is there a way to work around it?
  • Bot and automation detection schemes; some sites use services to protect against automated processes from collecting content. This includes taking screenshots
  • Improperly configured domain names, SSL/TLS encryption certificates, and other network-related issues
  • Nefarious website owners and hacked sites that attempt to exploit the web browser to mine crypto-currencies. This puts an added load on your resources and can significantly slow your render-times.
  • Taking too many screenshots at a time may overload the server and cause timeouts or failure to load pages.
  • Temporary network or website failure; If the problem is on the site's end, then how will we know that and schedule another attempt later?
  • People using the service as a defacto proxy (eg- pranksters downloading porn at their schools or in public places)

Requirements

My development evironment is on MacOS, so HomeBrew and PyCharm are my friends here.

  • python 3.x stable in Virtual Environment (this is the only version I'm working with)
  • Selenium/geckodriver/chrome-driver installed via homebrew brew install geckodriver
  • Docker
  • Postgres installed via Homebrew.

I don't use Docker on my development machine because I have not figured out how to get PyCharm's awesome debugger working well inside docker containers. IF you can, ping me.

Getting started

  1. Check out the repo
  2. Install a local virtual environment python -m venv venv/
  3. Jump into venv/ with source venv/bin/activate
  4. Install requirements pip install -r requirements.txt
  5. Create the postgres database for the project CREATE DATABASE screenshots
  6. copy the env.sample to env in the root source folder
  7. Check / update values in the env folder if needed
  8. Install Selenium geckodriver for your platform brew install geckodriver
  9. Migrate the database cd src && ./manage.py migrate
  10. Create the cache table cd src && ./manage.py createcachetable
  11. Create the superuser cd src && ./manage.py createsuperuser
  12. Start the worker cd src && ./manage.py screenshot_worker_ff
  13. Finally, start the webserver cd src && ./manage.py runserver 0.0.0.0:8000

Open a browser onto http://localhost:8000 and see the screenshot app in all its glory.

System Architecture

system architecture

Web process

Django runs as usual in either development mode or inside gunicorn (for production).

Worker Processes

There is a worker (or a number of workers) that run as parallel, independent processes to the webserver process. They connect to the database and poll for new work on an interval. This pattern obviates the need for Celery, Redis, RabbitMQ, or other complicated moving parts in the system.

The worker processes work like this:

  1. poll database for new screenshots to make
  2. find a screenshot, mark it as pending
  3. launch slenium and take screenshot of resulting page (up to 60 seconds time limit)
  4. save screenshot to database
  5. shutdown selenium browser
  6. sleep
  7. repeat

But where are images stored?

In the database! Now, before you lose it -- I know what many of you will say about storing images in the database. I have linked to the StackOverflow here:

My rationale is this:

  • All content lives in the database, so there is no syncing issues with regards to the data (screenshots) and the metadata (database).
  • Images will be smallish because they are compressed screenshots not more than a 1mb (often far less). But we will need to run many iterations and save as much metadata about the screens to really know.
  • Thumbnails will be stored in cache (also a database table), but get purged after 30 days.
  • Todays compute, network, and storage capacities are so big that 1TB is no longer considered unreasonable. This means that if we build up a screenshot datbase of 1TB, then that is a good problem to have and we can re-architect from there.

Note: This is a hypothesis, and I am willing to change my mind if this does not work out.

Recommended reading on the subject

Alternative Services

Thank-yous

Contributing

Please fork and submit pull requests if you are inspired to do so. Issues are open as well.