Skip to content
This repository has been archived by the owner on Oct 26, 2022. It is now read-only.

Python program with SQL queries to extract information from a relational database

License

Notifications You must be signed in to change notification settings

br3ndonland/udacity-fsnd-sql-logs

Repository files navigation

SQL database logs analysis

Udacity logo

Udacity Full Stack Developer Nanodegree program

Brendon Smith (br3ndonland)

code license

Table of Contents

Project description

Databases can contain trillions of rows and columns, something that spreadsheet programs like Google Sheets and Microsoft Excel simply can't handle. Structured Query Language (SQL) is used to query (ask) the computer to retrieve information from databases.

I completed this project as part of the Udacity Full Stack Developer Nanodegree program. Students were asked to write SQL queries to extract information from a database of news articles with over a million rows. The SQL queries contain advanced joins, selection, and calculation features. Another general-purpose computing language called Python was used to group and control the queries.

(Back to TOC)

Repository contents

Required files

Other files

  • db/ - files that set up the PostgreSQL database
  • img/ - images used in repository
  • info/ -
  • sql/ - database queries in SQL format
  • docker-compose.yml and Dockerfile - files used by Docker to set up containers for the application.
  • Pipfile and Pipfile.lock - files used by Pipenv to set up a Python virtual environment.
  • Vagrantfile - file used to configure a Vagrant virtual environment.

(Back to TOC)

Installation

The application can be run by setting up either a virtual environment or a virtual machine. Instructions for each option are provided below.

Run with Pipenv virtual environment

Clone repository

~cd /path/to/repo
~/path/to/repo
❯ git clone git@github.com:br3ndonland/udacity-fsnd-sql-logs.git

Install dependencies

Install dependencies with a package manager of choice, such as Homebrew on macOS.

brew install python pipenv postgresql
brew cask install postico

Major dependencies:

  • Python
  • Pipenv
  • PostgreSQL or the Postgres app
  • Postico (optional): helpful GUI that can complement Postgres. Postico needs a Postgres server to connect to local databases.

Install Pipenv virtual environment

  • Pipenv was used to manage the development virtual environment for this project.

    • Install Pipenv with a system package manager like Homebrew, or with the Python package manager pip.

    • Use Pipenv to install the virtual environment from the Pipfile with pipenv install --dev.

      • The --dev flag was used to accommodate the Black autoformatter, which is considered a pre-release package (see code style below).

      • When generating the initial Pipfile containing the Black dev package, the --pre flag added a line to the Pipfile to allow pre-release packages. (TOML format):

        [pipenv]
        allow_prereleases = true
      • Further information can be found in the Pipenv docs.

  • VSCode can be configured to recognize the Pipenv virtual environment. See Using Python environments in VS Code.

    • Command Palette -> Python: Select Interpreter. Select virtual environment.
    • Command Palette -> Python: Create Terminal. Creates a terminal and automatically activates the virtual environment. If the virtual environment is set as the interpreter for the workspace, new terminal windows will automatically start in the virtual environment.

Set up database

Postgres should be installed as described above. Postico is a helpful GUI for PostgreSQL that can complement (but not replace) Postgres. Postico needs a Postgres server to connect to local databases.

Unzip news data
  • Unzip data/newsdata.zip. If the file is not present, download and save in the data/ directory.

  • The zipped news database has been added to this repository with Git LFS. To work with Git LFS:

    # Install Git LFS on your system. For macOS:
    ~
    ❯ brew install git-lfs
    ~cd ~/path/to/repo
    # Create .gitattributes and tell Git LFS what to track:
    ~/path/to/repo
    ❯ touch .gitattributes
    ~/path/to/repo
    ❯ git lfs track "*.zip"
    Tracking "*.zip"
    # Commit .gitattributes
    ~/path/to/repo
    ❯ git add .gitattributes
    ~/path/to/repo
    ❯ git commit -m "Initialize Git LFS"
    # Initialize Git LFS, once per repo:
    ~/path/to/repo
    ❯ git lfs install
    Updated git hooks.
    Git LFS initialized.
    ~/path/to/repo
    ❯ git add --all
    ~/path/to/repo
    ❯ git commit
    ~/path/to/repo
    ❯ git lfs ls-files
    SHA * data/newsdata.zip
    ~/path/to/repo
    ❯ git push
    # Should see something like "Uploading LFS objects"
  • It is important to verify that Git LFS is properly initialized before committing files to be tracked with LFS. It is very difficult to get files un-tracked by Git after they have been committed. If you get into this situation, you can try git lfs migrate import, as described in this GitHub issue, but it might be easier to just locally undo and create a new branch.

Create user and database
  • Enter PostgreSQL with psql on the command line.
  • Create vagrant user:
    • newsdata.sql is configured for a PostgreSQL user vagrant.
    • Without Vagrant, you may need to run createdb before starting the psql CLI.
    • When using the Vagrant virtual machine, this step is specified in the Vagrantfile and performed during vagrant up.
  • Create news database.
  • Grant user access to database.
  • Load newsdata.sql into database.
~
❯ psql
psql (11.3)
Type "help" for help.

user=# CREATE USER vagrant;
user=# CREATE DATABASE news;
user=# GRANT ALL PRIVILEGES ON DATABASE "news" to vagrant;
user=# \q

~
❯ cd /path/to/repo
~/path/to/repo
❯ psql -d news -f db/data/newsdata.sql

Run queries

The queries can be run by psql, Postico, or Python.

Run queries with psql
~/path/to/repo
❯ psql -U vagrant -d news -a -f ./sql/1-most-popular-articles.sql
❯ psql -U vagrant -d news -a -f ./sql/2-most-popular-authors.sql
❯ psql -U vagrant -d news -a -f ./sql/3-http-request-error-rate.sql
Run queries with Postico
Set up database connection

Setting up a database connection in Postico

Connect to database

Run queries

Running SQL queries with Postico

Run queries with Python
~/path/to/repo
❯ pipenv shell
udacity-fsnd-sql-logs-hash ❯ python logs.py

Run with Docker Compose

Docker background

  • Docker is a technology for running lightweight virtual machines called containers.

    • An image is the executable set of files read by Docker.
    • A container is a running image.
    • The Dockerfile tells Docker how to build the container.
  • VSCode has built-in Docker features. See Working with Docker in VSCode and the VSCode tutorial on deploying Python with Docker.

  • To install Docker tools locally:

  • To build a Docker image and run the container:

    cd path/to/repo
    docker build . -t app:latest
    docker run -d -p 80:80 app:latest
    • -t web tells Docker to name the image app. Adding . builds from the current directory.
    • -d runs the container in detached mode. Docker will display the container hash and return the terminal prompt.
    • -p 80:80 maps the http port 80 from your local machine to port 1040 in the container.
    • A tag can be specified with name:tag, otherwise, the tag latest will be used.
  • Expand this details element for more useful Docker commands.
    # List images
    docker image ls
    
    # List containers
    docker container ls
    
    # Inspect a container (web in this example) and return the IP Address
    docker inspect web | grep IPAddress
    
    # Stop a container
    docker container stop # container hash
    
    # Remove a downloaded image
    docker image rm # image hash or name
    
    # Remove a container
    docker container rm # container hash
    
    # Prune stopped containers (wipes them and resets their state)
    docker container prune
    
    # Connect to running container (sort of like SSH)
    docker ps # get ID/hash of container you want to connect to
    docker exec -it [ID] /bin/bash
    # Or, connect as root:
    docker exec -u 0 -it [ID] /bin/bash
    
    # Copy file to/from container:
    docker cp [container_name]:/path/to/file destination.file

Pipenv and Docker

  • There are a few adjustments needed to enable Pipenv and Docker to work together.

  • Pipenv must first be installed with pip.

  • Python dependencies are then installed from Pipfile.lock.

    • Docker containers don't need virtual environments, so the --system flag is used to install packages into the container's global Python installation. Thus, it is not necessary to enter the virtual environment with pipenv shell before starting the application.
    • The --deploy flag causes the build to fail if the Pipfile.lock is out of date.
    • The --ignore-pipfile flag tells Pipenv to use the Pipfile.lock for installation instead of the Pipfile.
    # Pull an image: alpine images are tightly controlled and small in size
    FROM python:3.7-alpine
    LABEL app=news
    WORKDIR /app
    # Copy the directory to /app in the container
    COPY . /app
    # Install psycopg2 PostgreSQL dependencies
    RUN apk update; apk add build-base postgresql postgresql-dev libpq
    # Install Pipenv
    RUN python -m pip install pipenv
    # Install packages from Pipfile.lock, configured for Docker deployments
    RUN pipenv install --system --deploy --ignore-pipfile

Docker Compose

TODO. A preliminary Docker Compose file has been added, but the application is not fully functional within Docker yet.

Run with Vagrant virtual machine

A virtual machine can be used to run the code from an operating system with a defined configuration. The virtual machine has all the dependencies needed to run the application.

Install virtual machine

Install dependencies
  • HashiCorp Vagrant

    • Software that provides an operating system in a defined configuration, allowing it to run identically across many personal computers.
  • Vagrant and VirtualBox can be installed with Homebrew Cask:

    brew cask install vagrant
Install Vagrant configuration
  • Vagrant creates a virtual environment using a provider.

    • Oracle VirtualBox is the most common provider used. It is also available as a Homebrew Cask.
    • Docker can be used as a Vagrant provider instead of VirtualBox.
    • The provider normally installs the operating system using boxes. The Docker provider does not require a config.vm.box setting. Rather, it pulls an image from a container registry like Docker Hub.
  • Vagrant is configured with a Vagrantfile.

    • This project was originally completed with the Udacity virtual machine configuration, a Git repository from Udacity that contains a Vagrantfile to configure Vagrant. If using the Udacity configuration, note that some of the necessary Python modules in the Udacity virtual machine configuration are only included for Python 2, and not Python 3. If needed, install the modules with pip.
    • A Vagrantfile was later added directly to this repo, with Docker as a provider.
  • Vagrant reads the Vagrantfile, and uses the provider to run the virtual environment.

  • Install the Udacity Vagrant virtual machine using the instructions in logs-udacity.md. Summary:

    ~cd /path/to/repo
    ~/path/to/repo
    ❯ git clone git@github.com:udacity/fullstack-nanodegree-vm.git
    ~/path/to/repo
    ❯ cd fullstack-nanodegree-vm/vagrant
    ~/path/to/repo/fullstack-nanodegree-vm/vagrant
    ❯ git clone git@github.com:br3ndonland/udacity-fsnd-sql-logs.git
    ~/path/to/repo/fullstack-nanodegree-vm/vagrant
    ❯ vagrant up
    ~/path/to/repo/fullstack-nanodegree-vm/vagrant
    ❯ vagrant ssh
    ❯ vagrant@vagrant:~$ cd /vagrant/udacity-fsnd-sql-logs
    vagrant@vagrant:/vagrant/logs$ python3 logs.py

(Back to TOC)

A tale of three queries

I implemented three SQL queries:

1. Most popular articles

What are the most popular three articles of all time?

Returns a sorted list of the three most highly accessed articles, with the top article first.

Query:

select title, num from
    (select substr(path, 10), count(*) as num from log
    where path !='/' group by path)
as hits, articles where substr = slug order by num desc limit 3;

Terminal output:

              title               | views
----------------------------------+--------
 Candidate is jerk, alleges rival | 338647
 Bears love berries, alleges bear | 253801
 Bad things gone, say good people | 170098
(3 rows)

2. Most popular authors

Who are the most popular article authors of all time?

Returns a sorted list of the most popular article authors, with the most popular author at the top.

Query:

select name, sum(views) as total_views from
    (select name, author, title, views from
        (select substr(path, 10), count(*) as views from log
            where path !='/' group by path)
        as hits, articles, authors
        where substr = slug and author = authors.id
        order by views desc)
    as threetables group by name order by total_views desc;

Terminal output:

          name          | total_views
------------------------+-------------
 Ursula La Multa        |      507594
 Rudolf von Treppenwitz |      423457
 Anonymous Contributor  |      170098
 Markoff Chaney         |       84557
(4 rows)

3. HTTP request error rate

On which days did more than 1% of requests lead to errors?

Returns a list of days on which >1% of HTTP requests resulted in HTTP error codes.

Query:

select errdate, http_requests, http_404,
100.0 * http_404 / http_requests as errpct from
    (select date_trunc('day', time) as reqdate, count(*)
    as http_requests from log group by reqdate)
    as requests,
    (select date_trunc('day', time) as errdate, count(*)
    as http_404 from log where status = '404 NOT FOUND'
    group by errdate)
    as errors
where reqdate = errdate
and errors.http_404 > 0.01 * requests.http_requests
order by errdate desc;

Terminal output:

        errdate         | http_requests | http_404 |       errpct
------------------------+---------------+----------+--------------------
 2016-07-17 00:00:00+00 |         55907 |     1265 | 2.2626862468027260
(1 row)

(Back to TOC)