SQL database logs analysis

Udacity Full Stack Developer Nanodegree program

Project description

Databases can contain trillions of rows and columns, something that spreadsheet programs like Google Sheets and Microsoft Excel simply can't handle. Structured Query Language (SQL) is used to query (ask) the computer to retrieve information from databases.

I completed this project as part of the Udacity Full Stack Developer Nanodegree program. Students were asked to write SQL queries to extract information from a database of news articles with over a million rows. The SQL queries contain advanced joins, selection, and calculation features. Another general-purpose computing language called Python was used to group and control the queries.

(Back to TOC)

Repository contents

Required files

logs.py - main program code
logs-output.txt - example code output
README.md - this file, a concise description of the project

Other files

db/ - files that set up the PostgreSQL database
img/ - images used in repository
info/ -
- logs-methods.md - computational narrative explaining how I wrote the code for the project
- logs-review.md - Udacity code review
- logs-udacity.md - project description, installation instructions, and rubric from Udacity
sql/ - database queries in SQL format
docker-compose.yml and Dockerfile - files used by Docker to set up containers for the application.
Pipfile and Pipfile.lock - files used by Pipenv to set up a Python virtual environment.
Vagrantfile - file used to configure a Vagrant virtual environment.

(Back to TOC)

Installation

The application can be run by setting up either a virtual environment or a virtual machine. Instructions for each option are provided below.

Run with Pipenv virtual environment

Clone repository

~
❯ cd /path/to/repo
~/path/to/repo
❯ git clone git@github.com:br3ndonland/udacity-fsnd-sql-logs.git

Install dependencies

Install dependencies with a package manager of choice, such as Homebrew on macOS.

brew install python pipenv postgresql
brew cask install postico

Major dependencies:

Python
Pipenv
PostgreSQL or the Postgres app
Postico (optional): helpful GUI that can complement Postgres. Postico needs a Postgres server to connect to local databases.

Install Pipenv virtual environment

Pipenv was used to manage the development virtual environment for this project.
- Install Pipenv with a system package manager like Homebrew, or with the Python package manager pip.
- Use Pipenv to install the virtual environment from the Pipfile with pipenv install --dev.
  - The --dev flag was used to accommodate the Black autoformatter, which is considered a pre-release package (see code style below).
  - When generating the initial Pipfile containing the Black dev package, the --pre flag added a line to the Pipfile to allow pre-release packages. (TOML format):
```
[pipenv]
allow_prereleases = true
```
  - Further information can be found in the Pipenv docs.
VSCode can be configured to recognize the Pipenv virtual environment. See Using Python environments in VS Code.
- Command Palette -> Python: Select Interpreter. Select virtual environment.
- Command Palette -> Python: Create Terminal. Creates a terminal and automatically activates the virtual environment. If the virtual environment is set as the interpreter for the workspace, new terminal windows will automatically start in the virtual environment.

Set up database

Postgres should be installed as described above. Postico is a helpful GUI for PostgreSQL that can complement (but not replace) Postgres. Postico needs a Postgres server to connect to local databases.

Unzip news data

Unzip data/newsdata.zip. If the file is not present, download and save in the data/ directory.

The zipped news database has been added to this repository with Git LFS. To work with Git LFS:

# Install Git LFS on your system. For macOS:
~
❯ brew install git-lfs
~
❯ cd ~/path/to/repo
# Create .gitattributes and tell Git LFS what to track:
~/path/to/repo
❯ touch .gitattributes
~/path/to/repo
❯ git lfs track "*.zip"
Tracking "*.zip"
# Commit .gitattributes
~/path/to/repo
❯ git add .gitattributes
~/path/to/repo
❯ git commit -m "Initialize Git LFS"
# Initialize Git LFS, once per repo:
~/path/to/repo
❯ git lfs install
Updated git hooks.
Git LFS initialized.
~/path/to/repo
❯ git add --all
~/path/to/repo
❯ git commit
~/path/to/repo
❯ git lfs ls-files
SHA * data/newsdata.zip
~/path/to/repo
❯ git push
# Should see something like "Uploading LFS objects"

It is important to verify that Git LFS is properly initialized before committing files to be tracked with LFS. It is very difficult to get files un-tracked by Git after they have been committed. If you get into this situation, you can try git lfs migrate import, as described in this GitHub issue, but it might be easier to just locally undo and create a new branch.

Create user and database

Enter PostgreSQL with psql on the command line.
Create vagrant user:
- newsdata.sql is configured for a PostgreSQL user vagrant.
- Without Vagrant, you may need to run createdb before starting the psql CLI.
- When using the Vagrant virtual machine, this step is specified in the Vagrantfile and performed during vagrant up.
Create news database.
Grant user access to database.
Load newsdata.sql into database.

~
❯ psql
psql (11.3)
Type "help" for help.

user=# CREATE USER vagrant;
user=# CREATE DATABASE news;
user=# GRANT ALL PRIVILEGES ON DATABASE "news" to vagrant;
user=# \q

~
❯ cd /path/to/repo
~/path/to/repo
❯ psql -d news -f db/data/newsdata.sql

Run queries

The queries can be run by psql, Postico, or Python.

Run queries with `psql`

~/path/to/repo
❯ psql -U vagrant -d news -a -f ./sql/1-most-popular-articles.sql
❯ psql -U vagrant -d news -a -f ./sql/2-most-popular-authors.sql
❯ psql -U vagrant -d news -a -f ./sql/3-http-request-error-rate.sql

Run queries with Postico

Set up database connection

Connect to database

Run queries

Run queries with Python

~/path/to/repo
❯ pipenv shell
udacity-fsnd-sql-logs-hash ❯ python logs.py

Run with Docker Compose

Docker background

Docker is a technology for running lightweight virtual machines called containers.
- An image is the executable set of files read by Docker.
- A container is a running image.
- The Dockerfile tells Docker how to build the container.
VSCode has built-in Docker features. See Working with Docker in VSCode and the VSCode tutorial on deploying Python with Docker.
To install Docker tools locally:
- Ubuntu Linux: follow the instructions for Ubuntu Linux, making sure to follow the postinstallation steps to activate the Docker daemon.
- macOS and Windows: install Docker Desktop (available via Homebrew with brew cask install docker).
To build a Docker image and run the container:
```
cd path/to/repo
docker build . -t app:latest
docker run -d -p 80:80 app:latest
```
- -t web tells Docker to name the image app. Adding . builds from the current directory.
- -d runs the container in detached mode. Docker will display the container hash and return the terminal prompt.
- -p 80:80 maps the http port 80 from your local machine to port 1040 in the container.
- A tag can be specified with name:tag, otherwise, the tag latest will be used.

Expand this details element for more useful Docker commands.

# List images
docker image ls

# List containers
docker container ls

# Inspect a container (web in this example) and return the IP Address
docker inspect web | grep IPAddress

# Stop a container
docker container stop # container hash

# Remove a downloaded image
docker image rm # image hash or name

# Remove a container
docker container rm # container hash

# Prune stopped containers (wipes them and resets their state)
docker container prune

# Connect to running container (sort of like SSH)
docker ps # get ID/hash of container you want to connect to
docker exec -it [ID] /bin/bash
# Or, connect as root:
docker exec -u 0 -it [ID] /bin/bash

# Copy file to/from container:
docker cp [container_name]:/path/to/file destination.file

Pipenv and Docker

There are a few adjustments needed to enable Pipenv and Docker to work together.
Pipenv must first be installed with pip.

Python dependencies are then installed from Pipfile.lock.

Docker containers don't need virtual environments, so the --system flag is used to install packages into the container's global Python installation. Thus, it is not necessary to enter the virtual environment with pipenv shell before starting the application.
The --deploy flag causes the build to fail if the Pipfile.lock is out of date.
The --ignore-pipfile flag tells Pipenv to use the Pipfile.lock for installation instead of the Pipfile.

# Pull an image: alpine images are tightly controlled and small in size
FROM python:3.7-alpine
LABEL app=news
WORKDIR /app
# Copy the directory to /app in the container
COPY . /app
# Install psycopg2 PostgreSQL dependencies
RUN apk update; apk add build-base postgresql postgresql-dev libpq
# Install Pipenv
RUN python -m pip install pipenv
# Install packages from Pipfile.lock, configured for Docker deployments
RUN pipenv install --system --deploy --ignore-pipfile

Docker Compose

TODO. A preliminary Docker Compose file has been added, but the application is not fully functional within Docker yet.

Run with Vagrant virtual machine

A virtual machine can be used to run the code from an operating system with a defined configuration. The virtual machine has all the dependencies needed to run the application.

Install virtual machine

Install dependencies

HashiCorp Vagrant
- Software that provides an operating system in a defined configuration, allowing it to run identically across many personal computers.
Vagrant and VirtualBox can be installed with Homebrew Cask:
```
brew cask install vagrant
```

Install Vagrant configuration

Vagrant creates a virtual environment using a provider.
- Oracle VirtualBox is the most common provider used. It is also available as a Homebrew Cask.
- Docker can be used as a Vagrant provider instead of VirtualBox.
- The provider normally installs the operating system using boxes. The Docker provider does not require a config.vm.box setting. Rather, it pulls an image from a container registry like Docker Hub.
Vagrant is configured with a Vagrantfile.
- This project was originally completed with the Udacity virtual machine configuration, a Git repository from Udacity that contains a Vagrantfile to configure Vagrant. If using the Udacity configuration, note that some of the necessary Python modules in the Udacity virtual machine configuration are only included for Python 2, and not Python 3. If needed, install the modules with pip.
- A Vagrantfile was later added directly to this repo, with Docker as a provider.
Vagrant reads the Vagrantfile, and uses the provider to run the virtual environment.

Install the Udacity Vagrant virtual machine using the instructions in logs-udacity.md. Summary:

~
❯ cd /path/to/repo
~/path/to/repo
❯ git clone git@github.com:udacity/fullstack-nanodegree-vm.git
~/path/to/repo
❯ cd fullstack-nanodegree-vm/vagrant
~/path/to/repo/fullstack-nanodegree-vm/vagrant
❯ git clone git@github.com:br3ndonland/udacity-fsnd-sql-logs.git
~/path/to/repo/fullstack-nanodegree-vm/vagrant
❯ vagrant up
~/path/to/repo/fullstack-nanodegree-vm/vagrant
❯ vagrant ssh
❯ vagrant@vagrant:~$ cd /vagrant/udacity-fsnd-sql-logs
vagrant@vagrant:/vagrant/logs$ python3 logs.py

(Back to TOC)

A tale of three queries

I implemented three SQL queries:

1. Most popular articles

What are the most popular three articles of all time?

Returns a sorted list of the three most highly accessed articles, with the top article first.

Query:

select title, num from
    (select substr(path, 10), count(*) as num from log
    where path !='/' group by path)
as hits, articles where substr = slug order by num desc limit 3;

Terminal output:

              title               | views
----------------------------------+--------
 Candidate is jerk, alleges rival | 338647
 Bears love berries, alleges bear | 253801
 Bad things gone, say good people | 170098
(3 rows)

2. Most popular authors

Who are the most popular article authors of all time?

Returns a sorted list of the most popular article authors, with the most popular author at the top.

Query:

select name, sum(views) as total_views from
    (select name, author, title, views from
        (select substr(path, 10), count(*) as views from log
            where path !='/' group by path)
        as hits, articles, authors
        where substr = slug and author = authors.id
        order by views desc)
    as threetables group by name order by total_views desc;

Terminal output:

          name          | total_views
------------------------+-------------
 Ursula La Multa        |      507594
 Rudolf von Treppenwitz |      423457
 Anonymous Contributor  |      170098
 Markoff Chaney         |       84557
(4 rows)

3. HTTP request error rate

On which days did more than 1% of requests lead to errors?

Returns a list of days on which >1% of HTTP requests resulted in HTTP error codes.

Query:

select errdate, http_requests, http_404,
100.0 * http_404 / http_requests as errpct from
    (select date_trunc('day', time) as reqdate, count(*)
    as http_requests from log group by reqdate)
    as requests,
    (select date_trunc('day', time) as errdate, count(*)
    as http_404 from log where status = '404 NOT FOUND'
    group by errdate)
    as errors
where reqdate = errdate
and errors.http_404 > 0.01 * requests.http_requests
order by errdate desc;

Terminal output:

        errdate         | http_requests | http_404 |       errpct
------------------------+---------------+----------+--------------------
 2016-07-17 00:00:00+00 |         55907 |     1265 | 2.2626862468027260
(1 row)

(Back to TOC)

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
db		db
img		img
info		info
sql		sql
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
Pipfile		Pipfile
README.md		README.md
Vagrantfile		Vagrantfile
docker-compose.yml		docker-compose.yml
logs.py		logs.py

License

br3ndonland/udacity-fsnd-sql-logs

Folders and files

Latest commit

History

Repository files navigation

SQL database logs analysis

Table of Contents

Project description

Repository contents

Required files

Other files

Installation

Run with Pipenv virtual environment

Clone repository

Install dependencies

Install Pipenv virtual environment

Set up database

Unzip news data

Create user and database

Run queries

Run queries with psql

Run queries with Postico

Set up database connection

Connect to database

Run queries

Run queries with Python

Run with Docker Compose

Docker background

Pipenv and Docker

Docker Compose

Run with Vagrant virtual machine

Install virtual machine

Install dependencies

Install Vagrant configuration

A tale of three queries

1. Most popular articles

2. Most popular authors

3. HTTP request error rate

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Run queries with `psql`