Skip to content

aliavni/docker

Repository files navigation

Docker data stack

Table of Contents

Run

  1. Install Docker Desktop
  2. Create .env file in the repo root by copying .env.template
  3. Fill in the desired POSTGRES_PASSWORD value in the .env file
  4. Build containers:
docker compose up -d --build

Jupyter

Check out the jupyterlab container logs and click on the link that looks like http://127.0.0.1:8089/lab?token=...

Trino

docker exec -it trino trino
SHOW SCHEMAS FROM db;
USE db.public;
SHOW TABLES FROM public;

Spark

docker exec -it spark-master /bin/bash
cd /opt/spark/bin
./spark-submit --master spark://0.0.0.0:7077 \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi  \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.5.1.jar 100

Thrift

docker exec -it spark-master /bin/bash
./bin/beeline
!connect jdbc:hive2://localhost:10000 scott tiger
show databases;
create table hive_example(a string, b int) partitioned by(c int);
alter table hive_example add partition(c=1);
insert into hive_example partition(c=1) values('a', 1), ('a', 2),('b',3);
select count(distinct a) from hive_example;
select sum(b) from hive_example;

ScyllaDB

Connect to cqlsh

docker exec -it scylla-1 cqlsh

Create keyspace

CREATE KEYSPACE data
WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};

Use keyspace and create table

USE data;

CREATE TABLE data.users (
    user_id uuid PRIMARY KEY,
    first_name text,
    last_name text,
    age int
);

Insert data

INSERT INTO data.users (user_id, first_name, last_name, age)
  VALUES (123e4567-e89b-12d3-a456-426655440000, 'Polly', 'Partition', 77);

Kafka

Create topic

docker exec -it kafka kafka-topics.sh --create --topic test --bootstrap-server 127.0.0.1:9092

Kafka producer

See kafka_producer.ipynb

Kafka consumer

kafka_consumer.ipynb

Airflow

Check out the .env.template file. Copy/paste airflow related variables and update their values where necessary.

Slack integration

You need to create a Slack app and setup AIRFLOW_CONN_SLACK_API_DEFAULT env variable with Slack api key. If you don't want to use this integration, remove the AIRFLOW_CONN_SLACK_API_DEFAULT variable from your .env file.