GitHub - tatitati/pipeline-user-orders

ELT implementation with python for EL and dbt for T

Q: should I use csv or parquet?

Parquet: -> it defines an schema (faster to read data) -> allow schema evolution -> includes data compression -> etc

Q: can I copy parquet files from s3 to snowflake?

-> yes

Q: do I need to specify the schema when reading from oltp-mysql?

-> No. If you’re loading data into Spark from a file, you’ll probably want to specify a schema to avoid making Spark infer it. For a MySQL database, however, that’s not necessary since it has its own schema and Spark can translate it.

Q: I uploaded a parquet file to s3 (myfile.parquet). How can I see from snowflake what data contains this?:

```
CREATE STAGE "MYDBT"."DE_BRONZE".s3pipelineusersorders
URL = 's3://pipelineusersorders'
CREDENTIALS = (AWS_KEY_ID = 'XXXXX' AWS_SECRET_KEY = 'XXXXXXX');

LIST @s3pipelineusersorders;

CREATE OR REPLACE FILE FORMAT my_parquet_format
  TYPE = PARQUET
  COMPRESSION = SNAPPY;

SELECT $1
FROM @s3pipelineusersorders/myfile.parquet
(file_format => 'my_parquet_format');

SELECT $1:name::varchar
FROM @s3pipelineusersorders/myfile.parquet
(file_format => 'my_parquet_format');
```

Q: how do I init the dbt folder named transform?:

```
    dbt init transform
    dbt debug
    dbt run -m silver
```

should I upload a parquet file for users and another for orders?, or only one file as result of joins?

-> one per each:

Orders can be updated at different times of Users
One user can have multiple orders. We might fetch User information when we only need Order information, or the opposite. Performance issue, it add complexity

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
experiments		experiments
extract-load		extract-load
jars		jars
samples		samples
transform_kimball		transform_kimball
transform_obt		transform_obt
.gitignore		.gitignore
Readme.md		Readme.md
docker-compose.yml		docker-compose.yml
oltp-initdb.sql		oltp-initdb.sql
pipeline.conf.example		pipeline.conf.example
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

tatitati/pipeline-user-orders

Folders and files

Latest commit

History

Repository files navigation

ELT implementation with python for EL and dbt for T

Q: should I use csv or parquet?

Q: can I copy parquet files from s3 to snowflake?

Q: do I need to specify the schema when reading from oltp-mysql?

Q: I uploaded a parquet file to s3 (myfile.parquet). How can I see from snowflake what data contains this?:

Q: how do I init the dbt folder named transform?:

should I upload a parquet file for users and another for orders?, or only one file as result of joins?

About

Resources

Stars

Watchers

Forks

Languages