Skip to content

mara/mara-google-sheet-downloader

Repository files navigation

Mara Google Sheet Downloader

Build Status PyPI - License PyPI version Slack Status

This package contains a google sheet downloader to be used with the mara ETL framework:

  • Download a Google sheet to a database table
  • Cells can be validated and formatted during download

 

Installation

To use the library directly, use pip:

pip install mara-google-sheet-downloader

or

pip install git+https://github.com/mara/mara-google-sheet-downloader.git

 

Example

Here is a pipeline "gs_demo" which downloads to a table. This assumes you have a spread sheet under the URL https://docs.google.com/spreadsheets/d/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/edit. This sheet must be shared with the email address for which you configured the credentials (see below).

The spreadsheet contains a worksheet 'test' which contains the following colums:

city some_date amount already paid? price comments
Berlin 01.01.2020 3 ja 2.3 added by JK on 2020-03-12

Note that it doesn't contain an id column at the start!

from mara_pipelines.pipelines import Pipeline, Task
from mara_pipelines.commands.sql import ExecuteSQL
from mara_google_sheet_downloader.mara_integration import DownloadGoogleSpreadsheet

pipeline = Pipeline(
    id='gs_demo',
    description='A small pipeline that demonstrates the a google sheet download')

pipeline.add(Task(
    id='download_', description='Download a google sheet',
    commands=[
        ExecuteSQL(
            sql_statement=f"""
-- Creates the table where the google sheet data should end up in
DROP TABLE IF EXISTS public.gs_test;
CREATE TABLE public.gs_test (
id BIGINT PRIMARY KEY,
city TEXT,
some_date DATE,
amount INTEGER,
is_paid BOOLEAN,
price DOUBLE PRECISION,
comments TEXT
)
""",
            echo_queries=False,
        ),
        DownloadGoogleSpreadsheet(
            spreadsheet_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',  # from the URL
            worksheet_name='test',  # worksheet
            # Validators:
            # c: counter (additional column, will start at 1 and count each row),
            # s: string,
            # d(in_fmt=%d.%m.%Y): date in YYYY.mm.dd format,
            # i: integer,
            # b(true:ja,false=nein): boolean with ja/nein interpreted as True/False,
            # f: float,
            # s: string
            # any cell which does not confirm to this spec will fail the import!
            columns_definition='csd(in_fmt=%d.%m.%Y)ib(true:ja,false=nein)fs',
            target_table_name='public.gs_test', # table where the data should end up
            target_db_alias='dwh', # alias of the DB where the data should end up
            skip_rows=1), # how many rows at the top should be skipped
    ]),
)

Config

The downloader needs OAuth2 credentials, either use a service account or a user account.

Credentials will need the scopes 'https://www.googleapis.com/auth/spreadsheets.readonly', 'https://www.googleapis.com/auth/drive.readonly'.

Example with OAuth2 credentials for a user account:

from mara_app.monkey_patch import patch
import mara_google_sheet_downloader.config
patch(mara_google_sheet_downloader.config.gs_user_account_client_id)(lambda:"....client_id...")
patch(mara_google_sheet_downloader.config.gs_user_account_client_secret)(lambda:"...client_secret...")
patch(mara_google_sheet_downloader.config.gs_user_account_refresh_token)(lambda:"...initial_refresh_token...")

Setup access to sheets to be downloaded

All sheets which should be accessed by the downloader must be shared with the email address associated with these credentials. This email address is:

  • for user account credentials: the email address of the user who created the credentials.
  • for service accounts: the email address of the service account itself (e.g. "@.iam.gserviceaccount.com"). This email address is e.g. included in the json file you can download.

CLI

This package contains a small cli app which downloads a google sheet and outputs it as csv.

You can use it stand alone, see mara-google-sheet-downloader --help for how to use it.