M3D API

M3D stands for Metadata Driven Development and is a cloud and platform agnostic framework for the automated creation, management and governance of metadata and data flows from multiple source to multiple target systems. The main features and design goals of M3D are:

Cloud and platform agnostic
Enforcement global data model including speaking names and business objects
Governance by conventions instead of maintaining state and logic
Lightweight and easy to use
Flexible development of new features
Stateless execution with minimal external dependencies
Enable self-service
Possibility to extend to multiple destination systems (currently AWS EMR)

M3D consists of two components. m3d-api which we're providing in this repo, and m3d-engine containing the main logic using Apache Spark.

Use cases

M3D can be used for:

Creation of data lake environments
Management and governance of metadata
Data flows from multiple sources
Data flows to multiple target systems
Algorithms as data frame transformations

adidas is not responsible for the usage of this software for different purposes that the ones described in the use cases.

M3D Architecture

M3D is based on a layered architecture, using AWS S3 buckets as storage and Spark/Scala for processing. Using the M3D api you can create data lake environments in a reproducible way. These are the layers defined in the M3D architecture:

At the lowest level we have the inbound layer, where raw data is uploaded by source systems. The format of the source data is not fixed and a number of formats are supported by M3D. Only this layer is accessible by external non-M3D governed systems.
On top of the inbound layer, we have the landing layer, in which archived raw data from the inbound layer is stored together with the metadata that is used for further loading to the lake. It can be used for exploration on the raw files and for reprocessing but does not provide a Hive schema.
The next layer is the lake layer where data is persisted in parquet format for consumption by applications. This layer should be accessed using Hive. Also there are lake-to-lake algorithms that read from this layer and write to it.
The top layer is the lake-out layer which is a virtual layer for globally standardized semantic names.

Graphically, the architecture of M3D looks like this:

AWS Prerequisites for Out of the Box Usage

You will need to create four S3 buckets: inbound, landing, lake and application. The latter will contain the jar artifact from the M3D-engine.
An account for managing clusters in the AWS console.
A host machine with internet access.
An access key with permissions to write to the specified buckets and to create/delete EMR clusters.
Databases for landing, lake and lake_out.

Setup and Deployment: The Easy Way

The quickest way to get started with M3D API is to use GUI installer available for different platforms (Windows/Linux/Mac). With the GUI installer you can setup m3d-api and m3d-engine on a remote host (or localhost if you have an unix based system) and load local tables into AWS an EMR environment right out the box. This of course requires an active AWS account that can be created by visiting this link. If you already have an AWS account, make sure to get your access key and secret access key for successful installation and deployment of environment in EMR. You can go to this repository to build the installer the UI for your preferred OS.

After the installation completed the final steps of the GUI installer are:

Display sample data to be uploaded from on premises storage to AWS Cloud
Display the structure of tables in on premises database to be created in AWS cloud
Create the environment in the AWS Cloud
Upload the data to an S3 inbound bucket
Start the EMR cluster
Execute the Full Load spark algorithm contained in the m3d-engine to put data in the lake layer.
Shutdown EMR resources

Setup and Deployment: Advanced Users

For advanced users, you may use conda for installing M3D by entering the following command in your terminal: conda install -c some-channel m3d-api.

Available API calls

create_table: Creates a table in the AWS environment based on TCONX Files.
drop_table: Drops a table in the AWS environment. The files will remain in storage.
truncate_table: Removes all files of a table from storage.
create_lake_out_view: Executes an HQL statement to generate a view in the AWS environment.
drop_lake_out_view: Removes a given view in the AWS environment.
load_table: Loads a table in AWS from an specified source.
run_algorithm: Executes an algorithm available in m3d-engine.
create_emr_cluster: Initializes an EMR cluster in AWS.
delete_emr_cluster: Terminates an EMR cluster in AWS.

API Arguments

-function: Name of the function to execute.

-config: Location of the configuration json file. An example of configuration json file is provided below

    {
        "emails": [
            "test@test.com"
        ],
        "dir_exec": "/tmp/",
        "python": {
            "main": "m3d_main.pyc",
            "base_package": "m3d"
        },
        "subdir_projects": {
            "m3d_engine": "m3d-engine/target/scala-2.11/",
            "m3d_api": "m3d-api/"
        },
        "tags": {
            "full_load": "full_load",
            "delta_load": "delta_load",
            "append_load": "append_load",
            "table_suffix_stage": "_stg1",
            "table_suffix_swap": "_swap",
            "config": "config",
            "system": "system",
            "algorithm": "algorithm",
            "table": "table",
            "view": "view",
            "upload": "upload",
            "pushdown": "pushdown",
            "aws": "aws",
            "file": "file"
        },
        "data_dict_delimiter": "|"
    }

-cluster_mode: Specifies whether the function should execute in a cluster or on a single node.
-destination_system: Name of the system to which data will be loaded.
-destination_database: Name of the destination database.
-destination_environment: Name of the different environments (test, dev, preprod, prod, etc.)
-destination_table: Name of the table in the destination_database of the destination_system where data will be written to.
-algorithm_instance: Name of the algorithm from m3d-engine to be executed
-load_type: Type of the load algorithm to be executed (FullLoad, DeltaLoad, or AppendLoad).
-ext_params: parameters in JSON format expected by an algorithm in M3D-engine.
-spark_params: Spark parameters in JSON format.
-core_instance_count: Number of executor nodes in the EMR cluster.
-core_instance_type: AWS node instance type for each executor node in the EMR cluster.
-master_instance_type: AWS node instance type for the master node in the EMR cluster.
-emr_version: Version of EMR to use in for EMR clusters.

Not all arguments are mandatory for API calls. Please check the source code to identify required parameters for the API you would like to use.

Example Use Case: Loading Data into AWS Environment

As an example of M3D capabilities, we provide an example that will load data from data files into AWS. Prerequisites: cd into your working directory where you have m3d-api and m3d-engine copied, whether it was from conda or from the GUI installer. For M3D-engine, you will need the compiled jar or build it manually with SBT.

Before you proceed, make sure you have everything in the prerequisites section completed and that entries in the config.json file has been made to match your setup. Also, make sure the relevant information is in the tconx file, such as column names, lake table name, destination database, etc. Note that for the example below, destination_database is set to emr_db, destination_system is emr and destination_environment is test. For table_name, we use test_table. Database name in M3D layers, should match the names you defined in the prerequisites section.

The steps are the following:

Upload a csv file containing the data to be uploaded to the lake. You can use aws cli for placing the file in inbound bucket.
```
aws s3 cp s3://your-inbound-bucket/test/data.csv
```

Create an instance of EMR cluster

python m3d_main.py -function create_emr_cluster \
    -core_instance_type m4.large \
    -master_instance_type m4.large \
    -core_instance_count 3 \
    -destination_system emr \
    -destination_database emr_database \
    -destination_environment test \
    -config /relative/to/m3d-api/config/m3d/config.json \
    -emr_version emr-6.2.0

Create the environment in AWS by invoking the API create_table

python m3d_main.py -function create_table \
    -config /relative/to/m3d-api/config/m3d/config.json \
    -destination_system emr \
    -destination_database emr_database \
    -destination_environment test \
    -destination_table table_name \
    -destination_table_location_prefix table_location_prefix \
    -emr_cluster_id id-of-started-cluster

Trigger the FullLoad algorithm in M3D-engine to load from inbound into lake layer.

python m3d_main.py -function load_table \
    -config /relative/to/m3d-api/config/m3d/config.json \
    -destination_system emr \
    -destination_database emr_database \
    -destination_environment test \
    -destination_table table_name \
    -load_type FullLoad \
    -emr_cluster_id id-of-started-cluster

OPTIONAL: Shutdown EMR cluster - Normally, after completion of a load job, you will stop the current EMR cluster, but if you would like to connect to your cluster after the loading job is completed, you can avoid executing this final API call to keep the cluster running. Afterwards, you can open HUE to query data via hive in the running EMR cluster. This can be done by connecting to the master instance if it was configured as suggested in this guide.
```
python m3d_main.py -function delete_emr_cluster \
    -config /relative/to/m3d-api/config/m3d/config.json \
    -destination_system emr \
    -destination_database emr_database \
    -destination_environment test \
    -emr_cluster_id id-of-started-cluster
```

License and Software Information

adidas AG publishes this software and accompanied documentation (if any) subject to the terms of the Apache 2.0 license with the aim of helping the community with our tools and libraries which we think can be also useful for other people. You will find a copy of the Apache 2.0 license in the root folder of this package. All rights not explicitly granted to you under the Apache 2.0 license remain the sole and exclusive property of adidas AG.

NOTICE: The software has been designed solely for the purpose of automated creation, management and governance of metadata and data flows. The software is NOT designed, tested or verified for productive use whatsoever, nor or for any use related to high risk environments, such as health care, highly or fully autonomous driving, power plants, or other critical infrastructures or services.

If you want to contact adidas regarding the software, you can mail us at software.engineering@adidas.com.

For further information open the adidas terms and conditions page.

License

Apache 2.0

FAQ

What is a TCONX file? It is a JSON file containing the definition of a table to be created in an Hadoop environment. Entries in the file include destination database, table name in lake, table columns, name of columns in the different M3D layers. For an example of what a TCONX file looks like, you can take a look at the samples subdirectory in this repo. It is important to note that the parameters mentioned above (table name, environment, etc) are part of the TCONX file naming convention. In samples/tconx-(emr)-(emr_database)-(test)-(prefix)_(table_name).json, we can find the following parts in parenthesis:
- emr - this is the destination system
- emr_database - this is the destination database
- test - this is the destination environment
- prefix - this is the name of the source system generating the data
- table_name - name of the table which the tconx file was generated for.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
config		config
m3d		m3d
samples		samples
static/images		static/images
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
Jenkinsfile_build_lint_unit_integration		Jenkinsfile_build_lint_unit_integration
LICENSE		LICENSE
README.md		README.md
common.sh		common.sh
dev-env.sh		dev-env.sh
flake8.conf		flake8.conf
m3d_main.py		m3d_main.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

License

adidas/m3d-api

Folders and files

Latest commit

History

Repository files navigation

M3D API

Use cases

M3D Architecture

AWS Prerequisites for Out of the Box Usage

Setup and Deployment: The Easy Way

Setup and Deployment: Advanced Users

Available API calls

API Arguments

Example Use Case: Loading Data into AWS Environment

License and Software Information

License

FAQ

About

Resources

License

Stars

Watchers

Forks

Languages