YARN-dev-tools

This project contains various developer helper scripts in order to simplify every day tasks related to Apache Hadoop YARN development.

Main dependencies

gitpython - GitPython is a python library used to interact with git repositories, high-level like git-porcelain, or low-level like git-plumbing.
tabulate - python-tabulate: Pretty-print tabular data in Python, a library and a command-line utility.
bs4 - Beautiful Soup is a Python library for pulling data out of HTML and XML files.
TODO: Missing dependencies

Contributing

TODO

Authors

Szilard Nemeth - Initial work - Szilard Nemeth

License

TODO

Acknowledgments

TODO

Getting started

In order to use this tool, you need to have at least Python 3.8 installed.

Use yarn-dev-tools from package (Recommended)

If you don't want to tinker with the source code, you can download yarn-dev-tools from PyPi as well. This is probably the easiest way to use it. You don't need to install anything manually as I created a script that performs the installation automatically. The script has a setup-vars function at the beginning that defines some environment variables:

These are the following:

YARNDEVTOOLS_ROOT: Specifies the directory where the Python virtualenv will be created and yarn-dev-tools will be installed to this virtualenv.
HADOOP_DEV_DIR Should be set to the upstream Hadoop repository root, e.g.: "~/development/apache/hadoop/"
CLOUDERA_HADOOP_ROOT Should be set to the downstream Hadoop repository root, e.g.: "~/development/cloudera/hadoop/"

The latter two environment variables is better to be added to your bashrc / zshrc file (depending on what shell you are using) to keep them between the shells.

Use yarn-dev-tools from source

If you want to use yarn-dev-tools from source, first you need to install its dependencies. The project root contains a pyproject.toml file that has all the dependencies listed. The project uses Poetry to resolve the dependencies so you need to install poetry as well. Simply go to the root of this project and execute poetry install --without localdev. Alternatively, you can run make from the root of the project.

Setting up handy aliases to use yarn-dev-tools

If you completed the installation (either by source or by package), you may want to define some shell aliases to use the tool more easily. In my system, I have these. Please make sure to source this script so that the command 'yarndevtools' will be available since it's defined as a function. It is important to specify HADOOP_DEV_DIR and CLOUDERA_HADOOP_ROOT as mentioned above, before sourcing the script.

After these steps, you will have a basic set of aliases that is enough to get you started.

Setting up yarn-dev-tools with Cloudera CDSW

Initial setup

Check out the branch 'cloudera-mirror-version'
Upload the initial setup scripts to the CDSW files, to the root directory (/home/cdsw).
You can do this by drag & drop, after choosing "Files" from the left-hand side menu.

Create and launch new CDSW session. Wait for the session to be launched and open up a terminal by Clicking "Terminal access" on the top menu bar.
Execute this command in the CLI:

~/initial-cdsw-setup.sh user cloudera

The initial-cdsw-setup.sh script performs the following actions:

Downloads the scripts that are cloning the upstream and downstream Hadoop repositories + installing yarndevtools itself as a python module. The download location is: /home/cdsw/scripts
Please note that the files will be downloaded from the GitHub master branch of this repository!

Executes the script described in step 2. This can take some time, especially cloning Hadoop. Note: The individual CDSW jobs should make sure for themselves to clone the repositories.
Copies the python-based job configs for all jobs to /home/cdsw/jobs
After this, all you have to do in CDSW is to set up the projects and their starter scripts like this:

Project	Starter script location	Arguments for script
Jira umbrella data fetcher	scripts/start_job.py	jira-umbrella-data-fetcher
Unit test result aggregator	scripts/start_job.py	unit-test-result-aggregator
Unit test result fetcher	scripts/start_job.py	unit-test-result-fetcher
Branch comparator	scripts/start_job.py	branch-comparator
Review sheet backport updater	scripts/start_job.py	review-sheet-backport-updater
Reviewsync	scripts/start_job.py	reviewsync

More details for the internals of the `initial-cdsw-setup.sh` script

The two provided arguments user and cloudera corresponds to:

PYTHON_MODULE_MODE=user
EXEC_MODE=cloudera

In any case, the script that download the Hadoop repos (either upstream or downstream) are downloaded from https://github.com/szilard-nemeth/yarn-dev-tools. See this code block for details.

PYTHON_MODULE_MODE can be set to user or global. It controls if the Python package should be installed globally or just for the user. See this code block for more details.

EXEC_MODE controls just one thing: the downstream Hadoop repo will only be downloaded if EXEC_MODE is set to cloudera.

The script called install-requirements.sh will be executed. What does the install-requirements.sh do?

Uninstalls the yarn-dev-tools python package
Installs the yarn-dev-tools python package

Installation details can be found here As you can see in this code block, the env var called YARNDEVTOOLS_VERSION controls how the package should be installed. As the current setup, YARNDEVTOOLS_VERSION=repo (set as env var in CDSW / Project Settings / Advanced), therefore the package will be installed from the github.com repository, with command:

pip3 install git+https://github.com/szilard-nemeth/yarn-dev-tools.git@cloudera-mirror-version

See https://jira.cloudera.com/browse/COMPX-17121 for detailed execution logs.

CDSW environment variables

Common environment variables for CDSW jobs

All common environment variables are used from a class called CdswEnvVar

Name	Level	Mandatory?	Default value	Description
MAIL_ACC_USER	Project	Yes	N/A	Username for the Gmail account that is being used for sending emails
MAIL_ACC_PASSWORD	Project	Yes	N/A	Password for the Gmail account that is being used for sending emails
MAIL_RECIPIENTS	Project or Job	No	yarn_eng_bp@cloudera.com	Comma separated email addresses to send emails to. If not specified, the YARN mailing list is the default: yarn_eng_bp@cloudera.com Can be specified on Job-level, too
ENABLE_GOOGLE_DRIVE_INTEGRATION	Project or Job	No	True	Whether to enable Google Drive integration for saving result files.
DEBUG_ENABLED	Project or Job	No	Job-level default	Whether to enable debug mode for yarndevtools commands. Adds the `--debug` switch to CLI commands. Accepted values: True, False
OVERRIDE_SCRIPT_BASEDIR	Project	No	N/A	Option to change the scripts dir for CDSW jobs. Do not modify unless absolutely necessary!
ENABLE_LOGGER_HANDLER_SANITY_CHECK	Project or Job	No	True	Whether to enable sanity checking the number of loggers after first logger initialization. Can be disabled if errors come up during logger setup.
CLOUDERA_HADOOP_ROOT	Project	Yes	<CDSW_BASEDIR>/repos/cloudera/hadoop/	Downstream repository path for Hadoop. Auto set for CDSW
HADOOP_DEV_DIR	Project	Yes	<CDSW_BASEDIR>/repos/apache/hadoop/	Upstream repository path for Hadoop. Auto set for CDSW
PYTHONPATH	Project	No	$PYTHONPATH:/home/cdsw/scripts	Tweaked PYTHONPATH, to correctly reload python dependencies. Do not modify unless absolutely necessary!
TEST_EXEC_MODE	Project	No	cloudera	Test execution mode. Can take values of `TestExecMode` enum. For CDSW, it should be always set to `TestExecMode.CLOUDERA`
PYTHON_MODULE_MODE	Project	No	user	Python module mode. Can take values of `user` and `global`. For CDSW, it should be always set to `user`.
INSTALL_REQUIREMENTS	Project	No	True	Whether to run the install-requirements.sh script. Do not modify unless absolutely necessary!
RESTART_PROCESS_WHEN_REQUIREMENTS_INSTALLED	Project	No	False	Only used for testing

Environment variables for job: Jira umbrella data fetcher

Corresponding class: JiraUmbrellaFetcherEnvVar

Name	Mandatory?	Default value	Actual value	Description
UMBRELLA_IDS	Yes	N/A	"YARN-10888 YARN-10889"	Comma separated list of umbrella Jira IDs

Environment variables for job: Unit test result aggregator

Corresponding class: UnitTestResultAggregatorEmailEnvVar

Name	Mandatory?	Default value	Actual value	Description
GSHEET_CLIENT_SECRET	Yes	N/A	/home/cdsw/.secret/projects/cloudera/hadoop-reviewsync/client_secret_service_account_snemeth_cloudera_com.json	Path to the Google Sheets client secret file. Used for authenticating with Google Sheets.
GSHEET_SPREADSHEET	Yes	N/A	"Failed testcases parsed from emails [generated by script]"	Name of the Google Sheets to work on
GSHEET_WORKSHEET	Yes	N/A	"Failed testcases"	Name of the Google Sheets worksheet to work on
REQUEST_LIMIT	No	999	3000	Limit the number of Gmail threads to query.
MATCH_EXPRESSION	Yes	N/A	YARN::org.apache.hadoop.yarn MR::org.apache.hadoop.mapreduce	Match expressions that serves as a basis for grouping and rendering tables of test failures. See this file for more details
ABBREV_TC_PACKAGE	No	N/A	org.apache.hadoop.yarn.server	Whether to abbreviate testcase package names in outputs in order to save screen space. The specified string will be abbreviated with the starting letters.
AGGREGATE_FILTERS	No	N/A	CDPD-7.1.x CDPD-7.x	The resulted emails and testcases for each filter will be aggregated to a separate worksheet with name aggregated where WS is equal to the value specified by the --gsheet-worksheet argument.
SKIP_AGGREGATION_RESOURCE_FILE	No	N/A	N/A	Specify file that defines lines to skip. If lines starting with these strings, they will not be considered as a line to parse from the emails.
SKIP_AGGREGATION_RESOURCE_FILE_AUTO_DISCOVERY	Yes	N/A	1	Whether to enable auto-discovery of skip aggregation resource file. Can take values to enable: ("True", "true", "1") or to disable: ("False", "false", "0").
GSHEET_COMPARE_WITH_JIRA_TABLE	No	N/A	"testcases with jiras"	A value should be provided if comparison of failed testcases with reported Jira table must be performed. The value is a name to a Google Sheets worksheet, for example 'testcases with jiras'

Environment variables for job: Unit test result fetcher

Corresponding class: UnitTestResultFetcherEnvVar For legacy reasons, Jenkins-related env vars are declared in the class called CdswEnvVar.

Name	Mandatory?	Default value	Actual value	Description
JENKINS_USER	Yes	N/A	snemeth	User name for Cloudera Jenkins API access.
JENKINS_PASSWORD	Yes	N/A		Password for Cloudera Jenkins API access.
BUILD_PROCESSING_LIMIT	No	999	999	Limit the number of Jenkins builds to fetch
FORCE_SENDING_MAIL	No	False	False	Force sending email for all Jenkins runs even they sent out earlier
RESET_JOB_BUILD_DATA	No	False	False	Reset job build data for specified jobs. Useful when job build data is corrupted.

Environment variables for job: Branch comparator

Corresponding class: BranchComparatorEnvVar

Name	Mandatory?	Default value	Actual value	Description
BRANCH_COMP_FEATURE_BRANCH	No	origin/CDH-7.1-maint	origin/CDH-7.1-maint	Name of the feature branch
BRANCH_COMP_MASTER_BRANCH	No	origin/cdpd-master	origin/cdpd-master	Name of the master branch
BRANCH_COMP_REPO_TYPE	No	downstream (`RepoType.DOWNSTREAM`)	N/A	Repository type. Can take a value of `RepoType` enum

Environment variables for job: Review sheet backport updater

Corresponding class: ReviewSheetBackportUpdaterEnvVar

Name	Mandatory?	Default value	Actual value	Description
GSHEET_CLIENT_SECRET	Yes	N/A	/home/cdsw/.secret/projects/cloudera/hadoop-reviewsync/client_secret_service_account_snemeth_cloudera_com.json	Path to the Google Sheets client secret file. Used for authenticating with Google Sheets.
GSHEET_SPREADSHEET	Yes	N/A	"YARN/MR Reviews"	Name of the Google Sheets to work on
GSHEET_WORKSHEET	Yes	N/A	"Reviews done"	Name of the Google Sheets worksheet to work on
GSHEET_JIRA_COLUMN	Yes	N/A	"JIRA"	Name of the column that contains Jira issue IDs in the Google Sheets spreadsheet
GSHEET_UPDATE_DATE_COLUMN	Yes	N/A	"Last Updated"	Name of the column where this script will store last updated date in the Google Sheets spreadsheet
GSHEET_STATUS_INFO_COLUMN	Yes	N/A	"Backported"	Name of the column where this script will store patch status info in the Google Sheets spreadsheet
BRANCHES	Yes	N/A	origin/CDH-7.1-maint origin/cdpd-master origin/CDH-7.1.6.x origin/CDH-7.1.7.1057 origin/CDH-7.1.7.2000 origin/CDH-7.1.8.x	Check backports against these branches. Values should be separated by space.

Environment variables for job: Reviewsync

Corresponding class: ReviewSyncEnvVar

Name	Mandatory?	Default value	Actual value	Description
GSHEET_CLIENT_SECRET	Yes	N/A	/home/cdsw/.secret/projects/cloudera/hadoop-reviewsync/client_secret_service_account_snemeth_cloudera_com.json	Path to the Google Sheets client secret file. Used for authenticating with Google Sheets.
GSHEET_SPREADSHEET	Yes	N/A	"YARN/MR Reviews"	Name of the Google Sheets to work on
GSHEET_WORKSHEET	Yes	N/A	Incoming	Name of the Google Sheets worksheet to work on
GSHEET_JIRA_COLUMN	Yes	N/A	"JIRA"	Name of the column that contains Jira issue IDs in the Google Sheets spreadsheet
GSHEET_UPDATE_DATE_COLUMN	Yes	N/A	"Last Updated"	Name of the column where this script will store last updated date in the Google Sheets spreadsheet
GSHEET_STATUS_INFO_COLUMN	Yes	N/A	"Reviewsync"	Name of the column where this script will store patch status info in the Google Sheets spreadsheet
BRANCHES	Yes	N/A	branch-3.2 branch-3.1	List of branches to apply patches that are targeted to trunk. Values should be separated by space.

Other environment variables

Name	Mandatory?	Default value	Class	Description
IGNORE_SMTP_AUTH_ERROR	No	False	EnvVar	Enable to ignore `SMTPAuthenticationError`s
FORCE_COLLECTING_ARTIFACTS	No	False	YarnDevToolsTestEnvVar	Enable to always collect all test artifacts.
PROJECT_DETERMINATION_STRATEGY	Yes	N/A	YarnDevToolsEnvVar	Method for detecting the project name. Value can be one of: `common_file`, `sys_path`, `repository_dir`. `common_file` is suitable for most of the use-cases. Behaviour defined in external repo (https://github.com/szilard-nemeth/python-commons)
ENV_CLOUDERA_HADOOP_ROOT	Yes	N/A	YarnDevToolsEnvVar	Alias of `CLOUDERA_HADOOP_ROOT`, see CDSW env vars above
ENV_HADOOP_DEV_DIR	Yes	N/A	YarnDevToolsEnvVar	Alias of `HADOOP_DEV_DIR`, see CDSW env vars above
YARNDEVTOOLS_VERSION	Yes	repo	N/A (script)	Used by script `install-requirements.sh`. See this function for details. Special value of `latest` means using the most recent pypi version. Special value of `repo` means use the most recent version from the repository.

Use-cases

Examples for YARN backporter

To backport YARN-6221 to 2 branches, run these commands:

yarn-backport YARN-6221 COMPX-6664 cdpd-master
yarn-backport YARN-6221 COMPX-6664 CDH-7.1-maint --no-fetch

The first argument is the upstream Jira ID
The second argument is the downstream Jira ID.
The third argument is the downstream branch.
The --no-fetch option is a means to skip git fetch on both repos.

How to backport to an already existing relation chain?

Go to Gerrit UI and download the patch. For example:

git fetch "https://gerrit.sjc.cloudera.com/cdh/hadoop" refs/changes/29/156429/5 && git checkout FETCH_HEAD

Checkout a new branch

git checkout -b my-relation-chain

Run backporter with:

yarn-backport YARN-10314 COMPX-7855 CDH-7.1.7.1000 --no-fetch --downstream_base_ref my-relation-chain

where:
The first argument is the upstream Jira ID
The second argument is the downstream Jira ID.
The third argument is the downstream branch.
The --no-fetch option is a means to skip git fetch on both repos.
The --downstream_base_ref <local-branch is a way to use a local branch to base the backport on so the Git remote name won't be prepended.

Finally, I set up two aliases for pushing the changes to the downstream repo:

alias git-push-to-cdpdmaster="git push <REMOTE> HEAD:refs/for/cdpd-master%<REVIEWER_LIST>"
alias git-push-to-cdh71maint="git push <REMOTE> HEAD:refs/for/CDH-7.1-maint%<REVIEWER_LIST>"

where REVIEWER_LIST is in this format: "r=user1,r=user2,r=user3,..."

Contributing

Setup of pre-commit

Configure precommit as described in this blogpost.

Commands:

Install precommit: pip install pre-commit
Make sure to add pre-commit to your path. For example, on a Mac system, pre-commit is installed here: $HOME/Library/Python/3.8/bin/pre-commit.
Execute pre-commit install to install git hooks in your .git/ directory.

Running the tests

TODO

Troubleshooting

Installation issues

In case you're facing a similar issue:

An error has occurred: InvalidManifestError: 
=====> /<userhome>/.cache/pre-commit/repoBP08UH/.pre-commit-hooks.yaml does not exist
Check the log at /<userhome>/.cache/pre-commit/pre-commit.log

, please run: pre-commit autoupdate

More info can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 1,171 Commits
.github		.github
docker		docker
legacy-scripts/branch-comparator		legacy-scripts/branch-comparator
tests		tests
yarndevtools		yarndevtools
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.run		.run
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
initial_setup.sh		initial_setup.sh
marshmallow_test.py		marshmallow_test.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

License

szilard-nemeth/yarn-dev-tools

Folders and files

Latest commit

History

Repository files navigation

YARN-dev-tools

Main dependencies

Contributing

Authors

License

Acknowledgments

Getting started

Use yarn-dev-tools from package (Recommended)

Use yarn-dev-tools from source

Setting up handy aliases to use yarn-dev-tools

Setting up yarn-dev-tools with Cloudera CDSW

Initial setup

More details for the internals of the initial-cdsw-setup.sh script

CDSW environment variables

Common environment variables for CDSW jobs

Environment variables for job: Jira umbrella data fetcher

Environment variables for job: Unit test result aggregator

Environment variables for job: Unit test result fetcher

Environment variables for job: Branch comparator

Environment variables for job: Review sheet backport updater

Environment variables for job: Reviewsync

Other environment variables

Use-cases

Examples for YARN backporter

How to backport to an already existing relation chain?

Contributing

Setup of pre-commit

Running the tests

Troubleshooting

Installation issues

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

More details for the internals of the `initial-cdsw-setup.sh` script