Skip to content

Latest commit

 

History

History
675 lines (509 loc) · 30.5 KB

CONTRIBUTING.md

File metadata and controls

675 lines (509 loc) · 30.5 KB

Contributing to DataLad

Files organization

  • datalad/ is the main Python module where major development is happening, with major submodules being:
    • cmdline/ - helpers for accessing interface/ functionality from command line
    • customremotes/ - custom special remotes for annex provided by datalad
    • downloaders/ - support for accessing data from various sources (e.g. http, S3, XNAT) via a unified interface.
      • configs/ - specifications for known data providers and associated credentials
    • interface/ - high level interface functions which get exposed via command line (cmdline/) or Python (datalad.api).
    • tests/ - some unit- and regression- tests (more could be found under tests/ of corresponding submodules. See Tests)
      • utils.py provides convenience helpers used by unit-tests such as @with_tree, @serve_path_via_http and other decorators
    • ui/ - user-level interactions, such as messages about errors, warnings, progress reports, AND when supported by available frontend -- interactive dialogs
    • support/ - various support modules, e.g. for git/git-annex interfaces, constraints for the interface/, etc
  • benchmarks/ - asv benchmarks suite (see Benchmarking)
  • docs/ - yet to be heavily populated documentation
    • bash-completions - bash and zsh completion setup for datalad (just source it)
  • fixtures/ currently not under git, contains generated by vcr fixtures
  • sandbox/ - various scripts and prototypes which are not part of the main/distributed with releases codebase
  • tools/ contains helper utilities used during development, testing, and benchmarking of DataLad. Implemented in any most appropriate language (Python, bash, etc.)

Whenever a new top-level file or folder is added to the repository, it should be listed in MANIFEST.in so that it will be either included in or excluded from source distributions as appropriate. See here for information about writing a MANIFEST.in.

How to contribute

The preferred way to contribute to the DataLad code base is to fork the main repository on GitHub. Here we outline the workflow used by the developers:

  1. Have a clone of our main project repository as origin remote in your git:

       git clone git://github.com/datalad/datalad
    
  2. Fork the project repository: click on the 'Fork' button near the top of the page. This creates a copy of the code base under your account on the GitHub server.

  3. Add your forked clone as a remote to the local clone you already have on your local disk:

       git remote add gh-YourLogin git@github.com:YourLogin/datalad.git
       git fetch gh-YourLogin
    

    To ease addition of other github repositories as remotes, here is a little bash function/script to add to your ~/.bashrc:

     ghremote () {
             url="$1"
             proj=${url##*/}
             url_=${url%/*}
             login=${url_##*/}
             git remote add gh-$login $url
             git fetch gh-$login
     }
    

    thus you could simply run:

      ghremote git@github.com:YourLogin/datalad.git
    

    to add the above gh-YourLogin remote. Additional handy aliases such as ghpr (to fetch existing pr from someone's remote) and ghsendpr could be found at yarikoptic's bash config file

  4. Create a branch (generally off the origin/master) to hold your changes:

       git checkout -b nf-my-feature
    

    and start making changes. Ideally, use a prefix signaling the purpose of the branch

    • nf- for new features
    • bf- for bug fixes
    • rf- for refactoring
    • doc- for documentation contributions (including in the code docstrings).
    • bm- for changes to benchmarks We recommend to not work in the master branch!
  5. Work on this copy on your computer using Git to do the version control. When you're done editing, do:

       git add modified_files
       git commit
    

    to record your changes in Git. Ideally, prefix your commit messages with the NF, BF, RF, DOC, BM similar to the branch name prefixes, but you could also use TST for commits concerned solely with tests, and BK to signal that the commit causes a breakage (e.g. of tests) at that point. Multiple entries could be listed joined with a + (e.g. rf+doc-). See git log for examples. If a commit closes an existing DataLad issue, then add to the end of the message (Closes #ISSUE_NUMER)

  6. Push to GitHub with:

       git push -u gh-YourLogin nf-my-feature
    

    Finally, go to the web page of your fork of the DataLad repo, and click 'Pull request' (PR) to send your changes to the maintainers for review. This will send an email to the committers. You can commit new changes to this branch and keep pushing to your remote -- github automagically adds them to your previously opened PR.

(If any of the above seems like magic to you, then look up the Git documentation on the web.) Our Design Docs provide a growing collection of insights on the command API principles and the design of particular subsystems in DataLad to inform standard development practice.

Development environment

We support Python 3 only (>= 3.7).

See README.md:Dependencies for basic information about installation of datalad itself. On Debian-based systems we recommend to enable NeuroDebian since we use it to provide backports of recent fixed external modules we depend upon:

apt-get install -y -q git git-annex-standalone
apt-get install -y -q patool python3-scrapy python3-{argcomplete,git,humanize,keyring,lxml,msgpack,progressbar,requests,setuptools}

and additionally, for development we suggest to use tox and new versions of dependencies from pypy:

apt-get install -y -q python3-{dev,httpretty,pytest,pip,vcr,virtualenv} python3-tox
# Some libraries which might be needed for installing via pip
apt-get install -y -q lib{ffi,ssl,curl4-openssl,xml2,xslt1}-dev

some of which you could also install from PyPi using pip (prior installation of those libraries listed above might be necessary)

pip install -r requirements-devel.txt

and you will need to install recent git-annex using appropriate for your OS means (for Debian/Ubuntu, once again, just use NeuroDebian).

Contributor Files History

The original repository provided a .zenodo.json file, and we generate a .contributors file from that via:

pip install tributors
tributors --version
0.0.18

It helps to have a GitHub token to increase API limits:

export GITHUB_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Instructions for these environment variables can be found here. Then update zenodo:

tributors update  zenodo
INFO:    zenodo:Updating .zenodo.json
INFO:    zenodo:Updating .tributors cache from .zenodo.json
WARNING:tributors:zenodo does not support updating from names.

In the case that there is more than one orcid found for a user, you will be given a list to check. Others will be updated in the file. You can then curate the file as you see fit. We next want to add the .allcontributors file:

$ tributors init allcontrib
INFO:allcontrib:Generating .all-contributorsrc for datalad/datalad
$ tributors update allcontrib
INFO:allcontrib:Updating .all-contributorsrc
INFO:allcontrib:Updating .tributors cache from .all-contributorsrc
INFO:allcontrib:⭐️ Found new contributor glalteva in .all-contributorsrc
INFO:allcontrib:⭐️ Found new contributor adswa in .all-contributorsrc
INFO:allcontrib:⭐️ Found new contributor chrhaeusler in .all-contributorsrc
...
INFO:allcontrib:⭐️ Found new contributor bpoldrack in .all-contributorsrc
INFO:allcontrib:⭐️ Found new contributor yetanothertestuser in .all-contributorsrc
WARNING:tributors:allcontrib does not support updating from orcids.
WARNING:tributors:allcontrib does not support updating from email.

We can then populate the shared .tributors file:

$ tributors update-lookup allcontrib

And then we can rely on the GitHub action to update contributors. The action is set to run on merges to master, meaning when the contributions are finalized. This means that we add new contributors, and we look for new orcids as we did above.

Additional Hints

Merge commits

For merge commits to have more informative description, add to your .git/config or ~/.gitconfig following section:

[merge]
log = true

and if conflicts occur, provide short summary on how they were resolved in "Conflicts" listing within the merge commit (see example).

Quality Assurance

It is recommended to check that your contribution complies with the following rules before submitting a pull request:

  • All public methods should have informative docstrings with sample usage presented as doctests when appropriate.

  • All other tests pass when everything is rebuilt from scratch.

  • New code should be accompanied by tests.

The documentation contains a Design Document specifically on running and writing tests that we encourage you to read beforehand. Further hands-on advice is detailed below.

Tests

datalad/tests contains tests for the core portion of the project, and more tests are provided under corresponding submodules in tests/ subdirectories to simplify re-running the tests concerning that portion of the codebase. To execute many tests, the codebase first needs to be "installed" in order to generate scripts for the entry points. For that, the recommended course of action is to use virtualenv, e.g.

virtualenv --system-site-packages venv-tests
source venv-tests/bin/activate
pip install -r requirements.txt
python setup.py develop

and then use that virtual environment to run the tests, via

pytest datalad

then to later deactivate the virtualenv just simply enter

deactivate

Alternatively, or complimentary to that, you can use tox -- there is a tox.ini file which sets up a few virtual environments for testing locally, which you can later reuse like any other regular virtualenv for troubleshooting. Additionally, tools/testing/test_README_in_docker script can be used to establish a clean docker environment (based on any NtesteuroDebian-supported release of Debian or Ubuntu) with all dependencies listed in README.md pre-installed.

CI setup

We are using several continuous integration services to run our tests battery for every PR and on the default branch. Please note that new a contributor's first PR needs workflow approval from a team member to start the CI runs, but we promise to promptly review and start the CI runs on your PR. As the full CI suite takes a while to complete, we recommend to run at least tests directly related to your contributions locally beforehand. Logs from all CI runs are collected periodically by con/tinuous and archived at smaug:/mnt/btrfs/datasets/datalad/ci/logs/. For developing on Windows you can use free Windows VMs. If you would like to propose patch against git-annex itself, submit them against datalad/git-annex repository which builds and tests git-annex.

Coverage

You can also check for common programming errors with the following tools:

  • Code with good unittest coverage (at least 80%), check with:

        pip install pytest coverage
        pytest --cov=datalad path/to/tests_for_package
    
  • We rely on https://codecov.io to provide convenient view of code coverage. Installation of the codecov extension for Firefox/Iceweasel or Chromium is strongly advised, since it provides coverage annotation of pull requests.

Linting

We are not (yet) fully PEP8 compliant, so please use these tools as guidelines for your contributions, but not to PEP8 entire code base.

Sidenote: watch Raymond Hettinger - Beyond PEP 8

  • No pyflakes warnings, check with:

         pip install pyflakes
         pyflakes path/to/module.py
    
  • No PEP8 warnings, check with:

         pip install pep8
         pep8 path/to/module.py
    
  • AutoPEP8 can help you fix some of the easy redundant errors:

         pip install autopep8
         autopep8 path/to/pep8.py
    

Also, some team developers use PyCharm community edition which provides built-in PEP8 checker and handy tools such as smart splits/joins making it easier to maintain code following the PEP8 recommendations. NeuroDebian provides pycharm-community-sloppy package to ease pycharm installation even further.

Benchmarking

We use asv to benchmark some core DataLad functionality. The benchmarks suite is located under benchmarks/, and periodically we publish results of running benchmarks on a dedicated host to http://datalad.github.io/datalad/ . Those results are collected and available under the .asv/ submodule of this repository, so to get started

  • git submodule update --init .asv
  • pip install .[devel] or just pip install asv
  • asv machine - to configure asv for your host if you want to run benchmarks locally

And then you could use asv in multiple ways.

Quickly benchmark the working tree

  • asv run -E existing - benchmark using the existing python environment and just print out results (not stored anywhere). You can add -q to run each benchmark just once (thus less reliable estimates)
  • asv run -b api.supers.time_createadd_to_dataset -E existing would run that specific benchmark using the existing python environment

Note: --python=same (-E existing) seems to have restricted applicability, e.g. can't be used for a range of commits, so it can't be used with continuous.

Compare results for two commits from recorded runs

Use asv compare to compare results from different runs, which should be available under .asv/results/<machine>. (Note that the example below passes ref names instead of commit IDs, which requires asv v0.3 or later.)

> asv compare -m hopa maint master

All benchmarks:

       before           after         ratio
     [b619eca4]       [7635f467]
-           1.87s            1.54s     0.82  api.supers.time_createadd
-           1.85s            1.56s     0.84  api.supers.time_createadd_to_dataset
-           5.57s            4.40s     0.79  api.supers.time_installr
          145±6ms          145±6ms     1.00  api.supers.time_ls
-           4.59s            2.17s     0.47  api.supers.time_remove
          427±1ms          434±8ms     1.02  api.testds.time_create_test_dataset1
-           4.10s            3.37s     0.82  api.testds.time_create_test_dataset2x2
      1.81±0.07ms      1.73±0.04ms     0.96  core.runner.time_echo
       2.30±0.2ms      2.04±0.03ms    ~0.89  core.runner.time_echo_gitrunner
+        420±10ms          535±3ms     1.27  core.startup.time_help_np
          111±6ms          107±3ms     0.96  core.startup.time_import
+         334±6ms          466±4ms     1.39  core.startup.time_import_api

Run and compare results for two commits

asv continuous could be used to first run benchmarks for the to-be-tested commits and then provide stats:

  • asv continuous maint master - would run and compare maint and master branches
  • asv continuous HEAD - would compare HEAD against HEAD^
  • asv continuous master HEAD - would compare HEAD against state of master
  • TODO: continuous -E existing

Notes:

  • only significant changes will be reported
  • raw results from benchmarks are not stored (use --record-samples if desired)

Run and record benchmarks results (for later comparison etc)

Profile a benchmark and produce a nice graph visualization

Example (replace with the benchmark of interest)

asv profile -v -o profile.gprof usecases.study_forrest.time_make_studyforrest_mockup
gprof2dot -f pstats profile.gprof | dot -Tpng -o profile.png \
    && xdg-open profile.png

Common options

  • -E to restrict to specific environment, e.g. -E virtualenv:2.7
  • -b could be used to specify specific benchmark(s)
  • -q to run benchmark just once for a quick assessment (results are not stored since too unreliable)

Easy Issues

A great way to start contributing to DataLad is to pick an item from the list of Easy issues in the issue tracker. Resolving these issues allows you to start contributing to the project without much prior knowledge. Your assistance in this area will be greatly appreciated by the more experienced developers as it helps free up their time to concentrate on other issues.

Maintenance teams coordination

We distinguish particular aspects of DataLad's functionality, each corresponding to parts of the code base in this repository, and loosely maintain teams assigned to these aspects. While any contributor can tackle issues on any aspect, you may want to refer to members of such teams (via GitHub tagging or review requests) or the team itself (via GitHub issue label team-<area>) when creating a PR, feature request, or bug report. Members of a team are encouraged to respond to PRs or issues within the given area, and pro-actively improve robustness, user experience, documentation, and performance of the code.

New and existing contributors are invited to join teams:

  • core: core API/commands (@datalad/team-core)

  • git: Git interface (e.g. GitRepo, protocols, helpers, compatibility) (@datalad/team-git)

  • gitannex: git-annex interface (e.g. AnnexRepo, protocols, helpers, compatibility) (@datalad/team-gitannex)

  • remotes: (special) remote implementations (@datalad/team-remotes)

  • runner: sub-process execution and IO (@datalad/team-runner)

  • services: interaction with 3rd-party services (create-sibling*, downloaders, credentials, etc.) (@datalad/team-services)

Recognizing contributions

We welcome and recognize all contributions from documentation to testing to code development.

You can see a list of current contributors in our zenodo file. If you are new to the project, don't forget to add your name and affiliation there! We also have an .all-contributorsrc that is updated automatically on merges. Once it's merged, if you helped in a non standard way (e.g., a contribution other than code) you can open a pull request to add any All Contributors Emoji that match your contribution types.

Thank you!

You're awesome. 👋😃

Various hints for developers

Useful tools

  • While performing IO/net heavy operations use dstat for quick logging of various health stats in a separate terminal window:

      dstat -c --top-cpu -d --top-bio --top-latency --net
    
  • To monitor speed of any data pipelining pv is really handy, just plug it in the middle of your pipe.

  • For remote debugging epdb could be used (avail in pip) by using import epdb; epdb.serve() in Python code and then connecting to it with python -c "import epdb; epdb.connect()".

  • We are using codecov which has extensions for the popular browsers (Firefox, Chrome) which annotates pull requests on github regarding changed coverage.

Useful Environment Variables

Refer datalad/config.py for information on how to add these environment variables to the config file and their naming convention

  • DATALAD_DATASETS_TOPURL: Used to point to an alternative location for /// dataset. If running tests preferred to be set to https://datasets-tests.datalad.org

  • DATALAD_LOG_LEVEL: Used for control the verbosity of logs printed to stdout while running datalad commands/debugging

  • DATALAD_LOG_NAME: Whether to include logger name (e.g. datalad.support.sshconnector) in the log

  • DATALAD_LOG_OUTPUTS: Used to control either both stdout and stderr of external commands execution are logged in detail (at DEBUG level)

  • DATALAD_LOG_PID To instruct datalad to log PID of the process

  • DATALAD_LOG_TARGET Where to log: stderr (default), stdout, or another filename

  • DATALAD_LOG_TIMESTAMP: Used to add timestamp to datalad logs

  • DATALAD_LOG_TRACEBACK: Runs TraceBack function with collide set to True, if this flag is set to 'collide'. This replaces any common prefix between current traceback log and previous invocation with "..."

  • DATALAD_LOG_VMEM: Reports memory utilization (resident/virtual) at every log line, needs psutil module

  • DATALAD_EXC_STR_TBLIMIT: This flag is used by datalad to cap the number of traceback steps included in exception logging and result reporting to DATALAD_EXC_STR_TBLIMIT of pre-processed entries from traceback.

  • DATALAD_SEED: To seed Python's random RNG, which will also be used for generation of dataset UUIDs to make those random values reproducible. You might want also to set all the relevant git config variables like we do in one of the travis runs

  • DATALAD_TESTS_TEMP_KEEP: Function rmtemp will not remove temporary file/directory created for testing if this flag is set

  • DATALAD_TESTS_TEMP_DIR: Create a temporary directory at location specified by this flag. It is used by tests to create a temporary git directory while testing git annex archives etc

  • DATALAD_TESTS_NONETWORK: Skips network tests completely if this flag is set Examples include test for S3, git_repositories, OpenfMRI, etc

  • DATALAD_TESTS_SSH: Skips SSH tests if this flag is not set. If you enable this, you need to set up a "datalad-test" and "datalad-test2" target in your SSH configuration. The second target is used by only a couple of tests, so depending on the tests you're interested in, you can get by with only "datalad-test" configured.

    A Docker image that is used for DataLad's tests is available at https://github.com/datalad-tester/docker-ssh-target. Note that the DataLad tests assume that target files exist in DATALAD_TESTS_TEMP_DIR, which restricts the "datalad-test" target to being either the localhost or a container that mounts DATALAD_TESTS_TEMP_DIR.

  • DATALAD_TESTS_NOTEARDOWN: Does not execute teardown_package which cleans up temp files and directories created by tests if this flag is set

  • DATALAD_TESTS_USECASSETTE: Specifies the location of the file to record network transactions by the VCR module. Currently used by when testing custom special remotes

  • DATALAD_TESTS_OBSCURE_PREFIX: A string to prefix the most obscure (but supported by the filesystem test filename

  • DATALAD_TESTS_PROTOCOLREMOTE: Binary flag to specify whether to test protocol interactions of custom remote with annex

  • DATALAD_TESTS_RUNCMDLINE: Binary flag to specify if shell testing using shunit2 to be carried out

  • DATALAD_TESTS_TEMP_FS: Specify the temporary file system to use as loop device for testing DATALAD_TESTS_TEMP_DIR creation

  • DATALAD_TESTS_TEMP_FSSIZE: Specify the size of temporary file system to use as loop device for testing DATALAD_TESTS_TEMP_DIR creation

  • DATALAD_TESTS_NONLO: Specifies network interfaces to bring down/up for testing. Currently used by travis.

  • DATALAD_TESTS_KNOWNFAILURES_PROBE: Binary flag to test whether "known failures" still actually are failures. That is - change behavior of tests, that decorated with any of the known_failure, to not skip, but executed and fail if they would pass (indicating that the decorator may be removed/reconsidered).

  • DATALAD_TESTS_GITCONFIG: Additional content to add to ~/.gitconfig in the tests HOME environment. \n is replaced with os.linesep.

  • DATALAD_TESTS_CREDENTIALS: Set to system to allow for credentials possibly present in the user/system wide environment to be used.

  • DATALAD_CMD_PROTOCOL: Specifies the protocol number used by the Runner to note shell command or python function call times and allows for dry runs. 'externals-time' for ExecutionTimeExternalsProtocol, 'time' for ExecutionTimeProtocol and 'null' for NullProtocol. Any new DATALAD_CMD_PROTOCOL has to implement datalad.support.protocol.ProtocolInterface

  • DATALAD_CMD_PROTOCOL_PREFIX: Sets a prefix to add before the command call times are noted by DATALAD_CMD_PROTOCOL.

  • DATALAD_USE_DEFAULT_GIT: Instructs to use git as available in current environment, and not the one which possibly comes with git-annex (default behavior).

  • DATALAD_ASSERT_NO_OPEN_FILES: Instructs test helpers to check for open files at the end of a test. If set, remaining open files are logged at ERROR level. Alternative modes are: "assert" (raise AssertionError if any open file is found), "pdb"/"epdb" (drop into debugger when open files are found, info on files is provided in a "files" dictionary, mapping filenames to psutil process objects).

  • DATALAD_ALLOW_FAIL: Instructs @never_fail decorator to allow to fail, e.g. to ease debugging.

Release(s) workflow

Branches

  • master: changes toward the next MAJOR.MINOR.0 release. Release candidates (tagged with an rcX suffix) are cut from this branch
  • maint: bug fixes for the latest released MAJOR.MINOR.PATCH
  • maint-MAJOR.MINOR: generally not used, unless some bug fix release with a critical bug fix is needed.

Workflow

  • upon release of MAJOR.MINOR.0, maint branch needs to be fast-forwarded to that release
  • bug fixes to functionality released within the maint branch should be submitted against maint branch
  • cherry-picking fixes from master into maint is allowed where needed
  • master branch accepts PRs with new functionality
  • master branch merges maint as frequently as needed

Helpers

Makefile provides a number of useful make targets:

  • linkissues-changelog: converts (#ISSUE) placeholders into proper markdown within CHANGELOG.md
  • update-changelog: uses above linkissues-changelog and updates .rst changelog
  • release-pypi: ensures no dist/ exists yet, creates a wheel and a source distribution and uploads to pypi.

Releasing with GitHub Actions, auto, and pull requests

New releases of DataLad are created via a GitHub Actions workflow using datalad/release-action, which was inspired by auto. Whenever a pull request is merged into maint that has the "release" label, that workflow updates the changelog based on the pull requests since the last release, commits the results, tags the new commit with the next version number, and creates a GitHub release for the tag. This in turn triggers a job for building an sdist & wheel for the project and uploading them to PyPI.

CHANGELOG entries and labelling pull requests

DataLad uses scriv to maintain CHANGELOG.md. Adding label CHANGELOG-missing to a PR triggers workflow to add a new scriv changelog fragment under changelog.d/ using PR title as the content. That produced changelog snippet could subsequently tuned to improve perspective CHANGELOG entry. The section that workflow adds to the changelog depends on the semver- label added to the PR:

  • semver-minor — for changes corresponding to an increase in the minor version component
  • semver-patch — for changes corresponding to an increase in the patch/micro version component; this is the default label for unlabelled PRs
  • semver-internal — for changes only affecting the internal API
  • semver-documentation — for changes only affecting the documentation
  • semver-tests — for changes to tests
  • semver-dependencies — for updates to dependency versions
  • semver-performance — for performance improvements

git-annex

Even though git-annex is a separate project, DataLad's and git-annex's development is often intertwined.

Filing issues

It is not uncommon to discover potential git-annex bugs or git-annex feature request while working on DataLad. In those cases, it is common for developers and contributors to file an issue in git-annex's public bug tracker at git-annex.branchable.com. Here are a few hints on how to go about it:

  • You can report a new bug or browse through existing bug reports at git-annex.branchable.com/bugs)
  • In order to associate a bug report with the DataLad you can add the following mark up into the description: [[!tag projects/datalad]]
  • You can add author metadata with the following mark up: [[!meta author=yoh]]. Some authors will be automatically associated with the DataLad project by git-annex's bug tracker.

Testing and contributing

To provide downstream testing of development git-annex against DataLad, we maintain the datalad/git-annex repository. It provides daily builds of git-annex with CI setup to run git-annex built-in tests and tests of DataLad across all supported operating systems. It also has a facility to test git-annex on your client systems following the instructions. All the build logs and artifacts (installer packages etc) for daily builds and releases are collected using con/tinuous and archived on smaug:/mnt/btrfs/datasets/datalad/ci/git-annex/. You can test your fixes for git-annex by submitting patches for it following instructions.