Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarking of various mounting strategies #67

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from
Draft

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Jan 19, 2024

Closes #66.

To do:

  • Implement mounting & unmounting with webdavfs
  • Implement mounting & unmounting with davfs2
  • Implement timing of the following tests:
    • pynwb_open_load_ns
    • matnwb_nwbRead
    • DANDI_CACHE=ignore dandi ls (to load metadata) on a single local asset
  • Add a subcommand for doing all the benchmarking at once (mounting each mount and running & timing of tests)
    • Emit a summary of timing results
    • Add an option for only using specified mount types
    • Add an option for whether to update the local clone of the dandisets repo for fusefs
  • Add a subcommand that just runs & times the tests against a path given on the command line
  • When mounting with fusefs, clone the dandisets repository first if it isn't present at the dataset path.
  • PROBLEM: webdavfs doesn't support redirects: Support redirects for non-collection resources miquels/webdavfs#30
    • Comment out the webdavfs support for now
  • Document how to set up davfs2 for use by dandisets-healthstatus
  • Actually run benchmarks
    • Run on smaug
    • The assets to test should be one (or more?) sample assets of some "typical" size (a few GBs). sub-mouse1-fni16/sub-mouse1-fni16_ses-161228151100.nwb in 000016 is suggested as a possible candidate.

@jwodder jwodder added the enhancement New feature or request label Jan 19, 2024
@jwodder jwodder self-assigned this Jan 19, 2024
Copy link

codecov bot commented Jan 19, 2024

Codecov Report

Attention: Patch coverage is 64.49275% with 98 lines in your changes are missing coverage. Please review.

Project coverage is 61.30%. Comparing base (d71a25d) to head (9655733).
Report is 5 commits behind head on main.

Files Patch % Lines
code/src/healthstatus/mounts.py 56.25% 55 Missing and 1 partial ⚠️
code/src/healthstatus/tests.py 69.35% 19 Missing ⚠️
code/src/healthstatus/__main__.py 71.42% 10 Missing ⚠️
code/src/healthstatus/checker.py 64.00% 8 Missing and 1 partial ⚠️
code/src/healthstatus/core.py 87.50% 3 Missing ⚠️
code/src/healthstatus/util.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #67      +/-   ##
==========================================
+ Coverage   60.43%   61.30%   +0.86%     
==========================================
  Files           9       10       +1     
  Lines         685      876     +191     
  Branches      169      209      +40     
==========================================
+ Hits          414      537     +123     
- Misses        251      319      +68     
  Partials       20       20              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@satra
Copy link
Member

satra commented Jan 19, 2024

since s3 can have different latencies at different times of the day, let's also make sure we have some estimate of s3 latency during each benchmark if these benchmarks take some time to run. if they run in mins, i would be less worried about latencies, and in such a scenario we should just get multiple estimates to create some error bars.

@jwodder
Copy link
Member Author

jwodder commented Jan 19, 2024

@satra

make sure we have some estimate of s3 latency

How?

@satra
Copy link
Member

satra commented Jan 19, 2024

this is old, but something like this: https://github.com/dvassallo/s3-benchmark

@jwodder
Copy link
Member Author

jwodder commented Jan 22, 2024

@satra This seems like something that should be done separately from dandisets-healthstatus. Trying to integrate it into this PR doesn't seem sensible.

@jwodder jwodder force-pushed the gh-66 branch 2 times, most recently from c237d13 to 2c7bebe Compare January 22, 2024 21:00
@jwodder
Copy link
Member Author

jwodder commented Jan 23, 2024

@yarikoptic Problem: dandisets-healthstatus requires Pydantic 2.0, yet this PR adds an extra dependency on dandi (and dandidav, which requires dandi), which still requires Pydantic 1.x.

@yarikoptic
Copy link
Member

since it all in motion, I think it would be ok to point to that branch you have for dandi-cli with pydantic 2.0 compat

@@ -3,7 +3,7 @@ set -ex

PYTHON="$HOME"/miniconda3/bin/python
DANDISETS_PATH=/mnt/backup/dandi/dandisets-healthstatus/dandisets
MOUNT_PATH=/mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse
MOUNT_PATH=/tmp/dandisets-fuse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do it under some safer user-specific location, e.g. /var/run/user/$UID/.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse is the path we've been using on drogon for FUSE-mounting the Dandisets. Also, tools/run.sh is just for generating the healthstatus reports; the shell script for generating the benchmarks is unfinished and hasn't been committed yet.

@yarikoptic
Copy link
Member

Please provide results of running such benchmarking across possible solutions e.g. on typhon. (should be less busy ATM)

@yarikoptic
Copy link
Member

scrape that about typhon, I forgot that we rely on having dandisets around. Please do it on drogon.

@jwodder
Copy link
Member Author

jwodder commented Mar 27, 2024

@yarikoptic You initially said here to run the benchmarks on smaug.

@yarikoptic
Copy link
Member

if benchmarks rely on full clone of dandisets/ hierarchy, probably best to just run on drogon. If you want to replicate the hierarchy then indeed can do on smaug or typhon. Choose the host you deem most appropriate for this.

@jwodder
Copy link
Member Author

jwodder commented Mar 27, 2024

@yarikoptic I need permission to sudo-run the following commands on smaug:

/usr/bin/mount -t webdavfs -o allow_other https://webdav.dandiarchive.org /tmp/dandisets-fuse
/usr/bin/mount -t davfs https://webdav.dandiarchive.org /tmp/dandisets-fuse

Note that the colons in the URLs need to be escaped when adding them to the sudoers file.

Also, follow_redirect in /etc/davfs2/davfs2.conf needs to be set to 1.

@yarikoptic
Copy link
Member

done

@jwodder
Copy link
Member Author

jwodder commented Mar 27, 2024

@yarikoptic matlab needs to be installed on smaug so that I can benchmark the associated test.

@jwodder
Copy link
Member Author

jwodder commented Mar 28, 2024

@yarikoptic Ping.

@yarikoptic
Copy link
Member

done now -- the same 2022b version is installed systemwide

@jwodder
Copy link
Member Author

jwodder commented Mar 29, 2024

@yarikoptic When I try to run a matlab test on smaug, it fails with:

    License checkout failed.
    License Manager Error -1
    The license file cannot be found.

    Troubleshoot this issue by visiting: 
    https://www.mathworks.com/support/lme/R2022b/1

    Diagnostic Information:
    Feature: MATLAB 
    License path: /home/jwodder/.matlab/R2022b_licenses:/usr/local/MATLAB/R2022b/licenses/license.dat:/usr/local/MATLA
    B/R2022b/licenses 
    Licensing error: -1,359. System Error: 2

Note that there is no /usr/local/MATLAB/R2022b/licenses folder on the server.

@yarikoptic
Copy link
Member

could you please give me full matlab invocation to ensure to work correctly? on smaug you do it under your account or some other (like datalad etc)?

@jwodder
Copy link
Member Author

jwodder commented Mar 29, 2024

@yarikoptic

matlab -nodesktop -batch 'nwb = nwbRead('"'"'/tmp/dandisets-fuse/000016/sub-mouse1-fni16/sub-mouse1-fni16_ses-161228151100.nwb'"'"')'

where /tmp/dandisets-fuse is a FUSE mount and there is a copy of matnwb in matnwb/ in the current directory (and the envvar MATLABPATH points to this matnwb/). The command is run under my account.

@yarikoptic
Copy link
Member

command didn't run under my account on drogon, but worked (errored but past the license check) under dandi so it is user specific somewhere... strace pointed to /home/dandi/.matlab/R2022b_licenses/ ... but can't be just copied since "Your username does not match the username in the license file." .. started matlab's initiator script under VNC on smaug under my login but provided jwodder as the target login, changed permissions for the license... now works for jwodder account (none else, bleh)

@jwodder
Copy link
Member Author

jwodder commented Mar 29, 2024

@yarikoptic The benchmarking is failing because the matlab test on FUSE is exceeding the 1-hour timeout. I tried increasing the timeout to 2 hours, but it exceeded that as well. Should I try increasing the timeout to something incredibly high or take another approach?

@yarikoptic
Copy link
Member

how long would it run on that file if downloaded in full? if it is just generally very slow (half an hour) -- might be smth to relay to matnwb.

@jwodder
Copy link
Member Author

jwodder commented Mar 29, 2024

@yarikoptic 42 seconds

@yarikoptic
Copy link
Member

hm. Any ideas on why fuse solution takes that long? how long it takes with datalad-fuse?

@jwodder
Copy link
Member Author

jwodder commented Mar 29, 2024

@yarikoptic I don't know why it's so slow with FUSE, and I don't know how long it would take with FUSE, as the benchmark code kills the process at the 2-hour timeout.

@yarikoptic
Copy link
Member

please make time out 5 hours and run against both fuse solutions -- datalad-fuse and dandidav + davfs2

@yarikoptic
Copy link
Member

ideally: profile datalad-fuse while running the test to see where it spends time.

@jwodder
Copy link
Member Author

jwodder commented Apr 1, 2024

@yarikoptic The matnwb test on datalad-fuse exceeded the five-hour time limit as well.

How exactly should I profile it? Just use py-spy?

@yarikoptic
Copy link
Member

First - py-spy would not hurt indeed.

Then I would have probably added log lines at DEBUG level within datalad-fuse to see what is actually taking time there if py-spy was not conclusive.

@jwodder
Copy link
Member Author

jwodder commented May 30, 2024

@yarikoptic Is there a way to get datalad's logs to include timestamps?

@yarikoptic
Copy link
Member

yes, there is also a number of other possibly helpful options (available through env vars or even git config since defined in common_cfg) for augmenting logging behavior:

❯ pwd
/home/yoh/proj/datalad/datalad-maint
❯ grep DATALAD_LOG CONTRIBUTING.md
- *DATALAD_LOG_LEVEL*:
- *DATALAD_LOG_NAME*:
- *DATALAD_LOG_OUTPUTS*:
- *DATALAD_LOG_PID*
- *DATALAD_LOG_TARGET*
- *DATALAD_LOG_TIMESTAMP*:
- *DATALAD_LOG_TRACEBACK*:
- *DATALAD_LOG_VMEM*:

@jwodder
Copy link
Member Author

jwodder commented May 31, 2024

Disregard

@yarikoptic When I try running datalad -l debug fusefs ... on smaug with DATALAD_LOG_TIMESTAMP=1 set, it crashes with:

Traceback (most recent call last):
  File "/bin/datalad", line 33, in <module>
    sys.exit(load_entry_point('datalad==0.19.5', 'console_scripts', 'datalad')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/bin/datalad", line 25, in importlib_load_entry_point
    return next(matches).load()
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/metadata/__init__.py", line 202, in load
    module = import_module(match.group('module'))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/lib/python3/dist-packages/datalad/__init__.py", line 112, in <module>
    cfg = ConfigManager()
          ^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/datalad/config.py", line 399, in __init__
    self.reload(force=True)
  File "/usr/lib/python3/dist-packages/datalad/config.py", line 460, in reload
    self._stores[store_id] = self._reload(runargs)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/datalad/config.py", line 488, in _reload
    stdout, stderr = self._run(
                     ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/datalad/config.py", line 869, in _run
    out = self._runner.run(self._config_cmd + args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/datalad/runner/runner.py", line 242, in run
    raise CommandError(
datalad.runner.exception.CommandError: CommandError: 'git --git-dir=/dev/null config -z -l --show-origin' failed with exitcode 1 [err: '/usr/lib/git-annex.linux/git: 2: readlink: not found
/usr/lib/git-annex.linux/git: 6: dirname: not found
** cannot find base directory (I seem to be /usr/lib/git-annex.linux/git)']

I don't know why the datalad executable at /bin/datalad would be executed; when datalad fusefs is run, the first item in PATH is the bin directory in a virtualenv that contains datalad, so I would expect that to be run instead.

EDIT: I realized that I wasn't setting the environment for the datalad fusefs command correctly, so PATH et alii were being wiped out.

@jwodder
Copy link
Member Author

jwodder commented May 31, 2024

@yarikoptic I believe sub-mouse1-fni16/sub-mouse1-fni16_ses-161228151100.nwb in 000016 was a poor choice of asset to test, as it has seemingly always timed out in the normal fusefs tests, and it continues to time out when testing out benchmarking. (Thus, as the timing-out is not specific to the benchmarking, if you want me to investigate it, you should file a separate issue.) Please choose another asset to test the benchmarking on, one that isn't currently marked as timing out.

@yarikoptic
Copy link
Member

Let's try on sub-mouse1-fni16/sub-mouse1-fni16_ses-170808184141.nwb in the same dandiset.. if I read yaml correctly it is ok for pynwb and errors out on matnwb (but does not timeout). In general -- feel welcome to choose any asset you deem appropriate and not too "easy" (fast)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compare performance of webdav+davfs2, webdav+webdavfs, and datalad-fuse
3 participants