Add benchmarking of various mounting strategies #67

jwodder · 2024-01-19T13:18:55Z

Closes #66.

To do:

codecov · 2024-01-19T13:20:32Z

Codecov Report

Attention: Patch coverage is 64.49275% with 98 lines in your changes are missing coverage. Please review.

Project coverage is 61.30%. Comparing base (d71a25d) to head (9655733).
Report is 5 commits behind head on main.

Files	Patch %	Lines
code/src/healthstatus/mounts.py	56.25%	55 Missing and 1 partial ⚠️
code/src/healthstatus/tests.py	69.35%	19 Missing ⚠️
code/src/healthstatus/__main__.py	71.42%	10 Missing ⚠️
code/src/healthstatus/checker.py	64.00%	8 Missing and 1 partial ⚠️
code/src/healthstatus/core.py	87.50%	3 Missing ⚠️
code/src/healthstatus/util.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #67      +/-   ##
==========================================
+ Coverage   60.43%   61.30%   +0.86%     
==========================================
  Files           9       10       +1     
  Lines         685      876     +191     
  Branches      169      209      +40     
==========================================
+ Hits          414      537     +123     
- Misses        251      319      +68     
  Partials       20       20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

satra · 2024-01-19T19:58:25Z

since s3 can have different latencies at different times of the day, let's also make sure we have some estimate of s3 latency during each benchmark if these benchmarks take some time to run. if they run in mins, i would be less worried about latencies, and in such a scenario we should just get multiple estimates to create some error bars.

jwodder · 2024-01-19T19:59:48Z

@satra

make sure we have some estimate of s3 latency

How?

satra · 2024-01-19T20:01:38Z

this is old, but something like this: https://github.com/dvassallo/s3-benchmark

jwodder · 2024-01-22T14:47:05Z

@satra This seems like something that should be done separately from dandisets-healthstatus. Trying to integrate it into this PR doesn't seem sensible.

jwodder · 2024-01-23T19:49:30Z

@yarikoptic Problem: dandisets-healthstatus requires Pydantic 2.0, yet this PR adds an extra dependency on dandi (and dandidav, which requires dandi), which still requires Pydantic 1.x.

yarikoptic · 2024-01-23T20:39:12Z

since it all in motion, I think it would be ok to point to that branch you have for dandi-cli with pydantic 2.0 compat

yarikoptic · 2024-03-27T13:13:55Z

tools/run.sh

@@ -3,7 +3,7 @@ set -ex

 PYTHON="$HOME"/miniconda3/bin/python
 DANDISETS_PATH=/mnt/backup/dandi/dandisets-healthstatus/dandisets
-MOUNT_PATH=/mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse
+MOUNT_PATH=/tmp/dandisets-fuse


let's do it under some safer user-specific location, e.g. /var/run/user/$UID/.

Why? /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse is the path we've been using on drogon for FUSE-mounting the Dandisets. Also, tools/run.sh is just for generating the healthstatus reports; the shell script for generating the benchmarks is unfinished and hasn't been committed yet.

yarikoptic · 2024-03-27T13:16:28Z

Please provide results of running such benchmarking across possible solutions e.g. on typhon. (should be less busy ATM)

yarikoptic · 2024-03-27T13:17:07Z

scrape that about typhon, I forgot that we rely on having dandisets around. Please do it on drogon.

jwodder · 2024-03-27T13:32:04Z

@yarikoptic You initially said here to run the benchmarks on smaug.

yarikoptic · 2024-03-27T13:36:25Z

if benchmarks rely on full clone of dandisets/ hierarchy, probably best to just run on drogon. If you want to replicate the hierarchy then indeed can do on smaug or typhon. Choose the host you deem most appropriate for this.

jwodder · 2024-03-27T14:01:02Z

@yarikoptic I need permission to sudo-run the following commands on smaug:

/usr/bin/mount -t webdavfs -o allow_other https://webdav.dandiarchive.org /tmp/dandisets-fuse
/usr/bin/mount -t davfs https://webdav.dandiarchive.org /tmp/dandisets-fuse

Note that the colons in the URLs need to be escaped when adding them to the sudoers file.

Also, follow_redirect in /etc/davfs2/davfs2.conf needs to be set to 1.

yarikoptic · 2024-03-27T14:17:54Z

done

jwodder · 2024-03-27T15:37:07Z

@yarikoptic matlab needs to be installed on smaug so that I can benchmark the associated test.

jwodder · 2024-03-28T16:21:13Z

@yarikoptic Ping.

yarikoptic · 2024-03-28T20:32:37Z

done now -- the same 2022b version is installed systemwide

jwodder · 2024-03-29T11:56:48Z

@yarikoptic When I try to run a matlab test on smaug, it fails with:

    License checkout failed.
    License Manager Error -1
    The license file cannot be found.

    Troubleshoot this issue by visiting: 
    https://www.mathworks.com/support/lme/R2022b/1

    Diagnostic Information:
    Feature: MATLAB 
    License path: /home/jwodder/.matlab/R2022b_licenses:/usr/local/MATLAB/R2022b/licenses/license.dat:/usr/local/MATLA
    B/R2022b/licenses 
    Licensing error: -1,359. System Error: 2

Note that there is no /usr/local/MATLAB/R2022b/licenses folder on the server.

yarikoptic · 2024-03-29T13:56:42Z

could you please give me full matlab invocation to ensure to work correctly? on smaug you do it under your account or some other (like datalad etc)?

jwodder · 2024-03-29T13:58:28Z

@yarikoptic

matlab -nodesktop -batch 'nwb = nwbRead('"'"'/tmp/dandisets-fuse/000016/sub-mouse1-fni16/sub-mouse1-fni16_ses-161228151100.nwb'"'"')'

where /tmp/dandisets-fuse is a FUSE mount and there is a copy of matnwb in matnwb/ in the current directory (and the envvar MATLABPATH points to this matnwb/). The command is run under my account.

yarikoptic · 2024-03-29T14:44:06Z

command didn't run under my account on drogon, but worked (errored but past the license check) under dandi so it is user specific somewhere... strace pointed to /home/dandi/.matlab/R2022b_licenses/ ... but can't be just copied since "Your username does not match the username in the license file." .. started matlab's initiator script under VNC on smaug under my login but provided jwodder as the target login, changed permissions for the license... now works for jwodder account (none else, bleh)

jwodder · 2024-03-29T18:05:58Z

@yarikoptic The benchmarking is failing because the matlab test on FUSE is exceeding the 1-hour timeout. I tried increasing the timeout to 2 hours, but it exceeded that as well. Should I try increasing the timeout to something incredibly high or take another approach?

yarikoptic · 2024-03-29T19:17:09Z

how long would it run on that file if downloaded in full? if it is just generally very slow (half an hour) -- might be smth to relay to matnwb.

jwodder · 2024-03-29T19:23:10Z

@yarikoptic 42 seconds

yarikoptic · 2024-03-29T19:26:44Z

hm. Any ideas on why fuse solution takes that long? how long it takes with datalad-fuse?

jwodder · 2024-03-29T19:29:18Z

@yarikoptic I don't know why it's so slow with FUSE, and I don't know how long it would take with FUSE, as the benchmark code kills the process at the 2-hour timeout.

yarikoptic · 2024-03-29T20:34:02Z

please make time out 5 hours and run against both fuse solutions -- datalad-fuse and dandidav + davfs2

yarikoptic · 2024-03-29T20:34:36Z

ideally: profile datalad-fuse while running the test to see where it spends time.

jwodder · 2024-04-01T12:40:16Z

@yarikoptic The matnwb test on datalad-fuse exceeded the five-hour time limit as well.

How exactly should I profile it? Just use py-spy?

yarikoptic · 2024-04-01T16:56:44Z

First - py-spy would not hurt indeed.

Then I would have probably added log lines at DEBUG level within datalad-fuse to see what is actually taking time there if py-spy was not conclusive.

jwodder · 2024-05-30T13:50:16Z

@yarikoptic Is there a way to get datalad's logs to include timestamps?

yarikoptic · 2024-05-30T20:30:09Z

yes, there is also a number of other possibly helpful options (available through env vars or even git config since defined in common_cfg) for augmenting logging behavior:

❯ pwd
/home/yoh/proj/datalad/datalad-maint
❯ grep DATALAD_LOG CONTRIBUTING.md
- *DATALAD_LOG_LEVEL*:
- *DATALAD_LOG_NAME*:
- *DATALAD_LOG_OUTPUTS*:
- *DATALAD_LOG_PID*
- *DATALAD_LOG_TARGET*
- *DATALAD_LOG_TIMESTAMP*:
- *DATALAD_LOG_TRACEBACK*:
- *DATALAD_LOG_VMEM*:

jwodder · 2024-05-31T12:46:30Z

Disregard

@yarikoptic When I try running datalad -l debug fusefs ... on smaug with DATALAD_LOG_TIMESTAMP=1 set, it crashes with:

Traceback (most recent call last):
  File "/bin/datalad", line 33, in <module>
    sys.exit(load_entry_point('datalad==0.19.5', 'console_scripts', 'datalad')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/bin/datalad", line 25, in importlib_load_entry_point
    return next(matches).load()
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/metadata/__init__.py", line 202, in load
    module = import_module(match.group('module'))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/lib/python3/dist-packages/datalad/__init__.py", line 112, in <module>
    cfg = ConfigManager()
          ^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/datalad/config.py", line 399, in __init__
    self.reload(force=True)
  File "/usr/lib/python3/dist-packages/datalad/config.py", line 460, in reload
    self._stores[store_id] = self._reload(runargs)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/datalad/config.py", line 488, in _reload
    stdout, stderr = self._run(
                     ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/datalad/config.py", line 869, in _run
    out = self._runner.run(self._config_cmd + args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/datalad/runner/runner.py", line 242, in run
    raise CommandError(
datalad.runner.exception.CommandError: CommandError: 'git --git-dir=/dev/null config -z -l --show-origin' failed with exitcode 1 [err: '/usr/lib/git-annex.linux/git: 2: readlink: not found
/usr/lib/git-annex.linux/git: 6: dirname: not found
** cannot find base directory (I seem to be /usr/lib/git-annex.linux/git)']

I don't know why the datalad executable at /bin/datalad would be executed; when datalad fusefs is run, the first item in PATH is the bin directory in a virtualenv that contains datalad, so I would expect that to be run instead.

EDIT: I realized that I wasn't setting the environment for the datalad fusefs command correctly, so PATH et alii were being wiped out.

jwodder · 2024-05-31T13:20:28Z

@yarikoptic I believe sub-mouse1-fni16/sub-mouse1-fni16_ses-161228151100.nwb in 000016 was a poor choice of asset to test, as it has seemingly always timed out in the normal fusefs tests, and it continues to time out when testing out benchmarking. (Thus, as the timing-out is not specific to the benchmarking, if you want me to investigate it, you should file a separate issue.) Please choose another asset to test the benchmarking on, one that isn't currently marked as timing out.

yarikoptic · 2024-05-31T16:28:07Z

Let's try on sub-mouse1-fni16/sub-mouse1-fni16_ses-170808184141.nwb in the same dandiset.. if I read yaml correctly it is ok for pynwb and errors out on matnwb (but does not timeout). In general -- feel welcome to choose any asset you deem appropriate and not too "easy" (fast)

jwodder added the enhancement New feature or request label Jan 19, 2024

jwodder self-assigned this Jan 19, 2024

jwodder force-pushed the gh-66 branch 2 times, most recently from c237d13 to 2c7bebe Compare January 22, 2024 21:00

jwodder force-pushed the gh-66 branch from 42885a7 to 0c27cb1 Compare February 5, 2024 17:45

yarikoptic mentioned this pull request Feb 16, 2024

datalad-fuse via fsspec caching is too slow - needs a solution #53

Open

yarikoptic reviewed Mar 27, 2024

View reviewed changes

jwodder force-pushed the gh-66 branch from 2d66ed0 to 9fbbae7 Compare March 27, 2024 14:01

jwodder added 8 commits March 27, 2024 10:11

Move fuse-mounting code to a dedicated function

5d34ef3

Mounting dandidav

e4f331d

Mounting webdavfs

3a9f0de

Timing tests

dfcd068

More logging

47b45cc

Add "dandi" extra

fb0f546

time-mounts command

b127ab7

Only use anyio.Path where it's actually needed

9d91982

--mounts option

c3a7895

jwodder added 2 commits March 27, 2024 10:24

Use hosted dandidav service instead of running locally

f28b180

Shell script for running benchmarks

9a13d6e

jwodder force-pushed the gh-66 branch from a942d75 to 9a13d6e Compare March 27, 2024 14:24

jwodder added 3 commits March 27, 2024 10:28

The weird import error seems to be gone now

74e2729

Ensure dataset directory and mount point directory exist

a3a5a77

Adjust bench.sh

695305a

time-mounts: Prepare timed tests

98cdf1e

Fix webdav mount paths

9655733

yarikoptic mentioned this pull request May 31, 2024

Investigate timeouts for a good number of assets #75

Open

Add benchmarking of various mounting strategies #67

Are you sure you want to change the base?

Add benchmarking of various mounting strategies #67

Conversation

jwodder commented Jan 19, 2024 • edited

codecov bot commented Jan 19, 2024 • edited

Codecov Report

satra commented Jan 19, 2024

jwodder commented Jan 19, 2024

satra commented Jan 19, 2024

jwodder commented Jan 22, 2024

jwodder commented Jan 23, 2024

yarikoptic commented Jan 23, 2024

yarikoptic Mar 27, 2024

Choose a reason for hiding this comment

jwodder Mar 27, 2024

Choose a reason for hiding this comment

yarikoptic commented Mar 27, 2024

yarikoptic commented Mar 27, 2024

jwodder commented Mar 27, 2024

yarikoptic commented Mar 27, 2024

jwodder commented Mar 27, 2024

yarikoptic commented Mar 27, 2024

jwodder commented Mar 27, 2024

jwodder commented Mar 28, 2024

yarikoptic commented Mar 28, 2024

jwodder commented Mar 29, 2024

yarikoptic commented Mar 29, 2024

jwodder commented Mar 29, 2024 • edited

yarikoptic commented Mar 29, 2024

jwodder commented Mar 29, 2024

yarikoptic commented Mar 29, 2024

jwodder commented Mar 29, 2024

yarikoptic commented Mar 29, 2024

jwodder commented Mar 29, 2024

yarikoptic commented Mar 29, 2024

yarikoptic commented Mar 29, 2024

jwodder commented Apr 1, 2024

yarikoptic commented Apr 1, 2024

jwodder commented May 30, 2024

yarikoptic commented May 30, 2024

jwodder commented May 31, 2024 • edited

jwodder commented May 31, 2024

yarikoptic commented May 31, 2024

jwodder commented Jan 19, 2024 •

edited

codecov bot commented Jan 19, 2024 •

edited

jwodder commented Mar 29, 2024 •

edited

jwodder commented May 31, 2024 •

edited