Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ORA remote more aware of the remote OS when using SSH #7549

Open
wants to merge 3 commits into
base: maint
Choose a base branch
from

Conversation

mslw
Copy link
Contributor

@mslw mslw commented Jan 23, 2024

The ORA remote (more specifically, its SSHRemoteIO.ensure_writeable context manager) runs the stat command on the remote end to check file permissions. The command format and its output differ between macOS and Linux ("It doesn't even think about a windows server").

Previously, the format was chosen based on the local OS, which makes no sense for a command that is executed on the remote end. This led to situations where it was impossible to push data from Mac to Linux because the stat command errored out, and permissions were left incorrect, when moving a file from transfer to its desired location - see #7536

With this change, the ORA remote will run uname -S on the remote end (of an SSH connection) to determine where it is operating, if it needs to. To the best of my knowledge (https://en.wikipedia.org/wiki/Uname), checking for "Darwin" should be sufficient to detect macOS.

This PR adds remote_uname as a lazy property of ORA remote's SSHRemoteIO class, which is accessed before running said stat commands (replacing a locally operating check of on_osx).

Having access to the property avoids re-running uname for every file that needs to be touched, and having the property lazily
resolved avoids running uname for operations which don't need it.

This should fix #7536

@mslw
Copy link
Contributor Author

mslw commented Jan 24, 2024

I was surprised by the CI failures. These were the failed tests:

  1. Test on macOS / test (snapshot) (pull_request): ../local/tests/test_diff.py::test_path_diff
  2. Test on macOS / test (brew) (pull_request): ../tests/test_annexrepo.py::test_annex_copy_to
  3. Appveyor Ubu20a1: ../datalad/downloaders/tests/test_http.py::test_download_ftp
  4. Appveyor Ubu20P37b: ../datalad/downloaders/tests/test_http.py::test_download_ftp

The Appveyor macOS tests pass. I also noticed that 1, 2, and 4 are using Python 3.7, despite its EOL.

@mslw mslw added the semver-patch Increment the patch version when merged label Jan 24, 2024
@yarikoptic
Copy link
Member

makes sense to me, rerunning failed jobs but might be related given the specific of failure (both OSX) ;-)

Copy link

codecov bot commented Feb 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.24%. Comparing base (8639b90) to head (df864e0).
Report is 3 commits behind head on maint.

Additional details and impacted files
@@            Coverage Diff             @@
##            maint    #7549      +/-   ##
==========================================
- Coverage   91.25%   91.24%   -0.02%     
==========================================
  Files         325      325              
  Lines       43428    43430       +2     
  Branches     5778     5779       +1     
==========================================
- Hits        39630    39626       -4     
- Misses       3783     3789       +6     
  Partials       15       15              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mslw
Copy link
Contributor Author

mslw commented Feb 5, 2024

Looking at the results after the rerun, the only failing test is now:

  1. ../distributed/tests/test_push.py::test_force_checkdatapresent (that's core/distributed/...)

which did not fail before.

Looking at the recent macOS test runs for different PRs, I see that the same error cropped up at least once 1 (run 7645277233) but passed in most of them (e.g. all green run 7782008144). All these runs use the same git-annex version.

I don't see the connection to the code changes, and will therefore re-run the macOS tests once again 🤞

One observation in case the failure persists or happens elsewhere: why would the test_force_checkdatapresent produce spurious results but only sometimes? In the test's code I see comments addressing changes in how git-annex handles timestamps (nb. the comment includes a link, but the link target says "HEAD" instead of the full ID of this commit), and the test asserts either "ok" or "notneeded" result depending on the git-annex version. I did not fully understand the nature of the changes, but maybe the behavior can be affected by when things happen (within a second or not), and the test does not sufficiently account for that?

Footnotes

  1. For the methodical log greppers: FAILED ../distributed/tests/test_push.py::test_force_checkdatapresent - AssertionError: Desired result

@yarikoptic
Copy link
Member

yarikoptic commented Feb 5, 2024

indeed we had it failed elsewhere as well, but there it seems to fail consistently and only on OSX and this PR does concern only with OSX... didn't look deeper but it might be just that this PR unravels otherwise not previously observed issue

edit: this particular FAILED is quite recent

(git)smaug:/mnt/datasets/datalad/ci/logs[master]git
$> datalad foreach-dataset --o-s relpath -r -J10 git grep 'FAILED ../distributed/tests/test_push.py::test_force_checkdatapresent - AssertionError: Desired result'
2024/01/24/pr/7550/4abd2b2/github-Test on macOS-4857-failed/0_test (brew).txt:2024-01-24T21:10:29.4942580Z FAILED ../distributed/tests/test_push.py::test_force_checkdatapresent - AssertionError: Desired result                                                  
2024/01/24/pr/7550/4abd2b2/github-Test on macOS-4857-failed/test (brew)/8_Run tests.txt:2024-01-24T21:10:29.4942580Z FAILED ../distributed/tests/test_push.py::test_force_checkdatapresent - AssertionError: Desired result
2024/01/31/pr/7553/ac778d5/github-CrippledFS-4249-failed/0_test.txt:2024-01-31T21:36:28.8425784Z FAILED ../distributed/tests/test_push.py::test_force_checkdatapresent - AssertionError: Desired result
2024/01/31/pr/7553/ac778d5/github-CrippledFS-4249-failed/test/8_Run tests.txt:2024-01-31T21:36:28.8425781Z FAILED ../distributed/tests/test_push.py::test_force_checkdatapresent - AssertionError: Desired result
2024/02/05/pr/7556/6411506/github-Test on macOS-4873-failed/1_test (snapshot).txt:2024-02-05T02:50:12.4644640Z FAILED ../distributed/tests/test_push.py::test_force_checkdatapresent - AssertionError: Desired result                                              
2024/02/05/pr/7556/6411506/github-Test on macOS-4873-failed/test (snapshot)/8_Run tests.txt:2024-02-05T02:50:12.4644640Z FAILED ../distributed/tests/test_push.py::test_force_checkdatapresent - AssertionError: Desired result
datalad foreach-dataset --o-s relpath -r -J10 git grep   67.02s user 79.67s system 153% cpu 1:35.31 total  

@mslw
Copy link
Contributor Author

mslw commented Feb 5, 2024

Thanks for checking!

Alas, re-running test on macOS for this PR after my previous comment I got test_force_checkdatapresent failing with both snapshot and brew. I still don't think it's related (the test pushes between two local repos) but I don't have good arguments and I understand that you first want to investigate further for underlying issues, related or not.

I looked a little deeper but I'm on Debian. So far I can't reproduce locally (using latest autobuild git-annex), and still don't have a good clue of what might be going on.

mih added a commit to mih/datalad-next that referenced this pull request Apr 17, 2024
Analysis and patch are taken more-or-less exactly from
datalad/datalad#7549. Thanks @mslw!

The patch is slightly modified to be more compact, and work by replacing
`SSHRemoteIO.ensure_writeable()` alone.

Fixes datalad/datalad#7536
The ORA remote (more specifically, its `SSHRemoteIO.ensure_writeable`
context manager) runs the `stat` command on the remote end to check
file permissions. The command format and its output differ between
macOS and Linux.

Previously, the format was chosen based on the *local* OS, which makes
no sense for a command that is executed on the *remote* end. This led
to situations where it was impossible to push data from Mac to Linux
because the stat command errored out, and permissions were left
incorrect, when moving a file from `transfer` to its desired location.

With this change, the ORA remote will run `uname -S` on the remote end
(of an SSH connection) to determine where it is operating.

To the best of my knowledge (https://en.wikipedia.org/wiki/Uname),
checking for "Darwin" should be good.
This adds a property `remote_uname`, with lazy resolution (by running
`uname -s` on the remote end), to the ORA remote's `SHRemoteIO` class.
Thus, the SSH commands can operate with an awareness of the remote OS,
if needed.

One situation when this is needed is the class's `ensure_writable`,
which needs to run `stat` with an OS-specific command (and output)
format. Having access to the property avoids re-running `uname` for
every file that needs to be touched, and having the property lazily
resolved avoids running `uname` for operations which don't need it.

Note that although `uname` is UNIX-specific, the class docstring
already says that "It doesn't even think about a windows server".
@mslw
Copy link
Contributor Author

mslw commented Apr 24, 2024

Hey, I rebased the pull request onto the 1.0 maint, and all tests passed, no doubt thanks to #7581 (see checklist for HEAD~1). The only thing that was red is codecov/project, and I don't think I can do anything about it.

The mac tests now fail at setting up python 3.7, but that is a completely new storyline...

One big caveat is that the datalad test setup can not test operation between OSs, so we have to rely on manual testing. I don't have a Mac, but @christian-monch does and I heard from him (please confirm) that the patch indeed works as intended.

@yarikoptic would you consider merging on that basis?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-patch Increment the patch version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Publishing to ria store
2 participants