Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stale] Compare identical environments by using a same_file::Handle #179

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

darnuria
Copy link
Contributor

@darnuria darnuria commented Jul 9, 2023

This PR implements this suggestion #145 (comment) and try to fix #180.

Stalled because the current implementation with same_file open files since windows needs it, but lmdb itself use posix locks with fnctl syscall on unixes.

We use the same_file::Handle struct that compares dev_t and ino_t instead of raw paths internally to make the comparison more accurate. On Windows, you should read how it works 馃槄.

EDITS: Edited to add the caveat found during implementing the idea.

Caveat of this PR

Tl;DR Keeping FileDescriptor open, may trouble flock/fcntl works inside lmdb and screw multi-process lock with lmdb.

From comment: #179 (comment)

Unix ideal: just check (dev_t, ino_t)

Unix way

stat kind of function clean nice no File Descriptor keeped.
Note: that same file keep open a File Descriptor open!

https://github.com/BurntSushi/same-file/blob/master/src/unix.rs#L10

Windows

An other same-file rely on winapi32, involving Filehandles.

Lmdb limitation

For sure it may break the flock()ing (not the best system api..), and the multi-process features of lmdb is really handy.

Checked-out lmdb sources:

About flocking in lmdb.h the sentense you pointed changed with:

commit 77845345ca9bf3854fd9da60a3e3b0527fa9c76a
Author: Hallvard Furuseth <hallvard@openldap.org>
Date:   Tue Sep 27 07:03:42 2016 +0200

    ITS#8505 Clarify fork() caveat, mdb_env_get_fd(), flock->fcntl.

diff --git a/libraries/liblmdb/lmdb.h b/libraries/liblmdb/lmdb.h
index 30d5862..319fcf6 100644
--- a/libraries/liblmdb/lmdb.h
+++ b/libraries/liblmdb/lmdb.h
@@ -97,11 +97,12 @@
  *    transactions.  Each transaction belongs to one thread.  See below.
  *    The #MDB_NOTLS flag changes this for read-only transactions.
  *
- *  - Use an MDB_env* in the process which opened it, without fork()ing.
+ *  - Use an MDB_env* in the process which opened it, not after fork().
  *
  *  - Do not have open an LMDB database twice in the same process at
  *    the same time.  Not even from a plain open() call - close()ing it
- *    breaks flock() advisory locking.
+ *    breaks fcntl() advisory locking.  (It is OK to reopen it after
+ *    fork() - exec*(), since the lockfile has FD_CLOEXEC set.)
  *
  *  - Avoid long-lived transactions.  Read transactions prevent
  *    reuse of pages freed by newer write transactions, thus the

@Kerollmops
Copy link
Member

Kerollmops commented Jul 9, 2023

Thank you @darnuria,

It looks like the same_file::Handle implements Eq + Hash, which makes it a good candidate for replacing the current PathBuf key in the OPENED_ENV hashmap. What do you think?

@darnuria darnuria force-pushed the open_env_device_inode branch 3 times, most recently from 0fba17a to 993fe55 Compare July 9, 2023 16:08
heed/src/env.rs Outdated Show resolved Hide resolved
heed/src/env.rs Show resolved Hide resolved
heed/src/env.rs Outdated Show resolved Hide resolved
heed/src/env.rs Outdated Show resolved Hide resolved
@darnuria darnuria marked this pull request as ready for review July 9, 2023 16:12
@Kerollmops Kerollmops added the breaking A change that is breaking the semver label Jul 10, 2023
heed/src/env.rs Show resolved Hide resolved
@Kerollmops Kerollmops changed the title Check env with device and Inode instead of paths. Compare identical environments by using a same_file::Handle Jul 10, 2023
heed/src/env.rs Outdated Show resolved Hide resolved
heed/src/env.rs Show resolved Hide resolved
Comment on lines +42 to +44
/// Mind that Handle currently open the file, so avoid writing through the fd held by [same_file::Handle],
/// since the file will also be opened by LMDB.
static OPENED_ENV: Lazy<RwLock<HashMap<Handle, EnvEntry>>> = Lazy::new(RwLock::default);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaved a mark since Handle open the file and hold a file descriptor https://github.com/BurntSushi/same-file/blob/master/src/unix.rs#L10

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so can we add a comment about it? I mean a really documentation comment in the lib.rs file. Under an Important/Remark header, please? I don't know how LMDB behaves towards that.

I am not sure that we really care about this sentence as we keep it open the whole env-lifetime right?

Do not have open an LMDB database twice in the same process at the same time. Not even from a plain open() call - close()ing it breaks flock() advisory locking.

Copy link
Contributor Author

@darnuria darnuria Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(backported in description in case this PR got stuck/closed to help somebody)
Yeah my worries comes from that line, and my first idea to just check (dev_t, ino_t)

For sure it may break the flock()ing (not the best system api..), and the multi-process features of lmdb is really handy.

Checked-out lmdb sources:

About flocking in lmdb.h the sentense you pointed changed with:

commit 77845345ca9bf3854fd9da60a3e3b0527fa9c76a
Author: Hallvard Furuseth <hallvard@openldap.org>
Date:   Tue Sep 27 07:03:42 2016 +0200

    ITS#8505 Clarify fork() caveat, mdb_env_get_fd(), flock->fcntl.

diff --git a/libraries/liblmdb/lmdb.h b/libraries/liblmdb/lmdb.h
index 30d5862..319fcf6 100644
--- a/libraries/liblmdb/lmdb.h
+++ b/libraries/liblmdb/lmdb.h
@@ -97,11 +97,12 @@
  *    transactions.  Each transaction belongs to one thread.  See below.
  *    The #MDB_NOTLS flag changes this for read-only transactions.
  *
- *  - Use an MDB_env* in the process which opened it, without fork()ing.
+ *  - Use an MDB_env* in the process which opened it, not after fork().
  *
  *  - Do not have open an LMDB database twice in the same process at
  *    the same time.  Not even from a plain open() call - close()ing it
- *    breaks flock() advisory locking.
+ *    breaks fcntl() advisory locking.  (It is OK to reopen it after
+ *    fork() - exec*(), since the lockfile has FD_CLOEXEC set.)
  *
  *  - Avoid long-lived transactions.  Read transactions prevent
  *    reuse of pages freed by newer write transactions, thus the

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what do we conclude? Can we still use the same_file::Handle struct? If not, what do we plan? Is it even possible to get the dev_t and ino_t without opening the file?

Copy link
Contributor Author

@darnuria darnuria Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stat function family (except the fstatat) check metadata and return and don't maintain a file descriptor.

It's the kind of function used by metadata in rust std, sadly posix never standardized a "device / FSID" so ino_t/dev_t is left to implementation detail.

For windows it's more https://docs.rs/winapi-util/latest/winapi_util/file/fn.information.html
https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getfileinformationbyhandle

heed/src/env.rs Outdated Show resolved Hide resolved
heed/src/env.rs Outdated Show resolved Hide resolved
@Kerollmops Kerollmops added this to the v0.20.0 milestone Jul 10, 2023
@Kerollmops
Copy link
Member

I read somewhere that hard links on Windows require Administrator privileges...

@darnuria
Copy link
Contributor Author

darnuria commented Jul 10, 2023

(not directly related to hardlink) Yeah windows CI had troble with theses two: https://github.com/meilisearch/heed/actions/runs/5508749200/jobs/10041079210?pr=179#step:4:140

@Kerollmops
Copy link
Member

Do you know if we can run a certain pipeline in the CI as admin or Windows admin?

Samefile abstract away differences between Unixes and windows to check
if two files lead to the same file on the underlying filesystem.

Now it should follow through soft-links, hard-links and file moved.
Precise that Handle open the file.
heed/src/env.rs Outdated Show resolved Hide resolved
@darnuria
Copy link
Contributor Author

Do you know if we can run a certain pipeline in the CI as admin or Windows admin?

Don't know, also discovering that windows way of doing is really complicated. 馃ぁ, like moving dir or getting file information.

assert_eq!(env_name, env.path());

let env_renamed = dir.path().join("serafina.mdb");
std::fs::rename(&env_name, &env_renamed).unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kerollmops
Copy link
Member

Don't know, also discovering that windows way of doing is really complicated. 馃ぁ, like moving dir or getting file information.

I would like to skip the tests that require Admin rights on Windows as I don't have time to make them work in the CI for now 馃槵 What do you think?

@darnuria darnuria marked this pull request as draft July 20, 2023 16:06
@darnuria darnuria changed the title Compare identical environments by using a same_file::Handle [Stale] Compare identical environments by using a same_file::Handle Jul 20, 2023
@Kerollmops Kerollmops modified the milestones: v0.20.0, v0.21.0 Feb 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking A change that is breaking the semver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Checking for already opened LMDB Environments
2 participants