New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent inode numbers for mounted files #10047
Comments
Yes, gVisor makes up inode numbers for host files. If I recall correctly, this is necessary in order to be able to support checkpoint/restore, such that inode numbers remain consistent after a container is restored on a different machine. |
Hmm. Checkpoint/Restore seems rather finicky in the presence of hostPath mounts anyway, I think? There's no guarantee that the host's filesystem looks identical on another machine I guess 🤔. Isn't checkpoint/restore off the table in such a scenario anyway? That said, I'd personally be fine with a flag that essentially disables checkpoint/restore in favor of use-cases like described. |
There has been work done on this front. Just want to provide the context. There are two things used to identify files on a filesystem: 1) inode number 2) device ID. You can have different files with the same inode number on different devices. The gVisor sandbox virtualizes device IDs. So we can't share host device IDs. As of now, what we do in the gofer filesystem is that we give the gofer mount a virtual (sandbox internal) device ID and we generate new inode numbers (incrementally) for each file. The inode number generation is done by combining host inode number and host device ID. This is actually quite expensive. We need to maintain this huge ever growing map (for every unique file ever encountered) that maps [host inode, host device ID] -> gofer inode number. If a host file is not found in this map, we increment this counter and use that value as the inode number. Note that we can not passthrough the host inode number as is because there might be conflicts (the host filesystem being served itself may have multiple mountpoints with different devices and conflicting inode numbers). Because of this, the syscall implementation for For these performance reasons, @nixprime had made this proposal: #6665 (comment). As per this, we can pass through the host inode number but we will map the host device ID to a sentry internal device ID. I had implemented this proposal in #7801. But I had dropped it for S/R reasons: #6665 (comment). My question to you is, will the approach taken in #7801 work for you? It will give you the same inode numbers across usages. But the device ID of the file may be different in pods. |
For my somewhat specific use-case of fluent-bit specifically, I do believe that just having stable inode numbers across pods would be sufficient. For posterity, their sqlite DB looks like this
(the first two instances are different instances of the "same" fluent-bit pod on the same node. The last one is me changing to They don't seem to be taking device ID into account (Vector does: https://vector.dev/docs/reference/configuration/sources/file/#fingerprint.strategy but they provide an alternative strategy that would sidestep this issue altogether). So yes, I believe just having stable inodes would be fine. |
@ayushr2 do you have a feeling for whether or not your proposal could be massaged into an acceptable state wrt. S/R or if it could be put behind a flag for that reason? Let me know if I can be of any help or if you'd want me take a whack at trying to implement it. I'm trying to get a sense of the alternatives that I have for moving forward on our side. |
Hey @markusthoemmes, I am having this conversation internally. I think we are committed to checking in the inode passthrough approach (#7801). It is a performance and compatibility win. It is just a matter of whether we want to do that unconditionally or preserve the current behavior behind a flag. I will update here once we have a conclusion. Reasons for not preserving the current S/R behavior:
|
Thanks for the update @ayushr2, hugely appreciated! 🥳 |
There are some applications that rely on device ID and inode number stability. So just having the "inode passthrough" would not suffice. Because the device IDs are still virtualized and on restore, we would have to reassign sentry-internal device IDs (which may change even though the underlying host device/inode numbers didn't change). I guess it is best to gate the current behavior behind a flag and implement the "inode passthrough" approach as the default. |
I do not have cycles immediately to pick this up. @markusthoemmes if this is urgent for you, feel free to rebase #7801 and implement it with a flag. Happy to code review. Otherwise, I will try to pick this up soon-ish. |
@ayushr2 not ultra urgent but it'd be interesting to know a rough timeline just for expectation management. I can try to take a whack at it, but superficially it looks like there's a few dragons there, if both paths have to be kept intact 😅 |
Description
I'm trying to run fluent-bit inside of gVisor. It uses a
hostPath
mount to read all the container logs to then forward them somewhere else. In order to keep track of what was already dealt with, it writes a sqlite DB file to the disk (also ahostPath
in my case). It keeps track of the underlying files by path and inode.After noticing some log duplication after rolling my pods, I've dug in and it seems like the files mounted via the
hostPath
mount do not have consistent inode numbering. It varies from restart to restart of the container. That completely breaks any kinda tracking fluent-bit could be doing in this case.The inode numbers are very low, suggesting to me that gvisor is doing some internal assignment of these numbers. The "real" inodes are way higher.
I've tried switching
directfs
andoverlay2
on and off but haven't noticed any change in behavior.Steps to reproduce
Run the following pod and compare the logs it produces. Preferably, there's a couple of pods on the same machine to trigger the effect.
runsc version
docker version (if using docker)
No response
uname
Linux pool-apps-appworkload-shared-s-4vcpu-8gb-o6j7h 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
No response
The text was updated successfully, but these errors were encountered: