Fix bugzilla 24524: Very slow process fork if RLIMIT_NOFILE is too high #8990

trikko · 2024-04-27T17:40:25Z

When the soft limit for NOFILE is set to a high number, the current code for spawnProcess, pipeProcess, etc. becomes very slow and memory intensive.

This is particularly evident when running a D application inside a Docker container. Docker sets the soft limit to the maximum allowed, which on some systems can be as high as 2^30.

This code reads the list of file descriptors in use from /dev/fd or /proc/this/fd, avoiding the need to scroll through the entire list of possible file descriptors.

dlang-bot · 2024-04-27T17:40:28Z

Thanks for your pull request and interest in making D better, @trikko! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
I have provided a detailed rationale explaining my changes
New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.

If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Auto-close	Bugzilla	Severity	Description
✓	24524	enhancement	Very slow process fork if RLIMIT_NOFILE is too high

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + phobos#8990"

CyberShadow · 2024-04-27T17:42:18Z

I think this code is very Linux-specific. std.process should be POSIX-compatible.
There is no guarantee that procfs is mounted.

Edit: I see now it falls back to the old code.

trikko · 2024-04-27T17:46:17Z

1 - I think that /dev/fd exists in many unix-based os. Maybe /proc/this/fd is linux specific.
2 - That's the reason why I first check /dev/fd

Please notice that this code could be simplified if scandir were imported on dirent.

CyberShadow · 2024-04-27T19:20:25Z

I see this adds a third method of doing the same thing (/proc/fd -> poll -> for loop).

Has anyone checked how other language runtimes handle this (besides being consistent about CLOEXEC)?

trikko · 2024-04-27T20:08:30Z

Some related examples:

python that counts open fd: https://github.com/python/cpython/blob/main/Lib/test/support/os_helper.py#L612
python that closes fd of subprocess (!) https://github.com/python/cpython/blob/main/Modules/_posixsubprocess.c#L496
swift to check max fd used: https://github.com/apple/swift-corelibs-foundation/blob/master/Sources/Foundation/Process.swift#L70
libreoffice: https://github.com/LibreOffice/online/blob/d0edfeabbdc969a9a66cf90976a63c2f4403a6d3/common/Util.cpp#L204

libreoffice use a function from dirent.h not available on druntime: dirfd that tells you which fd are you using to browse the directory (so you can exclude it and avoid closing the dir).

Not sure how much dirent.h is a standard, but with that function we could avoid the list and all the mallocs/frees. Probably just a over-optimization.

trikko · 2024-04-27T20:26:49Z

Even docker and podman use this way to close fd:
https://github.com/docker/docker-ce/blob/master/components/engine/daemon/graphdriver/devmapper/deviceset.go#L1265
https://github.com/containers/podman/blob/main/libpod/container_top_linux.go#L141

CyberShadow · 2024-04-27T20:34:03Z

Even docker and podman use this way to close fd:

Great research! This is good evidence that we're on the right track.

libreoffice use a function from dirent.h not available on druntime: dirfd that tells you which fd are you using to browse the directory (so you can exclude it and avoid closing the dir).

Not sure how much dirent.h is a standard, but with that function we could avoid the list and all the mallocs/frees. Probably just a over-optimization.

It seems to be in POSIX:

https://pubs.opengroup.org/onlinepubs/9699919799.2013edition/functions/dirfd.html

I think that means it should be safe to use. It can be added to Druntime in parallel to a temporary private extern(C) declaration in Phobos.

CyberShadow

Approving now (thanks!), but if the dirfd allows getting rid of the linked list, that would be even better.

trikko · 2024-04-27T20:45:41Z

https://en.wikibooks.org/wiki/C_Programming/POSIX_Reference/dirent.h

Here they say it's a "pseudo-standard": I'm not sure that every C library implements that function. Is there a way to understand when these functions were added to posix standard? Maybe druntime is relying on a older posix standard?

I don't want to mess everything up!

CyberShadow · 2024-04-27T20:53:15Z

Is there a way to understand when these functions were added to posix standard?

Bottom of that page says issue 7, which seems to have been released in 2017.

I don't want to mess everything up!

Well, that's what CI is for :)

But another way to approach this is to go through the list of OSes which are supported by DMD (which aren't many). GDC and LDC carry compatibility patches for lots of things anyway, I believe.

Yet another way is to use static if and use this function only if it's declared in Druntime for the current OS. Then, it can be added only for OSes which are known to have it implemented in their libc. This is done in std.file to detect the current platform's subsecond precision API.

thewilsonator · 2024-04-27T20:53:56Z

please change the commit title to Fix bugzilla 24524: [description of issue] so the bot picks it up.

trikko · 2024-04-27T21:45:03Z

I wonder if I should still leave the old method in the case of low limits. At that point it's probably faster to iterate and close all the fds and that's it, rather than reading a list from "/dev/fd".

I guess it also depends on how many files are open. If the maximum is 1000 and all 1000 are open, obviously the old way will work better.

But if only a couple of files are open the new method wins.

In any case if the limit is too high, the old methods blows up. On macOS the hard limit is set to "unlimited" and this is really dangerous if someone set a high softlimit :)

thewilsonator · 2024-04-27T22:18:43Z

commit message title, not PR title

ibuclaw · 2024-04-28T01:18:29Z

std/process.d

-                }
-                foreach (i; 0 .. maxToClose)
+                // Try to open the directory /dev/fd or /proc/self/fd
+                DIR* dir = opendir("/dev/fd");


This is Posix code, right? Does this exist for all posix platforms? BSDs, OSX, Cygwin, etc.

ibuclaw · 2024-04-28T01:26:59Z

std/process.d

+                // Missing druntime declaration
+                pragma(mangle, "dirfd")
+                extern(C) nothrow @nogc int dirfd(DIR* dir);


Documentation on opengroup says

This interface was introduced because the Base Definitions volume of POSIX.1-2017 does not make public the DIR data structure.

Unless the year is wrong it looks like the function might be too new to be considered.

See the discussion above :)

See the discussion above :)

I've just seen it now. I'm firmly in the not sure camp, as I'll be the one who'll get stick for undefined reference issues on Solaris 11 and Darwin 12.

Do we have any way of quickly checking?

Being able to run a command / try to compile a simple C program on a variety of OSes would be a really useful tool to have...

The first version I put here indeed used a linked list of fds. I can restore it, if you want.

About Solaris 11: https://stackoverflow.com/a/28025462

About Darwin: this page said those function were added on 4.2BSD. Isn't Darwin bsd based? https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/dirfd.3.html

schveiguy · 2024-04-28T02:59:25Z

Hm... I was rather thinking the getrlimit would be the test, and then only if it's over a certain limit would it make sense to use the /proc/fd read.

The poll mechanism has the advantage that is is only one syscall for the entire "read" operation.

ibuclaw · 2024-04-28T05:45:00Z

Even docker and podman use this way to close fd:
https://github.com/docker/docker-ce/blob/master/components/engine/daemon/graphdriver/devmapper/deviceset.go#L1265
https://github.com/containers/podman/blob/main/libpod/container_top_linux.go#L141

Note, both of them are Linux only.

trikko · 2024-04-28T07:26:02Z

Hm... I was rather thinking the getrlimit would be the test, and then only if it's over a certain limit would it make sense to use the /proc/fd read.

The poll mechanism has the advantage that is is only one syscall for the entire "read" operation.

... and the disadvantage that allocates one struct for each possible fd so it becomes slow and eats memory.

Which is the right limit?

trikko · 2024-04-28T10:09:31Z

I have done some tests with a ulimit starting from 1_000_000 and going down to 100_000, in 10 steps. For each step, I tested the performance with 0, 1000, …, 9000 open file descriptors. The result is attached.

Things to keep in mind:

In some cases, the limit is much higher than 1,000,000 and can even reach up to 4000 times as much, even if the user then opens only one file.
How many files does a user normally open?
Having said that, it is necessary to find a balance between the two methods, deciding when to use one or the other based on the ulimit. Any ideas?

results.txt

CyberShadow · 2024-04-28T12:51:29Z

Having said that, it is necessary to find a balance between the two methods, deciding when to use one or the other based on the ulimit. Any ideas?

How about benchmarking the various methods, seeing which one is faster in which circumstances, and choosing the implementation to use based on that?

We can use forkPipeOut as a rough estimate of the number of currently-open file descriptors.

trikko · 2024-04-28T13:01:33Z

Having said that, it is necessary to find a balance between the two methods, deciding when to use one or the other based on the ulimit. Any ideas?

How about benchmarking the various methods, seeing which one is faster in which circumstances, and choosing the implementation to use based on that?

We can use forkPipeOut as a rough estimate of the number of currently-open file descriptors.

Even if POSIX says you must assign the smaller unused FD to a new process, it could be a good rough estimate. Or maybe we can open some "/dev/null" to fill some holes. :)

Benchmark: did you se the result.txt file? Am I missing something?

CyberShadow · 2024-04-28T13:09:22Z

Benchmark: did you se the result.txt file? Am I missing something?

Right - will the dumb for loop always be slower than poll, then?

If so, I guess the rule would be something like if (fd_limit / 10000 < nr_open_fds / 300) { use poll } else { use "/dev/fd" iteration }?

trikko · 2024-04-28T14:24:22Z

With few FD it wins. That probably it is the common case. How many software did you write with more than a dozen of FDs open at the same time? 🙂

trikko · 2024-04-28T17:16:53Z

Watching the stats above, I think:

maxfds/(1+estimated_open) > 120 => /dev/fd
maxfds > N => /dev/fd
else : old way (it allocates 8bytes*maxfds)

So, how much ram we want to allocate for the old method in the worst case?

Then N = maxram/8 (bytes per FD used by poll)

CyberShadow · 2024-04-28T19:34:03Z

Good idea to limit memory usage. Unless someone has better ideas, I can think of two options:

We pick an arbitrary number, like 64KB or 1MB
We pick a number that's small enough that we can get it from the stack, to avoid any potential repercussions of allocating memory during a fork.

trikko · 2024-04-28T22:06:46Z

Anyway the fallback is still prone to fail/crash if /dev/fd and /proc/self/fd don't exist.

schveiguy · 2024-04-29T22:01:14Z

... and the disadvantage that allocates one struct for each possible fd so it becomes slow and eats memory.

Not really. It's one allocation for all the structs. Looping through sequential memory is pretty quick. I would expect the largest cost to be the system call.

Indeed, the best solution would be something that said "here are all the open fds". I believe the poll mechanism is better than the dumb loop, and that it is probably more efficient than the /dev/ file for some small number of max fds.

Your test is using much higher numbers than I would even expect in normal operations. For example, on macos, my default ulimit -n is 2560. On my linux system, the limit is 1024.

I would say, if the limit is more than 8k, then try using the /dev/ mechanism. We might also want to split those polls into multiple calls for large numbers of max fds to avoid trying to allocate monstrous blocks. Logic something like:

auto maxfds = getMaxFds;
if(maxfds < 8192) doPoll(0);
else if(canOpenDevToLookForFds) doDevSearch();
else for(auto minfd = 0; minfd < maxfds; minfd += 8192) doPoll(minfd);

trikko · 2024-04-29T22:17:26Z

Your test is using much higher numbers than I would even expect in normal operations. For example, on macos, my default ulimit -n is 2560. On my linux system, the limit is 1024.

Once again: we have a problem if the app is running on docker, for example, where apparently it raises the soft limit up to the hard limit of host machine. The issue was discovered on docker.

A guy was trying to run my app on a docker inside a Linux with 2^30 hard limit (and a normal soft limit like yours). He said it takes many seconds to start just one process.

I know you can set the soft limit from inside the app using setrlimit, but a common user just see the spawnProcess function allocating some GB of ram.

Edit: so I'm not sure the loop with poll is so fast.

schveiguy · 2024-04-30T02:25:45Z

Yes I get it. I'm saying, the threshold can be much much lower, like 8k. 100k is too much, let alone 1 million, I don't think we should be using poll at that point. I'm sure the poll version beats the version that opens the directory with only 8k file descriptors, even with a small number of open files.

If we make the threshold 8k, it will use poll on most normal systems, and be fast, and use the dev filesystem above that and be fast on e.g. docker.

trikko · 2024-04-30T07:34:51Z

Yes I get it. I'm saying, the threshold can be much much lower, like 8k. 100k is too much, let alone 1 million, I don't think we should be using poll at that point. I'm sure the poll version beats the version that opens the directory with only 8k file descriptors, even with a small number of open files.

If we make the threshold 8k, it will use poll on most normal systems, and be fast, and use the dev filesystem above that and be fast on e.g. docker.

I think that values over 8k are not that uncommon. Please consider that for example machines with mongo usually has 64k limit. See the "reccomanded ulimit settings" section here.

Don't you like the euristic way to determine which method to use?

I can of course replace this:

 if (
    r.rlim_cur/(forkPipeOut+1) > 120 ||     // ... the number of file descriptors is small ...
    r.rlim_cur > 1024*1024                  // ... or the soft limit is high. In this case poll would allocate a huge array)
)

Simply with:

if (r.rlim_cur > 8*1024)

Is this what you mean?

trikko · 2024-04-30T09:04:42Z

If we make the threshold 8k, it will use poll on most normal systems, and be fast, and use the dev filesystem above that and be fast on e.g. docker.

Test with limit == 10k, 9k, 8k, ...

The dumb method seems to be still faster than poll with ~ limit/open > 120 (stdin/out/err just raise count to 3)

(anyway we can set a "poll only" zone for limit < 20k or something like that, but I would keep the euristic algo)

test10000.txt

schveiguy · 2024-04-30T11:06:28Z

Can you post what the code you use for testing is? Is the test including the fork/exec? My interest is only to test the closing of file descriptors, not the other stuff.

trikko · 2024-04-30T13:19:10Z

Just a stupid script, using a cloned version of std.process:

import std.stdio;
import std.conv;
import std.datetime : Clock;


import core.stdc.stdio : fopen, fread;
import std.datetime : SysTime;

void main()
{
	import core.sys.posix.sys.resource : rlimit, getrlimit, RLIMIT_NOFILE, setrlimit;


	// Get the maximum number of file descriptors that could be open.
	rlimit r;
	getrlimit(RLIMIT_NOFILE, &r);

	r.rlim_cur = 10000;
	setrlimit(RLIMIT_NOFILE, &r);


	for(size_t i = 0; i <= 9000; i+=1000)
	{
		r.rlim_cur = 10000 - i;
		stderr.writeln("\n\n--- Maximum number of file descriptors: ", r.rlim_cur);
		setrlimit(RLIMIT_NOFILE, &r);

		for(size_t k = 0; k < 10; ++k)
		{
			immutable file_to_open = 80*k;
			stderr.writeln("\n  - FDs open: ", file_to_open);
			stderr.writeln("  - Starting spawnProcess(`echo`); 10 times...");

			FILE *[] fds;
			foreach(x; 0 .. file_to_open)
			{
				fds ~= fopen("/dev/urandom", "r");
			}

			SysTime c;

			c = Clock.currTime;
			foreach(_; 0 .. 10)
			{
				import std.process: spawnProcess;
				spawnProcess("echo");
			}
			stderr.writeln("Time (poll): ", Clock.currTime - c);

			c = Clock.currTime;
			foreach(_; 0 .. 10)
			{
				import pro2: spawnProcess;
				spawnProcess("echo");
			}
			stderr.writeln("Time (/dev/fd): ", Clock.currTime - c);

			foreach(fd; fds)
			{
				fclose(fd);
			}

			fds = null;
		}
	}


}

schveiguy · 2024-04-30T13:49:23Z

OK, so the timings include the actual fork/exec of the other process. I think I will try and build a test to do just the fd closing.

trikko · 2024-04-30T16:00:36Z

OK, so the timings include the actual fork/exec of the other process. I think I will try and build a test to do just the fd closing.

That's why I call the same process for 10 times.

trikko · 2024-04-30T17:43:08Z

This works with a simplified logic:

over 128*1024 decriptors => dumb way
less than 128*1024 descriptors => poll
fallback => very dumb

schveiguy

Nice, this looks like what I was expecting, thanks!

std/process.d

schveiguy · 2024-05-03T17:34:15Z

/home/runner/work/phobos/dmd/compiler/test/test_results/runnable/d/testthread_0: undefined symbol: _D4core6thread8osthread6Thread5sleepFNbNiNeSQBq4time8DurationZv

How is the tester not finding some thread symbol? Nothing is changing in druntime...

ibuclaw · 2024-05-04T07:52:15Z

/home/runner/work/phobos/dmd/compiler/test/test_results/runnable/d/testthread_0: undefined symbol: _D4core6thread8osthread6Thread5sleepFNbNiNeSQBq4time8DurationZv
How is the tester not finding some thread symbol? Nothing is changing in druntime...

Seems to be introduced by #8992

ibuclaw · 2024-05-04T20:18:35Z

Seems to be introduced by #8992

That doesn't look right. It's not even on the same branch. Did you mean to link to a PR in another repo?

If I understand right, it introduces a second version of the compiler to the ci environment.

Reverting it fixes the phobos pipelines.

https://github.com/dlang/phobos/actions/workflows/main.yml?query=branch%3Amaster

CyberShadow · 2024-05-05T00:33:15Z

Oops, you're right!

trikko requested review from CyberShadow, kyllingstad and schveiguy as code owners April 27, 2024 17:40

CyberShadow approved these changes Apr 27, 2024

View reviewed changes

trikko changed the title ~~Fix issue #24524~~ Fix bugzilla 24524: Very slow process fork if RLIMIT_NOFILE is too high Apr 27, 2024

Fix bugzilla 24524: Very slow process fork if RLIMIT_NOFILE is too high

53ccbae

trikko force-pushed the master branch from bcabe80 to 53ccbae Compare April 27, 2024 22:57

dlang-bot added the Enhancement label Apr 27, 2024

rikkimax mentioned this pull request Apr 27, 2024

On process fork (POSIX) close all other handles Project-Sidero/eventloop#2

Open

ibuclaw reviewed Apr 28, 2024

View reviewed changes

Euristic selection of closing method

0c0cc7a

Simplified logic

0c0611f

schveiguy approved these changes Apr 30, 2024

View reviewed changes

std/process.d Outdated Show resolved Hide resolved

trikko added 2 commits April 30, 2024 22:50

Fix declarations

2204ddd

Fix coding style

3b356b6

trikko force-pushed the master branch from 0cbdb1b to 3b356b6 Compare May 2, 2024 11:35

This comment was marked as outdated.

Sign in to view

Fix bugzilla 24524: Very slow process fork if RLIMIT_NOFILE is too high #8990

Are you sure you want to change the base?

Fix bugzilla 24524: Very slow process fork if RLIMIT_NOFILE is too high #8990

Conversation

trikko commented Apr 27, 2024

dlang-bot commented Apr 27, 2024 • edited

Bugzilla references

Testing this PR locally

CyberShadow commented Apr 27, 2024 • edited

trikko commented Apr 27, 2024 • edited

CyberShadow commented Apr 27, 2024

trikko commented Apr 27, 2024

trikko commented Apr 27, 2024

CyberShadow commented Apr 27, 2024

CyberShadow left a comment

Choose a reason for hiding this comment

trikko commented Apr 27, 2024 • edited

CyberShadow commented Apr 27, 2024

thewilsonator commented Apr 27, 2024

trikko commented Apr 27, 2024

thewilsonator commented Apr 27, 2024

ibuclaw Apr 28, 2024

Choose a reason for hiding this comment

ibuclaw Apr 28, 2024

Choose a reason for hiding this comment

CyberShadow Apr 28, 2024

Choose a reason for hiding this comment

ibuclaw Apr 28, 2024

Choose a reason for hiding this comment

CyberShadow Apr 28, 2024

Choose a reason for hiding this comment

trikko Apr 28, 2024 • edited

Choose a reason for hiding this comment

schveiguy commented Apr 28, 2024

ibuclaw commented Apr 28, 2024

trikko commented Apr 28, 2024

trikko commented Apr 28, 2024

CyberShadow commented Apr 28, 2024

trikko commented Apr 28, 2024

CyberShadow commented Apr 28, 2024

trikko commented Apr 28, 2024

trikko commented Apr 28, 2024

CyberShadow commented Apr 28, 2024

trikko commented Apr 28, 2024

schveiguy commented Apr 29, 2024

trikko commented Apr 29, 2024 • edited

schveiguy commented Apr 30, 2024

trikko commented Apr 30, 2024

trikko commented Apr 30, 2024 • edited

schveiguy commented Apr 30, 2024

trikko commented Apr 30, 2024

schveiguy commented Apr 30, 2024

trikko commented Apr 30, 2024

trikko commented Apr 30, 2024 • edited

schveiguy left a comment

Choose a reason for hiding this comment

schveiguy commented May 3, 2024

ibuclaw commented May 4, 2024

This comment was marked as outdated.

ibuclaw commented May 4, 2024 • edited

CyberShadow commented May 5, 2024

dlang-bot commented Apr 27, 2024 •

edited

CyberShadow commented Apr 27, 2024 •

edited

trikko commented Apr 27, 2024 •

edited

trikko commented Apr 27, 2024 •

edited

trikko Apr 28, 2024 •

edited

trikko commented Apr 29, 2024 •

edited

trikko commented Apr 30, 2024 •

edited

trikko commented Apr 30, 2024 •

edited

ibuclaw commented May 4, 2024 •

edited