Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable oppportunistic fd counting fast path #486

Merged
merged 2 commits into from
Mar 7, 2023

Conversation

bobrik
Copy link
Contributor

@bobrik bobrik commented Dec 26, 2022

Existing slow path takes ~725ms of CPU time on my laptopto count 1 million open files. Compare this to just 0.25ms for the baseline, when no extra files are open by a Go program:

Open files reported: 7
 Gathered metrics in 0.26ms
Open files reported: 1000007
 Gathered metrics in 724.50ms

Adding fastpath from Linux v6.2 makes it fast:

Open files reported: 6
 Gathered metrics in 0.29ms
Open files reported: 1000006
 Gathered metrics in 0.31ms

This is before taking in account any lock contention effects in the kernel if you try to count files from multiple threads concurrently, which makes the slow path even slower, burning a lot more CPU in the process. See:

The code I used
package main

import (
	"fmt"
	"os"
	"time"

	"github.com/prometheus/client_golang/prometheus"
)

func main() {
	run()

	openLotsOfFiles(1000000)

	run()
}

func run() {
	s := time.Now()

	metrics, err := prometheus.DefaultGatherer.Gather()
	if err != nil {
		panic(err)
	}

	for _, metric := range metrics {
		if metric.GetName() == "process_open_fds" {
			fmt.Printf("Open files reported: %.0f\n", *metric.Metric[0].Gauge.Value)
		}
	}

	fmt.Printf(" Gathered metrics in %.2fms\n", float64(time.Since(s).Microseconds())/1000)
}

func openLotsOfFiles(lots int) []*os.File {
	files := make([]*os.File, lots)

	for i := 0; i < lots; i++ {
		file, err := os.Open("/etc/hosts")
		if err != nil {
			panic(err)
		}

		files[i] = file
	}

	return files
}
go build -o /tmp/lol .
for i in $(seq 1 5); do GOMAXPROCS=1 /tmp/lol; done

Existing slow path takes ~725ms of CPU time on my laptopto count
1 million open files. Compare this to just 0.25ms for the baseline,
when no extra files are open by a Go program:

```
Open files reported: 7
 Gathered metrics in 0.26ms
Open files reported: 1000007
 Gathered metrics in 724.50ms
```

Adding fastpath from Linux v6.2 makes it fast:

```
Open files reported: 6
 Gathered metrics in 0.29ms
Open files reported: 1000006
 Gathered metrics in 0.31ms
```

This is before taking in account any lock contention effects in the kernel
if you try to count files from multiple threads concurrently, which makes
the slow path even slower, burning a lot more CPU in the process. See:

* torvalds/linux@f1f1f2569901

Signed-off-by: Ivan Babrou <github@ivan.computer>
@bobrik
Copy link
Contributor Author

bobrik commented Dec 26, 2022

Tests are unhappy:

--- FAIL: TestFileDescriptorsLen (0.00s)
    proc_test.go:244: want fds 5, have 224

That's because they make FileDescriptorsLen look at the Stat() result for a real directory, which behaves differently from /proc/pid/fd. I'm open to suggestions on how to address this in tests.

ivan@vm:~/projects/prometheus-procfs$ stat testdata/fixtures/proc/26231/fd | head -2
  File: testdata/fixtures/proc/26231/fd
  Size: 224       	Blocks: 0          IO Block: 1048576 directory
ivan@vm:~/projects/prometheus-procfs$ ls testdata/fixtures/proc/26231/fd | wc -l
5

Copy link
Member

@discordianfish discordianfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM, thanks!

@discordianfish
Copy link
Member

No clear recommendation for the tests.. Might need to restructure this to fix it.

For some behaviors there is no substitution for real `/proc`.

One example is `stat()` for `/proc/<pid>/fd` returning the number
of open files for `procfs`. The field has a different meaning on
regular filesystems and we need to check for it in tests.

Work around this by carrying the fact that `FS` is real in the `FS`
struct itself, so that it can be checked where it matters. The realness
is determined once per `FS` creation by doing `statfs()` and checking
for the `PROC_SUPER_MAGIC` match in the returned fs type.

Signed-off-by: Ivan Babrou <github@ivan.computer>
@bobrik
Copy link
Contributor Author

bobrik commented Jan 4, 2023

I added a commit with a possible solution. I'm not sure if it's a good one, but I can't think of anything better.

Initially I hoped for a procfs detection in stat() syscall result itself, but it doesn't seem possible. Adding another syscall to double check whether stat() result can be trusted doesn't seem right either, as we only need to learn this fact once per mountpoint.

@discordianfish discordianfish merged commit 80824fa into prometheus:master Mar 7, 2023
@bobrik bobrik deleted the ivan/file-count-fastpath branch March 7, 2023 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants