Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU usage increases after folder scans (and hence, over time) #9417

Closed
olifre opened this issue Feb 13, 2024 · 15 comments · Fixed by #9451
Closed

CPU usage increases after folder scans (and hence, over time) #9417

olifre opened this issue Feb 13, 2024 · 15 comments · Fixed by #9451
Labels
bug A problem with current functionality, as opposed to missing functionality (enhancement) needs-triage New issues needed to be validated
Milestone

Comments

@olifre
Copy link
Contributor

olifre commented Feb 13, 2024

  • Syncthing: v1.27.3, Linux (64-bit Intel/AMD)
  • OS: Debian 12, Kernel 6.1.0-18-amd64
  • Size: 23.075 files, 1.738 dirs, ~45,6 GiB total size, 6 folders
  • Staggered versioning enabled (not sure if relevant)
  • inotify watches enabled

Observation

CPU usage is low in general (< 1 % on a small Intel SoC), but keeps growing slowly over time (after >2 days) to 2 % and above, i.e. by over a factor of two (in idle). After prolonged runtime of some weeks, I also observed > 4 %.

Reproduction steps

Either wait ;-), or run many folder rescans. Slowly hitting "Rescan all folders" about 80 times, then waiting until everything has settled down, CPU usage has increased, too.
So likely, the regular hourly scan is causing it.

Further information

Initially reported as:
https://forum.syncthing.net/t/slowly-increasing-cpu-usage-over-time/21639/2
(until I found a reproducer).

Note I also observe a higher rate of syscalls such as:

[pid  1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4,  <unfinished ...>
[pid   962] <... nanosleep resumed>NULL) = 0
[pid   962] futex(0x1a2afa0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=2996919}) = -1 ETIMEDOUT (Connection timed out)
[pid  1605] <... epoll_pwait resumed>[], 128, 13, NULL, 0) = 0
[pid   962] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid  1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4,  <unfinished ...>
[pid   962] <... nanosleep resumed>NULL) = 0
[pid   962] futex(0x1a2afa0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=1697811} <unfinished ...>
[pid  1605] <... epoll_pwait resumed>[], 128, 12, NULL, 0) = 0
[pid   962] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid  1605] epoll_pwait(4,  <unfinished ...>
[pid   962] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid  1605] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4,  <unfinished ...>
[pid   962] <... nanosleep resumed>NULL) = 0
[pid   962] futex(0x1a2afa0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=3958898} <unfinished ...>
[pid  1605] <... epoll_pwait resumed>[], 128, 14, NULL, 0) = 0
[pid   962] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)

in this state, but not sure whether this is really related.

@olifre olifre added bug A problem with current functionality, as opposed to missing functionality (enhancement) needs-triage New issues needed to be validated labels Feb 13, 2024
@calmh
Copy link
Member

calmh commented Feb 13, 2024

Take a CPU profile?

https://docs.syncthing.net/users/profiling.html

@olifre
Copy link
Contributor Author

olifre commented Feb 13, 2024

Wow, nice, kudos for this great documentation on how to grab a profile!

gziped them so I could upload here to GitHub:

Hope this helps.

@calmh
Copy link
Member

calmh commented Feb 13, 2024

Looks like it was running a database GC in the second profile. May have been an unlucky coincidence. Take another one?

@olifre
Copy link
Contributor Author

olifre commented Feb 13, 2024

Sure, no problem — to decrease chances of anything else interfering, I grabbed three in a row:

@calmh
Copy link
Member

calmh commented Feb 14, 2024

The increase seems to be in findRunnable, which could indicate a lot of goroutines (a goroutine leak somewhere). I ran Syncthing for a couple of hours with a 5s rescan on one of my folders, and saw no increase in goroutines at least. (Possibly a little increase in background CPU usage but within the margin of error, small fractions of a percent.) Could be something as simple as memory getting fragmented increase GC work or something, as well.

@olifre
Copy link
Contributor Author

olifre commented Feb 14, 2024

Interesting, sure, memory fragmentation due to many rescans (after long runtimes) seems quite likely.

The reason why I stumbled upon this is actually since I am performing power consumption monitoring for the home server running Syncthing (and other things), and the x86_64 machine (Intel Pentium Silver N6005) usually uses 3 W after reboot, but was using >4 W after a week, growing towards this value with an almost linear increase over time. Of course, still not a lot, but 30 % increase (which of course also translates in UPS runtime) was reason for the nerd to start investigating, and the power consumption data basically looked as if there was a "leak" (not in memory, but in CPU usage).
Since restarting Syncthing brought it back down to 3 W right away, I came here 😉 .

Then I found the increased number of epoll_wait syscalls which Syncthing appears to be doing after prolonged runtime, but maybe that's just the way goroutines interact.

Do you think it is likely that configuration can trigger this?
I don't have anything (obviously) special configured apart from staggered versioning, which is probably not related.

Is there a way to dump the number of goroutines at runtime, then I could confirm that this number is also constant for me?

If we don't find the cause, or you think it is not fruitful to delve deeper, that is of course also perfectly fine with me — then I will likely work around this with a daily restart or decrease the interval for full scans (or both) for the system.
Thanks in any case for checking and trying to reproduce on your end! (and of course for Syncthing itself, too 😉 )

@calmh
Copy link
Member

calmh commented Feb 15, 2024

You can enable the built in profiler by running with --debug-profiler-listen 127.0.0.1:9090. That gives you a web page http://localhost:9090/debug/pprof/ where you can see counts of goroutines and other things, and also take profiles directly with the go profiler if so inclined, like go tool pprof http://localhost:9090/debug/pprof/goroutine or go tool pprof http://localhost:9090/debug/pprof/profile for a CPU profile

@calmh
Copy link
Member

calmh commented Feb 15, 2024

There's also the new metrics endpoint that exposes a few things, which I collect on my home server. There I can corroborate idle CPU usage increase over time, clearly:

Screenshot 2024-02-15 at 08 03 27

However it's not accompanied by an increase in goroutines or memory, so it's not a simple leak like that:

Screenshot 2024-02-15 at 07 59 57 Screenshot 2024-02-15 at 07 59 30

@calmh
Copy link
Member

calmh commented Feb 15, 2024

It's also not seen as an increase in threads, GC time, file system operations, or open file descriptors (I'll spare you graphs not showing an increase).

calmh added a commit to calmh/syncthing that referenced this issue Feb 15, 2024
@olifre
Copy link
Contributor Author

olifre commented Feb 15, 2024

Many thanks, that's very enlightening! The built-in profiler which I missed up to now is really great. ❤️

I will also take a look, but I will probably not see anything new, my Go knowledge is very basic at the moment.
Still, I feel much better that you are also seeing the effect in general in your monitored metrics, so at least it's not a very exotic configuration on my end.

@gary094
Copy link

gary094 commented Feb 28, 2024

I can confirm that I also see CPU usage increase after a few days of uptime.

Edit: @calmh: Could you let me know which metric you used for the cpu usage? I can only see process_cpu_seconds_total in the /metrics endpoint but no current cpu usage

@olifre
Copy link
Contributor Author

olifre commented Feb 29, 2024

For completeness, Alex has also reported this in the forums:
https://forum.syncthing.net/t/slowly-increasing-cpu-usage-over-time/21639/4?u=olifre
In his case, he reports that this only started to show up for him with 1.22.3, and not with 1.22.2.

Sadly, I don't have enough monitoring history to confirm (I never saw a noticeable increase before, but my monitoring was just worse in the past).

@colingibbs1
Copy link

I've run into this problem as well. A walk through the bowels of go's timer internals reveals that there are hundreds of timers with a two second period being processed. With the helpful hints about the scan routines, I found this path missing a timer stop:

--- a/lib/scanner/walk.go
+++ b/lib/scanner/walk.go
@@ -161,6 +161,7 @@ func (w *walker) walk(ctx context.Context) chan ScanResult {
 		}
 
 		if len(filesToHash) == 0 {
+			ticker.Stop()
 			close(finishedChan)
 			return
 		}

@olifre
Copy link
Contributor Author

olifre commented Mar 5, 2024

@colingibbs1 That's a great find!
It also matches Alex' observation in the forums, i.e. that the effect is new with 1.27.3, as that "return path" without a stop of the timer was newly added in 1.27.3 here:
2f3eacd

@imsodin
Copy link
Member

imsodin commented Mar 5, 2024

Argh, let me do some coding for Syncthing for once and I add a resource leak. I'll fix it asap (while I should review stuff instead of course, but I wanna fix :) ). Thanks a lot for finding this!

imsodin added a commit to imsodin/syncthing that referenced this issue Mar 5, 2024
Move the ticker closer to where it's used and defer stop it to avoid
missing a branch.
calmh pushed a commit that referenced this issue Mar 5, 2024
Move the ticker closer to where it's used and defer stop it to avoid
missing a branch.

Fixes regression introduced in
2f3eacd

Fixes #9417
@calmh calmh added this to the v1.27.5 milestone Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A problem with current functionality, as opposed to missing functionality (enhancement) needs-triage New issues needed to be validated
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants