CPU usage increases after folder scans (and hence, over time) #9417

olifre · 2024-02-13T18:58:32Z

Syncthing: v1.27.3, Linux (64-bit Intel/AMD)
OS: Debian 12, Kernel 6.1.0-18-amd64
Size: 23.075 files, 1.738 dirs, ~45,6 GiB total size, 6 folders
Staggered versioning enabled (not sure if relevant)
inotify watches enabled

Observation

CPU usage is low in general (< 1 % on a small Intel SoC), but keeps growing slowly over time (after >2 days) to 2 % and above, i.e. by over a factor of two (in idle). After prolonged runtime of some weeks, I also observed > 4 %.

Reproduction steps

Either wait ;-), or run many folder rescans. Slowly hitting "Rescan all folders" about 80 times, then waiting until everything has settled down, CPU usage has increased, too.
So likely, the regular hourly scan is causing it.

Further information

Initially reported as:
https://forum.syncthing.net/t/slowly-increasing-cpu-usage-over-time/21639/2
(until I found a reproducer).

Note I also observe a higher rate of syscalls such as:

[pid  1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4,  <unfinished ...>
[pid   962] <... nanosleep resumed>NULL) = 0
[pid   962] futex(0x1a2afa0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=2996919}) = -1 ETIMEDOUT (Connection timed out)
[pid  1605] <... epoll_pwait resumed>[], 128, 13, NULL, 0) = 0
[pid   962] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid  1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4,  <unfinished ...>
[pid   962] <... nanosleep resumed>NULL) = 0
[pid   962] futex(0x1a2afa0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=1697811} <unfinished ...>
[pid  1605] <... epoll_pwait resumed>[], 128, 12, NULL, 0) = 0
[pid   962] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid  1605] epoll_pwait(4,  <unfinished ...>
[pid   962] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid  1605] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid  1605] epoll_pwait(4,  <unfinished ...>
[pid   962] <... nanosleep resumed>NULL) = 0
[pid   962] futex(0x1a2afa0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=3958898} <unfinished ...>
[pid  1605] <... epoll_pwait resumed>[], 128, 14, NULL, 0) = 0
[pid   962] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)

in this state, but not sure whether this is really related.

The text was updated successfully, but these errors were encountered:

calmh · 2024-02-13T20:28:38Z

Take a CPU profile?

https://docs.syncthing.net/users/profiling.html

olifre · 2024-02-13T21:07:57Z

Wow, nice, kudos for this great documentation on how to grab a profile!

gziped them so I could upload here to GitHub:

Profile before, i.e. after Syncthing running for an hour or so, almost zero CPU load::
syncthing-cpu-linux-amd64-v1.27.3-215155.pprof.gz
Profile after slowly clicking "rescan all folders" for 2 minutes in a row, then waiting 2 minutes for things to settle down:
syncthing-cpu-linux-amd64-v1.27.3-215657.pprof.gz

Hope this helps.

calmh · 2024-02-13T21:35:35Z

Looks like it was running a database GC in the second profile. May have been an unlucky coincidence. Take another one?

olifre · 2024-02-13T21:47:05Z

Sure, no problem — to decrease chances of anything else interfering, I grabbed three in a row:

calmh · 2024-02-14T16:52:21Z

The increase seems to be in findRunnable, which could indicate a lot of goroutines (a goroutine leak somewhere). I ran Syncthing for a couple of hours with a 5s rescan on one of my folders, and saw no increase in goroutines at least. (Possibly a little increase in background CPU usage but within the margin of error, small fractions of a percent.) Could be something as simple as memory getting fragmented increase GC work or something, as well.

olifre · 2024-02-14T17:22:52Z

Interesting, sure, memory fragmentation due to many rescans (after long runtimes) seems quite likely.

The reason why I stumbled upon this is actually since I am performing power consumption monitoring for the home server running Syncthing (and other things), and the x86_64 machine (Intel Pentium Silver N6005) usually uses 3 W after reboot, but was using >4 W after a week, growing towards this value with an almost linear increase over time. Of course, still not a lot, but 30 % increase (which of course also translates in UPS runtime) was reason for the nerd to start investigating, and the power consumption data basically looked as if there was a "leak" (not in memory, but in CPU usage).
Since restarting Syncthing brought it back down to 3 W right away, I came here 😉 .

Then I found the increased number of epoll_wait syscalls which Syncthing appears to be doing after prolonged runtime, but maybe that's just the way goroutines interact.

Do you think it is likely that configuration can trigger this?
I don't have anything (obviously) special configured apart from staggered versioning, which is probably not related.

Is there a way to dump the number of goroutines at runtime, then I could confirm that this number is also constant for me?

If we don't find the cause, or you think it is not fruitful to delve deeper, that is of course also perfectly fine with me — then I will likely work around this with a daily restart or decrease the interval for full scans (or both) for the system.
Thanks in any case for checking and trying to reproduce on your end! (and of course for Syncthing itself, too 😉 )

calmh · 2024-02-15T06:56:59Z

You can enable the built in profiler by running with --debug-profiler-listen 127.0.0.1:9090. That gives you a web page http://localhost:9090/debug/pprof/ where you can see counts of goroutines and other things, and also take profiles directly with the go profiler if so inclined, like go tool pprof http://localhost:9090/debug/pprof/goroutine or go tool pprof http://localhost:9090/debug/pprof/profile for a CPU profile

calmh · 2024-02-15T07:05:00Z

There's also the new metrics endpoint that exposes a few things, which I collect on my home server. There I can corroborate idle CPU usage increase over time, clearly:

However it's not accompanied by an increase in goroutines or memory, so it's not a simple leak like that:

calmh · 2024-02-15T07:09:11Z

It's also not seen as an increase in threads, GC time, file system operations, or open file descriptors (I'll spare you graphs not showing an increase).

olifre · 2024-02-15T17:52:34Z

Many thanks, that's very enlightening! The built-in profiler which I missed up to now is really great. ❤️

I will also take a look, but I will probably not see anything new, my Go knowledge is very basic at the moment.
Still, I feel much better that you are also seeing the effect in general in your monitored metrics, so at least it's not a very exotic configuration on my end.

gary094 · 2024-02-28T21:10:00Z

I can confirm that I also see CPU usage increase after a few days of uptime.

Edit: @calmh: Could you let me know which metric you used for the cpu usage? I can only see process_cpu_seconds_total in the /metrics endpoint but no current cpu usage

olifre · 2024-02-29T11:14:17Z

For completeness, Alex has also reported this in the forums:
https://forum.syncthing.net/t/slowly-increasing-cpu-usage-over-time/21639/4?u=olifre
In his case, he reports that this only started to show up for him with 1.22.3, and not with 1.22.2.

Sadly, I don't have enough monitoring history to confirm (I never saw a noticeable increase before, but my monitoring was just worse in the past).

colingibbs1 · 2024-03-04T00:40:01Z

I've run into this problem as well. A walk through the bowels of go's timer internals reveals that there are hundreds of timers with a two second period being processed. With the helpful hints about the scan routines, I found this path missing a timer stop:

--- a/lib/scanner/walk.go
+++ b/lib/scanner/walk.go
@@ -161,6 +161,7 @@ func (w *walker) walk(ctx context.Context) chan ScanResult {
 		}
 
 		if len(filesToHash) == 0 {
+			ticker.Stop()
 			close(finishedChan)
 			return
 		}

olifre · 2024-03-05T02:52:01Z

@colingibbs1 That's a great find!
It also matches Alex' observation in the forums, i.e. that the effect is new with 1.27.3, as that "return path" without a stop of the timer was newly added in 1.27.3 here:
2f3eacd

imsodin · 2024-03-05T15:50:49Z

Argh, let me do some coding for Syncthing for once and I add a resource leak. I'll fix it asap (while I should review stuff instead of course, but I wanna fix :) ). Thanks a lot for finding this!

Move the ticker closer to where it's used and defer stop it to avoid missing a branch.

Move the ticker closer to where it's used and defer stop it to avoid missing a branch. Fixes regression introduced in 2f3eacd Fixes #9417

olifre added bug A problem with current functionality, as opposed to missing functionality (enhancement) needs-triage New issues needed to be validated labels Feb 13, 2024

calmh added a commit to calmh/syncthing that referenced this issue Feb 15, 2024

all: More stringent timer handling (ref syncthing#9417)

d64bd51

imsodin added a commit to imsodin/syncthing that referenced this issue Mar 5, 2024

lib/scanner: Fix ticker leak in scanner (fixes syncthing#9417)

e91f4e3

Move the ticker closer to where it's used and defer stop it to avoid missing a branch.

imsodin mentioned this issue Mar 5, 2024

lib/scanner: Fix ticker leak in scanner (fixes #9417) #9451

Merged

calmh closed this as completed in #9451 Mar 5, 2024

calmh pushed a commit that referenced this issue Mar 5, 2024

lib/scanner: Fix ticker leak in scanner (fixes #9417) (#9451)

2ab2488

Move the ticker closer to where it's used and defer stop it to avoid missing a branch. Fixes regression introduced in 2f3eacd Fixes #9417

calmh added this to the v1.27.5 milestone Mar 11, 2024

acolomb mentioned this issue Mar 11, 2024

CPU in idle mode increases over time #9467

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU usage increases after folder scans (and hence, over time) #9417

CPU usage increases after folder scans (and hence, over time) #9417

olifre commented Feb 13, 2024

calmh commented Feb 13, 2024

olifre commented Feb 13, 2024

calmh commented Feb 13, 2024

olifre commented Feb 13, 2024

calmh commented Feb 14, 2024

olifre commented Feb 14, 2024

calmh commented Feb 15, 2024

calmh commented Feb 15, 2024

calmh commented Feb 15, 2024 •

edited

olifre commented Feb 15, 2024

gary094 commented Feb 28, 2024 •

edited

olifre commented Feb 29, 2024

colingibbs1 commented Mar 4, 2024

olifre commented Mar 5, 2024

imsodin commented Mar 5, 2024

CPU usage increases after folder scans (and hence, over time) #9417

CPU usage increases after folder scans (and hence, over time) #9417

Comments

olifre commented Feb 13, 2024

Observation

Reproduction steps

Further information

calmh commented Feb 13, 2024

olifre commented Feb 13, 2024

calmh commented Feb 13, 2024

olifre commented Feb 13, 2024

calmh commented Feb 14, 2024

olifre commented Feb 14, 2024

calmh commented Feb 15, 2024

calmh commented Feb 15, 2024

calmh commented Feb 15, 2024 • edited

olifre commented Feb 15, 2024

gary094 commented Feb 28, 2024 • edited

olifre commented Feb 29, 2024

colingibbs1 commented Mar 4, 2024

olifre commented Mar 5, 2024

imsodin commented Mar 5, 2024

calmh commented Feb 15, 2024 •

edited

gary094 commented Feb 28, 2024 •

edited