Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keepalived_healthcheckers[62205]: A thread timer expired 5.167032 seconds ago #2271

Open
qd19zzx opened this issue Apr 2, 2023 · 4 comments

Comments

@qd19zzx
Copy link

qd19zzx commented Apr 2, 2023

Describe the issue
We're using keepalived to perform health checks on the realserver. We've noticed that the interval between TCP health checks is inconsistent with the deloy_loop setting. Although deloy_loop is set to 3 seconds, the actual interval is 7 seconds. Additionally, we've seen a large number of 'Keepalived_healthcheckers[62205]: A thread timer expired 5.167316 seconds ago' in our logs, I think the interval inconsistent maybe related to this. And the comment in the code suggests that it may be related to high load, but we have sufficient CPU and memory. What other factors could cause A thread timer expired and lead to inaccurate interval times?

top - 15:23:10 up 2 days, 1:44, 3 users, load average: 5.00, 4.92, 4.56
Tasks: 809 total, 1 running, 808 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 0.0 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.3 hi, 0.0 si, 0.0 st
%Cpu8 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu9 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu10 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu11 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu13 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu14 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu15 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu16 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu17 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu18 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu19 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu20 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu21 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu22 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu23 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu24 : 1.3 us, 2.0 sy, 0.0 ni, 96.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu25 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu26 : 99.7 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.3 hi, 0.0 si, 0.0 st
%Cpu27 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu28 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu29 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu30 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu31 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

[root@localhost l4lb-agent]# free -h
total used free shared buff/cache available
Mem: 472Gi 237Gi 230Gi 244Mi 4.3Gi 231Gi
Swap: 4.0Gi 0.0Ki 4.0Gi

Keepalived version
Output of keepalived -v
Keepalived v2.2.7 (03/21,2023), git commit +

Copyright(C) 2001-2023 Alexandre Cassen, acassen@gmail.com

Built with kernel headers for Linux 4.18.0
Running on Linux 4.18.0-240.el8.x86_64 #1 SMP Mon Jan 4 11:55:39 UTC 2021
Distro: CentOS Linux 8

configure options: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exusr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-vrrp PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXXsr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection LDFLAGS=-Wl,-z,relro

Config options: LVS OLD_CHKSUM_COMPAT INIT=systemd SYSTEMD_NOTIFY

System options: VSYSLOG MEMFD_CREATE LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CFRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA NET_LINUX_IF_H_COLLISION LIBIPTC_LINUX_NET_IF_H_C

Distro (please complete the following information):

  • Name Centos
  • Version CentOS Linux release 8.3.2011
  • Architecture x86_64

Configuration file:
A full copy of the configuration file, obfuscated if necessary to protect passwords and IP addresses

global_defs {
lvs_notify_fifo /tmp/lvs_rs_update
checker_process_name lvs_checker
}

virtual_server 177.17.7.3 4040 {
delay_loop 6
persistence_timeout 50
protocol tcp
real_server 177.17.6.5 4045 {
weight 100
inhibit_on_failure
TCP_CHECK {
connect_timeout 10
retry 3
delay_before_retry 3
delay_loop 3
connect_port 4041
log_all_failures on
}
}
real_server 177.17.6.4 4044 {
weight 100
inhibit_on_failure
TCP_CHECK {
connect_timeout 10
retry 3
delay_before_retry 3
delay_loop 3
connect_port 4041
log_all_failures on
}
}
real_server 177.17.6.6 4046 {
weight 100
inhibit_on_failure
TCP_CHECK {
connect_timeout 10
retry 3
delay_before_retry 3
delay_loop 3
connect_port 4041
log_all_failures on
}
}
}

System Log entries
Full keepalived system log entries from when keepalived started

Apr 02 15:21:11 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.162731 seconds ago
Apr 02 15:21:18 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.165175 seconds ago
Apr 02 15:21:25 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.166563 seconds ago
Apr 02 15:21:32 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.163723 seconds ago
Apr 02 15:21:39 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.165628 seconds ago
Apr 02 15:21:47 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.167099 seconds ago
Apr 02 15:21:54 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.167093 seconds ago
Apr 02 15:22:01 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.165756 seconds ago
Apr 02 15:22:08 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.167222 seconds ago
Apr 02 15:22:15 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.166807 seconds ago

@pqarmitage
Copy link
Collaborator

I notice that CPUs 25, 26 and 27 are running at 100% and also the hardware interrupt rate is not insignificant. Could the processes running on CPUs 25, 26 and 27 be using some other (i.e. non memory/CPU) system resources that stop keepalived being scheduled?

Is /tmp local, and are the keepalived logs being written to local filestore? We have experienced similar delays in the past when keepalived is reading/writing files that are on network attached devices.

The thread timer expired times are all 5.16... seconds, which is extremely strange.

Can you capture the network traffic for the TCP_CHECKs? It might give some indication as to why they are running a 7 second intervals, although the thread timer expired messages is likely to be the reason why the checkers are running at 7 or 8 second intervals.

With the information provided above I cannot see any other reason for why the kernel is not scheduling keepalived to run as needed.

@qd19zzx
Copy link
Author

qd19zzx commented Apr 2, 2023

Thank you for your reply. In fact, we have tried setting CPU affinity and binding Keepalived to an idle CPU core. However, we still encountered thread timer expired logs. We also do not have network storage. The attachment is the TCP check packet that I captured.
tcp_checks.zip

@qd19zzx
Copy link
Author

qd19zzx commented Apr 3, 2023

hi @pqarmitage ,I enable the epoll_debug flag. And I noticed that all "A thread timer" are generated when calling epoll_wait on fd 1.
Here is the log, the system time is within 1 second, but a timeout of 5.16 seconds is reported. It seems that the timer is inaccurate, and the TCP check interval is also likely to be inaccurate due to the inaccuracy of the timer. What could be the reason for the timer inaccuracy?

Apr 3 09:45:41 localhost Keepalived_healthcheckers[73631]: Setting timer_fd 0.000000001
Apr 3 09:45:41 localhost Keepalived_healthcheckers[73631]: calling epoll_wait
Apr 3 09:45:41 localhost Keepalived_healthcheckers[73631]: epoll_wait returned 1 fds
Apr 3 09:45:41 localhost Keepalived_healthcheckers[73631]: A thread timer expired 5.166934 seconds ago

@pqarmitage
Copy link
Collaborator

@qd19zzx Are you still experiencing this problem?

The only way I can think of to debug this is to run with --debug=DcEc which will produce a large amount of log entries. To enable these options you need to specify configure options --enable-epoll-thread-dump --enable-epoll-debug. I would also suggest --enable-log-file so that log entries can be written to a file rather than via syslog.

Run keepalived with the following options: --debug=DcEc -g/tmp/epoll-debug.log. You will get various log files, but the interesting one should be /tmp/epoll_debug_check.log. It will show all the thread details and we should be able to see what thread timer is expiring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants