Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config Server Heartbeat Management #1016

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

liangry
Copy link
Collaborator

@liangry liangry commented Jul 14, 2023

  1. mark agent's status as online, if (for example) two consecutive heartbeats received
  2. mark agent's status as offline, if config server do not receive any heartbeat in (for example) three consecutive cycles
  3. remove agent's record if agent has been offlined for more than (for example) 12 hours

ref: #917 (comment)

1. mark agent's status as online, if (for example) two consecutive heartbeats received
2. mark agent's status as offline, if config server do not receive any heartbeat in (for example) three consecutive cycles
3. remove agent's record if agent has been offlined for more than (for example) 12 hours

current := time.Now()
deadline := current.Add(-time.Duration(setting.GetSetting().ClearHoursAfterOffline) * time.Hour).Unix()
for _, v := range agentList {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

每次心跳都遍历检查,开销是不是有点高了?heartbeat会被高频调用,特别是agent多的时候。我感觉过期检查可以放在每次用户查询的时候,在返回列表前检查一次,有过期的删掉/变更状态,这样遍历频率不高也不影响展示。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果只是摘除离线已经很久的 Agent,是可以移动到 ListAgents 接口处理的。但是,这里还要处理心跳周期内没有接收到心跳请求的情况,需要记录失败次数和决定是否标记为离线状态,这是不能在用户查询时处理的。而且,这个失败处理在 batchUpdateAgentMessage 内,无论有多少个 Agent,都在一个批处理周期内处理一次,对性能影响应该不大。

ONLINE = 1;
OFFLINE = 2;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里或许不需要单独拉一个类型,在代码中以String格式写死就可以了

@@ -78,8 +85,8 @@ message HeartBeatRequest {
string agent_type = 3; // Required, Agent's type(ilogtail, ..)
AgentAttributes attributes = 4; // Agent's basic attributes
repeated string tags = 5; // Agent's tags
string running_status = 6; // Required, Agent's running status
int64 startup_time = 7; // Required, Agent's startup time
RunningStatus running_status = 6; // Agent's running status
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,这里的类型改变不是必要的,最好还是不要修改了

@@ -15,6 +15,13 @@ enum CheckStatus {
MODIFIED = 2;
}

// Define Agent's running status
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个PR整体功能没有问题了,但是一些实现可能需要在讨论一下,细节的部分在代码那里说明。
主要围绕两个点:
1、是否需要修改PB?需要,但是有一些不太必要
2、过期机制的实现。目前靠记录连续几次心跳是否正常来判断是否在线/过期,有几个点想确认一下:
2.1、OnlineAfterSuccess的初始值为2会不会太多了?启动有些慢。
2.2、Agent的信息里,beat_cycle_time和interval参数好像有些重复,interval目前还没有用到
2.3、心跳的本意是检查当前状态/分析故障。如果是分析故障的话,那应该记录一个agent的持久化的心跳信息(目前是覆盖写),可以查询是哪一段时间心跳丢失。如果只记录一个success_beat_count和fail_beat_count,是否有必要呢?目前的实现是不是可以简化一下

int64 latest_beat_time = 8; // Agent's latest heartbeat time
int64 beat_cycle_time = 9; // Agent's heartbeat cycle time
int32 success_beat_count = 10; // Agent's success heartbeat count
int32 fail_beat_count = 11; // Agent's fail heartbeat count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beat_cycle_time和interval语义应该是一样的,可以考虑合并?
这里的success_beat_count和fail_beat_count是否需要?
如果去掉上面提到的三个值,那么只剩一个latest_beat_time是否需要记录在Agent中,改变Agent的数据结构?还是说直接放在heartbeat的response里,供调用者了解就可以了?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants