Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] 纳入对 epoll 一轮循环耗时的监控 #143

Open
hyj1991 opened this issue Feb 22, 2022 · 11 comments
Open

[WIP] 纳入对 epoll 一轮循环耗时的监控 #143

hyj1991 opened this issue Feb 22, 2022 · 11 comments
Assignees
Labels

Comments

@hyj1991
Copy link
Member

hyj1991 commented Feb 22, 2022

No description provided.

@legendecas
Copy link
Member

是指 uv_metrics_idle_time 吗

@hyj1991
Copy link
Member Author

hyj1991 commented Feb 22, 2022

uv_metrics_idle_time

其实是 loop 几个阶段耗时累加,正好有个对应的 case 需要这种监控才方便告警

@hyj1991
Copy link
Member Author

hyj1991 commented Feb 22, 2022

image

这里的几个阶段,其实我没想好怎么取值,如果是在 runtime 里改的话直接埋点就行了

@hyj1991
Copy link
Member Author

hyj1991 commented Feb 22, 2022

看了下 uv_metrics_idle_time 的描述,感觉上每个 tick 减掉上一次 tick 的值后,用上图里的整个耗时再减去这个值,就是 event loop 每个 tick 的真正耗时了

@legendecas
Copy link
Member

监控是希望看 loop 利用率吗?比如这种 https://nodesource.com/blog/event-loop-utilization-nodejs

@hyj1991
Copy link
Member Author

hyj1991 commented Feb 22, 2022

https://nodesource.com/blog/event-loop-utilization-nodejs

我觉得从采集的间隔维度来说(比如 1min 为最小间隔),它提到的 elu 和 cpu 是不会有特别大的波动的;这个 feat 更希望能获取到每次 loop 循环的 cpu 具体时间开销,那么可以在单次 loop 阻塞(但是 1min 维度平均负载不高)时提供一些有用的告警信息。

实际上确实在生产中遇到了 1min 内仅有几次 loop 阻塞严重产生的 rpc 超时问题,此时抓 profile 是看不到太多细节的(时间被拉长)

@hyj1991
Copy link
Member Author

hyj1991 commented Feb 22, 2022

它提到的 elu 和 cpu 是不会有特别大的波动的

这个说的不对,cpu 还包含 gc 那些

@legendecas
Copy link
Member

实际上确实在生产中遇到了 1min 内仅有几次 loop 阻塞严重产生的 rpc 超时问题,此时抓 profile 是看不到太多细节的(时间被拉长)

感觉这种场景检测 event loop delay 就能满足需求的样子,比如设置一个 repeated timer,看 timer 延迟,如果高的话就记录报告。e.g. https://github.com/tj/node-blocked

@hyj1991
Copy link
Member Author

hyj1991 commented Feb 23, 2022

感觉这种场景检测 event loop delay 就能满足需求的样子,比如设置一个 repeated timer,看 timer 延迟,如果高的话就记录报告。e.g. https://github.com/tj/node-blocked

有道理,我来研究下

@fengmk2
Copy link
Member

fengmk2 commented Feb 23, 2022

ELU 也可以作为一个观测指标监控起来。

@hyj1991
Copy link
Member Author

hyj1991 commented Feb 23, 2022

ELU 也可以作为一个观测指标监控起来。

嗯,这个 issue 计划增加 loop delay 和 elu 的指标监控

@hyj1991 hyj1991 self-assigned this Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants