Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kitex服务发现切换后会有使用close链接或者io timeout的表现 #1246

Open
b675987273 opened this issue Feb 5, 2024 · 1 comment

Comments

@b675987273
Copy link

b675987273 commented Feb 5, 2024

Describe the bug

我们是有一套psm-etcd和psm-dns服务发现的kitex rpc架构,就是etcd服务发现失效的时候我们会用dns兜底。这两个服务发现都能独立正常运行。场景是我们etcd失效之后,会启动dns-resolve兜底,这段时间都是没问题的,过了很长一段时间etcd服务发现恢复期间,就出现请求报错,具体都是reset by peer,use close 之类的报错,但最后也恢复了
err = remote or network error: get connection error: dial tcp 10.4.82.241:8888: connection has been closed by peer
然后我看了一下代码,是不是因为这部分太暴力直接把下架ip对应旧的链接就删了,因为etcd发现是实例ip而dns发现是vip。类似的逻辑hertz好像是没有的

if long, ok := pool.(remote.LongConnPool); ok {
kc.opt.Bus.Watch(discovery.ChangeEventName, func(ev *event.Event) {
	ch, ok := ev.Extra.(*discovery.Change)
	if !ok {
		return
	}
	for _, inst := range ch.Removed {
		if addr := inst.Address(); addr != nil {
			long.Clean(addr.Network(), addr.String())
		}
	}
})
}

依靠长连接池idle time 和expire time是不是能够慢慢把这些链接清理掉

@felix021
Copy link
Contributor

felix021 commented Feb 5, 2024

Kitex 设计这个 Change 就是为了能够及时清除已下线实例,避免将请求继续发过去(以及请求不均衡),从代码实现看是没问题的,连接池里的链接都是空闲的,close 掉不应该影响现有请求。

问题描述里的 "reset by peer" 可能不是由这个原因引起的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants