Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak by use HijackRequests() #748

Closed
allanpk716 opened this issue Nov 9, 2022 · 18 comments
Closed

memory leak by use HijackRequests() #748

allanpk716 opened this issue Nov 9, 2022 · 18 comments
Labels
question Questions related to rod

Comments

@allanpk716
Copy link

allanpk716 commented Nov 9, 2022

Rod Version: v0.111.0

The code to demonstrate your question

func PageNavigateWithProxy(page *rod.Page, proxyUrl string, desURL string, timeOut time.Duration) (*rod.Page, *proto.NetworkResponseReceived, error) {

	router := page.HijackRequests()
	defer router.Stop()

	router.MustAdd("*", func(ctx *rod.Hijack) {
		px, _ := url.Parse(proxyUrl)
		err := ctx.LoadResponse(&http.Client{
			Transport: &http.Transport{
				Proxy:           http.ProxyURL(px),
				TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
			},
		}, true)
		if err != nil {
			return
		}
	})
	go router.Run()

	err := page.SetUserAgent(&proto.NetworkSetUserAgentOverride{
		UserAgent: RandomUserAgent(true),
	})
	if err != nil {
		if page != nil {
			page.Close()
		}
		return nil, nil, err
	}
	var e proto.NetworkResponseReceived
	wait := page.WaitEvent(&e)
	err = rod.Try(func() {
		page.Timeout(timeOut).MustNavigate(desURL).MustWaitLoad()
		wait()
	})
	if err != nil {
		return page, &e, err
	}
	if page == nil {
		return nil, nil, errors.New("page is nil")
	}

	return page, &e, nil
}

What you got

对于本项目的一些代码进行了封装,下面这个函数的目标是在每一个 page 独立使用一个代理去访问页面。
As a wrapper around some of the code for this project, the goal of the following function is to use a proxy on each page to access the page independently

my code

这里的 page 都会在使用完毕后进行 page.Close() 操作。
The page here is page.close () when used.

  1. 我不确定是 rod 的 HijackRequests 的问题,还是我使用的问题导致的
    I'm not sure this problem if it was rod's HijackRequests or my use
  2. 下面会给出我的代码,以及 pprof 的两个文件
    My code is given below, along with the two files for pprof
  3. 目前反复使用这个函数一段时间后,内存会缓慢的增长
    Currently, after repeated use of this function over a period of time, memory will slowly grow
    pprof Files:

查看这两个文件的差异
Look at the differences between the two files

go tool pprof -http=':8081'  -diff_base .\pprof.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz .\pprof.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz

这个是截图
This is a screenshot
image

内存缓慢的增长。
Memory growth is slow.

What you expected to see

希望能够给出一个方向如何定位这个看起来想泄露的问题,是我使用不挡,还是确实是 bug。
I hope to give some direction on how to locate this seemingly leaky issue, whether I'm using it or whether it's actually a bug.

What have you tried to solve the question

我查看了 rod 项目中相关的两个 test 文件的示例:
I looked at two examples of test files associated with Project rod

@allanpk716 allanpk716 added the question Questions related to rod label Nov 9, 2022
@rod-robot
Copy link

Please fix the format of your markdown:

67 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "pprof  Files:"]
68 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "* [pprof.alloc_objects.alloc_s..."]
83 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2]
84 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 3]
101 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2]

generated by check-issue

@ysmood
Copy link
Member

ysmood commented Nov 9, 2022

你如果不需要读取 body 的话就没必要把 loadBody 设置成 true,你截图里读 body 耗费了大量内存,这是正常现象。

rod/hijack.go

Line 225 in 1139c6b

func (h *Hijack) LoadResponse(client *http.Client, loadBody bool) error {

@allanpk716
Copy link
Author

我去改为 false,然后中午挂机试试。谢谢。

@allanpk716
Copy link
Author

目前测试没有发现这个部分泄露了。但是遇到新的问题,之前 loadbody true 如果遇到网站的 403,可以很快的返回(之前瞬间返回),现在如果改为 false 了403 要等待很久(超时时间)。有办法加快 4xx 和 5xx 的返回码?

@allanpk716
Copy link
Author

你如果不需要读取 body 的话就没必要把 loadBody 设置成 true,你截图里读 body 耗费了大量内存,这是正常现象。

rod/hijack.go

Line 225 in 1139c6b

func (h *Hijack) LoadResponse(client *http.Client, loadBody bool) error {

读取 body 后,stop Hijack 有办法释放这个部分吗?现在的改动导致爬虫效率很低···

@ysmood
Copy link
Member

ysmood commented Nov 9, 2022

那大概率是你的用法导致的泄露,我感觉不到代码哪儿还能泄露。rod 这块代码已经非常薄了,根本没做任何复杂的封装和抽象。

@ysmood
Copy link
Member

ysmood commented Nov 9, 2022

你能按照这个文档给个最小复现的代码吗?

https://github.com/go-rod/rod/blob/master/.github/ISSUE_TEMPLATE/question.md

@allanpk716
Copy link
Author

你能按照这个文档给个最小复现的代码吗?

https://github.com/go-rod/rod/blob/master/.github/ISSUE_TEMPLATE/question.md

可以,我先本地自建 http 服务器,去模拟我提到的这个 403 情况,如果能复现,我会给出实现的复现项目,以及操作步骤。

上面截图提到的泄露,我之前在其他地方也见过,也是提示 io.ReadAll 导致(有梗)的,看有一些搜索结果提到的,应该避免或者适当的使用这个 golang 的方法。不知道你可以也看看这一块吗?

@allanpk716
Copy link
Author

allanpk716 commented Nov 9, 2022

为了复现相同的问题,我新建了一个测试的项目rod_helper_sample,里面由对应的复现步骤。因为会引用一些库,可能未必是精简的,见谅。

@ysmood
Copy link
Member

ysmood commented Nov 9, 2022

如果对性能有高要求。你完全可以自己实现读取 body,不需要用 rod 来做这件事, 这个 rod 只是做了很简单的处理。

我大概率不会进一步优化 hijack 了,原因是 cdp 协议自身太鸡肋了。

建议你看看这个 #395

@allanpk716
Copy link
Author

如果对性能有高要求。你完全可以自己实现读取 body,不需要用 rod 来做这件事, 这个 rod 只是做了很简单的处理。

我大概率不会进一步优化 hijack 了,原因是 cdp 协议自身太鸡肋了。

建议你看看这个 #395

其实我就是想实现单个 page 使用指定的代理去访问,然后得到这个 page 对象后,再进行额外的判断和页面操作···不知道还有其他方案实现我这个需求吗?

@allanpk716
Copy link
Author

看了 #693 ,等你们实现后,我再测试吧

@ysmood
Copy link
Member

ysmood commented Nov 9, 2022

我写了个压力测试,感觉 hijack 并没有泄露,跑了十几万个请求,没有任何问题,参见 #745

大概率是你自己写的代码引起的泄露,我先关闭这个 issue 了,如果有新的发现我们可以 reopen。

@ysmood ysmood closed this as not planned Won't fix, can't repro, duplicate, stale Nov 9, 2022
@allanpk716
Copy link
Author

目前的用法是一个 Browser ,不停的新建和关闭 page 来使用,会出现内存泄露。然后改为,用一段时间,关闭 browser,再开启一个继续用,目前观察没有泄露发生。待再挂机一晚上再过来更新测试情况。

@ysmood
Copy link
Member

ysmood commented Nov 10, 2022

建议你看看 rod 的单元测试是如何使用 gotrace 来防止泄露的:https://github.com/ysmood/gotrace

用 gotrace 测试下你自己的项目。

@allanpk716
Copy link
Author

好的,谢谢

@5idu
Copy link

5idu commented Nov 27, 2022

@allanpk716 泄漏解决了吗,我遇到了和你一样的情况..

@allanpk716
Copy link
Author

@allanpk716 泄漏解决了吗,我遇到了和你一样的情况..

用一段时间关闭 Browser 就行了, 再开个新的 Browser 继续用。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions related to rod
Projects
None yet
Development

No branches or pull requests

4 participants