[BUG] 使用 simple 模式过滤时出现不正常任务 #111

PIGfaces · 2022-07-01T03:06:20Z

Version

Golang: go version go1.18 darwin/arm64
Browser: Google Chrome 103.0.5060.53
OS: macOS 12.4 (21F79) [m1]
Arch: arm64
Commit
- 321828272c66c95a05fd262365a501e8f7b5d031
- dbf70647a44bbfbdaeec98791f90c2497d781708 - latest

问题描述

当我用 simple 过滤模式爬取某个目标站点时，发现当某个网页存在上千条链接时便会出现如下情况：

新爬虫任务都是超时
在无头模式下调试时日志输出一直有 Crawling ******* ，但浏览器没有打开新标签页
强制退出了浏览器后，任务没有停止，仍然在输出 Crawling ********

当页面链接数比较少时不会发生这种情况

执行命令

crawlergo -c /PATH/MY/Browser -t 10 -m 4000 -f simple --no-headless --output-mode none https://www.xbiquge.so

增加额外日志输出

crawlergo/pkg/task_main.go

Lines 230 to 232 in dbf7064

    
           t.crawlerTask.Result.resultLock.Lock() 
        
           t.crawlerTask.Result.AllReqList = append(t.crawlerTask.Result.AllReqList, tab.ResultList...) 
        
           t.crawlerTask.Result.resultLock.Unlock()

我在 231 行后添加了日志输出便于调试

        .....Lock()

	logger.Logger.Info("tab task result count: ", len(tab.ResultList))

        ......Unlock()

日志截图如下

圈起来的位置解释

在第一张图中，获取到了超过 1000 条 url ，第二张图是后续的爬虫任务都会超时，浏览器标签页也不会打开新页面。

复现步骤

在如上的 Commit Version 上测试
启动命令参数如上

观察到的表现

大概几分钟后

日志大量输出 Crawling **** 后面跟随的日志也是大量的 navigate timeout
浏览器标签页不再打开新页面
强制退出浏览器后 crawlergo 仍在运行

期望表现

当页面存在大量链接时：

日志有Crawling *** 时标签页会打开新页面
浏览器退出后，crawlergo 能正常退出
正常导航页面不超时

The text was updated successfully, but these errors were encountered:

PIGfaces · 2022-07-01T03:13:51Z

【 Code Review 】自己的分析

当标签页收集到所有链接后，均是通过异步的方式加入任务池，便会有大量阻塞的任务。推测：这就是即使退出了浏览器也能继续执行任务的原因

crawlergo/pkg/task_main.go

Lines 202 to 208 in dbf7064

    
           go func() { 
        
           	err := t.Pool.Submit(task.Task) 
        
           	if err != nil { 
        
           		t.taskWG.Done() 
        
           		logger.Logger.Error("addTask2Pool ", err) 
        
           	} 
        
           }()

另一个不理解问题：但页面超时的控制是在 err := t.Pool.Submit(task.Task) 函数里新建标签页时才会给上下文（context）设置超时时间，此时最顶层的 browser context 应该是不受影响的，并在创建标签页时应该能打开。但是表现为：浏览器却并不会打开新的标签页，我尝试过更新 chromedp 库版本，但依然不起作用。

byposeidon · 2022-12-17T21:03:37Z

I'm getting the same error(navigate timeout). Please fix this bug. @Qianlitp

redkit75 · 2023-01-04T10:35:35Z

This critical bug has not been resolved. Crawlergo fails to perform its most important function.

Qianlitp added the bug Something isn't working label Jul 1, 2022

redkit75 mentioned this issue Jan 6, 2023

add: 增加爬虫整体运行最大超时时间 #137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] 使用 simple 模式过滤时出现不正常任务 #111

[BUG] 使用 simple 模式过滤时出现不正常任务 #111

PIGfaces commented Jul 1, 2022 •

edited

PIGfaces commented Jul 1, 2022 •

edited

byposeidon commented Dec 17, 2022 •

edited

redkit75 commented Jan 4, 2023

[BUG] 使用 simple 模式过滤时出现不正常任务 #111

[BUG] 使用 simple 模式过滤时出现不正常任务 #111

Comments

PIGfaces commented Jul 1, 2022 • edited

Version

问题描述

复现步骤

观察到的表现

期望表现

PIGfaces commented Jul 1, 2022 • edited

byposeidon commented Dec 17, 2022 • edited

redkit75 commented Jan 4, 2023

PIGfaces commented Jul 1, 2022 •

edited

PIGfaces commented Jul 1, 2022 •

edited

byposeidon commented Dec 17, 2022 •

edited