Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: cheerio issues with HTML serialization #1113

Merged
merged 4 commits into from
Aug 12, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/check-ts-support.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-node@v1
- uses: actions/setup-node@v2
with:
node-version: 16
registry-url: https://registry.npmjs.org/
Expand Down
6 changes: 4 additions & 2 deletions .github/workflows/check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
steps:
- uses: actions/checkout@v2
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v1
uses: actions/setup-node@v2
with:
node-version: ${{ matrix.node-version }}
- name: Install Dependencies
Expand All @@ -34,7 +34,9 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-node@v1
- uses: actions/setup-node@v2
with:
node-version: 16
- name: Install Dependencies
run: npm install
- name: Run Linter
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ jobs:
uses: actions/checkout@v2
-
name: Use Node.js 16
uses: actions/setup-node@v1
uses: actions/setup-node@v2
with:
node-version: 16
-
Expand All @@ -72,7 +72,7 @@ jobs:
-
uses: actions/checkout@v2
-
uses: actions/setup-node@v1
uses: actions/setup-node@v2
with:
node-version: 16
registry-url: https://registry.npmjs.org/
Expand Down
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
2.0.2 / BETA
====================
- Fix serialization issues in `CheerioCrawler` caused by parser conflicts in recent versions of `cheerio`.

2.0.1 / 2021/08/06
====================
- Use `got-scraping` 2.0.1 until fully compatible.
Expand Down
1 change: 0 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,6 @@
"@apify/ps-tree": "^1.1.4",
"@apify/storage-local": "^2.0.1",
"@apify/utilities": "^1.1.2",
"@types/cheerio": "^0.22.30",
"@types/domhandler": "^2.4.2",
"@types/node": "^15.14.2",
"@types/socket.io": "^2.1.13",
Expand Down
10 changes: 9 additions & 1 deletion src/crawlers/cheerio_crawler.js
Original file line number Diff line number Diff line change
Expand Up @@ -553,7 +553,15 @@ class CheerioCrawler extends BasicCrawler {

request.loadedUrl = response.url;

const $ = dom ? cheerio.load(dom, { xmlMode: isXml }) : null;
const $ = dom
? cheerio.load(dom, {
xmlMode: isXml,
// Recent versions of cheerio use parse5 as the HTML parser/serializer. It's more strict than htmlparser2
// and not good for scraping. It also does not have a great streaming interface.
// Here we tell cheerio to use htmlparser2 for serialization, otherwise the conflict produces weird errors.
_useHtmlParser2: true,
})
: null;

crawlingContext.$ = $;
crawlingContext.contentType = contentType;
Expand Down
68 changes: 68 additions & 0 deletions test/crawlers/cheerio_crawler.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,52 @@ const responseSamples = {
+ '</item>\n'
+ '</items>',
image: fs.readFileSync(path.join(__dirname, 'data/apify.png')),
html: '<!doctype html>\n'
+ '<html>\n'
+ '<head>\n'
+ ' <title>Example Domain</title>\n'
+ '\n'
+ ' <meta charset="utf-8">\n'
+ ' <meta http-equiv="Content-type" content="text/html; charset=utf-8">\n'
+ ' <meta name="viewport" content="width=device-width, initial-scale=1">\n'
+ ' <style type="text/css">\n'
+ ' body {\n'
+ ' background-color: #f0f0f2;\n'
+ ' margin: 0;\n'
+ ' padding: 0;\n'
+ ' font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n' // eslint-disable-line max-len
+ ' \n'
+ ' }\n'
+ ' div {\n'
+ ' width: 600px;\n'
+ ' margin: 5em auto;\n'
+ ' padding: 2em;\n'
+ ' background-color: #fdfdff;\n'
+ ' border-radius: 0.5em;\n'
+ ' box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n'
+ ' }\n'
+ ' a:link, a:visited {\n'
+ ' color: #38488f;\n'
+ ' text-decoration: none;\n'
+ ' }\n'
+ ' @media (max-width: 700px) {\n'
+ ' div {\n'
+ ' margin: 0 auto;\n'
+ ' width: auto;\n'
+ ' }\n'
+ ' }\n'
+ ' </style> \n'
+ '</head>\n'
+ '\n'
+ '<body>\n'
+ '<div>\n'
+ ' <h1>Example Domain</h1>\n'
+ ' <p>This domain is for use in illustrative examples in documents. You may use this\n'
+ ' domain in literature without prior coordination or asking for permission.</p>\n'
+ ' <p><a href="https://www.iana.org/domains/example">More information...</a></p>\n'
+ '</div>\n'
+ '</body>\n'
+ '</html>\n',
};

const app = express();
Expand Down Expand Up @@ -64,6 +110,10 @@ app.get('/mirror', (req, res) => {
res.send('<html><head><title>Title</title></head><body>DATA</body></html>');
});

app.get('/html-type', (req, res) => {
res.type('html').send(responseSamples.html);
});

app.get('/json-type', (req, res) => {
res.json(responseSamples.json);
});
Expand Down Expand Up @@ -252,6 +302,24 @@ describe('CheerioCrawler', () => {
await cheerioCrawler.run();
});

test('should serialize body and html', async () => {
expect.assertions(2);
const sources = [`http://${HOST}:${port}/html-type`];
const requestList = await Apify.openRequestList(null, sources);

const cheerioCrawler = new Apify.CheerioCrawler({
requestList,
maxRequestRetries: 0,
maxConcurrency: 1,
handlePageFunction: async ({ $, body }) => {
expect(body).toBe(responseSamples.html);
expect($.html()).toBe(body);
},
});

await cheerioCrawler.run();
});

describe('should timeout', () => {
let ll;
beforeAll(() => {
Expand Down
2 changes: 1 addition & 1 deletion test/typescript/crawlers/cheerio_crawler.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ describe('CheerioCrawler TS', () => {
test('Can pass around and call `handler({ var }: { var: Type})`', async () => {
// This form can also be easily reused as above.
// Auto-completion works on defined input variables in parameter list.
const y = async ({$}: { $?: cheerio.Selector }) => {
const y = async ({$}: { $?: cheerio.CheerioAPI }) => {
expect($!('a').attr('href')).toEqual('#');
};

Expand Down