Does this work on relative paths out of the box? #421

mhluska · 2018-03-27T21:43:53Z

I'm trying to crawl a web page which has only root-relative paths. One example snippet from the content of the initial page would be <a href="/someusername" class="profile-link">.

The crawler only fetches the first page and then gives up. Here's the code I'm using:

#!/usr/bin/env node

const Crawler = require('simplecrawler');
const crawler = new Crawler('https://www.example.com');

crawler.decodeResponses = true;

var originalEmit = crawler.emit;
crawler.emit = function(evtName, queueItem) {
    crawler.queue.countItems({ fetched: true }, function(err, completeCount) {
        if (err) {
            throw err;
        }

        crawler.queue.getLength(function(err, length) {
            if (err) {
                throw err;
            }

            console.log("fetched %d of %d — %d open requests, %d open listeners",
                completeCount,
                length,
                crawler._openRequests.length,
                crawler._openListeners);
        });
    });

    console.log(evtName, queueItem ? queueItem.url ? queueItem.url : queueItem : null);
    originalEmit.apply(crawler, arguments);
};

crawler.on('fetchcomplete', (queueItem, responseBody) => {
  console.log(`Found URL ${queueItem.url}`);
});

crawler.start();

Here's the output:

fetched 0 of 1 — 0 open requests, 0 open listeners
crawlstart null
fetched 0 of 1 — 1 open requests, 0 open listeners
fetchstart https://www.example.com/
fetched 0 of 1 — 1 open requests, 0 open listeners
fetchheaders https://www.example.com/
fetched 1 of 1 — 0 open requests, 0 open listeners
fetchcomplete https://www.exampe.com/
Found URL https://www.example.com/
fetched 1 of 1 — 0 open requests, 0 open listeners
complete null

The text was updated successfully, but these errors were encountered:

mhluska · 2018-03-27T23:11:14Z

Actually, the reason this was failing was because of a missing Content-Type header on the initial resource. Going to close.

mhluska closed this as completed Mar 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does this work on relative paths out of the box? #421

Does this work on relative paths out of the box? #421

mhluska commented Mar 27, 2018

mhluska commented Mar 27, 2018

Does this work on relative paths out of the box? #421

Does this work on relative paths out of the box? #421

Comments

mhluska commented Mar 27, 2018

mhluska commented Mar 27, 2018