Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

Does this work on relative paths out of the box? #421

Closed
mhluska opened this issue Mar 27, 2018 · 1 comment
Closed

Does this work on relative paths out of the box? #421

mhluska opened this issue Mar 27, 2018 · 1 comment

Comments

@mhluska
Copy link

mhluska commented Mar 27, 2018

I'm trying to crawl a web page which has only root-relative paths. One example snippet from the content of the initial page would be <a href="/someusername" class="profile-link">.

The crawler only fetches the first page and then gives up. Here's the code I'm using:

#!/usr/bin/env node

const Crawler = require('simplecrawler');
const crawler = new Crawler('https://www.example.com');

crawler.decodeResponses = true;

var originalEmit = crawler.emit;
crawler.emit = function(evtName, queueItem) {
    crawler.queue.countItems({ fetched: true }, function(err, completeCount) {
        if (err) {
            throw err;
        }

        crawler.queue.getLength(function(err, length) {
            if (err) {
                throw err;
            }

            console.log("fetched %d of %d — %d open requests, %d open listeners",
                completeCount,
                length,
                crawler._openRequests.length,
                crawler._openListeners);
        });
    });

    console.log(evtName, queueItem ? queueItem.url ? queueItem.url : queueItem : null);
    originalEmit.apply(crawler, arguments);
};

crawler.on('fetchcomplete', (queueItem, responseBody) => {
  console.log(`Found URL ${queueItem.url}`);
});

crawler.start();

Here's the output:

fetched 0 of 1 — 0 open requests, 0 open listeners
crawlstart null
fetched 0 of 1 — 1 open requests, 0 open listeners
fetchstart https://www.example.com/
fetched 0 of 1 — 1 open requests, 0 open listeners
fetchheaders https://www.example.com/
fetched 1 of 1 — 0 open requests, 0 open listeners
fetchcomplete https://www.exampe.com/
Found URL https://www.example.com/
fetched 1 of 1 — 0 open requests, 0 open listeners
complete null
@mhluska
Copy link
Author

mhluska commented Mar 27, 2018

Actually, the reason this was failing was because of a missing Content-Type header on the initial resource. Going to close.

@mhluska mhluska closed this as completed Mar 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant