Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subdomain crawl with "allowedDomains" parameter crawls top domain, too #381

Open
michaelpapesch opened this issue Nov 29, 2021 · 0 comments

Comments

@michaelpapesch
Copy link

michaelpapesch commented Nov 29, 2021

For the domain "test.domain.com" result.response.url includes urls from "domain.com", too.
I tried it with the subdomain name and regexp.
I don't understand, why, shouldn't "allowedDomains" parameter prevent scanning from URLs of other domains?

(async () => {
    const crawler = await HCCrawler.launch({
        headless: true,
        args: [
            '--ignore-certificate-errors',
            '--no-sandbox',
        ],
        allowedDomains: [domain],
        maxDepth: 8,
        customCrawl: async (page, crawl) => {
            const result = await crawl();
            result.content = await page.content();
            return result;
        },
        onSuccess: result => {
            const values = [
                result.response.url
            ];
        },
    await crawler.queue(url);
    await crawler.onIdle();
    await crawler.close().then(() => connection.end());
    console.log('Scan completed.');
})();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant