Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl stops on non-www URLs #338

Open
cosmiXs opened this issue Jan 9, 2019 · 5 comments
Open

Crawl stops on non-www URLs #338

cosmiXs opened this issue Jan 9, 2019 · 5 comments
Labels

Comments

@cosmiXs
Copy link

cosmiXs commented Jan 9, 2019

What is the current behavior?
If I specify a domain like eg. "http://www.domainname.com/"
but the preferred domain settings on the server are without "www." then the crawling process stops.

The REVERSE is also valid unfortunately if I specify a DOMAIN that has "www" but I do not specify it eg. "http://domainname.com/" the crawling also STOPS.

If the current behavior is a bug, please provide the steps to reproduce

What is the expected behavior?
Normally I expect it to recognize the domain name without "www"

What is the motivation / use case for changing the behavior?

Please tell us about your environment:

  • Version: latest
  • Platform / OS version: mac OS X
  • Node.js version: latest
@matheuschimelli
Copy link

Sorry, but i cannot understand what you want to say. What you really want to do? If a site don't use www on domain, you have no reason to crawl using www. I think that's not a bug. Please provide more details.

@cosmiXs
Copy link
Author

cosmiXs commented Jan 24, 2019

I do not know in advance if a domain has explicitly require www. or non-www.
I have a list of domains that I want to crawl, I've placed them into a file and I'm reading them from there. By default I'm putting www. in front of all the domains, but when the crawler reaches a domain that explicitly does not have www. (this is how is forced by Preferred domain server setting) then the crawler only acceses the home page than exits.

@simlevesque
Copy link

hey cosmiXs I made an npm package to fix this. It's called 'redirect-chain'. You give it your entrypoint url and then it gives you the domain redirect chain. Then use this array as allowedDomains.

https://www.npmjs.com/package/redirect-chain

@vycoder
Copy link

vycoder commented Mar 29, 2019

I'm having the same problem. I get a timeout when visiting a non www url. I can visit it on the browser just fine.

I tried using @simlevesque's solution but I sill get the same problem.

await crawler.queue({
   url,
   allowedDomains: await redirectChain.domains(url);
});

still no luck. I'm getting a Error: Navigation Timeout Exceeded: 30000ms exceeded

@kulikalov
Copy link
Contributor

@cosmiXs @vycoder could you provide a full code example to reproduce the issue?

@kulikalov kulikalov added the bug label Oct 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants