Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Puppeteer is being blocked by some sites (which uses distill networks) #4985

Closed
manibharathytu opened this issue Sep 28, 2019 · 14 comments
Closed

Comments

@manibharathytu
Copy link

manibharathytu commented Sep 28, 2019

Steps to reproduce

Tell us about your environment:

  • Puppeteer version: 1.12.2
  • Platform / OS version: Ubuntu
  • URLs (if applicable):
  • Node.js version: 11.4

What steps will reproduce the problem?

Please include code that reproduces the issue.

puppeteer = require("puppeteer");

async function test() {
    browser = await puppeteer.launch({
        headless: false
    });
    page = await browser.newPage();
    await page.goto('https://streeteasy.com', { waitUntil: 'networkidle0', timeout: 0 });
}

test()

What is the expected result?
It should get the proper response (which you can see by browsing to https://streeteasy.com)
https://drive.google.com/open?id=1-L3bjQWs9Et4Kk6dGOhUyi2ePEgtuEMM
Refer 1.png

What happens instead?
Captcha page is shown
https://drive.google.com/open?id=1-L3bjQWs9Et4Kk6dGOhUyi2ePEgtuEMM
Refer 2.png

@manibharathytu
Copy link
Author

manibharathytu commented Sep 28, 2019

I was running without headless false.They were detecting that its a bot. I changed headless=false then. Still they are detecting. If I browse from my chromium browser it works. But if I browse through puppeteer with headless false (which is again chromium browser) they are detecting that its a bot.

@developervariety
Copy link

This is out of the control from Puppeteer devs. You could use extensions to lower the detection rate aka https://github.com/berstend/puppeteer-extra

@nicoandmee
Copy link

Shameless plug for my framework which defeats distill: https://nicoandmee.github.io/puppeteer-theater/

@moujoud
Copy link

moujoud commented Jan 31, 2020

did you try to add --user-agent to your puppeteer args?

const args = [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-infobars',
'--window-position=0,0',
'--ignore-certifcate-errors',
'--ignore-certifcate-errors-spki-list',
'--user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3312.0 Safari/537.36"'
];

@Ca10San
Copy link

Ca10San commented Feb 19, 2020

did you try to add --user-agent to your puppeteer args?

const args = [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-infobars',
'--window-position=0,0',
'--ignore-certifcate-errors',
'--ignore-certifcate-errors-spki-list',
'--user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3312.0 Safari/537.36"'
];

I tried this right now and really works, you saved my life 😁

@lattice0
Copy link

did you try to add --user-agent to your puppeteer args?

const args = [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-infobars',
'--window-position=0,0',
'--ignore-certifcate-errors',
'--ignore-certifcate-errors-spki-list',
'--user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3312.0 Safari/537.36"'
];

these args make it worse for me, instant blocking

@sssubik
Copy link

sssubik commented May 29, 2020

@LucasZanella @manibharathytu is the issue solved?

@lattice0
Copy link

@LucasZanella @manibharathytu is the issue solved?

For me it didn't work. Had to install a plugin I found somewhere that does the job. Can't remember the name

@manibharathytu
Copy link
Author

manibharathytu commented May 31, 2020

@LucasZanella @manibharathytu is the issue solved?

Partially.

This is becasue distill uses some machine learning based backend which takes in all the browser params and decides whther its a bot or not. So it is very difficult to trick it completely.

After a heavy research and debuggin their js code, I was able to tweak my code to trick distill 4 out of 5 times. I had to randomize some of my puppeteer browser fingerprint dynamically and even my ip to make it look its not a bot.

I had to manually tweak some browser parameters one by one (eg: audio/video supported, extensions installed, mouse movement etc) distill js checks and then trick distill to think my puppeteer is a normal chromium browser. I don't remember all the parameter I had to tweak .

You can get a good idea by debugging their js and see what are the checks they are doing. And compare it with the normal chromium browser and see what parameter changes that makes distill think that puppeteer is a bot. I vaguely remember, I was stuck with 2 or 3 native browser parameters which I cant change from js, because of that parameter difference, distill was able to detect that 1 out of 5 times. If those params can be hacked distill can be tricked 100% of the time

Some references to get started :
https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth
https://stackoverflow.com/questions/51731848/how-to-avoid-being-detected-as-bot-on-puppeteer-and-phantomjs

@stale
Copy link

stale bot commented Jun 26, 2022

We're marking this issue as unconfirmed because it has not had recent activity and we weren't able to confirm it yet. It will be closed if no further activity occurs within the next 30 days.

@stale stale bot added the unconfirmed label Jun 26, 2022
@stale
Copy link

stale bot commented Jul 26, 2022

We are closing this issue. If the issue still persists in the latest version of Puppeteer, please reopen the issue and update the description. We will try our best to accomodate it!

@stale stale bot closed this as completed Jul 26, 2022
@Nyceane
Copy link

Nyceane commented May 22, 2023

Please reopen the issue... I am still seeing the problem on some sites

@clearly-outsane
Copy link

I can't get past Zillow no matter what

@abhijeetkushe
Copy link

abhijeetkushe commented Sep 8, 2023

I recently upgraded from nodejs 12 to nodejs 16.For that I had to switch from https://github.com/alixaxel/chrome-aws-lambda/tree/v3.1.1 to https://github.com/Sparticuz/chrome-aws-lambda/tree/puppeteer%4013.5.0 because I started seeing issues described by this open bug alixaxel/chrome-aws-lambda#264 .After that I started seeing errors like Enable JavaScript and cookies to continue with same code.I got past that problem by passing the args
mentioned in #4985 (comment). But now I am seeing errors like this The owner of this website (www.perimeterdermatology.com) has banned your access based on your browser's signature (802a5baa3eb43b68-ua60) . I do know that puppeteer version used by both are different so the chrome browser used is also different v3.1.0...v14.1.0 .But can someone say what exactly has changed as the same websites were scraping fine on the old website but not on the new one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants