Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy for page #678

Open
ivibe opened this issue Sep 4, 2017 · 81 comments
Open

Proxy for page #678

ivibe opened this issue Sep 4, 2017 · 81 comments
Labels
chromium Issues with Puppeteer-Chromium feature upstream

Comments

@ivibe
Copy link

ivibe commented Sep 4, 2017

Hi!

Could someone tell me, whether there's a possibility to set proxy not only for a chromium instance, but also for a page?

So the current solution is:
const browser = await puppeteer.launch({ args: [ '--proxy-server=127.0.0.1:9876' ] });

Desired solution in my case is something like this:
const page = await browser.newPage({ args: [ '--proxy-server=127.0.0.1:9876' ] });

With proxy per page there's a possibility to run a single chrome instance, but use different proxies depending on page.

Thanks in advance!

@JoelEinbinder
Copy link
Collaborator

You can use request interception to forward requests from each page to the correct proxy.

@ivibe
Copy link
Author

ivibe commented Sep 5, 2017

@JoelEinbinder could you show an example how can I forward request through SOCKS proxy using request interception?

@aslushnikov
Copy link
Contributor

@ivibe unfortunately, this is not possible for SOCKS proxy, you'll have to launch a separate browser instance for this case.

Out of curiosity, why would you need this?

@ivibe
Copy link
Author

ivibe commented Sep 6, 2017

It's a pity.

My use case is web-scraping. Web-servers can block IPs or the proxy server can become inactive, that's why relatively often I need to change proxy.

Of course, I would like to avoid a perfomance hit related to launching many instances of chromium. Is there any chance, that such functionality (i.e. dynamic changing proxy) will be implemented in future chromium releases?

@ks07
Copy link

ks07 commented Sep 6, 2017

+1

This feature would be useful for me too, as I'm currently forced to launch multiple chromium instances if I need to access multiple URLs via different proxies. To add to what @ivibe suggested for use-cases, this could also be useful if you need to access resources behind firewalls with no common proxy that can pass through both. Alternatively, this would be useful if you wanted to test or screenshot your web application from multiple sources - e.g. if page content changes based on the visitor's IP's geolocation.

If there is a way to workaround this as suggested by @JoelEinbinder, perhaps the SOCKS requirement could be alleviated by setting up a proxy in the middle to allow an HTTP proxy interface to the SOCKS connection. (e.g. https://superuser.com/questions/423563/convert-http-requests-to-socks5)

@Khady
Copy link

Khady commented Sep 6, 2017

unfortunately, this is not possible for SOCKS proxy, you'll have to launch a separate browser instance for this case.

What are the supported proxy for this case?

@blue-cp
Copy link

blue-cp commented Sep 9, 2017

+1
we have exact same use case. this will be a very useful feature.
if we can set http proxy per page that would be great.

@fhmd4k
Copy link

fhmd4k commented Oct 14, 2017

I think you can capture every request to use http(s) proxy!

@fhmd4k
Copy link

fhmd4k commented Oct 14, 2017

Socks proxy affect to the whole browser(all tabs), you only run different browser(different userDataDir) instance to do.

@Khady
Copy link

Khady commented Oct 15, 2017 via email

@ivibe
Copy link
Author

ivibe commented Oct 15, 2017

@fhmd4k even if we consider only regular http(s) proxy, that would be nice to see an example of using it through capturing requests

@barbolo
Copy link

barbolo commented Nov 10, 2017

Hi, I'm working around on this issue and I'm already able to make this work with HTTP websites. For HTTPS websites I'm still facing some issues.

It may sound a bit hacky and complex... hmm... that's because it really is! But hey, it works.

The idea is to create a local Downstream Proxy that parses the address of the Upstream Proxy from the headers of the page's requests.

image
(image credits: https://www.fedux.org/articles/2015/04/11/setup-a-proxy-with-ruby.html)

You can use something like this per page:

page.setExtraHTTPHeaders({proxy_addr: "200.11.11.11", proxy_port: 999});
// 200.11.11.11:999 is the address of your final proxy you want to use (the Upstream Proxy).

You should start chrome using --proxy-server=downstream-proxy-address.

Then, your custom Downstream Proxy should extract those proxy headers and forward the request for the proper Upstream Proxy.

For HTTPS requests, the issue I'm facing is to intercept the CONNECTION requests when the secure communication tunnel is being created. In this case the proxy headers are not sent by Chrome and I'm figuring out another way of transmitting the proxy information to the Downstream Proxy without needing to hack chrome(/chromium) itself.

The Downstream Proxy should be a very lightweight process running in your operation system. For reference, the proxy I've built consumes about 20MB of system's memory. I won't share the proxy code for now because it currently exposes some security risks for my application.

@tzellman
Copy link

I could be wrong, but I believe SOCKS5 is already supported: http://www.chromium.org/developers/design-documents/network-stack/socks-proxy

--proxy-server="socks5://myproxy:8080"
--host-resolver-rules="MAP * ~NOTFOUND , EXCLUDE myproxy"

@barbolo
Copy link

barbolo commented Nov 21, 2017

@tzellman that sets a single proxy for chrome and not for each page (tab) of chrome.

@gwaramadze
Copy link

@barbolo I believe this workaround applies to most headless browsers. We have set up similar stack with PhantomJS:
client => haproxy => phantomjs => server

Same story, works great for HTTP resources but fails to route HTTPS as there is no access to additional headers, querystring, nothing... We are even considering SSL termination but that's just soooo much hacking to achieve such a simple thing :/

Did you have any luck with working around HTTPS requests?

@barbolo
Copy link

barbolo commented Dec 13, 2017

@gwaramadze Yes, I've found some ways of making this scheme work with HTTPS and I'll share how I'm currently doing it.

Like I've said in the previous comment, the custom headers with the proxy information were ignored by Chrome when communicating with the downstream proxy server. However the user-agent header was being transmitted.

The first approach I tried was to encapsulate the proxy information in a JSON string sent as the user-agent header. For example, I would change the Chrome user-agent for each tab to look like this:

var userAgent = JSON.stringify({
  "user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
  "proxy-addr" : "111.111.111.111",
  "proxy-port" : "9999",
});
page.setUserAgent(userAgent);

That way I can intercept the user-agent in the Downstream Proxy and parse the Proxy attributes from it.

The problem is that the user-agent is also encrypted in the connection and sent directly to the final HTTP server. It's impossible to intercept it and fix it before sending it to the HTTP server. So the final HTTP server would receive a bizarre user agent string that would include your proxy connection information. If that is not a problem for you, that will work. But for me it could be a problem.

So what I ended up doing was to create a list with thousands of user agent strings and for each new tab:

  1. Choose a user agent string from the list and set it in as the page user agent

  2. Send a request to the downstream proxy specifying that requests with this user agent string should use a proxy i was also specifying.

  3. Send a new request with this tab

  4. In the downstream proxy, find which proxy should be used based on the user agent string.

That's how I'm doing it now. The steps 2 and 4 implies in reprogramming the downstream proxy.

Another approach that should work is to make changes to the source of chromium network to allow other headers to be transmitted. But that would be more maintenance work in the long term.

@gwaramadze
Copy link

@barbolo Thanks, this is quite interesting hack. I wouldn't want to meddle with user agents too much as they might be checked by anti-scraping algorithms.

@barbolo
Copy link

barbolo commented Dec 13, 2017

@gwaramadze yes. That's why I'm using the other approach. For instance, you have thousands of real chrome user agents available for recent versions of the chrome browser.

@Ogofo
Copy link

Ogofo commented Jan 11, 2018

Is this feature in active development? Got the same issue and I guess the Use-Case is widely spread.

@chaims
Copy link

chaims commented Jan 22, 2018

+1
I have same use case ! waiting for a solution !

@qingpengchen2011
Copy link

+1

@dvssmgk
Copy link

dvssmgk commented Jan 29, 2018

+1 Even I have similar use case. Waiting for the Solution with capability to set Proxy per page.

@barbolo
Copy link

barbolo commented Jan 29, 2018

I don't think Puppeteer has anything to do with this issue. The problem is with Chrome, which doesn't provide any API to configure proxy.

You can either use a workaround like I've suggested above or you can build Chromium with a modified Network Stack, which I don't see as a good option.

@flyxl
Copy link

flyxl commented Jan 31, 2018

I'm using request interception to forwarding request:

 async newPage(browser) {
        let page = await browser.newPage();

        await page.setRequestInterception(true);
        page.on('request', async interceptedRequest => {
            const resType = interceptedRequest.resourceType();
            if (['document', 'xhr'].indexOf(resType) !== -1) {
                const url = interceptedRequest.url();
                const options = {
                    uri: url,
                    method: interceptedRequest.method(),
                    headers: interceptedRequest.headers(),
                    body: interceptedRequest.postData(),
                    usingProxy: true,
                };
                const response = await this.fetch(options);

                interceptedRequest.respond({
                    status: response.statusCode,
                    contentType: response.headers['content-type'],
                    headers: response.headers,
                    body: response.body,
                });
            } else {
                interceptedRequest.continue();
            }
        });
        return page;
    }

    fetch(options) {
        // let baseUrl = options.baseUrl || request.globals.baseUrl;
        let isHttps;
        if (options.uri.startsWith('https')) {
            isHttps = true;
        } else if (options.uri.startsWith('http')) {
            isHttps = false;
        }

        if (options.usingProxy || process.env.NODE_ENV === 'production') {
            options.agentClass = isHttps ? Sock5HttpsAgent : Sock5HttpAgent;
            options.agentOptions = {
                socksHost: 'localhost', // Defaults to 'localhost'.
                socksPort: 9050 // Defaults to 1080.
            }
        }

        options.resolveWithFullResponse = true;

        return request(options);
    }

Please note that In my case I just forward document and xhr request and ignore baseUrl of request options and I use request-promise-native instead of request. You can replace the proxy settings in function fetch.

@joelgriffith
Copy link
Contributor

You can use a project like browserless and configure per-request proxies via query-params. This, coupled with the page.authenticate method, allow for pretty flexible usage.

browserless is here
page.authenticate is here

@banxian
Copy link

banxian commented Mar 31, 2018

@flyxl I used your code in project to forward all request to proxy, but it introduced some 502 error from server. sure directly add proxy config in launch options works fine.
I guess the problem is triggered by resorted request order, and conflict to servers logical.

@mathiasbynens
Copy link
Member

Chromium tracking issue: https://bugs.chromium.org/p/chromium/issues/detail?id=1090797

@mikespnu
Copy link

mikespnu commented Jun 6, 2020

Anybody having issues with Puppeteer-page-proxy?
I'm getting the following error:

dist/source/create.js:155
                    yield item;
                    ^^^^^

SyntaxError: Unexpected strict mode reserved word
    at createScript (vm.js:80:10)
    at Object.runInThisContext (vm.js:139:10)
    at Module._compile (module.js:617:28)

@gajus
Copy link

gajus commented Jun 6, 2020

Anybody having issues with Puppeteer-page-proxy?
I'm getting the following error:

dist/source/create.js:155
                    yield item;
                    ^^^^^

SyntaxError: Unexpected strict mode reserved word
    at createScript (vm.js:80:10)
    at Object.runInThisContext (vm.js:139:10)
    at Module._compile (module.js:617:28)

What Node.js version?

@mikespnu
Copy link

mikespnu commented Jun 6, 2020 via email

@mikespnu
Copy link

mikespnu commented Jun 7, 2020

Node needed to be updated. It's working fine

@runningabcd
Copy link

厉害了,python啥时候有?
wow,python no support

@Nisthar
Copy link

Nisthar commented Nov 25, 2020

EDIT:
It's possible with puppeteer-page-proxy.
It supports setting a proxy for an entire page, or if you like, it can set a different proxy for each request.
Repository:
https://github.com/Cuadrix/puppeteer-page-proxy

Is this library still working for you?

joone pushed a commit to joone/puppeteer that referenced this issue Aug 23, 2021
joone pushed a commit to joone/puppeteer that referenced this issue Aug 23, 2021
joone pushed a commit to joone/puppeteer that referenced this issue Aug 23, 2021
…ontext

Issue: puppeteer#678

Example:

const browser = await puppeteer.launch();
const context = await browser.createIncognitoBrowserContext('myproxy.com:3128');
const page = await context.newPage()
await page.authenticate({username: 'foo', password: 'bar' });
await page.goto('https://google.com');
await browser.close();
joone pushed a commit to joone/puppeteer that referenced this issue Aug 23, 2021
…ontext

Issue: puppeteer#678

Example:

(async () => {
  const browser = await puppeteer.launch();
  const context = await browser.createIncognitoBrowserContext('myproxy.com:3128');
  const page = await context.newPage()
  await page.authenticate({username: 'foo', password: 'bar' });
  await page.goto('https://google.com');
  await browser.close();
})();
joone pushed a commit to joone/puppeteer that referenced this issue Sep 17, 2021
…ontext

Issue: puppeteer#678

Example:

(async () => {
  const browser = await puppeteer.launch();
  const context = await browser.createIncognitoBrowserContext('myproxy.com:3128');
  const page = await context.newPage()
  await page.authenticate({username: 'foo', password: 'bar' });
  await page.goto('https://google.com');
  await browser.close();
})();
joone pushed a commit to joone/puppeteer that referenced this issue Sep 17, 2021
jschfflr pushed a commit that referenced this issue Sep 18, 2021
…ontext (#7516)

Example:

(async () => {
  const browser = await puppeteer.launch();
  const context = await browser.createIncognitoBrowserContext('myproxy.com:3128');
  const page = await context.newPage()
  await page.authenticate({username: 'foo', password: 'bar' });
  await page.goto('https://google.com');
  await browser.close();
})();

Issue: #678
@radiolondra
Copy link

@Nisthar
AFAIK puppeteer-page-proxy lib has some issues.
Personally I tried to use it with my proxy, but I have had problems, for example, going to https://whatismyipaddress.com/ (and other similar links) to simply get the proxy IP address in the proxied page. It fails also when Google sends reCaptcha while scraping (and not only with Google).
Instead everything works fine using the standard puppeteer launch arg '--proxy-server'.
The lib seems to be not actively maintained, even answering the issues.

@Kikobeats
Copy link
Contributor

You can use proxy per context, that in the end it's going to be pretty similar
https://pptr.dev/next/api/puppeteer.browsercontextoptions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment