Skip to content

privacy-tech-lab/privacy-pioneer-web-crawler

Repository files navigation

GitHub release (latest by date) GitHub Release Date GitHub last commit GitHub issues GitHub closed issues GitHub watchers GitHub Repo stars GitHub forks GitHub sponsors

Privacy Pioneer Web Crawler

A web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer. This crawler utilizes Privacy Pioneer's code; however, this repository is not related to the actual development of the Privacy Pioneer extension. Instead, this repository is an implementation of Privacy Pioneer utilizing Selenium for browser automation. Thus, it is not necessesary to download/install the extension code on your own; the necessary components for the extension are already provided in this repository. The user only needs to clone the Privacy Pioneer repository if they would like to make their own changes to the extension. In order for the extension to be used by Selenium, it needs to be compiled and packaged in a xpi file. Since the crawler will only work with a compiled version of the extension, it is necessary for the two repositories to be separate.

The code in this repo is developed and maintained by the Privacy Pioneer team.

Notes on Setup

Our crawler is intended to be run in different geographic locations, with the goal of investigating different privacy laws and how websites adhere to them. To that end, you may use a VPN, Web Proxy, cloud computing, or some other method suited for your needs. We provide instructions for setting up on the cloud because of problems that we encountered with how we intended to use the crawler. We will also be outlining the steps to install the crawler locally, in case you want to use the crawler in some other fashion. If you are not planning to crawl on the cloud, then feel free to skip to the crawler setup.

1. Instructions for Creating a New VM on Google Cloud

This section will outline the necessary steps to create a VM on the cloud. You will need to create a project in the Google console. Unless otherwise specified, leave a setting at it's default value. Click the triangles next to the step number to see an example of what you should see at each step.

1. Navigate to the Compute Engine and click Create Instance. step 1
2. Choose the name, region, and zone for your machine. Your decision should reflect what location you'd like to crawl from. step 2
3. Select the appropriate type of machine you'd like to use. If you're on the fence about whether or not your machine will be powerful enough, it's better to overestimate. We've had issues with weaker machines where Selenium stops working when a machine doesn't have sufficient memory. step 3
4. Change the server boot disk to Windows. In theory, there's no reason why you couldn't run this crawler on a Linux server. However, we haven't tested this, and we recommend the Windows route because you have easy access to a GUI. This makes checking if the crawler is operating as expected significantly easier. step 4 step 4.5
5. Allow HTTP and HTTPS messages through the firewall. Then, click "Create". step 5
6. Now that you have your server, click on the triangle next to "RDP" and select "Set Windows Password". Be sure to save these credentials somewhere safe, as they will not be shown again. step 6

You should now have a working Google Cloud VM. To connect to the VM, use the Remote Desktop Connection app on Windows. Provide the external IP, username, and password. After connecting, you should see the server desktop. Next, you'll need to go through the crawler setup instructions.

Note: When crawling with multiple locations, you can avoid the hassle of setting up for each VM by using a machine image.

2. Instructions for Setting up the Crawler on Windows

The previous steps were getting you ready to set the crawler up on the cloud. Now, we'll actually be setting up the crawler. This process is identical locally and on the cloud.

Note: The crawler requires that Firefox Nightly be installed

Start by cloning this repo. If you want to make changes to the Privacy Pioneer extension for the crawl, then look here for a guide. If you want to use the extension as we intend to use it, then you can ignore said guide.

Install MySQL and the MySQL shell. Once installed, enter the MySQL Shell and run the following commands:

\connect root@localhost

Enter your MySQL root password. If you haven't set this up yet, then the shell should prompt you to create one. Use password abc

Next, switch the shell over from JS to SQL mode.

\sql

To set the crawler up for accessing the database via your root account run:

ALTER USER 'root'@'localhost' IDENTIFIED WITH 'mysql_native_password' BY 'abc';
FLUSH PRIVILEGES;

2.1 Database Setup

Next, we will set up the MySQL database. This is important because we need a place to store the evidence that Privacy Pioneer will find. Interactions with the database will be managed by the scripts located in the rest-api directory. We are also using a special version of Privacy Pioneer that is designed to interact with this database.

First, in the MySQL shell, create the database:

CREATE DATABASE analysis;

Then, access it:

USE analysis;

Lastly, create a table where any evidence that Privacy Pioneer finds will be stored:

CREATE TABLE entries
  (id INTEGER PRIMARY KEY AUTO_INCREMENT, timestp varchar(255), permission varchar(255), rootUrl varchar(255),
  snippet varchar(4000), requestUrl varchar(9000), typ varchar(255), ind varchar(255), firstPartyRoot varchar(255),
  parentCompany varchar(255), watchlistHash varchar(255),
  extraDetail varchar(255), cookie varchar(255), loc varchar(255));

You can now exit the MySQL shell.

In the rest-api folder, create a new file called .env, and save the following to that file:

DB_CONNECTION=mysql
DB_HOST=localhost
DB_DATABASE=analysis
DB_USERNAME=root
DB_PASSWORD=abc

2.2 Crawler Setup

Lastly, you will need to manually set the zip code and the GPS coordinates that you will be crawling from. You can accomplish this by opening up the local crawler script local-crawler.js and modifying the following values:

const TARGET_LAT = 10.12; // replace this value with your intended latitude
const TARGET_LONG = -11.12; // replace this value with your intended longitude
const TARGET_ZIP = "011000"; // replace this value with your intended zip code (note that it must be a string)

3. Instructions for Running the Crawler

Using the terminal, enter the privacy-pioneer-web-crawler/rest-api directory. Run either:

npm install
node index.js

or

npm install
npm start

In another instance of the terminal, enter the privacy-pioneer-web-crawler/selenium-crawler directory, and run either:

npm install
node local-crawler.js

or

npm install
npm start

These two commands are enough to get the crawl to run. You will know the crawl is working when an instance of Firefox Nightly opens up and it looks like this:

4. Changing the Extension for a Crawl

In case you should need it, here are the steps that you will need to follow in order to have your changes to Privacy Pioneer reflected in your crawl.

  1. Clone the Privacy Pioneer repo and make any changes that you'd like to see.

    Note: If you are making your own version of the crawler, then you will need to remember to enable "crawl mode" within the extension source code. The instructions for doing that can be found in the comments located here. The gist is that you will need to set the flag IS_CRAWLING to true. If you are testing changes to the crawler, you will also need to set the IS_CRAWLING_TESTING flag to true. This is necessary so that functionality related to setting the location data and recording crawl data are enabled.

  2. Once the changes have been made, run npm run build from within the privacy-pioneer directory.

  3. Navigate to the newly made dev directory.

  4. In the manifest.json file, add the following code at the bottom (within the json). Firefox will not let you add an extension without this ID.

    "browser_specific_settings": {
        "gecko": {
          "id": "{daf44bf7-a45e-4450-979c-91cf07434c3d}"
        }
      }
  5. Within the dev directory, send all the files to a zip file.

  6. Rename the file extension from .zip to .xpi. Functionally, these files will behave the same. This is the format that Firefox uses to load an extension.

  7. Place this new file into the selenium-crawler directory, and modify the crawler accordingly. Make sure that the aforementioned local-crawler.js file is looking for the correct extension, i.e.,

.addExtensions("ext.xpi");

is pointing to the right XPI file.

5. Known Issues

Coordinates and Zip Codes under VPNs

The motivation to use Google Cloud was primarily fueled by this issue. As described in the privacy pioneer repo, the extension is meant to find evidence of location elements being taken. However, when using a VPN (or any service without a static IP), it becomes nearly impossible for privacy pioneer to find evidence of GPS Location and/or Zip Code. This is due to how privacy pioneer decides where the user's location is, and so there will almost certainly be a discrepancy between where privacy pioneer thinks the user is, and where a website thinks the user is. Since these features are built-in to the extension, it would be difficult to make privacy pioneer work with a VPN crawl without signifant changes to the architecture. Thus, we have opted to hard-code the latitude, longitude, and zip code for our crawls. For instructions on how to do this, look here.

Cloud Computing Power

We've had issues with Selenium working properly when working with a relatively weak virtual machine. We recommend using the n2-standard-4 preset in Google Cloud.

Connecting to Cloud VMs

Currently, the only way to actually see the GUI is through the Remote Desktop Connection app on Windows.

Starting the Crawl

If the crawler fails to start, simply try running it again. Firefox nightly is updated often, and this causes the program to crash on the first bootup. Try running the program in privacy-pioneer-web-crawler/selenium-crawler again.

Other issues

If you encounter an issue that hasn't been described, try to identify if the issue is coming from Selenium or not. To accomplish this, look at any error messages in the terminal that's running in selenium-crawler. Make sure that you're connected to the internet, both programs are running, and that the crawler looks as shown above.

6. Thank You

We would like to thank our financial supporters!


Major financial support provided by Google.

Google Logo

Additional financial support provided by Wesleyan University and the Anil Fernando Endowment.

Wesleyan University Logo

Conclusions reached or positions taken are our own and not necessarily those of our financial supporters, its trustees, officers, or staff.

privacy-tech-lab logo

About

Web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published