Skip to content

privacy-tech-lab/gpc-web-crawler

Repository files navigation

GitHub release (latest by date) GitHub Release Date GitHub last commit GitHub issues GitHub closed issues GitHub GitHub watchers GitHub Repo stars GitHub forks GitHub sponsors

OptMeowt logo

GPC Web Crawler

GPC web crawler code. The GPC Web crawler is developed and maintained by the OptMeowt team.

1. Selenium OptMeowt Crawler
2. Development
3. Architecture
4. Thank You!

1. Selenium OptMeowt Crawler

The Selenium OptMeowt Crawler is a crawler implemented using Selenium that analyzes sites using the OptMeowt analysis extension. More details about analysis functionality can be found here.

2. Development

  1. Clone this repo locally or download a zipped copy and unzip it.

  2. Set up the local SQL database by following the instructions in the wiki.

  3. Then, run the Rest API by following the instructions in the wiki.

  4. With the Rest API running, open a new terminal, and navigate to the root directory of selenium-optmeowt-crawler in terminal by running:

cd selenium-optmeowt-crawler
  1. Open sites.csv and enter the links you want to analyze in the first column. (Some examples included in the file)

  2. Ensure Firefox Nightly is installed on your computer using this link. (If you receive the following error when completing step 8, update Firefox Nightly to the latest version: WebDriverError: Process unexpectedly closed with status 0.)

  3. Install the dependencies by running:

npm install
  1. To start the crawler, run:
node local-crawler.js
  1. To check the analysis results, open a browser and navigate to http://localhost:8080/analysis.

  2. If you modify the analysis extension, you should test it to make sure it still works properly. Some guidelines can be found in the wiki.

3. Architecture

crawler-architecture

Components:

  • Crawler Script:

    The flow of the crawler script is described in the diagram below.

analysis-flow

This script is stored and executed locally. The crawler also keeps a log of sites that cause errors. It stores these logs in a file called error-logging.json and updates this file after each error.

Types of Errors that may be logged:

  1. TimeoutError: A Selenium error that is thrown when either the page has not loaded in 30 seconds or the page has not responded for 30 seconds. Timeouts are set in driver.setTimeouts.
  2. HumanCheckError: A custom error that is thrown when the site has a title that we have observed means our VPN IP address is blocked or there is a human check on that site. See Limitations/Known Issues for more details.
  3. InsecureCertificateError: A Selenium error that indicates that the site will not be loaded, as it has an insecure certificate.
  4. WebDriverError: A Selenium error that indicates that the WebDriver has failed to execute some part of the script.
  5. WebDriverError: Reached Error Page: This indicates that an error page has been reached when Selenium tried to load the site.
  6. UnexpectedAlertOpenError: This indicates that a popup on the site disrupted Selenium's ability to analyze the site (such as a mandatory login)
  • OptMeowt Analysis Extension:

    The OptMeowt Analysis extension is packaged as an xpi file and installed on a Firefox Nightly browser by the crawler script. When a site loads, the OptMeowt Analysis extension automatically analyzes the site and sends the analysis data to the Cloud SQL database via a POST request. The analysis performed by the OptMeowt analysis extension investigates the GPC compliance of a given site using a 4-step approach:

    1. The extension checks whether the site is subject to the CCPA by looking at Firefox's urlClassification object. Requests returned by this object are based on the Disconnect list, as described here.
    2. The extension checks the value of the US Privacy string, the GPP string, and OneTrust’s OptanonConsent, OneTrustWPCCPAGoogleOptOut, OTGPPConsent cookies if any of these exist.
    3. The extension sends a GPC signal to the site.
    4. The extension rechecks the value of the US Privacy string, OneTrust cookies, and GPP string.

    The information collected during this process is used to determine whether the site respects GPC. Note that legal obligations to respect GPC differ by geographic location. In order for a site to be GPC compliant, the following statements should be true after the GPC signal was sent for each string or cookie that the site implemented:

    1. the third character of the US Privacy string is a Y
    2. the value of the OptanonConsent cookie is isGpcEnabled=1
    3. the opt out columns in the GPP string's relevant US section (i.e. SaleOptOut, TargetedAdvertisingOptOut, SharingOptOut) have a value of 1. Note that the columns and opt out requirements vary by state.
    4. the value of the OneTrustWPCCPAGoogleOptOut cookie is true
  • Node.js Rest API:

    We use the Rest API to make GET, PUT, and POST requests to the SQL database. The Rest API is also local and is run in a separate terminal from the crawler.

  • SQL Database:

    The SQL database is a local database that stores analysis data. Instructions to set up an SQL database can be found in the wiki. The columns of our database tables are below:

    id site_id domain sent_gpc uspapi_before_gpc uspapi_after_gpc usp_cookies_before_gpc usp_cookies_after_gpc OptanonConsent_before_gpc OptanonConsent_after_gpc gpp_before_gpc gpp_after_gpc urlClassification OneTrustWPCCPAGoogleOptOut_before_gpc OneTrustWPCCPAGoogleOptOut_after_gpc OTGPPConsent_before_gpc OTGPPConsent_after_gpc

    The first few columns primarily pertain to identifying the site and verifying that the OptMeowt Analysis extension is working properly.

    • id: autoincrement primary key to identify the database entry
    • site_id: the id of the domain in the csv file that lists the sites to crawl. This is used for processing purposes (i.e. to identify domains that redirect to another domain) and is set by the crawler script.
    • domain: the domain name of the site
    • sent_gpc: a binary indicator of whether the OptMeowt Analysis extension sent a GPC opt out signal to the site

    The remaining columns pertain to the opt out status of a user, which is indicated by the value of the US Privacy String, OptanonConsent cookie, and GPP string. The US Privacy String can be implemented on a site via (1) the client-side JavaScript USPAPI, which returns the US Privacy String value when called, or (2) an HTTP cookie that stores its value. The OptMeowt analysis extension checks each site for both implementations of the US Privacy String by calling the USPAPI and checking all cookies. The GPP string's value is obtained via the CMPAPI for GPP.

    • uspapi_before_gpc: return value of calling the USPAPI before a GPC opt out signal was sent
    • uspapi_after_gpc: return value of calling the USPAPI after a GPC opt out signal was sent
    • usp_cookies_before_gpc: the value of the US Privacy String in an HTTP cookie before a GPC opt out signal was sent
    • usp_cookies_after_gpc: the value of the US Privacy String in an HTTP cookie after a GPC opt out signal was sent
    • OptanonConsent_before_gpc: the isGpcEnabled string from One Trust’s OptanonConsent cookie before a GPC opt out signal was sent. The user is opted out if isGpcEnabled=1, and the user is not opted out if isGpcEnabled=0. If the cookie is present but does not have an isGpcEnabled string, we return “no_gpc”.
    • OptanonConsent_after_gpc: the isGpcEnabled string from One Trust’s OptanonConsent cookie after a GPC opt out signal was sent. The user is opted out if isGpcEnabled=1, and the user is not opted out if isGpcEnabled=0. If the cookie is present but does not have an isGpcEnabled string, we return “no_gpc”.
    • gpp_before_gpc: the value of the GPP string before a GPC opt out signal was sent
    • gpp_after_gpc: the value of the GPP string after a GPC opt out signal was sent
    • urlClassification: the return value of Firefox's urlClassificaton object, sorted by category and filtered for the following categories: fingerprinting, tracking_ad, tracking_social, any_basic_tracking, any_social_tracking.
    • OneTrustWPCCPAGoogleOptOut_before_gpc: the value of the OneTrustWPCCPAGoogleOptOut cookie before a GPC signal was sent. This cookie is described by OneTrust here.
    • OneTrustWPCCPAGoogleOptOut_after_gpc: the value of the OneTrustWPCCPAGoogleOptOut cookie after a GPC signal was sent. This cookie is described by OneTrust here.
    • OTGPPConsent_before_gpc: the value of the OTGPPConsent cookie before a GPC signal was sent. This cookie is described by OneTrust here.
    • OTGPPConsent_after_gpc: the value of the OTGPPConsent cookie after a GPC signal was sent. This cookie is described by OneTrust here.

4. Limitations/Known Issues

Since we are using Selenium and a VPN to visit the sites we analyze, there are some limitations to the sites we can analyze. There are 2 main types of sites that we cannot analyze due to our methodology:

  1. Sites where the VPN’s IP address is blocked.

A site titled “Access Denied” that says we don’t have permission to access the site on this server is loaded instead of the real site.

  1. Sites that have some kind of human check.

Some sites can detect that we are using automation tools (i.e. Selenium) and do not let us access the real site. Instead, we’re redirected to a page with some kind of captcha or puzzle. We do not try to bypass any human checks.

Since the data collected from both of these types of sites will be incorrect, we list them under HumanCheckError in error-logging.json. We have observed a few different site titles that indicate we have reached a site in one of these categories. Most of the titles occur for multiple sites, with the most common being “Just a Moment…” on a captcha from Cloudflare. We detect when our crawler visits one of these sites by matching the site title of the loaded site with a set of regular expressions that match with the known titles. Clearly, we will miss some sites in this category if we have not seen it and added the title to the set of regular expressions. We are updating the regular expressions as we see more sites like this. For more information, see issue #51.

  1. Sites that block script injection.

For instance flickr.com, blocks script injection and will not successfully be analyzed. In the debugging table, on the first attempt, the last message will be runAnalysis-fetching, and on the second, the extension logs SQL POSTING: SOMETHING WENT WRONG.

  1. Sites that redirect between multiple domains throughout analysis.

For instance https://spothero.com/ and https://parkingpanda.com/ are now one entity but still can use both domains. In the dubugging table, you will see multiple debugging entries under each domain. Because we store analysis data by domain, the data will be incomplete and will not be added to the database.

5. Other Resources

  • Python Library for GPP String Decoding:

    GPP strings must be decoded. The IAB provides a JavaScript library here and an interactive html decoder to do this. To integrate decoding with our colab notebooks, we rewrote the library in Python. The library can be found here.

  • .well-known/gpc.json Python Script:

We collect .well-known/gpc.json data after the whole crawl finishes with a separate Python script. Start the script using python3 well-known-collection.py. This script should be run using a California VPN after all eight crawl batches are completed. Running this script requires 3 input files: full-crawl-set.csv, which is in the repo, redo-original-sites.csv, and redo-sites.csv. The second two files are not found in the repo and should be created for that crawl using step 5. As explained in well-known-collection.py, the output is a csv called well-known-data.csv with 3 columns: Site URL, request status, json data as well as an error json file called well-known-errors.json that logs all the errors. To run this script on a csv file of sites without accounting for redo sites, comment all lines between line 27 and line 40 except for line line 34.

Purpose of the well.known Python Script: analyze the full crawl set with the redo sites replaced

  • uses the full set of sites and the sites that we redone (which replaced the original sites with redo domaints)

Output:

  1. If successful, a csv with 3 columns: Site URL, request status, json data
  2. If not, an error json file: logs all the errors(printing the reason & 500 chars of the request text) Examples of an error:
    • "Expecting value: line 1 column 1(char 0)": the status was 200(of sites exists and loaded) but didn't find a json
    • Reason: sites send al incorrect links to a generic error page instead of not serving the page

Code rundown:

  1. First, the file read in the full site set, redo original sites and redo sites
  • sites_df.index(redo_original_sites[idx]): get the index of the site we want to change
  • sites_list[x] = redo_new_sites[idx]: replace the site with the new site
  1. r = requests.get(sites_df[site_idx] + '/.well-known/gpc.json', timeout=35): The code run with a timeout of 35 seconds (to stay consistent with crawler timeouts) (i) checks if there will be a json data, then logging all 3 columns (site, status, and json data) (ii) if there is no json data, it will just log the status and site (iii) if r.json is not json data(), the "Expecting value: line 1 column 1 (char 0)", mean that the status .." error will appear in the error logging and the error will log site and status (iv) if the request.get doesn't finish in 35 seconds, it will store errors and only log site

Important code documentation:

  • "file1.write(sites_df[site_idx] + "," + str(r.status_code) + ',"' + str(r.json()) + '"\n')" : writing data to a file with 3 columns (site, status and json data)
  • "errors[sites_df[site_idx] = str(e)" -> store errors with original links
  • "with open("well-known-errors.json", "w") as outfile: json.dump(errors, outfile)" -> convert and write JSON object as containing errors to file

6. Thank You!

We would like to thank our financial supporters!


Major financial support provided by the National Science Foundation.

National Science Foundation Logo

Additional financial support provided by the Alfred P. Sloan Foundation, Wesleyan University, and the Anil Fernando Endowment.

Sloan Foundation Logo Wesleyan University Logo

Conclusions reached or positions taken are our own and not necessarily those of our financial supporters, its trustees, officers, or staff.

privacy-tech-lab logo

About

A GPC web crawler for detecting websites' compliance with GPC privacy preference signals at scale

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published