Skip to content

Analyzing the evolution of ChatGPT's codebase through time with curated archives and scripts

License

Unknown and 2 other licenses found

Licenses found

Unknown
LICENSE.md
CC-BY-4.0
LICENSE-CC-BY-4.0
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

0xdevalias/chatgpt-source-watch

ChatGPT Source Watch

Analyzing the evolution of ChatGPT's codebase through time with curated archives and scripts.

Or, to put it more poetically, in the eloquent words of ChatGPT itself:

ChatGPT Source Watch is a meticulously curated repository that serves as a treasure trove for those interested in observing the evolution of ChatGPT's webpack chunks. It gracefully preserves the historical webpack chunks in their original splendor, while also offering a breath of fresh air with unpacked and beautifully formatted versions of the chunk files. This thoughtful touch empowers you to effortlessly analyze the nuances between different builds.

But there's more - it's not just about the chunks. The repository is adorned with a detailed changelog that tracks the symphony of changes over time, and is equipped with a suite of automation scripts that act as your personal concierge.

ChatGPT Source Watch stands as a beacon of transparency and a portal to discovery.

Table of Contents

tl;dr

If you're looking for a concise summary of the changes and updates in ChatGPT's codebase over time, the CHANGELOG.md is probably what you need. It's a comprehensive record of changes made in each build version and serves as a quick reference.

For those interested in a more detailed analysis or diving into the code, the repository also contains the original webpack chunks and unpacked, formatted versions of the chunk files.

If the particular build/chunk version you're looking for isn't archived in this repo, you could also try checking the WayBack Machine to see if they have captured it:

Announcement

A couple of the places we first announced this project:

Feel free to join in on the discussions or share your own thoughts and experiences with the repository. We value your feedback and contributions!

Repository Structure

  • CHANGELOG.md: A record of changes made in each build version.
  • scripts/: Helper scripts to streamline various tasks in the repository.
  • orig/: This directory contains the raw unmodified webpack chunks from each build, saved for historical reference.
  • unpacked/: This directory contains the unpacked and biome formatted version of the chunk files, for easier diffing and analysis.

Helper Scripts

The scripts/ directory is home to a collection of helper scripts designed to streamline various tasks in the repository

  • filter-for-unsaved.js: A Node.js script that takes input URLs from stdin and outputs URLs of webpack chunks that are not already saved in the local orig directory. Ensures no duplicates in the output. Useful for fetching new chunks.
  • buildmanifest-to-json.js: A Node.js script that converts a build manifest file to JSON. When piped with the second jq command, it extracts and prefixes static asset URLs with https://chat.openai.com/_next/, ready for fetching.
  • unpack-files-from-orig.js: A Node.js script that processes input file paths from stdin, copies the corresponding files from the orig/ directory to an unpacked/ directory, and normalizes directory names and file names by removing hashes. It then runs biome format on all the files in the unpacked/ directory for formatting. Useful for preparing files for easier diffing between builds.
  • filter-urls-not-in-changelog.js: A Node.js script that filters input URLs from stdin and outputs only those URLs that are not already present in the CHANGELOG.md. Useful for identifying new URLs that have not been logged.

Please carefully read and comprehend the contents of each script, as detailed documentation is not provided. Understanding how the scripts operate is essential before executing them.

Getting Started

Clone this repository:

git clone https://github.com/0xdevalias/chatgpt-source-watch.git
cd chatgpt-source-watch

Start by obtaining a list of webpack chunks including the _buildManifest.js of a new build. For example, you can extract webpack chunk files and the build manifest from a webpage using Chrome DevTools and a CSS selector:

  1. Open Chrome and navigate to the target webpage.
  2. Press Ctrl + Shift + I to open Chrome DevTools.
  3. Go to the Console tab in DevTools.
  4. Run the following JavaScript snippet to extract URLs from script tags inside the head element:
const scriptTags = document.querySelectorAll('html > head > script');
const urls = Array.from(scriptTags).map(tag => tag.src).filter(Boolean);
console.log(urls);

Use the filter-for-unsaved.js script to filter this list of chunk URLs to output only the URLs of webpack chunks not already saved in the local orig/ directory:

echo "<input_urls>" | ./scripts/filter-for-unsaved.js

Manually download the chunk files from the URLs output by the filter-for-unsaved.js script. Save them in the orig/ directory ensuring to match the original file structure and filenames (including hashes). As a sanity check, you can re-run filter-for-unsaved.js and check that none of the chunks you just saved are listed anymore.

Use the buildmanifest-to-json.js script to turn the build manifest into JSON and fetch build URLs. The script accepts the build hash or a full URL to a _buildManifest.js file as an argument:

# Run the script with a build hash
./scripts/buildmanifest-to-json.js <build-hash> --extract-urls

# Or run the script with a full URL to a _buildManifest.js file
./scripts/buildmanifest-to-json.js <url-to-buildmanifest> --extract-urls

Use the unpack-files-from-orig.js script to unpack, normalize, and format files from the orig/ directory. Pass the list of URLs (or URL file paths) that need to be unpacked as input:

echo "<list_of_urls_or_file_paths>" | ./scripts/unpack-files-from-orig.js

It's recommended to run biome format multiple times to ensure complete formatting, as one pass may not address all complexities. Keep executing the following command until files stay unchanged:

npx biome format --write unpacked/

Manually update the CHANGELOG.md file with the information about the new build and changes observed. Ensure that you follow the existing format and include relevant details such as the build version, date, and a summary of the changes.

Finally, we can use the filter-urls-not-in-changelog.js script as a sanity check to ensure that all the URLs in our list are properly captured in the CHANGELOG.md file:

echo "<input_urls>" | ./scripts/filter-urls-not-in-changelog.js

After using the helper scripts, you can compare different builds by navigating to the unpacked/ directory and using tools like git diff:

cd unpacked
git diff <build_hash_1>..<build_hash_2> -- <filename>

Related Research and Notes

For additional context and a deeper understanding of the underlying concepts and techniques that could be useful from here, you may find the following resources useful:

These resources offer useful insights and should be used with responsibility, adhering to legal and ethical considerations.

License

This project is subject to multiple licenses with specific exceptions. For details, please refer to the LICENSE.md file.

Responsible Usage

This repository is provided for educational and research purposes. Ensure that you use the content, especially from the orig/ and unpacked/ directories, in a lawful and ethical manner. We are not responsible for any unauthorized or unlawful use of the materials contained in this repository.

OpenAI operates a Bug Bounty Program through Bugcrowd, aimed at enhancing the security of its services via responsible vulnerability disclosures. For detailed information and participation, please visit the official program page on Bugcrowd. You can also read the announcement blog post for an overview of the program.

Star History

Star History Chart

About

Analyzing the evolution of ChatGPT's codebase through time with curated archives and scripts

Topics

Resources

License

Unknown and 2 other licenses found

Licenses found

Unknown
LICENSE.md
CC-BY-4.0
LICENSE-CC-BY-4.0
MIT
LICENSE-MIT

Stars

Watchers

Forks

Languages