🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
Updated
May 18, 2024 - Python
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Collect and revisit web pages.
Run a high-fidelity browser-based crawler in a single Docker container
Streaming WARC/ARC library for fast web archive IO
Serverless replay of web archives directly in the browser
Bitextor generates translation memories from multilingual websites
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
News crawling with StormCrawler - stores content as WARC
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
CoCrawler is a versatile web crawler built using modern tools and concurrency.
Parse And Create Web ARChive (WARC) files with node.js
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
Chrome extension to "Create WARC files from any webpage"
Add a description, image, and links to the warc topic page so that developers can more easily learn about it.
To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."