#

warc

Here are 100 public repositories matching this topic...

ArchiveBox

ArchiveBox / ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Updated May 18, 2024
Python

internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

java warc heritrix webcrawling

Updated May 15, 2024
Java

ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

crawler spider archiving crawl warc

Updated Mar 22, 2024
Python

conifer

Rhizome-Conifer / conifer

Collect and revisit web pages.

python docker archives warc web-archiving wayback webrecorder pywb

Updated Nov 8, 2023
Python

webrecorder / browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container

crawler web-crawler crawling warc web-archiving webrecorder wacz

Updated May 18, 2024
TypeScript

webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO

python warc web-archiving web-archives pywb

Updated May 15, 2024
Python

webrecorder / replayweb.page

Serverless replay of web archives directly in the browser

service-worker warc web-archiving wayback-machine web-archive replay-web-page web-replay wacz

Updated May 2, 2024
TypeScript

bitextor

bitextor / bitextor

Bitextor generates translation memories from multilingual websites

Updated Sep 14, 2023
Python

ipwb

oduwsdl / ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

python docker service-worker ipfs memento warc web-archiving wayback memento-rfc

Updated Apr 24, 2024
Python

webrecorder / webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

electron warc web-archiving webrecorder pywb

Updated Sep 17, 2020
JavaScript

wail

machawk1 / wail

🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation

python gui warc web-archiving pyinstaller wayback heritrix openwayback

Updated May 16, 2024
Roff

commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC

crawler news web-crawler apache-storm warc commoncrawl common-crawl storm-crawler

Updated Dec 13, 2023
Java

cocrawler / cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

python warc web-archiving cdx web-archives commoncrawl cdx-api

Updated Feb 17, 2024
Python

webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

kubernetes cloud archiving warc web-archiving webrecorder web-archive wacz

Updated May 16, 2024
TypeScript

cocrawler / cocrawler

CoCrawler is a versatile web crawler built using modern tools and concurrency.

screenshot crawler concurrency async-python python3 aiohttp warc aiohttp-client pluggable-modules

Updated Apr 29, 2022
Python

N0taN3rd / node-warc

Parse And Create Web ARChive (WARC) files with node.js

warc web-archiving webarchive web-archives webarchiving warc-files chrome-remote-interface pupeteer

Updated Jan 3, 2023
JavaScript

centic9 / CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

java mime-types warc cdx-files commoncrawl

Updated Apr 21, 2024
Java

helgeho / ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

spark internet-archive warc web-archiving webarchive archivespark spark-framework

Updated Apr 4, 2024
Scala

ArchiveTeam / wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

crawler scraper downloader spider lua ftp scraping crawling archiving wget crawl zstd crawlers warc webarchiving archiveteam wget-lua

Updated Jan 29, 2024
C

warcreate

machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"

chrome-extension warc web-archiving

Updated Dec 6, 2023
JavaScript

Improve this page

Add a description, image, and links to the warc topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."