Skip to content

Releases: ArchiveBox/ArchiveBox

v0.8.0-rc: New REST API ✨, Django 5.0, S3/B2/SMB/NFS remote storage support, VNC viewer, and more

27 Mar 00:03
Compare
Choose a tag to compare

WIP pre-release for the upcoming ArchiveBox v0.8.0 release.

Warning

This is an unfinished pre-release. We're promoting it a little earlier than usual because it contains ✨ lots of big new features ✨ and we want brave early adopters to help us test it! If that sounds like you, make sure to back up your archive first, then let us know if you find bugs by opening a new issue!

Try this release early using docker or pip:

# with docker (pre-built)
docker pull archivebox/archivebox:dev
# with docker (built from source)
docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev
# with pip (built from source)
pip install 'git+https://github.com/ArchiveBox/ArchiveBox@dev'

New ArchiveBox REST APIArchiveBox Admin Webhooks UIArchiveBox Configuration Admin UIS3/B2/SMB/NFS/GDrive Remote Storage Setup

Highlights

Expand to see see more...
  • add gitea and other domains to default GIT_DOMAINS list to run git archiving on
  • check /, /data, and /data/archive in Docker and warn if running low on disk space
  • Add COOKIES_FILE support for singlefile extractor by @naoph in #1372
  • Use COOKIES_FILE to fetch page titles by @benmuth in #1364
  • Fallback to not chown'ing ./data/archive dir if it's a network mount that prevents ownership changes by @gnattu in #1312
  • Show the upgrade notification only in specific views by @benmuth in #1314
  • ability to populate is_staff and is_superuser flags at LDAP authentication by @vladimirdulov in #1335
  • Make it a little easier to run specific tests by @jimwins in #1371
  • disable chrome automatic self-updating when running headless
  • Add ability to populate is_staff and is_superuser flags during LDAP first auth
  • allow more restrictive NFS permission coercion on ./data/archive
  • bump yt-dlp, singlefile, wget, curl, and chrome versions
  • fix RESOLUTION being ignored when using Chrome headless in Docker
  • fix sorting by Size / Files in the Admin Snapshots list page UI
  • fix spinner icon showing on some Snapshots instead of favicon when only a few extractors are enabled
  • fix yt-dlp sometimes failing to archive media due to filenames being too long or containing special characters
  • fix wget extractor not finding output when :80 or :443 port is present in the original URL
  • fix /var/spool/cron/crontabs permissions when mounting it via Docker
  • fix /browsers chown on Docker armv7 entrypoint failing

COMING SOON: new sci-dl scientific paper downloader being worked on by @benmuth

New Contributors

Full Changelog: v0.7.2...v0.8.0-rc

v0.7.2: Make scheduled imports taggable, fix admin buttons, readability, Docker permissions

04 Jan 19:25
315c9f3
Compare
Choose a tag to compare
Web version screenshot

Get this release via pip, docker, brew, or dpkg (apt & brew releases are delayed).

# Get it with Pip on any OS (`amd64`, `arm64`, `arm/v7`)
pip install --upgrade 'archivebox==0.7.2'`
# Get it with Docker on any OS (`amd64`, `arm64`, `arm/v7`)
docker pull archivebox/archivebox:0.7.2
# Get it with brew on macOS (`amd64`, `arm64`)
brew tap archivebox/archivebox
brew install archivebox
pip install --upgrade 'archivebox==0.7.2'`
# Get it with apt on Ubuntu/Debian based systems (`any`)
wget 'https://github.com/ArchiveBox/debian-archivebox/raw/main/archivebox-0.7.1.deb'
apt install ./archivebox-0.7.1.deb
# OR
dpkg -i ./archivebox-0.7.1.deb

# then run pip install after
pip install --upgrade 'archivebox==0.7.2'`

Note: this is not packaged using "proper" debian techniques like 0.6.2 was, instead it's just a wrapper for executing pip install archivebox w/ a few extras. This is because ArchiveBox relies on some binary and dynamic dependencies (node, chrome, playwright, ffmpeg, yt-dlp, etc.) which aren't allowed in Debian packages.

(Launchpad apt ppa & brew updates coming eventually, packaging all the vendored binaries that archivebox depends on has gotten harder lately)


CLI version screenshot
# Then run this to upgrade an existing collection data dir to 0.7.2
cd ~/path/to/data/dir
archivebox init

What's Changed

  • add --tag=tag1,tag2,tag3 support to archivebox schedule command
  • allow PGID=0 root-group ownership of data dir (but PUID=0 is still not allowed)
  • improve error messages, hints, and logging about permissions issues in Docker
  • notify users when new ArchiveBox version is available on Github (thanks @benmuth!)
  • bump dependency versions (yt-dlp, chrome, readability, node, python)
  • warn when Docker / or /data volume mounts don't have any space available
  • limit to compatible python version to >= 3.8 and <= 3.11

Bug Fixes

  • fix action buttons in Snapshot admin page not showing up correctly
  • tag links immediately in first stage of archivebox add instead of at the end (so that imports that are paused or interrupted still get tagged correctly)
  • fix config variables in CHROME_USER_AGENT format string not getting interpolated properly
  • switch readability to prefer Chrome DOM dumps for article text instead of singlefile (because singlefile output is often huge and crashes readability/times out)
  • make Docker image smaller by removing unneeded docs files
  • better current version detection and remove annoying +editable string and also add BUILD_TIME
  • fix /browsers/* does not exist warning on startup

v0.7.1: Minor new features, bugfixes, and new dependency versions

04 May 05:53
Compare
Choose a tag to compare

Get this release via pip, docker, brew, or dpkg (apt ppa update delayed).

# Get it with Pip on any OS (`amd64`, `arm64`, `arm/v7`)
pip install --upgrade 'archivebox==0.7.1'`
# Get it with Docker on any OS (`amd64`, `arm64`, `arm/v7`)
docker pull archivebox/archivebox:0.7.1
# Get it with brew on macOS (`amd64`, `arm64`)
brew tap archivebox/archivebox
brew install archivebox
# Get it with apt on Ubuntu/Debian based systems (`any`)
wget 'https://github.com/ArchiveBox/debian-archivebox/raw/main/archivebox-0.7.1.deb'
apt install ./archivebox-0.7.1.deb
# OR
dpkg -i ./archivebox-0.7.1.deb

Note: this is not packaged using "proper" debian techniques like 0.6.2 was, instead it's just a wrapper for executing pip install archivebox w/ a few extras. This is because ArchiveBox relies on some binary and dynamic dependencies (node, chrome, playwright, ffmpeg, yt-dlp, etc.) which aren't allowed in Debian packages.

(Launchpad apt ppa update coming eventually, packaging for apt has gotten harder lately)


# Then run this to upgrade an existing collection data dir to 0.7.1
cd ~/path/to/data/dir
archivebox init

What's Changed

Lots of bugfixes, speedups, and small convenience features.

New Contributors

Expand to see the list...

Full Changelog: v0.6.2...v0.7.1

v0.6.2: >10x performance gain, new Admin UI & CLI features, and more

10 Apr 12:24
Compare
Choose a tag to compare

New features

  • new ArchiveResult log in the admin web UI, with full editing ability of individual extractor outputs + list of outputs under each Snapshot admin entry
  • ability to save multiple snapshots of the same URL over time using new Re-snapshot button
  • add init --quick and server --quick-init options to quickly update the db version without doing a full re-init (for users with large archive collections this will make version upgrades a lot faster / less painful)
  • add new archivebox setup command and archivebox init --setup flag to aid in automatically installing dependencies and creating a superuser during initial setup
  • new SNAPSHOTS_PER_PAGE=40 and MEDIA_MAX_SIZE=750m config options
  • allow hotlinking directly to specific extractor output on the snapshot detail page using URL #hash e.g. /archive/<timestamp>/index.html#git
  • add ability to view snapshot matching a given URLs by visiting /archive/https://example.com/some/url -> redirects to -> /archive/<timestamp>/index.html (also works without scheme /archive/example.com)
  • #660 add ability to tag URLs while adding them via the web UI and via the CLI using archivebox add --tag=tag1,tag2,tag3 ...
  • #659 add back ability to override visual styling with custom HTML and CSS using new config option CUSTOM_TEMPLATES_DIR
  • ability to add and remove multiple tags at once from the snapshot admin using autocompleting dropdown

Enhancements

  • lots of performance improvements! (in testing with 100k entries, the main index was brought down from 10-14 second load times to ~110ms once cache warms up)
  • full text search now works on the public snapshot list
  • dates and times are now localized to your browser's timezone instead of showing in UTC
  • integrity and correctness improvements to readability, mercury, warc, and other extractors
  • video subtitles and description are now added to the full-text search index as well (including youtube's autogenerated transcripts in all languages)
  • log all errors with full tracebacks to new data/logs/errors.log file (so users no longer have to run in --debug mode to see error details)
  • better archivebox schedule logging and changed logfile location to ./logs/schedule.log
  • better docker-compose setup experience with sonic config example in docker-compose.yml
  • add Django Debug Toolbar + djdt_flamegraph for developers to profile UI performance
  • add --overwrite flag support to archivebox schedule, archived urls get added similarly to add --overwrite
  • #644 remove boostrap and jquery remove network requests to CDNs by inlining them instead
  • #647 allow filtering by ArchiveResult status in the Snapshot admin UI to select only links that have been archived or not archived
  • #550 kill all orphan child processes after each extractor finishes to prevent dangling chromium/node subprocesses and memory leaks
  • 3276434 add new SEARCH_BACKEND_TIMEOUT config option to tune amount of time search backend can take before it gives up
  • more diagnostic info added to the Snapshot admin view including most recent status code, content type, detected server, etc
  • make the order of the table columns, layout, and spacing the same on the public view and private view (also remove DataTable, we're not using it)
  • better snapshot grid page (faster load times, nicer CSS for tags and cards, more actions supported and metadata shown)
  • added Cache-Control headers to dramatically speed up load times by caching favicons, screenshots, etc. in browsers/upstreams
  • new project releases page https://releases.archivebox.io and demo url https://demo.archivebox.io

Bugfixes

  • #673 fix searching by URL substring in Snapshot admin list
  • #658 fix Snapshot admin action buttons not working in Safari and some other browsers
  • #678 fix AssertionError error when archivebox would to attempt archive with CHROME_BINARY=None when Chrome was not found on host system
  • #654 fix some issues with sonic attempting to index massive text blobs or binary blobs on some pages and hanging
  • #674 fix UTF-8 encoding encoding problems with file reading/writing on Windows (supporting a Python pkg on Windows is unreasonably painful ya'll)
  • #433 fix deleted items sometimes reappearing on next import/update
  • #473 fix issue preventing use of archivebox python API inside raw REPL (not using archivebox shell)
  • fix stdin/stdout/stderr handling for some edge cases in Docker/Docker-Compose

image
image

v0.5.6: Bugfixes and packaging improvements

09 Feb 14:25
9766ea2
Compare
Choose a tag to compare
  • add ARMv7 and ARMv8 CPU support for apt / deb distribution on Launchpad PPA
  • fix nodesource apt repo not supported on i386 b90afc8
  • fix handling of skipped ArchiveResult entries with null output 0aea5ed
  • catch exception on import of old index.json into ArchiveResult 171bbeb
  • move debsign to release not build 66fb5b2
  • skip tests during debian build a32eac3
  • fix emptystrings in cmd_version causing exception a49884a
  • automate deb dist better and bump version 0e6ac39
  • fix assertion 6705354
  • change wording of db not found error 683a087

v0.5.4: New Snapshot detail UI, lots of bugfixes, speed improvements, and limit media downloads to 750mb by default

01 Feb 08:11
Compare
Choose a tag to compare

Thank you contributors who helped with the 181 commits in this release!
@cdvv7788, @jdcaballerov, @thedanbob, @aggroskater, @mAAdhaTTah, @mario-campos, @mikaelf

  • fix migration failing due to null cmd_versions in older archives a3008c8
  • Publish, minor, & major version to DockerHub and add set up CodeQL codeql-analysis.yml c5b7d9f, bbb6cc8
  • fix DATABASE_NAME posixpath, and dependencies dict bug 02bdb3b, 5c7842f
  • use relative imports for .util to fix windows import clash 72e2c7b
  • fix COOKIES_FILE config param breaking in wget ef7711f
  • Refactor should_save_extractor methods to accept overwrite parameter 5420903
  • Fix issue #617 by using mark_safe in combination with format_html … 1989275
  • make permission chowning on docker start less fancy, respect PUID/PGID #635
  • add createsuperuser flag to server command 39ec77e
  • fix files icons styling and use the db exclusively for rendering them, instead of filesystem f004058, 7d8fe66, 5c54bcc, 534ead2
  • limit youtubedl download size to 750m and stop splitting out audio files 3227f54
  • also search url, timestamp, tags on public index 8a4edb4
  • fix trailing slash problems and wget not detecting download path 9764a8e
  • add response status code to headers.json c089501
  • fix singlefile path used for sonic 24e2493
  • cleanup template layout in filesystem, new snapshot detail page UI

Screen Shot 2021-01-30 at 9 53 22 p

v0.5.3: New grid UI, full-text search, oneshot subcommand, Pocket API and Wallabag importers, bufixes, and packaging improvements

06 Jan 19:46
Compare
Choose a tag to compare

v0.4.24: Packaging improvements, UI improvements, and bugfixes

03 Dec 16:57
b186e98
Compare
Choose a tag to compare

Last stable version for the v0.4 branch, contains numerous last fixes an improvements to v0.4 before the leap to v0.5.

v0.4.21: Better Node dependency version checking and sdist PATH fixes

18 Aug 23:44
Compare
Choose a tag to compare

v0.4.17: Bugfixes and CLI experience improvements

18 Aug 13:50
Compare
Choose a tag to compare
  • Fix bugs with parsing long URLs as paths
  • html-encoded URLs
  • new generic HTML parser
  • new --init and --overwrite flags on add
  • improve stdout and hints
  • fix Pull title button
  • other small bugfixes