Skip to content

Filters for processing Web ARChive (WARC) files as part of the WofG Web Reporting Service

License

Notifications You must be signed in to change notification settings

govau/wofg-web-filters

Repository files navigation

wofg-web-filters

Filters for processing Web ARChive (WARC) files as part of the WofG Web Reporting Service

Overview

These filters were originally developed with Funnelback for use in both in-crawl and post-crawl filtering of data gathered during a Whole-of-Australian Government web crawl.

Pre-gather workflow tasks are run in order to generate mappings for domains to portfolios (drawn from the Australian Government Organisation Register) and augment with other external data sources.

Post-gather, several content checks are run. These are written in Groovy, and are run with Funnelback's filter framework. Tools for splitting WARC files are also included at this stage.

Post-filtering, metadata is written to JSON for injecting into ElasticSearch.

About

Filters for processing Web ARChive (WARC) files as part of the WofG Web Reporting Service

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published