Filters for processing Web ARChive (WARC) files as part of the WofG Web Reporting Service
These filters were originally developed with Funnelback for use in both in-crawl and post-crawl filtering of data gathered during a Whole-of-Australian Government web crawl.
Pre-gather workflow tasks are run in order to generate mappings for domains to portfolios (drawn from the Australian Government Organisation Register) and augment with other external data sources.
Post-gather, several content checks are run. These are written in Groovy, and are run with Funnelback's filter framework. Tools for splitting WARC files are also included at this stage.
Post-filtering, metadata is written to JSON for injecting into ElasticSearch.