Skip to content

Releases: centic9/CommonCrawlDocumentDownload

1.0.0.9

15 Jan 14:58
Compare
Choose a tag to compare
  • Switch to Gradle 7.6 and to the new maven-publish plugin
  • Update third-party-libraries
  • Update to more recent CC-MAIN
  • Parse newer fields
  • Adjust logging configuration

Full Changelog: 1.0.0.8...1.0.0.9

1.0.0.8

15 Jan 14:53
Compare
Choose a tag to compare

Intermediate release while switching to Gradle 7.6, not uploaded to Maven Central.

Full Changelog: 1.0.0.7...1.0.0.8

1.0.0.10

15 Jan 17:48
Compare
Choose a tag to compare
  • Re-publish with correct artifactId

Full Changelog: 1.0.0.9...1.0.0.10

1.0.0.7

13 Mar 07:30
Compare
Choose a tag to compare
  • Add Extension .pot for powerpoint
  • Switch to CC-MAIN-2019-39
  • Update third-party libraries

Full Changelog: 1.0.0.6...1.0.0.7

1.0.0.6

21 Mar 06:30
Compare
Choose a tag to compare
  • Update 3rd party libraries
  • Use common-crawl 2018-43 by default
  • Write accumulated mimetypes to a separate text-file after each index-file
  • Add some support for detecting duplicate files and moving them out of the list to not re-process the same file over and over by the post-processing steps
  • Some small adjustments for behavior changes in Java 11

1.0.0.5

30 Oct 20:47
Compare
Choose a tag to compare
  • Update 3rd party libraries
  • Download some more mime-types out of the box
  • Use longer socket-timeout
  • Switch to the new S3 public dataset URL
  • Handle new item "mime-detected" in JSON
  • Some refactoring