Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching From File #1914

SelfOnTheShelf · 2021-08-18T14:46:10Z

Description

This feature adds support to use Redis as the mechanism for skipping already downloaded URLs. If you use RipMe over a longer period of time, to download many, many galleries and albums, the url_history.txt file gets quite large. Doing an O(n) scan through the entire list for every URL in a job becomes VERY expensive. My own url_history.txt file is approaching 3 million lines and 130 MB. Using Redis speeds up the ripping process considerably AND allows power users the ability to coordinate jobs running across multiple machines on a network.

Users can optionally add the following lines to the rip.properties file:

url_history.redis_cache.host = 192.168.0.123  #IP address or domain name for the redis host 
url_history.redis_cache.port = 6379 #redis port (optional, rip me defaults to 6379)
url_history.redis_cache.key_prefix = RipMeURL:  #a prefix to give the keys added to redis (optional, will default to an empty string)

If users do not add this configuration, the URL matching algorithm now uses a HashSet. This is memory intensive, but performs faster than the sequential scan.

Note: RipMe will continue to append new lines to the url_history.txt file since this operation does not seem to slow down the job (...at least at the scales that I have encountered)

Note 2: The easiest way to run redis locally is to use docker (Something like docker run --name my-redis -d -p 6379:6379 redis). Alternatively you could download and install redis for your OS.

Testing

Required verification:

I've verified that there are no regressions in mvn test (there are no new failures or errors).
I've verified that this change works as intended.
- Downloads all relevant content.
- Downloads content from multiple pages (as necessary or appropriate).
- Saves content at reasonable file names (e.g. page titles or content IDs) to help easily browse downloaded content.
I've verified that this change did not break existing functionality (especially in the Ripper I modified).

Optional but recommended:

I've added a unit test to cover my change.

soloturn · 2021-09-11T06:31:51Z

what a cool pull request. not that i'd ever need it - but the principle is a great show case :) tried to merge here: https://github.com/ripmeapp2/ripme , but i then wondered how to see within a couple of seconds now and in future if it works. you mind doing a tiny unit test just, maybe in the lines of:
https://www.baeldung.com/spring-embedded-redis

SelfOnTheShelf · 2021-10-07T02:26:14Z

@soloturn I've added some tests for this. Please let me know if you want me to change anything, especially regarding style. I'm both new to this codebase and the Java world in general!

MarcoBorrini99 · 2021-10-15T21:08:24Z

@soloturn I've added some tests for this. Please let me know if you want me to change anything, especially regarding style. I'm both new to this codebase and the Java world in general!

It seems to works, the only "downside" is that project seems to be abandoned

soloturn · 2021-10-17T08:31:39Z

thank you @SelfOnTheShelf ! 3 tiny things if you could adjust please:

use latest versions for your dependencies
if you could add the dependencies to the build.gradle.kts file as well, ripme2 has no maven build any more but gradle
reorder the Hashset import alphabetic so it would make it merge without conflict into ripme2.

SelfOnTheShelf added 3 commits August 17, 2021 23:01

Add support for Redis URL caching

17dc115

add support for redis key prefixes

54ae5da

use hash set

8578ed3

SelfOnTheShelf changed the title ~~Power Users Can Use Redis to Filter Already Downloaded URLs~~ Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching Aug 19, 2021

SelfOnTheShelf force-pushed the master branch 2 times, most recently from 398ee03 to ea8ba4b Compare August 20, 2021 00:25

add hashset logging

7319e85

SelfOnTheShelf force-pushed the master branch from ea8ba4b to 7319e85 Compare August 20, 2021 00:25

SelfOnTheShelf changed the title ~~Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching~~ Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching From File Aug 20, 2021

Add tests for redis and hashset functionality

06b4d94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching From File #1914

Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching From File #1914

SelfOnTheShelf commented Aug 18, 2021 •

edited

soloturn commented Sep 11, 2021

SelfOnTheShelf commented Oct 7, 2021

MarcoBorrini99 commented Oct 15, 2021

soloturn commented Oct 17, 2021 •

edited

Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching From File #1914

Are you sure you want to change the base?

Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching From File #1914

Conversation

SelfOnTheShelf commented Aug 18, 2021 • edited

Category

Description

Testing

soloturn commented Sep 11, 2021

SelfOnTheShelf commented Oct 7, 2021

MarcoBorrini99 commented Oct 15, 2021

soloturn commented Oct 17, 2021 • edited

SelfOnTheShelf commented Aug 18, 2021 •

edited

soloturn commented Oct 17, 2021 •

edited