You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background:
When crawling a website, it is not uncommon to see output like
#0 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN
#1 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN
#2 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN
#3 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN
#4 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN
This is because the URL seems to be truncated. This makes it hard to see whether these URLs are actually different content or whether they are all the same with different ?param=1 fields that all redirect to the same page and can be avoided with a simple tweak to the config.yaml
DoD:
Implement an argument that allows the full url to be printed, even it it takes up new lines
(By the way, excellent, excellent work on this easy to use docker image!)
The text was updated successfully, but these errors were encountered:
+1, but full untruncated URLs might make following the output rather difficult as the lines wrap and jump around.
As an alternative option while this is considered, you can tail the collections/<collection>/pages/pages.jsonl file to see the urls as they're written. If the JSON output is distracting, you can pipe to jq .url or jq -r .url. Finally, at the cost of a bit more complexity, you can also decode URL encoded chars as they are written to the screen, for greater intelligibility. Full example:
tail -f pages.jsonl | stdbuf -oL jq .url | { while read i; do echo -e "${i//\%/\\x}"; done; }
Background:
When crawling a website, it is not uncommon to see output like
This is because the URL seems to be truncated. This makes it hard to see whether these URLs are actually different content or whether they are all the same with different ?param=1 fields that all redirect to the same page and can be avoided with a simple tweak to the config.yaml
DoD:
Implement an argument that allows the full url to be printed, even it it takes up new lines
(By the way, excellent, excellent work on this easy to use docker image!)
The text was updated successfully, but these errors were encountered: