Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full URL in WORKER log #112

Open
KathrynN opened this issue Mar 5, 2022 · 1 comment
Open

Full URL in WORKER log #112

KathrynN opened this issue Mar 5, 2022 · 1 comment

Comments

@KathrynN
Copy link

KathrynN commented Mar 5, 2022

Background:
When crawling a website, it is not uncommon to see output like

   #0 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN
  #1 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN
   #2 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN
   #3 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN
   #4 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN

This is because the URL seems to be truncated. This makes it hard to see whether these URLs are actually different content or whether they are all the same with different ?param=1 fields that all redirect to the same page and can be avoided with a simple tweak to the config.yaml

DoD:
Implement an argument that allows the full url to be printed, even it it takes up new lines

(By the way, excellent, excellent work on this easy to use docker image!)

@simonwiles
Copy link
Contributor

+1, but full untruncated URLs might make following the output rather difficult as the lines wrap and jump around.

As an alternative option while this is considered, you can tail the collections/<collection>/pages/pages.jsonl file to see the urls as they're written. If the JSON output is distracting, you can pipe to jq .url or jq -r .url. Finally, at the cost of a bit more complexity, you can also decode URL encoded chars as they are written to the screen, for greater intelligibility. Full example:

tail -f pages.jsonl | stdbuf -oL jq .url | { while read i; do echo -e "${i//\%/\\x}"; done; }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants