URL percent encode

A Python script used to make characters in seed lists "URL-safe" (e.g. encoding of unicode characters in URLs to their percent-encoded equivalent) for compatibility with crawlers such as Heritrix.

Input

This script runs over an input list of URLs, formatted as a text file with one URL per line. See ./example_seed_lists/bad.txt and ./example_seed_lists/good.txt for examples.

bad.txt - This file contains an example list of URLs that contain characters that are not URL-safe and so cannot be crawled.

good.txt - This file contains the same list of URLs but this time percent encoded so they can be crawled.

To run:

When the encode.py script is run it will prompt the user for an input txt file name and an output txt file name. No errors will be produced if all characters in seed URLs are URL-safe (so there's no need to check the input file before running).

Input txt - The txt file that needs to be encoded (with file path if relevant).

Output txt - The desired name of the file produced by this script (with file path if relevant).

Output

The script will produce a txt file (named by the user) which will contain the same URLs from the input file but with any non-standard characters made URL-safe.

Safe characters

Characters that should not be encoded/changed can be listed as part of the "safe" list. In this example, "/" ":" "?" "=" "&" "+" and "#" characters will be unchanged in the output: encoded_seed = urllib.parse.quote(unencoded_seed, safe="/:?=&+#")

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
example_seed_lists		example_seed_lists
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
encode.py		encode.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

example_seed_lists

example_seed_lists

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

encode.py

encode.py

requirements.txt

requirements.txt

Repository files navigation

URL percent encode

Input

To run:

Output

Safe characters

Contributing

License

About

Releases

Packages

Contributors 2

Languages

License

tna-webarchive/url-percent-encode

Folders and files

Latest commit

History

Repository files navigation

URL percent encode

Input

To run:

Output

Safe characters

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Languages