text file crawler (tfc)

To quench my curiousity, I wanted to gauge the usage & adoption of the following pseudo-standard text files:

Given a domains.txt file containing one domain per line, the Node.js script will fire off requests for each of the files. Given network I/O is the constraint, this can take a while.

NOTE: This script isn't particularly efficient in terms of memory usage. If you encounter issues running of memory, pass the --max-old-space-size flag like so: node --max-old-space-size=4096 tfc.

Redirects are capped at 20 and validity is based off the HTTP status code, Content-Type, and first few values of the response data. After completing, the statistics will be printed out. Valid text files found will be written to files/, which is created & wiped for you each time the script is started.

If you're interested in a write-up about this along with the metrics, you should check out my article.

Usage

Make a domains.txt by making your own or symlinking one of the provided:

ln -s domains-faang.txt domains.txt

Then, grab the dependencies & start it up:

npm install && npm start

Not all requests receive a response & hang indefinitely. If it's been a while, just Ctrl + C the process, which will print out the stats before exiting.

Thanks

David. Jeff.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.editorconfig		.editorconfig
.gitignore		.gitignore
README.md		README.md
domains-100.txt		domains-100.txt
domains-100k.txt		domains-100k.txt
domains-10k.txt		domains-10k.txt
domains-1k.txt		domains-1k.txt
domains-1m.txt		domains-1m.txt
domains-25k.txt		domains-25k.txt
domains-faang.txt		domains-faang.txt
package-lock.json		package-lock.json
package.json		package.json
tfc.js		tfc.js
top-1m.csv.zip		top-1m.csv.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.editorconfig

.editorconfig

.gitignore

.gitignore

README.md

README.md

domains-100.txt

domains-100.txt

domains-100k.txt

domains-100k.txt

domains-10k.txt

domains-10k.txt

domains-1k.txt

domains-1k.txt

domains-1m.txt

domains-1m.txt

domains-25k.txt

domains-25k.txt

domains-faang.txt

domains-faang.txt

package-lock.json

package-lock.json

package.json

package.json

tfc.js

tfc.js

top-1m.csv.zip

top-1m.csv.zip

Repository files navigation

text file crawler (tfc)

Usage

Thanks

License

About

Releases

Packages

Languages

Pinjasaur/tfc

Folders and files

Latest commit

History

Repository files navigation

text file crawler (tfc)

Usage

Thanks

License

About

Resources

Stars

Watchers

Forks

Languages