gutenparse

This project has a few parts, all of which are designed to help me curate the Labelled and Balanced Gutenberg Bookset (LabGub).

gutenparse.py

A little library to parse some common metadata from RDF metadata files in Project Gutenberg books.

remove_*.py scripts

Several scripts to prune books from a collection of Project Gutenberg books. For these to work, it is assumed that the book collection is structured like so:

collection
├── 9043
│   ├── pg9043.rdf
│   └── pg9043.txt.utf8
├── 9044
│   ├── pg9044.rdf
│   └── pg9044.txt.utf8
├── 9045
│   ├── pg9045.rdf
│   └── pg9045.txt.utf8
├── 9046
│   ├── pg9046.rdf
│   └── pg9046.txt.utf8
└── 9047
    ├── pg9047.rdf
    └── pg9047.txt.utf8

They are used by passing the RDF files as command line arguments. The results are logged in detail.

$ python remove_non_english.py collection/*/*.rdf
Processing 60 meta files...

Removing SA /mnt/c/Users/kdbanman/Dev/gutenparse/tiny/9000
Removing DE /mnt/c/Users/kdbanman/Dev/gutenparse/tiny/9046
Removing DE /mnt/c/Users/kdbanman/Dev/gutenparse/tiny/9049
Removing FR /mnt/c/Users/kdbanman/Dev/gutenparse/tiny/9053
Removing DE /mnt/c/Users/kdbanman/Dev/gutenparse/tiny/9058
Removing DE /mnt/c/Users/kdbanman/Dev/gutenparse/tiny/9066
Removing FI /mnt/c/Users/kdbanman/Dev/gutenparse/tiny/9082
Removing DE /mnt/c/Users/kdbanman/Dev/gutenparse/tiny/9083

Non-English languages encountered during removal:
DE (5)
FI (1)
SA (1)
FR (1)

Removed 8 books.

These scripts are destructive, and you are responsible for your own backups. If you're scared, pass --dry as the last parameter to do a dry run where nothing is actually deleted.

$ python remove_multiple_authors.py collection/*/*.rdf --dry

aggregate_stats.py

A final analysis script designed to log some aggregate stats and save a CSV containing labels for each book.

TODO

balance lcc subjects
- could get clever
  - parse out single letters. might not need to, or might be a bad idea. (currently doing this though)
  - remove specifics. (EX: E456)
- might be best to just leave codes as-is and
  - remove under-represented
  - trim over-represented
balance author birth years
- don't need to get perfect uniform distribution over years
- must truncate distribution tails
- might need to trim over-represented decades
subject and birth year might not be independent, so pruned distribution might be order dependent. hopefully not much.
license and headers
perform prune and balance
- make sure original is backed up safely
  - totally different directory
  - readonly permissions for all users
- run each processor script
  1. write a sentence or two saying why this script now
  2. run script and pipe to log file python do_shit.py previous_step/*/*.rdf > current_step_log.txt 2>&1 &
  3. copy log down to local machine scp lstm-server:data/big_drive/prunes/current_step_log.txt ./
  4. rename mutated directory mv previous_step current_step
  5. back up processed dir with tar -czf current_step.tar.gz current_step
- run analysis script
actual progress
- english_only
- labelled_english_only
- single_author_labelled_english_only
- balanced_subject_single_author_labelled_english_only
- balanced_birth_year_and_subject_single_author_labelled_english_only

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

gutenparse

TODO

About

Releases

Packages

Languages

kdbanman/gutenparse

Folders and files

Latest commit

History

Repository files navigation

gutenparse

TODO

About

Topics

Resources

Stars

Watchers

Forks

Languages