Felix Archive Builder

This repo contains scripts used to extract and clean up texts from Felix archive in pdf format, as well as the finished output

extract_text.sh

Prerequisite: Have pdftotext from Xpdf command line tools installed
Usage: extract_text.sh [first issue inclusive] [last issue inclusive] [directory containing pdfs (default ".")] [output directory (default "text")]

This bash script extracts the OCRed text from Felix archive PDFs into text documents. Each issue has its own folder, in which there are text files representing pages in that issue. The text files are named with the page number in the issue PDF file, not the page number displayed in paper.

The output of this script is in text folder.

solrify.py

Prerequisite: Python 3.8 or above, Unix-like OS preferred
Usage: python3 solrify.py [first issue inclusive] [last issue inclusive] [directory containing output of extract_text.sh (default "text")] [output directory (default "output")]

This Python script reads the OCRed archive content, performs a cleanup and output json files ready to be imported into solr. The cleanup involves mainly two things:

Rebuild hyphen-segmented words at the end of a line
Strip off extra spaces in words (a severe problem in earlier issues)

Running this script on a Unix-like OS (such as Linux or macOS) is preferred as it can utilise multiprocessing. Running this on Windows will be extremely slow.

It outputs one json file per issue. It will try to find a file called felix_dates.csv to add date information into the json files.

output folder contains the output of this script

extract_date.py

Prerequisite: Python 3.8 or above, dateutil package
Usage: python3 extract_date.py [directory containing output of extract_text.sh] [output csv file name]

This Python script attempts to extract the issue date from the text of each issue. The result requires manual validation. A manually validated (although not error-free) mapping between issue number and issue dates is in felix_dates.csv

configsets/felix_archive

Prerequisite: Apache Solr
Usage: cp -r configsets/felix_archive [solr_root]/server/solr/configsets

This is the solr schema. It has 5 fields: id, date, issue, page, content. This should be copied to [solr_root]/sever/solr/configsets so that a new solr core can be created with this schema using bin/solr create -c [core name] -d felix_archive

missing_issues.txt

This text file lists the issue numbers that are missing from the archive we have

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
configsets/felix_archive/conf		configsets/felix_archive/conf
output		output
.gitignore		.gitignore
README.md		README.md
extract_date.py		extract_date.py
extract_text.sh		extract_text.sh
felix_dates.csv		felix_dates.csv
missing_issues.txt		missing_issues.txt
solrify.py		solrify.py
words-by-frequency.txt		words-by-frequency.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configsets/felix_archive/conf

configsets/felix_archive/conf

output

output

.gitignore

.gitignore

README.md

README.md

extract_date.py

extract_date.py

extract_text.sh

extract_text.sh

felix_dates.csv

felix_dates.csv

missing_issues.txt

missing_issues.txt

solrify.py

solrify.py

words-by-frequency.txt

words-by-frequency.txt

Repository files navigation

Felix Archive Builder

extract_text.sh

solrify.py

extract_date.py

configsets/felix_archive

missing_issues.txt

About

Releases

Packages

Languages

FelixOnline/FelixArchiveBuilder

Folders and files

Latest commit

History

Repository files navigation

Felix Archive Builder

extract_text.sh

solrify.py

extract_date.py

configsets/felix_archive

missing_issues.txt

About

Resources

Stars

Watchers

Forks

Languages