Skip to content

NYULibraries/dlts-epub-manager

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

em - DLTS EPUB Manager

em is a command-line program for managing the NYU Press open access EPUBs made available online on Open Square.

Overview

Current functions:

  • Intake of EPUB files: creation of the normalized, exploded EPUBs that are stored in the nyu-press-readium-epub-content private repo
  • Solr indexing of EPUB metadata from the source files generated by the metadata command
  • Management of the legacy EPUB handles in the handle server
  • Creation of the normalized metadata files that are used for publication that are stored in dlts-epub-metadata
  • Creation and editing of epub_library.json files, which are the files used by ReadiumJS viewer to describe a library (note that epub_library.json is a legacy format file that has been superseded by epub_library.opds -- OPDS XML)
  • Writing out metadata dump files for analysis

em can operate in either immediate execution or interactive shell mode.

Getting Started

Prerequisities

  • Node.js
  • Java for running the bundled Solr v6.6.5 used for solr tests
  • yarn for installing the dependencies (npm install sometimes fails)

Installation and setup

To use em for processing NYU Press collections:

Step 1) Clone the repo and install NPM packages using yarn (npm sometimes fails):

git clone https://github.com/NYULibraries/dlts-epub-manager.git epub-manager
cd epub-manager
yarn

Step 2) Clone the metadata and exploded EPUB repos:

git clone https://github.com/NYULibraries/dlts-epub-metadata.git ~/epub-metadata
# This is a private repo and can only be accessed by DLTS and technical partners.
git clone https://github.com/nyudlts/nyu-press-readium-epub-content ~/nyu-press-readium-epub-content

Step 3) Make private configuration files for dev, stage, and prod. Private configuration files contain sensitive information that cannot be committed into the repo in the dev.json, stage.json, and prod.json files in config/, such as the usernames and passwords for our restful handle servers and for the Supafolio API.

somebody@host:~/epub-manager$ cat config-private/dev.json 
{
    "restfulHandleServerUsername" : "[USERNAME FOR DEV RESTFUL HANDLE SERVER]",
    "restfulHandleServerPassword" : "[PASSWORD FOR DEV RESTFUL HANDLE SERVER]",

    "supafolioApiKey" : "[SUPAFOLIO API KEY]"
}
somebody@host:~/epub-manager$ cat config-private/stage.json 
{
    "restfulHandleServerUsername" : "[USERNAME FOR STAGE RESTFUL HANDLE SERVER]",
    "restfulHandleServerPassword" : "[PASSWORD FOR STAGE RESTFUL HANDLE SERVER]",

    "supafolioApiKey" : "[SUPAFOLIO API KEY]"
}
somebody@host:~/epub-manager$ cat config-private/prod.json 
{
    "restfulHandleServerUsername" : "[USERNAME FOR PROD RESTFUL HANDLE SERVER]",
    "restfulHandleServerPassword" : "[PASSWORD FOR PROD RESTFUL HANDLE SERVER]",

    "supafolioApiKey" : "[SUPAFOLIO API KEY]"
}

Step 4) Make a local configuration if needed. The intake and metadata commands currently require a local configuration file (see Special note about configuration of intake and Special note about configuration of metadata ):

somebody@host:~/epub-manager$ ls config/
dev.json   prod.json   stage.json
somebody@host:~/epub-manager$ cat > config/local.json
{
    "cacheMetadataInMemory" : true,
    
    "intakeEpubDir"         : "/home/somebody/epubs/publish/nyupress/wip",
    "intakeEpubList"        : null,
    "intakeOutputDir"       : "/home/somebody/nyu-press-readium-epub-content/",    
    
    "metadataDir"           : "/home/somebody/epub-metadata/nyupress",
    "metadataEpubList"      : null,

    "readiumJsonFile"       : "/home/somebody/nyu-press-readium-epub-content/epub_library.json",

    "restfulHandleServerHost" : "localhost:9002",
    "restfulHandleServerPath" : "/id/handle",

    "solrHost"              : "localhost",
    "solrPort"              : 8080,
    "solrPath"              : "/solr"
}

Don't forget the private configuration file:

somebody@host:~/epub-manager$ cat config-private/local.json 
{
    "restfulHandleServerUsername" : "[USERNAME FOR CHOSEN RESTFUL HANDLE SERVER]",
    "restfulHandleServerPassword" : "[PASSWORD FOR CHOSEN RESTFUL HANDLE SERVER]",

    "supafolioApiKey" : "[SUPAFOLIO API KEY]"
}

See Configuration file format for more details. Also see Special note about configuration of intake.

Quickstart

Intake new EPUBs - local configuration (see Special note about configuration of intake):

# Intake EPUB files and output normalized exploded EPUB directories.
./em intake add local

Create metadata files - local configuration (see Special note about configuration of metadata):

# Create metadata files.
./em metadata add local

Handles processing - prod configuration:

# Add all prod handles to handle server.
./em handles add prod

# Delete prod handles from handle server.
./em handles delete prod

Solr indexing - dev configuration:

# Add all dev EPUB metadata to Solr index.
./em solr add dev

# Delete dev EPUB metadata from Solr index.
./em solr delete dev

# Delete everything from Solr index.
./em solr delete all dev

# Same as `delete all` followed by `add`.
./em solr full-replace dev

epub_library.json file editing - local configuration:

# Add all local EPUB metadata to file.
./em readium-json add local

# Delete local EPUB metadata from file.
./em readium-json delete local

# Delete everything from file.
./em readium-json delete all local

# Same as `delete all` followed by `add`.
./em readium-json full-replace local

Load prod configuration metadata and write to file: start interactive shell, run load prod followed by load write.

somebody@host:~/epub-manager$ ./em
em$ load prod
Cloning into '/Users/david/Documents/programming/src/dlts/epub-manager/cache/metadataRepo'...
Already on 'master'
em$ load write
Metadata dumped to /home/someboady/epub-manager/cache/metadata.json.
em$ quit
somebody@host:~/epub-manager$ # Metadata for prod was written to JSON file in cache directory.
somebody@host:~/epub-manager$ ls cache/metadata.json
cache/metadata.json

Get help message (note publish and verify have not been implemented yet):

somebody@host:~/epub-manager$ ./em help

  Commands:

    help [command...]                          Provides help for a given command.
    exit                                       Exits application.
    handles add [configuration]                Bind EPUB handles.
    handles delete [configuration]             Unbind EPUB handles.
    intake add [configuration]                 Intake EPUBs and generate Readium versions.
    load <configuration>                       Read in configuration file and load resources.
    load write [file]                          Write metadata out to file.
    load clear                                 Clear all loaded metadata.
    metadata add [configuration]               Generate metadata files from Supafolio API.
    publish [options]                          Publish EPUBs.
    publish add [options]                      Add EPUBs.
    publish delete [options]                   Delete EPUBs.
    publish delete all [options]               Delete all EPUBs.
    publish full-replace [options]             Replace all EPUBs.
    readium-json add [configuration]           Add EPUBs to `epub_library.json` file.
    readium-json delete [configuration]        Delete EPUBs from `epub_library.json` file.
    readium-json delete all [configuration]    Delete all EPUBs from `epub_library.json` file.
    readium-json full-replace [configuration]  Replace entire `epub_library.json` file.
    solr add [configuration]                   Add EPUBs to Solr index.
    solr delete [configuration]                Delete EPUBs from Solr index.
    solr delete all [configuration]            Delete all EPUBs from Solr index.
    solr full-replace [configuration]          Replace entire Solr index.
    verify                                     Verify integrity of published collection, handles, and metadata indexes.


Get help for specific commands in interactive mode:

somebody@host:~/epub-manager$ ./em
em$ help load

  Usage: load [options] <configuration>

  Read in configuration file and load resources.

  Options:

    --help  output usage information

em$ help solr

  Commands:

    solr add [configuration]           Add EPUBs to Solr index.
    solr delete [configuration]        Delete EPUBs from Solr index.
    solr delete all [configuration]    Delete all EPUBs from Solr index.
    solr full-replace [configuration]  Replace entire Solr index.

em$

Usage

em is built using Vorpal, a Node.js framework for building interactive CLI applications. The various EPUB management functions are executed using specific commands: handles, intake, load, metadata, readium-json, and solr. Most em commands and subcommands can be run immediately from the command line by passing them as arguments to the em script. There are a relatively small subset of commands that can only be run in the interactive shell because they must be run as part of a sequence of commands.

The help command lists all these function commands along with information about their subcommands and options. For help on individual commands, use help COMMAND. Note that the following commands are listed in help but are not yet implemented: publish, verify. These have been set up as placeholders only (and for testing).

While in interactive shell mode, the following features are available:

  • Autocompletion via the tab key. Commands can be autocompleted, as can their subcommands. In addition, for commands that take the [configuration] option, there is autocompletion for the names of the configuration files in config/ (minus their *.json suffixes).
  • Command history using the up and down arrows.

General note about operations

Most of the commands share a similar set of subcommands which run specific operations whose semantics are generally the same for all commands. In each case, EPUB-related data are first loaded by a load [configuration] operation (which is performed transparently if [configuration] is used with the current command). The subcommand then performs operations on the destination, which is usually a datastore of some kind or a filesystem.

  • add: add EPUB data to the destination, updating in place any EPUBs that already exist. Do not delete any existing EPUBs.
  • delete: delete the EPUB data specified by [configuration] from the destination. Do not delete any other data for EPUBs that are already there.
  • delete all: delete all EPUB data from the destination, regardless of whether the EPUBs are specified in [configuration].
  • full-replace: this is a delete all followed by an add.

Examples

See Quickstart for some basic usage examples. Below are some more detailed use cases. No detailed use case is provided for intake because this command has only one subcommand add and is usually run as a one-shot (see Special note about configuration of intake).

Note about invocation

Most of the examples given will employ the interactive shell. With few exceptions, the command invocations shown can also be performed in immediate execution mode. For example, the following command invocations do the same thing:

In em shell, using the tab key to get suggestions for [configuration]:

somebody@host:~/epub-manager$ ./em
em$ readium-json add
dev  local  prod  stage
em$ readium-json add local
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.

Immediately executed on the command line:

somebody@host:~/epub-manager$ ./em readium-json add local
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.

EXAMPLE: Update Solr index and epub_library.json for local (from Installation and setup), then add to Solr index for dev.


Note that local configuration specifies metadataDir while dev specifies metadataRepo, metadataRepoBranch, and metadataRepoSubdirectory.

somebody@host:~/epub-manager$ ./em
em$ load local
em$ solr add
Added 67 EPUBs to Solr index:
9780814707821
9780814707517
9780814725078
...
[SNIPPED]
em$ readium-json add
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.
em$ load dev
Cloning into '/home/somebody/epub-manager/cache/metadataRepo'...
Switched to a new branch 'develop'
em$ solr add
Added 67 EPUBs to Solr index:
9780814707821
9780814707517
9780814725078
...
[SNIPPED]
em$ quit

...or...

somebody@host:~/epub-manager$ ./em
em$ solr add local
Added 67 EPUBs to Solr index:
9780814707821
9780814707517
9780814725078
...
[SNIPPED]
em$ readium-json add local
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.
em$ load dev
Cloning into '/home/somebody/epub-manager/cache/metadataRepo'...
Switched to a new branch 'develop'
em$ solr add dev
Cloning into '/Users/david/Documents/programming/src/dlts/epub-manager/cache/metadataRepo'...
Switched to a new branch 'develop'
Added 67 EPUBs to Solr index:
9780814707821
9780814707517
9780814725078
...
[SNIPPED]
em$ quit

Note that it is not possible to rewrite the remote epub_library.json file sitting on the dev server. The epub_library.json file rewrite is always local. Thus, in this use case, the user presumably switched the local repo /home/somebody/nyu-press-readium-epub-content/ to develop branch before running the readium-json command.

Rewriting the epub_library.json file for a local instance of ReadiumJS viewer would have involved changing the readiumJsonFile option in local.conf from the path to the repo copy /home/somebody/nyu-press-readium-epub-content/epub_library.json to the path of the library content directory of a locally installed ReadiumJS viewer: e.g. /var/www/html/readium-js-viewer/cloud-reader/epub_content/epub_library.json.


EXAMPLE: Dump metadata for 3 EPUBs into file cache/3-epubs.json, then delete them from stage Solr index, then dump the metadata again to /tmp/3-epubs.json.


Note that load write [file] cannot be run in immediate execution mode, because it must first be preceded by load [configuration].

Copy config/stage.json to config/ad-hoc.json (for example) and change:

"metadataEpubList"              : null,

...to:

"metadataEpubList"              : [ "9780814707821", "9780814707517", "9780814725078" ],

...then:

somebody@host:~/epub-manager$ ./em
em$ load ad-hoc
Cloning into '/home/somebody/epub-manager/cache/metadataRepo'...
Switched to a new branch 'stage'
em$ load write cache/3-epubs.json
Metadata dumped to cache/3-epubs.json.
em$ quit
somebody@host:~/epub-manager$ ls cache/3-epubs.json
  cache/3-epubs.json
somebody@host:~/epub-manager$ ./em
em$ solr delete ad-hoc
Cloning into '/home/somebody/epub-manager/cache/metadataRepo'...
Switched to a new branch 'stage'
Deleted 9780814707821 from Solr index.
Deleted 9780814707517 from Solr index.
Deleted 9780814725078 from Solr index.
Deleted 3 EPUBs.
em$ quit
somebody@host:~/epub-manager$ cat cache/3-epubs.json
cat: cache/3-epubs.json: No such file or directory
somebody@host:~/epub-manager$ # Whoops, cache/ was cleared when `em` was restarted for `solr delete ad-hoc`.
somebody@host:~/epub-manager$ # Write the file again, this time to /tmp/:
somebody@host:~/epub-manager$ ./em
em$ load ad-hoc
em$ load write /tmp/3-epubs.json
Metadata dumped to /tmp/3-epubs.json.
em$ quit
somebody@host:~/epub-manager$ ls /tmp/3-epubs.json
/tmp/3-epubs.json

EXAMPLE: Add handles for prod, then delete handles specified in ad-hoc.


somebody@host:~/epub-manager$ ./em
em$ handles add prod
Cloning into '/Users/david/Documents/programming/src/dlts/epub-manager/cache/metadataRepo'...
Already on 'master'
Added 67 handles to handles server:
9780814707821: 2333.1/37pvmfhh
9780814707517: 2333.1/4tmpg641
9780814725078: 2333.1/zgmsbf5k
9780814723418: 2333.1/9s4mw88v
9780814786086: 2333.1/tqjq2dn7
9780814786123: 2333.1/ffbg7c4r
...
[SNIPPED]
em$ handles delete ad-hoc 
Cloning into '/Users/david/Documents/programming/src/dlts/epub-manager/cache/metadataRepo'...
Switched to a new branch 'develop'
Added 3 handles to handles server:
9780814784891: 2333.1/b8gthvz5
9781479863570: 2333.1/73n5tfjs
9781479829712: 2333.1/brv15j8p
em$ quit

EXAMPLE: Delete all EPUBs in epub_library.json file for local, then add local EPUBs twice, then do a full replace.


somebody@host:~/epub-manager$ ./em
em$ readium-json delete all local
Deleted all EPUBs from /home/somebody/nyu-press-readium-epub-content/epub_library.json.
em$ quit
somebody@host:~/epub-manager$ cat /home/somebody/nyu-press-readium-epub-content/epub_library.json
[]
somebody@host:~/epub-manager$ # Accidentally add `local` EPUBs twice.  The second
somebody@host:~/epub-manager$ # `add` will simply update with the same content.
somebody@host:~/epub-manager$ ./em
em$ readium-json add local
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.
em$ readium-json add local
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.
em$ quit
somebody@host:~/epub-manager$ # Verify that the file only has 67 EPUBs in it, despite
somebody@host:~/epub-manager$ # having run `readium-json add local` twice.
somebody@host:~/epub-manager$ grep '"identifier":' //home/somebody/nyu-press-readium-epub-content/epub_library.json | wc -l
      67
somebody@host:~/epub-manager$ # But do a full replace anyway...
somebody@host:~/epub-manager$ ./em
em$ readium-json full-replace local
Deleted all EPUBs from /home/somebody/nyu-press-readium-epub-content/epub_library.json.
Added to Readium JSON file /home/somebody/nyu-press-readium-epub-content/epub_library.json for conf "local": 67 EPUBs.
Fully replaced all EPUBs in Readium JSON for conf local.
em$ quit
somebody@host:~/epub-manager$ grep '"identifier":' /home/somebody/nyu-press-readium-epub-content/epub_library.json | wc -l
      67

Running the tests

# Run all acceptance and unit tests
yarn test

# Run unit tests
yarn test:lib

# Run individual unit tests or group of tests
node_modules/.bin/jest [PATH TO *.test.js* FILE OR FILES]

# Run acceptance tests
yarn test:acceptance

# Run acceptance tests for individual commands
# Note that certain acceptance test suites cannot be run simultaneously --
# see https://jira.nyu.edu/jira/browse/NYUP-742.  For this reason, `yarn test:acceptance`
# uses the --runInBand Jest option.
node_modules/.bin/jest test/acceptance/handles
node_modules/.bin/jest test/acceptance/intake
node_modules/.bin/jest test/acceptance/load
node_modules/.bin/jest test/acceptance/metadata
node_modules/.bin/jest test/acceptance/readium-json
node_modules/.bin/jest test/acceptance/solr

Note that the solr tests require that the test/solr/ Solr instance be running. If it is not running, the test will produce an error message with instructions on how to start the test Solr:

somebody@host:~/epub-manager$ mocha test/acceptance/solr


  solr command
    1) "before all" hook


  0 passing (184ms)
  1 failing

  1) solr command "before all" hook:
     AssertionError:

Solr is not responding.  Try running Solr setup and start script:

	test/solr/start-solr-test-server.sh

Error: connect ECONNREFUSED 127.0.0.1:9001
      at Context.before (test/acceptance/solr.js:41:20)

The test/solr/start-solr-test-server.sh script is a modified version of django-haystack's. Running it will do the following:

  • Download the appropriate Solr archive to test/solr/download-cache/. This step is skipped if the archive exists already.
  • Unpack the archive, install the Solr server and configure it using the files in test/solr/config-files/.
  • Start Solr on port 9001 in the foreground. To start it in the background, set BACKGROUND_SOLR to a non-empty value:
BACKGROUND_SOLR=true test/solr/start-solr-test-server.sh

To stop the server, simply kill the process.

Configuration files

Configuration files are stored in config/ and config-private/. Each file in config/ must have a corresponding, identically named file in config-private/ for storing sensitive information related to that configuration. The basenames of the files in config/ are the configuration names that can be specified as options for various em commands, and are used as autocomplete possibilities for commands that take a configuration option.

The dev, stage, and prod configurations for NYU Press collections are already included in the repo in config/. Individual clones of this repo must have local config-private/ files corresponding to these three configurations. See Installation and setup, Step 3.

New configuration files can be created in config/ and will be ignored by git. The contents of config-private/ is ignored by git entirely.

config/ file properties:

  • cacheMetadataInMemory: true to load all metadata at once into memory for faster processing, otherwise false. Currently only true is supported.
  • intakeEpubDir: directory containing the *.epub files be processed by the intake system. The subdirectory names are also used as the ISBN list for the metadata command if intakeEpubList is not specified.
  • intakeEpubList: array of EPUB ids specifying the EPUBs to be processed by the intake system. All other EPUBs will be ignored. If this option is not specified then the names of the subdirectories in intakeEpubDir will be used for the EPUB list. Example: [ "9780814707821", "9780814707517", "9780814725078" ]
  • intakeOutputDir: directory to output the normalized, exploded EPUBs to
  • metadataDir: full path to the directory containing the metadata files. For NYU Press collections, this would be the nyupress directory in the local clone of the dlts-epub-metadata repo. If this option is specified, metadata repo options will be ignored. Example: "/home/somebody/epub-metadata/nyupress"
  • metadataEpubList: array of EPUB ids specifying the EPUBs to be processed by the metadata system. All other EPUBs will be ignored. If this option is not specified then the names of the subdirectories in metadataDir will be used for the EPUB list. Example: [ "9780814707821", "9780814707517", "9780814725078" ]
  • Metadata repo options -- these will be ignored if metadataDir has been specified.
    • metadataRepo: URL for the git repo containing the metadata. The repo will be cloned locally using git clone [metadataRepo]. Example: "https://github.com/NYULibraries/dlts-epub-manager.git"
    • metadataRepoBranch: branch or commit to use. Will be checked out using git checkout [metadataRepoBranch]. Examples:
      • "master"
      • "0c18465a5c80c056088e98d45b6dd621e6001a7b"
    • metadataRepoSubdirectory: relative path to subdirectory containing the metadata to be processed. Example: "nyupress"
  • readiumJsonFile: full path to the epub_library.json file. Example: "/home/somebody/nyu-press-readium-epub-content/epub_library.json"
  • restfulHandleServerHost: hostname of the restful handle server. Example: "devhandle.dlib.nyu.edu"
  • restfulHandleServerPath: path to use for handle requests. Example: "/id/handle"
  • solrHost: hostname of Solr server. Example: "localhost"
  • solrPort: port that Solr is running on. Example: 8080
  • solrPath: path to use for Solr requests. Example: "/solr/nyupress"

config-private/ file properties:

  • restfulHandleServerUsername: user authorized to add, update, and delete on the restful handle server.
  • restfulHandleServerPassword: password for the user authorized to add, update, and delete on the restful handle server.
  • supafolioApiKey: API key for the Supafolio Open Square catalog.

For example configuration files that illustrate the correct usage of all the above options, look in config/ and test/acceptance/fixtures/config/.

Special note about configuration of intake

intake configuration is not included in the dev, stage, and prod configurations because intake is a one-shot process that is done when a collection is first received by the publisher, and is never run again as part of a publication job to dev, stage, prod. intake creates the data and metadata that are consumed by the other commands like handles or solr that are run on a more regular basis against dev, stage, and prod servers.

Special note about configuration of metadata

The metadata command currently determines which ISBNs to fetch metadata for by reading the subdirectory names in intakeEpubDir or the ISBN list in intakeEpubList. Note that these options are not present in the dev, stage, and prod configurations - see Special note about configuration of intake.

This means that there exists no dev, stage, or prod configurations for the metadata command as well. This is intentional, as the metadatacommand targets a local directory and does not run against dev, stage, prod servers or in dev, stage, and prod environments. This local directory will usually be a local clone of dlts-epub-metadata, with dev, stage, or prod branches checked out as needed.

##Future enhancements

  • Solr indexing of EPUB full-text content
  • Creation and editing of epub_library.opds files
  • One-step publications of EPUBs: full processing of new EPUBs -- all functions performed in one step
  • Verification of collection: check that decompressed EPUB files, Solr index, and epub_library.json file are in sync