Skip to content

CMUEberlyCenter/eberly-docuscope-tagger

Repository files navigation

pipeline status

DocuScope Tagger Service

A web api for tagging text based on the DocuScope Ity tagger.

Administration and Support

For any questions regarding overall project or the language model used, please contact suguru@cmu.edu

The project code is supported and maintained by the Eberly Center at Carnegie Mellon University. For help with this fork, project, or service please contact eberly-assist@andrew.cmu.edu.

Requirements

  1. Neo4J database.
  2. A DocuScope dictionary stored in the Neo4J database generated using CMU_Sidecar/docuscope-dictionary-tools/docuscope-rules> docuscope-rule-neo4j tool and a DocuScope language model.
  3. common-dict.json file that specifies a hierarchical organization of clusters. JSON Schema
  4. wordclasses.json file which is the json version of a DocuScope language model's _wordclasses.txt file converted using CMU_Sidecar/docuscope-dictionary-tools/docuscope-rules> docuscope-wordclasses tool.
  5. ${DICTIONARY}_tones.json.gz file which is the compressed json version of a DocuScope _tones.txt file converted using CMU_Sidecar/docuscope-dictionary-tools/docuscope-tones> ds-tones tool.
  6. MySQL database for storing CMU_Sidecar/docuscope-classroom> documents and performance measures.
  7. Optional: Memcached

Configuration

The following environment variable should be set so that the DocuScope tagger can access the various required services. The defaults tend to be reasonable values for a development environment where everything is hosted locally and do not reflect values that should be used in any production environment.

Variable Description Default
DICTIONARY String used in formulating tag labels and used to load the correct dictionary files. default
DICTIONARY_HOME Path to base directory of necessary runtime dictionary files specified above. <Application's base directory>/dictionary
DB_HOST Hostname of the MySQL database for storing processed documents. 127.0.0.1
DB_PORT Port of the MySQL document database. 3306
DB_PASSWORD Password for accessing the document database. 1 2
DB_USER Username for accessing the document database. 1 docuscope
MEMCACHED_URL Hostname for the optional caching service. localhost
MEMCACHED_PORT Port of the caching service. 11211
MYSQL_DATABASE Identifier for document database. docuscope
NEO4J_DATABASE Identifier for dictionary database. neo4j
NEO4J_PASSWORD Password for accessing the dictionary database. 1 2
NEO4J_USER Username for accessing the dictionary database. 1 neo4j
NEO4J_URI URI of the dictionary database. neo4j://localhost:7687/3

Usage

  1. Build docker image: docker build -t <tag> . When deployed, service bound to port 80 of the docker container.
  2. Run locally: pipenv run hypercorn app.main:app --bind 0.0.0.0:8000

This is meant to work in conjunction with CMU_Sidecar/docuscope-classroom> which is designed for visualizing and analyzing the results in a classroom setting and with DocuScope Write & Audit.

Acknowledgements

This project was partially funded by the A.W. Mellon Foundation, Carnegie Mello University's Simon Initiative Seed Grant, and the Berkman Faculty Development Fund.


Footnotes

  1. It is recommended to use Docker secrets to get these values. The application is able to retrieve values from specified files if the environment variable has the _FILE affix added. 2 3 4

  2. Passwords intentionally default to None value for security reasons. 2

  3. See Neo4J Python Driver information for more details on the various valid protocols.