Skip to content

Python tools for parsing documents and building the inverted index with enriched metadata. Java version with slightly different features - https://github.com/USCDataScience/parser-indexer

License

Notifications You must be signed in to change notification settings

USCDataScience/parser-indexer-py

Repository files navigation

Parser-Indexer

This project contains tools for parsing files and indexing to solr. It also includes tools for information extraction.

For instructions related to training the custom Named Entity Recognition model using Stanford CoreNLP, visit src/corenlp

Requirements

  1. Solr
  2. Parser Server

1. Setting up solr

Download Solr

mkdir workspace && cd workspace
wget http://archive.apache.org/dist/lucene/solr/6.1.0/solr-6.1.0.tgz
tar xvzf solr-6.1.0.tgz
cd solr-6.1.0

Start and Create a Core

PORT=8983
bin/solr start -p $PORT
bin/solr create_core -c docs -d $YOUR_PATH/conf/solr/docs -p $PORT

2. Parser Server

Refer to README of parser-server in sub directory.

Examples :

Checkout docs folder.

  • To parse and index jounrals : docs/parser-index-journals.md

About

Python tools for parsing documents and building the inverted index with enriched metadata. Java version with slightly different features - https://github.com/USCDataScience/parser-indexer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published