ner-ternary

What is it?

ner-ternary is a command line tool that will annotate a directory of documents with key terms provided in a .tsv file or (for smaller tasks) supplied through the command.

The end result is a directory of .json files with the following annotation structure:

{
    "Annotations": {
        "phrases": {
            ...
        },
        "tokens": {
            "key1": {
                "line": {
                    "x": [
                            [start1_char_pos, end1_char_pos],
                            [start2_char_pos, end2_char_pos]
                        ],
                    "y": [
                        [..., ...]
                    ],
                    .
                    .
                    .
                },
            "key2": {
                "line": {
                    "x": [
                        [..., ...]
                    ],
                    "y": [
                        [..., ...]
                    ],
                    .
                    .
                    .
                }
            }
            }
        }
    }
}

A basic example of the type of file structure this tool is aimed at would be:

root/
    key_words.tsv
    Documents/
        doc1.txt
        doc2.txt
        doc3.txt
        .
        .
        .
    out/

Clean Any Data

If you have a .tsv where your key words are located then you can extract those from the tsv using the clean-tsv command.

Command: clean-tsv
- Options:
  - -s: This is a flag to shuffle the outfile.txt, if ommitted then the words will be written to the outfile.txt in alphabetical order.
  - -c <x>: This flag is required and it is to tell the program which column in the .tsv you wish to extract.
  - The remaining options are the path to the input .tsv and the path to where you want the outfile.txt written.

nerternary clean-tsv -s -c <x> <path/to/input.tsv> <path/to/outfile.txt>

Annotation

After cleaning data (or if you already have a .txt file with your desired words on a new line), the annotate command can be used.

Command: annotate
- Options:
  - -p: This is a flag to switch on parallel processing, the input .txt files will be split up over the desired number of processors.
  - -t: The number of threads/processors to use. Please check your systems capabilities before attempting to overprescribe processors.
  - -w <"list, of, words">: NOT IMPLEMENTED Supply a list of words via the command line to annotate for
  - The remaining options are the .txt file of words/phrases to be annotated for, the path to the input/dir containing the .txt files to annotate, and the output/dir where an annotations/ sub directory will be created and the .json annotations will be written.
FUTURE: You will also be able to supply a list of words via the command line to annotate for.

Limitations

Currently this CLI only supports annotating for single or hyphenated words owing to a lack of noun phrase building post tokenization.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
client		client
data/default		data/default
proto		proto
server		server
.dockerignore		.dockerignore
.gitignore		.gitignore
.releaserc.json		.releaserc.json
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

client

client

data/default

data/default

proto

proto

server

server

.dockerignore

.dockerignore

.gitignore

.gitignore

.releaserc.json

.releaserc.json

CHANGELOG.md

CHANGELOG.md

CMakeLists.txt

CMakeLists.txt

README.md

README.md

package-lock.json

package-lock.json

package.json

package.json

Repository files navigation

ner-ternary

What is it?

Clean Any Data

Annotation

Limitations

About

Releases 2

Packages

Contributors 2

Languages

CosyOranges/ner-ternary

Folders and files

Latest commit

History

Repository files navigation

ner-ternary

What is it?

Clean Any Data

Annotation

Limitations

About

Resources

Stars

Watchers

Forks

Languages