ner-ternary is a command line tool that will annotate a directory of documents with key terms provided in a .tsv
file or (for smaller tasks) supplied through the command.
The end result is a directory of .json
files with the following annotation structure:
{
"Annotations": {
"phrases": {
...
},
"tokens": {
"key1": {
"line": {
"x": [
[start1_char_pos, end1_char_pos],
[start2_char_pos, end2_char_pos]
],
"y": [
[..., ...]
],
.
.
.
},
"key2": {
"line": {
"x": [
[..., ...]
],
"y": [
[..., ...]
],
.
.
.
}
}
}
}
}
}
A basic example of the type of file structure this tool is aimed at would be:
root/
key_words.tsv
Documents/
doc1.txt
doc2.txt
doc3.txt
.
.
.
out/
If you have a .tsv
where your key words are located then you can extract those from the tsv using the clean-tsv
command.
- Command:
clean-tsv
- Options:
-s
: This is a flag to shuffle theoutfile.txt
, if ommitted then the words will be written to theoutfile.txt
in alphabetical order.-c <x>
: This flag is required and it is to tell the program which column in the.tsv
you wish to extract.- The remaining options are the path to the input
.tsv
and the path to where you want theoutfile.txt
written.
- Options:
nerternary clean-tsv -s -c <x> <path/to/input.tsv> <path/to/outfile.txt>
After cleaning data (or if you already have a .txt
file with your desired words on a new line), the annotate
command can be used.
-
Command:
annotate
- Options:
-p
: This is a flag to switch on parallel processing, the input.txt
files will be split up over the desired number of processors.-t
: The number of threads/processors to use. Please check your systems capabilities before attempting to overprescribe processors.-w <"list, of, words">
: NOT IMPLEMENTED Supply a list of words via the command line to annotate for- The remaining options are the
.txt
file of words/phrases to be annotated for, the path to theinput/dir
containing the.txt
files to annotate, and theoutput/dir
where anannotations/
sub directory will be created and the.json
annotations will be written.
- Options:
-
FUTURE: You will also be able to supply a list of words via the command line to annotate for.
Currently this CLI only supports annotating for single or hyphenated words owing to a lack of noun phrase building post tokenization.