Skip to content

crishoj/thcg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

THCG2DEP is an implementation of a method for deriving (minimally labeled) dependency trees from the Thai CG Bank (Ruangrajitpakorn & Supnithi 2010), using a lexical dictionary for assigning dependency directions to the CG types associated with the grammatical entities in the CG Bank, with fallback to a generic CG->CDG mapping in case of out-of-dictionary words.

Optionally uses distributional (“brown-”) clusters obtained through unsupervised processing of large amounts of (word-segmented) raw text in place of POS tags.

Requirements

A recent ruby interpreter is required, as well as the following ruby packages:

* commander (for command-line interface)
* conll (for generating dependency treebank files)

Usage

Treebank conversion

To extract dependency trees from a CG treebank, use the following command:

% ruby lib/thcg_converter.rb data/dict/merge.CDG data/map/CDG-CG.txt data/cg/sample.txt > data/conll/sample.conll

Explanation:

* data/dict/merge.CDG      - merged CDG lexical dictionary 
* data/map/CDG-CG.txt      - generic CG->CDG mapping
* data/cg/sample.txt       - source CG bank
* data/conll/sample.conll  - target file for dependency treebank (CONLL format)

Sample output:

Dictionary file: data/dict/merge.CDG
Building dictionary from data/dict/merge.CDG
38250 forms, 2 CDGS/form on average
Building CG->CDG map
Ambiguous CG->CDG mapping for น่า: (s\np)/(s\np) could be either (s\<np)/>(s\<np) or (s\<np)/<(s\<np)
Ambiguous CG->CDG mapping for ตอนนั้น: s/s could be either s/>s or s/<s
Ambiguous CG->CDG mapping for กับ: np\np/np could be either np\<np/>np or np\>np/>np
Ambiguous CG->CDG mapping for ตอนนี้: s/s could be either s/>s or s/<s
Ambiguous CG->CDG mapping for ขณะนี้: s/s could be either s/>s or s/<s
Ambiguous CG->CDG mapping for ขณะนั้น: s/s could be either s/>s or s/<s
Map file: data/map/CDG-CG.txt
Ambiguous CG->CDG mapping: np\np/np could be either np\>np/>np or np\<np/>np
Ambiguous CG->CDG mapping: s/s could be either s/>s or s/<s
Ambiguous CG->CDG mapping: s\s could be either s\>s or s\<s
Ambiguous CG->CDG mapping: (s\np)/(s\np) could be either (s\<np)/>(s\<np) or (s\<np)/<(s\<np)
Ambiguous CG->CDG mapping: s/(s\np) could be either s/>(s\<np) or s/<(s\<np)
Ambiguous CG->CDG mapping: s/(s\np)/np could be either s/>(s\<np)/>np or s/<(s\<np)/>np
Ambiguous CG->CDG mapping: s\(s\np) could be either s\<(s\<np) or s\>(s\<np)
90 CG->CDG mapping(s)
Treebank file: data/cg/sample.txt
Parsing 10 lines...
done
Processing 5 sentences...
No mapping of ((s\np)\(s\np))/np for 'ใน' - falling back on CD->CDG map
No mapping of (np\np)\(np\np) for 'ๆ' - falling back on CD->CDG map
done

Please note that CG types remain in the resulting CONLL file. These should be stripped out before using the dependency treebank for any experiments or real application, as the CG types cannot reliably be obtained automatically for unseen text.

Word cluster augmentation

To add unsupervised Brown clusters to a dependency treebank, use the command:

% ruby lib/cluster_path_map.rb data/conll/sample.conll data/paths/best-all-spaces-c32-p1.out/paths > data/conll/sample-c32.conll

Explanation:

* data/conll/sample.conll      - input dependency treebank
* data/paths/best-all-spaces-c32-p1.out/paths  - output file from the wcluster utility (http://www.cs.berkeley.edu/~pliang/software/brown-cluster-1.2.zip)
* data/conll/sample-c32.conll  - output file for augmented treebank

A second paths file may be specified, to include clusters of different granularity as a second POS tag.

Dependency parser training

To train a dependency parser, the output from the above steps should first be stripped of the CG types, shuffled (to compensate for clustering of similar sentence types) and then split in e.g. a 9-1 ratio for training and testing.

Given a train part and a test part, the standard MaltParser may be trained and tested as follows:

% java -jar ~/Tools/malt-1.4.1/malt.jar -c malt_test -i data/train.conll -m learn
% java -jar ~/Tools/malt-1.4.1/malt.jar -c malt_test -i data/test.conll -o malt_test.out.conll -m parse
% eval07.pl -q -g data/test.conll -s malt_test.out.conll

eval07.pl is the standard evaluation script for the task, available for download at e.g. nextens.uvt.nl/depparse-wiki/SoftwarePage#eval07.pl

About

Derive (minimally labeled) dependency trees from the Thai CG Bank

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages