Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoNLL-U plus #119

Open
Stormur opened this issue Nov 28, 2023 · 1 comment
Open

CoNLL-U plus #119

Stormur opened this issue Nov 28, 2023 · 1 comment

Comments

@Stormur
Copy link

Stormur commented Nov 28, 2023

Hi!

I cannot find this in documentation: I was wondering if UDAPI already includes ways to deal with CoNLL-U plus files (i.e. read, write...). In particular, I am interested in expanding an existing regular CoNLL-U file into a plus one by adding new custom columns.

Thanks!

@martinpopel
Copy link
Contributor

Udapi does not support CoNLL-U Plus yet. There is read.Conll with parameter attributes, where you can specify which columns are in a given file, but it uses setattr(node, attribute_name, value) internally, which means that only existing attribute names can be used as column names (or an underscore meaning that a given column should be ignored by the reader).

I would welcome if someone sends a PR adding read.Conlluplus (or read.Conllup considering that .conllup is the recommended file extension) and write.Conlluplus. That would mean interpreting the global.columns header (perhaps storing it to document.meta['global.columns'] similarly to document.meta['global.Entity']. The question is where to store the extra (non-standard) columns and how to name them (lowercase?). I would suggest storing them in node.misc, so e.g. global.columns = ID FORM PARSEME:MWE results in node.misc["parseme:mwe"] containing the values from the last column. When serializing this document using write.Conlluplus with document.meta['global.columns'] == "ID FORM MISC PARSEME:MWE", the parseme:mwe attribute would not be stored in MISC, but in the last (PARSEME:MWE) column. This would allow the users to easily convert between different formats (possibly using e.g. udapy read.Conllu files=input.conllu util.Eval doc='doc.meta['global.columns'] = "ID FORM LEMMA MISC PARSEME:MWE"' write.Conlluplus files=output.conllup).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants