Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Suggestion: CoNLL 2012 Coreference format / more flexible handling #116

Open
polm opened this issue Apr 23, 2021 · 1 comment
Open

Comments

@polm
Copy link

polm commented Apr 23, 2021

I recently ran into a CoNLL format that this library doesn't handle - the CoNLL 2012 coreference resolution annotation format. This is basically the same as standard CoNLL(-u?) format but with some extra fields at the end. Reference data page (note the data is dehydrated):

https://cemantix.org/conll/2012/data.html

Format details are described at the link - it's a little weird, with a variable number of fields, though not awful. I don't think this format was ever used for any other task, but this particular dataset remains one of the best of its kind in English as far as I'm aware. It continues to be actively used in research.

https://paperswithcode.com/sota/coreference-resolution-on-conll-2012

One thought I had when trying to load this is that it would be nice if extra fields were able to be accommodated in a general way somehow rather than throwing an error - maybe an option for permissive parsing that just added them in an "extras" field or something?

Anyway, not sure it makes sense to support this given it's rather limited use, but thought I'd mention it.

@matgrioni
Copy link
Collaborator

Thanks for the suggestion. There is an existing issue related to supporting more CoNLL formats (#49) and have started work on but unfortunately due to time constraints have had limited bandwidth for recently.

This format is definitely interesting due to its variable columns. I had not even thought of that as a possibility for a CoNLL related format. I'm not sure I totally understand the format, from a brief look but the structural differentiation of it is clear at least, and I think your suggestion to capture extra info in a delegated field is a good one.

With the current version 3.0, I added type annotations which I think are a nice benefit for use, and the solution I am working on uses dataclasses to keep that with user defined formats. Adding a class decorator to this set of classes to say capture everything else should be possible I would believe (although most IDEs cannot follow the type hints on decorators, users will still have the option to define their custom typed structure if they desire, or fall back to an untyped, purely string based one).

Hopefully this month, I'll be able to keep working on this. The main issue I run into is that the parsing code is tied into the data layer, so that has to properly separated first which is the bulk of the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants