Feature Suggestion: CoNLL 2012 Coreference format / more flexible handling #116

polm · 2021-04-23T10:39:40Z

I recently ran into a CoNLL format that this library doesn't handle - the CoNLL 2012 coreference resolution annotation format. This is basically the same as standard CoNLL(-u?) format but with some extra fields at the end. Reference data page (note the data is dehydrated):

https://cemantix.org/conll/2012/data.html

Format details are described at the link - it's a little weird, with a variable number of fields, though not awful. I don't think this format was ever used for any other task, but this particular dataset remains one of the best of its kind in English as far as I'm aware. It continues to be actively used in research.

https://paperswithcode.com/sota/coreference-resolution-on-conll-2012

One thought I had when trying to load this is that it would be nice if extra fields were able to be accommodated in a general way somehow rather than throwing an error - maybe an option for permissive parsing that just added them in an "extras" field or something?

Anyway, not sure it makes sense to support this given it's rather limited use, but thought I'd mention it.

matgrioni · 2021-05-03T04:50:03Z

Thanks for the suggestion. There is an existing issue related to supporting more CoNLL formats (#49) and have started work on but unfortunately due to time constraints have had limited bandwidth for recently.

This format is definitely interesting due to its variable columns. I had not even thought of that as a possibility for a CoNLL related format. I'm not sure I totally understand the format, from a brief look but the structural differentiation of it is clear at least, and I think your suggestion to capture extra info in a delegated field is a good one.

With the current version 3.0, I added type annotations which I think are a nice benefit for use, and the solution I am working on uses dataclasses to keep that with user defined formats. Adding a class decorator to this set of classes to say capture everything else should be possible I would believe (although most IDEs cannot follow the type hints on decorators, users will still have the option to define their custom typed structure if they desire, or fall back to an untyped, purely string based one).

Hopefully this month, I'll be able to keep working on this. The main issue I run into is that the parsing code is tied into the data layer, so that has to properly separated first which is the bulk of the work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Suggestion: CoNLL 2012 Coreference format / more flexible handling #116

Feature Suggestion: CoNLL 2012 Coreference format / more flexible handling #116

polm commented Apr 23, 2021

matgrioni commented May 3, 2021

Feature Suggestion: CoNLL 2012 Coreference format / more flexible handling #116

Feature Suggestion: CoNLL 2012 Coreference format / more flexible handling #116

Comments

polm commented Apr 23, 2021

matgrioni commented May 3, 2021