You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To have a better idea of how well the models are doing, we could add human benchmarks. These could be evaluated on the validation splits. All human evaluations should ideally be released openly.
To enable proper comparison between the models and humans, we should create an evaluation platform (could just be a simple Gradio app) which only supplies the humans with the same information as the models. The NER task should be dealt with separately by, e.g., having fields where they can write the entities for a given category (rather than having them write valid JSON, since we cannot use structured generation with humans).
Evaluation platform built
Danish
Named entity recognition
Sentiment classification
Linguistic acceptability
Question answering
Summarisation
Knowledge
Common-sense reasoning
Swedish
Named entity recognition
Sentiment classification
Linguistic acceptability
Question answering
Summarisation
Knowledge
Common-sense reasoning
Norwegian
Named entity recognition
Sentiment classification
Linguistic acceptability
Question answering
Summarisation
Knowledge
Common-sense reasoning
Icelandic
Named entity recognition
Sentiment classification
Linguistic acceptability
Question answering
Summarisation
Knowledge
Common-sense reasoning
Faroese
Named entity recognition
Sentiment classification
Linguistic acceptability
Question answering
Summarisation
Knowledge
Common-sense reasoning
German
Named entity recognition
Sentiment classification
Linguistic acceptability
Question answering
Summarisation
Knowledge
Common-sense reasoning
Dutch
Named entity recognition
Sentiment classification
Linguistic acceptability
Question answering
Summarisation
Knowledge
Common-sense reasoning
English
Named entity recognition
Sentiment classification
Linguistic acceptability
Question answering
Summarisation
Knowledge
Common-sense reasoning
The text was updated successfully, but these errors were encountered:
To have a better idea of how well the models are doing, we could add human benchmarks. These could be evaluated on the validation splits. All human evaluations should ideally be released openly.
To enable proper comparison between the models and humans, we should create an evaluation platform (could just be a simple Gradio app) which only supplies the humans with the same information as the models. The NER task should be dealt with separately by, e.g., having fields where they can write the entities for a given category (rather than having them write valid JSON, since we cannot use structured generation with humans).
Danish
Swedish
Norwegian
Icelandic
Sentiment classificationFaroese
Sentiment classificationQuestion answeringSummarisationKnowledgeCommon-sense reasoningGerman
Dutch
English
The text was updated successfully, but these errors were encountered: