Abstract

Evaluating ChatGPT’s Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness

Bo Li, Gexiang Fang, Yang Yang, Quansen Wang, Wei Ye, Wen Zhao, and Shikun Zhang.

Abstract

In this paper, we focus on assessing the overall ability of ChatGPT using 7 fine-grained information extraction (IE) tasks. Specially, we present the systematically analysis by measuring ChatGPT's performance, explainability, calibration, and faithfulness, and resulting in 15 keys from either the ChatGPT or domain experts. Our findings reveal that ChatGPT’s performance in Standard-IE setting is poor, but it surprisingly exhibits excellent performance in the OpenIE setting, as evidenced by human evaluation. In addition, our research indicates that ChatGPT provides high-quality and trustworthy explanations for its decisions. However, there is an issue of ChatGPT being overconfident in its predictions, which resulting in low calibration. Furthermore, ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. We manually annotate and release the test sets of 7 fine-grained IE tasks contains 14 datasets to further promote the research.

Collected Keys

We collected 15 keys from both ChatGPT and domain experts, with 10 keys extracted from ChatGPT and the remaining 5 involving human involvements. These keys could systemically assess ChatGPT's ability from the following four aspects:

Dataset

Please access the datasets used in our paper from the following resources:

Entity Typing(ET): BBN, OntoNotes

Named Entity Recognition(NER): CoNLL2003, OntoNotes

Relation Classification(RC): TACRED, SemEval2010

Relation Extraction(RE): ACE05-R, SciERC

Event Detection(ED), Event Argument Extraction(EAE) and Event Extraction(EE): ACE05-E, ACE05-E+

An Example

We show an input example for the event detection (ED) task to help readers understand our implementation.

Input of Event Detection (ED)
Task Description: Given an input list of words, identify all triggers in the list, and categorize each of them into the predefined set of event types. A trigger is the main word that most clearly expresses the occurrence of an event in the predefined set of event types.
Pre-defined Label Set: The predefined set of event types includes: [Life.Be-Born, Life.Marry, Life.Divorce, Life.Injure, Life.Die, Movement.Transport, Transaction.Transfer-Ownership, Transaction.Transfer-Money, Business.Start-Org, Business.Merge-Org, Business.Declare Bankruptcy, Business.End-Org, Conflict.Attack, Conflict.Demonstrate, Contact.Meet, Contact. Phone-Write, Personnel.Start-Position, Personnel.End-Position, Personnel.Nominate, Personnel. Elect, Justice.Arrest-Jail, Justice.Release-Parole, Justice.Trial-Hearing, Justice.Charge-Indict, Justice.Sue, Justice.Convict, Justice.Sentence, Justice.Fine, Justice.Execute, Justice.Extradite, Justice.Acquit, Justice.Appeal, Justice.Pardon.]
Input and Task Requirement: Perform ED task for the following input list, and print the output: [’Putin’, ’concluded’, ’his’, ’two’, ’days’, ’of’, ’talks’, ’in’, ’Saint’, ’Petersburg’, ’with’, ’Jacques’, ’Chirac’, ’of’, ’France’, ’and’, ’German’, ’Chancellor’, ’Gerhard’, ’Schroeder’, ’on’, ’Saturday’, ’still’, ’urging’, ’for’, ’a’, ’central’, ’role’, ’for’, ’the’, ’United’, ’Nations’, ’in’, ’a’, ’post’, ’-’, ’war’, ’revival’, ’of’, ’Iraq’, ’.’] The output of ED task should be a list of dictionaries following json format. Each dictionary corresponds to the occurrence of an event in the input list and should consists of "trigger", "word_index", "event_type", "top3_event_type", "top5_event_type", "confidence", "if_context_dependent", "reason" and "if_reasonable" nine keys. The value of "word_index" key is an integer indicating the index (start from zero) of the "trigger" in the input list. The value of "confidence" key is an integer ranging from 0 to 100, indicating how confident you are that the "trigger" expresses the "event_type" event. The value of "if_context_dependent" key is either 0 (indicating the event semantic is primarily expressed by the trigger rather than contexts) or 1 (indicating the event semantic is primarily expressed by contexts rather than the trigger). The value of "reason" key is a string describing the reason why the "trigger" expresses the "event_type", and do not use any " mark in this string. The value of "if_reasonable" key is either 0 (indicating the reason given in the "reason" field is not reasonable) or 1 (indicating the reason given in the "reason" field is reasonable). Note that your answer should only contain the json string and nothing else.

Future Work

We will add more analysis on other popular LLMs in the next version.

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
ChatGPT_Output		ChatGPT_Output
Code		Code
Image		Image
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChatGPT_Output

ChatGPT_Output

Code

Code

Image

Image

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Abstract

Collected Keys

Dataset

Entity Typing(ET): BBN, OntoNotes

Named Entity Recognition(NER): CoNLL2003, OntoNotes

Relation Classification(RC): TACRED, SemEval2010

Relation Extraction(RE): ACE05-R, SciERC

Event Detection(ED), Event Argument Extraction(EAE) and Event Extraction(EE): ACE05-E, ACE05-E+

An Example

Future Work

About

Releases

Packages

Contributors 3

Languages

License

pkuserc/ChatGPT_for_IE

Folders and files

Latest commit

History

Repository files navigation

Abstract

Collected Keys

Dataset

Entity Typing(ET): BBN, OntoNotes

Named Entity Recognition(NER): CoNLL2003, OntoNotes

Relation Classification(RC): TACRED, SemEval2010

Relation Extraction(RE): ACE05-R, SciERC

Event Detection(ED), Event Argument Extraction(EAE) and Event Extraction(EE): ACE05-E, ACE05-E+

An Example

Future Work

About

Topics

Resources

License

Stars

Watchers

Forks

Languages