CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations

📃Paper 🏰Project Page 🏆Leaderboard ✨Findings

Construction of CHARM

Comparison of commonsense reasoning benchmarks

Benchmarks	CN-Lang	CSR	CN-specifics	Dual-Domain	Rea-Mem
Most benchmarks in davis2023benchmarks	✘	✔	✘	✘	✘
XNLI, XCOPA,XStoryCloze	✔	✔	✘	✘	✘
LogiQA, CLUE, CMMLU	✔	✘	✔	✘	✘
CORECODE	✔	✔	✘	✘	✘
CHARM (ours)	✔	✔	✔	✔	✔

"CN-Lang" indicates the benchmark is presented in Chinese language. "CSR" means the benchmark is designed to focus on CommonSense Reasoning. "CN-specific" indicates the benchmark includes elements that are unique to Chinese culture, language, regional characteristics, history, etc. "Dual-Domain" indicates the benchmark encompasses both Chinese-specific and global domain tasks, with questions presented in the similar style and format. "Rea-Mem" indicates the benchmark includes closely-interconnected reasoning and memorization tasks.

🚀 What's New

[2024.5.24] CHARM has been open-sourced !!! 🔥🔥🔥
[2024.5.15] CHARM has been accepted to the main conference of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) !!! 🔥🔥🔥
[2024.3.21] Paper available on ArXiv.

🧾 TODO

Support inference and evaluation on Opencompass.

🛠️ Inference and Evaluation on Opencompass

Below are the steps for quickly downloading CHARM and using OpenCompass for evaluation.

1. OpenCompass Environment Setup

Refer to the installation steps for OpenCompass.

2. Download CHARM

git clone https://github.com/opendatalab/CHARM CHARM

3. Run Inference and Evaluation

cd opencompass
mkdir data
ln -snf path_to_CHARM_repo/data/CHARM ./data/CHARM

# Infering and evaluating CHARM with hf_llama3_8b_instruct model
python run.py --models hf_llama3_8b_instruct --datasets charm_gen

🖊️ Citation

@misc{sun2024benchmarking,
      title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations}, 
      author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
      year={2024},
      eprint={2403.14112},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

💳 License

This project is released under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data/CHARM		data/CHARM
docs		docs
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/CHARM

data/CHARM

docs

docs

tools

tools

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

README_ZH.md

README_ZH.md

Repository files navigation

CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations

Construction of CHARM

Comparison of commonsense reasoning benchmarks

🚀 What's New

🧾 TODO

🛠️ Inference and Evaluation on Opencompass

1. OpenCompass Environment Setup

2. Download CHARM

3. Run Inference and Evaluation

🖊️ Citation

💳 License

About

Releases

Packages

Languages

License

opendatalab/CHARM

Folders and files

Latest commit

History

Repository files navigation

CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations

Construction of CHARM

Comparison of commonsense reasoning benchmarks

🚀 What's New

🧾 TODO

🛠️ Inference and Evaluation on Opencompass

1. OpenCompass Environment Setup

2. Download CHARM

3. Run Inference and Evaluation

🖊️ Citation

💳 License

About

Resources

License

Stars

Watchers

Forks

Languages