Skip to content

InfuseAI/llm_model_evaluation

Repository files navigation

llm_model_evaluation

Description

Use python script to do LLM Model Evaluation.

Support Dataset

I. mmlu dataset

II. tmmluplus dataset

How to use it?

  • Step 1: please download the model from huggingface The following command line is the example of mistral-7B-v0.1 model:
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-v0.1
  • Step 2: Please arrange the dataset from tmmluplus data folder to data_arrange folder.

  • Step 3: Please run the following code to predict the result:

python3 evaluation_hf_testing.py \
    --model ./models/llama2-7b-hf \
    --data_dir ./llm_evaluation_tmmluplus/data_arrange/ \
    --save_dir ./llm_evaluation_tmmluplus/results/
  • Step 4: Please run the evaluation code to get the output json file.
!python /content/llm_model_evaluation/catogories_result_eval.py \
    --catogory "mmlu" \
    --model ./models/llama2-7b-hf \
    --save_dir "./results/results_llama2-7b-hf"

The example google colab code

  • mmlu dataset:
  1. Google Colab - mmlu
  2. Google Colab - mmlu in phi-2 model [Colab free tier can use this Google Colab example]
  • tmmluplus dataset:
  1. Google Colab - tmmluplus

Evaluation Result

  • mmlu dataset:
模型 Weighted Accuracy STEM humanities social sciences other Inference Time(s)
Mistral-7B-v0.1 0.6254094858282296 0.5251822398939695 0.5636556854410202 0.7357816054598635 0.703578038247995 15624.038010835648
  • tmmluplus dataset:
模型 Weighted Accuracy STEM humanities social sciences other Inference Time(s)
Mistral-7B-v0.1 - - - - - -

About

LLM Model Evaluation for two MMLU-like datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published