LLama-3 8B - can't match MMLU performance #30694

gioaca00 · 2024-05-07T13:49:13Z

System Info

transformers version: 4.39.0
Platform: Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.10.13
Huggingface_hub version: 0.21.4
Safetensors version: 0.4.2
Accelerate version: 0.29.3
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help?

@ArthurZucker @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Using the meta-llama/Meta-Llama-3-8B model for replicating meta's performance on the cais/mmlu dataset
Using a prompt with 5 shots in the following form:

The following are multiple choice questions (with answers) about abstract_algebra
Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.
A: 0
B: 1
C: 2
D: 3
Answer: B
[other 4 examples ...]
The following are multiple choice questions (with answers) about abstract_algebra
Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A: 0
B: 4
C: 2
D: 6
Answer:

Taking the model's highest probability between p(A), p(B), p(C), p(D) and predict the answer associated with the highest probability
I can't match the performance of https://github.com/meta-llama/llama3/blob/main/eval_details.md. The performance that I get are: Macro average: 59.66, Micro average: 58.57

Expected behavior

The expected performance for micro-average is 65.4. Could it be due to a difference in the huggingface and meta implementations? Or is there a problem with my prompt?

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-05-08T15:41:48Z

Hey! I don't think it is a difference in implementation as the model has been QUITE tested in the past years 😓
What is happening might be that the generation_config does not include the same parameters? Or something with the prompt. We ran the evaluations on the openLLM leaderboard using transformers as a backend and made sure that we were able to reproduce the results.

cc @clefourrier the lead of LLM Leaderboards!

clefourrier · 2024-05-08T17:03:26Z

Hi! Evaluation is notoriously fickle and prompt sensitive.
To reproduce Meta's number, you would need to run the exact same setup (same batch size, same generation temperature, same few-shot samples in the exact same order, etc). If you want to try to reproduce results we get on the Open LLM Leaderboard, you can follow the steps in the About page, reproducibility section.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLama-3 8B - can't match MMLU performance #30694

LLama-3 8B - can't match MMLU performance #30694

gioaca00 commented May 7, 2024

ArthurZucker commented May 8, 2024

clefourrier commented May 8, 2024

LLama-3 8B - can't match MMLU performance #30694

LLama-3 8B - can't match MMLU performance #30694

Comments

gioaca00 commented May 7, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented May 8, 2024

clefourrier commented May 8, 2024