Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLama-3 8B - can't match MMLU performance #30694

Open
2 of 4 tasks
gioaca00 opened this issue May 7, 2024 · 2 comments
Open
2 of 4 tasks

LLama-3 8B - can't match MMLU performance #30694

gioaca00 opened this issue May 7, 2024 · 2 comments

Comments

@gioaca00
Copy link

gioaca00 commented May 7, 2024

System Info

  • transformers version: 4.39.0
  • Platform: Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.10.13
  • Huggingface_hub version: 0.21.4
  • Safetensors version: 0.4.2
  • Accelerate version: 0.29.3
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: bf16
    - use_cpu: False
    - debug: False
    - num_processes: 2
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: all
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help?

@ArthurZucker @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Using the meta-llama/Meta-Llama-3-8B model for replicating meta's performance on the cais/mmlu dataset
  2. Using a prompt with 5 shots in the following form:
The following are multiple choice questions (with answers) about abstract_algebra
Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.
A: 0
B: 1
C: 2
D: 3
Answer: B
[other 4 examples ...]
The following are multiple choice questions (with answers) about abstract_algebra
Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A: 0
B: 4
C: 2
D: 6
Answer: 
  1. Taking the model's highest probability between p(A), p(B), p(C), p(D) and predict the answer associated with the highest probability
  2. I can't match the performance of https://github.com/meta-llama/llama3/blob/main/eval_details.md. The performance that I get are: Macro average: 59.66, Micro average: 58.57

Expected behavior

The expected performance for micro-average is 65.4. Could it be due to a difference in the huggingface and meta implementations? Or is there a problem with my prompt?

@ArthurZucker
Copy link
Collaborator

Hey! I don't think it is a difference in implementation as the model has been QUITE tested in the past years 😓
What is happening might be that the generation_config does not include the same parameters? Or something with the prompt. We ran the evaluations on the openLLM leaderboard using transformers as a backend and made sure that we were able to reproduce the results.

cc @clefourrier the lead of LLM Leaderboards!

@clefourrier
Copy link
Member

Hi! Evaluation is notoriously fickle and prompt sensitive.
To reproduce Meta's number, you would need to run the exact same setup (same batch size, same generation temperature, same few-shot samples in the exact same order, etc). If you want to try to reproduce results we get on the Open LLM Leaderboard, you can follow the steps in the About page, reproducibility section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants