You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Using the meta-llama/Meta-Llama-3-8B model for replicating meta's performance on the cais/mmlu dataset
Using a prompt with 5 shots in the following form:
The following are multiple choice questions (with answers) about abstract_algebra
Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.
A: 0
B: 1
C: 2
D: 3
Answer: B
[other 4 examples ...]
The following are multiple choice questions (with answers) about abstract_algebra
Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A: 0
B: 4
C: 2
D: 6
Answer:
Taking the model's highest probability between p(A), p(B), p(C), p(D) and predict the answer associated with the highest probability
The expected performance for micro-average is 65.4. Could it be due to a difference in the huggingface and meta implementations? Or is there a problem with my prompt?
The text was updated successfully, but these errors were encountered:
Hey! I don't think it is a difference in implementation as the model has been QUITE tested in the past years 😓
What is happening might be that the generation_config does not include the same parameters? Or something with the prompt. We ran the evaluations on the openLLM leaderboard using transformers as a backend and made sure that we were able to reproduce the results.
Hi! Evaluation is notoriously fickle and prompt sensitive.
To reproduce Meta's number, you would need to run the exact same setup (same batch size, same generation temperature, same few-shot samples in the exact same order, etc). If you want to try to reproduce results we get on the Open LLM Leaderboard, you can follow the steps in the About page, reproducibility section.
System Info
transformers
version: 4.39.0- distributed_type: MULTI_GPU
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Who can help?
@ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
meta-llama/Meta-Llama-3-8B
model for replicating meta's performance on thecais/mmlu
datasetMacro average: 59.66, Micro average: 58.57
Expected behavior
The expected performance for micro-average is
65.4
. Could it be due to a difference in the huggingface and meta implementations? Or is there a problem with my prompt?The text was updated successfully, but these errors were encountered: