Skip to content

noway/labels-nanogpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nanogpt with labels

Nanogpt with labels implemented. Performance on the task of subtracting two single-digit numbers is increased for the comparable hyperparameters.

subtraction task results

ranking sum, lower is better

no_labels

39M parameter model, 512 context window

eval_type sum
0shot_cot 373
0shot_direct 294
nshot_cot 453
nshot_direct 376

with_labels

39M parameter model, 512 (effective) context window

eval_type sum
0shot_cot 173
0shot_direct 142
nshot_cot 178
nshot_direct 193

label_embeddings

39M parameter model, 512 context window,

eval_type sum
0shot_cot 421
0shot_direct 263
nshot_cot 382
nshot_direct 956

training

  • 6.5 hours compute budget on 1 H100 for each model
  • 132 batch_size for 512 context window, 38 batch_size for 1048 context window

tokenizer & labeler

original hypothesis & findings

The original hypothesis was that the label embedding model must increase eval performance over no labels. This has been proven false - label embedding model has slightly better/worse eval performance, overall mixed. On the other hand, having a label on every second token (with_labels model) does increase eval performance for the comparable hyperparameters. The intuition here is that the extra token is a chance for the model to "think".

caveats

  • The model may be memorizing the substraction result from the dataset, thus becoming more of a recall test rather than calculation test. Although there are instances where the model gets the right result that's not in the dataset.
  • No optimal hyperparameter search was done.
  • main branch is for evals, training is in real_run

weights

See here