Add new model config for smaller tests #450

jesus-orozco · 2024-05-08T16:06:47Z

Adding a new model configuration for text experiments. The goal is to get an early termination model for fuji-test to accelerate infrastructure validation.

@jiya-zhang

jiya-zhang · 2024-05-08T16:30:38Z

@jesus-orozco is still working on this PR, but it would be helpful to get some early feedback from @markblee - Thanks!

jiya-zhang · 2024-05-08T16:49:05Z

axlearn/experiments/text/gpt/c4_trainer.py

+                    cfg.mesh_shape = mesh_shape_from_axes(data=-1, fsdp=4)
+                    cfg.summary_writer.write_every_n_steps = eval_every_n_steps
+                    cfg.checkpointer.save_policy = config_for_function(every_n_steps_policy).set(
+                        n=eval_every_n_steps


Is it possible to save checkpointer more frequently than eval? Something like save ckpt every 500 steps, eval every 1500 steps. This allows us to identify issues separately if the job hangs

Thanks for the feedback! Added the custom policy to save checkpoints more often to differentiate from eval steps

markblee

Thanks @jesus-orozco !

axlearn/experiments/text/gpt/c4_trainer.py

markblee · 2024-05-13T17:48:42Z

axlearn/experiments/text/gpt/fuji.py

@@ -140,6 +140,29 @@ def get_trainer_kwargs(model_size: str, *, vocab_size: int, version: Version) ->
                ),
            ),
        )
+    elif model_size == "simple":


Thanks! Does this need to be separate from "test" (which is itself intended to be the testing configuration)?

In particular, we can configure mesh_rules for the accelerator that you are testing on. This way, it'll run on both CPU and the target testing hardware.

The only other differences seem to be batch sizes and eval/saving more frequently, which seem tolerable as defaults. WDYT?

Re mesh_rules, yeah sometimes we test on v4-8, and we need something like (-1,1,4,1,1).

However, re eval/saving/max step, we do want to have a config that terminate training early. As long as the training runs for a few thousands steps without problems, then we know the jax testing passes

Thanks! That sounds reasonable, adding the new defaults to the "test" configuration instead for frequent saving/early termination.
On mesh rules, I'll leave the default for it to work on CPU, but can you clarify how we can configure the rules for specific accelerators? as Maggie mentioned, we'd be testing mainly on smaller TPU shapes like v4-8.

Left a comment inline -- since it's a simple case, we probably do not need mesh rules. You can think of mesh rules as overrides to the default mesh. E.g.

mesh_rules=( ("tpu-v4-8", mesh_shape_from_axes(fsdp=-1)), )

means that if the instance type matches tpu-v4-8, we instead use (1,1,4,1,1) rather than the default mesh_shape. Let me know whether this makes sense.

Thanks @markblee !
Committed the changes you suggested, it makes sense to add data=-1 to the default configuration.

axlearn/experiments/text/gpt/fuji.py

Co-authored-by: Mark Lee <mmaarrkklleeee@gmail.com>

markblee

FYI, you might need to run golden config updates: https://github.com/apple/axlearn/blob/main/docs/01-start.md#testing

add new model config for smaller tests

1ed258a

jiya-zhang reviewed May 8, 2024

View reviewed changes

Updated simple test model config with more frequent chekcpoint policy

a4431ab

markblee reviewed May 8, 2024

View reviewed changes

axlearn/experiments/text/gpt/c4_trainer.py Outdated Show resolved Hide resolved

jesus-orozco added 2 commits May 9, 2024 22:47

Move simple model config to fuji settings

3e4aadb

Address formatting

a361781

jesus-orozco marked this pull request as ready for review May 10, 2024 00:29

markblee reviewed May 13, 2024

View reviewed changes

Remove additional model config and add defaults to test config

51371a0

markblee reviewed May 14, 2024

View reviewed changes

axlearn/experiments/text/gpt/fuji.py Outdated Show resolved Hide resolved

Update mesh shape for fuji test config

b1f7535

Co-authored-by: Mark Lee <mmaarrkklleeee@gmail.com>

jesus-orozco requested a review from markblee May 14, 2024 23:09

Merge branch 'main' into model_config

cfbe984

markblee reviewed May 29, 2024

View reviewed changes

jesus-orozco requested a review from markblee May 30, 2024 00:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new model config for smaller tests #450

Add new model config for smaller tests #450

jesus-orozco commented May 8, 2024 •

edited

jiya-zhang commented May 8, 2024

jiya-zhang May 8, 2024

jesus-orozco May 8, 2024

markblee left a comment

markblee May 13, 2024

jiya-zhang May 13, 2024

jesus-orozco May 14, 2024

markblee May 14, 2024

jesus-orozco May 14, 2024

markblee left a comment

Add new model config for smaller tests #450

Are you sure you want to change the base?

Add new model config for smaller tests #450

Conversation

jesus-orozco commented May 8, 2024 • edited

jiya-zhang commented May 8, 2024

jiya-zhang May 8, 2024

Choose a reason for hiding this comment

jesus-orozco May 8, 2024

Choose a reason for hiding this comment

markblee left a comment

Choose a reason for hiding this comment

markblee May 13, 2024

Choose a reason for hiding this comment

jiya-zhang May 13, 2024

Choose a reason for hiding this comment

jesus-orozco May 14, 2024

Choose a reason for hiding this comment

markblee May 14, 2024

Choose a reason for hiding this comment

jesus-orozco May 14, 2024

Choose a reason for hiding this comment

markblee left a comment

Choose a reason for hiding this comment

jesus-orozco commented May 8, 2024 •

edited