Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print step time for each step #361

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

samos123
Copy link

@samos123 samos123 commented Mar 9, 2024

This is helpful in cases where there is variable step time and looking at the logs would quickly allow you to identify such cases.

@samos123 samos123 changed the title Print step time for each stip Print step time for each step Mar 9, 2024
Copy link
Contributor

@markblee markblee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not sure that we want to print every step by default (since it'll introduce a significant amount of logs/summaries, and since avg step time is usually pretty stable after the first couple hundred steps)?

@samos123
Copy link
Author

samos123 commented Mar 9, 2024

That makes sense. Should I hide it behind an option? For me it was important to troubleshoot a variable step time issue. Or would you rather totally leave this out of the code base. Note I'm fine with that too.

@samos123 samos123 requested a review from markblee March 9, 2024 06:02
@samos123
Copy link
Author

samos123 commented Mar 9, 2024

I've added it as a config parameter and made it false by default

@samos123
Copy link
Author

samos123 commented May 13, 2024

I have been using this for 2 use cases:

  • Ensure step time is stable across steps. In the past on the GPU, the networking becomes unstable where collectives sometimes take longer between steps. E.g. step time varies between 5 second to 7 seconds.
  • Quickly evaluate whether there is a performance gain without having to wait for 100 steps

Please let me know if there is any interest in merging it @markblee

Comment on lines +144 to +145
# Log the step time for each individual step
log_step_time: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it'll be more general to do:

Suggested change
# Log the step time for each individual step
log_step_time: bool = False
# Interval to log average step time. If None, defaults to 100.
log_step_time_every_n_steps: Optional[int] = None

Note that we default to None instead of 100 to avoid triggering a large number of golden config updates.

step_time = time.perf_counter() - step_start_time
self._step_log("Step %s took %s seconds", self.step, step_time)
self.summary_writer(self.step, {"step_time": step_time})

num_steps += 1
if num_steps % 100 == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if num_steps % 100 == 0:
if num_steps % (cfg.log_step_time_every_n_steps or 100) == 0:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants