Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2.3.1] Release Tracker #125425

Open
atalman opened this issue May 2, 2024 · 26 comments
Open

[v2.3.1] Release Tracker #125425

atalman opened this issue May 2, 2024 · 26 comments
Labels
triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Milestone

Comments

@atalman
Copy link
Contributor

atalman commented May 2, 2024

🐛 Describe the bug

This issue is for tracking cherry-picks to the release branch. Following is release branch for the 2.3.1 release.

Our plan from this point is roughly the following:

  • Phase 1 (until 5/27): Cherry-pick post deadline (End of day 5PM PST)
  • Phase 2 (after 5/27): Perform extended integration/stability/performance testing based on Release Candidate builds.

Only issues that have ‘cherry-picks’ in this tracker will be considered for the release.

Cherry-Pick Criteria

Phase 1 (until 5/27):

The Releng team relies on the cherry pick process to manage risk to release quality, i.e. by porting a small set of commit from trunk that are "must-have" into the release branch, we limit the change to the minimal to address pressing issues. Thus, not everything a developer land into the trunk will make it into the release. So, please consider the criteria below and follow the cherry picking process. Only low-risk changes may be cherry-picked from master:

  1. No feature work allowed
  2. Fixes to regressions against the most recent release (e.g. 2.3.0 for 2.3.1 release; see module: regression issue list)
  3. Low risk critical fixes for: silent correctness, backwards compatibility, crashes, deadlocks, (large) memory leaks
  4. Critical Fixes to new features being introduced in 2.3.0 release
  5. Documentation improvements
  6. Release branch specific changes (e.g. blocking ci fixes, change version identifiers)

Any other change requires special dispensation from the release managers (currently @atalman, @huydhn, @PaliC, @malfet). If this applies to your change please write "Special Dispensation" in the "Criteria Category:" template below and explain.

Phase 2 (after 5/27):

Note that changes here require us to rebuild a Release Candidate and restart extended testing (likely delaying the release). Therefore, the only accepted changes are Release-blocking critical fixes for: silent correctness, backwards compatibility, crashes, deadlocks, (large) memory leaks

Changes will likely require a discussion with the larger release team over VC or Slack.

Cherry-Pick Process

  1. Ensure your PR has landed in master. This does not apply for release-branch specific changes (see Phase 1 criteria).

  2. Create (but do not land) a PR against the release branch.

    # Find the hash of the commit you want to cherry pick
    # (for example, abcdef12345)
    git log
    
    git fetch origin release/2.3
    git checkout release/2.3
    git cherry-pick abcdef12345
    
    # Submit a PR based against 'release/2.3' either:
    # via the GitHub UI
    git push my-fork
    
    # via the GitHub CLI
    gh pr create --base release/2.3
  3. Make a request below with the following format:

Link to landed trunk PR (if applicable):
* 

Link to release branch PR:
* 

Criteria Category:
* 
  1. Someone from the release team will reply with approved / denied or ask for more information.
  2. If approved, someone from the release team will merge your PR once the tests pass. Do not land the release branch PR yourself.

NOTE: Our normal tools (ghstack / ghimport, etc.) do not work on the release branch.

See HUD 2.3

Versions

2.3.1

@atalman atalman added this to the 2.3.1 milestone May 2, 2024
@atalman atalman pinned this issue May 2, 2024
@snadampal
Copy link
Collaborator

snadampal commented May 2, 2024

Link to main PR (if applicable):

Link to 2.3.1 release branch PR:

Criteria Category:
Fixes performance regression issues pytorch/builder#1774 and #124922


@atalman merged, pending validation with test described here: pytorch/builder#1774 (comment)

@saitcakmak
Copy link
Contributor

saitcakmak commented May 2, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Documentation improvements

@atalman merged

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 3, 2024
@soulitzer
Copy link
Contributor

soulitzer commented May 8, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • low risk critical fix

@atalman merged

@eqy
Copy link
Collaborator

eqy commented May 8, 2024

Link to landed trunk PR (if applicable):

  • This is not a PR cherrypick but a revert commit: 51cf57c

Link to release branch PR:

Criteria Category:

  • Low risk critical fix

@atalman merged

@FFFrog FFFrog unpinned this issue May 10, 2024
@FFFrog FFFrog pinned this issue May 10, 2024
@atalman
Copy link
Contributor Author

atalman commented May 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical Fix - torchdata compatibility

@huydhn merged

@atalman
Copy link
Contributor Author

atalman commented May 13, 2024

@atalman
Copy link
Contributor Author

atalman commented May 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:


@huydhn merged

@atalman
Copy link
Contributor Author

atalman commented May 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:


@huydhn merged

@atalman
Copy link
Contributor Author

atalman commented May 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Fixes doc failure on Release branch

@huydhn merged

@atalman
Copy link
Contributor Author

atalman commented May 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:


@huydhn merged

@atalman
Copy link
Contributor Author

atalman commented May 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:


@huydhn I update the cherry pick manually to remove changes from #124479, let's see if that works.
@atalman merged

@atalman
Copy link
Contributor Author

atalman commented May 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:


@huydhn merged

@atalman
Copy link
Contributor Author

atalman commented May 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:


@huydhn This cherry pick is complex from what I see because cherry picking it blindly won't work. It depends on #122146, that enables dynamo for 3.12. And that stack is not a small one. So, I think we need to rework the cherry pick #126107 manually if we want to go ahead with this (cc @williamwen42 @atalman @malfet @albanD)

@atalman Closed - #126107 . Merged - #126235

@atalman
Copy link
Contributor Author

atalman commented May 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Fixes new feature

@huydhn merged

@atalman
Copy link
Contributor Author

atalman commented May 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Required to fix Lint errors on release

@huydhn merged

@wanchaol
Copy link
Contributor

wanchaol commented May 14, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Fixes to regressions against the 2.3.0 release

@atalman merged

@kiukchung kiukchung unpinned this issue May 14, 2024
@atalman atalman pinned this issue May 14, 2024
@atalman
Copy link
Contributor Author

atalman commented May 14, 2024

Link to landed trunk PR (if applicable):
*

Link to release branch PR:

Criteria Category:

  • Release only - use triton 2.3.1 version rather then current 2.3.0

@atalman merged

@antoinebrl
Copy link

Hello 👋!
I am facing some issue regarding none-persistent buffers in distributed scenarios. A solution was submitted and merged into main (#125337). I was doing the cherry picking process to submit this fix for it to be part of the next patch release. However, the solution was developed after a refactoring which tackled another problem introduced in 2.3 (a891779). This create a conflict preventing me to submit the cherry picked commit as is. Should I submit both modifications in the corresponding order for them to be included in 2.3.1?
Tagging @fegin who authored both contributions.

@atalman
Copy link
Contributor Author

atalman commented May 15, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Documentation

@atalman merged

@williamwen42
Copy link
Member

williamwen42 commented May 15, 2024

@digantdesai digantdesai unpinned this issue May 16, 2024
@jithunnair-amd
Copy link
Collaborator

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Low risk addition of hipify mappings to enable DeepSpeed transformer extensions on ROCm

@weifengpy
Copy link
Contributor

weifengpy commented May 16, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Fixes to regressions against the 2.3.0 release

@atalman atalman pinned this issue May 17, 2024
@atalman
Copy link
Contributor Author

atalman commented May 17, 2024

Link to landed trunk PR (if applicable):

  • NA

Link to release branch PR:

Criteria Category:

  • Release only changes - pin docker image for rocm CI. Temporary PR. ROCm test jobs were failing with the MIOpen error because a subtle difference crept in the MIOpen kdb files when the docker images were rebuilt. Hence pin to make CI jobs green. Will unpin once kdb issue is resolved

@atalman merged #126452

@mvpatel2000
Copy link
Contributor

Link to main PR (if applicable):

Link to 2.3.1 release branch PR:

Criteria Category:

  • Low risk critical fix for checkpointing. This PR removes an additional check introduced in torch 2.3 which is actually incorrect and causes checkpointing to fail if at least 1 forward/backward pass has not been run.

@mvpatel2000
Copy link
Contributor

Link to main PR (if applicable):

Link to 2.3.1 release branch PR:

Criteria Category:

  • Low risk critical fix for checkpointing. When checkpointing with activation checkpointing, the names of the variables (FQNs) are changed with the tag _checkpoint_wrapped_module. This breaks checkpointing if you resume without activation checkpointing.

@mvpatel2000
Copy link
Contributor

Link to main PR:

Link to 2.3.1 release branch PR:

Criteria Category:

  • Low risk critical fix for checkpointing. Torch 2.3 ignores _extra_state (whereas prior PyTorch versions correctly handle it), breaking checkpoint loading from prior checkpoints. This also breaks integration with Nvidia's TransformerEngine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests