Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated the UI for TPU support #201

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sayantan1410
Copy link
Contributor

Description - Adding support for TPU

Fix #173

  • New feature

I have updated the UI but couldn't figure out the next step: that if XLA-TPU is selected then training should be distributed to 8 processes.

Here's a screenshot of the UI

Screenshot (16)

@netlify
Copy link

netlify bot commented Jan 10, 2022

👷 Deploy request for code-generator pending review.
Visit the deploys page to approve it

🔨 Explore the source changes: 460772b

@vfdev-5
Copy link
Member

vfdev-5 commented Jan 10, 2022

I have updated the UI but couldn't figure out the next step: that if XLA-TPU is selected then training should be distributed to 8 processes.

@ydcjeff can you help with that please ? How to make the following:

when user selects "xla-tpu", training should be only distributed with 8 processes and "Run the training with torch.multiprocessing.spawn".

@sayantan1410 first comment about the UI, basically we can not and should not choose the backend if distirbuted training is not specified => "Choose a Backend" should be a subpart of "Distributed Training" and should be active if distributed training is selected.

@sayantan1410
Copy link
Contributor Author

@vfdev-5 Okay will change that !!

@sayantan1410
Copy link
Contributor Author

@vfdev-5 I have made the changes as you suggested, now the "choose a backend" option is shown only when distributed training is selected.

Copy link
Member

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @sayantan1410
Now let's move on with content update once a backend is selected.

src/components/TabTraining.vue Outdated Show resolved Hide resolved
src/metadata/metadata.json Outdated Show resolved Hide resolved
@sayantan1410
Copy link
Contributor Author

@vfdev-5 can you please guide me on how to do that ?

@vfdev-5
Copy link
Member

vfdev-5 commented Jan 14, 2022

@sayantan1410 we are doing right now a coding sprint (please check our discord, #start-contributing channel). if you can join it, it could be a good opportunity to learn more about the projects and being guided.

@sayantan1410
Copy link
Contributor Author

@vfdev-5 Sorry for the halt, I have made the changes requested above, can you please guide me how to update the content once a backend is selected.

@vfdev-5
Copy link
Member

vfdev-5 commented Jan 22, 2022

@sayantan1410 no worries and thanks for the update!
I had an idea on how to progress on this in gradual manner. Adding TPU support would require a lot updates and I'm not quite sure about all of them right now. However, what can we do right now is the following. Let's add "Gloo" backend which is very similar to NCCL. Basically, we can create another PR with almost the same UI update as here but instead of XLA-TPU we put GLOO and we have to make if-else template conditions in the content where we put "nccl" string. What do you think ?

For example, the content changes we would like to do for each template. Let me illustrate on vision classification template

  1. README.md

In the readme we give some launch commands depending on the config, for example:

python -m torch.distributed.launch \
  --nproc_per_node #:::= nproc_per_node :::# \
  --nnodes #:::= it.nnodes :::# \
  --node_rank 0 \
  --master_addr #:::= it.master_addr :::# \
  --master_port #:::= it.master_port :::# \
  --use_env main.py \
  --backend nccl

We can that --nproc_per_node is taken from UI with #:::= nproc_per_node :::# etc
We need to replace all --backend nccl by something similar: --backend #:::= backend :::#.

I think that's it for GLOO backend...

By the way, we need to replace python -m torch.distributed.launch by torchrun, see #199.

@sayantan1410
Copy link
Contributor Author

@vfdev-5 Okay, then I will make another draft PR with the updates that you suggested, and then we can take that forward. Also Should I close this PR here or keep it as it is ?

@vfdev-5
Copy link
Member

vfdev-5 commented Jan 23, 2022

Sounds good for another draft PR and let's keep this one open

@sayantan1410 sayantan1410 mentioned this pull request Jan 25, 2022
1 task
@vfdev-5 vfdev-5 closed this in #203 Jan 26, 2022
@vfdev-5 vfdev-5 reopened this Jan 26, 2022
@vfdev-5
Copy link
Member

vfdev-5 commented Jan 26, 2022

@sayantan1410 can you sync now this PR against main branch and we could work on adding the TPU option.
If you feel better to close this one and open another one, feel free to do as you are comfortable

@sayantan1410
Copy link
Contributor Author

@vfdev-5 Yeah sure, doing it !!

@sayantan1410 sayantan1410 reopened this Jan 27, 2022
@sayantan1410
Copy link
Contributor Author

@vfdev-5 Added XLA-TPU option in the backend dropdown.

@vfdev-5
Copy link
Member

vfdev-5 commented Jan 27, 2022

@vfdev-5 Added XLA-TPU option in the backend dropdown.

Sounds good, so what's next ? :)

@sayantan1410
Copy link
Contributor Author

sayantan1410 commented Jan 27, 2022

@vfdev-5 I have a very stupid question, let's say someone selects backend as XLA-TPU, then we will have to check if XLA is there in the system or not and whether TPU is on or not, and if they are present then according to that we will have to give him/her the template code right ?

@vfdev-5
Copy link
Member

vfdev-5 commented Jan 27, 2022

We do not check for the infrastructure in code-generator app. For example, when specifying nccl and distributed training with 1000 nodes and 10000 processes we just say in the readme how to launch and that's it.
User can download everything as a zip or open in colab where we set GPU by default in metadata.
Same thing should be done for XLA if exported in colab otherwise, we do not check for user's infrastructure.
You can try from your side, take the code and execute it with xla-tpu backend and see what is the error message. I think this can be enough.

@sayantan1410
Copy link
Contributor Author

@vfdev-5 Okay I will try and let you know the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for TPU devices
2 participants