Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TPU VM] Attaching & Mounting Persistent Disk #3497

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

jackyk02
Copy link
Contributor

@jackyk02 jackyk02 commented Apr 29, 2024

Issue
Reference: #2778

When launching a TPU VM with sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200, the resulting VM is still initialized with a disk size of 100 GB (default size). Users have to add a persistent disk to expand their local disk capacity as the boot disk of TPU VMs is not resizable.

tpu_vm.yaml:

resources:
   accelerators: tpu-v2-8
   accelerator_args:
      runtime_version: tpu-vm-base

Solution
We currently use the Cloud TPU API for managing TPUVMs (e.g. create_instance, set_labels, and delete_instance). However, this API lacks functionality for disk attachment. Therefore, this PR includes using the GCP CLI to attach a persistent disk to TPU VMs (Documentation).

Test 1:

  1. Launch the TPUVM with a specified disk size:
    sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200
    sky stop mucluster

  2. Restart the TPUVM with a specified disk size:
    sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200

  3. Verified that a extra disk with size 100GB has been created and attached to the TPUVM

  4. Ensured that disk is mounted under the path /mnt/disks/persist

Test 2:

  1. Relaunch the TPUVM multiple times
  2. Received Error: Disk creation failed: The resource projects/project_name/zones/zone_name/disks/mycluster-d9a3-tpu-extra-disk' already exists

Test 3:

  1. Launch the TPUVM with a disk size that is less than 100:
    sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 80
    sky stop mucluster

  2. Restart the TPUVM with a specified disk size:
    sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 80

  3. Verified that no extra disk has been created

Test 4:
pytest tests/test_smoke.py --tpu

Note:

  1. Disk attachment only takes effect when the cluster is restarted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant