Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu_sharing_config missing from guest_accelerator #1430

Closed
hi-tal opened this issue Oct 18, 2022 · 4 comments
Closed

gpu_sharing_config missing from guest_accelerator #1430

hi-tal opened this issue Oct 18, 2022 · 4 comments
Labels
bug Something isn't working triaged Scoped and ready for work upstream Work required on Terraform core or provider

Comments

@hi-tal
Copy link

hi-tal commented Oct 18, 2022

TL;DR

running our terraform we get an error: " Inappropriate value for attribute "guest_accelerator": element 0: attribute
"gpu_sharing_config" is required."
The problem is that gpu_sharing_config can't be passed to google.

see explanation of gpu_sharing_config here:
https://github.com/hashicorp/terraform-provider-google/blob/main/google/node_config.go

│ Error: Incorrect attribute value type
│
│   on .terraform/modules/gke/modules/private-cluster/cluster.tf line 345, in resource "google_container_node_pool" "pools":
│  345:     guest_accelerator = [
│  346:       for guest_accelerator in lookup(each.value, "accelerator_count", 0) > 0 ? [{
│  347:         type               = lookup(each.value, "accelerator_type", "")
│  348:         count              = lookup(each.value, "accelerator_count", 0)
│  349:         gpu_partition_size = lookup(each.value, "gpu_partition_size", null)
│  350:         }] : [] : {
│  351:         type               = guest_accelerator["type"]
│  352:         count              = guest_accelerator["count"]
│  353:         gpu_partition_size = guest_accelerator["gpu_partition_size"]
│  354:       }
│  355:     ]
│     ├────────────────
│     │ each.value is map of string with 15 elements
│
│ Inappropriate value for attribute "guest_accelerator": element 0: attribute
│ "gpu_sharing_config" is required.
╵
ERRO[0012] 1 error occurred:
        * exit status 1

Expected behavior

I'd expect to be able to set it / or for google to keep it optional like it used to be.

Observed behavior

creating the environment fails

Terraform Configuration

terraform {
  backend "gcs" {}
}


#provider "kubernetes" {
#  load_config_file       = false
#  host                   = "https://${module.gke.endpoint}"
#  token                  = data.google_client_config.default.access_token
#  cluster_ca_certificate = base64decode(module.gke.ca_certificate)
#}


module "gke" {
  source                     = "terraform-google-modules/kubernetes-engine/google//modules/private-cluster"
  version = "21.1.0"
  project_id                 = var.project_id
  name                       = var.name
  region                     = var.region
  zones                      = var.zones
  kubernetes_version         = var.kubernetes_version
  release_channel            = var.release_channel
  network                    = format("%s-private-vpc",var.project_id)
  subnetwork                 = "gke-nodes"
  ip_range_services          = "gke-services"
  ip_range_pods              = "gke-pods"
  enable_private_nodes       = true
  enable_private_endpoint    = true
  horizontal_pod_autoscaling = true
  master_ipv4_cidr_block     = var.master_ipv4_cidr_block
  monitoring_service         = "monitoring.googleapis.com/kubernetes"
  logging_service            = "logging.googleapis.com/kubernetes"
  master_authorized_networks = var.master_authorized_networks
  filestore_csi_driver       = true
  grant_registry_access      = true
  default_max_pods_per_node      = var.default_max_pods_per_node

  node_pools = [
    {
      name               = "cpu"
      machine_type       = "e2-highmem-16"
      min_count          = var.cpu_nodes_min
      max_count          = var.cpu_nodes_max
      local_ssd_count    = 0
      disk_size_gb       = 100
      disk_type          = "pd-standard"
      image_type         = "COS_CONTAINERD"
      auto_repair        = true
      auto_upgrade       = true
      create_service_account = true
      #service_account    = format("%s@%s.iam.gserviceaccount.com", var.name, var.project_id)
      preemptible        = false
      initial_node_count = 1
    },
    {
      name               = "gpu"
      machine_type       = "custom-48-319488"
      min_count          = var.gpu_nodes_min
      max_count          = var.gpu_nodes_max
      local_ssd_count    = 0
      disk_type          = "pd-standard"
      image_type         = "COS_CONTAINERD"
      auto_repair        = true
      auto_upgrade       = true
      create_service_account = true
      #service_account    = format("%s@%s.iam.gserviceaccount.com", var.name, var.project_id)
      preemptible        = false
      initial_node_count = 1
      accelerator_count  = 1
      accelerator_type   = "nvidia-tesla-t4"
      node_locations     = "${var.region}-c"
    },
    {
      name               = "planner"
      machine_type       = "n1-highmem-16"
      min_count          = var.gpu_nodes_min
      max_count          = var.gpu_nodes_max
      local_ssd_count    = 0
      disk_type          = "pd-standard"
      image_type         = "COS_CONTAINERD"
      auto_repair        = true
      auto_upgrade       = true
      create_service_account = true
      #service_account    = format("%s@%s.iam.gserviceaccount.com", var.name, var.project_id)
      preemptible        = false
      initial_node_count = 1
      accelerator_count  = 1
      accelerator_type   = "nvidia-tesla-t4"
      node_locations     = "${var.region}-c"
    },
    {
      name               = "data"
      machine_type       = "e2-highmem-16"
      min_count          = var.data_nodes_min
      max_count          = var.data_nodes_max
      local_ssd_count    = 0
      disk_size_gb       = 50
      disk_type          = "pd-standard"
      image_type         = "COS_CONTAINERD"
      auto_repair        = true
      auto_upgrade       = true
      create_service_account = true
      #service_account    = format("%s@%s.iam.gserviceaccount.com", var.name, var.project_id)
      preemptible        = false
      initial_node_count = 1
    },
    {
      name               = "ui"
      machine_type       = "n1-highmem-4"
      min_count          = var.data_nodes_min
      max_count          = var.data_nodes_max
      local_ssd_count    = 0
      disk_size_gb       = 100
      disk_type          = "pd-standard"
      image_type         = "COS_CONTAINERD"
      auto_repair        = true
      auto_upgrade       = true
      create_service_account = true
      #service_account    = format("%s@%s.iam.gserviceaccount.com", var.name, var.project_id)
      preemptible        = false
      initial_node_count = 1
      sandbox_type       = "gvisor"
      sandbox_enabled     = true
    },
  ]

  node_pools_oauth_scopes = {
    all = [
    "https://www.googleapis.com/auth/cloud-platform",
    "https://www.googleapis.com/auth/devstorage.read_only",
    "https://www.googleapis.com/auth/servicecontrol",
    "https://www.googleapis.com/auth/service.management.readonly",
    "https://www.googleapis.com/auth/trace.append"
    ]

  }


  node_pools_labels = {
    all = {}

    cpu = {
      Environment = "cpu"
    }

    data = {
      cpu = "true"
    }

    gpu = {
      Environment = "gpu"
    }

    planner = {
      Environment = "planner"
    }

    ui = {
      Environment = "ui"
    }
  }

  node_pools_metadata = {
    all = {}

    default-node-pool = {
      node-pool-metadata-custom-value = "workers"
    }
  }

  node_pools_taints = {
    all = []

    gpu = [
      {
        key    = "dedicated"
        value  = "gpuGroup"
        effect = "NO_SCHEDULE"
      }
    ]
    planner = [
      {
        key    = "dedicated"
        value  = "plannerGroup"
        effect = "NO_SCHEDULE"
      }
    ]
    ui = [
      {
        key    = "dedicated"
        value  = "uiGroup"
        effect = "NO_SCHEDULE"
      }
    ]
    data = [
      {
        key    = "dedicated"
        value  = "metadata"
        effect = "NO_SCHEDULE"
      }
    ]

    default-node-pool = [
      {
        key    = "default-node-pool"
        value  = true
        effect = "PREFER_NO_SCHEDULE"
      },
    ]
  }

  node_pools_tags = {
    all = []
    private-workers = [
        "private",
    ]

  }

}

Terraform Version

Terraform v1.2.1
on linux_amd64

Additional information

used to work before.

@hi-tal hi-tal added the bug Something isn't working label Oct 18, 2022
@bharathkkb
Copy link
Member

Thanks for the report @hi-tal. I believe this is an upstream issue with the latest provider 4.41.0. Could you try pinning to 4.40.0 as a workaround? Ref: hashicorp/terraform-provider-google#12817

@bharathkkb bharathkkb added upstream Work required on Terraform core or provider triaged Scoped and ready for work labels Oct 18, 2022
@hi-tal
Copy link
Author

hi-tal commented Oct 19, 2022

@bharathkkb Thank you very much for the fast response.
Pinning the version to 4.40.0 works like a charm!

@bharathkkb
Copy link
Member

@hi-tal Glad to hear. We have switched to using a dynamic block in the module in #1428 so this should now be fixed in the main branch and will go out in a future release.

@staypuftman
Copy link

I'm seeing this error in v4.64.0 and pinning back to v4.40.0 isn't an option for me as other options in my configs would be missing; managed prometheus, cost management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Scoped and ready for work upstream Work required on Terraform core or provider
Projects
None yet
Development

No branches or pull requests

3 participants