gpu_sharing_config missing from guest_accelerator #1430

hi-tal · 2022-10-18T11:47:58Z

TL;DR

running our terraform we get an error: " Inappropriate value for attribute "guest_accelerator": element 0: attribute
"gpu_sharing_config" is required."
The problem is that gpu_sharing_config can't be passed to google.

see explanation of gpu_sharing_config here:
https://github.com/hashicorp/terraform-provider-google/blob/main/google/node_config.go

│ Error: Incorrect attribute value type
│
│   on .terraform/modules/gke/modules/private-cluster/cluster.tf line 345, in resource "google_container_node_pool" "pools":
│  345:     guest_accelerator = [
│  346:       for guest_accelerator in lookup(each.value, "accelerator_count", 0) > 0 ? [{
│  347:         type               = lookup(each.value, "accelerator_type", "")
│  348:         count              = lookup(each.value, "accelerator_count", 0)
│  349:         gpu_partition_size = lookup(each.value, "gpu_partition_size", null)
│  350:         }] : [] : {
│  351:         type               = guest_accelerator["type"]
│  352:         count              = guest_accelerator["count"]
│  353:         gpu_partition_size = guest_accelerator["gpu_partition_size"]
│  354:       }
│  355:     ]
│     ├────────────────
│     │ each.value is map of string with 15 elements
│
│ Inappropriate value for attribute "guest_accelerator": element 0: attribute
│ "gpu_sharing_config" is required.
╵
ERRO[0012] 1 error occurred:
        * exit status 1

Expected behavior

I'd expect to be able to set it / or for google to keep it optional like it used to be.

Observed behavior

creating the environment fails

Terraform Configuration

terraform {
  backend "gcs" {}
}


#provider "kubernetes" {
#  load_config_file       = false
#  host                   = "https://${module.gke.endpoint}"
#  token                  = data.google_client_config.default.access_token
#  cluster_ca_certificate = base64decode(module.gke.ca_certificate)
#}


module "gke" {
  source                     = "terraform-google-modules/kubernetes-engine/google//modules/private-cluster"
  version = "21.1.0"
  project_id                 = var.project_id
  name                       = var.name
  region                     = var.region
  zones                      = var.zones
  kubernetes_version         = var.kubernetes_version
  release_channel            = var.release_channel
  network                    = format("%s-private-vpc",var.project_id)
  subnetwork                 = "gke-nodes"
  ip_range_services          = "gke-services"
  ip_range_pods              = "gke-pods"
  enable_private_nodes       = true
  enable_private_endpoint    = true
  horizontal_pod_autoscaling = true
  master_ipv4_cidr_block     = var.master_ipv4_cidr_block
  monitoring_service         = "monitoring.googleapis.com/kubernetes"
  logging_service            = "logging.googleapis.com/kubernetes"
  master_authorized_networks = var.master_authorized_networks
  filestore_csi_driver       = true
  grant_registry_access      = true
  default_max_pods_per_node      = var.default_max_pods_per_node

  node_pools = [
    {
      name               = "cpu"
      machine_type       = "e2-highmem-16"
      min_count          = var.cpu_nodes_min
      max_count          = var.cpu_nodes_max
      local_ssd_count    = 0
      disk_size_gb       = 100
      disk_type          = "pd-standard"
      image_type         = "COS_CONTAINERD"
      auto_repair        = true
      auto_upgrade       = true
      create_service_account = true
      #service_account    = format("%s@%s.iam.gserviceaccount.com", var.name, var.project_id)
      preemptible        = false
      initial_node_count = 1
    },
    {
      name               = "gpu"
      machine_type       = "custom-48-319488"
      min_count          = var.gpu_nodes_min
      max_count          = var.gpu_nodes_max
      local_ssd_count    = 0
      disk_type          = "pd-standard"
      image_type         = "COS_CONTAINERD"
      auto_repair        = true
      auto_upgrade       = true
      create_service_account = true
      #service_account    = format("%s@%s.iam.gserviceaccount.com", var.name, var.project_id)
      preemptible        = false
      initial_node_count = 1
      accelerator_count  = 1
      accelerator_type   = "nvidia-tesla-t4"
      node_locations     = "${var.region}-c"
    },
    {
      name               = "planner"
      machine_type       = "n1-highmem-16"
      min_count          = var.gpu_nodes_min
      max_count          = var.gpu_nodes_max
      local_ssd_count    = 0
      disk_type          = "pd-standard"
      image_type         = "COS_CONTAINERD"
      auto_repair        = true
      auto_upgrade       = true
      create_service_account = true
      #service_account    = format("%s@%s.iam.gserviceaccount.com", var.name, var.project_id)
      preemptible        = false
      initial_node_count = 1
      accelerator_count  = 1
      accelerator_type   = "nvidia-tesla-t4"
      node_locations     = "${var.region}-c"
    },
    {
      name               = "data"
      machine_type       = "e2-highmem-16"
      min_count          = var.data_nodes_min
      max_count          = var.data_nodes_max
      local_ssd_count    = 0
      disk_size_gb       = 50
      disk_type          = "pd-standard"
      image_type         = "COS_CONTAINERD"
      auto_repair        = true
      auto_upgrade       = true
      create_service_account = true
      #service_account    = format("%s@%s.iam.gserviceaccount.com", var.name, var.project_id)
      preemptible        = false
      initial_node_count = 1
    },
    {
      name               = "ui"
      machine_type       = "n1-highmem-4"
      min_count          = var.data_nodes_min
      max_count          = var.data_nodes_max
      local_ssd_count    = 0
      disk_size_gb       = 100
      disk_type          = "pd-standard"
      image_type         = "COS_CONTAINERD"
      auto_repair        = true
      auto_upgrade       = true
      create_service_account = true
      #service_account    = format("%s@%s.iam.gserviceaccount.com", var.name, var.project_id)
      preemptible        = false
      initial_node_count = 1
      sandbox_type       = "gvisor"
      sandbox_enabled     = true
    },
  ]

  node_pools_oauth_scopes = {
    all = [
    "https://www.googleapis.com/auth/cloud-platform",
    "https://www.googleapis.com/auth/devstorage.read_only",
    "https://www.googleapis.com/auth/servicecontrol",
    "https://www.googleapis.com/auth/service.management.readonly",
    "https://www.googleapis.com/auth/trace.append"
    ]

  }


  node_pools_labels = {
    all = {}

    cpu = {
      Environment = "cpu"
    }

    data = {
      cpu = "true"
    }

    gpu = {
      Environment = "gpu"
    }

    planner = {
      Environment = "planner"
    }

    ui = {
      Environment = "ui"
    }
  }

  node_pools_metadata = {
    all = {}

    default-node-pool = {
      node-pool-metadata-custom-value = "workers"
    }
  }

  node_pools_taints = {
    all = []

    gpu = [
      {
        key    = "dedicated"
        value  = "gpuGroup"
        effect = "NO_SCHEDULE"
      }
    ]
    planner = [
      {
        key    = "dedicated"
        value  = "plannerGroup"
        effect = "NO_SCHEDULE"
      }
    ]
    ui = [
      {
        key    = "dedicated"
        value  = "uiGroup"
        effect = "NO_SCHEDULE"
      }
    ]
    data = [
      {
        key    = "dedicated"
        value  = "metadata"
        effect = "NO_SCHEDULE"
      }
    ]

    default-node-pool = [
      {
        key    = "default-node-pool"
        value  = true
        effect = "PREFER_NO_SCHEDULE"
      },
    ]
  }

  node_pools_tags = {
    all = []
    private-workers = [
        "private",
    ]

  }

}

Terraform Version

Terraform v1.2.1
on linux_amd64

Additional information

used to work before.

The text was updated successfully, but these errors were encountered:

bharathkkb · 2022-10-18T15:12:53Z

Thanks for the report @hi-tal. I believe this is an upstream issue with the latest provider 4.41.0. Could you try pinning to 4.40.0 as a workaround? Ref: hashicorp/terraform-provider-google#12817

hi-tal · 2022-10-19T05:17:00Z

@bharathkkb Thank you very much for the fast response.
Pinning the version to 4.40.0 works like a charm!

bharathkkb · 2022-10-19T17:28:04Z

@hi-tal Glad to hear. We have switched to using a dynamic block in the module in #1428 so this should now be fixed in the main branch and will go out in a future release.

staypuftman · 2023-08-01T09:45:56Z

I'm seeing this error in v4.64.0 and pinning back to v4.40.0 isn't an option for me as other options in my configs would be missing; managed prometheus, cost management.

hi-tal added the bug Something isn't working label Oct 18, 2022

bharathkkb added upstream Work required on Terraform core or provider triaged Scoped and ready for work labels Oct 18, 2022

bharathkkb mentioned this issue Oct 18, 2022

fix: use dynamic block for accelerators, updates for CI #1428

Merged

bharathkkb closed this as completed Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu_sharing_config missing from guest_accelerator #1430

gpu_sharing_config missing from guest_accelerator #1430

hi-tal commented Oct 18, 2022

bharathkkb commented Oct 18, 2022

hi-tal commented Oct 19, 2022

bharathkkb commented Oct 19, 2022

staypuftman commented Aug 1, 2023

gpu_sharing_config missing from guest_accelerator #1430

gpu_sharing_config missing from guest_accelerator #1430

Comments

hi-tal commented Oct 18, 2022

TL;DR

Expected behavior

Observed behavior

Terraform Configuration

Terraform Version

Additional information

bharathkkb commented Oct 18, 2022

hi-tal commented Oct 19, 2022

bharathkkb commented Oct 19, 2022

staypuftman commented Aug 1, 2023