Issue with deploying via Hex-LLM, TPU serving solution built with XLA, which is being developed by Google Cloud. #2768

ariji1 · 2024-03-08T21:33:43Z

Expected Behavior

Model Deployed Successfully

Actual Behavior

I am getting this error -

INFO:google.cloud.aiplatform.models:Creating Endpoint
INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/81995035742/locations/us-central1/endpoints/6941658909824253952/operations/4744238776585289728
Using model from: gs://19865_finetuned_models/gemma-keras-lora-train_20240308_200536
INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/81995035742/locations/us-central1/endpoints/6941658909824253952
INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:
INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/81995035742/locations/us-central1/endpoints/6941658909824253952')
INFO:google.cloud.aiplatform.models:Creating Model
INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/81995035742/locations/us-central1/models/6359646723312189440/operations/7818789947195785216
INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/81995035742/locations/us-central1/models/6359646723312189440@1
INFO:google.cloud.aiplatform.models:To use this Model in another session:
INFO:google.cloud.aiplatform.models:model = aiplatform.Model('projects/81995035742/locations/us-central1/models/6359646723312189440@1')
INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/81995035742/locations/us-central1/endpoints/6941658909824253952

InactiveRpcError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/google/api_core/grpc_helpers.py in error_remapped_callable(*args, **kwargs)
71 try:
---> 72 return callable(*args, **kwargs)
73 except grpc.RpcError as exc:

11 frames
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Machine type "ct4p-hightpu-4t" is not supported."
debug_error_string = "UNKNOWN:Error received from peer ipv4:173.194.196.95:443 {created_time:"2024-03-08T21:11:42.821027279+00:00", grpc_status:3, grpc_message:"Machine type "ct4p-hightpu-4t" is not supported."}"

The above exception was the direct cause of the following exception:

InvalidArgument Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/google/api_core/grpc_helpers.py in error_remapped_callable(*args, **kwargs)
72 return callable_(*args, **kwargs)
73 except grpc.RpcError as exc:
---> 74 raise exceptions.from_grpc_error(exc) from exc
75
76 return error_remapped_callable

InvalidArgument: 400 Machine type "ct4p-hightpu-4t" is not supported.

Steps to Reproduce the Problem

Run this code - # @title Deploy

@markdown This section uploads the model to Model Registry and deploys it on the Endpoint. It takes 15 minutes to 1 hour to finish.

@markdown Hex-LLM is a High-Efficiency Large Language Model (LLM) TPU serving solution built with XLA, which is being developed by Google Cloud. This notebook uses TPU v5e machines. Click `Show code` to see more details.

if LOAD_MODEL_FROM != "Kaggle":
print("Skipped: Expect to load model from Kaggle, got", LOAD_MODEL_FROM)
else:
if "2b" in KAGGLE_MODEL_ID:
# Sets ct5lp-hightpu-1t (1 TPU chip) to deploy Gemma 2B models.
machine_type = "ct5lp-hightpu-1t"
else:
# Sets ct5lp-hightpu-4t (4 TPU chips) to deploy Gemma 7B models.
machine_type = "ct4p-hightpu-4t"

# Note that a larger max_num_batched_tokens will require more TPU memory.
max_num_batched_tokens = 11264
# Multiple of tokens for padding alignment. A higher value can reduce
# re-compilation but can also increase the waste in computation.
tokens_pad_multiple = 1024
# Multiple of sequences for padding alignment. A higher value can reduce
# re-compilation but can also increase the waste in computation.
seqs_pad_multiple = 32

print("Using model from: ", output_folder)
model, endpoint = deploy_model_hexllm(
    model_name=get_job_name_with_datetime(prefix="gemma-serve-hexllm"),
    base_model_id=f"google/{KAGGLE_MODEL_ID}",
    model_id=output_folder,
    service_account=SERVICE_ACCOUNT,
    machine_type=machine_type,
    max_num_batched_tokens=max_num_batched_tokens,
    tokens_pad_multiple=tokens_pad_multiple,
    seqs_pad_multiple=seqs_pad_multiple,
)
print("endpoint_name:", endpoint.name)

Specifications

Version:
Platform: colab enterprise

The text was updated successfully, but these errors were encountered:

gericdong · 2024-03-11T13:43:20Z

@KCFindstr: can you please take a look at this issue? Thanks.

KCFindstr · 2024-03-11T20:38:47Z

From the original notebook, the correct machine type is ct5lp-hightpu-4t - You might have accidentally modified the machine type. Please let me know if ct5lp-hightpu-4t works.

ariji1 · 2024-03-12T03:44:28Z

I had originally tried with ct5lp-hightpu-4t It still gives the same error, also ct5lp-hightpu-1t is giving the same error. I have also tried raising a support request but there is no issue with permissions.

ariji1 · 2024-03-12T04:00:04Z

Still getting the same error

KCFindstr · 2024-03-12T18:03:49Z

Hi @kathyyu-google , would you please take a look at this hex-llm deployment failure?

kathyyu-google · 2024-03-15T02:26:46Z

Based on the endpoint ID from the logs (projects/81995035742/locations/us-central1/endpoints/6941658909824253952), this endpoint was created in the us-central1 region. TPU deployment is supported only in the us-west1 region. Please update the variable REGION and re-attempt the deployment. Please also verify that there is available TPU quota (see the "Request for TPU quota" section of the notebook for more details).

ct5lp-hightpu-*t is the expected machine type here, such as ct5lp-hightpu-1t and ct5lp-hightpu-4t.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with deploying via Hex-LLM, TPU serving solution built with XLA, which is being developed by Google Cloud. #2768

Issue with deploying via Hex-LLM, TPU serving solution built with XLA, which is being developed by Google Cloud. #2768

ariji1 commented Mar 8, 2024

gericdong commented Mar 11, 2024

KCFindstr commented Mar 11, 2024

ariji1 commented Mar 12, 2024

ariji1 commented Mar 12, 2024 •

edited

KCFindstr commented Mar 12, 2024

kathyyu-google commented Mar 15, 2024

Issue with deploying via Hex-LLM, TPU serving solution built with XLA, which is being developed by Google Cloud. #2768

Issue with deploying via Hex-LLM, TPU serving solution built with XLA, which is being developed by Google Cloud. #2768

Comments

ariji1 commented Mar 8, 2024

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

@markdown This section uploads the model to Model Registry and deploys it on the Endpoint. It takes 15 minutes to 1 hour to finish.

@markdown Hex-LLM is a High-Efficiency Large Language Model (LLM) TPU serving solution built with XLA, which is being developed by Google Cloud. This notebook uses TPU v5e machines. Click Show code to see more details.

Specifications

gericdong commented Mar 11, 2024

KCFindstr commented Mar 11, 2024

ariji1 commented Mar 12, 2024

ariji1 commented Mar 12, 2024 • edited

KCFindstr commented Mar 12, 2024

kathyyu-google commented Mar 15, 2024

@markdown Hex-LLM is a High-Efficiency Large Language Model (LLM) TPU serving solution built with XLA, which is being developed by Google Cloud. This notebook uses TPU v5e machines. Click `Show code` to see more details.

ariji1 commented Mar 12, 2024 •

edited