Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with deploying via Hex-LLM, TPU serving solution built with XLA, which is being developed by Google Cloud. #2768

Open
ariji1 opened this issue Mar 8, 2024 · 6 comments

Comments

@ariji1
Copy link

ariji1 commented Mar 8, 2024

Expected Behavior

Model Deployed Successfully

Actual Behavior

I am getting this error -

INFO:google.cloud.aiplatform.models:Creating Endpoint
INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/81995035742/locations/us-central1/endpoints/6941658909824253952/operations/4744238776585289728
Using model from: gs://19865_finetuned_models/gemma-keras-lora-train_20240308_200536
INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/81995035742/locations/us-central1/endpoints/6941658909824253952
INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:
INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/81995035742/locations/us-central1/endpoints/6941658909824253952')
INFO:google.cloud.aiplatform.models:Creating Model
INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/81995035742/locations/us-central1/models/6359646723312189440/operations/7818789947195785216
INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/81995035742/locations/us-central1/models/6359646723312189440@1
INFO:google.cloud.aiplatform.models:To use this Model in another session:
INFO:google.cloud.aiplatform.models:model = aiplatform.Model('projects/81995035742/locations/us-central1/models/6359646723312189440@1')
INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/81995035742/locations/us-central1/endpoints/6941658909824253952

InactiveRpcError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/google/api_core/grpc_helpers.py in error_remapped_callable(*args, **kwargs)
71 try:
---> 72 return callable
(*args, **kwargs)
73 except grpc.RpcError as exc:

11 frames
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Machine type "ct4p-hightpu-4t" is not supported."
debug_error_string = "UNKNOWN:Error received from peer ipv4:173.194.196.95:443 {created_time:"2024-03-08T21:11:42.821027279+00:00", grpc_status:3, grpc_message:"Machine type "ct4p-hightpu-4t" is not supported."}"

The above exception was the direct cause of the following exception:

InvalidArgument Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/google/api_core/grpc_helpers.py in error_remapped_callable(*args, **kwargs)
72 return callable_(*args, **kwargs)
73 except grpc.RpcError as exc:
---> 74 raise exceptions.from_grpc_error(exc) from exc
75
76 return error_remapped_callable

InvalidArgument: 400 Machine type "ct4p-hightpu-4t" is not supported.

Steps to Reproduce the Problem

Run this code - # @title Deploy

@markdown This section uploads the model to Model Registry and deploys it on the Endpoint. It takes 15 minutes to 1 hour to finish.

@markdown Hex-LLM is a High-Efficiency Large Language Model (LLM) TPU serving solution built with XLA, which is being developed by Google Cloud. This notebook uses TPU v5e machines. Click Show code to see more details.

if LOAD_MODEL_FROM != "Kaggle":
print("Skipped: Expect to load model from Kaggle, got", LOAD_MODEL_FROM)
else:
if "2b" in KAGGLE_MODEL_ID:
# Sets ct5lp-hightpu-1t (1 TPU chip) to deploy Gemma 2B models.
machine_type = "ct5lp-hightpu-1t"
else:
# Sets ct5lp-hightpu-4t (4 TPU chips) to deploy Gemma 7B models.
machine_type = "ct4p-hightpu-4t"

# Note that a larger max_num_batched_tokens will require more TPU memory.
max_num_batched_tokens = 11264
# Multiple of tokens for padding alignment. A higher value can reduce
# re-compilation but can also increase the waste in computation.
tokens_pad_multiple = 1024
# Multiple of sequences for padding alignment. A higher value can reduce
# re-compilation but can also increase the waste in computation.
seqs_pad_multiple = 32

print("Using model from: ", output_folder)
model, endpoint = deploy_model_hexllm(
    model_name=get_job_name_with_datetime(prefix="gemma-serve-hexllm"),
    base_model_id=f"google/{KAGGLE_MODEL_ID}",
    model_id=output_folder,
    service_account=SERVICE_ACCOUNT,
    machine_type=machine_type,
    max_num_batched_tokens=max_num_batched_tokens,
    tokens_pad_multiple=tokens_pad_multiple,
    seqs_pad_multiple=seqs_pad_multiple,
)
print("endpoint_name:", endpoint.name)

Specifications

  • Version:
  • Platform: colab enterprise
@gericdong
Copy link
Contributor

@KCFindstr: can you please take a look at this issue? Thanks.

@KCFindstr
Copy link
Contributor

From the original notebook, the correct machine type is ct5lp-hightpu-4t - You might have accidentally modified the machine type. Please let me know if ct5lp-hightpu-4t works.

@ariji1
Copy link
Author

ariji1 commented Mar 12, 2024

I had originally tried with ct5lp-hightpu-4t It still gives the same error, also ct5lp-hightpu-1t is giving the same error. I have also tried raising a support request but there is no issue with permissions.

@ariji1
Copy link
Author

ariji1 commented Mar 12, 2024

image
Still getting the same error

@KCFindstr
Copy link
Contributor

Hi @kathyyu-google , would you please take a look at this hex-llm deployment failure?

@kathyyu-google
Copy link
Collaborator

Based on the endpoint ID from the logs (projects/81995035742/locations/us-central1/endpoints/6941658909824253952), this endpoint was created in the us-central1 region. TPU deployment is supported only in the us-west1 region. Please update the variable REGION and re-attempt the deployment. Please also verify that there is available TPU quota (see the "Request for TPU quota" section of the notebook for more details).

ct5lp-hightpu-*t is the expected machine type here, such as ct5lp-hightpu-1t and ct5lp-hightpu-4t.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants