Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model cannot be loaded in the SageMaker endpoint after update of SageMaker SDK to 2.212 #4488

Open
Neptun332 opened this issue Mar 8, 2024 · 4 comments

Comments

@Neptun332
Copy link

Describe the bug
Model cannot be loaded in the SageMaker endpoint after update of SageMaker SDK to 2.212

To reproduce

model_builder = ModelBuilder(
    model_path=model_path,
    schema_builder=SchemaBuilder(sample_input, sample_output, input_translator=InputTranslator()),
    content_type='application/x-image',
    mode=Mode.SAGEMAKER_ENDPOINT,
    role_arn=role_arn,
    image_uri=image,
    inference_spec=InferenceSpec()
)
built_model = model_builder.build()
built_model.deploy(
    instance_type="ml.c6i.2xlarge",
    endpoint_name="my_endpoint_name",
    initial_instance_count=1)

Expected behavior

  • By default ModelBuilder set == not >=
  • model can be loaded for SageMaker SDK 2.212

Screenshots or logs

2024-03-07T10:23:04.572+01:00	Model server started.
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,338 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=64
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,341 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,351 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.10/site-packages/ts/configs/metrics.yaml.
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,351 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [PID]64
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,352 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch worker started.
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,352 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Python runtime: 3.10.9
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,357 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,366 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2024-03-07T10:23:04.572+01:00	2024-03-07T09:23:04,371 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1709803384370
2024-03-07T10:23:05.324+01:00	2024-03-07T09:23:04,409 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
2024-03-07T10:23:05.324+01:00	2024-03-07T09:23:05,201 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,202 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,553 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend worker process died.
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,553 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,553 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 253, in <module>
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,554 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server()
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,554 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 221, in run_server
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,554 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket)
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,555 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 184, in handle_connection
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,555 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg)
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,555 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,555 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,556 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service = model_loader.load(
2024-03-07T10:23:05.575+01:00	2024-03-07T09:23:05,556 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.212
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
  • Framework version: -
  • Python version: 3.10
  • CPU or GPU: GPU and CPU
  • Custom Docker image (Y/N): N

Additional context
SageMaker endpoint was working for a while and successfully processing requests. The endpoint restarted and installed the latest version of SageMaker SDK (2.212). The endpoint stopped processing requests and printed logs as above. I have noticed that ModelBuilder creates a model package with requirements.txt. In that file, there is sagemaker>=2.199. I modified it and set sagemaker==2.199 which solved the issue.

@samruds
Copy link
Collaborator

samruds commented Mar 13, 2024

Taking a look. Will reproduce the error locally today.

@samruds samruds self-assigned this Mar 13, 2024
@samruds
Copy link
Collaborator

samruds commented Mar 30, 2024

Hello we have identified and fixed the problem to be related to an extra dependencies that was added to ModelBuilder.

Please pull in the latest commit of SDK if you are still seeing an issue with this version. Specifically pull in this commit #4549

@samruds
Copy link
Collaborator

samruds commented Mar 30, 2024

Short term mitigations are

  1. Pass a custom dependency
model_builder = ModelBuilder(
    #mode=Mode.SAGEMAKER_ENDPOINT,  # you can change it to Mode.LOCAL_CONTAINER for local testing
    mode=Mode.LOCAL_CONTAINER ,
    model_path=resnet_model_dir,
    inference_spec=my_inference_spec,
    schema_builder=my_schema,
    role_arn=execution_role,
    dependencies={
        "custom": [
            "accelerate==0.24.1",

        ],
    }
)

  1. Installing the extras for bringing in accelerate if using the ModelBuilder interface. !pip install --force-reinstall --no-cache-dir --quiet "sagemaker[huggingface]>=2.212.0"

@samruds
Copy link
Collaborator

samruds commented Mar 30, 2024

I will sync with SDK team on Monday for next steps to work with customer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants