Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code_location parameter passed to tensorflow estimator is not all passed to create_model #4536

Open
HCharlie opened this issue Mar 25, 2024 · 0 comments
Labels

Comments

@HCharlie
Copy link

HCharlie commented Mar 25, 2024

Describe the bug
Hi, I am following this example.

And I found when running the deploy function, it will ask for permission to create the default s3 bucket even when the code_location parameter is passed to tensorflow estimator.

However based on the source code, if code_location is passed to initialize the model, it should avoid create a new S3 bucket and reused the one parsed from the code_location variable, and it will store the model output in this s3 bucket.

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.
The provided code need to be complete and runnable, if additional data is needed, please include them in the issue.
mys3bucket is existing s3 bucket, prefix is available,

import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()
region = sagemaker_session.boto_session.region_name

username = os.environ['USER']
base_job_name = f"users-{username}-tf-script-mode"
role = get_execution_role()
bucket = "mys3bucket"
prefix = f'data/users/{username}/tensorflow'



training_data_uri = 's3://sagemaker-sample-data-{}/tensorflow/mnist'.format(region)


from sagemaker.tensorflow import TensorFlow
source_dir = 's3://{}/{}/source'.format(bucket, prefix)
output_path = 's3://{}/{}/output'.format(bucket, prefix)
print(f"{source_dir=}")
print(f"{output_path=}")
hyperparams = {
    'sagemaker_requirements': 'code/requirements.txt'
}

mnist_estimator = TensorFlow(entry_point='code/mnist.py',
                              base_job_name=base_job_name,
                              output_path=output_path,
                              code_location=source_dir,
                              hyperparameters=hyperparams,
                              role=role,
                              instance_count=2,
                              instance_type='ml.m5.large',
                              framework_version='2.1.0',
                              py_version='py3',
                              distribution={'parameter_server': {'enabled': True}})

## fit
print("start fitting")
mnist_estimator.fit(training_data_uri)

## deploy
print("start deploy")
predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.m5.large')

error message

WARNING:sagemaker.deprecations:update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker.tensorflow.model:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:531, in Session._create_s3_bucket_if_it_does_not_exist(self, bucket_name, region)
    529 try:
    530     # trying head bucket call
--> 531     s3.meta.client.head_bucket(Bucket=bucket.name)
    532 except ClientError as e:
    533     # bucket does not exist or forbidden to access

File /opt/conda/lib/python3.9/site-packages/botocore/client.py:553, in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
    552 # The "self" in this scope is referring to the BaseClient.
--> 553 return self._make_api_call(operation_name, kwargs)

File /opt/conda/lib/python3.9/site-packages/botocore/client.py:1009, in BaseClient._make_api_call(self, operation_name, api_params)
   1008     error_class = self.exceptions.from_code(error_code)
-> 1009     raise error_class(parsed_response, operation_name)
   1010 else:

ClientError: An error occurred (404) when calling the HeadBucket operation: Not Found

During handling of the above exception, another exception occurred:

ClientError                               Traceback (most recent call last)
Cell In[7], line 1
----> 1 predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.m5.large')

File /opt/conda/lib/python3.9/site-packages/sagemaker/estimator.py:1509, in EstimatorBase.deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, use_compiled_model, wait, model_name, kms_key, data_capture_config, tags, serverless_inference_config, async_inference_config, volume_size, model_data_download_timeout, container_startup_health_check_timeout, inference_recommendation_id, explainer_config, **kwargs)
   1503 model.name = model_name
   1505 tags = update_inference_tags_with_jumpstart_training_tags(
   1506     inference_tags=tags, training_tags=self.tags
   1507 )
-> 1509 return model.deploy(
   1510     instance_type=instance_type,
   1511     initial_instance_count=initial_instance_count,
   1512     serializer=serializer,
   1513     deserializer=deserializer,
   1514     accelerator_type=accelerator_type,
   1515     endpoint_name=endpoint_name,
   1516     tags=tags or self.tags,
   1517     wait=wait,
   1518     kms_key=kms_key,
   1519     data_capture_config=data_capture_config,
   1520     serverless_inference_config=serverless_inference_config,
   1521     async_inference_config=async_inference_config,
   1522     explainer_config=explainer_config,
   1523     volume_size=volume_size,
   1524     model_data_download_timeout=model_data_download_timeout,
   1525     container_startup_health_check_timeout=container_startup_health_check_timeout,
   1526     inference_recommendation_id=inference_recommendation_id,
   1527 )

File /opt/conda/lib/python3.9/site-packages/sagemaker/tensorflow/model.py:335, in TensorFlowModel.deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, update_endpoint, async_inference_config, serverless_inference_config, volume_size, model_data_download_timeout, container_startup_health_check_timeout, inference_recommendation_id, explainer_config)
    332     msg = "The TensorFlow version %s doesn't support EIA." % self.framework_version
    333     raise AttributeError(msg)
--> 335 return super(TensorFlowModel, self).deploy(
    336     initial_instance_count=initial_instance_count,
    337     instance_type=instance_type,
    338     serializer=serializer,
    339     deserializer=deserializer,
    340     accelerator_type=accelerator_type,
    341     endpoint_name=endpoint_name,
    342     tags=tags,
    343     kms_key=kms_key,
    344     wait=wait,
    345     data_capture_config=data_capture_config,
    346     async_inference_config=async_inference_config,
    347     serverless_inference_config=serverless_inference_config,
    348     volume_size=volume_size,
    349     model_data_download_timeout=model_data_download_timeout,
    350     container_startup_health_check_timeout=container_startup_health_check_timeout,
    351     update_endpoint=update_endpoint,
    352     inference_recommendation_id=inference_recommendation_id,
    353     explainer_config=explainer_config,
    354 )

File /opt/conda/lib/python3.9/site-packages/sagemaker/model.py:1248, in Model.deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, async_inference_config, serverless_inference_config, volume_size, model_data_download_timeout, container_startup_health_check_timeout, inference_recommendation_id, explainer_config, **kwargs)
   1245     if self._base_name is not None:
   1246         self._base_name = "-".join((self._base_name, compiled_model_suffix))
-> 1248 self._create_sagemaker_model(
   1249     instance_type, accelerator_type, tags, serverless_inference_config
   1250 )
   1252 serverless_inference_config_dict = (
   1253     serverless_inference_config._to_request_dict() if is_serverless else None
   1254 )
   1255 production_variant = sagemaker.production_variant(
   1256     self.name,
   1257     instance_type,
   (...)
   1263     container_startup_health_check_timeout=container_startup_health_check_timeout,
   1264 )

File /opt/conda/lib/python3.9/site-packages/sagemaker/model.py:681, in Model._create_sagemaker_model(self, instance_type, accelerator_type, tags, serverless_inference_config)
    659 def _create_sagemaker_model(
    660     self, instance_type=None, accelerator_type=None, tags=None, serverless_inference_config=None
    661 ):
    662     """Create a SageMaker Model Entity
    663 
    664     Args:
   (...)
    679             not provided in serverless inference. So this is used to find image URIs.
    680     """
--> 681     container_def = self.prepare_container_def(
    682         instance_type,
    683         accelerator_type=accelerator_type,
    684         serverless_inference_config=serverless_inference_config,
    685     )
    687     if not isinstance(self.sagemaker_session, PipelineSession):
    688         # _base_name, model_name are not needed under PipelineSession.
    689         # the model_data may be Pipeline variable
    690         # which may break the _base_name generation
    691         self._ensure_base_name_if_needed(
    692             image_uri=container_def["Image"],
    693             script_uri=self.source_dir,
    694             model_uri=self.model_data,
    695         )

File /opt/conda/lib/python3.9/site-packages/sagemaker/tensorflow/model.py:391, in TensorFlowModel.prepare_container_def(self, instance_type, accelerator_type, serverless_inference_config)
    389 env = self._get_container_env()
    390 key_prefix = sagemaker.fw_utils.model_code_key_prefix(self.key_prefix, self.name, image_uri)
--> 391 bucket = self.bucket or self.sagemaker_session.default_bucket()
    393 if self.entry_point and not is_pipeline_variable(self.model_data):
    394     model_data = s3.s3_path_join("s3://", bucket, key_prefix, "model.tar.gz")

File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:500, in Session.default_bucket(self)
    497 if not default_bucket:
    498     default_bucket = generate_default_sagemaker_bucket_name(self.boto_session)
--> 500 self._create_s3_bucket_if_it_does_not_exist(bucket_name=default_bucket, region=region)
    502 self._default_bucket = default_bucket
    504 return self._default_bucket

File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:545, in Session._create_s3_bucket_if_it_does_not_exist(self, bucket_name, region)
    543         s3.create_bucket(Bucket=bucket_name)
    544     else:
--> 545         s3.create_bucket(
    546             Bucket=bucket_name,
    547             CreateBucketConfiguration={"LocationConstraint": region},
    548         )
    550     LOGGER.info("Created S3 bucket: %s", bucket_name)
    551 except ClientError as e:

File /opt/conda/lib/python3.9/site-packages/boto3/resources/factory.py:581, in ResourceFactory._create_action.<locals>.do_action(self, *args, **kwargs)
    580 def do_action(self, *args, **kwargs):
--> 581     response = action(self, *args, **kwargs)
    583     if hasattr(self, 'load'):
    584         # Clear cached data. It will be reloaded the next
    585         # time that an attribute is accessed.
    586         # TODO: Make this configurable in the future?
    587         self.meta.data = None

File /opt/conda/lib/python3.9/site-packages/boto3/resources/action.py:88, in ServiceAction.__call__(self, parent, *args, **kwargs)
     79 params.update(kwargs)
     81 logger.debug(
     82     'Calling %s:%s with %r',
     83     parent.meta.service_name,
     84     operation_name,
     85     params,
     86 )
---> 88 response = getattr(parent.meta.client, operation_name)(*args, **params)
     90 logger.debug('Response: %r', response)
     92 return self._response_handler(parent, params, response)

File /opt/conda/lib/python3.9/site-packages/botocore/client.py:553, in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
    549     raise TypeError(
    550         f"{py_operation_name}() only accepts keyword arguments."
    551     )
    552 # The "self" in this scope is referring to the BaseClient.
--> 553 return self._make_api_call(operation_name, kwargs)

File /opt/conda/lib/python3.9/site-packages/botocore/client.py:1009, in BaseClient._make_api_call(self, operation_name, api_params)
   1005     error_code = error_info.get("QueryErrorCode") or error_info.get(
   1006         "Code"
   1007     )
   1008     error_class = self.exceptions.from_code(error_code)
-> 1009     raise error_class(parsed_response, operation_name)
   1010 else:
   1011     return parsed_response

ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Access Denied

Expected behavior
A clear and concise description of what you expected to happen.

no new s3 bucket needs to be created.

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: latest
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans):
  • Framework version:
  • Python version:3.9
  • CPU or GPU:cpu
  • Custom Docker image (Y/N):N

Additional context
a proposed solution is #4537

@HCharlie HCharlie added the bug label Mar 25, 2024
@HCharlie HCharlie changed the title parameter passed to tensorflow estimator is not all passed to create_model code_location parameter passed to tensorflow estimator is not all passed to create_model May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant