New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can Not use PyTorchProcessor with Sagemaker Pipeline Object in a GPU Instance #4574
Comments
Hi @DataPsycho Thanks for reaching out to sagemaker. You mentioned that the workflow does not even start, is there an error message here? or some other details that can help. |
Hi, I went into bit deeper into the problem and found some of the issues that Whats happening under the hood. It certainly not what I thought. But there is a need of improvement of the Pipeline object so that it returns the currect error message. What Happened? from sagemaker.processing import ProcessingInput
sklearn_processor.run(
source_dir="src",
code="main.py",
inputs=[
ProcessingInput(source=code_s3, destination="/opt/ml/processing/input/code"),
ProcessingInput(source=reqs_s3, destination="/opt/ml/processing/input/requirements"),
ProcessingInput(source=artifact_s3, destination="/opt/ml/processing/input/artifacts"),
]
) Which throughs the error:
Because the job did not even have started there is no where any log about it that this is the problem. There is no cloud watch log about this because the job not even have starated. But when running with pipeline, pipeline create a arn object as follows:
But there is no tress of this Execution arn in sagemaker processing job list or in Cloud Watch which makes it really difficult to debug the issue. Few suggestions on the documentation:
|
Hi, Let me know. Should I close this ticket? @mufaddal-rohawala |
Hi @DataPsycho , thanks for providing the further info! IIUC, seems when running the pipeline, it failed without properly surfacing the create job error (ResourceLimitExceeded) to you. That's weird. In normal cases, the pipeline should surface the error thrown when creating a downstream job. Would you mind sharing your pipeline execution ARN to us so I can check our service log to see what was going on?
|
Hi, Not sure it is my Org. Account and if I am allowed to share any Arn having a account number. By the way as the resource limit is increased I am not able to produce this error. You can impose a resource limit on you account probably and try to regenerate this issue. I can provide you the generic code. If required. |
Hi @DataPsycho, I see. After a second thought, and re-visiting your description below, I get your point.
I guess what you meant is the SageMaker Python SDK did not surface the issue for you. If so, then correct, this is a feature gap. Rather, users can check the SageMaker Studio UI to inspect the execution result in Pipelines page. Or users can also call the |
Hi, Thanks for the explanation. Will look forward and also will be happy to help with any further info. |
What did you find confusing? Please describe.
Hi, I was following the document to create a Pytorch Based processing pipeline where I will load certain model from hugging face and process bunch of text (A batch job). For testing purpose I have put hugging face embeeding model sentence-transformers/all-MiniLM-L6-v2 from hugging face and put it in s3 in the following location:
aws s3 cp artifacts s3://$bucket/$s3_prefix/artifacts
Which will be downloaded into the following location: in the container/opt/ml/processing/input/artifacts
Now I have created some place holder variable:
Then I have defined a pre processor:
Ignoore the naming sklearn as I was experimenting in between.
Then I created a processing steps:
Finally I create the Pipeline and tried to run it:
When I use the instance type
ml.m5.xlarge
this code runs without any issue. But when I switched to GPU based Instanceprocessing_instance_type = ParameterString(name="ProcessingInstanceType", default_value="ml.p3.2xlarge")
. This code does not even start.Describe how documentation can be improved
Proper document update is required that How to use
Pipeline
object fromsagemaker.workflow.pipeline
along withPyTorchProcessor
in a GPU intance. Its fine ifPyTorch
instance could be used instead and do such kind of pipelining when It is required to run a batch job.Additional context
N/A
The text was updated successfully, but these errors were encountered: