Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can Not use PyTorchProcessor with Sagemaker Pipeline Object in a GPU Instance #4574

Open
DataPsycho opened this issue Apr 11, 2024 · 7 comments
Labels

Comments

@DataPsycho
Copy link

DataPsycho commented Apr 11, 2024

What did you find confusing? Please describe.
Hi, I was following the document to create a Pytorch Based processing pipeline where I will load certain model from hugging face and process bunch of text (A batch job). For testing purpose I have put hugging face embeeding model sentence-transformers/all-MiniLM-L6-v2 from hugging face and put it in s3 in the following location: aws s3 cp artifacts s3://$bucket/$s3_prefix/artifacts Which will be downloaded into the following location: in the container /opt/ml/processing/input/artifacts

Now I have created some place holder variable:

from sagemaker.workflow.parameters import ParameterInteger, ParameterString
import os
import boto3
import sagemaker

sess = sagemaker.session.Session()
bucket = sess.default_bucket() 
region = boto3.Session().region_name
s3_prefix = 'scheduled-processing'
rawdata_s3_prefix = '{}/data/raw'.format(s3_prefix)
raw_s3 = sess.upload_data(path='./data/raw/', key_prefix=rawdata_s3_prefix)
print(raw_s3)

## PARAMETERS
# processed_s3 = f"s3://{bucket}/{s3_prefix}/data/processed"
code_s3 = f"s3://{bucket}/{s3_prefix}/code"
artifact_s3 = f"s3://{bucket}/{s3_prefix}/artifacts"
reqs_s3 = f"s3://{bucket}/{s3_prefix}/requirements"

# Path to S3
# input_data = ParameterString(name="InputData", default_value=raw_s3)
input_code = ParameterString(name="InputCode", default_value=code_s3)
input_artifcts = ParameterString(name="InputArtifacts", default_value=artifact_s3)
input_reqs = ParameterString(name="InputReq", default_value=reqs_s3)

# processing_instance_type = ParameterString(name="ProcessingInstanceType", default_value="ml.p3.2xlarge")
processing_instance_type = ParameterString(name="ProcessingInstanceType", default_value="ml.m5.xlarge")
processing_instance_count = ParameterInteger(name="ProcessingInstanceCount", default_value=1)

Then I have defined a pre processor:

import boto3
import sagemaker
# from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.pytorch.processing import PyTorchProcessor

role = sagemaker.get_execution_role()
framework_version = "2.1"
# framework_version = "1.2-1"

sklearn_processor = PyTorchProcessor(
    framework_version=framework_version,
    py_version="py310",
    instance_type=processing_instance_type, #ml.m5.xlarge
    instance_count=processing_instance_count, #1
    base_job_name="scheduled-processing",
    sagemaker_session=sess,
    role=role
)

Ignoore the naming sklearn as I was experimenting in between.

Then I created a processing steps:

from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep

## PROCESSING STEP
step_process = ProcessingStep(
    name="ScheduledProcessing",
    processor=sklearn_processor,
    inputs=[
        ProcessingInput(source=input_code, destination="/opt/ml/processing/input/code", s3_data_distribution_type='ShardedByS3Key'),
        ProcessingInput(source=input_reqs, destination="/opt/ml/processing/input/requirements", s3_data_distribution_type='ShardedByS3Key'),
        ProcessingInput(source=input_artifcts, destination="/opt/ml/processing/input/artifacts", s3_data_distribution_type='ShardedByS3Key'),
    ],
    code=code_entrypoint
)

Finally I create the Pipeline and tried to run it:

import json
from sagemaker.workflow.pipeline import Pipeline

pipeline_name = "ScheduledProcessingPipeline"

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        input_reqs,
        input_artifcts,
        input_code,
        processing_instance_type,
        processing_instance_count,
    ],
    steps=[
        step_process
    ],
    sagemaker_session=sess
)


definition = json.loads(pipeline.definition())
print(definition)
pipeline.upsert(role_arn=role)
execution = pipeline.start()

When I use the instance type ml.m5.xlarge this code runs without any issue. But when I switched to GPU based Instance processing_instance_type = ParameterString(name="ProcessingInstanceType", default_value="ml.p3.2xlarge"). This code does not even start.

Describe how documentation can be improved
Proper document update is required that How to use Pipeline object from sagemaker.workflow.pipeline along with PyTorchProcessor in a GPU intance. Its fine if PyTorch instance could be used instead and do such kind of pipelining when It is required to run a batch job.

Additional context
N/A

@mufaddal-rohawala
Copy link
Member

Hi @DataPsycho Thanks for reaching out to sagemaker. You mentioned that the workflow does not even start, is there an error message here? or some other details that can help.

@mufaddal-rohawala mufaddal-rohawala added component: pipelines Relates to the SageMaker Pipeline Platform bug labels Apr 12, 2024
@DataPsycho
Copy link
Author

Hi, I went into bit deeper into the problem and found some of the issues that Whats happening under the hood. It certainly not what I thought. But there is a need of improvement of the Pipeline object so that it returns the currect error message.

What Happened?
There is a restriction on the service account that for processing job there is no permission to use GPU instance for me, so that is the main issue I guess that why I can not use a GPU instance. But I got these error only when running the Processing Job with synchronous mode with the run method as follows:

from sagemaker.processing import ProcessingInput

sklearn_processor.run(
    source_dir="src",
    code="main.py",
    inputs=[
        ProcessingInput(source=code_s3, destination="/opt/ml/processing/input/code"),
        ProcessingInput(source=reqs_s3, destination="/opt/ml/processing/input/requirements"),
        ProcessingInput(source=artifact_s3, destination="/opt/ml/processing/input/artifacts"),
    ]
)

Which throughs the error:

An error occurred (ResourceLimitExceeded) when calling the CreateProcessingJob operation: The account-level service limit 'ml.g4dn.2xlarge for processing job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please use AWS Service Quotas to request an increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for this quota.

Because the job did not even have started there is no where any log about it that this is the problem. There is no cloud watch log about this because the job not even have starated. But when running with pipeline, pipeline create a arn object as follows:

_PipelineExecution(arn='arn:aws:sagemaker:<region>:<account>:pipeline/ScheduledProcessingPipeline/execution/3tpobd1xk9l9', sagemaker_session=<sagemaker.session.Session object at 0x7f175c333be0>)

But there is no tress of this Execution arn in sagemaker processing job list or in Cloud Watch which makes it really difficult to debug the issue.

Few suggestions on the documentation:
There is no complete example of using PyTorchProcessor and HuggingFaceProcessor for some batch processing. My use case was to run Mistral/Gemma locally and do some processing on certain documents for which I do not wanna use Bedrock or OpenAI because of Org. restriction. The code with 4 bit quantization of Mistral certainly works but for many days I had to fight with the Pipeline and Job with traial and error to get to the solution to run the code

  • So the documentation should be improved to show case atleast example for each case that how to use PyTorchPrecessor and HuggingFaceProcessor for data preprocessing
  • During the pipeline execution I had to figure out that while using Pipeline object the entry point is a shell file for example could be main.sh there can python execution instruction python /opt/ml/processing/main.py but while running run method this is not the case.
    So there is a need of clear instruction how to use PyTorchProcessor/HuggingFaceProcessor in a Pipeline sagemaker.workflow.pipeline.Pipeline

@DataPsycho
Copy link
Author

Hi, Let me know. Should I close this ticket? @mufaddal-rohawala

@qidewenwhen
Copy link
Member

qidewenwhen commented May 11, 2024

Hi @DataPsycho , thanks for providing the further info!

IIUC, seems when running the pipeline, it failed without properly surfacing the create job error (ResourceLimitExceeded) to you. That's weird. In normal cases, the pipeline should surface the error thrown when creating a downstream job.

Would you mind sharing your pipeline execution ARN to us so I can check our service log to see what was going on?
The account provided here is missing region and account information so it's hard to locate the log.

_PipelineExecution(arn='arn:aws:sagemaker:<region>:<account>:pipeline/ScheduledProcessingPipeline/execution/3tpobd1xk9l9', sagemaker_session=<sagemaker.session.Session object at 0x7f175c333be0>)

@DataPsycho
Copy link
Author

DataPsycho commented May 11, 2024

Hi, Not sure it is my Org. Account and if I am allowed to share any Arn having a account number. By the way as the resource limit is increased I am not able to produce this error. You can impose a resource limit on you account probably and try to regenerate this issue. I can provide you the generic code. If required.

@qidewenwhen
Copy link
Member

Hi @DataPsycho, I see. After a second thought, and re-visiting your description below, I get your point.

But when running with pipeline, pipeline create a arn object as follows:

_PipelineExecution(arn='arn:aws:sagemaker:<region>:<account>:pipeline/ScheduledProcessingPipeline/execution/3tpobd1xk9l9', sagemaker_session=<sagemaker.session.Session object at 0x7f175c333be0>)

But there is no tress of this Execution arn in sagemaker processing job list or in Cloud Watch which makes it really difficult to debug the issue.

I guess what you meant is the SageMaker Python SDK did not surface the issue for you. If so, then correct, this is a feature gap.
In SageMaker Python SDK, when users start a pipeline execution, it triggers an async pipeline execution running in our service. It's not easy for user in Python SDK to check the execution results as well as the failure reason, because we did not expose the _PipelineExecution to customers. We have a feature request for it: #4391

Rather, users can check the SageMaker Studio UI to inspect the execution result in Pipelines page. Or users can also call the aws sagemaker describe CLI to check the execution results.

@DataPsycho
Copy link
Author

Hi, Thanks for the explanation. Will look forward and also will be happy to help with any further info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants