You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To promote better end-user visibility of costs and organizational cost management, it would be useful if the SM Python SDK could (when configured by an administrator):
Maximum possible compute cost for training/processing jobs (with ref to the max_run stopping condition)
Maximum per-hour compute cost for real-time inference endpoints (or maybe cost at the initial_instance_count, if tracing auto-scaling config is too hard?)
Give an informational message including the projected/max cost
Optionally reject the operation, if the projection exceeds an organizationally-configured threshold
How would this feature be used? Please describe.
An administrator would enable max-cost projection and optionally configure hard limits through the SDK intelligent defaults YAML config file
We could expose the options as arguments to e.g. Estimator/Predictor/etc as well, but I see minimal value unless an org can turn them on by default for their team.
I acknowledge the messaging (esp around training job max run time) could be complex and confusing for new users, so would not suggest to enable either logging or limits for all SDK users by default - just offer it as a configurable feature
When creating a job or endpoint through the normal SDK methods, the data scientist would be notified of the projected (max) costs for the actions via log messages, and the action would fail with an error if the pre-configured threshold is exceeded
For example something like:
[INFO] With the configured max_run = 3600 seconds, this training job could
generate up to $2.12 in compute instance charges.
The messaging would need to be carefully chosen to avoid confusion, because:
A total cost estimate would involve a range of other factors like configured job EBS size (known up-front) and S3 data access patterns / data transfer fees (impossible to know).
Enabling SageMaker Managed Spot could offer some (unknowable?) discount over the on-demand price
Describe alternatives you've considered
I appreciate that it's already possible to restrict IAM CreateTrainingJob & CreateProcessingJob permissions by both sagemaker:InstanceTypes and sageamker:MaxRuntimeInSeconds conditions, for strictly-enforceable controls on this... But the resulting AccessDenied errors are challenging for end users to understand and don't foster cost-awareness beyond enforcing the hard limits.
Any additional context
This was raised by a customer of mine today who are considering implementing similar functionality in their own internal Python utility for SageMaker - but not immediately clear how their internal SDK and the SageMaker Python SDK could integrate together for an effective user workflow.
The text was updated successfully, but these errors were encountered:
Describe the feature you'd like
To promote better end-user visibility of costs and organizational cost management, it would be useful if the SM Python SDK could (when configured by an administrator):
max_run
stopping condition)initial_instance_count
, if tracing auto-scaling config is too hard?)How would this feature be used? Please describe.
For example something like:
The messaging would need to be carefully chosen to avoid confusion, because:
Describe alternatives you've considered
I appreciate that it's already possible to restrict IAM CreateTrainingJob & CreateProcessingJob permissions by both
sagemaker:InstanceTypes
andsageamker:MaxRuntimeInSeconds
conditions, for strictly-enforceable controls on this... But the resulting AccessDenied errors are challenging for end users to understand and don't foster cost-awareness beyond enforcing the hard limits.Any additional context
This was raised by a customer of mine today who are considering implementing similar functionality in their own internal Python utility for SageMaker - but not immediately clear how their internal SDK and the SageMaker Python SDK could integrate together for an effective user workflow.
The text was updated successfully, but these errors were encountered: