Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sagemaker creates separate experiment runs for training jobs #4523

Open
Omaraldarwish opened this issue Mar 21, 2024 · 0 comments
Open

Sagemaker creates separate experiment runs for training jobs #4523

Omaraldarwish opened this issue Mar 21, 2024 · 0 comments
Labels

Comments

@Omaraldarwish
Copy link

Describe the bug
Having followed virtually all guidance available on experiment tracking for a training job with "bring your own script" model, it seems sagemaker always decides to crate a separate run for the training job. So for each run I end up with two runs: one which I initialize and all the metrics are logged to, a second 'pytorch-training--aws-training-job' which contains the output model artifacts and the debug info.

To reproduce
relevant excerpt from code that initiates the training jon
`
from sagemaker.pytorch import PyTorch
from sagemaker.experiments import Run

experiment_name = 'test_experiment'
run_name = 'test_run'

with Run(experiment_name=experiment_name, run_name=run_name) as run:
est = PyTorch(
entry_point="./job.py",
role=role,
model_dir=False,
framework_version="2.2",
py_version="py310",
instance_type="ml.g5.12xlarge",
instance_count=1,
hyperparameters=hyperparameters
)
est.fit()
`

relevant excerpt from job.py
`
if name == "main":
from sagemaker.session import Session
from sagemaker.experiments.run import load_run

session = Session(boto3.session.Session(region_name='us-west-2'))
with load_run(sagemaker_session=session) as run:
    # Log all parameters
    run.log_parameters({k:str(v) for k,v in vars(args).items()})
    run.log_parameter('job_name', str(job_name))

    execute(args, run)

`

Expected behavior
the experiment config passed to the estimator should correctly contain the run and the experiment and the training job should be associated with the run.

Screenshots or logs
image

System information
A description of your system. Please provide:

  • SageMaker Python SDK version:2.212.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
  • Framework version:2.2
  • Python version:310
  • CPU or GPU:GPU
  • Custom Docker image (Y/N):N

Additional context
I've tried virtually everything including manually passing the experiment config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant