You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Having followed virtually all guidance available on experiment tracking for a training job with "bring your own script" model, it seems sagemaker always decides to crate a separate run for the training job. So for each run I end up with two runs: one which I initialize and all the metrics are logged to, a second 'pytorch-training--aws-training-job' which contains the output model artifacts and the debug info.
To reproduce
relevant excerpt from code that initiates the training jon
`
from sagemaker.pytorch import PyTorch
from sagemaker.experiments import Run
with Run(experiment_name=experiment_name, run_name=run_name) as run:
est = PyTorch(
entry_point="./job.py",
role=role,
model_dir=False,
framework_version="2.2",
py_version="py310",
instance_type="ml.g5.12xlarge",
instance_count=1,
hyperparameters=hyperparameters
)
est.fit()
`
relevant excerpt from job.py
`
if name == "main":
from sagemaker.session import Session
from sagemaker.experiments.run import load_run
session = Session(boto3.session.Session(region_name='us-west-2'))
with load_run(sagemaker_session=session) as run:
# Log all parameters
run.log_parameters({k:str(v) for k,v in vars(args).items()})
run.log_parameter('job_name', str(job_name))
execute(args, run)
`
Expected behavior
the experiment config passed to the estimator should correctly contain the run and the experiment and the training job should be associated with the run.
Screenshots or logs
System information
A description of your system. Please provide:
SageMaker Python SDK version:2.212.0
Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
Framework version:2.2
Python version:310
CPU or GPU:GPU
Custom Docker image (Y/N):N
Additional context
I've tried virtually everything including manually passing the experiment config.
The text was updated successfully, but these errors were encountered:
Describe the bug
Having followed virtually all guidance available on experiment tracking for a training job with "bring your own script" model, it seems sagemaker always decides to crate a separate run for the training job. So for each run I end up with two runs: one which I initialize and all the metrics are logged to, a second 'pytorch-training--aws-training-job' which contains the output model artifacts and the debug info.
To reproduce
relevant excerpt from code that initiates the training jon
`
from sagemaker.pytorch import PyTorch
from sagemaker.experiments import Run
experiment_name = 'test_experiment'
run_name = 'test_run'
with Run(experiment_name=experiment_name, run_name=run_name) as run:
est = PyTorch(
entry_point="./job.py",
role=role,
model_dir=False,
framework_version="2.2",
py_version="py310",
instance_type="ml.g5.12xlarge",
instance_count=1,
hyperparameters=hyperparameters
)
est.fit()
`
relevant excerpt from job.py
`
if name == "main":
from sagemaker.session import Session
from sagemaker.experiments.run import load_run
`
Expected behavior
the experiment config passed to the estimator should correctly contain the run and the experiment and the training job should be associated with the run.
Screenshots or logs
System information
A description of your system. Please provide:
Additional context
I've tried virtually everything including manually passing the experiment config.
The text was updated successfully, but these errors were encountered: