Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'Message' when encountering an error in _send_metrics #4482

Open
sziem opened this issue Mar 6, 2024 · 4 comments
Open

KeyError: 'Message' when encountering an error in _send_metrics #4482

sziem opened this issue Mar 6, 2024 · 4 comments
Labels

Comments

@sziem
Copy link

sziem commented Mar 6, 2024

Describe the bug
When an error occurs while calling run.log_metric, it does not show the error message, but a KeyError.

To reproduce
It is a bit hard for me to describe this as it occured randomly after working for 42 epochs.

Expected behavior
Get a message of the actual Error cause.

Screenshots or logs

Train epoch 43:  68%|██████▊   | 622/921 [15:36<07:30,  1.51s/it]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ec2-user/train-clrnet/src/man_adt/model_training/train_clrnet/entrypoints/train_clrnet.py", line 69, in main
    runner.train()
  File "/home/ec2-user/train-clrnet/src/man_adt/model_training/train_clrnet/engine/runner.py", line 185, in train
    self._train_epoch(_sagemaker_run)
  File "/home/ec2-user/train-clrnet/src/man_adt/model_training/train_clrnet/engine/runner.py", line 269, in _train_epoch
    _log_training_metrics(
  File "/home/ec2-user/train-clrnet/src/man_adt/model_training/train_clrnet/engine/runner.py", line 406, in _log_training_metrics
    run.log_metric(name="Learning Rate", value=lr, step=step)
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/train-net-3hJI2r0s-py3.10/lib/python3.10/site-packages/sagemaker/experiments/_utils.py", line 90, in wrapper
    return func(*args, **kwargs)
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/train-net-3hJI2r0s-py3.10/lib/python3.10/site-packages/sagemaker/experiments/run.py", line 297, in log_metric
    self._metrics_manager.log_metric(
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/train-net-3hJI2r0s-py3.10/lib/python3.10/site-packages/sagemaker/experiments/_metrics.py", line 138, in log_metric
    self.sink.log_metric(metric_data)
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/train-net-3hJI2r0s-py3.10/lib/python3.10/site-packages/sagemaker/experiments/_metrics.py", line 173, in log_metric
    self._drain()
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/train-net-3hJI2r0s-py3.10/lib/python3.10/site-packages/sagemaker/experiments/_metrics.py", line 187, in _drain
    self._send_metrics(available_metrics)
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/train-net-3hJI2r0s-py3.10/lib/python3.10/site-packages/sagemaker/experiments/_metrics.py", line 200, in _send_metrics
    message = errors[0]["Message"]
KeyError: 'Message'

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: '2.209.0'
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): -
  • Framework version: -
  • Python version: 3.10
  • CPU or GPU: GPU
  • Custom Docker image (Y/N): N

Additional context

@sziem sziem added the bug label Mar 6, 2024
@sziem sziem changed the title KeyError: 'Message KeyError: 'Message' when encountering an error in _send_metrics Mar 6, 2024
@ananth102
Copy link
Collaborator

Hi sziem, are you repeatedly seeing this issue? If so can you share some sample code that we can use to replicate this.

@sziem
Copy link
Author

sziem commented Apr 1, 2024

Hi, thanks for your reply. After seeing this about 2-3 times, I wrapped my calls in a try-except and just ignored it, so I'm not sure if this is still an issue, sorry. Also, it's been a while since I looked at it.

As I said above, it is a bit hard to create a minimal example for the issue, because of the large time delay until it occurs. Unfortunately, I'm not at liberty to share my code. But the way I've been using log_metrics is like this:

import boto3
from sagemaker.experiments import run
from sagemaker.session import Session

# sagemaker_session = Session(boto_session=boto3.Session(...))

my_run = run.Run(
    experiment_name="experiment_foo",
    run_name="run_foo",
    tags={"tag_key": "tag_value"},
    sagemaker_session=sagemaker_session,
)

n_steps = 1000000
lr = 0.0001
with my_run as my_run_ :
    for step in range(steps):
        my_run_.log_metric(name="Learning Rate", value=lr, step=step)

Then there must have been something (maybe a connection error?) that caused send_metrics to fail at some point.

@ananth102
Copy link
Collaborator

Seems like an issue with the sdk.

message = errors[0]["Message"]

This statement needs to reference "Code" instead of "Message". As that is what the api returns (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-metrics/client/batch_put_metrics.html)

It would still error out in the next line:

raise Exception(f'{len(errors)} errors with message "{message}"')

but the error message would be more helpful.

@sziem
Copy link
Author

sziem commented Apr 3, 2024

Yes I agree. That should be the fix and the correct behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants