Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLI]: Wandb agent fails with no log #7444

Closed
arkadiusz-czerwinski opened this issue Apr 21, 2024 · 9 comments
Closed

[CLI]: Wandb agent fails with no log #7444

arkadiusz-czerwinski opened this issue Apr 21, 2024 · 9 comments
Labels

Comments

@arkadiusz-czerwinski
Copy link

Describe the bug

After setting up wandb launch environment, successfully adding job to queue with wandb job create, the run fails, with no feedback other than.: "The submitted job exited successfully but failed to call wandb.init". There is no feedback in the agent terminal.

wandb launch-agent -e ENTITY -q QUEUE -v

The repo is available here, with the code and Dockerfile.

Additional Files

No response

Environment

WandB version: 0.16.6

OS:

Python version:

Versions of relevant libraries:

Additional Context

No response

@anmolmann
Copy link

Hey @arkadiusz-czerwinski , thanks for writing to W&B support. We'll investigate this on our end and get back to you soon.

@anmolmann
Copy link

Hey @arkadiusz-czerwinski ,

  1. Job run will be reported as a failure if they don’t include a wandb.init to create a run, which I think might have happened in your case. This was another gap in our documentation--apologies.
  2. If you look to the far right of the job run’s view under the queue, do you see an error icon (example in the video below, bottom right)? This should immediately pop an error modal. (We’ve heard from other customers who missed this, so we’re working on making it more prominent and discoverable).
launch_queue_error_modal.mov

Please let me know if this video helps locating the stack trace on your end.

@anmolmann
Copy link

Hi @arkadiusz-czerwinski , I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

@arkadiusz-czerwinski
Copy link
Author

arkadiusz-czerwinski commented May 9, 2024

Hi. Thank you for giving me a reminder. In case of our setup, the script starts with wandb init. The error, however, states that the script failed before reaching the wandb init, although highest verbose settings were selected. @anmolmann

@anmolmann
Copy link

anmolmann commented May 9, 2024

Hi @arkadiusz-czerwinski , could you share your queue config and the script as well? This will help us in reproducing this issue on our end for further investigation.

In addition to the above, launch jobs in the run queue do not always show the underlying error message especially when the job fails before init - we made progress here in the last few months as seen in the video shared above, however, I wouldn't be able to guarantee that we will extract the exact right failure message each time. We'll keep working on improvements and investigate the case you brought up, though I don't anticipate us achieving perfection in that regard.

@arkadiusz-czerwinski
Copy link
Author

arkadiusz-czerwinski commented May 9, 2024

Fair point. The code will be visible in this repo.

gpus: all
label:
  - tutorial
volume:
  - /mnt/space:/mnt/space

The error looks as follows:
image

The main issue is that the error is not very descriptive, and the agent provides no error even after specifying verbose level.

@arkadiusz-czerwinski
Copy link
Author

Update: the issue seems to be that during creation of a job from git, entry point is required, but is then not passed forward to the wandb, which caused the error, as it had to be defined in Dockerfile.wandb.

@anmolmann
Copy link

I see, thanks for the context @arkadiusz-czerwinski . I will create a feature request to improve the error catching functionality for launch.

@arkadiusz-czerwinski
Copy link
Author

arkadiusz-czerwinski commented May 20, 2024

Thank you for being so cooperative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants