-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
axlearn on GPU started failing during init after upgrade #463
Comments
It seems there was a code change that requires setting the port. I can probably fix this on my end. |
hmm setting the port on my side didn't work either:
|
@samos123 you may need to remove some deprecated xla flags like |
That got me passed the issue of Fatal Python Error: aborted! Thanks a lot Mark. The next issue:
This is what I do in my Dockerfile:
inside my entrypoint script I set this:
and gke_fuji.py is copied to |
I think this is resolved after appending the version suffix to the config name -- @samos123 are we good to close this? |
There were 2 issues:
Thanks a lot for helping troubleshoot it. Closing. |
One thing to note is that we should have better error reporting when there is a code error in your custom experiment. It will show up as a config not found error with no clear error messages on what could be wrong. |
This is the error message I see when launching like this:
Error message:
The text was updated successfully, but these errors were encountered: