-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keras callback for pushing models to Hub #718
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I just have some concerns about blocking/waiting before pushing. Left some comments 🚀
src/huggingface_hub/keras_mixin.py
Outdated
while not self.last_job.is_done: | ||
time.sleep(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two options
- We set
blocking=True
. This will do the same thing you're doing with thewhile
loop to wait. See doc, the function won't return until push is finished. - Buuut...I'm not sure if we want the push to be blocking. With current approach, the user will need to wait for last job until push is complete, since it can significantly slow down the training. If previous push is still ongoing, you could simply not start another as done in https://github.com/huggingface/transformers/blob/9947001e7cf302cbb3406bae87366d80121dc026/src/transformers/keras_callbacks.py#L275-L282. If we go with this approach, there would still be commits for the different epochs as expected, so it would still have nice version control.
@osanseviero We can return for steps and block for epochs. |
I usually prefer consistency since it will confuse users that one method always pushes while the other not. Is there any con for making the epoch one not blocking? |
@osanseviero Epochs are taking longer which I think blocking will not cause any change in performance but I'll change now if you think being consistent is better. |
Thanks for explaining! Yes, I feel consistency is better, and would do non-blocking which will be appreciated for large models. |
When |
I'm a bit surprised for this. Do we get same error in the transformers callback? The code is very similar too. |
@osanseviero I think this is because length of epochs/steps and state of push, it doesn't happen in transformers (because epochs take long), I tested on my own to make the model I'm testing with bigger and it doesn't appear. This also doesn't affect the pushes (I see everything pushed on the other end and the tests pass), it's just confusing for the users. |
Ah I see, this makes sense, thanks for explaining!. I don't think it's a big issue if this would require a significant change in the code to get it working though, but if we can reduce the noise that would be great. |
I think another difference between the implementation between here and |
@LysandreJik I get the same thing even when I return the
I'll just get commit SHA and update ref. it doesn't work. |
Removing this PR from the v0.5 milestone as version v0.5 will be released in a bit. |
I will come up with a way to test this without messing refs and making model bigger. |
Let me preface this by admitting that I have very little experience with keras but here are my thoughts: As a user, when I would see a callback like this, I would expect it to work similarly to ModelCheckpoint. E.g. there, I have the option to monitor a specific metric like validation loss and only create a checkpoint if that metric improves. Implementation-wise, I took a look at |
@BenjaminBossan thanks for the feedback! This is a really great idea, and I ran into it too when I made a callback for PyTorch Lightning. Almost better off copying the default model checkpoint callback and adding the few lines for pushing somewhere within. would love to hear others thoughts |
@merveenoyan I added a feature (this PR) on skorch to save model checkpoints on HF Hub. I took a different approach there which works with existing callbacks instead of writing a new one. Maybe that could be a more elegant solution for keras as well? Take a look at this notebook to see how it works from a user perspective. Whether such an approach would work with keras |
This PR by @BenjaminBossan uses |
@merveenoyan I don't know what "soon" means and how urgent is this callback feature, but just an update to say that keras mixins to leverage non-git uploads are on their way (see PR #847). Especially it will be easier to build a callback based on |
Yes, I don't think this has been widely requested or urgent, and as waiting will make everyone's life easier I would just wait. |
The callback to push models to Hub. It's a best of both worlds for transformers Keras callback and
push_to_hub_callback()
. I didn't write any tests on it yet because I couldn't decide the best way of doing it. I could either:Here are couple of examples (push per epoch to a hub model ID and push by URL).
I don't want a review yet because 1. it's a branch derived from unapproved model card PR branch (I did it intentionally because I needed model card writing functions inside, so many merge conflicts upcoming lol), which needs merger 2. I didn't write any tests. You're free to try it on your notebooks though.