Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras callback for pushing models to Hub #718

Open
wants to merge 77 commits into
base: main
Choose a base branch
from
Open

Conversation

merveenoyan
Copy link
Contributor

@merveenoyan merveenoyan commented Feb 24, 2022

The callback to push models to Hub. It's a best of both worlds for transformers Keras callback and push_to_hub_callback(). I didn't write any tests on it yet because I couldn't decide the best way of doing it. I could either:

  1. Test by interfering to training loop to check if the model is pushed in specific intervals. (which is probably a very very bad idea)
  2. Check the commit history.

Here are couple of examples (push per epoch to a hub model ID and push by URL).

I don't want a review yet because 1. it's a branch derived from unapproved model card PR branch (I did it intentionally because I needed model card writing functions inside, so many merge conflicts upcoming lol), which needs merger 2. I didn't write any tests. You're free to try it on your notebooks though.

Copy link
Member

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I just have some concerns about blocking/waiting before pushing. Left some comments 🚀

Comment on lines 323 to 324
while not self.last_job.is_done:
time.sleep(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two options

  • We set blocking=True. This will do the same thing you're doing with the while loop to wait. See doc, the function won't return until push is finished.
  • Buuut...I'm not sure if we want the push to be blocking. With current approach, the user will need to wait for last job until push is complete, since it can significantly slow down the training. If previous push is still ongoing, you could simply not start another as done in https://github.com/huggingface/transformers/blob/9947001e7cf302cbb3406bae87366d80121dc026/src/transformers/keras_callbacks.py#L275-L282. If we go with this approach, there would still be commits for the different epochs as expected, so it would still have nice version control.

src/huggingface_hub/keras_mixin.py Outdated Show resolved Hide resolved
tests/test_keras_integration.py Show resolved Hide resolved
tests/test_keras_integration.py Outdated Show resolved Hide resolved
@merveenoyan
Copy link
Contributor Author

merveenoyan commented Mar 17, 2022

@osanseviero We can return for steps and block for epochs.

@osanseviero
Copy link
Member

I usually prefer consistency since it will confuse users that one method always pushes while the other not. Is there any con for making the epoch one not blocking?

@merveenoyan
Copy link
Contributor Author

@osanseviero Epochs are taking longer which I think blocking will not cause any change in performance but I'll change now if you think being consistent is better.

@osanseviero
Copy link
Member

Thanks for explaining! Yes, I feel consistency is better, and would do non-blocking which will be appreciated for large models.

@merveenoyan
Copy link
Contributor Author

When blocking is False the user will see:
remote: error: cannot lock ref 'refs/heads/main': is at 66d0b2cc64c2c1ad77084cac6c6414fdd2636fc6 but expected adb34ff5f99b94c48df5a435815e2399374bb559 To https://huggingface.co/merve/callback-batch-test ! [remote rejected] main -> main (failed to update ref) error: failed to push some refs to '....'
So HEAD according to the refs is at adb34 and there's a mismatch.
I find it a bit confusing for users. I'll try to fix this through updating ref, I also saw pruning could help.
(I keep this like a diary, you don't have to answer)

@osanseviero
Copy link
Member

I'm a bit surprised for this. Do we get same error in the transformers callback? The code is very similar too.

@merveenoyan
Copy link
Contributor Author

@osanseviero I think this is because length of epochs/steps and state of push, it doesn't happen in transformers (because epochs take long), I tested on my own to make the model I'm testing with bigger and it doesn't appear. This also doesn't affect the pushes (I see everything pushed on the other end and the tests pass), it's just confusing for the users.
The models I was testing with previously would take a second or two every step or epoch to train, I think that's why it happened.

@osanseviero
Copy link
Member

osanseviero commented Mar 23, 2022

Ah I see, this makes sense, thanks for explaining!. I don't think it's a big issue if this would require a significant change in the code to get it working though, but if we can reduce the noise that would be great.

@LysandreJik
Copy link
Member

I think another difference between the implementation between here and transformers is that in transformers we specifically wait for the previous push to be finished before starting another one. We retrieve a push_in_progress variable that we check before trying to push again. Could that be the source of the refs mismatch?

@merveenoyan
Copy link
Contributor Author

merveenoyan commented Mar 24, 2022

@LysandreJik I get the same thing even when I return the last_job from the push. (which is is_done in command output) I don't know if people upload models that are super fast to train every step so I don't know if it's worth updating refs manually inside callback itself. The tests fail because of this reason, I made the models very fast to train so that tests pass fast. 😅 but now I get this. 😄 but I'll put last_job back regardless, thanks for the catch.
This is definitely an edge case 😅

Epoch 2/2
1/1 [==============================] - 0s 3ms/step - loss: 0.5654

I'll just get commit SHA and update ref. it doesn't work.

@LysandreJik
Copy link
Member

Removing this PR from the v0.5 milestone as version v0.5 will be released in a bit.

@LysandreJik LysandreJik removed this from the v0.5 milestone Apr 5, 2022
@merveenoyan
Copy link
Contributor Author

I will come up with a way to test this without messing refs and making model bigger.

@BenjaminBossan
Copy link
Member

Let me preface this by admitting that I have very little experience with keras but here are my thoughts:

As a user, when I would see a callback like this, I would expect it to work similarly to ModelCheckpoint. E.g. there, I have the option to monitor a specific metric like validation loss and only create a checkpoint if that metric improves.

Implementation-wise, I took a look at ModelCheckpoint code to see if it could be used as a base class with a few changes, but it does not encapsulate the i/o part, so this would probably be a bad idea. Still, even if a lot of code would need to be copy-pasted, at least we can have confidence that ModelCheckpoint has a long history of usage and is well tested, allowing PushToHubCallback to focus on the i/o.

@nateraw
Copy link
Contributor

nateraw commented Jul 11, 2022

@BenjaminBossan thanks for the feedback! This is a really great idea, and I ran into it too when I made a callback for PyTorch Lightning. Almost better off copying the default model checkpoint callback and adding the few lines for pushing somewhere within. would love to hear others thoughts

@BenjaminBossan
Copy link
Member

@merveenoyan I added a feature (this PR) on skorch to save model checkpoints on HF Hub. I took a different approach there which works with existing callbacks instead of writing a new one. Maybe that could be a more elegant solution for keras as well? Take a look at this notebook to see how it works from a user perspective. Whether such an approach would work with keras ModelCheckpoint, I can't say for sure.

@merveenoyan
Copy link
Contributor Author

This PR by @BenjaminBossan uses upload_file which makes sense to use for this case, then it doesn't mess up git ref (from what I see and my guess) Should I wait until we migrate mixins to that or would you like me to do it soon? @osanseviero @LysandreJik @nateraw

@osanseviero
Copy link
Member

I would wait until #847 is merged, @Wauplin is actively working on this

@Wauplin
Copy link
Contributor

Wauplin commented Aug 8, 2022

@merveenoyan I don't know what "soon" means and how urgent is this callback feature, but just an update to say that keras mixins to leverage non-git uploads are on their way (see PR #847). Especially it will be easier to build a callback based on push_to_hub_keras since there will be no need for an initialization with a git pull anymore. See also API example from @LysandreJik in #847 (comment) .

@osanseviero
Copy link
Member

Yes, I don't think this has been widely requested or urgent, and as waiting will make everyone's life easier I would just wait.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants