Keras callback for pushing models to Hub #718

merveenoyan · 2022-02-24T17:00:55Z

The callback to push models to Hub. It's a best of both worlds for transformers Keras callback and push_to_hub_callback(). I didn't write any tests on it yet because I couldn't decide the best way of doing it. I could either:

Test by interfering to training loop to check if the model is pushed in specific intervals. (which is probably a very very bad idea)
Check the commit history.

Here are couple of examples (push per epoch to a hub model ID and push by URL).

I don't want a review yet because 1. it's a branch derived from unapproved model card PR branch (I did it intentionally because I needed model card writing functions inside, so many merge conflicts upcoming lol), which needs merger 2. I didn't write any tests. You're free to try it on your notebooks though.

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

…ub into keras_callback

osanseviero

Looks great! I just have some concerns about blocking/waiting before pushing. Left some comments 🚀

osanseviero · 2022-03-17T11:46:54Z

src/huggingface_hub/keras_mixin.py

+                while not self.last_job.is_done:
+                    time.sleep(1)


I see two options

We set blocking=True. This will do the same thing you're doing with the while loop to wait. See doc, the function won't return until push is finished.

Buuut...I'm not sure if we want the push to be blocking. With current approach, the user will need to wait for last job until push is complete, since it can significantly slow down the training. If previous push is still ongoing, you could simply not start another as done in https://github.com/huggingface/transformers/blob/9947001e7cf302cbb3406bae87366d80121dc026/src/transformers/keras_callbacks.py#L275-L282. If we go with this approach, there would still be commits for the different epochs as expected, so it would still have nice version control.

src/huggingface_hub/keras_mixin.py

tests/test_keras_integration.py

merveenoyan · 2022-03-17T14:15:49Z

@osanseviero We can return for steps and block for epochs.

osanseviero · 2022-03-18T08:58:30Z

I usually prefer consistency since it will confuse users that one method always pushes while the other not. Is there any con for making the epoch one not blocking?

merveenoyan · 2022-03-18T15:57:23Z

@osanseviero Epochs are taking longer which I think blocking will not cause any change in performance but I'll change now if you think being consistent is better.

osanseviero · 2022-03-21T09:10:06Z

Thanks for explaining! Yes, I feel consistency is better, and would do non-blocking which will be appreciated for large models.

merveenoyan · 2022-03-22T12:33:39Z

When blocking is False the user will see:
remote: error: cannot lock ref 'refs/heads/main': is at 66d0b2cc64c2c1ad77084cac6c6414fdd2636fc6 but expected adb34ff5f99b94c48df5a435815e2399374bb559 To https://huggingface.co/merve/callback-batch-test ! [remote rejected] main -> main (failed to update ref) error: failed to push some refs to '....'
So HEAD according to the refs is at adb34 and there's a mismatch.
I find it a bit confusing for users. I'll try to fix this through updating ref, I also saw pruning could help.
(I keep this like a diary, you don't have to answer)

osanseviero · 2022-03-23T11:02:17Z

I'm a bit surprised for this. Do we get same error in the transformers callback? The code is very similar too.

merveenoyan · 2022-03-23T13:05:53Z

@osanseviero I think this is because length of epochs/steps and state of push, it doesn't happen in transformers (because epochs take long), I tested on my own to make the model I'm testing with bigger and it doesn't appear. This also doesn't affect the pushes (I see everything pushed on the other end and the tests pass), it's just confusing for the users.
The models I was testing with previously would take a second or two every step or epoch to train, I think that's why it happened.

osanseviero · 2022-03-23T13:35:28Z

Ah I see, this makes sense, thanks for explaining!. I don't think it's a big issue if this would require a significant change in the code to get it working though, but if we can reduce the noise that would be great.

LysandreJik · 2022-03-24T09:14:43Z

I think another difference between the implementation between here and transformers is that in transformers we specifically wait for the previous push to be finished before starting another one. We retrieve a push_in_progress variable that we check before trying to push again. Could that be the source of the refs mismatch?

merveenoyan · 2022-03-24T09:25:44Z

@LysandreJik I get the same thing even when I return the last_job from the push. (which is is_done in command output) I don't know if people upload models that are super fast to train every step so I don't know if it's worth updating refs manually inside callback itself. The tests fail because of this reason, I made the models very fast to train so that tests pass fast. 😅 but now I get this. 😄 but I'll put last_job back regardless, thanks for the catch.
This is definitely an edge case 😅

Epoch 2/2
1/1 [==============================] - 0s 3ms/step - loss: 0.5654

I'll just get commit SHA and update ref. it doesn't work.

LysandreJik · 2022-04-05T12:49:22Z

Removing this PR from the v0.5 milestone as version v0.5 will be released in a bit.

merveenoyan · 2022-05-09T12:28:25Z

I will come up with a way to test this without messing refs and making model bigger.

BenjaminBossan · 2022-07-11T10:09:59Z

Let me preface this by admitting that I have very little experience with keras but here are my thoughts:

As a user, when I would see a callback like this, I would expect it to work similarly to ModelCheckpoint. E.g. there, I have the option to monitor a specific metric like validation loss and only create a checkpoint if that metric improves.

Implementation-wise, I took a look at ModelCheckpoint code to see if it could be used as a base class with a few changes, but it does not encapsulate the i/o part, so this would probably be a bad idea. Still, even if a lot of code would need to be copy-pasted, at least we can have confidence that ModelCheckpoint has a long history of usage and is well tested, allowing PushToHubCallback to focus on the i/o.

nateraw · 2022-07-11T16:58:19Z

@BenjaminBossan thanks for the feedback! This is a really great idea, and I ran into it too when I made a callback for PyTorch Lightning. Almost better off copying the default model checkpoint callback and adding the few lines for pushing somewhere within. would love to hear others thoughts

BenjaminBossan · 2022-08-04T14:06:01Z

@merveenoyan I added a feature (this PR) on skorch to save model checkpoints on HF Hub. I took a different approach there which works with existing callbacks instead of writing a new one. Maybe that could be a more elegant solution for keras as well? Take a look at this notebook to see how it works from a user perspective. Whether such an approach would work with keras ModelCheckpoint, I can't say for sure.

merveenoyan · 2022-08-08T12:53:28Z

This PR by @BenjaminBossan uses upload_file which makes sense to use for this case, then it doesn't mess up git ref (from what I see and my guess) Should I wait until we migrate mixins to that or would you like me to do it soon? @osanseviero @LysandreJik @nateraw

osanseviero · 2022-08-08T14:12:55Z

I would wait until #847 is merged, @Wauplin is actively working on this

Wauplin · 2022-08-08T14:16:48Z

@merveenoyan I don't know what "soon" means and how urgent is this callback feature, but just an update to say that keras mixins to leverage non-git uploads are on their way (see PR #847). Especially it will be easier to build a callback based on push_to_hub_keras since there will be no need for an initialization with a git pull anymore. See also API example from @LysandreJik in #847 (comment) .

osanseviero · 2022-08-08T15:17:03Z

Yes, I don't think this has been widely requested or urgent, and as waiting will make everyone's life easier I would just wait.

merveenoyan and others added 30 commits February 7, 2022 13:17

Initial commit

0a98319

directory path fix

f5148d4

fix with os.path

f095f91

fix with os.path

242bc24

added hyperparameters and other parts of card

3e5f9f1

readme fix

37a2612

quality

04dfbc6

added model summary

922ad9f

summary fix

6328bfd

summary fix

821fa84

added model history to model card

5270776

history fix

c027c16

model plot added

717a72b

history fix

cb7ade2

swapped model tag with library name keras

810fe94

yaml metadata fix

6242152

yaml metadata reverted

85e520c

bug fix

04cd2b6

toggle fix

d0c4096

remove toggle for summary

ed8a152

increment in epochs

a3af1a4

put image inside toggle

7e95fb0

fixed invalid yaml metadata

447de45

plot path fix

2568667

plot path fix

0e9052b

plot path fix

84c3427

added additional task_name tag for people to tag the model

d7a0f5c

tests

d212d65

Update src/huggingface_hub/keras_mixin.py

d546dab

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

changed require tf decorator for unit test

44cb09b

merveenoyan added 3 commits March 14, 2022 21:24

Merge branch 'main' into keras_callback

ba4f9ca

nit picks

a7a344e

Merge branch 'keras_callback' of github.com:huggingface/huggingface_h…

2ccf34e

…ub into keras_callback

merveenoyan requested a review from osanseviero March 15, 2022 10:03

osanseviero reviewed Mar 17, 2022

View reviewed changes

changed blocking, removed git add, and comment nit

293908d

disabled blocking behavior for epoch

8079a9b

changed tests and put last_job back

7cd27df

LysandreJik mentioned this pull request Mar 24, 2022

Replace obsolete function argument name in docs for api.create_repo #795

Closed

adrinjalali added this to the v0.4 milestone Mar 31, 2022

LysandreJik removed this from the v0.5 milestone Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keras callback for pushing models to Hub #718

Keras callback for pushing models to Hub #718

merveenoyan commented Feb 24, 2022 •

edited

osanseviero left a comment

osanseviero Mar 17, 2022

merveenoyan commented Mar 17, 2022 •

edited

osanseviero commented Mar 18, 2022

merveenoyan commented Mar 18, 2022

osanseviero commented Mar 21, 2022

merveenoyan commented Mar 22, 2022

osanseviero commented Mar 23, 2022

merveenoyan commented Mar 23, 2022

osanseviero commented Mar 23, 2022 •

edited

LysandreJik commented Mar 24, 2022

merveenoyan commented Mar 24, 2022 •

edited

LysandreJik commented Apr 5, 2022

merveenoyan commented May 9, 2022

BenjaminBossan commented Jul 11, 2022

nateraw commented Jul 11, 2022

BenjaminBossan commented Aug 4, 2022

merveenoyan commented Aug 8, 2022

osanseviero commented Aug 8, 2022

Wauplin commented Aug 8, 2022

osanseviero commented Aug 8, 2022

Keras callback for pushing models to Hub #718

Are you sure you want to change the base?

Keras callback for pushing models to Hub #718

Conversation

merveenoyan commented Feb 24, 2022 • edited

osanseviero left a comment

Choose a reason for hiding this comment

osanseviero Mar 17, 2022

Choose a reason for hiding this comment

merveenoyan commented Mar 17, 2022 • edited

osanseviero commented Mar 18, 2022

merveenoyan commented Mar 18, 2022

osanseviero commented Mar 21, 2022

merveenoyan commented Mar 22, 2022

osanseviero commented Mar 23, 2022

merveenoyan commented Mar 23, 2022

osanseviero commented Mar 23, 2022 • edited

LysandreJik commented Mar 24, 2022

merveenoyan commented Mar 24, 2022 • edited

LysandreJik commented Apr 5, 2022

merveenoyan commented May 9, 2022

BenjaminBossan commented Jul 11, 2022

nateraw commented Jul 11, 2022

BenjaminBossan commented Aug 4, 2022

merveenoyan commented Aug 8, 2022

osanseviero commented Aug 8, 2022

Wauplin commented Aug 8, 2022

osanseviero commented Aug 8, 2022

merveenoyan commented Feb 24, 2022 •

edited

merveenoyan commented Mar 17, 2022 •

edited

osanseviero commented Mar 23, 2022 •

edited

merveenoyan commented Mar 24, 2022 •

edited