Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trainer.fit() Object Memory Error in Local Machine #253

Open
satyamnyati opened this issue Feb 2, 2024 · 4 comments
Open

trainer.fit() Object Memory Error in Local Machine #253

satyamnyati opened this issue Feb 2, 2024 · 4 comments

Comments

@satyamnyati
Copy link

I run out of memory in trainer.fit(), I have 8gb RAM and i7 8th gen 12 core CPU. I also see error in trainer.fit(). Is there any way to reduce the load on RAM or will I need another RAM to be able to run this.

@totovivi
Copy link

totovivi commented Feb 9, 2024

These changes helped me:

  • setting num_workers to 1
  • setting ray.init(object_store_memory=10**9)
  • changing the batch_size to 32

But I am surprised that I had to do these changes, given my computer's characteristics

@satyamnyati
Copy link
Author

Yes, I changed the batch size and I could get it to run on ubuntu. However in windows I couldn't run it even with these changes. The path manipulations are causing some problems.

@capmichal
Copy link

capmichal commented Feb 26, 2024

Same issue here, on both laptops (powerfull Dell XPS 15, and weaker HP from work). After using num_workers=1 and batch_size=32 I could run just fine, it took about 30 minutes to train model.

If I work on a single machine, does the num_workers affect this OOM problem, or all i have to do is manipulate batch_size variable ? Does changing to 2 workers allow me to use larger batch size? Or does my computer(each core) just cannot handle more that batch_size=32 ?

Would love to get explanation about how does num_workers, resources_per_worker and batch_size relate to my RAM.

In Addition: Using GPU on my Dell solves this issue, and trains model in a minute, but i am interested in using only CPU. While running the whole setup my computer RAM usage is about 70% which leaves about 4/5GB of RAM for training, it is simply not enough, so any configuration (if not using GPU) wont allow my to easily train?

@Koowah
Copy link

Koowah commented May 4, 2024

@capmichal @totovivi I had the same reactions and questions.

After a bit of research and chatting with GPT, here's what I gathered :

  • When we're training a model, we have to load it in memory. Here, we're using scibert which weighs around 442mb according the HG page for the model. This weight roughly corresponds to the weight of all the model's parameters (110 million).
  • One foward pass of one sample loads in memory all the network's associated activations and gradients. When we pass a batch, we have to multiply this load by the batch size. Memory load is therefore a big function of model size, and increases linearly with batch size.

So I guess the main issue regarding memory is how much data is passed by iteration (batch_size * num_workers). Using less workers should however be better as each worker has overhead and must load its own copy of the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants