trainer.fit() Object Memory Error in Local Machine #253

satyamnyati · 2024-02-02T02:45:44Z

I run out of memory in trainer.fit(), I have 8gb RAM and i7 8th gen 12 core CPU. I also see error in trainer.fit(). Is there any way to reduce the load on RAM or will I need another RAM to be able to run this.

totovivi · 2024-02-09T09:24:31Z

These changes helped me:

setting num_workers to 1
setting ray.init(object_store_memory=10**9)
changing the batch_size to 32

But I am surprised that I had to do these changes, given my computer's characteristics

satyamnyati · 2024-02-13T18:56:10Z

Yes, I changed the batch size and I could get it to run on ubuntu. However in windows I couldn't run it even with these changes. The path manipulations are causing some problems.

capmichal · 2024-02-26T13:03:48Z

Same issue here, on both laptops (powerfull Dell XPS 15, and weaker HP from work). After using num_workers=1 and batch_size=32 I could run just fine, it took about 30 minutes to train model.

If I work on a single machine, does the num_workers affect this OOM problem, or all i have to do is manipulate batch_size variable ? Does changing to 2 workers allow me to use larger batch size? Or does my computer(each core) just cannot handle more that batch_size=32 ?

Would love to get explanation about how does num_workers, resources_per_worker and batch_size relate to my RAM.

In Addition: Using GPU on my Dell solves this issue, and trains model in a minute, but i am interested in using only CPU. While running the whole setup my computer RAM usage is about 70% which leaves about 4/5GB of RAM for training, it is simply not enough, so any configuration (if not using GPU) wont allow my to easily train?

Koowah · 2024-05-04T21:26:02Z

@capmichal @totovivi I had the same reactions and questions.

After a bit of research and chatting with GPT, here's what I gathered :

When we're training a model, we have to load it in memory. Here, we're using scibert which weighs around 442mb according the HG page for the model. This weight roughly corresponds to the weight of all the model's parameters (110 million).
One foward pass of one sample loads in memory all the network's associated activations and gradients. When we pass a batch, we have to multiply this load by the batch size. Memory load is therefore a big function of model size, and increases linearly with batch size.

So I guess the main issue regarding memory is how much data is passed by iteration (batch_size * num_workers). Using less workers should however be better as each worker has overhead and must load its own copy of the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trainer.fit() Object Memory Error in Local Machine #253

trainer.fit() Object Memory Error in Local Machine #253

satyamnyati commented Feb 2, 2024

totovivi commented Feb 9, 2024 •

edited

satyamnyati commented Feb 13, 2024

capmichal commented Feb 26, 2024 •

edited

Koowah commented May 4, 2024 •

edited

trainer.fit() Object Memory Error in Local Machine #253

trainer.fit() Object Memory Error in Local Machine #253

Comments

satyamnyati commented Feb 2, 2024

totovivi commented Feb 9, 2024 • edited

satyamnyati commented Feb 13, 2024

capmichal commented Feb 26, 2024 • edited

Koowah commented May 4, 2024 • edited

totovivi commented Feb 9, 2024 •

edited

capmichal commented Feb 26, 2024 •

edited

Koowah commented May 4, 2024 •

edited