Low GPU Usage #381

upadsamay387 · 2023-06-07T14:10:51Z

Hi! I've been using a slightly-modified version of the example OpenAI code, and I only get about 15% utilization on my GPU regardless of what step size I set. I've tried changing a few other values randomly, and nothing seems to increase my utilization. I do notice that about 75% of my VRAM is used, however.

I also wanted to know if there were plans to make a Discord server or something similar. It would be really helpful to have a community that helps each other similar to RLBot.

Ball-Man · 2023-06-09T01:04:42Z

Hi, could you link the example here explicitly? Just to make it easier to retrieve the code. Also because there are at least two versions of it (one for the old API and one for the new).

Secondly, what GPU are you running this on? Of course this plays an important part as GPUs are very diverse.

Anyway, I can make some assumptions about what is going on.
First things first: GPU computational usage is not influenced by your step size, as step size is just a scalar used to multiply the gradients. Basically, just a number in multiplication.

Low usage is could be related to a CPU or IO bottleneck, that is, generating/reading your data takes much more time than actually fitting your model on it. This is typical of reinforcement learning tasks, as data must be generated by playing games, and games are played on CPU cores.
The solution to such problem usually is: run the generation of data on multiple threads/cores (so, play multiple games in parallel), gather all the data and fit the model on that. If I remember correctly, the example you are using makes use of the keras-rl package (import rl ...). This library provides simple implementations of popular RL algorithms, but they are all single threaded as far as I know.
Moreover, the very nature of poke-env itself might contribute to this problem as well. poke-env generates game simulations by interacting with (possibly) a local instance of showdown. This means that each taken action must be transmitted to the showdown (local) server, waiting for a response. Even though a local instance provides minimal delays, this is still an IO operation, hence, notoriously slow in terms of high performance computation.

To prove the point, you may try monitoring your CPU usage during execution. A single threaded application will only show a peak on one core. On a unix machine, you can easily take a look with htop. On Windows, open task manager, performance tab, right click on the CPU graph to change from overall usage to logical cores. If you see a strong peak on one core at a time, the bottleneck is CPU computation (single threaded). If you don't see any peaks at all, the bottleneck is probably showdown communication (so, potentially single threaded IO).
Unfortunately I don't currently have the time to replicate your scenario and monitor it personally.

Finally, what can you do about it?
In both the discussed cases (CPU/IO bottleneck) the only real solution is parallelizing on more cores. Unfortunately I am not aware of such a feature on keras-rl, so improving the example might not be possible. One idea is to simply ignore the fact, assessing how good computational time is even if your GPU is not fully used. You may find that it is good enough, so that you could just ignore the thing for your first tests.
So, at some point you may want to settle this thing and actually have a parallel implementation. One advice in this case is to take a look at other RL libraries out there, like TF Agents, a well supported framework for this tasks, with complete support for distribution on multiple cores, as well as on multiple machines.
Clearly, take a look at keras-rl to see if it has some multiprocessing support (I don't think so but don't take my words for granted). However, note that it has not been receving updates for quite a while (see keras-rl github page), so it may be a good idea in general to find an alternative.

Ball-Man · 2023-06-09T01:40:28Z

Another thing I didn't mention, you can squeeze out extra usage from your GPU by adjusting the batch size (i.e. how much data is sent to the GPU at each iteration). In the example, I believe there is no batch size specified to the DQNAgent, so it would get the default (which should be 32). You can try doubling it adding explicitly the parameter:

# Example code...
agent = DQNAgent(
    ...,
    batch_size=64
)
# ... more example code

64 is the natural example, but you can increase it more in subsequent experiments. Keep monitoring your CPU and GPU to see how the bottlenecks evolve. Since your VRAM occupation is already high, it may not be possible for you to increase it. In this case the whole script would crash with an out of memory error. If this is the case, 32 was already your best value.

mancho2000 · 2023-06-11T08:22:49Z

Hi Ball-Man,

Could you provide a simple example parallel implementation using TF Agents? Would be helpful for newbies like me :)

Also would like to ask if you have managed to have a good random battles bot? I can't make any good ones, wonder if I am doing something wrong, but what I think is that the agent is missing the information of the possible moves the opponent could have from the random battles sets, making it play blindly until it finds each move when revealed.

Kymawave · 2023-06-18T10:00:56Z

tf-agents is by far a better beginner friendly environment compared to the now ancient keras-rl, and can execute all rl algorithm code on the gpu independent of python. but, the tf-agents multiprocessing module doesnt work with poke-env. it should be possible to set up a multiprocessing pipeline however, and it so happens that this is what ill be doing next for my project. you can contact me on discord (kyma#2862) for how and if i actually achieve this.

Ball-Man · 2023-06-18T20:43:53Z

Thank you Kymawave for your help.
Unfortunately I currently don't have the time to experiment and provide a minimal working example. Moreover, I was unaware of the complications with the multiprocessing module.

@mancho2000 regarding your concerns about the random battles sets, I think you are right. With the minimal provided setup the agent is learning solely from experience. This is similar to how a human newcomer with no previous knowledge would experiment the format. Given a complex enough model and enough train time, the agent could start learning and recognizing the different sets on its own. Clearly, if we provided to the agent "a list" of the existing sets we could expect faster and better training, even though encoding such knowledge is not necessarily trivial.

hsahovic mentioned this issue Nov 19, 2023

Add benchmarks to CI #451

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low GPU Usage #381

Low GPU Usage #381

upadsamay387 commented Jun 7, 2023

Ball-Man commented Jun 9, 2023

Ball-Man commented Jun 9, 2023

mancho2000 commented Jun 11, 2023

Kymawave commented Jun 18, 2023

Ball-Man commented Jun 18, 2023

Low GPU Usage #381

Low GPU Usage #381

Comments

upadsamay387 commented Jun 7, 2023

Ball-Man commented Jun 9, 2023

Ball-Man commented Jun 9, 2023

mancho2000 commented Jun 11, 2023

Kymawave commented Jun 18, 2023

Ball-Man commented Jun 18, 2023