[Performance]: Qwen 7b chat model, under 128 concurrency, the CPU utilization rate is 100%, and the GPU SM utilization rate is only about 60%-75%. Is it a CPU bottleneck? #4806
Labels
performance
Performance-related issues
Proposal to improve performance
No response
Report of performance regression
No response
Misc discussion on performance
I am using vllm to deploy the qwen 7b chat model service. In a very high concurrency scenario, such as 128 concurrency, I found that the CPU utilization reached 100%, but I saw the GPU utilization rate is less than 60%
My question is, because a lot of vllm's scheduling and calculation logic is implemented by Python coroutines, it can only use the computing power of a single CPU. In a scenario like this with 128 concurrency, is the CPU becoming a computing bottleneck, causing GPU CUDA to be unable to achieve higher performance?
Model download address:https://huggingface.co/Qwen/Qwen-7B-Chat/tree/main
Your current environment (if you think it is necessary)
The text was updated successfully, but these errors were encountered: