bug: Tokens per second calculation is wrong. #2923

avianion · 2024-05-18T00:54:17Z

Describe the bug
Tokens per second is currently calculated including the latency since the beginning of the API request and or hitting the start button.

However, tokens per second should be calculated like this

(Total tokens) / (Time to last token - Time to first token)

Steps to reproduce
Steps to reproduce the behavior:

Use jan.ai and observe the token per second counting behaviour is wrong

Expected behavior
(Total tokens) / (Time to last token - Time to first token)

Screenshots
N/a

Environment details

Operating System: Windows 11

Logs
If the cause of the error is not clear, kindly provide your usage logs: https://jan.ai/docs/troubleshooting#how-to-get-error-logs

Additional context
Add any other context or information that could be helpful in diagnosing the problem.

Propheticus · 2024-05-18T14:24:13Z

How did you find it's including the latency? Looking at the code, the behaviour and statements like these, to me it looks like it's already only counting the time of actual generation. It's not including the time to first token.

From the logs, where multiple separate timings are shown for prompt evaluation, token generation (eval time) and total time, only the second is used to display tokens/s

20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings: prompt eval time = 119.744ms / 33 tokens (3.62860606061 ms per token, 275.587920898 tokens per second) - context/llama_server_context.h:448
20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings:        eval time = 339.311 ms / 17 runs   (19.9594705882 ms per token, 50.1015292755 tokens per second)
 - context/llama_server_context.h:455
20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings:       total time = 459.055 ms - context/llama_server_context.h:462

Last time I checked the UI, what I saw was that 50 t/s figure.

avianion · 2024-05-18T14:50:28Z

The eval time is calculated including the initial time of response from the server. Sent from my iPhone On 18 May 2024, at 15:24, Propheticus ***@***.***> wrote: How did you find it's including the latency? Looking at the code, the behaviour and statements like these<#2636 (comment)>, to me it looks like it's already only counting the time of actual generation. It's not including the time to first token. From the logs, where multiple separate timings are shown for prompt evaluation, token generation (eval time) and total time, only the second is used to display tokens/s 20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings: prompt eval time = 119.744ms / 33 tokens (3.62860606061 ms per token, 275.587920898 tokens per second) - context/llama_server_context.h:448 20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings: eval time = 339.311 ms / 17 runs (19.9594705882 ms per token, 50.1015292755 tokens per second) - context/llama_server_context.h:455 20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings: total time = 459.055 ms - context/llama_server_context.h:462 Last time I checked the UI, what I saw was that 50 t/s figure. — Reply to this email directly, view it on GitHub<#2923 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AI4UWH6OK7WF3UBZS3JCS7DZC5QCFAVCNFSM6AAAAABH47OSAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYHA2DAMZTHE>. You are receiving this because you authored the thread.Message ID: ***@***.***>

Propheticus · 2024-05-18T14:58:28Z

It is not. The eval time is the Llama.cpp reported eval time for generating the tokens.
The time to response is sometimes several seconds, and still the tok/s value remains ~47-50. Adding a full second of time to first response to the eval time would result in a drastically lower figure of around 13 t/s. This is not a figure I see in the GUI.

avianion added the type: bug Something isn't working label May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Tokens per second calculation is wrong. #2923

bug: Tokens per second calculation is wrong. #2923

avianion commented May 18, 2024

Propheticus commented May 18, 2024

avianion commented May 18, 2024 via email

Propheticus commented May 18, 2024

bug: Tokens per second calculation is wrong. #2923

bug: Tokens per second calculation is wrong. #2923

Comments

avianion commented May 18, 2024

Propheticus commented May 18, 2024

avianion commented May 18, 2024 via email

Propheticus commented May 18, 2024