Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Tokens per second calculation is wrong. #2923

Open
avianion opened this issue May 18, 2024 · 3 comments
Open

bug: Tokens per second calculation is wrong. #2923

avianion opened this issue May 18, 2024 · 3 comments
Labels
type: bug Something isn't working

Comments

@avianion
Copy link

Describe the bug
Tokens per second is currently calculated including the latency since the beginning of the API request and or hitting the start button.

However, tokens per second should be calculated like this

(Total tokens) / (Time to last token - Time to first token)

Steps to reproduce
Steps to reproduce the behavior:

Use jan.ai and observe the token per second counting behaviour is wrong

Expected behavior
(Total tokens) / (Time to last token - Time to first token)

Screenshots
N/a

Environment details

  • Operating System: Windows 11

Logs
If the cause of the error is not clear, kindly provide your usage logs: https://jan.ai/docs/troubleshooting#how-to-get-error-logs

Additional context
Add any other context or information that could be helpful in diagnosing the problem.

@avianion avianion added the type: bug Something isn't working label May 18, 2024
@Propheticus
Copy link

How did you find it's including the latency? Looking at the code, the behaviour and statements like these, to me it looks like it's already only counting the time of actual generation. It's not including the time to first token.

From the logs, where multiple separate timings are shown for prompt evaluation, token generation (eval time) and total time, only the second is used to display tokens/s

20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings: prompt eval time = 119.744ms / 33 tokens (3.62860606061 ms per token, 275.587920898 tokens per second) - context/llama_server_context.h:448
20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings:        eval time = 339.311 ms / 17 runs   (19.9594705882 ms per token, 50.1015292755 tokens per second)
 - context/llama_server_context.h:455
20240514 08:31:51.048000 UTC 17652 DEBUG [print_timings] print_timings:       total time = 459.055 ms - context/llama_server_context.h:462

Last time I checked the UI, what I saw was that 50 t/s figure.

@avianion
Copy link
Author

avianion commented May 18, 2024 via email

@Propheticus
Copy link

It is not. The eval time is the Llama.cpp reported eval time for generating the tokens.
The time to response is sometimes several seconds, and still the tok/s value remains ~47-50. Adding a full second of time to first response to the eval time would result in a drastically lower figure of around 13 t/s. This is not a figure I see in the GUI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

2 participants