New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect RPS reported in statistics and graphs (distributed mode) #2708
Comments
Hi!
|
Oh and one more thing: is it only the graphs or also the command line output that is wrong? (either run |
Hey! I've been testing using HTTPUser and I get the same result, also it's not only the chart, the statistics are also wrong, which is also visible when running the master with I've ran a smaller test (same settings but limited to 50 users, only 17 workers (1 per node) and
Even if we ignore de warmup or when the test didn't even start and consider If I take into account the timestamp where I press "start" which should be Or we can look at a subset of the test:
I'm not sure if the issue is that some requests do not finish, would those request be accounted in the statistics as total? Thank you for looking into this. |
If requests never finish (and dont time out either), then they are not logged in Locust at all. You may want to reduce the timeout (using FastHttpUser.network_timeout and .connection_timeout, default is 60 seconds) and see if that makes a difference. You can also add --stop-timeout argument to allow locust to gracefully exit when shuttning down (this could make it even more obvious if something has gone wrong and it is stuck somehow). |
Also, can you share some log (from master and worker). |
Sure, I'll share some logs tomorrow.
class LoadTest(FastHttpUser):
connection_timeout = 5
network_timeout = 10
|
|
Hi! I've been trying to reproduce it with 1 or 2 servers without success. I'm attaching the logs from the following test, I tried not to trigger cpu warnings on any worker:
This is with a logging level of INFO, I could do an identical test in DEBUG level and share it if it doesn't contain sensitive data (or just remove it first). |
Hi! I dont see anything strange in your logs. Try setting DEBUG log level. If the log becomes overwhelming try reducing the number of workers. Could it possibly be that workers dont properly report the last requests (they should, but there could be a bug or something). Another thing you might try is installing locust-plugins and using its iteration limit (-i) and see if the numbers match then (https://github.com/SvenskaSpel/locust-plugins/blob/507ff46808d2945f2eea6ed46aa2d69117159ede/examples/cmd_line_examples.sh#L10). It shouldnt be needed of course, but might narrow down where the problem originates from. |
I enabled DEBUG log level and "reproduced" it again, but I don't see anything wrong in the logs. I also tried to reproduce it without success locally with a dummy go http server and 18 workers as docker replicas. This test had a peak of 32k RPS and all the numbers matched. I cannot do many more tests on the real environment, I'm inclined to think that it's related to the network. Is there a way to "cooldown" the test? For example increase the users for the first X seconds and then decrease them until 0 before ending the run. |
You can use a load shape https://docs.locust.io/en/latest/custom-load-shape.html#custom-load-shapes But --stop-timeout should have been enough (you ran with that when the logs were generated, right?). With that setting, either users should finish their tasks or there would be an error message, saying they had to be interrupted. |
Ok, I think I found out what was happening and it's not locust fault. I was checking how the stats are computed: Lines 313 to 314 in 58515c4
And messing up with this locust file on the docker environment to see how the stats reporting worked: import json
import logging
import time
from collections import Counter
from pathlib import Path
from locust import FastHttpUser, events, task
logger = logging.getLogger(__name__)
reports = Counter()
@events.report_to_master.add_listener
def on_report_to_master(client_id, data, **_):
reports[client_id] += 1
data["report_timestamp"] = time.time()
data["report_num"] = reports[client_id]
@events.worker_report.add_listener
def on_worker_report(client_id, data, **_):
out = Path(f"/mnt/locust/logs/{data['report_num']}_{client_id}.json")
out.write_text(json.dumps(data))
@events.quit.add_listener
def on_quit(**_):
for k, v in reports.items():
out = Path(f"/mnt/locust/logs/{k}.reports")
out.write_text(str(v))
class MyUser(FastHttpUser):
connection_timeout = 5
network_timeout = 10
@task
def hit_webserver(self):
self.client.get("/") Since each worker sends the RPS stored in a dict with the timestamp as the key, and the timestamp comes from each worker server date, I checked the date on all 18 servers. I found that 4 of them had an offset of more than 1-2 seconds, with one server being 160 seconds behind! If this is the case, in theory we should be able to reproduce it in a docker environment using https://github.com/wolfcw/libfaketime, setting an offset for each worker and overriding the date before reporting. I'm not sure if something like this would work. I've now synced all the servers and ran a couple of tests that previously reported incorrect RPS and now the numbers seem to match. Does it make sense to you? If it does feel free to close this. Thank you. |
There used to be a feature that detected time difference between master and worker, but it had to be disabled (4+ years ago) due to false positives. I’ll have a look at reintroducing it! |
Prerequisites
Description
The test is executed across 17 machines with a total of 170 workers.
I've done this in the past and the RPS Chart was accurate (I think locust was on
1.x.x
back then).All the servers including the master have the version
2.27.0
.For context, the chart shows a maximum of 14.000 RPS, however the real time reporting tools in the webserver that is receiving all the traffic (and it is only receiving locust trafic) reports 100.000 to 200.000 RPS during the test (a varnish instance, checking it with
varnishstat
).I also confirm that it is true with the logs, I truncate them before each test and locust is sending all the requests but it's not reporting the RPS correctly.
EDIT: RPS is shown wrong both in the chart and in the statistics (web or terminal)
Command line
locust -f $FILE --master --master-bind-port 5557 --host "https://" -u 80000 -r 30 --tags hit300
Locustfile contents
Python version
3.12.1
Locust version
2.27.0
Operating system
Debian 12
The text was updated successfully, but these errors were encountered: