Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of "error processing requests from scheduler" in querier logs #8067

Closed
SharifiFaranak opened this issue May 6, 2024 · 3 comments
Closed

Comments

@SharifiFaranak
Copy link

Describe the bug

A clear and concise description of what the bug is.

In our querier logs, we are seeing a lot of "error contacting scheduler" and "error processing requests from scheduler" errors.
At the same time, we do see logs such as "Starting querier worker connected to query-scheduler" that suggest that query is able to discover the query-schedulers.

To Reproduce

Steps to reproduce the behavior:

Run Mimir helm chart version 5.1.4.

Expected behavior

A clear and concise description of what you expected to happen.

Do not see these errors in the logs.

Environment

  • Infrastructure: Kubernetes
  • Deployment tool: Helm

Additional Context

Logs look like the following:

ts=2024-05-06T17:38:54.996195311Z caller=scheduler_processor.go:125 level=error msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=<ip>:9095

ts=2024-05-06T17:39:23.622425677Z caller=scheduler_processor.go:117 level=warn msg="error contacting scheduler" err="rpc error: code = Canceled desc = context canceled" addr=<ip>:9095

ts=2024-05-06T17:40:11.993595405Z caller=scheduler_processor.go:184 level=error user=user123 msg="error notifying scheduler about finished query" err=EOF addr=<ip>:9095


@dimitarvdimitrov
Copy link
Contributor

I think this was fixed by #7168 and #6728. They are available in Mimir 2.12.0, but not in Mimir 2.10.5, which the 5.1.4 chart is running. Can you upgrade and check if this is still happening?

@jmichalek132
Copy link
Contributor

I think this was fixed by #7168 and #6728. They are available in Mimir 2.12.0, but not in Mimir 2.10.5, which the 5.1.4 chart is running. Can you upgrade and check if this is still happening?

Hi, I wonder if there could have been a bug introduced in one of the 2 linked PRs. We recently upgraded to 2.12 and started to see this issue mention on slack.

image

ts=2024-04-28T18:53:10.959228333Z caller=logging.go:126 level=warn traceID=7d41f545d9deae29 msg="GET /prometheus/api/v1/label/scrape_job/values?start=1714330091&end=1714330391 (500) 1.149977ms Response: \"failed to enqueue request\" ws: false; Accept: application/json, text/plain, */*; Accept-Encoding: gzip, deflate; Accept-Language: en-US,en;q=0.9; Connection: close; User-Agent: Grafana/10.4.2; X-Forwarded-For: 10.226.5.1; X-Grafana-Org-Id: 1; X-Grafana-Referer: http://example.com:3000/d/OZ6xeJqVk/controller-metrics?orgId=1; X-Scope-Orgid: anonymous; "
ts=2024-04-28T18:53:11.069380654Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=anonymous level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-04-28T18:53:11.069471535Z caller=frontend_scheduler_worker.go:291 level=error msg="error sending requests to scheduler" err=EOF addr=10.226.4.53:9095
ts=2024-04-28T18:53:11.069546419Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=anonymous level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-04-28T18:53:11.069652937Z caller=frontend_scheduler_worker.go:291 level=error msg="error sending requests to scheduler" err=EOF addr=10.226.4.53:9095
ts=2024-04-28T18:53:11.069658194Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=anonymous level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-04-28T18:53:11.069707439Z caller=frontend_scheduler_worker.go:291 level=error msg="error sending requests to scheduler" err=EOF addr=10.226.6.52:9095
ts=2024-04-28T18:53:11.069750502Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=anonymous level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-04-28T18:53:11.069805732Z caller=frontend_scheduler_worker.go:291 level=error msg="error sending requests to scheduler" err=EOF addr=10.226.4.53:9095

More context from our side:

  • We don't use ruler in remote mode so the query pipeline has gaps with no utilization.
  • This is causing small increase in latency (due to retries happening and eventually succeeding) and very small error rate of 500, when all retries fail.
  • I will be out until June but a colleague would be able to provide more details (logs, traces ,etc) on request.

@dimitarvdimitrov
Copy link
Contributor

Hi, I wonder if there could have been a bug introduced in one of the 2 linked PRs. We recently upgraded to 2.12 and started to see this issue mention on slack.

this looks related to the query-frontend<>query-scheduler communication. The two linked PRs are making changes only after enqueuing the request. It's unlikely that's influencing enqueuing (not impossible, but unlikely). Can you open a new issue for this @jmichalek132?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants