Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(cache): increased the default lock timeout to 10s that mlcache #12956

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ms2008
Copy link
Contributor

@ms2008 ms2008 commented Apr 29, 2024

Summary

Currently, this lock timeout is hardcoded and does not provide an optional configuration parameter out of the box. We have received multiple complaints from users with the following error in their logs:

failed to get from node cache: could not acquire callback lock: timeout

So, here we double the default timeout time

Checklist

  • The Pull Request has tests
  • A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

Fix FTI-4299

@github-actions github-actions bot added the cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee label Apr 29, 2024
@ms2008 ms2008 marked this pull request as ready for review April 29, 2024 08:57
@bungle
Copy link
Member

bungle commented Apr 29, 2024

How feasible it is to expect that if "the thing" doesn't happen in 5 secs, it will happen in 10 secs?

@dubuqingfeng
Copy link

Are there any updates?

@ms2008
Copy link
Contributor Author

ms2008 commented May 11, 2024

How feasible it is to expect that if "the thing" doesn't happen in 5 secs, it will happen in 10 secs?

I don't have any quantifiable data to provide strong support for the fix here, much like setting a TCP timeout, which depends on the specific environment of each individual. However, in practice, increasing the value here to 10s doesn't seem to pose significant risks. Additionally, we have several users who, after independently increasing this limit, no longer encounter issues.

So, if we prefer not to expose this configuration to users, increasing this limit could indeed solve some users' problems.

@bungle
Copy link
Member

bungle commented May 13, 2024

I don't have any quantifiable data to provide strong support for the fix here, much like setting a TCP timeout, which depends on the specific environment of each individual. However, in practice, increasing the value here to 10s doesn't seem to pose significant risks.

Yes, but where is the limit. Next time we increase it to 30 secs? Is it risk that this starts showing up in max latencies or p99 latencies etc.? More things piling up? More memory used because of piling up? I know that current limit is magic number, just like this proposal is. I would see we somehow fix the root cause or something. Why does retrieving certificate take more than 5 secs. Why it even takes time, why isn't it cached already etc.

Additionally, we have several users who, after independently increasing this limit, no longer encounter issues.
So, if we prefer not to expose this configuration to users, increasing this limit could indeed solve some users' problems.

Yes, but at the same time hide the problems that we probably want to fix. Same question can be asked about any of our timeouts, why not just increase every timeout?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee size/XS skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants