Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make fd_limit an env var to let operater increase beyond 65k default #10963

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rtsainear
Copy link

This PR introduces the ability to customize the file descriptor (FD) limit for the neard process through an environment variable. Currently, neard has a hardcoded file descriptor hard limit of 65,000. While this default setting accommodates typical operation, edge cases such as the resharding event have demonstrated that archival nodes can exhaust this limit due to the opening of multiple RocksDB instances, leading to errors from too many open files.

Motivation
During intensive operations like resharding, archival nodes encounter the FD hard limit, resulting in failure due to "too many open files" errors. Providing operators with the flexibility to adjust the FD limit as needed based on their configuration and operational demands will enhance the stability and adaptability of neard.

Changes
Environment Variable FD_LIMIT: Operators can now specify an FD_LIMIT environment variable to set the file descriptor limit, offering a way to increase it beyond the default 65,000 when necessary.
Default Behavior: In the absence of this environment variable, the file descriptor limit will remain at the default setting of 65,000. This ensures backward compatibility and maintains current performance expectations under standard operational conditions.

@rtsainear rtsainear requested a review from a team as a code owner April 5, 2024 17:09
@rtsainear rtsainear requested a review from wacban April 5, 2024 17:09
Copy link

codecov bot commented Apr 5, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 71.52%. Comparing base (563ffe7) to head (1917996).
Report is 5 commits behind head on master.

Files Patch % Lines
neard/src/main.rs 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10963      +/-   ##
==========================================
+ Coverage   71.49%   71.52%   +0.02%     
==========================================
  Files         758      758              
  Lines      152276   152487     +211     
  Branches   152276   152487     +211     
==========================================
+ Hits       108868   109059     +191     
- Misses      38899    38918      +19     
- Partials     4509     4510       +1     
Flag Coverage Δ
backward-compatibility 0.24% <0.00%> (-0.01%) ⬇️
db-migration 0.24% <0.00%> (-0.01%) ⬇️
genesis-check 1.42% <85.71%> (+<0.01%) ⬆️
integration-tests 37.07% <0.00%> (-0.06%) ⬇️
linux 69.96% <85.71%> (+0.01%) ⬆️
linux-nightly 70.96% <0.00%> (-0.03%) ⬇️
macos 54.51% <0.00%> (+0.03%) ⬆️
pytests 1.65% <85.71%> (+<0.01%) ⬆️
sanity-checks 1.44% <85.71%> (+<0.01%) ⬆️
unittests 67.14% <0.00%> (+0.02%) ⬆️
upgradability 0.30% <85.71%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is fine for me even though I would consider it a "use at own risk" kind of a config.

On implementation side it should be done as part of config, not an environment variable. Please change the code to read the value from config instead. You may need to move the code a bit further away from main but this is fine.

cc @posvyatokum FYI in case you have any comments, if not I'll approve once the comments are addressed.

Copy link
Member

@posvyatokum posvyatokum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Wac, that this limit should come from

pub struct StoreConfig {
.
Otherwise, no objections

@nagisa
Copy link
Collaborator

nagisa commented Apr 8, 2024

My opinion is that the node shouldn't need to bother with this at all, since this limit is ambient and can/should be set by the actual environment (e.g. systemd)

My understanding of this rlimit setter was effectively to circumvent issues arising from such ambient state misconfiguration by setting a known-good value. If we aren't able to come up with such a value correctly, then we have no business trying to modify this state at all.

At the very least we should read out the current limit and not reduce it if it's actually higher than our "reasonable limit".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants