Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] General stability and bug fixes. #3999

Merged
merged 20 commits into from Mar 30, 2023
Merged

Conversation

derekcollison
Copy link
Member

This PR has general improvements and fixes to filestore, raft, and the clustering layer.

Summary

  1. Additional support for preAck handling for interest based streams when replicated acks arrive before the message itself.
  2. Better handling when checking state to determine whether to remove an interest based message.
  3. Improved StepDown() and leadership transfer handling after restarts.
  4. Improved voting logic for high load systems.
  5. Various improvements and fixes for filestore Compact(), which is used heavily in the raft layer when updating snapshots and the raft wal.

Signed-off-by: Derek Collison derek@nats.io

@derekcollison derekcollison requested a review from a team as a code owner March 29, 2023 18:19
@derekcollison derekcollison force-pushed the interest-stresser branch 3 times, most recently from 003b64a to 3581673 Compare March 29, 2023 19:41
1. Fixed a bug that would process a removal of a message after the message block was closed.
2. Improved removal of non-existant message when we know the store is empty.
3. Improved last write index size tracking when opening the file descriptor after being closed.
4. Improved Compact() by not loading messages for last block twice.
5. Improved Compact() determination of calling purge by determing last sequence under write lock.
6. Improved Compact() by only compacting underlying message block if over certain size threshold.
7. Improved Compact() by writing the index file if needed while still holding lock avoiding an unecessary re-lock.
8. Improved Compact() by not calling out to upper layers on no messages being purged.
9. Fixed a bug in Compact() that would not delete members from a block's delete map.
10. Fixed a bug in reset() when a callback was not registered (raft logs) which avoiding msg block cleanup.
11. Improved consumer store Update() call for when to avoid an outdated update.

Signed-off-by: Derek Collison <derek@nats.io>
1. If reset ignore Applied() that are greater then our commit.
2. Improved StepDown() by placing at back of queue if preferred.
3. Improved handling of leadership transfer during StepDown().
4. Do not store EntryLeaderTransfer records on disk.
5. Remove un-needed processing of older terms.
6. If append entry has higher term, also inherit pterm.
7. Only inherit a candidate's term if we decide to vote for them.

Signed-off-by: Derek Collison <derek@nats.io>
1. Do not process an ack if we are closed.
2. When checking for needing an ack for a given consumer, hold lock entire time.
3. During recovery and restarts we check if we need to replay acks to the parent stream.

Signed-off-by: Derek Collison <derek@nats.io>
1. During ackMsg processing hold write lock to block concurrent access.
2. Check for presence of preAcks before and force removal if present.
3. Rework check for orphan msgs on startup to use checkStateForInterestStream().

Signed-off-by: Derek Collison <derek@nats.io>
…rive before the actual msgs.

1. If we are retention based, make sure our consumers are running before entering into monitorStream logic.
2. If we skip messages and are interest based, make sure we check for a preAck state.
3. On finalization of recovery for consumers have them check against the interest based stream.
4. Do not process ack state updates if consumer is closed and shutting down.
5. When processing final state for a stream after upper layer catchup, check all attached consumers for ack skew.
6. During catchup of stream messages consult preAck state and skip messages as needed.

Signed-off-by: Derek Collison <derek@nats.io>
…rs during rolling restarts.

Signed-off-by: Derek Collison <derek@nats.io>
… were getting and advantage on that after server restart.

This change speeds up raft layer more to avoid timeouts.

Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
… error

Signed-off-by: Derek Collison <derek@nats.io>
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks fine to me, just a couple minor things!

server/filestore.go Show resolved Hide resolved
server/filestore.go Show resolved Hide resolved
server/jetstream_cluster.go Outdated Show resolved Hide resolved
server/stream.go Outdated Show resolved Hide resolved
…sed the lock

Signed-off-by: Derek Collison <derek@nats.io>
Copy link
Member

@wallyqs wallyqs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

derekcollison and others added 2 commits March 29, 2023 15:29
Pre-allocate

Co-authored-by: Neil <neil@nats.io>
Pre-allocate

Co-authored-by: Neil <neil@nats.io>
server/consumer.go Outdated Show resolved Hide resolved
Signed-off-by: Derek Collison <derek@nats.io>
…om underneath of us.

Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
@derekcollison derekcollison merged commit 02702e4 into main Mar 30, 2023
2 checks passed
@derekcollison derekcollison deleted the interest-stresser branch March 30, 2023 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants