Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify backup task spawning in safekeepers #7751

Closed
Tracked by #6220
petuhovskiy opened this issue May 14, 2024 · 0 comments · Fixed by #7768
Closed
Tracked by #6220

Simplify backup task spawning in safekeepers #7751

petuhovskiy opened this issue May 14, 2024 · 0 comments · Fixed by #7768
Assignees
Labels
c/storage/safekeeper Component: storage: safekeeper

Comments

@petuhovskiy
Copy link
Member

petuhovskiy commented May 14, 2024

Currently it works like this:

  • we have a background task that watches channel wal_backup_launcher_rx
  • it receives ttid from this channel and decides to spawn/kill the timeline backup task
  • every code modifying timeline state is responsible for pinging wal_backup_launcher_tx
  • if we forget to notify it, backup task can be left in the wrong state

I have an idea to change this schema to this:

  • each Timeline will spawn async "manager" task that will live until timeline is deleted
  • this task will receive all state updates, such as (commit_lsn, backup_lsn, num_computes, peers)
  • on each state update it will decide to stop/start backup task (mostly same logic as now)
  • Arc<Timeline> will maintain async channels with current state, it will be updated after each time someone touched shared state
@petuhovskiy petuhovskiy self-assigned this May 14, 2024
@petuhovskiy petuhovskiy added the c/storage/safekeeper Component: storage: safekeeper label May 14, 2024
petuhovskiy added a commit that referenced this issue May 22, 2024
In safekeepers we have several background tasks. Previously `WAL backup`
task was spawned by another task called `wal_backup_launcher`. That task
received notifications via `wal_backup_launcher_rx` and decided to spawn
or kill existing backup task associated with the timeline. This was
inconvenient because each code segment that touched shared state was
responsible for pushing notification into `wal_backup_launcher_tx`
channel. This was error prone because it's easy to miss and could lead
to deadlock in some cases, if notification pushing was done in the wrong
order.

We also had a similar issue with `is_active` timeline flag. That flag
was calculated based on the state and code modifying the state had to
call function to update the flag. We had a few bugs related to that,
when we forgot to update `is_active` flag in some places where it could
change.

To fix these issues, this PR adds a new `timeline_manager` background
task associated with each timeline. This task is responsible for
managing all background tasks, including `is_active` flag which is used
for pushing broker messages. It is subscribed for updates in timeline
state in a loop and decides to spawn/kill background tasks when needed.

There is a new structure called `TimelinesSet`. It stores a set of
`Arc<Timeline>` and allows to copy the set to iterate without holding
the mutex. This is what replaced `is_active` flag for the broker. Now
broker push task holds a reference to the `TimelinesSet` with active
timelines and use it instead of iterating over all timelines and
filtering by `is_active` flag.

Also added some metrics for manager iterations and active backup tasks.
Ideally manager should be doing not too many iterations and we should
not have a lot of backup tasks spawned at the same time.

Fixes #7751

---------

Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/safekeeper Component: storage: safekeeper
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant