Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[taskchampion] Drop old versions on the sync server #3

Open
djmitche opened this issue Jul 11, 2022 · 18 comments
Open

[taskchampion] Drop old versions on the sync server #3

djmitche opened this issue Jul 11, 2022 · 18 comments

Comments

@djmitche
Copy link
Collaborator

djmitche commented Jul 11, 2022

(Summary updated Jan 2024)

The sync model results in a linked sequence of versions on the server (whether taskchampion-sync-server or on a cloud service). To avoid that history growing forever, there's some cleanup (in taskchampion/taskchampion/src/server/cloud/server.rs for cloud services) that applies a date-based heuristic to delete old versions. That heuristic is trading use of extra space against the possibility that a replica that hasn't synced in a loooong time will find itself unable to sync, because the last version it saw has been deleted.

The proposal here involves tracking all replicas and the last version they saw. The downside to this proposal is that it adds some required UI for users to manage replicas.

See GothenburgBitFactory/taskchampion#311 for more background.

@Shadow53
Copy link

I'd like to take a crack at this. I want to make sure I have the right understanding of what's wanted, though.

Currently, a snapshot is an opaque, possibly encrypted, blob of data understood by the client, but not the server. The snapshot contains the current state of the user's tasks at the time of snapshotting, i.e. all non-deleted tasks and associated data, not including their modification history.

A "version" a set of modifications to the task database, e.g. "two tasks were added, one was delete, one was modified"?

The goal is to periodically delete versions that are "too old" -- possibly as defined in the server config -- if they are also older than the most recent snapshot. If "too old" is two weeks and the most recent snapshot was one week ago, one week of prior versions would be retained.

Looking at the sqlite storage backend as a reference, I am thinking:

  1. Add a date column to versions tracking when the version was created
    • Use an idempotent ALTER TABLE to add the column for backwards compatibility
    • For compatibility, also set date to "now" as of adding the column for all existing versions
    • Newly added versions will have date set to "now" as of insertion
  2. Add an index to the new date column for quicker lookup, at the expense of a little extra storage
  3. Whenever cleanup happens, delete all versions with date less than "today minus configured retention time" AND less than the most recent snapshot timestamp

Questions:

  • Is there a preference for cleaning up on sync vs periodically, which I understand to be as a background task?
    • I think on sync would be easier to implement and, if done right, shouldn't add too much time to the sync response
  • Is handling a too-infrequently synced client intended as part of this?
    • That is, adding special logic to the client and/or server to recognize that one or more intermediate versions are missing, and the client should download the snapshot instead

@djmitche
Copy link
Collaborator Author

You've got the context pretty much correct. The snapshot is always encrypted, and also contains tasks of status Deleted.

One piece that's missing, and that guides setting the parameters, a client can only sync if there are enough versions on the server, where "sync" means to combine local changes on the client with changes on the server. If it can't do that, then it just downloads the latest snapshot and any local changes on the client are lost. "Enough versions" is a little complicated. The client stores the ID of the last version it saw on the server. When it begins a sync operation, the child of that version must still exist on the server. Support for this already exists (at least in Taskchampion; I don't recall if the "hey, you need to re-initialize" message is implemented in Taskwarrior).

Deleting on sync seems reasonable. If it becomes too much load, we could make it probabalistic, with the probability a tunable parameter.

@ryneeverett
Copy link

ryneeverett commented Oct 3, 2023

Here's a high-level proposal:

  1. First, implement guaranteed-lossless version deletions on sync:
    a. Is the oldest version equal to that of the newest snapshot? If so, we can't delete any versions.
    b. Are there any versions older than the oldest latest_version_id of any client? If so, delete them.
  2. Second, expose the functionality to manually delete a client from the clients database. Next time you sync, all the versions up to the next-most-out-of-date client will be deleted. This is also essential for key revocation, as otherwise one could use a compromised client's secret key to access all future tasks.
  3. Third, implement (optional) client deletion based on age of latest_version_id (or whatever parameters users demand). There's no point in having any record of the client if it cannot be synchronized.

Does that make sense? It seems like this path is simple and lets us postpone some of the more difficult questions until step (3) while still solving most of the use cases.

@djmitche
Copy link
Collaborator Author

djmitche commented Oct 4, 2023

I think there may be some confusion about "client" in this case. A "client id" represents a task database on the server, and is the same for all replicas connected to that task database. For example, my laptop, desktop, and phone would all use the same client id. I think what you mean by "..of any client" is really "..of any replica", for example the oldest latest_version_id of my laptop, desktop, and phone. That's currently not knowable at any time, since those replicas do not identify themselves individually. We could add a "replica id" and store the latest_version_id for each replica on the server, but that might lead to never deleting data -- for example, if I lost my phone and replaced it with a new one, but never told the server that the old phone was dead. I think your point (3) addresses that by deleting clients with an old enough latest_version_id (or equivalently the last date they were heard from). Point (2) would involve deleting a replica from the replicas table, but I'm not sure what the task command to do this would look like.

Also note that encryption keys are per-client, not per-replica. That is, my laptop, desktop, and phone all share the same encryption key. We should have a process to handle a compromised key, but I think it would involve migrating all replicas to a new client_id (and new encryption key).

How would users generate replica id's? If they use the same replica id on two replicas, what happens?

What if we didn't store replica_id's, and just implemented a variant of (3) -- deleting versions when they are old enough that every actively-used client should have seen them (on the scale of months). Would that be simpler for the user? Would it be less effective?

@ryneeverett
Copy link

ryneeverett commented Oct 4, 2023

Ah, I was in fact under the impression that devices (replicas) were clients with their own encryption keys. My whole idea was based on thinking the server already knew the state of the replicas. I'll have to give some more thought to whether maintaining that data is worth the trouble.

Also note that encryption keys are per-client, not per-replica. That is, my laptop, desktop, and phone all share the same encryption key. We should have a process to handle a compromised key, but I think it would involve migrating all replicas to a new client_id (and new encryption key).

This doesn't strike me as a great security story. Maybe it's daunting but would it be worth considering per-replica keys? I would argue that most people don't roll keys but they do roll devices.

I recall at least one person reporting that their team uses bugwarrior to aggregate their issues into a single synchronized taskwarrior database. I believe the entire team would be one "client" under the new model, so when somebody leaves the company everybody on the team would need to roll the key.

How would users generate replica id's? If they use the same replica id on two replicas, what happens?

One option would be to reject the initial sync if it's not unique. Another option would be to not have human-readable uuid's and allow the human-readable device name to be a duplicate. This seems to be the way chat clients like Signal or Element work, though I've never like that UX myself because it's impossible to know what device you're verifying/revoking without having the device in order to check it's key.

What if we didn't store replica_id's, and just implemented a variant of (3) -- deleting versions when they are old enough that every actively-used client should have seen them (on the scale of months). Would that be simpler for the user? Would it be less effective?

I don't see a strategy like this having so much consensus. How eagerly you want potentially lossy cleanup operations to run depends on a combination of the storage space you have available, the rate at which you're adding task data, and the value of your task data. For example, a user adding their tasks manually is going to value their data a lot more than somebody using taskwarrior strictly as an issue aggregator with bugwarrior.

The simplest thing for the user is if taskwarrior could mostly clean up after itself without the possibility of loss and the user could configure more aggressive cleanup if and when they encounter a need to do so.

@djmitche
Copy link
Collaborator Author

djmitche commented Oct 6, 2023

Let's split off the issue of key rotation. I don't know of a simple mechanism for each replica to have a different key but still be able to exchange information with other replicas.

I think it's impossible to clean up anything and ensure no data loss: a replica may just stop sync'ing, and it's impossible for the server to know if that replica will come back someday or is gone forever.

You suggested addressing this by deleting a replica when its last sync was "old enough", and I think that strategy would make a lot of sense for a user: imagine I dug an old laptop out of the closet and turned it on for the first time in over a year. As a user, I want that laptop to get back up to date, but I don't need any task updates that were sitting on that laptop, un-sync'd for over a year, to appear -- the utility of those changes has long ago faded. The expiration duration can be configurable, so a server with limited space might expire versions after a month, for example.

My variant of (3) accomplishes a similar thing (if a replica doesn't sync for a long time, it will have to reset to a snapshot), but without storing replica_id's.

@ryneeverett
Copy link

it's impossible for the server to know if that replica will come back someday or is gone forever.

Under my proposed model, replica's would know when all the replica's had last been updated. An alternative to auto-deletion would be to notify users when a certain threshold of age is reached, either by adding an urgent task or a separate UI mechanism.

imagine I dug an old laptop out of the closet and turned it on for the first time in over a year. As a user, I want that laptop to get back up to date, but I don't need any task updates that were sitting on that laptop, un-sync'd for over a year, to appear -- the utility of those changes has long ago faded.

Imagine I'm using the sync server mostly for backup or I stop regularly using two devices and become a single-device user. For some reason the sync server stops working -- maybe the configuration is broken or the service daemon didn't restart it when it crashed, etc. I don't notice that the sync server is down for over a year, but when I fix it everything I've done for the last year on my laptop gets deleted.

My variant of (3) accomplishes a similar thing (if a replica doesn't sync for a long time, it will have to reset to a snapshot), but without storing replica_id's.

I don't think your variant is unworkable, but it seems like there might be a number of user stories we need to take into account and come up with solutions for.

From an ease-of-use perspective, it seems like with your variant users will need to consider and decide what kind of user they are at the time they set up the task server. This involves not only acquiring some understanding of how task synchronization works but also predicting what kind of user they will remain in the future. With my proposal, users can "lazily" learn about this issue when it confronts them as an actual problem -- either running out of storage space or getting warned that a device hasn't been seen for a long time. And they choose whether to manually address the issue by deleting the offending replica or whether to automate this process for the future. Many users will never be confronted with it at all.

Another issue is that the sync server is designed to support multiple clients and it seems like all clients will have to have the same configuration. From this "hosted taskwarrior" perspective, it seems like the server would want the ability to delete old versions based solely on the total size of a client's database. Maybe that would be better split off into a separate issue. My concern though is that your "simple" variant will provide a sub-optimal but "good enough" solution to all the user stories, complacency will set in, and we may not consider a more nuanced solution that might be easier to implement at an earlier stage of development.

@djmitche
Copy link
Collaborator Author

djmitche commented Oct 7, 2023

We should definitely have a user consent prompt before resetting local state to a snapshot, so everything on your laptop wouldn't be deleted automatically! I think this is currently an error message suggesting deleting the local task DB. That UX could be better, but at least the consequences are clear and the user can choose.

Do you want to make a more precise model of your proposal? My concerns are in the details, so having those details drawn out would help.

@ryneeverett
Copy link

ryneeverett commented Oct 10, 2023

Here's a more detailed and updated breakdown of my proposal above:

Step 1: guaranteed-lossless version deletions on sync

  1. Add replicas "table" to the sync server, mapping replica_uuid -> base_version.
  2. Add replica_uuid "table" to the replica storage model. On initial sync, replicas will generate a random UUID and store it.
  3. Add an api endpoint #[post("/v1/client/replica_base_version/{replica_uuid}/{base_version}")] which updates the replicas table.
  4. On initial sync, replicas generate a random UUID and store it
  5. Replicas will post their base_version to the endpoint each time it is updated.
  6. The end of the get_child_version and get_snapshot methods, could call a cleanup hook on success:
{
    purge_versions();
    Ok(
...
  1. And the cleanup hook would look something like:
async fn purge_versions() -> None {
    for (uuid) in versions.sort_by(oldest) {
        // We can't delete this version if it's newer than the newest replica.
        if uuid == snapshots.sort_by(newest).version_id {
            // Technically this version isn't needed but this is probably the simplest way to set a stopping condition without a more robust tracking of timestamps.
            break;
        }
        // We can't delete this version if it's needed by a replica.
        for (_, base_version) in replicas {
            if server.get_child_version(base_version) == uuid {
                // This is the next version needed by this replica.
                break;
            }
        }
        // Otherwise, delete it.
        versions.delete(uuid);
    }
}

Step 2: add a command to manually delete a replica

  1. Add an api endpoint #[post("/v1/client/list_replicas")] which returns the entire "replicas" table.
  2. Add an api endpoint #[post("/v1/client/delete_replica/{replica_uuid}")] which deletes the given replica_uuid from the replicas "table".
  3. Add a task sync list-replicas subcommand which calls the api endpoint and prints the replicas alongside the timestamp of their latest revision.
# TODO Change the data model to make this more performant!
fn approximate_version_age(base_version: UUID) -> Timestamp {
    let found_op = false;
    for (op) in Txn.operations() {
        if op.uuid == base_version {
            found_op = true;
        }
        if found_op && let ReplicaOp::Update(update) = op {
            update.timestamp
        }
    }
}
  1. Add a task sync delete-replica <replica-uuid> subcommand which calls the api endpoint to delete the replica.

Step 3: automatic replica deletion based on age

  1. Add a replica_max_age configuration option to set time period at which a replica will be considered "too old".
  2. At the end of a task sync, taskwarrior will check the replica ages using the function equivalent of task sync list-replicas and delete any that meet the criteria using the function equivalent of delete-replica:
for (replica_uuid, age) in list_replicas() {
    if age > config.get("replica_max_age") {
        delete_replica(replica_uuid);
    }
}

@djmitche
Copy link
Collaborator Author

Nice! Can you add some detail on how replica UUIDs are registered, and how old replicas are expired?

@ryneeverett
Copy link

I edited my comment to specify that replicas generate their own UUID on initial sync. There isn't really a special "registration" story on the server side because the server will store anything it's given in a /v1/client/replica_base_version/{replica_uuid}/{base_version} request. I also added more detailed accounts of steps (2) and (3).

@djmitche
Copy link
Collaborator Author

Thanks!

  • Are the replica-related API calls authenticated by the shared client_id, or by something else? If so, that's still a shared, unchangeable secret among the replicas, so perhaps warrants further thought. For example, if a device is compromised, the attacker could extract the client_id and add a new replica for themselves.
  • If I understand correctly, step 3 has each replica performing task sync looking for old replicas and deleting them. I like putting control of that process on the client side!
  • We could easily store a timestamp with each version, or just a "last seen" timestamp for each replica, which would simplify the approximation of replica age.

@ryneeverett
Copy link

* Are the replica-related API calls authenticated by the shared `client_id`, or by something else? If so, that's still a shared, unchangeable secret among the replicas, so perhaps warrants further thought. For example, if a device is compromised, the attacker could extract the `client_id` and add a new replica for themselves.

Key revocation for the purposes of authorization (a separate concern from encryption) is a real concern but maybe should be spun off into a separate issue that can be fixed later? I don't think this makes the situation worse than present -- if an attacker gets the client_id and encryption_secret they can add as many replicas as they want. And it would be nice to have a better idea where GothenburgBitFactory/taskchampion#371 is going because that would also change the way these id's/keys are used anyway and could make signatures an easy solution.

* If I understand correctly, step 3 has each replica performing `task sync` looking for old replicas and deleting them. I like putting control of that process on the client side!

Yes, that's correct!

* We could easily store a timestamp with each version, or just a "last seen" timestamp for each replica, which would simplify the approximation of replica age.

Yeah, I figured there were a few different ways we could change the data model to make that performant and that you'd have a better idea which was best.

Regardless of the implementation, the timestamp needs to be tied to the timestamp of the version itself, not the time when the replica received it. For example, a replica on a poor internet connection might only receive a few very old operations now and then but be very far behind -- we shouldn't give it a timestamp of "today" just because it received a version today.

Maybe the simplest approach would be to add a timestamp field to the replicas database and have the client add the timestamp of the most recent Replicaop::Update to the api call: #[post("/v1/client/replica_base_version/{replica_uuid}/{base_version}/{timestamp}")]. It would be a very short "walk" compared to what I have now.

@djmitche
Copy link
Collaborator Author

Sorry, I was indeed crossing two different issues there. Still, worth considering the overlap!

What's the motivation to tie a version's timestamp to the timestamp of an operation within it? Is there an advantage over simply adding a timestamp for the version when the version is created?

@ryneeverett
Copy link

What's the motivation to tie a version's timestamp to the timestamp of an operation within it? Is there an advantage over simply adding a timestamp for the version when the version is created?

The motivation of the current proposal is that it puts all the work on the client side without adding complexity to the overall system, but I think we'd agree that there are better options. I'll post an updated proposal momentarily.

@ryneeverett
Copy link

Phase One: guaranteed-lossless version deletions on sync

  1. Add replicas table to the sync server, mapping replica_uuid -> base_version.
  2. Add replica_uuid table to the replica storage model. On initial sync, replicas will generate a random UUID and store it.
  3. Add an api endpoint #[post("/v1/client/replica_base_version/{replica_uuid}/{base_version}")] which updates the replicas table.
  4. On initial sync, replicas generate a random UUID and store it
  5. Replicas will post their base_version to the endpoint each time it is updated.
  6. The end of the get_child_version and get_snapshot methods, could call a cleanup hook on success:
{
    purge_versions();
    Ok(
...
  1. And the cleanup hook would look something like:
async fn purge_versions() -> None {
    for (uuid) in versions.sort_by(oldest) {
        // We can't delete this version if it's newer than the newest replica.
        if uuid == snapshots.sort_by(newest).version_id {
            // Technically this version isn't needed but this is probably the simplest way to set a stopping condition without a more robust tracking of timestamps.
            break;
        }
        // We can't delete this version if it's needed by a replica.
        for (_, base_version) in replicas {
            if server.get_child_version(base_version) == uuid {
                // This is the next version needed by this replica.
                break;
            }
        }
        // Otherwise, delete it.
        versions.delete(uuid);
    }
}

Phase Two: add a command to list replicas and the timestamp of their base_version

  1. Add a timestamp field to the versions table on the server. (Maybe this could be taken from the date header of #[post("/v1/client/add-version/{parent_version_id}")]?)
  2. Add an api endpoint #[post("/v1/client/list_replicas")] which returns the replicas table, JOINed with the timestamps of the corresponding versions in the versions table.
  3. Add a task sync list-replicas subcommand which calls the api endpoint and prints the replicas alongside the timestamp of their latest revision.

Phase Three: add a command to manually delete a replica

  1. Add an api endpoint #[post("/v1/client/delete_replica/{replica_uuid}")] which deletes the given replica_uuid from the replicas table.
  2. Add a task sync delete-replica <replica-uuid> subcommand which calls the api endpoint to delete the replica.

Phase Four: automatic replica deletion based on age

  1. Add a replica_max_age configuration option to set time period at which a replica will be considered "too old".
  2. At the end of a task sync, taskwarrior will check the replica ages using the function equivalent of task sync list-replicas and delete any that meet the criteria using the function equivalent of delete-replica:
for (replica_uuid, age) in list_replicas() {
    if age > config.get("replica_max_age") {
        delete_replica(replica_uuid);
    }
}

@djmitche
Copy link
Collaborator Author

Apologies for the wait while I found some time to consider this.

I think this sounds like a good model! Getting version timestamps from the Date header sounds good, but could also just use its local clock. I see a conceptual advantage to trusting the client though!

We could support this for the cloud-storage-backed model, too, by for example storing each replica's latest version in an object named replica/<replica-uuid>/latest-version. Then listing the replicas is just listing objects matching replica/* and deleting a replica is just deleting its latest-version object. I think we could use the object timestamp for version objects (version/<uuid>) as the version timestamp, or if necessary store an additional object containing the datestamp (version/<uuid>/datestamp).

Anyway, this is all just thinking about how this fits into other ideas. The proposal sounds good :)

@djmitche
Copy link
Collaborator Author

I updated the summary of this issue with the latest status -- generalizing a little to cover cloud storage as well as the sync server.

@djmitche djmitche transferred this issue from GothenburgBitFactory/taskwarrior Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

3 participants