Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support undoing archived batches #54

Open
dpriskorn opened this issue Feb 17, 2023 · 4 comments
Open

Support undoing archived batches #54

dpriskorn opened this issue Feb 17, 2023 · 4 comments

Comments

@dpriskorn
Copy link

https://editgroups.toolforge.org/b/CB/8f5438bc04fc/
I would like to undo that one. How do I do that?

@wetneb
Copy link
Member

wetneb commented Feb 18, 2023

This is indeed not supported yet: it would first require unarchiving it. This means re-fetching all the edits in the batch, for instance by fetching user contributions. Then, the batch could be undone again.

@wetneb wetneb changed the title Undoing archived batch does not seem to work Support undoing archived batches Jul 29, 2023
@dpriskorn
Copy link
Author

I would like to help getting this done. Can someone point me in the right direction?

@wetneb
Copy link
Member

wetneb commented Aug 11, 2023

Of course!

When archiving batches, we remove in our database all of their edits but the last 10 ones. It's a measure to prevent EditGroup's own database from growing too much as time goes by. The archiving of batches is done periodically with this method:

editgroups/store/models.py

Lines 267 to 277 in b674c84

@classmethod
def archive_old_batches(cls, batch_inspector):
"""
Archive all batches which have not been modified for a long time
and contain more edits than our archival threshold.
This method is meant to be run periodically.
"""
cutoff_date = datetime.utcnow().replace(tzinfo=UTC) - settings.BATCH_ARCHIVAL_DELAY
for batch in cls.objects.filter(nb_edits__gt=settings.EDITS_KEPT_AFTER_ARCHIVAL, archived=False, ended__lt=cutoff_date):
batch.archive(batch_inspector)

So, if we want to undo a batch that has been archived, we need to re-fetch the edits that are deleted by this method.
Because this is generally going to take a while, we'll want to make sure that the batch will not be archived again by the periodic task while we are doing it, and also for some time after the batch has been unarchived. This could be done by adding a date field on the Batch model indicating the date of latest unarchival (and making sure we only archive batches whose date of latest unarchival is null or older than some threshold). That could be the first step: add this new field to the Batch class, generate the corresponding migration with ./manage.py makemigrations and update the archival task to take it into account.

Then, for the unarchiving itself, as mentioned above we could look into querying the MediaWiki API to fetch the contributions of the user. In our batch metadata, we know when the batch started and ended, so we just need to fetch the contributions between those times. The tasks that come to mind here would be:

  • parse the API response to represent the edits in the same format as what we are retrieving from the EventStream API (so that we can use as much common ingestion logic as possible, between the EventStream and the MediaWiki API use cases).
  • filter out the edits which do not belong to the batch we are trying to unarchive (so that we are only adding those edits)
  • connect up all this code into a Celery task (so that this can be run asynchronously, independently from the web request that triggered the unarchival).

This is the core backend work for this issue and it'd be worth writing test cases for it (this code base is pretty extensively tested so it should not be too hard to imitate what is already there).

Finally, we'd need to expose the unarchiving feature in the frontend. This means adding a route to let the user trigger the unarchiving, and expose the corresponding button in the frontend, by modifying the template that displays batches: ./store/templates/store/batch.html.

I hope this description is not too daunting and I'd be happy to give more details where needed!

@wetneb
Copy link
Member

wetneb commented Aug 11, 2023

I just thought of another approach: instead of adding support for un-archiving batches, one could also consider a less demanding approach: "simply" fetch the edits from the MediaWiki API when undoing them, in a streaming fashion. This means that we don't even need to save them back to the database.

It probably also gives a better UX: the undo button would remain available on all batches, whether they are archived or not, and users would not need to first unarchive the batch before undoing it.
Arguably, unarchiving a batch would be useful for other purposes (for instance to download the CSV of all its edits), but it's probably minor compared to undoing.

With this approach, the core of the work would be to write a Python generator which would iterate through the edits of a particular user between two timestamps, filtering out the edits which do not belong to a particular batch. This would then be used in the undoing task in place of the iteration on the edits ingested in the database. We'd need to make some small tweaks to update the number of undone edits in the Batch metadata (since currently, this is updated by the general edit ingestion logic, not the undoing code) but that does not seem to difficult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants