-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix user deletion #8700
Fix user deletion #8700
Conversation
7bc00a5
to
de7bf40
Compare
d1449b0
to
36825a2
Compare
c9669ef
to
a825e0f
Compare
h/services/user_delete.py
Outdated
): | ||
self._db = db_session | ||
self._annotation_delete_service = annotation_delete_service | ||
job_name = "purge_user" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reckon this belongs to:
Line 19 in 44ec8fd
SYNC_ANNOTATION = "sync_annotation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I think this JobName
enum is in the wrong place.
Having it on JobQueueService
is a pain because JobQueueService
is mocked in UserDeleteService
's tests, so JobQueueService.JobName.PURGE_USER
is a mock not a string, which causes SQLAlchemy to crash when UserDeleteService
tries to write a mock to the DB.
It can be fixed by moving JobName
onto models.Job
instead, which I think is where it belongs anyway: JobQueueService
may not be the only code reading and writing the job table: d8fcf1e
I am slightly dubious about the value of these enums. The existing JobName.SYNC_ANNOTATION
and ANNOTATION_SLIM
aren't actually used anywhere! All the code just duplicates the strings.
continue | ||
|
||
try: | ||
purger.delete_authtickets(user) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this task will go as far down this list as the limit permits.
When finishing via the break the transactions will be committed and next time around it will continue down the list.
|
||
def delete_user(self, user): | ||
"""Delete `user`.""" | ||
self.worker.delete(User, select(User.id).where(User.id == user.id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice to have would be to have UserDeletion.completed_at and update it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would just be misleading unfortunately. At the time when the user
row is deleted the user may actually still have annotations (and annotation slims etc) in the DB waiting to be purged by the purge-deleted-annotations
task, and they may still have annotations in Elasticsearch waiting to be purged by the sync-annotation
task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good, thanks for the detailed testing instructions.
class JobName(str, Enum): | ||
SYNC_ANNOTATION = "sync_annotation" | ||
ANNOTATION_SLIM = "annotation_slim" | ||
PURGE_USER = "purge_user" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this enum from the service onto the model and added PURGE_USER
.
Enums on services get mocked in the unit tests of other services which is a pain (e.g. SQLAlchemy crashes when trying to write mocks to the DB). I think the model is probably a better place for this enum anyway: JobQueueService
may not be the only code that needs to work directly with the job table.
Note that the existing SYNC_ANNOTATION
and ANNOTATION_SLIM
in this enum aren't used anyway, everywhere just uses the strings directly!
sa.func.count(Annotation.id) # pylint:disable=not-callable | ||
num_annotations=self.db.scalar( | ||
select( | ||
func.count(Annotation.id) # pylint:disable=not-callable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counting the annotation
table here. Will this be fast enough? May want to change this to AnnotationSlim
once back-filling it is finished.
def delete_authtickets(self, user): | ||
"""Delete all AuthTicket's belonging to `user`.""" | ||
self.worker.delete( | ||
AuthTicket, select(AuthTicket.id).where(AuthTicket.user == user) | ||
) | ||
|
||
def delete_tokens(self, user): | ||
"""Delete all tokens belonging to `user`.""" | ||
self.worker.delete(Token, select(Token.id).where(Token.user == user)) | ||
|
||
def delete_flags(self, user): | ||
"""Delete all flags created by `user`.""" | ||
self.worker.delete(Flag, select(Flag.id).where(Flag.user_id == user.id)) | ||
|
||
def delete_featurecohort_memberships(self, user): | ||
"""Remove `user` from all feature cohorts.""" | ||
self.worker.delete( | ||
FeatureCohortUser, | ||
select(FeatureCohortUser.id).where(FeatureCohortUser.user_id == user.id), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all these methods may belong in model-specific services (AuthTicketService
etc) where they can potentially be re-used by other code. But that's a refactoring that I don't want to bite off right now.
h/services/user_delete.py
Outdated
def delete_annotations(self, user): | ||
"""Delete all of `user`'s annotations from both Postgres and Elasticsearch.""" | ||
now = datetime.utcnow() | ||
|
||
deleted_annotation_ids = self.worker.update( | ||
Annotation, | ||
select(Annotation.id).where(Annotation.userid == user.userid), | ||
{ | ||
# We don't actually delete the user's annotations from the DB, | ||
# we only mark them as deleted. | ||
# The marked-as-deleted annotations will later be purged by the | ||
# periodic purge_deleted_annotations() task. | ||
# | ||
# This is because there are parts of h that don't work if | ||
# annotations are deleted immediately, including the WebSocket | ||
# and the call to JobQueueService.add_by_id() below. | ||
# | ||
# This is the same as what happens when annotations are deleted | ||
# individually by the API. | ||
"deleted": True, | ||
# Bump the annotation's updated time when marking it as deleted. | ||
# This is to prevent the purge_deleted_annotations() task from | ||
# purging the annotation too soon: that task only purges | ||
# deleted annotations whose updated time is more than ten mins | ||
# ago. | ||
"updated": now, | ||
}, | ||
) | ||
|
||
# Whenever we update annotations we also need to update the corresponding annotation_slims. | ||
num_deleted_annotation_slims = self.db.execute( | ||
update(AnnotationSlim) | ||
.where(AnnotationSlim.pubid.in_(deleted_annotation_ids)) | ||
.values({"deleted": True, "updated": now}) | ||
).rowcount | ||
log.info("Updated %d rows from annotation_slim", num_deleted_annotation_slims) | ||
|
||
# Add jobs to the queue so the annotations will eventually be deleted from Elasticsearch. | ||
for annotation_id in deleted_annotation_ids: | ||
self.job_queue.add_by_id( | ||
name="sync_annotation", | ||
annotation_id=annotation_id, | ||
tag="UserDeleteService.delete_annotations", | ||
schedule_in=60, | ||
) | ||
log.info( | ||
"Enqueued job to delete annotation from Elasticsearch: %s", | ||
annotation_id, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duplicates code from AnnotationDeleteService.delete()
(set annotation.deleted = True
, update annotation.updated
, update the corresponding annotation_slim
) but does it in a more efficient way and with support for a limit
(via LimitedWorker
).
This should probably be refactored to remove the duplication: AnnotationDeleteService.delete()
should probably be refactored to work more efficiently like this delete_annotation()
method and with support for deleting multiple annotations and using a limited worker. But I don't think I want to bite that off right now.
AnnotationDeleteService.delete()
also sends an AnnotationEvent
for the deleted annotation but I don't think I want to do that here: I don't think we want bulk-deleting annotations to cause an explosion of AnnotationEvent
's to be sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AnnotationEvent
's would:
-
Cause the annotations to be deleted from Elasticsearch:
Lines 41 to 50 in f1669fd
@subscriber(AnnotationEvent) def annotation_sync(event): """Ensure an annotation is synchronised to Elasticsearch.""" # Checking feature flags opens a connection to the database. As this event # is processed after the main transaction has closed, we must open a new # transaction to ensure we don't leave an un-closed transaction with event.request.tm: search_index = event.request.find_service(name="search_index") search_index.handle_annotation_event(event) AnnotationDeleteService.delete()
andUserDeleteService
. But theAnnotationEvent
causes an attempt to be made to delete the annotation from Elasticsearch synchronously, meaning the job queue only needs to delete it if this attempt fails. This is part of why the current method for deleting users doesn't work: it attempts to synchronously delete all the user's annotations from Elasticsearch during the admin pages form submission request, and times out. I don't think we want this synchronous attempt when bulk-deleting all the user's annotations: we actually just want to let the job queue handle it in time. -
Publish an annotation event to the message queue (for the WebSocket):
Lines 53 to 65 in f1669fd
@subscriber(AnnotationEvent) def publish_annotation_event(event): """Publish an annotation event to the message queue.""" data = { "action": event.action, "annotation_id": event.annotation_id, "src_client_id": event.request.headers.get("X-Client-Id"), } try: event.request.realtime.publish_annotation(data) except RealtimeMessageQueueError as err: report_exception(err) -
Cause reply notifications to be sent:
Lines 68 to 89 in f1669fd
@subscriber(AnnotationEvent) def send_reply_notifications(event): """Queue any reply notification emails triggered by an annotation event.""" request = event.request with request.tm: annotation = request.find_service(AnnotationReadService).get_annotation_by_id( event.annotation_id ) notification = reply.get_notification(request, annotation, event.action) if notification is None: return send_params = emails.reply_notification.generate(request, notification) try: mailer.send.delay(*send_params) except OperationalError as err: # pragma: no cover # We could not connect to rabbit! So carry on report_exception(err)
I think that's everything the AnnotationEvent
's do
def delete_groups(self, user): | ||
""" | ||
Delete groups created by `user` that have no annotations. | ||
|
||
Delete groups that were created by `user` and that don't contain any | ||
non-deleted annotations. | ||
|
||
If delete_annotations() (above) is called first then all of `user`'s | ||
own annotations will already have been deleted, so ultimately any | ||
groups created by `user` that don't contain any annotations by *other* | ||
users will get deleted. | ||
|
||
Known issue: if this method does not delete a group because it contains | ||
annotations by other users, and those annotations other users are later | ||
deleted, then the group will no longer contain any annotations but will | ||
never be deleted. | ||
|
||
Known issue: this will also delete all of the groups memberships due to | ||
a foreign key constraint with ondelete="cascade". If a group had a | ||
really large number of members this could take too long and cause a | ||
timeout. | ||
""" | ||
# pylint:disable=not-callable,use-implicit-booleaness-not-comparison-to-zero | ||
self.worker.delete( | ||
Group, | ||
select(Group.id) | ||
.where(Group.creator_id == user.id) | ||
.outerjoin( | ||
Annotation, | ||
and_( | ||
Annotation.groupid == Group.pubid, | ||
Annotation.deleted.is_(False), | ||
), | ||
) | ||
.group_by(Group.id) | ||
.having(func.count(Annotation.id) == 0), | ||
) | ||
|
||
def delete_group_memberships(self, user): | ||
""" | ||
Delete `user`'s group memberships. | ||
|
||
Known issue: this can leave groups that were created by `user` in an | ||
odd state - `user` will no longer be a member of the group but will | ||
still be its creator. But this state can occur in other ways as well, | ||
for example the web interface currently allows a group's creator to | ||
leave the group. | ||
|
||
If `delete_group_creators()` (below) is called after this method then | ||
the situation will be only temporary: `user` will soon be removed as | ||
the group's creator as well. | ||
""" | ||
self.worker.delete( | ||
GroupMembership, | ||
select(GroupMembership.id) | ||
.where(GroupMembership.user_id == user.id) | ||
.join(Group, GroupMembership.group_id == Group.id), | ||
) | ||
|
||
def delete_group_creators(self, user): | ||
""" | ||
Delete `user` as the creator of any groups they created. | ||
|
||
Known issue: this will leave groups in an odd state - with no creator | ||
(group.creator = None). | ||
""" | ||
self.worker.update( | ||
Group, select(Group.id).where(Group.creator == user), {"creator_id": None} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods related to deleting groups should probably be moved to GroupDeleteService
.
GroupDeleteService
already has its own method for deleting a group (it's only used by the admin pages: users can't delete groups) but it deletes all the group's annotations (which is not what we want here) and isn't scalable (it tries to delete everything all at once, which will time-out for large groups).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If done another round of (lighter) testing and all looks good.
Not sure how this was passing coverage before but it's not now.
Replace
UserDeleteService
with a completely new service that marks users as deleted and then purges their data incrementally in the background. This should fix issues with the user delete admin page timing out when trying to delete users with large amounts of data becauseUserDeleteService
tries to delete all the user's data at once during the web request and this takes too long.The way the new service works is:
The delete user admin page calls
UserDeleteService.delete_user()
which:user.deleted = True
"purge_user"
job to the job table in the DBuser_deletion
tableAll three are done in the same DB transaction, so it's not possible to mark a user as deleted but fail to add the
"purge_user"
job. The"purge_user"
job won't be deleted from the DB until all of the user's data has been purged.A periodic
purge_deleted_users()
task (Addpurge_deleted_users()
task to h h-periodic#441) runs every hour and callsUserDeleteService.purge_deleted_users()
which:"purge_user"
jobs from the tablepurge_deleted_users()
deletes the"purge_user"
job. Otherwise it leaves the job on the queue and it will pick up the same job again next time and continue working on it.purge_deleted_users()
doesn't actually delete annotations from the DB, it just marks them as deleted by settingAnnotation.deleted=True
. The existingpurge_deleted_annotations()
task will come along later and actually delete them from the DB. This is the same way that annotations are deleted when deleting them one-by-one via the API, so it's good to be consistent. It's done this way because the WebSocket code can have issues if annotations are deleted immediately. Also,JobQueueService.add_by_id()
can only add a"sync_annotation"
job for an annotation if that annotation still exists in the DB.purge_deleted_users()
marks an annotation as deleted it also adds a"sync_annotation"
job to the job queue. This is done in the same DB transaction, so it's not possible to mark an annotation as deleted without adding a"sync_annotation"
job to the queue.sync_annotations()
will then consume these"sync_annotation"
jobs and delete the annotations from Elasticseach, and won't delete the"sync_annotation"
job from the queue until it has verified that the annotation has been deleted from Elasticsearch.Testing
Checkout Add
purge_deleted_users()
task to h h-periodic#441 and run h-periodic'smake dev
You might want to hack h-periodic to run the
purge-deleted-annotations
andpurge-deleted-users
tasks more frequently:You might want to hack h to delete fewer rows per run of the
purge_deleted_users()
task, to force it to take more than one task run to purge the test user:Log in to http://localhost:5000/login as
devdata_admin
in one browser and asdevdata_user
in another browser (logging in will create two authtickets in the DB, one of which will be deleted when we deletedevdata_user
below)Create a developer token for
devdata_user
: go to http://localhost:5000/account/developer and click Generate your API tokenGo to http://localhost:5000/docs/help as
devdata_user
and log in to the client. This will create an OAuth token in the DBCreate a couple of annotations on http://localhost:5000/docs/help as
devdata_user
Go to http://localhost:5000/docs/help as
devdata_admin
, log in, and flag one ofdevdata_user
's annotations by clicking the flag icon on the annotationCreate an annotation of http://localhost:5000/docs/help as
devdata_admin
Go to http://localhost:5000/docs/help as
devdata_user
and flagdevdata_admin
's annotationAs
devdata_admin
go to http://localhost:5000/admin/features/cohorts, create a feature cohort, and adddevdata_user
to the feature cohortAs
devdata_user
go to http://localhost:5000/groups/new and create three groupsAs
devdata_user
go to http://localhost:5000/docs/help and create some annotations in the first groupHave
devdata_admin
join the second group (you need to copy the group's invite link asdevdata_user
and open it asdevdata_admin
)Have
devdata_admin
join the third group and create an annotation in itAs
devdata_admin
go to http://localhost:5000/groups/new, create a group, copy the invite link, and havedevdata_user
join the groupAt this point you should be able to see lots of data belonging to
devdata_user
and their annotations in the DB: annotations in theannotation
table; an auth ticket (used for logging in to h's web pages) inauthticket
; a feature cohort membership infeaturecohort_user
; flags inflag
; groups ingroup
; group memberships inuser_group
; an OAuth 2 token (used by the client) and a developer token intoken
; There are some other related tables where users may have data, e.g.annotation_slim
,annotation_metadata
,annotation_moderation
,user_identity
, but these have foreign keys toannotation
oruser
withON DELETE CASCADE
so any data in these tables will be deleted when the annotations or user are deleted.You should also be able to see
devdata_user
's annotations in Elasticsearch at http://localhost:9200/_search.Go to http://localhost:5000/admin/users?username=devdata_user&authority=localhost as
devdata_admin
and deletedevdata_user
You should see output like this in the terminal:
You should also see messages from the
sync_annotations
task about syncing the deleted annotations. It can take a minute or two after deleting the annotations from the DB before these appear:If you inspect the DB and Elasticsearch you should see that all the user's data and annotations are gone.