-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow delete/updates with database #714
Comments
Hi Shane. Thanks for the detailed feedback and work you have done investigating this. When I looked at this previously I was seeing deletion times of 2-3 min for the test dataset I was using (most likely one of the examples datasets pulled in by ord-interface during test builds). So it was slow compared to the addition times for the same dataset, but not so bad as 30 - 120 min. I'll have to retest it on some larger datasets but if you have a particular public ORD dataset in mind then just let us know. We'll discuss the slow deletions and foreign key indexing at our next ORD maintenance meeting and get back to you. |
Thanks Ben, I'm running the below code and I've already passed 13min. I'll update with the actual time once it finishes and try to get an output like the previous excel. begin; |
The delete has been running for 3 hours so far. I'll let it run overnight so I can try to get the output. |
The dataset id is ord_dataset-7d8f5fd922d4497d91cb81489b052746 This dataset contains 1536 datapoints. |
Tomorrow I will have a go at deleting this dataset and time it. |
I did try your SQL deletion script but I gave up after it hang there for 10 minutes. It could be it was just too slow, or it could be a mistake in the way I modified the script. Instead I have been trying to use the delete_dataset function in the ord orm. Below is the Python script I created to do the deletion of the HTE Pd-catalyzed cross-coupling screen dataset. While it was slow (compared to adding the same dataset) it only took 297 seconds to complete. Unfortunately it does not appear to be actually deleting the dataset from the database - which is strange since my notes show it was working when we looked at this a few months back. I'm going to look into this more.
|
Note that the dataset_id is changed to "ord_dataset-7d8f5fd922d4497d91cb81489b052746" for the HTE Pd-catalyzed cross-coupling screen dataset. |
I'll take a look at this. |
We have a test for this on the ORM side, but it might not be reflecting the reality of the underlying database: https://github.com/open-reaction-database/ord-schema/blob/main/ord_schema/orm/database_test.py#L37. I'll take a look and get back to you next week. |
As Ben is aware from previous conversations, the delete times for datasets are very slow. (this would be the same for updates) Deletes of a single dataset were sometimes taking 30min or in the most extreme cases 2hr. It's been more time efficient to drop my entire schema and re-add all my data rather than delete/update specific datasets. I did some investigating by using "explain" in postgres to see where the slowest steps of deleting a dataset were (I've attached my output in the excel). This slowness is caused by delete cascades having to lookup fks. I iteratively added indexing to fks that were causing the largest slowdowns (one was 17min...) and eventually the deletes were taking 10s.
I would like to suggest adding some additional indexing on fks to help speed up these operations as long as it does not cause significant increases in dataset adding time. As we all know, adding fks can slow down adding data to a database. I can understand if you don't wish to add these. My datasets are mainly 96 well plates or smaller, while some of the ORD datasets are large sets of patent data. I know currently spinning up the entire ord data can take days. So I understand if the team is worried about multiplying that time.
For my pipeline right now, I can apply the indexing when and where I need it to by running some sql. I don't want to mess with modifying the ord orm so I don't have to worry about future conflicts. If the indexing is added by ord, then I won't have to run those extra lines of sql. I have tested adding 10 datasets after this index change and I am not seeing significant changes to the addition times. (my entire ETL pipeline for this dataset type takes ~20s and it is still taking ~20s after adding the indexing)
ORD Slow Delete Explain Analysis.xlsx
The text was updated successfully, but these errors were encountered: