Slow delete/updates with database #714

Stonemansrs · 2024-04-03T17:05:18Z

As Ben is aware from previous conversations, the delete times for datasets are very slow. (this would be the same for updates) Deletes of a single dataset were sometimes taking 30min or in the most extreme cases 2hr. It's been more time efficient to drop my entire schema and re-add all my data rather than delete/update specific datasets. I did some investigating by using "explain" in postgres to see where the slowest steps of deleting a dataset were (I've attached my output in the excel). This slowness is caused by delete cascades having to lookup fks. I iteratively added indexing to fks that were causing the largest slowdowns (one was 17min...) and eventually the deletes were taking 10s.

I would like to suggest adding some additional indexing on fks to help speed up these operations as long as it does not cause significant increases in dataset adding time. As we all know, adding fks can slow down adding data to a database. I can understand if you don't wish to add these. My datasets are mainly 96 well plates or smaller, while some of the ORD datasets are large sets of patent data. I know currently spinning up the entire ord data can take days. So I understand if the team is worried about multiplying that time.

For my pipeline right now, I can apply the indexing when and where I need it to by running some sql. I don't want to mess with modifying the ord orm so I don't have to worry about future conflicts. If the indexing is added by ord, then I won't have to run those extra lines of sql. I have tested adding 10 datasets after this index change and I am not seeing significant changes to the addition times. (my entire ETL pipeline for this dataset type takes ~20s and it is still taking ~20s after adding the indexing)

ORD Slow Delete Explain Analysis.xlsx

bdeadman · 2024-04-05T10:35:44Z

Hi Shane. Thanks for the detailed feedback and work you have done investigating this.

When I looked at this previously I was seeing deletion times of 2-3 min for the test dataset I was using (most likely one of the examples datasets pulled in by ord-interface during test builds). So it was slow compared to the addition times for the same dataset, but not so bad as 30 - 120 min. I'll have to retest it on some larger datasets but if you have a particular public ORD dataset in mind then just let us know.

We'll discuss the slow deletions and foreign key indexing at our next ORD maintenance meeting and get back to you.

Stonemansrs · 2024-04-10T16:46:05Z

Thanks Ben,
I think some of the HTE datasets are more representative. I'm running an explain on the delete of Name: "HTE Pd-catalyzed cross-coupling screen" Description: "Reactions from Experiment 2 of DOI: 10.1126/science.1259203" (I'm unsure the best way to point you to the same dataset, so let me know if you have issues finding it).

I'm running the below code and I've already passed 13min. I'll update with the actual time once it finishes and try to get an output like the previous excel.

begin;
explain (analyze,buffers,timing)
delete from ord.dataset d
where d.id = '16006';
rollback;

Stonemansrs · 2024-04-10T21:00:28Z

The delete has been running for 3 hours so far. I'll let it run overnight so I can try to get the output.

bdeadman · 2024-04-10T22:15:19Z

The dataset id is ord_dataset-7d8f5fd922d4497d91cb81489b052746
link to ORD Browser entry
link to GitHub data file

This dataset contains 1536 datapoints.

bdeadman · 2024-04-10T22:16:19Z

Tomorrow I will have a go at deleting this dataset and time it.

bdeadman · 2024-04-11T23:02:00Z

I did try your SQL deletion script but I gave up after it hang there for 10 minutes. It could be it was just too slow, or it could be a mistake in the way I modified the script.

Instead I have been trying to use the delete_dataset function in the ord orm. Below is the Python script I created to do the deletion of the HTE Pd-catalyzed cross-coupling screen dataset. While it was slow (compared to adding the same dataset) it only took 297 seconds to complete. Unfortunately it does not appear to be actually deleting the dataset from the database - which is strange since my notes show it was working when we looked at this a few months back.

I'm going to look into this more.

from sqlalchemy import create_engine
from sqlalchemy.orm import Session

from ord_schema.orm.database import delete_dataset

dataset_id = "ord_dataset-00005539a1e04c809a9a78647bea649c"
database = 'ord_database'
username = 'postgres'
password = 'postgres'
host = 'localhost'
port = 5432


connection_string = f"postgresql://{username}:{password}@{host}:{port}/{database}"
engine = create_engine(connection_string, future=True)
with Session(engine) as session:
    delete_dataset(dataset_id, session)
    session.flush()

bdeadman · 2024-04-11T23:03:11Z

Note that the dataset_id is changed to "ord_dataset-7d8f5fd922d4497d91cb81489b052746" for the HTE Pd-catalyzed cross-coupling screen dataset.

skearnes · 2024-04-19T14:25:32Z

I'll take a look at this.

skearnes · 2024-05-31T14:10:19Z

We have a test for this on the ORM side, but it might not be reflecting the reality of the underlying database: https://github.com/open-reaction-database/ord-schema/blob/main/ord_schema/orm/database_test.py#L37. I'll take a look and get back to you next week.

Stonemansrs added the enhancement New feature or request label Apr 3, 2024

bdeadman self-assigned this Apr 18, 2024

skearnes self-assigned this Apr 19, 2024

bdeadman assigned bdeadman and skearnes and unassigned skearnes and bdeadman Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow delete/updates with database #714

Slow delete/updates with database #714

Stonemansrs commented Apr 3, 2024

bdeadman commented Apr 5, 2024

Stonemansrs commented Apr 10, 2024

Stonemansrs commented Apr 10, 2024

bdeadman commented Apr 10, 2024

bdeadman commented Apr 10, 2024

bdeadman commented Apr 11, 2024

bdeadman commented Apr 11, 2024

skearnes commented Apr 19, 2024

skearnes commented May 31, 2024

Slow delete/updates with database #714

Slow delete/updates with database #714

Comments

Stonemansrs commented Apr 3, 2024

bdeadman commented Apr 5, 2024

Stonemansrs commented Apr 10, 2024

Stonemansrs commented Apr 10, 2024

bdeadman commented Apr 10, 2024

bdeadman commented Apr 10, 2024

bdeadman commented Apr 11, 2024

bdeadman commented Apr 11, 2024

skearnes commented Apr 19, 2024

skearnes commented May 31, 2024