Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow delete/updates with database #714

Open
Stonemansrs opened this issue Apr 3, 2024 · 9 comments
Open

Slow delete/updates with database #714

Stonemansrs opened this issue Apr 3, 2024 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@Stonemansrs
Copy link

As Ben is aware from previous conversations, the delete times for datasets are very slow. (this would be the same for updates) Deletes of a single dataset were sometimes taking 30min or in the most extreme cases 2hr. It's been more time efficient to drop my entire schema and re-add all my data rather than delete/update specific datasets. I did some investigating by using "explain" in postgres to see where the slowest steps of deleting a dataset were (I've attached my output in the excel). This slowness is caused by delete cascades having to lookup fks. I iteratively added indexing to fks that were causing the largest slowdowns (one was 17min...) and eventually the deletes were taking 10s.

I would like to suggest adding some additional indexing on fks to help speed up these operations as long as it does not cause significant increases in dataset adding time. As we all know, adding fks can slow down adding data to a database. I can understand if you don't wish to add these. My datasets are mainly 96 well plates or smaller, while some of the ORD datasets are large sets of patent data. I know currently spinning up the entire ord data can take days. So I understand if the team is worried about multiplying that time.

For my pipeline right now, I can apply the indexing when and where I need it to by running some sql. I don't want to mess with modifying the ord orm so I don't have to worry about future conflicts. If the indexing is added by ord, then I won't have to run those extra lines of sql. I have tested adding 10 datasets after this index change and I am not seeing significant changes to the addition times. (my entire ETL pipeline for this dataset type takes ~20s and it is still taking ~20s after adding the indexing)

ORD Slow Delete Explain Analysis.xlsx

@Stonemansrs Stonemansrs added the enhancement New feature or request label Apr 3, 2024
@bdeadman
Copy link
Collaborator

bdeadman commented Apr 5, 2024

Hi Shane. Thanks for the detailed feedback and work you have done investigating this.

When I looked at this previously I was seeing deletion times of 2-3 min for the test dataset I was using (most likely one of the examples datasets pulled in by ord-interface during test builds). So it was slow compared to the addition times for the same dataset, but not so bad as 30 - 120 min. I'll have to retest it on some larger datasets but if you have a particular public ORD dataset in mind then just let us know.

We'll discuss the slow deletions and foreign key indexing at our next ORD maintenance meeting and get back to you.

@Stonemansrs
Copy link
Author

Thanks Ben,
I think some of the HTE datasets are more representative. I'm running an explain on the delete of Name: "HTE Pd-catalyzed cross-coupling screen" Description: "Reactions from Experiment 2 of DOI: 10.1126/science.1259203" (I'm unsure the best way to point you to the same dataset, so let me know if you have issues finding it).

I'm running the below code and I've already passed 13min. I'll update with the actual time once it finishes and try to get an output like the previous excel.

begin;
explain (analyze,buffers,timing)
delete from ord.dataset d
where d.id = '16006';
rollback;

@Stonemansrs
Copy link
Author

The delete has been running for 3 hours so far. I'll let it run overnight so I can try to get the output.

@bdeadman
Copy link
Collaborator

The dataset id is ord_dataset-7d8f5fd922d4497d91cb81489b052746
link to ORD Browser entry
link to GitHub data file

This dataset contains 1536 datapoints.

@bdeadman
Copy link
Collaborator

Tomorrow I will have a go at deleting this dataset and time it.

@bdeadman
Copy link
Collaborator

I did try your SQL deletion script but I gave up after it hang there for 10 minutes. It could be it was just too slow, or it could be a mistake in the way I modified the script.

Instead I have been trying to use the delete_dataset function in the ord orm. Below is the Python script I created to do the deletion of the HTE Pd-catalyzed cross-coupling screen dataset. While it was slow (compared to adding the same dataset) it only took 297 seconds to complete. Unfortunately it does not appear to be actually deleting the dataset from the database - which is strange since my notes show it was working when we looked at this a few months back.

I'm going to look into this more.

from sqlalchemy import create_engine
from sqlalchemy.orm import Session

from ord_schema.orm.database import delete_dataset

dataset_id = "ord_dataset-00005539a1e04c809a9a78647bea649c"
database = 'ord_database'
username = 'postgres'
password = 'postgres'
host = 'localhost'
port = 5432


connection_string = f"postgresql://{username}:{password}@{host}:{port}/{database}"
engine = create_engine(connection_string, future=True)
with Session(engine) as session:
    delete_dataset(dataset_id, session)
    session.flush()

@bdeadman
Copy link
Collaborator

Note that the dataset_id is changed to "ord_dataset-7d8f5fd922d4497d91cb81489b052746" for the HTE Pd-catalyzed cross-coupling screen dataset.

@bdeadman bdeadman self-assigned this Apr 18, 2024
@skearnes
Copy link
Collaborator

I'll take a look at this.

@skearnes skearnes self-assigned this Apr 19, 2024
@bdeadman bdeadman assigned bdeadman and skearnes and unassigned skearnes and bdeadman Apr 19, 2024
@skearnes
Copy link
Collaborator

We have a test for this on the ORM side, but it might not be reflecting the reality of the underlying database: https://github.com/open-reaction-database/ord-schema/blob/main/ord_schema/orm/database_test.py#L37. I'll take a look and get back to you next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants