Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: <Memory Leak persist after update 1.13.1 > #1142

Open
mchinchilla opened this issue Mar 17, 2023 · 7 comments
Open

Bug: <Memory Leak persist after update 1.13.1 > #1142

mchinchilla opened this issue Mar 17, 2023 · 7 comments
Assignees
Labels
bug Something isn't working priority Top priority feature or things to do todo Things to be done in the future wontfix This will not be worked on working-as-expected The reported issue is working as expected.

Comments

@mchinchilla
Copy link

Bug Description

Hi Michael,
I'm using RepoDb.PostgreSql.BulkOperations version 1.13.1, this application parse a lot of CDR's files and insert in a PostgreSql database, but when my application execute this line:

await db.InsertAllAsync( tableName: "schema.table", recordsFinal, batchSize: insertBuffer, transaction: transaction );

Memory usage increase and never decrease, No matter if I force to GC to collect and clean.
This application runs in a Debian Linux Server with 32 GB and postgresql 13.x, I'm using the latest version of RepoDb.PostgreSql.BulkOperations and dotnet core 6.0.
When the applications usage gets the 32 GB of RAM, the server starts to swap and the application stops working.

And it seems that it is incremental, for example it starts with 100 MB and for each file it processes it increases exponentially 200MB, 400 MB... 3 GB, etc.

Images
image

Library Version:
Version: RepoDb.PostgreSql.BulkOperations version 1.13.1

@mchinchilla mchinchilla added the bug Something isn't working label Mar 17, 2023
@mikependon mikependon added todo Things to be done in the future priority Top priority feature or things to do labels Mar 17, 2023
@mikependon
Copy link
Owner

For more information, I am referencing these discussions from Twitter.
Link: https://twitter.com/mike_pendon/status/1636489405098491910

@mikependon
Copy link
Owner

mikependon commented Mar 17, 2023

I have investigated this and it sounds to me that it works as expected. But, you as a user must know the caveat of it. In short, this is not a memory leaks!

RepoDB requires these caching for it to perform a more performant insertion of the batch operations.

Explanation

Below is the screenshot and a very small project that we used for simulation and replication. The project below requires a SQL Server as we did the simulation there. Unfortunately for SQL Server and MDS, it only allow a maximum number of 2100 parameters. Therefore, you will see that I can only hit the max of 20 batchSize for 20 rows with 50 columns data entities.

Project:

This project is good for small simulation on this kind of issue.
InsertAllMemoryLeaks-InsertAll.zip

Screenshot:

image

What the program does?

  • It first create a table named [dbo].[BigTable] with 120 columns.
  • The program will iterate infinitely to do an InsertAll operation.
  • In every iteration, the program will create a list of BigTable entity with max 50 columns. The creation of the list will vary on the maximum batchSize, from 1. This is to ensure that RepoDB will create and cache each buffer's command text in the memory.
  • Insert the created list of BigTable entity towards the table.

Observation

Fluctuations:

In the first few seconds, the memory has fluctuated a lot, it is because when the INSERT statement is being created for the number of rows given, the library will put that in the cache.

Truth: If you insert 1 row, it will create 1 INSERT statement and cache it. If you insert 2 rows, it will create 2 INSERT statements and cache it, and so forth.

The batchSize is the maximum number of INSERT statement it will create and cache into the memory. So in short, you will have 20 times INSERT statement being cached into the memory (in which each of them will have different number of parameters based on the columns provided on the data entities).

Flat-Lines:

You will notice the flat-lines after those 20 INSERT statement has been cached to the memory. This is because the library is not creating an INSERT statement anymore, instead, simply reusing the one in the cache based on the number of rows you are inserting.

Behavior Extent

This kind of behavior is expected and is also present to both MergeAll and UpdateAll.

Conclusion

The number of cached statements will vary on the number of batchSize you passed on the batch operations (i.e.: InsertAll, MergeAll and UpdateAll). The size of the statement that is being cached will vary on the size of the entity schema (i.e.: Number of Columns).

Optimizations

Currently, RepoDB is creating multiple INSERT statement per row-numbers's batch insertion. See below.

  • InsertAll with 1 row:
    Statement cached:
    INSERT INTO [Table] VALUES (ColumnFirst, ..., ColumnLast);
    
  • InsertAll with 5 rows:
    Statements cached:
    INSERT INTO [Table] VALUES (ColumnFirst, ..., ColumnLast);
    INSERT INTO [Table] VALUES (ColumnFirst2, ..., ColumnLast2);
    INSERT INTO [Table] VALUES (ColumnFirst3, ..., ColumnLast3);
    INSERT INTO [Table] VALUES (ColumnFirst4, ..., ColumnLast4);
    INSERT INTO [Table] VALUES (ColumnFirst5, ..., ColumnLast5);
    
  • And so forth...

The statement above are verbose and is also not using the more optimal bulk insert. This can be optimized by below.

  • InsertAll with 3 rows:
    Statements cached:
    INSERT INTO [Table]
    VALUES
    (ColumnFirst, ..., ColumnLast),
    (ColumnFirst2, ..., ColumnLast2),
    (ColumnFirst3, ..., ColumnLast3);
    

With that approach, it will eliminate so may characters from the memory.

@mikependon
Copy link
Owner

Referencing: #380

@mikependon
Copy link
Owner

We will create the simulation in the PGSQ database, we are hopeful that this 2100 limit is not present there so we can simulate your use-case. We will post the result here once done.

@mikependon
Copy link
Owner

mikependon commented Mar 17, 2023

If you use BulkInsert (equivalent to BinaryBulkImport), you are levaraging the real bulk operations and these cached are skipped. In the screenshot below, you will notice that we adjusted the batchSize to 1000 with 100 columns and yet the memory usage is not increasing.

image

You will also notice that the higher the batchSize the faster it inserts the data. It is because the real bulk insert loves bring all the data at once from the client into the database server - there it performs much better.

Note: If you bulk insert 1000 rows and you set the batchSize to 1, it will iterate 1000 times (this is the behavior of SqlBulkCopy itself). Therefore, if you bulk insert 1000 rows and you set the batchSize to 1000 as well, it will bring all data at once.

EDIT: Attaching. InsertAllMemoryLeaks-BulkInsert.zip

@mikependon
Copy link
Owner

Hmmm - interestingly, seems I can replicate your issue in PGSQL. I modified the program and enable the 100 max batch size with 120 columns in a table.

Below is a screenshot in the first few seconds of run. It first burst up to 900 MB, went down suddenly and slowly climbing up, exactly as what you explained in Twitter.

image

And few minutes after, the memory is not going down. It even reach to 2 GB, and is still climbing.

image

But, when the library had cached all the row batch statements (100 total statements), the execution suddenly becomes faster and it gives a flat line on the memory allocations.

image

After 5 minutes - the line is now flat and is not climbing anymore.

image

Here is the project that replicates your case.
InsertAllMemoryLeaksPostgreSql.zip

In conclusion, this is not a memory leak. Since you're requiring a big caching on the statement based on the big data entity you have, the library requires such memory to execute your insertion fast.

@mikependon mikependon added wontfix This will not be worked on working-as-expected The reported issue is working as expected. labels Mar 17, 2023
@mikependon
Copy link
Owner

The BinaryBulkInsert only requires 40 MB as it does not cache anything.

image

Project: InsertAllMemoryLeaksPostgreSql-BinaryBulkInsert.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority Top priority feature or things to do todo Things to be done in the future wontfix This will not be worked on working-as-expected The reported issue is working as expected.
Projects
None yet
Development

No branches or pull requests

2 participants