Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Refactor Batch Operations SQL Statements #1143

Open
mikependon opened this issue Mar 18, 2023 · 8 comments
Open

Enhancement: Refactor Batch Operations SQL Statements #1143

mikependon opened this issue Mar 18, 2023 · 8 comments
Assignees
Labels
enhancement New feature or request pre-discussed Pre-discussed and/or pre-collaborated with the reported priority Top priority feature or things to do todo Things to be done in the future

Comments

@mikependon
Copy link
Owner

Describe the enhancement

As a response to the #1142, it is important for the library to limit the usage of the memory while it is caching the packed-statements per entity/model when calling the batch operations. This is to ensure give more resource for the application to utilize the memory on other purpose.

The rationale

What? In the mentioned issue above, in the PGSQL extended library, a user has reported a memory leaks. They have an entity model that corresponds to the table with 112 columns. During the call to the InsertAll operation with batchSize of 500, the library suddenly spike the memory to 3GB.

Why? Given with this parameter, RepoDB will/might possibly cache the max of 500 big packed-statement per row-size for such big table. This is only for a single entity and on this operation, not yet for other enties for other batch operation (i.e.: MergeAll, UpdateAll)

In the current version of the library, when you call the InsertAll operation, it caches a packed-statement based on the number of rows passed in the operation. (This is also true to MergeAll and UpdateAll)

For a better elaboration, if a user had passed 3 entities in the operation, the statement below will be created.

INSERT INTO [Table] (Col1, ... Col112) VALUES (@Col1_1, ..., @Col112_1); RETURN SCOPE_IDENTITY();
...
INSERT INTO [Table] (Col1, ... Col112) VALUES (@Col1_3, ..., @Col112_3); RETURN SCOPE_IDENTITY();

Such statement will be cached into a host memory for the operation with 3 rows. It will be reused when the same entity model (or table) is passed again in the future, but is only for 3 rows.

Then, if a user had passed 5 entities in the operation, the statement below will be cached as well into the host memory.

INSERT INTO [Table] (Col1, ... Col112) VALUES (@Col1_1, ..., @Col112_1); RETURN SCOPE_IDENTITY();
...
...
...
INSERT INTO [Table] (Col1, ... Col112) VALUES (@Col1_5, ..., @Col112_5); RETURN SCOPE_IDENTITY();

If the user had set the batchSize to 500, a possible 500 packed-statements of the above statements will be saved into the host memory, resulting to a bigger memory requirements/consumptions.

The bigger the rows passed, the bigger the statement it will cached.

Conclusion

As per the report and also to the screenshots we provided during the investigation, it has utilized to the max of 3GB memory for such single entity. This alarming as it will require the application to issue high resource limit in case they need to utilize the batch capability of the library.

Though, this is not an issue, but this is something requires an attention and revisits to optimize the memory usage.

@mikependon mikependon added enhancement New feature or request todo Things to be done in the future priority Top priority feature or things to do pre-discussed Pre-discussed and/or pre-collaborated with the reported labels Mar 18, 2023
@mikependon mikependon self-assigned this Mar 18, 2023
@mikependon
Copy link
Owner Author

Referencing: #380

@cajuncoding
Copy link
Collaborator

Yeah 3GB is unacceptably high…but it also feels odd that an otherwise simple string cache would be so large in memory.

Have you tried a POC to read the byte (array) size of the cached string to just to confirm the actual string size approaches that… while not apples to apples it can help grasp the actual size of the cached string because 3GB would hold more than the entire encyclopedia which used to fit (with images) on a CD-ROM…

@mikependon
Copy link
Owner Author

Have you tried a POC to read the byte (array) size of the cached string to just to confirm the actual string size approaches that

Yes, we will do this.

In addition, as of writing this, we are now working and investigating on which approach is best to do. Since we cannot eliminate the required packed-statement caching when executing the batch operations, we should somehow put focus into the packed-statements composition.

Initial resolution of us would be (ideally), we should create the following SQL in every batch operation Row-N insertion.

INSERT INTO [TableName] (Col1, ..., Col112)
VALUES
(@p1, ..., @p112),
(@p113, ..., @p224),
(@p225, ..., @p336);

But, renaming from the @Col1_1 to @p1 is a MAJOR change and would maybe challenge the library's application architecture approach. But, the below's approach can still be meet.

INSERT INTO [TableName] (Col1, ..., Col112)
RETURNING INSERTED.[PK/IK]
VALUES
(@Col1_1, ..., @Col112_1),
(@Col1_2, ..., @Col112_2),
(@Col1_3, ..., @Col112_3);

It would eliminate more than 75% of the size of the packed-statement as we analyzed.

@mikependon
Copy link
Owner Author

See these 2 packed statements. The new one is only around 50% of size of the old one. We only have eliminated around 55% of size as we are only using small table on our Regression Test. The bigger the table and Row-N the higher the bytes it will be eliminated.

The challenge to us here is - how do we ensure that RETURNING [Id] is on proper order? This is the one you reported in the past where we have solved using the extract __RepoDb_OrderColumn_10 column

InsertAll-New-Packed-Statement.txt
InsertAll-Old-Packed-Statement.txt

@mikependon
Copy link
Owner Author

Hey @cajuncoding , there is a little bit of complexity atleast for SQL Server. When introducing the INSERT BULK approach in the packed-statement like below.

INSERT INTO [Table] ([Column1])
OUTPUT [INSERTED].[Id]
VALUES
(@Column1_0),
(@Column1_1),
(@Column1_2);

There is no way for me to order the OUTPUT [INSERTED].[Id] based on the extra parameter named @__RepoDb_OrderColumn_X.

Ofcourse, unless I hacked it like below.

WITH CTE AS
(
	SELECT @Column1_0 AS [Column1], @__RepoDb_OrderColumn_0 AS [__RepoDb_OrderColumn]
	UNION SELECT @Column1_1, @__RepoDb_OrderColumn_1
	UNION SELECT @Column1_2, @__RepoDb_OrderColumn_2
)
INSERT INTO [Table] ([Column1])
OUTPUT [INSERTED].[Id]
SELECT [Column1] FROM CTE ORDER BY [__RepoDb_OrderColumn];

Since you were the one who had requested this capability, would you be able to validate that the 1st statement above is really contaminating the ordering of the insertion? I tried it many times, but it sounds to me it is not affecting the order. In short, the result is the same with 2nd statement and with the old approach.

Would you be of help?

@cajuncoding
Copy link
Collaborator

cajuncoding commented Mar 19, 2023

@mikependon , I can’t remember exactly how I was validating it at the time but it was definitely an issue…but as you alluded to that was related to using BulkInsert() (based on SqlBulkCopy) as the higher performing alternative to InsertAll().

However the issue still stands because SQL Server makes no guarantee at all of the returned item order unless you tell it exactly how to order (via ORDER BY clause).

A little google foo yielded this excellent Q&A on StackOverflow with a great response…it may be that the SQL Server implementation will need to use a merge statement vs simple insert, but it’s just a tweak on the script because all values can still be packed and passed using the same VALUES (a, b, c), (d,e,f)… syntax as an input into the merge and potentially a #OutputTempTable.

https://dba.stackexchange.com/a/155737/168595

Thoughts on this?

@mikependon
Copy link
Owner Author

@cajuncoding yeah, I read the link you shared and it is such a goos comment feom the community to not rely on the default order of the output based on the input. So, it is still recommendable to add that extra steps, which I would also do. Thanks mate 🚀

mikependon added a commit that referenced this issue Mar 19, 2023
@mikependon
Copy link
Owner Author

There seems to have a challenge in the PGSQL, a working query in SQL Server is failing in PGSQL.

image

Reference: https://twitter.com/mike_pendon/status/1638654274237497346

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pre-discussed Pre-discussed and/or pre-collaborated with the reported priority Top priority feature or things to do todo Things to be done in the future
Projects
None yet
Development

No branches or pull requests

2 participants