Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: check for memory errors on reading xlsx files and proper formatting of csv sample code #432

Closed
wants to merge 3 commits into from

Conversation

SgtPepper47
Copy link

@SgtPepper47 SgtPepper47 commented May 8, 2024

Updates to the Excel range creation to check the machine memory against the memory required for the range.
Updates to the sample Excel to csv writer to identify the memory error and to write out a properly formatted csv file with commas instead of semicolons.
Resolves #433

Check the Vec size against the memory allocated to the machine.
Catch the error from the range variable without a panic fail.
Change the writing to write a proper csv with a comma delimiter and quoting.
@tafia
Copy link
Owner

tafia commented May 11, 2024

Could you add a test showing how this can fail?

@SgtPepper47
Copy link
Author

tafia,
Here is an example of a file with a sheet that fails to load into memory. Essentially, it will depend on your machine, but an Excel sheet has a finite range of about 1 million rows to 16k columns. I just run the example script to convert 'Sheet1' to a csv file.
Large_Sheet.xlsx
Thanks.

@tafia
Copy link
Owner

tafia commented May 15, 2024

tests are failing.

@dimastbk
Copy link
Contributor

dimastbk commented May 15, 2024

May be we should check cells.len() == isize::MAX in loop, when add cell to cells? And this should be backported to other formats too.

@SgtPepper47
Copy link
Author

First, I should have opened this by saying I think calamine is a great piece of work, and I'm impressed with its performance.
I see in the build that I mistakenly missed a closing parentheses. I can do another pull request to fix that if you like.
To dimastbk's comment. I think it would be cleaner to check in the cells loop to see when it has exceeded the limit for the machine. However, I think cell.len() would return a usize where it wouldn't be able to check against isize.
Of course, all of this is just checking for an error. I haven't come up with a solution to be able to load the Excel sheet without an error.

@dimastbk
Copy link
Contributor

@SgtPepper47, sorry, cells.len() == usize::MAX

Copy link
Owner

@tafia tafia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand why you would reserve some capacity on cells. If anything you actually want to shrink_to_fit. The capacity and the checks should be reserved in the from_sparse function instead.

An alternative would also be to enhance the Range to be an enum, either sparse or dense. Basically there should be some checks to determine when using a sparse representation is better (like here, very large dimension but very few cells).

This is of course much more work but it would be better.

@@ -823,6 +824,13 @@ impl<RS: Read + Seek> Xlsx<RS> {
Err(e) => return Err(e),
}
}
/// Chcek for machine memory error
if cells.capacity() < usize_len {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@tafia
Copy link
Owner

tafia commented May 24, 2024

I think this PR makes things worse as it allocates a potentially very large Vec for nothing (this is not the right Vec to reserve capacity).

@tafia
Copy link
Owner

tafia commented Jun 4, 2024

Closing it for now, feel free to reopen and come up with an actual improvement.
Thanks

@tafia tafia closed this Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory error and panic on large Excel sheets
3 participants