Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got "Invalid data: snappy::Uncompress failed" when decompressing raw file #24

Open
zxybazh opened this issue Oct 14, 2018 · 8 comments
Open

Comments

@zxybazh
Copy link

zxybazh commented Oct 14, 2018

I compressed a raw file with snzip -t raw file and when I run snzip -t raw -d file.raw I got the error message of uncompress failed.

@kubo
Copy link
Owner

kubo commented Oct 15, 2018

Could post more information?

  • OS version and CPU architecture.
  • Test data

It works for me.

$ ./snzip -t raw INSTALL
$ ./snzip -t raw -d INSTALL.raw 

My environment is:
OS: Linux (Ubuntu 16.04 x86_64)
Test data: INSTALL

@zxybazh
Copy link
Author

zxybazh commented Oct 16, 2018

Hi, I did the test on Ubuntu 16.04, CPU Intel(R) Core(TM) i7-7700.
Test data is right here, part of a TPCH dataset.
Please check, thanks!

@kubo
Copy link
Owner

kubo commented Oct 16, 2018

Thanks. The compressed file is incorrectly compressed because of too big data. The maximum size of raw uncompressed data is 4G according to this information.

There are two choices.

  1. Make snzip -t raw fail when the file size is over 4G.
  2. Split file data by 4G and create a compressed file containing concatenated compressed split data.

@zxybazh
Copy link
Author

zxybazh commented Oct 16, 2018

Got it, thanks.

@kubo
Copy link
Owner

kubo commented Oct 24, 2018

  1. Make snzip -t raw fail when the file size is over 4G.
  2. Split file data by 4G and create a compressed file containing concatenated compressed split data.

The latter is impossible. I can create a file containing concatenated raw compressed data. However I cannot decompress it because snappy checks whether all input data are consumed or not by decompressor->eof(). When two raw compressed data are concatenated, there is no way to know the boundary.

@zxybazh
Copy link
Author

zxybazh commented Oct 24, 2018

I believe we have to make a new file format to store the file length information for splits of raw compressed data over 4G in case we can split them again when decompressing.

@kubo
Copy link
Owner

kubo commented Oct 25, 2018

What merit does the new file format have? I won't reinvent the wheel unless it has explicit merit.

@zxybazh
Copy link
Author

zxybazh commented Oct 25, 2018

Well, you're right. Let's not reinvent the wheel. It's just that I want to make sure that we can get the boundary for every split when we want to decompress the file. If there is something already there, it would be even better. For now, you may just make it fail when file size is over 4G.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants