Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about matches in snzip tool #30

Open
shulib opened this issue Jan 11, 2023 · 11 comments
Open

Question about matches in snzip tool #30

shulib opened this issue Jan 11, 2023 · 11 comments

Comments

@shulib
Copy link

shulib commented Jan 11, 2023

Hi,
Do you support window size for match offset > 64k when packet is greater?
what are the parameters I should insert to do that
I run snzip tool version 1.0.4
modes I run: framing2 and framing
Thanks;

@shulib
Copy link
Author

shulib commented Jan 11, 2023

should I use snzip to compress when window size is greater then 64k (match offset?)
how?
thanks,

@kubo
Copy link
Owner

kubo commented Jan 12, 2023

I'm not sure what is window size you wrote. If it is kBlockSize, you should ask google snappy mailing list.

@shulib
Copy link
Author

shulib commented Jan 12, 2023

I create the flowing file data to compress:
part a: 64k bytes random digits for example:71451745376545378
part b 128k: block of 1 digit
part c 64k: I copied part a to part c
After compress it by tool, I expect to see that the starting block of part c supposed to be match with offset 192k,
but I compared part C compression to part a Compression results and they same.

@kubo
Copy link
Owner

kubo commented Jan 12, 2023

Could you post concrete explanation?
Did you compress a file containing three parts?
Could you post your data to gist and post what you did (by commands you executed, not by words) here?

@shulib
Copy link
Author

shulib commented Jan 12, 2023

ok,
out.txt
I took this file and run
snzip.exe -t framing2 -k out.txt
I translate out.txt.sz to hex format and compared the blocks
new line per digit - expected to match in from character number 61066 and see that you implemented it as literals section.

@shulib
Copy link
Author

shulib commented Jan 12, 2023

character on snzip file

@kubo
Copy link
Owner

kubo commented Jan 13, 2023

I haven't got your question yet. Your explanation is unclear.

I translate out.txt.sz to hex format

I got it until here. You did something similar to the following command.

od -t x1z out.txt.sz > out.txt.sz.hex  # od is a command line tool on linux

and compared the blocks
new line per digit - expected to match in from character number 61066 and see that you implemented it as literals section.

I'm not sure what you did.
Could you post what you expected with more details and what you see?
What is "literals section"?

@shulib
Copy link
Author

shulib commented Jan 13, 2023

I expect to match snappy sequence in the first sequence of last block
match mode: 3
match offset: 192k
match length: 0x40
instead of that I get a literal sequence.
0xf4 ...

@kubo
Copy link
Owner

kubo commented Jan 13, 2023

  1. Could you post what you see with your own eyes, not what you interpreted?
    Without it, I cannot understand your interpretations.
    Even when you and I see same thing, you and I may interpret it differently.
  2. Could you post out.txt.sz compressed by your snzip?
    If your snappy library version used by snzip is different from mine, the output may differ slightly.

I want post similar with the following. If you cannot copy and paste hex dump as text, paste images instead.


Head of hex data dumped by od -t x1z -A x out.txt.sz > out.txt.sz.hex;

000000 ff 06 00 00 73 4e 61 50 70 59 00 57 d6 00 11 1c  >....sNaPpY.W....<
000010 33 c7 80 80 04 f4 8d 0a 39 30 32 35 39 33 35 31  >3.......90259351<
000020 35 35 39 39 39 33 37 33 32 38 35 35 39 35 38 31  >5599937328559581<
000030 32 37 37 37 32 32 31 36 36 37 39 32 35 32 39 31  >2777221667925291<
000040 36 33 39 35 39 33 30 30 34 33 38 35 32 38 33 33  >6395930043852833<

line 3815-3820 of out.txt.sz.hex (byte offset 0x00ee60 - 0x00eebf of out.txt.sz)

00ee60 00 fe 01 00 fe 01 00 fe 01 00 fe 01 00 fe 01 00  >................<
00ee70 fe 01 00 fe 01 00 fe 01 00 fe 01 00 fa 01 00 00  >................<
00ee80 57 d6 00 11 1c 33 c7 80 80 04 f4 8d 0a 39 30 32  >W....3.......902<
00ee90 35 39 33 35 31 35 35 39 39 39 33 37 33 32 38 35  >5935155999373285<
00eea0 35 39 35 38 31 32 37 37 37 32 32 31 36 36 37 39  >5958127772216679<
00eeb0 32 35 32 39 31 36 33 39 35 39 33 30 30 34 33 38  >2529163959300438<

I interpreted it as:

The stream identifier (chunk type 0xff) starts at offset 0x000000. The chunk data size is 0x000006. The total chunk size is 4 + 0x000006 = 0x00000a.
The first compressed data (chunk type 0x00) starts at 0x00000a. The chunk data size is 0x00d657.

The first chunk of part c starts at 0x00ee7f. It is a compressed data chunk. The subsequent bytes looks same with that of the first compressed data at 0x00000a.

00ee70 fe 01 00 fe 01 00 fe 01 00 fe 01 00 fa 01 00 00  >................<
                                                    ``-- part c starts here.

@shulib
Copy link
Author

shulib commented Jan 14, 2023

see line 00ee80 block starts on byte 7 80 80 04 (part c starts)
after that you got: f4 8d 0a 39 30 32 ...
it is the same to line 000010 byte 5 you get the same bytes
why is it not match sequence?

@kubo
Copy link
Owner

kubo commented Jan 15, 2023

I finally got your question now.

Do you support window size for match offset > 64k when packet is greater?
what are the parameters I should insert to do that

No parameters. The snappy library divides input data into 64k blocks(*1). Each block is compressed separately(*2). Byte sequences in a block cannot be encoded as match of that in previous blocks.

*1: https://github.com/google/snappy/blob/1.1.9/snappy.cc#L1477-L1529
*2: Table for compression is cleared for each block. https://github.com/google/snappy/blob/1.1.9/snappy.cc#L713

To increase the block size, you need to change not only snzip but also snappy in order to handle offset more than 16-bit as described here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants