Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log file inside zip file erroneously guessed as "tar" instead of text #2316

Open
krackout opened this issue Feb 15, 2024 · 14 comments
Open

Log file inside zip file erroneously guessed as "tar" instead of text #2316

krackout opened this issue Feb 15, 2024 · 14 comments
Labels

Comments

@krackout
Copy link

Log file inside zip file erroneously guessed as "tar" instead of text. Visidata version 3.0.2.

On 2.11 it's correctly opened as text.

@krackout krackout added the bug label Feb 15, 2024
@anjakefala
Copy link
Collaborator

anjakefala commented Feb 15, 2024

Hi @krackout!

Thanks for filing this issue. Please provide some sample data, so that we can reproduce this. We might not be able to fix it otherwise.

@krackout
Copy link
Author

Inside somelogs.zip file, there's a freshclam.log which produces the error.
somelogs.zip

I also attach a screenshot of the error.
visidata-logfile-error

@ajkerrigan
Copy link
Collaborator

ajkerrigan commented Feb 15, 2024

Seeing the same thing here, repro:

echo "hiya" > test.log
zip test.zip test.log
vd -N test.zip

Then hitting Enter gives me the same thing @krackout showed. Full traceback:

Traceback (most recent call last):
File "/home/aj/.local/pipx/venvs/visidata/lib/python3.12/site-packages/visidata/threads.py", line 220, in _toplevelTryFunc
t.status = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/aj/.local/pipx/venvs/visidata/lib/python3.12/site-packages/visidata/sheets.py", line 260, in reload
self.loader()
File "/home/aj/.local/pipx/venvs/visidata/lib/python3.12/site-packages/visidata/sheets.py", line 285, in loader
for r in self.iterload():
File "/home/aj/.local/pipx/venvs/visidata/lib/python3.12/site-packages/visidata/loaders/archive.py", line 140, in iterload
with tarfile.open(name=str(self.source)) as tf:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/tarfile.py", line 1808, in open
raise ReadError(f"file could not be opened successfully:\n{error_msgs_summary}")
tarfile.ReadError: file could not be opened successfully:
- method gz: ReadError('not a gzip file')
- method bz2: ReadError('not a bzip2 file')
- method xz: ReadError('not an lzma file')
- method tar: ReadError('truncated header')

Bisect tells me that this started in eb9d9c4.

If I use p.open_bytes() instead of p.fp here locally it seems to do good things, in that:

  • A .log inside a .zip will still open as txt
  • Tar files still open properly, and renaming test.tar to test.taradjacent gets properly guessed/opened as tar

Guessing doesn't work for a tar file named test.taradjacent inside a zip file... but I can't tell if we expect that level of nested guessing to work 🤔

@anjakefala
Copy link
Collaborator

Thanks for investigating! =)

Wrt that commit, for any fixes, we'll also have to check up on this behaviour: #1978

@ajkerrigan
Copy link
Collaborator

ajkerrigan commented Feb 16, 2024

Thanks for investigating! =)

Wrt that commit, for any fixes, we'll also have to check up on this behaviour: #1978

Good call, thanks! Ok repro matrix time 😅 . There are a few candidates for that line:

   if tarfile.is_tarfile(p):
   if tarfile.is_tarfile(p.fp):
   if tarfile.is_tarfile(p.open_bytes()):

And a few cases to check:

  1. Do we drop rows reading from stdin, in a test case like this?
  2. If we rename a tar file to an unknown extension, does it open as a tar?
    2a. What if it's in a zip file?
    2b. Or if we open it via stdin?
  3. if we open a non-tar file with an unknown extension, does it open as txt?
    3a. What if it's in a zip file?
    3b. Or if we open it via stdin? (e.g. vd -N <(cat <file>) or cat <file> | vd -N)

And from some testing the results look like...


   if tarfile.is_tarfile(p):
  1. ❌ Drops rows when reading from stdin, disqualified 😅

   if tarfile.is_tarfile(p.fp):
  1. ✔️
  2. guess_tar: nothing to open
    2a. ✔️
    2b. ❌ guess_tar: illegal seek
  3. ✔️
    3a. ❌ Incorrectly guesses as tar and then throws ReadErrors (the core of this GH issue)
    3b. ✔️

   if tarfile.is_tarfile(p.open_bytes()):
  1. ✔️
  2. ✔️
    2a. ❌ guess_tar: a RepeatFile holds text and cannot be reopened in binary mode, opens as txt
    2b. ❌ guess_tar: a RepeatFile holds text and cannot be reopened in binary mode, opens as txt
  3. ✔️
    3a. ✔️
    3b. ✔️

So I think none of the options I listed are perfect, but p.open_bytes() seems to fail the least 😅 . Its main drawback is that it can't properly identify a tar file with a non-tar extension, being opened from stdin or inside an archive. Maybe that's an issue that's worth digging into separately, or it's a rare enough combination of edge cases that we consciously punt on it? 🤔

@saulpw
Copy link
Owner

saulpw commented Feb 16, 2024

So I think none of the options I listed are perfect, but p.open_bytes() seems to fail the least 😅 . Its main drawback is that it can't properly identify a tar file with a non-tar extension, being opened from stdin or inside an archive. Maybe that's an issue that's worth digging into separately, or it's a rare enough combination of edge cases that we consciously punt on it? 🤔

I think it should be rare enough that we can punt on it for now. Ultimately we should head towards filetype being composable, e.g. zip+tar+txt so that you can specify the filetype for each layer of container independently. Until then, sorry about your log file inside a tarball inside a zip file.

@krackout
Copy link
Author

It must be an edge case indeed; I supposed it would be easy to fix, since it worked on v2.11.

@anjakefala
Copy link
Collaborator

It must be an edge case indeed; I supposed it would be easy to fix, since it worked on v2.11.

v2.11 did not have the filetype guesser logic! That's why the behaviour changed. =)

@midichef
Copy link
Contributor

I've been sitting on a mostly-finished patch that addresses the issue of is_tarfile() needing a binary file. It also would allow Visidata to guess binary filetypes, like guess_zip(), when reading data piped to stdin.

Currently, visidata saves stdin data into a RepeatFile, which reads the data as text, and saves it to be read multiple times for "files" opened in text mode. The change is to make a RepeatBinaryFile that saves the incoming data as binary bytes instead. Then it can be reread directly as bytes for binary "files", or encoded and served as text for text "files".

If there's interest, I can put up what I've got as a starting point. But it needs to be finished, and then tested thoroughly. I can't do that in the near future.

@saulpw
Copy link
Owner

saulpw commented Feb 17, 2024

I would love to see that, @midichef. I don't have time myself right now either, but maybe it will inspire another enterprising VisiDatan to kick the tires on it.

@ajkerrigan
Copy link
Collaborator

I'm curious too. I would definitely need some time to spin up on the changes and ripple effects. Happy to kick the tires and work through issues though. I'm onboard with the motivation for sure, so it's very cool that you've already got something started! ❤️

@krackout
Copy link
Author

Until then, sorry about your log file inside a tarball inside a zip file.

@saulpw Apparently you misunderstood. It's a text file with .log extension, inside a zip file. The text/log file is guessed erroneously as tar, no tar involved.

@saulpw
Copy link
Owner

saulpw commented Feb 20, 2024

@krackout I was being a little cheeky, the 'you' here was referring to the hypothetical user for a tar file within an archive:

Its main drawback is that it can't properly identify a tar file with a non-tar extension, being opened from stdin or inside an archive.

@midichef
Copy link
Contributor

midichef commented May 10, 2024

An update to my comment a couple of months ago:

I've been sitting on a mostly-finished patch

I've been prioritizing fixing Visidata bugs, but now I can resume work on this feature. I will likely have a workable version ready within a week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants