Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOFError when reading a second file in an archive #240

Closed
girardbaptiste opened this issue Sep 16, 2020 · 6 comments
Closed

EOFError when reading a second file in an archive #240

girardbaptiste opened this issue Sep 16, 2020 · 6 comments
Labels
for extraction Issue on extraction, decompression or decryption invalid This doesn't seem right question Further information is requested

Comments

@girardbaptiste
Copy link

girardbaptiste commented Sep 16, 2020

Describe the bug
The first file opened with the first call to zip_file.read() is correctly open but the second returns an EOFError

To Reproduce

    import py7zr

    zip_file = py7zr.SevenZipFile(r"py7zr-0.9.5\tests\data\lzma2_1.7z",
                                  mode='r')

    for file in zip_file.files:
        if not file.emptystream:
            file_dict = zip_file.read(file.filename)
            for line in file_dict[file.filename].readlines():
                print(line)
            file_dict[file.filename].close()

The first file content is printed:
b'#!/usr/bin/env python\n'
b'\n'
b'import sys\n'
b'\n'
b'from py7zr import main\n'
b"if name == 'main':\n"
b' sys.exit(main())\n'
b'\n'

***But the second returns an EOFError Exception ***

Traceback (most recent call last):
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\py7zr.py", line 735, in _extract
self.worker.extract(self.fp, parallel=(not self.password_protected and not self._filePassed))
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\py7zr.py", line 954, in extract
self.extract_single(fp, self.files, self.src_start, src_end, q)
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\py7zr.py", line 1024, in extract_single
exc_q.put(exc_tuple)
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\py7zr.py", line 1015, in extract_single
if f.crc32 is not None and crc32 != f.crc32:
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\py7zr.py", line 1058, in decompress
tmp = decompressor.decompress(inp, max_length)
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\compressor.py", line 763, in decompress
folder_data = self.cchain.decompress(data, max_length=max_length)
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\compressor.py", line 687, in decompress
tmp = self._decompress(data, max_length)
File "\Continuum\miniconda3\envs\test7zip\lib\site-packages\py7zr\compressor.py", line 668, in _decompress
raise EOFError
EOFError

Expected behavior
A clear and concise description of what you expected to happen.

Environment (please complete the following information):

  • OS: [e.g. Windows 10
  • Python 3.8.3
  • py7zr version: 0.9.5

Test data(please attach in the report):
The 7zip file test comes from git repo: py7zr-0.9.5\tests\data\lzma2_1.7z

Additional context

@miurahr
Copy link
Owner

miurahr commented Sep 17, 2020

Once SevelZipFIle.read() called, all the file are processed and file pointer goes to EOF.
If you want to read again from start, SevenZipFile.reset() reset file pointer and decompressor status.

caution: when you handle 1GB archive and read(), reset() and read(), you read 2GB from disk.
It is recommended to use read(file-spec) once,

i.e.

import py7zr
import re

filter_pattern = re.compile(r'<your/target/file_and_directories/regex/expression>')
with SevenZipFile('archive.7z', 'r') as archive:
    allfiles = archive.getnames()
    selective_files = [f if filter_pattern.match(f) for f in allfiles]
    dict_data = archive.read(targets=selective_files)
    for entry in dict_data:
        target_data = dict_data[entry].read()

@miurahr miurahr added for extraction Issue on extraction, decompression or decryption invalid This doesn't seem right labels Sep 17, 2020
@girardbaptiste
Copy link
Author

girardbaptiste commented Sep 17, 2020 via email

@miurahr
Copy link
Owner

miurahr commented Sep 17, 2020

7-Zip use "solid compression" . https://en.wikipedia.org/wiki/Solid_compression
It is because read() processes all payload chunk to extract a part of archived files.

py7zr take a design that readall() and read(name-spec) read all chunks, and return all of archive files or parts of files.
when run read() it read payload from start to end of single chunk, then if data is not used, it dropped, and return specified data of files.

By contrast, the ZIP format is not solid because it stores separately compressed files.
so it allow user to random access to archived file without reading other parts.

@girardbaptiste
Copy link
Author

girardbaptiste commented Sep 17, 2020

That’s clear. A single file couldn’t be access without reading all the other files.
But is-it possible to reset a single file that has been already read without resetting the complete archive ?

For example:

import py7zr
import re

filter_pattern = re.compile(r'<your/target/file_and_directories/regex/expression>')
with SevenZipFile('archive.7z', 'r') as archive:
    allfiles = archive.getnames()
    selective_files = [f for f in allfiles if filter_pattern.match(f)]
    dict_data = archive.read(targets=selective_files)
    for entry in dict_data:
        target_data = dict_data[entry].read()
        for lines in target_data.readlines():
            print(lines)

        target_data.reset()
        #  or
        target_data.seek(0)
        # or  
        target_data = open(target_data, 'r')
       
        # and then
        for lines in target_data.readlines():
            print(lines)

@miurahr
Copy link
Owner

miurahr commented Sep 17, 2020

dict_data[entry] in example is BytesIO object
you can do

bio = dict_data[entry]
data = bio.read()
bio.seek(0)
data = bio.read()

@miurahr miurahr added the question Further information is requested label Sep 17, 2020
@miurahr miurahr closed this as completed Sep 19, 2020
@miurahr
Copy link
Owner

miurahr commented Sep 19, 2020

Question is answered.

@miurahr miurahr pinned this issue Oct 28, 2020
Repository owner locked and limited conversation to collaborators Jan 31, 2024
@miurahr miurahr converted this issue into discussion #573 Jan 31, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
for extraction Issue on extraction, decompression or decryption invalid This doesn't seem right question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants