Memory leaks when uncompressing multi-volume archives #575

aquirin · 2024-03-03T00:23:07Z

Describe the bug

It seems decompressing a multi-volume archive with a relatively large amount of files (321 "outer" volumes, 8000 compressed files, 400kb each, so 3 Gb in total) is producing some memory leaks.

The basic code which is failing is:

with multivolumefile.open(zip_path, mode='rb') as multizip_handler:
    with py7zr.SevenZipFile(multizip_handler, 'r') as zip_handler:
        for fname, fcontent in zip_handler.read(targets=None).items():
            count_files += 1

See complete function: uncompress.py.txt

The corresponding archive is a multi-volume archive of 8000 files, 400 kb per file, filled with random data, and splitted each 10 mb. No filters, specific headers, encryption or password have been set. The compression options have been set to the defaults.

A copy of the archive is available here: multi.zip. Please note that the first level needs to be uncompressed manually before the test. The actual archive to be tested is the folder with the 321 "7z" volumes.

Better, it is possible to reproduce this archive (modulo the random data) using the following code: compress.py.txt. Several tests indicates that the behavior is not related to the random content, only related to the size of the files.

If enough memory is available, the archive can be uncompressed without any issue. The process is still taking a lot of memory (ie, 3.3 gb of memory), which is not expected as each compressed file is quite small, and the uncompression script discards immediately any data on the fly.

$ ps up <pid>
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
<user>     <pid> 17.3 39.6 3420292 3206732 pts/1 S+   00:33   0:55 python3 code.py

If not enough memory is available, the uncompression script is crashing, with a CRC error (see log below) or a Bad7zFile: invalid header data. Actually, it seems the CRC error is only a consequence of the lack of memory, as the archive looks perfectly fine.

7z-crc-error.log

We can see the archive is error-free:

7z t test.7z.0001

[...]
Everything is Ok

Files: 8000
Size:       3201920000
Compressed: 3202141273

Note that for the purpose of tests, it is possible to deliberately fill the memory using commands such as: head -c 5G /dev/zero | tail

Related issue

These issues might be related to this one, but none of the existing tickets mention multivolume and OOM at the same time:

To Reproduce

Download the archive from the Google Drive link, uncompress it to have a single folder with 321 7z files inside. Or better: use the compress.py.txt script to generate a random archive with the correct sizes.
Run the following code with Python 3: uncompress.py.txt
Run ps up <pid> in another terminal to see how memory is increasing

Expected behavior

Even if the archive has a total size of 3 gb, it is not expected that uncompressing it file by file, where each file is 400 kb, fills the memory. Uncompressing a multi-volume archive should have a very low memory footprint, as it should be possible to directly write the bytes on the disk, whatever size of the archive, size of individual files, amount of volumes or amount of compressed files we have in the archive.

Environment (please complete the following information):

OS:
Python
py7zr version:

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"

$ python --version
Python 3.8.0

$ pip freeze | grep py7zr
py7zr==0.21.0

$ pip freeze | grep multivolumefile
multivolumefile==0.2.3

Test data(please attach in the report):

See provided archive or script to generate it above.

Additional context

The text was updated successfully, but these errors were encountered:

aquirin · 2024-03-03T01:18:21Z

Checking this a bit more, and it seems to me that the issues might reside here:

py7zr/py7zr/py7zr.py

Line 588 in 6b253d1

self._dict[fname] = _buf

Filling the dict in the loop inside the _extract function might prevent to have a low footprint for large amount of dict entries, even if we close the buffers or remove the dict entries later in the caller. Would it be possible to have a true iterator, using yield for instance?

Note that my code is running fine with extractall or extract instead of read, but these functions pass return_dict=False which does not fill any dict, thus saving memory.

miurahr · 2024-04-02T07:05:03Z

Duplicated with #579

aquirin changed the title ~~Memory leaks when uncompressing multipart archives~~ Memory leaks when uncompressing multi-volume archives Mar 3, 2024

miurahr added help wanted Extra attention is needed for extraction Issue on extraction, decompression or decryption Speed/Performance labels Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leaks when uncompressing multi-volume archives #575

Memory leaks when uncompressing multi-volume archives #575

aquirin commented Mar 3, 2024 •

edited

aquirin commented Mar 3, 2024 •

edited

miurahr commented Apr 2, 2024

Memory leaks when uncompressing multi-volume archives #575

Memory leaks when uncompressing multi-volume archives #575

Comments

aquirin commented Mar 3, 2024 • edited

aquirin commented Mar 3, 2024 • edited

miurahr commented Apr 2, 2024

aquirin commented Mar 3, 2024 •

edited

aquirin commented Mar 3, 2024 •

edited