Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tar archive containing sparse file shows wrong file size, silently extracts corrupted file #2125

Open
M-a-r-k opened this issue Apr 13, 2024 · 6 comments

Comments

@M-a-r-k
Copy link

M-a-r-k commented Apr 13, 2024

I have come across an issue relating to extracting a tar archive which contains a sparse file. On extracting, no error is reported but the extracted file is corrupted. Perhaps all the holes are "collapsed" in the extracted file?

Testing on Windows x64.

>bsdtar --version
bsdtar 3.7.3 - libarchive 3.7.3 zlib/1.3 liblzma/5.4.4 bz2lib/1.1.0 libzstd/1.5.5

bsdtar shows the wrong file size for the archive it can not extract correctly:

>bsdtar -vtf NetBSD_4GB_HD_after_install_A3000_zeroed_ADOS_partition.tar
-rw-rw-r--  0 1000   1000 462721024 Jan 23  2013 NetBSD_4GB_HD.bin

But it shows the correct size for an archive it can extract correctly:

>bsdtar -vtf NetBSD_4GB_HD_template.tar
-rw-rw-r--  0 mark   mark 4000000000 Mar 02  2013 NetBSD_4GB_HD_template.bin

With GNU tar the file lists (and extracts) correctly:

# tar --version
tar (GNU tar) 1.34
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by John Gilmore and Jay Fenlason.

# tar -vtf NetBSD_4GB_HD_after_install_A3000_zeroed_ADOS_partition.tar
-rw-rw-r-- 1000/1000 4000000000 2013-01-23 17:24 NetBSD_4GB_HD.bin

Circa 2013 I created some tar archives which contain a sparse file. If I recall correctly, I used either star or GNU tar on Linux.

There are three files. Two extract correctly but the third does not. The file in the third archive contains a large number of holes. Each archive contains a single 4,000,000,000-byte sparse file.

You can download the files for testing from

https://www.mediafire.com/file/5frl8btr182au3q/NetBSD_4GB_HD_template.tar/file
(10KB)

https://www.mediafire.com/file/981eoc791c8l91a/NetBSD_4GB_HD_base.tar/file
(30KB)

https://www.mediafire.com/file/zvskd27ylq7xjaa/NetBSD_4GB_HD_after_install_A3000_zeroed_ADOS_partition.tar.xz/file
(83.89MB)

@kientzle
Copy link
Contributor

kientzle commented Apr 13, 2024

Your third example here is using the Old GNU Tar Sparse Format. Noone has yet implemented support for that in libarchive. Currently, libarchive supports the GNU Tar Pax Sparse formats, including versions 0.0 and 0.1, and version 1.0.

The Old GNU Tar Sparse Format uses an unusual extended header that libarchive currently interprets as part of the file contents. So the extracted file that you're seeing contains a map of the file holes followed by the file data. If you look at a hex dump of the resulting file, you'll see this. (Aside: The fact that the sparse file data + hole map in this case is larger than the file contents stored directly without sparse compression is interesting. No it's not. I mis-read the numbers.)

Implementing support for the old GNU sparse format in libarchive shouldn't be too difficult, since the infrastructure for handling GNU conventions and handling general sparse files already exists. (If you'd like to work on this, I would recommend waiting a couple of weeks until I land my current overhaul of the tar header parsing code. Though I might just take a quick crack at it myself while I'm in this part of the code. ;-)

@kientzle
Copy link
Contributor

Note: The tar header reading overhaul is currently #2127

@kientzle
Copy link
Contributor

I've started looking into this and it seems we do have code to support the old GNU sparse format, but apparently it's not working correctly for this specific example.

@kientzle
Copy link
Contributor

Here's the real oddity: Your third file is using a GNU sparse file extension, but isn't marked as using GNU format. Rather, it's marked as being a standard "ustar" format file. That seems to be the core problem -- libarchive only expects this particular GNU sparse file format when reading GNU tar files. Do you happen to know what program created this file?

@kientzle
Copy link
Contributor

Ah. This seems to be a "star" format archive, which uses an "S" header that is just enough different from GNU's "S" header that libarchive's logic for GNU "S" headers won't work with it. Probably your other archives are actually GNU tar archives.

I was rather confused at first because the documentation I found for star format shows a "tar" signature at the end of the header which your example does not have. It looks like star dropped that signature at some point in favor of a slightly more complex check for it's special header format.

@M-a-r-k
Copy link
Author

M-a-r-k commented Apr 27, 2024

Well done for figuring that out! I do remember experimenting/playing with star, which has various options to specify the archive type/variant.

It looks like I posted this in 2011: https://lists.gnu.org/archive/html/bug-tar/2011-02/msg00010.html
There I mentioned star options, which might have been the same I used to create the example archive here:
$ star -no-fifo -v -c f=archive.tar artype=xustar -numeric -sparse filename

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants