unpack and decompress archive formats #111

pabs3 · 2021-12-14T09:39:28Z

There are thousands of files in Debian source packages that are in formats that contain other files, for example compressed files (*.gz .bz2 .xz etc), tarballs (.tar .tar. .tz), zip files (.zip .jar), filesystems (.iso) and other archive formats. For situations like the log4shell security issue in log4j where there may be many embedded code copies in archive file formats, it would be useful if the Debian code search system could recursively (within limits) unpack all the archive files in Debian source packages and index those too. I see from #80 that there is already some unpacking going on, but I assume that dcs isn't unpacking everything and isn't unpacking recursively.

stapelberg · 2021-12-14T14:53:23Z

Is unpacking recursively really required? It sounds unclean to me.

In general, I would have expected the dpkg-source command we run in importer.go to result in an indexable source directory, not requiring any further unpacking. As you pointed out, in #80 we added explicit unpacking for packages which unpack during build.

Are you saying there are packages that unpack recursively during build? Are these one-offs or entire build systems?

For log4j, wouldn’t it be a more promising strategy (in particular because I can’t promise any changes to DCS to happen quickly) to code search for usages of the log4j library (Java imports?) rather than copies of the log4j library?

pabs3 · 2021-12-15T01:22:09Z

Debian source packages are not exactly proper pristine clean source only trees. Even Debian trees aren't proper pristine clean source only trees. The same applies to other distros and upstream source tarballs and VCS trees. There are thousands of prebuilt files, compressed files and archive files themselves containing more prebuilt files, compressed files or archive files, possibly recursively for multiple levels. Indexing the files at each of the recursion levels would be useful for many different situations, including log4j. For example: $ apt-file search -I dsc --regex '\.(tar(\.(gz|bz2|xz))|tgz|tbz|txz|iso|zip|jar|rar|gz|bz2|xz)$' | wc -l 42576 For log4j, as I understand it the vulnerability is in log4j itself and the fix just disables the vulnerable feature, so you first need to find instances of the vulnerable code in copies of log4j, then determine if they are built into .jar files and or exported to binary packages in other ways, then determine if any source packages build against those binary packages and then check if they copy files out of the log4j containing binary packages. https://opensourcesecurity.io/2021/12/12/log4j-is-hard-to-find-and-harder-to-fix/ For the opposite situation, where there is a common vulnerability caused by an API with a bad design, that is where you need to find usages of the API, determine if they are dead code or not and then fix the ones that end up in binary packages. For both of the above scenarios, the vulnerable code could be inside archive files, so indexing them is still a useful thing to do. Many of the archive files are probably unused but there is no way to tell since any part of the source package including the upstream build system, test suite or even code from other binary packages could be triggering unpacking and potentially the files could be used. I don't think the Debian security team intends to go any further than fixing the Debian source packages of log4j, so the above feature request is unlikely to be of assistance to them.

…

-- bye, pabs https://bonedaddy.net/pabs3/

stapelberg · 2021-12-18T16:45:19Z

Debian source packages are not exactly proper pristine clean source only trees. Even Debian trees aren't proper pristine clean source only trees. The same applies to other distros and upstream source tarballs and VCS trees. There are thousands of prebuilt files, compressed files and archive files themselves containing more prebuilt files, compressed files or archive files, possibly recursively for multiple levels.

While Debian source packages aren’t always clean source-only trees, they usually are, and I remain unconvinced that just blindly unpacking all archives will result in more valuable data in the search index afterwards.

I scrolled through a number of file names based on the command you provided, and most files look like testdata, samples, etc.

When faced with embedded copies of other software, it’s generally in the package maintainers interest to get rid of this problem, as all the tooling works to your disadvantage otherwise. Your package will be harder to maintain, trigger more lintian warnings, etc.

I could find just one example in favor of your feature request, which is piespy, where a dependency upstream distributes their software in a jar that contains sources and binary data, and the package rebuilds from source to be DFSG-compliant: https://sources.debian.org/src/piespy/0.4.0-5/debian/rules/. Those sources are not indexed by Debian Code Search, because they’re in a .jar file until build time. Ideally, of course, that dependency would be in a separate Debian source package.

Debian Code Search hasn’t had to take a strong position regarding vendoring of dependency sources thus far (it’s happening so little within the Debian archive that we could just pretend the problem doesn’t exist). Generally, I’ve tried to keep the search results as high-quality as possible, so my first instinct is to avoid vendored sources as much as possible, but I can see that for some use-cases it would be valuable to include all vendored sources. It might be another axis of searching altogether (include/exclude vendored code).

So, to summarize: I can see the point of extracting archives, but given the numbers, I think it’d do more harm than good, and if we wanted to extract archives after all, we should probably allow including/excluding vendored code (and recognizing it as such!) first.

pabs3 · 2021-12-19T02:14:29Z

Summary: I agree with the approach mentioned in final paragraph, but disagree on incidence of embedding & usefulness of indexing archives. I'm not sure about the incidence of compressed or archived embedded dependencies, the results for log4j look like about 5/6 source packages embed log4j .jar files. For me that was enough that I thought I should at least start a discussion about this. $ apt-file search -I dsc -i --regex log4j.*jar On the topic of embedded copies in general I think you are mistaken about how common it is to embed copies of dependencies in upstream tarballs distributed by Debian. For example the Firefox source package embeds at least 64 different Python modules. The record for embedding in Debian that I saw was about 5 layers deep of projects embedding projects embedding projects, IIRC that was in a Qt/KDE fork of Chromium. There are many many copies known to the Debian security team and many many copies that they do not know about. I've come across a lot myself, I don't bother to report them as there are basically too many to deal with manually. Since the Debian Technical Comittee decision that approved vendored dependencies in Kubernetes, and due to the popularity of vendoring in some communities like Golang, and due to the declining popularity of distros amongst application authors I believe that this trend is only going to increase in the future. https://wiki.debian.org/EmbeddedCopies https://lwn.net/Articles/835599/ On the topic of hiding embedded copies from the Debian Code Search interface, I think that is a great idea, during my use of the service to find common typos I came across lots of embedded copies (especially in Firefox/Chromium) that I would like to have been able to hide. At the same time I think it is important to have an option to show them, for use-cases where they are important (like security issues). On the topic of automatically detecting embedded copies, I would love to have a tool for this, so if you write something for this, please make it a separate project, with a command-line tool included. On the topic of how to detect embedded copies, the check-all-the-things project has heuristics to find them and a TODO item for some other ideas for detecting them that Debian folks came up with, quoted below. https://github.com/collab-qa/check-all-the-things [embed-readme] flags = embed files = *README* comment = Please check if these README files belong to embedded code/data copies. command = find {cwd} -mindepth 2 -iname '*README*' [embed-dirs] flags = embed comment = Please check if these directories contain embedded code/data copies. command = find {cwd} -type d -name 'vendor*' -o -iname '*rd*party' -o -iname 3rdp -o -name contrib -o -name imports -o -name node_modules -o -iname external -o -iname externals -o -iname deps -o -name inc -o -name __pypackages__ [embed-auto-tools] flags = todo # I've seen configure.ac in legit subdirs, not sure how false-positivy that would be. will add anyway # you can look for the known-metadata files in subdir # .gitmodules :D # setup.py, *.gemspec, package.json etc # how about detecting mentions of subdir projects in top-level build scripts? # no idea about details though # also, cmake has ExternalProject_Add # another one: difference license block than the majority of files in the package # (should work well for license blocks that name the copyright owner)

…

-- bye, pabs https://bonedaddy.net/pabs3/

stapelberg added the enhancement label Dec 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unpack and decompress archive formats #111

unpack and decompress archive formats #111

pabs3 commented Dec 14, 2021

stapelberg commented Dec 14, 2021

pabs3 commented Dec 15, 2021 via email

stapelberg commented Dec 18, 2021

pabs3 commented Dec 19, 2021 via email

unpack and decompress archive formats #111

unpack and decompress archive formats #111

Comments

pabs3 commented Dec 14, 2021

stapelberg commented Dec 14, 2021

pabs3 commented Dec 15, 2021 via email

stapelberg commented Dec 18, 2021

pabs3 commented Dec 19, 2021 via email