Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unpack and decompress archive formats #111

Open
pabs3 opened this issue Dec 14, 2021 · 4 comments
Open

unpack and decompress archive formats #111

pabs3 opened this issue Dec 14, 2021 · 4 comments

Comments

@pabs3
Copy link
Member

pabs3 commented Dec 14, 2021

There are thousands of files in Debian source packages that are in formats that contain other files, for example compressed files (*.gz .bz2 .xz etc), tarballs (.tar .tar. .tz), zip files (.zip .jar), filesystems (.iso) and other archive formats. For situations like the log4shell security issue in log4j where there may be many embedded code copies in archive file formats, it would be useful if the Debian code search system could recursively (within limits) unpack all the archive files in Debian source packages and index those too. I see from #80 that there is already some unpacking going on, but I assume that dcs isn't unpacking everything and isn't unpacking recursively.

@stapelberg
Copy link
Contributor

Is unpacking recursively really required? It sounds unclean to me.

In general, I would have expected the dpkg-source command we run in importer.go to result in an indexable source directory, not requiring any further unpacking. As you pointed out, in #80 we added explicit unpacking for packages which unpack during build.

Are you saying there are packages that unpack recursively during build? Are these one-offs or entire build systems?


For log4j, wouldn’t it be a more promising strategy (in particular because I can’t promise any changes to DCS to happen quickly) to code search for usages of the log4j library (Java imports?) rather than copies of the log4j library?

@pabs3
Copy link
Member Author

pabs3 commented Dec 15, 2021 via email

@stapelberg
Copy link
Contributor

Debian source packages are not exactly proper pristine clean source only trees. Even Debian trees aren't proper pristine clean source only trees. The same applies to other distros and upstream source tarballs and VCS trees. There are thousands of prebuilt files, compressed files and archive files themselves containing more prebuilt files, compressed files or archive files, possibly recursively for multiple levels.

While Debian source packages aren’t always clean source-only trees, they usually are, and I remain unconvinced that just blindly unpacking all archives will result in more valuable data in the search index afterwards.

I scrolled through a number of file names based on the command you provided, and most files look like testdata, samples, etc.

When faced with embedded copies of other software, it’s generally in the package maintainers interest to get rid of this problem, as all the tooling works to your disadvantage otherwise. Your package will be harder to maintain, trigger more lintian warnings, etc.

I could find just one example in favor of your feature request, which is piespy, where a dependency upstream distributes their software in a jar that contains sources and binary data, and the package rebuilds from source to be DFSG-compliant: https://sources.debian.org/src/piespy/0.4.0-5/debian/rules/. Those sources are not indexed by Debian Code Search, because they’re in a .jar file until build time. Ideally, of course, that dependency would be in a separate Debian source package.

Debian Code Search hasn’t had to take a strong position regarding vendoring of dependency sources thus far (it’s happening so little within the Debian archive that we could just pretend the problem doesn’t exist). Generally, I’ve tried to keep the search results as high-quality as possible, so my first instinct is to avoid vendored sources as much as possible, but I can see that for some use-cases it would be valuable to include all vendored sources. It might be another axis of searching altogether (include/exclude vendored code).

So, to summarize: I can see the point of extracting archives, but given the numbers, I think it’d do more harm than good, and if we wanted to extract archives after all, we should probably allow including/excluding vendored code (and recognizing it as such!) first.

@pabs3
Copy link
Member Author

pabs3 commented Dec 19, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants