-
Notifications
You must be signed in to change notification settings - Fork 621
Improve DB metadata regarding data provenance #2529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with some pedantic comments about checksums
grype/vulnerability/provider.go
Outdated
From string `json:"from,omitempty"` | ||
Built string `json:"built,omitempty"` | ||
Path string `json:"path,omitempty"` | ||
Checksum string `json:"checksum,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you highlighted, I'm not sure what value the post-hydration checksum provides here -- it doesn't match the checksum that someone could see if looking at the online archives, and there is a high probability that it will not match across different hydrations in various cases. I think this fact will actually end up making this checksum more confusing than good. Maybe this could be the checksum of the original download, which would be something a user could at least validate later.
Related, I think it's actually not correct to include the checksum in the From
, because I believe the go-getter library is actually stripping that out of the request and only using it for post-download validation.
Just my 2 cents, I'll leave it to your judgement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still value in the checksum in the sense that at startup we validate the DB integrity with it. It can also still be used to verify the same exact DB was used for comparison on the same system.
One problem with changing the semantics of the DB status checksum field to mean the archive is that historically it's been used to describe the DB and it's an object that is describing the DB status. So I could easily see folks trying to use the checksum field against the DB path and get confused as to why it never lines up.
I do agree that the usefulness of the checksum field has overall been diminished, and probably has a place (along with some timestamps) to be considered with #522 , but I don't think I'm trying to say that there is no value in having it.
re: From field... I realize that go-getter takes out this query param on the request, but the nice thing about it is that I have a branch which implements #2134 and having the query param means with a simple jq .descriptor.db.status.from
against a grype json doc and use that with grype db import
to get a specific DB with checksum validation without the need to have a history.json lookup:
go run ./cmd/grype db import $(cat mygrype.json | jq '.descriptor.db.status.from')
✔ Vulnerability DB [imported]
or if it's not what you'd expect:
...
✔ Vulnerability DB [validating]
[0004] ERROR unable to import vulnerability database: unable to update vulnerability database: unable to download db: Checksums did not match for /tmp/getter3555198916/archive.
Expected: d4654e3b212f1d8a1aaab979599691099af541568d687c4a7c4e7c1da079b9b7
Got: 8a179b9568141aad1092b899fe7c1da09af5a69d4654e3b21b97957c4a7c4d68
I'll think about it over the weekend and chat it through on monday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mulling it over and regarding our sync this morning, I'm going to remove the checksum field entirely since even it's existence might raise user-surprise, which is not ideal. For the meantime I'll leave the query param for the archive checksum in the URL since it doesn't seem to be harming anything.
@@ -3,37 +3,39 @@ package commands | |||
import ( | |||
"bytes" | |||
"errors" | |||
"github.com/anchore/grype/grype/vulnerability" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: formatting
Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
With the distribution changes in DB v6 schema, the DB archive contains a DB with no indexes and upon download the new indexes are built before the startup checksum is captured. The process of building indexes (or really writing into a SQLite DB of any kind) does not guarantee any stability of a digest of the resulting DB.
This is a small problem in terms of reproducible documents, since the grype JSON output provides a digest of the DB file as proof that two scans were scanned with the same dataset. With v5 no writing to the DB was done after unarchiving, so multiple scans occurring on different systems could be comparable (typically comparisons rely on the same database to be used, otherwise the comparison should be considered potentially invalid... potentially an apples-to-oranges comparison). With v6 this kind of comparison across systems is not possible.
This PR does a few things in the area of data provenance:
descriptor.db
block. This is the point in time the data was captured from the canonical/upstream vulnerability provider as well as a digest of all of the files that were considered when creating vulnerability records from that upstream source. This has two advantages: a) if there is a change between scans, you can tell exactly which providers changed, b) operationally if we rebuild and redistribute multiple builds of the same DB within a day (which has happened many times before) then scan results during that time frame would still be comparable.descriptor.db
block (Addressing Add the DB url to the JSON descriptor block #356). This makes it easier to reproduce results (and hints at doing Allow DB import from a URL #2134 in the near future)db status
output anddescriptor.db
block in the grype json report.Here's
what the descriptor block used to look like:
And with the current changes:
In terms of the
db status
command theFrom
attribute is also shown in the text/json formats:At a lower level there were a few changes made to implement all of this:
vulnerability.StoreMetadataProvider
interface was added which vulnerability providers can opt into implementing. This provides at least provider-level information, and with future refactoring could provide status information as well (I decided to not tackle that in this PR)Load()
DB function (i.e. for v5, v6, v7, etc...) instead of relying on per-schema struct definitions here.Note: if the user is importing the DB from the local filesystem (either a DB directly or an archive) then
manual import
is shown instead:Closes #356
I'll probably pick up #522 next to close the gap in terms of making the output reproducible (removing timestamps and digests that are not stable).