Improve DB metadata regarding data provenance #2529

wagoodman · 2025-03-14T19:57:26Z

With the distribution changes in DB v6 schema, the DB archive contains a DB with no indexes and upon download the new indexes are built before the startup checksum is captured. The process of building indexes (or really writing into a SQLite DB of any kind) does not guarantee any stability of a digest of the resulting DB.

This is a small problem in terms of reproducible documents, since the grype JSON output provides a digest of the DB file as proof that two scans were scanned with the same dataset. With v5 no writing to the DB was done after unarchiving, so multiple scans occurring on different systems could be comparable (typically comparisons rely on the same database to be used, otherwise the comparison should be considered potentially invalid... potentially an apples-to-oranges comparison). With v6 this kind of comparison across systems is not possible.

This PR does a few things in the area of data provenance:

Adds per-provider date and input-digest values to the descriptor.db block. This is the point in time the data was captured from the canonical/upstream vulnerability provider as well as a digest of all of the files that were considered when creating vulnerability records from that upstream source. This has two advantages: a) if there is a change between scans, you can tell exactly which providers changed, b) operationally if we rebuild and redistribute multiple builds of the same DB within a day (which has happened many times before) then scan results during that time frame would still be comparable.
Adds the URL which the DB was downloaded from to the descriptor.db block (Addressing Add the DB url to the JSON descriptor block #356). This makes it easier to reproduce results (and hints at doing Allow DB import from a URL #2134 in the near future)
Removes the existing checksum field from the db status output and descriptor.db block in the grype json report.

Here's
what the descriptor block used to look like:

$ grype -q alpine:latest -o json | jq '.descriptor.db'

{
  "schemaVersion": "v6.0.2",
  "built": "2025-03-14T04:07:07Z",
  "path": "/Users/wagoodman/Library/Caches/grype/db/6/vulnerability.db",
  "checksum": "xxh64:ae6202871c2abc5e",
  "error": ""
}

And with the current changes:

{
  "status": {
    "schemaVersion": "v6.0.2",
    "from": "https://grype.anchore.io/databases/v6/vulnerability-db_v6.0.2_2025-03-14T01:31:06Z_1741925227.tar.zst?checksum=sha256%3Ad4654e3b212f1d8a1aaab979599691099af541568d687c4a7c4e7c1da079b9b8",
    "built": "2025-03-14T04:07:07Z",
    "path": "/Users/wagoodman/Library/Caches/grype/db/6/vulnerability.db",
    "valid": true
  },
  "providers": {
    "alpine": {
      "captured": "2025-03-14T01:32:06Z",
      "input": "xxh64:4d4e5fc2d251d291"
    },
    "amazon": {
      "captured": "2025-03-14T01:32:30Z",
      "input": "xxh64:b2e9a0009edb4e5b"
    },
    "chainguard": {
      "captured": "2025-03-14T01:31:06Z",
      "input": "xxh64:f12f651b0b61b670"
    },
    "debian": {
      "captured": "2025-03-14T01:32:35Z",
      "input": "xxh64:5452e317668309bb"
    },
    "epss": {
      "captured": "2025-03-14T01:32:26Z",
      "input": "xxh64:da84696be83705ac"
    },
    "github": {
      "captured": "2025-03-14T01:32:38Z",
      "input": "xxh64:f2a3b2b908fc6b78"
    },
    "kev": {
      "captured": "2025-03-14T01:32:23Z",
      "input": "xxh64:c73cc13946f1659a"
    },
    "mariner": {
      "captured": "2025-03-14T01:32:15Z",
      "input": "xxh64:c993cbddc9768c71"
    },
    "nvd": {
      "captured": "2025-03-14T01:37:39Z",
      "input": "xxh64:0f6fef6b4be95891"
    },
    "oracle": {
      "captured": "2025-03-14T01:32:14Z",
      "input": "xxh64:29175752bbed7eb2"
    },
    "rhel": {
      "captured": "2025-03-14T01:33:03Z",
      "input": "xxh64:027af8200ecf9d68"
    },
    "sles": {
      "captured": "2025-03-14T01:32:25Z",
      "input": "xxh64:0b9e802a0262600f"
    },
    "ubuntu": {
      "captured": "2025-03-14T01:32:59Z",
      "input": "xxh64:096ed5534524b39c"
    },
    "wolfi": {
      "captured": "2025-03-14T01:32:22Z",
      "input": "xxh64:352df829a48d7298"
    }
  }
}

In terms of the db status command the From attribute is also shown in the text/json formats:

$ go run ./cmd/grype db status                                       

Path:      /Users/wagoodman/Library/Caches/grype/db/6/vulnerability.db
Schema:    v6.0.2
Built:     2025-03-14T04:07:07Z
From:      https://grype.anchore.io/databases/v6/vulnerability-db_v6.0.2_2025-03-14T01:31:06Z_1741925227.tar.zst?checksum=sha256%3Ad4654e3b212f1d8a1aaab979599691099af541568d687c4a7c4e7c1da079b9b8
Status:    valid

{
 "schemaVersion": "v6.0.2",
 "from": "https://grype.anchore.io/databases/v6/vulnerability-db_v6.0.2_2025-03-14T01:31:06Z_1741925227.tar.zst?checksum=sha256%3Ad4654e3b212f1d8a1aaab979599691099af541568d687c4a7c4e7c1da079b9b8",
 "built": "2025-03-14T04:07:07Z",
 "path": "/Users/wagoodman/Library/Caches/grype/db/6/vulnerability.db",
 "valid": true
}

At a lower level there were a few changes made to implement all of this:

a new vulnerability.StoreMetadataProvider interface was added which vulnerability providers can opt into implementing. This provides at least provider-level information, and with future refactoring could provide status information as well (I decided to not tackle that in this PR)
Standardizes DB status returned from any Load() DB function (i.e. for v5, v6, v7, etc...) instead of relying on per-schema struct definitions here.
The DB client now expose the URL used to download new archives and the curator bakes in that URL into the import metadata file (see below).

cat ~/Library/Caches/grype/db/6/import.json

{
 "digest": "xxh64:ae6202871c2abc5e",
 "source": "https://grype.anchore.io/databases/v6/vulnerability-db_v6.0.2_2025-03-14T01:31:06Z_1741925227.tar.zst?checksum=sha256%3Ad4654e3b212f1d8a1aaab979599691099af541568d687c4a7c4e7c1da079b9b8",
 "client_version": "v6.0.2"
}

Note: if the user is importing the DB from the local filesystem (either a DB directly or an archive) then manual import is shown instead:

$ go run ./cmd/grype db import ~/code/grype-db/build/vulnerability.db
 ✔ Vulnerability DB                [imported]  

$ go run ./cmd/grype db status                                  
Path:      /Users/wagoodman/Library/Caches/grype/db/6/vulnerability.db
Schema:    v6.0.2
Built:     2025-03-13T15:21:43Z
From:      manual import
Status:    valid

Closes #356

I'll probably pick up #522 next to close the gap in terms of making the output reproducible (removing timestamps and digests that are not stable).

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

kzantow

LGTM, with some pedantic comments about checksums

kzantow · 2025-03-14T20:10:12Z

grype/vulnerability/provider.go

+		From          string `json:"from,omitempty"`
+		Built         string `json:"built,omitempty"`
+		Path          string `json:"path,omitempty"`
+		Checksum      string `json:"checksum,omitempty"`


As you highlighted, I'm not sure what value the post-hydration checksum provides here -- it doesn't match the checksum that someone could see if looking at the online archives, and there is a high probability that it will not match across different hydrations in various cases. I think this fact will actually end up making this checksum more confusing than good. Maybe this could be the checksum of the original download, which would be something a user could at least validate later.

Related, I think it's actually not correct to include the checksum in the From, because I believe the go-getter library is actually stripping that out of the request and only using it for post-download validation.

Just my 2 cents, I'll leave it to your judgement.

There is still value in the checksum in the sense that at startup we validate the DB integrity with it. It can also still be used to verify the same exact DB was used for comparison on the same system.

One problem with changing the semantics of the DB status checksum field to mean the archive is that historically it's been used to describe the DB and it's an object that is describing the DB status. So I could easily see folks trying to use the checksum field against the DB path and get confused as to why it never lines up.

I do agree that the usefulness of the checksum field has overall been diminished, and probably has a place (along with some timestamps) to be considered with #522 , but I don't think I'm trying to say that there is no value in having it.

re: From field... I realize that go-getter takes out this query param on the request, but the nice thing about it is that I have a branch which implements #2134 and having the query param means with a simple jq .descriptor.db.status.from against a grype json doc and use that with grype db import to get a specific DB with checksum validation without the need to have a history.json lookup:

go run ./cmd/grype db import $(cat mygrype.json | jq '.descriptor.db.status.from') ✔ Vulnerability DB [imported]

or if it's not what you'd expect:

... ✔ Vulnerability DB [validating] [0004] ERROR unable to import vulnerability database: unable to update vulnerability database: unable to download db: Checksums did not match for /tmp/getter3555198916/archive. Expected: d4654e3b212f1d8a1aaab979599691099af541568d687c4a7c4e7c1da079b9b7 Got: 8a179b9568141aad1092b899fe7c1da09af5a69d4654e3b21b97957c4a7c4d68

I'll think about it over the weekend and chat it through on monday.

Mulling it over and regarding our sync this morning, I'm going to remove the checksum field entirely since even it's existence might raise user-surprise, which is not ideal. For the meantime I'll leave the query param for the archive checksum in the URL since it doesn't seem to be harming anything.

kzantow · 2025-03-14T20:39:51Z

cmd/grype/cli/commands/db_status_test.go

@@ -3,37 +3,39 @@ package commands
 import (
 	"bytes"
 	"errors"
+	"github.com/anchore/grype/grype/vulnerability"


nit: formatting

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

improve metadata around data provenance

Loading
Loading status checks…

75e0c39

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

wagoodman added the enhancement label Mar 14, 2025

kzantow approved these changes Mar 14, 2025

View reviewed changes

wagoodman mentioned this pull request Mar 17, 2025

Allow import DB from URL #2532

Merged

wagoodman enabled auto-merge (squash) March 17, 2025 16:14

wagoodman merged commit 2eb0c33 into main Mar 17, 2025
10 checks passed

wagoodman deleted the db-provenance-improvements branch March 17, 2025 16:34

wagoodman mentioned this pull request Mar 17, 2025

Make timestamp in output configurable (so that results are more reproducible) #522

Open

BrewTestBot mentioned this pull request Mar 17, 2025

grype 0.90.0 Homebrew/homebrew-core#211281

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve DB metadata regarding data provenance #2529

Improve DB metadata regarding data provenance #2529

wagoodman commented Mar 14, 2025 •

edited

Loading

kzantow left a comment

kzantow Mar 14, 2025

wagoodman Mar 15, 2025

wagoodman Mar 17, 2025

kzantow Mar 14, 2025

Improve DB metadata regarding data provenance #2529

Improve DB metadata regarding data provenance #2529

Conversation

wagoodman commented Mar 14, 2025 • edited Loading

kzantow left a comment

Choose a reason for hiding this comment

kzantow Mar 14, 2025

Choose a reason for hiding this comment

wagoodman Mar 15, 2025

Choose a reason for hiding this comment

wagoodman Mar 17, 2025

Choose a reason for hiding this comment

kzantow Mar 14, 2025

Choose a reason for hiding this comment

wagoodman commented Mar 14, 2025 •

edited

Loading