Skip to content

Improve DB metadata regarding data provenance #2529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 17, 2025

Conversation

wagoodman
Copy link
Contributor

@wagoodman wagoodman commented Mar 14, 2025

With the distribution changes in DB v6 schema, the DB archive contains a DB with no indexes and upon download the new indexes are built before the startup checksum is captured. The process of building indexes (or really writing into a SQLite DB of any kind) does not guarantee any stability of a digest of the resulting DB.

This is a small problem in terms of reproducible documents, since the grype JSON output provides a digest of the DB file as proof that two scans were scanned with the same dataset. With v5 no writing to the DB was done after unarchiving, so multiple scans occurring on different systems could be comparable (typically comparisons rely on the same database to be used, otherwise the comparison should be considered potentially invalid... potentially an apples-to-oranges comparison). With v6 this kind of comparison across systems is not possible.

This PR does a few things in the area of data provenance:

  • Adds per-provider date and input-digest values to the descriptor.db block. This is the point in time the data was captured from the canonical/upstream vulnerability provider as well as a digest of all of the files that were considered when creating vulnerability records from that upstream source. This has two advantages: a) if there is a change between scans, you can tell exactly which providers changed, b) operationally if we rebuild and redistribute multiple builds of the same DB within a day (which has happened many times before) then scan results during that time frame would still be comparable.
  • Adds the URL which the DB was downloaded from to the descriptor.db block (Addressing Add the DB url to the JSON descriptor block #356). This makes it easier to reproduce results (and hints at doing Allow DB import from a URL #2134 in the near future)
  • Removes the existing checksum field from the db status output and descriptor.db block in the grype json report.

Here's
what the descriptor block used to look like:

$ grype -q alpine:latest -o json | jq '.descriptor.db'
{
  "schemaVersion": "v6.0.2",
  "built": "2025-03-14T04:07:07Z",
  "path": "/Users/wagoodman/Library/Caches/grype/db/6/vulnerability.db",
  "checksum": "xxh64:ae6202871c2abc5e",
  "error": ""
}

And with the current changes:

{
  "status": {
    "schemaVersion": "v6.0.2",
    "from": "https://grype.anchore.io/databases/v6/vulnerability-db_v6.0.2_2025-03-14T01:31:06Z_1741925227.tar.zst?checksum=sha256%3Ad4654e3b212f1d8a1aaab979599691099af541568d687c4a7c4e7c1da079b9b8",
    "built": "2025-03-14T04:07:07Z",
    "path": "/Users/wagoodman/Library/Caches/grype/db/6/vulnerability.db",
    "valid": true
  },
  "providers": {
    "alpine": {
      "captured": "2025-03-14T01:32:06Z",
      "input": "xxh64:4d4e5fc2d251d291"
    },
    "amazon": {
      "captured": "2025-03-14T01:32:30Z",
      "input": "xxh64:b2e9a0009edb4e5b"
    },
    "chainguard": {
      "captured": "2025-03-14T01:31:06Z",
      "input": "xxh64:f12f651b0b61b670"
    },
    "debian": {
      "captured": "2025-03-14T01:32:35Z",
      "input": "xxh64:5452e317668309bb"
    },
    "epss": {
      "captured": "2025-03-14T01:32:26Z",
      "input": "xxh64:da84696be83705ac"
    },
    "github": {
      "captured": "2025-03-14T01:32:38Z",
      "input": "xxh64:f2a3b2b908fc6b78"
    },
    "kev": {
      "captured": "2025-03-14T01:32:23Z",
      "input": "xxh64:c73cc13946f1659a"
    },
    "mariner": {
      "captured": "2025-03-14T01:32:15Z",
      "input": "xxh64:c993cbddc9768c71"
    },
    "nvd": {
      "captured": "2025-03-14T01:37:39Z",
      "input": "xxh64:0f6fef6b4be95891"
    },
    "oracle": {
      "captured": "2025-03-14T01:32:14Z",
      "input": "xxh64:29175752bbed7eb2"
    },
    "rhel": {
      "captured": "2025-03-14T01:33:03Z",
      "input": "xxh64:027af8200ecf9d68"
    },
    "sles": {
      "captured": "2025-03-14T01:32:25Z",
      "input": "xxh64:0b9e802a0262600f"
    },
    "ubuntu": {
      "captured": "2025-03-14T01:32:59Z",
      "input": "xxh64:096ed5534524b39c"
    },
    "wolfi": {
      "captured": "2025-03-14T01:32:22Z",
      "input": "xxh64:352df829a48d7298"
    }
  }
}

In terms of the db status command the From attribute is also shown in the text/json formats:

$ go run ./cmd/grype db status                                       

Path:      /Users/wagoodman/Library/Caches/grype/db/6/vulnerability.db
Schema:    v6.0.2
Built:     2025-03-14T04:07:07Z
From:      https://grype.anchore.io/databases/v6/vulnerability-db_v6.0.2_2025-03-14T01:31:06Z_1741925227.tar.zst?checksum=sha256%3Ad4654e3b212f1d8a1aaab979599691099af541568d687c4a7c4e7c1da079b9b8
Status:    valid
{
 "schemaVersion": "v6.0.2",
 "from": "https://grype.anchore.io/databases/v6/vulnerability-db_v6.0.2_2025-03-14T01:31:06Z_1741925227.tar.zst?checksum=sha256%3Ad4654e3b212f1d8a1aaab979599691099af541568d687c4a7c4e7c1da079b9b8",
 "built": "2025-03-14T04:07:07Z",
 "path": "/Users/wagoodman/Library/Caches/grype/db/6/vulnerability.db",
 "valid": true
}

At a lower level there were a few changes made to implement all of this:

  • a new vulnerability.StoreMetadataProvider interface was added which vulnerability providers can opt into implementing. This provides at least provider-level information, and with future refactoring could provide status information as well (I decided to not tackle that in this PR)
  • Standardizes DB status returned from any Load() DB function (i.e. for v5, v6, v7, etc...) instead of relying on per-schema struct definitions here.
  • The DB client now expose the URL used to download new archives and the curator bakes in that URL into the import metadata file (see below).
cat ~/Library/Caches/grype/db/6/import.json
{
 "digest": "xxh64:ae6202871c2abc5e",
 "source": "https://grype.anchore.io/databases/v6/vulnerability-db_v6.0.2_2025-03-14T01:31:06Z_1741925227.tar.zst?checksum=sha256%3Ad4654e3b212f1d8a1aaab979599691099af541568d687c4a7c4e7c1da079b9b8",
 "client_version": "v6.0.2"
}

Note: if the user is importing the DB from the local filesystem (either a DB directly or an archive) then manual import is shown instead:

$ go run ./cmd/grype db import ~/code/grype-db/build/vulnerability.db
 ✔ Vulnerability DB                [imported]  

$ go run ./cmd/grype db status                                  
Path:      /Users/wagoodman/Library/Caches/grype/db/6/vulnerability.db
Schema:    v6.0.2
Built:     2025-03-13T15:21:43Z
From:      manual import
Status:    valid

Closes #356

I'll probably pick up #522 next to close the gap in terms of making the output reproducible (removing timestamps and digests that are not stable).

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
@wagoodman wagoodman added the enhancement New feature or request label Mar 14, 2025
Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with some pedantic comments about checksums

From string `json:"from,omitempty"`
Built string `json:"built,omitempty"`
Path string `json:"path,omitempty"`
Checksum string `json:"checksum,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you highlighted, I'm not sure what value the post-hydration checksum provides here -- it doesn't match the checksum that someone could see if looking at the online archives, and there is a high probability that it will not match across different hydrations in various cases. I think this fact will actually end up making this checksum more confusing than good. Maybe this could be the checksum of the original download, which would be something a user could at least validate later.

Related, I think it's actually not correct to include the checksum in the From, because I believe the go-getter library is actually stripping that out of the request and only using it for post-download validation.

Just my 2 cents, I'll leave it to your judgement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still value in the checksum in the sense that at startup we validate the DB integrity with it. It can also still be used to verify the same exact DB was used for comparison on the same system.

One problem with changing the semantics of the DB status checksum field to mean the archive is that historically it's been used to describe the DB and it's an object that is describing the DB status. So I could easily see folks trying to use the checksum field against the DB path and get confused as to why it never lines up.

I do agree that the usefulness of the checksum field has overall been diminished, and probably has a place (along with some timestamps) to be considered with #522 , but I don't think I'm trying to say that there is no value in having it.

re: From field... I realize that go-getter takes out this query param on the request, but the nice thing about it is that I have a branch which implements #2134 and having the query param means with a simple jq .descriptor.db.status.from against a grype json doc and use that with grype db import to get a specific DB with checksum validation without the need to have a history.json lookup:

go run ./cmd/grype db import $(cat mygrype.json | jq '.descriptor.db.status.from')
 ✔ Vulnerability DB                [imported]  

or if it's not what you'd expect:

...
 ✔ Vulnerability DB                [validating]  
[0004] ERROR unable to import vulnerability database: unable to update vulnerability database: unable to download db: Checksums did not match for /tmp/getter3555198916/archive.
Expected: d4654e3b212f1d8a1aaab979599691099af541568d687c4a7c4e7c1da079b9b7
Got:      8a179b9568141aad1092b899fe7c1da09af5a69d4654e3b21b97957c4a7c4d68

I'll think about it over the weekend and chat it through on monday.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mulling it over and regarding our sync this morning, I'm going to remove the checksum field entirely since even it's existence might raise user-surprise, which is not ideal. For the meantime I'll leave the query param for the archive checksum in the URL since it doesn't seem to be harming anything.

@@ -3,37 +3,39 @@ package commands
import (
"bytes"
"errors"
"github.com/anchore/grype/grype/vulnerability"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: formatting

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
@wagoodman wagoodman enabled auto-merge (squash) March 17, 2025 16:14
@wagoodman wagoodman merged commit 2eb0c33 into main Mar 17, 2025
10 checks passed
@wagoodman wagoodman deleted the db-provenance-improvements branch March 17, 2025 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add the DB url to the JSON descriptor block
2 participants