Consider using VDB6 as a data source #1155

prabhu · 2024-03-21T22:12:27Z

AppThreat vulnerability-db is an MIT-licensed database used by tools such as depscan for scanning. VDB6 is now available as a downloadable SQLite database. This data would help DT support containers, Linux OS, and some c/c++ with purl-based searches.

The easiest way to download the databases is using the ORAS cli tool.

oras pull ghcr.io/appthreat/vdbxz:v6
tar -xvf data.vdb6.tar.xz
tar -xvf data.index.vdb6.tar.xz

Use any sqlite browser tool to inspect and query the databases.

Proposed integration

Fork the vdb repo and publish vdb6 artefacts under DT org
Use a compatible oci registry library to periodically download the VDB6 databases
Populate internal tables with the information from the index database (data.index.vdb6)
Where the search results refer to CVE IDs already present in DT use the information from the database. Fall back to the SQLite database for new IDs such ALSA- or RLSA- etc (data.vdb6)

Possible challenges

data.vdb6 uses BLOB to store JSONB data in CVE 5.0 json format. This requires sqlite >= 3.45.2 to be installed. If this proves too challenging, we can offer a patch to create the database with JSON columns compatible with older versions of sqlite
Some percentage of purls generated by cdxgen for containers will not be compatible. This is a bug/feature in cdxgen which will be addressed on time

nscuro · 2024-03-22T09:43:59Z

Thanks for the suggestion @prabhu, will definitely have a look!

If we end up pulling in a pre-compiled / curated database, what would be great to have is the possibility of only fetching deltas. As in: "Only give me data that changed since I last checked". Having to pull in an entire blob of 1-N GB for only minor changes in the dataset will be expensive both on the network, but also on the processing side of things.

I know this is a tricky problem which may not work when distributing the data as SQLite. Did you look into this aspect before, by chance?

prabhu · 2024-03-22T11:31:03Z

@nscuro Thank you so much for looking into this.

Note the entire compressed database is only 188MB.

total 188M
-rw-r--r-- 1 prabhu prabhu  45M Mar 22 10:05 data.index.vdb6.tar.xz
-rw-r--r-- 1 prabhu prabhu 144M Mar 22 10:05 data.vdb6.tar.xz

Regarding the delta database, the larger database has a source_data_hash column in the future. I am happy to collaborate and improve this.

nscuro · 2024-04-10T12:57:25Z

Apologies for the delay. The compression definitely is a good thing here, thanks for pointing it out!

Regarding the source_data_hash, what would be even more helpful would be a updated_at column. This way, when syncing with our internal database, we can drastically reduce the number of records we have to enumerate over. We could then do a SELECT ... WHERE updated_at > :lastSync.

Would that be a viable thing for VDB to add? I reckon it would require some sort of state-keeping between successive builds of the DB...

nscuro · 2024-04-12T10:14:08Z

Responding to my own question above, I think the point

Fork the vdb repo and publish vdb6 artefacts under DT org

from the issue description kinda covers that already. Essentially we can do the state-keeping and enrichment with updated_at ourselves, in our fork.

prabhu · 2024-04-13T10:15:06Z

@nscuro I will look into the updated timestamp to see if there is a way to expose it as a column. At this point, I am not sure if all the sources correctly update this timestamp and there are sources with no timestamps too, and hence went with the hash of the metadata.

Shall we explore alternatives to syncing the database like having a temp table for VDB6 or searching the sqlite directly for any hits from the index database?

prabhu · 2024-04-13T11:12:27Z

Another option is to use sqldiff to find the differing rows, but have not tried this command yet.

Update:

Download sqldiff from here - https://www.sqlite.org/download.html

To quickly find the summary

sqldiff --summary --table cve_data data.vdb6 data.vdb6.bak

cve_data: 1901635 changes, 0 inserts, 18060 deletes, 195753 unchanged

To create SQL update statements for only the changed rows. This took a few minutes for me.

sqldiff --table cve_data data.vdb6 data.vdb6.bak > out.sql

nscuro · 2024-04-13T12:22:11Z

sqldiff definitely looks closer to what we'd need.

At this point, I am not sure if all the sources correctly update this timestamp and there are sources with no timestamps too, and hence went with the hash of the metadata.

While the created/updated timestamps of the upstream sources are nice to have, for our use case we are more interested in when VDB6 updated a given entry. Say we fix a bug in how verses are assembled, how certain fields from upstream sources are parsed, or manual corrections are applied. Essentially we need to know when either the upstream data, or the VDB6 logic changed.

prabhu · 2024-04-13T14:13:03Z

@nscuro, can this be achieved by fixing the version in the pipeline here?

nscuro · 2024-04-16T12:02:19Z

Side note, the selection of ORAS clients is rather sparse right now. The library proposed in the issue description might work, but would pull in Kotlin as additional dependency. It's also fairly new with only a single maintainer.

Considering we won't need the full capabilities of ORAS, we should implement the "pull" functionality ourselves, without adding new dependencies. In the end it's just a HTTP API. Spec is here: https://github.com/opencontainers/distribution-spec/blob/main/spec.md#pull

sahibamittal · 2024-04-16T13:28:05Z

Some of the observations I found :

Each vulnerability id in cve_data has mapped source_data (CVE_JSON_5.0_schema) which doesnt has epss score/percentile so EPSS mirroring will still be required.
It doesn’t give you option to select specific OSV ecosystems, so we'll need to fetch all of it. May be we can enable this in a config to limit the ecosystems to mirror.
Overall it seems to ease the different datasource conversions into cdx in Hyades repo (will need only CVE_JSON_5.0_schema -> cdx), but it also adds additional processing of blob data for each vulnerability, which can be addressed by using JDBC Driver for Sqlite (thanks @nscuro for the suggestion).

prabhu · 2024-04-16T15:07:59Z

@sahibamittal, Thank you. Re (1), I am not a fan of epss, so unlikely to ever add support for it. For (2), we can enhance this code to accept a comma separated list of osv keys and create a new osv_url_dict which will get used subsequently.

nscuro · 2024-04-16T15:26:07Z

Inclusion of EPSS is something we could add as additional enrichment on our side.

@prabhu Any thoughts on resolving alias relationships? We did some research on this a while back, and found that alias data from some sources is wild west (mostly OSV), however data from GHSA is usually reliable. I'd assume the same to be true for Linux distro feeds.

Alias resolution is something that is easiest when all relevant data is present, so VDB6 is in a great position to make this happen as a post-build enrichment.

prabhu · 2024-04-16T15:35:17Z

@nscuro, interesting idea! Aliases are currently set in the description section for some sources. VDB tries to resolve the CVE id if available to reduce duplicates. But definitely an idea for a future enhancement.

prabhu · 2024-05-27T12:09:45Z

@nscuro, now that CVE 5.1 is released with support for purl, I am thinking of prioritizing VDB 6.1, which will use 5.1 schema with a couple of breaking changes. Additionally, we can support vulnrichment repo (auto-upgraded to 5.1 format).

Are you ok with parking this issue and revisit around September 2024?

nscuro · 2024-05-28T16:30:31Z

@prabhu Most certainly.

nscuro added p2 Non-critical bugs, and features that help organizations to identify and reduce risk size/M Medium effort spike/research Requires more research before implementation labels Mar 22, 2024

sahibamittal self-assigned this Apr 10, 2024

sahibamittal removed their assignment Apr 16, 2024

nscuro added the blocked label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using VDB6 as a data source #1155

Consider using VDB6 as a data source #1155

prabhu commented Mar 21, 2024 •

edited

nscuro commented Mar 22, 2024

prabhu commented Mar 22, 2024

nscuro commented Apr 10, 2024 •

edited

nscuro commented Apr 12, 2024

prabhu commented Apr 13, 2024

prabhu commented Apr 13, 2024 •

edited

nscuro commented Apr 13, 2024

prabhu commented Apr 13, 2024

nscuro commented Apr 16, 2024

sahibamittal commented Apr 16, 2024 •

edited

prabhu commented Apr 16, 2024

nscuro commented Apr 16, 2024

prabhu commented Apr 16, 2024

prabhu commented May 27, 2024

nscuro commented May 28, 2024

Consider using VDB6 as a data source #1155

Consider using VDB6 as a data source #1155

Comments

prabhu commented Mar 21, 2024 • edited

Proposed integration

Possible challenges

nscuro commented Mar 22, 2024

prabhu commented Mar 22, 2024

nscuro commented Apr 10, 2024 • edited

nscuro commented Apr 12, 2024

prabhu commented Apr 13, 2024

prabhu commented Apr 13, 2024 • edited

nscuro commented Apr 13, 2024

prabhu commented Apr 13, 2024

nscuro commented Apr 16, 2024

sahibamittal commented Apr 16, 2024 • edited

prabhu commented Apr 16, 2024

nscuro commented Apr 16, 2024

prabhu commented Apr 16, 2024

prabhu commented May 27, 2024

nscuro commented May 28, 2024

prabhu commented Mar 21, 2024 •

edited

nscuro commented Apr 10, 2024 •

edited

prabhu commented Apr 13, 2024 •

edited

sahibamittal commented Apr 16, 2024 •

edited