Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using VDB6 as a data source #1155

Open
6 tasks
prabhu opened this issue Mar 21, 2024 · 15 comments
Open
6 tasks

Consider using VDB6 as a data source #1155

prabhu opened this issue Mar 21, 2024 · 15 comments
Labels
blocked p2 Non-critical bugs, and features that help organizations to identify and reduce risk size/M Medium effort spike/research Requires more research before implementation

Comments

@prabhu
Copy link

prabhu commented Mar 21, 2024

AppThreat vulnerability-db is an MIT-licensed database used by tools such as depscan for scanning. VDB6 is now available as a downloadable SQLite database. This data would help DT support containers, Linux OS, and some c/c++ with purl-based searches.

The easiest way to download the databases is using the ORAS cli tool.

oras pull ghcr.io/appthreat/vdbxz:v6
tar -xvf data.vdb6.tar.xz
tar -xvf data.index.vdb6.tar.xz

Use any sqlite browser tool to inspect and query the databases.

Proposed integration

  • Fork the vdb repo and publish vdb6 artefacts under DT org
  • Use a compatible oci registry library to periodically download the VDB6 databases
  • Populate internal tables with the information from the index database (data.index.vdb6)
  • Where the search results refer to CVE IDs already present in DT use the information from the database. Fall back to the SQLite database for new IDs such ALSA- or RLSA- etc (data.vdb6)

Possible challenges

  • data.vdb6 uses BLOB to store JSONB data in CVE 5.0 json format. This requires sqlite >= 3.45.2 to be installed. If this proves too challenging, we can offer a patch to create the database with JSON columns compatible with older versions of sqlite
  • Some percentage of purls generated by cdxgen for containers will not be compatible. This is a bug/feature in cdxgen which will be addressed on time
@nscuro nscuro added p2 Non-critical bugs, and features that help organizations to identify and reduce risk size/M Medium effort spike/research Requires more research before implementation labels Mar 22, 2024
@nscuro
Copy link
Member

nscuro commented Mar 22, 2024

Thanks for the suggestion @prabhu, will definitely have a look!

If we end up pulling in a pre-compiled / curated database, what would be great to have is the possibility of only fetching deltas. As in: "Only give me data that changed since I last checked". Having to pull in an entire blob of 1-N GB for only minor changes in the dataset will be expensive both on the network, but also on the processing side of things.

I know this is a tricky problem which may not work when distributing the data as SQLite. Did you look into this aspect before, by chance?

@prabhu
Copy link
Author

prabhu commented Mar 22, 2024

@nscuro Thank you so much for looking into this.

Note the entire compressed database is only 188MB.

total 188M
-rw-r--r-- 1 prabhu prabhu  45M Mar 22 10:05 data.index.vdb6.tar.xz
-rw-r--r-- 1 prabhu prabhu 144M Mar 22 10:05 data.vdb6.tar.xz

Regarding the delta database, the larger database has a source_data_hash column in the future. I am happy to collaborate and improve this.

vdb6

@nscuro
Copy link
Member

nscuro commented Apr 10, 2024

Apologies for the delay. The compression definitely is a good thing here, thanks for pointing it out!

Regarding the source_data_hash, what would be even more helpful would be a updated_at column. This way, when syncing with our internal database, we can drastically reduce the number of records we have to enumerate over. We could then do a SELECT ... WHERE updated_at > :lastSync.

Would that be a viable thing for VDB to add? I reckon it would require some sort of state-keeping between successive builds of the DB...

@sahibamittal sahibamittal self-assigned this Apr 10, 2024
@nscuro
Copy link
Member

nscuro commented Apr 12, 2024

Responding to my own question above, I think the point

Fork the vdb repo and publish vdb6 artefacts under DT org

from the issue description kinda covers that already. Essentially we can do the state-keeping and enrichment with updated_at ourselves, in our fork.

@prabhu
Copy link
Author

prabhu commented Apr 13, 2024

@nscuro I will look into the updated timestamp to see if there is a way to expose it as a column. At this point, I am not sure if all the sources correctly update this timestamp and there are sources with no timestamps too, and hence went with the hash of the metadata.

Shall we explore alternatives to syncing the database like having a temp table for VDB6 or searching the sqlite directly for any hits from the index database?

@prabhu
Copy link
Author

prabhu commented Apr 13, 2024

Another option is to use sqldiff to find the differing rows, but have not tried this command yet.

Update:

Download sqldiff from here - https://www.sqlite.org/download.html

To quickly find the summary

sqldiff --summary --table cve_data data.vdb6 data.vdb6.bak

cve_data: 1901635 changes, 0 inserts, 18060 deletes, 195753 unchanged

To create SQL update statements for only the changed rows. This took a few minutes for me.

sqldiff --table cve_data data.vdb6 data.vdb6.bak > out.sql

@nscuro
Copy link
Member

nscuro commented Apr 13, 2024

sqldiff definitely looks closer to what we'd need.

At this point, I am not sure if all the sources correctly update this timestamp and there are sources with no timestamps too, and hence went with the hash of the metadata.

While the created/updated timestamps of the upstream sources are nice to have, for our use case we are more interested in when VDB6 updated a given entry. Say we fix a bug in how verses are assembled, how certain fields from upstream sources are parsed, or manual corrections are applied. Essentially we need to know when either the upstream data, or the VDB6 logic changed.

@prabhu
Copy link
Author

prabhu commented Apr 13, 2024

@nscuro, can this be achieved by fixing the version in the pipeline here?

@nscuro
Copy link
Member

nscuro commented Apr 16, 2024

Side note, the selection of ORAS clients is rather sparse right now. The library proposed in the issue description might work, but would pull in Kotlin as additional dependency. It's also fairly new with only a single maintainer.

Considering we won't need the full capabilities of ORAS, we should implement the "pull" functionality ourselves, without adding new dependencies. In the end it's just a HTTP API. Spec is here: https://github.com/opencontainers/distribution-spec/blob/main/spec.md#pull

@sahibamittal
Copy link
Collaborator

sahibamittal commented Apr 16, 2024

Some of the observations I found :

  • Each vulnerability id in cve_data has mapped source_data (CVE_JSON_5.0_schema) which doesnt has epss score/percentile so EPSS mirroring will still be required.
  • It doesn’t give you option to select specific OSV ecosystems, so we'll need to fetch all of it. May be we can enable this in a config to limit the ecosystems to mirror.
  • Overall it seems to ease the different datasource conversions into cdx in Hyades repo (will need only CVE_JSON_5.0_schema -> cdx), but it also adds additional processing of blob data for each vulnerability, which can be addressed by using JDBC Driver for Sqlite (thanks @nscuro for the suggestion).

@sahibamittal sahibamittal removed their assignment Apr 16, 2024
@prabhu
Copy link
Author

prabhu commented Apr 16, 2024

@sahibamittal, Thank you. Re (1), I am not a fan of epss, so unlikely to ever add support for it. For (2), we can enhance this code to accept a comma separated list of osv keys and create a new osv_url_dict which will get used subsequently.

@nscuro
Copy link
Member

nscuro commented Apr 16, 2024

Inclusion of EPSS is something we could add as additional enrichment on our side.

@prabhu Any thoughts on resolving alias relationships? We did some research on this a while back, and found that alias data from some sources is wild west (mostly OSV), however data from GHSA is usually reliable. I'd assume the same to be true for Linux distro feeds.

Alias resolution is something that is easiest when all relevant data is present, so VDB6 is in a great position to make this happen as a post-build enrichment.

@prabhu
Copy link
Author

prabhu commented Apr 16, 2024

@nscuro, interesting idea! Aliases are currently set in the description section for some sources. VDB tries to resolve the CVE id if available to reduce duplicates. But definitely an idea for a future enhancement.

@prabhu
Copy link
Author

prabhu commented May 27, 2024

@nscuro, now that CVE 5.1 is released with support for purl, I am thinking of prioritizing VDB 6.1, which will use 5.1 schema with a couple of breaking changes. Additionally, we can support vulnrichment repo (auto-upgraded to 5.1 format).

Are you ok with parking this issue and revisit around September 2024?

@nscuro
Copy link
Member

nscuro commented May 28, 2024

@prabhu Most certainly.

@nscuro nscuro added the blocked label May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked p2 Non-critical bugs, and features that help organizations to identify and reduce risk size/M Medium effort spike/research Requires more research before implementation
Projects
None yet
Development

No branches or pull requests

3 participants