Skip to content

Commit

Permalink
Document the properties of a high quality OSV record
Browse files Browse the repository at this point in the history
  • Loading branch information
andrewpollock committed May 17, 2024
1 parent 2e1117e commit 12cd78b
Show file tree
Hide file tree
Showing 2 changed files with 120 additions and 1 deletion.
7 changes: 6 additions & 1 deletion docs/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ Additionally, the OSV.dev team maintains a conversion pipeline for:
[here](https://github.com/google/osv.dev/tree/master/vulnfeeds/cmd/alpine).

## Covered Ecosystems

Between the data served in OSV and the data converted to OSV the following ecosystems are covered.

- AlmaLinux
Expand Down Expand Up @@ -89,6 +90,10 @@ Between the data served in OSV and the data converted to OSV the following ecosy
- RubyGems
- SwiftURL

## Data Quality

The quality of the data in OSV.dev [is very important to us](https://google.github.io/osv.dev/faq/#ive-found-something-wrong-with-the-data). The minimum quality bar for OSV records acceptable for import is documented [here](data_quality.md)

## Data dumps

For convenience, these sources are aggregated and continuously exported to a GCS
Expand All @@ -115,4 +120,4 @@ A list of all current ecosystems is available at
## Contributing Data
If you a work with a project such as a Linux distribution and would like to contribute your security advisories, please follow the steps outlined in [CONTRIBUTING.md](https://github.com/google/osv.dev/blob/master/CONTRIBUTING.md#contributing-data)

Data can be supplied either through a public Git repository, a public GCS bucket or to [REST API endpoints](contributing/rest-api-contribution.md).
Data can be supplied either through a public Git repository, a public GCS bucket or to [REST API endpoints](contributing/rest-api-contribution.md).
114 changes: 114 additions & 0 deletions docs/data_quality.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Properties of a High Quality OSV Record

## Version

1.0.0 (SEMVER)

## Purpose

Describe the “good enough” OSV record that will be imported by OSV.dev

### Out of scope

This does not discuss the problem of record bit rot over time, after initial successful import. The problem of continuous revalidation and treatment of records that have been successfully imported will be dealt with separately in .

Deferred to a future iteration: validating the existence of vulnerable functions in the `ecosystem_specific` field, if supplied.

## Audience

1. Current and aspiring OSV record producers
2. Downstream OSV.dev record consumers

## Rationale

OSV.dev seeks to be an comprehensive, accurate and timely database of known vulnerabilities (that is highly automation friendly). In order to meet this accuracy goal, a quality bar needs to be both defined and sustainably enforced.

## Properties of a High Quality OSV Record

### Valid

As a prerequisite, it is assumed that a record passes [JSON Schema validation](#appendix-a-osv-schema-validation) for the version of the OSV Schema it declares itself to comply with in the `schema_version` field, or 1.0.0 if it does not.

### Precise

A high quality OSV record allows a consumer of that record to be able to answer the following questions in an **automated** way, at scale:

* “Does this vulnerability, as described, impact me?
* “What version do I need to upgrade to for it not to impact me?”

The definition of “impact” will vary depending on how fine-grained the information available is (i.e. package-level or symbol-level for software library packages). Package-level precision is the minimum standard.

#### Properties

* for version and commit ranges
* `affected[]`.`ranges[]`.`introduced` is defined
* prefer `affected[]`.`ranges[]`.`fixed` over `affected[]`.`ranges[]`.`last_affected`
* this minimizes false negatives
* distinct ranges for `introduced..fixed` and/or `introduced..last_affected` *(i.e. introduced and fixed commits can't be the same)*
* values in `introduced` are before/less than `fixed`/`last_affected`
* for version (`ECOSYSTEM` and `SEMVER`) ranges
* the versions exist in the specific package ecosystem
* for commit (`GIT`) ranges
* the commits exist in the specified `repo` *(i.e. they are not from another GitHub fork)*
* the `package.ecosystem`, and a unique `identifier` prefix for it, are defined in the OSV Schema
* the `package.name` exists within the defined `package.ecosystem, and is canonically encoded for unambiguity *(i.e. normalized)*
* Package URLs in the `package.url` field in conform to the [specification](https://github.com/package-url/purl-spec)
* `reference` URLs return a 2xx or 3xx response

### Identifiable

#### Properties

* Where relevant, an `alias` to the equivalent CVE record is present
* Where an OSV record consolidates multiple vulnerabilities in another ecosystem (or universe), multiple `related` identifiers are present

## Examples

* [GO-2024-2687](https://api.osv.dev/v1/vulns/GO-2024-2687)
* Has `introduced` and `fixed` versions
* Has an alias to a CVE record ID
* Has a purl
* [OSV-2024-98](https://api.osv.dev/v1/vulns/OSV-2024-98)
* Has `introduced` and `fixed` commits
* commits exist in repo
* [DSA-5678-1](https://api.osv.dev/v1/vulns/DSA-5678-1)
* Has `introduced` and `fixed` versions
* Has multiple `related` CVE record IDs

## Appendix A: OSV Schema validation

(As at version 1.6.3, generated by Gemini)

**Top-Level Information:**

* **id:** A unique string identifier for the vulnerability.
* **modified:** A timestamp (in a specific format) indicating when the vulnerability information was last updated.

**Optional, but validated when present:**

* **schema\_version:** A string specifying the version of the schema being used.
* **published/withdrawn:** Timestamps for when the vulnerability was published or withdrawn.
* **aliases/related:** Arrays of strings for alternate identifiers or related vulnerabilities.
* **summary/details:** String descriptions of the vulnerability.
* **severity:** An array of objects detailing the severity using different scoring systems (e.g., CVSS v2, v3, or v4), if available.
* **affected:** An array of objects describing which packages are affected, including details like:
* **package:** The ecosystem (e.g., npm, PyPI), name, and Package URL (PURL) of the affected package.
* **severity:** Severity for the specific package (if different from the overall severity).
* **ranges:** Information on the affected version ranges, commit ranges, or ecosystem-specific identifiers.
* **versions:** A list of specific affected versions.
* **ecosystem\_specific/database\_specific:** Additional data specific to the package ecosystem or the vulnerability database.
* **references:** An array of objects providing URLs to external resources about the vulnerability, categorized by type (e.g., advisory, article, discussion).
* **credits:** An array of objects giving credit to individuals or organizations involved in discovering, reporting, or fixing the vulnerability.
* **database\_specific:** A flexible object for any extra information specific to the database using this schema.

**Additional Validation Rules:**

* **timestamp:** A custom definition that ensures timestamps adhere to a specific date-time format (e.g., "2023-11-15T12:34:56Z").
* **additionalProperties: false:** This prevents any extra properties from being added to the JSON object beyond those defined in the schema.
* **Specific Requirements in `affected` Array:
* There are conditional validations based on the `type` of range, ensuring the correct properties are present (e.g., `repo` is required when `type` is `GIT`).
* A logical check ensures that if `last_affected` is specified in `events`, then `fixed` cannot be present in the same `events` array.

**Overall:**

This schema enforces a consistent and detailed structure for representing open source vulnerabilities, including information about affected packages, severity assessments, references, and credits. It helps ensure that the data is accurate and comprehensive while remaining flexible enough to accommodate various package ecosystems and additional data specific to the database using the schema.

0 comments on commit 12cd78b

Please sign in to comment.