Skip to content

v1.0

Latest
Compare
Choose a tag to compare
@sofiaoreis sofiaoreis released this 07 Mar 18:48
· 173 commits to main since this release
3d974be

The dataset integrates 5941 security patches (multi-language, single-commit, and multi-commit patches).

Dataset Patches Commits Language Refs
CVE-Details 2224 1816 multi 1
SecBench 659 659 multi 2
SAP 1127 565 Java 3
Big-Vul 4047 3433 C/C++ 4

Dataset vulnerabilities span 146 different types of vulnerabilities and 20 languages. More details in the paper.

Dataset Schema

  • cve_id: The common vulnerabilities and exposures identifier.
  • project: GitHub project name.
  • sha: Commit key or identifier of the version in the project repository.
  • cwe_id: Severity score of the vulnerability.
  • score: Severity score of the vulnerability.
  • files: Set of files changed by the patch. Schema: {path: ..., additions: ..., deletions: ..., changes: ..., status: ...}.
  • github: Commit Link.
  • parents: Commit keys for the previous software version.
  • date: Date of the changes.
  • author: Author of the changes.
  • ext_files: Extension of the files.
  • lang: Programming language.
  • summary: Summary of the vulnerability.
  • message: Commit message.
  • comments: Developers comments. Schema: {author: ..., date: ..., body: ...}

Programing Language

Language Commits
C/C++ 3944
Java 1369
PHP 1350

Visit the paper for more details about the other programming languages supported.

CWE/Weakness

CWE Commits
CWE-79 870
CWE-20 712
CWE-119 705
CWE-200 419
CWE-125 380

Visit the paper for more details about the other weaknesses supported.