TPC-H benchmark data & queries generation utilities - dbgen & qgen

This repository holds the data generation and query generation utilities (dbgen and qgen respectively) for the TPC-H analytics benchmark for DBMSes.

Repository build status on Github Workflow:

Table of contents
"What? Another fork of tpch-dbgen? Why?" About the TPC-H benchmark Building the generation utilities Using `dbgen` to generate data

"What? Another fork of tpch-dbgen? Why?"

The tpch-dbgen utility is a minor(ish) m odification of the original utility available on the TPC-H benchmark's dedicated page. It's somewhat crufty code with all kinds of warts. It's not the intention of this repository to make all of those disappear, or to effect a rewrite - but avoiding compiler warnings, ensuring the build passes on more platforms, and other such minor issues are what this endeavor is about.

Now, there are several repositories on GitHub with such modifications of the official TPC-H dbgen - each resolving some issues and ignoring others; some adding customizations relevant only to some users; and each starting out with a potentially outdated version of the official utility. This particular repository is an attempt to unify all of them, applying all changes to the code which - in my opinion - are generally useful, and merging them all together. Details of what's already been done can be found on the Closed Issues Page and of course by examining the commit comments.

If you are the author of one of the other repositories - please contact me for better coordination of this effort.

About the TPC-H benchmark

TPC benchmark H is one of several benchmarks created and maintained by the Transaction Processiong Council - TPC for short. The TPC, is industrial consortium with the major DBMS vendors as members, which is intended to standardize the performance benchmarking of DBMSes in various usage scenarios. Other TPC-H benchmarks have been named TPC-A, TPC-B, TPC-R, TPC-DS and so on; some are deprecated and no longer irrelevant. You can read about the actively-supported benchmarks here.

TPC-H is intended for benchmarking the Decision Support performance of a DBMS, or in other words - analytic query processing performance. It follows a commercial business scenario involving suppliers, customers, parts, suppliers of parts and orders of parts shipped to clients, over the course of several business years (1992-1998). Its schema is not very complicated, but not trivial either (e.g. it is not a "star schema"). It has 22 queries ranging from simpler to relatively long and involved - although none of them utilizes more advanced or esoteric SQL features (such as window functions, UDFs and so on). It is in wide use both in its "proper" form - of sending random variants of the queries from the set to a running DBMS over the course of an hour - and in the "artificial" form of running individual queries on a cold DBMS.

Building the generation utility

The original, TPC-distributed version of the code in this repository required manually creating a Makefile from am template (makefile.suite) - that is not necessary with this repository.

The build process is automated using Kitware CMake. There are several settings which you must make before building, which CMake will guide you through if you invoke it properly - that is, using the GUI or terminal-based user interface that presents the configuration options for you to choose from.

Following is an explanation of the values you would need to set using the GUI or TUI:

Variable	Used by	How to set it?	List of options
`DATABASE`	qgen	Select the name of the DBMS closest to the one you're benchmarking in terms of syntax. If unsure, choose `DB2`	`INFORMIX` `DB2` `TDAT` `SQLSERVER` `SYBASE` `ORACLE` `VECTORWISE` `POSTGRES`
`CSV_OUTPUT_FORMAT`	dbgen	Set to `ON` for output in the CSV format - with comma-delimited fields and double-quoted strings.	`ON` `OFF`
`WORKLOAD`	dbgen, qgen	Use `TPCH`
`EOL_HANDLING`	dbgen	Set to `OFF` if your DBMS doesn't support loading the data if it has a separator character at the end of each line	`ON` or `OFF` (it's a boolean really)

CMake will generate files (including a Makefile) which you can then use with your platform-specific build tools, e.g. NMake for Windows or GNU Make on Unix-like systems. If you're not sure how to use them, consult the documentation or Goole.

Note: The PLATFORM or MACHINE variable, which in TPC's own version had to be manually set within the Makefile, is now not necessary: Platform features will be auto-detected and taken into account (like any normal software package these days).

Using `dbgen` to generate schema data

Typically, dbgen is invoked with a specific scale factor, and it must be directed at the dists.dss file. Thus the command-line should look something like this:

$ /path/to/dbgen -v -s 10 -b /path/to/dists.dss

which will create the various table files (e.g. customer.tbl, nation.tbl, region.tbl, supplier.tbl and so on) in the current directory, with a scale factor of 10, i.e. 300,000 customer lines. Here are the first few lines of a resulting customer.tbl:

1|Customer#000000001|j5JsirBM9P|MOROCCO  0|MOROCCO|AFRICA|25-989-741-2988|BUILDING|
2|Customer#000000002|487LW1dovn6Q4dMVym|JORDAN   1|JORDAN|MIDDLE EAST|23-768-687-3665|AUTOMOBILE|
3|Customer#000000003|fkRGN8n|ARGENTINA7|ARGENTINA|AMERICA|11-719-748-3364|AUTOMOBILE|
4|Customer#000000004|4u58h f|EGYPT    4|EGYPT|MIDDLE EAST|14-128-190-5944|MACHINERY|

the fields are separated by a pipe character (|), and there's a trailing pipe at the end of the line - unless you set EOL_HANDLING option to OFF.

After generating .tbl files for all tables, you should now either load them directly into your DBMS, or apply some textual processing on them.

Note: On Unix-like systems, it is also possible to write the generated data into a FIFO filesystem node, reading from the other side with a compression utility, so as to only write compressed data to disk. This may be useful of disk space is limited and you are using a particularly high scale factor.

Have you encountered some other issue with dbgen or qgen? Please open a new issue on the Issues Page; be sure to list exactly what you did and enter a copy of the terminal output of the commands you used. Note that the issue might be with the original, unmodified TPC-H dbgen/qgen code, in which case the issue may be forwarded rather than resolved.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
answers		answers
check_answers		check_answers
doc		doc
queries		queries
reference		reference
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
dists.dss		dists.dss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

answers

answers

check_answers

check_answers

doc

doc

queries

queries

reference

reference

src

src

tests

tests

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

README.md

README.md

dists.dss

dists.dss

Repository files navigation

TPC-H benchmark data & queries generation utilities - dbgen & qgen

"What? Another fork of tpch-dbgen? Why?"

About the TPC-H benchmark

Building the generation utility

Using `dbgen` to generate schema data

About

Releases

Packages

Languages

eyalroz/tpch-dbgen

Folders and files

Latest commit

History

Repository files navigation

TPC-H benchmark data & queries generation utilities - dbgen & qgen

About

Topics

Resources

Stars

Watchers

Forks

Languages