Skip to content

A benchmark for time-series systems using a simple HTAP/IoT workload.

License

Notifications You must be signed in to change notification settings

Jameak/OccupancyBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OccupancyBench: A benchmark for time-series systems using a simple HTAP/IoT workload

OccupancyBench is a benchmark I developed during a project at the IT University of Copenhagen (ITU) and expanded during my thesis.

The benchmark is designed to model the WiFi occupancy system at ITU. It generates synthetic WiFi access point data using seed data while facilitating scaling of the number of floors, access points, and connected clients. It runs a small mix of traditional IoT queries and analytical queries inspired by the queries run against the ITU dataset by students and staff.

The benchmark can run ingestion and querying both individually or as a mixed workload. The workload run by the benchmark is highly configurable and facilitates changing the data access-pattern, query-distribution, data sampling-rate, schema, and more.

Supported databases

The benchmark supports 3 database systems:

  • InfluxDB, a columnar time-series DBMS.
  • TimescaleDB, a relational time-series DBMS built as a PostgreSQL extension.
  • Apache Kudu, a distributed columnar time-series storage engine designed for real-time analytics.

For limitations regarding the database-implementations of these systems, see the known limitations file.

Why create a new benchmark?

The existing HTAP/IoT benchmarks that I encountered during my studies mainly focused on large-scale systems with thousands of sensors and high data cardinality. While the performance of such big-data use-cases are important, I believe that the benchmark-space focusing on smaller systems whose sensor-numbers and cardinality is lower has been neglected, and that is the niche that this benchmark attempts to fill.

Concretely, because this benchmark models the ITU occupancy system workload and allows scaling of the seed data that's given as input, it can be used to evaluate whether it makes sense to replace the existing ITU occupancy setup with an alternate time-series system.

Generated data complexity and data cardinality

The data generated by the benchmark has extremely low data-complexity and low data-cardinality because it is designed to model the ITU occupancy system where only a single metric, the number of clients connected to each access point, is stored.

The access points in the original ITU setup expose more metrics that could be of interest but only the connected clients metric is stored in the occupancy system due to privacy concerns.

Repo structure

For ease-of-use, this repository is laid out as follows:

  • The benchmark folder contains the benchmark code, build instructions, and documentation regarding the benchmark configuration.
  • The seed folder contains sample seed data to document the seed data format as well as a seed data generator for getting people started with using the benchmark. Do note that this generator does not generate realistic data and should not be used for serious benchmarking due to likely differences in the cache- and compression-behavior of the database with fake vs. real data.
  • The scripts folder contains some of the scripts used during development.

Documentation about known implementation issues can be found in the known limitations file and documentation about suggested future work can be found in the future work file.

Getting started

To get started with the benchmark:

  1. Follow the build instructions to compile the benchmark.
  2. Extract seed data from your system and create the metadata-seed files as described in the sample seed readme. If you do not have your own seed data, follow the build instructions to compile the sample seed generator program and generate seed data.
  3. Install your target databases and configure them as desired. Create database-users that can authenticate with a password and create a database for the benchmark to use.
    • If you're targeting InfluxDB, you need to enable the http-endpoint and http-auth.
    • If you're targeting TimescaleDB and want to run the database and the benchmark on different hosts, you need to configure the database to allow remote connections.
    • If you're targeting Apache Kudu and want to run the database and the benchmark on different hosts, you may need to extend the default list of trusted subnets since Kudu does not use username/password auth.
  4. Generate the default benchmark configuration file and fill it with your database credentials and seed-data paths.

Acknowledgements

Thank you to my supervisor, Pinar Tözün, for help and guidance during the initial benchmark project and throughout my thesis.