WSPR Analytics

In the early days (March 2008), WSPR Spots measured in the hundreds of thousands per month. Today, that number has increased to over 75+ Million per month and shows no sign of abatement. By any reasonable definition, it is safe to say that WSPR has entered the realm of Big Data.

Features

Full Tool Chain Installation and Environment Setup Guide(s)
Tutorials, experiments, and tests on large data sets
Exposure to leading-edge technologies in the realm of Big Data Processing
Hints, tips and tricks for keeping your Linux distro running smooth
And eventually, produce useful datasets for the greater Amateur Radio Community to consume

Primary Focus

The focus of this project is to provide a set of tools to download, manage, transform and query WSPR DataSets using modern Big Data frameworks.

Setup and Installation

The setup section in the documentation below below will walk users through everything they need to setup their system for big data processing. The guide has been well tested on three different Linux distributions, namely: Ubuntu-20.04, Arch Linux, and Alpine.

For newer Linux users, I'd highly recommend Ubuntu-20.04, as Arch and Alpine can be difficult if you are not accustom to their installation methods.

Documentation

Each folder will contain a series of README files that explain the content, and where warranted , usage. Additionally, project documentation website can be used for more extensive and exhaustive explanation of the project and its contents.

WSPR Analytics Docs - bookmark this location for future reference

Folder Descriptions

Several frameworks are used in this repository. The following matrix provides a short description of each, and their intended purpose.

Folder	Frameworks	Description
docs	Python, MkDocs	General repository documentation
golang	Golang	General purpose command line apps and utilities
java	Java, Maven, SBT	Java apps for RDD and Avro examples
notebooks	Jupyter Notebooks	Notebooks for basic test and visualization
pyspark	Python, PyArrow	Scripts that interact with CSV and Parquet files
spark	Scala	Scala programs to perform ETL tasks
wsprdaemon	Python, Scala, Psql	Utilities related to the WSPR Daemon project
wsprana	Python	(soon to be retired)

Base Tool Requirements

You must have Python, Java, PySpark / Spark (Scala) and SBT available from the command line.

Java OpenJDK version 1.8.0_275 or later
Python 3.7 or 3.8, PyArrow has issues with 3.9 at present
PySpark from PyPi
Apache Arrow 2.0+
Scala 2.12.12 - patch version 10,11,12,13 also work with Spark 3.0.1 / 3.1.1
Spark 3.0.1
PostgreSQL Database (local, remote, Docker, Vagrant, etc)
Optional ClickHouse High Performance Database

IMPORTANT: The Spark / Scala combinations are version sensitive. Check the Spark download page for recommended version combinations if you deviate from what is listed here. As of this writing, Spark 3.0.1 and above was built with Scala 2.12.10. For the least amount of frustration, stick with what's known to work (any of the 2.12.xx series)

Data Sources and Processing

The main data source will be the monthly WSPRNet Archives. At present, there is no plan to pull nightly updates. That could change if a reasonable API is identified. WSPR Daemon

The tools (apps/scripts) will be used to convert the raw CSV files into a format better suited for parallel processing, namely, Parquet. Read speeds, storage footprints, and ingestion improve dramatically with this storage format. However, there is a drawback, one cannot simply view a binary file as they can with raw text files. The original CSV will remain in place, but all bulk processing will be pulled from Parquet or a high performance database such as ClickHouse. During these transformations is where PyArrow, PySpark or Spark will earn it's keep.

Persistent Storage

A PostgreSQL database server will be needed. There are many ways to perform this installation (local, remote, Dockerize PostgreSQL, PostgreSQL with Vagrant, etc).

High Performance Database

While PostgreSQL is a highly-capabale RDMS, another database that is better suited to big data and extremely fast queries called ClickHouse will be used.

It is column-oriented and allows to generate analytical reports using SQL queries in real-time.

Blazingly fast

Linearly scalable

Feature rich

Hardware efficient

Fault-tolerant

Highly reliable

ClickHouse Organization

Name		Name	Last commit message	Last commit date
Latest commit History 398 Commits
docs		docs
golang		golang
java		java
notebooks		notebooks
pyspark		pyspark
spark		spark
wsprana		wsprana
wsprdaemon		wsprdaemon
.gitattributes		.gitattributes
.gitignore		.gitignore
CREDITS.md		CREDITS.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml

License

KI7MT/wspr-analytics

Folders and files

Latest commit

History

Repository files navigation

WSPR Analytics

Features

Primary Focus

Setup and Installation

Documentation

Folder Descriptions

Base Tool Requirements

Data Sources and Processing

Persistent Storage

High Performance Database

About

Topics

Resources

License

Stars

Watchers

Forks

Languages