cmd/ursrv: Improve data storage and aggregation (fixes #9212) #9305

er-pa · 2023-12-26T21:34:23Z

Purpose

Currently, the ursrv stores all the incoming usage reports directly in a postgres database. Once a day only a part of the data from these reports is being aggregated and stored in three separate tables. Other than that aggregation, all the previous day's reports are being loaded and parsed in the caching layer. This isn't very efficient and this way it's required to keep storing all the usage reports (approx 136k daily currently).

With this change, the aggregation takes over most of the intensive work, which happens once a day. This aggregation process now compiles an aggregated report which basically consists of a combination of the data that was previously the output of the caching-mechanism of the previous day's reports ánd all the aggregated data (three separate tables) corresponding that same day obviously. This new report-kind is stored in a blob-store (either locally on the disk or an S3-compatible storage). The usage reports are still stored as-is (in the new storage), and are not cleaned up by logic in this change. The idea is that some sort of external retention-settings can be used to clean out out-dated usage reports, whenever and however that'll be preferred. But the goal is to probably not have more than two days worth of usage reports hanging around.

In the web-side itself very little changed, only some variable names really. Most logic remained the same as well. Mostly because it's proven to work and I didn't see much need to change it significantly. However, a lot of the functionality moved from the serve-side to the aggregate-side. The serve-side now practically only contains logic to format the data into something that the front-end can handle easily. Reports from very old Syncthing instances are now simply ignored, instead of having some odd work-arounds here and there.

The caching logic was changed too. It contains the latest aggregated report (of the day before) and data which is extracted from a whole bunch of daily aggregated reports to use in graphs (versions, performance and blockstats). The report is updated whenever a new one is found, the graph-data is updated whenever there is a new report ór if some inconsistency was found (amount of reports in the store <-> amount of data-points were cached), as these two should end up being equal.

Migrate

Temporary migration support was added (--migrate option to the aggregate command). It's quite contained, so it should be easy to clean this out of the code together with any remaining database-related content. This migration-method reads out all the usage reports from the old postgres db, and compiles a (new) aggregation report per day and stores it in the new location. The continues aggregation won't start when doing this.

Old data is -not- adjusted or changed. With this in mind it is probably possible to fire up the new version besides the old one listening on different ports, run the migration and see if it all makes sense and works as expected.

Testing

That's a bit tricky. I did fire up a local ursrv and I pointed some local Syncthing instances to it. This all seemed to work as expected. Needless to say it's a bit difficult to test it more comprehensively.

(( Reason for it being a draft being that I wanted to run the checks and I'll run a couple more tests of recent changes ))

calmh · 2023-12-27T08:58:37Z

Old data is -not- adjusted or changed. With this in mind it is probably possible to fire up the new version besides the old one listening on different ports, run the migration and see if it all makes sense and works as expected.

Yes; we should do precisely this. Perhaps we can even rig up a thing to post incoming reports to both old and new and let them run concurrently while finalising/tweaking the PR, and then shift over smoothly.

(I have lots of code comments! We'll get to those in time.)

AudriusButkevicius · 2023-12-27T20:59:00Z

Half of the code is adding new storage stuff. What's wrong with storing things in the storage engine we have, i.e., postgres? Why do we need to pull more machinery into this.

calmh · 2023-12-27T21:08:07Z

I would like to move to something that runs easier as a stateless container with something cloud based for storage. S3 style blob stores are a dime a dozen, and we don't need to maintain a postgresql.

calmh · 2024-01-13T14:02:44Z

I did some hacks and started running this. The migration is not great at the moment, it generates "empty" aggregated JSON files (i.e., all values null and zero), possibly because it ignores the old versions that generated those reports. (It also takes about an hour per day to do it, after multiple hours to find the minimum date to start from... We should either optimise that query to use an index, or we can add it as a command line parameter and default to 2014-06-11 which is what the query results in :p)

In any case migration looks like it would take months to complete with the current method, so we may need to think on how to do that faster.

20140613.json

calmh · 2024-01-16T07:54:44Z

Upon further consideration, I think the current aggregation is too lossy. It's fine for its current purpose: displaying the data we're currently interested in in the GUI. However, it's rather limiting to say that we will never want to do any other form of analysis of the data. Instead, let's store a higher-fidelity aggregation... Starting with the contract.Report, we already have a couple of convenience methods to flatten it (FieldPointers and FieldNames). I would like to store an aggregated report that has data for each and every of these fields:

For int and float64 fields, we should store min, max, median, average, 5%ile, 95%ile, sum, count of non-zero
For map-of-int fields, the above per field
For boolean fields, count of true
For string fields, a histogram

Now, not all of the metrics will be relevant/sensible for each field (e.g, sum of NumCPU does not measure something we care about, and for many of the feature-enabled metrics we care mostly about count-of-nonzero vs total count), but if we do the same everywhere we already have a sensible handling by default of any new fields that may be introduced, something we're otherwise likely to forget.

I would express this aggregated report as a protobuf message something like this;

message Aggregation {
  int64 date = 1;
  int64 count = 2;
  repeated Statistic statistics = 3;
}

message Statistic {
  string key = 1;
  oneof statistic {
    FloatStatistic float = 1; // e.g., for HashPerf
    IntegerStatistic integer = 2; // e.g., for NumCPU
    map<string, int64> histogram = 3; // e.g., for Version, or Location...
    map<string, IntegerStatistic> map = 4; // e.g., for FilesystemType
  }
}

message FloatStatistic {
  int64 count = 1; // count of non-zero reports
  double min = 2;
  double max = 3;
  double med = 4;
  double nfp = 5; // 95th percentile
  double fp = 6; // fifth percentile
  double sum = 7; // divide by count for average
}

message IntegerStatistic {
  // same, but values are int64
}

That way, most data would be preserved in a usable format for graphing and tabling... The actual API format towards the GUI might well remain as-is, of course, or be changed if it's more convenient.

Generating this thing should be doable without hardcoding, by iterating over the FieldNames and FieldPointers things and switching on the actual type of the value...

er-pa added 5 commits December 22, 2023 17:07

cmd/ursrv: Improve data storage and aggregation (fixes syncthing#9212)

3d37d61

wip

37fd9b5

wip

5183f70

wip

5b2b934

unused param

da7e360

er-pa added 4 commits January 2, 2024 10:58

code style

553a8b6

multi-instance support for incoming reports

5a0725d

Add copyright notes

482c20d

Adjust compatibility env address

aea01e0

er-pa and others added 7 commits January 15, 2024 12:08

Add range-selection for migrate and add support for < v1.0

9e8efab

change default from migrate-date

b8ba495

obsolete urversion check

5f6c3da

Updating branch from main

a5a3009

wip

55a0f44

wip

9a24a85

wip

2fcc1ba

calmh temporarily deployed to docker January 15, 2024 14:11 — with GitHub Actions Inactive

wip

ca2aefb

er-pa added 3 commits January 23, 2024 10:08

Limit age of obtained reports

733cf6d

wip: completely revamp aggregation

ea2c8e9

Merge branch 'main' into ursrv-storage

f28a963

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/ursrv: Improve data storage and aggregation (fixes #9212) #9305

cmd/ursrv: Improve data storage and aggregation (fixes #9212) #9305

er-pa commented Dec 26, 2023 •

edited

calmh commented Dec 27, 2023

AudriusButkevicius commented Dec 27, 2023 •

edited

calmh commented Dec 27, 2023

calmh commented Jan 13, 2024 •

edited

calmh commented Jan 16, 2024 •

edited

cmd/ursrv: Improve data storage and aggregation (fixes #9212) #9305

Are you sure you want to change the base?

cmd/ursrv: Improve data storage and aggregation (fixes #9212) #9305

Conversation

er-pa commented Dec 26, 2023 • edited

Purpose

Migrate

Testing

calmh commented Dec 27, 2023

AudriusButkevicius commented Dec 27, 2023 • edited

calmh commented Dec 27, 2023

calmh commented Jan 13, 2024 • edited

calmh commented Jan 16, 2024 • edited

er-pa commented Dec 26, 2023 •

edited

AudriusButkevicius commented Dec 27, 2023 •

edited

calmh commented Jan 13, 2024 •

edited

calmh commented Jan 16, 2024 •

edited